BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

Nemotron 3 33B

By NVIDIA · United States

chat code reasoning
Parameters
33B
License
NVIDIA Open Model License
Context
125k
VRAM (Q4)
19 GB
Released
4 May 2026

Overview

NVIDIA's dense 33B model targeting balanced chat, code, and reasoning workloads. Fits a single RTX 4090 at Q4 with a 128k context window.

When to pick this model

  • Single-GPU local deployment on a 24GB card (RTX 4090/3090) at Q4
  • Mixed workloads spanning chat, code generation, and step-by-step reasoning
  • Long-document analysis up to 128k tokens
  • Self-hosted alternative to mid-tier API models when data must stay on-prem

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)19 GB
Q5_K_M23 GB
Q8_035 GB
FP16 (no quantization)66 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • Dense 33B sized to saturate a 24GB consumer GPU at Q4
  • 128k context handles long codebases and reports
  • RLHF tuned for reasoning and code, not just chat
  • Open weights backed by NVIDIA's research stack

Limitations

  • NVIDIA Open Model License has commercial terms worth reviewing carefully
  • Gated on Hugging Face (click-through access required)
  • Dense 33B is heavier than comparable MoE alternatives at inference

Architecture & training

Architecture: Dense Transformer · 33B parameters · 128k context

Training: NVIDIA Nemotron family, RLHF alignment focused on reasoning and code.

Verdict

A solid single-GPU workhorse for teams that want strong reasoning and code on a 4090 without depending on an API.

Quick start

ollama run nemotron3

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is Nemotron 3 33B the right pick for you?

Compute self-hosted ROI → Back to catalog