Model fiche
Nemotron 3 33B
By NVIDIA · United States
chat
code
reasoning
Overview
NVIDIA's dense 33B model targeting balanced chat, code, and reasoning workloads. Fits a single RTX 4090 at Q4 with a 128k context window.
When to pick this model
- Single-GPU local deployment on a 24GB card (RTX 4090/3090) at Q4
- Mixed workloads spanning chat, code generation, and step-by-step reasoning
- Long-document analysis up to 128k tokens
- Self-hosted alternative to mid-tier API models when data must stay on-prem
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 19 GB |
| Q5_K_M | 23 GB |
| Q8_0 | 35 GB |
| FP16 (no quantization) | 66 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Dense 33B sized to saturate a 24GB consumer GPU at Q4
- 128k context handles long codebases and reports
- RLHF tuned for reasoning and code, not just chat
- Open weights backed by NVIDIA's research stack
Limitations
- NVIDIA Open Model License has commercial terms worth reviewing carefully
- Gated on Hugging Face (click-through access required)
- Dense 33B is heavier than comparable MoE alternatives at inference
Architecture & training
Architecture: Dense Transformer · 33B parameters · 128k context
Training: NVIDIA Nemotron family, RLHF alignment focused on reasoning and code.
Verdict
A solid single-GPU workhorse for teams that want strong reasoning and code on a 4090 without depending on an API.
Quick start
ollama run nemotron3Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.