Model fiche
Nemotron 3 Nano 30B-A3B
By NVIDIA · United States
chat
code
reasoning
moe
Overview
NVIDIA's 30B-parameter MoE with only 3.5B active per token, delivering 30B-class quality at small-model speeds across chat, code, and reasoning. 128k context.
When to pick this model
- Throughput-sensitive serving where latency matters more than peak quality
- Local inference with partial CPU offload (around 39GB system RAM)
- Long-context reasoning and coding without paying dense-model compute
- Workloads that previously needed a dense 30B but were too slow
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 17 GB |
| Q5_K_M | 21 GB |
| Q8_0 | 32 GB |
| FP16 (no quantization) | 60 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- MoE routing yields 3.5B-class latency with 30B-class capability
- 128k context for large documents and repos
- Strong across chat, code, and reasoning in one checkpoint
- Distillation plus RL alignment from the broader Nemotron family
Limitations
- Needs ~39GB system RAM when partially offloaded to CPU
- NVIDIA Open Model License — review commercial terms
- Gated on Hugging Face
Architecture & training
Architecture: MoE · 30B total / 3.5B active · 128k context
Training: Nemotron 3 family, distillation and RL alignment focused on reasoning, code, and chat.
Verdict
The fast lane of the Nemotron 3 family — pick it when you want 30B output quality but can't afford 30B latency.
Quick start
ollama run nemotron-3-nanoOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.