Model fiche
Nemotron Nano 3 30B-A3B
By NVIDIA · United States
chat
general
reasoning
moe
Overview
NVIDIA's Mamba-2 + Transformer hybrid MoE with 3B active out of 30B total parameters. A native 1M-token context with roughly 4× the throughput of Nemotron 2.
When to pick this model
- Million-token context workloads
- Edge and on-device inference at unusually long context
- Throughput-critical pipelines (RAG ingestion, log analysis)
- Hybrid SSM-Transformer research and benchmarking
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 19 GB |
| Q5_K_M | 23 GB |
| Q8_0 | 35 GB |
| FP16 (no quantization) | 62 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Native 1M-token context window
- Ultra-efficient MoE with only 3B active parameters
- Roughly 4× throughput improvement over Nemotron 2
- Permissive NVIDIA Open Model license
Limitations
- Full 1M context consumes substantial VRAM in practice
- Hybrid architecture has thinner tooling support
- Distilled from Llama — inherits some base-model quirks
Architecture & training
Architecture: MoE · 30B total / 3B active · Nemotron-Nano-3 · 1M native context
Training: NVIDIA — distilled from Llama, edge-optimized with 1 million token context.
Verdict
The throughput-and-context champion for edge MoE deployments — built for workloads where 128k context isn't enough.
Quick start
ollama run nemotron3:30bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.