Nemotron Cascade 2 30B-A3B
By NVIDIA · United States
Overview
NVIDIA's 30B MoE (3B active) with both thinking and instruct modes. Earned IMO 2025 and IOI 2025 gold medals — 30B-class reasoning at 3B-active inference speed. Released April 2026.
When to pick this model
- Competition-grade math and code workloads
- Reasoning agents needing fast inference (3B active)
- Single-GPU deployments on 24 GB cards in Q4
- Production systems on NVIDIA Open Model License terms
- Tasks switching between thinking and instruct modes
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 17 GB |
| Q5_K_M | 21 GB |
| Q8_0 | 32 GB |
| FP16 (no quantization) | 60 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| AIME 2025 | 88 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Gold medal at IMO 2025 and IOI 2025 in thinking mode
- Fast inference with only 3B active params
- Fits on a 24 GB GPU at Q4
- Commercial use allowed under NVIDIA Open Model License
Limitations
- NVIDIA Open Model License — not Apache or MIT
- 32+ GB VRAM total in Q4 (full model is 30B)
- Thinking mode generation can be slow
Architecture & training
Architecture: MoE 30B/3B active · unified thinking mode + instruct · 128k ctx
Training: Trained by NVIDIA. Gold medal at IMO 2025 and IOI 2025 in thinking mode. Optimized for mathematical reasoning and competitive code.
Olympic-grade reasoning at 3B-active inference cost — the sharpest open math and code model in its weight class.
Quick start
ollama run nemotron-cascade-2Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.
Is Nemotron Cascade 2 30B-A3B the right pick for you?