BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

Nemotron 3 Nano 30B-A3B

By NVIDIA · United States

chat code reasoning moe
Parameters
30B
License
NVIDIA Open Model License
Context
125k
VRAM (Q4)
17 GB
Released
11 April 2026

Overview

NVIDIA's 30B-parameter MoE with only 3.5B active per token, delivering 30B-class quality at small-model speeds across chat, code, and reasoning. 128k context.

When to pick this model

  • Throughput-sensitive serving where latency matters more than peak quality
  • Local inference with partial CPU offload (around 39GB system RAM)
  • Long-context reasoning and coding without paying dense-model compute
  • Workloads that previously needed a dense 30B but were too slow

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)17 GB
Q5_K_M21 GB
Q8_032 GB
FP16 (no quantization)60 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • MoE routing yields 3.5B-class latency with 30B-class capability
  • 128k context for large documents and repos
  • Strong across chat, code, and reasoning in one checkpoint
  • Distillation plus RL alignment from the broader Nemotron family

Limitations

  • Needs ~39GB system RAM when partially offloaded to CPU
  • NVIDIA Open Model License — review commercial terms
  • Gated on Hugging Face

Architecture & training

Architecture: MoE · 30B total / 3.5B active · 128k context

Training: Nemotron 3 family, distillation and RL alignment focused on reasoning, code, and chat.

Verdict

The fast lane of the Nemotron 3 family — pick it when you want 30B output quality but can't afford 30B latency.

Quick start

ollama run nemotron-3-nano

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is Nemotron 3 Nano 30B-A3B the right pick for you?

Compute self-hosted ROI → Back to catalog