BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

Nemotron Nano 3 30B-A3B

By NVIDIA · United States

chat general reasoning moe
Parameters
30B
License
NVIDIA Open Model License
Context
976k
VRAM (Q4)
19 GB
Released
May 2025

Overview

NVIDIA's Mamba-2 + Transformer hybrid MoE with 3B active out of 30B total parameters. A native 1M-token context with roughly 4× the throughput of Nemotron 2.

When to pick this model

  • Million-token context workloads
  • Edge and on-device inference at unusually long context
  • Throughput-critical pipelines (RAG ingestion, log analysis)
  • Hybrid SSM-Transformer research and benchmarking

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)19 GB
Q5_K_M23 GB
Q8_035 GB
FP16 (no quantization)62 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • Native 1M-token context window
  • Ultra-efficient MoE with only 3B active parameters
  • Roughly 4× throughput improvement over Nemotron 2
  • Permissive NVIDIA Open Model license

Limitations

  • Full 1M context consumes substantial VRAM in practice
  • Hybrid architecture has thinner tooling support
  • Distilled from Llama — inherits some base-model quirks

Architecture & training

Architecture: MoE · 30B total / 3B active · Nemotron-Nano-3 · 1M native context

Training: NVIDIA — distilled from Llama, edge-optimized with 1 million token context.

Verdict

The throughput-and-context champion for edge MoE deployments — built for workloads where 128k context isn't enough.

Quick start

ollama run nemotron3:30b

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is Nemotron Nano 3 30B-A3B the right pick for you?

Compute self-hosted ROI → Back to catalog