Nemotron 3 Nano Omni 30B-A3B
By NVIDIA · United States
Overview
NVIDIA's omnimodal MoE: 30B total / 3B active, handling text, image, audio, and video in 256k context. Hybrid Mamba2-MoE architecture delivers 9x the throughput of competing open omni models. Released April 2026.
When to pick this model
- High-throughput omnimodal inference on NVIDIA hardware
- Single-GPU deployments needing text + image + audio + video
- Long-context multimodal analysis (256k)
- Production pipelines built on NVIDIA NIM
- English-only voice and video assistants
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 21 GB |
| Q5_K_M | 25 GB |
| Q8_0 | 33 GB |
| FP16 (no quantization) | 62 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Native omnimodal: text, image, audio, video
- 256k context window
- 9x throughput versus other open omni models
- Runs on a single GPU thanks to 3B active MoE
- First-class NVIDIA NIM pipeline
Limitations
- English-only
- Full multimodal requires llama.cpp or vLLM (Ollama is text-only)
- NVIDIA Open Model License is not Apache or MIT
Architecture & training
Architecture: Hybrid Mamba2-Transformer MoE · 30B total / 3B active · Conv3D + EVS · integrated vision/audio/video
Training: 354.6M samples · ~717B tokens across 1,395 datasets. English only. BF16, FP8, NVFP4 variants released.
The fastest open omnimodal model on a single GPU — as long as you only need English.
Quick start
# HuggingFace : nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.
Is Nemotron 3 Nano Omni 30B-A3B the right pick for you?