Llama 3.1 Nemotron 70B
By NVIDIA · United States
Overview
NVIDIA's RLHF tune of Llama 3.1 70B that topped Arena Hard at 85.0 at release. Strong alignment and instruction-following on familiar Llama foundations.
When to pick this model
- Instruction-heavy chat assistants needing strong alignment
- Deployments already standardized on the Llama 3.1 family
- Workloads where human-preference alignment beats raw benchmarks
- NVIDIA-stack deployments leveraging NIM and TensorRT-LLM
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 40 GB |
| Q5_K_M | 48 GB |
| Q8_0 | 75 GB |
| FP16 (no quantization) | 140 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| Arena Hard | 85 |
| AlpacaEval 2 LC | 57.6 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Arena Hard 85.0 — topped the leaderboard at release
- AlpacaEval 2 LC 57.6
- MT-Bench 8.98
- Strong RLHF on real human preference data
Limitations
- Llama 3.1 Community License with MAU clause
- Hugging Face gated access
- Now overtaken on reasoning by Qwen 2.5 72B and R1 distills
- ~42GB at Q4 — needs dual 24GB GPUs
Architecture & training
Architecture: Dense Llama 3.1 70B · intensive NVIDIA RLHF
Training: RLHF on human preferences.
An excellent RLHF tune of Llama 3.1 70B — still strong for alignment-heavy chat, though reasoning specialists have since pulled ahead.
Quick start
ollama run nemotron:70bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.