Model fiche
Llama 3.1 405B Instruct
By Meta · United States
chat
general
reasoning
Overview
Meta's reference open dense model at 405B parameters, with MMLU 88.6 and HumanEval 89.0. Gated on Hugging Face and over 240GB even at Q4.
When to pick this model
- Self-hosted alternative to closed frontier APIs when you have the hardware
- Reproducible research baseline for large dense models
- Long-running batch inference where weight licensing matters more than speed
- Distillation source for smaller specialist models
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 240 GB |
| Q5_K_M | 288 GB |
| Q8_0 | 435 GB |
| FP16 (no quantization) | 810 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU | 88.6 |
| HumanEval | 89 |
| GSM8K | 96.8 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- The reference dense open model — widely benchmarked and well-understood
- MMLU 88.6, HumanEval 89.0
- 128k context
- Mature ecosystem support across all serving frameworks
Limitations
- 240+ GB at Q4 — needs a serious multi-GPU server
- Hugging Face gated access
- Llama 3.1 Community License with MAU clause
- Largely superseded by MoE alternatives at similar quality
Architecture & training
Architecture: Dense 405B · GQA
Training: 15T tokens by Meta.
Verdict
Still the canonical dense open model, but MoE alternatives now deliver comparable quality at a fraction of the inference cost.
Quick start
ollama run llama3.1:405bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.