Llama 4 Scout 109B
By Meta · United States
Overview
Meta's compact Llama 4 MoE — 109B total, 17B active, natively multimodal, with an unprecedented 10M token context. Fits on a single H100.
When to pick this model
- Whole-codebase or whole-corpus analysis up to 10M tokens
- Multimodal pipelines where one H100 is the inference budget
- Long-form document understanding without RAG
- Multilingual chat with native image input
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 65 GB |
| Q5_K_M | 78 GB |
| Q8_0 | 117 GB |
| FP16 (no quantization) | 218 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU-Pro | 74 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- 10M token context — unmatched among open models
- Runs on a single H100 thanks to MoE sparsity
- Native multimodal input — no separate vision adapter needed
- 17B active parameters keeps inference fast
Limitations
- Hugging Face gated access
- Llama 4 Community License with the >700M MAU clause
- Long-context quality drops well before the 10M ceiling
- Newer than Llama 3.1 — tooling still catching up
Architecture & training
Architecture: MoE 16 experts · 109B/17B active · iRoPE · natively multimodal
Training: Meta Llama 4 compact flagship.
The long-context champion of open weights — if you actually need 10M tokens, nothing else comes close on a single H100.
Quick start
ollama run llama4:scoutOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.