OLMoE 1B-7B Instruct
By Allen AI · United States
Overview
Allen AI's OLMoE is the only MoE released with weights, training data, and code fully open — 7B total with 1.3B active, matching Llama2-13B-Chat quality.
When to pick this model
- Research that requires fully reproducible MoE training
- Latency-critical chat where 1.3B active params win
- Teaching and curriculum use cases needing full provenance
- Cheap CPU or single-GPU inference setups
- Baselines for new MoE architectures
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 4 GB |
| Q5_K_M | 5 GB |
| Q8_0 | 7 GB |
| FP16 (no quantization) | 14 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU | 52 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Very fast inference with only 1.3B active parameters
- Training corpus is 100% open source (Dolmino + Pile 2)
- Apache 2.0 license throughout
- Competitive with Llama2-13B-Chat at a fraction of the cost
Limitations
- 4096-token context is limiting for modern workloads
- Quality trails recent dense 7B models
- Limited tooling and quantization support
Architecture & training
Architecture: MoE · 7B total / 1B active · 64 experts, 8 active per token
Training: AllenAI OLMoE. Open data Dolmino + The Pile 2.
The only truly open MoE end-to-end — pick it for research and education over raw production quality.
Quick start
ollama run olmoeOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.