Molmo 72B
By Allen AI · United States
Overview
Allen AI's flagship Apache 2.0 VLM built on Qwen2-72B, ranked #2 in human evaluation behind only GPT-4o for visual understanding.
When to pick this model
- On-prem replacement for GPT-4o vision in regulated environments
- High-stakes visual analysis where quality dominates cost
- Research benchmarks demanding open weights at frontier quality
- Document and diagram understanding at scale
- Multi-GPU deployments already provisioned for 70B-class models
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 42 GB |
| Q5_K_M | 50 GB |
| Q8_0 | 78 GB |
| FP16 (no quantization) | 144 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMMU | 72.2 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Top-tier vision quality among open-weight VLMs
- Apache 2.0 license with PixMo open training data
- Strong on complex visual reasoning and dense scenes
- Human evaluation second only to GPT-4o
Limitations
- ~42 GB VRAM at Q4 typically requires 2-3 GPUs
- 4096-token context constrains long multimodal sessions
- No official GGUF release complicates llama.cpp use
Architecture & training
Architecture: Dense · 72B vision · based on Qwen2 72B + OpenAI CLIP encoder
Training: AllenAI PixMo dataset, maximal 72B version.
The highest-quality fully open VLM — choose it when you have the GPUs and need GPT-4o-class vision on-prem.
Quick start
ollama run molmo:72bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.