Llama 3.2 Vision 11B
By Meta · United States
Overview
Meta's first official multimodal Llama. An 11B vision-language model built on Llama 3.1 8B with added image adapters and a 128k text context.
When to pick this model
- OCR and document understanding on a consumer GPU
- Image captioning and description pipelines
- Chart and graph analysis
- Mixed text-and-image RAG workloads
- Llama ecosystem deployments needing vision
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 8 GB |
| Q5_K_M | 10 GB |
| Q8_0 | 14 GB |
| FP16 (no quantization) | 24 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMMU | 50.7 |
| DocVQA | 88.4 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- 128k text context with image input
- Strong OCR and image description
- Built on the well-supported Llama 3 base
- First-party Meta multimodal release
Limitations
- Vision quality trails Qwen2-VL and LLaVA-OneVision
- Subject to Llama Community license terms
- No video understanding
- Image inputs add significant VRAM overhead
Architecture & training
Architecture: Dense · 11B · vision cross-attention · CLIP encoder · Llama 3.2
Training: Llama 3.1 8B + vision adapters. First official Meta vision model.
A solid Llama-family vision model — but Qwen2-VL is the better open-weight choice when license terms allow.
Quick start
ollama run llama3.2-vision:11bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.