Llama 3.2 3B
By Meta · United States
Overview
Meta's 3B instruct model with a full 128k context, tuned for laptops, mobile, and edge devices where memory and battery matter.
When to pick this model
- On-device assistants for laptops, phones, or tablets
- CPU-only inference where speed beats raw quality
- Long-context summarization on constrained hardware
- Latency-critical agent loops
- Local autocomplete or text classification
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 2.5 GB |
| Q5_K_M | 3 GB |
| Q8_0 | 4.5 GB |
| FP16 (no quantization) | 7 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU | 63.4 |
| HellaSwag | 79.2 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- 128k context in a 3B parameter footprint
- Fast CPU inference
- Strong baseline for edge and mobile use cases
- Distilled from larger Llama models for better quality density
Limitations
- Noticeably weaker than 7B+ models on complex tasks
- No vision in this checkpoint
- Subject to Llama Community license terms
Architecture & training
Architecture: Dense Transformer · Llama 3.2 3B · lightweight architecture for edge
Training: Meta multilingual corpus + distillation from larger Llama models.
The best 3B open-weight model for edge use cases — pick it when memory and latency dominate the brief.
Quick start
ollama run llama3.2:3bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.