Gemma 2 2B
By Google · United States
Overview
Google's Gemma 2 2B, a compact instruct model distilled from larger Gemmas. Small enough to run on a Raspberry Pi 5 or modest CPU.
When to pick this model
- Edge devices, microservers, and SBCs
- Background tasks where latency beats sophistication
- Text classification, simple summarization, and routing
- Educational and demo deployments
- Fallback model when GPU resources are unavailable
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 1.8 GB |
| Q5_K_M | 2.2 GB |
| Q8_0 | 3.2 GB |
| FP16 (no quantization) | 5 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU | 52.2 |
| HellaSwag | 74.9 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Runs comfortably in under 2GB VRAM at Q4
- Best-in-class 2B quality for its release window
- Workable on commodity CPUs
- Google's Gemma license permits broad use
Limitations
- 8k context is restrictive for modern RAG
- Falls apart on multi-step reasoning
- No vision, no tool calling out of the box
Architecture & training
Architecture: Dense Transformer · Gemma 2 2B · logit-softcapping + local/global attention
Training: 3T tokens, compact Google architecture distilled from larger models.
The best 2B for edge and CPU workloads — just don't expect it to reason.
Quick start
ollama run gemma2:2bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.