Gemma 2 9B
By Google · United States
Overview
Google's Gemma 2 9B, a distilled instruct model that outperforms Llama 3 8B on several benchmarks at a slightly larger size.
When to pick this model
- General-purpose chat with stronger output quality than Llama 3 8B
- Workloads that don't need a long context window
- Instruction-following tasks and structured output
- Single consumer GPU deployments
- Fine-tuning baselines under Google's Gemma license
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 6 GB |
| Q5_K_M | 7.5 GB |
| Q8_0 | 11 GB |
| FP16 (no quantization) | 20 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU | 71.3 |
| HellaSwag | 87.2 |
| HumanEval | 40.2 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Beats Llama 3 8B on multiple benchmarks
- Solid quality-per-parameter
- Reliable instruction following
- Distilled from Gemma 2 27B for better quality density
Limitations
- 8k context is the standout limitation
- No vision capabilities
- Gemma license is more restrictive than Apache 2.0
Architecture & training
Architecture: Dense Transformer · Gemma 2 9B · sliding window attention
Training: 8T tokens. Architecture distilled from Gemma 2 27B.
A strong 9B if you can live with 8k context — otherwise pick Qwen 2.5 7B or Llama 3.1 8B for the 128k window.
Quick start
ollama run gemma2:9bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.