Model fiche
Gemma 4 2B
By Google · United States
chat
vision
multilingual
small
Overview
Google's 2B base model in the Gemma 4 family with text and image input, 128k context, and a 1.2GB Q4 footprint that runs on integrated graphics or a Raspberry Pi 5.
When to pick this model
- On-device assistants for laptops, phones, and SBCs
- Multimodal prototypes that can't justify a dedicated GPU
- Long-context summarization at the edge
- Air-gapped or offline scenarios where latency and privacy matter
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 1.2 GB |
| Q5_K_M | 1.4 GB |
| Q8_0 | 2.1 GB |
| FP16 (no quantization) | 4 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Runs on integrated GPUs at ~1.2GB VRAM in Q4
- Multimodal text and image input out of the box
- 128k context unusual at this parameter count
- Permissive Gemma license
Limitations
- Reasoning lags behind 4B and larger Gemma variants
- Gated on Hugging Face (click-through access)
Architecture & training
Architecture: Gemma 4 base · 2B dense · multimodal text + image · 128k context
Training: Google Gemma 4 family, 2B multimodal base version, trained for edge/laptop.
Verdict
The smallest Gemma 4 that still feels useful — a strong default for edge multimodal apps.
Quick start
ollama run gemma4Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.