Gemma 4 E2B
By Google · United States
Overview
Google's edge-optimized Gemma 4: 2B effective params, full text + image multimodal, 128k context, and a configurable thinking mode. Built for laptops, mobile, and CPU inference.
When to pick this model
- On-device multimodal apps on laptops and phones
- CPU or low-end GPU inference at ~7 GB Q4
- Long-context tasks up to 128k at edge scale
- Quick-toggle thinking mode for harder prompts
- 140+ language coverage in a tiny footprint
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 7 GB |
| Q5_K_M | 9 GB |
| Q8_0 | 13 GB |
| FP16 (no quantization) | 25 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Full multimodal in ~7 GB at Q4
- Runs on CPU or entry-level GPU
- 128k context
- Thinking mode toggle
- Open Gemma license
Limitations
- Quality trails the E4B and 26B variants
- Reasoning benchmarks well below larger models
- Gemma license isn't Apache or MIT
Architecture & training
Architecture: Dense E2B (2B effective) · multimodal text+image · 128k ctx · configurable thinking
Training: Ultra-compact edge edition of Gemma 4. Architecture optimized for on-device/mobile. 140+ languages.
The Gemma 4 to pick when you're shipping on-device — small, multimodal, and surprisingly long-context.
Quick start
ollama run gemma4:e2bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.