Model fiche
Gemma 3 4B
By Google · United States
chat
general
vision
multilingual
small
Overview
Google's compact multimodal 4B with 128K context, vision input, and 140+ language coverage. The smallest Gemma 3 with the full feature set intact.
When to pick this model
- You need vision and long context on a low-VRAM machine or edge device
- You're shipping multilingual apps and need broad language coverage
- You want one small model for both text and image inputs
- You're prototyping before scaling to 12B or 27B
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 10 GB |
| Q5_K_M | 12 GB |
| Q8_0 | 18 GB |
| FP16 (no quantization) | 33 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Multimodal in a 4B footprint
- 140+ language coverage
- 128K context
- Sliding-window attention keeps memory in check
Limitations
- Gemma License — review terms before commercial use
- Trails the 12B and 27B on reasoning and code
Architecture & training
Architecture: Dense · multimodal text+vision · sliding-window attention (5:1 local:global)
Training: 4T tokens, 140+ languages.
Verdict
The most capable 4B multimodal you can run locally — strong default for resource-constrained deployments.
Quick start
ollama run gemma3:4bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.