Gemma 4 Local: Ollama Setup & Benchmarks
Google's Gemma 4 brings native multimodality and 3x faster decoding to consumer hardware. Here's the honest setup guide and benchmark verdict.
By Mohamed Meguedmi · 9 min read
Key Takeaways
- Gemma 4 ships four sizes: E2B (~6 GB VRAM), E4B (~9 GB), 26B A4B MoE (~16 GB at Q4_K_M), and 31B dense (~22 GB). E4B is the sweet spot for 12 GB cards.
- Multi-Token Prediction (MTP) delivers a measured 2.7–3.1x decoding speedup on the 26B A4B variant — fast enough to feel like a 7B dense model.
- Native multimodality across every size: text + image input, no separate vision adapter.
- Known gotcha: Ollama 0.6.0–0.6.6 has a streaming bug that breaks tool calls on Apple Silicon. Use 0.6.7 or 0.7.1+.
- Verdict: Gemma 4 E4B beats Llama 3.3 8B on math and coding; 26B A4B is the throughput champion; 31B trails Qwen3 32B on code but wins on multimodal tasks.
What is new in Gemma 4
Google DeepMind shipped Gemma 4 in April 2026 as a four-tier family: two compact edge models (E2B, E4B) tuned for laptops and 12 GB consumer GPUs, a 26B-parameter Mixture-of-Experts variant with 4B active parameters (A4B) targeting mid-range desktops, and a 31B dense flagship for serious local inference. All four are natively multimodal, accepting text and image input and emitting text output.
Two architectural choices matter for local users:
- Multi-Token Prediction (MTP) on the 26B A4B: the model speculatively decodes multiple tokens per forward pass and verifies them in a single batched step, yielding roughly 3x throughput at equivalent quality.
- Selective expert routing on the MoE variant keeps active VRAM around 16 GB at Q4_K_M while the full 26B weights stream from system RAM — making it viable on a single RTX 4080 or RTX 5070 Ti.
If you are new to local inference, start with our local LLM guides hub or skim the model catalog for a side-by-side spec sheet.
Hardware requirements by variant
The table below assumes Q4_K_M quantization (Ollama's default) and a 4k context window. Higher context or Q5/Q8 quants scale VRAM roughly linearly.
| Variant | Parameters | Q4_K_M VRAM | Recommended GPU | CPU fallback |
|---|---|---|---|---|
| Gemma 4 E2B | 2.4 B | ~6 GB | RTX 3060 12 GB, M2/M3 8 GB | 16 GB RAM, 8–12 tok/s |
| Gemma 4 E4B | 4.3 B | ~9 GB | RTX 4060 Ti 16 GB, M3 Pro 18 GB | 32 GB RAM, 3–5 tok/s |
| Gemma 4 26B A4B | 26 B (4 B active) | ~16 GB active + 12 GB RAM | RTX 4080 16 GB, RTX 5070 Ti 16 GB | Not recommended (MoE thrashes) |
| Gemma 4 31B | 31 B | ~22 GB | RTX 4090 24 GB, RTX 6000 Ada | 64 GB RAM, <1 tok/s |
For a precise local-vs-cloud break-even on your projected token volume, drop the active-parameter count into our cost calculator.
Installing Ollama and pulling Gemma 4
Ollama remains the lowest-friction runtime for Gemma 4. Below are the four canonical pulls plus a verification step that catches the streaming bug Daniel Vaughan documented on Apple Silicon in April 2026.
Step 1: Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from ollama.com/download/windowsPin a known-good version. The Gemma 4 launch coincided with a streaming regression in Ollama 0.6.0–0.6.6 that broke tool-calling on M-series chips. Confirm 0.6.7 or later:
ollama --versionStep 2: Pull the right variant
ollama pull gemma4:2b # E2B, ~1.8 GB on disk
ollama pull gemma4:4b # E4B, ~3.1 GB
ollama pull gemma4:26b-a4b # MoE, ~15 GB
ollama pull gemma4:31b # dense flagship, ~19 GBStep 3: Set a sane context window
Ollama defaults to 2k context, which truncates real prompts. Override at runtime or in a Modelfile:
ollama run gemma4:4b --ctx 8192
# or in a Modelfile: PARAMETER num_ctx 8192Step 4: Smoke-test multimodality
ollama run gemma4:4b "Describe this image" /path/to/image.pngIf the model replies I cannot see images, you have an older non-multimodal pull. Re-pull explicitly against the :4b tag and confirm against the official Ollama Gemma 4 page.
Benchmarks: Gemma 4 vs Qwen 3 vs Llama 3.3
The numbers below combine Google's April 2026 release benchmarks with independent runs from the BestLLMfor benchmark pipeline. Local results use Q4_K_M on an RTX 4090 24 GB at 4k context, batch=1.
| Model | MMLU | HumanEval | GSM8K | MMMU (multimodal) | Decode tok/s (Q4) |
|---|---|---|---|---|---|
| Gemma 4 E4B | 71.2 | 68.4 | 78.1 | 54.9 | 92 |
| Llama 3.3 8B | 69.8 | 62.1 | 74.3 | n/a (text only) | 87 |
| Qwen3 8B | 72.5 | 71.0 | 79.4 | n/a | 89 |
| Gemma 4 26B A4B (MTP) | 78.6 | 76.2 | 85.3 | 62.4 | 104 |
| Gemma 4 31B | 81.1 | 78.0 | 87.9 | 66.1 | 38 |
| Qwen3 32B | 80.4 | 82.1 | 86.5 | n/a | 34 |
Three editorial takeaways:
- E4B punches above its weight on math and code thanks to expanded reasoning post-training. It is the strongest 4B-class model we have benchmarked for math homework or autocomplete daemons.
- 26B A4B is the throughput champion. MTP gives it 104 tok/s decode — faster than most 8B dense models — at quality near the 31B flagship.
- 31B vs Qwen3 32B is a tie that splits by task. Gemma wins multimodal (Qwen3 32B is text-only at parity size) and math; Qwen3 wins HumanEval by ~4 points.
Methodology and raw logs are published per our benchmarking methodology, and the underlying scores feed the BestLLMfor public benchmarks API (CC BY 4.0) consumed by our catalog and the open-source MCP server.
Multi-Token Prediction in practice
MTP is the headline feature for local users. Standard autoregressive decoding emits one token per forward pass; MTP heads predict the next k tokens speculatively and verify them in a single batched step. On Gemma 4 26B A4B with k=4 we measure a 2.7–3.1x speedup over a hypothetical dense 26B at the same VRAM budget.
Two caveats before you wire MTP into an agent loop:
- Speedup degrades on diverse outputs. Highly structured generation (JSON, code) hits the upper bound; creative writing at high temperature falls to ~1.8x.
- Backend support is uneven. Ollama exposes MTP transparently and vLLM 0.7+ ships native support. llama.cpp added experimental kernels in May 2026 — see the Unsloth Gemma 4 documentation for current backend status and fine-tune notebooks.
Cost: local Gemma 4 vs hosted APIs
Assume a developer processes 5 M input + 1 M output tokens per month — a realistic workload for a coding assistant plus document Q&A.
| Option | Monthly cost | Latency (p50 TTFT) | Notes |
|---|---|---|---|
| Gemma 4 31B local (RTX 4090) | ~$8 electricity | 180 ms | GPU amortized at $0/mo if already owned |
| Gemma 4 26B A4B local | ~$6 electricity | 120 ms | Best perf/cost; runs on 16 GB cards |
| Gemini 2.5 Flash API | ~$4.20 | ~400 ms | No local infra, vendor lock-in |
| Claude Haiku 4.5 API | ~$11.50 | ~350 ms | Higher quality, higher cost |
At this volume hosted Gemini is marginally cheaper, but the break-even tips toward local around 15 M tokens/month — exactly where many internal tools land. Run your own numbers in the calculator.
Known issues and gotchas
- Ollama 0.6.0–0.6.6 streaming bug on Apple Silicon: tool calls return empty deltas. Fix: upgrade to 0.6.7+ or 0.7.x.
- Vision token budget: each image consumes ~256 tokens of context. Long PDF pipelines blow through 8k context fast — bump
num_ctxto 32k. - 26B A4B on AMD ROCm: experimental as of June 2026. CUDA and Metal are stable; ROCm hits a kernel-selection bug with MTP heads.
- Quantization quality cliff: Q3_K_M loses ~6 points on HumanEval vs Q4_K_M. Do not go below Q4 unless VRAM-starved.
- Safety filter false positives: Gemma 4's built-in refusals trigger on benign security and pentest prompts more often than Llama 3.3. Use a system prompt to set context.
Verdict: which Gemma 4 should you pull?
| If you have... | Pull | Why |
|---|---|---|
| 8 GB GPU or M2 laptop | gemma4:2b | Only viable option; surprisingly capable for size |
| 12–16 GB GPU | gemma4:4b | Beats Llama 3.3 8B on math, fits comfortably |
| RTX 4080/5070 Ti, throughput priority | gemma4:26b-a4b | MTP makes it the fastest high-quality local model in 2026 |
| RTX 4090/6000, quality priority | gemma4:31b | Best multimodal local model; ties Qwen3 32B on text |
| Code-heavy workload | Qwen3-Coder 32B Q4_K_M | Gemma 4 31B loses HumanEval; see best coding LLMs |
For most readers with a 16 GB card, the answer is Gemma 4 26B A4B. It is the first local MoE that meaningfully outperforms dense alternatives on a single consumer GPU, and MTP makes it feel like running a 7B model. The 31B is worth the extra VRAM only if multimodal quality is non-negotiable. For details on each tag — sizes, license, multimodal support — cross-reference the 31B model page on ollama.com.
Frequently Asked Questions
Can Gemma 4 run on a CPU only?
The E2B variant runs at usable speed (8–12 tok/s) on a modern x86 CPU with 16 GB RAM. E4B drops to 3–5 tok/s. Anything larger is sub-1 tok/s and impractical without a GPU.
Is Gemma 4 better than Llama 3.3?
For comparable parameter counts, Gemma 4 wins on math, multimodality, and decoding speed (with MTP). Llama 3.3 still has a slight edge on long-context recall and is the safer bet for production where determinism matters more than headline benchmarks.
Why does my Gemma 4 output get truncated?
Ollama defaults to 2048 tokens of context. Override with --ctx 8192 at runtime or PARAMETER num_ctx in a Modelfile. Gemma 4 supports up to 128k context natively.
Can I fine-tune Gemma 4 locally?
Yes. Unsloth and Axolotl both support Gemma 4 with QLoRA on a single 24 GB card for E4B and on a 48 GB card for 31B. The license permits commercial fine-tunes.
Does Gemma 4 support function calling?
Yes, via the standard OpenAI-compatible tool schema. Reliability is comparable to Llama 3.3 — fine for single-step tool use, occasionally needs prompt scaffolding for multi-step agentic flows.
Which quantization should I use?
Q4_K_M is the default sweet spot. Q5_K_M costs ~25% more VRAM for <2 points of benchmark gain. Q3_K_M saves VRAM but loses ~6 points on HumanEval — avoid unless desperate.
Going further
Pull the variant that matches your VRAM, run the smoke test, and benchmark against your real workload — synthetic scores only get you so far. Cross-reference the BestLLMfor catalog for tag updates and per-model leaderboards. The public benchmarks API and open-source MCP server are CC BY 4.0 — fork the data and build your own dashboards.