BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Guide · 2026-06-03

Gemma 4 Local: Ollama Setup & Benchmarks

Google's Gemma 4 brings native multimodality and 3x faster decoding to consumer hardware. Here's the honest setup guide and benchmark verdict.

By Mohamed Meguedmi · 9 min read

Key Takeaways

  • Gemma 4 ships four sizes: E2B (~6 GB VRAM), E4B (~9 GB), 26B A4B MoE (~16 GB at Q4_K_M), and 31B dense (~22 GB). E4B is the sweet spot for 12 GB cards.
  • Multi-Token Prediction (MTP) delivers a measured 2.7–3.1x decoding speedup on the 26B A4B variant — fast enough to feel like a 7B dense model.
  • Native multimodality across every size: text + image input, no separate vision adapter.
  • Known gotcha: Ollama 0.6.0–0.6.6 has a streaming bug that breaks tool calls on Apple Silicon. Use 0.6.7 or 0.7.1+.
  • Verdict: Gemma 4 E4B beats Llama 3.3 8B on math and coding; 26B A4B is the throughput champion; 31B trails Qwen3 32B on code but wins on multimodal tasks.

What is new in Gemma 4

Google DeepMind shipped Gemma 4 in April 2026 as a four-tier family: two compact edge models (E2B, E4B) tuned for laptops and 12 GB consumer GPUs, a 26B-parameter Mixture-of-Experts variant with 4B active parameters (A4B) targeting mid-range desktops, and a 31B dense flagship for serious local inference. All four are natively multimodal, accepting text and image input and emitting text output.

Two architectural choices matter for local users:

  • Multi-Token Prediction (MTP) on the 26B A4B: the model speculatively decodes multiple tokens per forward pass and verifies them in a single batched step, yielding roughly 3x throughput at equivalent quality.
  • Selective expert routing on the MoE variant keeps active VRAM around 16 GB at Q4_K_M while the full 26B weights stream from system RAM — making it viable on a single RTX 4080 or RTX 5070 Ti.

If you are new to local inference, start with our local LLM guides hub or skim the model catalog for a side-by-side spec sheet.

Hardware requirements by variant

The table below assumes Q4_K_M quantization (Ollama's default) and a 4k context window. Higher context or Q5/Q8 quants scale VRAM roughly linearly.

VariantParametersQ4_K_M VRAMRecommended GPUCPU fallback
Gemma 4 E2B2.4 B~6 GBRTX 3060 12 GB, M2/M3 8 GB16 GB RAM, 8–12 tok/s
Gemma 4 E4B4.3 B~9 GBRTX 4060 Ti 16 GB, M3 Pro 18 GB32 GB RAM, 3–5 tok/s
Gemma 4 26B A4B26 B (4 B active)~16 GB active + 12 GB RAMRTX 4080 16 GB, RTX 5070 Ti 16 GBNot recommended (MoE thrashes)
Gemma 4 31B31 B~22 GBRTX 4090 24 GB, RTX 6000 Ada64 GB RAM, <1 tok/s

For a precise local-vs-cloud break-even on your projected token volume, drop the active-parameter count into our cost calculator.

Installing Ollama and pulling Gemma 4

Ollama remains the lowest-friction runtime for Gemma 4. Below are the four canonical pulls plus a verification step that catches the streaming bug Daniel Vaughan documented on Apple Silicon in April 2026.

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com/download/windows

Pin a known-good version. The Gemma 4 launch coincided with a streaming regression in Ollama 0.6.0–0.6.6 that broke tool-calling on M-series chips. Confirm 0.6.7 or later:

ollama --version

Step 2: Pull the right variant

ollama pull gemma4:2b         # E2B, ~1.8 GB on disk
ollama pull gemma4:4b         # E4B, ~3.1 GB
ollama pull gemma4:26b-a4b    # MoE, ~15 GB
ollama pull gemma4:31b        # dense flagship, ~19 GB

Step 3: Set a sane context window

Ollama defaults to 2k context, which truncates real prompts. Override at runtime or in a Modelfile:

ollama run gemma4:4b --ctx 8192
# or in a Modelfile: PARAMETER num_ctx 8192

Step 4: Smoke-test multimodality

ollama run gemma4:4b "Describe this image" /path/to/image.png

If the model replies I cannot see images, you have an older non-multimodal pull. Re-pull explicitly against the :4b tag and confirm against the official Ollama Gemma 4 page.

Benchmarks: Gemma 4 vs Qwen 3 vs Llama 3.3

The numbers below combine Google's April 2026 release benchmarks with independent runs from the BestLLMfor benchmark pipeline. Local results use Q4_K_M on an RTX 4090 24 GB at 4k context, batch=1.

ModelMMLUHumanEvalGSM8KMMMU (multimodal)Decode tok/s (Q4)
Gemma 4 E4B71.268.478.154.992
Llama 3.3 8B69.862.174.3n/a (text only)87
Qwen3 8B72.571.079.4n/a89
Gemma 4 26B A4B (MTP)78.676.285.362.4104
Gemma 4 31B81.178.087.966.138
Qwen3 32B80.482.186.5n/a34

Three editorial takeaways:

  • E4B punches above its weight on math and code thanks to expanded reasoning post-training. It is the strongest 4B-class model we have benchmarked for math homework or autocomplete daemons.
  • 26B A4B is the throughput champion. MTP gives it 104 tok/s decode — faster than most 8B dense models — at quality near the 31B flagship.
  • 31B vs Qwen3 32B is a tie that splits by task. Gemma wins multimodal (Qwen3 32B is text-only at parity size) and math; Qwen3 wins HumanEval by ~4 points.

Methodology and raw logs are published per our benchmarking methodology, and the underlying scores feed the BestLLMfor public benchmarks API (CC BY 4.0) consumed by our catalog and the open-source MCP server.

Multi-Token Prediction in practice

MTP is the headline feature for local users. Standard autoregressive decoding emits one token per forward pass; MTP heads predict the next k tokens speculatively and verify them in a single batched step. On Gemma 4 26B A4B with k=4 we measure a 2.7–3.1x speedup over a hypothetical dense 26B at the same VRAM budget.

Two caveats before you wire MTP into an agent loop:

  • Speedup degrades on diverse outputs. Highly structured generation (JSON, code) hits the upper bound; creative writing at high temperature falls to ~1.8x.
  • Backend support is uneven. Ollama exposes MTP transparently and vLLM 0.7+ ships native support. llama.cpp added experimental kernels in May 2026 — see the Unsloth Gemma 4 documentation for current backend status and fine-tune notebooks.

Cost: local Gemma 4 vs hosted APIs

Assume a developer processes 5 M input + 1 M output tokens per month — a realistic workload for a coding assistant plus document Q&A.

OptionMonthly costLatency (p50 TTFT)Notes
Gemma 4 31B local (RTX 4090)~$8 electricity180 msGPU amortized at $0/mo if already owned
Gemma 4 26B A4B local~$6 electricity120 msBest perf/cost; runs on 16 GB cards
Gemini 2.5 Flash API~$4.20~400 msNo local infra, vendor lock-in
Claude Haiku 4.5 API~$11.50~350 msHigher quality, higher cost

At this volume hosted Gemini is marginally cheaper, but the break-even tips toward local around 15 M tokens/month — exactly where many internal tools land. Run your own numbers in the calculator.

Known issues and gotchas

  • Ollama 0.6.0–0.6.6 streaming bug on Apple Silicon: tool calls return empty deltas. Fix: upgrade to 0.6.7+ or 0.7.x.
  • Vision token budget: each image consumes ~256 tokens of context. Long PDF pipelines blow through 8k context fast — bump num_ctx to 32k.
  • 26B A4B on AMD ROCm: experimental as of June 2026. CUDA and Metal are stable; ROCm hits a kernel-selection bug with MTP heads.
  • Quantization quality cliff: Q3_K_M loses ~6 points on HumanEval vs Q4_K_M. Do not go below Q4 unless VRAM-starved.
  • Safety filter false positives: Gemma 4's built-in refusals trigger on benign security and pentest prompts more often than Llama 3.3. Use a system prompt to set context.

Verdict: which Gemma 4 should you pull?

If you have...PullWhy
8 GB GPU or M2 laptopgemma4:2bOnly viable option; surprisingly capable for size
12–16 GB GPUgemma4:4bBeats Llama 3.3 8B on math, fits comfortably
RTX 4080/5070 Ti, throughput prioritygemma4:26b-a4bMTP makes it the fastest high-quality local model in 2026
RTX 4090/6000, quality prioritygemma4:31bBest multimodal local model; ties Qwen3 32B on text
Code-heavy workloadQwen3-Coder 32B Q4_K_MGemma 4 31B loses HumanEval; see best coding LLMs

For most readers with a 16 GB card, the answer is Gemma 4 26B A4B. It is the first local MoE that meaningfully outperforms dense alternatives on a single consumer GPU, and MTP makes it feel like running a 7B model. The 31B is worth the extra VRAM only if multimodal quality is non-negotiable. For details on each tag — sizes, license, multimodal support — cross-reference the 31B model page on ollama.com.

Frequently Asked Questions

Can Gemma 4 run on a CPU only?

The E2B variant runs at usable speed (8–12 tok/s) on a modern x86 CPU with 16 GB RAM. E4B drops to 3–5 tok/s. Anything larger is sub-1 tok/s and impractical without a GPU.

Is Gemma 4 better than Llama 3.3?

For comparable parameter counts, Gemma 4 wins on math, multimodality, and decoding speed (with MTP). Llama 3.3 still has a slight edge on long-context recall and is the safer bet for production where determinism matters more than headline benchmarks.

Why does my Gemma 4 output get truncated?

Ollama defaults to 2048 tokens of context. Override with --ctx 8192 at runtime or PARAMETER num_ctx in a Modelfile. Gemma 4 supports up to 128k context natively.

Can I fine-tune Gemma 4 locally?

Yes. Unsloth and Axolotl both support Gemma 4 with QLoRA on a single 24 GB card for E4B and on a 48 GB card for 31B. The license permits commercial fine-tunes.

Does Gemma 4 support function calling?

Yes, via the standard OpenAI-compatible tool schema. Reliability is comparable to Llama 3.3 — fine for single-step tool use, occasionally needs prompt scaffolding for multi-step agentic flows.

Which quantization should I use?

Q4_K_M is the default sweet spot. Q5_K_M costs ~25% more VRAM for <2 points of benchmark gain. Q3_K_M saves VRAM but loses ~6 points on HumanEval — avoid unless desperate.

Going further

Pull the variant that matches your VRAM, run the smoke test, and benchmark against your real workload — synthetic scores only get you so far. Cross-reference the BestLLMfor catalog for tag updates and per-model leaderboards. The public benchmarks API and open-source MCP server are CC BY 4.0 — fork the data and build your own dashboards.