Which quantization should I use for Gemma 4?

Q4_K_M is the default sweet spot. Q5_K_M costs roughly 25% more VRAM for under 2 points of benchmark gain. Q3_K_M saves VRAM but loses around 6 points on HumanEval - avoid unless VRAM-starved.

Guide · 2026-06-03

Gemma 4 Local: Ollama Setup & Benchmarks

Last updated 2026-06-03

Google's Gemma 4 brings native multimodality and 3x faster decoding to consumer hardware. Here's the honest setup guide and benchmark verdict.

By Mohamed Meguedmi · 9 min read

Key Takeaways

Gemma 4 ships four sizes: E2B (~6 GB VRAM), E4B (~9 GB), 26B A4B MoE (~16 GB at Q4_K_M), and 31B dense (~22 GB). E4B is the sweet spot for 12 GB cards.
Multi-Token Prediction (MTP) delivers a measured 2.7–3.1x decoding speedup on the 26B A4B variant — fast enough to feel like a 7B dense model.
Native multimodality across every size: text + image input, no separate vision adapter.
Known gotcha: Ollama 0.6.0–0.6.6 has a streaming bug that breaks tool calls on Apple Silicon. Use 0.6.7 or 0.7.1+.
Verdict: Gemma 4 E4B beats Llama 3.3 8B on math and coding; 26B A4B is the throughput champion; 31B trails Qwen3 32B on code but wins on multimodal tasks.

What is new in Gemma 4

Google DeepMind shipped Gemma 4 in April 2026 as a four-tier family: two compact edge models (E2B, E4B) tuned for laptops and 12 GB consumer GPUs, a 26B-parameter Mixture-of-Experts variant with 4B active parameters (A4B) targeting mid-range desktops, and a 31B dense flagship for serious local inference. All four are natively multimodal, accepting text and image input and emitting text output.

Two architectural choices matter for local users:

Multi-Token Prediction (MTP) on the 26B A4B: the model speculatively decodes multiple tokens per forward pass and verifies them in a single batched step, yielding roughly 3x throughput at equivalent quality.
Selective expert routing on the MoE variant keeps active VRAM around 16 GB at Q4_K_M while the full 26B weights stream from system RAM — making it viable on a single RTX 4080 or RTX 5070 Ti.

If you are new to local inference, start with our local LLM guides hub or skim the model catalog for a side-by-side spec sheet.

Hardware requirements by variant

The table below assumes Q4_K_M quantization (Ollama's default) and a 4k context window. Higher context or Q5/Q8 quants scale VRAM roughly linearly.

Variant	Parameters	Q4_K_M VRAM	Recommended GPU	CPU fallback
Gemma 4 E2B	2.4 B	~6 GB	RTX 3060 12 GB, M2/M3 8 GB	16 GB RAM, 8–12 tok/s
Gemma 4 E4B	4.3 B	~9 GB	RTX 4060 Ti 16 GB, M3 Pro 18 GB	32 GB RAM, 3–5 tok/s
Gemma 4 26B A4B	26 B (4 B active)	~16 GB active + 12 GB RAM	RTX 4080 16 GB, RTX 5070 Ti 16 GB	Not recommended (MoE thrashes)
Gemma 4 31B	31 B	~22 GB	RTX 4090 24 GB, RTX 6000 Ada	64 GB RAM, <1 tok/s

For a precise local-vs-cloud break-even on your projected token volume, drop the active-parameter count into our cost calculator.

Installing Ollama and pulling Gemma 4

Ollama remains the lowest-friction runtime for Gemma 4. Below are the four canonical pulls plus a verification step that catches the streaming bug Daniel Vaughan documented on Apple Silicon in April 2026.

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com/download/windows

Pin a known-good version. The Gemma 4 launch coincided with a streaming regression in Ollama 0.6.0–0.6.6 that broke tool-calling on M-series chips. Confirm 0.6.7 or later:

ollama --version

Step 2: Pull the right variant

ollama pull gemma4:2b         # E2B, ~1.8 GB on disk
ollama pull gemma4:4b         # E4B, ~3.1 GB
ollama pull gemma4:26b-a4b    # MoE, ~15 GB
ollama pull gemma4:31b        # dense flagship, ~19 GB

Step 3: Set a sane context window

Ollama defaults to 2k context, which truncates real prompts. Override at runtime or in a Modelfile:

ollama run gemma4:4b --ctx 8192
# or in a Modelfile: PARAMETER num_ctx 8192

Step 4: Smoke-test multimodality

ollama run gemma4:4b "Describe this image" /path/to/image.png

If the model replies I cannot see images, you have an older non-multimodal pull. Re-pull explicitly against the :4b tag and confirm against the official Ollama Gemma 4 page.

Benchmarks: Gemma 4 vs Qwen 3 vs Llama 3.3

The numbers below combine Google's April 2026 release benchmarks with independent runs from the BestLLMfor benchmark pipeline. Local results use Q4_K_M on an RTX 4090 24 GB at 4k context, batch=1.

Model	MMLU	HumanEval	GSM8K	MMMU (multimodal)	Decode tok/s (Q4)
Gemma 4 E4B	71.2	68.4	78.1	54.9	92
Llama 3.3 8B	69.8	62.1	74.3	n/a (text only)	87
Qwen3 8B	72.5	71.0	79.4	n/a	89
Gemma 4 26B A4B (MTP)	78.6	76.2	85.3	62.4	104
Gemma 4 31B	81.1	78.0	87.9	66.1	38
Qwen3 32B	80.4	82.1	86.5	n/a	34

Three editorial takeaways:

E4B punches above its weight on math and code thanks to expanded reasoning post-training. It is the strongest 4B-class model we have benchmarked for math homework or autocomplete daemons.
26B A4B is the throughput champion. MTP gives it 104 tok/s decode — faster than most 8B dense models — at quality near the 31B flagship.
31B vs Qwen3 32B is a tie that splits by task. Gemma wins multimodal (Qwen3 32B is text-only at parity size) and math; Qwen3 wins HumanEval by ~4 points.

Methodology and raw logs are published per our benchmarking methodology, and the underlying scores feed the BestLLMfor public benchmarks API (CC BY 4.0) consumed by our catalog and the open-source MCP server.

Multi-Token Prediction in practice

MTP is the headline feature for local users. Standard autoregressive decoding emits one token per forward pass; MTP heads predict the next k tokens speculatively and verify them in a single batched step. On Gemma 4 26B A4B with k=4 we measure a 2.7–3.1x speedup over a hypothetical dense 26B at the same VRAM budget.

Two caveats before you wire MTP into an agent loop:

Speedup degrades on diverse outputs. Highly structured generation (JSON, code) hits the upper bound; creative writing at high temperature falls to ~1.8x.
Backend support is uneven. Ollama exposes MTP transparently and vLLM 0.7+ ships native support. llama.cpp added experimental kernels in May 2026 — see the Unsloth Gemma 4 documentation for current backend status and fine-tune notebooks.

Cost: local Gemma 4 vs hosted APIs

Assume a developer processes 5 M input + 1 M output tokens per month — a realistic workload for a coding assistant plus document Q&A.

Option	Monthly cost	Latency (p50 TTFT)	Notes
Gemma 4 31B local (RTX 4090)	~$8 electricity	180 ms	GPU amortized at $0/mo if already owned
Gemma 4 26B A4B local	~$6 electricity	120 ms	Best perf/cost; runs on 16 GB cards
Gemini 2.5 Flash API	~$4.20	~400 ms	No local infra, vendor lock-in
Claude Haiku 4.5 API	~$11.50	~350 ms	Higher quality, higher cost

At this volume hosted Gemini is marginally cheaper, but the break-even tips toward local around 15 M tokens/month — exactly where many internal tools land. Run your own numbers in the calculator.

Known issues and gotchas

Ollama 0.6.0–0.6.6 streaming bug on Apple Silicon: tool calls return empty deltas. Fix: upgrade to 0.6.7+ or 0.7.x.
Vision token budget: each image consumes ~256 tokens of context. Long PDF pipelines blow through 8k context fast — bump num_ctx to 32k.
26B A4B on AMD ROCm: experimental as of June 2026. CUDA and Metal are stable; ROCm hits a kernel-selection bug with MTP heads.
Quantization quality cliff: Q3_K_M loses ~6 points on HumanEval vs Q4_K_M. Do not go below Q4 unless VRAM-starved.
Safety filter false positives: Gemma 4's built-in refusals trigger on benign security and pentest prompts more often than Llama 3.3. Use a system prompt to set context.

Verdict: which Gemma 4 should you pull?

If you have...	Pull	Why
8 GB GPU or M2 laptop	`gemma4:2b`	Only viable option; surprisingly capable for size
12–16 GB GPU	`gemma4:4b`	Beats Llama 3.3 8B on math, fits comfortably
RTX 4080/5070 Ti, throughput priority	`gemma4:26b-a4b`	MTP makes it the fastest high-quality local model in 2026
RTX 4090/6000, quality priority	`gemma4:31b`	Best multimodal local model; ties Qwen3 32B on text
Code-heavy workload	Qwen3-Coder 32B Q4_K_M	Gemma 4 31B loses HumanEval; see best coding LLMs

For most readers with a 16 GB card, the answer is Gemma 4 26B A4B. It is the first local MoE that meaningfully outperforms dense alternatives on a single consumer GPU, and MTP makes it feel like running a 7B model. The 31B is worth the extra VRAM only if multimodal quality is non-negotiable. For details on each tag — sizes, license, multimodal support — cross-reference the 31B model page on ollama.com.

Frequently Asked Questions

Can Gemma 4 run on a CPU only?

The E2B variant runs at usable speed (8–12 tok/s) on a modern x86 CPU with 16 GB RAM. E4B drops to 3–5 tok/s. Anything larger is sub-1 tok/s and impractical without a GPU.

Is Gemma 4 better than Llama 3.3?

For comparable parameter counts, Gemma 4 wins on math, multimodality, and decoding speed (with MTP). Llama 3.3 still has a slight edge on long-context recall and is the safer bet for production where determinism matters more than headline benchmarks.

Why does my Gemma 4 output get truncated?

Ollama defaults to 2048 tokens of context. Override with --ctx 8192 at runtime or PARAMETER num_ctx in a Modelfile. Gemma 4 supports up to 128k context natively.

Can I fine-tune Gemma 4 locally?

Yes. Unsloth and Axolotl both support Gemma 4 with QLoRA on a single 24 GB card for E4B and on a 48 GB card for 31B. The license permits commercial fine-tunes.

Does Gemma 4 support function calling?

Yes, via the standard OpenAI-compatible tool schema. Reliability is comparable to Llama 3.3 — fine for single-step tool use, occasionally needs prompt scaffolding for multi-step agentic flows.

Which quantization should I use?

Q4_K_M is the default sweet spot. Q5_K_M costs ~25% more VRAM for <2 points of benchmark gain. Q3_K_M saves VRAM but loses ~6 points on HumanEval — avoid unless desperate.

Going further

Pull the variant that matches your VRAM, run the smoke test, and benchmark against your real workload — synthetic scores only get you so far. Cross-reference the BestLLMfor catalog for tag updates and per-model leaderboards. The public benchmarks API and open-source MCP server are CC BY 4.0 — fork the data and build your own dashboards.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.