Model fiche
Gemma 4 26B-A4B MoE
By Google · United States
chat
general
vision
audio
multilingual
moe
Overview
Google's MoE variant of Gemma 4 with 26B total / 4B active params and full text+image+audio multimodality. The smallest open model with native audio understanding at this quality.
When to pick this model
- Multimodal apps that need text, image, and audio in one model
- Voice-driven assistants and audio analysis pipelines
- Long-context reasoning over mixed-media inputs (128k)
- On-prem deployments where Google's tooling integrates cleanly
- Replacing three separate models with one
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 16 GB |
| Q5_K_M | 19 GB |
| Q8_0 | 28 GB |
| FP16 (no quantization) | 52 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Unified text, image, and audio in 26B/4B-active MoE
- 128k context
- Strong reasoning relative to size
- Backed by Google's training infrastructure and corpus
- 4B active params keep inference cheap
Limitations
- Around 16 GB VRAM in Q4
- Gated on Hugging Face with click-through agreement
- Gemma license has more restrictions than Apache or MIT
Architecture & training
Architecture: MoE · 26B · Gemma 4 · multimodal text+image+audio · 128k context
Training: Google Gemma 4 MoE 26B — natively multimodal with audio, vision, and text.
Verdict
The most capable open multimodal model under 30B if you can live with the Gemma license.
Quick start
ollama run gemma4:26b-moeOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.