BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

Gemma 4 2B

By Google · United States

chat vision multilingual small
Parameters
2B
License
Gemma
Context
125k
VRAM (Q4)
1.2 GB
Released
6 May 2026

Overview

Google's 2B base model in the Gemma 4 family with text and image input, 128k context, and a 1.2GB Q4 footprint that runs on integrated graphics or a Raspberry Pi 5.

When to pick this model

  • On-device assistants for laptops, phones, and SBCs
  • Multimodal prototypes that can't justify a dedicated GPU
  • Long-context summarization at the edge
  • Air-gapped or offline scenarios where latency and privacy matter

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)1.2 GB
Q5_K_M1.4 GB
Q8_02.1 GB
FP16 (no quantization)4 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • Runs on integrated GPUs at ~1.2GB VRAM in Q4
  • Multimodal text and image input out of the box
  • 128k context unusual at this parameter count
  • Permissive Gemma license

Limitations

  • Reasoning lags behind 4B and larger Gemma variants
  • Gated on Hugging Face (click-through access)

Architecture & training

Architecture: Gemma 4 base · 2B dense · multimodal text + image · 128k context

Training: Google Gemma 4 family, 2B multimodal base version, trained for edge/laptop.

Verdict

The smallest Gemma 4 that still feels useful — a strong default for edge multimodal apps.

Quick start

ollama run gemma4

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is Gemma 4 2B the right pick for you?

Compute self-hosted ROI → Back to catalog