Granite 4.1 3B Instruct
By IBM · United States
Overview
IBM's dense 3B Granite 4.1: Apache 2.0, 12 languages, 131k context, with tool calling and FIM code support. The smallest Granite tier, sharing data and pipeline with its larger siblings.
When to pick this model
- Edge and embedded deployments needing ~3 GB VRAM
- Code completion with fill-in-the-middle
- Tool-calling agents on resource-constrained hardware
- Multilingual apps across 12 languages
- Long-context tasks at very small scale (131k)
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 2 GB |
| Q5_K_M | 2.5 GB |
| Q8_0 | 3 GB |
| FP16 (no quantization) | 6 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Apache 2.0 with full openness
- Tool calling plus FIM code completion
- 12 languages including French
- 131k context at only 3B params
- Fits in 3 GB VRAM
Limitations
- Reasoning lags the 8B and 30B siblings
- Demanding chat use cases really want the 8B model
Architecture & training
Architecture: Dense Transformer · 40 layers · GQA 40Q/8KV · embedding 2560 · MLP hidden 8,192 · SwiGLU · RoPE · RMSNorm
Training: SFT + RL alignment, same data and pipeline as the 8B and 30B. 12 languages including FR. NVIDIA GB200 NVL72 cluster.
A serious 3B option for edge and embedded — same Granite recipe, just smaller.
Quick start
ollama run granite4.1:3bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.