Model fiche
Granite 4.0 H-Tiny 7B-A1B
By IBM · United States
chat
general
moe
small
Overview
IBM's edge-class hybrid MoE with 7B total and only 1B active parameters — Apache 2.0 licensed and built for embedded and low-cost serving.
When to pick this model
- On-device assistants on laptops or edge boxes
- High-QPS endpoints where active-param cost dominates
- Long-context summarization on memory-constrained hardware
- Embedded products needing a clean commercial license
- Prototyping pipelines before scaling to Granite 4.0 Small
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 4 GB |
| Q5_K_M | 5 GB |
| Q8_0 | 7 GB |
| FP16 (no quantization) | 14 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Extremely low compute cost per token via 1B active params
- Apache 2.0 license with no commercial strings attached
- 128k context handled efficiently thanks to hybrid Mamba-2
- Tiny memory footprint suits edge and serverless deploys
Limitations
- Quality lags dense 3B models on some single-shot tasks
- Smaller active capacity hurts complex reasoning
- Needs current llama.cpp support to run efficiently
Architecture & training
Architecture: Hybrid Mamba-2 + granular MoE · 7B/1B active
Training: Edge variant of 4.0.
Verdict
The most efficient Apache-licensed MoE for edge inference — the right pick when cost-per-token and license cleanliness trump raw quality.
Quick start
ollama run granite4:tiny-hOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.
Tools
Is Granite 4.0 H-Tiny 7B-A1B the right pick for you?