BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

Granite 4.0 H-Tiny 7B-A1B

By IBM · United States

chat general moe small
Parameters
7B
License
Apache 2.0
Context
125k
VRAM (Q4)
4 GB
Released
October 2025

Overview

IBM's edge-class hybrid MoE with 7B total and only 1B active parameters — Apache 2.0 licensed and built for embedded and low-cost serving.

When to pick this model

  • On-device assistants on laptops or edge boxes
  • High-QPS endpoints where active-param cost dominates
  • Long-context summarization on memory-constrained hardware
  • Embedded products needing a clean commercial license
  • Prototyping pipelines before scaling to Granite 4.0 Small

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)4 GB
Q5_K_M5 GB
Q8_07 GB
FP16 (no quantization)14 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • Extremely low compute cost per token via 1B active params
  • Apache 2.0 license with no commercial strings attached
  • 128k context handled efficiently thanks to hybrid Mamba-2
  • Tiny memory footprint suits edge and serverless deploys

Limitations

  • Quality lags dense 3B models on some single-shot tasks
  • Smaller active capacity hurts complex reasoning
  • Needs current llama.cpp support to run efficiently

Architecture & training

Architecture: Hybrid Mamba-2 + granular MoE · 7B/1B active

Training: Edge variant of 4.0.

Verdict

The most efficient Apache-licensed MoE for edge inference — the right pick when cost-per-token and license cleanliness trump raw quality.

Quick start

ollama run granite4:tiny-h

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is Granite 4.0 H-Tiny 7B-A1B the right pick for you?

Compute self-hosted ROI → Back to catalog