ERNIE 4.5 21B-A3B Thinking
By Baidu · China
Overview
Baidu's compact reasoning MoE with 3B active parameters out of 21B total. Fast inference thanks to the small active set, with Chinese-language strength.
When to pick this model
- Cost-sensitive reasoning workloads
- Chinese-language reasoning tasks
- Single-GPU deployments needing 128k context
- Latency-sensitive applications
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 13 GB |
| Q5_K_M | 16 GB |
| Q8_0 | 23 GB |
| FP16 (no quantization) | 42 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Around 13 GB VRAM at Q4
- Compact MoE optimized for reasoning
- Strong Chinese-language performance
- 128k context window
Limitations
- Weaker multilingual coverage than Qwen
- Baidu license terms need verification
- Smaller community than Qwen or Llama equivalents
Architecture & training
Architecture: MoE · 21B · ERNIE 4.5 compact · reasoning-optimized
Training: Baidu ERNIE 4.5 compact version with reasoning specialization.
An efficient reasoning MoE with real Chinese strength, but Qwen's compact models remain easier to adopt outside China.
Quick start
ollama pull hf.co/baidu/ernie-4.5-21b-GGUFOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.
Is ERNIE 4.5 21B-A3B Thinking the right pick for you?