GLM-5.1
By Z.AI · China
Overview
Z.AI's flagship MoE with 744B total and 40B active parameters under an MIT license. Ranked #1 open-weight model on Artificial Analysis as of April 2026.
When to pick this model
- Production agentic systems on dedicated server clusters
- Replacing closed frontier APIs with self-hosted weights
- Long-context document analysis up to 200K tokens
- Open-weight SWE-Bench-grade coding agents
- Commercial deployments that need MIT licensing
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 445 GB |
| Q5_K_M | 535 GB |
| Q8_0 | 800 GB |
| FP16 (no quantization) | 1488 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| SWE-Bench Pro | 58.4 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- #1 open-weight model on Artificial Analysis (April 2026)
- 58.4 on SWE-Bench Pro, leading all open weights
- 200K context for whole-repo reasoning
- True MIT license with full commercial rights
Limitations
- 445GB+ in Q4 quantization requires a multi-GPU server
- No official Ollama tag at launch
- Operational complexity rules out single-workstation use
Architecture & training
Architecture: MoE · 744B/40B active · 200k ctx · Reasoning variant
Training: Successor to GLM-5 (February 2026).
The strongest open-weight model available today, provided you have the hardware to run a 744B MoE.
Quick start
# HuggingFace (GGUF) : unsloth/GLM-5.1-GGUFOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.