GLM 5.1 Local: Setup & Benchmarks
Z.ai's 754B MoE flagship runs locally — but only with serious silicon. Here's what GLM 5.1 Local actually needs, how it scores, and when self-hosting beats the API.
By Mohamed Meguedmi · 9 min read
Key Takeaways
- GLM-5.1 is a 754B-parameter MoE with ~32B active per token, MIT-licensed, and currently the #1 open model on Code Arena.
- Realistic local floor: 2× RTX 6000 Ada (96 GB VRAM) + 256 GB system RAM for Q4_K_M GGUF with expert offload, or 4× H100 80 GB for FP8 production serving.
- Q4_K_M holds ~98% of FP16 quality on coding benchmarks; Q2_K is tinker-grade; anything below IQ2 noticeably degrades agentic reliability.
- Throughput verdict: vLLM > SGLang > llama.cpp > Ollama. Pick Ollama only for single-user prototyping.
- Local breaks even around 500 M tokens/month for a single seat. Below that, the Z.ai API is cheaper than electricity plus hardware amortization.
What GLM-5.1 actually is — and why people self-host it
Released by Z.ai (zai-org) in February 2026, GLM-5.1 is the successor to GLM-5 and currently the strongest open-weight model on the Code Arena leaderboard. It uses a Mixture-of-Experts architecture: 754 billion total parameters, but only about 32 billion are active for any given token. That detail matters enormously for local inference — you pay storage cost for all 754 B, but compute cost for only 32 B.
Three things drive the self-hosting interest:
- License. MIT, including the weights. You can fine-tune, redistribute, embed in a commercial product, and run it air-gapped — none of which is true for Claude, GPT-5, or Gemini.
- Agentic coding performance. 63.5 on Terminal-Bench 2.0 (Terminus-2 harness) and 42.7 on NL2Repo. That is within striking distance of Claude Code (69.0) at zero per-token cost once the hardware is paid for.
- Data residency. EU and HIPAA-adjacent workloads can't legally ship code to Z.ai's API endpoints in China. Local is the only option.
If none of the three apply, the Z.ai API is the path of least resistance. The rest of this guide assumes at least one of them does.
Hardware requirements by quantization
This is the table to read first. Numbers assume Unsloth's dynamic GGUF quants, llama.cpp build dated April 2026, and a single concurrent user with 8K context.
| Quant | Disk / VRAM (no offload) | Suggested GPU config | Tokens/sec | Use case |
|---|---|---|---|---|
| FP16 | ~1.5 TB | 8× H200 141 GB (NVLink) | 180–220 | Research, lab only |
| FP8 (vLLM) | ~754 GB | 4× H200 or 8× H100 80 GB | 140–180 | Production serving |
| AWQ-4bit | ~410 GB | 6× RTX 6000 Ada 48 GB | 90–120 | Multi-user batched |
| Q4_K_M GGUF | ~380 GB | 2× RTX 6000 Ada + 256 GB RAM (expert offload) | 14–22 | Sweet spot for single dev |
| Q3_K_M GGUF | ~290 GB | 2× RTX 6000 Ada + 192 GB RAM | 18–26 | Tight VRAM budgets |
| Q2_K GGUF | ~200 GB | 4× RTX 4090 + 192 GB RAM | 10–16 | Tinker, evaluation only |
| IQ1_S | ~150 GB | 2× RTX 4090 + 128 GB RAM | 6–10 | Not recommended for agentic |
Two things to call out. First, MoE models offload to system RAM gracefully — only the active experts need to be hot in VRAM at any moment — which is why a 2-GPU consumer-pro box can actually run Q4_K_M. Second, those tokens/sec are with llama.cpp's --override-tensor expert-offload path; skip it and performance collapses by 4–6×.
For a side-by-side with Qwen3-Coder, DeepSeek V4, and Llama 4 Behemoth, see the BestLLMfor model catalog.
Quantization quality: where the cliff actually is
Unsloth published per-quant evaluation runs on April 8, 2026. Aggregated across HumanEval+, MBPP+, NL2Repo, and Terminal-Bench 2.0:
| Quant | HumanEval+ | NL2Repo | Terminal-Bench 2.0 | Retention vs FP16 |
|---|---|---|---|---|
| FP16 | 92.1 | 42.7 | 63.5 | 100% |
| FP8 | 91.8 | 42.5 | 63.1 | 99.5% |
| Q5_K_M | 91.4 | 42.0 | 62.4 | 98.7% |
| Q4_K_M | 90.6 | 41.4 | 61.2 | 97.6% |
| Q3_K_M | 87.9 | 38.8 | 56.7 | 92.1% |
| Q2_K | 81.3 | 33.1 | 47.8 | 81.4% |
| IQ1_S | 68.5 | 22.4 | 31.0 | 59.2% |
The cliff sits between Q3 and Q2 — and it is steep specifically on agentic, multi-step tool-use evals. A 14-point drop on Terminal-Bench means the model starts forgetting which directory it is in, mis-quoting shell arguments, and calling the wrong tool. For interactive chat, Q2_K is fine. For autonomous coding loops, do not go below Q3_K_M.
Setup: four paths, ranked by effort vs throughput
Path A — Ollama (easiest, single user)
# Requires Ollama 0.6.2+ for MoE expert offload
ollama pull glm-5.1:q4_k_m
ollama run glm-5.1:q4_k_m
One command. Pulls ~380 GB. Auto-detects VRAM and spills experts to system RAM. Expect 14–22 tok/s on the reference 2× RTX 6000 Ada config. Good for chat, agent prototyping, and IDE plugins. Not appropriate for serving more than one concurrent request.
Path B — llama.cpp from GGUF (most control)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j
huggingface-cli download unsloth/GLM-5.1-GGUF \
--include "GLM-5.1-Q4_K_M-*.gguf" --local-dir ./models/glm51
./build/bin/llama-server \
-m ./models/glm51/GLM-5.1-Q4_K_M-00001-of-00009.gguf \
--n-gpu-layers 99 \
--override-tensor "([0-9]+).ffn_.*_exps.=CPU" \
--ctx-size 32768 \
--host 0.0.0.0 --port 8080
The --override-tensor flag is the critical one: it pins attention layers to GPU and pushes MoE expert tensors to CPU. Without it you would need 380 GB of VRAM. With it, 96 GB VRAM is enough.
Path C — vLLM (production)
pip install vllm==0.9.1
vllm serve zai-org/GLM-5.1-FP8 \
--tensor-parallel-size 4 \
--max-model-len 65536 \
--enable-prefix-caching
Assumes 4× H100 80 GB or 4× H200. vLLM with FP8 hits 140–180 tok/s aggregated across concurrent users, with sub-second time-to-first-token. This is the right answer for any team workload.
Path D — SGLang (agentic-optimized)
SGLang beats vLLM by 15–25% on agent workloads thanks to its RadixAttention prefix cache. If traffic is dominated by long, shared system prompts (typical for coding agents), use it instead of vLLM.
Benchmarks: GLM-5.1 vs the field
Numbers below come from the official GLM-5.1 model card and the May 2026 Terminal-Bench leaderboard. Open-weight models in bold.
| Model | License | Active params | SWE-Bench Verified | Terminal-Bench 2.0 | NL2Repo |
|---|---|---|---|---|---|
| Claude Code 4.5 | Proprietary | — | 74.2 | 69.0 | 48.1 |
| GPT-5.2 | Proprietary | — | 71.8 | 66.4 | 45.3 |
| GLM-5.1 | MIT | 32 B | 67.9 | 63.5 | 42.7 |
| DeepSeek V4 | DeepSeek | 37 B | 65.1 | 61.2 | 40.9 |
| Qwen3-Coder 480B | Apache 2.0 | 35 B | 62.4 | 58.8 | 38.5 |
| Llama 4 Behemoth | Llama 4 CL | 288 B | 59.7 | 54.1 | 35.2 |
The takeaway: GLM-5.1 is roughly 5–8 points behind frontier proprietary on agentic coding, and 2–5 points ahead of the closest open alternative. For a self-hosted MIT-licensed model, that is the current state of the art.
Cost analysis: local vs Z.ai API
The Z.ai API charges roughly $0.45 / M input tokens and $1.80 / M output tokens for GLM-5.1 as of June 2026. A reference local config (2× RTX 6000 Ada ≈ $13,000 + EPYC server ≈ $7,000 = $20K capex) amortized over 36 months at 24/7 operation costs about $0.78/hour in hardware plus $0.42/hour in electricity at $0.15/kWh — call it $876/month all-in.
| Monthly tokens (in+out, 50/50 split) | API cost | Local cost (capex + power) | Winner |
|---|---|---|---|
| 5 M | $5.6 | $876 | API by 156× |
| 20 M | $22.5 | $876 | API by 39× |
| 50 M | $56.3 | $876 | API by 16× |
| 500 M | $562 | $876 | API by 1.6× |
| 2 B | $2,250 | $876 | Local by 2.6× |
| 10 B | $11,250 | $876 | Local by 12.8× |
Plug actual volume into the cost calculator to see exactly where break-even lands. The honest answer for most individual developers: stay on the API. Local makes financial sense once you are serving a team, running long agent loops, or have a data-residency requirement that removes the option entirely.
When to pick GLM-5.1 over alternatives
| Scenario | Recommendation |
|---|---|
| Solo dev, <50 M tok/month, no compliance constraint | Z.ai API |
| Team of 5–20, mostly coding agents | GLM-5.1 FP8 on 4× H100 with vLLM |
| Solo dev, 2× pro GPUs already on hand | GLM-5.1 Q4_K_M via Ollama or llama.cpp |
| Consumer hardware only (single 4090 / 5090) | Skip GLM-5.1. See best LLM for RTX 5090. |
| EU/HIPAA workload | GLM-5.1 self-hosted, on-prem only |
| Need frontier-tier results regardless of cost | Stay on Claude Code or GPT-5.2 |
BestLLMfor publishes the underlying hardware-vs-model matrix as a free CC BY 4.0 dataset; the same data powers an open-source MCP server so it can be queried directly from Claude Code or any MCP-compatible agent. See the methodology page for how the benchmarks above are reproduced.
Frequently Asked Questions
Can GLM-5.1 run on a single RTX 5090?
No, not usefully. A single 5090 has 32 GB VRAM. Even with aggressive expert offload to 256 GB of system RAM, it would mean running IQ1_S quants at 4–6 tok/s with significant quality loss. Use Qwen3-Coder 32B or DeepSeek-V2.5 Lite instead — they fit fully in 32 GB and deliver 60–80 tok/s.
How much disk space is needed to download the model?
Q4_K_M is ~380 GB across nine shards. FP8 is 754 GB. Plan on at least 1 TB of NVMe for working storage; headroom for the K/V cache files and at least one backup quant is recommended.
Is GLM-5.1 actually MIT-licensed including commercial use?
Yes. The model weights, tokenizer, and inference code are released under MIT. Fine-tuning, redistribution, and embedding in commercial products are all permitted without royalties or notification. This is the most permissive license among frontier-tier open models as of June 2026.
Does it support tool calling and structured output?
Yes. GLM-5.1 ships with a native tool-calling format compatible with the OpenAI function-calling schema, and supports JSON Schema constrained decoding through both vLLM (via outlines) and llama.cpp (via grammars). Agentic frameworks like LangGraph, CrewAI, and Claude-Code-style harnesses work without modification.
Will quantization break agentic workflows?
Q4_K_M and above are safe. Below that, expect occasional mis-quoted shell commands, tool-name hallucination, and forgotten directory context in multi-turn agent runs. The Q3→Q2 cliff is where production reliability falls off — keep Q3_K_M as the floor for autonomous use.
How does GLM-5.1 compare to GLM-5?
Z.ai reports +7.3 points on Terminal-Bench 2.0 (63.5 vs 56.2) and +6.8 on NL2Repo (42.7 vs 35.9). Long-horizon agent tasks and role-play coherence are the biggest gains. For coding-focused users, the upgrade is worth the additional VRAM footprint.
Bottom line
GLM-5.1 is the best open-weight coding model available in June 2026, and the first one whose local performance comes within striking distance of Claude Code. But "running locally" still means 2× professional GPUs and 256 GB of RAM as a floor — this is not a single-GPU model. For solo developers below 50 M tokens/month, the Z.ai API is the rational choice. For teams, regulated industries, or anyone running long agent loops, self-hosting GLM-5.1 with vLLM on 4× H100 is the configuration to copy. Going below Q4_K_M starts costing reliability on the exact tasks the model is best at — see the BestLLMfor guides hub for the next round of model-specific tuning notes.