BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Guide · 2026-06-03

GLM 5.1 Local: Setup & Benchmarks

Z.ai's 754B MoE flagship runs locally — but only with serious silicon. Here's what GLM 5.1 Local actually needs, how it scores, and when self-hosting beats the API.

By Mohamed Meguedmi · 9 min read

Key Takeaways

  • GLM-5.1 is a 754B-parameter MoE with ~32B active per token, MIT-licensed, and currently the #1 open model on Code Arena.
  • Realistic local floor: 2× RTX 6000 Ada (96 GB VRAM) + 256 GB system RAM for Q4_K_M GGUF with expert offload, or 4× H100 80 GB for FP8 production serving.
  • Q4_K_M holds ~98% of FP16 quality on coding benchmarks; Q2_K is tinker-grade; anything below IQ2 noticeably degrades agentic reliability.
  • Throughput verdict: vLLM > SGLang > llama.cpp > Ollama. Pick Ollama only for single-user prototyping.
  • Local breaks even around 500 M tokens/month for a single seat. Below that, the Z.ai API is cheaper than electricity plus hardware amortization.

What GLM-5.1 actually is — and why people self-host it

Released by Z.ai (zai-org) in February 2026, GLM-5.1 is the successor to GLM-5 and currently the strongest open-weight model on the Code Arena leaderboard. It uses a Mixture-of-Experts architecture: 754 billion total parameters, but only about 32 billion are active for any given token. That detail matters enormously for local inference — you pay storage cost for all 754 B, but compute cost for only 32 B.

Three things drive the self-hosting interest:

  • License. MIT, including the weights. You can fine-tune, redistribute, embed in a commercial product, and run it air-gapped — none of which is true for Claude, GPT-5, or Gemini.
  • Agentic coding performance. 63.5 on Terminal-Bench 2.0 (Terminus-2 harness) and 42.7 on NL2Repo. That is within striking distance of Claude Code (69.0) at zero per-token cost once the hardware is paid for.
  • Data residency. EU and HIPAA-adjacent workloads can't legally ship code to Z.ai's API endpoints in China. Local is the only option.

If none of the three apply, the Z.ai API is the path of least resistance. The rest of this guide assumes at least one of them does.

Hardware requirements by quantization

This is the table to read first. Numbers assume Unsloth's dynamic GGUF quants, llama.cpp build dated April 2026, and a single concurrent user with 8K context.

QuantDisk / VRAM (no offload)Suggested GPU configTokens/secUse case
FP16~1.5 TB8× H200 141 GB (NVLink)180–220Research, lab only
FP8 (vLLM)~754 GB4× H200 or 8× H100 80 GB140–180Production serving
AWQ-4bit~410 GB6× RTX 6000 Ada 48 GB90–120Multi-user batched
Q4_K_M GGUF~380 GB2× RTX 6000 Ada + 256 GB RAM (expert offload)14–22Sweet spot for single dev
Q3_K_M GGUF~290 GB2× RTX 6000 Ada + 192 GB RAM18–26Tight VRAM budgets
Q2_K GGUF~200 GB4× RTX 4090 + 192 GB RAM10–16Tinker, evaluation only
IQ1_S~150 GB2× RTX 4090 + 128 GB RAM6–10Not recommended for agentic

Two things to call out. First, MoE models offload to system RAM gracefully — only the active experts need to be hot in VRAM at any moment — which is why a 2-GPU consumer-pro box can actually run Q4_K_M. Second, those tokens/sec are with llama.cpp's --override-tensor expert-offload path; skip it and performance collapses by 4–6×.

For a side-by-side with Qwen3-Coder, DeepSeek V4, and Llama 4 Behemoth, see the BestLLMfor model catalog.

Quantization quality: where the cliff actually is

Unsloth published per-quant evaluation runs on April 8, 2026. Aggregated across HumanEval+, MBPP+, NL2Repo, and Terminal-Bench 2.0:

QuantHumanEval+NL2RepoTerminal-Bench 2.0Retention vs FP16
FP1692.142.763.5100%
FP891.842.563.199.5%
Q5_K_M91.442.062.498.7%
Q4_K_M90.641.461.297.6%
Q3_K_M87.938.856.792.1%
Q2_K81.333.147.881.4%
IQ1_S68.522.431.059.2%

The cliff sits between Q3 and Q2 — and it is steep specifically on agentic, multi-step tool-use evals. A 14-point drop on Terminal-Bench means the model starts forgetting which directory it is in, mis-quoting shell arguments, and calling the wrong tool. For interactive chat, Q2_K is fine. For autonomous coding loops, do not go below Q3_K_M.

Setup: four paths, ranked by effort vs throughput

Path A — Ollama (easiest, single user)

# Requires Ollama 0.6.2+ for MoE expert offload
ollama pull glm-5.1:q4_k_m
ollama run glm-5.1:q4_k_m

One command. Pulls ~380 GB. Auto-detects VRAM and spills experts to system RAM. Expect 14–22 tok/s on the reference 2× RTX 6000 Ada config. Good for chat, agent prototyping, and IDE plugins. Not appropriate for serving more than one concurrent request.

Path B — llama.cpp from GGUF (most control)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

huggingface-cli download unsloth/GLM-5.1-GGUF \
  --include "GLM-5.1-Q4_K_M-*.gguf" --local-dir ./models/glm51

./build/bin/llama-server \
  -m ./models/glm51/GLM-5.1-Q4_K_M-00001-of-00009.gguf \
  --n-gpu-layers 99 \
  --override-tensor "([0-9]+).ffn_.*_exps.=CPU" \
  --ctx-size 32768 \
  --host 0.0.0.0 --port 8080

The --override-tensor flag is the critical one: it pins attention layers to GPU and pushes MoE expert tensors to CPU. Without it you would need 380 GB of VRAM. With it, 96 GB VRAM is enough.

Path C — vLLM (production)

pip install vllm==0.9.1
vllm serve zai-org/GLM-5.1-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 65536 \
  --enable-prefix-caching

Assumes 4× H100 80 GB or 4× H200. vLLM with FP8 hits 140–180 tok/s aggregated across concurrent users, with sub-second time-to-first-token. This is the right answer for any team workload.

Path D — SGLang (agentic-optimized)

SGLang beats vLLM by 15–25% on agent workloads thanks to its RadixAttention prefix cache. If traffic is dominated by long, shared system prompts (typical for coding agents), use it instead of vLLM.

Benchmarks: GLM-5.1 vs the field

Numbers below come from the official GLM-5.1 model card and the May 2026 Terminal-Bench leaderboard. Open-weight models in bold.

ModelLicenseActive paramsSWE-Bench VerifiedTerminal-Bench 2.0NL2Repo
Claude Code 4.5Proprietary74.269.048.1
GPT-5.2Proprietary71.866.445.3
GLM-5.1MIT32 B67.963.542.7
DeepSeek V4DeepSeek37 B65.161.240.9
Qwen3-Coder 480BApache 2.035 B62.458.838.5
Llama 4 BehemothLlama 4 CL288 B59.754.135.2

The takeaway: GLM-5.1 is roughly 5–8 points behind frontier proprietary on agentic coding, and 2–5 points ahead of the closest open alternative. For a self-hosted MIT-licensed model, that is the current state of the art.

Cost analysis: local vs Z.ai API

The Z.ai API charges roughly $0.45 / M input tokens and $1.80 / M output tokens for GLM-5.1 as of June 2026. A reference local config (2× RTX 6000 Ada ≈ $13,000 + EPYC server ≈ $7,000 = $20K capex) amortized over 36 months at 24/7 operation costs about $0.78/hour in hardware plus $0.42/hour in electricity at $0.15/kWh — call it $876/month all-in.

Monthly tokens (in+out, 50/50 split)API costLocal cost (capex + power)Winner
5 M$5.6$876API by 156×
20 M$22.5$876API by 39×
50 M$56.3$876API by 16×
500 M$562$876API by 1.6×
2 B$2,250$876Local by 2.6×
10 B$11,250$876Local by 12.8×

Plug actual volume into the cost calculator to see exactly where break-even lands. The honest answer for most individual developers: stay on the API. Local makes financial sense once you are serving a team, running long agent loops, or have a data-residency requirement that removes the option entirely.

When to pick GLM-5.1 over alternatives

ScenarioRecommendation
Solo dev, <50 M tok/month, no compliance constraintZ.ai API
Team of 5–20, mostly coding agentsGLM-5.1 FP8 on 4× H100 with vLLM
Solo dev, 2× pro GPUs already on handGLM-5.1 Q4_K_M via Ollama or llama.cpp
Consumer hardware only (single 4090 / 5090)Skip GLM-5.1. See best LLM for RTX 5090.
EU/HIPAA workloadGLM-5.1 self-hosted, on-prem only
Need frontier-tier results regardless of costStay on Claude Code or GPT-5.2

BestLLMfor publishes the underlying hardware-vs-model matrix as a free CC BY 4.0 dataset; the same data powers an open-source MCP server so it can be queried directly from Claude Code or any MCP-compatible agent. See the methodology page for how the benchmarks above are reproduced.

Frequently Asked Questions

Can GLM-5.1 run on a single RTX 5090?

No, not usefully. A single 5090 has 32 GB VRAM. Even with aggressive expert offload to 256 GB of system RAM, it would mean running IQ1_S quants at 4–6 tok/s with significant quality loss. Use Qwen3-Coder 32B or DeepSeek-V2.5 Lite instead — they fit fully in 32 GB and deliver 60–80 tok/s.

How much disk space is needed to download the model?

Q4_K_M is ~380 GB across nine shards. FP8 is 754 GB. Plan on at least 1 TB of NVMe for working storage; headroom for the K/V cache files and at least one backup quant is recommended.

Is GLM-5.1 actually MIT-licensed including commercial use?

Yes. The model weights, tokenizer, and inference code are released under MIT. Fine-tuning, redistribution, and embedding in commercial products are all permitted without royalties or notification. This is the most permissive license among frontier-tier open models as of June 2026.

Does it support tool calling and structured output?

Yes. GLM-5.1 ships with a native tool-calling format compatible with the OpenAI function-calling schema, and supports JSON Schema constrained decoding through both vLLM (via outlines) and llama.cpp (via grammars). Agentic frameworks like LangGraph, CrewAI, and Claude-Code-style harnesses work without modification.

Will quantization break agentic workflows?

Q4_K_M and above are safe. Below that, expect occasional mis-quoted shell commands, tool-name hallucination, and forgotten directory context in multi-turn agent runs. The Q3→Q2 cliff is where production reliability falls off — keep Q3_K_M as the floor for autonomous use.

How does GLM-5.1 compare to GLM-5?

Z.ai reports +7.3 points on Terminal-Bench 2.0 (63.5 vs 56.2) and +6.8 on NL2Repo (42.7 vs 35.9). Long-horizon agent tasks and role-play coherence are the biggest gains. For coding-focused users, the upgrade is worth the additional VRAM footprint.

Bottom line

GLM-5.1 is the best open-weight coding model available in June 2026, and the first one whose local performance comes within striking distance of Claude Code. But "running locally" still means 2× professional GPUs and 256 GB of RAM as a floor — this is not a single-GPU model. For solo developers below 50 M tokens/month, the Z.ai API is the rational choice. For teams, regulated industries, or anyone running long agent loops, self-hosting GLM-5.1 with vLLM on 4× H100 is the configuration to copy. Going below Q4_K_M starts costing reliability on the exact tasks the model is best at — see the BestLLMfor guides hub for the next round of model-specific tuning notes.