Guide · 2026-06-03

GLM 5.1 Local: Setup & Benchmarks

Q: Can GLM-5.1 run on a single RTX 5090?

No, not usefully. A single 5090 has 32 GB VRAM. Even with aggressive expert offload to 256 GB of system RAM, it would mean running IQ1_S quants at 4-6 tok/s with significant quality loss. Use Qwen3-Coder 32B or DeepSeek-V2.5 Lite instead.

Q: How much disk space is needed to download the model?

Q4_K_M is approximately 380 GB across nine shards. FP8 is 754 GB. Plan on at least 1 TB of NVMe for working storage plus headroom for K/V cache files and at least one backup quant.

Q: Is GLM-5.1 actually MIT-licensed including commercial use?

Yes. The model weights, tokenizer, and inference code are released under MIT. Fine-tuning, redistribution, and embedding in commercial products are all permitted without royalties or notification, making it the most permissive license among frontier-tier open models as of June 2026.

Q: Does GLM-5.1 support tool calling and structured output?

Yes. GLM-5.1 ships with a native tool-calling format compatible with the OpenAI function-calling schema, and supports JSON Schema constrained decoding through both vLLM (via outlines) and llama.cpp (via grammars). Agentic frameworks like LangGraph and CrewAI work without modification.

Q: Will quantization break agentic workflows?

Q4_K_M and above are safe. Below that, expect occasional mis-quoted shell commands, tool-name hallucination, and forgotten directory context in multi-turn agent runs. The Q3 to Q2 cliff is where production reliability falls off; keep Q3_K_M as the floor for autonomous use.

Q: How does GLM-5.1 compare to GLM-5?

Z.ai reports +7.3 points on Terminal-Bench 2.0 (63.5 vs 56.2) and +6.8 on NL2Repo (42.7 vs 35.9). Long-horizon agent tasks and role-play coherence are the biggest gains. For coding-focused users, the upgrade is worth the additional VRAM footprint.

Last updated 2026-06-03

Z.ai's 754B MoE flagship runs locally — but only with serious silicon. Here's what GLM 5.1 Local actually needs, how it scores, and when self-hosting beats the API.

By Mohamed Meguedmi · 9 min read

Key Takeaways

GLM-5.1 is a 754B-parameter MoE with ~32B active per token, MIT-licensed, and currently the #1 open model on Code Arena.
Realistic local floor: 2× RTX 6000 Ada (96 GB VRAM) + 256 GB system RAM for Q4_K_M GGUF with expert offload, or 4× H100 80 GB for FP8 production serving.
Q4_K_M holds ~98% of FP16 quality on coding benchmarks; Q2_K is tinker-grade; anything below IQ2 noticeably degrades agentic reliability.
Throughput verdict: vLLM > SGLang > llama.cpp > Ollama. Pick Ollama only for single-user prototyping.
Local breaks even around 500 M tokens/month for a single seat. Below that, the Z.ai API is cheaper than electricity plus hardware amortization.

What GLM-5.1 actually is — and why people self-host it

Released by Z.ai (zai-org) in February 2026, GLM-5.1 is the successor to GLM-5 and currently the strongest open-weight model on the Code Arena leaderboard. It uses a Mixture-of-Experts architecture: 754 billion total parameters, but only about 32 billion are active for any given token. That detail matters enormously for local inference — you pay storage cost for all 754 B, but compute cost for only 32 B.

Three things drive the self-hosting interest:

License. MIT, including the weights. You can fine-tune, redistribute, embed in a commercial product, and run it air-gapped — none of which is true for Claude, GPT-5, or Gemini.
Agentic coding performance. 63.5 on Terminal-Bench 2.0 (Terminus-2 harness) and 42.7 on NL2Repo. That is within striking distance of Claude Code (69.0) at zero per-token cost once the hardware is paid for.
Data residency. EU and HIPAA-adjacent workloads can't legally ship code to Z.ai's API endpoints in China. Local is the only option.

If none of the three apply, the Z.ai API is the path of least resistance. The rest of this guide assumes at least one of them does.

Hardware requirements by quantization

This is the table to read first. Numbers assume Unsloth's dynamic GGUF quants, llama.cpp build dated April 2026, and a single concurrent user with 8K context.

Quant	Disk / VRAM (no offload)	Suggested GPU config	Tokens/sec	Use case
FP16	~1.5 TB	8× H200 141 GB (NVLink)	180–220	Research, lab only
FP8 (vLLM)	~754 GB	4× H200 or 8× H100 80 GB	140–180	Production serving
AWQ-4bit	~410 GB	6× RTX 6000 Ada 48 GB	90–120	Multi-user batched
Q4_K_M GGUF	~380 GB	2× RTX 6000 Ada + 256 GB RAM (expert offload)	14–22	Sweet spot for single dev
Q3_K_M GGUF	~290 GB	2× RTX 6000 Ada + 192 GB RAM	18–26	Tight VRAM budgets
Q2_K GGUF	~200 GB	4× RTX 4090 + 192 GB RAM	10–16	Tinker, evaluation only
IQ1_S	~150 GB	2× RTX 4090 + 128 GB RAM	6–10	Not recommended for agentic

Two things to call out. First, MoE models offload to system RAM gracefully — only the active experts need to be hot in VRAM at any moment — which is why a 2-GPU consumer-pro box can actually run Q4_K_M. Second, those tokens/sec are with llama.cpp's --override-tensor expert-offload path; skip it and performance collapses by 4–6×.

For a side-by-side with Qwen3-Coder, DeepSeek V4, and Llama 4 Behemoth, see the BestLLMfor model catalog.

Quantization quality: where the cliff actually is

Unsloth published per-quant evaluation runs on April 8, 2026. Aggregated across HumanEval+, MBPP+, NL2Repo, and Terminal-Bench 2.0:

Quant	HumanEval+	NL2Repo	Terminal-Bench 2.0	Retention vs FP16
FP16	92.1	42.7	63.5	100%
FP8	91.8	42.5	63.1	99.5%
Q5_K_M	91.4	42.0	62.4	98.7%
Q4_K_M	90.6	41.4	61.2	97.6%
Q3_K_M	87.9	38.8	56.7	92.1%
Q2_K	81.3	33.1	47.8	81.4%
IQ1_S	68.5	22.4	31.0	59.2%

The cliff sits between Q3 and Q2 — and it is steep specifically on agentic, multi-step tool-use evals. A 14-point drop on Terminal-Bench means the model starts forgetting which directory it is in, mis-quoting shell arguments, and calling the wrong tool. For interactive chat, Q2_K is fine. For autonomous coding loops, do not go below Q3_K_M.

Setup: four paths, ranked by effort vs throughput

Path A — Ollama (easiest, single user)

# Requires Ollama 0.6.2+ for MoE expert offload
ollama pull glm-5.1:q4_k_m
ollama run glm-5.1:q4_k_m

One command. Pulls ~380 GB. Auto-detects VRAM and spills experts to system RAM. Expect 14–22 tok/s on the reference 2× RTX 6000 Ada config. Good for chat, agent prototyping, and IDE plugins. Not appropriate for serving more than one concurrent request.

Path B — llama.cpp from GGUF (most control)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

huggingface-cli download unsloth/GLM-5.1-GGUF \
  --include "GLM-5.1-Q4_K_M-*.gguf" --local-dir ./models/glm51

./build/bin/llama-server \
  -m ./models/glm51/GLM-5.1-Q4_K_M-00001-of-00009.gguf \
  --n-gpu-layers 99 \
  --override-tensor "([0-9]+).ffn_.*_exps.=CPU" \
  --ctx-size 32768 \
  --host 0.0.0.0 --port 8080

The --override-tensor flag is the critical one: it pins attention layers to GPU and pushes MoE expert tensors to CPU. Without it you would need 380 GB of VRAM. With it, 96 GB VRAM is enough.

Path C — vLLM (production)

pip install vllm==0.9.1
vllm serve zai-org/GLM-5.1-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 65536 \
  --enable-prefix-caching

Assumes 4× H100 80 GB or 4× H200. vLLM with FP8 hits 140–180 tok/s aggregated across concurrent users, with sub-second time-to-first-token. This is the right answer for any team workload.

Path D — SGLang (agentic-optimized)

SGLang beats vLLM by 15–25% on agent workloads thanks to its RadixAttention prefix cache. If traffic is dominated by long, shared system prompts (typical for coding agents), use it instead of vLLM.

Benchmarks: GLM-5.1 vs the field

Numbers below come from the official GLM-5.1 model card and the May 2026 Terminal-Bench leaderboard. Open-weight models in bold.

Model	License	Active params	SWE-Bench Verified	Terminal-Bench 2.0	NL2Repo
Claude Code 4.5	Proprietary	—	74.2	69.0	48.1
GPT-5.2	Proprietary	—	71.8	66.4	45.3
GLM-5.1	MIT	32 B	67.9	63.5	42.7
DeepSeek V4	DeepSeek	37 B	65.1	61.2	40.9
Qwen3-Coder 480B	Apache 2.0	35 B	62.4	58.8	38.5
Llama 4 Behemoth	Llama 4 CL	288 B	59.7	54.1	35.2

The takeaway: GLM-5.1 is roughly 5–8 points behind frontier proprietary on agentic coding, and 2–5 points ahead of the closest open alternative. For a self-hosted MIT-licensed model, that is the current state of the art.

Cost analysis: local vs Z.ai API

The Z.ai API charges roughly $0.45 / M input tokens and $1.80 / M output tokens for GLM-5.1 as of June 2026. A reference local config (2× RTX 6000 Ada ≈ $13,000 + EPYC server ≈ $7,000 = $20K capex) amortized over 36 months at 24/7 operation costs about $0.78/hour in hardware plus $0.42/hour in electricity at $0.15/kWh — call it $876/month all-in.

Monthly tokens (in+out, 50/50 split)	API cost	Local cost (capex + power)	Winner
5 M	$5.6	$876	API by 156×
20 M	$22.5	$876	API by 39×
50 M	$56.3	$876	API by 16×
500 M	$562	$876	API by 1.6×
2 B	$2,250	$876	Local by 2.6×
10 B	$11,250	$876	Local by 12.8×

Plug actual volume into the cost calculator to see exactly where break-even lands. The honest answer for most individual developers: stay on the API. Local makes financial sense once you are serving a team, running long agent loops, or have a data-residency requirement that removes the option entirely.

When to pick GLM-5.1 over alternatives

Scenario	Recommendation
Solo dev, <50 M tok/month, no compliance constraint	Z.ai API
Team of 5–20, mostly coding agents	GLM-5.1 FP8 on 4× H100 with vLLM
Solo dev, 2× pro GPUs already on hand	GLM-5.1 Q4_K_M via Ollama or llama.cpp
Consumer hardware only (single 4090 / 5090)	Skip GLM-5.1. See best LLM for RTX 5090.
EU/HIPAA workload	GLM-5.1 self-hosted, on-prem only
Need frontier-tier results regardless of cost	Stay on Claude Code or GPT-5.2

BestLLMfor publishes the underlying hardware-vs-model matrix as a free CC BY 4.0 dataset; the same data powers an open-source MCP server so it can be queried directly from Claude Code or any MCP-compatible agent. See the methodology page for how the benchmarks above are reproduced.

Frequently Asked Questions

Can GLM-5.1 run on a single RTX 5090?

No, not usefully. A single 5090 has 32 GB VRAM. Even with aggressive expert offload to 256 GB of system RAM, it would mean running IQ1_S quants at 4–6 tok/s with significant quality loss. Use Qwen3-Coder 32B or DeepSeek-V2.5 Lite instead — they fit fully in 32 GB and deliver 60–80 tok/s.

How much disk space is needed to download the model?

Q4_K_M is ~380 GB across nine shards. FP8 is 754 GB. Plan on at least 1 TB of NVMe for working storage; headroom for the K/V cache files and at least one backup quant is recommended.

Is GLM-5.1 actually MIT-licensed including commercial use?

Yes. The model weights, tokenizer, and inference code are released under MIT. Fine-tuning, redistribution, and embedding in commercial products are all permitted without royalties or notification. This is the most permissive license among frontier-tier open models as of June 2026.

Does it support tool calling and structured output?

Yes. GLM-5.1 ships with a native tool-calling format compatible with the OpenAI function-calling schema, and supports JSON Schema constrained decoding through both vLLM (via outlines) and llama.cpp (via grammars). Agentic frameworks like LangGraph, CrewAI, and Claude-Code-style harnesses work without modification.

Will quantization break agentic workflows?

Q4_K_M and above are safe. Below that, expect occasional mis-quoted shell commands, tool-name hallucination, and forgotten directory context in multi-turn agent runs. The Q3→Q2 cliff is where production reliability falls off — keep Q3_K_M as the floor for autonomous use.

How does GLM-5.1 compare to GLM-5?

Z.ai reports +7.3 points on Terminal-Bench 2.0 (63.5 vs 56.2) and +6.8 on NL2Repo (42.7 vs 35.9). Long-horizon agent tasks and role-play coherence are the biggest gains. For coding-focused users, the upgrade is worth the additional VRAM footprint.

Bottom line

GLM-5.1 is the best open-weight coding model available in June 2026, and the first one whose local performance comes within striking distance of Claude Code. But "running locally" still means 2× professional GPUs and 256 GB of RAM as a floor — this is not a single-GPU model. For solo developers below 50 M tokens/month, the Z.ai API is the rational choice. For teams, regulated industries, or anyone running long agent loops, self-hosting GLM-5.1 with vLLM on 4× H100 is the configuration to copy. Going below Q4_K_M starts costing reliability on the exact tasks the model is best at — see the BestLLMfor guides hub for the next round of model-specific tuning notes.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.