Model fiche
GLM 4.7 Flash
By Zhipu AI · China
chat
multilingual
Overview
Zhipu AI's compact 3B variant of GLM 4.7, MIT-licensed with a 128k context. Optimized for low-latency bilingual Chinese-English chat.
When to pick this model
- Bilingual zh/en chat assistants where latency is critical
- Lightweight chat backends with a strict permissive license requirement
- Long-context summarization on small GPUs
- Cost-sensitive serving at scale where 30B variants are overkill
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 1.7 GB |
| Q5_K_M | 2.1 GB |
| Q8_0 | 3.2 GB |
| FP16 (no quantization) | 6 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- MIT license — among the most permissive in the open ecosystem
- 128k context in a 3B footprint
- Strong Chinese and English performance
- Compact ~1.7GB VRAM at Q4
Limitations
- Gated on Hugging Face despite the open license
- Less versatile than the 30B GLM 4.7 variants
Architecture & training
Architecture: Dense transformer · 3B parameters · 128k context
Training: GLM 4.7 family from Zhipu AI / THUDM (Tsinghua). Flash variant optimized for latency, focus on zh/en.
Verdict
MIT-licensed, fast, and bilingual — the GLM 4.7 to reach for when you need throughput over peak capability.
Quick start
ollama run glm-4.7-flashOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.