BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

GLM 4.7 Flash

By Zhipu AI · China

chat multilingual
Parameters
3B
License
MIT
Context
125k
VRAM (Q4)
1.7 GB
Released
February 2026

Overview

Zhipu AI's compact 3B variant of GLM 4.7, MIT-licensed with a 128k context. Optimized for low-latency bilingual Chinese-English chat.

When to pick this model

  • Bilingual zh/en chat assistants where latency is critical
  • Lightweight chat backends with a strict permissive license requirement
  • Long-context summarization on small GPUs
  • Cost-sensitive serving at scale where 30B variants are overkill

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)1.7 GB
Q5_K_M2.1 GB
Q8_03.2 GB
FP16 (no quantization)6 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • MIT license — among the most permissive in the open ecosystem
  • 128k context in a 3B footprint
  • Strong Chinese and English performance
  • Compact ~1.7GB VRAM at Q4

Limitations

  • Gated on Hugging Face despite the open license
  • Less versatile than the 30B GLM 4.7 variants

Architecture & training

Architecture: Dense transformer · 3B parameters · 128k context

Training: GLM 4.7 family from Zhipu AI / THUDM (Tsinghua). Flash variant optimized for latency, focus on zh/en.

Verdict

MIT-licensed, fast, and bilingual — the GLM 4.7 to reach for when you need throughput over peak capability.

Quick start

ollama run glm-4.7-flash

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is GLM 4.7 Flash the right pick for you?

Compute self-hosted ROI → Back to catalog