Qwen 2.5 32B
By Alibaba · China
Overview
Alibaba's Qwen 2.5 32B, the open-weight 32B reference of late 2024 — matching 70B-class quality on most benchmarks at half the VRAM.
When to pick this model
- Self-hosted assistants on a single 24GB GPU at lower precision
- Long-context reasoning workloads up to 128k tokens
- Math and code-heavy pipelines
- Commercial deployments needing Apache 2.0
- A 70B alternative when VRAM is tight
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 19 GB |
| Q5_K_M | 23 GB |
| Q8_0 | 35 GB |
| FP16 (no quantization) | 64 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU | 83.3 |
| HumanEval | 90.2 |
| MATH | 83.1 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Quality on par with many 70B models
- 128k context
- Apache 2.0 license
- Strong math, code, and reasoning
Limitations
- Needs ~19GB VRAM at Q4 — pushes the limits of a single 24GB card
- Outperformed by Qwen 3 32B in 2025
- No native vision
Architecture & training
Architecture: Dense Transformer · 64 layers · GQA · Qwen 2.5
Training: 18T tokens. Strong in reasoning, code, and long instructions.
A landmark open-weight 32B that's still a strong default — upgrade to Qwen 3 32B when you can.
Quick start
ollama run qwen2.5:32bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.