Qwen 3 8B
By Alibaba · China
Overview
Alibaba's 8B dense model with a toggleable thinking mode and broad multilingual coverage. Punches well above its weight for an 8B and runs comfortably on a single consumer GPU.
When to pick this model
- You want one local 8B that handles both quick chat and harder reasoning via a thinking toggle
- You need multilingual coverage across 100+ languages without paying API fees
- You're prototyping agents and want long context (up to 131K with YaRN) on modest hardware
- You need an Apache 2.0 model for commercial deployment at the 8B tier
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 5 GB |
| Q5_K_M | 6 GB |
| Q8_0 | 9 GB |
| FP16 (no quantization) | 16 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU-Pro | 68.7 |
| GPQA | 60 |
| LiveCodeBench | 54.4 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Hybrid thinking/fast modes switchable per request
- Strong multilingual performance across 119 languages
- Up to 131K context via YaRN (32K native)
- Apache 2.0 — clean commercial use
Limitations
- Thinking traces are verbose and burn tokens fast
- Ecosystem tooling still less mature than the Qwen 2.5 line
Architecture & training
Architecture: Dense · GQA · hybrid thinking/non-thinking mode
Training: 36T tokens, 119 languages.
The best general-purpose Apache-licensed 8B for teams that want one model covering chat, reasoning, and 100+ languages.
Quick start
ollama run qwen3:8bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.