DeepSeek V3 671B
By DeepSeek · China
Overview
DeepSeek's frontier-open MoE — 671B total, 37B active — with multi-head latent attention and an auxiliary-loss-free balancing scheme. The V3.1-Terminus update relicenses under MIT.
When to pick this model
- You're running server-class inference and want frontier-open performance
- You need a non-reasoning frontier model for general chat and code at scale
- You want the MLA architecture's reduced KV-cache footprint
- You can move to V3.1-Terminus for MIT licensing
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 400 GB |
| Q5_K_M | 480 GB |
| Q8_0 | 720 GB |
| FP16 (no quantization) | 1342 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Frontier-open performance in chat, code, and general tasks
- MLA cuts KV memory significantly vs standard attention
- V3.1-Terminus available under MIT
- Pretrained on 14.8T tokens
Limitations
- Original V3 uses the restrictive DeepSeek License
- 400GB+ in Q4 — server-class hardware only
- Overkill for most workloads under 10B requests/month
Architecture & training
Architecture: MoE 256 experts, 8 active · MLA · auxiliary-loss-free · FP8 training
Training: 14.8T tokens pre-training. V3.1-Terminus (Sep 2025) re-licensed MIT.
Frontier-open performance for teams with serious inference infrastructure — go straight to V3.1-Terminus for the MIT license.
Quick start
ollama run deepseek-v3:671bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.