Model fiche
MiMo V2 Flash
By Xiaomi · China
chat
code
moe
Overview
Xiaomi's 309B-parameter sparse MoE (52B active) released under MIT, topping SWE-Bench Verified at 73.4% at launch. Built for heavy-duty code and reasoning work.
When to pick this model
- Self-hosted coding agents that need frontier SWE-Bench accuracy
- Refactoring and bug-fixing pipelines over large repos
- Long-context code review (up to 128k tokens)
- MIT-licensed deployments where commercial use is non-negotiable
- Teams with multi-GPU infrastructure willing to trade VRAM for quality
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 185 GB |
| Q5_K_M | 222 GB |
| Q8_0 | 330 GB |
| FP16 (no quantization) | 618 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- State-of-the-art SWE-Bench Verified score (73.4%) at release
- MoE design activates only 52B of 309B params, lowering inference cost
- 128k context window suits whole-repo reasoning
- Permissive MIT license for commercial deployment
- Architecture borrows from DeepSeek's proven MoE recipe
Limitations
- Requires roughly 185 GB VRAM in Q4 — multi-GPU or H100-class hardware
- Xiaomi's open-weight licensing is newer and worth a legal review
- Newer architecture may lag in tooling support outside vLLM
Architecture & training
Architecture: MoE · 309B total / 52B active · Xiaomi MiMo V2 Flash
Training: Xiaomi — strong in code and reasoning, architecture inspired by DeepSeek.
Verdict
If you need an MIT-licensed, top-of-the-leaderboard coding model and have the GPUs to run it, MiMo V2 Flash is the pick.
Quick start
ollama pull hf.co/xiaomiteam/MiMo-V2-Flash-GGUFOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.