Model fiche

MiMo V2 Flash

By Xiaomi · China

Updated 2026-07-13

chat code moe

Parameters

309B

License

MIT

Context

125k

VRAM (Q4)

185 GB

Released

April 2025

Overview

Xiaomi's 309B-parameter sparse MoE (52B active) released under MIT, topping SWE-Bench Verified at 73.4% at launch. Built for heavy-duty code and reasoning work.

When to pick this model

Self-hosted coding agents that need frontier SWE-Bench accuracy
Refactoring and bug-fixing pipelines over large repos
Long-context code review (up to 128k tokens)
MIT-licensed deployments where commercial use is non-negotiable
Teams with multi-GPU infrastructure willing to trade VRAM for quality

VRAM requirements by quantization

Quantization	VRAM required
Q4_K_M (recommended)	185 GB
Q5_K_M	222 GB
Q8_0	330 GB
FP16 (no quantization)	618 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

In practice, MiMo V2 Flash is server-class even at Q4_K_M (185 GB). Stepping up to Q8_0 nearly doubles the footprint to 330 GB, and unquantized FP16 weights take 618 GB — plan your GPU around the Q4 or Q5 figure unless you specifically need the higher fidelity.

Without a GPU, MiMo V2 Flash needs roughly 230 GB of system RAM to run on CPU via llama.cpp or Ollama — workable for background jobs, but far slower than GPU inference. Throughput estimates from our compatibility engine: around 1 tokens/sec on entry-level GPUs, on the order of 5 tokens/sec on a mid-range card, and up to 15 tokens/sec on high-end hardware — assuming the chosen quantization fully fits in VRAM.

What hardware do you need

The table below matches MiMo V2 Flash to common GPU memory tiers, using the highest-fidelity quantization that fully fits each card class. Spilling layers to system RAM works but costs most of the speed, so size your card to the quantization you actually want to run.

GPU memory	Example cards	Best fit for MiMo V2 Flash
8 GB	RTX 5070 Laptop, RTX 5060, RTX 5060 Ti 8GB	Does not fit — needs 185 GB at Q4_K_M
12 GB	RTX 5070, RTX 5070 Ti Laptop, RTX 4080 Laptop	Does not fit — needs 185 GB at Q4_K_M
16 GB	RTX 5080, RTX 4080 Super, Radeon RX 9070 XT	Does not fit — needs 185 GB at Q4_K_M
24 GB	RTX 4090, Radeon RX 7900 XTX, RTX 5090 Laptop	Does not fit — needs 185 GB at Q4_K_M
32 GB	RTX 5090	Does not fit — needs 185 GB at Q4_K_M

Which GPU should you buy to run MiMo V2 Flash?

To run MiMo V2 Flash locally at Q4, you need ~185 GB of VRAM. The best value for this is a Apple Mac Studio (64+ GB unified memory).

Check Apple Mac Studio price on Amazon →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.

Strengths

State-of-the-art SWE-Bench Verified score (73.4%) at release
MoE design activates only 52B of 309B params, lowering inference cost
128k context window suits whole-repo reasoning
Permissive MIT license for commercial deployment
Architecture borrows from DeepSeek's proven MoE recipe

Limitations

Requires roughly 185 GB VRAM in Q4 — multi-GPU or H100-class hardware
Xiaomi's open-weight licensing is newer and worth a legal review
Newer architecture may lag in tooling support outside vLLM

Typical workloads

In our catalog grid, MiMo V2 Flash is filed under Frontier Code, Dev Agents — the use cases where its size/quality trade-off makes the most sense. Its tags translate to concrete workloads: code generation and review (pair it with an editor integration like Continue.dev or Cline).

The 125k-token context window covers long chats and mid-sized documents, though very large retrieval workloads will need chunking. The MIT license is permissive, so shipping it inside a commercial product raises no special legal questions.

Architecture & training

Architecture: MoE · 309B total / 52B active · Xiaomi MiMo V2 Flash

Training: Xiaomi — strong in code and reasoning, architecture inspired by DeepSeek.

Verdict

If you need an MIT-licensed, top-of-the-leaderboard coding model and have the GPUs to run it, MiMo V2 Flash is the pick.

Quick start

ollama pull hf.co/xiaomiteam/MiMo-V2-Flash-GGUF

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Similar models worth comparing

284B · MIT · q4 170 GB

DeepSeek V4 Flash 284B

needs 15 GB less VRAM at Q4

122B · Apache 2.0 · q4 73 GB

Qwen 3.5 122B-A10B

needs 112 GB less VRAM at Q4

70B · Llama 3.3 Community · q4 40 GB

Llama 3.3 70B Instruct

needs 145 GB less VRAM at Q4

310B · MIT · q4 180 GB

MiMo V2.5

same family · needs 5 GB less VRAM at Q4

1020B · MIT · q4 595 GB

MiMo V2.5 Pro

same family · needs 410 GB more VRAM at Q4

Frequently asked questions

How much VRAM does MiMo V2 Flash need?

At the recommended Q4_K_M quantization, MiMo V2 Flash needs about 185 GB of VRAM. Q8_0 takes 330 GB, and unquantized FP16 weights take 618 GB.

Can MiMo V2 Flash run without a GPU?

Yes — with roughly 230 GB of system RAM it runs CPU-only through llama.cpp or Ollama. Expect a fraction of GPU speed, which is fine for background or batch jobs but slow for interactive chat.

What context window does MiMo V2 Flash support?

MiMo V2 Flash supports a 125k-token context window (128,000 tokens).

Can I use MiMo V2 Flash commercially?

Yes. MiMo V2 Flash is released under MIT, a permissive open-source license that allows commercial use, modification and redistribution.

How fast is MiMo V2 Flash on consumer hardware?

Our compatibility engine estimates on the order of 5 tokens/sec on a mid-range GPU and up to 15 tokens/sec on high-end cards, assuming the quantization fully fits in VRAM.

Which quantization of MiMo V2 Flash should I download first?

Start with Q4_K_M (185 GB) — the standard size/quality sweet spot. Step up to Q5_K_M or Q8_0 only if you have VRAM headroom. It does not fit a single 24 GB consumer card — plan for multi-GPU or server hardware.

Tools

Is MiMo V2 Flash the right pick for you?

Compute self-hosted ROI → Back to catalog