MiMo V2.5
By Xiaomi · China
Overview
Xiaomi's MIT-licensed omnimodal model: 310B MoE with 15B active params handling text, image, video, and audio. Scores 87.7 on Video-MME with 1M context. Released April 2026.
When to pick this model
- Video understanding pipelines (Video-MME 87.7)
- Unified text, image, video, and audio workflows
- Million-token multimodal context tasks
- MIT-licensed alternative to closed omnimodal APIs
- Document and chart reasoning (CharXiv RQ 81.0)
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 180 GB |
| Q5_K_M | 220 GB |
| Q8_0 | 330 GB |
| FP16 (no quantization) | 620 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| Video-MME | 87.7 |
| CharXiv RQ | 81 |
| MMMU-Pro | 77.9 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Omnimodal under MIT — text, image, video, audio
- 1M context window
- 87.7 Video-MME and 81.0 CharXiv RQ
- Permissive MIT license at frontier scale
- MoE design keeps active compute reasonable
Limitations
- Around 180 GB VRAM in Q4
- Video and audio inference pipelines are not yet standardized
- No Ollama support
Architecture & training
Architecture: MoE 310B/15B active · 48 layers (1 dense + 47 MoE) · 256 experts top-8 · ViT 729M + Audio 261M · MTP 329M · FP8
Training: ≈48T tokens · pipeline text pre-train → projector warmup → multimodal pre-train → agentic SFT → RL+MOPD.
The first MIT-licensed model that genuinely handles video alongside everything else.
Quick start
# HuggingFace : XiaomiMiMo/MiMo-V2.5Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.