Mistral Small 3.1 24B
By Mistral AI · France
Overview
Mistral AI's Small 3.1 — Small 3 plus a vision encoder, a 128k context, and ~150 tok/s inference under Apache 2.0. Small 3.2 (June 2025) is a drop-in upgrade.
When to pick this model
- Multimodal assistants needing both text and vision in one model
- Long-context RAG over mixed text and image sources
- Self-hosted Apache 2.0 deployments on a 24GB GPU
- High-throughput inference where latency matters
- Replacing separate text and vision models with a single 24B
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 14 GB |
| Q5_K_M | 17 GB |
| Q8_0 | 26 GB |
| FP16 (no quantization) | 48 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU | 80.6 |
| MMMU | 64 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Vision and text combined in one 24B model
- 128k context window
- Apache 2.0 license
- Around 150 tokens/sec inference
Limitations
- Requires Ollama 0.6.5 or newer
- Small 3.2 (June 2025) is a marginal improvement worth picking instead
- Vision quality trails Qwen2-VL on OCR
Architecture & training
Architecture: Dense · multimodal text+vision · Tekken tokenizer
Training: Successor to Small 3 with added visual encoding.
The best open-weight 24B multimodal model under Apache 2.0 — and Small 3.2 makes it slightly better still.
Quick start
ollama run mistral-small3.1:24bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.