Mistral Small 3.1 24B — A Real-World Review
We benchmarked Mistral Small 3.1 24B across six weeks of local deployment. Here is what actually holds up — and what does not.
By Mohamed Meguedmi · 11 min read
Key takeaways
- Mistral Small 3.1 24B is the best Apache 2.0 vision-capable model that fits on a single 24 GB GPU at Q4_K_M (~14.3 GB weights + KV cache).
- Sustained 52–58 tok/s on RTX 4090, 18–22 tok/s on M3 Max 64 GB, 9–11 tok/s on RTX 3090 at 8K context, Q4_K_M.
- The repetition bug is real: ~2.1% of long generations loop above temperature 0.6. Mitigate with
repeat_penalty 1.08or wait for 3.2. - Vision OCR is genuinely useful — 92% accuracy on ChartQA, comparable to GPT-4o-mini at zero marginal cost.
- Skip it if you need strict instruction-following (Arena Hard 19.6%) or agentic tool chains — go to Mistral Small 3.2 or Qwen3-Coder 32B instead.
Mistral AI released Small 3.1 on March 17, 2025, and fourteen months later it remains the default recommendation for teams that need a permissively-licensed, multimodal, 128K-context model on a single consumer GPU. That has not stopped Mistral Small 3.2 (June 2025) and Qwen3 from challenging it — but the 3.1 weights are still downloaded ~180K times per month on Hugging Face, which says something.
This review is based on six weeks of structured testing by the editorial team across three hardware tiers, 2,400 prompts, and the public BestLLMfor evaluation harness (CC BY 4.0, available at /api/). All numbers below are reproducible.
What Mistral Small 3.1 actually is
Small 3.1 is a 24-billion-parameter dense decoder, fine-tuned from a base checkpoint that adds vision encoder weights (~400M parameters) and extends context from 32K to 128K tokens. License is Apache 2.0 — meaning commercial use, redistribution, and fine-tuning are unrestricted, which sets it apart from Llama 3.3 70B (community license) and Gemma 3 (custom terms).
The official launch post is on mistral.ai/news/mistral-small-3-1; the weights live on Hugging Face and ready-to-run quantizations at ollama.com/library/mistral-small.
Specs at a glance
| Attribute | Value |
|---|---|
| Parameters | 24.0B dense |
| Architecture | Transformer, 40 layers, GQA (8 KV heads) |
| Context window | 128,000 tokens |
| Tokenizer | Tekken v7, 131K vocab |
| Vision | Yes — Pixtral-derived encoder, native resolution |
| Tool calling | Native, JSON schema |
| License | Apache 2.0 |
| Release date | March 17, 2025 |
Hardware requirements — what actually fits
We tested four quantization levels across consumer GPUs and Apple Silicon. The crucial number for most readers is not the weights footprint but the working set including the KV cache at meaningful context lengths.
| Quant | Weights | + KV @ 8K | + KV @ 32K | + KV @ 128K | Min GPU |
|---|---|---|---|---|---|
| Q8_0 | 25.1 GB | 26.4 GB | 30.2 GB | 45.8 GB | RTX 6000 Ada / 2×3090 |
| Q5_K_M | 16.8 GB | 18.1 GB | 21.9 GB | 37.5 GB | RTX 4090 (8K only) |
| Q4_K_M | 14.3 GB | 15.6 GB | 19.4 GB | 35.0 GB | RTX 4090 / 3090 |
| Q3_K_M | 11.5 GB | 12.8 GB | 16.6 GB | 32.2 GB | RTX 4070 Ti Super 16 GB |
Practical recommendation: on 24 GB cards, Q4_K_M with 8K–16K context is the sweet spot. Going to 32K halves prompt-processing throughput because of attention quadratic costs on the prefill. To estimate hosted-equivalent costs for your prompt mix, use our cost calculator.
Throughput measured
| Hardware | Quant | Prefill (tok/s) | Decode (tok/s) | Watts |
|---|---|---|---|---|
| RTX 4090 24 GB | Q4_K_M | 1,840 | 56 | 320 |
| RTX 4090 24 GB | Q5_K_M | 1,720 | 48 | 325 |
| RTX 3090 24 GB | Q4_K_M | 980 | 33 | 295 |
| 2× RTX 3090 (TP) | Q8_0 | 1,410 | 27 | 540 |
| M3 Max 64 GB | Q4_K_M | 290 | 20 | 52 |
| M3 Max 64 GB | Q8_0 | 185 | 11 | 58 |
| Ryzen AI 9 HX 370 (CPU) | Q4_K_M | 42 | 4.1 | 54 |
Apple Silicon offers the best perf-per-watt by a factor of 4–6×, but anyone serving multiple concurrent users will still want CUDA. Full methodology at /methodology/.
Benchmarks — where 3.1 wins and loses
We re-ran the public benchmarks under llama.cpp build b3891 with Q4_K_M weights, temperature 0.15, top_p 0.95, max_tokens 4096. Hosted-API numbers from Mistral's own Le Platforme are in parentheses for reference.
| Benchmark | Mistral Small 3.1 | Mistral Small 3.2 | Qwen3 32B | GPT-4o-mini |
|---|---|---|---|---|
| MMLU-Pro (5-shot) | 66.8 (68.4) | 67.2 | 72.1 | 63.2 |
| HumanEval (pass@1) | 84.5 | 92.9 | 88.4 | 87.2 |
| MATH | 69.3 | 69.5 | 75.6 | 70.8 |
| Arena Hard | 19.56 | 43.10 | 40.2 | 36.1 |
| Wildbench | 55.60 | 65.33 | 62.4 | 60.0 |
| ChartQA (vision) | 92.1 | 92.7 | n/a | 91.5 |
| DocVQA (vision) | 94.1 | 94.5 | n/a | 92.0 |
| MM-MT-Bench | 7.3 | 7.4 | n/a | 7.1 |
Two observations stand out. First, on knowledge and math, Small 3.1 trades blows with GPT-4o-mini and clearly beats it on vision tasks. Second, the Arena Hard score of 19.56% is brutal — instruction-following on adversarial multi-turn prompts is genuinely weak, and this is the single biggest reason Mistral shipped 3.2 three months later.
The repetition bug — quantified
The most discussed flaw on r/LocalLLaMA is the repetition / looping problem. We ran 1,000 long-form generations (target 1500+ tokens) and counted infinite or near-infinite loops:
| Sampler config | Loop rate | Quality (subjective 1-5) |
|---|---|---|
| temp 0.15, no repeat_penalty | 2.1% | 4.3 |
| temp 0.7, no repeat_penalty | 4.8% | 3.9 |
| temp 0.7, repeat_penalty 1.08 | 0.4% | 4.1 |
| temp 0.7, repeat_penalty 1.15 | 0.1% | 3.4 (over-correction) |
| min_p 0.05, temp 0.7 | 1.6% | 4.2 |
The fix is well-known: a modest repeat_penalty of 1.05–1.10 essentially eliminates the loop without flattening the prose. Mistral Small 3.2 reduces the raw loop rate to ~0.5% at the same sampler — a 4× improvement and a real reason to upgrade if you build user-facing chat.
Vision capability — quietly excellent
This is where Small 3.1 deserves more credit than the discourse gave it. We fed it 250 charts from real SEC 10-Q filings (PDF screenshots, 200 dpi) and 100 hand-written technical diagrams. Results:
- ChartQA-style numeric extraction: 92.1% exact match. GPT-4o-mini scored 91.5% on the same set.
- Multi-column scientific PDF tables: 88% column-aware accuracy when prompted with a target schema.
- Handwriting (English, neat): 76% — usable but worse than Claude 3.5 Sonnet's 89%.
- Handwriting (cursive): 41% — do not bother.
For invoice OCR, receipt parsing, and chart-to-JSON pipelines, this model is production-ready on a $1,800 GPU. That is a noteworthy threshold to have crossed.
How to install — 5 minutes
The fastest path is Ollama. The model card and tags are at ollama.com/library/mistral-small.
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Pull the model:
ollama pull mistral-small:24b-instruct-2503-q4_K_M(~14 GB) - Run an interactive session:
ollama run mistral-small - For vision, pass an image:
ollama run mistral-small "describe this chart" ./chart.png - Add a Modelfile override for repeat_penalty:
FROM mistral-small:24b-instruct-2503-q4_K_M PARAMETER repeat_penalty 1.08 PARAMETER temperature 0.3 PARAMETER num_ctx 16384
If you prefer raw llama.cpp, GGUF files are mirrored by bartowski and unsloth on Hugging Face; for vLLM (best throughput for concurrent serving), use the original BF16 weights and --max-model-len 32768 to stay under 48 GB VRAM. If you want to expose the model to Claude Desktop or any MCP client, the open-source quelllm-mcp server bridges local Mistral endpoints to the Model Context Protocol.
Cost vs hosted APIs
At Mistral's own hosted pricing of $0.10 / $0.30 per 1M tokens (input/output), Small 3.1 is already cheaper than GPT-4o-mini. But the local economics are interesting once you cross ~30M tokens/month.
| Scenario (50M tokens/month, 70/30 in/out) | Cost |
|---|---|
| OpenAI GPT-4o-mini (hosted) | $13.50 |
| Mistral Le Platforme Small 3.1 | $8.00 |
| Local RTX 4090, $0.14/kWh, 70% util | $24/mo electricity + $1,800 amortized over 24 mo = $99 |
| Local M3 Max, $0.14/kWh, 70% util | $4/mo electricity + $3,500 amortized over 24 mo = $150 |
Hosted wins on pure cost at low volumes. Local wins on data sovereignty, latency floor (no network), and per-request marginal cost of zero — which matters when an agentic loop fires 200 sub-requests per user task. See the assumptions behind these numbers in /about/ or compute your own with the cost calculator. French-speaking readers can also consult quelllm.fr.
Verdict
| Use case | Recommendation |
|---|---|
| Document understanding / vision OCR on local hardware | Buy. Best Apache-2.0 option at 24 GB VRAM. |
| General chat with strict instruction-following | Skip — use Mistral Small 3.2 (same hardware, 2× Arena Hard). |
| Code generation | Skip — use Qwen3-Coder 32B Q4_K_M (HumanEval 92+). |
| Agentic tool-calling pipelines | Marginal. Works, but 3.2 and Qwen3 are more reliable. |
| Long-context summarization (32K–128K) | Use cautiously; attention degrades past 64K in our tests. |
| Privacy-sensitive enterprise deployment | Buy. Apache 2.0 + on-prem is hard to beat. |
Overall: Mistral Small 3.1 24B remains the right pick if and only if you specifically need vision and Apache 2.0 on a single 24 GB GPU. For most other 2026 use cases, Mistral Small 3.2 or Qwen3 32B are now the better starting points — the 3.2 upgrade in particular is a free win on the same weights footprint.
Frequently asked questions
Can Mistral Small 3.1 24B run on 16 GB VRAM?
Yes, with Q3_K_M (~11.5 GB weights) and context capped at 4K–8K, it fits on an RTX 4070 Ti Super or RTX 4080 16 GB. Expect ~35–40 tok/s decode but a measurable quality drop on MATH and code benchmarks (3–5 points). For 12 GB cards, Q3_K_S works but quality degrades noticeably.
Is Mistral Small 3.1 better than Llama 3.3 70B?
No on raw benchmarks (Llama 3.3 70B scores ~75 on MMLU-Pro vs 66.8). Yes on accessibility — 3.1 runs on a single 24 GB GPU at decent speed; Llama 3.3 70B needs 2× 24 GB or one 48 GB card. And 3.1 has native vision, which Llama 3.3 does not.
Should I wait for Mistral Small 3.2 instead?
3.2 is already out (June 2025) and is a strict upgrade on the same hardware: identical 24B size, identical 128K context, but Arena Hard jumps from 19.56% to 43.10% and the repetition bug rate is roughly halved. Unless you have already deployed and validated 3.1, start with 3.2.
Does Mistral Small 3.1 support function calling?
Yes, natively, via JSON schema in the system or tool prompt. Reliability is around 88% on our 500-call internal harness — workable for production with retries, weaker than GPT-4o-mini's 94% but matching Llama 3.1 70B. Mistral Small 3.2 lifts this to ~93%.
What is the license — can I use it commercially?
Apache 2.0. You can deploy commercially, fine-tune, redistribute weights and derivatives, and sell products built on it. No revenue caps, no usage reporting, no acceptable-use addendum beyond standard Apache terms. This is genuinely permissive — among the most business-friendly licenses on a capable 24B model in 2026.