Guide · 2026-05-21

Mistral Small 3.1 24B — A Real-World Review

We benchmarked Mistral Small 3.1 24B across six weeks of local deployment. Here is what actually holds up — and what does not.

By Mohamed Meguedmi · 11 min read

Key takeaways

Mistral Small 3.1 24B is the best Apache 2.0 vision-capable model that fits on a single 24 GB GPU at Q4_K_M (~14.3 GB weights + KV cache).
Sustained 52–58 tok/s on RTX 4090, 18–22 tok/s on M3 Max 64 GB, 9–11 tok/s on RTX 3090 at 8K context, Q4_K_M.
The repetition bug is real: ~2.1% of long generations loop above temperature 0.6. Mitigate with repeat_penalty 1.08 or wait for 3.2.
Vision OCR is genuinely useful — 92% accuracy on ChartQA, comparable to GPT-4o-mini at zero marginal cost.
Skip it if you need strict instruction-following (Arena Hard 19.6%) or agentic tool chains — go to Mistral Small 3.2 or Qwen3-Coder 32B instead.

Mistral AI released Small 3.1 on March 17, 2025, and fourteen months later it remains the default recommendation for teams that need a permissively-licensed, multimodal, 128K-context model on a single consumer GPU. That has not stopped Mistral Small 3.2 (June 2025) and Qwen3 from challenging it — but the 3.1 weights are still downloaded ~180K times per month on Hugging Face, which says something.

This review is based on six weeks of structured testing by the editorial team across three hardware tiers, 2,400 prompts, and the public BestLLMfor evaluation harness (CC BY 4.0, available at /api/). All numbers below are reproducible.

What Mistral Small 3.1 actually is

Small 3.1 is a 24-billion-parameter dense decoder, fine-tuned from a base checkpoint that adds vision encoder weights (~400M parameters) and extends context from 32K to 128K tokens. License is Apache 2.0 — meaning commercial use, redistribution, and fine-tuning are unrestricted, which sets it apart from Llama 3.3 70B (community license) and Gemma 3 (custom terms).

The official launch post is on mistral.ai/news/mistral-small-3-1; the weights live on Hugging Face and ready-to-run quantizations at ollama.com/library/mistral-small.

Specs at a glance

Attribute	Value
Parameters	24.0B dense
Architecture	Transformer, 40 layers, GQA (8 KV heads)
Context window	128,000 tokens
Tokenizer	Tekken v7, 131K vocab
Vision	Yes — Pixtral-derived encoder, native resolution
Tool calling	Native, JSON schema
License	Apache 2.0
Release date	March 17, 2025

Hardware requirements — what actually fits

We tested four quantization levels across consumer GPUs and Apple Silicon. The crucial number for most readers is not the weights footprint but the working set including the KV cache at meaningful context lengths.

Quant	Weights	+ KV @ 8K	+ KV @ 32K	+ KV @ 128K	Min GPU
Q8_0	25.1 GB	26.4 GB	30.2 GB	45.8 GB	RTX 6000 Ada / 2×3090
Q5_K_M	16.8 GB	18.1 GB	21.9 GB	37.5 GB	RTX 4090 (8K only)
Q4_K_M	14.3 GB	15.6 GB	19.4 GB	35.0 GB	RTX 4090 / 3090
Q3_K_M	11.5 GB	12.8 GB	16.6 GB	32.2 GB	RTX 4070 Ti Super 16 GB

Practical recommendation: on 24 GB cards, Q4_K_M with 8K–16K context is the sweet spot. Going to 32K halves prompt-processing throughput because of attention quadratic costs on the prefill. To estimate hosted-equivalent costs for your prompt mix, use our cost calculator.

Throughput measured

Hardware	Quant	Prefill (tok/s)	Decode (tok/s)	Watts
RTX 4090 24 GB	Q4_K_M	1,840	56	320
RTX 4090 24 GB	Q5_K_M	1,720	48	325
RTX 3090 24 GB	Q4_K_M	980	33	295
2× RTX 3090 (TP)	Q8_0	1,410	27	540
M3 Max 64 GB	Q4_K_M	290	20	52
M3 Max 64 GB	Q8_0	185	11	58
Ryzen AI 9 HX 370 (CPU)	Q4_K_M	42	4.1	54

Apple Silicon offers the best perf-per-watt by a factor of 4–6×, but anyone serving multiple concurrent users will still want CUDA. Full methodology at /methodology/.

Benchmarks — where 3.1 wins and loses

We re-ran the public benchmarks under llama.cpp build b3891 with Q4_K_M weights, temperature 0.15, top_p 0.95, max_tokens 4096. Hosted-API numbers from Mistral's own Le Platforme are in parentheses for reference.

Benchmark	Mistral Small 3.1	Mistral Small 3.2	Qwen3 32B	GPT-4o-mini
MMLU-Pro (5-shot)	66.8 (68.4)	67.2	72.1	63.2
HumanEval (pass@1)	84.5	92.9	88.4	87.2
MATH	69.3	69.5	75.6	70.8
Arena Hard	19.56	43.10	40.2	36.1
Wildbench	55.60	65.33	62.4	60.0
ChartQA (vision)	92.1	92.7	n/a	91.5
DocVQA (vision)	94.1	94.5	n/a	92.0
MM-MT-Bench	7.3	7.4	n/a	7.1

Two observations stand out. First, on knowledge and math, Small 3.1 trades blows with GPT-4o-mini and clearly beats it on vision tasks. Second, the Arena Hard score of 19.56% is brutal — instruction-following on adversarial multi-turn prompts is genuinely weak, and this is the single biggest reason Mistral shipped 3.2 three months later.

The repetition bug — quantified

The most discussed flaw on r/LocalLLaMA is the repetition / looping problem. We ran 1,000 long-form generations (target 1500+ tokens) and counted infinite or near-infinite loops:

Sampler config	Loop rate	Quality (subjective 1-5)
temp 0.15, no repeat_penalty	2.1%	4.3
temp 0.7, no repeat_penalty	4.8%	3.9
temp 0.7, repeat_penalty 1.08	0.4%	4.1
temp 0.7, repeat_penalty 1.15	0.1%	3.4 (over-correction)
min_p 0.05, temp 0.7	1.6%	4.2

The fix is well-known: a modest repeat_penalty of 1.05–1.10 essentially eliminates the loop without flattening the prose. Mistral Small 3.2 reduces the raw loop rate to ~0.5% at the same sampler — a 4× improvement and a real reason to upgrade if you build user-facing chat.

Vision capability — quietly excellent

This is where Small 3.1 deserves more credit than the discourse gave it. We fed it 250 charts from real SEC 10-Q filings (PDF screenshots, 200 dpi) and 100 hand-written technical diagrams. Results:

ChartQA-style numeric extraction: 92.1% exact match. GPT-4o-mini scored 91.5% on the same set.
Multi-column scientific PDF tables: 88% column-aware accuracy when prompted with a target schema.
Handwriting (English, neat): 76% — usable but worse than Claude 3.5 Sonnet's 89%.
Handwriting (cursive): 41% — do not bother.

For invoice OCR, receipt parsing, and chart-to-JSON pipelines, this model is production-ready on a $1,800 GPU. That is a noteworthy threshold to have crossed.

How to install — 5 minutes

The fastest path is Ollama. The model card and tags are at ollama.com/library/mistral-small.

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Pull the model: ollama pull mistral-small:24b-instruct-2503-q4_K_M (~14 GB)
Run an interactive session: ollama run mistral-small
For vision, pass an image: ollama run mistral-small "describe this chart" ./chart.png

Add a Modelfile override for repeat_penalty:

FROM mistral-small:24b-instruct-2503-q4_K_M
PARAMETER repeat_penalty 1.08
PARAMETER temperature 0.3
PARAMETER num_ctx 16384

If you prefer raw llama.cpp, GGUF files are mirrored by bartowski and unsloth on Hugging Face; for vLLM (best throughput for concurrent serving), use the original BF16 weights and --max-model-len 32768 to stay under 48 GB VRAM. If you want to expose the model to Claude Desktop or any MCP client, the open-source quelllm-mcp server bridges local Mistral endpoints to the Model Context Protocol.

Cost vs hosted APIs

At Mistral's own hosted pricing of $0.10 / $0.30 per 1M tokens (input/output), Small 3.1 is already cheaper than GPT-4o-mini. But the local economics are interesting once you cross ~30M tokens/month.

Scenario (50M tokens/month, 70/30 in/out)	Cost
OpenAI GPT-4o-mini (hosted)	$13.50
Mistral Le Platforme Small 3.1	$8.00
Local RTX 4090, $0.14/kWh, 70% util	$24/mo electricity + $1,800 amortized over 24 mo = $99
Local M3 Max, $0.14/kWh, 70% util	$4/mo electricity + $3,500 amortized over 24 mo = $150

Hosted wins on pure cost at low volumes. Local wins on data sovereignty, latency floor (no network), and per-request marginal cost of zero — which matters when an agentic loop fires 200 sub-requests per user task. See the assumptions behind these numbers in /about/ or compute your own with the cost calculator. French-speaking readers can also consult quelllm.fr.

Verdict

Use case	Recommendation
Document understanding / vision OCR on local hardware	Buy. Best Apache-2.0 option at 24 GB VRAM.
General chat with strict instruction-following	Skip — use Mistral Small 3.2 (same hardware, 2× Arena Hard).
Code generation	Skip — use Qwen3-Coder 32B Q4_K_M (HumanEval 92+).
Agentic tool-calling pipelines	Marginal. Works, but 3.2 and Qwen3 are more reliable.
Long-context summarization (32K–128K)	Use cautiously; attention degrades past 64K in our tests.
Privacy-sensitive enterprise deployment	Buy. Apache 2.0 + on-prem is hard to beat.

Overall: Mistral Small 3.1 24B remains the right pick if and only if you specifically need vision and Apache 2.0 on a single 24 GB GPU. For most other 2026 use cases, Mistral Small 3.2 or Qwen3 32B are now the better starting points — the 3.2 upgrade in particular is a free win on the same weights footprint.

Frequently asked questions

Can Mistral Small 3.1 24B run on 16 GB VRAM?

Yes, with Q3_K_M (~11.5 GB weights) and context capped at 4K–8K, it fits on an RTX 4070 Ti Super or RTX 4080 16 GB. Expect ~35–40 tok/s decode but a measurable quality drop on MATH and code benchmarks (3–5 points). For 12 GB cards, Q3_K_S works but quality degrades noticeably.

Is Mistral Small 3.1 better than Llama 3.3 70B?

No on raw benchmarks (Llama 3.3 70B scores ~75 on MMLU-Pro vs 66.8). Yes on accessibility — 3.1 runs on a single 24 GB GPU at decent speed; Llama 3.3 70B needs 2× 24 GB or one 48 GB card. And 3.1 has native vision, which Llama 3.3 does not.

Should I wait for Mistral Small 3.2 instead?

3.2 is already out (June 2025) and is a strict upgrade on the same hardware: identical 24B size, identical 128K context, but Arena Hard jumps from 19.56% to 43.10% and the repetition bug rate is roughly halved. Unless you have already deployed and validated 3.1, start with 3.2.

Does Mistral Small 3.1 support function calling?

Yes, natively, via JSON schema in the system or tool prompt. Reliability is around 88% on our 500-call internal harness — workable for production with retries, weaker than GPT-4o-mini's 94% but matching Llama 3.1 70B. Mistral Small 3.2 lifts this to ~93%.

What is the license — can I use it commercially?

Apache 2.0. You can deploy commercially, fine-tune, redistribute weights and derivatives, and sell products built on it. No revenue caps, no usage reporting, no acceptable-use addendum beyond standard Apache terms. This is genuinely permissive — among the most business-friendly licenses on a capable 24B model in 2026.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.