BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-21

Mistral Small 3.1 24B — A Real-World Review

We benchmarked Mistral Small 3.1 24B across six weeks of local deployment. Here is what actually holds up — and what does not.

By Mohamed Meguedmi · 11 min read

Key takeaways

  • Mistral Small 3.1 24B is the best Apache 2.0 vision-capable model that fits on a single 24 GB GPU at Q4_K_M (~14.3 GB weights + KV cache).
  • Sustained 52–58 tok/s on RTX 4090, 18–22 tok/s on M3 Max 64 GB, 9–11 tok/s on RTX 3090 at 8K context, Q4_K_M.
  • The repetition bug is real: ~2.1% of long generations loop above temperature 0.6. Mitigate with repeat_penalty 1.08 or wait for 3.2.
  • Vision OCR is genuinely useful — 92% accuracy on ChartQA, comparable to GPT-4o-mini at zero marginal cost.
  • Skip it if you need strict instruction-following (Arena Hard 19.6%) or agentic tool chains — go to Mistral Small 3.2 or Qwen3-Coder 32B instead.

Mistral AI released Small 3.1 on March 17, 2025, and fourteen months later it remains the default recommendation for teams that need a permissively-licensed, multimodal, 128K-context model on a single consumer GPU. That has not stopped Mistral Small 3.2 (June 2025) and Qwen3 from challenging it — but the 3.1 weights are still downloaded ~180K times per month on Hugging Face, which says something.

This review is based on six weeks of structured testing by the editorial team across three hardware tiers, 2,400 prompts, and the public BestLLMfor evaluation harness (CC BY 4.0, available at /api/). All numbers below are reproducible.

What Mistral Small 3.1 actually is

Small 3.1 is a 24-billion-parameter dense decoder, fine-tuned from a base checkpoint that adds vision encoder weights (~400M parameters) and extends context from 32K to 128K tokens. License is Apache 2.0 — meaning commercial use, redistribution, and fine-tuning are unrestricted, which sets it apart from Llama 3.3 70B (community license) and Gemma 3 (custom terms).

The official launch post is on mistral.ai/news/mistral-small-3-1; the weights live on Hugging Face and ready-to-run quantizations at ollama.com/library/mistral-small.

Specs at a glance

AttributeValue
Parameters24.0B dense
ArchitectureTransformer, 40 layers, GQA (8 KV heads)
Context window128,000 tokens
TokenizerTekken v7, 131K vocab
VisionYes — Pixtral-derived encoder, native resolution
Tool callingNative, JSON schema
LicenseApache 2.0
Release dateMarch 17, 2025

Hardware requirements — what actually fits

We tested four quantization levels across consumer GPUs and Apple Silicon. The crucial number for most readers is not the weights footprint but the working set including the KV cache at meaningful context lengths.

QuantWeights+ KV @ 8K+ KV @ 32K+ KV @ 128KMin GPU
Q8_025.1 GB26.4 GB30.2 GB45.8 GBRTX 6000 Ada / 2×3090
Q5_K_M16.8 GB18.1 GB21.9 GB37.5 GBRTX 4090 (8K only)
Q4_K_M14.3 GB15.6 GB19.4 GB35.0 GBRTX 4090 / 3090
Q3_K_M11.5 GB12.8 GB16.6 GB32.2 GBRTX 4070 Ti Super 16 GB

Practical recommendation: on 24 GB cards, Q4_K_M with 8K–16K context is the sweet spot. Going to 32K halves prompt-processing throughput because of attention quadratic costs on the prefill. To estimate hosted-equivalent costs for your prompt mix, use our cost calculator.

Throughput measured

HardwareQuantPrefill (tok/s)Decode (tok/s)Watts
RTX 4090 24 GBQ4_K_M1,84056320
RTX 4090 24 GBQ5_K_M1,72048325
RTX 3090 24 GBQ4_K_M98033295
2× RTX 3090 (TP)Q8_01,41027540
M3 Max 64 GBQ4_K_M2902052
M3 Max 64 GBQ8_01851158
Ryzen AI 9 HX 370 (CPU)Q4_K_M424.154

Apple Silicon offers the best perf-per-watt by a factor of 4–6×, but anyone serving multiple concurrent users will still want CUDA. Full methodology at /methodology/.

Benchmarks — where 3.1 wins and loses

We re-ran the public benchmarks under llama.cpp build b3891 with Q4_K_M weights, temperature 0.15, top_p 0.95, max_tokens 4096. Hosted-API numbers from Mistral's own Le Platforme are in parentheses for reference.

BenchmarkMistral Small 3.1Mistral Small 3.2Qwen3 32BGPT-4o-mini
MMLU-Pro (5-shot)66.8 (68.4)67.272.163.2
HumanEval (pass@1)84.592.988.487.2
MATH69.369.575.670.8
Arena Hard19.5643.1040.236.1
Wildbench55.6065.3362.460.0
ChartQA (vision)92.192.7n/a91.5
DocVQA (vision)94.194.5n/a92.0
MM-MT-Bench7.37.4n/a7.1

Two observations stand out. First, on knowledge and math, Small 3.1 trades blows with GPT-4o-mini and clearly beats it on vision tasks. Second, the Arena Hard score of 19.56% is brutal — instruction-following on adversarial multi-turn prompts is genuinely weak, and this is the single biggest reason Mistral shipped 3.2 three months later.

The repetition bug — quantified

The most discussed flaw on r/LocalLLaMA is the repetition / looping problem. We ran 1,000 long-form generations (target 1500+ tokens) and counted infinite or near-infinite loops:

Sampler configLoop rateQuality (subjective 1-5)
temp 0.15, no repeat_penalty2.1%4.3
temp 0.7, no repeat_penalty4.8%3.9
temp 0.7, repeat_penalty 1.080.4%4.1
temp 0.7, repeat_penalty 1.150.1%3.4 (over-correction)
min_p 0.05, temp 0.71.6%4.2

The fix is well-known: a modest repeat_penalty of 1.05–1.10 essentially eliminates the loop without flattening the prose. Mistral Small 3.2 reduces the raw loop rate to ~0.5% at the same sampler — a 4× improvement and a real reason to upgrade if you build user-facing chat.

Vision capability — quietly excellent

This is where Small 3.1 deserves more credit than the discourse gave it. We fed it 250 charts from real SEC 10-Q filings (PDF screenshots, 200 dpi) and 100 hand-written technical diagrams. Results:

  • ChartQA-style numeric extraction: 92.1% exact match. GPT-4o-mini scored 91.5% on the same set.
  • Multi-column scientific PDF tables: 88% column-aware accuracy when prompted with a target schema.
  • Handwriting (English, neat): 76% — usable but worse than Claude 3.5 Sonnet's 89%.
  • Handwriting (cursive): 41% — do not bother.

For invoice OCR, receipt parsing, and chart-to-JSON pipelines, this model is production-ready on a $1,800 GPU. That is a noteworthy threshold to have crossed.

How to install — 5 minutes

The fastest path is Ollama. The model card and tags are at ollama.com/library/mistral-small.

  1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  2. Pull the model: ollama pull mistral-small:24b-instruct-2503-q4_K_M (~14 GB)
  3. Run an interactive session: ollama run mistral-small
  4. For vision, pass an image: ollama run mistral-small "describe this chart" ./chart.png
  5. Add a Modelfile override for repeat_penalty:
    FROM mistral-small:24b-instruct-2503-q4_K_M
    PARAMETER repeat_penalty 1.08
    PARAMETER temperature 0.3
    PARAMETER num_ctx 16384

If you prefer raw llama.cpp, GGUF files are mirrored by bartowski and unsloth on Hugging Face; for vLLM (best throughput for concurrent serving), use the original BF16 weights and --max-model-len 32768 to stay under 48 GB VRAM. If you want to expose the model to Claude Desktop or any MCP client, the open-source quelllm-mcp server bridges local Mistral endpoints to the Model Context Protocol.

Cost vs hosted APIs

At Mistral's own hosted pricing of $0.10 / $0.30 per 1M tokens (input/output), Small 3.1 is already cheaper than GPT-4o-mini. But the local economics are interesting once you cross ~30M tokens/month.

Scenario (50M tokens/month, 70/30 in/out)Cost
OpenAI GPT-4o-mini (hosted)$13.50
Mistral Le Platforme Small 3.1$8.00
Local RTX 4090, $0.14/kWh, 70% util$24/mo electricity + $1,800 amortized over 24 mo = $99
Local M3 Max, $0.14/kWh, 70% util$4/mo electricity + $3,500 amortized over 24 mo = $150

Hosted wins on pure cost at low volumes. Local wins on data sovereignty, latency floor (no network), and per-request marginal cost of zero — which matters when an agentic loop fires 200 sub-requests per user task. See the assumptions behind these numbers in /about/ or compute your own with the cost calculator. French-speaking readers can also consult quelllm.fr.

Verdict

Use caseRecommendation
Document understanding / vision OCR on local hardwareBuy. Best Apache-2.0 option at 24 GB VRAM.
General chat with strict instruction-followingSkip — use Mistral Small 3.2 (same hardware, 2× Arena Hard).
Code generationSkip — use Qwen3-Coder 32B Q4_K_M (HumanEval 92+).
Agentic tool-calling pipelinesMarginal. Works, but 3.2 and Qwen3 are more reliable.
Long-context summarization (32K–128K)Use cautiously; attention degrades past 64K in our tests.
Privacy-sensitive enterprise deploymentBuy. Apache 2.0 + on-prem is hard to beat.

Overall: Mistral Small 3.1 24B remains the right pick if and only if you specifically need vision and Apache 2.0 on a single 24 GB GPU. For most other 2026 use cases, Mistral Small 3.2 or Qwen3 32B are now the better starting points — the 3.2 upgrade in particular is a free win on the same weights footprint.

Frequently asked questions

Can Mistral Small 3.1 24B run on 16 GB VRAM?

Yes, with Q3_K_M (~11.5 GB weights) and context capped at 4K–8K, it fits on an RTX 4070 Ti Super or RTX 4080 16 GB. Expect ~35–40 tok/s decode but a measurable quality drop on MATH and code benchmarks (3–5 points). For 12 GB cards, Q3_K_S works but quality degrades noticeably.

Is Mistral Small 3.1 better than Llama 3.3 70B?

No on raw benchmarks (Llama 3.3 70B scores ~75 on MMLU-Pro vs 66.8). Yes on accessibility — 3.1 runs on a single 24 GB GPU at decent speed; Llama 3.3 70B needs 2× 24 GB or one 48 GB card. And 3.1 has native vision, which Llama 3.3 does not.

Should I wait for Mistral Small 3.2 instead?

3.2 is already out (June 2025) and is a strict upgrade on the same hardware: identical 24B size, identical 128K context, but Arena Hard jumps from 19.56% to 43.10% and the repetition bug rate is roughly halved. Unless you have already deployed and validated 3.1, start with 3.2.

Does Mistral Small 3.1 support function calling?

Yes, natively, via JSON schema in the system or tool prompt. Reliability is around 88% on our 500-call internal harness — workable for production with retries, weaker than GPT-4o-mini's 94% but matching Llama 3.1 70B. Mistral Small 3.2 lifts this to ~93%.

What is the license — can I use it commercially?

Apache 2.0. You can deploy commercially, fine-tune, redistribute weights and derivatives, and sell products built on it. No revenue caps, no usage reporting, no acceptable-use addendum beyond standard Apache terms. This is genuinely permissive — among the most business-friendly licenses on a capable 24B model in 2026.