Model fiche

Mistral Medium 3.5 128B

By Mistral AI · France

Updated 2026-07-13

chat general code reasoning vision multilingual fr

Parameters

128B

License

Modified MIT

Context

250k

VRAM (Q4)

74 GB

Released

29 April 2026

Overview

Mistral AI's first merged flagship — a dense 128B with vision, 256k context, and configurable reasoning. Hits 77.6% on SWE-Bench Verified, consolidating Medium 3.1, Magistral, and Devstral 2 into one model.

When to pick this model

Agentic coding workflows demanding state-of-the-art SWE-Bench performance
Customer-support automation needing top τ³-Telecom scores
Long-document analysis up to 256k tokens
Multilingual vision tasks across 24 languages
Production deployments wanting reasoning toggleable per request

VRAM requirements by quantization

Quantization	VRAM required
Q4_K_M (recommended)	74 GB
Q5_K_M	91 GB
Q8_0	137 GB
FP16 (no quantization)	256 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

In practice, Mistral Medium 3.5 128B is server-class even at Q4_K_M (74 GB). Stepping up to Q8_0 nearly doubles the footprint to 137 GB, and unquantized FP16 weights take 256 GB — plan your GPU around the Q4 or Q5 figure unless you specifically need the higher fidelity.

Without a GPU, Mistral Medium 3.5 128B needs roughly 160 GB of system RAM to run on CPU via llama.cpp or Ollama — workable for background jobs, but far slower than GPU inference. Throughput estimates from our compatibility engine: around 1 tokens/sec on entry-level GPUs, on the order of 4 tokens/sec on a mid-range card, and up to 12 tokens/sec on high-end hardware — assuming the chosen quantization fully fits in VRAM.

What hardware do you need

The table below matches Mistral Medium 3.5 128B to common GPU memory tiers, using the highest-fidelity quantization that fully fits each card class. Spilling layers to system RAM works but costs most of the speed, so size your card to the quantization you actually want to run.

GPU memory	Example cards	Best fit for Mistral Medium 3.5 128B
8 GB	RTX 5070 Laptop, RTX 5060, RTX 5060 Ti 8GB	Does not fit — needs 74 GB at Q4_K_M
12 GB	RTX 5070, RTX 5070 Ti Laptop, RTX 4080 Laptop	Does not fit — needs 74 GB at Q4_K_M
16 GB	RTX 5080, RTX 4080 Super, Radeon RX 9070 XT	Does not fit — needs 74 GB at Q4_K_M
24 GB	RTX 4090, Radeon RX 7900 XTX, RTX 5090 Laptop	Does not fit — needs 74 GB at Q4_K_M
32 GB	RTX 5090	Does not fit — needs 74 GB at Q4_K_M

Which GPU should you buy to run Mistral Medium 3.5 128B?

To run Mistral Medium 3.5 128B locally at Q4, you need ~74 GB of VRAM. The best value for this is a Apple Mac Studio (64+ GB unified memory).

Check Apple Mac Studio price on Amazon →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.

Published benchmark scores

Benchmark	Score
SWE-Bench Verified	77.6
τ³-Telecom	91.4

Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.

Strengths

SWE-Bench Verified 77.6% — best-in-class for open weights
τ³-Telecom 91.4% for tool-using agents
256k context with strong long-context retention
Vision-enabled and multilingual across 24 languages
Modified MIT — permissive for most commercial use

Limitations

~74GB at Q4 — needs a 4-GPU box for comfortable serving
Revenue clause kicks in for large enterprises
Single-model consolidation means no separate specialized variants

Typical workloads

In our catalog grid, Mistral Medium 3.5 128B is filed under Frontier Agentic Coding, Long Context RAG, Multilingual Vision — the use cases where its size/quality trade-off makes the most sense. Its tags translate to concrete workloads: code generation and review (pair it with an editor integration like Continue.dev or Cline); multi-step reasoning and math-flavoured tasks; vision-language work — screenshots, charts, scanned documents; multilingual workloads; French-language output where quality matters.

The 250k-token context window is large enough to hold entire codebases' worth of files or long reports in a single prompt, which is what makes local RAG and document analysis practical. The Modified MIT license is permissive, so shipping it inside a commercial product raises no special legal questions.

Architecture & training

Architecture: Dense 128B · vision encoder · 256k ctx · configurable reasoning · integrated EAGLE draft head

Training: First merged Mistral flagship: replaces Medium 3.1, Magistral and Devstral 2 in Le Chat / Vibe.

Verdict

The first Mistral flagship that bundles coding, reasoning, and vision into one model — and it's competitive on every axis.

Quick start

ollama run mistral-medium-3.5:128b

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Similar models worth comparing

675B · Apache 2.0 · q4 405 GB

Mistral Large 3 675B

same family · needs 331 GB more VRAM at Q4

122B · Apache 2.0 · q4 73 GB

Qwen 3.5 122B-A10B

needs 1 GB less VRAM at Q4

284B · MIT · q4 170 GB

DeepSeek V4 Flash 284B

needs 96 GB more VRAM at Q4

119B · Apache 2.0 · q4 72 GB

Mistral Small 4

same family · needs 2 GB less VRAM at Q4

141B · Apache 2.0 · q4 82 GB

Mixtral 8x22B Instruct

same family · needs 8 GB more VRAM at Q4

Frequently asked questions

How much VRAM does Mistral Medium 3.5 128B need?

At the recommended Q4_K_M quantization, Mistral Medium 3.5 128B needs about 74 GB of VRAM. Q8_0 takes 137 GB, and unquantized FP16 weights take 256 GB.

Can Mistral Medium 3.5 128B run without a GPU?

Yes — with roughly 160 GB of system RAM it runs CPU-only through llama.cpp or Ollama. Expect a fraction of GPU speed, which is fine for background or batch jobs but slow for interactive chat.

What context window does Mistral Medium 3.5 128B support?

Mistral Medium 3.5 128B supports a 250k-token context window (256,000 tokens).

Can I use Mistral Medium 3.5 128B commercially?

Yes. Mistral Medium 3.5 128B is released under Modified MIT, a permissive open-source license that allows commercial use, modification and redistribution.

How fast is Mistral Medium 3.5 128B on consumer hardware?

Our compatibility engine estimates on the order of 4 tokens/sec on a mid-range GPU and up to 12 tokens/sec on high-end cards, assuming the quantization fully fits in VRAM.

Which quantization of Mistral Medium 3.5 128B should I download first?

Start with Q4_K_M (74 GB) — the standard size/quality sweet spot. Step up to Q5_K_M or Q8_0 only if you have VRAM headroom. It does not fit a single 24 GB consumer card — plan for multi-GPU or server hardware.

Tools

Is Mistral Medium 3.5 128B the right pick for you?

Compute self-hosted ROI → Back to catalog