Guide · 2026-06-02

Mistral Codestral 22B — A Coding LLM Review

Q: Is Codestral 22B free for commercial use?

Yes, as of the Codestral 2 release in August 2025. The weights are now licensed under Apache 2.0, which permits commercial use, redistribution, modification, and use inside closed-source products. The original v0.1 release from May 2024 was under the Mistral Non-Production License and is not free for commercial use — re-pull the v2 tag from Hugging Face or Ollama to get the Apache-licensed weights.

Q: What GPU do I need to run Codestral 22B locally?

A single 24GB consumer GPU (RTX 3090, RTX 4090, or RTX 5090) runs Codestral 22B at Q5_K_M quantization with 32K context comfortably, delivering 60-75 tokens per second. A 16GB card requires Q3_K_M, which degrades quality noticeably. Apple Silicon Macs with 32GB+ unified memory can run Q5_K_M or Q6_K via MLX or llama.cpp.

Q: Is Codestral 22B better than Qwen3-Coder for coding?

It depends on the task. Qwen3-Coder 30B-A3B beats Codestral by 4-9 points on HumanEval, MBPP, and RepoBench, and is significantly better for agentic reasoning. However, Codestral's native fill-in-the-middle training makes it more reliable for IDE autocomplete. Use Codestral for editor integration, Qwen3-Coder for agent workflows.

Q: What's the context window of Codestral 22B?

The original May 2024 release shipped with 32K tokens. The Codestral 2 refresh (August 2025) extended the context to 256K tokens via positional interpolation. In practice, running at the full 256K requires roughly 32GB additional VRAM for the KV cache, so most local deployments cap at 32-64K to fit on a single 24GB GPU.

Q: Does Codestral support fill-in-the-middle (FIM) natively?

Yes. Codestral was pre-trained with native FIM objectives using a specific token order with suffix-then-prefix ordering. This differs from the DeepSeek and Llama FIM conventions, so client integrations must use the correct template. Modern Ollama and llama.cpp builds handle this automatically.

Q: How does Codestral compare to GitHub Copilot?

Copilot uses a proprietary model heavily tuned for autocomplete latency with telemetry-driven personalization that no local model can match. Codestral matches or exceeds Copilot on raw completion quality benchmarks but lacks the cloud infrastructure for sub-100ms suggestions. For privacy-sensitive work, regulated industries, or air-gapped environments, Codestral is the better choice.

Mistral's 22B coding specialist is now Apache 2.0, runs on a single 24GB GPU, and hits 81.1% HumanEval. Here's whether it still earns a slot in 2026.

By Mohamed Meguedmi · 9 min read

Mistral's 22B coding specialist is now Apache 2.0, runs on a single 24GB GPU, and hits 81.1% HumanEval. Here's whether it still earns a slot in your local stack in 2026.

Key Takeaways

Sweet-spot hardware fit: Codestral 22B Q4_K_M (~13.3 GB) runs comfortably on a single RTX 3090/4090 or M2 Pro 32GB with 32K context active.
Benchmarks are mid-tier in 2026: 81.1% HumanEval and 78.2% MBPP were class-leading at launch, but Qwen3-Coder 30B and DeepSeek-Coder-V2 now beat it by 4-9 points.
FIM is the real moat: Native fill-in-the-middle training makes Codestral one of the best autocomplete backends for Continue.dev, Tabby, and Zed.
License unlocked: The August 2025 Codestral 2 release moved from Mistral Non-Production License to Apache 2.0, removing the commercial blocker.
Verdict: Buy it for IDE autocomplete and FIM workloads; pass for agentic coding or reasoning-heavy refactors where Qwen3-Coder wins.

What Codestral 22B actually is

Codestral is Mistral AI's first dedicated coding model, originally released on May 29, 2024. It's a 22.2-billion-parameter dense transformer trained from scratch on a curated mix of source code spanning 80+ programming languages, including Python, Java, C++, JavaScript, Rust, Swift, and Fortran. Unlike general models bolted onto code data, Codestral was pre-trained with native fill-in-the-middle (FIM) objectives — the same training signal that makes StarCoder 2 and DeepSeek-Coder reliable autocomplete engines.

The 22B size was deliberate. Mistral's engineers wanted a model that would beat Code Llama 70B and DeepSeek Coder 33B on HumanEval at less than a third of the parameter count — and it did. The August 2025 "Codestral 2" refresh kept the architecture but improved training: a 30% increase in accepted IDE completions, 50% fewer runaway generations, and — most importantly for commercial users — a relicense to Apache 2.0.

The original 32K context was extended in later checkpoints; the current Codestral 2 weights ship with a 256K-token context, putting it in repository-scale territory alongside Qwen3-Coder. For pricing comparisons across local deployment and hosted alternatives, see our cost calculator.

Benchmark performance: where Codestral lands in 2026

Codestral 22B's published scores were class-leading in mid-2024, but the goalposts have moved. Here's how it compares against the current crop of open-weight coding models on standard benchmarks.

Coding benchmark comparison — open-weight models, May 2026
Model	Params	HumanEval	MBPP	RepoBench	License
Codestral 22B v1	22.2B	81.1%	78.2%	34.0%	MNPL (legacy)
Codestral 2 (Aug 2025)	22.2B	84.7%	80.9%	41.2%	Apache 2.0
Qwen3-Coder 30B-A3B	30B MoE	89.6%	84.1%	47.8%	Apache 2.0
DeepSeek-Coder-V2 Lite	16B MoE	85.4%	81.3%	38.5%	DeepSeek License
Code Llama 70B Instruct	70B	67.8%	62.4%	28.1%	Llama 2 Community
StarCoder2 15B	15B	72.6%	75.2%	33.4%	BigCode OpenRAIL

The headline: Codestral 2 holds its own against larger models but trails Qwen3-Coder by ~5 points on HumanEval and ~7 on RepoBench. Where Codestral genuinely wins is FIM accuracy — internal evaluations from Mistral's announcement report a single-pass FIM exact-match rate above what DeepSeek-Coder-V2 achieves, and that gap shows up in IDE feel.

Benchmark numbers don't translate one-to-one to daily coding ergonomics. The team at BestLLMfor maintains a reproducible eval harness — see the methodology page for how we run pass@1 evals locally on consumer hardware, and the public BestLLMfor API (CC BY 4.0) exposes the raw numbers for anyone building their own comparison.

Hardware requirements and quantization

Codestral 22B's parameter count puts it squarely in the "single high-end consumer GPU" tier. Here's what to expect at common quantization levels, measured with llama.cpp build b3450 and 32K active context.

Codestral 22B VRAM and throughput by quantization
Quant	File size	VRAM (32K ctx)	Min GPU	tok/s (RTX 4090)	Quality loss vs FP16
FP16	44.5 GB	~48 GB	2x RTX 3090 / A100 40GB	38	0% (reference)
Q8_0	23.6 GB	~28 GB	RTX 4090 + offload	52	<0.5%
Q6_K	18.3 GB	~22 GB	RTX 3090 / 4090 24GB	61	~1%
Q5_K_M	15.7 GB	~19 GB	RTX 3090 / 4090 24GB	67	~1.5%
Q4_K_M	13.3 GB	~17 GB	RTX 3090 / 4090 24GB	74	~3%
Q3_K_M	10.8 GB	~14 GB	RTX 4070 Ti Super 16GB	81	~6-8%

The pragmatic recommendation: Q5_K_M on a 24GB card. Q4_K_M is fine for chat-style coding help, but Q5 noticeably reduces hallucinated APIs on less-common languages (Swift, Elixir, Zig). Apple Silicon users with an M2/M3 Pro 32GB or any Max-tier chip can run Q6_K through MLX or llama.cpp at 18-25 tok/s.

If you need to drop below 16GB VRAM, consider the smaller alternatives in our best local coding LLMs ranking rather than pushing Codestral to Q2 — sub-3-bit quants degrade FIM quality severely, which is the one thing Codestral does best.

Fill-in-the-middle: the real reason to use it

FIM is the killer feature. When your IDE asks the model to complete code between a prefix and a suffix — not just continue from the end — most general-purpose LLMs guess poorly. Codestral was pre-trained with the exact token format used by Continue.dev, Tabby, and Zed, so completions slot in cleanly without overshooting brackets or repeating the suffix.

The FIM prompt template is straightforward:

<s>[SUFFIX]{suffix}[PREFIX]{prefix}

Note the suffix-then-prefix ordering — this trips up integrations that assume the Llama or DeepSeek FIM convention. Most modern Ollama and llama.cpp templates handle it automatically, but if you're rolling your own client, check the official model card for the exact tokens.

In practical testing across Python, TypeScript, and Rust autocompletion tasks, Codestral 2 Q5_K_M produces a usable completion on the first attempt roughly 72% of the time at typical IDE cursor positions, versus 58% for Qwen3-Coder 30B at the same quant level. Qwen wins on raw HumanEval but loses on the constrained-completion task that actually matters when you're typing.

How to run Codestral 22B locally

The fastest path from zero to a working IDE integration takes about 15 minutes on a fresh machine with a 24GB GPU.

Install Ollama 0.5+: curl -fsSL https://ollama.com/install.sh | sh on Linux/macOS, or grab the installer from ollama.com/library/codestral.
Pull the model: ollama pull codestral:22b-v2-q5_K_M downloads ~16 GB. For the legacy v1 weights use codestral:22b-v0.1-q5_K_M.
Verify FIM works: ollama run codestral:22b-v2-q5_K_M "<s>[SUFFIX]\n}\n[PREFIX]def fibonacci(n):\n if n <= 1:\n return n\n return" should return a clean recursive call, not narration.
Wire up Continue.dev: Add a tabAutocompleteModel entry pointing at codestral:22b-v2-q5_K_M with useLegacyCompletionsEndpoint: false. Set context length to 8192 for autocomplete (32K wastes VRAM on every keystroke).
Tune sampling: Temperature 0.1, top_p 0.9, repeat_penalty 1.05 for completions. For chat use temperature 0.3.
Optional — vLLM for throughput: If serving multiple developers, run vLLM with --quantization awq --max-model-len 32768 on an H100 or 2x RTX 4090 for ~180 tok/s aggregate.

Licensing: the most important change in 2025

The original Codestral shipped under the Mistral Non-Production License (MNPL), which explicitly prohibited commercial use without a paid Mistral commercial agreement. That single clause kept the model out of enterprise pipelines for over a year — teams that wanted Codestral for production autocomplete had to either pay Mistral or switch to DeepSeek/Qwen.

The August 2025 Codestral 2 release relicensed the weights under Apache 2.0. This is the biggest licensing unlock in open-source coding models since the Llama 2 commercial release. You can now:

Ship Codestral inside a commercial IDE plugin or SaaS product.
Fine-tune on proprietary code without contaminating IP.
Distribute quantized derivatives without notifying Mistral.
Use it inside a closed-source product without source disclosure.

If you're still running v0.1 weights pulled before August 2025, the MNPL still applies to those bits — re-pull the v2 tag to get under Apache 2.0. The legacy ucstrategies and llmradar.eu writeups that warn about commercial restrictions are referring to the pre-2025 release.

Codestral vs. the alternatives in 2026

The local coding LLM market has matured. Codestral 22B is no longer the obvious default — it's now one of three reasonable picks, each with a different sweet spot.

Pick Codestral 22B if: You want the best FIM autocomplete on a single 24GB GPU, you need Apache 2.0 for commercial deployment, and you primarily work in mainstream languages.
Pick Qwen3-Coder 30B-A3B if: You have 32GB+ VRAM, you want the strongest agentic / reasoning performance, and you don't mind slightly weaker raw autocomplete.
Pick DeepSeek-Coder-V2 Lite 16B if: VRAM is tight (16GB cards), you want fast inference, and the DeepSeek license terms are acceptable for your use case.

For a side-by-side comparison across all three, see our full model catalog or the local coding LLM comparison guide. The BestLLMfor open-source MCP server lets you query these benchmarks programmatically from Claude Desktop or any MCP-compatible client.

Verdict

Codestral 22B in its Codestral 2 incarnation is a buy for IDE autocomplete and a pass for agentic coding. The combination of native FIM training, Apache 2.0 licensing, and a 24GB-friendly footprint makes it the most pragmatic default for editor integration in 2026. If you're building agents that plan multi-step refactors or call tools, spend the extra VRAM on Qwen3-Coder.

Final verdict — Codestral 22B (Codestral 2, August 2025)
Criterion	Score	Notes
Code completion (FIM)	9/10	Best-in-class for single-cursor IDE autocomplete
Code generation (chat)	7/10	Solid but trails Qwen3-Coder on complex tasks
Agentic / tool use	5/10	Not trained for it; use Qwen3-Coder or DeepSeek instead
Hardware accessibility	9/10	Runs cleanly on any 24GB GPU at Q5_K_M
License (post-Aug 2025)	10/10	Apache 2.0, no commercial restrictions
Multilingual code	8/10	80+ languages, strong on Swift/Rust/Fortran
Overall	8/10	The pragmatic default for local IDE autocomplete

Frequently Asked Questions

Is Codestral 22B free for commercial use?

Yes, as of the Codestral 2 release in August 2025. The weights are now licensed under Apache 2.0, which permits commercial use, redistribution, modification, and use inside closed-source products. The original v0.1 release from May 2024 was under the Mistral Non-Production License and is not free for commercial use — re-pull the v2 tag from Hugging Face or Ollama to get the Apache-licensed weights.

What GPU do I need to run Codestral 22B locally?

A single 24GB consumer GPU (RTX 3090, RTX 4090, or RTX 5090) runs Codestral 22B at Q5_K_M quantization with 32K context comfortably, delivering 60-75 tokens per second. A 16GB card (RTX 4070 Ti Super, RTX 5070) requires Q3_K_M, which degrades quality noticeably. Apple Silicon Macs with 32GB+ unified memory can run Q5_K_M or Q6_K via MLX or llama.cpp.

Is Codestral 22B better than Qwen3-Coder for coding?

It depends on the task. Qwen3-Coder 30B-A3B beats Codestral by 4-9 points on HumanEval, MBPP, and RepoBench, and is significantly better for agentic and multi-step reasoning. However, Codestral's native fill-in-the-middle training makes it more reliable for IDE autocomplete, where it produces a usable first-pass completion roughly 14 percentage points more often. Use Codestral for editor integration, Qwen3-Coder for agent workflows.

What's the context window of Codestral 22B?

The original May 2024 release shipped with 32K tokens. The Codestral 2 refresh (August 2025) extended the context to 256K tokens via positional interpolation, putting it in repository-scale territory. In practice, running at the full 256K requires roughly 32GB additional VRAM for the KV cache, so most local deployments cap at 32-64K to fit on a single 24GB GPU.

Does Codestral support fill-in-the-middle (FIM) natively?

Yes. Codestral was pre-trained with native FIM objectives using a specific token order: <s>[SUFFIX]{suffix}[PREFIX]{prefix}. This is different from the DeepSeek and Llama FIM conventions, so client integrations must use the correct template. Modern Ollama and llama.cpp builds handle this automatically; custom clients should verify against the official Hugging Face model card.

How does Codestral compare to GitHub Copilot?

Copilot uses a proprietary model that's been heavily tuned for autocomplete latency and has telemetry-driven personalization that no local model can match. Codestral matches or exceeds Copilot on raw completion quality benchmarks but lacks the cloud infrastructure for sub-100ms suggestions. For privacy-sensitive work, regulated industries, or air-gapped environments, Codestral is the better choice. For pure ergonomics and speed in a standard developer workflow, Copilot still wins.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.