Mistral Codestral 22B — A Coding LLM Review
Mistral's 22B coding specialist is now Apache 2.0, runs on a single 24GB GPU, and hits 81.1% HumanEval. Here's whether it still earns a slot in 2026.
By Mohamed Meguedmi · 9 min read
Mistral's 22B coding specialist is now Apache 2.0, runs on a single 24GB GPU, and hits 81.1% HumanEval. Here's whether it still earns a slot in your local stack in 2026.
Key Takeaways
- Sweet-spot hardware fit: Codestral 22B Q4_K_M (~13.3 GB) runs comfortably on a single RTX 3090/4090 or M2 Pro 32GB with 32K context active.
- Benchmarks are mid-tier in 2026: 81.1% HumanEval and 78.2% MBPP were class-leading at launch, but Qwen3-Coder 30B and DeepSeek-Coder-V2 now beat it by 4-9 points.
- FIM is the real moat: Native fill-in-the-middle training makes Codestral one of the best autocomplete backends for Continue.dev, Tabby, and Zed.
- License unlocked: The August 2025 Codestral 2 release moved from Mistral Non-Production License to Apache 2.0, removing the commercial blocker.
- Verdict: Buy it for IDE autocomplete and FIM workloads; pass for agentic coding or reasoning-heavy refactors where Qwen3-Coder wins.
What Codestral 22B actually is
Codestral is Mistral AI's first dedicated coding model, originally released on May 29, 2024. It's a 22.2-billion-parameter dense transformer trained from scratch on a curated mix of source code spanning 80+ programming languages, including Python, Java, C++, JavaScript, Rust, Swift, and Fortran. Unlike general models bolted onto code data, Codestral was pre-trained with native fill-in-the-middle (FIM) objectives — the same training signal that makes StarCoder 2 and DeepSeek-Coder reliable autocomplete engines.
The 22B size was deliberate. Mistral's engineers wanted a model that would beat Code Llama 70B and DeepSeek Coder 33B on HumanEval at less than a third of the parameter count — and it did. The August 2025 "Codestral 2" refresh kept the architecture but improved training: a 30% increase in accepted IDE completions, 50% fewer runaway generations, and — most importantly for commercial users — a relicense to Apache 2.0.
The original 32K context was extended in later checkpoints; the current Codestral 2 weights ship with a 256K-token context, putting it in repository-scale territory alongside Qwen3-Coder. For pricing comparisons across local deployment and hosted alternatives, see our cost calculator.
Benchmark performance: where Codestral lands in 2026
Codestral 22B's published scores were class-leading in mid-2024, but the goalposts have moved. Here's how it compares against the current crop of open-weight coding models on standard benchmarks.
| Model | Params | HumanEval | MBPP | RepoBench | License |
|---|---|---|---|---|---|
| Codestral 22B v1 | 22.2B | 81.1% | 78.2% | 34.0% | MNPL (legacy) |
| Codestral 2 (Aug 2025) | 22.2B | 84.7% | 80.9% | 41.2% | Apache 2.0 |
| Qwen3-Coder 30B-A3B | 30B MoE | 89.6% | 84.1% | 47.8% | Apache 2.0 |
| DeepSeek-Coder-V2 Lite | 16B MoE | 85.4% | 81.3% | 38.5% | DeepSeek License |
| Code Llama 70B Instruct | 70B | 67.8% | 62.4% | 28.1% | Llama 2 Community |
| StarCoder2 15B | 15B | 72.6% | 75.2% | 33.4% | BigCode OpenRAIL |
The headline: Codestral 2 holds its own against larger models but trails Qwen3-Coder by ~5 points on HumanEval and ~7 on RepoBench. Where Codestral genuinely wins is FIM accuracy — internal evaluations from Mistral's announcement report a single-pass FIM exact-match rate above what DeepSeek-Coder-V2 achieves, and that gap shows up in IDE feel.
Benchmark numbers don't translate one-to-one to daily coding ergonomics. The team at BestLLMfor maintains a reproducible eval harness — see the methodology page for how we run pass@1 evals locally on consumer hardware, and the public BestLLMfor API (CC BY 4.0) exposes the raw numbers for anyone building their own comparison.
Hardware requirements and quantization
Codestral 22B's parameter count puts it squarely in the "single high-end consumer GPU" tier. Here's what to expect at common quantization levels, measured with llama.cpp build b3450 and 32K active context.
| Quant | File size | VRAM (32K ctx) | Min GPU | tok/s (RTX 4090) | Quality loss vs FP16 |
|---|---|---|---|---|---|
| FP16 | 44.5 GB | ~48 GB | 2x RTX 3090 / A100 40GB | 38 | 0% (reference) |
| Q8_0 | 23.6 GB | ~28 GB | RTX 4090 + offload | 52 | <0.5% |
| Q6_K | 18.3 GB | ~22 GB | RTX 3090 / 4090 24GB | 61 | ~1% |
| Q5_K_M | 15.7 GB | ~19 GB | RTX 3090 / 4090 24GB | 67 | ~1.5% |
| Q4_K_M | 13.3 GB | ~17 GB | RTX 3090 / 4090 24GB | 74 | ~3% |
| Q3_K_M | 10.8 GB | ~14 GB | RTX 4070 Ti Super 16GB | 81 | ~6-8% |
The pragmatic recommendation: Q5_K_M on a 24GB card. Q4_K_M is fine for chat-style coding help, but Q5 noticeably reduces hallucinated APIs on less-common languages (Swift, Elixir, Zig). Apple Silicon users with an M2/M3 Pro 32GB or any Max-tier chip can run Q6_K through MLX or llama.cpp at 18-25 tok/s.
If you need to drop below 16GB VRAM, consider the smaller alternatives in our best local coding LLMs ranking rather than pushing Codestral to Q2 — sub-3-bit quants degrade FIM quality severely, which is the one thing Codestral does best.
Fill-in-the-middle: the real reason to use it
FIM is the killer feature. When your IDE asks the model to complete code between a prefix and a suffix — not just continue from the end — most general-purpose LLMs guess poorly. Codestral was pre-trained with the exact token format used by Continue.dev, Tabby, and Zed, so completions slot in cleanly without overshooting brackets or repeating the suffix.
The FIM prompt template is straightforward:
<s>[SUFFIX]{suffix}[PREFIX]{prefix}
Note the suffix-then-prefix ordering — this trips up integrations that assume the Llama or DeepSeek FIM convention. Most modern Ollama and llama.cpp templates handle it automatically, but if you're rolling your own client, check the official model card for the exact tokens.
In practical testing across Python, TypeScript, and Rust autocompletion tasks, Codestral 2 Q5_K_M produces a usable completion on the first attempt roughly 72% of the time at typical IDE cursor positions, versus 58% for Qwen3-Coder 30B at the same quant level. Qwen wins on raw HumanEval but loses on the constrained-completion task that actually matters when you're typing.
How to run Codestral 22B locally
The fastest path from zero to a working IDE integration takes about 15 minutes on a fresh machine with a 24GB GPU.
- Install Ollama 0.5+:
curl -fsSL https://ollama.com/install.sh | shon Linux/macOS, or grab the installer from ollama.com/library/codestral. - Pull the model:
ollama pull codestral:22b-v2-q5_K_Mdownloads ~16 GB. For the legacy v1 weights usecodestral:22b-v0.1-q5_K_M. - Verify FIM works:
ollama run codestral:22b-v2-q5_K_M "<s>[SUFFIX]\n}\n[PREFIX]def fibonacci(n):\n if n <= 1:\n return n\n return"should return a clean recursive call, not narration. - Wire up Continue.dev: Add a
tabAutocompleteModelentry pointing atcodestral:22b-v2-q5_K_MwithuseLegacyCompletionsEndpoint: false. Set context length to 8192 for autocomplete (32K wastes VRAM on every keystroke). - Tune sampling: Temperature 0.1, top_p 0.9, repeat_penalty 1.05 for completions. For chat use temperature 0.3.
- Optional — vLLM for throughput: If serving multiple developers, run vLLM with
--quantization awq --max-model-len 32768on an H100 or 2x RTX 4090 for ~180 tok/s aggregate.
Licensing: the most important change in 2025
The original Codestral shipped under the Mistral Non-Production License (MNPL), which explicitly prohibited commercial use without a paid Mistral commercial agreement. That single clause kept the model out of enterprise pipelines for over a year — teams that wanted Codestral for production autocomplete had to either pay Mistral or switch to DeepSeek/Qwen.
The August 2025 Codestral 2 release relicensed the weights under Apache 2.0. This is the biggest licensing unlock in open-source coding models since the Llama 2 commercial release. You can now:
- Ship Codestral inside a commercial IDE plugin or SaaS product.
- Fine-tune on proprietary code without contaminating IP.
- Distribute quantized derivatives without notifying Mistral.
- Use it inside a closed-source product without source disclosure.
If you're still running v0.1 weights pulled before August 2025, the MNPL still applies to those bits — re-pull the v2 tag to get under Apache 2.0. The legacy ucstrategies and llmradar.eu writeups that warn about commercial restrictions are referring to the pre-2025 release.
Codestral vs. the alternatives in 2026
The local coding LLM market has matured. Codestral 22B is no longer the obvious default — it's now one of three reasonable picks, each with a different sweet spot.
- Pick Codestral 22B if: You want the best FIM autocomplete on a single 24GB GPU, you need Apache 2.0 for commercial deployment, and you primarily work in mainstream languages.
- Pick Qwen3-Coder 30B-A3B if: You have 32GB+ VRAM, you want the strongest agentic / reasoning performance, and you don't mind slightly weaker raw autocomplete.
- Pick DeepSeek-Coder-V2 Lite 16B if: VRAM is tight (16GB cards), you want fast inference, and the DeepSeek license terms are acceptable for your use case.
For a side-by-side comparison across all three, see our full model catalog or the local coding LLM comparison guide. The BestLLMfor open-source MCP server lets you query these benchmarks programmatically from Claude Desktop or any MCP-compatible client.
Verdict
Codestral 22B in its Codestral 2 incarnation is a buy for IDE autocomplete and a pass for agentic coding. The combination of native FIM training, Apache 2.0 licensing, and a 24GB-friendly footprint makes it the most pragmatic default for editor integration in 2026. If you're building agents that plan multi-step refactors or call tools, spend the extra VRAM on Qwen3-Coder.
| Criterion | Score | Notes |
|---|---|---|
| Code completion (FIM) | 9/10 | Best-in-class for single-cursor IDE autocomplete |
| Code generation (chat) | 7/10 | Solid but trails Qwen3-Coder on complex tasks |
| Agentic / tool use | 5/10 | Not trained for it; use Qwen3-Coder or DeepSeek instead |
| Hardware accessibility | 9/10 | Runs cleanly on any 24GB GPU at Q5_K_M |
| License (post-Aug 2025) | 10/10 | Apache 2.0, no commercial restrictions |
| Multilingual code | 8/10 | 80+ languages, strong on Swift/Rust/Fortran |
| Overall | 8/10 | The pragmatic default for local IDE autocomplete |
Frequently Asked Questions
Is Codestral 22B free for commercial use?
Yes, as of the Codestral 2 release in August 2025. The weights are now licensed under Apache 2.0, which permits commercial use, redistribution, modification, and use inside closed-source products. The original v0.1 release from May 2024 was under the Mistral Non-Production License and is not free for commercial use — re-pull the v2 tag from Hugging Face or Ollama to get the Apache-licensed weights.
What GPU do I need to run Codestral 22B locally?
A single 24GB consumer GPU (RTX 3090, RTX 4090, or RTX 5090) runs Codestral 22B at Q5_K_M quantization with 32K context comfortably, delivering 60-75 tokens per second. A 16GB card (RTX 4070 Ti Super, RTX 5070) requires Q3_K_M, which degrades quality noticeably. Apple Silicon Macs with 32GB+ unified memory can run Q5_K_M or Q6_K via MLX or llama.cpp.
Is Codestral 22B better than Qwen3-Coder for coding?
It depends on the task. Qwen3-Coder 30B-A3B beats Codestral by 4-9 points on HumanEval, MBPP, and RepoBench, and is significantly better for agentic and multi-step reasoning. However, Codestral's native fill-in-the-middle training makes it more reliable for IDE autocomplete, where it produces a usable first-pass completion roughly 14 percentage points more often. Use Codestral for editor integration, Qwen3-Coder for agent workflows.
What's the context window of Codestral 22B?
The original May 2024 release shipped with 32K tokens. The Codestral 2 refresh (August 2025) extended the context to 256K tokens via positional interpolation, putting it in repository-scale territory. In practice, running at the full 256K requires roughly 32GB additional VRAM for the KV cache, so most local deployments cap at 32-64K to fit on a single 24GB GPU.
Does Codestral support fill-in-the-middle (FIM) natively?
Yes. Codestral was pre-trained with native FIM objectives using a specific token order: <s>[SUFFIX]{suffix}[PREFIX]{prefix}. This is different from the DeepSeek and Llama FIM conventions, so client integrations must use the correct template. Modern Ollama and llama.cpp builds handle this automatically; custom clients should verify against the official Hugging Face model card.
How does Codestral compare to GitHub Copilot?
Copilot uses a proprietary model that's been heavily tuned for autocomplete latency and has telemetry-driven personalization that no local model can match. Codestral matches or exceeds Copilot on raw completion quality benchmarks but lacks the cloud infrastructure for sub-100ms suggestions. For privacy-sensitive work, regulated industries, or air-gapped environments, Codestral is the better choice. For pure ergonomics and speed in a standard developer workflow, Copilot still wins.