Best Local LLM for Rust — Tested on Borrow Checker Edge Cases
We pushed seven local models through 42 borrow checker traps, lifetime puzzles, and unsafe edge cases. One winner, two surprises, three to skip.
By Mohamed Meguedmi · 11 min read
Key Takeaways
- Qwen3-Coder 32B Q5_K_M wins overall: 34/42 borrow checker traps solved on first compile, 38/42 after one self-correction round.
- DeepSeek-Coder-V2.5 16B is the best value if you have ≤24 GB VRAM — 29/42 on first compile at a fraction of the memory footprint.
- Strand-Rust-Coder-v1 7B punches above its weight on idiomatic Rust but collapses on multi-trait lifetime puzzles (12/42).
- General-purpose Llama 3.3 70B is worse than a focused 32B coder model on every Rust-specific axis we measured.
- Run the winner via llama.cpp or mistral.rs — both expose deterministic sampling required for reproducible borrow checker tests.
The borrow checker is the single feature that separates Rust LLM evaluation from Python or JavaScript benchmarks. A model that “writes Rust” in the HumanEval sense can still produce code that rustc rejects with E0502, E0507, or E0623. We built a 42-case suite covering exactly the patterns where models historically fail: nested mutable borrows, self-referential structs, 'static bounds inside async, Pin/Unpin dances, and the classic “cannot move out of borrowed content” family. Then we ran seven locally-hostable models against it.
This guide reports what passed, what failed, and which model the BestLLMfor editorial team now recommends for serious Rust work on consumer hardware. Compare projected energy and amortization with the cost calculator before committing to a build.
How we tested: 42 borrow checker edge cases
We constructed the test set from three sources: the official rustc error index (E04xx–E07xx range), Niko Matsakis’s tree-borrows write-ups, and ten real-world bugs lifted from popular crates’ GitHub issue trackers. Each case is a cargo new project where the model receives the failing code plus the literal compiler error and must return a patch.
Scoring is binary per case:
- First-compile pass — output compiles with zero warnings on stable Rust 1.86.
- Iterative pass — model is shown the new compiler error and gets one more attempt.
- Semantic pass — the unit tests bundled with the case also succeed (catches models that “fix” by deleting the failing call).
All runs used temperature 0.2, top-p 0.9, deterministic seed, and 8K context. Hardware: single RTX 4090 (24 GB VRAM) paired with 64 GB DDR5 system RAM. Inference engine was llama.cpp build b4231 for GGUF quants and mistral.rs 0.3.x for native ISQ runs. Full methodology lives at /methodology/; raw per-case outputs are downloadable from the BestLLMfor public API (CC BY 4.0).
The contenders
| Model | Params | Quant | VRAM (GB) | License |
|---|---|---|---|---|
| Qwen3-Coder 32B | 32B | Q5_K_M | 22.4 | Apache 2.0 |
| DeepSeek-Coder-V2.5 | 16B (MoE, 2.4B active) | Q5_K_M | 11.8 | DeepSeek License |
| Llama 3.3 70B Instruct | 70B | Q3_K_M | 31.0 (offloaded) | Llama 3.3 |
| Strand-Rust-Coder-v1 | 7B | Q6_K | 6.2 | Apache 2.0 |
| Codestral 22B v0.3 | 22B | Q5_K_M | 15.6 | MNPL (non-commercial) |
| Phi-4 14B | 14B | Q6_K | 11.2 | MIT |
| StarCoder2 15B | 15B | Q5_K_M | 10.9 | BigCode OpenRAIL-M |
We deliberately included Strand-Rust-Coder-v1 because it claims state-of-the-art Rust performance from peer-ranked synthetic fine-tuning. Spoiler: the claim holds for idiomatic code generation but not for borrow checker recovery.
Results: borrow checker pass rates
| Model | First compile | Iterative (2 tries) | Semantic (tests pass) | Tokens/sec |
|---|---|---|---|---|
| Qwen3-Coder 32B Q5_K_M | 34 / 42 (81%) | 38 / 42 (90%) | 36 / 42 (86%) | 42 |
| DeepSeek-Coder-V2.5 16B | 29 / 42 (69%) | 34 / 42 (81%) | 32 / 42 (76%) | 71 |
| Codestral 22B v0.3 | 26 / 42 (62%) | 31 / 42 (74%) | 28 / 42 (67%) | 54 |
| Llama 3.3 70B Q3 | 22 / 42 (52%) | 27 / 42 (64%) | 24 / 42 (57%) | 14 |
| Phi-4 14B | 18 / 42 (43%) | 23 / 42 (55%) | 20 / 42 (48%) | 63 |
| Strand-Rust-Coder-v1 7B | 16 / 42 (38%) | 21 / 42 (50%) | 12 / 42 (29%) | 98 |
| StarCoder2 15B | 11 / 42 (26%) | 15 / 42 (36%) | 9 / 42 (21%) | 59 |
The semantic column matters most. StarCoder2 and Strand both have a habit of “solving” lifetime errors by silently changing function signatures, dropping the offending argument, or wrapping everything in Rc<RefCell<_>>. The code compiles. The tests do not. We saw Strand drop from 21 iterative-compile passes to 12 semantic passes precisely because of this pattern.
Where each model breaks
Qwen3-Coder 32B — the new default
Qwen3-Coder is the only model in the set that consistently produces the minimal patch. On the canonical cannot borrow as mutable more than once case using split_at_mut, every other model either rewrote the function or reached for unsafe. Qwen3 returned the two-line split_at_mut fix on the first attempt. It also nailed three of four self-referential struct cases by correctly suggesting ouroboros or Pin<Box<_>> rather than fabricating a lifetime parameter that rustc would reject as unconstrained.
Where it stumbles: GATs with higher-ranked trait bounds (HRTBs). Two of the eight failures involved for<'a> F: Fn(&'a T) -> &'a U patterns where Qwen3 inverted the lifetime relationship.
DeepSeek-Coder-V2.5 — best value
For setups limited to 12–16 GB VRAM, this is the pick. The MoE architecture means 2.4B active parameters per token, so it runs at 71 tok/s on the same hardware where Qwen3 manages 42. It loses to Qwen3 mostly on cases that require reasoning across more than 200 lines of context — the borrowed-cursor-in-iterator family, specifically.
Llama 3.3 70B — wrong tool
This is the headline finding. A 70B general-purpose instruct model, even at Q3_K_M with CPU offload, lost decisively to a focused 32B coder. Llama 3.3 produces verbose, “helpful” explanations of why the borrow checker is complaining, then proposes fixes that the borrow checker also rejects. If your only available model is a generalist, accept that you are leaving 30 percentage points on the table compared to a domain-tuned alternative.
Strand-Rust-Coder-v1 — promising but specialized
Strand’s peer-ranked synthetic training data clearly captured idiomatic style: variable naming, derives, module organization, even doc-comment phrasing. But the fine-tune set evidently underrepresented the recover-from-error scenario. When given a compiler error, Strand frequently rewrites the entire function rather than producing a targeted patch. Use it for greenfield code generation, not for debugging existing borrow check failures.
Hardware and cost: what to actually buy
| Tier | GPU | VRAM | Approx. price (USD) | Best model fit | Tokens/sec |
|---|---|---|---|---|---|
| Entry | RTX 4060 Ti 16GB | 16 GB | $450 | DeepSeek-Coder-V2.5 Q5 | ~55 |
| Mid | RTX 4070 Ti Super | 16 GB | $800 | DeepSeek-Coder-V2.5 Q5 | ~78 |
| Pro | RTX 4090 / 5080 | 24 / 16 GB | $1,700–$1,100 | Qwen3-Coder 32B Q5 | 42–48 |
| Studio | RTX 6000 Ada | 48 GB | $6,800 | Qwen3-Coder 32B Q8 + 32K ctx | 61 |
| Apple | M4 Max 128 GB | shared | $4,700 | Qwen3-Coder 32B MLX 4-bit | 28 |
Plug your electricity rate and daily usage into the cost calculator to compare against a Claude or Codex API subscription. At US average $0.16/kWh and an 8-hour coding day on a 4090, local Qwen3 amortizes a $1,700 GPU against a $20/month API plan in roughly 7–9 years. The justification is not pure cost — it is air-gapped operation, deterministic seeds, and the ability to ship the BestLLMfor public API or run quelllm-mcp as a code-completion backend without sending crate sources off-network.
Installing the winner
The simplest reproducible path uses Ollama with a custom Modelfile to lock sampling parameters. The official Qwen3-Coder Ollama page ships a Q4_K_M by default; switch to Q5_K_M for the borrow-checker results above.
# 1. Pull the higher-quality quant
ollama pull qwen3-coder:32b-q5_K_M
# 2. Create a Rust-focused Modelfile
cat > Modelfile <<'EOF'
FROM qwen3-coder:32b-q5_K_M
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 16384
PARAMETER seed 42
SYSTEM """You are a Rust expert. When given compiler errors, produce the
minimal patch that satisfies rustc and preserves the original function
signatures and test semantics. Never wrap code in unsafe to bypass the
borrow checker."""
EOF
# 3. Build and run
ollama create rust-coder -f Modelfile
ollama run rust-coderFor deeper integration with editor tooling, the mistral.rs engine exposes an OpenAI-compatible HTTP server with native ISQ quantization in pure Rust — appropriate when the host application is itself written in Rust and you want a single binary deploy. Detailed install notes and benchmark replication scripts are linked from the about page; French-speaking readers can find the equivalent guide on quelllm.fr.
What this means for your workflow
If you write Rust professionally, install Qwen3-Coder 32B locally and wire it into your editor (Zed, Helix with the LSP-AI bridge, or rustaceanvim with a custom completion source). Use it as a borrow-checker reasoning partner, not as an autocompleter — the model’s value compounds when you paste a compiler error and ask for the minimal patch, exactly as Niko Matsakis described in his 2025 essay on collaborating with LLMs on Rust.
Reserve Llama 3.3 70B or Claude/Codex API calls for design-level questions: API surface decisions, trait hierarchy planning, async runtime selection. They produce better English about Rust. They produce worse Rust.
Frequently asked questions
Which local LLM handles Rust lifetimes best in 2026?
Qwen3-Coder 32B at Q5_K_M solved 38 of 42 lifetime and borrow-checker edge cases in our test suite within two iterations — the highest score of any locally-hostable model we evaluated. DeepSeek-Coder-V2.5 16B is a close second and runs on half the VRAM.
Can a 7B model really replace a 70B model for Rust?
Not yet for borrow-checker reasoning. Strand-Rust-Coder-v1 7B produces idiomatic style and good greenfield code, but its semantic pass rate on debugging tasks was 29% versus Qwen3’s 86%. The architectural ceiling for nuanced lifetime reasoning currently sits around 14–16B active parameters.
Is fine-tuning a Rust-specific model worth it?
Only if you have at least 50k high-quality Rust examples with compiler-validated outputs. The Strand-Rust-Coder report shows that peer-ranked synthetic data improves idiomatic generation, but raw fine-tuning on uncurated crate code typically degrades borrow-checker behavior. Most teams get more value from prompt engineering and a good base model.
How much VRAM do I actually need?
16 GB unlocks DeepSeek-Coder-V2.5 at Q5_K_M, which clears 76% semantic pass on our suite. 24 GB unlocks Qwen3-Coder 32B Q5_K_M, the current winner. Anything above 24 GB benefits only when you need 32K+ context windows for whole-crate review.
Does Ollama or llama.cpp give better results?
Identical, when sampling parameters are matched. Ollama is a llama.cpp wrapper; differences in our benchmarks across the two were within run-to-run noise. mistral.rs produces marginally faster tokens/sec on the same GGUF when used with ISQ quantization.
Verdict
| Use case | Recommendation |
|---|---|
| Daily Rust coding, 24 GB+ VRAM | Qwen3-Coder 32B Q5_K_M |
| Budget setup, 12–16 GB VRAM | DeepSeek-Coder-V2.5 16B Q5_K_M |
| Apple Silicon, 64–128 GB unified | Qwen3-Coder 32B MLX 4-bit |
| Greenfield idiomatic style only | Strand-Rust-Coder-v1 7B |
| Architecture & design discussion | Llama 3.3 70B (or API fallback) |
| Avoid for borrow-checker debugging | StarCoder2 15B, Phi-4 14B |
The bottom line: a 32B Rust-aware model running locally now beats every generalist 70B we tested on the language’s hardest feature. Local Rust coding has officially crossed the threshold from “workable” to “preferable.”