Best Local LLM for RTX 5070 Ti — Cursor & Aider Tested 2026
We benchmarked 11 quantized models on the RTX 5070 Ti against real Cursor and Aider workflows. Here is what actually ships code in 2026.
By Mohamed Meguedmi · 11 min read
The RTX 5070 Ti sits at the most interesting price point in the RTX 50 stack for local inference: $749–$799, 16 GB of GDDR7, 896 GB/s of bandwidth, and enough CUDA throughput to run a serious coding model entirely on-GPU. The question is no longer can you run a local LLM on it — it is which one, at what quant, and whether it is actually usable from Cursor and Aider day-to-day. We tested 11 model/quant combinations across two weeks of real coding sessions. Here is the verdict.
Key Takeaways
- Best overall:
Qwen3-Coder 32B Q4_K_Mwith 4 layers offloaded to CPU — 28 tok/s sustained, beats every fully-resident option on Aider's code-edit benchmark. - Best fully-resident (no offload):
Devstral Small 2 24B Q4_K_Mat 16k context — 54 tok/s, 92% Aider pass rate on Python edits. - Best for Cursor Tab (autocomplete):
Qwen3-Coder 7B Q5_K_M— 118 tok/s, sub-200ms first-token latency. - Avoid: dense 70B models at Q2 — every quant we tested below Q3 hallucinated imports.
- Sweet-spot context window: 16k. Going to 32k forces extra CPU offload and halves throughput on every 24B+ model.
The RTX 5070 Ti, Briefly
The Blackwell-generation 5070 Ti ships with 8,960 CUDA cores, 16 GB GDDR7, and a 256-bit bus delivering 896 GB/s of memory bandwidth. For local LLM workloads, the bandwidth figure is the one that matters: token generation on dense transformers is memory-bound, not compute-bound. A 24B model at Q4_K_M occupies roughly 14.2 GB on disk and 13.8 GB in VRAM with a 16k KV cache — close to the ceiling, but workable.
Compared to the RTX 5080 (10,752 cores, 960 GB/s, same 16 GB), the 5070 Ti is 15–20% slower on raw tok/s but $200–$300 cheaper. Compared to the RTX 5090 (32 GB, 1,792 GB/s, $1,999), the 5070 Ti gives up the entire 32B-dense-at-Q6 tier and the 70B-at-Q4 tier. Within its 16 GB envelope, however, it is the best price/performance card NVIDIA ships in 2026 for inference. If you are still deciding between cloud and local, our cloud vs local cost calculator shows the 5070 Ti pays back its purchase against a Claude Sonnet 4.6 API spend in roughly 7 months at 40 prompts/day.
Test Methodology
Every model below was tested on identical hardware: RTX 5070 Ti, 64 GB DDR5-6000, Ryzen 9 7950X, NVMe Gen4 SSD, Linux 7.0 kernel. Inference stacks: llama.cpp b4892 with CUDA 13.0, and Ollama 0.6.1 where it adds value (notably for Cursor's HTTP integration). All measurements are warm-cache, batch size 1, with --flash-attn enabled and KV cache in Q8_0.
Three benchmark suites:
- Throughput — sustained tokens/second on a 2,000-token generation at 8k context, averaged across 10 runs.
- Aider code-edit pass rate — Aider's official polyglot benchmark, 225 problems, scored by exact-match against reference diffs.
- Cursor real-world — Two weeks of normal feature work (TypeScript + Python), measured via Cursor's request log: first-token latency, completion acceptance rate, edits per merge.
The full benchmark methodology, raw logs, and reproduction scripts are documented on our methodology page. All numerical data is also exposed via the BestLLMfor public API (CC BY 4.0) so you can plot it against your own GPU.
Throughput & Memory: The Raw Numbers
VRAM footprint includes weights, a 16k KV cache, and CUDA context. "Offload" is the number of transformer layers that had to be moved to CPU to fit. Throughput is sustained tok/s, not peak.
| Model | Quant | VRAM | Offload | Tok/s | First token |
|---|---|---|---|---|---|
| Qwen3-Coder 7B | Q5_K_M | 5.8 GB | 0 | 118 | 180 ms |
| Devstral Small 2 24B | Q4_K_M | 13.8 GB | 0 | 54 | 310 ms |
| Gemma 4 26B | Q4_K_M | 14.9 GB | 0 | 49 | 340 ms |
| Qwen3-Coder 14B | Q6_K | 11.5 GB | 0 | 72 | 240 ms |
| GLM-4.6 32B | Q4_K_M | 18.1 GB | 6 layers | 26 | 420 ms |
| Qwen3-Coder 32B | Q4_K_M | 18.4 GB | 4 layers | 28 | 410 ms |
| Qwen3-Coder 32B | Q3_K_L | 14.6 GB | 0 | 46 | 330 ms |
| Llama 3.3 70B | Q2_K | 23.9 GB | 22 layers | 6.2 | 1,180 ms |
| DeepSeek-Coder V3 16B | Q5_K_M | 11.2 GB | 0 | 78 | 220 ms |
| Zeta-2 4B (FIM) | Q8_0 | 4.3 GB | 0 | 164 | 95 ms |
| Mistral Small 3.2 24B | Q4_K_M | 13.6 GB | 0 | 55 | 305 ms |
Two patterns stand out. First, any model that fits fully in VRAM clears 45+ tok/s — comfortable for chat. Second, the moment you offload to CPU you lose 40–60% of throughput; the PCIe round-trip per layer is brutal. The 32B-Q4 tier is the most contested: it produces dramatically better code than 24B but costs you nearly half the tok/s.
Aider Pass Rate: Which Models Actually Edit Code Correctly
Throughput is meaningless if the model writes the wrong diff. We ran Aider's polyglot benchmark (225 problems across Python, TypeScript, Go, Rust, C++, Java) in diff edit format. Results sorted by pass rate:
| Model + Quant | Pass rate | Edit-format errors | Effective tok/s |
|---|---|---|---|
| Qwen3-Coder 32B Q4_K_M (offload 4) | 71.1% | 2.7% | 28 |
| GLM-4.6 32B Q4_K_M (offload 6) | 68.4% | 3.1% | 26 |
| Qwen3-Coder 32B Q3_K_L (resident) | 62.2% | 4.0% | 46 |
| Devstral Small 2 24B Q4_K_M | 58.7% | 1.8% | 54 |
| Qwen3-Coder 14B Q6_K | 53.3% | 3.6% | 72 |
| Gemma 4 26B Q4_K_M | 49.8% | 5.3% | 49 |
| Mistral Small 3.2 24B Q4_K_M | 44.0% | 4.9% | 55 |
| DeepSeek-Coder V3 16B Q5_K_M | 41.3% | 6.7% | 78 |
| Llama 3.3 70B Q2_K | 34.2% | 14.2% | 6.2 |
Three findings worth highlighting:
- Q3 beats Q4-with-offload on throughput, but loses on quality. Qwen3-Coder 32B at Q3_K_L is 64% faster than the Q4 version with offload, but its pass rate drops 9 points. For interactive sessions that is the wrong trade.
- Devstral Small 2 is the best fully-resident option. 58.7% pass rate at 54 tok/s with the cleanest edit-format compliance of any model — it almost never produces a malformed diff, which matters when Aider is auto-applying changes.
- Llama 3.3 70B at Q2 is a trap. Despite the parameter count, aggressive quantization destroys instruction-following on diff formats. We saw fabricated imports in 1 of every 7 responses. Do not run a 70B at Q2 on this card.
Cursor Integration: What Actually Works Day-to-Day
Cursor expects an OpenAI-compatible endpoint. Both llama.cpp's server and Ollama expose one; we used Ollama 0.6.1 because its model-swap latency is meaningfully lower. Two roles, two models:
Cursor Tab (inline autocomplete)
You want sub-200 ms first-token latency or the suggestions feel laggy. Only two models in the test set clear that bar: Zeta-2 4B (95 ms, purpose-trained for fill-in-the-middle) and Qwen3-Coder 7B Q5_K_M (180 ms, full chat-capable). Zeta-2 wins on raw speed; Qwen3-Coder 7B wins on completion quality. We measured a 34% Cursor-Tab acceptance rate with Qwen3-Coder 7B vs 28% with Zeta-2 over a week of TypeScript work. For most developers, accept the extra 85 ms.
Cursor Composer / Agent mode
Composer issues longer prompts with file context — you want quality, not raw speed. Qwen3-Coder 32B Q4 with 4-layer offload is the pick if you can tolerate 28 tok/s output. If you need faster turnaround, drop to Devstral Small 2 24B Q4 fully resident at 54 tok/s — you give up 12 points of Aider pass rate but stay above the 50% threshold where the model is still net-helpful.
Configure Cursor to point at http://localhost:11434/v1 with API key ollama. The official setup walk-through on our sister site's Cursor guide works identically on the EN side.
Aider Integration
Aider's --model flag accepts any OpenAI-compatible endpoint via --openai-api-base. The setup is three lines:
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
aider --model openai/qwen3-coder:32b-q4_K_M --edit-format diff
One non-obvious tip: set --map-tokens 1024. Aider's default repo-map size assumes a frontier model with a 128k context; on a 16k local context you will blow the budget on map tokens alone. Cutting the map to 1k leaves room for the actual conversation.
For larger codebases, pair Aider with the quelllm-mcp open-source MCP server. It exposes a model-routing layer so you can hand off heavy edits to your local Qwen3-Coder 32B and keep fast completions on the 7B model — without restarting Aider.
The 16 GB Context Wall
The single largest constraint on this card is the KV cache. At Q8_0 cache compression, every additional 1k of context costs roughly 80–110 MB depending on the model. Here is what fits at each context length for the three top picks:
| Model | 8k ctx | 16k ctx | 32k ctx | 64k ctx |
|---|---|---|---|---|
| Qwen3-Coder 7B Q5_K_M | 5.4 GB | 5.8 GB | 6.9 GB | 9.1 GB |
| Devstral Small 2 24B Q4 | 13.2 GB | 13.8 GB | 15.4 GB | Offload required |
| Qwen3-Coder 32B Q4 | 17.8 GB | 18.4 GB | 20.1 GB | Offload required |
The practical implication: pick 16k as your default. 32k forces Devstral and Qwen3-32B into CPU offload, and the throughput hit is not worth the headroom for 95% of coding tasks. If you genuinely need long-context work, the 7B model is the only one that scales gracefully to 64k on this GPU.
Power Draw & Thermals
Sustained inference on the 5070 Ti pulls 245–275 W under llama.cpp with all layers resident. At 28 tok/s on Qwen3-Coder 32B, that is roughly 9.5 Wh per 1,000 generated tokens. At U.S. residential electricity prices ($0.16/kWh average), 1 million local tokens costs about $1.50 in electricity — versus $3.00 for the same volume on Claude Haiku 4.5 API and $15.00 on Sonnet 4.6. Even accounting for the GPU's amortized cost, local breaks even fast for high-volume coding work. The cost calculator models this for your specific usage.
Verdict: What to Actually Run
| Use case | Pick | Why |
|---|---|---|
| Cursor Tab (autocomplete) | Qwen3-Coder 7B Q5_K_M | 180 ms first token, 34% accept rate, headroom for 64k context. |
| Cursor Composer / Aider — quality first | Qwen3-Coder 32B Q4_K_M (4-layer offload) | 71% Aider pass rate, only resident option that competes with frontier APIs. |
| Cursor Composer / Aider — speed first | Devstral Small 2 24B Q4_K_M | 54 tok/s fully resident, 59% pass rate, cleanest diff compliance. |
| General-purpose chat + light coding | Gemma 4 26B Q4_K_M | Best non-coding reasoning under 16 GB, multilingual. |
| Skip entirely | Any 70B at Q2 | Hallucinated imports, 14% malformed-edit rate. |
If you can only download one model, make it Qwen3-Coder 32B Q4_K_M. The 4-layer CPU offload sounds painful on paper but in actual coding sessions the higher pass rate more than compensates for the lower tok/s — you stop re-prompting the model. About the editorial team and our testing budget is on our about page.
Frequently Asked Questions
Can the RTX 5070 Ti run a 70B model locally?
Technically yes, at Q2 with 22 layers offloaded to CPU — but you should not. Throughput drops to ~6 tok/s and instruction-following degrades sharply at Q2. For 70B-class quality on this card, your better option is a remote API for those queries and Qwen3-Coder 32B Q4 locally for everything else.
Is 16 GB of VRAM enough for serious local LLM work in 2026?
For coding-focused use, yes. Qwen3-Coder 32B at Q4 with light CPU offload delivers genuinely useful results on Aider's benchmark. Where 16 GB falls short is multi-document RAG at 32k+ context and any 70B-dense model at usable quants — those workloads need a 24 GB (RTX 4090) or 32 GB (RTX 5090) card.
Should I use Ollama or llama.cpp directly?
Ollama for Cursor (cleaner OpenAI-compatible endpoint, faster model swaps). llama.cpp directly for benchmarking and squeezing the last 5–10% of throughput via --flash-attn, batched generation, and custom KV-cache quant settings. Both use the same underlying engine.
What about MoE models like Mixtral or DeepSeek V3 MoE?
The original DeepSeek V3 MoE (671B total / 37B active) is far too large for 16 GB even at aggressive quants. The dense distilled variant we tested (DeepSeek-Coder V3 16B) fits comfortably but underperforms Devstral Small 2 on Aider. Mixtral 8x7B at Q4 fits with offload but is outclassed by Qwen3-Coder 14B for code-specific tasks.
How does the RTX 5070 Ti compare to a used RTX 3090 for local LLMs?
The RTX 3090 has 24 GB of VRAM versus 16 GB — a decisive advantage for fitting 32B-class models without offload, and for 70B at Q3. The 5070 Ti wins on raw tok/s (~25% faster on resident models thanks to GDDR7), power efficiency, and warranty. If your priority is running the largest possible model, get a used 3090. If your priority is speed and longevity on models up to 24B, the 5070 Ti is the better buy.
Will these recommendations hold up six months from now?
The hardware analysis is stable through the next NVIDIA refresh. The model recommendations turn over faster — we update the rankings monthly and expose them via the BestLLMfor public API so downstream tools can stay current. Bookmark this page or query /api/v1/gpu/rtx-5070-ti/models for the live data.