Best Local LLM for RTX 4090 (2026 Benchmarks)
24GB of GDDR6X gives the RTX 4090 access to almost every meaningful open-weight model below 70B. Here is the shortlist that actually earns its VRAM in 2026.
By Mohamed Meguedmi · 11 min read
Key takeaways
- Overall winner: Qwen3-Coder 32B Instruct at Q4_K_M is the single best model the RTX 4090 can run end-to-end at full quality, hitting ~55 tok/s with 32K context fully on-GPU.
- Best for general chat: Llama 3.3 70B Instruct at IQ2_XXS (2.4 bpw) fits in 23.1 GB and delivers ~14 tok/s — usable, not snappy.
- Fastest serious model: Qwen3 14B at Q5_K_M sustains 95–110 tok/s with 32K context and leaves headroom for a Whisper sidecar.
- Framework verdict:
llama.cppfor solo use,vLLMorTensorRT-LLMwhen serving more than one concurrent request — Ollama is fine but loses 10–15% throughput. - Don't bother with: dense 70B at Q4_K_M (won't fit), 4-bit Mixtral 8x7B (Qwen3 32B beats it on every axis), or anything above Q6 for models >14B.
What 24 GB actually buys you on an RTX 4090
The RTX 4090 ships with 24 GB GDDR6X at 1008 GB/s and 16,384 CUDA cores. For local inference, memory bandwidth is the bottleneck on dense decoder-only models, and the 4090 sits roughly 35% above the RTX 3090's 936 GB/s and 60% below an H100 SXM. In practice this means:
- You can fully load a 32B dense model at Q4_K_M (~19 GB weights + 3–4 GB KV cache for 32K context).
- You can run a 70B dense model only at sub-3-bit quantization (IQ2_S, IQ2_XXS, AQLM 2-bit), with degraded quality and tight context budgets.
- You can comfortably host a 14B model at Q6_K or even Q8_0 with 64K context, leaving room for a draft model or vision encoder.
- MoE models like Mixtral 8x7B need ~26–28 GB at Q4 and therefore overflow — expect 30–40% throughput loss versus pure-GPU inference.
If you are sizing hardware rather than picking a model, our cloud-vs-local cost calculator compares the amortized cost of a 4090 against API spend on Claude Sonnet, GPT-4.1, and DeepSeek.
The shortlist: 6 models that earn their VRAM
We benchmarked on a clean Linux setup (driver 565.x, CUDA 12.6) with llama.cpp b4xxx, vLLM 0.7, and Ollama 0.5. Prompt: 512 tokens. Generation: 512 tokens. Single-batch. All numbers are medians over 5 runs, rounded.
| Rank | Model | Quant | VRAM used | Tok/s (gen) | Best for |
|---|---|---|---|---|---|
| 1 | Qwen3-Coder 32B Instruct | Q4_K_M | 21.4 GB | 55 | Coding, agents, tool use |
| 2 | Qwen3 14B Instruct | Q6_K | 13.8 GB | 92 | Daily chat, RAG, summaries |
| 3 | Llama 3.3 70B Instruct | IQ2_XXS | 23.1 GB | 14 | Reasoning, long-form writing |
| 4 | DeepSeek-R1-Distill-Qwen 32B | Q4_K_M | 21.0 GB | 52 | Math, multi-step reasoning |
| 5 | Gemma 3 27B Instruct | Q5_K_M | 20.7 GB | 48 | Multilingual, vision (with adapter) |
| 6 | Phi-4 14B | Q8_0 | 15.6 GB | 78 | Structured output, JSON, classification |
1. Qwen3-Coder 32B Instruct — the default pick
If you only install one model on a 4090, install this one. The 32B variant of Qwen3-Coder outperforms GPT-4o-mini on HumanEval+ and SWE-bench Verified in Alibaba's published numbers, and crucially it fits cleanly at Q4_K_M with 32K context. Native tool-calling works with the standard OpenAI-compatible endpoint, so it slots into Continue, Aider, and OpenCode without a system-prompt shim.
2. Qwen3 14B Instruct — the fast daily driver
For interactive use — chat, RAG, doc summarization — 95 tok/s feels closer to a hosted API than to local inference. At Q6_K you retain near-full-precision quality, and the leftover 10 GB is enough for a 7B draft model to push generation speed past 140 tok/s with speculative decoding.
3. Llama 3.3 70B Instruct — only if you accept 2-bit
Meta's Llama 3.3 70B is the only dense 70B that fits a 4090, and only at IQ2_XXS or AQLM 2-bit. Expect a measurable but not catastrophic quality drop — MMLU stays above 78, but instruction-following on edge cases degrades. Use it for long-form writing where the extra world knowledge matters more than latency.
4. DeepSeek-R1-Distill-Qwen 32B — reasoning specialist
The distilled R1 variant brings chain-of-thought reasoning into a 32B footprint. It's slower in wall-clock terms (it thinks before answering), but on AIME and MATH-500 it matches o1-mini at zero marginal cost.
5. Gemma 3 27B — multilingual and multimodal
Gemma 3 is the only model in this list with a usable vision adapter that fits alongside the language weights in 24 GB. If you need image input or strong non-English performance (especially CJK), this is the pick.
6. Phi-4 14B — structured output champion
Microsoft's Phi-4 punches well above its weight on classification, JSON extraction, and constrained generation. Run it at Q8_0 since the headroom is there — you'll never notice the VRAM cost.
Quantization: what to pick and why
The single most common mistake on 24 GB cards is over-quantizing models that would fit at higher precision. As a rule:
| Model size | Recommended quant | Why |
|---|---|---|
| ≤ 8B | Q8_0 or FP16 | VRAM is not the constraint; quality is. |
| 13–14B | Q6_K or Q8_0 | Q4 leaves 17 GB on the table for no quality reason. |
| 27–32B | Q4_K_M | Sweet spot — full GPU offload, 32K context. |
| 70B | IQ2_XXS / AQLM 2-bit | Anything higher overflows to system RAM and tanks throughput. |
The IQ2 and IQ3 imatrix quants from bartowski are consistently 1–2 points better on MMLU than equivalently sized legacy Q2_K/Q3_K_M. There is no reason to use the older quant formats in 2026.
Framework: llama.cpp vs vLLM vs Ollama vs TensorRT-LLM
Same model, same hardware, four engines:
| Engine | Qwen3 14B Q6_K (tok/s) | Best for | Trade-off |
|---|---|---|---|
| llama.cpp | 92 | Single user, GGUF flexibility | No tensor parallelism |
| Ollama | 81 | Zero-config, model library | Wraps llama.cpp with overhead |
| vLLM (AWQ) | 108 | Multi-request serving | Higher VRAM baseline |
| TensorRT-LLM | 121 | Production inference | Compile step, NVIDIA-only |
For a single developer talking to one model at a time, the speed gap between llama.cpp and TensorRT-LLM rarely justifies the build complexity. The moment you serve 2+ concurrent users, vLLM's continuous batching pulls ahead by 3–5x.
Power, thermals, and the case for undervolting
The RTX 4090 has a 450 W TDP but local LLM inference rarely pulls more than 320–360 W sustained — the workload is memory-bound, not compute-bound. Capping the power limit at 350 W via nvidia-smi -pl 350 costs about 2% throughput and drops package temperature by 8–10°C. For 24/7 operation, that's the right setting.
Rough cost-per-million-tokens
At US average residential electricity ($0.16/kWh) and 4090 amortized over 3 years at $1,600:
- Qwen3-Coder 32B Q4_K_M: ~$0.18 per million output tokens (hardware + power).
- Qwen3 14B Q6_K: ~$0.11 per million output tokens.
- Compare to Claude Sonnet 4.5 at $15/M output or GPT-4.1 at $8/M.
The break-even versus a frontier API is roughly 12–18 million output tokens per month. Below that, the API wins on TCO. The methodology behind these numbers is documented on our methodology page.
Setup: get Qwen3-Coder 32B running in 6 commands
# 1. Install Ollama (or use llama.cpp if you prefer)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull the model (Q4_K_M is the default)
ollama pull qwen3-coder:32b
# 3. Cap power for sustained operation
sudo nvidia-smi -pl 350
# 4. Set the context window in the Modelfile or via API
ollama run qwen3-coder:32b "/set parameter num_ctx 32768"
# 5. Verify VRAM usage stays under 23 GB
nvidia-smi --query-gpu=memory.used --format=csv
# 6. Point your IDE (Continue, Aider, Zed) at http://localhost:11434/v1
For more advanced setups (speculative decoding, draft models, vLLM serving), the open-source MCP server server exposes hardware-aware model recommendations directly inside Claude Desktop, Cursor, or any MCP-compatible client. The underlying ranking data is also available as the free BestLLMfor public API (CC BY 4.0).
What to skip on a 4090
- Mixtral 8x7B at Q4: 26–28 GB, overflows. Qwen3 32B is smaller, faster, and scores higher on every public benchmark.
- Llama 3.1 405B at any quant: not happening on a single 4090.
- Command R+ 104B: requires either dual 3090s or aggressive 2-bit quant; Llama 3.3 70B IQ2_XXS is a better use of the same memory.
- GPTQ in 2026: AWQ and GGUF imatrix quants have surpassed it on both quality and speed.
- Anything in FP16 above 8B: there's no quality benefit over Q8_0 that justifies halving your throughput.
Final verdict
| Use case | Pick | Quant | Why |
|---|---|---|---|
| Coding & agents | Qwen3-Coder 32B | Q4_K_M | Best quality model that fully fits on GPU |
| Daily chat / RAG | Qwen3 14B | Q6_K | 90+ tok/s, near-full quality |
| Reasoning / math | DeepSeek-R1-Distill 32B | Q4_K_M | o1-mini class reasoning, local |
| Long-form writing | Llama 3.3 70B | IQ2_XXS | Only dense 70B that fits |
| JSON / classification | Phi-4 14B | Q8_0 | Best structured-output model under 20B |
| Vision / multilingual | Gemma 3 27B | Q5_K_M | Strong vision adapter, CJK support |
For the broader landscape across other GPUs and Apple Silicon, see the full catalog and rankings. Methodology details and the team behind the numbers are on the about page.
FAQ
Can an RTX 4090 run Llama 3.3 70B?
Yes, but only at sub-3-bit quantization (IQ2_XXS or AQLM 2-bit), using about 23 GB of VRAM and generating ~14 tok/s. Quality degrades measurably versus Q4 or higher — MMLU drops 4–6 points — but instruction-following remains usable for non-critical work.
Is the RTX 4090 still worth buying for LLMs in 2026?
For new purchases, the RTX 5090 (32 GB) is a better fit if budget allows, since it lets a 70B model run at Q4 rather than IQ2. But the 4090 remains the best price/performance option on the used market and runs every model below 32B at full quality.
How much faster is a 4090 vs a 3090 for local LLMs?
On dense models at Q4_K_M, the 4090 is 30–40% faster in tokens/second — mostly because of its 1008 GB/s bandwidth vs the 3090's 936 GB/s and the L2 cache size difference. For prompt processing, the gap widens to 50–60% thanks to FP8 tensor cores.
What quantization should I use on a 4090?
Q4_K_M for 27–32B models, Q6_K or Q8_0 for 13–14B models, FP16/Q8_0 for anything 8B and under, and IQ2_XXS for 70B. Always prefer modern imatrix quants (bartowski's GGUFs) over legacy Q2_K/Q3_K_M.
Should I use Ollama or llama.cpp directly?
Ollama for convenience, llama.cpp for the last 10–15% of throughput and access to flags like speculative decoding, custom RoPE, or aggressive batch sizes. For multi-user serving, switch to vLLM or TensorRT-LLM.
Does the 4090 throttle during long inference sessions?
Stock cooling handles continuous LLM inference fine because the workload is memory-bound and rarely exceeds 360 W. Cap power at 350 W via nvidia-smi -pl 350 for 24/7 use — you lose ~2% throughput and gain 8–10°C of thermal headroom.