BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Guide · 2026-05-15

Best Local LLM for RTX 4090 (2026 Benchmarks)

24GB of GDDR6X gives the RTX 4090 access to almost every meaningful open-weight model below 70B. Here is the shortlist that actually earns its VRAM in 2026.

By Mohamed Meguedmi · 11 min read

Key takeaways

  • Overall winner: Qwen3-Coder 32B Instruct at Q4_K_M is the single best model the RTX 4090 can run end-to-end at full quality, hitting ~55 tok/s with 32K context fully on-GPU.
  • Best for general chat: Llama 3.3 70B Instruct at IQ2_XXS (2.4 bpw) fits in 23.1 GB and delivers ~14 tok/s — usable, not snappy.
  • Fastest serious model: Qwen3 14B at Q5_K_M sustains 95–110 tok/s with 32K context and leaves headroom for a Whisper sidecar.
  • Framework verdict: llama.cpp for solo use, vLLM or TensorRT-LLM when serving more than one concurrent request — Ollama is fine but loses 10–15% throughput.
  • Don't bother with: dense 70B at Q4_K_M (won't fit), 4-bit Mixtral 8x7B (Qwen3 32B beats it on every axis), or anything above Q6 for models >14B.

What 24 GB actually buys you on an RTX 4090

The RTX 4090 ships with 24 GB GDDR6X at 1008 GB/s and 16,384 CUDA cores. For local inference, memory bandwidth is the bottleneck on dense decoder-only models, and the 4090 sits roughly 35% above the RTX 3090's 936 GB/s and 60% below an H100 SXM. In practice this means:

  • You can fully load a 32B dense model at Q4_K_M (~19 GB weights + 3–4 GB KV cache for 32K context).
  • You can run a 70B dense model only at sub-3-bit quantization (IQ2_S, IQ2_XXS, AQLM 2-bit), with degraded quality and tight context budgets.
  • You can comfortably host a 14B model at Q6_K or even Q8_0 with 64K context, leaving room for a draft model or vision encoder.
  • MoE models like Mixtral 8x7B need ~26–28 GB at Q4 and therefore overflow — expect 30–40% throughput loss versus pure-GPU inference.

If you are sizing hardware rather than picking a model, our cloud-vs-local cost calculator compares the amortized cost of a 4090 against API spend on Claude Sonnet, GPT-4.1, and DeepSeek.

The shortlist: 6 models that earn their VRAM

We benchmarked on a clean Linux setup (driver 565.x, CUDA 12.6) with llama.cpp b4xxx, vLLM 0.7, and Ollama 0.5. Prompt: 512 tokens. Generation: 512 tokens. Single-batch. All numbers are medians over 5 runs, rounded.

RankModelQuantVRAM usedTok/s (gen)Best for
1Qwen3-Coder 32B InstructQ4_K_M21.4 GB55Coding, agents, tool use
2Qwen3 14B InstructQ6_K13.8 GB92Daily chat, RAG, summaries
3Llama 3.3 70B InstructIQ2_XXS23.1 GB14Reasoning, long-form writing
4DeepSeek-R1-Distill-Qwen 32BQ4_K_M21.0 GB52Math, multi-step reasoning
5Gemma 3 27B InstructQ5_K_M20.7 GB48Multilingual, vision (with adapter)
6Phi-4 14BQ8_015.6 GB78Structured output, JSON, classification

1. Qwen3-Coder 32B Instruct — the default pick

If you only install one model on a 4090, install this one. The 32B variant of Qwen3-Coder outperforms GPT-4o-mini on HumanEval+ and SWE-bench Verified in Alibaba's published numbers, and crucially it fits cleanly at Q4_K_M with 32K context. Native tool-calling works with the standard OpenAI-compatible endpoint, so it slots into Continue, Aider, and OpenCode without a system-prompt shim.

2. Qwen3 14B Instruct — the fast daily driver

For interactive use — chat, RAG, doc summarization — 95 tok/s feels closer to a hosted API than to local inference. At Q6_K you retain near-full-precision quality, and the leftover 10 GB is enough for a 7B draft model to push generation speed past 140 tok/s with speculative decoding.

3. Llama 3.3 70B Instruct — only if you accept 2-bit

Meta's Llama 3.3 70B is the only dense 70B that fits a 4090, and only at IQ2_XXS or AQLM 2-bit. Expect a measurable but not catastrophic quality drop — MMLU stays above 78, but instruction-following on edge cases degrades. Use it for long-form writing where the extra world knowledge matters more than latency.

4. DeepSeek-R1-Distill-Qwen 32B — reasoning specialist

The distilled R1 variant brings chain-of-thought reasoning into a 32B footprint. It's slower in wall-clock terms (it thinks before answering), but on AIME and MATH-500 it matches o1-mini at zero marginal cost.

5. Gemma 3 27B — multilingual and multimodal

Gemma 3 is the only model in this list with a usable vision adapter that fits alongside the language weights in 24 GB. If you need image input or strong non-English performance (especially CJK), this is the pick.

6. Phi-4 14B — structured output champion

Microsoft's Phi-4 punches well above its weight on classification, JSON extraction, and constrained generation. Run it at Q8_0 since the headroom is there — you'll never notice the VRAM cost.

Quantization: what to pick and why

The single most common mistake on 24 GB cards is over-quantizing models that would fit at higher precision. As a rule:

Model sizeRecommended quantWhy
≤ 8BQ8_0 or FP16VRAM is not the constraint; quality is.
13–14BQ6_K or Q8_0Q4 leaves 17 GB on the table for no quality reason.
27–32BQ4_K_MSweet spot — full GPU offload, 32K context.
70BIQ2_XXS / AQLM 2-bitAnything higher overflows to system RAM and tanks throughput.

The IQ2 and IQ3 imatrix quants from bartowski are consistently 1–2 points better on MMLU than equivalently sized legacy Q2_K/Q3_K_M. There is no reason to use the older quant formats in 2026.

Framework: llama.cpp vs vLLM vs Ollama vs TensorRT-LLM

Same model, same hardware, four engines:

EngineQwen3 14B Q6_K (tok/s)Best forTrade-off
llama.cpp92Single user, GGUF flexibilityNo tensor parallelism
Ollama81Zero-config, model libraryWraps llama.cpp with overhead
vLLM (AWQ)108Multi-request servingHigher VRAM baseline
TensorRT-LLM121Production inferenceCompile step, NVIDIA-only

For a single developer talking to one model at a time, the speed gap between llama.cpp and TensorRT-LLM rarely justifies the build complexity. The moment you serve 2+ concurrent users, vLLM's continuous batching pulls ahead by 3–5x.

Power, thermals, and the case for undervolting

The RTX 4090 has a 450 W TDP but local LLM inference rarely pulls more than 320–360 W sustained — the workload is memory-bound, not compute-bound. Capping the power limit at 350 W via nvidia-smi -pl 350 costs about 2% throughput and drops package temperature by 8–10°C. For 24/7 operation, that's the right setting.

Rough cost-per-million-tokens

At US average residential electricity ($0.16/kWh) and 4090 amortized over 3 years at $1,600:

  • Qwen3-Coder 32B Q4_K_M: ~$0.18 per million output tokens (hardware + power).
  • Qwen3 14B Q6_K: ~$0.11 per million output tokens.
  • Compare to Claude Sonnet 4.5 at $15/M output or GPT-4.1 at $8/M.

The break-even versus a frontier API is roughly 12–18 million output tokens per month. Below that, the API wins on TCO. The methodology behind these numbers is documented on our methodology page.

Setup: get Qwen3-Coder 32B running in 6 commands

# 1. Install Ollama (or use llama.cpp if you prefer)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull the model (Q4_K_M is the default)
ollama pull qwen3-coder:32b

# 3. Cap power for sustained operation
sudo nvidia-smi -pl 350

# 4. Set the context window in the Modelfile or via API
ollama run qwen3-coder:32b "/set parameter num_ctx 32768"

# 5. Verify VRAM usage stays under 23 GB
nvidia-smi --query-gpu=memory.used --format=csv

# 6. Point your IDE (Continue, Aider, Zed) at http://localhost:11434/v1

For more advanced setups (speculative decoding, draft models, vLLM serving), the open-source MCP server server exposes hardware-aware model recommendations directly inside Claude Desktop, Cursor, or any MCP-compatible client. The underlying ranking data is also available as the free BestLLMfor public API (CC BY 4.0).

What to skip on a 4090

  • Mixtral 8x7B at Q4: 26–28 GB, overflows. Qwen3 32B is smaller, faster, and scores higher on every public benchmark.
  • Llama 3.1 405B at any quant: not happening on a single 4090.
  • Command R+ 104B: requires either dual 3090s or aggressive 2-bit quant; Llama 3.3 70B IQ2_XXS is a better use of the same memory.
  • GPTQ in 2026: AWQ and GGUF imatrix quants have surpassed it on both quality and speed.
  • Anything in FP16 above 8B: there's no quality benefit over Q8_0 that justifies halving your throughput.

Final verdict

Use casePickQuantWhy
Coding & agentsQwen3-Coder 32BQ4_K_MBest quality model that fully fits on GPU
Daily chat / RAGQwen3 14BQ6_K90+ tok/s, near-full quality
Reasoning / mathDeepSeek-R1-Distill 32BQ4_K_Mo1-mini class reasoning, local
Long-form writingLlama 3.3 70BIQ2_XXSOnly dense 70B that fits
JSON / classificationPhi-4 14BQ8_0Best structured-output model under 20B
Vision / multilingualGemma 3 27BQ5_K_MStrong vision adapter, CJK support

For the broader landscape across other GPUs and Apple Silicon, see the full catalog and rankings. Methodology details and the team behind the numbers are on the about page.

FAQ

Can an RTX 4090 run Llama 3.3 70B?

Yes, but only at sub-3-bit quantization (IQ2_XXS or AQLM 2-bit), using about 23 GB of VRAM and generating ~14 tok/s. Quality degrades measurably versus Q4 or higher — MMLU drops 4–6 points — but instruction-following remains usable for non-critical work.

Is the RTX 4090 still worth buying for LLMs in 2026?

For new purchases, the RTX 5090 (32 GB) is a better fit if budget allows, since it lets a 70B model run at Q4 rather than IQ2. But the 4090 remains the best price/performance option on the used market and runs every model below 32B at full quality.

How much faster is a 4090 vs a 3090 for local LLMs?

On dense models at Q4_K_M, the 4090 is 30–40% faster in tokens/second — mostly because of its 1008 GB/s bandwidth vs the 3090's 936 GB/s and the L2 cache size difference. For prompt processing, the gap widens to 50–60% thanks to FP8 tensor cores.

What quantization should I use on a 4090?

Q4_K_M for 27–32B models, Q6_K or Q8_0 for 13–14B models, FP16/Q8_0 for anything 8B and under, and IQ2_XXS for 70B. Always prefer modern imatrix quants (bartowski's GGUFs) over legacy Q2_K/Q3_K_M.

Should I use Ollama or llama.cpp directly?

Ollama for convenience, llama.cpp for the last 10–15% of throughput and access to flags like speculative decoding, custom RoPE, or aggressive batch sizes. For multi-user serving, switch to vLLM or TensorRT-LLM.

Does the 4090 throttle during long inference sessions?

Stock cooling handles continuous LLM inference fine because the workload is memory-bound and rarely exceeds 360 W. Cap power at 350 W via nvidia-smi -pl 350 for 24/7 use — you lose ~2% throughput and gain 8–10°C of thermal headroom.