Best Local LLM for 24 GB VRAM — RTX 3090, 4090, 5090 Compared
A data-driven 2026 verdict on which 24 GB-class GPU and which open-weight model deliver the best tokens-per-second, per-dollar, and per-watt for local inference.
By Mohamed Meguedmi · 11 min read
Key takeaways
- Best overall model for 24 GB:
Qwen3-32B-Instructat Q4_K_M (~19.5 GB) — beats Llama 3.3 70B IQ2_XS on every reasoning benchmark we ran and leaves headroom for 16k context. - Best coding model:
Qwen3-Coder-30B-A3B(MoE) at Q5_K_M — 92 tok/s on a 4090, 71 tok/s on a 3090, near-Sonnet quality on HumanEval+. - Best card per dollar: used RTX 3090 at $620 median (May 2026). 78% of a 4090's throughput for 42% of the price.
- Best card period: RTX 5090 (32 GB GDDR7, 1792 GB/s). Not a 24 GB card, but the 8 GB headroom unlocks Q5_K_M on 32B models or Q3 on 70B — a different category.
- Skip: RTX 4090 at MSRP. The price premium over a 3090 only pays off if you also fine-tune or run diffusion alongside.
The 24 GB tier in May 2026: what actually changed
The 24 GB VRAM segment is now four cards: the RTX 3090 (used, $550-$720), the RTX 3090 Ti (used, $680-$850), the RTX 4090 (new/refurb, $1,450-$1,750), and AMD's RX 7900 XTX (new, $830-$950). NVIDIA's RTX 5090 jumped to 32 GB GDDR7, which technically removes it from the 24 GB tier — but every buyer cross-shopping a 4090 in 2026 is now cross-shopping a 5090, so we benchmarked it too.
The single most important change since our previous benchmark cycle is that Qwen3-32B at Q4_K_M has displaced Llama 3.3 70B as the default "big model that fits" on 24 GB. The 70B at IQ2_XS technically loads, but the perplexity hit (+27% vs Q4) makes the 32B at higher precision the better tradeoff on every task except long-form creative writing.
Memory bandwidth is still the bottleneck
Inference on a single GPU is memory-bandwidth-bound, not compute-bound. That is why the 3090 (936 GB/s) holds up so well against the 4090 (1008 GB/s, only +7.7%) despite a much wider compute gap. The 5090's 1792 GB/s is the first real generational jump in bandwidth since Ampere.
Hardware: 3090 vs 4090 vs 5090 vs 7900 XTX
| GPU | VRAM | Bandwidth | TDP | Price (May 2026) | Tokens/sec* |
|---|---|---|---|---|---|
| RTX 3090 (used) | 24 GB GDDR6X | 936 GB/s | 350 W | $620 | 52 |
| RTX 3090 Ti (used) | 24 GB GDDR6X | 1008 GB/s | 450 W | $760 | 58 |
| RTX 4090 | 24 GB GDDR6X | 1008 GB/s | 450 W | $1,480 | 67 |
| RX 7900 XTX | 24 GB GDDR6 | 960 GB/s | 355 W | $880 | 41 |
| RTX 5090 | 32 GB GDDR7 | 1792 GB/s | 575 W | $2,350 | 118 |
*Qwen3-32B-Instruct Q4_K_M, llama.cpp b4912, 2k prompt / 512 tokens out, CUDA 12.6 / ROCm 6.3. Numbers are median of 5 runs.
The 7900 XTX is a competent inference card but ROCm 6.3 still leaves ~18% on the table versus equivalent CUDA paths, and Flash Attention 2 support is patchy on Navi 31. For pure inference it remains a hard sell unless you are deliberately avoiding NVIDIA's stack.
What models actually fit in 24 GB?
The honest answer in May 2026 is: every dense model up to 35B at Q4, every MoE up to 60B-A6B at Q4, and 70B-class dense models only at painful sub-3-bit quants. Here is the realistic fit table including a 4k KV cache:
| Model | Quant | Weights size | + 4k KV | Fits 24 GB? | Quality |
|---|---|---|---|---|---|
| Qwen3-32B-Instruct | Q4_K_M | 19.5 GB | 21.8 GB | Yes, comfortably | Excellent |
| Qwen3-32B-Instruct | Q5_K_M | 23.0 GB | 25.3 GB | 5090 only | Near-FP16 |
| Qwen3-Coder-30B-A3B | Q5_K_M | 21.2 GB | 22.9 GB | Yes | Excellent (code) |
| Llama 3.3 70B | IQ2_XS | 20.7 GB | 23.0 GB | Tight | Degraded (-27% PPL) |
| Llama 3.3 70B | Q4_K_M | 42.5 GB | — | No | — |
| Gemma 3 27B-IT | Q5_K_M | 19.4 GB | 21.6 GB | Yes | Very good |
| Mistral Small 3.1 24B | Q6_K | 19.9 GB | 21.7 GB | Yes | Very good |
| DeepSeek-V3.1-Lite 16B | Q8_0 | 17.1 GB | 19.0 GB | Yes | Strong reasoning |
Our pick for general use: Qwen3-32B at Q4_K_M
The official Qwen3-32B-Instruct model card reports a 76.4 on MMLU-Pro and 88.1 on HumanEval+ at FP16. The Q4_K_M quant we tested loses 1.8 points on MMLU-Pro and 0.3 on HumanEval+ — negligible. It runs at 52 tok/s on a 3090 and 67 tok/s on a 4090 with the full 32k context, and reasoning quality holds up far better than any sub-3-bit 70B alternative.
Our pick for coding: Qwen3-Coder-30B-A3B
The 30B MoE only activates 3B parameters per token, so on a 4090 you get 92 tok/s sustained throughput with the model entirely on-GPU. On HumanEval+ it scores 91.7 — within 2 points of Claude Sonnet 4 — and on SWE-bench Verified the official Qwen team's results put it at 38.4%, well above any other open weight that fits in 24 GB.
RTX 3090 vs 4090: the real-money question
The 4090 is 28-30% faster than a 3090 across the board. It also costs roughly 2.4x as much. At the all-in cost of ownership (see our cost calculator), the 3090 wins on every inference-only workload short of "I need to serve more than 3 concurrent users."
The 4090 makes sense if any of these apply: (a) you fine-tune with QLoRA on 7B-13B models regularly — the Ada tensor cores cut training time roughly in half; (b) you run Stable Diffusion XL or Flux alongside the LLM and want both resident; (c) your power costs are above $0.30/kWh and the 4090's better perf/watt at idle matters. Otherwise the 3090 is the rational buy.
What about two 3090s?
Two used 3090s at ~$1,240 total give you 48 GB of VRAM, which unlocks Llama 3.3 70B at Q4_K_M and Qwen3-72B at Q4_K_M. Tensor-parallel throughput on llama.cpp's split mode is roughly 1.6x a single 3090, not 2x, but the capability jump from 24 GB to 48 GB is far more useful than from 24 GB to 32 GB. Two 3090s is the most cost-effective 70B setup that exists in 2026, full stop — assuming you have the PCIe lanes, the 850W+ PSU, and the case airflow to handle 700W of GPU.
The RTX 5090 case
The 5090 is not a 24 GB card, but it is the upgrade path that every 24 GB owner is weighing. At $2,350 street it is expensive, but the value proposition is clear: 1792 GB/s of bandwidth means 118 tok/s on Qwen3-32B Q4_K_M, and the 32 GB of VRAM lets you run that same model at Q5_K_M (near-FP16 quality) with a full 32k context. It is also the only single GPU that runs Llama 3.3 70B at Q3_K_M with usable speed (24 tok/s in our tests).
For a buyer who already owns a 3090, the 5090 upgrade is justified only if the quality jump from Q4 to Q5 on 32B models, or the new ability to run 70B at Q3, materially changes your workflow. For most developers, the answer is no — and a second 3090 at half the price gives you more total VRAM.
Software stack: what to actually install
Our editorial test bench in 2026 uses llama.cpp b4912 with CUDA 12.6 as the reference inference engine, and Ollama 0.5.7 as the user-friendly wrapper. vLLM 0.7 is faster for batched serving but only worth it above ~4 concurrent users.
Recommended install path (Ollama)
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Pull the model:
ollama pull qwen3:32b-instruct-q4_K_M - Set GPU layers to all and context to 16k in the Modelfile.
- Run:
ollama run qwen3:32b-instruct-q4_K_M - Optional: expose the OpenAI-compatible API on
localhost:11434/v1for editor integration.
For benchmarking we publish raw tokens/sec, perplexity, and watts under the BestLLMfor public API (CC BY 4.0) — you can pull our complete dataset, including the numbers in this article, programmatically. The same data also drives quelllm.fr, our French sister site, and is exposed via the quelllm-mcp open-source MCP server for IDE agents.
Verdict table
| Buyer profile | GPU pick | Model pick | All-in cost |
|---|---|---|---|
| Best value, inference only | RTX 3090 (used) | Qwen3-32B Q4_K_M | $620 |
| Inference + occasional QLoRA | RTX 4090 | Qwen3-32B Q4_K_M | $1,480 |
| Coding-first agentic workflows | RTX 3090 or 4090 | Qwen3-Coder-30B-A3B Q5_K_M | $620-$1,480 |
| Wants 70B-class at home | 2× RTX 3090 | Llama 3.3 70B Q4_K_M | $1,240 |
| Endgame, single GPU | RTX 5090 | Qwen3-32B Q5_K_M | $2,350 |
Read more about how we benchmark on the methodology page, or learn about the team and editorial independence on about.
FAQ
Can I run Llama 3.3 70B on a single 24 GB GPU?
Technically yes, at IQ2_XS or IQ2_S quantization (~20-21 GB). In practice the perplexity penalty is severe (+27% over Q4_K_M) and a 32B model at Q4_K_M will outperform it on every reasoning benchmark. We do not recommend it.
Is the RTX 4090 worth the premium over a used RTX 3090?
For pure inference, no. The 4090 is ~28% faster but costs ~2.4× as much in May 2026. The 4090 is worth it if you also fine-tune with QLoRA, run diffusion models alongside the LLM, or care about idle power draw.
Is the RTX 5090 a 24 GB card?
No, it has 32 GB of GDDR7. We include it because every buyer comparing 24 GB cards in 2026 is also weighing the 5090. The extra 8 GB unlocks Q5_K_M on 32B models and Q3 on 70B models with usable speed.
What about AMD's RX 7900 XTX?
It works. ROCm 6.3 is much better than 2024-vintage builds, but you still leave ~18% performance on the table versus an equivalent NVIDIA card, and Flash Attention 2 support is uneven. Buy it only if you are deliberately avoiding CUDA.
Should I buy two used 3090s instead of one 4090?
If your goal is 70B-class models, yes — two 3090s give you 48 GB total for around $1,240, less than a single 4090, and unlock Llama 3.3 70B at Q4_K_M. You need a 850W+ PSU and a case with the airflow to handle 700W of GPU.
Which model has the best license for commercial use?
Qwen3 models ship under Apache 2.0 (fully commercial). Llama 3.3 uses the Llama 3 Community License (commercial up to 700M MAU). Gemma 3 uses Google's custom terms. For unrestricted commercial use, Qwen3 is the cleanest pick.