BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-16

Best Local LLM for 24 GB VRAM — RTX 3090, 4090, 5090 Compared

A data-driven 2026 verdict on which 24 GB-class GPU and which open-weight model deliver the best tokens-per-second, per-dollar, and per-watt for local inference.

By Mohamed Meguedmi · 11 min read

Key takeaways

  • Best overall model for 24 GB: Qwen3-32B-Instruct at Q4_K_M (~19.5 GB) — beats Llama 3.3 70B IQ2_XS on every reasoning benchmark we ran and leaves headroom for 16k context.
  • Best coding model: Qwen3-Coder-30B-A3B (MoE) at Q5_K_M — 92 tok/s on a 4090, 71 tok/s on a 3090, near-Sonnet quality on HumanEval+.
  • Best card per dollar: used RTX 3090 at $620 median (May 2026). 78% of a 4090's throughput for 42% of the price.
  • Best card period: RTX 5090 (32 GB GDDR7, 1792 GB/s). Not a 24 GB card, but the 8 GB headroom unlocks Q5_K_M on 32B models or Q3 on 70B — a different category.
  • Skip: RTX 4090 at MSRP. The price premium over a 3090 only pays off if you also fine-tune or run diffusion alongside.

The 24 GB tier in May 2026: what actually changed

The 24 GB VRAM segment is now four cards: the RTX 3090 (used, $550-$720), the RTX 3090 Ti (used, $680-$850), the RTX 4090 (new/refurb, $1,450-$1,750), and AMD's RX 7900 XTX (new, $830-$950). NVIDIA's RTX 5090 jumped to 32 GB GDDR7, which technically removes it from the 24 GB tier — but every buyer cross-shopping a 4090 in 2026 is now cross-shopping a 5090, so we benchmarked it too.

The single most important change since our previous benchmark cycle is that Qwen3-32B at Q4_K_M has displaced Llama 3.3 70B as the default "big model that fits" on 24 GB. The 70B at IQ2_XS technically loads, but the perplexity hit (+27% vs Q4) makes the 32B at higher precision the better tradeoff on every task except long-form creative writing.

Memory bandwidth is still the bottleneck

Inference on a single GPU is memory-bandwidth-bound, not compute-bound. That is why the 3090 (936 GB/s) holds up so well against the 4090 (1008 GB/s, only +7.7%) despite a much wider compute gap. The 5090's 1792 GB/s is the first real generational jump in bandwidth since Ampere.

Hardware: 3090 vs 4090 vs 5090 vs 7900 XTX

GPUVRAMBandwidthTDPPrice (May 2026)Tokens/sec*
RTX 3090 (used)24 GB GDDR6X936 GB/s350 W$62052
RTX 3090 Ti (used)24 GB GDDR6X1008 GB/s450 W$76058
RTX 409024 GB GDDR6X1008 GB/s450 W$1,48067
RX 7900 XTX24 GB GDDR6960 GB/s355 W$88041
RTX 509032 GB GDDR71792 GB/s575 W$2,350118

*Qwen3-32B-Instruct Q4_K_M, llama.cpp b4912, 2k prompt / 512 tokens out, CUDA 12.6 / ROCm 6.3. Numbers are median of 5 runs.

The 7900 XTX is a competent inference card but ROCm 6.3 still leaves ~18% on the table versus equivalent CUDA paths, and Flash Attention 2 support is patchy on Navi 31. For pure inference it remains a hard sell unless you are deliberately avoiding NVIDIA's stack.

What models actually fit in 24 GB?

The honest answer in May 2026 is: every dense model up to 35B at Q4, every MoE up to 60B-A6B at Q4, and 70B-class dense models only at painful sub-3-bit quants. Here is the realistic fit table including a 4k KV cache:

ModelQuantWeights size+ 4k KVFits 24 GB?Quality
Qwen3-32B-InstructQ4_K_M19.5 GB21.8 GBYes, comfortablyExcellent
Qwen3-32B-InstructQ5_K_M23.0 GB25.3 GB5090 onlyNear-FP16
Qwen3-Coder-30B-A3BQ5_K_M21.2 GB22.9 GBYesExcellent (code)
Llama 3.3 70BIQ2_XS20.7 GB23.0 GBTightDegraded (-27% PPL)
Llama 3.3 70BQ4_K_M42.5 GBNo
Gemma 3 27B-ITQ5_K_M19.4 GB21.6 GBYesVery good
Mistral Small 3.1 24BQ6_K19.9 GB21.7 GBYesVery good
DeepSeek-V3.1-Lite 16BQ8_017.1 GB19.0 GBYesStrong reasoning

Our pick for general use: Qwen3-32B at Q4_K_M

The official Qwen3-32B-Instruct model card reports a 76.4 on MMLU-Pro and 88.1 on HumanEval+ at FP16. The Q4_K_M quant we tested loses 1.8 points on MMLU-Pro and 0.3 on HumanEval+ — negligible. It runs at 52 tok/s on a 3090 and 67 tok/s on a 4090 with the full 32k context, and reasoning quality holds up far better than any sub-3-bit 70B alternative.

Our pick for coding: Qwen3-Coder-30B-A3B

The 30B MoE only activates 3B parameters per token, so on a 4090 you get 92 tok/s sustained throughput with the model entirely on-GPU. On HumanEval+ it scores 91.7 — within 2 points of Claude Sonnet 4 — and on SWE-bench Verified the official Qwen team's results put it at 38.4%, well above any other open weight that fits in 24 GB.

RTX 3090 vs 4090: the real-money question

The 4090 is 28-30% faster than a 3090 across the board. It also costs roughly 2.4x as much. At the all-in cost of ownership (see our cost calculator), the 3090 wins on every inference-only workload short of "I need to serve more than 3 concurrent users."

The 4090 makes sense if any of these apply: (a) you fine-tune with QLoRA on 7B-13B models regularly — the Ada tensor cores cut training time roughly in half; (b) you run Stable Diffusion XL or Flux alongside the LLM and want both resident; (c) your power costs are above $0.30/kWh and the 4090's better perf/watt at idle matters. Otherwise the 3090 is the rational buy.

What about two 3090s?

Two used 3090s at ~$1,240 total give you 48 GB of VRAM, which unlocks Llama 3.3 70B at Q4_K_M and Qwen3-72B at Q4_K_M. Tensor-parallel throughput on llama.cpp's split mode is roughly 1.6x a single 3090, not 2x, but the capability jump from 24 GB to 48 GB is far more useful than from 24 GB to 32 GB. Two 3090s is the most cost-effective 70B setup that exists in 2026, full stop — assuming you have the PCIe lanes, the 850W+ PSU, and the case airflow to handle 700W of GPU.

The RTX 5090 case

The 5090 is not a 24 GB card, but it is the upgrade path that every 24 GB owner is weighing. At $2,350 street it is expensive, but the value proposition is clear: 1792 GB/s of bandwidth means 118 tok/s on Qwen3-32B Q4_K_M, and the 32 GB of VRAM lets you run that same model at Q5_K_M (near-FP16 quality) with a full 32k context. It is also the only single GPU that runs Llama 3.3 70B at Q3_K_M with usable speed (24 tok/s in our tests).

For a buyer who already owns a 3090, the 5090 upgrade is justified only if the quality jump from Q4 to Q5 on 32B models, or the new ability to run 70B at Q3, materially changes your workflow. For most developers, the answer is no — and a second 3090 at half the price gives you more total VRAM.

Software stack: what to actually install

Our editorial test bench in 2026 uses llama.cpp b4912 with CUDA 12.6 as the reference inference engine, and Ollama 0.5.7 as the user-friendly wrapper. vLLM 0.7 is faster for batched serving but only worth it above ~4 concurrent users.

Recommended install path (Ollama)

  1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  2. Pull the model: ollama pull qwen3:32b-instruct-q4_K_M
  3. Set GPU layers to all and context to 16k in the Modelfile.
  4. Run: ollama run qwen3:32b-instruct-q4_K_M
  5. Optional: expose the OpenAI-compatible API on localhost:11434/v1 for editor integration.

For benchmarking we publish raw tokens/sec, perplexity, and watts under the BestLLMfor public API (CC BY 4.0) — you can pull our complete dataset, including the numbers in this article, programmatically. The same data also drives quelllm.fr, our French sister site, and is exposed via the quelllm-mcp open-source MCP server for IDE agents.

Verdict table

Buyer profileGPU pickModel pickAll-in cost
Best value, inference onlyRTX 3090 (used)Qwen3-32B Q4_K_M$620
Inference + occasional QLoRARTX 4090Qwen3-32B Q4_K_M$1,480
Coding-first agentic workflowsRTX 3090 or 4090Qwen3-Coder-30B-A3B Q5_K_M$620-$1,480
Wants 70B-class at home2× RTX 3090Llama 3.3 70B Q4_K_M$1,240
Endgame, single GPURTX 5090Qwen3-32B Q5_K_M$2,350

Read more about how we benchmark on the methodology page, or learn about the team and editorial independence on about.

FAQ

Can I run Llama 3.3 70B on a single 24 GB GPU?

Technically yes, at IQ2_XS or IQ2_S quantization (~20-21 GB). In practice the perplexity penalty is severe (+27% over Q4_K_M) and a 32B model at Q4_K_M will outperform it on every reasoning benchmark. We do not recommend it.

Is the RTX 4090 worth the premium over a used RTX 3090?

For pure inference, no. The 4090 is ~28% faster but costs ~2.4× as much in May 2026. The 4090 is worth it if you also fine-tune with QLoRA, run diffusion models alongside the LLM, or care about idle power draw.

Is the RTX 5090 a 24 GB card?

No, it has 32 GB of GDDR7. We include it because every buyer comparing 24 GB cards in 2026 is also weighing the 5090. The extra 8 GB unlocks Q5_K_M on 32B models and Q3 on 70B models with usable speed.

What about AMD's RX 7900 XTX?

It works. ROCm 6.3 is much better than 2024-vintage builds, but you still leave ~18% performance on the table versus an equivalent NVIDIA card, and Flash Attention 2 support is uneven. Buy it only if you are deliberately avoiding CUDA.

Should I buy two used 3090s instead of one 4090?

If your goal is 70B-class models, yes — two 3090s give you 48 GB total for around $1,240, less than a single 4090, and unlock Llama 3.3 70B at Q4_K_M. You need a 850W+ PSU and a case with the airflow to handle 700W of GPU.

Which model has the best license for commercial use?

Qwen3 models ship under Apache 2.0 (fully commercial). Llama 3.3 uses the Llama 3 Community License (commercial up to 700M MAU). Gemma 3 uses Google's custom terms. For unrestricted commercial use, Qwen3 is the cleanest pick.