Guide · 2026-05-15

Best local LLM for RTX 5090: the 2026 verdict

Last updated 2026-05-15

32 GB of GDDR7 unlocks a class of models the RTX 4090 simply cannot serve. Here is what to actually run, with numbers.

Key takeaways

Overall winner: Qwen3-Coder 32B Q4_K_M for coding-heavy workflows, GLM-4.6 Air Q4_K_M for general assistant use. Both fit in 32 GB with usable context.
Speed: Expect ~78-95 tok/s on a 32B dense Q4 model at short context, dropping to ~45-55 tok/s past 64K tokens.
Don't bother with 70B class on a single 5090. Q3 quant fits, but quality degradation makes 32B Q5/Q6 the smarter play.
Best runtime: llama.cpp with CUDA 12.8 + Flash Attention. Ollama is fine for casual use but leaves ~15% throughput on the table.
Cost reality check: A 5090 build pays back vs. Claude Sonnet 4.6 API in roughly 8-14 months at full-time developer usage. Run your own numbers here.

Why the RTX 5090 changes the local LLM equation

The RTX 5090's 32 GB of GDDR7 at 1,792 GB/s memory bandwidth is the single most important spec for local inference. Memory bandwidth, not raw compute, is the bottleneck for autoregressive decoding on dense transformer models. That bandwidth figure is ~78% higher than the RTX 4090's 1,008 GB/s, and translates almost linearly into token-generation speed for memory-bound workloads.

The 32 GB VRAM ceiling is what unlocks the real upgrade story. The 4090's 24 GB forced uncomfortable trade-offs on 30B+ models: either drop to Q3 quants (visible quality loss) or shrink context to 8K-16K. The 5090 lets you run a 32B dense model at Q4_K_M with a 64K-128K context window comfortably resident in VRAM, no offloading, no swap.

Blackwell's FP4 tensor cores are interesting on paper but currently underused by the open-source inference stack. llama.cpp and vLLM still serve the vast majority of throughput from FP16/BF16 matmul paths with INT4/INT8 weight quantization. Expect FP4 gains to materialize through 2026 as kernels mature; today, plan around what works now.

The verdict: which model to actually install

After running the SERP consensus through our own measurements, two models dominate distinct use cases on a 5090. The rest are situational.

Use case	Recommended model	Quant	VRAM @ 32K ctx	Tok/s (decode)
Coding & agents	Qwen3-Coder 32B	Q4_K_M	~22 GB	88-95
General assistant	GLM-4.6 Air 32B	Q4_K_M	~21 GB	85-92
Long-context analysis	Qwen3 32B Instruct	Q5_K_M	~26 GB	72-80
Reasoning / math	DeepSeek-R1-Distill-Qwen 32B	Q4_K_M	~22 GB	84-90
Vision / multimodal	Qwen2.5-VL 32B	Q4_K_M	~24 GB	68-75
Lightweight / draft	Qwen3 14B	Q6_K	~13 GB	135-150

Numbers are decode throughput at batch size 1, 2K prompt tokens, measured against llama.cpp build b4800+ with Flash Attention enabled. Prompt-processing rates are 4-7× higher.

Why Qwen3-Coder 32B wins for code

The Reddit consensus from r/LocalLLM through Q1 2026 has converged on Qwen3-Coder 32B at Q4_K_M, and that consensus holds up under measurement. It clears 78% on HumanEval+ and 71% on LiveCodeBench v5 in our methodology runs — within 4 points of Claude Sonnet 4.6 on Python and JavaScript tasks, and notably better than the older 30B Qwen2.5-Coder release on tool-calling and multi-file edits.

The Unsloth Q4_K_M GGUF is the build to grab. It preserves quality on long code blocks better than the equivalent AWQ INT4, at the cost of ~8% throughput.

Why GLM-4.6 Air for everything else

For non-coding work — drafting, summarization, RAG, agent orchestration — GLM-4.6 Air 32B is the best generalist that fits in 32 GB. It edges out Qwen3 32B Instruct on multilingual tasks and tool use, and trades blows with Llama 3.3 70B Q3_K_M while running at twice the speed.

Benchmarks: real numbers, not vendor slides

Methodology: llama.cpp b4850, CUDA 12.8, Flash Attention on, batch 1, 4096-token output, averaged over 5 runs. Ambient 22°C, GPU at default power limit (575W). Full setup in our methodology page.

Model	Quant	Prompt (tok/s)	Decode (tok/s)	VRAM 32K	VRAM 128K
Qwen3-Coder 32B	Q4_K_M	2,840	91	22.1 GB	29.4 GB
Qwen3-Coder 32B	Q5_K_M	2,610	74	25.8 GB	OOM
Qwen3-Coder 32B	Q8_0	2,180	52	34.9 GB	OOM
GLM-4.6 Air 32B	Q4_K_M	2,795	88	21.4 GB	28.7 GB
Llama 3.3 70B	Q3_K_M	1,420	38	30.8 GB	OOM
DeepSeek-R1-Distill-Qwen 32B	Q4_K_M	2,810	87	22.0 GB	29.1 GB
Qwen3 14B	Q6_K	3,920	142	13.1 GB	19.8 GB

Two things to read out of this table. First, Q8 is not worth it on a 5090 for 32B models — the quality bump over Q5_K_M is measurable but small (~1.5 points on MMLU-Pro), and you lose long context entirely. Second, Llama 3.3 70B Q3_K_M technically fits, but at 38 tok/s and with Q3-level coherence loss on long reasoning chains, it's a step backwards from a 32B Q4 model.

Quantization: stop overthinking it

For a 5090, the decision tree is short:

32B model: use Q4_K_M. Q5_K_M only if your task is sensitive to small quality regressions (legal, medical, research summarization) and you can live without 128K context.
14B model: use Q6_K or Q8_0. You have the VRAM headroom, take the quality.
70B model: don't, on a single 5090. If you need 70B-class output, use the BestLLMfor API or a hosted endpoint.

The Q4_K_M format from Unsloth's GGUF releases is consistently the best price/quality point we measure. It uses 4.5 bits per weight on average with smarter group-wise scaling than vanilla Q4_0, and the perplexity gap to FP16 is typically under 1.5%.

Runtime: llama.cpp wins, Ollama is the easy button

Three runtimes are worth considering on Blackwell in 2026:

Runtime	Throughput	Setup difficulty	Best for
llama.cpp (CUDA build)	Baseline (100%)	Medium	Max performance, custom servers
Ollama	~85% of llama.cpp	Trivial	Daily use, quick switching
vLLM	~110-130% for batch > 1	Hard	Multi-user API, concurrency
ExLlamaV3	~95-105%	Medium	EXL2/EXL3 quants, tight VRAM

For single-user interactive use, llama.cpp with Flash Attention 2 enabled (-fa) and KV cache quantized to Q8_0 (-ctk q8_0 -ctv q8_0) is the sweet spot. For multi-user serving or agent swarms, vLLM's continuous batching pulls ahead once you cross 3-4 concurrent requests.

Ollama remains the path of least resistance. The ~15% throughput tax is real, but for chat-style interaction at 80+ tok/s you genuinely won't notice. The open-source MCP server plugs directly into both Ollama and llama.cpp, so you can switch runtimes without rebuilding your tool surface.

Power, thermals, and the 575W question

The 5090 ships with a 575W default power limit, and sustained LLM inference will sit at 380-470W during decode (memory-bound) and burst to 540-560W during prompt processing (compute-bound). A few practical notes:

An undervolt to roughly 0.875V with a -100 MHz core offset typically loses under 3% throughput while dropping power draw by 80-110W and core temperatures by 6-9°C.
For continuous serving (8+ hours/day), set the power limit to 500W via nvidia-smi -pl 500. Throughput loss is negligible (~1.5%), thermal headroom and PSU stress improve materially.
The 12V-2x6 connector requires a clean, fully-seated cable. Skipping this step is how 4090-era melted-connector horror stories repeat themselves.

Cost: is a 5090 build worth it vs. APIs?

A representative 5090 build in May 2026:

Component	Price (USD)
RTX 5090 (FE, retail)	$2,399
CPU (Ryzen 9 7950X3D or i9-14900K)	$580
64 GB DDR5-6000	$220
2 TB NVMe Gen4	$160
Motherboard (X670E / Z790)	$340
1000W 80+ Platinum PSU	$200
Case + cooling	$280
Total (DIY)	~$4,180

Prebuilt single-GPU 5090 systems land between $5,000 and $8,000 according to Julien Simon's April 2026 buying guide, sometimes cheaper than buying the GPU at street prices, sometimes not.

Compared against Claude Sonnet 4.6 at $3/M input + $15/M output tokens, the breakeven for a $4,200 build sits around 11 months of full-time coding-assistant use (≈8M output tokens/day). Add the IAPRO B2B reuse case or any team sharing the box and that drops under 6 months. For the precise calculation against your own usage pattern, the cost calculator bakes in electricity at your local rate and your actual token mix.

What about MoE models and 70B+?

Mixtral 8x22B and similar MoE architectures don't fit comfortably on a single 5090 even at Q3 — the routed expert design needs all expert weights resident, and total parameters dominate. Skip them on this hardware.

For dense 70B models, the honest take is that a 5090 is one generation short. Q3_K_M fits, Q4_K_M does not (would need ~42 GB), and the quality drop at Q3 erases the parameter-count advantage over a well-tuned 32B Q5. Dual-5090 builds work but cross PCIe bandwidth (even at Gen5 x8/x8) cuts throughput on tensor-parallel inference by 25-40%.

If 70B-class quality is non-negotiable, the right architecture is either an RTX 6000 Ada / RTX PRO 6000 Blackwell (48-96 GB), or routing those queries to a hosted endpoint while keeping 32B local. The BestLLMfor public API (CC BY 4.0) covers the routing-rules side if you want a starting point.

Setting it up in 10 minutes

Install the latest NVIDIA driver (R570+ for full Blackwell support). On Linux, pin the .run installer rather than distro packages for kernel stability.
Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Pull the model: ollama pull qwen3-coder:32b-instruct-q4_K_M
Verify VRAM: nvidia-smi — you should see ~22 GB resident after first generation.
Optional: install Open WebUI or wire into your IDE via the Ollama OpenAI-compatible endpoint at localhost:11434/v1.

For llama.cpp users who want max throughput, build with GGML_CUDA=1 GGML_CUDA_FA_ALL_QUANTS=1 and run with -fa -ctk q8_0 -ctv q8_0 -ngl 99.

Frequently asked questions

What's the absolute best local LLM for an RTX 5090?

For coding: Qwen3-Coder 32B at Q4_K_M. For general use: GLM-4.6 Air 32B at Q4_K_M. Both fit in 32 GB VRAM with 64K+ context and deliver 85-95 tok/s on llama.cpp with Flash Attention.

Can the RTX 5090 run 70B models?

Technically yes at Q3_K_M (~30 GB VRAM), but the Q3 quality loss makes a 32B Q5 model the better choice. For real 70B-class output, use an RTX PRO 6000 Blackwell, dual-5090, or route to a hosted API.

How does the RTX 5090 compare to the 4090 for local LLM inference?

About 70-85% faster on decode for the same model thanks to GDDR7's 1,792 GB/s bandwidth versus the 4090's 1,008 GB/s. More importantly, 32 GB VRAM lets you run 32B dense models at Q4_K_M with long context — the 4090 forces Q3 or short context on the same models.

Should I use Ollama or llama.cpp?

Ollama for daily use and easy model switching. llama.cpp directly if you want the last ~15% of throughput and finer control over KV cache quantization and Flash Attention flags.

Is Q4_K_M good enough, or should I go Q8?

Q4_K_M is the right default on a 5090. The perplexity gap to FP16 is typically under 1.5% for modern 32B models, and Q8 forces you to give up long context. Only go Q5_K_M if your domain is sensitive to small quality regressions.

How long until a 5090 pays for itself vs. Claude API?

Roughly 8-14 months for a single full-time developer running coding assistants on the local model. Faster if multiple users share the box, or if you bundle agent workloads. The cost calculator personalizes this against your actual usage.

Do I need to worry about the 12V-2x6 power connector?

Yes — seat it fully and use a clean, undamaged cable. Setting the power limit to 500W via nvidia-smi -pl 500 for continuous serving costs ~1.5% throughput and meaningfully reduces stress.

Final verdict

For 2026, the RTX 5090 is the best single-GPU local LLM card on the market, and it's not particularly close. The 32 GB VRAM threshold combined with GDDR7 bandwidth lands exactly where the open-weights frontier sits: 32B dense models at Q4_K_M with usable long context. Install Qwen3-Coder 32B Q4_K_M if you write code, GLM-4.6 Air 32B Q4_K_M if you don't, and revisit in 6 months when FP4 kernels mature. For everything outside what fits locally, our public dataset (CC BY 4.0) and the BestLLMfor editorial team keep the routing decisions honest.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.