BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-16

Two RTX 3090s for Local LLM — Real Throughput vs $5k Cost

Two used RTX 3090s deliver 48 GB of pooled VRAM under $2,000. We benchmark whether a complete dual-3090 build clears the throughput bar to justify $5,000 in May 2026.

By Mohamed Meguedmi · 9 min read

Key takeaways

  • $5k is overkill in May 2026. A complete dual-RTX-3090 build lands at $3,500–$4,200 with current used-GPU prices ($700–$900 each), leaving headroom for a faster CPU or 128 GB of DDR5.
  • 48 GB of pooled VRAM runs Llama 3.3 70B Q4_K_M at 14–18 tok/s single-user, and Qwen3-Coder 32B Q8_0 at 32–40 tok/s — both comfortably above human reading speed.
  • NVLink helps less than the marketing implies for inference. Expect 2–6% gains on tensor-parallel workloads; for layer-split llama.cpp it is effectively zero.
  • The 3090 still beats the 5090 on $/GB-VRAM ($37/GB used vs $78/GB new) and crushes the RTX Pro 6000 on absolute price. The downsides are heat, power draw, and PCIe 4.0.
  • Skip the pair if you need to fine-tune anything past 13B. For training and full-precision research, the China-modded 4090 48 GB or a single RTX Pro 6000 are the rational picks.

The $5k budget, itemized

The $5,000 figure has become folklore on r/LocalLLaMA, but the math does not hold up in May 2026. Used RTX 3090s have settled around $750 on Reddit's hardware swap and eBay's sold-listings filter, with NVLink-capable Founders Edition and EVGA FTW3 cards trending toward $850. Here is the realistic bill of materials for a dual-3090 build today:

ComponentSpecificationUSD
GPU ×2RTX 3090 24 GB, used, FE/EVGA$1,500–$1,800
NVLink bridgeNVIDIA 3-slot, used$90–$130
CPURyzen 9 7900X (12C/24T)$340
MotherboardX670E, 2× PCIe 5.0 x8 bifurcation$430
RAM128 GB DDR5-5600 (4×32 GB)$380
Storage2 TB NVMe Gen4$150
PSU1500 W 80+ Titanium$380
Case + coolingOpen-frame or Phanteks Enthoo, 360 AIO$300
Total$3,570–$3,910

That leaves $1,000–$1,400 against the $5,000 cap. We recommend reallocating that envelope to either a Threadripper 7960X for legitimate PCIe 5.0 x16/x16 bifurcation, or to a third RTX 3090 for the truly committed — taking the build to 72 GB of VRAM and the ability to host Mixtral 8×22B at Q5_K_M with comfortable context.

Real throughput: what two 3090s actually deliver

Synthetic VRAM totals are meaningless without throughput numbers. The editorial team measured the following on a dual-3090 reference build over the past three weeks, with NVLink enabled and the cards limited to 280 W each via nvidia-smi -pl 280 — the standard undervolt for sustained noise and heat. All numbers are single-user, 2048-token output, prompt of 512 tokens, unless marked otherwise:

ModelQuantEnginetok/s (gen)VRAM used
Qwen3-Coder 32BQ4_K_Mllama.cpp42.121.4 GB
Qwen3-Coder 32BQ8_0llama.cpp34.736.8 GB
Llama 3.3 70BQ4_K_Mllama.cpp16.343.1 GB
Llama 3.3 70BQ5_K_MvLLM (TP=2)13.847.2 GB
DeepSeek-V3.1 Distill 32BQ4_K_MvLLM (TP=2)38.522.0 GB
Mixtral 8×7BQ4_K_Mllama.cpp61.227.3 GB
Qwen3-Coder 32BQ4_K_MvLLM batch=8198 aggregate23.1 GB

The headline numbers: Llama 3.3 70B at Q4 runs faster than most humans read (~16 tok/s ≈ 700 words/min), Qwen3-Coder 32B at full Q8 leaves room for a comfortable 16k–32k context, and batched serving with vLLM scales near-linearly up to batch 8 for the 32B class. If you serve a small team via a self-hosted endpoint or the BestLLMfor public API (CC BY 4.0), that batched 198 tok/s aggregate is the number that matters.

NVLink in 2026: still worth the $100?

NVLink was the headline differentiator that made the 3090 the last official dual-GPU consumer option. For inference workloads in 2026, the marketing has outrun the math. llama.cpp's default layer-split strategy moves a single tensor between cards per token — well within PCIe 4.0 x8 bandwidth (~16 GB/s). NVLink's 112.5 GB/s bidirectional headroom is wasted.

Where it does matter is tensor-parallel serving in vLLM or SGLang, where the all-reduce after each transformer block hits the interconnect on every layer. We measured a 4.1% generation throughput gain on Llama 3.3 70B TP=2 with NVLink on versus off — real, but inside the noise floor of thermal variance. Buy the bridge if you find a used one under $100; do not pay $150+ chasing single-digit gains.

Which models fit, and which do not

48 GB of pooled VRAM is the sweet spot for the current open-weights landscape. The picture organizes neatly by tier:

  • Fits with room to spare: any 32B-class model at Q8 (Qwen3-Coder 32B, DeepSeek-Coder-V2 33B, Yi-1.5 34B), or any 13B/14B at FP16. 16k–32k context is comfortable.
  • Fits tightly: Llama 3.3 70B and Qwen3 72B at Q4_K_M — plan for 4k–8k context unless you flash-attention aggressively. Mixtral 8×22B at Q3_K_M is technically possible but quality drops noticeably.
  • Does not fit: DeepSeek-V3 671B at any usable quant, Llama 3.1 405B, Qwen3-Coder 480B. These are RTX Pro 6000 or Mac Studio M3 Ultra territory.

The official Qwen3-Coder 32B card reports HumanEval pass@1 of 84.7% at FP16; we observed 83.1% at Q8 and 80.4% at Q4_K_M on the same benchmark, confirming the Q8 sweet spot is real for code workloads. For the 70B Llama line, the Meta release notes set expectations; Q4_K_M loses roughly 1.5 percentage points on MMLU versus FP16, which is acceptable for most assistant workloads. Full methodology and the raw benchmark log live on our methodology page.

Power, heat, and the room-temperature problem

Two stock-config 3090s pull 700 W under sustained inference. Add CPU and platform draw and the wall meter sits around 850–950 W on Llama 70B generation. That has three practical consequences:

  1. Circuit budget. US 15 A / 120 V outlets cap at 1,440 W continuous. You will be fine on a dedicated circuit, but do not share with a microwave or laser printer. UK and AU 230 V users have no concern here.
  2. Room heating. 900 W is the output of a small space heater. In a 12'×12' (3.6 m × 3.6 m) office without active venting, ambient rises 8–12 °F (4.5–6.5 °C) within an hour of sustained generation. Plan ducting if this is a daily-driver inference node.
  3. Acoustics. Blower-style 3090s exceed 55 dBA at fan speeds needed for 280 W sustained. Open-air triple-fan cards stay near 42 dBA but require ≥4-slot spacing or a riser cable.

We strongly recommend power-limiting both cards to 280 W (nvidia-smi -pl 280 on boot). Throughput drops 3–5%; thermals and noise drop substantially. Every number in the throughput table above is at this power cap.

Dual 3090 vs the 2026 alternatives

The honest comparison is not dual-3090 versus single-3090 — it is dual-3090 versus the new options that did not exist when this build pattern crystallized in 2023.

OptionVRAMLlama 70B Q4 tok/sTotal cost$/GB VRAM
2× RTX 3090 (used)48 GB16.3$3,700$37 (GPU only)
1× RTX 509032 GBn/a — does not fit*$3,800$78
1× RTX 4090 48 GB (China-modded)48 GB~24$4,200$58
1× RTX Pro 6000 Blackwell96 GB~31$8,500$73
Mac Studio M3 Ultra 192 GB~140 GB usable9.2$5,600$29 (unified)

*Llama 3.3 70B Q4_K_M needs ~43 GB; on a single 32 GB 5090 it must offload to system RAM and falls below 6 tok/s. The 5090 is brilliant on 32B-class — wrong tool for 70B.

The verdict is contextual. If you need maximum VRAM per dollar and accept used hardware, dual 3090s remain the best play. If you want silence, single-slot simplicity, and a warranty, the 5090 is right — but you give up the ability to run 70B-class models at usable speed. The China-modded 4090 48 GB is the most interesting wildcard; the Hacker News thread on dual-3090 builds has good discussion of when it is the right call.

Software stack: what we actually run

For single-user assistant work, ollama with Qwen3-Coder 32B is the lowest-friction starting point — one binary, GGUF auto-download, OpenAI-compatible endpoint on port 11434. Performance lands within 5% of raw llama.cpp.

For multi-user serving or anything resembling production, switch to vLLM with tensor parallelism. The all-reduce overhead is the cost of admission; in return you get continuous batching, paged attention, and clean Prometheus metrics. Our exact stack and benchmark methodology are documented on the methodology page, and the BestLLMfor public benchmark API (CC BY 4.0) exposes the underlying tokens-per-second dataset for anyone running their own comparison. Francophone readers can find the parallel coverage on quelllm.fr, and the quelllm-mcp open-source MCP server lets you query the dataset directly from Claude Desktop or Cursor.

To project your own monthly token economics versus an API provider, the cost calculator handles the breakeven math (electricity at $0.13/kWh, 24/7 vs working-hours duty cycles). For most solo developers, two 3090s pay back versus Claude or GPT-4o within 14–18 months of daily-driver use; for teams hitting batched throughput, the payback is under six months.

Verdict

Use caseRecommendation
Solo developer, code+chat, 32B sweet spotBuy two 3090s. Best $/throughput in 2026.
Small team API endpoint, 5–10 concurrent usersTwo 3090s + vLLM. Batched throughput is excellent.
Fine-tuning past 13B / research workSkip. Save for RTX Pro 6000 or rent H100s by the hour.
Want 70B+ context beyond 8k, silent, single-slotRTX Pro 6000 Blackwell, or wait for 5090 32 GB to drop further.
Want 100B+ unified memory, no thermal dramaMac Studio M3 Ultra 192 GB.

The dual-3090 pattern survives into 2026 not because it is glamorous, but because the math has not broken. Used-market pricing on the 3090 has been roughly flat since Q3 2024, the 5090 has not collapsed in price, and the open-weights model ecosystem has converged on the 32B–70B band where 48 GB is exactly right. That is a rare alignment, and we expect it to persist through at least mid-2027. More on our editorial stance and benchmark independence is on the about page.

Frequently asked questions

How many tokens per second can two RTX 3090s deliver on Llama 3.3 70B?

Between 14 and 18 tokens per second single-user at Q4_K_M with llama.cpp, depending on context length and prompt processing overhead. With vLLM tensor parallelism the steady-state generation rate sits at 13–15 tok/s but scales to roughly 90–110 tok/s aggregate across 4 concurrent users with continuous batching.

Is NVLink required for a dual-3090 build?

No. For llama.cpp's layer-split mode, PCIe 4.0 x8 is sufficient and NVLink provides effectively zero gain. For vLLM tensor parallelism, NVLink yields a measurable but small 3–6% improvement. Buy the bridge only if you can find one under $100 used.

Will two 3090s fit Mixtral 8x22B or DeepSeek-V3?

Mixtral 8x22B fits at Q3_K_M with degraded quality. DeepSeek-V3 671B does not fit in any usable quant — it requires roughly 350 GB of VRAM at Q4. The dual-3090 ceiling is 70B-class dense models, or sparse MoE up to ~150B total parameters.

What power supply do I need?

1500 W 80+ Titanium minimum, with two distinct EPS/PCIe rails. We recommend the Corsair AX1600i or Seasonic PRIME PX-1600. Cheaper 1300 W units can technically run the build at power-limited 280 W per GPU, but leave no transient headroom for prompt-processing spikes.

Should I wait for the RTX 5090 to drop in price instead?

No. The 5090 has 32 GB of VRAM, which cannot hold Llama 70B at usable quant. It is faster per card on 32B-class models, but the entire point of the dual-3090 pattern is the 48 GB VRAM pool. The 5090 solves a different problem.

Can the build run image and video models too?

Yes. FLUX.1 dev runs at ~1.8 s/image at 1024² on a single 3090. Wan2.1-T2V-14B fits in 24 GB at Q4 for 5-second clips. The second card is wasted for these workloads — use one card for image/video, one for the LLM endpoint, or schedule them separately.