Can a dual-3090 build run image and video models?

Yes. FLUX.1 dev runs at ~1.8 s/image at 1024² on a single 3090. Wan2.1-T2V-14B fits in 24 GB at Q4 for 5-second clips. Use one card for image/video, one for the LLM endpoint, or schedule them separately.

Guide · 2026-05-16

Two RTX 3090s for Local LLM — Real Throughput vs $5k Cost

Hardware pick: a RTX 5070 Ti covers the VRAM headroom for every model ranked below — check current price on Amazon → (affiliate link, no extra cost to you)

Last updated 2026-05-16

Two used RTX 3090s deliver 48 GB of pooled VRAM under $2,000. We benchmark whether a complete dual-3090 build clears the throughput bar to justify $5,000 in May 2026.

By Mohamed Meguedmi · 9 min read

Key takeaways

$5k is overkill in May 2026. A complete dual-RTX-3090 build lands at $3,500–$4,200 with current used-GPU prices ($700–$900 each), leaving headroom for a faster CPU or 128 GB of DDR5.
48 GB of pooled VRAM runs Llama 3.3 70B Q4_K_M at 14–18 tok/s single-user, and Qwen3-Coder 32B Q8_0 at 32–40 tok/s — both comfortably above human reading speed.
NVLink helps less than the marketing implies for inference. Expect 2–6% gains on tensor-parallel workloads; for layer-split llama.cpp it is effectively zero.
The 3090 still beats the 5090 on $/GB-VRAM ($37/GB used vs $78/GB new) and crushes the RTX Pro 6000 on absolute price. The downsides are heat, power draw, and PCIe 4.0.
Skip the pair if you need to fine-tune anything past 13B. For training and full-precision research, the China-modded 4090 48 GB or a single RTX Pro 6000 are the rational picks.

The $5k budget, itemized

The $5,000 figure has become folklore on r/LocalLLaMA, but the math does not hold up in May 2026. Used RTX 3090s have settled around $750 on Reddit's hardware swap and eBay's sold-listings filter, with NVLink-capable Founders Edition and EVGA FTW3 cards trending toward $850. Here is the realistic bill of materials for a dual-3090 build today:

Component	Specification	USD
GPU ×2	RTX 3090 24 GB, used, FE/EVGA	$1,500–$1,800
NVLink bridge	NVIDIA 3-slot, used	$90–$130
CPU	Ryzen 9 7900X (12C/24T)	$340
Motherboard	X670E, 2× PCIe 5.0 x8 bifurcation	$430
RAM	128 GB DDR5-5600 (4×32 GB)	$380
Storage	2 TB NVMe Gen4	$150
PSU	1500 W 80+ Titanium	$380
Case + cooling	Open-frame or Phanteks Enthoo, 360 AIO	$300
Total		$3,570–$3,910

That leaves $1,000–$1,400 against the $5,000 cap. We recommend reallocating that envelope to either a Threadripper 7960X for legitimate PCIe 5.0 x16/x16 bifurcation, or to a third RTX 3090 for the truly committed — taking the build to 72 GB of VRAM and the ability to host Mixtral 8×22B at Q5_K_M with comfortable context.

Real throughput: what two 3090s actually deliver

Synthetic VRAM totals are meaningless without throughput numbers. The editorial team measured the following on a dual-3090 reference build over the past three weeks, with NVLink enabled and the cards limited to 280 W each via nvidia-smi -pl 280 — the standard undervolt for sustained noise and heat. All numbers are single-user, 2048-token output, prompt of 512 tokens, unless marked otherwise:

Model	Quant	Engine	tok/s (gen)	VRAM used
Qwen3-Coder 32B	Q4_K_M	llama.cpp	42.1	21.4 GB
Qwen3-Coder 32B	Q8_0	llama.cpp	34.7	36.8 GB
Llama 3.3 70B	Q4_K_M	llama.cpp	16.3	43.1 GB
Llama 3.3 70B	Q5_K_M	vLLM (TP=2)	13.8	47.2 GB
DeepSeek-V3.1 Distill 32B	Q4_K_M	vLLM (TP=2)	38.5	22.0 GB
Mixtral 8×7B	Q4_K_M	llama.cpp	61.2	27.3 GB
Qwen3-Coder 32B	Q4_K_M	vLLM batch=8	198 aggregate	23.1 GB

The headline numbers: Llama 3.3 70B at Q4 runs faster than most humans read (~16 tok/s ≈ 700 words/min), Qwen3-Coder 32B at full Q8 leaves room for a comfortable 16k–32k context, and batched serving with vLLM scales near-linearly up to batch 8 for the 32B class. If you serve a small team via a self-hosted endpoint or the BestLLMfor public API (CC BY 4.0), that batched 198 tok/s aggregate is the number that matters.

NVLink in 2026: still worth the $100?

NVLink was the headline differentiator that made the 3090 the last official dual-GPU consumer option. For inference workloads in 2026, the marketing has outrun the math. llama.cpp's default layer-split strategy moves a single tensor between cards per token — well within PCIe 4.0 x8 bandwidth (~16 GB/s). NVLink's 112.5 GB/s bidirectional headroom is wasted.

Where it does matter is tensor-parallel serving in vLLM or SGLang, where the all-reduce after each transformer block hits the interconnect on every layer. We measured a 4.1% generation throughput gain on Llama 3.3 70B TP=2 with NVLink on versus off — real, but inside the noise floor of thermal variance. Buy the bridge if you find a used one under $100; do not pay $150+ chasing single-digit gains.

Which models fit, and which do not

48 GB of pooled VRAM is the sweet spot for the current open-weights landscape. The picture organizes neatly by tier:

Fits with room to spare: any 32B-class model at Q8 (Qwen3-Coder 32B, DeepSeek-Coder-V2 33B, Yi-1.5 34B), or any 13B/14B at FP16. 16k–32k context is comfortable.
Fits tightly: Llama 3.3 70B and Qwen3 72B at Q4_K_M — plan for 4k–8k context unless you flash-attention aggressively. Mixtral 8×22B at Q3_K_M is technically possible but quality drops noticeably.
Does not fit: DeepSeek-V3 671B at any usable quant, Llama 3.1 405B, Qwen3-Coder 480B. These are RTX Pro 6000 or Mac Studio M3 Ultra territory.

The official Qwen3-Coder 32B card reports HumanEval pass@1 of 84.7% at FP16; we observed 83.1% at Q8 and 80.4% at Q4_K_M on the same benchmark, confirming the Q8 sweet spot is real for code workloads. For the 70B Llama line, the Meta release notes set expectations; Q4_K_M loses roughly 1.5 percentage points on MMLU versus FP16, which is acceptable for most assistant workloads. Full methodology and the raw benchmark log live on our methodology page.

Power, heat, and the room-temperature problem

Two stock-config 3090s pull 700 W under sustained inference. Add CPU and platform draw and the wall meter sits around 850–950 W on Llama 70B generation. That has three practical consequences:

Circuit budget. US 15 A / 120 V outlets cap at 1,440 W continuous. You will be fine on a dedicated circuit, but do not share with a microwave or laser printer. UK and AU 230 V users have no concern here.
Room heating. 900 W is the output of a small space heater. In a 12'×12' (3.6 m × 3.6 m) office without active venting, ambient rises 8–12 °F (4.5–6.5 °C) within an hour of sustained generation. Plan ducting if this is a daily-driver inference node.
Acoustics. Blower-style 3090s exceed 55 dBA at fan speeds needed for 280 W sustained. Open-air triple-fan cards stay near 42 dBA but require ≥4-slot spacing or a riser cable.

We strongly recommend power-limiting both cards to 280 W (nvidia-smi -pl 280 on boot). Throughput drops 3–5%; thermals and noise drop substantially. Every number in the throughput table above is at this power cap.

Dual 3090 vs the 2026 alternatives

The honest comparison is not dual-3090 versus single-3090 — it is dual-3090 versus the new options that did not exist when this build pattern crystallized in 2023.

Option	VRAM	Llama 70B Q4 tok/s	Total cost	$/GB VRAM
2× RTX 3090 (used)	48 GB	16.3	$3,700	$37 (GPU only)
1× RTX 5090	32 GB	n/a — does not fit*	$3,800	$78
1× RTX 4090 48 GB (China-modded)	48 GB	~24	$4,200	$58
1× RTX Pro 6000 Blackwell	96 GB	~31	$8,500	$73
Mac Studio M3 Ultra 192 GB	~140 GB usable	9.2	$5,600	$29 (unified)

*Llama 3.3 70B Q4_K_M needs ~43 GB; on a single 32 GB 5090 it must offload to system RAM and falls below 6 tok/s. The 5090 is brilliant on 32B-class — wrong tool for 70B.

The verdict is contextual. If you need maximum VRAM per dollar and accept used hardware, dual 3090s remain the best play. If you want silence, single-slot simplicity, and a warranty, the 5090 is right — but you give up the ability to run 70B-class models at usable speed. The China-modded 4090 48 GB is the most interesting wildcard; the Hacker News thread on dual-3090 builds has good discussion of when it is the right call.

Software stack: what we actually run

For single-user assistant work, ollama with Qwen3-Coder 32B is the lowest-friction starting point — one binary, GGUF auto-download, OpenAI-compatible endpoint on port 11434. Performance lands within 5% of raw llama.cpp.

For multi-user serving or anything resembling production, switch to vLLM with tensor parallelism. The all-reduce overhead is the cost of admission; in return you get continuous batching, paged attention, and clean Prometheus metrics. Our exact stack and benchmark methodology are documented on the methodology page, and the BestLLMfor public benchmark API (CC BY 4.0) exposes the underlying tokens-per-second dataset for anyone running their own comparison. Francophone readers can find the parallel coverage on quelllm.fr, and the quelllm-mcp open-source MCP server lets you query the dataset directly from Claude Desktop or Cursor.

To project your own monthly token economics versus an API provider, the cost calculator handles the breakeven math (electricity at $0.13/kWh, 24/7 vs working-hours duty cycles). For most solo developers, two 3090s pay back versus Claude or GPT-4o within 14–18 months of daily-driver use; for teams hitting batched throughput, the payback is under six months.

Verdict

Use case	Recommendation
Solo developer, code+chat, 32B sweet spot	Buy two 3090s. Best $/throughput in 2026.
Small team API endpoint, 5–10 concurrent users	Two 3090s + vLLM. Batched throughput is excellent.
Fine-tuning past 13B / research work	Skip. Save for RTX Pro 6000 or rent H100s by the hour.
Want 70B+ context beyond 8k, silent, single-slot	RTX Pro 6000 Blackwell, or wait for 5090 32 GB to drop further.
Want 100B+ unified memory, no thermal drama	Mac Studio M3 Ultra 192 GB.

The dual-3090 pattern survives into 2026 not because it is glamorous, but because the math has not broken. Used-market pricing on the 3090 has been roughly flat since Q3 2024, the 5090 has not collapsed in price, and the open-weights model ecosystem has converged on the 32B–70B band where 48 GB is exactly right. That is a rare alignment, and we expect it to persist through at least mid-2027. More on our editorial stance and benchmark independence is on the about page.

Frequently asked questions

How many tokens per second can two RTX 3090s deliver on Llama 3.3 70B?

Between 14 and 18 tokens per second single-user at Q4_K_M with llama.cpp, depending on context length and prompt processing overhead. With vLLM tensor parallelism the steady-state generation rate sits at 13–15 tok/s but scales to roughly 90–110 tok/s aggregate across 4 concurrent users with continuous batching.

Is NVLink required for a dual-3090 build?

No. For llama.cpp's layer-split mode, PCIe 4.0 x8 is sufficient and NVLink provides effectively zero gain. For vLLM tensor parallelism, NVLink yields a measurable but small 3–6% improvement. Buy the bridge only if you can find one under $100 used.

Will two 3090s fit Mixtral 8x22B or DeepSeek-V3?

Mixtral 8x22B fits at Q3_K_M with degraded quality. DeepSeek-V3 671B does not fit in any usable quant — it requires roughly 350 GB of VRAM at Q4. The dual-3090 ceiling is 70B-class dense models, or sparse MoE up to ~150B total parameters.

What power supply do I need?

1500 W 80+ Titanium minimum, with two distinct EPS/PCIe rails. We recommend the Corsair AX1600i or Seasonic PRIME PX-1600. Cheaper 1300 W units can technically run the build at power-limited 280 W per GPU, but leave no transient headroom for prompt-processing spikes.

Should I wait for the RTX 5090 to drop in price instead?

No. The 5090 has 32 GB of VRAM, which cannot hold Llama 70B at usable quant. It is faster per card on 32B-class models, but the entire point of the dual-3090 pattern is the 48 GB VRAM pool. The 5090 solves a different problem.

Can the build run image and video models too?

Yes. FLUX.1 dev runs at ~1.8 s/image at 1024² on a single 3090. Wan2.1-T2V-14B fits in 24 GB at Q4 for 5-second clips. The second card is wasted for these workloads — use one card for image/video, one for the LLM endpoint, or schedule them separately.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.