Guide · 2026-05-16

Best Local LLM for 12 GB VRAM (RTX 3060, 4070, 5070)

Last updated 2026-05-16

Twelve gigabytes of VRAM is the entry tier for serious local AI in 2026. Here is what the BestLLMfor editorial team runs on the RTX 3060, 4070, and 5070 — and why one model wins clean.

By Mohamed Meguedmi · 9 min read

Key takeaways

Best overall: Qwen3-14B Q4_K_M — 9.04 GB on disk, 28–32 tok/s on RTX 4070, 22–24 tok/s on RTX 3060, and beats Llama 3.3 8B on MMLU by 9 points.
Best for coding: Qwen3-Coder-14B Q4_K_M — 67 % pass@1 on HumanEval and stronger multi-file refactors than DeepSeek-Coder-V2-Lite.
Best for long context: Gemma 3 12B Q4_K_M — 128k native window, retains coherence past 64k where most dense 14Bs collapse.
MoE wildcard: Qwen3-30B-A3B Q4_K_M with partial CPU offload — 14–18 tok/s on 32 GB system RAM, frontier-tier reasoning.
Skip: Q3 quants on coding workloads (10–15 % accuracy drop) and any 22B+ dense model — you will thrash the swap file.

The 12 GB VRAM tier in 2026 — what actually fits

Twelve gigabytes is the entry tier where local LLMs stop feeling like a toy. With a sensible quantization (Q4_K_M or Q5_K_M), 12 GB cleanly hosts any dense 14B model at 8k–16k context. Push to a dense 22B and you are forced into Q3, which kneecaps quality on math and code. Push to a dense 30B and you are paging weights from system RAM — token rate falls below 5 tok/s.

The math is straightforward. A 14B model at Q4_K_M occupies roughly 8.9–9.2 GB of weights. The KV cache for a 16k context is another 1.5–1.8 GB depending on architecture. CUDA kernels, the framework runtime, and the desktop compositor reserve 300–500 MB. That puts a typical session at 11.0–11.5 GB used, with no headroom for a second model or a vision adapter — which is exactly why 16 GB is the next meaningful tier.

This guide ranks the models the BestLLMfor editorial team deploys on 12 GB cards in production. Every number below comes from our public benchmark dataset, also exposed via the BestLLMfor public API (CC BY 4.0) at /api/v1/benchmarks.

RTX 3060 vs 4070 vs 5070 — same VRAM, very different speeds

All three cards expose 12 GB of memory to the LLM runtime, but they are not interchangeable. Memory bandwidth is the single biggest predictor of token rate at this tier — compute is rarely the bottleneck for a 14B Q4 model.

GPU	VRAM	Bandwidth	TDP	Street price (May 2026)
RTX 3060 12 GB	12 GB GDDR6	360 GB/s	170 W	$220–250 (used)
RTX 4070	12 GB GDDR6X	504 GB/s	200 W	$520–580 (new)
RTX 4070 Ti	12 GB GDDR6X	504 GB/s	285 W	$680–740 (new)
RTX 5070	12 GB GDDR7	672 GB/s	250 W	$580–640 (new)

Bandwidth scaling tracks token rate almost linearly on 14B Q4. The RTX 5070 is ~30 % faster than the 4070 and ~85 % faster than the 3060 on Qwen3-14B Q4_K_M at 8k context. If the budget allows the 5070, take it — the 4070 Ti's extra compute is wasted at this memory tier. From the used market, the RTX 3060 12 GB at $230 remains the best dollars-per-tok/s buy in the entire NVIDIA lineup.

Our top pick — Qwen3-14B Q4_K_M

Qwen3-14B is the model the editorial team reaches for when only one slot is available. The Q4_K_M GGUF weighs 9.04 GB, leaves roughly 2.4 GB free at 8k context, and sustains generation rates that beat every Llama 3 / 3.3 derivative we have tested at the same parameter count.

Quality is where it pulls away. Qwen3-14B scores 78.2 on MMLU and 64.5 on MATH (lm-evaluation-harness, 0-shot CoT), versus Llama 3.3 8B at 69.1 / 51.2. Instruction following, JSON-mode reliability, and tool use are all noticeably tighter — Qwen's post-training pipeline shows here. See the official model card on HuggingFace for the full evaluation matrix.

Generation speed on 12 GB cards (llama.cpp b5230, 8k context, default sampling):

RTX 5070: 38–41 tok/s generation, 1 180 tok/s prompt
RTX 4070: 28–32 tok/s generation, 920 tok/s prompt
RTX 3060 12 GB: 22–24 tok/s generation, 670 tok/s prompt

The 3060 number lines up with independent measurements from Hardware Corner's RTX 3060 LLM benchmarks, which also place a 14B Q4 model at ~22 tok/s at 16k context.

Best for coding — Qwen3-Coder-14B

If the primary workload is code generation, refactoring, or terminal-agent loops, swap the general Qwen3-14B for Qwen3-Coder-14B Q4_K_M. Same parameter budget, same VRAM footprint, but post-trained on a much larger code corpus with reinforcement on test-suite execution.

HumanEval pass@1 lands at 67 % (Q4_K_M, greedy), and MBPP at 71 %. The bigger gap shows on multi-file tasks: SWE-bench Verified Lite has Qwen3-Coder-14B resolving 19 % of issues, where DeepSeek-Coder-V2-Lite (15.7B MoE, also fits in 12 GB) lands at 14 %. The MoE option is faster per token (38 tok/s on a 4070 vs 30) but produces more hallucinated imports and stale API calls in our regression set.

The verdict is not close for editorial work: Qwen3-Coder-14B Q4_K_M is the right call for coding on 12 GB. Pull it from the ollama qwen3-coder library page or grab the GGUF directly from HuggingFace.

Best for long context — Gemma 3 12B

Both Qwen3-14B and Llama 3.3 derivatives advertise 128k context windows, but their attention degrades sharply past 32k tokens — needle-in-haystack accuracy drops below 70 % by 64k. Gemma 3 12B is the exception. Google's sliding-window attention with global layers every six blocks maintains 88 %+ retrieval accuracy through 96k tokens in our internal evaluations.

The trade-off is KV cache cost. Gemma 3's cache is larger per token than Qwen3's GQA-8 design — a 32k context costs roughly 2.8 GB on top of the 7.2 GB of weights (Q4_K_M). That sits close to the VRAM ceiling, so use --cache-type-k q8_0 --cache-type-v q8_0 on llama.cpp to halve the KV footprint with negligible quality loss.

For RAG pipelines over long documents, contract review, or codebase-wide summarization, Gemma 3 12B at Q4_K_M is our pick.

The MoE wildcard — Qwen3-30B-A3B with partial offload

Mixture-of-Experts models change the math for the 12 GB tier. Qwen3-30B-A3B activates only 3B parameters per token despite a 30B total parameter count. With 32 GB of system RAM, the active experts stay on the GPU while the rest stream from DDR5 — a strategy llama.cpp implements transparently via -ngl with --override-tensor.

In our measurements, Qwen3-30B-A3B Q4_K_M sustains 14–18 tok/s on an RTX 4070 with DDR5-6000 and a properly tuned offload pattern. That is slower than dense Qwen3-14B, but quality on reasoning benchmarks (GPQA Diamond, MATH) climbs by 8–12 points. For agentic tasks that need frontier-tier reasoning and can absorb a 50 % speed penalty, this is the right pick.

Setup is fiddlier — French-speaking readers can consult the sister-site walkthrough at quelllm.fr for the exact tensor-routing flags.

Benchmarks — every contender, one table

All numbers below: llama.cpp b5230, Q4_K_M unless noted, 8k context, default samplers, RTX 4070 reference card. Quality scores come from published model cards cross-checked against our own runs; see our benchmark methodology for sampling details.

Model	Size on disk	VRAM @ 8k	Tok/s (4070)	MMLU	HumanEval	Verdict
Qwen3-14B	9.04 GB	10.8 GB	30	78.2	62	Top overall
Qwen3-Coder-14B	9.04 GB	10.8 GB	30	74.0	67	Top for code
Gemma 3 12B	7.20 GB	11.4 GB (32k ctx)	33	74.5	54	Top long-ctx
Llama 3.3 8B Instruct	5.07 GB	7.9 GB	52	69.1	56	Fast fallback
Phi-4 14B	8.40 GB	10.4 GB	31	77.4	61	Strong on math
Mistral Small 3 22B Q3_K_M	10.50 GB	11.7 GB	21	73.0	54	Quality compromised
DeepSeek-Coder-V2-Lite 16B MoE	10.40 GB	11.5 GB	38	67.1	62	Fast but flaky
Qwen3-30B-A3B (offload)	17.8 GB split	11.8 GB + 7 GB RAM	16	81.4	69	Slow but smart

Use our cost calculator to estimate electricity and amortization across these models — at $0.15/kWh, a 4070 running Qwen3-14B eight hours a day costs about $7.30/month in power.

Three-step install with ollama

The fastest path from a clean OS to a working chat session is ollama. The editorial team's standard sequence:

Install ollama — on Linux: curl -fsSL https://ollama.com/install.sh | sh. On Windows or macOS, download the installer from ollama.com.
Pull the model — ollama pull qwen3:14b-instruct-q4_K_M. The download is 9.04 GB and resumes if interrupted.
Verify VRAM and run — ollama run qwen3:14b-instruct-q4_K_M, then check with nvidia-smi that VRAM usage sits near 10.8 GB at 8k context. If it exceeds 11.5 GB, lower context with /set parameter num_ctx 4096.

For automation, point any OpenAI-compatible client at http://localhost:11434/v1. For tool use over the Model Context Protocol, the open-source quelllm-mcp server exposes any ollama model as an MCP endpoint in two lines of config.

Frequently asked questions

Can a 12 GB GPU run a 30B model?

Yes, but only as a Mixture-of-Experts model with partial CPU offload. Qwen3-30B-A3B Q4_K_M is the only currently viable 30B option — dense 30B models thrash the PCIe bus and fall below 5 tok/s. Plan for 32 GB of system RAM and DDR5-6000 or faster.

Is the RTX 3060 12 GB still worth buying in 2026?

For local LLMs, yes — it is the best dollars-per-tok/s in the lineup. The 4070 is roughly 35 % faster but costs 2.3× as much. If LLM inference is the primary workload and gaming is secondary, the used 3060 12 GB at ~$230 is hard to beat. For image or video generation, step up to 16 GB.

Q4_K_M, Q5_K_M, or Q8 — which quant should I use?

Q4_K_M is the floor for a 14B model on 12 GB. Q5_K_M (10.5 GB on disk) fits but leaves almost no room for context above 4k. Q8 does not fit. For 8B models, Q8 fits comfortably and is the right call. Avoid Q3 — quality drops are measurable on math and code.

Why not Llama 4 8B?

Llama 4 8B is a fine general model, but at the 12 GB tier there is headroom for a 14B parameter count, and Qwen3-14B beats Llama 4 8B by 6–9 points on every reasoning benchmark we track. Use Llama 4 8B if the extra speed (52 tok/s vs 30 tok/s on a 4070) matters more than reasoning quality.

Does the RTX 5070's GDDR7 actually help?

Yes — bandwidth is the dominant factor for 14B Q4 inference, and 672 GB/s vs 504 GB/s translates almost linearly into ~30 % more tok/s. The 5070 is the best new-card choice at this tier if it can be found at MSRP.

Verdict

The 12 GB VRAM tier is no longer a compromise. With Qwen3-14B Q4_K_M as the default, Qwen3-Coder-14B for code work, and Gemma 3 12B for long-context jobs, a $230 used RTX 3060 will deliver useful local AI today, and a $600 RTX 5070 will outpace cloud APIs for most single-user workloads.

Use case	Model	Quant
General chat & writing	Qwen3-14B Instruct	Q4_K_M
Coding & refactoring	Qwen3-Coder-14B	Q4_K_M
Long-document RAG	Gemma 3 12B	Q4_K_M
Agentic reasoning	Qwen3-30B-A3B (offload)	Q4_K_M
Maximum speed	Llama 3.3 8B Instruct	Q8_0

For the methodology behind these recommendations, see our benchmark methodology, and meet the editorial team on the about page.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.