Guide · 2026-05-16

Best Local LLM for 8 GB VRAM — Tested Against ChatGPT Plus

We benchmarked every top open-weight model that fits in 8 GB of VRAM against ChatGPT Plus. The 2026 winners — and where the cloud still wins — below.

By Mohamed Meguedmi · 9 min read

Key Takeaways

Qwen3 8B Q4_K_M is the strongest all-rounder on 8 GB VRAM in 2026, scoring within 4 points of ChatGPT Plus on MMLU-Pro at zero marginal cost.
Qwen3-Coder 7B Q5_K_M wins coding tasks, beating gpt-oss 7B by 5.8 points on HumanEval+.
DeepSeek-R1-Distill-Qwen 7B is the reasoning king at this tier — it outscores ChatGPT Plus's free-tier fallback on GSM8K (92.4% vs 88.1%).
Expect 28–45 tok/s on an RTX 3070 or 4060 Ti at 4096-token context. Going past 8K needs KV-cache quantization or a smaller model.
An RTX 4060 8 GB ($299) pays for itself versus ChatGPT Plus in roughly 13 months at typical US residential electricity rates.

8 GB of VRAM used to be a death sentence for serious local LLM work. Not anymore. In 2026, modern 7B–8B open-weight models — properly quantized, paired with FlashAttention 2.5 and KV-cache compression — punch high enough to make ChatGPT Plus look optional for a lot of everyday work. We tested the field. Here is what wins, what loses, and where you still need the cloud.

What 8 GB VRAM Actually Buys You in 2026

The arithmetic is unforgiving. A modern 8B-parameter model in Q4_K_M GGUF format occupies roughly 4.7 GB on disk and in VRAM. Add the KV cache at 4096 tokens (≈1.4 GB for an 8B model with grouped-query attention), framework overhead (~400 MB for llama.cpp, more for Ollama or LM Studio), and the desktop compositor (~600 MB on a typical 1440p display), and you are already pressing 7.1 GB. There is no slack for a 16K context window without aggressive optimization.

This is why the choice of quantization matters more than the choice of model at this tier. Q4_K_M is the sweet spot: it preserves about 98% of FP16 quality on standard benchmarks while halving the memory footprint versus Q8_0. Going down to Q3_K_S buys 1.1 GB of headroom but costs 4–7 points on MMLU-Pro — a bad trade. Going up to Q5_K_M or Q6_K is feasible for 7B models but leaves no room for context.

For the full memory math across model sizes, see our VRAM and cost calculator.

How We Tested

Two reference GPUs were used: an RTX 4060 8 GB (Ada, 272 GB/s memory bandwidth) and an RTX 3070 8 GB (Ampere, 448 GB/s). Software stack: llama.cpp build b4790, Ollama 0.5.4, LM Studio 0.3.10 — all with FlashAttention enabled and KV-cache quantized to q8_0. Driver: NVIDIA 565.77 on Windows 11 24H2; CUDA 12.6.

Benchmarks: MMLU-Pro (5-shot subset), HumanEval+ (Python, n=164), GSM8K (8-shot chain-of-thought), MT-Bench (judged by Claude Opus 4.7), plus 50 real-world prompts drawn from BestLLMfor reader submissions — a mix of code review, copywriting, structured data extraction from messy CSVs, and stack-trace debugging.

The ChatGPT Plus comparator is GPT-5 turbo (the default Plus model as of May 2026), accessed via the web UI with no custom instructions and Memory disabled. Each prompt was run three times; we report median scores. Full methodology and raw outputs are published under CC BY 4.0 on our methodology page and through the BestLLMfor public API.

The 8 GB VRAM Leaderboard

Five models earn the top spots. All run at Q4_K_M unless noted, with 4096-token context, FlashAttention on, KV cache at q8_0.

Rank	Model	VRAM used	Tok/s (RTX 3070)	MMLU-Pro	HumanEval+	Verdict
1	Qwen3 8B Q4_K_M	6.8 GB	41	62.7	71.3	Best all-rounder
2	Qwen3-Coder 7B Q5_K_M	6.4 GB	44	54.1	78.0	Best for code
3	DeepSeek-R1-Distill-Qwen 7B Q4_K_M	6.2 GB	38	59.4	69.2	Best reasoning
4	gpt-oss 7B Q4_K_M	6.5 GB	42	57.9	72.2	Strong runner-up
5	Gemma 3 9B Q4_K_M	7.6 GB	28	60.1	65.4	Best multilingual

Source weights and configs: Qwen3 8B GGUF card, DeepSeek-R1-Distill-Qwen 7B, and the Ollama Qwen3 page.

Tested Against ChatGPT Plus

We ran the same 50 reader prompts on Qwen3 8B and GPT-5 turbo (ChatGPT Plus). Three editors blind-rated each pair on a 1–5 Likert. Higher is better.

Task category	Qwen3 8B (local)	ChatGPT Plus (GPT-5)	Gap
General knowledge Q&A	4.1	4.5	−0.4
Code review (Python / TS)	4.0	4.4	−0.4
Marketing copy (3 variants)	3.8	4.2	−0.4
Structured data extraction	4.3	4.4	−0.1
Multi-step reasoning	3.6	4.6	−1.0
Creative fiction (500 words)	3.9	4.3	−0.4
Translation (EN↔FR, EN↔JA)	4.0	4.5	−0.5

The headline: on six of seven categories, Qwen3 8B lands within half a point of GPT-5. The single weak spot is multi-step reasoning, where ChatGPT Plus pulls clearly ahead. That is the category where you should keep a Plus subscription handy — or swap in DeepSeek-R1-Distill-Qwen 7B, which closes the gap to −0.4 on the same prompts at the cost of slower decode (it emits a hidden chain-of-thought before answering).

Honest verdict: if you can tolerate a 5–10% quality drop on 95% of tasks, an 8 GB local setup replaces ChatGPT Plus for everyday use. The remaining 5% — long-horizon agentic reasoning, frontier-knowledge questions, image generation — still belong in the cloud.

Best Local LLM by Use Case

Coding: Qwen3-Coder 7B Q5_K_M

Released March 2026, this is the strongest sub-10B code model we have measured. HumanEval+ 78.0, MBPP+ 71.4, and a near-perfect score on our internal "rewrite this jQuery as React" battery. Fits in 6.4 GB at Q5, leaving room for an 8K context window — enough for most single-file refactors.

Reasoning and math: DeepSeek-R1-Distill-Qwen 7B

The distillation series from DeepSeek AI brings R1-style reasoning down to consumer hardware. GSM8K 92.4%, MATH 58.7%. The catch is latency: the model thinks before it speaks, and a hard math problem can take 60–90 seconds. Use it deliberately, not as a general chat model.

General chat and writing: Qwen3 8B

The default choice. Strong instruction following, low refusal rate, clean prose. Pull it with ollama pull qwen3:8b and you are done.

Vision and OCR: Gemma 3 4B Vision Q4_K_M

Gemma 3 9B is the better text model but does not fit with the vision tower attached. The 4B vision variant runs in 5.1 GB and handles screenshots, receipts, and diagram OCR competently. Slower than dedicated OCR but useful as a one-stop tool.

Long-context summarization: Llama 3.3 8B Q4_K_M

Llama 3.3 is the only 8B model in our test that holds quality past 32K tokens. With KV-cache quantization at q4_0 (not the default q8_0), you can reach 64K context on 8 GB — barely. Use it for document summarization, not chat.

Install in Four Steps (Ollama)

Install Ollama. Download from ollama.com/download. Windows, macOS, and Linux binaries are all signed.
Pull the model. Open a terminal and run ollama pull qwen3:8b. Expect a 4.7 GB download.
Configure context. Set OLLAMA_NUM_CTX=4096 and OLLAMA_FLASH_ATTENTION=1 in your environment. On Windows, use System Properties → Environment Variables.
Run it. ollama run qwen3:8b drops you into a chat REPL. For an OpenAI-compatible API, the server is already listening on http://localhost:11434/v1.

If you want a managed editor experience with tool use baked in, our open-source quelllm-mcp server exposes any Ollama model as an MCP endpoint that Claude Desktop, Cursor, and Zed can call directly.

The Real Cost of Going Local

ChatGPT Plus is $20/mo, or $240/yr. An RTX 4060 8 GB retails at $299–$329; a used RTX 3070 8 GB is $220–$280 on eBay. Power draw at full inference load is 95–115 W on the 4060 and 180–210 W on the 3070. At the US average residential rate of $0.17/kWh and two hours of daily inference, the 4060 costs roughly $13/yr in electricity; the 3070 costs $24/yr.

Setup	Year 1	Year 2	Year 3	3-year total vs Plus
ChatGPT Plus only	$240	$240	$240	$720 (baseline)
RTX 4060 8 GB (new) + Qwen3	$312	$13	$13	$338 (−$382)
RTX 3070 8 GB (used) + Qwen3	$254	$24	$24	$302 (−$418)

Year 2 onward, local is essentially free. A used 3070 is the fastest path to break-even — under 12 months at typical use. Run the numbers for your own electricity rate on our cost calculator, or read about the team behind the testing on our about page. French-speaking readers can find the equivalent analysis on quelllm.fr.

When You Should Still Pay for ChatGPT Plus

Three scenarios keep us subscribed to ChatGPT Plus alongside a local stack:

Agentic, multi-tool tasks that need long-horizon planning and reliable function calling. GPT-5's tool-use reliability is still roughly 12 points ahead of any 8B open-weight model on our internal eval.
Image generation and editing. No 8 GB local stack does both text and image well in the same session.
Frontier-knowledge questions where the answer depends on web search or real-time data. ChatGPT Plus integrates browsing; your local model does not, unless you wire one in yourself.

For everything else — daily coding help, drafting, summarization, structured extraction, translation, brainstorming — an 8 GB local setup with Qwen3 8B is the rational default in 2026.

Frequently Asked Questions

Can I run a 13B model on 8 GB VRAM?

Only at Q3_K_S or lower, with a 2048-token context and no FlashAttention. The quality drop versus a Q4_K_M 8B model is severe. Stay at 7–9B parameters on 8 GB.

Is the RTX 4060 8 GB or the RTX 3070 8 GB better for local LLMs?

The 3070 wins on raw decode speed because of its 448 GB/s memory bandwidth (vs 272 GB/s on the 4060). The 4060 wins on power efficiency, AV1 encoder support, and warranty coverage. For pure LLM inference the 3070 is faster; for a hybrid gaming and inference build the 4060 is the safer choice.

Does CPU and RAM offloading help on 8 GB VRAM?

Sometimes — but tokens-per-second collapses. Offloading 5 of 32 layers to CPU on a Ryzen 7 7700X drops decode from 41 tok/s to 9 tok/s on Qwen3 8B. Better to pick a smaller model and stay fully on GPU.

What about Apple Silicon with 16 GB unified memory?

An M3 or M4 Mac with 16 GB shares memory across GPU and system. Qwen3 8B Q4_K_M runs at 22–28 tok/s on an M4 base — slower than the 3070 but with longer effective context, because unified memory absorbs the KV cache more gracefully.

Are these models safe to use commercially?

Qwen3, DeepSeek-R1-Distill, and Llama 3.3 ship under permissive licenses suitable for commercial use, with some restrictions (Llama's acceptable use policy, Qwen's Apache 2.0 variant). Always check the model card on Hugging Face before shipping.

Final Verdict

If you have an 8 GB GPU and you write code, draft text, or wrangle data for a living, install Qwen3 8B Q4_K_M today. It will not fully replace ChatGPT Plus for agentic work or frontier reasoning, but it will quietly handle 80–90% of your daily prompts at zero marginal cost, with zero data leaving your machine, and at speeds that beat the cloud once you account for round-trip latency. The local LLM era arrived sometime in 2025. In 2026, it is just normal.