Qwen 3 8B — The 8 GB VRAM Sweet Spot Review
A data-driven look at why Qwen 3 8B remains the smartest local model for 8 GB VRAM cards in 2026, with benchmarks, quant picks, and a clear verdict.
By Mohamed Meguedmi · 9 min read
Key takeaways
- Qwen 3 8B is the 2026 default for 8 GB VRAM cards. At Q4_K_M it occupies 4.9 GB of weights, leaving 3 GB for a 16K context — no other 8B-class model matches its MMLU/HumanEval scores in that footprint.
- It beats Llama 3.1 8B across MMLU, HumanEval, MATH and GSM8K, and roughly matches Qwen 2.5 14B reasoning while using half the memory.
- Thinking mode is a real feature, not marketing. Toggling
/thinkraises GSM8K from 87% to 94% at the cost of ~3× tokens — keep it off for chat, on for code and math. - The newer Qwen 3.5-27B and Qwen 3.6-27B are vision-language models needing 24 GB+ VRAM. They do not replace 8B for laptop/desktop users — they sit in a different tier.
- Verdict: buy or keep your 8 GB card and run Qwen 3 8B Q4_K_M. Upgrading to 12 GB only pays off if you specifically need 14B coding models or 32K+ context.
Why 8 GB VRAM still matters in 2026
The local-LLM hardware story in 2026 is bifurcated. The high end has moved to 24 GB and 48 GB cards running 27B–70B vision-language models. The middle — RTX 3060 Ti, RTX 3070, RTX 4060, RTX 2080, and the entire fleet of gaming laptops shipped between 2020 and 2024 — sits at 8 GB. According to the BestLLMfor reader survey (n=4,812, Q1 2026), 61% of self-hosted LLM users still run on 8 GB or less. These are the people Qwen 3 8B was built for, and the reason it remains the most-downloaded dense 8B model on Hugging Face this year.
The competition in this class is narrower than it looks. Llama 3.1 8B is older and weaker on code. Mistral Nemo 12B does not fit cleanly in 8 GB at usable quants. Gemma 2 9B fits but underperforms on reasoning. The 8B sweet spot belongs to Qwen/Qwen3-8B, and has since its release.
Specs and quantization footprint
Qwen 3 8B is a dense decoder-only transformer with 8.19B parameters, 36 layers, 32 attention heads, and a native 32K context window extendable to 128K via YaRN. Vocabulary is 151,936 tokens — large enough to make Chinese, Japanese, and code tokens efficient. The license is Apache 2.0, which matters: it is one of the few top-tier 8B models you can ship commercially without negotiation.
Quantization is where the 8 GB story is actually decided. Numbers below come from llama.cpp b4200+ on Ampere-class GPUs (RTX 3060 Ti, 8 GB), measured with a 4K prompt and 16K context allocated.
| Quant | Weights (GB) | +16K KV cache | Total VRAM | MMLU drop vs FP16 | Tok/s (RTX 3060 Ti) |
|---|---|---|---|---|---|
| FP16 | 15.3 | 2.1 | OOM on 8 GB | 0% | — |
| Q8_0 | 8.1 | 2.1 | OOM on 8 GB | -0.1% | — |
| Q6_K | 6.3 | 2.1 | ~8.4 GB (tight) | -0.3% | 34 |
| Q4_K_M | 4.9 | 2.1 | ~7.0 GB | -1.1% | 48 |
| Q4_0 | 4.5 | 2.1 | ~6.6 GB | -2.4% | 52 |
| IQ3_M | 3.7 | 2.1 | ~5.8 GB | -3.8% | 46 |
The pick is Q4_K_M. It is the only quant that leaves headroom for desktop compositing, a browser tab, and a 16K context simultaneously on a fully-loaded 8 GB card. Q6_K is technically better quality but risks OOM the moment Chrome eats 1 GB of VRAM for video decode. Going below Q4_K_M is a measurable hit on math and code, and is only justified on 6 GB cards. The figures above are consistent with nodepedia's compatibility data, which lists 5 GB as the floor for Q4_K_M.
Benchmarks: where Qwen 3 8B beats its weight class
The headline claim is that Qwen 3 8B reaches reasoning levels previously associated with 13B–14B models. The data supports it. Scores below combine the official Qwen technical report with our re-runs at Q4_K_M, so you see the practical number — not the FP16 marketing number.
| Benchmark | Qwen 3 8B (Q4_K_M, thinking on) | Qwen 3 8B (Q4_K_M, thinking off) | Llama 3.1 8B (Q4_K_M) | Gemma 2 9B (Q4_K_M) | Qwen 2.5 14B (Q4_K_M) |
|---|---|---|---|---|---|
| MMLU (5-shot) | 74.1 | 71.8 | 67.3 | 70.4 | 76.5 |
| GSM8K | 93.8 | 87.2 | 78.1 | 81.7 | 92.3 |
| MATH | 62.4 | 49.0 | 30.6 | 37.8 | 54.1 |
| HumanEval | 83.5 | 78.0 | 62.2 | 67.7 | 80.5 |
| MBPP | 76.1 | 72.4 | 66.4 | 70.8 | 74.9 |
| BFCL (tool use) | 80.2 | 76.8 | 61.5 | 64.0 | 78.3 |
Two patterns are worth pulling out. First, thinking mode is not free: it costs 2.5×–3.5× output tokens and roughly doubles end-to-end latency, but pays for itself on MATH (+13.4 points) and on agentic tool-use chains. For interactive chat, leave it off. For coding agents, IDE autocomplete supervisors, or anything resembling planning, turn it on. Second, the Q4_K_M version is essentially indistinguishable from FP16 on user-facing tasks — the often-cited "quantization tax" is below 1.5 points on every benchmark except MATH, where it is 2.1.
Real-world performance on representative hardware
Throughput is what readers actually feel. The figures below were captured by the BestLLMfor benchmarking pipeline, the same dataset that powers our cost calculator and is exposed publicly through our CC BY 4.0 benchmark API at api.bestllmfor.com/v1/benchmarks. The methodology — prompt set, warm-up, KV cache settings, sampling parameters — is documented on the methodology page.
| Hardware | Backend | Quant | Prompt eval (tok/s) | Generation (tok/s) | Idle VRAM headroom |
|---|---|---|---|---|---|
| RTX 3060 Ti 8 GB | llama.cpp CUDA | Q4_K_M | 1,820 | 48 | 1.0 GB |
| RTX 3070 Laptop 8 GB | llama.cpp CUDA | Q4_K_M | 1,640 | 42 | 0.8 GB |
| RTX 2080 8 GB | vLLM AWQ | AWQ-4bit | 3,100 | 67 | 0.4 GB |
| RTX 4060 8 GB | llama.cpp CUDA | Q4_K_M | 2,050 | 55 | 1.1 GB |
| Apple M2 Pro 16 GB | llama.cpp Metal | Q4_K_M | 820 | 27 | — |
| Apple M3 Max 36 GB | MLX | 4-bit | 2,400 | 71 | — |
| Ryzen 7840U iGPU (Vulkan) | llama.cpp Vulkan | Q4_K_M | 240 | 9.5 | — |
A few practical takeaways. The RTX 2080 result is worth a look: an AWQ build under vLLM beats every other 8 GB card on raw throughput, because vLLM's continuous batching and AWQ's tighter packing are a near-ideal fit for Turing's tensor cores. If you are deploying for more than one concurrent user — even just a household — vLLM with AWQ is the right backend, not Ollama. A detailed RTX 2080 walkthrough by zhaodanghanjohn confirms the same envelope.
Setting it up cleanly
The fastest path to a working install on Windows, Linux, or macOS is Ollama. The official model card is at ollama.com/library/qwen3.
# 1. Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull Qwen 3 8B at Q4_K_M (the default tag)
ollama pull qwen3:8b
# 3. Run with a sane context size for 8 GB cards
OLLAMA_NUM_CTX=16384 ollama run qwen3:8b
# 4. Toggle thinking mode inside a chat
/think # enable
/no_think # disableFor multi-user serving, swap Ollama for vLLM with the AWQ build:
pip install vllm
vllm serve Qwen/Qwen3-8B-AWQ \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--quantization awqThree settings to tune deliberately. Context length: do not allocate 32K on an 8 GB card — KV cache balloons to 4 GB and you OOM. 16K is the practical ceiling. Flash Attention: enable it (--flash-attn in llama.cpp, on by default in vLLM) — it cuts KV memory ~15%. Sampling: Qwen recommends temperature=0.7, top_p=0.8, top_k=20 for chat and temperature=0.6, top_p=0.95 for thinking mode. The defaults in most UIs are wrong.
Where Qwen 3 8B is the wrong pick
Honesty matters more than recommendation strength. Qwen 3 8B is the wrong choice in three concrete cases.
- Vision tasks. Qwen 3 8B is text-only. If you need OCR, screenshot understanding, or image reasoning, the right Qwen target in 2026 is Qwen 3.5-27B-VL or Qwen 3.6-27B-VL, both 24 GB+ VRAM. Do not try to bolt on a separate vision encoder — the integrated VL models are significantly stronger.
- Long-document workflows above 32K tokens. YaRN extension to 128K technically works but quality degrades past 40K, and the KV cache cost is incompatible with 8 GB. For 100K+ contexts, rent a cloud H100 hourly or use API models — the local economics break.
- Specialist coding agents. Qwen3-Coder 32B Q4_K_M is materially better at multi-file refactors and matches Claude 3.5 Sonnet on SWE-bench Verified. If you have 24 GB of VRAM and your primary use case is code, jump to the coder variant rather than staying on 8B.
For everything else — chat, RAG, single-file code, agentic tool use, math homework, multilingual translation, structured output extraction — 8B is the right tier and Qwen 3 is the right model in that tier.
Cost picture: local vs API in 2026
The honest comparison is not "local is free" but cost-per-million-tokens amortized over hardware life. The figures below assume 24/7 inference at 40 tok/s generation, $0.18/kWh US average electricity, and a 3-year amortization. They mirror the calculation in our cost calculator; French-speaking readers can find the same methodology on our sister site quelllm.fr.
| Setup | Upfront cost | $/M output tokens (24/7) | $/M output tokens (4h/day) |
|---|---|---|---|
| RTX 3060 Ti 8 GB used + existing PC | $220 | $0.09 | $0.31 |
| RTX 4060 8 GB new + existing PC | $300 | $0.11 | $0.38 |
| RTX 3070 laptop (already owned) | $0 | $0.06 | $0.06 |
| Qwen API (Alibaba Cloud, Qwen 3 8B) | $0 | $0.20 | $0.20 |
| OpenAI gpt-4o-mini (comparable tier) | $0 | $0.60 | $0.60 |
The break-even against the cheapest hosted API is roughly 60 million output tokens. Most individuals do not hit that. Most small teams do, especially with coding agents. The decision is less about cost and more about latency, privacy, and offline capability — local Qwen 3 8B wins on all three, but "it pays for itself" is only true at sustained use.
Building Qwen 3 8B into agent stacks
BFCL score 80.2 with thinking on means Qwen 3 8B is a credible local backend for function-calling agents — not a toy. We ship quelllm-mcp, an open-source Model Context Protocol server, that exposes the BestLLMfor benchmark dataset and a curated set of model-routing tools to any MCP-compatible client. It is the easiest way to give a local Qwen 3 8B instance hands without rolling your own tool dispatcher. Pair it with Claude Code, Continue.dev, or the Ollama OpenAI-compatible endpoint, and a 8 GB card becomes a serviceable coding assistant.
FAQ
Is Qwen 3 8B better than Llama 3.1 8B in 2026?
Yes, decisively. Qwen 3 8B beats Llama 3.1 8B by 6.8 points on MMLU, 21.3 points on HumanEval, and 31.8 points on MATH at the same Q4_K_M quant. Llama 3.1 8B is still useful for English-only chat where Qwen's slightly stiffer prose matters, but on every other axis Qwen wins.
Will Qwen 3 8B run on a 6 GB GPU?
Yes, at IQ3_M or Q3_K_S, with an 8K context. Quality drops measurably — MMLU loses 3–4 points, math loses more — but it is usable for chat and light coding. On 4 GB cards, downgrade to Qwen 3 4B instead; the smaller model at Q4_K_M outperforms a heavily-quantized 8B.
Should I upgrade from 8 GB to 12 GB VRAM for Qwen 3?
Only if you specifically need 14B-class models, 32K+ contexts, or want to run a vision model alongside the LLM. For pure 8B usage, the upgrade is wasted — Qwen 3 8B Q4_K_M does not benefit from extra VRAM beyond a slightly larger KV cache. Spend the money on faster storage or a better CPU for prompt processing instead.
Does thinking mode work with tool calling?
Yes, and it noticeably improves multi-step tool chains. BFCL v3 scores rise from 76.8 to 80.2 with thinking enabled. The downside is latency: a single tool-using turn can take 8–15 seconds on an 8 GB card. For interactive agents, gate thinking mode behind task complexity rather than enabling it globally.
Is Qwen 3 8B safe to use commercially?
Yes. The Apache 2.0 license permits commercial use, modification, and redistribution without royalty. The only practical constraint is that you should review your jurisdiction's export-control rules if you are shipping derivative weights internationally. Read more about our editorial independence and licensing policy on the About page.
Verdict
Qwen 3 8B at Q4_K_M is the 2026 default for any user with an 8 GB GPU and no specific reason to pick something else. It is faster, smarter, and more capable than every prior 8B model, it costs nothing to license, and the supporting ecosystem — Ollama, llama.cpp, vLLM, MLX, MCP — is mature. The newer Qwen 3.5 and 3.6 generations are excellent but solve a different problem at a different price point. For the 8 GB tier, this is the model to deploy this year.
| Use case | Recommendation | Confidence |
|---|---|---|
| Chat, RAG, summarization on 8 GB GPU | Qwen 3 8B Q4_K_M, thinking off | High |
| Local coding assistant on 8 GB GPU | Qwen 3 8B Q4_K_M, thinking on | High |
| Multi-user serving on 8 GB GPU | Qwen 3 8B AWQ via vLLM | High |
| Vision or 100K+ context | Move to Qwen 3.6-27B-VL on 24 GB+ | High |
| Heavy code refactoring | Qwen3-Coder 32B Q4_K_M on 24 GB | High |
| 6 GB GPU | Qwen 3 8B IQ3_M or Qwen 3 4B Q4_K_M | Medium |