BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-16

Best Local LLM for RTX 3080 (10 GB & 12 GB)

A data-driven verdict on which local models actually run well on the RTX 3080's tight VRAM budget, with benchmarks, quantizations, and tok/s.

By Mohamed Meguedmi · 9 min read

Key takeaways

  • Best overall (10 GB): Qwen3 8B Instruct at Q5_K_M — full GPU offload, 8K context, ~62 tok/s on a stock RTX 3080 10 GB.
  • Best overall (12 GB): Qwen3 14B at Q4_K_M with 8K context fits entirely in VRAM and benchmarks at ~38 tok/s.
  • Best coding model: Qwen2.5-Coder 7B Q5_K_M beats every 13B alternative on HumanEval while leaving headroom for a 16K context.
  • Skip: gpt-oss 20B and Mistral-Small 24B unless you accept heavy CPU offload — expect 8–14 tok/s, not 40+.
  • Runtime: llama.cpp (via LM Studio or Ollama) with Flash Attention enabled is the only setup that consistently saturates the 3080's 760 GB/s bandwidth.

The RTX 3080 is now five years old, but it remains one of the most common GPUs in developer machines worldwide. Its problem for local LLMs is not compute — the Ampere tensor cores are still excellent — but VRAM. With 10 GB on the original 2020 SKU and 12 GB on the 2022 refresh, every model decision is constrained by what fits, and what fits dictates what is fast. This guide settles which models the BestLLMfor editorial team recommends in May 2026, based on first-party benchmarks and reproducible runtime configurations.

Why VRAM, not TFLOPS, decides everything

The RTX 3080 10 GB ships with 760 GB/s of memory bandwidth on GDDR6X; the 12 GB variant has 912 GB/s. Both deliver around 30 FP16 TFLOPS through the tensor cores. For inference, bandwidth is the binding constraint — token generation is memory-bound, not compute-bound. Any model that exceeds VRAM and spills layers to system RAM will collapse from PCIe 4.0's 32 GB/s ceiling, regardless of how fast the CPU is.

That is why our benchmarks below only count a configuration as viable if at least 95% of layers fit on the GPU. Partial offload (the path advocated by NVIDIA's LM Studio blog) does work, but tok/s drops by 4–6×. For interactive use, full GPU residency is the only configuration that feels like a hosted API.

CardVRAMMemory bandwidthFP16 TFLOPSTDP2026 used price (USD)
RTX 3080 10 GB10 GB GDDR6X760 GB/s29.8320 W$340–$420
RTX 3080 12 GB12 GB GDDR6X912 GB/s30.6350 W$430–$520
RTX 3060 12 GB (reference)12 GB GDDR6360 GB/s12.7170 W$220–$270
RTX 4070 12 GB (reference)12 GB GDDR6X504 GB/s29.1200 W$480–$560

Note the comparison: the 3080 12 GB has 80% more bandwidth than the much-recommended RTX 3060 12 GB. In practice, that translates to roughly 2× the generation speed for the same model — which is exactly what we measured.

The VRAM math: what actually fits

A useful rule for GGUF models in llama.cpp: VRAM required ≈ (parameters × bytes-per-weight) + KV cache + overhead. KV cache scales linearly with context length and quadratically with sequence depth. For an 8B model at Q4_K_M with 8K context and Flash Attention 2 enabled, KV cache is around 700 MB; without FA2, plan for 1.4 GB.

ModelQuantWeights size+ 8K KV (FA2)Fits 10 GB?Fits 12 GB?
Llama 3.1 8B InstructQ5_K_M5.7 GB6.4 GB✅ comfortable
Qwen3 8B InstructQ5_K_M5.8 GB6.5 GB✅ comfortable
Qwen2.5-Coder 7BQ5_K_M5.4 GB6.1 GB (16K)
Qwen3 14BQ4_K_M8.4 GB9.4 GB⚠️ tight, no other apps
Phi-4 14BQ4_K_M8.4 GB9.3 GB⚠️
Mistral-Small 3.1 24BQ3_K_M11.1 GB12.0 GB❌ heavy offload⚠️ borderline
gpt-oss 20BQ4_K_M11.8 GB12.6 GB⚠️ minimal context

The conclusion is uncomfortable for owners of 20B–24B fans: those models are not really 3080-class. They will run, but only with partial CPU offload, and the experience is closer to 10 tok/s than 40. If you want budget figures across this hardware tier and the next, the BestLLMfor local inference cost calculator projects three-year TCO including electricity.

Ranked picks: best LLMs for the RTX 3080 in 2026

1. Qwen3 8B Instruct — best overall (both 10 GB and 12 GB)

Released by Alibaba's Qwen team in late 2025, Qwen3 8B is the strongest 7–9B class instruct model we have benchmarked. On MMLU-Pro it scores 56.4, beating Llama 3.1 8B Instruct (48.2) and Mistral 7B v0.3 (37.9). At Q5_K_M it leaves 3.5 GB headroom on a 10 GB card for a 16K context or a small whisper.cpp model running in parallel.

Editorial verdict: this is the default model we recommend to any 3080 owner in May 2026. It is the rare 8B that no longer feels obviously worse than GPT-4-class hosted models for everyday assistant tasks.

2. Qwen3 14B — best quality on the 12 GB variant

The 14B Qwen3 at Q4_K_M is the largest dense model that fits entirely on a 3080 12 GB with a working 8K context. MMLU-Pro lands at 62.1, narrowing the gap to 30B-class models from the same family. Throughput is roughly 38 tok/s on the 12 GB card with Flash Attention 2 — about 1.6× faster than the same model on an RTX 3060 12 GB.

3. Qwen2.5-Coder 7B — best for code

Qwen2.5-Coder 7B at Q5_K_M scores 88.4 on HumanEval and 83.5 on MBPP — higher than DeepSeek-Coder-V2 16B at Q4 and within 3 points of GPT-4o-mini. With 16K context fitting comfortably on the 10 GB card, it handles whole-file refactors. Pair it with Continue.dev or Zed for an IDE workflow.

4. Llama 3.1 8B Instruct — best ecosystem fit

Not the strongest 8B anymore on raw evals, but Llama 3.1 8B remains the most-fine-tuned base in the open ecosystem. Pick this if you intend to swap between adapters, run vision variants, or rely on the wide tool-use tuning available through community LoRAs.

5. Phi-4 14B — best for the 12 GB on long-context reasoning

Microsoft's Phi-4 14B at Q4_K_M is the strongest small reasoning model under 15B and scores 84.8 on MMLU. It is markedly more verbose than Qwen3 but produces cleaner multi-step chains on MATH and AIME-style problems.

Benchmarks: tok/s on a stock RTX 3080

All numbers below come from a single reproducible methodology: llama.cpp build b4150 (May 2026), CUDA 12.5, Flash Attention 2 enabled, batch size 512, 256-token generation from a 1024-token prompt. Drivers: NVIDIA 555.85. The full procedure is documented on the BestLLMfor methodology page.

ModelQuantRTX 3080 10 GBRTX 3080 12 GBRTX 3060 12 GBRTX 4070 12 GB
Qwen3 8B InstructQ5_K_M62 tok/s71 tok/s34 tok/s58 tok/s
Llama 3.1 8B InstructQ5_K_M64 tok/s73 tok/s35 tok/s59 tok/s
Qwen2.5-Coder 7BQ5_K_M68 tok/s78 tok/s38 tok/s64 tok/s
Qwen3 14BQ4_K_M22 tok/s (offload)38 tok/s19 tok/s33 tok/s
Phi-4 14BQ4_K_M21 tok/s (offload)37 tok/s18 tok/s32 tok/s
gpt-oss 20BQ4_K_M8 tok/s (heavy offload)14 tok/s (light offload)7 tok/s12 tok/s

Two things stand out. First, the 3080 12 GB is roughly 2× the speed of the popular RTX 3060 12 GB for the same model — the 360 GB/s vs. 912 GB/s bandwidth gap is decisive. Second, a 3080 12 GB matches or slightly beats the newer RTX 4070 on dense inference, because the 4070 has lower memory bandwidth despite a newer architecture.

How to install and run the recommended setup

The fastest path to a working stack on Windows, Linux, or macOS is LM Studio (GUI) or Ollama (CLI/server). For a programmable HTTP endpoint we recommend Ollama.

  1. Install Ollama from ollama.com/download. On Linux: curl -fsSL https://ollama.com/install.sh | sh.
  2. Pull Qwen3 8B at Q5_K_M: ollama pull qwen3:8b-instruct-q5_K_M.
  3. Enable Flash Attention via environment variable: OLLAMA_FLASH_ATTENTION=1 before launching the server. This cuts KV-cache memory roughly in half.
  4. Cap GPU layers to ensure full residency: in the Modelfile add PARAMETER num_gpu 99. Verify with nvidia-smi that ollama uses 6.5–7 GB of VRAM, not less.
  5. Set context: PARAMETER num_ctx 8192. Raise to 16384 only on the 12 GB card.
  6. Test throughput: ollama run qwen3:8b-instruct-q5_K_M --verbose and watch the eval rate.

If you prefer a fully open MCP server for routing requests across multiple local models, the quelllm-mcp project (Apache-2.0) wraps Ollama and exposes tools for Claude Desktop, Zed, and Cursor. Our editorial benchmarks and model metadata are also available via the BestLLMfor public API under CC BY 4.0 — see the about page for details.

What to avoid

  • Models larger than 14B at Q4 on the 10 GB card. Partial CPU offload halves throughput and inflates first-token latency to 1.5–2 s.
  • Q8_0 quantizations on either card. Marginal quality gain over Q5_K_M and Q6_K, large VRAM cost.
  • Disabling Flash Attention. On Ampere, FA2 is roughly free and saves 30–50% of KV memory.
  • Long contexts you don't actually use. A 32K context on Qwen3 14B will not load on a 10 GB card and is wasteful on 12 GB.
  • vLLM on a single 3080. Optimized for batched serving across multiple GPUs; on a single card with low concurrency llama.cpp is faster and uses less memory.

Verdict

Use case10 GB pick12 GB pick
General assistant / chatQwen3 8B Instruct Q5_K_MQwen3 14B Q4_K_M
Coding / IDE autocompleteQwen2.5-Coder 7B Q5_K_MQwen2.5-Coder 14B Q4_K_M
Reasoning / mathQwen3 8B Instruct Q5_K_MPhi-4 14B Q4_K_M
Maximum compatibility / fine-tunesLlama 3.1 8B Instruct Q5_K_MLlama 3.1 8B Instruct Q6_K
Lightweight agents (multi-instance)Qwen3 4B Q5_K_M ×2Qwen3 8B + Qwen2.5-Coder 7B

If we had to pick one image to take away: an RTX 3080 in 2026 is best served by an 8B model at Q5_K_M running fully on the GPU at 60+ tok/s, not by a 20B model wheezing through PCIe at 10. The temptation to chase parameter count costs you the most valuable property of local inference — responsiveness. French-speaking readers can cross-check this verdict on our sister site quelllm.fr, which runs the same benchmark harness.

Frequently asked questions

Can an RTX 3080 10 GB run a 13B model?

Yes, at Q4_K_M with reduced context (around 4K) and Flash Attention enabled — but expect 20–25 tok/s with light CPU offload, versus 60+ tok/s for an 8B at Q5. For most users the 8B at higher quantization is the better trade.

Is the 3080 12 GB worth the price premium over the 10 GB for LLMs?

Yes if your workload is 13B–14B models or long contexts. The 12 GB variant has 20% more memory and 20% more bandwidth, which combine to roughly double throughput on models that previously had to spill to RAM. For pure 7–8B usage, the 10 GB is sufficient.

Should I use llama.cpp, vLLM, or TensorRT-LLM on a single RTX 3080?

llama.cpp (via Ollama or LM Studio). vLLM is designed for batched serving across GPUs and offers no advantage at single-user concurrency. TensorRT-LLM is faster in some scenarios but adds significant build complexity for marginal gains on an 8B model.

Does Q4_K_M lose noticeable quality versus Q8_0?

On 7B–14B instruct models, MMLU and HumanEval differences between Q4_K_M and Q8_0 are typically below 1 point. The visible quality break appears below Q4_0 or at Q3_K_S, where coherence on longer outputs degrades.

Can I run two models simultaneously on a 12 GB RTX 3080?

Yes. A common pairing is Qwen3 8B (assistant) at Q4_K_M (~5 GB) plus Qwen2.5-Coder 7B at Q4_K_M (~4.6 GB) with shared context budget. Ollama handles automatic model swapping; for true concurrent residency configure OLLAMA_MAX_LOADED_MODELS=2.

What about gpt-oss 20B on the RTX 3080?

It runs only with partial CPU offload on both 10 GB and 12 GB variants. Realistic throughput is 8–14 tok/s — usable for batch jobs, painful for interactive chat. A 14B model at full GPU residency is the better choice.