Editorial ranking · 2026

Best local LLM for 8GB VRAM

Q: What is the best local LLM for 8 GB VRAM budgets?

Granite 4.0 H-Tiny 7B-A1B tops this ranking — a 7B model, licensed under Apache 2.0, needing about 4 GB of VRAM at Q4 quantization. See the full list below for the runner-ups and how they compare.

Last updated 2026-05-26 · Page updated 2026-07-13

Top 8 open-source picks for 8 GB VRAM budgets, ranked by benchmark performance and real-world fit. Updated monthly.

Eight gigabytes is the most common VRAM budget in gaming laptops and mid-range desktop cards — and it is genuinely enough for local LLM work, provided you respect the quantization shown on each card below. Every model in this ranking fits within 8 GB at Q4_K_M according to our compatibility data, from compact 7B builds around 5 GB up to 12B-class models that sit right at the ceiling.

Two practical rules. First, prefer the Q4_K_M build unless you have measured headroom for Q5 — on this budget, fidelity gains rarely justify an out-of-memory error. Second, keep context length in check: the KV cache grows with every token, and a long prompt can push an otherwise comfortable model into system RAM, where speed collapses. If your card carries more memory, the 12 GB and 24 GB guides cover the larger models that extra headroom unlocks.

Granite 4.0 H-Tiny 7B-A1B

7B · IBM · Apache 2.0

IBM's edge-class hybrid MoE with 7B total and only 1B active parameters — Apache 2.0 licensed and built for embedded and low-cost serving.

VRAM Q4: 4 GB · Context: 125k

Read full fiche →

Mistral Nemo 12B Instruct

12B · Mistral AI · Apache 2.0

Mistral AI and NVIDIA's co-developed 12B instruct model with 128k context, the Tekken tokenizer, and strong European multilingual coverage.

VRAM Q4: 7 GB · Context: 125k

Read full fiche →

Gemma 3 12B

12B · Google · Gemma

The 12B sweet spot of Google's Gemma 3 line — multimodal, 128K context, and 140 languages. Fits on a single consumer GPU with room for batching.

VRAM Q4: 7 GB · Context: 125k

Read full fiche →

Nemotron Nano v2 VL 12B

12.6B · NVIDIA · NVIDIA Open Model License

NVIDIA's 12.6B enterprise VLM with strong DocVQA and ChartQA scores, tuned for professional document extraction workflows.

VRAM Q4: 8 GB · Context: 125k

Read full fiche →

Lucie 7B

7B · OpenLLM-France · Apache 2.0

A French-sovereign 7B model from OpenLLM-France, backed by CNRS and LINAGORA, with a fully transparent and auditable training corpus.

VRAM Q4: 5 GB · Context: 4k

Read full fiche →

DeepSeek R1 Distill 7B

7B · DeepSeek · MIT

A 7B DeepSeek model distilled from R1 671B with explicit chain-of-thought reasoning. Surprisingly strong on AIME and MATH for its size.

VRAM Q4: 5 GB · Context: 32k

Read full fiche →

Qwen 3 8B

8B · Alibaba · Apache 2.0

Alibaba's 8B dense model with a toggleable thinking mode and broad multilingual coverage. Punches well above its weight for an 8B and runs comfortably on a single consumer GPU.

VRAM Q4: 5 GB · Context: 128k

Read full fiche →

Qwen 2.5 VL 7B

7B · Alibaba · Apache 2.0

A 7B vision-language model from Alibaba with state-of-the-art results in its class, scoring 95.7 on DocVQA. Handles hour-long video, bounding-box grounding, and multilingual OCR.

VRAM Q4: 6 GB · Context: 125k

Read full fiche →

Which GPU should you buy to run Granite 4.0 H-Tiny 7B-A1B?

To run Granite 4.0 H-Tiny 7B-A1B locally at Q4, you need ~4 GB of VRAM. The best value for this is a RTX 5060 (8 GB VRAM).

Check RTX 5060 price on Amazon →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.

Frequently asked questions

What is the best local LLM for 8 GB VRAM budgets?

Granite 4.0 H-Tiny 7B-A1B tops this ranking — a 7B model, licensed under Apache 2.0, needing about 4 GB of VRAM at Q4 quantization. See the full list below for the runner-ups and how they compare.

How much VRAM do I need to run Granite 4.0 H-Tiny 7B-A1B?

At Q4 quantization, Granite 4.0 H-Tiny 7B-A1B needs about 4 GB of VRAM and fits comfortably on a single 24 GB GPU.

Which of these models fit an 8 GB GPU?

At Q4 quantization, Granite 4.0 H-Tiny 7B-A1B, Mistral Nemo 12B Instruct, Gemma 3 12B, Nemotron Nano v2 VL 12B, Lucie 7B and 3 more fit within 8 GB of VRAM.

Are the models on this 8 GB VRAM budgets list free for commercial use?

Licenses across this list include Apache 2.0, Gemma, MIT, NVIDIA Open Model License. Check the specific license of each model on its catalog page before deploying commercially, as terms vary by author.

What context window do these models support?

Context windows on this list range from 4k to 128k tokens, depending on the model.