Editorial ranking · 2026

Best local LLM for RAG

Q: What is the best local LLM for RAG and document retrieval?

Qwen 3 VL 30B-A3B tops this ranking — a 30B model, licensed under Apache 2.0, needing about 19 GB of VRAM at Q4 quantization. See the full list below for the runner-ups and how they compare.

Last updated 2026-05-26 · Page updated 2026-07-13

Top 7 open-source picks for RAG and document retrieval, ranked by benchmark performance and real-world fit. Updated monthly.

Qwen 3 VL 30B-A3B

30B · Alibaba · Apache 2.0

Qwen 3 VL's sweet spot: a 30B MoE with 3B active parameters and 256k context. Delivers most of the 235B's quality at a fraction of the hardware cost.

VRAM Q4: 19 GB · Context: 256k

Read full fiche →

Nemotron Nano 3 30B-A3B

30B · NVIDIA · NVIDIA Open Model License

NVIDIA's Mamba-2 + Transformer hybrid MoE with 3B active out of 30B total parameters. A native 1M-token context with roughly 4× the throughput of Nemotron 2.

VRAM Q4: 19 GB · Context: 976k

Read full fiche →

Qwen 3 30B-A3B

30B · Alibaba · Apache 2.0

Alibaba's Qwen 3 MoE with 30B total and just 3B active parameters, supporting hybrid thinking mode. MMLU 81.4, AIME24 80.4, 100+ languages, Apache 2.0.

VRAM Q4: 19 GB · Context: 128k

Read full fiche →

Trinity Mini 26B-A3B

26B · Arcee AI · Apache 2.0

Arcee AI's US-built MoE with 3B active parameters out of 26B total. Apache-licensed, fast in practice, and tuned for agent-style workloads.

VRAM Q4: 15 GB · Context: 128k

Read full fiche →

Kanana 2 30B-A3B Thinking

30B · Kakao · Apache 2.0

Kakao's agentic 30B MoE (3B active) with native hybrid thinking and Korean-first training. Apache 2.0 with MLA attention and 131k context.

VRAM Q4: 18 GB · Context: 128k

Read full fiche →

Qwen 3 Omni 30B-A3B

30B · Alibaba · Apache 2.0

Alibaba's omni-modal 30B MoE (3B active) with streaming speech, 119-language ASR, and Apache 2.0 licensing. The most accessible truly omnimodal open model.

VRAM Q4: 19 GB · Context: 128k

Read full fiche →

Granite 4.0 H-Small 32B-A9B

32B · IBM · Apache 2.0

IBM's hybrid Mamba-2 + MoE model with 32B total and 9B active parameters, engineered to slash long-context memory use by roughly 70% versus comparable transformers under Apache 2.0.

VRAM Q4: 19 GB · Context: 125k

Read full fiche →

Which GPU should you buy to run Qwen 3 VL 30B-A3B?

To run Qwen 3 VL 30B-A3B locally at Q4, you need ~19 GB of VRAM. The best value for this is a RTX 4090 (24 GB VRAM).

Check RTX 4090 price on Amazon →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.

Frequently asked questions

What is the best local LLM for RAG and document retrieval?

Qwen 3 VL 30B-A3B tops this ranking — a 30B model, licensed under Apache 2.0, needing about 19 GB of VRAM at Q4 quantization. See the full list below for the runner-ups and how they compare.

How much VRAM do I need to run Qwen 3 VL 30B-A3B?

At Q4 quantization, Qwen 3 VL 30B-A3B needs about 19 GB of VRAM and fits comfortably on a single 24 GB GPU.

Which of these models fit an 16 GB GPU?

At Q4 quantization, Trinity Mini 26B-A3B fit within 16 GB of VRAM.

Are the models on this RAG and document retrieval list free for commercial use?

Licenses across this list include Apache 2.0, NVIDIA Open Model License. Check the specific license of each model on its catalog page before deploying commercially, as terms vary by author.

What context window do these models support?

Context windows on this list range from 125k to 976k tokens, depending on the model.