Guide · 2026-05-16

Best Local LLM for RAG — LangChain & LlamaIndex Tested

Last updated 2026-05-16

We benchmarked seven local models across LangChain and LlamaIndex pipelines on a 12,000-document corpus. Here is what actually works in production.

By Mohamed Meguedmi · 11 min read

Key takeaways

Winner overall: Qwen3 32B Q4_K_M paired with BGE-M3 embeddings hit 87.4% answer faithfulness on our 500-question RAGAS suite — within 2 points of GPT-4o-mini at roughly 1/40th the per-query cost.
LangChain vs LlamaIndex: LlamaIndex won on retrieval quality out of the box (+6.8% hit-rate at k=5); LangChain won on orchestration when tool-calling or multi-step agents entered the loop.
Minimum viable hardware: a 24 GB GPU (RTX 3090, 4090, or 7900 XTX) handles a 14B model plus a local Qdrant instance comfortably. 16 GB cards force you down to 7B-8B class models.
Embeddings matter more than the LLM: swapping nomic-embed-text for BGE-M3 raised end-to-end accuracy by 11.3 points — a bigger jump than upgrading from Llama 3.1 8B to Qwen3 32B.
Local pays back fast: at 50k queries/month, a self-hosted stack breaks even against the OpenAI API in under 4 months — see our cost calculator for your own numbers.

Why local RAG, why now

Retrieval-Augmented Generation has matured from a 2023 novelty into the default architecture for any system that needs to answer questions about private documents. The cloud-API path is well-trodden, but three trends pushed local deployments into the mainstream during 2025: open-weight models that match GPT-4o-mini on faithfulness, embedding models that finally rival OpenAI's text-embedding-3-large, and consumer GPUs cheap enough to amortize in months rather than years.

This guide is not another LangChain-vs-LlamaIndex theory piece. The editorial team built identical pipelines in both frameworks, pointed them at the same 12,000-document corpus (a mixed bag of technical PDFs, support tickets, and Markdown documentation totaling ~38M tokens), and measured what came out the other end. All results below come from real runs reproducible from our methodology page.

The test setup

We ran every benchmark on a single-GPU node — RTX 4090 (24 GB VRAM), Ryzen 9 7950X, 64 GB DDR5 — running Ubuntu 24.04 and Ollama 0.5.4. Vector storage used Qdrant 1.12 in Docker. Each pipeline answered the same 500-question evaluation set scored with RAGAS for faithfulness, answer relevancy, and context precision.

Component	LangChain pipeline	LlamaIndex pipeline
Document loader	UnstructuredFileLoader	SimpleDirectoryReader + LlamaParse
Chunker	RecursiveCharacterTextSplitter (512/64)	SentenceSplitter (512/64) + SemanticSplitterNodeParser
Embeddings	BGE-M3 via HuggingFaceEmbeddings	BGE-M3 via HuggingFaceEmbedding
Vector store	Qdrant via langchain-qdrant	Qdrant via llama-index-vector-stores-qdrant
Retriever	MultiQueryRetriever + Cohere rerank-3	VectorIndexRetriever + LLMRerank
LLM	ChatOllama	Ollama

The models we tested

Seven local models were in the mix, all quantized to Q4_K_M unless noted. Sizes refer to the on-disk GGUF weight files.

Qwen3 32B (Q4_K_M, 19.8 GB) — model card
Qwen3 14B (Q4_K_M, 9.0 GB)
Llama 3.3 70B (Q3_K_M, 32.1 GB — CPU-offloaded tail)
Llama 3.1 8B (Q4_K_M, 4.7 GB)
Mistral Small 24B (Q4_K_M, 14.3 GB)
Gemma 3 12B (Q4_K_M, 7.3 GB)
Phi-4 14B (Q4_K_M, 8.4 GB)

Retrieval results: LlamaIndex wins out of the box

Before the LLM ever sees a token, retrieval quality is what decides whether your RAG system tells the truth. We measured hit-rate@5 — the share of questions for which at least one truly relevant chunk appears in the top 5 retrieved — across both frameworks with identical embeddings.

Configuration	Hit-rate@5	MRR@10	Avg retrieval latency
LangChain, naive vector search	71.2%	0.542	38 ms
LangChain, MultiQuery + rerank	83.4%	0.681	284 ms
LlamaIndex, naive vector search	78.0%	0.611	41 ms
LlamaIndex, auto-merging + LLMRerank	90.2%	0.748	312 ms

The 6.8-point gap in naive mode comes down to LlamaIndex's SemanticSplitterNodeParser, which keeps semantically coherent passages together instead of slicing at fixed character counts. Once you turn on auto-merging — promoting a parent node when several child nodes match — LlamaIndex pulls another 12 points ahead of stock LangChain.

LangChain can match these numbers, but you have to assemble parent-document retrievers and rerankers manually. LlamaIndex ships those patterns as one-liners.

Generation results: Qwen3 32B is the local champion

With retrieval held constant (LlamaIndex auto-merging + BGE-M3 + LLMRerank), we swapped only the generator. Faithfulness measures whether the answer stays grounded in the retrieved context — the metric that actually catches hallucinations.

Model	Faithfulness	Answer relevancy	Tokens/sec (RTX 4090)	VRAM peak
Qwen3 32B Q4_K_M	87.4%	0.892	34	22.1 GB
Llama 3.3 70B Q3_K_M	88.1%	0.901	6	23.8 GB + 8 GB RAM
Mistral Small 24B Q4_K_M	84.6%	0.871	41	17.2 GB
Phi-4 14B Q4_K_M	82.9%	0.864	58	10.4 GB
Qwen3 14B Q4_K_M	82.1%	0.858	62	10.9 GB
Gemma 3 12B Q4_K_M	79.4%	0.832	67	9.1 GB
Llama 3.1 8B Q4_K_M	76.8%	0.811	89	6.3 GB
GPT-4o-mini (reference)	89.3%	0.908	~80	n/a

Llama 3.3 70B technically tops the faithfulness chart, but at 6 tokens/sec with CPU offload it is unusable for anything interactive. Qwen3 32B delivers within 0.7 points of the 70B at over 5× the throughput, and it does so on a single 24 GB GPU. If you have only 16 GB to spare, Phi-4 14B is the sharper pick — its training distribution skews toward reasoning over chat, which shows up in cleaner extractive answers.

Embeddings: the unsung lever

The largest single jump in end-to-end accuracy came not from the LLM but from the embedding model. We tested four open-weight options against the same Qwen3 32B generator.

Embedding model	Dim	Hit-rate@5	End-to-end faithfulness
nomic-embed-text v1.5	768	74.1%	76.1%
mxbai-embed-large	1024	81.9%	82.7%
BGE-M3	1024	90.2%	87.4%
Snowflake Arctic Embed L v2.0	1024	89.7%	87.0%

BGE-M3 (BAAI/bge-m3) supports dense, sparse, and ColBERT-style multi-vector retrieval out of one model, which is why it tops the table. Switching from nomic-embed-text to BGE-M3 raised faithfulness by 11.3 points — a bigger swing than going from Llama 3.1 8B all the way up to Qwen3 32B. If you remember one thing from this guide, make it that.

Cost: when does local pay back?

Equivalent quality on OpenAI runs roughly $0.000150 per 1k input tokens and $0.000600 per 1k output on gpt-4o-mini, plus $0.000130 per 1k tokens on text-embedding-3-large. The local stack — RTX 4090 ($1,799 retail) plus a Ryzen 9 host (~$1,400 fully built) and Qdrant running for free — runs about $0.12/hour in electricity at US average rates.

Volume	OpenAI monthly	Local monthly (power)	Local break-even
10k queries/month	$48	$86	Never (OpenAI cheaper)
50k queries/month	$240	$86	~3.8 months
250k queries/month	$1,200	$86	~2.9 months
1M queries/month	$4,800	$86	~0.7 months

Plug your own assumptions into the BestLLMfor cost calculator — the breakeven is sensitive to query length distribution, not just count. For European readers, our French sister site quelllm.fr publishes equivalent figures in EUR with EU electricity rates.

Framework verdict: pick LlamaIndex first, reach for LangChain when agents arrive

Our recommendation after eight weeks of pipeline-building: start in LlamaIndex. The retrieval primitives — sentence-window, auto-merging, hierarchical node parsers — are the difference between a 78% and a 90% hit-rate, and they exist as named constructors instead of recipes you have to assemble.

Move to LangChain when one of these enters the picture: multi-step tool-calling agents, complex stateful chains, observability via LangSmith, or a production deployment that benefits from LangGraph's checkpointing. The two libraries are not exclusive — both expose the same Qdrant collection, so you can read with LlamaIndex and orchestrate with LangChain in the same app.

A reproducible install path

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Pull the models: ollama pull qwen3:32b and ollama pull bge-m3
Start Qdrant: docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant:v1.12.4
Create a Python 3.12 venv and install llama-index llama-index-llms-ollama llama-index-embeddings-ollama llama-index-vector-stores-qdrant ragas
Index your corpus with SimpleDirectoryReader, wrap with VectorStoreIndex.from_documents, then query through a RetrieverQueryEngine with LLMRerank in the postprocessor chain.

If you want a ready-made bridge between your editor and a local model, the open-source quelllm-mcp server exposes our benchmark results through the Model Context Protocol, and the underlying numbers are downloadable as JSON under CC BY 4.0 from the BestLLMfor public API.

Final recommendation

Use case	Recommended stack
24 GB GPU, accuracy-first RAG	LlamaIndex + Qwen3 32B Q4_K_M + BGE-M3 + Qdrant + LLMRerank
16 GB GPU, balanced	LlamaIndex + Phi-4 14B Q4_K_M + BGE-M3 + Qdrant
12 GB GPU, throughput-first	LlamaIndex + Qwen3 14B Q4_K_M + mxbai-embed-large + Chroma
Agentic RAG with tools	LangChain + LangGraph + Qwen3 32B + BGE-M3 + Qdrant
Production, multi-tenant	LangChain + LangSmith + Qwen3 32B + BGE-M3 + Qdrant cluster

FAQ

Can I run this whole stack on an M-series Mac?

Yes. Qwen3 32B Q4_K_M loads on a 36 GB M3 Max at roughly 22 tokens/sec via Ollama's Metal backend. Faithfulness scores are identical to CUDA — the model is the same — only throughput differs.

Is Llama 3.3 70B worth the offload pain?

Only if you have a 48 GB+ card (A6000, RTX 6000 Ada) so it stays fully resident. CPU-offloaded at 6 tokens/sec, it cannot service interactive RAG. Qwen3 32B delivers 99% of the quality at 5× the speed.

Why not just use a hosted vector DB like Pinecone?

Latency from a self-hosted Qdrant on the same node is 30-50 ms; Pinecone adds a network round trip of 60-150 ms plus a recurring bill. For a fully local stack, on-device is faster and cheaper.

How often should I re-embed when the corpus changes?

Only the changed documents. Qdrant supports per-point upserts keyed by a stable document ID. A nightly delta job is sufficient for most teams.

Does fine-tuning beat better retrieval?

In our tests, no. Improving retrieval (BGE-M3 + auto-merging + rerank) raised faithfulness more than any fine-tune we attempted on the same base model, and it generalizes across new documents instantly.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.