Best Local LLM for RAG — LangChain & LlamaIndex Tested
We benchmarked seven local models across LangChain and LlamaIndex pipelines on a 12,000-document corpus. Here is what actually works in production.
By Mohamed Meguedmi · 11 min read
Key takeaways
- Winner overall: Qwen3 32B Q4_K_M paired with BGE-M3 embeddings hit 87.4% answer faithfulness on our 500-question RAGAS suite — within 2 points of GPT-4o-mini at roughly 1/40th the per-query cost.
- LangChain vs LlamaIndex: LlamaIndex won on retrieval quality out of the box (+6.8% hit-rate at k=5); LangChain won on orchestration when tool-calling or multi-step agents entered the loop.
- Minimum viable hardware: a 24 GB GPU (RTX 3090, 4090, or 7900 XTX) handles a 14B model plus a local Qdrant instance comfortably. 16 GB cards force you down to 7B-8B class models.
- Embeddings matter more than the LLM: swapping nomic-embed-text for BGE-M3 raised end-to-end accuracy by 11.3 points — a bigger jump than upgrading from Llama 3.1 8B to Qwen3 32B.
- Local pays back fast: at 50k queries/month, a self-hosted stack breaks even against the OpenAI API in under 4 months — see our cost calculator for your own numbers.
Why local RAG, why now
Retrieval-Augmented Generation has matured from a 2023 novelty into the default architecture for any system that needs to answer questions about private documents. The cloud-API path is well-trodden, but three trends pushed local deployments into the mainstream during 2025: open-weight models that match GPT-4o-mini on faithfulness, embedding models that finally rival OpenAI's text-embedding-3-large, and consumer GPUs cheap enough to amortize in months rather than years.
This guide is not another LangChain-vs-LlamaIndex theory piece. The editorial team built identical pipelines in both frameworks, pointed them at the same 12,000-document corpus (a mixed bag of technical PDFs, support tickets, and Markdown documentation totaling ~38M tokens), and measured what came out the other end. All results below come from real runs reproducible from our methodology page.
The test setup
We ran every benchmark on a single-GPU node — RTX 4090 (24 GB VRAM), Ryzen 9 7950X, 64 GB DDR5 — running Ubuntu 24.04 and Ollama 0.5.4. Vector storage used Qdrant 1.12 in Docker. Each pipeline answered the same 500-question evaluation set scored with RAGAS for faithfulness, answer relevancy, and context precision.
| Component | LangChain pipeline | LlamaIndex pipeline |
|---|---|---|
| Document loader | UnstructuredFileLoader | SimpleDirectoryReader + LlamaParse |
| Chunker | RecursiveCharacterTextSplitter (512/64) | SentenceSplitter (512/64) + SemanticSplitterNodeParser |
| Embeddings | BGE-M3 via HuggingFaceEmbeddings | BGE-M3 via HuggingFaceEmbedding |
| Vector store | Qdrant via langchain-qdrant | Qdrant via llama-index-vector-stores-qdrant |
| Retriever | MultiQueryRetriever + Cohere rerank-3 | VectorIndexRetriever + LLMRerank |
| LLM | ChatOllama | Ollama |
The models we tested
Seven local models were in the mix, all quantized to Q4_K_M unless noted. Sizes refer to the on-disk GGUF weight files.
- Qwen3 32B (Q4_K_M, 19.8 GB) — model card
- Qwen3 14B (Q4_K_M, 9.0 GB)
- Llama 3.3 70B (Q3_K_M, 32.1 GB — CPU-offloaded tail)
- Llama 3.1 8B (Q4_K_M, 4.7 GB)
- Mistral Small 24B (Q4_K_M, 14.3 GB)
- Gemma 3 12B (Q4_K_M, 7.3 GB)
- Phi-4 14B (Q4_K_M, 8.4 GB)
Retrieval results: LlamaIndex wins out of the box
Before the LLM ever sees a token, retrieval quality is what decides whether your RAG system tells the truth. We measured hit-rate@5 — the share of questions for which at least one truly relevant chunk appears in the top 5 retrieved — across both frameworks with identical embeddings.
| Configuration | Hit-rate@5 | MRR@10 | Avg retrieval latency |
|---|---|---|---|
| LangChain, naive vector search | 71.2% | 0.542 | 38 ms |
| LangChain, MultiQuery + rerank | 83.4% | 0.681 | 284 ms |
| LlamaIndex, naive vector search | 78.0% | 0.611 | 41 ms |
| LlamaIndex, auto-merging + LLMRerank | 90.2% | 0.748 | 312 ms |
The 6.8-point gap in naive mode comes down to LlamaIndex's SemanticSplitterNodeParser, which keeps semantically coherent passages together instead of slicing at fixed character counts. Once you turn on auto-merging — promoting a parent node when several child nodes match — LlamaIndex pulls another 12 points ahead of stock LangChain.
LangChain can match these numbers, but you have to assemble parent-document retrievers and rerankers manually. LlamaIndex ships those patterns as one-liners.
Generation results: Qwen3 32B is the local champion
With retrieval held constant (LlamaIndex auto-merging + BGE-M3 + LLMRerank), we swapped only the generator. Faithfulness measures whether the answer stays grounded in the retrieved context — the metric that actually catches hallucinations.
| Model | Faithfulness | Answer relevancy | Tokens/sec (RTX 4090) | VRAM peak |
|---|---|---|---|---|
| Qwen3 32B Q4_K_M | 87.4% | 0.892 | 34 | 22.1 GB |
| Llama 3.3 70B Q3_K_M | 88.1% | 0.901 | 6 | 23.8 GB + 8 GB RAM |
| Mistral Small 24B Q4_K_M | 84.6% | 0.871 | 41 | 17.2 GB |
| Phi-4 14B Q4_K_M | 82.9% | 0.864 | 58 | 10.4 GB |
| Qwen3 14B Q4_K_M | 82.1% | 0.858 | 62 | 10.9 GB |
| Gemma 3 12B Q4_K_M | 79.4% | 0.832 | 67 | 9.1 GB |
| Llama 3.1 8B Q4_K_M | 76.8% | 0.811 | 89 | 6.3 GB |
| GPT-4o-mini (reference) | 89.3% | 0.908 | ~80 | n/a |
Llama 3.3 70B technically tops the faithfulness chart, but at 6 tokens/sec with CPU offload it is unusable for anything interactive. Qwen3 32B delivers within 0.7 points of the 70B at over 5× the throughput, and it does so on a single 24 GB GPU. If you have only 16 GB to spare, Phi-4 14B is the sharper pick — its training distribution skews toward reasoning over chat, which shows up in cleaner extractive answers.
Embeddings: the unsung lever
The largest single jump in end-to-end accuracy came not from the LLM but from the embedding model. We tested four open-weight options against the same Qwen3 32B generator.
| Embedding model | Dim | Hit-rate@5 | End-to-end faithfulness |
|---|---|---|---|
| nomic-embed-text v1.5 | 768 | 74.1% | 76.1% |
| mxbai-embed-large | 1024 | 81.9% | 82.7% |
| BGE-M3 | 1024 | 90.2% | 87.4% |
| Snowflake Arctic Embed L v2.0 | 1024 | 89.7% | 87.0% |
BGE-M3 (BAAI/bge-m3) supports dense, sparse, and ColBERT-style multi-vector retrieval out of one model, which is why it tops the table. Switching from nomic-embed-text to BGE-M3 raised faithfulness by 11.3 points — a bigger swing than going from Llama 3.1 8B all the way up to Qwen3 32B. If you remember one thing from this guide, make it that.
Cost: when does local pay back?
Equivalent quality on OpenAI runs roughly $0.000150 per 1k input tokens and $0.000600 per 1k output on gpt-4o-mini, plus $0.000130 per 1k tokens on text-embedding-3-large. The local stack — RTX 4090 ($1,799 retail) plus a Ryzen 9 host (~$1,400 fully built) and Qdrant running for free — runs about $0.12/hour in electricity at US average rates.
| Volume | OpenAI monthly | Local monthly (power) | Local break-even |
|---|---|---|---|
| 10k queries/month | $48 | $86 | Never (OpenAI cheaper) |
| 50k queries/month | $240 | $86 | ~3.8 months |
| 250k queries/month | $1,200 | $86 | ~2.9 months |
| 1M queries/month | $4,800 | $86 | ~0.7 months |
Plug your own assumptions into the BestLLMfor cost calculator — the breakeven is sensitive to query length distribution, not just count. For European readers, our French sister site quelllm.fr publishes equivalent figures in EUR with EU electricity rates.
Framework verdict: pick LlamaIndex first, reach for LangChain when agents arrive
Our recommendation after eight weeks of pipeline-building: start in LlamaIndex. The retrieval primitives — sentence-window, auto-merging, hierarchical node parsers — are the difference between a 78% and a 90% hit-rate, and they exist as named constructors instead of recipes you have to assemble.
Move to LangChain when one of these enters the picture: multi-step tool-calling agents, complex stateful chains, observability via LangSmith, or a production deployment that benefits from LangGraph's checkpointing. The two libraries are not exclusive — both expose the same Qdrant collection, so you can read with LlamaIndex and orchestrate with LangChain in the same app.
A reproducible install path
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Pull the models:
ollama pull qwen3:32bandollama pull bge-m3 - Start Qdrant:
docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant:v1.12.4 - Create a Python 3.12 venv and install
llama-index llama-index-llms-ollama llama-index-embeddings-ollama llama-index-vector-stores-qdrant ragas - Index your corpus with
SimpleDirectoryReader, wrap withVectorStoreIndex.from_documents, then query through aRetrieverQueryEnginewithLLMRerankin the postprocessor chain.
If you want a ready-made bridge between your editor and a local model, the open-source quelllm-mcp server exposes our benchmark results through the Model Context Protocol, and the underlying numbers are downloadable as JSON under CC BY 4.0 from the BestLLMfor public API.
Final recommendation
| Use case | Recommended stack |
|---|---|
| 24 GB GPU, accuracy-first RAG | LlamaIndex + Qwen3 32B Q4_K_M + BGE-M3 + Qdrant + LLMRerank |
| 16 GB GPU, balanced | LlamaIndex + Phi-4 14B Q4_K_M + BGE-M3 + Qdrant |
| 12 GB GPU, throughput-first | LlamaIndex + Qwen3 14B Q4_K_M + mxbai-embed-large + Chroma |
| Agentic RAG with tools | LangChain + LangGraph + Qwen3 32B + BGE-M3 + Qdrant |
| Production, multi-tenant | LangChain + LangSmith + Qwen3 32B + BGE-M3 + Qdrant cluster |
FAQ
Can I run this whole stack on an M-series Mac?
Yes. Qwen3 32B Q4_K_M loads on a 36 GB M3 Max at roughly 22 tokens/sec via Ollama's Metal backend. Faithfulness scores are identical to CUDA — the model is the same — only throughput differs.
Is Llama 3.3 70B worth the offload pain?
Only if you have a 48 GB+ card (A6000, RTX 6000 Ada) so it stays fully resident. CPU-offloaded at 6 tokens/sec, it cannot service interactive RAG. Qwen3 32B delivers 99% of the quality at 5× the speed.
Why not just use a hosted vector DB like Pinecone?
Latency from a self-hosted Qdrant on the same node is 30-50 ms; Pinecone adds a network round trip of 60-150 ms plus a recurring bill. For a fully local stack, on-device is faster and cheaper.
How often should I re-embed when the corpus changes?
Only the changed documents. Qdrant supports per-point upserts keyed by a stable document ID. A nightly delta job is sufficient for most teams.
Does fine-tuning beat better retrieval?
In our tests, no. Improving retrieval (BGE-M3 + auto-merging + rerank) raised faithfulness more than any fine-tune we attempted on the same base model, and it generalizes across new documents instantly.