Best Local Embeddings Models for RAG — Self-Hosted
Eight embedding models tested on retrieval quality, throughput, and VRAM. Verdicts for laptops, single-GPU boxes, and CPU-only servers.
By Mohamed Meguedmi · 11 min read
Key takeaways
- Best overall (single GPU):
Qwen3-Embedding-8B— top MTEB score among open weights, 4096-dim, 32k context. - Best value:
BGE-M3— multilingual, dense+sparse+ColBERT in one model, runs on 4 GB VRAM. - Best for CPU-only servers:
nomic-embed-text-v1.5at INT8 — 137M params, Matryoshka dims down to 64. - Best for long documents:
Jina-embeddings-v3with 8192 token context and task-specific LoRAs. - Avoid: OpenAI
text-embedding-3-smallif data sovereignty matters — there is no quality reason left to pay per token in 2026.
Most RAG failures are not LLM failures. They are retrieval failures, and retrieval starts with the embedding model. The good news for 2026: the open-weight gap on the MTEB leaderboard has closed. The top open models now beat text-embedding-3-large on average across the 56-task benchmark, and they fit on hardware you already own.
The editorial team at BestLLMfor benchmarked eight production-ready open-weight embedding models on a Llama.cpp / Infinity / sentence-transformers stack, with workloads spanning English technical docs, multilingual legal contracts, and a 50k-chunk code corpus. This guide is the verdict.
What "best" means for local embeddings in 2026
Three dimensions matter, in order:
- Retrieval quality — typically measured by nDCG@10 on MTEB Retrieval, plus your own eval set. A 2-point gain on MTEB is roughly a 5-8% gain in answer correctness downstream.
- Throughput — tokens per second per GPU watt. Matters when you re-index 10M chunks or run real-time ingestion.
- Operational fit — context length, embedding dimension (storage cost), and license. A 4096-dim vector takes 4× the disk of a 1024-dim one and 4× the RAM in your vector DB.
Ignore parameter count in isolation. Qwen3-Embedding-0.6B outperforms several 7B models from 2024. The architecture and the training data matter far more than the size.
The 2026 short list
1. Qwen3-Embedding-8B — the new top of the leaderboard
Alibaba's Qwen3-Embedding-8B is the highest-scoring open-weight model on MTEB-v2 multilingual at the time of writing, with 70.58 average. It supports 100+ languages, 32k context, and 4096-dim output (Matryoshka-truncatable to 1024 or 512).
Trade-offs: 16 GB VRAM at FP16, ~9 GB at Q8. Inference is roughly 1,200 tokens/second on an RTX 4090. The 4B and 0.6B siblings are also excellent and scale down cleanly.
2. BGE-M3 — the Swiss army knife
BAAI's BGE-M3 remains the most practical choice for hybrid retrieval. A single model emits dense, sparse (lexical), and multi-vector (ColBERT-style) representations from one forward pass. 8192 token context, 100+ languages, 568M parameters, MIT license.
It does not top MTEB anymore, but for hybrid dense+sparse RAG pipelines — which still outperform pure-dense in domain-specific corpora — it remains the default recommendation.
3. Nomic-embed-text-v1.5 — the CPU king
nomic-embed-text-v1.5 is 137M parameters, Apache 2.0, fully reproducible training data, and supports Matryoshka representation learning (you can truncate from 768 down to 64 dims with minor quality loss). It runs at 800 tokens/second on a modern CPU with llama.cpp INT8.
4. Jina-embeddings-v3 — task-aware, long context
Jina v3 ships with five task-specific LoRA adapters (retrieval.query, retrieval.passage, classification, separation, text-matching) that swap in at inference time. 570M params, 8192 context, 1024 dims. Strong on long-document retrieval.
Benchmark table — quality and footprint
| Model | Params | Dims | Context | MTEB-v2 avg | License |
|---|---|---|---|---|---|
| Qwen3-Embedding-8B | 8B | 4096 | 32,768 | 70.58 | Apache 2.0 |
| Qwen3-Embedding-4B | 4B | 2560 | 32,768 | 69.45 | Apache 2.0 |
| Qwen3-Embedding-0.6B | 0.6B | 1024 | 32,768 | 64.33 | Apache 2.0 |
| BGE-M3 | 568M | 1024 | 8,192 | 59.84 | MIT |
| Jina-embeddings-v3 | 570M | 1024 | 8,192 | 58.58 | CC BY-NC 4.0 |
| mxbai-embed-large-v1 | 335M | 1024 | 512 | 57.78 | Apache 2.0 |
| nomic-embed-text-v1.5 | 137M | 768 | 8,192 | 56.21 | Apache 2.0 |
| all-MiniLM-L6-v2 | 23M | 384 | 512 | 49.12 | Apache 2.0 |
Scores from public MTEB-v2 leaderboard as of May 2026. Jina v3 is non-commercial; commercial users need a paid license.
Throughput and VRAM — measured on real hardware
All numbers below come from running each model behind Infinity with FP16 (where possible), batch size 32, 512-token chunks. See our methodology for the full harness.
| Model | RTX 4090 (tok/s) | RTX 3060 12GB (tok/s) | CPU (Ryzen 9 7950X, tok/s) | VRAM FP16 |
|---|---|---|---|---|
| Qwen3-Embedding-8B | 1,180 | OOM | 110 | 16.4 GB |
| Qwen3-Embedding-4B | 2,440 | 620 | 240 | 8.2 GB |
| Qwen3-Embedding-0.6B | 9,800 | 3,100 | 1,150 | 1.4 GB |
| BGE-M3 | 7,200 | 2,400 | 880 | 2.1 GB |
| Jina-embeddings-v3 | 6,900 | 2,200 | 820 | 2.3 GB |
| nomic-embed-text-v1.5 | 14,500 | 5,800 | 1,940 | 0.6 GB |
| all-MiniLM-L6-v2 | 38,000 | 14,200 | 4,100 | 0.1 GB |
CPU numbers use INT8 quantization via llama.cpp bindings. Quality loss versus FP16 is under 1 MTEB point for every model tested.
How to pick — a decision tree
If you have a single GPU with ≥16 GB VRAM and quality is paramount:
Qwen3-Embedding-8B. Use Matryoshka truncation to 1024 dims to cut your vector DB cost by 4× with under 1-point quality loss.If you have 8-12 GB VRAM:
Qwen3-Embedding-4Bat FP16, orBGE-M3if you need hybrid dense+sparse retrieval out of the box.If you are CPU-only (and you should not be ashamed — most production RAG runs fine on CPU):
nomic-embed-text-v1.5at INT8 viallama.cpp. Truncate to 256 dims for storage savings.If your documents are 5k+ tokens and you chunk aggressively:
Jina-embeddings-v3with theretrieval.passageLoRA. Check the license — it is non-commercial.If you embed 100M+ chunks:
all-MiniLM-L6-v2is still legitimate. 384 dims, near-zero VRAM, ~38k tok/s on a 4090. You sacrifice 15-20 MTEB points but gain an order of magnitude in throughput.
Serving stack — Ollama, Infinity, or llama.cpp?
For production RAG with concurrent requests, Infinity is the editorial team's recommendation. It supports OpenAI-compatible endpoints, batches dynamically, and handles ONNX/Torch/CTranslate2 backends transparently.
Ollama is simpler and ships pre-quantized embedding models, but its batching is weaker — expect 30-40% lower throughput at high concurrency. Fine for personal use, not for an indexing pipeline.
llama.cpp with --embeddings is the right answer for CPU-only deployments and edge boxes. The new llama-embedding binary added in early 2026 supports continuous batching on CPU.
Minimal install — Infinity with Qwen3-Embedding-4B
pip install "infinity-emb[all]"
infinity_emb v2 \
--model-id Qwen/Qwen3-Embedding-4B \
--batch-size 32 \
--port 7997
curl http://localhost:7997/embeddings \
-H "Content-Type: application/json" \
-d '{"input":["hello world"],"model":"Qwen/Qwen3-Embedding-4B"}'Drop-in compatible with the OpenAI Python SDK — point base_url at http://localhost:7997 and your LangChain / LlamaIndex code works unchanged. Estimate your full RAG cost — embeddings, vector DB, and the generation model — with our cost calculator.
Common mistakes that wreck retrieval quality
- Ignoring the query/passage asymmetry. Qwen3 and Jina expect a task prefix on the query side. Skipping it costs 3-5 MTEB points.
- Embedding raw HTML or markdown noise. Strip boilerplate before embedding. Garbage tokens push vectors toward the corpus centroid and destroy ranking.
- Over-chunking. 512-token chunks were a 2023 habit forced by short-context models. With 8k-32k context embeddings, 1024-2048 tokens per chunk gives better recall on long passages.
- Pure dense retrieval on domain-specific corpora. Add BM25 or use BGE-M3's sparse output. Hybrid retrieval reliably adds 5-10% nDCG.
- Re-embedding on every change. Use content hashing. Embeddings cost is dominated by re-indexing churn, not first-time indexing.
Where this guide's data lives
The benchmark numbers in this article are published under CC BY 4.0 via the BestLLMfor public API at api.bestllmfor.com/v1/embeddings/benchmarks. You can also self-host the same measurement harness — it powers our sister site quelllm.fr — through the open-source quelllm-mcp server on GitHub. See about for our independence policy.
Verdict
| Scenario | Pick | Why |
|---|---|---|
| Single RTX 4090, quality-first English/multilingual RAG | Qwen3-Embedding-8B | Top MTEB, 32k context, Matryoshka dims |
| RTX 3060/4060 12 GB, balanced workload | Qwen3-Embedding-4B | 2,440 tok/s, 8 GB VRAM, near-flagship quality |
| Hybrid dense+sparse retrieval, multilingual | BGE-M3 | Three retrieval modes in one model, MIT |
| CPU-only or edge deployment | nomic-embed-text-v1.5 | 1,940 tok/s CPU, Matryoshka to 64 dims |
| Long-document RAG, non-commercial | Jina-embeddings-v3 | Task LoRAs, 8k context, strong long-doc scores |
| Massive-scale (100M+ chunks) | all-MiniLM-L6-v2 | 384 dims, 38k tok/s, smallest footprint |
The local embeddings story in 2026 is unambiguous: there is no quality reason to send your documents to a paid API anymore. Pick from the table above, run Infinity behind your vector DB, and the only remaining bottleneck is the quality of your chunking strategy — which is a problem we will tackle in the next guide.
Frequently asked questions
Is Qwen3-Embedding-8B really better than OpenAI text-embedding-3-large?
On public MTEB-v2 multilingual, yes — 70.58 vs roughly 64.6 for text-embedding-3-large. On your specific domain, run your own eval. The gap is large enough that on most corpora the open model wins outright, and you keep your data on-premises.
Can I run a good embedding model without a GPU?
Yes. nomic-embed-text-v1.5 at INT8 reaches ~1,940 tokens/second on a modern desktop CPU, which is plenty for ingestion pipelines under 10M chunks. BGE-M3 also runs on CPU at ~880 tok/s if you need its hybrid features.
What embedding dimension should I use?
For most RAG use cases, 768-1024 dimensions is the sweet spot. Use Matryoshka-trained models (Qwen3, Nomic, mxbai) so you can truncate later without re-embedding. Going below 256 dims loses meaningful recall on diverse corpora.
Should I use a reranker on top?
Almost always yes. A cross-encoder reranker like bge-reranker-v2-m3 on the top 50 candidates adds 8-15% nDCG@10 for a few milliseconds per query. It is the highest-leverage component of a RAG pipeline after the embedding model itself.
Does fine-tuning the embedding model help?
Fine-tuning BGE-M3 or Qwen3-Embedding-0.6B on a few thousand domain query/passage pairs typically adds 4-8 MTEB points on that domain. It is worth doing once your retrieval-eval set is stable, not before.
What about Snowflake Arctic Embed or Stella?
Both are competitive and Apache 2.0. Arctic-embed-l-v2.0 sits between BGE-M3 and Qwen3-4B on MTEB. We omitted them from the verdict table to keep recommendations actionable, not because they are bad picks.