Model fiche
Pleias-RAG 1B
By PleIAs · France
chat
fr
small
Overview
A 1.2B RAG-specialized model from PleIAs with built-in citation and grounding behavior. Beats most sub-4B small language models on HotPotQA.
When to pick this model
- You're deploying RAG on tight hardware budgets or edge devices
- You need clean citations and grounding from a small model
- You're handling structured Q&A where source attribution matters
- You want a defensible audit trail for regulated RAG deployments
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 0.8 GB |
| Q5_K_M | 1 GB |
| Q8_0 | 1.5 GB |
| FP16 (no quantization) | 2.5 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Built-in citation and grounding in RAG responses
- Outperforms most small language models under 4B on HotPotQA
- Runs on lightweight hardware
- Apache 2.0
Limitations
- Context window of only ~2K
- No official Ollama tag
- Specialized for RAG โ not a general chat model
Architecture & training
Architecture: Dense 1.2B ยท fine-tuned for RAG with built-in citations/grounding
Training: Based on Pleias 1.2B.
Verdict
The most efficient small open model for production RAG with citations.
Quick start
# HuggingFace : PleIAs/Pleias-RAG-1B (GGUF : PleIAs/Pleias-RAG-1B-gguf)Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.