SmolLM2 1.7B Instruct
By HuggingFace · France
Overview
HuggingFace's 1.7B Apache 2.0 instruct model trained on 11T tokens. Beats Qwen2.5-1.5B by roughly 6 points on MMLU-Pro, making it a top pick at the sub-2B tier.
When to pick this model
- On-device assistants where every megabyte counts
- Edge inference on CPUs or low-end GPUs
- Building permissively licensed downstream products
- Fine-tuning experiments on a single consumer GPU
- Latency-critical autocomplete or classification tasks
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 1.2 GB |
| Q5_K_M | 1.5 GB |
| Q8_0 | 2.2 GB |
| FP16 (no quantization) | 3.5 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| BFCL (function calling) | 27 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Best-in-class quality for its size on MMLU-Pro
- Clean Apache 2.0 license with no commercial strings
- Massive 11T-token training corpus for a small model
- One of the most downloaded small models on Hugging Face
Limitations
- English-centric, weak on non-English languages
- 8K context window is tight for modern RAG workflows
- BFCL function-calling score of 27% trails larger peers
Architecture & training
Architecture: Dense Llama 2-style · SFT + DPO (UltraFeedback)
Training: 11T tokens.
If you need an Apache-licensed sub-2B model that punches above its weight, SmolLM2 is the default choice.
Quick start
ollama run smollm2:1.7bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.