Model fiche
SmolVLM2 2.2B Instruct
By HuggingFace · France
vision
chat
small
Overview
HuggingFace's 2.2B vision-language model built on SmolLM2-1.7B, handling image, video, and text in roughly 5.2GB of VRAM. The smallest serious VLM with video understanding.
When to pick this model
- Adding vision to mobile or embedded apps
- Video frame analysis on a single consumer GPU
- Document and screenshot understanding at the edge
- Permissively licensed multimodal prototypes
- Bandwidth-constrained deployments needing local VLM
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 1.6 GB |
| Q5_K_M | 2 GB |
| Q8_0 | 3 GB |
| FP16 (no quantization) | 4.5 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Runs full video inference in ~5.2GB VRAM
- Apache 2.0 license suitable for commercial use
- Genuine image + video + text capability at 2.2B scale
- Inherits SmolLM2's tight text fundamentals
Limitations
- 8K context inherited from SmolLM2 limits long video
- No official Ollama distribution yet
- Video understanding is basic compared to frontier VLMs
Architecture & training
Architecture: VLM image+video+text โ text ยท SmolLM2-1.7B backbone
Training: ~5.2 GB VRAM for video inference.
Verdict
The go-to small VLM when you need vision plus video in under 3B parameters and an Apache license.
Quick start
# HuggingFace : HuggingFaceTB/SmolVLM2-2.2B-InstructOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.