BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

SmolVLM2 2.2B Instruct

By HuggingFace · France

vision chat small
Parameters
2.2B
License
Apache 2.0
Context
8k
VRAM (Q4)
1.6 GB
Released
February 2025

Overview

HuggingFace's 2.2B vision-language model built on SmolLM2-1.7B, handling image, video, and text in roughly 5.2GB of VRAM. The smallest serious VLM with video understanding.

When to pick this model

  • Adding vision to mobile or embedded apps
  • Video frame analysis on a single consumer GPU
  • Document and screenshot understanding at the edge
  • Permissively licensed multimodal prototypes
  • Bandwidth-constrained deployments needing local VLM

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)1.6 GB
Q5_K_M2 GB
Q8_03 GB
FP16 (no quantization)4.5 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • Runs full video inference in ~5.2GB VRAM
  • Apache 2.0 license suitable for commercial use
  • Genuine image + video + text capability at 2.2B scale
  • Inherits SmolLM2's tight text fundamentals

Limitations

  • 8K context inherited from SmolLM2 limits long video
  • No official Ollama distribution yet
  • Video understanding is basic compared to frontier VLMs

Architecture & training

Architecture: VLM image+video+text โ†’ text ยท SmolLM2-1.7B backbone

Training: ~5.2 GB VRAM for video inference.

Verdict

The go-to small VLM when you need vision plus video in under 3B parameters and an Apache license.

Quick start

# HuggingFace : HuggingFaceTB/SmolVLM2-2.2B-Instruct

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is SmolVLM2 2.2B Instruct the right pick for you?

Compute self-hosted ROI → Back to catalog