Model fiche
LLaVA-OneVision 7B
By LMMs-Lab · Singapore
vision
chat
Overview
An Apache-licensed 7B vision-language model from LMMs-Lab, combining SigLIP SO400M with Qwen2-7B. Handles single images, multi-image inputs, and video at over 170k monthly downloads.
When to pick this model
- Self-hosted VLM apps needing a permissive license
- Multi-image reasoning and short video understanding
- Fine-tuning base for domain-specific vision tasks
- Cost-sensitive image captioning and VQA pipelines
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 5 GB |
| Q5_K_M | 6 GB |
| Q8_0 | 9 GB |
| FP16 (no quantization) | 16 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Fully Apache 2.0 with no commercial gotchas
- Genuine multi-image and video support
- Mature ecosystem with strong community traction
- Solid Qwen2-7B language backbone
Limitations
- No official Ollama packaging
- English-first; weaker on non-English vision QA
- Outpaced by Qwen3-VL on most 2025 benchmarks
Architecture & training
Architecture: VLM 7B · SO400M + Qwen2-7B · image/multi-image/video
Training: LMMs-Lab (Singapore).
Verdict
A dependable, truly open VLM for self-hosters who value Apache licensing over the latest leaderboard score.
Quick start
# HuggingFace : lmms-lab/llava-onevision-qwen2-7b-ovOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.