Qwen 2.5 Omni 7B
By Alibaba · China
Overview
Alibaba's first true omni-modal open model — text, image, audio, and video in, with text and speech out. A research-grade preview rather than a production-ready release.
When to pick this model
- You're researching unified multimodal pipelines and want one model end-to-end
- You need speech synthesis alongside text generation in a single model
- You're prototyping voice agents that also handle images and video
- You're willing to wire up vLLM or transformers directly
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 6 GB |
| Q5_K_M | 7 GB |
| Q8_0 | 10 GB |
| FP16 (no quantization) | 18 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| OmniBench (avg) | 56.13 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- Text, image, audio, and video input in one model
- Speech output without a separate TTS
- Apache 2.0
- Compact 7B footprint
Limitations
- No official Ollama tag — community GGUFs only
- 32K context is short for video-heavy workloads
- Early-generation omni model — quality lags specialized stacks
Architecture & training
Architecture: Thinker-Talker end-to-end · TMRoPE · streaming speech in+out
Training: First mainstream open omni model.
The first credible open omni model — promising for research, but not a drop-in for production yet.
Quick start
# GGUF : ggml-org/Qwen2.5-Omni-7B-GGUF (pas d'Ollama officiel)Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.