Model fiche
Qwen 3 Omni 30B-A3B
By Alibaba · China
vision
audio
chat
moe
Overview
Alibaba's omni-modal 30B MoE (3B active) with streaming speech, 119-language ASR, and Apache 2.0 licensing. The most accessible truly omnimodal open model.
When to pick this model
- Voice-first assistants with low-latency speech in/out
- Multilingual ASR across 119 languages
- Real-time multimodal agents on a single GPU
- Long-context multimodal reasoning (131k)
- Apache 2.0 commercial deployments
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 19 GB |
| Q5_K_M | 23 GB |
| Q8_0 | 35 GB |
| FP16 (no quantization) | 62 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Native omnimodal I/O: text, image, audio in and out
- 131k context
- Streaming speech for low-latency voice apps
- Apache 2.0 license
- Only 3B active params per token
Limitations
- Around 19 GB VRAM in Q4
- Audio path is still maturing relative to text and vision
- Tooling support uneven outside vLLM
Architecture & training
Architecture: MoE · 30B · Qwen3-Omni · text + vision + audio end-to-end
Training: Qwen3-Omni 30B — Qwen omnimodal model (text, images, audio in/out).
Verdict
The default open choice if you actually need audio in and out, not just text and images.
Quick start
ollama run qwen3-omni:30bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.