Model fiche
Phi-4 Multimodal 5.6B
By Microsoft · United States
chat
vision
audio
small
Overview
Microsoft's 5.6B multimodal model — text, image, and audio in, text out — using a Mixture-of-LoRAs design. Accepts roughly 2.8 hours of audio per request.
When to pick this model
- You're processing long audio recordings on a laptop or edge device
- You need lightweight multimodal in an English-first context
- You want an MIT-licensed multimodal model with no commercial restrictions
- You're prototyping voice + vision pipelines without server-class hardware
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 4 GB |
| Q5_K_M | 5 GB |
| Q8_0 | 7 GB |
| FP16 (no quantization) | 12 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Text, image, and audio input in a 5.6B footprint
- MIT license
- 128K context window
- Long audio handling (up to ~2.8 hours)
Limitations
- No official Ollama tag
- English-first — weaker on other languages
- Limited ecosystem tooling vs Qwen VL
Architecture & training
Architecture: Dense · Mixture-of-LoRAs for multimodal · LongRoPE
Training: Up to ~2.8h of audio input.
Verdict
The lightest credible audio-capable multimodal under MIT — ideal for transcription-adjacent pipelines on small hardware.
Quick start
# Via HuggingFace : microsoft/Phi-4-multimodal-instruct (pas d'Ollama officiel)Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.