Model fiche
Moshi 7B
By Kyutai · France
audio
fr
Overview
Kyutai's full-duplex speech model — 7.6B parameters with sub-second latency (~200ms) and two voices, Moshiko and Moshika. A speech architecture, not a text LLM.
When to pick this model
- You're building real-time voice interfaces and need full-duplex behavior
- You need low-latency speech-to-speech without separate TTS and STT
- You're researching speech architectures rather than text LLMs
- You can run inference directly in PyTorch or via Kyutai's stack
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 5 GB |
| Q5_K_M | 6 GB |
| Q8_0 | 9 GB |
| FP16 (no quantization) | 15 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- First open full-duplex speech model
- Sub-second latency (~200ms in practice)
- Mimi codec at 12.5 Hz / 1.1 kbps on 24 kHz audio
- From Kyutai, a respected French AI lab
Limitations
- Not a text LLM — different use case entirely
- Architecture not supported by Ollama
- CC-BY 4.0 license — attribution required
Architecture & training
Architecture: Full-duplex speech-text · Depth Transformer (codebook) + 7B Temporal Transformer
Training: Mimi codec at 12.5 Hz / 1.1 kbps on 24 kHz audio. ~200ms practical latency.
Verdict
The reference open full-duplex speech model — niche, but the only credible choice in its category.
Quick start
# GitHub : kyutai-labs/moshi — voix Moshiko (H) / Moshika (F)Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.