BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

Moshi 7B

By Kyutai · France

audio fr
Parameters
7.6B
License
CC-BY 4.0
Context
4k
VRAM (Q4)
5 GB
Released
September 2024

Overview

Kyutai's full-duplex speech model — 7.6B parameters with sub-second latency (~200ms) and two voices, Moshiko and Moshika. A speech architecture, not a text LLM.

When to pick this model

  • You're building real-time voice interfaces and need full-duplex behavior
  • You need low-latency speech-to-speech without separate TTS and STT
  • You're researching speech architectures rather than text LLMs
  • You can run inference directly in PyTorch or via Kyutai's stack

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)5 GB
Q5_K_M6 GB
Q8_09 GB
FP16 (no quantization)15 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • First open full-duplex speech model
  • Sub-second latency (~200ms in practice)
  • Mimi codec at 12.5 Hz / 1.1 kbps on 24 kHz audio
  • From Kyutai, a respected French AI lab

Limitations

  • Not a text LLM — different use case entirely
  • Architecture not supported by Ollama
  • CC-BY 4.0 license — attribution required

Architecture & training

Architecture: Full-duplex speech-text · Depth Transformer (codebook) + 7B Temporal Transformer

Training: Mimi codec at 12.5 Hz / 1.1 kbps on 24 kHz audio. ~200ms practical latency.

Verdict

The reference open full-duplex speech model — niche, but the only credible choice in its category.

Quick start

# GitHub : kyutai-labs/moshi — voix Moshiko (H) / Moshika (F)

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is Moshi 7B the right pick for you?

Compute self-hosted ROI → Back to catalog