LLaDA 2.0 Uni 16B
By Ant Group / inclusionAI · China
Overview
Ant Group's first open Apache 2.0 diffusion LLM: a 16B/1B MoE paired with a 6.2B diffusion decoder, unifying text and vision generation and editing. Released April 2026.
When to pick this model
- Research on diffusion-based language models
- Unified text + image generation and editing in one stack
- Interleaved thinking workflows during generation
- Apache 2.0 commercial use of dLLM architectures
- Experiments comparing diffusion vs. autoregressive decoding
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 18 GB |
| Q5_K_M | 22 GB |
| Q8_0 | 30 GB |
| FP16 (no quantization) | 47 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- The first Apache 2.0 open diffusion LLM
- Unified text, vision, generation, and editing
- Interleaved 'thinking' mode during diffusion
- Decoder-turbo distillation runs 8 diffusion steps instead of 50
- Apache 2.0 commercial license
Limitations
- Diffusion architecture not supported by Ollama or llama.cpp
- Requires Flash Attention 2 and CUDA 12.4
- Around 47 GB VRAM during active generation
- Only 8k context window
Architecture & training
Architecture: MoE 16B/1B active + Discrete Semantic Tokenizer (SigLIP-VQ) + Decoder Diffusion 6.2B + VAE
Training: Masked Token Prediction paradigm. Distilled decoder-turbo (10× speedup, 8 steps instead of 50). SPRINT acceleration.
A research-first release that proves Apache 2.0 dLLMs are real — production users should wait for tooling to catch up.
Quick start
# HuggingFace : inclusionAI/LLaDA2.0-Uni (Flash Attn 2 + CUDA 12.4 requis)Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.