Falcon Mamba 7B
By TII · UAE
Overview
TII's first serious pure Mamba SSM at scale — 7B with constant memory per token, sidestepping transformer attention costs entirely.
When to pick this model
- Streaming workloads needing constant memory per token
- Research on state-space models versus transformers
- Throughput-bound inference where attention is the bottleneck
- Long-running generation where context grows unboundedly
- Edge inference on memory-constrained devices
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 5 GB |
| Q5_K_M | 6 GB |
| Q8_0 | 9 GB |
| FP16 (no quantization) | 14 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Published benchmark scores
| Benchmark | Score |
|---|---|
| MMLU | 62 |
Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.
Strengths
- O(1) memory per token at inference
- No practical context limit imposed by attention
- Apache 2.0 license
- Demonstrates Mamba viability at production scale
Limitations
- Weaker in-context learning than transformers of equal size
- No vision or multimodal support
- Trained context is only 8k despite architectural headroom
Architecture & training
Architecture: Mamba architecture (SSM) · 7B · no Transformer · O(1) inference
Training: TII UAE — 5.5T tokens corpus. Pure State Space Model architecture.
The benchmark pure-Mamba 7B — pick it to study SSMs or to serve streaming workloads where attention costs hurt most.
Quick start
ollama run falcon-mamba:7bOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.