Model fiche
Step 3.5 Flash
By StepFun · China
chat
general
moe
Overview
StepFun's 196B MoE with 11B active parameters delivers 100 tokens/sec at 128K context. Ranks #3 by free-tier volume on OpenRouter under Apache 2.0.
When to pick this model
- High-throughput chat backends
- Long-context workloads needing fast inference
- Apache-licensed commercial deployments
- Cost-sensitive production at scale
- Workloads where latency matters more than top quality
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 118 GB |
| Q5_K_M | 141 GB |
| Q8_0 | 210 GB |
| FP16 (no quantization) | 392 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- 100 tokens/sec sustained at 128K context
- 256K maximum context window
- Only 11B active parameters
- Apache 2.0 license
Limitations
- 118GB+ in Q4 needs a multi-GPU server
- Brand awareness still low outside Asia
- Trails top open models on hardest benchmarks
Architecture & training
Architecture: MoE 196B/11B active · 256k ctx
Training: StepFun. 100 tok/s at 128k ctx.
Verdict
A fast, permissively licensed MoE that punches well above its name recognition.
Quick start
# HuggingFace : stepfun-ai/step-3.5-flashOr use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.