BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

Step 3.5 Flash

By StepFun · China

chat general moe
Parameters
196B
License
Apache 2.0
Context
250k
VRAM (Q4)
118 GB
Released
February 2026

Overview

StepFun's 196B MoE with 11B active parameters delivers 100 tokens/sec at 128K context. Ranks #3 by free-tier volume on OpenRouter under Apache 2.0.

When to pick this model

  • High-throughput chat backends
  • Long-context workloads needing fast inference
  • Apache-licensed commercial deployments
  • Cost-sensitive production at scale
  • Workloads where latency matters more than top quality

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)118 GB
Q5_K_M141 GB
Q8_0210 GB
FP16 (no quantization)392 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Strengths

  • 100 tokens/sec sustained at 128K context
  • 256K maximum context window
  • Only 11B active parameters
  • Apache 2.0 license

Limitations

  • 118GB+ in Q4 needs a multi-GPU server
  • Brand awareness still low outside Asia
  • Trails top open models on hardest benchmarks

Architecture & training

Architecture: MoE 196B/11B active · 256k ctx

Training: StepFun. 100 tok/s at 128k ctx.

Verdict

A fast, permissively licensed MoE that punches well above its name recognition.

Quick start

# HuggingFace : stepfun-ai/step-3.5-flash

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is Step 3.5 Flash the right pick for you?

Compute self-hosted ROI → Back to catalog