Model fiche
Qwen 3.5 0.8B
By Alibaba · China
chat
general
small
multilingual
Overview
Alibaba's ultra-compact 0.8B chat model with a 256k context window and a sub-1GB Q4 footprint, Apache 2.0 on Ollama. Runs on CPUs, integrated GPUs, and Raspberry Pi.
When to pick this model
- Embedded assistants on phones, SBCs, and microcontrollers with NPUs
- Cheap classification, routing, or instruction-following at scale
- Offline chat where memory and power budgets are tight
- Long-context retrieval scenarios that don't need deep reasoning
VRAM requirements by quantization
| Quantization | VRAM required |
|---|---|
| Q4_K_M (recommended) | 0.5 GB |
| Q5_K_M | 0.6 GB |
| Q8_0 | 0.9 GB |
| FP16 (no quantization) | 1.6 GB |
VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.
Strengths
- Negligible memory footprint — under 1GB at Q4
- 256k context, rare at this size
- Apache 2.0 distribution via Ollama
- Runs comfortably on CPU, integrated GPU, or Raspberry Pi
Limitations
- Reasoning quality is inherently limited at 0.8B
- Text-only — no vision capability
- Hugging Face distribution uses the Qwen license rather than Apache
Architecture & training
Architecture: Dense Transformer · 0.8B parameters
Training: Qwen 3.5 family (Alibaba). Ultra-compact variant aligned for chat/instruct.
Verdict
The right pick when you need a real LLM in under a gigabyte and don't need it to think hard.
Quick start
ollama run qwen3.5Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.