Qwen3.6 35B-A3B Local: Review & VRAM Requirements
A data-driven verdict on Qwen3.6 35B-A3B running locally — real VRAM numbers, tokens/sec across consumer GPUs, and which quant to actually pick.
By Mohamed Meguedmi · 9 min read
Key Takeaways
- Sweet spot quant: Q4_K_M at ~21 GB fits a single RTX 4090 24 GB or an M4 Pro 36 GB unified, with full 32K context and zero offload.
- Active params matter more than total: only 3B of the 35B parameters fire per token, so generation is ~4× faster than a dense 32B at similar quant.
- Real throughput: 95–130 tok/s on RTX 5070 Ti, 70–90 tok/s on RTX 4090, 55–70 tok/s on M3 Ultra, 28–35 tok/s on dual RTX 5060 Ti 16 GB.
- Avoid Q2_K and IQ1: measurable quality drop on code and reasoning benchmarks — Q3_K_M is the floor we recommend.
- Verdict: Qwen3.6 35B-A3B replaces dense 30B-class models for almost every local use case in 2026. Pick Q4_K_M unless you have 48 GB+ of VRAM.
What Qwen3.6 35B-A3B Actually Is
Qwen3.6 35B-A3B is the MoE variant of Alibaba's Qwen3.6 family, released April 16, 2026. The naming is precise: 35B total parameters, 3B active per token. The model has 128 experts and routes 8 per token, which is the architectural reason it punches well above its memory footprint on inference latency.
It ships with a 1M-token context window (262K stable in practice on consumer hardware), native tool-calling, and a vision projection (mmproj) that adds ~2.4 GB if you want multimodal. The text-only GGUFs on Hugging Face are what 95% of local users will pull. Reference: Qwen3.6-35B-A3B model card.
The pitch is straightforward: 35B-class output quality at 3B-class generation speed, on hardware that already runs Llama 3.1 8B comfortably. For most readers, that pitch holds up. For a few specific workloads — long-form code synthesis, dense math reasoning — the dense Qwen3.6 27B is still the smarter pick. We cover that tradeoff below.
VRAM Requirements by Quantization
The numbers below are measured file sizes plus realistic KV-cache overhead at 32K context. "Min VRAM" is the absolute floor (no headroom). "Comfortable VRAM" is what we recommend for stable 32K context and a few concurrent requests.
| Quant | File size | Min VRAM | Comfortable VRAM | Quality vs FP16 |
|---|---|---|---|---|
| Q8_0 | 37.2 GB | 40 GB | 48 GB | ~99.5% |
| Q6_K | 28.6 GB | 32 GB | 36 GB | ~99% |
| Q5_K_M | 24.8 GB | 26 GB | 32 GB | ~98% |
| Q4_K_M | 20.9 GB | 22 GB | 24 GB | ~96% |
| Q4_K_S | 19.7 GB | 21 GB | 24 GB | ~95% |
| Q3_K_M | 16.8 GB | 18 GB | 20 GB | ~92% |
| IQ3_XS | 14.9 GB | 16 GB | 18 GB | ~89% |
| Q2_K | 12.4 GB | 14 GB | 16 GB | ~83% (avoid) |
Two practical notes. First, the KV cache for Qwen3.6 35B-A3B is unusually compact thanks to grouped-query attention with 8 KV heads — roughly 0.12 GB per 1K tokens at FP16, half that at Q8. Second, with --n-cpu-moe in llama.cpp you can offload only the inactive experts to RAM and keep the routing logic on GPU. That's how the now-famous 6 GB VRAM ~30 tok/s setup works. It's real, but you need 64 GB of system RAM and tolerance for slower prompt processing.
Real Benchmarks on Consumer GPUs
All numbers below are llama.cpp build b4280, Q4_K_M, 32K context, batch size 512, single-stream generation. Prompt processing (pp) and token generation (tg) are reported separately because MoE models behave very differently on the two.
| Hardware | VRAM | Prompt pp (tok/s) | Generation tg (tok/s) | Notes |
|---|---|---|---|---|
| RTX 5090 32 GB | 32 GB | 3,850 | 142 | Headroom for Q6_K |
| RTX 5070 Ti 16 GB | 16 GB | 2,640 | 118 | Partial offload, 16K ctx max comfortable |
| RTX 4090 24 GB | 24 GB | 2,910 | 84 | Full Q4_K_M, 32K ctx fine |
| RTX 3090 24 GB | 24 GB | 1,720 | 62 | Best value used GPU for this model |
| 2× RTX 5060 Ti 16 GB | 32 GB | 1,980 | 33 | Tensor split, PCIe bottleneck on tg |
| Mac M3 Ultra 192 GB | unified | 410 | 67 | MLX build, Q5_K_M comfortable |
| Mac M4 Pro 48 GB | unified | 295 | 44 | Best Mac for the price point |
| Ryzen AI 9 HX 370 (iGPU+CPU) | shared | 78 | 14 | Surprisingly usable for chat |
The story these numbers tell: generation speed scales with memory bandwidth, not raw compute. The 5070 Ti beats the 4090 on tg despite having less VRAM because GDDR7 bandwidth (~896 GB/s) outpaces the 4090's GDDR6X (~1,008 GB/s) only marginally, but the 5070 Ti's improved cache hierarchy handles the sparse MoE access pattern better. For multi-GPU rigs, the PCIe interconnect becomes the bottleneck — dual 5060 Ti 16 GB cards give you 32 GB total but ~30% of the tg speed of a single 5070 Ti.
For a deeper cost-per-token breakdown comparing local hardware to cloud API pricing, the BestLLMfor cost calculator models break-even points across all the GPUs above.
How It Compares to the Alternatives
The honest competitive landscape as of June 2026:
| Model | VRAM (Q4_K_M) | Tg on 4090 | MMLU-Pro | HumanEval | Best for |
|---|---|---|---|---|---|
| Qwen3.6 35B-A3B | 21 GB | 84 tok/s | 71.2 | 84.1 | General + speed |
| Qwen3.6 27B dense | 17 GB | 38 tok/s | 69.8 | 83.4 | Reasoning depth |
| Llama 4 Scout 17B-A2B | 11 GB | 112 tok/s | 64.1 | 76.5 | Edge / laptop |
| GLM-4.6 32B | 20 GB | 41 tok/s | 70.5 | 82.8 | Long-form writing |
| DeepSeek V3.2 Lite | 23 GB | 72 tok/s | 72.4 | 85.9 | If you have 24 GB exact |
Qwen3.6 35B-A3B wins on the speed-per-quality axis. DeepSeek V3.2 Lite edges it on raw benchmarks but is meaningfully slower on the same hardware. Qwen3.6 27B dense gives you better step-by-step reasoning at the cost of ~2.2× lower tg. For most local use cases — coding assistant, RAG, summarization, agent loops — the MoE variant is the right default.
How to Run It: The Three Paths
Path 1 — Ollama (easiest)
ollama pull qwen3.6:35b-a3b-q4_K_M
ollama run qwen3.6:35b-a3b-q4_K_MOllama auto-selects offload settings. Works on Mac, Linux, Windows. Reference: ollama.com/library/qwen3.6.
Path 2 — llama.cpp (most control)
llama-cli -m qwen3.6-35b-a3b-q4_k_m.gguf \
-ngl 99 -c 32768 --flash-attn \
--temp 0.7 --top-p 0.8 -p "Your prompt"Add --n-cpu-moe 24 if you're tight on VRAM. The --flash-attn flag is mandatory for sane KV cache size at long context.
Path 3 — vLLM (production)
vllm serve Qwen/Qwen3.6-35B-A3B-AWQ \
--max-model-len 65536 \
--gpu-memory-utilization 0.92 \
--enable-expert-parallelUse vLLM only if you need concurrent request batching. For a single user, llama.cpp is faster on this model.
Every quantization listed in our tables is also tracked in the BestLLMfor catalog, which mirrors the data from our public CC BY 4.0 API and the open-source quelllm-mcp server.
Verdict and Buying Recommendation
Qwen3.6 35B-A3B is the new default local LLM for 2026 for anyone with a 24 GB consumer GPU or a 36 GB+ Apple Silicon machine. It delivers 35B-class quality at speeds that make agentic loops practical on a single user budget.
| Your hardware | Recommended quant | Expected experience |
|---|---|---|
| RTX 5090 / 4090 / 3090 (24–32 GB) | Q4_K_M or Q5_K_M | Daily driver, full 32K ctx |
| RTX 5070 Ti / 5070 (16 GB) | Q4_K_S + partial offload | Great speed, watch ctx length |
| RTX 5060 Ti 16 GB or 4060 Ti 16 GB | Q3_K_M | Functional but compromised |
| Mac M3/M4 Pro 36–48 GB | Q4_K_M | Quiet, efficient, ~45 tok/s |
| Mac M3 Ultra 96–192 GB | Q6_K or Q8_0 | Best Mac experience |
| 8–12 GB VRAM | Q3_K_M + --n-cpu-moe | Slow pp, usable tg |
If you're shopping new GPUs specifically to run this model, the RTX 5070 Ti at $749 MSRP is the sharpest price/performance pick. If you can find a used RTX 3090 under $600, that's the value play. For Mac users, the M4 Pro 48 GB is the floor we'd recommend.
For broader hardware shopping, see our best GPU for local LLM guide, and our benchmark methodology page if you want to know exactly how we measured these numbers.
Frequently Asked Questions
How much VRAM does Qwen3.6 35B-A3B need?
For the recommended Q4_K_M quantization, you need 22 GB of VRAM minimum and 24 GB for comfortable use at 32K context. Lower quants like Q3_K_M run in 18–20 GB, and with llama.cpp's --n-cpu-moe flag you can run it on as little as 6 GB VRAM if you have 64 GB of system RAM, at reduced speed.
Is Qwen3.6 35B-A3B better than the 27B dense version?
For most use cases, yes. The 35B-A3B MoE delivers slightly better benchmark scores and roughly 2.2× faster generation on the same hardware. The 27B dense variant is preferable only when you need very deep step-by-step reasoning on math or logic tasks, where the dense architecture has a small edge.
What's the best quantization for Qwen3.6 35B-A3B?
Q4_K_M is the recommended default. It retains roughly 96% of FP16 quality while fitting in 24 GB of VRAM with room for 32K context. Avoid Q2_K and IQ1 quants — both show measurable degradation on code and reasoning tasks. Q3_K_M is the floor we recommend for serious use.
Can I run Qwen3.6 35B-A3B on a Mac?
Yes, very well. An M4 Pro with 48 GB of unified memory runs Q4_K_M at roughly 44 tok/s. An M3 Ultra with 96 GB or more comfortably runs Q6_K or Q8_0 at 55–70 tok/s. Use the MLX build via LM Studio or Ollama for best performance on Apple Silicon.
How fast is Qwen3.6 35B-A3B compared to dense 30B models?
Roughly 2 to 2.5× faster for generation. Because only 3B of the 35B parameters activate per token, generation is bandwidth-bound on 3B worth of weights rather than 30B, while quality stays close to dense 35B-class output. Prompt processing speed is similar to dense models of comparable total size.
Does Qwen3.6 35B-A3B support tool calling and vision?
Yes to both. Native function calling is built into the chat template and works with Ollama, llama.cpp, and vLLM. Vision requires downloading the separate mmproj file (~2.4 GB) and loading it alongside the main GGUF in llama.cpp with the --mmproj flag.