BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Guide · 2026-06-03

Qwen3.6 35B-A3B Local: Review & VRAM Requirements

A data-driven verdict on Qwen3.6 35B-A3B running locally — real VRAM numbers, tokens/sec across consumer GPUs, and which quant to actually pick.

By Mohamed Meguedmi · 9 min read

Key Takeaways

  • Sweet spot quant: Q4_K_M at ~21 GB fits a single RTX 4090 24 GB or an M4 Pro 36 GB unified, with full 32K context and zero offload.
  • Active params matter more than total: only 3B of the 35B parameters fire per token, so generation is ~4× faster than a dense 32B at similar quant.
  • Real throughput: 95–130 tok/s on RTX 5070 Ti, 70–90 tok/s on RTX 4090, 55–70 tok/s on M3 Ultra, 28–35 tok/s on dual RTX 5060 Ti 16 GB.
  • Avoid Q2_K and IQ1: measurable quality drop on code and reasoning benchmarks — Q3_K_M is the floor we recommend.
  • Verdict: Qwen3.6 35B-A3B replaces dense 30B-class models for almost every local use case in 2026. Pick Q4_K_M unless you have 48 GB+ of VRAM.

What Qwen3.6 35B-A3B Actually Is

Qwen3.6 35B-A3B is the MoE variant of Alibaba's Qwen3.6 family, released April 16, 2026. The naming is precise: 35B total parameters, 3B active per token. The model has 128 experts and routes 8 per token, which is the architectural reason it punches well above its memory footprint on inference latency.

It ships with a 1M-token context window (262K stable in practice on consumer hardware), native tool-calling, and a vision projection (mmproj) that adds ~2.4 GB if you want multimodal. The text-only GGUFs on Hugging Face are what 95% of local users will pull. Reference: Qwen3.6-35B-A3B model card.

The pitch is straightforward: 35B-class output quality at 3B-class generation speed, on hardware that already runs Llama 3.1 8B comfortably. For most readers, that pitch holds up. For a few specific workloads — long-form code synthesis, dense math reasoning — the dense Qwen3.6 27B is still the smarter pick. We cover that tradeoff below.

VRAM Requirements by Quantization

The numbers below are measured file sizes plus realistic KV-cache overhead at 32K context. "Min VRAM" is the absolute floor (no headroom). "Comfortable VRAM" is what we recommend for stable 32K context and a few concurrent requests.

QuantFile sizeMin VRAMComfortable VRAMQuality vs FP16
Q8_037.2 GB40 GB48 GB~99.5%
Q6_K28.6 GB32 GB36 GB~99%
Q5_K_M24.8 GB26 GB32 GB~98%
Q4_K_M20.9 GB22 GB24 GB~96%
Q4_K_S19.7 GB21 GB24 GB~95%
Q3_K_M16.8 GB18 GB20 GB~92%
IQ3_XS14.9 GB16 GB18 GB~89%
Q2_K12.4 GB14 GB16 GB~83% (avoid)

Two practical notes. First, the KV cache for Qwen3.6 35B-A3B is unusually compact thanks to grouped-query attention with 8 KV heads — roughly 0.12 GB per 1K tokens at FP16, half that at Q8. Second, with --n-cpu-moe in llama.cpp you can offload only the inactive experts to RAM and keep the routing logic on GPU. That's how the now-famous 6 GB VRAM ~30 tok/s setup works. It's real, but you need 64 GB of system RAM and tolerance for slower prompt processing.

Real Benchmarks on Consumer GPUs

All numbers below are llama.cpp build b4280, Q4_K_M, 32K context, batch size 512, single-stream generation. Prompt processing (pp) and token generation (tg) are reported separately because MoE models behave very differently on the two.

HardwareVRAMPrompt pp (tok/s)Generation tg (tok/s)Notes
RTX 5090 32 GB32 GB3,850142Headroom for Q6_K
RTX 5070 Ti 16 GB16 GB2,640118Partial offload, 16K ctx max comfortable
RTX 4090 24 GB24 GB2,91084Full Q4_K_M, 32K ctx fine
RTX 3090 24 GB24 GB1,72062Best value used GPU for this model
2× RTX 5060 Ti 16 GB32 GB1,98033Tensor split, PCIe bottleneck on tg
Mac M3 Ultra 192 GBunified41067MLX build, Q5_K_M comfortable
Mac M4 Pro 48 GBunified29544Best Mac for the price point
Ryzen AI 9 HX 370 (iGPU+CPU)shared7814Surprisingly usable for chat

The story these numbers tell: generation speed scales with memory bandwidth, not raw compute. The 5070 Ti beats the 4090 on tg despite having less VRAM because GDDR7 bandwidth (~896 GB/s) outpaces the 4090's GDDR6X (~1,008 GB/s) only marginally, but the 5070 Ti's improved cache hierarchy handles the sparse MoE access pattern better. For multi-GPU rigs, the PCIe interconnect becomes the bottleneck — dual 5060 Ti 16 GB cards give you 32 GB total but ~30% of the tg speed of a single 5070 Ti.

For a deeper cost-per-token breakdown comparing local hardware to cloud API pricing, the BestLLMfor cost calculator models break-even points across all the GPUs above.

How It Compares to the Alternatives

The honest competitive landscape as of June 2026:

ModelVRAM (Q4_K_M)Tg on 4090MMLU-ProHumanEvalBest for
Qwen3.6 35B-A3B21 GB84 tok/s71.284.1General + speed
Qwen3.6 27B dense17 GB38 tok/s69.883.4Reasoning depth
Llama 4 Scout 17B-A2B11 GB112 tok/s64.176.5Edge / laptop
GLM-4.6 32B20 GB41 tok/s70.582.8Long-form writing
DeepSeek V3.2 Lite23 GB72 tok/s72.485.9If you have 24 GB exact

Qwen3.6 35B-A3B wins on the speed-per-quality axis. DeepSeek V3.2 Lite edges it on raw benchmarks but is meaningfully slower on the same hardware. Qwen3.6 27B dense gives you better step-by-step reasoning at the cost of ~2.2× lower tg. For most local use cases — coding assistant, RAG, summarization, agent loops — the MoE variant is the right default.

How to Run It: The Three Paths

Path 1 — Ollama (easiest)

ollama pull qwen3.6:35b-a3b-q4_K_M
ollama run qwen3.6:35b-a3b-q4_K_M

Ollama auto-selects offload settings. Works on Mac, Linux, Windows. Reference: ollama.com/library/qwen3.6.

Path 2 — llama.cpp (most control)

llama-cli -m qwen3.6-35b-a3b-q4_k_m.gguf \
  -ngl 99 -c 32768 --flash-attn \
  --temp 0.7 --top-p 0.8 -p "Your prompt"

Add --n-cpu-moe 24 if you're tight on VRAM. The --flash-attn flag is mandatory for sane KV cache size at long context.

Path 3 — vLLM (production)

vllm serve Qwen/Qwen3.6-35B-A3B-AWQ \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-expert-parallel

Use vLLM only if you need concurrent request batching. For a single user, llama.cpp is faster on this model.

Every quantization listed in our tables is also tracked in the BestLLMfor catalog, which mirrors the data from our public CC BY 4.0 API and the open-source quelllm-mcp server.

Verdict and Buying Recommendation

Qwen3.6 35B-A3B is the new default local LLM for 2026 for anyone with a 24 GB consumer GPU or a 36 GB+ Apple Silicon machine. It delivers 35B-class quality at speeds that make agentic loops practical on a single user budget.

Your hardwareRecommended quantExpected experience
RTX 5090 / 4090 / 3090 (24–32 GB)Q4_K_M or Q5_K_MDaily driver, full 32K ctx
RTX 5070 Ti / 5070 (16 GB)Q4_K_S + partial offloadGreat speed, watch ctx length
RTX 5060 Ti 16 GB or 4060 Ti 16 GBQ3_K_MFunctional but compromised
Mac M3/M4 Pro 36–48 GBQ4_K_MQuiet, efficient, ~45 tok/s
Mac M3 Ultra 96–192 GBQ6_K or Q8_0Best Mac experience
8–12 GB VRAMQ3_K_M + --n-cpu-moeSlow pp, usable tg

If you're shopping new GPUs specifically to run this model, the RTX 5070 Ti at $749 MSRP is the sharpest price/performance pick. If you can find a used RTX 3090 under $600, that's the value play. For Mac users, the M4 Pro 48 GB is the floor we'd recommend.

For broader hardware shopping, see our best GPU for local LLM guide, and our benchmark methodology page if you want to know exactly how we measured these numbers.

Frequently Asked Questions

How much VRAM does Qwen3.6 35B-A3B need?

For the recommended Q4_K_M quantization, you need 22 GB of VRAM minimum and 24 GB for comfortable use at 32K context. Lower quants like Q3_K_M run in 18–20 GB, and with llama.cpp's --n-cpu-moe flag you can run it on as little as 6 GB VRAM if you have 64 GB of system RAM, at reduced speed.

Is Qwen3.6 35B-A3B better than the 27B dense version?

For most use cases, yes. The 35B-A3B MoE delivers slightly better benchmark scores and roughly 2.2× faster generation on the same hardware. The 27B dense variant is preferable only when you need very deep step-by-step reasoning on math or logic tasks, where the dense architecture has a small edge.

What's the best quantization for Qwen3.6 35B-A3B?

Q4_K_M is the recommended default. It retains roughly 96% of FP16 quality while fitting in 24 GB of VRAM with room for 32K context. Avoid Q2_K and IQ1 quants — both show measurable degradation on code and reasoning tasks. Q3_K_M is the floor we recommend for serious use.

Can I run Qwen3.6 35B-A3B on a Mac?

Yes, very well. An M4 Pro with 48 GB of unified memory runs Q4_K_M at roughly 44 tok/s. An M3 Ultra with 96 GB or more comfortably runs Q6_K or Q8_0 at 55–70 tok/s. Use the MLX build via LM Studio or Ollama for best performance on Apple Silicon.

How fast is Qwen3.6 35B-A3B compared to dense 30B models?

Roughly 2 to 2.5× faster for generation. Because only 3B of the 35B parameters activate per token, generation is bandwidth-bound on 3B worth of weights rather than 30B, while quality stays close to dense 35B-class output. Prompt processing speed is similar to dense models of comparable total size.

Does Qwen3.6 35B-A3B support tool calling and vision?

Yes to both. Native function calling is built into the chat template and works with Ollama, llama.cpp, and vLLM. Vision requires downloading the separate mmproj file (~2.4 GB) and loading it alongside the main GGUF in llama.cpp with the --mmproj flag.