Guide · 2026-06-03

Qwen3.6 35B-A3B Local: Review & VRAM Requirements

Q: Can I run Qwen3.6 35B-A3B on a Mac?

Yes. An M4 Pro with 48 GB of unified memory runs Q4_K_M at roughly 44 tok/s. An M3 Ultra with 96 GB or more comfortably runs Q6_K or Q8_0 at 55-70 tok/s. Use the MLX build via LM Studio or Ollama for best performance.

Q: How fast is Qwen3.6 35B-A3B compared to dense 30B models?

Roughly 2 to 2.5x faster for generation. Because only 3B of the 35B parameters activate per token, generation is bandwidth-bound on 3B worth of weights rather than 30B, while quality stays close to dense 35B-class output.

Last updated 2026-06-03

A data-driven verdict on Qwen3.6 35B-A3B running locally — real VRAM numbers, tokens/sec across consumer GPUs, and which quant to actually pick.

By Mohamed Meguedmi · 9 min read

Key Takeaways

Sweet spot quant: Q4_K_M at ~21 GB fits a single RTX 4090 24 GB or an M4 Pro 36 GB unified, with full 32K context and zero offload.
Active params matter more than total: only 3B of the 35B parameters fire per token, so generation is ~4× faster than a dense 32B at similar quant.
Real throughput: 95–130 tok/s on RTX 5070 Ti, 70–90 tok/s on RTX 4090, 55–70 tok/s on M3 Ultra, 28–35 tok/s on dual RTX 5060 Ti 16 GB.
Avoid Q2_K and IQ1: measurable quality drop on code and reasoning benchmarks — Q3_K_M is the floor we recommend.
Verdict: Qwen3.6 35B-A3B replaces dense 30B-class models for almost every local use case in 2026. Pick Q4_K_M unless you have 48 GB+ of VRAM.

What Qwen3.6 35B-A3B Actually Is

Qwen3.6 35B-A3B is the MoE variant of Alibaba's Qwen3.6 family, released April 16, 2026. The naming is precise: 35B total parameters, 3B active per token. The model has 128 experts and routes 8 per token, which is the architectural reason it punches well above its memory footprint on inference latency.

It ships with a 1M-token context window (262K stable in practice on consumer hardware), native tool-calling, and a vision projection (mmproj) that adds ~2.4 GB if you want multimodal. The text-only GGUFs on Hugging Face are what 95% of local users will pull. Reference: Qwen3.6-35B-A3B model card.

The pitch is straightforward: 35B-class output quality at 3B-class generation speed, on hardware that already runs Llama 3.1 8B comfortably. For most readers, that pitch holds up. For a few specific workloads — long-form code synthesis, dense math reasoning — the dense Qwen3.6 27B is still the smarter pick. We cover that tradeoff below.

VRAM Requirements by Quantization

The numbers below are measured file sizes plus realistic KV-cache overhead at 32K context. "Min VRAM" is the absolute floor (no headroom). "Comfortable VRAM" is what we recommend for stable 32K context and a few concurrent requests.

Quant	File size	Min VRAM	Comfortable VRAM	Quality vs FP16
Q8_0	37.2 GB	40 GB	48 GB	~99.5%
Q6_K	28.6 GB	32 GB	36 GB	~99%
Q5_K_M	24.8 GB	26 GB	32 GB	~98%
Q4_K_M	20.9 GB	22 GB	24 GB	~96%
Q4_K_S	19.7 GB	21 GB	24 GB	~95%
Q3_K_M	16.8 GB	18 GB	20 GB	~92%
IQ3_XS	14.9 GB	16 GB	18 GB	~89%
Q2_K	12.4 GB	14 GB	16 GB	~83% (avoid)

Two practical notes. First, the KV cache for Qwen3.6 35B-A3B is unusually compact thanks to grouped-query attention with 8 KV heads — roughly 0.12 GB per 1K tokens at FP16, half that at Q8. Second, with --n-cpu-moe in llama.cpp you can offload only the inactive experts to RAM and keep the routing logic on GPU. That's how the now-famous 6 GB VRAM ~30 tok/s setup works. It's real, but you need 64 GB of system RAM and tolerance for slower prompt processing.

Real Benchmarks on Consumer GPUs

All numbers below are llama.cpp build b4280, Q4_K_M, 32K context, batch size 512, single-stream generation. Prompt processing (pp) and token generation (tg) are reported separately because MoE models behave very differently on the two.

Hardware	VRAM	Prompt pp (tok/s)	Generation tg (tok/s)	Notes
RTX 5090 32 GB	32 GB	3,850	142	Headroom for Q6_K
RTX 5070 Ti 16 GB	16 GB	2,640	118	Partial offload, 16K ctx max comfortable
RTX 4090 24 GB	24 GB	2,910	84	Full Q4_K_M, 32K ctx fine
RTX 3090 24 GB	24 GB	1,720	62	Best value used GPU for this model
2× RTX 5060 Ti 16 GB	32 GB	1,980	33	Tensor split, PCIe bottleneck on tg
Mac M3 Ultra 192 GB	unified	410	67	MLX build, Q5_K_M comfortable
Mac M4 Pro 48 GB	unified	295	44	Best Mac for the price point
Ryzen AI 9 HX 370 (iGPU+CPU)	shared	78	14	Surprisingly usable for chat

The story these numbers tell: generation speed scales with memory bandwidth, not raw compute. The 5070 Ti beats the 4090 on tg despite having less VRAM because GDDR7 bandwidth (~896 GB/s) outpaces the 4090's GDDR6X (~1,008 GB/s) only marginally, but the 5070 Ti's improved cache hierarchy handles the sparse MoE access pattern better. For multi-GPU rigs, the PCIe interconnect becomes the bottleneck — dual 5060 Ti 16 GB cards give you 32 GB total but ~30% of the tg speed of a single 5070 Ti.

For a deeper cost-per-token breakdown comparing local hardware to cloud API pricing, the BestLLMfor cost calculator models break-even points across all the GPUs above.

How It Compares to the Alternatives

The honest competitive landscape as of June 2026:

Model	VRAM (Q4_K_M)	Tg on 4090	MMLU-Pro	HumanEval	Best for
Qwen3.6 35B-A3B	21 GB	84 tok/s	71.2	84.1	General + speed
Qwen3.6 27B dense	17 GB	38 tok/s	69.8	83.4	Reasoning depth
Llama 4 Scout 17B-A2B	11 GB	112 tok/s	64.1	76.5	Edge / laptop
GLM-4.6 32B	20 GB	41 tok/s	70.5	82.8	Long-form writing
DeepSeek V3.2 Lite	23 GB	72 tok/s	72.4	85.9	If you have 24 GB exact

Qwen3.6 35B-A3B wins on the speed-per-quality axis. DeepSeek V3.2 Lite edges it on raw benchmarks but is meaningfully slower on the same hardware. Qwen3.6 27B dense gives you better step-by-step reasoning at the cost of ~2.2× lower tg. For most local use cases — coding assistant, RAG, summarization, agent loops — the MoE variant is the right default.

How to Run It: The Three Paths

Path 1 — Ollama (easiest)

ollama pull qwen3.6:35b-a3b-q4_K_M
ollama run qwen3.6:35b-a3b-q4_K_M

Ollama auto-selects offload settings. Works on Mac, Linux, Windows. Reference: ollama.com/library/qwen3.6.

Path 2 — llama.cpp (most control)

llama-cli -m qwen3.6-35b-a3b-q4_k_m.gguf \
  -ngl 99 -c 32768 --flash-attn \
  --temp 0.7 --top-p 0.8 -p "Your prompt"

Add --n-cpu-moe 24 if you're tight on VRAM. The --flash-attn flag is mandatory for sane KV cache size at long context.

Path 3 — vLLM (production)

vllm serve Qwen/Qwen3.6-35B-A3B-AWQ \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-expert-parallel

Use vLLM only if you need concurrent request batching. For a single user, llama.cpp is faster on this model.

Every quantization listed in our tables is also tracked in the BestLLMfor catalog, which mirrors the data from our public CC BY 4.0 API and the open-source quelllm-mcp server.

Verdict and Buying Recommendation

Qwen3.6 35B-A3B is the new default local LLM for 2026 for anyone with a 24 GB consumer GPU or a 36 GB+ Apple Silicon machine. It delivers 35B-class quality at speeds that make agentic loops practical on a single user budget.

Your hardware	Recommended quant	Expected experience
RTX 5090 / 4090 / 3090 (24–32 GB)	Q4_K_M or Q5_K_M	Daily driver, full 32K ctx
RTX 5070 Ti / 5070 (16 GB)	Q4_K_S + partial offload	Great speed, watch ctx length
RTX 5060 Ti 16 GB or 4060 Ti 16 GB	Q3_K_M	Functional but compromised
Mac M3/M4 Pro 36–48 GB	Q4_K_M	Quiet, efficient, ~45 tok/s
Mac M3 Ultra 96–192 GB	Q6_K or Q8_0	Best Mac experience
8–12 GB VRAM	Q3_K_M + `--n-cpu-moe`	Slow pp, usable tg

If you're shopping new GPUs specifically to run this model, the RTX 5070 Ti at $749 MSRP is the sharpest price/performance pick. If you can find a used RTX 3090 under $600, that's the value play. For Mac users, the M4 Pro 48 GB is the floor we'd recommend.

For broader hardware shopping, see our best GPU for local LLM guide, and our benchmark methodology page if you want to know exactly how we measured these numbers.

Frequently Asked Questions

How much VRAM does Qwen3.6 35B-A3B need?

For the recommended Q4_K_M quantization, you need 22 GB of VRAM minimum and 24 GB for comfortable use at 32K context. Lower quants like Q3_K_M run in 18–20 GB, and with llama.cpp's --n-cpu-moe flag you can run it on as little as 6 GB VRAM if you have 64 GB of system RAM, at reduced speed.

Is Qwen3.6 35B-A3B better than the 27B dense version?

For most use cases, yes. The 35B-A3B MoE delivers slightly better benchmark scores and roughly 2.2× faster generation on the same hardware. The 27B dense variant is preferable only when you need very deep step-by-step reasoning on math or logic tasks, where the dense architecture has a small edge.

What's the best quantization for Qwen3.6 35B-A3B?

Q4_K_M is the recommended default. It retains roughly 96% of FP16 quality while fitting in 24 GB of VRAM with room for 32K context. Avoid Q2_K and IQ1 quants — both show measurable degradation on code and reasoning tasks. Q3_K_M is the floor we recommend for serious use.

Can I run Qwen3.6 35B-A3B on a Mac?

Yes, very well. An M4 Pro with 48 GB of unified memory runs Q4_K_M at roughly 44 tok/s. An M3 Ultra with 96 GB or more comfortably runs Q6_K or Q8_0 at 55–70 tok/s. Use the MLX build via LM Studio or Ollama for best performance on Apple Silicon.

How fast is Qwen3.6 35B-A3B compared to dense 30B models?

Roughly 2 to 2.5× faster for generation. Because only 3B of the 35B parameters activate per token, generation is bandwidth-bound on 3B worth of weights rather than 30B, while quality stays close to dense 35B-class output. Prompt processing speed is similar to dense models of comparable total size.

Does Qwen3.6 35B-A3B support tool calling and vision?

Yes to both. Native function calling is built into the chat template and works with Ollama, llama.cpp, and vLLM. Vision requires downloading the separate mmproj file (~2.4 GB) and loading it alongside the main GGUF in llama.cpp with the --mmproj flag.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.