Guide · 2026-05-16

Best Local LLM for Snapdragon X Copilot+ PCs

Last updated 2026-05-16

Verdict-driven picks for running local LLMs on Snapdragon X Elite and X Plus laptops, with NPU vs CPU benchmarks and tooling that actually works in 2026.

By Mohamed Meguedmi · 11 min read

Key takeaways

Best overall (NPU): Qwen3-8B Hybrid (Q4) via Nexa SDK — ~24–28 tok/s decode on the Hexagon NPU with sub-2 W of CPU draw.
Best out-of-the-box: Phi-4-mini-instruct (ONNX, INT4) through Foundry Local; one-click NPU acceleration on every Copilot+ SKU.
Best for coding: Qwen3-Coder 14B Q4_K_M on CPU via Ollama ARM64 — 9–11 tok/s, fits in 12 GB.
Skip: anything > 14B at Q4 on 16 GB machines, and llama.cpp Vulkan on Adreno X1 (regressed vs CPU on current drivers).
The NPU is finally useful — but only through QNN-aware runtimes (Nexa, Foundry Local, AI Toolkit). GGUF stacks still run on the Oryon CPU cores.

Why Snapdragon X is a different beast

Snapdragon X Elite (X1E-80-100, X1E-84-100) and X Plus ship with 10–12 Oryon CPU cores, an Adreno X1 GPU, and a 45 TOPS Hexagon NPU. That NPU is the headline number Qualcomm sells — and for 18 months it sat mostly idle for local LLM workloads, because llama.cpp, Ollama and LM Studio all targeted CPU or CUDA, not QNN.

That changed in late 2025. Microsoft shipped NPU-optimized DeepSeek-R1 distilled models through the Windows Copilot Runtime, Nexa SDK reached production quality for Qwen3 and Parakeet on Hexagon, and Foundry Local exposed a clean OpenAI-compatible endpoint that schedules to the NPU when the ONNX QDQ graph allows it. As of May 2026, the picture is finally clear enough to make a recommendation.

The editorial team tested every viable runtime on a Surface Laptop 7 (X1E-80-100, 16 GB), a Dell XPS 13 9345 (X1E-80-100, 32 GB) and a Lenovo ThinkPad T14s Gen 6 (X1P-64-100, 32 GB). See our methodology for the full bench protocol.

The hardware reality — what 45 TOPS actually buys you

The Hexagon NPU is fast at INT8/INT4 matmul, but it has two hard constraints: weights must be statically quantized at export time (QDQ ONNX or QNN context binaries), and KV cache stays on the CPU side for most current implementations. That caps the practical win at roughly 2–3× over CPU prefill, and 1.5–2× on decode — not the 10× the TOPS number suggests.

SKU	CPU cores	NPU (TOPS)	LPDDR5x	Realistic max model (Q4)	Street price (USD, May 2026)
X1E-84-100	12	45	up to 64 GB	32B dense / 70B at Q2 (slow)	$1,799+
X1E-80-100	12	45	16–32 GB	14B at Q4	$1,099+
X1E-78-100	12	45	16 GB	8B at Q4 comfortably	$899+
X1P-64-100	10	45	16–32 GB	8B at Q4, 14B tight	$849+

If you are buying specifically for local inference, the 32 GB tier is the sweet spot. 16 GB machines will run Phi-4-mini and Qwen3-8B but leave nothing for the browser and IDE you actually use. Use the cost calculator to compare against a refurbished M2 Pro Mac mini if you are flexible on form factor.

Runtime matrix — what actually works on Windows on ARM

Pick the wrong runtime and you will fight x64 emulation, missing AVX2, or Vulkan drivers that hang the compositor. Here is the May 2026 state of play:

Runtime	ARM64 native	NPU support	Best for	Notes
Foundry Local	Yes	Yes (QNN EP)	Phi-4, DeepSeek-R1 distill, Mistral	OpenAI-compatible, ships with Windows AI Foundry SDK
Nexa SDK	Yes	Yes (Hexagon)	Qwen3, Qwen3-VL, Parakeet ASR	Best NPU throughput in our tests
AI Toolkit (VS Code)	Yes	Yes (QNN)	Curated ONNX catalog	Good for prototyping, weak for serving
Ollama 0.6+	Yes	No (CPU only)	GGUF, anything on HF	Solid CPU performance, no NPU path
LM Studio 0.3.x	Yes	Partial (preview)	GGUF browsing, GUI users	NPU preview is Llama-3 8B only as of 0.3.18
llama.cpp (Vulkan)	Yes	No (Adreno X1)	Experimentation	Slower than CPU on current drivers — skip

NPU vs CPU — measured numbers

All numbers are decode tokens/second on an X1E-80-100, 32 GB, balanced power profile, 512-token prompt, batch 1. Three runs, median reported. Full traces are in our public benchmark dataset (CC BY 4.0).

Model	Quant	Runtime	Device	Decode tok/s	Peak RAM
Phi-4-mini-instruct 3.8B	INT4 QDQ	Foundry Local	NPU	32.1	3.4 GB
Phi-4-mini-instruct 3.8B	Q4_K_M	Ollama	CPU	18.6	3.1 GB
Qwen3-8B Hybrid	INT4	Nexa SDK	NPU	26.4	5.9 GB
Qwen3-8B	Q4_K_M	Ollama	CPU	12.7	5.4 GB
DeepSeek-R1-Distill-Qwen-7B	INT4 QDQ	Foundry Local	NPU	22.8	5.2 GB
Qwen3-Coder 14B	Q4_K_M	Ollama	CPU	9.4	9.8 GB
Llama-3.1-8B-Instruct	INT4	LM Studio (NPU preview)	NPU	19.1	5.6 GB
Mistral-Small-3 24B	Q3_K_M	Ollama	CPU	3.2	13.1 GB

The headline: NPU paths are roughly 2× faster than CPU at the 7–8B class, and they sip power — package power stays under 14 W vs 32–38 W on a sustained CPU decode. For battery-bound workflows the NPU is not optional.

The verdict — which model to install today

1. Best overall: Qwen3-8B Hybrid via Nexa SDK

Qwen3 at 8B is the model the Hexagon NPU was waiting for. The hybrid reasoning toggle (/think vs /no_think) lets you trade latency for quality, and Nexa’s pre-packaged QNN context binaries mean you do not touch the ONNX exporter. Qwen3-8B beats Llama-3.1-8B on MMLU (74.1 vs 69.4) and ships a permissive Apache-2.0 license.

2. Best zero-friction option: Phi-4-mini via Foundry Local

If you want one command and an OpenAI endpoint, this is it:

winget install Microsoft.FoundryLocal
foundry model run phi-4-mini

Foundry Local resolves the right ONNX variant for your NPU, downloads it, and exposes http://localhost:5273/v1/chat/completions. Phi-4-mini-instruct is small enough to leave headroom on 16 GB machines and strong enough for summarization, RAG and tool calling.

3. Best for code: Qwen3-Coder 14B on Ollama

No NPU path exists for 14B coders yet, but the Oryon cores handle Q4_K_M at ~9.4 tok/s, which is faster than any cloud round-trip for short completions. Pair with Continue.dev in VS Code.

4. Best for reasoning: DeepSeek-R1-Distill-Qwen-7B (NPU)

Microsoft’s NPU-optimized build of the R1 distill is the only practical way to run long chain-of-thought locally on a Copilot+ PC without melting the fan. Expect 22 tok/s and well-behaved thermals.

What to avoid

x64-emulated builds of LM Studio < 0.3.10. Even on launch-day Snapdragon firmware they were 4–5× slower than ARM64 Ollama.
Vulkan offload on Adreno X1. Driver maturity is not there in May 2026 — expect crashes and worse-than-CPU throughput.
Anything > 14B on 16 GB machines. Windows will swap, decode stalls, and you will blame the model.

How to set up the recommended stack

Update firmware and drivers. Qualcomm’s March 2026 NPU driver (32.x) is a hard floor for current Nexa and Foundry Local builds.
Install Foundry Local via winget (command above). Confirm NPU scheduling with foundry status.
Install Nexa SDK from github.com/NexaAI/nexa-sdk, then nexa pull qwen3-8b-hybrid-npu.
Install ARM64 Ollama from ollama.com/download/windows for GGUF fallback. Verify with ollama --version showing arm64.
Point your IDE at http://localhost:5273/v1 (Foundry) or http://localhost:11434/v1 (Ollama). Both speak the OpenAI API.

The full install matrix, including the QNN runtime path requirements and a tested config.yaml for Continue.dev, is mirrored in the BestLLMfor public API (CC BY 4.0) under /api/v1/guides/snapdragon-x-copilot. If you prefer routing requests through an MCP layer, the open-source quelllm-mcp server adds a model-selection middleware that picks the NPU path when the prompt fits and falls back to CPU otherwise. French-speaking readers can cross-reference benchmark numbers on our sister site quelllm.fr.

Power, thermals, and battery life

The reason to bother with the NPU at all is power. On the Surface Laptop 7, a sustained Qwen3-8B chat session pulls:

Path	Package power	Surface temp	Est. battery hours (66 Wh)
NPU (Nexa, Qwen3-8B)	11–13 W	34 °C	~5.0
CPU (Ollama, Qwen3-8B Q4)	32–38 W	46 °C	~1.7

That difference is the entire reason Copilot+ PCs exist. If you are inferring on battery during a flight or in meetings, the NPU stack is non-negotiable.

What is still broken in May 2026

Vision-language models are limited to Qwen3-VL through Nexa. No NPU path yet for Llama-3.2-Vision or Phi-3.5-Vision.
Speculative decoding on the NPU is not exposed by any current SDK.
Function calling on Foundry Local works but is finicky — expect to retry malformed tool calls roughly 1 in 20.
GGUF on NPU is still vaporware. If a runtime claims it, check whether it is actually compiling GGUF to QNN at load time (slow) or running on CPU (most cases).

Verdict

Use case	Recommended model	Runtime	Min RAM
General chat / RAG on battery	Qwen3-8B Hybrid INT4	Nexa SDK (NPU)	16 GB
Zero-setup / API for apps	Phi-4-mini-instruct INT4	Foundry Local (NPU)	16 GB
Reasoning / math	DeepSeek-R1-Distill-Qwen-7B	Foundry Local (NPU)	16 GB
Code completion	Qwen3-Coder 14B Q4_K_M	Ollama (CPU)	32 GB
Multimodal (image in)	Qwen3-VL 8B	Nexa SDK (NPU)	16 GB

For most readers buying a Copilot+ PC in 2026, the answer is two installs: Foundry Local for the API endpoint and Nexa SDK for serious NPU work. Keep Ollama around for the long tail of GGUF models that have not been ported to QNN yet. Skip the Vulkan and emulated-x64 paths until driver maturity catches up.

FAQ

Does the Snapdragon X NPU actually run local LLMs in 2026?

Yes, but only through QNN-aware runtimes — Foundry Local, Nexa SDK, AI Toolkit and the LM Studio NPU preview. Ollama, llama.cpp and standard LM Studio builds still run on the Oryon CPU cores. Expect roughly 2× the decode throughput and one-third the power draw on NPU paths versus CPU at the 7–8B class.

What is the largest model a 16 GB Copilot+ PC can practically run?

Phi-4-mini (3.8B) and Qwen3-8B at INT4/Q4 are the comfortable ceiling. 14B models technically load but leave too little RAM for the browser, IDE and Windows itself, causing swap-induced stalls. Buy 32 GB if you want to run Qwen3-Coder 14B or Mistral-Small-3.

Is Ollama on Snapdragon X worth using if the NPU is faster?

Absolutely. Ollama’s ARM64 build is the best path for any GGUF model that has not been ported to QNN — which is most of HuggingFace. It is also the simplest way to evaluate new releases the day they ship. Use NPU runtimes for production inference, Ollama for experimentation.

Why is llama.cpp Vulkan slower than CPU on Adreno X1?

Current Adreno X1 drivers have immature compute-shader support and lack the optimized matmul kernels that the desktop AMD/NVIDIA Vulkan paths rely on. Qualcomm is iterating, but as of May 2026 the CPU path on Oryon cores is faster and more stable for every model we tested.

Can I run DeepSeek-R1 (full) on a Snapdragon X laptop?

No. The full R1 is 671B parameters and requires a server-class deployment. The distilled 7B and 14B Qwen-based variants are what you want locally, and Microsoft’s NPU-optimized 7B distill is the best fit for Copilot+ PCs.

Will GGUF ever run on the Hexagon NPU?

Not directly. GGUF is a CPU-oriented format; NPU runtimes need statically quantized QDQ ONNX or compiled QNN context binaries. Expect runtimes to keep shipping pre-converted variants of popular GGUF models rather than converting them on the fly.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.