BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-16

Best Local LLM for Snapdragon X Copilot+ PCs

Verdict-driven picks for running local LLMs on Snapdragon X Elite and X Plus laptops, with NPU vs CPU benchmarks and tooling that actually works in 2026.

By Mohamed Meguedmi · 11 min read

Key takeaways

  • Best overall (NPU): Qwen3-8B Hybrid (Q4) via Nexa SDK — ~24–28 tok/s decode on the Hexagon NPU with sub-2 W of CPU draw.
  • Best out-of-the-box: Phi-4-mini-instruct (ONNX, INT4) through Foundry Local; one-click NPU acceleration on every Copilot+ SKU.
  • Best for coding: Qwen3-Coder 14B Q4_K_M on CPU via Ollama ARM64 — 9–11 tok/s, fits in 12 GB.
  • Skip: anything > 14B at Q4 on 16 GB machines, and llama.cpp Vulkan on Adreno X1 (regressed vs CPU on current drivers).
  • The NPU is finally useful — but only through QNN-aware runtimes (Nexa, Foundry Local, AI Toolkit). GGUF stacks still run on the Oryon CPU cores.

Why Snapdragon X is a different beast

Snapdragon X Elite (X1E-80-100, X1E-84-100) and X Plus ship with 10–12 Oryon CPU cores, an Adreno X1 GPU, and a 45 TOPS Hexagon NPU. That NPU is the headline number Qualcomm sells — and for 18 months it sat mostly idle for local LLM workloads, because llama.cpp, Ollama and LM Studio all targeted CPU or CUDA, not QNN.

That changed in late 2025. Microsoft shipped NPU-optimized DeepSeek-R1 distilled models through the Windows Copilot Runtime, Nexa SDK reached production quality for Qwen3 and Parakeet on Hexagon, and Foundry Local exposed a clean OpenAI-compatible endpoint that schedules to the NPU when the ONNX QDQ graph allows it. As of May 2026, the picture is finally clear enough to make a recommendation.

The editorial team tested every viable runtime on a Surface Laptop 7 (X1E-80-100, 16 GB), a Dell XPS 13 9345 (X1E-80-100, 32 GB) and a Lenovo ThinkPad T14s Gen 6 (X1P-64-100, 32 GB). See our methodology for the full bench protocol.

The hardware reality — what 45 TOPS actually buys you

The Hexagon NPU is fast at INT8/INT4 matmul, but it has two hard constraints: weights must be statically quantized at export time (QDQ ONNX or QNN context binaries), and KV cache stays on the CPU side for most current implementations. That caps the practical win at roughly 2–3× over CPU prefill, and 1.5–2× on decode — not the 10× the TOPS number suggests.

SKUCPU coresNPU (TOPS)LPDDR5xRealistic max model (Q4)Street price (USD, May 2026)
X1E-84-1001245up to 64 GB32B dense / 70B at Q2 (slow)$1,799+
X1E-80-100124516–32 GB14B at Q4$1,099+
X1E-78-100124516 GB8B at Q4 comfortably$899+
X1P-64-100104516–32 GB8B at Q4, 14B tight$849+

If you are buying specifically for local inference, the 32 GB tier is the sweet spot. 16 GB machines will run Phi-4-mini and Qwen3-8B but leave nothing for the browser and IDE you actually use. Use the cost calculator to compare against a refurbished M2 Pro Mac mini if you are flexible on form factor.

Runtime matrix — what actually works on Windows on ARM

Pick the wrong runtime and you will fight x64 emulation, missing AVX2, or Vulkan drivers that hang the compositor. Here is the May 2026 state of play:

RuntimeARM64 nativeNPU supportBest forNotes
Foundry LocalYesYes (QNN EP)Phi-4, DeepSeek-R1 distill, MistralOpenAI-compatible, ships with Windows AI Foundry SDK
Nexa SDKYesYes (Hexagon)Qwen3, Qwen3-VL, Parakeet ASRBest NPU throughput in our tests
AI Toolkit (VS Code)YesYes (QNN)Curated ONNX catalogGood for prototyping, weak for serving
Ollama 0.6+YesNo (CPU only)GGUF, anything on HFSolid CPU performance, no NPU path
LM Studio 0.3.xYesPartial (preview)GGUF browsing, GUI usersNPU preview is Llama-3 8B only as of 0.3.18
llama.cpp (Vulkan)YesNo (Adreno X1)ExperimentationSlower than CPU on current drivers — skip

NPU vs CPU — measured numbers

All numbers are decode tokens/second on an X1E-80-100, 32 GB, balanced power profile, 512-token prompt, batch 1. Three runs, median reported. Full traces are in our public benchmark dataset (CC BY 4.0).

ModelQuantRuntimeDeviceDecode tok/sPeak RAM
Phi-4-mini-instruct 3.8BINT4 QDQFoundry LocalNPU32.13.4 GB
Phi-4-mini-instruct 3.8BQ4_K_MOllamaCPU18.63.1 GB
Qwen3-8B HybridINT4Nexa SDKNPU26.45.9 GB
Qwen3-8BQ4_K_MOllamaCPU12.75.4 GB
DeepSeek-R1-Distill-Qwen-7BINT4 QDQFoundry LocalNPU22.85.2 GB
Qwen3-Coder 14BQ4_K_MOllamaCPU9.49.8 GB
Llama-3.1-8B-InstructINT4LM Studio (NPU preview)NPU19.15.6 GB
Mistral-Small-3 24BQ3_K_MOllamaCPU3.213.1 GB

The headline: NPU paths are roughly 2× faster than CPU at the 7–8B class, and they sip power — package power stays under 14 W vs 32–38 W on a sustained CPU decode. For battery-bound workflows the NPU is not optional.

The verdict — which model to install today

1. Best overall: Qwen3-8B Hybrid via Nexa SDK

Qwen3 at 8B is the model the Hexagon NPU was waiting for. The hybrid reasoning toggle (/think vs /no_think) lets you trade latency for quality, and Nexa’s pre-packaged QNN context binaries mean you do not touch the ONNX exporter. Qwen3-8B beats Llama-3.1-8B on MMLU (74.1 vs 69.4) and ships a permissive Apache-2.0 license.

2. Best zero-friction option: Phi-4-mini via Foundry Local

If you want one command and an OpenAI endpoint, this is it:

winget install Microsoft.FoundryLocal
foundry model run phi-4-mini

Foundry Local resolves the right ONNX variant for your NPU, downloads it, and exposes http://localhost:5273/v1/chat/completions. Phi-4-mini-instruct is small enough to leave headroom on 16 GB machines and strong enough for summarization, RAG and tool calling.

3. Best for code: Qwen3-Coder 14B on Ollama

No NPU path exists for 14B coders yet, but the Oryon cores handle Q4_K_M at ~9.4 tok/s, which is faster than any cloud round-trip for short completions. Pair with Continue.dev in VS Code.

4. Best for reasoning: DeepSeek-R1-Distill-Qwen-7B (NPU)

Microsoft’s NPU-optimized build of the R1 distill is the only practical way to run long chain-of-thought locally on a Copilot+ PC without melting the fan. Expect 22 tok/s and well-behaved thermals.

What to avoid

  • x64-emulated builds of LM Studio < 0.3.10. Even on launch-day Snapdragon firmware they were 4–5× slower than ARM64 Ollama.
  • Vulkan offload on Adreno X1. Driver maturity is not there in May 2026 — expect crashes and worse-than-CPU throughput.
  • Anything > 14B on 16 GB machines. Windows will swap, decode stalls, and you will blame the model.

How to set up the recommended stack

  1. Update firmware and drivers. Qualcomm’s March 2026 NPU driver (32.x) is a hard floor for current Nexa and Foundry Local builds.
  2. Install Foundry Local via winget (command above). Confirm NPU scheduling with foundry status.
  3. Install Nexa SDK from github.com/NexaAI/nexa-sdk, then nexa pull qwen3-8b-hybrid-npu.
  4. Install ARM64 Ollama from ollama.com/download/windows for GGUF fallback. Verify with ollama --version showing arm64.
  5. Point your IDE at http://localhost:5273/v1 (Foundry) or http://localhost:11434/v1 (Ollama). Both speak the OpenAI API.

The full install matrix, including the QNN runtime path requirements and a tested config.yaml for Continue.dev, is mirrored in the BestLLMfor public API (CC BY 4.0) under /api/v1/guides/snapdragon-x-copilot. If you prefer routing requests through an MCP layer, the open-source quelllm-mcp server adds a model-selection middleware that picks the NPU path when the prompt fits and falls back to CPU otherwise. French-speaking readers can cross-reference benchmark numbers on our sister site quelllm.fr.

Power, thermals, and battery life

The reason to bother with the NPU at all is power. On the Surface Laptop 7, a sustained Qwen3-8B chat session pulls:

PathPackage powerSurface tempEst. battery hours (66 Wh)
NPU (Nexa, Qwen3-8B)11–13 W34 °C~5.0
CPU (Ollama, Qwen3-8B Q4)32–38 W46 °C~1.7

That difference is the entire reason Copilot+ PCs exist. If you are inferring on battery during a flight or in meetings, the NPU stack is non-negotiable.

What is still broken in May 2026

  • Vision-language models are limited to Qwen3-VL through Nexa. No NPU path yet for Llama-3.2-Vision or Phi-3.5-Vision.
  • Speculative decoding on the NPU is not exposed by any current SDK.
  • Function calling on Foundry Local works but is finicky — expect to retry malformed tool calls roughly 1 in 20.
  • GGUF on NPU is still vaporware. If a runtime claims it, check whether it is actually compiling GGUF to QNN at load time (slow) or running on CPU (most cases).

Verdict

Use caseRecommended modelRuntimeMin RAM
General chat / RAG on batteryQwen3-8B Hybrid INT4Nexa SDK (NPU)16 GB
Zero-setup / API for appsPhi-4-mini-instruct INT4Foundry Local (NPU)16 GB
Reasoning / mathDeepSeek-R1-Distill-Qwen-7BFoundry Local (NPU)16 GB
Code completionQwen3-Coder 14B Q4_K_MOllama (CPU)32 GB
Multimodal (image in)Qwen3-VL 8BNexa SDK (NPU)16 GB

For most readers buying a Copilot+ PC in 2026, the answer is two installs: Foundry Local for the API endpoint and Nexa SDK for serious NPU work. Keep Ollama around for the long tail of GGUF models that have not been ported to QNN yet. Skip the Vulkan and emulated-x64 paths until driver maturity catches up.

FAQ

Does the Snapdragon X NPU actually run local LLMs in 2026?

Yes, but only through QNN-aware runtimes — Foundry Local, Nexa SDK, AI Toolkit and the LM Studio NPU preview. Ollama, llama.cpp and standard LM Studio builds still run on the Oryon CPU cores. Expect roughly 2× the decode throughput and one-third the power draw on NPU paths versus CPU at the 7–8B class.

What is the largest model a 16 GB Copilot+ PC can practically run?

Phi-4-mini (3.8B) and Qwen3-8B at INT4/Q4 are the comfortable ceiling. 14B models technically load but leave too little RAM for the browser, IDE and Windows itself, causing swap-induced stalls. Buy 32 GB if you want to run Qwen3-Coder 14B or Mistral-Small-3.

Is Ollama on Snapdragon X worth using if the NPU is faster?

Absolutely. Ollama’s ARM64 build is the best path for any GGUF model that has not been ported to QNN — which is most of HuggingFace. It is also the simplest way to evaluate new releases the day they ship. Use NPU runtimes for production inference, Ollama for experimentation.

Why is llama.cpp Vulkan slower than CPU on Adreno X1?

Current Adreno X1 drivers have immature compute-shader support and lack the optimized matmul kernels that the desktop AMD/NVIDIA Vulkan paths rely on. Qualcomm is iterating, but as of May 2026 the CPU path on Oryon cores is faster and more stable for every model we tested.

Can I run DeepSeek-R1 (full) on a Snapdragon X laptop?

No. The full R1 is 671B parameters and requires a server-class deployment. The distilled 7B and 14B Qwen-based variants are what you want locally, and Microsoft’s NPU-optimized 7B distill is the best fit for Copilot+ PCs.

Will GGUF ever run on the Hexagon NPU?

Not directly. GGUF is a CPU-oriented format; NPU runtimes need statically quantized QDQ ONNX or compiled QNN context binaries. Expect runtimes to keep shipping pre-converted variants of popular GGUF models rather than converting them on the fly.