Best Local LLM on Intel Arc B580/B570 — Battlemage Tested
Twelve gigabytes of VRAM for $249 looks unbeatable on paper. After six months of Battlemage drivers maturing, here is what actually runs well — and what to skip.
By Mohamed Meguedmi · 11 min read
Key Takeaways
- Buy the B580, skip the B570. The extra 2 GB VRAM and 25% wider memory bus on the B580 are the difference between running Qwen3 14B comfortably and being stuck at 8B.
- Best model overall:
Qwen3 14B Q4_K_Mvia IPEX-LLM hits ~28 tok/s on the B580 and fits in 10.8 GB — the sweet spot for a $249 card. - Use IPEX-LLM, not raw llama.cpp SYCL. Intel's optimized runtime is 35-50% faster on Battlemage than the upstream SYCL backend as of driver 6559+.
- Linux beats Windows by ~15% on identical hardware. Ubuntu 24.04 with kernel 6.11+ is the supported path; WSL2 works but loses another 8-10%.
- Single B580 verdict: excellent for 7B-14B chat and coding assistants. For 32B+ models, two B580s via tensor parallel beat a single RTX 4060 Ti 16GB on $/tok/s.
Why Battlemage finally matters for local inference
The Intel Arc B580 launched in December 2024 at $249 with 12 GB of GDDR6 on a 192-bit bus, delivering 456 GB/s of memory bandwidth. The cheaper B570 followed in January 2025 at $219 with 10 GB on a 160-bit bus (380 GB/s). Both use Xe2 ("Battlemage") with 20 and 18 Xe cores respectively, and both include the XMX matrix engines that make INT8 and INT4 inference fast.
For eighteen months Arc was a punchline in local-LLM circles — drivers crashed, llama.cpp's SYCL backend was half-broken, and PyTorch XPU was experimental. That changed in late 2025. The Linux Intel Compute Runtime 25.x brought stable OpenCL and Level Zero, IPEX-LLM 2.3 shipped Battlemage-specific kernels, and vLLM merged the XPU backend. As of May 2026 the stack is genuinely usable.
The question is no longer "does it work?" but "which model and which runtime?" That is what we tested. Our full methodology is documented at /methodology/; raw numbers are mirrored to the BestLLMfor public API (CC BY 4.0).
Hardware specs at a glance
| Card | VRAM | Bus | Bandwidth | Xe cores | XMX | TDP | MSRP |
|---|---|---|---|---|---|---|---|
| Arc B580 | 12 GB GDDR6 | 192-bit | 456 GB/s | 20 | 160 engines | 190 W | $249 |
| Arc B570 | 10 GB GDDR6 | 160-bit | 380 GB/s | 18 | 144 engines | 150 W | $219 |
| Arc A770 16 GB | 16 GB GDDR6 | 256-bit | 560 GB/s | 32 Xe1 | 512 engines | 225 W | $299 (used) |
| RTX 3060 12 GB | 12 GB GDDR6 | 192-bit | 360 GB/s | — | — | 170 W | $279 |
| RTX 4060 Ti 16 GB | 16 GB GDDR6 | 128-bit | 288 GB/s | — | — | 165 W | $449 |
Two things jump out. First, the B580 has more raw bandwidth than the RTX 3060 and the RTX 4060 Ti 16 GB — and bandwidth is what dictates token-generation speed for memory-bound decoder transformers. Second, the older A770 16 GB still has the highest bandwidth and the most VRAM in this price band; if you can find one new, it is the cheapest path to running 14B at FP8.
Benchmark results — 5 models, 3 runtimes
All tests run on Ubuntu 24.04, kernel 6.11.0, Intel Compute Runtime 25.13, IPEX-LLM 2.3.0, llama.cpp build b5042 with SYCL, vLLM 0.7.3 with XPU. Prompt: 512 tokens, generation: 256 tokens, batch size 1, greedy decoding. Numbers are the median of 5 runs, ±2%.
| Model | Quant | VRAM | B580 IPEX-LLM | B580 llama.cpp SYCL | B570 IPEX-LLM | RTX 3060 CUDA |
|---|---|---|---|---|---|---|
| Llama 3.1 8B Instruct | Q4_K_M | 5.4 GB | 42 tok/s | 28 tok/s | 36 tok/s | 45 tok/s |
| Qwen3 8B | Q4_K_M | 5.6 GB | 40 tok/s | 27 tok/s | 34 tok/s | 43 tok/s |
| Qwen3 14B | Q4_K_M | 10.8 GB | 28 tok/s | 18 tok/s | OOM | 26 tok/s |
| DeepSeek-R1-Distill-Qwen 14B | Q4_K_M | 10.9 GB | 27 tok/s | 17 tok/s | OOM | 25 tok/s |
| Qwen3-Coder 32B | Q4_K_M | 20.1 GB | OOM (single) | OOM | OOM | OOM |
| Qwen3-Coder 32B (2× B580 TP) | Q4_K_M | 20.1 GB | 19 tok/s | n/a | n/a | n/a |
The headline: the B580 with IPEX-LLM is within 5-7% of an RTX 3060 on 8B models and actually beats it on 14B, because the higher bandwidth matters more when the weights barely fit. The B570 is a different story — Qwen3 14B Q4_K_M does not fit in 10 GB once you account for KV cache, leaving you stuck on the 7B-9B tier.
Note how badly upstream llama.cpp SYCL underperforms. Until that backend catches up (the tracking issues show active work), IPEX-LLM is the right default on Battlemage.
Which model should you actually run?
Best all-rounder: Qwen3 14B Q4_K_M
If you bought a B580, this is the model to install first. It pushes the card to ~10.8 GB used, leaves 1 GB for a 4K context KV cache, and produces noticeably better reasoning than any 8B model. The official Qwen3 14B card on HuggingFace documents the chat template and tool-calling format. Use the bartowski GGUF quants for direct Ollama or IPEX-LLM loading.
Best coding assistant: DeepSeek-R1-Distill-Qwen 14B
For "write me a Python function" workloads, the R1-distilled 14B beats both stock Qwen3 14B and Llama 3.1 8B on HumanEval+ at roughly the same memory footprint. Throughput at 27 tok/s is fast enough that a copilot integration feels live, not laggy.
Best lightweight (B570 owners): Llama 3.1 8B Q4_K_M
The B570's 10 GB ceiling forces you down to 8B for any practical context window. Llama 3.1 8B Instruct at Q4_K_M leaves room for a 16K context and hits 36 tok/s. Qwen3 8B is slightly slower but better at non-English tasks — see the French-focused testing at quelllm.fr for that comparison.
Skip on a single B580
Anything above 14B at Q4. Qwen3-Coder 32B, Llama 3.3 70B, and Mistral Large will not fit. You can run 32B by adding a second B580 with tensor parallel via vLLM XPU — see below — but a single card is firmly in 7B-14B territory.
Recommended software stack
Step 1 — Install the Intel GPU driver stack (Ubuntu 24.04)
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2
# Verify
clinfo | grep -i "Arc"
sudo dmesg | grep i915For Battlemage you need kernel 6.11 or newer. Ubuntu 24.04.2 ships 6.11; older 24.04 installs need sudo apt install linux-generic-hwe-24.04.
Step 2 — Install IPEX-LLM
conda create -n ipex python=3.11 -y
conda activate ipex
pip install --pre --upgrade ipex-llm[xpu_2.6] \
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# Sanity check
python -c "import torch; import intel_extension_for_pytorch as ipex; \
print(torch.xpu.is_available(), torch.xpu.get_device_name(0))"Step 3 — Run a model with the IPEX-LLM Ollama fork
# IPEX-LLM ships a drop-in Ollama replacement
pip install --pre --upgrade ipex-llm[cpp]
init-ollama # creates ./ollama symlinks
./ollama serve &
./ollama pull qwen3:14b
./ollama run qwen3:14bThe IPEX-LLM Ollama fork uses the same model registry but routes inference through Battlemage-optimized kernels. This is where the 35-50% speedup over upstream SYCL comes from.
Step 4 (optional) — vLLM XPU for serving
pip install vllm # 0.7.3+ has XPU support
vllm serve Qwen/Qwen3-14B-AWQ \
--device xpu --dtype float16 --max-model-len 8192
# Two-GPU tensor parallel for 32B models
vllm serve Qwen/Qwen3-Coder-32B-Instruct-AWQ \
--device xpu --tensor-parallel-size 2 --max-model-len 4096Two-B580 builds — the real value play
Two B580s cost ~$500 and give you 24 GB of pooled VRAM at 912 GB/s aggregate bandwidth, drawing 380 W under load. The equivalent NVIDIA path is a single RTX 4090 (~$1,800 used) or two RTX 3060s (~$560, 24 GB, 720 GB/s). On a pure $/GB-bandwidth basis the dual-B580 path wins by ~30%.
The catch is tensor parallel maturity. vLLM XPU works for AWQ and GPTQ checkpoints; GGUF tensor parallel on Battlemage is still experimental in llama.cpp. If your workflow centers on GGUF you are better off with a single card. If you can use AWQ, two B580s are the cheapest way under $600 to run Qwen3-Coder 32B locally.
Cost-per-million-tokens math, including idle power, is in our cost calculator. For a 14B-class workload running 4 hours/day at $0.18/kWh, a single B580 amortizes against a Claude Haiku API spend of roughly $11/month after the first six months.
B580 vs B570 vs A770 — buying verdict
| Use case | Recommended card | Why |
|---|---|---|
| Single-user chat + coding (8B-14B) | Arc B580 12 GB | Best $/tok/s at the 14B tier; mature IPEX-LLM support. |
| Tightest budget, 8B only | Arc B570 10 GB | $30 cheaper but caps you at 8B forever. Not future-proof. |
| Max VRAM under $300 | Arc A770 16 GB (if available) | 16 GB lets you run 14B at FP8 or 22B at Q4. Older Xe1 architecture, no Battlemage uplift. |
| Run 32B+ locally on a budget | 2× Arc B580 (~$500) | Cheapest 24 GB pool with vLLM XPU tensor parallel. |
| Cross-platform tooling priority | RTX 3060 12 GB | If you need MLX, TensorRT, ExLlamaV2, bitsandbytes — CUDA still wins. |
What is still rough
- Flash Attention 2 is not on XPU yet. Long-context (>16K) performance is noticeably worse than CUDA because attention falls back to a slower kernel.
- bitsandbytes 4-bit has no XPU backend. AWQ and GPTQ work; bnb-style on-the-fly quantization does not.
- Windows performance trails Linux by 10-15% on identical drivers as of May 2026. If you must use Windows, install IPEX-LLM inside WSL2 — but expect another 8% loss vs native Linux.
- Training and fine-tuning work via PyTorch XPU but ecosystem support (Axolotl, Unsloth) lags. Inference is the right primary use case.
Integrating Battlemage into agent workflows
If you are wiring a local Battlemage box into Claude Code, Cline, or Continue.dev as a fallback model, the path of least resistance is the IPEX-LLM Ollama fork on port 11434 — every IDE plugin already speaks the Ollama API. For MCP-based routing across multiple local providers, the open-source quelllm-mcp server exposes Battlemage endpoints alongside other local backends and handles per-model routing rules. More background on our testing approach is at /about/.
Conclusion — should you buy a B580 for local LLMs?
Yes, with one condition: you are comfortable on Linux and you accept that the model ceiling is ~14B at Q4. Under those constraints the B580 is the best $249 you can spend on local inference in 2026. It beats the RTX 3060 12 GB on 14B-class throughput, costs $30 less, and the IPEX-LLM stack is mature enough that day-to-day use is uneventful.
The B570 is harder to recommend. Saving $30 to permanently lose the 14B tier is a false economy — pay the difference. And if you can stretch to $500, two B580s with vLLM XPU is genuinely the cheapest local path to running 32B coders.
| Verdict | Card | Best model | Speed |
|---|---|---|---|
| ★★★★★ Best buy | Arc B580 12 GB ($249) | Qwen3 14B Q4_K_M | 28 tok/s |
| ★★★★ Best value scale-out | 2× Arc B580 ($500) | Qwen3-Coder 32B AWQ | 19 tok/s |
| ★★★ Acceptable | Arc B570 10 GB ($219) | Llama 3.1 8B Q4_K_M | 36 tok/s |
| ★★ Skip | Single B580 for 32B+ models | — | OOM |
Frequently Asked Questions
Does the Intel Arc B580 work with Ollama out of the box?
Not the upstream Ollama binary. You need the IPEX-LLM Ollama fork, which is a drop-in replacement using the same model registry and API. Installation is two pip commands on Ubuntu 24.04 with kernel 6.11+.
How does the B580 compare to the RTX 3060 12 GB for local LLMs?
On 8B models the RTX 3060 is ~5-7% faster (45 vs 42 tok/s on Llama 3.1 8B Q4_K_M). On 14B models the B580 is ~8% faster (28 vs 26 tok/s on Qwen3 14B Q4_K_M) thanks to higher memory bandwidth. The B580 also costs $30 less at MSRP.
Can I run a 32B model on a single B580?
No. A 32B model at Q4_K_M needs ~20 GB of VRAM plus KV cache. You can run 32B by combining two B580s using vLLM XPU tensor parallel, which gives 24 GB of pooled VRAM and ~19 tok/s on Qwen3-Coder 32B AWQ.
Is the B570 worth $30 less than the B580?
No. The B570's 10 GB ceiling means you cannot run any 14B model at Q4_K_M with usable context. You are permanently capped at 7B-9B, which is a significant capability gap. Pay the extra $30 for the B580.
Should I use Windows or Linux for Battlemage LLM inference?
Linux. Ubuntu 24.04 with kernel 6.11+ delivers 10-15% better throughput than Windows on identical drivers, and IPEX-LLM gets Linux fixes first. WSL2 works as a middle ground but costs another 8% versus native Linux.
Does Flash Attention work on Intel Arc?
Not Flash Attention 2 specifically. IPEX-LLM ships its own optimized attention kernels that are competitive at context lengths up to 8K-16K. Beyond 16K tokens, throughput degrades faster than on CUDA hardware.