BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-16

Best Local LLM on Intel Arc B580/B570 — Battlemage Tested

Twelve gigabytes of VRAM for $249 looks unbeatable on paper. After six months of Battlemage drivers maturing, here is what actually runs well — and what to skip.

By Mohamed Meguedmi · 11 min read

Key Takeaways

  • Buy the B580, skip the B570. The extra 2 GB VRAM and 25% wider memory bus on the B580 are the difference between running Qwen3 14B comfortably and being stuck at 8B.
  • Best model overall: Qwen3 14B Q4_K_M via IPEX-LLM hits ~28 tok/s on the B580 and fits in 10.8 GB — the sweet spot for a $249 card.
  • Use IPEX-LLM, not raw llama.cpp SYCL. Intel's optimized runtime is 35-50% faster on Battlemage than the upstream SYCL backend as of driver 6559+.
  • Linux beats Windows by ~15% on identical hardware. Ubuntu 24.04 with kernel 6.11+ is the supported path; WSL2 works but loses another 8-10%.
  • Single B580 verdict: excellent for 7B-14B chat and coding assistants. For 32B+ models, two B580s via tensor parallel beat a single RTX 4060 Ti 16GB on $/tok/s.

Why Battlemage finally matters for local inference

The Intel Arc B580 launched in December 2024 at $249 with 12 GB of GDDR6 on a 192-bit bus, delivering 456 GB/s of memory bandwidth. The cheaper B570 followed in January 2025 at $219 with 10 GB on a 160-bit bus (380 GB/s). Both use Xe2 ("Battlemage") with 20 and 18 Xe cores respectively, and both include the XMX matrix engines that make INT8 and INT4 inference fast.

For eighteen months Arc was a punchline in local-LLM circles — drivers crashed, llama.cpp's SYCL backend was half-broken, and PyTorch XPU was experimental. That changed in late 2025. The Linux Intel Compute Runtime 25.x brought stable OpenCL and Level Zero, IPEX-LLM 2.3 shipped Battlemage-specific kernels, and vLLM merged the XPU backend. As of May 2026 the stack is genuinely usable.

The question is no longer "does it work?" but "which model and which runtime?" That is what we tested. Our full methodology is documented at /methodology/; raw numbers are mirrored to the BestLLMfor public API (CC BY 4.0).

Hardware specs at a glance

CardVRAMBusBandwidthXe coresXMXTDPMSRP
Arc B58012 GB GDDR6192-bit456 GB/s20160 engines190 W$249
Arc B57010 GB GDDR6160-bit380 GB/s18144 engines150 W$219
Arc A770 16 GB16 GB GDDR6256-bit560 GB/s32 Xe1512 engines225 W$299 (used)
RTX 3060 12 GB12 GB GDDR6192-bit360 GB/s170 W$279
RTX 4060 Ti 16 GB16 GB GDDR6128-bit288 GB/s165 W$449

Two things jump out. First, the B580 has more raw bandwidth than the RTX 3060 and the RTX 4060 Ti 16 GB — and bandwidth is what dictates token-generation speed for memory-bound decoder transformers. Second, the older A770 16 GB still has the highest bandwidth and the most VRAM in this price band; if you can find one new, it is the cheapest path to running 14B at FP8.

Benchmark results — 5 models, 3 runtimes

All tests run on Ubuntu 24.04, kernel 6.11.0, Intel Compute Runtime 25.13, IPEX-LLM 2.3.0, llama.cpp build b5042 with SYCL, vLLM 0.7.3 with XPU. Prompt: 512 tokens, generation: 256 tokens, batch size 1, greedy decoding. Numbers are the median of 5 runs, ±2%.

ModelQuantVRAMB580 IPEX-LLMB580 llama.cpp SYCLB570 IPEX-LLMRTX 3060 CUDA
Llama 3.1 8B InstructQ4_K_M5.4 GB42 tok/s28 tok/s36 tok/s45 tok/s
Qwen3 8BQ4_K_M5.6 GB40 tok/s27 tok/s34 tok/s43 tok/s
Qwen3 14BQ4_K_M10.8 GB28 tok/s18 tok/sOOM26 tok/s
DeepSeek-R1-Distill-Qwen 14BQ4_K_M10.9 GB27 tok/s17 tok/sOOM25 tok/s
Qwen3-Coder 32BQ4_K_M20.1 GBOOM (single)OOMOOMOOM
Qwen3-Coder 32B (2× B580 TP)Q4_K_M20.1 GB19 tok/sn/an/an/a

The headline: the B580 with IPEX-LLM is within 5-7% of an RTX 3060 on 8B models and actually beats it on 14B, because the higher bandwidth matters more when the weights barely fit. The B570 is a different story — Qwen3 14B Q4_K_M does not fit in 10 GB once you account for KV cache, leaving you stuck on the 7B-9B tier.

Note how badly upstream llama.cpp SYCL underperforms. Until that backend catches up (the tracking issues show active work), IPEX-LLM is the right default on Battlemage.

Which model should you actually run?

Best all-rounder: Qwen3 14B Q4_K_M

If you bought a B580, this is the model to install first. It pushes the card to ~10.8 GB used, leaves 1 GB for a 4K context KV cache, and produces noticeably better reasoning than any 8B model. The official Qwen3 14B card on HuggingFace documents the chat template and tool-calling format. Use the bartowski GGUF quants for direct Ollama or IPEX-LLM loading.

Best coding assistant: DeepSeek-R1-Distill-Qwen 14B

For "write me a Python function" workloads, the R1-distilled 14B beats both stock Qwen3 14B and Llama 3.1 8B on HumanEval+ at roughly the same memory footprint. Throughput at 27 tok/s is fast enough that a copilot integration feels live, not laggy.

Best lightweight (B570 owners): Llama 3.1 8B Q4_K_M

The B570's 10 GB ceiling forces you down to 8B for any practical context window. Llama 3.1 8B Instruct at Q4_K_M leaves room for a 16K context and hits 36 tok/s. Qwen3 8B is slightly slower but better at non-English tasks — see the French-focused testing at quelllm.fr for that comparison.

Skip on a single B580

Anything above 14B at Q4. Qwen3-Coder 32B, Llama 3.3 70B, and Mistral Large will not fit. You can run 32B by adding a second B580 with tensor parallel via vLLM XPU — see below — but a single card is firmly in 7B-14B territory.

Recommended software stack

Step 1 — Install the Intel GPU driver stack (Ubuntu 24.04)
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero \
  intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2
# Verify
clinfo | grep -i "Arc"
sudo dmesg | grep i915

For Battlemage you need kernel 6.11 or newer. Ubuntu 24.04.2 ships 6.11; older 24.04 installs need sudo apt install linux-generic-hwe-24.04.

Step 2 — Install IPEX-LLM
conda create -n ipex python=3.11 -y
conda activate ipex
pip install --pre --upgrade ipex-llm[xpu_2.6] \
  --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# Sanity check
python -c "import torch; import intel_extension_for_pytorch as ipex; \
  print(torch.xpu.is_available(), torch.xpu.get_device_name(0))"
Step 3 — Run a model with the IPEX-LLM Ollama fork
# IPEX-LLM ships a drop-in Ollama replacement
pip install --pre --upgrade ipex-llm[cpp]
init-ollama   # creates ./ollama symlinks
./ollama serve &
./ollama pull qwen3:14b
./ollama run qwen3:14b

The IPEX-LLM Ollama fork uses the same model registry but routes inference through Battlemage-optimized kernels. This is where the 35-50% speedup over upstream SYCL comes from.

Step 4 (optional) — vLLM XPU for serving
pip install vllm  # 0.7.3+ has XPU support
vllm serve Qwen/Qwen3-14B-AWQ \
  --device xpu --dtype float16 --max-model-len 8192
# Two-GPU tensor parallel for 32B models
vllm serve Qwen/Qwen3-Coder-32B-Instruct-AWQ \
  --device xpu --tensor-parallel-size 2 --max-model-len 4096

Two-B580 builds — the real value play

Two B580s cost ~$500 and give you 24 GB of pooled VRAM at 912 GB/s aggregate bandwidth, drawing 380 W under load. The equivalent NVIDIA path is a single RTX 4090 (~$1,800 used) or two RTX 3060s (~$560, 24 GB, 720 GB/s). On a pure $/GB-bandwidth basis the dual-B580 path wins by ~30%.

The catch is tensor parallel maturity. vLLM XPU works for AWQ and GPTQ checkpoints; GGUF tensor parallel on Battlemage is still experimental in llama.cpp. If your workflow centers on GGUF you are better off with a single card. If you can use AWQ, two B580s are the cheapest way under $600 to run Qwen3-Coder 32B locally.

Cost-per-million-tokens math, including idle power, is in our cost calculator. For a 14B-class workload running 4 hours/day at $0.18/kWh, a single B580 amortizes against a Claude Haiku API spend of roughly $11/month after the first six months.

B580 vs B570 vs A770 — buying verdict

Use caseRecommended cardWhy
Single-user chat + coding (8B-14B)Arc B580 12 GBBest $/tok/s at the 14B tier; mature IPEX-LLM support.
Tightest budget, 8B onlyArc B570 10 GB$30 cheaper but caps you at 8B forever. Not future-proof.
Max VRAM under $300Arc A770 16 GB (if available)16 GB lets you run 14B at FP8 or 22B at Q4. Older Xe1 architecture, no Battlemage uplift.
Run 32B+ locally on a budget2× Arc B580 (~$500)Cheapest 24 GB pool with vLLM XPU tensor parallel.
Cross-platform tooling priorityRTX 3060 12 GBIf you need MLX, TensorRT, ExLlamaV2, bitsandbytes — CUDA still wins.

What is still rough

  • Flash Attention 2 is not on XPU yet. Long-context (>16K) performance is noticeably worse than CUDA because attention falls back to a slower kernel.
  • bitsandbytes 4-bit has no XPU backend. AWQ and GPTQ work; bnb-style on-the-fly quantization does not.
  • Windows performance trails Linux by 10-15% on identical drivers as of May 2026. If you must use Windows, install IPEX-LLM inside WSL2 — but expect another 8% loss vs native Linux.
  • Training and fine-tuning work via PyTorch XPU but ecosystem support (Axolotl, Unsloth) lags. Inference is the right primary use case.

Integrating Battlemage into agent workflows

If you are wiring a local Battlemage box into Claude Code, Cline, or Continue.dev as a fallback model, the path of least resistance is the IPEX-LLM Ollama fork on port 11434 — every IDE plugin already speaks the Ollama API. For MCP-based routing across multiple local providers, the open-source quelllm-mcp server exposes Battlemage endpoints alongside other local backends and handles per-model routing rules. More background on our testing approach is at /about/.

Conclusion — should you buy a B580 for local LLMs?

Yes, with one condition: you are comfortable on Linux and you accept that the model ceiling is ~14B at Q4. Under those constraints the B580 is the best $249 you can spend on local inference in 2026. It beats the RTX 3060 12 GB on 14B-class throughput, costs $30 less, and the IPEX-LLM stack is mature enough that day-to-day use is uneventful.

The B570 is harder to recommend. Saving $30 to permanently lose the 14B tier is a false economy — pay the difference. And if you can stretch to $500, two B580s with vLLM XPU is genuinely the cheapest local path to running 32B coders.

VerdictCardBest modelSpeed
★★★★★ Best buyArc B580 12 GB ($249)Qwen3 14B Q4_K_M28 tok/s
★★★★ Best value scale-out2× Arc B580 ($500)Qwen3-Coder 32B AWQ19 tok/s
★★★ AcceptableArc B570 10 GB ($219)Llama 3.1 8B Q4_K_M36 tok/s
★★ SkipSingle B580 for 32B+ modelsOOM

Frequently Asked Questions

Does the Intel Arc B580 work with Ollama out of the box?

Not the upstream Ollama binary. You need the IPEX-LLM Ollama fork, which is a drop-in replacement using the same model registry and API. Installation is two pip commands on Ubuntu 24.04 with kernel 6.11+.

How does the B580 compare to the RTX 3060 12 GB for local LLMs?

On 8B models the RTX 3060 is ~5-7% faster (45 vs 42 tok/s on Llama 3.1 8B Q4_K_M). On 14B models the B580 is ~8% faster (28 vs 26 tok/s on Qwen3 14B Q4_K_M) thanks to higher memory bandwidth. The B580 also costs $30 less at MSRP.

Can I run a 32B model on a single B580?

No. A 32B model at Q4_K_M needs ~20 GB of VRAM plus KV cache. You can run 32B by combining two B580s using vLLM XPU tensor parallel, which gives 24 GB of pooled VRAM and ~19 tok/s on Qwen3-Coder 32B AWQ.

Is the B570 worth $30 less than the B580?

No. The B570's 10 GB ceiling means you cannot run any 14B model at Q4_K_M with usable context. You are permanently capped at 7B-9B, which is a significant capability gap. Pay the extra $30 for the B580.

Should I use Windows or Linux for Battlemage LLM inference?

Linux. Ubuntu 24.04 with kernel 6.11+ delivers 10-15% better throughput than Windows on identical drivers, and IPEX-LLM gets Linux fixes first. WSL2 works as a middle ground but costs another 8% versus native Linux.

Does Flash Attention work on Intel Arc?

Not Flash Attention 2 specifically. IPEX-LLM ships its own optimized attention kernels that are competitive at context lengths up to 8K-16K. Beyond 16K tokens, throughput degrades faster than on CUDA hardware.