Guide · 2026-05-16

Local LLM on Raspberry Pi 5 — Yes, It Works (Just Painful)

Last updated 2026-05-16

A Raspberry Pi 5 can run a local LLM. Whether it should run one for your use case is a very different question, and the honest answer is rarely.

By Mohamed Meguedmi · 11 min read

Key takeaways

It works. A Raspberry Pi 5 (8 GB or 16 GB) running Ollama or llama.cpp will serve models up to roughly 4B parameters at Q4_K_M usefully — Llama 3.2 3B, Gemma 3 4B, Qwen 2.5 3B Instruct.
Expect 4–7 tokens/sec on 3B Q4 models with the Pi 5 16 GB at stock 2.4 GHz, dropping to 1.5–2.5 t/s on 7B Q4 once you push past the CPU's sweet spot.
Cooling is non-negotiable. Without an Active Cooler or the official case fan, the SoC throttles within 90 seconds of sustained inference and your tokens/sec halves.
The Hailo-8L AI HAT+ does not help LLM decoding in 2026 — it accelerates vision and small transformer encoders, not autoregressive text generation in Ollama/llama.cpp.
Verdict: buy a used mini PC with an N100 or Ryzen 7 5825U for the same money if your goal is local chat. Pick the Pi 5 only when low idle power (under 5 W), GPIO, or fleet deployment matters more than throughput.

Why the Pi 5 is the wrong tool for the right reasons

The Raspberry Pi 5 sells for $80 (8 GB) or $120 (16 GB) as of May 2026. With a power supply, NVMe HAT, 500 GB SSD, Active Cooler and case, a build that can actually host an LLM 24/7 lands at $185–$230. That is roughly the price of a used Beelink S12 Pro (Intel N100, 16 GB DDR4) on eBay, which will deliver 3–5× the tokens/sec on the same models.

Yet the Pi 5 LLM use case persists, and rationally so. The board idles at 3 W and peaks near 12 W under inference. It runs silently with passive cooling at light loads. It has 40 GPIO pins. It boots off a $4 SD card if a disk fails. None of those properties matter for a chat companion. All of them matter when the LLM is glue inside a voice assistant, kiosk, edge sensor hub, or a small fleet of identical units in a classroom or workshop.

This guide assumes you have already decided the form factor matters. If you have not, run the numbers in our cost calculator before buying anything.

Hardware that actually matters

Three components determine whether a Pi 5 LLM build is usable or miserable: RAM, storage, and cooling. The CPU is fixed at a quad-core Cortex-A76 at 2.4 GHz (overclockable to 3.0 GHz with sufficient cooling and an SSD root).

Component	Minimum viable	Recommended	Why it matters
Board	Pi 5 8 GB ($80)	Pi 5 16 GB ($120)	16 GB lets you load Gemma 3 4B Q8 or Qwen 2.5 7B Q4 with KV cache headroom.
Power supply	Official 27 W USB-C PD	Official 27 W USB-C PD	Third-party chargers throttle current; the Pi will under-volt and drop tokens/sec by 20–40%.
Cooling	Official Active Cooler ($5)	Argon ONE V3 or Pironman 5 case	SoC throttles at 85 °C. Sustained inference holds the SoC above 80 °C without forced airflow.
Storage	microSD A2 256 GB	NVMe HAT + 500 GB Gen3 SSD	SD cards bottleneck model load (60 MB/s vs 800+ MB/s on NVMe) and wear out under swap.
Optional	—	Hailo-8L AI HAT+ (for vision)	Useful for Whisper + YOLO pipelines, not for LLM token generation in llama.cpp.

The Hailo claim deserves expansion because the marketing implies otherwise. The Hailo-8L is a 13 TOPS INT8 NPU optimized for static graphs — image classification, object detection, ASR encoders. Autoregressive decoding in modern llama.cpp is memory-bandwidth bound, not compute bound, and has no Hailo backend in any stable release as of May 2026. If a vendor tells you the AI HAT+ accelerates Llama, ask for tokens/sec numbers and a commit hash.

The software stack that works in 2026

Two stacks are worth installing on Pi OS Bookworm 64-bit: Ollama for ease of use, llama.cpp for tuning. Both compile cleanly on aarch64 and both use the Pi 5's NEON SIMD instructions.

Install Ollama on Pi 5 (10 minutes)

Flash Raspberry Pi OS Bookworm 64-bit Lite to NVMe with rpi-imager. Enable SSH and set a username during imaging.
SSH in, then sudo apt update && sudo apt full-upgrade -y && sudo reboot.
Install Ollama: curl -fsSL https://ollama.com/install.sh | sh. The aarch64 binary is detected automatically.
Pull a model sized for the board: ollama pull gemma3:4b-it-q4_K_M on 16 GB, or ollama pull llama3.2:3b-instruct-q4_K_M on 8 GB.
Test throughput: ollama run gemma3:4b-it-q4_K_M --verbose "Write a haiku about heat sinks." Note the eval rate line — that is your tokens/sec.
Expose on the LAN by editing /etc/systemd/system/ollama.service to add Environment="OLLAMA_HOST=0.0.0.0:11434", then sudo systemctl daemon-reload && sudo systemctl restart ollama.

For remote access over the open internet, do not port-forward 11434. Use Tailscale or our reference recipe via the methodology guide, which mirrors what the community Tailscale + Chatbox writeup recommends.

Benchmarks: tokens per second, real numbers

Numbers below come from the BestLLMfor test bench: Pi 5 16 GB on the official Active Cooler, NVMe HAT+ with a Samsung 980 1 TB, Pi OS Bookworm 64-bit, kernel 6.6, Ollama 0.5.x with llama.cpp build b4500-class. Prompts are 256 tokens, generation capped at 256. Ambient 22 °C. Each result is the median of five runs after a 60-second warmup. Raw data is in our public benchmark dataset (CC BY 4.0) and via the BestLLMfor public API.

Model	Quant	Size on disk	RAM at runtime	Prompt eval	Generation	Verdict
TinyLlama 1.1B	Q4_K_M	0.7 GB	1.1 GB	42 t/s	14.8 t/s	Fast, but answers are weak. Toy use only.
Llama 3.2 3B Instruct	Q4_K_M	2.0 GB	3.4 GB	18 t/s	6.9 t/s	Best general-purpose pick on 8 GB Pi.
Gemma 3 4B IT	Q4_K_M	2.5 GB	4.1 GB	14 t/s	5.3 t/s	Best instruction following under 5B on Pi 5.
Qwen 2.5 3B Instruct	Q4_K_M	1.9 GB	3.2 GB	20 t/s	7.2 t/s	Strongest for multilingual + JSON output.
Phi-3.5 Mini 3.8B	Q4_K_M	2.3 GB	3.8 GB	15 t/s	5.8 t/s	Great reasoning, slow on long prompts.
Llama 3.1 8B Instruct	Q4_K_M	4.9 GB	7.6 GB	5 t/s	2.1 t/s	Requires 16 GB board. Borderline usable.
Qwen 2.5 7B Instruct	Q4_K_M	4.4 GB	7.1 GB	6 t/s	2.4 t/s	Best 7B quality, still slow. Batch jobs only.

The cliff between 4B and 7B is real. A 3B Q4 model holds your attention. A 7B Q4 model at 2.4 t/s breaks the rhythm of an interactive session — the human side reads faster than the model writes. For batch summarization, log triage, or queued tool calls that the user does not watch token by token, 7B is fine.

What about Qwen3-Coder, DeepSeek and the headline 30B models?

Don't. The Pi 5 has roughly 17 GB/s of memory bandwidth (LPDDR4X-4267). Decoding speed for a dense Q4 model is bounded by that bandwidth divided by model size. A 30B Q4 model weighs around 17 GB; you would not fit it on a 16 GB board even before considering the KV cache. Qwen 2.5-Coder 3B is the realistic ceiling for code completion on this hardware.

Thermals, throttling, and the case that matters

The Pi 5 SoC (BCM2712) throttles aggressively. The default kernel governor is ondemand, and the temperature cap kicks in at 80 °C with hard throttle at 85 °C. Under sustained Ollama inference, a bare board hits 85 °C in roughly 90 seconds.

Three cooling tiers, measured at 25 °C ambient under a 5-minute Llama 3.2 3B Q4 loop:

Cooling setup	Steady-state SoC temp	Generation tokens/sec	Cost
Passive heatsink only	83 °C (throttled)	4.1 t/s	$3
Official Active Cooler	62 °C	6.9 t/s	$5
Argon ONE V3 + 30 mm fan	54 °C	7.1 t/s	$30
Pironman 5 (tower + ICE)	49 °C	7.3 t/s (overclock to 2.8 GHz stable)	$70

The Active Cooler is the only purchase that pays for itself instantly. Premium cases buy you 5 °C and a chassis you can leave on a desk. Overclocking to 2.8 GHz adds about 6% to tokens/sec and requires NVMe boot — the SD card controller is on the SoC and gets unstable at the same time the cores do.

When the Pi 5 is the right call (and when it is not)

The honest decision tree, after testing the Pi 5 against an N100 mini PC and a used Ryzen 5825U laptop on the same models:

Voice assistant in a single room, with a wake word and short replies — Pi 5 16 GB + Active Cooler + a Hailo HAT for Whisper. Good fit.
Always-on RAG endpoint for personal notes serving one to two users — Pi 5 16 GB is fine; expect 5–7 t/s with a 3B model and your retrieval doing most of the work. Pair with our open-source quelllm-mcp server for an MCP front-end.
Coding copilot — no. Even Qwen 2.5-Coder 3B at 7 t/s is below the latency floor for inline completion. Use a remote endpoint or local laptop.
Classroom or workshop fleet (5+ identical units) — Pi 5 wins on uniformity, low idle power, and PoE-HAT options. Run TinyLlama or Llama 3.2 1B for snappy demos.
Daily chat with a 7B+ model — buy a mini PC. A new Beelink N100 with 16 GB is $179; a used Lenovo ThinkCentre M75q with a Ryzen 5825U is $220 and triples the throughput.

Model picks: what to actually pull tonight

Three models cover 90% of viable Pi 5 use. Pull all three; they cohabit on a 16 GB board with 32 GB of free SSD.

Llama 3.2 3B Instruct Q4_K_M — default everyday model. Strong English, decent function calling. See the official Meta model card.
Gemma 3 4B IT Q4_K_M — best instruction adherence in this size class. Multilingual, structured output friendly. ollama.com/library/gemma3.
Qwen 2.5 3B Instruct Q4_K_M — best for non-English prompts and JSON tool calls. Sister site quelllm.fr has French-language benchmarks for this one.

A Pi 5 with the right 3B model is not a slow ChatGPT. It is a fast, private, $200 microservice that answers in 3 seconds instead of 300 milliseconds. Stop comparing it to GPT-4o and it becomes useful.

Power, cost, and the five-year math

A Pi 5 running an Ollama endpoint 24/7 with light usage (under 5% inference duty cycle) averages 4.2 W on the wall — about $4.40/year at the US average of $0.12/kWh, or £5.60/year in the UK at £0.27/kWh. Over five years that is roughly $22, dwarfed by the $20 Active Cooler and $40 NVMe drive. The same workload on a Beelink N100 averages 7 W idle — about double the energy, still trivial.

What this means: do not buy a Pi 5 to save power. Buy it for the form factor, GPIO, or fleet uniformity. The kilowatt-hours are noise either way.

FAQ

Can a Raspberry Pi 5 really run an LLM?

Yes. A Raspberry Pi 5 with 8 GB or 16 GB of RAM, an Active Cooler, and Ollama or llama.cpp will serve 1B to 4B parameter models at Q4 quantization between 5 and 15 tokens/sec. 7B models work on 16 GB boards at around 2–3 tokens/sec, which is too slow for interactive chat but fine for batch jobs.

Which is the fastest LLM on Pi 5?

TinyLlama 1.1B Q4_K_M at roughly 15 tokens/sec generation. For useful answers, Llama 3.2 3B Instruct Q4_K_M at 6.9 t/s and Qwen 2.5 3B Instruct at 7.2 t/s are the practical leaders. Anything above 4B parameters drops below the comfortable interactive threshold.

Does the Hailo AI HAT+ speed up Llama on Pi 5?

No, not in May 2026. The Hailo-8L NPU accelerates vision models and ASR encoders such as Whisper, but llama.cpp and Ollama have no production Hailo backend for autoregressive text generation. Use it for Whisper + YOLO pipelines, not for LLM decoding.

How much RAM do I need on a Pi 5 for local LLM?

8 GB is enough for any model up to 4B parameters at Q4. Go 16 GB if you want headroom for 7B Q4 models, larger context windows (above 8K tokens), or to run a model alongside Whisper, embeddings, and a reranker on the same board.

Pi 5 vs N100 mini PC for LLM — which should I buy?

If your only goal is local chat, buy a used N100 or Ryzen 5825U mini PC for $150–$220. You get 3–5× more tokens/sec on the same models. Choose Pi 5 only when you need GPIO, sub-5 W idle power, fleet uniformity across many units, or a specific HAT integration.

Will overclocking the Pi 5 to 3.0 GHz help LLM speed?

A little. Stable 2.8 GHz with NVMe boot and tower cooling gains roughly 6% in tokens/sec. The Pi 5 is memory-bandwidth bound on LLM decoding, so CPU clock has diminishing returns. Spend the time on cooling and an NVMe drive first.

Final verdict

Use case	Recommended build	Verdict
Interactive chat, daily driver	Used Beelink N100, 16 GB	Skip the Pi 5
Always-on private RAG / MCP endpoint	Pi 5 16 GB + NVMe + Active Cooler	Buy it
Voice assistant with wake word	Pi 5 16 GB + Hailo HAT+ + mic array	Buy it
Local coding copilot	Laptop with 16+ GB unified memory	Skip the Pi 5
Classroom fleet / kiosk	Pi 5 8 GB + PoE HAT + Llama 3.2 3B	Buy it
Experiment for under $200	Pi 5 16 GB starter bundle	Buy it, eyes open

The Pi 5 LLM story in 2026 is not about beating a mini PC on tokens/sec. It is about whether a 12 W, fanless-capable, GPIO-equipped board belongs in your stack. For most readers asking the question, the answer is no — but for the ones it fits, nothing else on the market is close.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.