Best Local LLM on Mac Mini M4 Pro (24 GB Unified)
The 24 GB Mac Mini M4 Pro is a capable but constrained inference node. Here is what actually runs well in May 2026, with measured tokens-per-second.
By Mohamed Meguedmi · 9 min read
Key Takeaways
- Top overall pick: Qwen3-14B-Instruct Q5_K_M via MLX — the best balance of quality, speed (~28 tok/s) and 24 GB headroom.
- Best for coding: Qwen3-Coder 14B Q4_K_M beats every 7B model on HumanEval+ while staying under 11 GB resident.
- The 273 GB/s memory bandwidth — not the CPU/GPU — is the hard ceiling. A model twice as large runs roughly half as fast.
- Avoid 30B+ dense models. Quantized Llama 3.3 70B technically loads with swap but degrades to under 2 tok/s — unusable.
- MoE models (Qwen3-30B-A3B) are the dark-horse winner: 30B parameters, only 3B active, ~22 tok/s on 24 GB.
Why the Mac Mini M4 Pro 24 GB is a specific kind of LLM box
The base Mac Mini M4 Pro ships with a 12-core CPU, 16-core GPU, 16-core Neural Engine and — critically — 273 GB/s of unified memory bandwidth. That number is the single most important spec for local inference. As Apple's spec sheet confirms, bandwidth scales with the chip tier, not the RAM capacity, so a 48 GB M4 Pro runs at the same tok/s as the 24 GB variant for any model that fits in both.
What 24 GB actually buys you, after macOS reserves ~3 GB and you keep a browser open, is roughly 18–19 GB of usable VRAM-equivalent for the model, KV cache and context. That is the constraint every recommendation below respects.
Bandwidth-bound math, in one paragraph
Decode speed for a dense transformer is approximately bandwidth / model_size_in_GB. A 14B model at Q5_K_M weighs ~10 GB → theoretical ceiling ~27 tok/s. A 32B Q4_K_M weighs ~19 GB → ceiling ~14 tok/s. Real-world numbers land at 70–85% of theoretical with MLX, lower with llama.cpp Metal. This is why the “just buy more RAM” advice misses the point on this chip.
The benchmark table: what we actually measured
All numbers below were collected by the BestLLMfor editorial team on a stock Mac Mini M4 Pro 24 GB (macOS 15.4, MLX 0.19, Ollama 0.5.7), 4096-token context, prompt of 512 tokens, decode of 256 tokens, mean of 5 runs. Raw CSVs are published under CC BY 4.0 in the BestLLMfor public API.
| Model | Quant | RAM used | Prompt tok/s | Decode tok/s | Quality (MMLU-Pro) |
|---|---|---|---|---|---|
| Qwen3-14B-Instruct | Q5_K_M (MLX) | 10.4 GB | 312 | 28.1 | 61.4 |
| Qwen3-30B-A3B (MoE) | Q4_K_M (MLX) | 17.2 GB | 198 | 22.4 | 64.9 |
| Qwen3-Coder 14B | Q4_K_M (MLX) | 9.1 GB | 340 | 31.6 | n/a (HumanEval+ 78%) |
| Llama 3.3 8B | Q6_K (llama.cpp) | 7.0 GB | 290 | 34.0 | 54.2 |
| Gemma 3 12B | Q5_K_M (MLX) | 9.3 GB | 275 | 26.8 | 58.1 |
| Mistral Small 3.1 24B | Q4_K_M (MLX) | 14.8 GB | 175 | 17.2 | 62.0 |
| Llama 3.3 70B | Q2_K (swap) | 26.1 GB † | 18 | 1.6 | 67.8 |
† Exceeds physical RAM, relies on SSD swap. Listed for completeness only — do not run this in production.
The ranked picks
1. Overall winner: Qwen3-14B-Instruct Q5_K_M (MLX)
This is the model we recommend by default. The official Qwen3-14B card shows it tracking GPT-4o-mini on reasoning, and Q5_K_M loses less than 1 point of MMLU vs FP16. At 28 tok/s decode it is fast enough for interactive chat, agentic tool calls and Continue.dev autocomplete without the lag that makes 7Bs feel snappier but dumber. Crucially, you keep ~8 GB of headroom for context, RAG embeddings and the OS.
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3-14B-Instruct-5bit \
--prompt "Refactor this Python function:" --max-tokens 5122. Best for coding: Qwen3-Coder 14B Q4_K_M
If >70% of your usage is code, swap the generalist for the coder variant. It posts 78% on HumanEval+ and 71% on LiveCodeBench, beating DeepSeek-Coder-V2-Lite-16B at every quant level we tested. Wire it into Cursor or Continue.dev with the Ollama qwen3-coder tag and you get sub-second completions for files under 2k tokens.
3. Dark horse: Qwen3-30B-A3B (mixture-of-experts)
The MoE architecture activates only 3B parameters per token while keeping 30B in memory. The result on 24 GB is striking: higher quality than the 14B (MMLU-Pro 64.9 vs 61.4) at only marginally lower speed (22 vs 28 tok/s). The catch is the 17.2 GB resident size leaves almost no headroom — close Chrome before launching it. We rate it the best quality pick if you accept the discipline.
4. Fast and predictable: Llama 3.3 8B Q6_K
When you need 30+ tok/s for streaming agents or a Slack bot serving 2–5 people, the smaller Llama wins on latency and tool-calling reliability. JSON-mode compliance measured at 97.4% over 500 structured calls — matching what the AnswerOverflow thread from February predicted.
5. Multimodal option: Gemma 3 12B
The only pick here that accepts images natively. Vision quality is mediocre vs cloud models, but for OCR-style tasks (receipt parsing, chart extraction) it runs at 27 tok/s and stays under 10 GB. See the Gemma 3 12B card for the supported image resolutions.
What to avoid
- Llama 3.3 70B at any quant. Even Q2_K spills to SSD. The community thread declaring 24 GB “unusable” almost always traces back to someone trying this.
- Dense 32B models in FP16 or Q8. Mistral Small 3.1 24B at Q4 is the upper bound of comfort; beyond that you trade interactivity for nothing.
- llama.cpp Metal when MLX exists for the model. MLX is 15–25% faster on M4 Pro for most architectures we tested.
Install path: from zero to 28 tok/s in 10 minutes
- Install Homebrew, then
brew install ollamaandpip install mlx-lm. - Pull the model:
ollama pull qwen3:14b-instruct-q5_K_M(8.9 GB download). - Test:
ollama run qwen3:14b-instruct-q5_K_M "Hello". You should see >25 tok/s on first token. - For coding, install the Continue.dev VS Code extension and point it at
http://localhost:11434. - (Optional) Expose to your LAN via the quelllm-mcp open-source server for MCP-compatible clients.
Cost: is the Mac Mini M4 Pro 24 GB worth it for LLMs?
At $1,399 USD ($1,799 AUD, £1,399 GBP) for the base 24 GB / 512 GB configuration, the cost-per-tok/s is excellent if and only if you stay inside the 14B sweet spot. Power draw under sustained inference measured at 34 W, which works out to roughly $36/year at US average electricity rates running 8 hours a day. Compare against a single API call budget using our cost calculator — breakeven vs GPT-4o-mini hits at roughly 4 million tokens/month of usage.
| Option | Upfront | Best model tier | Peak tok/s | Verdict |
|---|---|---|---|---|
| Mac Mini M4 24 GB (base) | $599 | 8B Q4 | 22 | Too tight for serious work |
| Mac Mini M4 Pro 24 GB | $1,399 | 14B Q5 / 30B-A3B MoE | 28 | Sweet spot for solo dev |
| Mac Mini M4 Pro 48 GB | $1,599 | 32B Q4 | 17 | Worth +$200 only for 32B users |
| Mac Studio M4 Max 64 GB | $2,499 | 70B Q4 | 11 | Different tier of model entirely |
Editorial verdict
Buy the 24 GB Mac Mini M4 Pro and run Qwen3-14B Q5_K_M via MLX. It is the best balance of capability, speed and silence available under $1,500 in May 2026. If your work depends on 32B+ dense models, skip this machine entirely — pay $200 more for the 48 GB variant or step up to a Mac Studio. The 24 GB tier is not a compromise; it is a specific configuration optimized for 14B-class workloads, and within that envelope it is the best per-dollar local LLM platform we have benchmarked. Read our full benchmark methodology or learn more about BestLLMfor.
Frequently Asked Questions
Can a Mac Mini M4 Pro 24 GB run Llama 3.3 70B?
Technically yes, at Q2_K with heavy SSD swapping, but decode drops to 1–2 tok/s. We do not consider this usable. For 70B-class models, the practical minimum is 48 GB unified memory at the Max tier of chip.
Is MLX really faster than llama.cpp on the M4 Pro?
Yes, by 15–25% for most Qwen, Llama and Gemma variants in our tests. The gap narrows for very small models (under 4B) where llama.cpp's overhead is less of a factor.
How much context can I realistically use?
With Qwen3-14B Q5_K_M loaded (10.4 GB), you have ~8 GB of headroom. That supports roughly 32k tokens of KV cache before macOS starts compressing memory. For 128k context, drop to Llama 3.3 8B Q4_K_M.
Should I wait for the M5 Pro?
Memory bandwidth on the M4 Pro is 273 GB/s. Apple's typical generational uplift is 15–20%, so expect ~320 GB/s on M5 Pro — meaningful but not transformative. If you need a machine today, the M4 Pro at current pricing is a strong buy.
What about fine-tuning on this machine?
LoRA fine-tuning of 7B–8B models is feasible via MLX with batch size 1. Full fine-tunes or 13B+ LoRAs will exceed memory. Use cloud GPUs for training, the Mac Mini for inference.