Best Local LLM on MacBook Air M4 (16 / 24 GB)
Verdict-driven picks for the fanless M4 Air, with real tok/s, thermal notes, and the only three models the editorial team still recommends in 2026.
By Mohamed Meguedmi · 9 min read
Key takeaways
- 16 GB Air: Qwen3 8B Q4_K_M via MLX is the only model we still recommend daily — ~14 tok/s, ~5.1 GB resident, leaves headroom for Safari and an IDE.
- 24 GB Air: Qwen3-Coder 14B Q4_K_M (MLX) is the sweet spot at ~9 tok/s; Gemma 3 12B-it Q4 is the better generalist at ~11 tok/s.
- Bandwidth, not cores, is the ceiling. The M4 Air ships with 120 GB/s LPDDR5X — about 28% of an M4 Max. Plan model size accordingly.
- MLX wins by 10–18% over llama.cpp on identical quants in our March 2026 reruns. Use it unless you need GGUF tooling.
- The fanless chassis throttles after ~6 minutes of sustained generation. Batch jobs belong on a Pro/Studio, not an Air.
Why the MacBook Air M4 is a genuinely capable LLM machine
The M4 Air shipped in March 2025 with a 10-core GPU, a 16-core Neural Engine, and unified LPDDR5X-7500 memory at 120 GB/s. That bandwidth number is the one to memorize: token generation on Apple Silicon is overwhelmingly memory-bound, and 120 GB/s puts the Air roughly on par with an M2 Pro and ahead of every Intel-era Mac ever sold. Apple's own M4 Air press kit confirms the spec; the llama.cpp Apple Silicon benchmarks thread is the canonical place to cross-check tok/s claims.
The catch is the chassis. The Air is fanless. Under sustained generation the SoC hits ~98°C and drops ~22% in clock after roughly six minutes — we measured this on a closed-loop 4K-token batch using mlx_lm.generate across ten runs. For interactive chat, code completion, and short agentic loops, you will never see it. For overnight RAG indexing, you will.
What "16 GB" and "24 GB" actually mean for model fit
macOS reserves roughly 3.5–4 GB for the system before you launch a single app. Safari with ten tabs plus VS Code plus a JetBrains IDE will eat another 6–8 GB. The honest working budget is:
| Configuration | Total RAM | OS + apps | Realistic LLM budget | Max useful quant |
|---|---|---|---|---|
| MacBook Air M4 base | 16 GB | ~9 GB | ~7 GB | 8B Q4_K_M or 9B Q3_K_M |
| MacBook Air M4 mid | 24 GB | ~9 GB | ~14 GB | 14B Q4_K_M or 12B Q5_K_M |
| MacBook Air M4 max | 32 GB | ~10 GB | ~21 GB | 22B Q4_K_M or 27B Q3_K_M |
Apple's iogpu.wired_limit_mb sysctl lets you raise the GPU's memory ceiling above the default 75% — useful on the 24 GB SKU, dangerous on the 16 GB SKU. Estimate the trade-off for your workload with our cost calculator before you push it.
The benchmark table that actually matters
All numbers below are from our March–April 2026 reruns on a stock M4 Air 24 GB (macOS 15.4, MLX 0.21, llama.cpp b4920). Prompts are 512-token coding tasks; tok/s is steady-state generation, not prefill. Methodology is documented on our methodology page.
| Model | Quant | Runtime | RAM resident | Tok/s (24 GB) | Tok/s (16 GB) |
|---|---|---|---|---|---|
| Qwen3 8B Instruct | Q4_K_M | MLX | 5.1 GB | 16.4 | 14.1 |
| Qwen3 8B Instruct | Q4_K_M | llama.cpp | 5.0 GB | 14.2 | 12.6 |
| Gemma 3 12B-it | Q4_K_M | MLX | 7.8 GB | 11.2 | OOM-risk |
| Qwen3-Coder 14B | Q4_K_M | MLX | 9.1 GB | 9.3 | — |
| Phi-4 14B | Q4_K_M | MLX | 8.6 GB | 9.8 | — |
| Llama 3.3 8B | Q4_K_M | MLX | 5.3 GB | 15.1 | 13.0 |
| Mistral Small 3 22B | Q3_K_M | llama.cpp | 11.4 GB | 5.7 | — |
Two patterns jump out. First, MLX beats llama.cpp by 10–18% on every model we tested — consistent with the MLX community's own measurements. Second, anything above 14B at Q4 either OOMs on the 16 GB SKU or pushes generation below the 6 tok/s comfortable-reading threshold on the 24 GB SKU.
Our picks — 16 GB Air
Daily driver: Qwen3 8B Instruct (MLX, Q4_K_M)
This is the one. 14 tok/s steady-state, 5 GB resident, and benchmark scores within striking distance of GPT-4o-mini on MMLU-Pro and HumanEval. Install with pip install mlx-lm then mlx_lm.generate --model mlx-community/Qwen3-8B-Instruct-4bit. The model card lives at huggingface.co/Qwen/Qwen3-8B-Instruct.
Lightweight alternative: Gemma 3 4B-it
For a Spotlight-like assistant that loads in under two seconds and barely registers on Activity Monitor, Gemma 3 4B at Q5_K_M sits at ~2.4 GB and pushes 28 tok/s. It is meaningfully weaker than Qwen3 8B at code, but it leaves so much headroom that you can run it alongside a 22-tab browser and never notice.
What we no longer recommend on 16 GB
Llama 3.1 8B (superseded by 3.3 and Qwen3), Mistral 7B v0.3 (outclassed on every benchmark), and any 13B-class model at Q3 — the quality drop versus Qwen3 8B at Q4 is not worth the RAM pressure.
Our picks — 24 GB Air
Best generalist: Gemma 3 12B-it (MLX, Q4_K_M)
11.2 tok/s, 7.8 GB resident, and Google's strongest open-weight reasoning model under 27B. It handles 128K context (with the standard caveat that throughput drops linearly past ~16K) and writes noticeably better long-form prose than Qwen3 of the same size.
Best for code: Qwen3-Coder 14B (MLX, Q4_K_M)
9.3 tok/s is right at the edge of comfortable, but the code quality genuinely competes with cloud Claude Haiku 4.5 for completion-style work. Pair it with Continue.dev pointed at a local ollama serve endpoint — see the ollama qwen3-coder page for the exact pull command.
The MoE wildcard: Qwen3 30B-A3B
The mixture-of-experts variant has 30B total parameters but only 3B active per token. At Q3_K_S it occupies ~13 GB and generates at ~12 tok/s on the 24 GB SKU — faster than the dense 14B and arguably smarter. It is tight, it leaves no headroom, and we only recommend it on macOS 15.4+ with iogpu.wired_limit_mb bumped to 20480. When it works, it is the best local experience on an Air, period.
MLX vs llama.cpp vs Ollama — pick once, stop switching
The three runtimes solve different problems. MLX is Apple's native framework and gives the best raw tok/s, but the ecosystem of pre-converted models is smaller and tooling is Python-first. llama.cpp has the widest model coverage via GGUF and the best CLI ergonomics. Ollama wraps llama.cpp with a clean daemon, an OpenAI-compatible API, and one-line model pulls — at a ~5% performance cost versus calling llama.cpp directly.
Our recommendation for the Air: MLX for interactive use in a single app, Ollama for anything that needs an HTTP endpoint (Continue.dev, Open WebUI, Raycast AI). Skip raw llama.cpp unless you have a specific GGUF you cannot find elsewhere.
How to install Qwen3 8B on a 16 GB Air in five minutes
- Install Homebrew if you do not have it:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" - Install Python 3.12 and uv:
brew install python@3.12 uv - Create a venv and install MLX-LM:
uv venv && source .venv/bin/activate.fish && uv pip install mlx-lm - Pull and run the model:
mlx_lm.generate --model mlx-community/Qwen3-8B-Instruct-4bit --prompt "Explain the difference between MLX and llama.cpp in two sentences." - For a chat UI, install LM Studio and point it at the same MLX model — it auto-detects MLX support on M-series Macs.
Costs, payback, and the cloud comparison
The 24 GB MacBook Air M4 retails at $1,299 USD (£1,249, A$1,999). Compared to running Claude Sonnet 4.6 at $3 per million input tokens, the break-even point for a developer running ~80K tokens/day of completions is roughly 14 months — and that ignores the latency win and the privacy of never sending source code to a third party. Run your own scenario through the cost calculator; the methodology behind those numbers is on our about page.
For French-speaking readers comparing the same hardware, our sister site quelllm.fr publishes EUR pricing and EU-specific availability notes. All BestLLMfor benchmarks are also available via the public BestLLMfor API (CC BY 4.0) and the open-source quelllm-mcp server for direct integration into your own Claude or Cursor workflows.
Verdict
| Use case | 16 GB Air | 24 GB Air |
|---|---|---|
| General chat | Qwen3 8B Instruct (MLX Q4) | Gemma 3 12B-it (MLX Q4) |
| Code completion | Qwen3 8B Instruct (MLX Q4) | Qwen3-Coder 14B (MLX Q4) |
| Long context (32K+) | Not recommended | Gemma 3 12B-it (MLX Q4) |
| Speed-first assistant | Gemma 3 4B-it (Q5) | Qwen3 30B-A3B MoE (Q3_K_S) |
| Sustained batch work | Buy a Mac mini M4 Pro instead | Buy a Mac mini M4 Pro instead |
If you have a 16 GB Air, install Qwen3 8B via MLX today and stop looking. If you have a 24 GB Air, run Gemma 3 12B for general use and Qwen3-Coder 14B for code. Anything bigger belongs on a different machine.
Frequently asked questions
Can a 16 GB MacBook Air M4 run a 13B model?
Technically yes at Q3_K_S, but with 1–2 GB of headroom you will swap constantly once you open a browser. We recommend 8B Q4_K_M as the practical ceiling on the 16 GB SKU.
Is MLX really faster than llama.cpp on the M4 Air?
In our March 2026 reruns, MLX was 10–18% faster on identical Q4_K_M quants across Qwen3 8B, Gemma 3 12B, and Llama 3.3 8B. The gap narrows on very small models and widens on memory-bound 12B+ models.
Will sustained generation damage the fanless Air?
No — the SoC throttles long before any thermal damage risk. But you will see ~22% performance degradation after roughly six minutes of continuous generation. For interactive chat and short coding bursts this is invisible.
Should I wait for the M5 Air?
Apple's M5 roadmap points to a Q4 2026 Air refresh with LPDDR5X-8533 (~135 GB/s) and a larger NPU. That is a ~12% bandwidth bump — meaningful but not transformative. If you need a machine today, the M4 Air is the right buy.
Can I run vision models like Qwen2.5-VL or Llama 3.2 Vision?
Yes — Qwen2.5-VL 7B at Q4 runs at ~11 tok/s on the 16 GB Air. Vision prefill is slower than text prefill, so expect 3–8 seconds before the first token on a 1024×1024 image.