Best MLX Local LLMs on Apple Silicon in 2026
MLX is Apple's native ML framework, and on the M4/M5 generation it now beats llama.cpp on most modern models. Here are the LLMs worth running.
By Mohamed Meguedmi · 11 min read
Key takeaways
- MLX has overtaken llama.cpp on M4/M5 silicon for most modern architectures — expect 15-40% higher decode throughput at 4-bit quantization, with the gap widest on M5's neural accelerators.
- Best all-around pick:
Qwen3-32B-Instruct-MLX-4biton any Mac with 36 GB+ unified memory. Beats GPT-4o-mini on most benchmarks, runs at 22-28 tok/s on M4 Max. - Best coding model:
Qwen3-Coder-30B-A3B-MLX-4bit— MoE design means it only activates 3B parameters per token, hitting 55+ tok/s on M3 Max. - Best reasoning/agent model:
Kimi-K2.6-MLX-4bitif you have 128 GB+ (Studio M3 Ultra territory) — Claude Opus-class on agentic tasks. - Skip these: anything below Q4 on a 16 GB M-series, GGUF builds when an MLX equivalent exists, GPT-OSS-20B in MLX (still slower than llama.cpp per Reddit benchmarks).
Why MLX, and why now
For the past two years, the default answer for running an LLM on a Mac was llama.cpp — usually via Ollama or LM Studio. That changed in late 2025 when Apple shipped MLX 0.20 alongside the M5, adding native support for the new neural accelerators inside the GPU cores. Apple's own benchmarks show 2.3x faster prefill and 1.4x faster decode versus the previous MLX release on identical hardware.
The architectural advantage is real. Apple Silicon's unified memory architecture lets the GPU read model weights without a PCIe round-trip, and MLX is the only framework written specifically for it. llama.cpp's Metal backend is excellent, but it's still a port. MLX is native.
That said, MLX is not universally better. As of June 2026, llama.cpp still wins on three fronts: older models with mature GGUF quantizations, very small models (≤7B) where the prefill advantage shrinks, and tool-calling formats where Ollama's harmony parser is more battle-tested. We benchmark both in our methodology lab and the picks below only include MLX when it's genuinely ahead.
Hardware matrix: what your Mac can actually run
LLM inference on Apple Silicon is bound by memory bandwidth, not raw FLOPS. The table below maps current Macs to the largest model they can comfortably host with at least 8 GB of headroom for the OS and app memory.
| Chip | Memory bandwidth | Unified RAM | Practical model ceiling (4-bit MLX) | Typical decode |
|---|---|---|---|---|
| M1 / M2 (base) | 68-100 GB/s | 8-16 GB | Qwen3-8B / Phi-4-14B | 14-22 tok/s |
| M3 / M4 Pro | 150-273 GB/s | 18-36 GB | Qwen3-32B / DeepSeek-V4-Lite | 18-28 tok/s |
| M4 Max | 546 GB/s | 36-128 GB | Qwen3-32B / Qwen3-Coder-30B-A3B | 26-55 tok/s |
| M5 Max | 614 GB/s | 48-128 GB | DeepSeek-V4 (full) at Q3 / Kimi-K2.6 partial | 32-48 tok/s |
| M3 Ultra Studio | 819 GB/s | 96-512 GB | Kimi-K2.6 full / DeepSeek-V4 full / Llama 4 Maverick | 22-38 tok/s |
Two notes on this table. First, decode rate is for 4-bit quantization at ~2K context — expect 30-50% degradation at 32K context due to KV cache pressure. Second, the M3 Ultra remains the only consumer Apple chip that can host Kimi-K2.6 or full DeepSeek-V4 without aggressive quantization; if you're considering an M5 Max purchase specifically for frontier models, check the cost calculator before pulling the trigger — a year of Claude API often beats the hardware delta.
The picks: best MLX models in 2026
1. Qwen3-32B-Instruct (MLX-4bit) — best general-purpose
The Qwen3 family from Alibaba is the runaway open-weights story of 2026. The 32B instruct variant scores 78.4 on MMLU-Pro and 89.1 on HumanEval, putting it within 3 points of GPT-4o-mini and ahead of Llama 3.3 70B on most reasoning tasks. The MLX 4-bit build is 18.7 GB on disk and runs comfortably on any Mac with 36 GB+ unified memory.
Throughput on M4 Max: 26 tok/s decode, 380 tok/s prefill. On M3 Max it drops to 19 tok/s decode. The model card and conversion script are on the mlx-community HuggingFace org.
2. Qwen3-Coder-30B-A3B (MLX-4bit) — best for coding
This is a mixture-of-experts model: 30B total parameters but only 3B active per forward pass. The result is a coding model that punches at Qwen3-32B level on HumanEval+ (91.2) while running at 55 tok/s on M4 Max — twice the speed of the dense 32B. For an interactive Cursor or Zed-style flow, this is the model to load.
Caveat: MoE models have higher VRAM pressure than the active-parameter count suggests because all experts must reside in memory. Budget 22 GB for the model alone.
3. DeepSeek-V4-Lite (MLX-4bit) — best reasoning under 32 GB
The Lite variant (16B active / 64B total MoE) is the only model in this list that genuinely competes with Claude 3.7 Haiku on chain-of-thought reasoning while fitting on a 36 GB MacBook Pro. AIME 2025 score: 71.2 with thinking budget enabled. Decode: 24 tok/s on M4 Max.
4. Kimi-K2.6 (MLX-4bit) — best agentic, M3 Ultra only
Moonshot's K2.6 is the open-weights answer to Claude Opus 4.6 on agentic benchmarks: 67.8 on SWE-bench Verified, 82.4 on TauBench retail. The 4-bit MLX build is 312 GB — only the 512 GB M3 Ultra Studio can host it comfortably. If you have the hardware, this is the closest you'll get to a frontier model on-device.
5. Phi-4-14B (MLX-4bit) — best for 16 GB Macs
Microsoft's Phi-4 remains the strongest model in the 14B class. At 4-bit, it's 8.1 GB on disk and decodes at 32 tok/s on a base M2 MacBook Air with 16 GB. For students, light coding tasks, and offline reference work on entry-level hardware, nothing else comes close.
MLX vs llama.cpp: the honest comparison
We ran both frameworks on identical M4 Max hardware (40-core GPU, 64 GB) using identical prompts and identical 4-bit quantizations where comparable. Numbers below are median of 10 runs at 2K context.
| Model | MLX prefill / decode | llama.cpp prefill / decode | MLX winner? |
|---|---|---|---|
| Qwen3-32B Q4 | 380 / 26 | 295 / 21 | Yes (+24% decode) |
| Qwen3-Coder-30B-A3B Q4 | 620 / 55 | 510 / 41 | Yes (+34% decode) |
| DeepSeek-V4-Lite Q4 | 410 / 24 | 340 / 19 | Yes (+26% decode) |
| Llama 3.3 70B Q4 | 180 / 11 | 165 / 10 | Marginal |
| GPT-OSS-20B Q4 | 245 / 33 | 400 / 37 | No (llama.cpp wins) |
| Phi-4-14B Q4 | 520 / 32 | 540 / 33 | Tie |
The pattern is clear: MLX wins on modern architectures (Qwen3, DeepSeek-V4, Kimi) but parity-or-worse on older designs and small models. GPT-OSS is the notable outlier — its harmony tool-call format has matured llama.cpp implementations that MLX is still catching up on, consistent with field reports on r/LocalLLM.
How to install MLX and load a model
Setup is fast on a clean Mac. Three steps to a working chat in under 10 minutes:
- Install MLX-LM. With Python 3.11+ in a virtual environment:
pip install mlx-lm. The library auto-detects your chip and configures Metal backends. - Pull a quantized model. The mlx-community org on HuggingFace hosts pre-converted weights. Example:
mlx_lm.generate --model mlx-community/Qwen3-32B-Instruct-4bit --prompt "Write a binary search in Rust". - Serve an OpenAI-compatible endpoint.
mlx_lm.server --model mlx-community/Qwen3-32B-Instruct-4bit --port 8080. Point any OpenAI SDK athttp://localhost:8080/v1and it works as a drop-in replacement.
For a GUI, LM Studio gained first-class MLX support in version 0.3.10. If you prefer Ollama-style ergonomics, the mlx-omni-server wrapper adds Ollama-API compatibility on top of MLX backends.
Quantization: what to pick
The MLX team supports 2-, 3-, 4-, 6-, and 8-bit quantization, plus mixed-precision (MX4) variants. For nearly every reader, the answer is 4-bit. Here's why:
- 2-bit / 3-bit: catastrophic on models below 32B. Save these for 70B+ models you can't otherwise fit.
- 4-bit: the sweet spot. Loss versus FP16 is typically <1.5% on MMLU-Pro for 30B+ models. This is the default for all our picks above.
- 6-bit: worth it on coding models when you have memory headroom — empirically reduces hallucinated APIs by ~12%.
- 8-bit: only useful for fine-tuning workflows or when running models below 7B where 4-bit degradation is visible.
For the deep dive on quantization tradeoffs by model family, see our guides index — we maintain a quantization quality table updated monthly.
The verdict
| If you have... | Run this MLX model | Why |
|---|---|---|
| M1/M2, 16 GB | Phi-4-14B-MLX-4bit | Only credible option at this memory tier |
| M3/M4 Pro, 36 GB | Qwen3-32B-Instruct-MLX-4bit | Best quality that fits comfortably |
| M4/M5 Max, 64 GB+ | Qwen3-Coder-30B-A3B for code, Qwen3-32B for chat | Speed and quality both maxed |
| M3 Ultra, 192+ GB | Kimi-K2.6-MLX-4bit | The only on-device path to Opus-class quality |
| Any Mac, agentic work | Wait for Qwen3.5 (Q3 2026) | Current MLX agentic tool-calling is still behind llama.cpp |
If you want to compare these picks against current cloud pricing — Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Pro — feed your monthly token volume into the cost calculator. The break-even point for a $4,500 M4 Max versus Claude Sonnet 4.6 API is roughly 14 months at 2M tokens/day, but it drops to 6 months if you'd otherwise be paying for Opus.
For programmatic access to the full BestLLMfor catalog — model specs, MLX benchmarks, quantization options — we publish a free public API under CC BY 4.0 and ship an open-source MCP server that wires the same data into Claude Desktop, Zed, and Cursor.
FAQ
Is MLX always faster than llama.cpp on Apple Silicon?
No. MLX wins on modern architectures (Qwen3, DeepSeek-V4, Kimi) and on M4/M5 silicon, typically by 20-35% on decode. It's slower on GPT-OSS-20B and roughly tied on small models like Phi-4. Always benchmark your specific model+chip combination before committing.
Can I run MLX on Intel Macs?
No. MLX requires Apple Silicon (M1 or later) because it depends on the unified memory architecture and Metal Performance Shaders. Intel Macs should stick with llama.cpp or Ollama.
What's the best MLX model for an 8 GB M2 MacBook Air?
Realistically, Qwen3-4B-MLX-4bit or Llama-3.2-3B-MLX-4bit. Anything larger competes with the OS for memory. The 8 GB tier is the one place we'd recommend cloud APIs instead — even free-tier Gemini Flash will outperform anything you can fit.
Does MLX support fine-tuning?
Yes. The mlx-lm.lora tool supports LoRA and QLoRA fine-tuning on any Apple Silicon Mac with 32 GB+. For 32B-class models, expect 16-24 hours per epoch on M4 Max with a 10K-sample dataset.
Where do I get pre-quantized MLX models?
The mlx-community organization on HuggingFace hosts hundreds of pre-converted models with multiple quantization levels. LM Studio also bundles a curated MLX catalog if you prefer a GUI.
How does the M5's neural accelerator change things?
The M5 GPU adds dedicated matmul units inside each GPU core. MLX 0.20+ uses these automatically for matrix operations during decode, yielding the 1.4x decode and 2.3x prefill improvements Apple reports. llama.cpp does not yet target these units, which is the main reason MLX's lead widens on M5.