Guide · 2026-05-16

Best Local LLM on Mac Studio M4 Ultra (256 GB): 2026 Verdict

Q: Do I need to raise the macOS GPU memory cap?

Yes, for anything over ~75% of system RAM. Run sudo sysctl iogpu.wired_limit_mb=235520 to allow up to 230 GB of wired GPU memory. The setting resets on reboot.

Last updated 2026-05-16

We tested 14 open-weight models on a 256 GB M4 Ultra. Here are the three you should actually run, with token speeds, quantization picks, and runtime settings.

By Mohamed Meguedmi · 11 min read

Key takeaways

Top overall pick: DeepSeek-V3.1 671B at Q3_K_M (≈310 GB on disk, ≈205 GB resident with MoE offload) — the only frontier-class model that fits comfortably in 256 GB unified memory.
Best daily driver: Qwen3-Coder 32B Instruct MLX 4-bit — 58–62 tok/s generation, sub-second TTFT, fits with a 128k context window.
Best dense general model: Llama-3.3-70B-Instruct Q5_K_M at 14–16 tok/s; strong English reasoning, easy to integrate.
Runtime verdict: Use MLX-LM for dense models ≤70B, llama.cpp for very large MoE models where GGUF quants are more mature.
Skip: Full-precision 405B+ dense models — they technically fit but generation drops below 6 tok/s and prompt processing collapses.

The Mac Studio M4 Ultra with 256 GB of unified memory is, in mid-2026, the cheapest single-box machine that can host a frontier-tier open-weight model without quantization butchery. But "fits in memory" is not the same as "usable." After a month of inference testing across the BestLLMfor benchmark suite, this is what we actually recommend running — and what to leave on Hugging Face.

The hardware reality on a 256 GB M4 Ultra

The M4 Ultra ships with an 80-core GPU and roughly 819 GB/s of memory bandwidth. That bandwidth is the real ceiling on local LLM throughput: for memory-bound decode (every transformer model in this guide), token speed scales almost linearly with how many bytes per token you have to stream out of unified memory.

A 256 GB system gives you, in practice, about 232–240 GB of headroom for model weights and KV cache after macOS, the Metal compiler, and the runtime process. macOS will let you push past the default iogpu.wired_limit_mb, but we recommend capping at 230 GB to avoid Metal allocator stalls during long context generation.

Spec	Mac Studio M4 Ultra 256 GB	Practical impact for LLMs
Memory bandwidth	~819 GB/s unified	Caps decode tok/s for dense models
GPU cores	80	Healthy for prompt prefill, not the bottleneck
Usable RAM for weights	~230 GB	Fits DeepSeek-V3.1 Q3 with KV cache headroom
Power draw (sustained)	180–240 W	Cheap inference vs. 2× RTX 6000 Ada (~600 W)
Street price (May 2026)	$5,599 USD	See our cost calculator for €/£/AUD and per-token economics

For comparison, the same 256 GB-class workload on Nvidia hardware needs an H100 80GB plus host RAM offloading, or two RTX 6000 Ada boards in tensor-parallel. The Mac is slower per token but roughly one-third the capital cost and a quarter of the wall power.

How we tested

Every model was loaded fresh, warmed with a 256-token prompt, and then benchmarked with three workloads: a 2k-token code completion, an 8k-token retrieval-augmented Q&A, and a 32k-token long-document summarization. We report generation tok/s (decode) and time to first token (TTFT). MLX models use mlx-lm 0.21.x; GGUF models use llama.cpp build b4900 with Metal and --flash-attn. Full configs and raw logs are available via the BestLLMfor methodology page and as machine-readable JSON through our public API (CC BY 4.0).

Tier 1 — The frontier pick: DeepSeek-V3.1 671B

DeepSeek-V3.1 is a 671B-parameter Mixture-of-Experts model that activates roughly 37B parameters per token. That MoE structure is exactly what saves it on Apple Silicon: even though the full weights are ~310 GB at Q3_K_M, only the active experts plus the shared attention layers need to be hot in cache, so effective decode bandwidth lands far higher than a same-size dense model would suggest.

We measured 9.4 tok/s on 2k prompts and 7.1 tok/s at 32k context. TTFT is the rough part — 11–14 seconds for an 8k prompt — because prefill still has to touch every routed expert. For agentic and batch workloads where latency matters less than capability, this is the only locally hostable model that lands within striking distance of Claude Sonnet 4.6 on coding and reasoning benchmarks.

Recommended quant: unsloth/DeepSeek-V3.1-GGUF Q3_K_M. Skip Q2_K — perplexity blows up on code.

Runtime settings we use

llama-server \
  --model DeepSeek-V3.1-Q3_K_M-00001-of-00007.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --flash-attn \
  --threads 16 \
  --mlock \
  --host 0.0.0.0 --port 8080

Tier 2 — The daily driver: Qwen3-Coder 32B Instruct (MLX 4-bit)

For most engineering work — autocomplete in your editor, code review, refactors, shell scripting — a 32B dense coder running at near-interactive speed beats a 671B model you have to wait on. Qwen3-Coder 32B in MLX 4-bit weighs in at 17.8 GB resident and delivers 58–62 tok/s on the M4 Ultra with a 128k window, with TTFT under 700 ms for typical prompts.

On the BestLLMfor coder eval (a 380-task superset of HumanEval+ and SWE-Bench Lite), it scores within 4 points of GPT-4.1-mini and outperforms every dense Llama variant we tested. MLX quantization here is genuinely better than the equivalent Q4_K_M GGUF — about 3% lower perplexity at the same memory footprint.

Pair it with the quelllm-mcp open-source MCP server if you want IDE-grade tool calling without paying per token to a cloud provider.

Tier 3 — The general-purpose dense: Llama-3.3-70B-Instruct

Llama 3.3 70B at Q5_K_M (49 GB resident) hits 14–16 tok/s on the M4 Ultra. It is not the smartest model on this list, but it is the most predictable: well-behaved instruction following, native function calling, and the broadest tool ecosystem of any open-weight model. If you are deploying an internal RAG service, customer chat, or non-code agent, this is the safest bet.

We tested Q6_K and Q8_0 — quality gains are within margin of error on our eval, and decode drops to 11 tok/s and 8 tok/s respectively. Q5_K_M is the sweet spot.

What we tested and rejected

Model	Quant	Resident	Decode tok/s	Verdict
DeepSeek-V3.1 671B	Q3_K_M GGUF	~205 GB	9.4	Pick — frontier capability
Qwen3-Coder 32B	MLX 4-bit	17.8 GB	60	Pick — daily driver
Llama-3.3-70B	Q5_K_M GGUF	49 GB	15	Pick — general-purpose
Qwen3 235B-A22B	Q4_K_M GGUF	132 GB	18	Strong runner-up; pick if DeepSeek is too slow
Llama-3.1-405B	Q4_K_M GGUF	228 GB	4.8	Skip — slower and weaker than DeepSeek-V3.1
Mistral-Large-2411 123B	Q5_K_M GGUF	86 GB	9.1	Skip — outclassed by Qwen3 235B at similar speed
Gemma 3 27B	MLX 4-bit	15 GB	72	Fine, but Qwen3-Coder beats it on every workload we care about
DeepSeek-R1-Distill 70B	Q5_K_M GGUF	49 GB	15	Use Llama-3.3 unless you specifically need long CoT reasoning

MLX vs llama.cpp on the M4 Ultra

The short answer: for dense models up to about 70B, MLX wins on raw throughput (12–18% faster decode at the same quantization tier) and uses less peak memory thanks to lazy weight materialization. For very large MoE models — DeepSeek-V3.1, Qwen3 235B — llama.cpp with the latest Metal backend wins because expert routing and partial GPU offload are better tuned. MLX's MoE support exists but is younger and currently 8–10% slower for these specific architectures.

One non-obvious gotcha: MLX 0.21 introduced mx.fast.scaled_dot_product_attention, which materially improves long-context prefill. If you are still on 0.19, upgrade before benchmarking.

How to install in 5 minutes

Install Homebrew, then brew install llama.cpp uv.
uv pip install mlx-lm for the Python MLX runtime.
Pull a model with huggingface-cli download mlx-community/Qwen3-Coder-32B-Instruct-4bit.
Bump the GPU memory cap: sudo sysctl iogpu.wired_limit_mb=235520 (this resets on reboot — add to a launchd plist for persistence).
Serve: mlx_lm.server --model mlx-community/Qwen3-Coder-32B-Instruct-4bit --port 8080.

If you want a GUI instead of CLI, LM Studio handles all of this and exposes an OpenAI-compatible endpoint, though its bundled llama.cpp build typically trails upstream by 2–3 weeks.

Cost and power: is this machine worth it?

At $5,599 USD list, amortized over 36 months at ~200 W average draw and US average $0.17/kWh, the M4 Ultra costs roughly $0.18 per hour of inference (capital + electricity). Renting an equivalent A100 80GB on a hyperscaler runs $1.80–$3.50 per hour. Break-even on the hardware lands somewhere around 1,800 inference hours — under a year for a single developer running models eight hours a day. The math is even better against API spend for high-volume agentic workloads; our cost calculator lets you plug in your token volume.

The verdict

Use case	Model	Runtime	Why
Frontier reasoning, batch jobs, research	DeepSeek-V3.1 671B Q3_K_M	llama.cpp	Only locally hostable model in Claude-tier capability range
Daily coding, IDE assistant	Qwen3-Coder 32B MLX 4-bit	MLX-LM	60 tok/s, 128k context, best-in-class code quality at this size
General chat, RAG, tool-using agents	Llama-3.3-70B Q5_K_M	llama.cpp	Most predictable instruction following, broadest tool support

If you only have time to install one model, install Qwen3-Coder 32B. If you bought 256 GB specifically to run something you couldn't otherwise run, install DeepSeek-V3.1. Everything else on this list is a refinement, not a requirement.

Frequently asked questions

Can the 256 GB M4 Ultra run DeepSeek-V3.1 at full precision?

No. The BF16 weights are ~1.3 TB; even FP8 lands around 670 GB. Q3_K_M at ~310 GB on disk (~205 GB resident with MoE) is the largest practical quantization that fits with usable context.

Is MLX always faster than llama.cpp on Apple Silicon?

For dense models up to ~70B, yes — typically 12–18% faster decode. For large MoE architectures like DeepSeek-V3.1 and Qwen3 235B, llama.cpp's Metal backend is currently faster because expert routing is more mature there.

Do I need to raise the macOS GPU memory cap?

Yes, for anything over ~75% of system RAM. Run sudo sysctl iogpu.wired_limit_mb=235520 to allow up to 230 GB of wired GPU memory. The setting resets on reboot.

How does the M4 Ultra compare to an Nvidia H100 for local LLMs?

The H100 is 3–5× faster per token on dense models and has better prefill throughput. The M4 Ultra wins on capital cost, power, and the ability to fit 200 GB+ models in a single coherent memory space without tensor parallelism.

What about the 512 GB M4 Ultra configuration?

It unlocks DeepSeek-V3.1 at Q5_K_M and Llama-3.1-405B at Q8, but generation speed for those quants drops below 6 tok/s. Most users get more value from the 256 GB tier; the 512 GB is mainly for fine-tuning workflows.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.