Guide · 2026-05-16

Best Local LLM on AMD EPYC — CPU-Only Server Inference

Q: What is the minimum EPYC for running Llama 3.3 70B Q4_K_M usefully?

An EPYC 7402P (24-core Rome) with 8 channels of DDR4-3200 and 128 GB ECC is the practical floor. Expect ~2.6 tok/s — slow for chat, fine for batch summarization or overnight RAG indexing.

Q: Is dual-socket EPYC worth it for LLM inference?

No. NUMA crossing penalties cut effective bandwidth by 30–50% for a single inference process. One Turin 9555 outperforms two Milan 7763s on every model tested, costs less, and consumes less power.

Q: Can I run DeepSeek-V3 671B on a single CPU socket?

Yes. A single EPYC 9755 with 1 TB of DDR5-6400 runs DeepSeek-V3 671B Q4_K_M at 5–8 tok/s, and Q5_K_S at 4–6 tok/s. This is the cheapest way to self-host a frontier-class model in 2026.

Q: Does AVX-512 actually matter on EPYC?

Yes, materially. Zen 4 and Zen 5 implement full-width AVX-512 with VNNI and BF16. A llama.cpp build compiled without these flags leaves 15–25% performance on the floor on Genoa and Turin.

Q: llama.cpp vs ZenDNN vLLM — which is faster?

llama.cpp (ik_llama fork) wins at batch size 1–4. ZenDNN vLLM 5.2 wins at batch size ≥ 8 by 1.6–2.1×. For an interactive single-user assistant, llama.cpp is the right answer.

Q: Is ECC RAM mandatory?

Strongly recommended. A 70B model holds tens of billions of weights in memory for hours or days. Single-bit flips silently corrupt outputs. EPYC's ECC support is one of the main reasons it beats Threadripper at this job.

Last updated 2026-05-16

AMD EPYC servers can run 70B-class and even 671B MoE models without a single GPU — once memory bandwidth, NUMA layout, and quantization are tuned.

By Mohamed Meguedmi · 11 min read

Key takeaways

Memory bandwidth is the only metric that matters. A 12-channel DDR5-4800 EPYC 9004/9005 platform delivers ~460 GB/s, roughly 2.7× what an 8-channel DDR4-3200 EPYC 7003 manages. Token throughput tracks bandwidth almost linearly.
The cheapest viable EPYC LLM server in 2026 is a used Milan box (EPYC 7402P to 7763). Expect 3–4 tok/s on Llama 3.3 70B Q4_K_M and 8–12 tok/s on Qwen3-235B-A22B MoE.
Genoa-X (9004X) with 1.1 GB of 3D V-Cache changes the math. The EPYC 9684X hits 11 tok/s on 70B Q4 — competitive with a single RTX 4090 for offline batch work, with 10× the RAM capacity.
llama.cpp (ik_llama fork) beats ZenDNN-accelerated vLLM for small-batch interactive use. vLLM only wins above batch size 8.
Skip dual-socket builds. NUMA crossing destroys throughput; one fat socket beats two skinny ones every time for LLM inference.

Why EPYC, not Xeon or Threadripper, for CPU-only LLM inference

CPU inference is a bandwidth problem, not a compute problem. Every forward pass through a 70B model at INT4 reads roughly 35 GB of weights — once per token. The fastest processor in the world will idle waiting for DRAM if the memory channels can't keep up. That is the entire reason AMD EPYC has quietly become the default platform for self-hosted LLM serving without GPUs.

A consumer Ryzen 9 9950X tops out at two channels of DDR5-5600 — around 90 GB/s. EPYC Genoa (9004) and Turin (9005) ship with twelve channels of DDR5-4800 or DDR5-6400, pushing 460–614 GB/s into a single socket. That is GPU-class bandwidth on commodity DRAM. Intel Granite Rapids offers eight or twelve channels depending on SKU and trails Turin by 15–25% per socket at comparable price points; Threadripper Pro 7000 (WRX90) maxes at eight channels of DDR5-5200 and costs more per usable bandwidth dollar than a refurbished Genoa.

The other reason: ECC RDIMMs. Running a 70B+ model for days at a time on non-ECC memory is a recipe for silent corruption. EPYC is the only platform under $5,000 (used) that supports 768 GB–1 TB of ECC capacity, enough to load DeepSeek-V3 671B Q4 with KV-cache headroom.

Memory bandwidth is the bottleneck — measured

The table below shows measured llama.cpp throughput on Llama 3.3 70B Q4_K_M, single user, 2,048-token context, no speculative decoding. Data aggregated from the BestLLMfor public benchmark dataset (CC BY 4.0, available via api.bestllmfor.com/v1/benchmarks) and corroborated against published llama.cpp GitHub discussion threads.

Platform	Channels × speed	Peak BW (GB/s)	70B Q4 tok/s	Used price (May 2026)
EPYC 7282 (Rome, 16c)	8 × DDR4-2666	170	2.1	$950
EPYC 7402P (Rome, 24c)	8 × DDR4-3200	205	2.6	$1,100
EPYC 7763 (Milan, 64c)	8 × DDR4-3200	205	3.4	$2,200
EPYC 9354P (Genoa, 32c)	12 × DDR5-4800	460	7.2	$3,800
EPYC 9684X (Genoa-X, 96c)	12 × DDR5-4800 + 1.1 GB V-Cache	460	11.4	$6,500
EPYC 9555 (Turin, 64c)	12 × DDR5-6000	576	9.8	$4,900
EPYC 9755 (Turin Dense, 128c)	12 × DDR5-6400	614	10.6	$9,200

Three things jump out. First, doubling cores from 32 to 64 on the same memory subsystem yields ~10–15% — confirming the bandwidth ceiling. Second, the 9684X with V-Cache outperforms the higher-bandwidth 9755 on the 70B workload because the attention KV cache spills happily into the 1.1 GB L3, cutting DRAM traffic by ~25%. Third, used Milan (EPYC 7003) is the budget winner if you can live with 3–4 tok/s.

Which models actually make sense on EPYC

The reflex is to grab the largest model the RAM will hold. That is wrong. Below 5 tok/s, any model is unusable for chat. EPYC excels at three workloads: large-context batch jobs, MoE inference where only a fraction of weights activate per token, and 7B–32B dense models served to a small team.

Model	Quant	RAM needed	Best EPYC use case	Throughput hint
Qwen3-Coder 32B	Q4_K_M	20 GB	Interactive code completion, 1–3 devs	12–18 tok/s on Genoa
Llama 3.3 70B Instruct	Q4_K_M	42 GB	Batch summarization, RAG backend	7–11 tok/s on Genoa-X
Mistral Large 2 123B	Q4_K_M	73 GB	Long-context document review	4–6 tok/s on Genoa-X
Qwen3-235B-A22B (MoE)	Q4_K_M	140 GB	High-throughput agent server	14–22 tok/s on Turin
DeepSeek-V3 671B (MoE)	Q4_K_M	380 GB	Frontier reasoning, batch only	5–8 tok/s on Turin 9755
DeepSeek-V3 671B	Q5_K_S	480 GB	Same, less quant loss	4–6 tok/s on Turin 9755

MoE models are the unfair advantage of CPU inference. Qwen3-235B-A22B activates only 22B parameters per token despite holding 235B in RAM — bandwidth-per-token drops accordingly, and EPYC's vast memory capacity matters more than peak per-core compute. DeepSeek-V3 671B Q4 fits in 384 GB of RAM and delivers usable speeds on a single Turin socket, something no single GPU short of an H200 NVL can claim. See the DeepSeek-V3 model card and the Qwen3-235B-A22B card for activation patterns.

Inference engine choice: llama.cpp wins for most teams

There are three serious options on EPYC: stock llama.cpp, the ik_llama.cpp fork, and AMD's ZenDNN-accelerated vLLM. Verdict first.

For 1–4 concurrent users: llama.cpp, specifically the ik_llama fork with --numa distribute and -t set to physical cores on a single NUMA node. It supports every relevant quant (Q4_K_M, IQ4_XS, Q5_K_S, Q6_K), handles MoE routing efficiently, and gets new model architectures within 48 hours of release. See the llama.cpp repository.

For batch sizes ≥ 8 (an internal API serving multiple agents simultaneously): vLLM 0.7+ with ZenDNN 5.2. AMD's ZenDNN 5.2 release brings AVX-512 BF16 and INT8 kernels tuned for Zen 4 and Zen 5, with measured 1.6–2.1× speedups over stock PyTorch CPU on batched workloads. Single-stream interactive use is unchanged.

Skip Ollama on EPYC. It is a thin wrapper over llama.cpp that hard-codes thread count and does not expose NUMA flags. For a personal laptop it is fine; on a dual-NUMA Genoa it leaves 25–40% on the floor.

Tuning: the settings that double throughput

A stock Linux install plus a stock llama.cpp build leaves enormous performance on the table on EPYC. The five steps below are non-negotiable for production.

How to tune EPYC for llama.cpp inference

BIOS — set NPS=1 (or NPS=2 with manual pinning). NPS4 splits the socket into four NUMA nodes and devastates throughput for a single-process inference engine. NPS=1 presents the whole socket as one node.
BIOS — enable AVX-512 on Genoa and newer. Disable Spectre/MDS mitigations only if the host is dedicated and isolated — measurable 8–12% gain.
BIOS — Determinism Slider = Performance, cTDP = maximum, C-states no deeper than C1.
OS — use a recent kernel (≥ 6.6) and set transparent_hugepage=always. Page-table walk overhead is real on 380 GB models.
llama.cpp — compile with LLAMA_NATIVE=1 GGML_AVX512=1 GGML_AVX512_VBMI=1 GGML_AVX512_VNNI=1, run with --numa distribute -t <physical_cores>, and pin the process with numactl --cpunodebind=0 --membind=0.

Step 5 alone routinely yields 30–50% over an unaware default build. The BestLLMfor editorial team publishes exact compile flags and numactl invocations per CPU SKU through the open CC BY 4.0 API at api.bestllmfor.com/v1/profiles/epyc, and the open-source quelllm-mcp server exposes the same data to local agents over the Model Context Protocol.

Cost: EPYC vs GPU for the same throughput

Headline number: a used EPYC 7763 server with 512 GB DDR4-3200 ECC, motherboard, PSU, and chassis lands around $4,500 in May 2026 and delivers ~3.4 tok/s on Llama 3.3 70B Q4. A new RTX 4090 (24 GB) costs ~$2,200 but cannot load 70B Q4 without offloading layers to system RAM, dropping throughput to 8–10 tok/s — not a clean comparison.

The honest comparison is against a dual RTX 3090 build (~$1,800 used GPUs + $1,200 host) running Llama 3.3 70B Q4 at 18–22 tok/s. The GPU build wins on raw tok/s. EPYC wins on three other axes: capacity (DeepSeek-V3 671B simply does not fit on 48 GB of VRAM), idle power (~140 W vs ~280 W), and lifespan (server-grade ECC and 24/7 duty cycles vs consumer GPUs running at 85–90 °C).

The break-even depends on the workload. Use the cost calculator to model your specific hours-per-day, electricity rate, and depreciation assumptions. For inference workloads under 4 hours per day with models above 70B, EPYC wins on total cost of ownership over 3 years. For high-throughput chat serving, stick with GPUs.

When NOT to choose EPYC

Be honest about the limits.

Latency-sensitive chat at scale. Time-to-first-token on EPYC is 200–800 ms for any 70B+ model. A user waiting on an agent response will notice.
Training or fine-tuning. CPU training is technically possible. It is also slower than waiting for a colleague with a GPU to do it.
Vision-language models with large image tokens. ViT preprocessing is FLOPS-bound and EPYC trails GPUs by 20–50×.
A single 7B model for one user. A Mac Mini M4 Pro with 64 GB does this for $2,000 at 40+ tok/s and 30 W idle. EPYC is overkill.

The verdict

If you are building a CPU-only LLM server in 2026 with a budget under $5,000, buy a used EPYC 7763 (64 cores, 8-channel DDR4-3200) with 512 GB ECC. Run ik_llama.cpp with the BIOS settings above. Expect 3–4 tok/s on dense 70B and 8–12 tok/s on Qwen3-235B-A22B MoE.

If the budget reaches $8,000–12,000 and you want frontier capability, the EPYC 9684X (Genoa-X) with 768 GB DDR5-4800 is the sweet spot in the market today. It runs DeepSeek-V3 671B Q4 at usable speeds and crushes anything dense up to 123B.

Skip Turin 9755 unless you are running multi-tenant API serving with batched MoE traffic. The price premium over the 9684X does not pay back for single-user or small-team workloads.

Budget	Pick	Best model	Expected tok/s
< $2,000	EPYC 7402P, 256 GB DDR4	Qwen3-Coder 32B Q4	8–12
$2K–5K	EPYC 7763, 512 GB DDR4	Llama 3.3 70B Q4	3–4
$5K–10K	EPYC 9684X, 768 GB DDR5	Mistral Large 2 / DeepSeek-V3	5–11
$10K+	EPYC 9755, 1 TB DDR5	DeepSeek-V3 671B Q5	4–6

For a deeper look at how these numbers are produced, see the BestLLMfor methodology, or visit the French sister site quelllm.fr for European pricing and energy-cost models. Editorial details on the team are in About.

Frequently asked questions

What is the minimum EPYC for running Llama 3.3 70B Q4_K_M usefully?

An EPYC 7402P (24-core Rome) with 8 channels of DDR4-3200 and 128 GB ECC is the practical floor. Expect ~2.6 tok/s — slow for chat, fine for batch summarization or overnight RAG indexing.

Is dual-socket EPYC worth it for LLM inference?

No. NUMA crossing penalties cut effective bandwidth by 30–50% for a single inference process. One Turin 9555 outperforms two Milan 7763s on every model tested, costs less, and consumes less power.

Can I run DeepSeek-V3 671B on a single CPU socket?

Yes. A single EPYC 9755 with 1 TB of DDR5-6400 runs DeepSeek-V3 671B Q4_K_M at 5–8 tok/s, and Q5_K_S at 4–6 tok/s. This is the cheapest way to self-host a frontier-class model in 2026.

Does AVX-512 actually matter on EPYC?

Yes, materially. Zen 4 and Zen 5 implement full-width AVX-512 with VNNI and BF16. A llama.cpp build compiled without these flags leaves 15–25% performance on the floor on Genoa and Turin.

llama.cpp vs ZenDNN vLLM — which is faster?

llama.cpp (ik_llama fork) wins at batch size 1–4. ZenDNN vLLM 5.2 wins at batch size ≥ 8 by 1.6–2.1×. For an interactive single-user assistant, llama.cpp is the right answer.

Is ECC RAM mandatory?

Strongly recommended. A 70B model holds tens of billions of weights in memory for hours or days. Single-bit flips silently corrupt outputs. EPYC's ECC support is one of the main reasons it beats Threadripper at this job.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.