BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-16

Best Local LLM on AMD EPYC — CPU-Only Server Inference

AMD EPYC servers can run 70B-class and even 671B MoE models without a single GPU — once memory bandwidth, NUMA layout, and quantization are tuned.

By Mohamed Meguedmi · 11 min read

Key takeaways

  • Memory bandwidth is the only metric that matters. A 12-channel DDR5-4800 EPYC 9004/9005 platform delivers ~460 GB/s, roughly 2.7× what an 8-channel DDR4-3200 EPYC 7003 manages. Token throughput tracks bandwidth almost linearly.
  • The cheapest viable EPYC LLM server in 2026 is a used Milan box (EPYC 7402P to 7763). Expect 3–4 tok/s on Llama 3.3 70B Q4_K_M and 8–12 tok/s on Qwen3-235B-A22B MoE.
  • Genoa-X (9004X) with 1.1 GB of 3D V-Cache changes the math. The EPYC 9684X hits 11 tok/s on 70B Q4 — competitive with a single RTX 4090 for offline batch work, with 10× the RAM capacity.
  • llama.cpp (ik_llama fork) beats ZenDNN-accelerated vLLM for small-batch interactive use. vLLM only wins above batch size 8.
  • Skip dual-socket builds. NUMA crossing destroys throughput; one fat socket beats two skinny ones every time for LLM inference.

Why EPYC, not Xeon or Threadripper, for CPU-only LLM inference

CPU inference is a bandwidth problem, not a compute problem. Every forward pass through a 70B model at INT4 reads roughly 35 GB of weights — once per token. The fastest processor in the world will idle waiting for DRAM if the memory channels can't keep up. That is the entire reason AMD EPYC has quietly become the default platform for self-hosted LLM serving without GPUs.

A consumer Ryzen 9 9950X tops out at two channels of DDR5-5600 — around 90 GB/s. EPYC Genoa (9004) and Turin (9005) ship with twelve channels of DDR5-4800 or DDR5-6400, pushing 460–614 GB/s into a single socket. That is GPU-class bandwidth on commodity DRAM. Intel Granite Rapids offers eight or twelve channels depending on SKU and trails Turin by 15–25% per socket at comparable price points; Threadripper Pro 7000 (WRX90) maxes at eight channels of DDR5-5200 and costs more per usable bandwidth dollar than a refurbished Genoa.

The other reason: ECC RDIMMs. Running a 70B+ model for days at a time on non-ECC memory is a recipe for silent corruption. EPYC is the only platform under $5,000 (used) that supports 768 GB–1 TB of ECC capacity, enough to load DeepSeek-V3 671B Q4 with KV-cache headroom.

Memory bandwidth is the bottleneck — measured

The table below shows measured llama.cpp throughput on Llama 3.3 70B Q4_K_M, single user, 2,048-token context, no speculative decoding. Data aggregated from the BestLLMfor public benchmark dataset (CC BY 4.0, available via api.bestllmfor.com/v1/benchmarks) and corroborated against published llama.cpp GitHub discussion threads.

PlatformChannels × speedPeak BW (GB/s)70B Q4 tok/sUsed price (May 2026)
EPYC 7282 (Rome, 16c)8 × DDR4-26661702.1$950
EPYC 7402P (Rome, 24c)8 × DDR4-32002052.6$1,100
EPYC 7763 (Milan, 64c)8 × DDR4-32002053.4$2,200
EPYC 9354P (Genoa, 32c)12 × DDR5-48004607.2$3,800
EPYC 9684X (Genoa-X, 96c)12 × DDR5-4800 + 1.1 GB V-Cache46011.4$6,500
EPYC 9555 (Turin, 64c)12 × DDR5-60005769.8$4,900
EPYC 9755 (Turin Dense, 128c)12 × DDR5-640061410.6$9,200

Three things jump out. First, doubling cores from 32 to 64 on the same memory subsystem yields ~10–15% — confirming the bandwidth ceiling. Second, the 9684X with V-Cache outperforms the higher-bandwidth 9755 on the 70B workload because the attention KV cache spills happily into the 1.1 GB L3, cutting DRAM traffic by ~25%. Third, used Milan (EPYC 7003) is the budget winner if you can live with 3–4 tok/s.

Which models actually make sense on EPYC

The reflex is to grab the largest model the RAM will hold. That is wrong. Below 5 tok/s, any model is unusable for chat. EPYC excels at three workloads: large-context batch jobs, MoE inference where only a fraction of weights activate per token, and 7B–32B dense models served to a small team.

ModelQuantRAM neededBest EPYC use caseThroughput hint
Qwen3-Coder 32BQ4_K_M20 GBInteractive code completion, 1–3 devs12–18 tok/s on Genoa
Llama 3.3 70B InstructQ4_K_M42 GBBatch summarization, RAG backend7–11 tok/s on Genoa-X
Mistral Large 2 123BQ4_K_M73 GBLong-context document review4–6 tok/s on Genoa-X
Qwen3-235B-A22B (MoE)Q4_K_M140 GBHigh-throughput agent server14–22 tok/s on Turin
DeepSeek-V3 671B (MoE)Q4_K_M380 GBFrontier reasoning, batch only5–8 tok/s on Turin 9755
DeepSeek-V3 671BQ5_K_S480 GBSame, less quant loss4–6 tok/s on Turin 9755

MoE models are the unfair advantage of CPU inference. Qwen3-235B-A22B activates only 22B parameters per token despite holding 235B in RAM — bandwidth-per-token drops accordingly, and EPYC's vast memory capacity matters more than peak per-core compute. DeepSeek-V3 671B Q4 fits in 384 GB of RAM and delivers usable speeds on a single Turin socket, something no single GPU short of an H200 NVL can claim. See the DeepSeek-V3 model card and the Qwen3-235B-A22B card for activation patterns.

Inference engine choice: llama.cpp wins for most teams

There are three serious options on EPYC: stock llama.cpp, the ik_llama.cpp fork, and AMD's ZenDNN-accelerated vLLM. Verdict first.

For 1–4 concurrent users: llama.cpp, specifically the ik_llama fork with --numa distribute and -t set to physical cores on a single NUMA node. It supports every relevant quant (Q4_K_M, IQ4_XS, Q5_K_S, Q6_K), handles MoE routing efficiently, and gets new model architectures within 48 hours of release. See the llama.cpp repository.

For batch sizes ≥ 8 (an internal API serving multiple agents simultaneously): vLLM 0.7+ with ZenDNN 5.2. AMD's ZenDNN 5.2 release brings AVX-512 BF16 and INT8 kernels tuned for Zen 4 and Zen 5, with measured 1.6–2.1× speedups over stock PyTorch CPU on batched workloads. Single-stream interactive use is unchanged.

Skip Ollama on EPYC. It is a thin wrapper over llama.cpp that hard-codes thread count and does not expose NUMA flags. For a personal laptop it is fine; on a dual-NUMA Genoa it leaves 25–40% on the floor.

Tuning: the settings that double throughput

A stock Linux install plus a stock llama.cpp build leaves enormous performance on the table on EPYC. The five steps below are non-negotiable for production.

How to tune EPYC for llama.cpp inference

  1. BIOS — set NPS=1 (or NPS=2 with manual pinning). NPS4 splits the socket into four NUMA nodes and devastates throughput for a single-process inference engine. NPS=1 presents the whole socket as one node.
  2. BIOS — enable AVX-512 on Genoa and newer. Disable Spectre/MDS mitigations only if the host is dedicated and isolated — measurable 8–12% gain.
  3. BIOS — Determinism Slider = Performance, cTDP = maximum, C-states no deeper than C1.
  4. OS — use a recent kernel (≥ 6.6) and set transparent_hugepage=always. Page-table walk overhead is real on 380 GB models.
  5. llama.cpp — compile with LLAMA_NATIVE=1 GGML_AVX512=1 GGML_AVX512_VBMI=1 GGML_AVX512_VNNI=1, run with --numa distribute -t <physical_cores>, and pin the process with numactl --cpunodebind=0 --membind=0.

Step 5 alone routinely yields 30–50% over an unaware default build. The BestLLMfor editorial team publishes exact compile flags and numactl invocations per CPU SKU through the open CC BY 4.0 API at api.bestllmfor.com/v1/profiles/epyc, and the open-source quelllm-mcp server exposes the same data to local agents over the Model Context Protocol.

Cost: EPYC vs GPU for the same throughput

Headline number: a used EPYC 7763 server with 512 GB DDR4-3200 ECC, motherboard, PSU, and chassis lands around $4,500 in May 2026 and delivers ~3.4 tok/s on Llama 3.3 70B Q4. A new RTX 4090 (24 GB) costs ~$2,200 but cannot load 70B Q4 without offloading layers to system RAM, dropping throughput to 8–10 tok/s — not a clean comparison.

The honest comparison is against a dual RTX 3090 build (~$1,800 used GPUs + $1,200 host) running Llama 3.3 70B Q4 at 18–22 tok/s. The GPU build wins on raw tok/s. EPYC wins on three other axes: capacity (DeepSeek-V3 671B simply does not fit on 48 GB of VRAM), idle power (~140 W vs ~280 W), and lifespan (server-grade ECC and 24/7 duty cycles vs consumer GPUs running at 85–90 °C).

The break-even depends on the workload. Use the cost calculator to model your specific hours-per-day, electricity rate, and depreciation assumptions. For inference workloads under 4 hours per day with models above 70B, EPYC wins on total cost of ownership over 3 years. For high-throughput chat serving, stick with GPUs.

When NOT to choose EPYC

Be honest about the limits.

  • Latency-sensitive chat at scale. Time-to-first-token on EPYC is 200–800 ms for any 70B+ model. A user waiting on an agent response will notice.
  • Training or fine-tuning. CPU training is technically possible. It is also slower than waiting for a colleague with a GPU to do it.
  • Vision-language models with large image tokens. ViT preprocessing is FLOPS-bound and EPYC trails GPUs by 20–50×.
  • A single 7B model for one user. A Mac Mini M4 Pro with 64 GB does this for $2,000 at 40+ tok/s and 30 W idle. EPYC is overkill.

The verdict

If you are building a CPU-only LLM server in 2026 with a budget under $5,000, buy a used EPYC 7763 (64 cores, 8-channel DDR4-3200) with 512 GB ECC. Run ik_llama.cpp with the BIOS settings above. Expect 3–4 tok/s on dense 70B and 8–12 tok/s on Qwen3-235B-A22B MoE.

If the budget reaches $8,000–12,000 and you want frontier capability, the EPYC 9684X (Genoa-X) with 768 GB DDR5-4800 is the sweet spot in the market today. It runs DeepSeek-V3 671B Q4 at usable speeds and crushes anything dense up to 123B.

Skip Turin 9755 unless you are running multi-tenant API serving with batched MoE traffic. The price premium over the 9684X does not pay back for single-user or small-team workloads.

BudgetPickBest modelExpected tok/s
< $2,000EPYC 7402P, 256 GB DDR4Qwen3-Coder 32B Q48–12
$2K–5KEPYC 7763, 512 GB DDR4Llama 3.3 70B Q43–4
$5K–10KEPYC 9684X, 768 GB DDR5Mistral Large 2 / DeepSeek-V35–11
$10K+EPYC 9755, 1 TB DDR5DeepSeek-V3 671B Q54–6

For a deeper look at how these numbers are produced, see the BestLLMfor methodology, or visit the French sister site quelllm.fr for European pricing and energy-cost models. Editorial details on the team are in About.

Frequently asked questions

What is the minimum EPYC for running Llama 3.3 70B Q4_K_M usefully?

An EPYC 7402P (24-core Rome) with 8 channels of DDR4-3200 and 128 GB ECC is the practical floor. Expect ~2.6 tok/s — slow for chat, fine for batch summarization or overnight RAG indexing.

Is dual-socket EPYC worth it for LLM inference?

No. NUMA crossing penalties cut effective bandwidth by 30–50% for a single inference process. One Turin 9555 outperforms two Milan 7763s on every model tested, costs less, and consumes less power.

Can I run DeepSeek-V3 671B on a single CPU socket?

Yes. A single EPYC 9755 with 1 TB of DDR5-6400 runs DeepSeek-V3 671B Q4_K_M at 5–8 tok/s, and Q5_K_S at 4–6 tok/s. This is the cheapest way to self-host a frontier-class model in 2026.

Does AVX-512 actually matter on EPYC?

Yes, materially. Zen 4 and Zen 5 implement full-width AVX-512 with VNNI and BF16. A llama.cpp build compiled without these flags leaves 15–25% performance on the floor on Genoa and Turin.

llama.cpp vs ZenDNN vLLM — which is faster?

llama.cpp (ik_llama fork) wins at batch size 1–4. ZenDNN vLLM 5.2 wins at batch size ≥ 8 by 1.6–2.1×. For an interactive single-user assistant, llama.cpp is the right answer.

Is ECC RAM mandatory?

Strongly recommended. A 70B model holds tens of billions of weights in memory for hours or days. Single-bit flips silently corrupt outputs. EPYC's ECC support is one of the main reasons it beats Threadripper at this job.