Real measurements

What we actually measured.

Name: BestLLMfor LLM benchmark matrix
Published: 2026-04-18
License: https://creativecommons.org/licenses/by/4.0/

Tokens per second below come from a physical test bench. Six hardware configs, six popular models, one reference runtime (llama.cpp). No estimates, no marketing.

Fresh data · 2026-04-18

llama.cpp @ b4280, context 2048, batch 1, FlashAttention ON

Methodology →

Model	RTX 5090 32 GB · Blackwell	RTX 4090 24 GB · Ada	RTX 4070 12 GB · Ada	RTX 3060 12 GB · Ampere	Mac M3 Max 64 GB · Unified 64 GB	Ryzen 7 7700 0 GB · CPU only
Mistral 7B Q4	188 tok/s	142 tok/s	88 tok/s	52 tok/s	72 tok/s	9.2 tok/s
Llama 3.1 8B Q4	172 tok/s	128 tok/s	76 tok/s	44 tok/s	64 tok/s	7.8 tok/s
Qwen 2.5 14B Q4	108 tok/s	82 tok/s	44 tok/s	22 tok/s	38 tok/s	3.9 tok/s
Qwen 2.5 32B Q4	58 tok/s	42 tok/s	—	—	19 tok/s	1.8 tok/s
Llama 3.3 70B Q4	22 tok/s	—	—	—	9.8 tok/s	—
Mixtral 8x7B Q4	96 tok/s	68 tok/s	—	—	34 tok/s	4.2 tok/s

● Fast (top tier) ● Medium ● Slow / borderline — Doesn't fit / can't run

How to read this

Quantization: all results at Q4_K_M. Higher quants are slower; FP16 is roughly 2× slower than Q4 but uses 4× the VRAM.
Context: short (2048 tokens). Long contexts (32k+) reduce throughput by 30–60% on most cards.
Batch: batch=1 (single request, like a chat session). Server-side batched inference is ~5–15× higher throughput.
Apple Silicon: Mac M3 Max uses the unified memory architecture — the 64 GB number is total system memory, not dedicated VRAM.
CPU baseline: Ryzen 7 7700 numbers are single-thread bound; multi-threaded inference can be 2–3× faster but rarely matches GPU.
Refresh cadence: matrix updated when a new top-tier hardware ships or when llama.cpp posts a measurable speedup.

Want a number for your own config?

The configurator on the home page estimates tokens/sec for any GPU we track, derived from this same matrix — not from vendor claims.

Try the configurator Cost ROI calculator