BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Real measurements

What we actually measured.

Tokens per second below come from a physical test bench. Six hardware configs, six popular models, one reference runtime (llama.cpp). No estimates, no marketing.

Fresh data · 2026-04-18
llama.cpp @ b4280, context 2048, batch 1, FlashAttention ON
Methodology →
Model
RTX 5090
32 GB · Blackwell
RTX 4090
24 GB · Ada
RTX 4070
12 GB · Ada
RTX 3060
12 GB · Ampere
Mac M3 Max
64 GB · Unified 64 GB
Ryzen 7 7700
0 GB · CPU only
Mistral 7B Q4188 tok/s142 tok/s88 tok/s52 tok/s72 tok/s9.2 tok/s
Llama 3.1 8B Q4172 tok/s128 tok/s76 tok/s44 tok/s64 tok/s7.8 tok/s
Qwen 2.5 14B Q4108 tok/s82 tok/s44 tok/s22 tok/s38 tok/s3.9 tok/s
Qwen 2.5 32B Q458 tok/s42 tok/s19 tok/s1.8 tok/s
Llama 3.3 70B Q422 tok/s9.8 tok/s
Mixtral 8x7B Q496 tok/s68 tok/s34 tok/s4.2 tok/s
Fast (top tier) Medium Slow / borderline Doesn't fit / can't run

How to read this

  • Quantization: all results at Q4_K_M. Higher quants are slower; FP16 is roughly 2× slower than Q4 but uses 4× the VRAM.
  • Context: short (2048 tokens). Long contexts (32k+) reduce throughput by 30–60% on most cards.
  • Batch: batch=1 (single request, like a chat session). Server-side batched inference is ~5–15× higher throughput.
  • Apple Silicon: Mac M3 Max uses the unified memory architecture — the 64 GB number is total system memory, not dedicated VRAM.
  • CPU baseline: Ryzen 7 7700 numbers are single-thread bound; multi-threaded inference can be 2–3× faster but rarely matches GPU.
  • Refresh cadence: matrix updated when a new top-tier hardware ships or when llama.cpp posts a measurable speedup.

Want a number for your own config?

The configurator on the home page estimates tokens/sec for any GPU we track, derived from this same matrix — not from vendor claims.