Real measurements
What we actually measured.
Tokens per second below come from a physical test bench. Six hardware configs, six popular models, one reference runtime (llama.cpp). No estimates, no marketing.
| Model | RTX 5090 32 GB · Blackwell | RTX 4090 24 GB · Ada | RTX 4070 12 GB · Ada | RTX 3060 12 GB · Ampere | Mac M3 Max 64 GB · Unified 64 GB | Ryzen 7 7700 0 GB · CPU only |
|---|---|---|---|---|---|---|
| Mistral 7B Q4 | 188 tok/s | 142 tok/s | 88 tok/s | 52 tok/s | 72 tok/s | 9.2 tok/s |
| Llama 3.1 8B Q4 | 172 tok/s | 128 tok/s | 76 tok/s | 44 tok/s | 64 tok/s | 7.8 tok/s |
| Qwen 2.5 14B Q4 | 108 tok/s | 82 tok/s | 44 tok/s | 22 tok/s | 38 tok/s | 3.9 tok/s |
| Qwen 2.5 32B Q4 | 58 tok/s | 42 tok/s | — | — | 19 tok/s | 1.8 tok/s |
| Llama 3.3 70B Q4 | 22 tok/s | — | — | — | 9.8 tok/s | — |
| Mixtral 8x7B Q4 | 96 tok/s | 68 tok/s | — | — | 34 tok/s | 4.2 tok/s |
● Fast (top tier)
● Medium
● Slow / borderline
— Doesn't fit / can't run
How to read this
- Quantization: all results at Q4_K_M. Higher quants are slower; FP16 is roughly 2× slower than Q4 but uses 4× the VRAM.
- Context: short (2048 tokens). Long contexts (32k+) reduce throughput by 30–60% on most cards.
- Batch: batch=1 (single request, like a chat session). Server-side batched inference is ~5–15× higher throughput.
- Apple Silicon: Mac M3 Max uses the unified memory architecture — the 64 GB number is total system memory, not dedicated VRAM.
- CPU baseline: Ryzen 7 7700 numbers are single-thread bound; multi-threaded inference can be 2–3× faster but rarely matches GPU.
- Refresh cadence: matrix updated when a new top-tier hardware ships or when llama.cpp posts a measurable speedup.
Want a number for your own config?
The configurator on the home page estimates tokens/sec for any GPU we track, derived from this same matrix — not from vendor claims.