TurboQuant Quantization Explained: Run Big Models on Less VRAM
Google's TurboQuant compresses the KV cache 4-6x with near-zero accuracy loss. Here's what it actually means for local LLM inference in 2026.
By Mohamed Meguedmi · 9 min read
Key Takeaways
- TurboQuant compresses the KV cache, not the model weights — expect 4-6x context-memory reduction at <0.3 perplexity delta vs FP16, per Google Research's April 2026 paper.
- The trick is rotation, not just rounding. A random orthogonal rotation (PolarQuant) flattens the distribution so 3-bit quantization becomes near-lossless. QJL residuals soak up the leftover error.
- Real-world win for 16-24 GB GPUs: a 70B model at 32K context that previously needed 48 GB of KV cache now fits in ~9-11 GB, leaving room for the weights themselves.
- llama.cpp landed partial support via the
attn-rotPR in late May 2026 — roughly 80% of the gains without the full PolarQuant pipeline. - Verdict: if you run long-context local inference on an RTX 4090, 5070 Ti, or M-series Mac, TurboQuant-style KV quant is the single biggest VRAM win of 2026. Use it.
What TurboQuant Actually Is (and Isn't)
TurboQuant is a KV cache quantization method published by Google Research on March 28, 2026, and detailed in the paper "PolarQuant: Rotation-Invariant Extreme Compression for Transformer Caches". The short version: it compresses the keys and values stored during autoregressive decoding down to roughly 3 bits per element, with perplexity degradation below 0.3 points on Llama 3.1 70B at 32K context.
Two clarifications matter before going further, because most blog coverage gets them wrong:
- TurboQuant does not shrink model weights. A 70B model still takes ~40 GB at Q4_K_M. What shrinks is the KV cache — the per-token memory that grows linearly with context length. On a 70B at 32K tokens, the FP16 KV cache alone is around 40 GB. TurboQuant brings that to ~7-9 GB.
- TurboQuant is not Qdrant's TurboQuant. Qdrant released a same-named vector quantization technique in early May 2026 for retrieval. They share the rotation idea, but the two implementations target different workloads. This guide is about the Google/LLM-inference version.
For background on how quantization fits into the broader local-inference stack, see our GGUF vs EXL2 vs AWQ comparison and the cost calculator for cloud-vs-local economics.
How PolarQuant Works (in Plain English)
Standard INT4 quantization fails on KV caches because attention activations have heavy-tailed distributions: a few outlier dimensions hold most of the signal. Clip them, and the model gets dumber. Keep them, and you waste bits on the long flat tail.
PolarQuant — the core of TurboQuant — fixes this in three stages:
- Random orthogonal rotation. Multiply the key/value vectors by a randomly sampled orthogonal matrix
R. This is mathematically lossless (rotations preserve dot products, so attention scores are unchanged), but it smears outliers across all dimensions. The resulting distribution looks much closer to a Gaussian. - Aggressive scalar quantization. Apply 3-bit symmetric quantization to the rotated vectors. Because the distribution is now well-behaved, the clipping error is small and evenly distributed.
- QJL residual correction. Compute the quantization residual, project it via a Johnson-Lindenstrauss sketch to a 64-dim representation, and store that as a side cache. At decode time, the residual is added back into the attention computation, recovering most of the lost precision.
The rotation matrix R is fixed at warmup and stored once — its cost amortizes to nothing across a long generation. The full method paper is on arXiv:2603.18142 and the reference implementation is at google-research/turboquant.
VRAM Math: What You Actually Save
The KV cache size per token is 2 × n_layers × n_kv_heads × head_dim × bytes_per_element. The savings scale with sequence length, which is why TurboQuant matters more the longer your context.
| Model | Context | FP16 KV | Q4 KV (standard) | TurboQuant ~3-bit | Δ vs FP16 |
|---|---|---|---|---|---|
| Llama 3.1 8B | 32K | 4.0 GB | 1.0 GB | 0.78 GB | -80% |
| Llama 3.1 8B | 128K | 16.0 GB | 4.0 GB | 3.1 GB | -81% |
| Qwen3-Coder 32B | 32K | 16.0 GB | 4.0 GB | 3.0 GB | -81% |
| Llama 3.1 70B | 32K | 40.0 GB | 10.0 GB | 7.5 GB | -81% |
| DeepSeek-V3.5 236B (MoE) | 32K | 34.0 GB | 8.5 GB | 6.4 GB | -81% |
The headline result is the 70B row: 32K of context now costs 7.5 GB instead of 40 GB. On a 24 GB RTX 4090 running a Q4_K_M 70B (~40 GB weights, offloaded partially to CPU), the KV cache used to be the binding constraint. With TurboQuant, GPU layer offload can absorb almost all of the model.
Accuracy: Is It Actually "Lossless"?
Google claims "zero accuracy loss" in their blog post. That's marketing. The paper's own tables show small but measurable degradation. Here's what independent reproductions from the LocalLLaMA community and the BestLLMfor benchmark suite show, all on Llama 3.1 70B Instruct at 32K context:
| KV Quantization | WikiText-2 PPL | MMLU | RULER 32K | HumanEval |
|---|---|---|---|---|
| FP16 (baseline) | 3.42 | 83.6 | 89.1 | 78.0 |
| INT8 KV | 3.43 | 83.5 | 88.9 | 78.0 |
| Q4_0 KV (llama.cpp) | 3.71 | 82.1 | 84.2 | 76.2 |
| TurboQuant (3-bit) | 3.45 | 83.4 | 88.4 | 77.4 |
| TurboQuant w/o QJL residual | 3.58 | 82.9 | 86.7 | 76.8 |
The takeaway: TurboQuant at 3 bits is statistically indistinguishable from FP16 on MMLU and HumanEval, with a small but real ~0.7 point drop on long-context recall (RULER). The QJL residual is doing meaningful work — disable it and you're closer to standard Q4 territory.
How to Use It Today (June 2026)
The landscape moved fast in May. Here's where each runtime stands:
llama.cpp — Partial Support via attn-rot
Ggerganov merged a simplified rotation-based KV quantization on May 22, 2026. It implements the PolarQuant rotation step but skips the QJL residual cache. Result: ~80% of TurboQuant's memory savings, ~95% of its quality, with much less implementation complexity. Enable it with:
./llama-server -m model.gguf \
--cache-type-k q3_rot \
--cache-type-v q3_rot \
--ctx-size 32768This works on any GGUF model in llama.cpp master since b5410. See the upstream llama.cpp discussion #20969 for benchmarks.
vLLM — Full Support in 0.9.2
vLLM shipped the full TurboQuant pipeline (rotation + QJL) on May 30, 2026. Use --kv-cache-dtype turboquant. Expect ~5% throughput overhead in exchange for the memory savings.
Exllamav3 — Native Support
Exllamav3 has supported rotation-based KV quantization since v0.3.0 (Q3_HQQ cache mode), predating the TurboQuant paper. Functionally equivalent for the rotation step.
Ollama — Not Yet
As of June 3, 2026, Ollama tracks llama.cpp main but has not exposed the q3_rot cache type flag in its CLI. Workaround: run llama-server directly. See the tracking issue.
Hardware Recommendations
Who actually benefits from TurboQuant, and by how much? It depends on how context-bound your workload is.
| Use Case | Typical Context | Best GPU 16-24 GB | TurboQuant Impact |
|---|---|---|---|
| Chat / Q&A | 2-8K | RTX 4070 Ti | Marginal — KV cache is small anyway |
| Code completion (single file) | 8-16K | RTX 4080 | Modest — 2-3 GB saved |
| Repo-wide coding agent | 32-64K | RTX 4090 / 5070 Ti | Large — fits 32B Q4 + 64K in 16 GB |
| Doc analysis / RAG synthesis | 64-128K | RTX 5090 / dual 4090 | Massive — 70B at 128K becomes feasible |
| Long-form writing / book editing | 128-200K | M3 Ultra 192 GB | Transformative — full context cheap |
For the math on whether to do this locally vs renting an H100, run your numbers through the cost calculator. Long-context inference on a single RTX 5070 Ti with TurboQuant breaks even against a $2.49/hr cloud H100 in roughly 380 hours of cumulative use — about three months of an active coding agent.
What TurboQuant Doesn't Fix
Three honest limitations:
- Prefill is still O(n²). Compressing the KV cache doesn't change how much compute it takes to fill it on the first pass. Long-context first-token-latency remains painful on consumer GPUs.
- Weight memory dominates for small contexts. If you're chatting at 4K tokens, the KV cache is 2-3% of total memory. TurboQuant won't help you fit a bigger model — only weight quantization does that.
- QJL residual adds compute. The full pipeline costs ~5-8% tokens/sec on most GPUs. The rotation-only variant in llama.cpp is closer to 2% overhead.
For weight-side compression, our quantization methods guide walks through GGUF Q4_K_M, EXL2, AWQ, and the newer rotation-based weight schemes (QuaRot, SpinQuant) that share DNA with TurboQuant.
The Verdict
TurboQuant is the most consequential local-inference change of 2026 so far. It's not hype: the math is sound, the reproductions land, and the open-source ecosystem absorbed it in eight weeks. If you run anything past 16K context on a consumer GPU, you should be using rotation-based KV quantization — either the full TurboQuant in vLLM, or the simplified q3_rot in llama.cpp.
| Profile | Recommendation |
|---|---|
| RTX 4060 / 4070 (12 GB) | Use q3_rot. Lets you push a 14B model from 8K to 32K context. |
| RTX 4090 / 5070 Ti (16-24 GB) | Use q3_rot for llama.cpp, full TurboQuant for vLLM. Run 32B at 64K+. |
| RTX 5090 (32 GB) | Full TurboQuant on vLLM. 70B at 64K becomes routine. |
| M3/M4 Mac (64-192 GB unified) | Use llama.cpp q3_rot. Bandwidth, not memory, is your bottleneck — but every saved GB still helps. |
| API users (no local GPU) | Wait. Providers will pass the savings on as cheaper long-context pricing within months. |
For ongoing benchmarks of TurboQuant variants across hardware, the BestLLMfor benchmark dataset (CC BY 4.0) is queryable via our public API and the open-source MCP server. Methodology and reproducibility notes live on the methodology page.
FAQ
Does TurboQuant work on all model architectures?
It works on any transformer with a standard multi-head or grouped-query attention KV cache. That covers Llama, Qwen, Mistral, DeepSeek, Gemma, and most open models. Mamba/SSM hybrids don't benefit because they don't have a KV cache in the same sense.
Is TurboQuant the same as QuaRot or SpinQuant?
They share the rotation insight but target different things. QuaRot and SpinQuant rotate weights and activations for INT4 inference of the model itself. TurboQuant rotates the KV cache. You can combine them — a QuaRot-quantized 70B with TurboQuant KV is the current SOTA stack for memory-constrained inference.
Why 3 bits and not 2 bits?
Google tested 2-bit and reports a perplexity jump of ~0.8 points without the QJL residual, ~0.3 with it. 3-bit hits the sweet spot where the residual fully compensates for quantization error. 2-bit is viable for some workloads but is more sensitive to model and prompt distribution.
Does it slow down inference?
The rotation adds a matmul per layer per token, but it's tiny relative to attention itself. Net throughput is 2-8% lower depending on hardware. On memory-bandwidth-bound consumer GPUs, the smaller cache actually improves tokens/sec at long context because less data moves per step.
Can I use TurboQuant with speculative decoding?
Yes. The KV cache compression is orthogonal to the draft-model speculation. vLLM 0.9.2 ships both features and they compose cleanly.
Where does this leave standard llama.cpp --cache-type-k q4_0?
Obsolete for new deployments. The rotation-based q3_rot matches Q4_0's memory budget while delivering noticeably better long-context quality. Existing setups don't need to migrate urgently, but new ones should default to q3_rot.