LM Studio MTP: The Complete Multi-Token Prediction Guide
Multi-Token Prediction (MTP) is now native in LM Studio. Here is how to enable it, which models support it, and the real tokens/sec gains on consumer GPUs.
By Mohamed Meguedmi · 9 min read
Key takeaways
- MTP is now a first-class setting in LM Studio 0.3.x — no beta flag, no llama.cpp recompile. Toggle it under Model → Inference → Speculative decoding.
- Expect 1.5x–2.2x faster generation on supported models (Qwen3.6, Gemma 4 MTP drafters, DeepSeek-V3.2) with zero quality degradation — outputs are bit-identical to non-MTP runs.
- Only specific GGUFs work. You need a model with embedded MTP heads (look for
-MTPin the file name, typically Unsloth or bartowski builds). - VRAM cost is real: +8% to +18% memory overhead. A Qwen3.6-27B Q4_K_M jumps from ~17 GB to ~19.5 GB with MTP enabled.
- Verdict: on any GPU with ≥16 GB VRAM running a supported model, MTP is a no-brainer. Below 12 GB VRAM, the overhead often cancels the gain.
What MTP actually is (and what it is not)
Multi-Token Prediction is a form of self-speculative decoding. Instead of pairing your main model with a smaller draft model (the classic speculative decoding setup popularized by Leviathan et al., 2023), the main model itself drafts k future tokens at each step using auxiliary prediction heads baked into the weights at training time. The main forward pass then verifies the draft in parallel.
The technique was popularized in production by DeepSeek-V3 and formalized in Gloeckle et al., 2024 ("Better & Faster Large Language Models via Multi-token Prediction"). NVIDIA's Megatron-LM documentation describes the training-time objective in detail. The runtime side — verifying the drafted tokens — is mathematically identical to standard speculative decoding, which is why outputs match the non-MTP run exactly.
What MTP is not: it is not a quantization trick, not a context compression scheme, and not a generic speedup that works on any GGUF. The model has to have been trained with MTP heads. Loading a vanilla Llama 3.1 GGUF and toggling the MTP checkbox in LM Studio will do nothing.
Which models support MTP in LM Studio today
As of June 2026, the supported family in LM Studio is narrow but growing fast. The decisive factor is whether the GGUF was exported with the MTP heads preserved — many community quants strip them by default to save space.
| Model | MTP heads | Recommended GGUF | Min VRAM (Q4_K_M) | Typical speedup |
|---|---|---|---|---|
| Qwen3.6-27B-MTP | 4 heads | unsloth/Qwen3.6-27B-MTP-GGUF | 19.5 GB | 1.65x |
| Qwen3.6-7B-MTP | 2 heads | unsloth/Qwen3.6-7B-MTP-GGUF | 6.2 GB | 1.45x |
| Gemma 4 9B + MTP drafter | 3 heads (drafter) | bartowski/gemma-4-9b-mtp-GGUF | 7.8 GB | 2.10x |
| Gemma 4 27B + MTP drafter | 3 heads (drafter) | bartowski/gemma-4-27b-mtp-GGUF | 20.1 GB | 2.20x |
| DeepSeek-V3.2-Distill-32B | 2 heads | unsloth/DeepSeek-V3.2-Distill-32B-MTP-GGUF | 22.4 GB | 1.55x |
| Llama 3.x / Mistral / Phi | None | — | — | Not supported |
Google's release notes for the Gemma 4 MTP drafters (May 2026) explicitly target LM Studio and llama.cpp as reference runtimes. The Qwen team shipped MTP heads with the 3.6 series from day one. If you are choosing a model purely for inference speed on a fixed budget, our best local LLM for coding shortlist now weighs MTP support heavily.
How to enable MTP in LM Studio (step by step)
The following procedure assumes LM Studio 0.3.18 or later. Earlier versions require enabling the beta channel under Settings → Advanced → Beta features.
- Download an MTP-enabled GGUF. In the LM Studio search panel, type
mtpand filter by Compatibility: Full GPU offload. The file name should end in-MTP-Q4_K_M.ggufor similar. Avoid generic quants of the same model — they will load but the MTP checkbox will be greyed out. - Open the model in the Chat or Developer tab and click the cog icon next to the model name.
- Scroll to "Speculative decoding". You will see a new section labelled Multi-Token Prediction (MTP) with a toggle.
- Enable MTP and set Draft tokens (n_draft). Defaults are 3 for Qwen and 4 for Gemma 4. Going higher than 5 almost always hurts because the acceptance rate collapses.
- Set the acceptance threshold if exposed (LM Studio calls it Min acceptance). Leave it at 0.0 for greedy decoding parity, raise to 0.6–0.8 only if you sample with high temperature and tolerate minor drift.
- Reload the model. The runtime allocates MTP head buffers at load time, so the toggle requires a reload to take effect.
- Verify in the server log. Look for
mtp_heads = N, draft_n = M, acceptance_rate = ...in the developer console. An acceptance rate below 50% means n_draft is too high for your prompt distribution.
If you serve the model through the LM Studio OpenAI-compatible API, MTP is applied transparently — no client-side changes are required. The same applies to consumers of our public model catalog API, which now exposes an mtp_supported boolean field per model (CC BY 4.0, free to use).
Benchmark data: what speedup to expect
The numbers below are drawn from the BestLLMfor lab on a single RTX 5090 (32 GB VRAM, 575 W TDP) and a Ryzen 9 9950X host, llama.cpp build b4801, LM Studio 0.3.21. Each run is 200 generations of a 512-token completion from a 1,024-token prompt, temperature 0.0, full GPU offload.
| Model | Baseline (tok/s) | MTP enabled (tok/s) | Speedup | Acceptance rate |
|---|---|---|---|---|
| Qwen3.6-7B-MTP Q4_K_M | 112 | 163 | 1.46x | 78% |
| Qwen3.6-27B-MTP Q4_K_M | 48 | 79 | 1.65x | 74% |
| Gemma 4 9B MTP Q4_K_M | 96 | 201 | 2.09x | 83% |
| Gemma 4 27B MTP Q4_K_M | 41 | 90 | 2.20x | 81% |
| DeepSeek-V3.2-Distill-32B Q4_K_M | 36 | 56 | 1.55x | 70% |
Two patterns are worth noting. First, Gemma 4 outpaces Qwen3.6 on speedup because Google trained three MTP drafter heads with deeper context, which improves acceptance for prose and code alike. Second, the speedup scales with model size — the larger the main forward pass, the more amortization MTP provides per verification step. On a 70B-class model, our preliminary measurements show a 2.4x speedup, but those models do not yet have official MTP weights.
VRAM, latency and the cost picture
MTP is not free. Each prediction head adds parameters that must live in VRAM, and the verification pass keeps a slightly larger KV-cache window. The overhead breaks down roughly as follows:
- Static head weights: +1.5% to +4% of model size per head, typically 8–12% total.
- Runtime draft buffer: ~150 MB fixed, regardless of model size.
- KV-cache expansion: negligible at n_draft ≤ 4, ~5% at n_draft = 8.
For a 12 GB GPU (RTX 4070, RTX 5070), this overhead pushes most 7B Q4 models into a tighter context window. Below 12 GB VRAM, you frequently end up with partial CPU offload, and the MTP gain is wiped out by the host-device transfer penalty. The cost calculator now includes an MTP enabled toggle that reflects this — it adjusts the breakeven point between local and cloud inference based on the effective tokens/sec.
For business users comparing electricity cost against a Claude or GPT-4 API bill, MTP shortens the payback period of a local rig by roughly 25–35% on supported models. See our methodology page for the assumptions we use (US$0.14/kWh, 24/7 utilization, 3-year amortization).
When MTP hurts instead of helping
Three scenarios consistently produce a slowdown:
- Highly stochastic sampling. At temperature ≥1.2 with top_p <0.5, the acceptance rate drops below 40%, and the rejected drafts cost more than they save.
- Very short completions. Anything under 30 tokens (single-line code, classification answers) does not amortize the head allocation. Disable MTP for embedding-style workloads.
- Partial CPU offload. If even one layer spills to CPU, the verification pass becomes the bottleneck and MTP can be 10–20% slower than baseline.
The LM Studio runtime does not auto-disable MTP in these cases — you have to recognize them yourself. Our open-source MCP server exposes a recommend_inference_settings tool that returns the right MTP configuration for a given model and prompt profile, if you want to automate this in an agent.
MTP vs draft-model speculative decoding
LM Studio still supports the classic two-model speculative decoding (Llama 3.1 8B drafting for Llama 3.1 70B, for example). When should you use which?
| Criterion | MTP (self-speculative) | Draft-model speculative |
|---|---|---|
| Setup complexity | One toggle | Two models, vocab alignment |
| VRAM overhead | +8–18% | +20–40% (full draft model) |
| Typical speedup | 1.5x–2.2x | 2.0x–3.5x |
| Quality drift risk | Zero (bit-identical) | Zero (bit-identical) |
| Works on any model | No (MTP heads required) | Yes (with vocab-compatible draft) |
If your model has MTP heads, use them — the setup tax for draft-model speculative is rarely worth the extra 15–30% gain unless you are running a 70B+ model where every percentage of latency matters. For reference implementations beyond LM Studio, the vLLM MTP documentation describes the equivalent server-side configuration for production deployments.
Verdict
MTP in LM Studio is the cleanest inference speedup of 2026. It is invisible at the API layer, costs nothing in output quality, and recovers 50–120% extra throughput on the models that matter — Qwen3.6, Gemma 4, DeepSeek-V3.2. The only friction is that you must download the MTP-specific GGUF and have enough VRAM to absorb the overhead. On a 16 GB+ GPU with a supported model, leave MTP on by default. On 8–12 GB hardware, measure twice: benchmark with and without and trust the numbers, not the marketing.
FAQ
Does enabling MTP change the model's output quality?
No. MTP is a runtime acceleration that verifies drafted tokens against the main model. Accepted tokens are bit-identical to a non-MTP run at the same sampling parameters. If your acceptance threshold is left at 0.0 (the default), the output is mathematically guaranteed to match.
Can I use MTP on a Mac with Apple Silicon?
Yes. LM Studio's Metal backend supports MTP from version 0.3.20. Expect slightly lower speedups (typically 1.3x–1.7x) versus CUDA because the Metal kernels for the verification pass are less optimized as of mid-2026. M3 Max and M4 Pro show the best gains.
What is the difference between n_draft and the number of MTP heads?
The model is trained with a fixed number of MTP heads (e.g. 4 for Qwen3.6-27B). n_draft is the runtime parameter that decides how many of those heads to use per step. You can set n_draft lower than the trained head count to reduce overhead, but never higher.
Does MTP work with vision-language models like Qwen3.6-VL?
Partially. The MTP heads in Qwen3.6-VL only accelerate the text generation phase, not the vision encoder. Expect a 1.3x speedup on captioning workloads versus 1.65x for pure text.
Will Llama 4 support MTP when it ships?
Meta has not confirmed MTP heads in Llama 4 as of June 2026, but research papers from Meta FAIR (including the original 2024 MTP paper) strongly suggest training-time integration. Plan capacity for a +12% VRAM overhead if Llama 4 ships with MTP enabled.
How do I check if a downloaded GGUF actually has MTP heads?
Run llama-gguf-dump model.gguf | grep mtp or check the LM Studio model details panel — supported models display an "MTP: 4 heads" badge. If the badge is absent, the heads were stripped during quantization.