Multi-Token Prediction for MLX: MTPLX Tested on the M3 Max

On Apple Silicon, MLX is today the fastest way to run an LLM locally. A new project called MTPLX claims to squeeze out a lot more with Multi-Token Prediction; the maker speaks of up to 2.24× the speed. Sounds nice, let’s check it.

What Multi-Token Prediction Does

A classic LLM produces text token by token. Each step needs a full forward pass through the model, and on the Mac this step is the bottleneck for local inference. The hardware is usually not saturated; it waits for the next token.

Speculative decoding tackles exactly this. A small, fast model, the drafter, guesses several tokens at once. The large model, the verifier, checks these proposals in a single forward pass and keeps only the correct ones. The trick is that verification is parallelizable: the large model can check several guessed tokens at the same time and so amortizes its cost.¹ When the guesses are often right, each expensive forward pass yields several finished tokens.

The decisive point is that quality does not suffer. A procedure called rejection sampling ensures that the accepted tokens follow exactly the distribution of the large model. The result is the same one that would have come out without speculation, only faster.¹

Multi-Token Prediction (MTP) is a special variant of this. Instead of a separate drafter model, MTP uses additional prediction heads that sit directly inside the main model. Modern models like Qwen 3.5 and 3.6 already ship with such MTP heads. So the model drafts several tokens ahead with its own heads and verifies them itself.

MTPLX: MTP Heads Instead of an External Drafter

This is exactly where MTPLX comes in. It is an MLX-native runtime for Apple Silicon, not a wrapper around other tools, and it provides an OpenAI- and Anthropic-compatible server.² The difference from ordinary speculative decoding is stated plainly in its own documentation: „Not an external-drafter system. The drafter is the target model’s own MTP heads."²

In practice that means: no second model, no reconciliation of two weight sets, but a single model that proposes tokens to itself. The acceptance of proposals runs through exact rejection sampling following the theorem of Leviathan and Chen, so the output distribution is preserved at any temperature.²

On first start, MTPLX measures the machine and searches for the fastest speculation depth for this particular Mac. Installation is via Homebrew, via pip, or through a DMG with guided onboarding.²

brew install youssofal/mtplx/mtplx
# or
python3 -m pip install mtplx

The models have to bring built-in MTP heads. The catalog includes, among others, Qwen 3.6 (27B and 35B MoE) as well as Gemma 4, each in speed, balance, and quality builds.²

The Test on the M3 Max

For a meaningful comparison, what matters is running the same model once with and once without MTP. Otherwise you compare models, not methods. MTPLX makes this easy: the mtplx ask command knows the flags --no-mtp and --mtp, plus --stats for the measured token rates.

The setup:


Hardware	Apple M3 Max, 64 GB unified memory
OS	macOS 26.3.1
Model	Qwen3.6-27B, MTPLX speed build (4bit)
Parameters	temp 0.6, top_p 0.95, fixed seed
Baseline reference	`mlx_lm` on Qwen3.6-27B-4bit

One detail up front that dampens expectations. MTPLX itself reports on this M3 Max:

M5 TensorOps eligible: false
hardware acceleration confirmed: false

The 2.24× figures the maker cites come from an M5 Max with its special tensor acceleration. On an M3 Max this path is not available. The realistic reference value is closer to the magnitude that the YouTuber Joe Maddalone measured in his video, around 23 percent.

The actual A/B is a two-liner. Same model, same prompt, once with MTP off, once on:

mtplx ask --model <path> --no-mtp --max-tokens 256 --temperature 0.6 --top-p 0.95 --seed 0 --stats
mtplx ask --model <path> --mtp    --max-tokens 256 --temperature 0.6 --top-p 0.95 --seed 0 --stats

Results

All runs with the same technical prompt, 256 generated tokens, temp 0.6, top_p 0.95. MTPLX automatically chose speculation depth 3 for this machine.

Run	Mode	tok/s
mlx_lm, standard build	autoregressive	22.17
MTPLX, `--no-mtp`	autoregressive	19.86
MTPLX, `--mtp` (run 1)	MTP, depth 3	27.48
MTPLX, `--mtp` (run 2)	MTP, depth 3	29.32

Two numbers stand out. The pure MTP effect, that is the same model with and without MTP, is around 40 percent: from 19.86 to 27 through 29 tok/s. Against my actual everyday setup with mlx_lm it is a net 24 to 28 percent, because MTPLX in pure AR mode is somewhat slower than mlx_lm.

The outputs were coherent and equivalent in content in both modes. That is expected, not coincidental: exact rejection sampling guarantees that MTP delivers the same distribution as the model without MTP. More speed, same quality.

Standard decoding needs four expensive forward passes for four tokens. With MTP, the model’s own heads draft several tokens, a single verify pass checks them, and rejection sampling accepts only what matches the model’s distribution.

Conclusion

Three things remain to note.

First: MTP works, even on an M3 Max without the M5’s tensor acceleration. Around 40 percent more token speed without a loss of quality is clearly noticeable in a local coding workflow.

Second: the maker’s 2.24× value is not invented, but tied to the M5 hardware. Anyone transferring it to an M3 or M4 will be disappointed. Realistic there is a third to almost half more, depending on the workload.

Third: the gain depends on the acceptance rate. Structured, predictable text such as code or technical explanations is predicted well, the drafted tokens are accepted often. With high-entropy, creative text, acceptance drops and with it the advantage. My test prompt was deliberately technical, so the measured value marks the upper end rather than the average.

Bottom line, MTPLX delivers what matters at its core: faster local inference without a loss of quality. Only the headline number needs qualifying. On this M3 Max it is a solid 40 percent or so, not 124.

As of June 16, 2026. The performance figures were measured on an Apple M3 Max (64 GB) by myself. Maker and technical references see footnotes.

Foundations of speculative decoding, parallel verification, and preservation of the output distribution via rejection sampling: „Speculative decoding", Wikipedia, and the original work by Leviathan et al. (2023). https://en.wikipedia.org/wiki/Speculative_decoding ↩︎ ↩︎
MTPLX, „Run local LLMs twice as fast on your Mac", project site and repository (youssofal/MTPLX), on MTP heads, exact rejection sampling, installation, and the model catalog. https://mtplx.com/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎