Native MTP speculative decoding (Qwen3.5/3.6 reference implementation)#990
Native MTP speculative decoding (Qwen3.5/3.6 reference implementation)#990AirRunner wants to merge 29 commits into
Conversation
|
Great work! Would this also be possible for models like GLM5? As in, does each model require its own implementation of MTP, or can we reuse your mtp_generate_step-funtion for other models? Thanks for your work so far! |
Thanks! Yes The Qwen3.5-specific part is So the speculative-decoding logic lives in one place, and adding a new model is just a matter of exposing the right interface. For GLM5 specifically, it would certainly be feasible yeah. But I don't think there is even a glm5.py currently. |
fb12ea6 to
a66d242
Compare
|
Great work on this! We've been using it on M2 Ultra (128GB) with all three Qwen3.5 sizes and it works well. MoE fix neededThe PR works out of the box for the dense 27B, but MoE models (35B-A3B, 122B-A10B) fail conversion with "768 parameters not in model". The MTP layer's expert weights use unfused per-expert format ( Fix (add to # Stack per-expert MTP weights into switch_mlp format.
mtp_num = getattr(self.language_model.args, "mtp_num_hidden_layers", 0)
num_experts = self.language_model.args.num_experts
for l in range(mtp_num):
prefix = f"language_model.mtp.layers.{l}.mlp"
test_key = f"{prefix}.experts.0.gate_proj.weight"
if test_key in new_weights:
for n in ["gate_proj", "up_proj", "down_proj"]:
to_join = [
new_weights.pop(f"{prefix}.experts.{e}.{n}.weight")
for e in range(num_experts)
]
new_weights[f"{prefix}.switch_mlp.{n}.weight"] = mx.stack(to_join)Also needs Full fix on our fork: Thump604/mlx-lm@04a4383 Benchmark results (M2 Ultra, greedy)
Pre-converted models with MTP weights: Thump604/Qwen3.5-27B-MLX-8bit, 35B, 122B |
|
@Thump604 Thanks for the report and the fix! I've integrated it in AirRunner/mlx-lm@8d06796 with a credit. Also, what acceptance rates did you get with MoE? I'm curious if it's somehow correlated to the speedup. |
|
Thanks for the quick integration! Here are the acceptance rates derived from our benchmarks (M2 Ultra 128GB, greedy/temp=0):
At temp=0.6 (production sampling), 122B drops to 1.05x (~5% acceptance). So yes — it does correlate with architecture. MoE acceptance rates are significantly lower than dense. My hypothesis: the MTP layer contains a full 256-expert MoE routing step (same expert count as the backbone), but with only a single layer of context depth it struggles to predict the correct expert routing. The dense 27B's MTP layer is a standard transformer layer — much simpler prediction task, much higher acceptance. The fp16 27B was actually 0.61x (slower) — bandwidth-saturated, the MTP overhead exceeds the savings. 8-bit quantization is the sweet spot where MTP helps most. |
8d06796 to
85583a0
Compare
|
Hey @AirRunner — thanks for integrating the MoE sanitize fix! The PR has merge conflicts with main now though. Would you be able to rebase? Happy to help if needed. Also, any thoughts on tagging a maintainer for review? This has been open since March 13 with zero maintainer engagement. The implementation is solid (8 tests, code review feedback addressed, MoE fix integrated), just needs someone to look at it. |
|
Hey @Goekdeniz-Guelmez, would you be able to take a look when you get a chance? Quick summary: 8 unit tests, code review feedback from @janhilgard and @Thump604, rebased on main. |
|
Subject: Successfully running Qwen3.5-27B locally with workaround
Thanks for this PR! I was able to get Qwen3.5-27B working locally with MLX, but encountered an issue that might help others. The Bug I Was AddressingWhen trying to use the model with a client that passes short model IDs, I encountered: The error message was misleading - it suggested an expired token, but the real issue was a config/weight mismatch described below. Issue EncounteredThe model failed to load with: Root CauseThe model's {
"text_config": {
"mtp_num_hidden_layers": 1
}
}However, the actual WorkaroundSet cat config.json | jq '.text_config.mtp_num_hidden_layers = 0' > config_fixed.json
mv config_fixed.json config.jsonOther Configuration NotesFor anyone trying this setup:
SuggestionIt might be helpful to add a check/warning when:
This would help users identify config/weight mismatches more quickly and avoid confusing auth error messages. |
|
@layer4down thanks for the write-up! You're right, To actually use MTP acceleration, the model needs to be re-quantized including the MTP layers using this branch. As you suggested I just pushed a fix that raises a clear |
|
@angeloskath -- this PR has been open 11 days with no maintainer review. AirRunner rebased on 2026-03-21, all conflicts resolved, 8 unit tests passing. We've been running this in production on M2 Ultra 128GB since day one. Qwen3.5-122B-A10B-VLM-MTP-5bit, 24/7 inference serving coding agents. MTP acceptance rates:
MoE acceptance rates are lower because a single MTP layer can't predict expert routing well. Still a net win for the latency-sensitive use case. The MoE sanitize fix (commit 8d06796) is essential for Qwen3.5 MoE models -- without it, 768 MTP parameters are silently missing. We've also published pre-converted VLM+MTP models on HuggingFace that depend on this code path. Would be great to get this reviewed and merged so the community models work out of the box. |
|
Can we at the reviewer again? it's an important update for qwen3.5 |
|
@angeloskath @awni — this PR has been open 17 days with no maintainer review or feedback. Multiple community members have asked for review (AirRunner, ourselves, cresseelia). Is there a concern with the approach, scope, or implementation that's blocking review? We're happy to help address any issues — split the PR, rework the API surface, add tests, whatever is needed. We're running this in production on 122B and have validated it across three Qwen3.5 model sizes. The community is actively hitting the config/weight mismatch that AirRunner already fixed in this branch (layer4down's report above). Without this merged, users have to manually patch config.json to use MTP on Qwen3.5 models. If the PR needs changes or a different direction, we'd rather know than wait. Let us know how we can help move this forward. |
|
as you can see 6 files have been changes/added alongside 700 lines of added code. This is a PR that has big changes int he codebase itself. Reviewing and (correctly) implementing it will take time. 17 days not long enough. My full weight fine-tuning PR took multiple weeks to be merged. Just keep it open, update it and please be patient. Adding completely new features will take long. |
|
@Goekdeniz-Guelmez — fair point, thanks for the perspective. We appreciate you taking the time to look at it. To make the review easier, we can split this into two smaller PRs: PR 1 — Model architecture (~260 lines): MTPModule, MTPDecoderLayer, SSM rollback support in GatedDeltaNet, MoE weight stacking, cache rollback field. Pure model-side changes, reviewable independently. PR 2 — Generation + tests (~420 lines): Would splitting it this way help with the review process? Happy to do the work if so. @AirRunner — would you be open to splitting the PR this way? |
a358ace to
04da246
Compare
@janhilgard I'm not sure splitting would actually help the review here actually? The PRs you suggest wouldn't be reviewable in isolation, because the architecture changes only make sense in the context of how (Also, 183 of the 683 added lines are just unit tests). That said, I'm open to whatever helps, happy to reorganize if it does :). |
|
@angeloskath @awni — this PR has been open 20+ days with no maintainer review. It is the foundation for MTP speculative decoding on Qwen3.5 models, which several of us are using in production. My PR #1085 (probabilistic acceptance, 2.3x throughput on 122B) builds directly on top of it. AirRunner's implementation is solid: 8 tests, 80.6% acceptance on M4 Pro. Is there a concern about scope or approach blocking review? |
|
@Thump604 can you stop pinging people? The more annoying you are the less likely anyone is going to respond. |
|
Great work — I've been running MTP on Qwen3.5 MoE models in production (M3 Ultra, 256 GB) and wanted to share findings that might explain the low MoE acceptance rates. BF16 MTP weights are critical for MoE acceptanceYour if path.endswith("mtp.fc"):
return FalseBut the MTP transformer layer (attention, MLP, norms) still gets quantized. We found that quantized MTP weights give near-0% acceptance on MoE models — the quantization error compounds through the expert routing prediction. Fix: exclude ALL MTP weights from quantization: if "mtp." in path:
return FalseOur MoE results with BF16 MTP weights
vs your MoE benchmarks (quantized MTP weights):
The difference is stark: BF16 MTP weights → 79-85% acceptance, quantized → 5-11%. Batch auto-skipYour PR sets if len(active_batch) > 1:
# Skip MTP, fall back to standard generation
return _orig_step(input_tokens, cache)This gives the best of both worlds:
Instead of disabling batching entirely, you could dynamically switch. Weight extractionWe extract BF16 MTP weights from the original HF model (not the quantized MLX model) with a dedicated script. See vllm-mlx PR #245 for the
Happy to collaborate on getting BF16 MTP weights into the standard conversion pipeline. |
|
I tested your BF16 MTP finding on our models. Sharing the data since it tells a different story on 5-bit and 8-bit backbones. I extracted fresh BF16 MTP weights from the original HF models (not dequantized from quantized), applied the RMSNorm +1.0 shift, stacked MoE experts into SwitchLinear format, and re-quantized to match the backbone (5-bit gs=64 for 122B, 4-bit gs=64 for 4B). This matches the process you describe in your extraction script. Results (probabilistic acceptance, temp=0.6):
No measurable difference. Re-quantizing MTP from the BF16 source produces the same acceptance as the original quantized weights on these models. I also tested with fully unquantized BF16 MTP (no re-quantization, just raw BF16 + norm shift). This gave 0% acceptance across all models. The BF16 MTP forward pass produces a different logit distribution than the quantized backbone expects. Once I re-quantize to match the backbone, the acceptance rate converges to the same ~46%. Your 79-85% acceptance at 4-bit is significantly higher than what I see. A few questions:
Our acceptance ceiling appears to be ~47% with probabilistic sampling regardless of how the MTP weights are prepared, as long as they match the backbone's quantization. If you are getting 79-85%, there may be a difference in the generation loop or sampling strategy that accounts for the gap. |
Previously _prefill only populated the backbone cache, leaving the MTP KVCache cold at the start of decode. The MTP head was trained with full prefix context, so starting from an empty cache is misaligned with training. Now each prefill chunk passes return_hidden=True and immediately calls mtp_forward(hidden, y[1:n+1], mtp_cache). The hidden tensor is transient: consumed within the same iteration before mx.clear_cache().
|
@JJJYmmm Thanks, you're right! To simplify It's now fixed, It's worth noting though that in multi-turn usage the MTP cache from the previous turn is already populated, so the cold-start effect would only apply to the very first turn. Also, at temp>0 with probabilistic acceptance, the criterion would partially compensates for cold-cache predictions, so the real-world delta will likely be small. VRAM considerationsThis adds a bit of overhead, but the For Qwen3.5 it just adds about 4 KB/token permanent VRAM cost for the MTP KV cache (4 KV heads × 256 head_dim × BF16), so about 40 MB for a 10K-token prompt. |
generate_step calls mx.clear_cache() every 256 tokens to bound the Metal allocator's free list. Introduce _CACHE_CLEAR_INTERVAL = 256 shared by both generate_step and mtp_generate_step to add the equivalent cache-clearing logic to the MTP decode loop. The block-based counter (ntoks // _CACHE_CLEAR_INTERVAL) handles MTP iterations that could emit multiple tokens at once, where a '% interval == 0' check could skip a boundary.
|
@AirRunner Configuration Convert no mtp result mtp result |
|
@atelepov Your setup looks correct. Are the MTP weights present in the quantized version? It's known that on MoE there is no much gain, especially on the 35B-A3B where is could even drop 1 or 2%. It might simply be the benchmark length (might be too much variance at 100 tokens). Could you retry with |
|
For M1/M2 you always want "--dtype float16" when quantizing as bfloat16 goes through a software path I think? - I still don't see any improvement at 4-bit though:, but 8-bit shows a decent boost. |
|
@s-n-t Interesting findings. So collecting different data points across the comments of this PR, we have:
Important note: The M1 Max result is the outlier. The speedup differences across chips could be better explained by the β+δ framework from my reply to Anionex: Also from what I understand, there is no dual-datapath execution (FP32/BF16 and int4 ops serialized) or dedicated matrix accelerators on M1. M3 introduced both. So the M1 GPU generation likely adds disproportionate compute overhead on the MTP head forward pass (short sequence, more compute-bound than the memory-bound backbone). With p≈0.85 (from α=0.46 via α=p/(1+p)), breakeven is at β+δ = 1+p ≈ 1.85. On M4 Pro β+δ = 1.190, well below that. On M1 Max, the MTP head forward pass likely pushes β+δ closer to 1.85, which would explain the near-zero result. Running this bench script would give a measured β+δ for M1 Max. |
|
Without the "--dtype float16" on the same hardware (confirms @atelepov numbers): |
|
@AirRunner --dtype float16 NO MTP --dtype DEFAULT MTP NO MTP |
|
@s-n-t Ok then your -22% (bf16) vs -2% (fp16) gap is almost entirely explained by BF16 emulation on M1. M1 has no native BF16 GPU support, So for M1 users: The 8-bit case is much less affected by dtype (+25% vs +27%) because the baseline is slower. Same absolute BF16 overhead on With fp16, the residual -2% on 4-bit likely reflects the β+δ compute overhead on M1's GPU architecture (see here). @atelepov Nice, around +10% on M1 Max seems to be the expected range on this architecture. |
|
|
@heykb did MTP manage to accelerate prefill by accident? |
Declare u = mx.random.uniform() immediately before its first use (mx.eval) rather than before the unrelated _step_backbone call.
I found that hw.optional.arm.FEAT_BF16: 1 is supported on the M2 chip. If the M2 chip supports BF16, why does converting to FP16 still result in performance improvements? |
|
Empirical benchmarks (Qwen3.6-27B 4-bit, M4 Pro, temp=0/0.6/1.0) show no measurable impact on MTP acceptance rate when mtp.fc is quantized to 4-bit: acceptance delta is within noise (−0.2 to +0.3 pp), speedup delta within noise (−0.003 to +0.026x). Additionally, keeping mtp.fc in BF16 penalizes M1 users where BF16 has no native GPU support.
|
Following up on these: comment1, comment2, comment3. The original rationale for excluding However I just tested this empirically on Qwen3.6-27B 4-bit. Two models: Results
No measurable degradation at any temperature. Acceptance delta is ±0.3 pp across 16 samples/condition, within noise. Also, as a side effect, keeping Raw logs here. |
The test was checking mtp.fc exclusion, which was removed in c47c1cb after empirical benchmarks.
|
M2 Max @ 30 GPU cores here. Qwen3.6-27B + Q4 + FP16: |
… the ledger gap) Brings the MTP machinery fully current with the PR ml-explore#990 tip, closing the temp>0 sampling-correctness gap that the corrected ledger flagged. Adopted commits not previously in our tree: - 13f157b fix(mtp): use residual sampling on rejection at temp>0 - 6594348 fix(mtp): reduce residual sampling to 1 sync, correct z=0 fallback - 87f1b09 feat(mtp): native sampling params, XTC draw sharing, correct lp_accept - a2f1374 quality: from functools import partial - b1dad14 fix(mtp): prefill MTP cache during prompt prefill - 8a52379 fix(mtp): input_embeddings + logits_processors dimensionality - 32fdaa3 fix(mtp): remove spurious mtp_cache trim on draft rejection - a5a82a9 style(mtp): move u after _step_backbone - c47c1cb qwen3_5: remove mtp.fc exclusion from quant_predicate - 6222938 test(mtp): remove stale quant_predicate test (48e1fca / fae9fa1 net content was already present; cache.py and qwen3_5_moe.py were already byte-identical to tip.) Method: checked out generate.py / qwen3_5.py / qwen3_5_moe.py / cache.py / test_mtp.py at pr/990 tip, then re-applied Patch 3's three disjoint additive generate.py hunks (import os, --json-schema arg, the json_schema logits_processors block in main()). The MTP sampling refactor and Patch 3's CLI plumbing touch non-overlapping regions, so the re-application is exact: `git diff pr/990 -- generate.py` is now precisely Patch 3's 29 insertions; the other MTP files are byte-identical to tip. Subsumes the standalone 68b2cd4 (ffac433 cache-clear, cherry-picked out of order with a 2-var adaptation because 87f1b09 was then absent) and the original Patch 1 squash; the final tree state equals tip regardless of that earlier adaptation. Also reconciled the two ml-explore#990 commits that touch files outside the core MTP set: - mlx_lm/sample_utils.py taken wholesale at tip (unpatched by us; adds make_sampler_chain / native sampling params — the missing import that broke collection until reconciled). - mlx_lm/server.py: applied only the two ml-explore#990 hunks we lacked (_xtc_special_tokens helper + native sampling kwargs threaded into the _serve_single MTP generate call). Patches 2/4/6/7/8 regions untouched; the ml-explore#990 server.py delta is now fully reconciled. Test status: tests/test_structured.py + tests/test_mtp.py (now the full tip suite) + tests/test_server.py green (88); full suite 263 passed, only the 3 pre-existing test_tokenizers BPE-whitespace environment flakes (unrelated, untouched module). VALIDATION GATE (per CLAUDE.md "Validation requirements before tagging"): this changes bench-validated Patch 1 behavior at temp>0. Unit suite is green here, but the case-project benchmark harness + MTP-preserving converted weights are NOT on this box, so the mandated re-bench has NOT been run. Do not treat any tag built on this commit as bench-valid until that run completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The accepted draft token was never processed by the MTP head, causing the cache to drift behind the backbone cache by one entry per accept. After k accepts the MTP head operates on k tokens of missing context. Empirically the impact was negligible though (backbone hidden dominates MTP head conditioning). Fix: extend _step_mtp with an optional cache_commit=(hidden, tok) parameter. When set, the alignment position and the draft position are processed in a single 2-token batched mtp_forward, committing the accepted token to mtp_cache at no extra forward-pass cost.
@deepsweet Interesting to see quite high β+δ on M2 Max given that you have FP16 weights. It's actually consistent with atelepov's M1 Max results (+7-10% with FP16 weights). I believe even with FP16 weights MLX intermediate activations likely remain in BF16, which still goes through the FP32 path in Metal's matrix multiplication on M1/M2. To confirm, could you run bench_mtp_timing.py on your M2 Max? It measures β and δ directly and also includes a python bench_mtp_timing.py --model <path> # full β/δ measurement
python bench_mtp_timing.py --model <path> --synth # lm_head kernel dispatchAlso one thing that also stands out: your baseline SD is ±1.2 tok/s, whereas I get ±0.1. That seems unusual for a baseline, how many runs did you have? Maybe worth a rerun to confirm the β+δ is stable. (Note: the "46.2% acceptance" shown actually means 85.9% in the |
|
@AirRunner same Qwen3.6-27B + Q4 + FP16 as above: and
I've ran the default "quick" preset without any extra arguments. |
|
@deepsweet Hum ok so your β=1.725 means the verify backbone (L=2) costs 72% more than baseline (L=1), which is pretty high. I also see that your baseline SD is unusually high (±8.9ms when I get ±0.4ms). The synth goes in this direction with 1.35x, but doesn't really explain β=1.725. The full backbone ratio is worse than lm_head alone, meaning MLP and attention layers have an even larger L=2/L=1 penalty... Actually this is the same family of issue I documented in ml-explore/mlx#3553, but there the non-linear step shifted at L=3 on my M4 Pro. On M2 it seems to be at L=2. Looking at the dispatch code, So to sum up, on M1/M2 the MTP verify pass ( |
Summary
Qwen3.5 checkpoints ship with a built-in Multi-Token Prediction head (
mtp_num_hidden_layers: 1in config) that predicts token t+2 from the backbone hidden state at t and the embedding of token t+1. This PR adds support for using it as a native speculative decoding mechanism. No separate draft model needed, at minimal extra compute (1 extra transformer layer).Changes
mlx_lm/generate.py: MTP generation loop with draft/verify and probabilistic acceptance,--mtpCLI flagmlx_lm/models/cache.py:rollback_stateslot for conv/SSM snapshot on draft rejectionmlx_lm/sample_utils.py:p_drawparameter added toapply_xtcto share the XTC draw across draft and verifymlx_lm/models/qwen3_5.py: MTP head module,self.normmoved toTextModelto expose pre-norm hidden states for MTP,n_confirmedparameter for SSM rollback,sanitize: norm +1 shift now triggered only on raw HF checkpoints (unsanitized conv1d), not on presence of MTP weights, and MoE gate weights at 8-bit group=64mlx_lm/models/qwen3_5_moe.py: MTP checkpoint sanitization for MoE variants + handling both Qwen3.5 and Qwen3.6 (fusedgate_up_proj)mlx_lm/server.py:--mtpflag, dynamic MTP/batch switching + fixxtc_special_tokensconstructiontests/test_mtp.py: 10 unit testsHow it works
Each backbone forward pass returns both logits and pre-norm hidden states. The MTP head fuses
pre_fc_norm_hidden(h_t)andpre_fc_norm_embedding(embed(t+1))via a linear projection, runs one full-attention transformer layer, and produces draft logits through the sharedlm_head.The generation loop verifies drafts by feeding
[confirmed_tok, draft_tok]to the backbone withn_confirmed=1. This causesGatedDeltaNetto snapshot its conv/SSM state after the confirmed token. On acceptance, both tokens are emitted. On rejection, the SSM state is rolled back to the snapshot and KV caches are trimmed.Results (Qwen3.6-27B 4-bit, M4 Pro)
Pooled tok/s, 3 runs × 8 prompts, conditions interleaved. See bench_mtp.py.
Acceptance is reported as A/V (drafts accepted / drafts proposed), the standard speculative decoding metric. The benchmark script also reports
A/(V+A)(~46% for Qwen3.6-27B at temp=0.6), which equalsA/V ÷ (1 + A/V)and is used in some implementations.Identity check: greedy MTP output == standard
generate_stepoutput.Usage
Checkpoint conversion
This requires a checkpoint converted with MTP weights (the default
sanitize()previously stripped them). Re-convert from HF with this branch to preservemtp.*weights.Note on M1/M2: M1 and M2 lack native BF16 GPU support (
MTLDataType.bfloatrequires Apple8+). If you choose not to quantisemtp.fcon M1/M2, you need to add the flag the flag--dtype float16to the convert command. Without it, MTP may drastically slow down on M1 despite positive acceptance rates.Questions for reviewers
sampler is Noneas greedy signal: I usesampler is Noneto distinguish greedy from stochastic and apply exact-match vs probabilistic acceptance accordingly. Is this the right signal, or would you prefer an explicitgreedy: boolparameter for instance?self.requests.empty(): MTP for solo requests andBatchGeneratorfor concurrent ones. Is a best-effort queue check the right approach, or is there a preferred pattern in the server architecture?Future work
DRY refactor + SamplerConfig
A follow-up PR independent of MTP would address:
_prefilllogic: 3 variants acrossgenerate_step,speculative_generate_step, andmtp_generate_step_process_and_sample: almost same pattern inspeculative_generate_stepandmtp_generate_stepquantize_cache_fn = functools.partial(...): same pattern in all threeSamplerConfig: currentlymtp_generate_stepcannot accept a pre-builtsampler=callable and produce correct acceptance logprobs simultaneously. Asamplertoday returns only a token, but for MTP the acceptance criterion also needs the log-probability distribution the token was drawn from. The fix is a richer sampler interface that returns(token, lp_distribution), allowing bothgenerate_stepandmtp_generate_stepto share the same interface without passing a dozen individual parameters.Beyond DRY,
SamplerConfigunlocks a potential performance gain: sparse residual sampling.On rejection at
temp > 0, the current implementation samples frommax(p_target - p_draft, 0) / Zover the full vocabulary (151K-token for Qwen3.5, 580 µs/call). Withtop_k > 0, the sampler already computes a top-k partition over the vocabulary, so exposing those indices lets the rejection path work on a K-token support instead.Without a
SamplerConfig, re-running argpartition specifically for the rejection path is slower or equal to the full-vocab path.Batched MTP
This PR brings MTP for the solo request path only.
However, per-sequence selective rollback (restore SSM state + trim KV only for rejected sequences) is already implemented in
AirRunner/mlx-lm · feat/mtp-batched, left out of this PR to keep the diff reviewable.Test plan
Relates to #872 — cc @janhilgard
Update - probabilistic acceptance and MoE benchmarks
Integrated probabilistic draft acceptance with two cases:
sampler=None): exact-match acceptance, mathematically correct for deterministic argmax samplingtemp > 0):min(1, p_target / p_draft): recovers greedy acceptance level at any temperatureBenchmarks on M4 Pro, with 8 diverse prompts:
A reproducible benchmark script is available: bench_mtp.py
Qwen3.5-27B 4-bit
Qwen3.5-35B-A3B 4-bit
On M4 Pro MoE speedup is marginal regardless of acceptance rate. MTP benefit scales with baseline decode time, so at 85 tok/s (3B active params) the MTP overhead is proportionally too large to yield meaningful speedup. With probabilistic acceptance, acceptance rates are consistent with the dense model (~85%).
Bandwidth model
The cross-hardware speedup variation is explained by
speedup = (1+p) / (β+δ), wherepis the per-round acceptance probability,β = T_verify_backbone / T_baseline, andδ = T_mtp_head / T_baseline. Full derivation and per-component bandwidth estimates in this comment.For reference: