From cc52a978be68a41dd0f8bfcea7b6d96b5617f975 Mon Sep 17 00:00:00 2001 From: John Rocky Date: Wed, 15 Apr 2026 18:39:28 +0900 Subject: [PATCH] docs: retract 43 tok/s projection post-D1b failure; 32 tok/s is the current ANE decode ceiling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR #78 reframed the value prop around a triad of ~1 W power, ~1 s TTFT, and ~43 tok/s decode (projected via PR #77's compute-unit-split spike). PR #79 (open) implemented the full 2-stage pipeline that projection required and measured a 24 % regression across all 4 prompt categories, with a bit-exact failure on summary @ token 50 from fp16 rounding between ANE and GPU backends of chunk 3. Root cause: the Gemma-4 chunk graph has a strict c3 → c4 data dep (c4 consumes c3's hidden_states_out). The only within-step overlap window is a ~1 µs Swift dict-build against ~16 ms GPU c3; the cross-step pipeline is blocked by the symmetric token-feedback edge. No non-speculative decode overlap is available on the current graph; PR #79's three future options all require conversion/-side work. This commit retracts the ~43 tok/s projection on main and propagates the consequence: 32 tok/s is the measured ANE decode ceiling, item 27 (GPU prefill / TTFT) is now the single critical-path decode-adjacent lever, and the gap vs LiteRT-LM on decode widens from 20 % to 42 %. The UX argument (~1 W, ~1 s TTFT, GPU-free host envelope) carries the pitch, not decode parity. Touched: - MOBILE_2K_COMPETITIVE_PLAN.md: retraction callout, triad update, competitive table honesty, Projection basis rewrite, D1b removed from execution table (B is now the only item). - PHASE_B_DECISION.md §"What this means for the go-forward target": D1b status flipped to REGRESSED with structural cause, item 27 elevated to sole decode-adjacent lever. - PRIORITY_ROADMAP.md item 27: footnote marking it as the single critical-path decode-adjacent item after D1b invalidation. - HANDOFF.md: read-order includes the D1b failure doc; opening prompt retracts 43 tok/s, next-session starts are item 27 OR one of PR #79's three conversion/-side options. Preserves history — callouts cite PR #79 / commit 7c21c7b rather than rewriting the prior reasoning chain. Total net-added prose ≈ 99 lines across 4 docs. Docs only. --- docs/HANDOFF.md | 82 +++++++++++------- docs/MOBILE_2K_COMPETITIVE_PLAN.md | 135 ++++++++++++++++++++--------- docs/PHASE_B_DECISION.md | 38 +++++--- docs/PRIORITY_ROADMAP.md | 10 +++ 4 files changed, 182 insertions(+), 83 deletions(-) diff --git a/docs/HANDOFF.md b/docs/HANDOFF.md index 7cf4558..b95f288 100644 --- a/docs/HANDOFF.md +++ b/docs/HANDOFF.md @@ -1,23 +1,29 @@ # Next-session handoff -**Last updated:** 2026-04-15 post value-prop reframe. The "beat -LiteRT-LM at 56 tok/s" target is **retired**; go-forward value prop -is ANE-native (power + TTFT + ~43 tok/s ceiling). See -`docs/MOBILE_2K_COMPETITIVE_PLAN.md`. +**Last updated:** 2026-04-15 late (post D1b-impl retraction). The +"beat LiteRT-LM at 56 tok/s" target is **retired**; go-forward +value prop is ANE-native (power + TTFT + **measured 32 tok/s +decode ceiling**). The earlier ~43 tok/s decode projection was +retracted after PR #79 implemented the required D1b pipeline and +measured a 24 % regression. See `docs/MOBILE_2K_COMPETITIVE_PLAN.md` +§"Retraction callout". ## Read this first To resume cleanly, the next session should: 1. Open `docs/MOBILE_2K_COMPETITIVE_PLAN.md` — **authoritative - value prop**. ANE-native triad: power, TTFT, tok/s ceiling. 56 + value prop**. ANE-native triad: ~1 W power, ~1 s TTFT + (projected, item 27), **32 tok/s decode ceiling (measured)**. 56 tok/s parity is no longer a goal; we compete on a different axis. 2. Open this file (`docs/HANDOFF.md`) — takes 5 minutes. 3. Read `docs/PHASE_B_DECISION.md` §"What this means for the go-forward target" — why speculative decoding is off the critical - path and the two tractable items that replaced it (D1b + item 27). -4. Read `docs/BASELINE_SPEED_AUDIT.md` — per-chunk costs that ground - the ~43 tok/s pipelining ceiling. + path, why D1b pipelining was retracted (PR #79), and why item 27 + is now the *single* tractable decode-adjacent lever. +4. Read `docs/BASELINE_SPEED_AUDIT.md` — per-chunk costs (c1=5.9, + c2=6.8, c3=8.1, c4=10.4 ms on iPhone 17 Pro); read alongside the + retraction to understand why the `c3 → c4` dep kills overlap. 5. Skim `docs/SESSION_STATE.md` for exact PR / branch / task state. 6. Historical context (read only if needed): - `docs/PHASE_C_TIGHTENING_FINDINGS.md` — why verify-chunk redesign @@ -26,6 +32,10 @@ To resume cleanly, the next session should: `PHASE_B_LIVE_ACCEPT_RATE_GAP.md` — speculative-decode refutation chain. - `docs/PHASE_A5_DECISION.md` — superseded. + - PR #79 / `feat/chunk-pipelining-d1b` — D1b full-impl negative + result (the retraction evidence); see `docs/PHASE_D_PIPELINING_IMPL.md` + on that branch for the three future options (decoupled c4, + speculative h3, model re-chunking) — all require `conversion/`. No need to touch MTP Path A, PR #17, PR #33, or further Union / speculative wiring — all off the critical path with reasons in @@ -39,37 +49,49 @@ model walks in with correct framing: > Value-prop reframed 2026-04-15. **56 tok/s parity with LiteRT-LM > is no longer a goal.** LiteRT-LM is Metal GPU at 3–5 W; this repo > is ANE-native at ~1 W. Different competitive axis. Current target -> triad: (1) sustained power ~1 W, (2) TTFT ~1 s (conditional on -> item 27), (3) decode tok/s ~43 (conditional on D1b). All three -> projections are documented in docs/MOBILE_2K_COMPETITIVE_PLAN.md. +> triad: (1) sustained power ~1 W, (2) TTFT ~1 s (projected, +> conditional on item 27), (3) **decode 32 tok/s (measured ANE +> ceiling)**. The earlier ~43 tok/s decode projection has been +> **retracted** — PR #79 implemented the D1b 2-stage pipeline that +> projection required and measured a 24 % regression across all 4 +> categories; the `c3 → c4` data dep leaves no within-step overlap +> window. See docs/MOBILE_2K_COMPETITIVE_PLAN.md §"Retraction +> callout" and docs/PHASE_B_DECISION.md §"What this means for the +> go-forward target". > -> Next-session focus: **ship D1b full impl + scope item 27 (GPU -> prefill via MLX-Swift)**. Both are Swift-side, both preserve -> ANE-resident decode. Neither depends on speculative decoding. +> Next-session focus: **pick one of the following.** Both are +> independent of speculative decoding. +> +> - **Item 27 (GPU prefill via MLX-Swift).** Sole remaining tractable +> non-speculative decode-adjacent lever after D1b was invalidated. +> Targets TTFT, not decode rate. MLX-Swift prefill path, 7–10 days. +> Decode stays on ANE; GPU is used only during the 512-token +> prefill. Exit criterion for the scoping pass: a one-page design +> that identifies the Metal↔CoreML handoff cost and the ~1 s TTFT +> feasibility gate. +> - **One of PR #79's three future options** (all require `conversion/` +> work — not runtime-only): decoupled c4 (c4 takes h2 instead of h3, +> recomputing layers 25–34 independently), speculative h3 +> (predict-and-verify on the hidden axis), or model re-chunking so +> c3/c4 can compute independent residual sub-streams. See +> `docs/PHASE_D_PIPELINING_IMPL.md` on branch `feat/chunk-pipelining-d1b`. > > Speculative decoding is off the critical path. Phase B closed; > verify-protocol redesign (C0 option b) is multi-week and is not -> required for the ANE-native value prop. Do not re-litigate. +> required for the ANE-native value prop. Do not re-litigate. D1b +> runtime pipelining is also off the table — do not revisit without +> first doing one of the three `conversion/` fixes above. > > Read (in order): -> 1. docs/MOBILE_2K_COMPETITIVE_PLAN.md — the value prop. +> 1. docs/MOBILE_2K_COMPETITIVE_PLAN.md — the value prop (with +> retraction callout). > 2. docs/PHASE_B_DECISION.md §"What this means for the go-forward -> target" — why speculation is parked. +> target" — why speculation is parked and why D1b was retracted. > 3. docs/BASELINE_SPEED_AUDIT.md — per-chunk costs; grounds the -> 43 tok/s ceiling. +> 32 tok/s ceiling and shows why `c3 → c4` is load-bearing. > 4. docs/HANDOFF.md (this file) + docs/SESSION_STATE.md. -> -> Start options: -> - **Ship D1b pipelining full impl.** PR #77 validated the ANE+GPU -> compute-unit split (overlap 0.87–0.99). Branch -> `feat/chunk-pipelining-d1b` is in flight. Deliver the 4-stage -> pipeline + iPhone measurement; exit criterion is ≥ 40 tok/s at -> 2K on iPhone 17 Pro. -> - **Scope item 27 (GPU prefill).** MLX-Swift prefill path, 7–10 -> days. Decode stays on ANE; GPU is used only during the 512-token -> prefill. Exit criterion for the scoping pass: a one-page design -> that identifies the Metal↔CoreML handoff cost and the ~1 s TTFT -> feasibility gate. +> 5. PR #79 / `docs/PHASE_D_PIPELINING_IMPL.md` (on the branch) — +> retraction evidence + three future options. > > Docs auto-merge; bench/harness code auto-merges too. diff --git a/docs/MOBILE_2K_COMPETITIVE_PLAN.md b/docs/MOBILE_2K_COMPETITIVE_PLAN.md index 6235489..c1e88ef 100644 --- a/docs/MOBILE_2K_COMPETITIVE_PLAN.md +++ b/docs/MOBILE_2K_COMPETITIVE_PLAN.md @@ -1,7 +1,24 @@ # Mobile 2K competitive plan — ANE-native value prop -**Status:** 2026-04-15. Supersedes the previous "beat LiteRT-LM at 56 -tok/s" framing. See §"What changed" for the history. +**Status:** 2026-04-15 late (post D1b-impl retraction). Supersedes +the previous "beat LiteRT-LM at 56 tok/s" framing. See §"What +changed" for the history. + +> **Retraction callout (2026-04-15 late).** An earlier revision of +> this doc claimed a **~43 tok/s projected decode ceiling** grounded +> in PR #77's compute-unit-split spike. **That projection is +> withdrawn.** PR #79 implemented the full 2-stage pipeline it +> required and measured a **24 % regression** across all 4 prompt +> categories (baseline 32.8–33.2 → pipelined 24.9–25.5 tok/s on Mac +> Studio), with a bit-exact failure on `summary` at token 50 due to +> fp16 rounding between ANE and GPU backends of chunk 3. The +> structural blocker is a strict `c3 → c4` data dependency: c4 +> consumes c3's `hidden_states_out`, so the only within-step overlap +> window is a ~1 µs Swift dict-build against a ~16 ms GPU c3 — pure +> regression. The numerical claim collapses to the **measured +> 32 tok/s ANE decode ceiling**. Full analysis in PR #79 / +> `docs/PHASE_D_PIPELINING_IMPL.md` (branch `feat/chunk-pipelining-d1b`, +> HEAD 7c21c7b). Value prop below is updated accordingly. **Goal:** ship the strongest *ANE-native* Gemma 4 E2B runtime on iPhone at ctx=2048. The competitive axis is **not** raw decode tok/s — @@ -12,13 +29,15 @@ power draw, TTFT, and decode tok/s ceiling under ANE constraints.** ## Value proposition (one-liner) -> ANE-native LLM runtime for Apple Silicon. Target **~43 tok/s at -> ~1 W sustained**, with **~1 s TTFT** on 512-token prompts — different +> ANE-native LLM runtime for Apple Silicon. Sustained **~1 W** with +> **~1 s TTFT** (projected via item 27) on 512-token prompts and a +> **32 tok/s decode ceiling** (measured, current) — different > competitive axis from LiteRT-LM's 56 tok/s at 3–5 W. -Both the 43 tok/s figure and the 1 s TTFT are **projections, not -shipped numbers**. See §"Projection basis" for what grounds them and -what has to ship to realise each. +The 1 s TTFT is a **projection** conditional on item 27 shipping +(see §"Projection basis"). The 32 tok/s decode figure is the +**measured ANE ceiling on iPhone 17 Pro / Mac Studio** after PR #79 +retired the 43 tok/s D1b pipelining projection. --- @@ -27,9 +46,9 @@ what has to ship to realise each. Gemma 4 E2B on iPhone 17 Pro, ctx=2048. Power numbers are rough order- of-magnitude, not calibrated measurements. -| Axis | LiteRT-LM iOS (Metal GPU) | This repo (ANE, shipped today) | This repo (after D1b + item 27, projected) | +| Axis | LiteRT-LM iOS (Metal GPU) | This repo (ANE, shipped today) | This repo (after item 27, projected) | |---|---:|---:|---:| -| Decode tok/s @ 2K | **56** | 31 | **~43** | +| Decode tok/s @ 2K | **56** | **32** (measured ceiling) | **32** (no non-speculative decode lever left) | | TTFT @ 512 prompt | ~1–2 s (Metal prefill) | ~13 s (ANE prefill) | **~1 s** (GPU prefill) | | Sustained power draw | 3–5 W (GPU active) | **~1 W** (ANE) | ~1 W + brief GPU prefill spikes | | Battery life @ continuous decode | baseline | **~3× baseline** | ~3× baseline | @@ -37,9 +56,18 @@ of-magnitude, not calibrated measurements. | Background / always-on friendly | no (thermal + power) | **yes** | yes | | Model-placement footprint | fp16 GPU weights | INT4 ANE weights | INT4 ANE weights | -On tok/s-for-tok/s we are **~20 % slower than LiteRT-LM** even if -everything on the right-hand column ships. We do not claim to match -them on decode rate. We claim to win a different envelope. +On tok/s-for-tok/s we are **~42 % slower than LiteRT-LM** on decode +(32 vs 56). We do not claim parity, and we do not claim to close +more than a marginal fraction of that gap on the current chunk +graph — PR #79 showed the non-speculative decode-overlap lever is +structurally unavailable. We claim to win a different envelope. + +The UX argument carries on that asymmetry: **32 tok/s is ~6× human +read speed (~5 tok/s)**, so decode is already faster than the user +can follow. Where LiteRT-LM wins 56-vs-32 is a throughput regime the +typical chat / assistant UX doesn't consume. The product wedge is +sustained ~1 W, ~1 s TTFT, and the GPU-free host envelope — not +decode parity. --- @@ -70,22 +98,37 @@ always-on notetaker, on-the-go with limited charge). ## Projection basis -### 43 tok/s ceiling (decode) - -Comes from the Phase D compute-unit split spike (PR #77). With -chunk 3/4 placed on GPU and chunks 1/2 on ANE, the two compute units -overlap at factor 0.87–0.99 across drivers — the critical-path -estimate drops from 51.7 ms per step (fully serial, ~19 tok/s on Mac -Studio audit) to ~23 ms per step at full pipelining, which on iPhone -translates to ~43 tok/s from the measured 31 tok/s baseline. This -**requires shipping D1b chunk pipelining** (in flight on -`feat/chunk-pipelining-d1b`). Until D1b merges and measures on an -iPhone 17 Pro, 43 tok/s remains a projection. - -What ANE-chunk pipelining *cannot* do: PR #75 showed the ANE driver -serialises all submissions at the driver level, so adding more ANE -parallelism on the same compute unit does not help. The win is only -available by splitting across ANE + GPU. +### 32 tok/s decode ceiling (measured, current) + +Mac Studio 128-token decode with drafters OFF (PR #79, 2026-04-15): +chat 32.80, code 33.24, qa 33.15, summary 33.02 tok/s. iPhone 17 Pro +measures 31.4 tok/s at 2 K under the same defaults. This is the +**current ANE decode ceiling** on the shipped chunk graph. + +#### Why the earlier 43 tok/s projection was retracted + +The 43 tok/s figure came from PR #77's compute-unit-split spike: +`max(c1+c2+c4_ANE, c1+c3_GPU)` ≈ 23 ms/step, grounded in a measured +0.87–0.99 kernel-overlap factor between ANE and GPU driver queues. +The projection assumed c3 and c4 could run in parallel. **PR #79 +implemented the full 2-stage pipeline and refuted that assumption +empirically:** + +- Every category regressed by ~24 % (see retraction callout at top). +- Root cause: c4 takes c3's `hidden_states_out` as input — a strict + data dependency. The only overlap window between c3 and c4 within + a step is the ~1 µs Swift dict-build, negligible against the + ~16 ms GPU c3. The cross-step pipeline (c3 of step N+1 concurrent + with c3 of step N) is similarly blocked because c3 of step N+1 + needs c4 of step N to emit the just-decoded token. +- PR #75 independently showed that adding ANE-only parallelism does + not help because the ANE driver serialises all submissions. + +Conclusion: on the current CoreML chunk graph there is **no +non-speculative decode overlap available**. Three future options +documented in PR #79's `docs/PHASE_D_PIPELINING_IMPL.md` +(decoupled c4, speculative h3, model re-chunking) all require +`conversion/`-side work — none are runtime-only. ### 1 s TTFT (prefill) @@ -138,32 +181,42 @@ The previous version of this doc set **"70–110 tok/s at 2K, i.e. it would require pivoting this repo to MLX-Swift / GPU decode, which is an explicitly rejected direction (the repo's reason for existing is ANE-native placement). -- Under ANE constraints, the decode ceiling is ~43 tok/s (PR #77 - split + D1b pipelining). That is ~77 % of LiteRT-LM's throughput. +- Under ANE constraints, the measured decode ceiling is **32 tok/s**. + An earlier revision of this doc projected ~43 tok/s via a PR #77 + compute-unit split and D1b pipelining; PR #79's full + implementation empirically refuted that (see retraction callout + and §"Projection basis" for the mechanism). 32 tok/s is ~58 % of + LiteRT-LM's decode rate — a ~42 % gap, not ~20 %. - The speculative-decoding route that was supposed to close the gap is blocked at the verify-chunk write-through layer (see above). - Reframing on power + TTFT + tok/s-ceiling is **honest about what ANE can deliver** and **identifies where we actually win**: a different user segment (background-friendly, battery-sensitive, - GPU-contended host apps). + GPU-contended host apps). The decode-rate gap is real; the UX + argument has to carry the pitch, not decode parity. We do not claim parity on decode rate. We claim a better product on a different axis triad. --- -## Execution — two tractable paths forward +## Execution — single tractable decode-adjacent path forward -Neither depends on the speculative-decoding work. +Only one non-speculative decode-adjacent item remains after the D1b +refutation: **item 27 (GPU prefill)**, which targets the TTFT axis +rather than the decode axis. The decode axis is parked at the +measured 32 tok/s ceiling pending either a model re-conversion +(PR #79's three future options) or the multi-week verify-protocol +redesign (C0 option b). | # | Item | Status | Axis unlocked | |---|---|---|---| -| **A** | **D1b chunk pipelining** — overlap chunk N+1 step *t* with chunk N step *t-1* across ANE+GPU split (PR #77 validated the split; D1b implements the full 4-stage pipeline). | in flight on `feat/chunk-pipelining-d1b` | decode tok/s 31 → ~43 (projection) | -| **B** | **Item 27 GPU prefill via MLX-Swift** — offload the 512-token prefill batch to GPU tensor cores. | not started; elevated to **Phase C critical path** (was "stretch"). | TTFT 13 s → ~1 s (projection) | +| **A** | ~~**D1b chunk pipelining**~~ — **RETRACTED 2026-04-15** by PR #79 full-impl measurement (−24 % on all 4 categories). Structural blocker: `c3 → c4` data dependency. Three future options (decoupled c4 / speculative h3 / model re-chunking) require `conversion/` work. | refuted; plumbing kept OFF-by-default on `feat/chunk-pipelining-d1b` | — | +| **B** | **Item 27 GPU prefill via MLX-Swift** — offload the 512-token prefill batch to GPU tensor cores. | not started; elevated to **Phase C critical path** (was "stretch"). Sole remaining tractable decode-adjacent lever. | TTFT 13 s → ~1 s (projection) | -Both are Swift-side work; no chunk reconversion required. Both preserve -the ANE-resident decode story (decode stays on ANE; GPU is used for -prefill only in B, and only for chunks 3/4 in A). +Item B is Swift-side work; no chunk reconversion required. It +preserves the ANE-resident decode story (decode stays on ANE; GPU +is used for prefill only). ### Explicitly out of scope @@ -182,10 +235,10 @@ prefill only in B, and only for chunks 3/4 in A). | Risk | Mitigation / note | |---|---| -| D1b pipelining hits an iOS-side dispatch quirk | Falls back to serial execution; decode stays at 31 tok/s baseline. We don't regress. | -| GPU prefill (item 27) surfaces a Metal ↔ CoreML handoff cost that eats the TTFT win | First-pass measurement is cheap; if the handoff dominates we ship only D1b and revise TTFT claim. | +| ~~D1b pipelining hits an iOS-side dispatch quirk~~ | **N/A — D1b retracted 2026-04-15 by PR #79 full-impl measurement; structural, not an iOS quirk.** | +| GPU prefill (item 27) surfaces a Metal ↔ CoreML handoff cost that eats the TTFT win | First-pass measurement is cheap; if the handoff dominates we revise the TTFT claim. With D1b retracted, item 27 is the only remaining decode-adjacent projection in this doc. | | Power-draw advantage is harder to measure than tok/s and easy to wave hands at | Add Instruments energy trace to the benchmark doc before leaning on it as a competitive claim. Treat the 3× battery-life number as an order-of-magnitude until measured. | -| 43 tok/s projection is optimistic — PR #77 overlap was measured on Mac; iPhone ANE/GPU concurrency may differ | PR #77 measured 0.87–0.99 overlap across drivers; iPhone measurement is the gating exit criterion for the claim. | +| 32 tok/s decode ceiling is a hard competitive floor — 42 % behind LiteRT-LM — and no tractable non-speculative lever remains | Pitch leans on power + TTFT + background-friendliness; decode parity is explicitly not claimed. If decode parity becomes a requirement, revisit PR #79's three future options (all need model re-conversion). | --- diff --git a/docs/PHASE_B_DECISION.md b/docs/PHASE_B_DECISION.md index b2b5174..84ca669 100644 --- a/docs/PHASE_B_DECISION.md +++ b/docs/PHASE_B_DECISION.md @@ -163,19 +163,33 @@ independent reasons: the ANE driver serialises submissions, so Mirror-SD's cost-hiding model doesn't recover the expected concurrency win. -The remaining tractable path, independent of speculation, is two -items: - -- **(a) D1b chunk pipelining** — ANE+GPU compute-unit split (PR #77 - validated overlap factor 0.87–0.99). Projected ceiling ~43 tok/s - on iPhone 17 Pro. In flight on `feat/chunk-pipelining-d1b`. +The remaining tractable path, independent of speculation, is now a +**single** item: + +- **(a) D1b chunk pipelining** — ~~Projected ceiling ~43 tok/s on + iPhone 17 Pro via ANE+GPU compute-unit split (PR #77).~~ + **Implemented on `feat/chunk-pipelining-d1b` (PR #79) and + REGRESSED by 24 % across all 4 prompt categories** (baseline + 32.8–33.2 → pipelined 24.9–25.5 tok/s on Mac Studio), with a + bit-exact failure on `summary` at token 50 (fp16 rounding between + ANE and GPU backends of c3). **Structural blocker:** c4 takes + c3's `hidden_states_out` — a strict data dep that leaves only a + ~1 µs Swift dict-build as the within-step overlap window against + a ~16 ms GPU c3. Cross-step pipelining is blocked by the symmetric + token-feedback edge (c3@N+1 needs c4@N). Three future options + documented in PR #79's `docs/PHASE_D_PIPELINING_IMPL.md` + (decoupled c4, speculative h3, model re-chunking) all require + model re-conversion — not Swift-side. D1b is off the critical + path; plumbing stays OFF-by-default on main. - **(b) GPU prefill via MLX-Swift** — Phase 5 item 27, elevated to - Phase C critical path. Projected TTFT 13 s → ~1 s. - -Both are Swift-side work, independent of verify-chunk numerics, and -preserve the ANE-resident decode story. The 43 tok/s and 1 s TTFT -figures are projections; see `docs/MOBILE_2K_COMPETITIVE_PLAN.md` -§"Projection basis" for what grounds each. + Phase C critical path. Projected TTFT 13 s → ~1 s. **With D1b + invalidated, item 27 is now the sole tractable non-speculative + decode-adjacent gain** (it targets the TTFT axis, not decode rate). + +The decode-rate ceiling on the current chunk graph is the +**measured 32 tok/s ANE ceiling**; the ~43 tok/s projection has been +retracted (see `docs/MOBILE_2K_COMPETITIVE_PLAN.md` §"Projection +basis" and §"Retraction callout"). Phase B itself stays closed. C0 remains filed for anyone who later wants to pursue the verify-protocol redesign, but it is not on the diff --git a/docs/PRIORITY_ROADMAP.md b/docs/PRIORITY_ROADMAP.md index cb07b53..52d1eb3 100644 --- a/docs/PRIORITY_ROADMAP.md +++ b/docs/PRIORITY_ROADMAP.md @@ -200,6 +200,16 @@ KV direct-write → **40–60 tok/s @ 2K, 25–35 tok/s @ 8K**. > win once item 27 ships. Effort revised up to 7–10 days (MLX-Swift > is recent and underdocumented; original 1–2-day estimate was > optimistic). +> +> **2026-04-15 late — sole critical-path decode-adjacent item.** The +> D1b chunk-pipelining projection (~43 tok/s via PR #77's +> compute-unit split) was retracted after PR #79 implemented the +> full 2-stage pipeline and measured −24 % on all 4 categories; the +> `c3 → c4` data dependency eliminates the within-step overlap +> window. See `docs/MOBILE_2K_COMPETITIVE_PLAN.md` §"Projection basis" +> and `docs/PHASE_B_DECISION.md` §"What this means for the +> go-forward target". With D1b invalidated, **item 27 is now the +> single critical-path decode-adjacent item** on this roadmap. | Priority | What | Gain | Effort | Source | |---|---|---|---|---|