john-rocky · john-rocky · Apr 15, 2026 · Apr 15, 2026
diff --git a/docs/HANDOFF.md b/docs/HANDOFF.md
@@ -1,23 +1,29 @@
 # Next-session handoff
 
-**Last updated:** 2026-04-15 post value-prop reframe. The "beat
-LiteRT-LM at 56 tok/s" target is **retired**; go-forward value prop
-is ANE-native (power + TTFT + ~43 tok/s ceiling). See
-`docs/MOBILE_2K_COMPETITIVE_PLAN.md`.
+**Last updated:** 2026-04-15 late (post D1b-impl retraction). The
+"beat LiteRT-LM at 56 tok/s" target is **retired**; go-forward
+value prop is ANE-native (power + TTFT + **measured 32 tok/s
+decode ceiling**). The earlier ~43 tok/s decode projection was
+retracted after PR #79 implemented the required D1b pipeline and
+measured a 24 % regression. See `docs/MOBILE_2K_COMPETITIVE_PLAN.md`
+§"Retraction callout".
 
 ## Read this first
 
 To resume cleanly, the next session should:
 
 1. Open `docs/MOBILE_2K_COMPETITIVE_PLAN.md` — **authoritative
-   value prop**. ANE-native triad: power, TTFT, tok/s ceiling. 56
+   value prop**. ANE-native triad: ~1 W power, ~1 s TTFT
+   (projected, item 27), **32 tok/s decode ceiling (measured)**. 56
    tok/s parity is no longer a goal; we compete on a different axis.
 2. Open this file (`docs/HANDOFF.md`) — takes 5 minutes.
 3. Read `docs/PHASE_B_DECISION.md` §"What this means for the
    go-forward target" — why speculative decoding is off the critical
-   path and the two tractable items that replaced it (D1b + item 27).
-4. Read `docs/BASELINE_SPEED_AUDIT.md` — per-chunk costs that ground
-   the ~43 tok/s pipelining ceiling.
+   path, why D1b pipelining was retracted (PR #79), and why item 27
+   is now the *single* tractable decode-adjacent lever.
+4. Read `docs/BASELINE_SPEED_AUDIT.md` — per-chunk costs (c1=5.9,
+   c2=6.8, c3=8.1, c4=10.4 ms on iPhone 17 Pro); read alongside the
+   retraction to understand why the `c3 → c4` dep kills overlap.
 5. Skim `docs/SESSION_STATE.md` for exact PR / branch / task state.
 6. Historical context (read only if needed):
    - `docs/PHASE_C_TIGHTENING_FINDINGS.md` — why verify-chunk redesign
@@ -26,6 +32,10 @@ To resume cleanly, the next session should:
      `PHASE_B_LIVE_ACCEPT_RATE_GAP.md` — speculative-decode refutation
      chain.
    - `docs/PHASE_A5_DECISION.md` — superseded.
+   - PR #79 / `feat/chunk-pipelining-d1b` — D1b full-impl negative
+     result (the retraction evidence); see `docs/PHASE_D_PIPELINING_IMPL.md`
+     on that branch for the three future options (decoupled c4,
+     speculative h3, model re-chunking) — all require `conversion/`.
 
 No need to touch MTP Path A, PR #17, PR #33, or further Union /
 speculative wiring — all off the critical path with reasons in
@@ -39,37 +49,49 @@ model walks in with correct framing:
 > Value-prop reframed 2026-04-15. **56 tok/s parity with LiteRT-LM
 > is no longer a goal.** LiteRT-LM is Metal GPU at 3–5 W; this repo
 > is ANE-native at ~1 W. Different competitive axis. Current target
-> triad: (1) sustained power ~1 W, (2) TTFT ~1 s (conditional on
-> item 27), (3) decode tok/s ~43 (conditional on D1b). All three
-> projections are documented in docs/MOBILE_2K_COMPETITIVE_PLAN.md.
+> triad: (1) sustained power ~1 W, (2) TTFT ~1 s (projected,
+> conditional on item 27), (3) **decode 32 tok/s (measured ANE
+> ceiling)**. The earlier ~43 tok/s decode projection has been
+> **retracted** — PR #79 implemented the D1b 2-stage pipeline that
+> projection required and measured a 24 % regression across all 4
+> categories; the `c3 → c4` data dep leaves no within-step overlap
+> window. See docs/MOBILE_2K_COMPETITIVE_PLAN.md §"Retraction
+> callout" and docs/PHASE_B_DECISION.md §"What this means for the
+> go-forward target".
 >
-> Next-session focus: **ship D1b full impl + scope item 27 (GPU
-> prefill via MLX-Swift)**. Both are Swift-side, both preserve
-> ANE-resident decode. Neither depends on speculative decoding.
+> Next-session focus: **pick one of the following.** Both are
+> independent of speculative decoding.
+>
+> - **Item 27 (GPU prefill via MLX-Swift).** Sole remaining tractable
+>   non-speculative decode-adjacent lever after D1b was invalidated.
+>   Targets TTFT, not decode rate. MLX-Swift prefill path, 7–10 days.
+>   Decode stays on ANE; GPU is used only during the 512-token
+>   prefill. Exit criterion for the scoping pass: a one-page design
+>   that identifies the Metal↔CoreML handoff cost and the ~1 s TTFT
+>   feasibility gate.
+> - **One of PR #79's three future options** (all require `conversion/`
+>   work — not runtime-only): decoupled c4 (c4 takes h2 instead of h3,
+>   recomputing layers 25–34 independently), speculative h3
+>   (predict-and-verify on the hidden axis), or model re-chunking so
+>   c3/c4 can compute independent residual sub-streams. See
+>   `docs/PHASE_D_PIPELINING_IMPL.md` on branch `feat/chunk-pipelining-d1b`.
 >
 > Speculative decoding is off the critical path. Phase B closed;
 > verify-protocol redesign (C0 option b) is multi-week and is not
-> required for the ANE-native value prop. Do not re-litigate.
+> required for the ANE-native value prop. Do not re-litigate. D1b
+> runtime pipelining is also off the table — do not revisit without
+> first doing one of the three `conversion/` fixes above.
 >
 > Read (in order):
-> 1. docs/MOBILE_2K_COMPETITIVE_PLAN.md — the value prop.
+> 1. docs/MOBILE_2K_COMPETITIVE_PLAN.md — the value prop (with
+>    retraction callout).
 > 2. docs/PHASE_B_DECISION.md §"What this means for the go-forward
->    target" — why speculation is parked.
+>    target" — why speculation is parked and why D1b was retracted.
 > 3. docs/BASELINE_SPEED_AUDIT.md — per-chunk costs; grounds the
->    43 tok/s ceiling.
+>    32 tok/s ceiling and shows why `c3 → c4` is load-bearing.
 > 4. docs/HANDOFF.md (this file) + docs/SESSION_STATE.md.
->
-> Start options:
-> - **Ship D1b pipelining full impl.** PR #77 validated the ANE+GPU
->   compute-unit split (overlap 0.87–0.99). Branch
->   `feat/chunk-pipelining-d1b` is in flight. Deliver the 4-stage
->   pipeline + iPhone measurement; exit criterion is ≥ 40 tok/s at
->   2K on iPhone 17 Pro.
-> - **Scope item 27 (GPU prefill).** MLX-Swift prefill path, 7–10
->   days. Decode stays on ANE; GPU is used only during the 512-token
->   prefill. Exit criterion for the scoping pass: a one-page design
->   that identifies the Metal↔CoreML handoff cost and the ~1 s TTFT
->   feasibility gate.
+> 5. PR #79 / `docs/PHASE_D_PIPELINING_IMPL.md` (on the branch) —
+>    retraction evidence + three future options.
 >
 > Docs auto-merge; bench/harness code auto-merges too.
 

diff --git a/docs/MOBILE_2K_COMPETITIVE_PLAN.md b/docs/MOBILE_2K_COMPETITIVE_PLAN.md
@@ -1,7 +1,24 @@
 # Mobile 2K competitive plan — ANE-native value prop
 
-**Status:** 2026-04-15. Supersedes the previous "beat LiteRT-LM at 56
-tok/s" framing. See §"What changed" for the history.
+**Status:** 2026-04-15 late (post D1b-impl retraction). Supersedes
+the previous "beat LiteRT-LM at 56 tok/s" framing. See §"What
+changed" for the history.
+
+> **Retraction callout (2026-04-15 late).** An earlier revision of
+> this doc claimed a **~43 tok/s projected decode ceiling** grounded
+> in PR #77's compute-unit-split spike. **That projection is
+> withdrawn.** PR #79 implemented the full 2-stage pipeline it
+> required and measured a **24 % regression** across all 4 prompt
+> categories (baseline 32.8–33.2 → pipelined 24.9–25.5 tok/s on Mac
+> Studio), with a bit-exact failure on `summary` at token 50 due to
+> fp16 rounding between ANE and GPU backends of chunk 3. The
+> structural blocker is a strict `c3 → c4` data dependency: c4
+> consumes c3's `hidden_states_out`, so the only within-step overlap
+> window is a ~1 µs Swift dict-build against a ~16 ms GPU c3 — pure
+> regression. The numerical claim collapses to the **measured
+> 32 tok/s ANE decode ceiling**. Full analysis in PR #79 /
+> `docs/PHASE_D_PIPELINING_IMPL.md` (branch `feat/chunk-pipelining-d1b`,
+> HEAD 7c21c7b). Value prop below is updated accordingly.
 
 **Goal:** ship the strongest *ANE-native* Gemma 4 E2B runtime on
 iPhone at ctx=2048. The competitive axis is **not** raw decode tok/s —
@@ -12,13 +29,15 @@ power draw, TTFT, and decode tok/s ceiling under ANE constraints.**
 
 ## Value proposition (one-liner)
 
-> ANE-native LLM runtime for Apple Silicon. Target **~43 tok/s at
-> ~1 W sustained**, with **~1 s TTFT** on 512-token prompts — different
+> ANE-native LLM runtime for Apple Silicon. Sustained **~1 W** with
+> **~1 s TTFT** (projected via item 27) on 512-token prompts and a
+> **32 tok/s decode ceiling** (measured, current) — different
 > competitive axis from LiteRT-LM's 56 tok/s at 3–5 W.
 
-Both the 43 tok/s figure and the 1 s TTFT are **projections, not
-shipped numbers**. See §"Projection basis" for what grounds them and
-what has to ship to realise each.
+The 1 s TTFT is a **projection** conditional on item 27 shipping
+(see §"Projection basis"). The 32 tok/s decode figure is the
+**measured ANE ceiling on iPhone 17 Pro / Mac Studio** after PR #79
+retired the 43 tok/s D1b pipelining projection.
 
 ---
 
@@ -27,19 +46,28 @@ what has to ship to realise each.
 Gemma 4 E2B on iPhone 17 Pro, ctx=2048. Power numbers are rough order-
 of-magnitude, not calibrated measurements.
 
-| Axis | LiteRT-LM iOS (Metal GPU) | This repo (ANE, shipped today) | This repo (after D1b + item 27, projected) |
+| Axis | LiteRT-LM iOS (Metal GPU) | This repo (ANE, shipped today) | This repo (after item 27, projected) |
 |---|---:|---:|---:|
-| Decode tok/s @ 2K | **56** | 31 | **~43** |
+| Decode tok/s @ 2K | **56** | **32** (measured ceiling) | **32** (no non-speculative decode lever left) |
 | TTFT @ 512 prompt | ~1–2 s (Metal prefill) | ~13 s (ANE prefill) | **~1 s** (GPU prefill) |
 | Sustained power draw | 3–5 W (GPU active) | **~1 W** (ANE) | ~1 W + brief GPU prefill spikes |
 | Battery life @ continuous decode | baseline | **~3× baseline** | ~3× baseline |
 | GPU / CPU availability for host app | low (GPU saturated) | **high** | high (GPU used only during prefill) |
 | Background / always-on friendly | no (thermal + power) | **yes** | yes |
 | Model-placement footprint | fp16 GPU weights | INT4 ANE weights | INT4 ANE weights |
 
-On tok/s-for-tok/s we are **~20 % slower than LiteRT-LM** even if
-everything on the right-hand column ships. We do not claim to match
-them on decode rate. We claim to win a different envelope.
+On tok/s-for-tok/s we are **~42 % slower than LiteRT-LM** on decode
+(32 vs 56). We do not claim parity, and we do not claim to close
+more than a marginal fraction of that gap on the current chunk
+graph — PR #79 showed the non-speculative decode-overlap lever is
+structurally unavailable. We claim to win a different envelope.
+
+The UX argument carries on that asymmetry: **32 tok/s is ~6× human
+read speed (~5 tok/s)**, so decode is already faster than the user
+can follow. Where LiteRT-LM wins 56-vs-32 is a throughput regime the
+typical chat / assistant UX doesn't consume. The product wedge is
+sustained ~1 W, ~1 s TTFT, and the GPU-free host envelope — not
+decode parity.
 
 ---
 
@@ -70,22 +98,37 @@ always-on notetaker, on-the-go with limited charge).
 
 ## Projection basis
 
-### 43 tok/s ceiling (decode)
-
-Comes from the Phase D compute-unit split spike (PR #77). With
-chunk 3/4 placed on GPU and chunks 1/2 on ANE, the two compute units
-overlap at factor 0.87–0.99 across drivers — the critical-path
-estimate drops from 51.7 ms per step (fully serial, ~19 tok/s on Mac
-Studio audit) to ~23 ms per step at full pipelining, which on iPhone
-translates to ~43 tok/s from the measured 31 tok/s baseline. This
-**requires shipping D1b chunk pipelining** (in flight on
-`feat/chunk-pipelining-d1b`). Until D1b merges and measures on an
-iPhone 17 Pro, 43 tok/s remains a projection.
-
-What ANE-chunk pipelining *cannot* do: PR #75 showed the ANE driver
-serialises all submissions at the driver level, so adding more ANE
-parallelism on the same compute unit does not help. The win is only
-available by splitting across ANE + GPU.
+### 32 tok/s decode ceiling (measured, current)
+
+Mac Studio 128-token decode with drafters OFF (PR #79, 2026-04-15):
+chat 32.80, code 33.24, qa 33.15, summary 33.02 tok/s. iPhone 17 Pro
+measures 31.4 tok/s at 2 K under the same defaults. This is the
+**current ANE decode ceiling** on the shipped chunk graph.
+
+#### Why the earlier 43 tok/s projection was retracted
+
+The 43 tok/s figure came from PR #77's compute-unit-split spike:
+`max(c1+c2+c4_ANE, c1+c3_GPU)` ≈ 23 ms/step, grounded in a measured
+0.87–0.99 kernel-overlap factor between ANE and GPU driver queues.
+The projection assumed c3 and c4 could run in parallel. **PR #79
+implemented the full 2-stage pipeline and refuted that assumption
+empirically:**
+
+- Every category regressed by ~24 % (see retraction callout at top).
+- Root cause: c4 takes c3's `hidden_states_out` as input — a strict
+  data dependency. The only overlap window between c3 and c4 within
+  a step is the ~1 µs Swift dict-build, negligible against the
+  ~16 ms GPU c3. The cross-step pipeline (c3 of step N+1 concurrent
+  with c3 of step N) is similarly blocked because c3 of step N+1
+  needs c4 of step N to emit the just-decoded token.
+- PR #75 independently showed that adding ANE-only parallelism does
+  not help because the ANE driver serialises all submissions.
+
+Conclusion: on the current CoreML chunk graph there is **no
+non-speculative decode overlap available**. Three future options
+documented in PR #79's `docs/PHASE_D_PIPELINING_IMPL.md`
+(decoupled c4, speculative h3, model re-chunking) all require
+`conversion/`-side work — none are runtime-only.
 
 ### 1 s TTFT (prefill)
 
@@ -138,32 +181,42 @@ The previous version of this doc set **"70–110 tok/s at 2K, i.e.
   it would require pivoting this repo to MLX-Swift / GPU decode,
   which is an explicitly rejected direction (the repo's reason for
   existing is ANE-native placement).
-- Under ANE constraints, the decode ceiling is ~43 tok/s (PR #77
-  split + D1b pipelining). That is ~77 % of LiteRT-LM's throughput.
+- Under ANE constraints, the measured decode ceiling is **32 tok/s**.
+  An earlier revision of this doc projected ~43 tok/s via a PR #77
+  compute-unit split and D1b pipelining; PR #79's full
+  implementation empirically refuted that (see retraction callout
+  and §"Projection basis" for the mechanism). 32 tok/s is ~58 % of
+  LiteRT-LM's decode rate — a ~42 % gap, not ~20 %.
 - The speculative-decoding route that was supposed to close the gap
   is blocked at the verify-chunk write-through layer (see above).
 - Reframing on power + TTFT + tok/s-ceiling is **honest about what
   ANE can deliver** and **identifies where we actually win**: a
   different user segment (background-friendly, battery-sensitive,
-  GPU-contended host apps).
+  GPU-contended host apps). The decode-rate gap is real; the UX
+  argument has to carry the pitch, not decode parity.
 
 We do not claim parity on decode rate. We claim a better product on
 a different axis triad.
 
 ---
 
-## Execution — two tractable paths forward
+## Execution — single tractable decode-adjacent path forward
 
-Neither depends on the speculative-decoding work.
+Only one non-speculative decode-adjacent item remains after the D1b
+refutation: **item 27 (GPU prefill)**, which targets the TTFT axis
+rather than the decode axis. The decode axis is parked at the
+measured 32 tok/s ceiling pending either a model re-conversion
+(PR #79's three future options) or the multi-week verify-protocol
+redesign (C0 option b).
 
 | # | Item | Status | Axis unlocked |
 |---|---|---|---|
-| **A** | **D1b chunk pipelining** — overlap chunk N+1 step *t* with chunk N step *t-1* across ANE+GPU split (PR #77 validated the split; D1b implements the full 4-stage pipeline). | in flight on `feat/chunk-pipelining-d1b` | decode tok/s 31 → ~43 (projection) |
-| **B** | **Item 27 GPU prefill via MLX-Swift** — offload the 512-token prefill batch to GPU tensor cores. | not started; elevated to **Phase C critical path** (was "stretch"). | TTFT 13 s → ~1 s (projection) |
+| **A** | ~~**D1b chunk pipelining**~~ — **RETRACTED 2026-04-15** by PR #79 full-impl measurement (−24 % on all 4 categories). Structural blocker: `c3 → c4` data dependency. Three future options (decoupled c4 / speculative h3 / model re-chunking) require `conversion/` work. | refuted; plumbing kept OFF-by-default on `feat/chunk-pipelining-d1b` | — |
+| **B** | **Item 27 GPU prefill via MLX-Swift** — offload the 512-token prefill batch to GPU tensor cores. | not started; elevated to **Phase C critical path** (was "stretch"). Sole remaining tractable decode-adjacent lever. | TTFT 13 s → ~1 s (projection) |
 
-Both are Swift-side work; no chunk reconversion required. Both preserve
-the ANE-resident decode story (decode stays on ANE; GPU is used for
-prefill only in B, and only for chunks 3/4 in A).
+Item B is Swift-side work; no chunk reconversion required. It
+preserves the ANE-resident decode story (decode stays on ANE; GPU
+is used for prefill only).
 
 ### Explicitly out of scope
 
@@ -182,10 +235,10 @@ prefill only in B, and only for chunks 3/4 in A).
 
 | Risk | Mitigation / note |
 |---|---|
-| D1b pipelining hits an iOS-side dispatch quirk | Falls back to serial execution; decode stays at 31 tok/s baseline. We don't regress. |
-| GPU prefill (item 27) surfaces a Metal ↔ CoreML handoff cost that eats the TTFT win | First-pass measurement is cheap; if the handoff dominates we ship only D1b and revise TTFT claim. |
+| ~~D1b pipelining hits an iOS-side dispatch quirk~~ | **N/A — D1b retracted 2026-04-15 by PR #79 full-impl measurement; structural, not an iOS quirk.** |
+| GPU prefill (item 27) surfaces a Metal ↔ CoreML handoff cost that eats the TTFT win | First-pass measurement is cheap; if the handoff dominates we revise the TTFT claim. With D1b retracted, item 27 is the only remaining decode-adjacent projection in this doc. |
 | Power-draw advantage is harder to measure than tok/s and easy to wave hands at | Add Instruments energy trace to the benchmark doc before leaning on it as a competitive claim. Treat the 3× battery-life number as an order-of-magnitude until measured. |
-| 43 tok/s projection is optimistic — PR #77 overlap was measured on Mac; iPhone ANE/GPU concurrency may differ | PR #77 measured 0.87–0.99 overlap across drivers; iPhone measurement is the gating exit criterion for the claim. |
+| 32 tok/s decode ceiling is a hard competitive floor — 42 % behind LiteRT-LM — and no tractable non-speculative lever remains | Pitch leans on power + TTFT + background-friendliness; decode parity is explicitly not claimed. If decode parity becomes a requirement, revisit PR #79's three future options (all need model re-conversion). |
 
 ---