From 050f7e471f2cfef6294f203a848d6b4c3e5eded1 Mon Sep 17 00:00:00 2001 From: Ralf Anton Beier Date: Wed, 24 Jun 2026 20:01:13 +0200 Subject: [PATCH] perf(vcr-ra): scope VCR-RA-002 leaf prologue shrink (spike) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Scoping spike for the 6-register leaf prologue (#428, epic #242, north-star #390): `push {r4-r8,lr}` / `pop {…,pc}` is the largest remaining cycle residual (~12 cyc) on the dissolved hot path. No codegen change — frozen-safe (frozen gate verified green). Falsifies the naive hypothesis and corrects the layer (both empirically): 1. Promoting locals into caller-saved homes is necessary-but-insufficient — the operand-stack TEMPS still occupy r4-r8 and keep the push alive. With 1 param the free caller-saved set is exactly {r1,r2,r3}; if locals claim them, temps are forced onto callee-saved (net-neutral or worse). VCR-RA-002 is a JOINT locals+temps coloring problem: both prefer caller-saved, spill to callee-saved only under combined pressure, then shrink drops the push. 2. The change belongs in `reallocate_function` (liveness.rs), NOT `compute_local_promotion`. The non-relocatable path (the host-link / native- pointer consumer's path) re-colors every segment over POOL=[r0..r8] AFTER selection, washing out any caller-saved reg the selector picks (verified: a selector-level reg choice is a no-op on that path; LOCAL_PROMOTE's structural reg-vs-frame change survives, a reg *choice* does not). The reallocator must be leaf-aware: bias POOL toward caller-saved for leaf segments so the body touches no callee-saved unless pressure forces it; shrink then removes the dead push. #193/#220-class — land flag-off + unicorn-differential + on-target gate. Adds the generic neutral-value fixture (1 param + 3 promotable i32 locals — the non-vacuity case where locals consume all free caller-saved regs, exposing the temps interaction) + a scoping note. Adjacent finding filed separately: the optimized/non-relocatable path emits this EXPORTED leaf clobbering callee-saved r4-r8 with `bx lr` and no push (independent of promotion + realloc/shrink; the --relocatable build is correct) — potential AAPCS violation, differential-blind, #470. If real it outranks this lever. Refs #428, #242, #390, #470 Co-Authored-By: Claude Opus 4.8 --- scripts/repro/leaf_caller_saved.md | 69 +++++++++++++++++++++++++++++ scripts/repro/leaf_caller_saved.wat | 25 +++++++++++ 2 files changed, 94 insertions(+) create mode 100644 scripts/repro/leaf_caller_saved.md create mode 100644 scripts/repro/leaf_caller_saved.wat diff --git a/scripts/repro/leaf_caller_saved.md b/scripts/repro/leaf_caller_saved.md new file mode 100644 index 0000000..da65ec8 --- /dev/null +++ b/scripts/repro/leaf_caller_saved.md @@ -0,0 +1,69 @@ +# VCR-RA-002 scoping spike — leaf-function prologue shrink + +**Issue:** #428 (the named dominant cycle residual) · **Epic:** #242 (VCR-*) · +north-star #390 +**Status:** SCOPING SPIKE (no codegen change — frozen-safe). The byte-changing +allocator change is the separate next gated step. + +## The lever + +On-target measurement: the dissolved hot path is now bounded by the +**6-register leaf prologue** — `push {r4-r8,lr}` / `pop {…,pc}`, ~12 cyc of pure +save/restore on a small leaf, the largest remaining cycle residual after +cmp→select + local-promotion + immediate-shift folding. A leaf never calls, so +caller-saved r0-r3 are free of clobber; if the whole leaf body stays off +callee-saved registers, `shrink_callee_saved_saves` drops the push entirely. + +## What this spike establishes (and corrects) + +The naive hypothesis — "make local **promotion** prefer caller-saved homes" — was +implemented and **empirically falsified** here. Two findings: + +### 1. The pool change is necessary-but-insufficient (it's a JOINT allocation problem) +Promoting locals into caller-saved r1/r2/r3 was verified to work in isolation +(disasm: locals moved r4,r5,r6 → r1,r2,r3). But the prologue did **not** shrink: +the operand-stack **temps** still occupy r4-r7, so they keep the push alive. With +one param, the free caller-saved set is exactly {r1,r2,r3}; if locals claim all +three, temps are *forced* onto callee-saved — net-neutral or worse. Dropping the +push requires locals **and** temps to prefer caller-saved, spilling to callee-saved +only under combined pressure — then shrink fires. This is a joint +locals+temps coloring change, not a promotion-pool tweak. + +### 2. It must live in the reallocator, not the selector (the measured path re-colors) +The non-relocatable path (what the host-link / native-pointer consumer compiles) +runs `select_with_stack` **then** `reallocate_function` + `shrink_callee_saved_saves` +(arm_backend.rs). The reallocator **re-colors** every straight-line segment over +`POOL=[r0..r8]`, so any caller-saved home the selector picks is **overwritten** — +a `compute_local_promotion` pool change is a no-op on this path (verified: +`SYNTH_NO_LOCAL_PROMOTE` on/off changes the bytes — structural — but a +selector-level reg *choice* does not survive realloc). The `--relocatable` path +skips the reallocator, so a selector change shows there but is insufficient (#1). + +⇒ **VCR-RA-002 belongs in `reallocate_function` (liveness.rs):** for a leaf +segment, bias the colourer's `POOL` order toward caller-saved (r0-r3) so the body +touches no callee-saved reg unless pressure forces it; `shrink_callee_saved_saves` +then removes the now-dead `push {r4-r8}`. This is a #193/#220-class allocator +change — land flag-off, unicorn-differential the caller-saved homes (the +reservation correctness the leaf-only guarantee buys), then on-target gate. + +## Fixture + +`leaf_caller_saved.wat` — 1 param + 3 promotable i32 locals, low temp pressure; the +non-vacuity case (with 1 param, all 3 free caller-saved regs are consumed by +locals, exposing the temps-forced-to-callee-saved interaction). Neutral values. + +## Adjacent finding (filed separately) + +While measuring, the optimized/non-relocatable path was found to emit this +**exported** leaf clobbering callee-saved r4-r8 with `bx lr` and **no** push +(independent of promotion and of realloc/shrink; the `--relocatable` build is +correct). Potential AAPCS violation, differential-blind — **#470**. If exported +leaves must preserve callee-saved under synth's model, that bug outranks this perf +lever and changes its baseline. + +## Next gated step (separate PR) +1. Flag-off leaf-aware caller-saved bias in `reallocate_function` (`SYNTH_LEAF_CALLER_SAVED`, + default off ⇒ bit-identical). +2. Unicorn differential on `leaf_caller_saved.wat`: flag-off == flag-on == wasmtime, + with r4-r8 pre-dirtied (the reservation/clobber gate). +3. On-target cycle gate, then default-on flip + re-freeze. diff --git a/scripts/repro/leaf_caller_saved.wat b/scripts/repro/leaf_caller_saved.wat new file mode 100644 index 0000000..1f7f7f2 --- /dev/null +++ b/scripts/repro/leaf_caller_saved.wat @@ -0,0 +1,25 @@ +;; perf repro (VCR-RA-002, #428, epic #242): leaf-function prologue shrink. +;; +;; Local promotion (v0.14.0) homes eligible i32 locals in callee-saved r4..r8. +;; For a LEAF function that is the wrong pool: callee-saved regs must be +;; saved/restored (`push {r4-r8,lr}` / `pop {…,pc}` ~12 cyc of pure overhead), +;; whereas a leaf never calls, so caller-saved r1..r3 (minus params, minus r0 +;; for the return value) are free homes that need NO prologue save. Promoting +;; into caller-saved first lets `shrink_callee_saved_saves` drop the callee-saved +;; push entirely. +;; +;; This fixture: 1 param + 3 promotable i32 locals (each written-before-read, +;; depth-0, >=2 reads), minimal operand-stack temp pressure. Flag-off homes +;; a,b,c -> r4,r5,r6 (push {r4-r6,lr}); flag-on (SYNTH_LEAF_CALLER_SAVED=1) homes +;; them -> r1,r2,r3 (no callee-saved push). Same result either way. +;; +;; Generic — neutral values, tied to nothing real. +(module + (memory 1) + (func (export "leaf3") (param $p i32) (result i32) + (local $a i32) (local $b i32) (local $c i32) + (local.set $a (i32.add (local.get $p) (i32.const 1))) + (local.set $b (i32.add (local.get $a) (i32.const 2))) + (local.set $c (i32.add (local.get $b) (i32.const 3))) + (i32.add (i32.add (local.get $a) (local.get $b)) + (i32.add (local.get $c) (local.get $p)))))