Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions scripts/repro/leaf_caller_saved.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# VCR-RA-002 scoping spike — leaf-function prologue shrink

**Issue:** #428 (the named dominant cycle residual) · **Epic:** #242 (VCR-*) ·
north-star #390
**Status:** SCOPING SPIKE (no codegen change — frozen-safe). The byte-changing
allocator change is the separate next gated step.

## The lever

On-target measurement: the dissolved hot path is now bounded by the
**6-register leaf prologue** — `push {r4-r8,lr}` / `pop {…,pc}`, ~12 cyc of pure
save/restore on a small leaf, the largest remaining cycle residual after
cmp→select + local-promotion + immediate-shift folding. A leaf never calls, so
caller-saved r0-r3 are free of clobber; if the whole leaf body stays off
callee-saved registers, `shrink_callee_saved_saves` drops the push entirely.

## What this spike establishes (and corrects)

The naive hypothesis — "make local **promotion** prefer caller-saved homes" — was
implemented and **empirically falsified** here. Two findings:

### 1. The pool change is necessary-but-insufficient (it's a JOINT allocation problem)
Promoting locals into caller-saved r1/r2/r3 was verified to work in isolation
(disasm: locals moved r4,r5,r6 → r1,r2,r3). But the prologue did **not** shrink:
the operand-stack **temps** still occupy r4-r7, so they keep the push alive. With
one param, the free caller-saved set is exactly {r1,r2,r3}; if locals claim all
three, temps are *forced* onto callee-saved — net-neutral or worse. Dropping the
push requires locals **and** temps to prefer caller-saved, spilling to callee-saved
only under combined pressure — then shrink fires. This is a joint
locals+temps coloring change, not a promotion-pool tweak.

### 2. It must live in the reallocator, not the selector (the measured path re-colors)
The non-relocatable path (what the host-link / native-pointer consumer compiles)
runs `select_with_stack` **then** `reallocate_function` + `shrink_callee_saved_saves`
(arm_backend.rs). The reallocator **re-colors** every straight-line segment over
`POOL=[r0..r8]`, so any caller-saved home the selector picks is **overwritten** —
a `compute_local_promotion` pool change is a no-op on this path (verified:
`SYNTH_NO_LOCAL_PROMOTE` on/off changes the bytes — structural — but a
selector-level reg *choice* does not survive realloc). The `--relocatable` path
skips the reallocator, so a selector change shows there but is insufficient (#1).

⇒ **VCR-RA-002 belongs in `reallocate_function` (liveness.rs):** for a leaf
segment, bias the colourer's `POOL` order toward caller-saved (r0-r3) so the body
touches no callee-saved reg unless pressure forces it; `shrink_callee_saved_saves`
then removes the now-dead `push {r4-r8}`. This is a #193/#220-class allocator
change — land flag-off, unicorn-differential the caller-saved homes (the
reservation correctness the leaf-only guarantee buys), then on-target gate.

## Fixture

`leaf_caller_saved.wat` — 1 param + 3 promotable i32 locals, low temp pressure; the
non-vacuity case (with 1 param, all 3 free caller-saved regs are consumed by
locals, exposing the temps-forced-to-callee-saved interaction). Neutral values.

## Adjacent finding (filed separately)

While measuring, the optimized/non-relocatable path was found to emit this
**exported** leaf clobbering callee-saved r4-r8 with `bx lr` and **no** push
(independent of promotion and of realloc/shrink; the `--relocatable` build is
correct). Potential AAPCS violation, differential-blind — **#470**. If exported
leaves must preserve callee-saved under synth's model, that bug outranks this perf
lever and changes its baseline.

## Next gated step (separate PR)
1. Flag-off leaf-aware caller-saved bias in `reallocate_function` (`SYNTH_LEAF_CALLER_SAVED`,
default off ⇒ bit-identical).
2. Unicorn differential on `leaf_caller_saved.wat`: flag-off == flag-on == wasmtime,
with r4-r8 pre-dirtied (the reservation/clobber gate).
3. On-target cycle gate, then default-on flip + re-freeze.
25 changes: 25 additions & 0 deletions scripts/repro/leaf_caller_saved.wat
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
;; perf repro (VCR-RA-002, #428, epic #242): leaf-function prologue shrink.
;;
;; Local promotion (v0.14.0) homes eligible i32 locals in callee-saved r4..r8.
;; For a LEAF function that is the wrong pool: callee-saved regs must be
;; saved/restored (`push {r4-r8,lr}` / `pop {…,pc}` ~12 cyc of pure overhead),
;; whereas a leaf never calls, so caller-saved r1..r3 (minus params, minus r0
;; for the return value) are free homes that need NO prologue save. Promoting
;; into caller-saved first lets `shrink_callee_saved_saves` drop the callee-saved
;; push entirely.
;;
;; This fixture: 1 param + 3 promotable i32 locals (each written-before-read,
;; depth-0, >=2 reads), minimal operand-stack temp pressure. Flag-off homes
;; a,b,c -> r4,r5,r6 (push {r4-r6,lr}); flag-on (SYNTH_LEAF_CALLER_SAVED=1) homes
;; them -> r1,r2,r3 (no callee-saved push). Same result either way.
;;
;; Generic — neutral values, tied to nothing real.
(module
(memory 1)
(func (export "leaf3") (param $p i32) (result i32)
(local $a i32) (local $b i32) (local $c i32)
(local.set $a (i32.add (local.get $p) (i32.const 1)))
(local.set $b (i32.add (local.get $a) (i32.const 2)))
(local.set $c (i32.add (local.get $b) (i32.const 3)))
(i32.add (i32.add (local.get $a) (local.get $b))
(i32.add (local.get $c) (local.get $p)))))
Loading