Skip to content

perf(vcr-ra): scope VCR-RA-002 leaf prologue shrink (spike)#471

Merged
avrabe merged 1 commit into
mainfrom
vcr-ra/428-leaf-prologue-spike
Jun 24, 2026
Merged

perf(vcr-ra): scope VCR-RA-002 leaf prologue shrink (spike)#471
avrabe merged 1 commit into
mainfrom
vcr-ra/428-leaf-prologue-spike

Conversation

@avrabe

@avrabe avrabe commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Scoping spike for the 6-register leaf prologue (#428, epic #242, north-star #390) — the largest remaining cycle residual (~12 cyc, push {r4-r8,lr}/pop). No codegen change — frozen-safe (frozen gate green). The byte-changing allocator change is the separate next gated step.

Two empirical findings (the value of the spike)

1. Pool change is necessary-but-insufficient — it's a JOINT locals+temps problem.
Promoting locals into caller-saved r1/r2/r3 works in isolation (disasm: locals moved r4,r5,r6 → r1,r2,r3) but the push does not shrink: operand-stack temps still occupy r4-r7. With 1 param the free caller-saved set is exactly {r1,r2,r3}; if locals claim them, temps are forced onto callee-saved (net-neutral or worse). Dropping the push needs locals and temps to prefer caller-saved, spilling to callee-saved only under combined pressure — then shrink_callee_saved_saves fires.

2. Wrong layer — it must live in reallocate_function, not compute_local_promotion.
The non-relocatable path (host-link / native-pointer consumer) runs select_with_stack then reallocate_function + shrink, and the reallocator re-colors every segment over POOL=[r0..r8], washing out any caller-saved reg the selector picked. Verified: SYNTH_NO_LOCAL_PROMOTE on/off changes bytes (structural reg-vs-frame) but a selector-level reg choice is a no-op on that path. ⇒ make the reallocator leaf-aware (bias POOL toward caller-saved for leaf segments); shrink then drops the dead push. #193/#220-class → flag-off + unicorn differential + on-target gate.

Adjacent finding (filed #470)

The optimized/non-relocatable path emits this exported leaf clobbering callee-saved r4-r8 with bx lr and no push (independent of promotion + realloc/shrink; the --relocatable build is correct). Potential AAPCS violation, differential-blind. If real it outranks this lever and changes its baseline.

Fixture

leaf_caller_saved.wat — 1 param + 3 promotable i32 locals, the non-vacuity case (locals consume all free caller-saved regs, exposing the temps interaction). Neutral values.

Refs #428, #242, #390, #470

🤖 Generated with Claude Code

Scoping spike for the 6-register leaf prologue (#428, epic #242, north-star
#390): `push {r4-r8,lr}` / `pop {…,pc}` is the largest remaining cycle residual
(~12 cyc) on the dissolved hot path. No codegen change — frozen-safe (frozen gate
verified green).

Falsifies the naive hypothesis and corrects the layer (both empirically):

1. Promoting locals into caller-saved homes is necessary-but-insufficient — the
   operand-stack TEMPS still occupy r4-r8 and keep the push alive. With 1 param
   the free caller-saved set is exactly {r1,r2,r3}; if locals claim them, temps
   are forced onto callee-saved (net-neutral or worse). VCR-RA-002 is a JOINT
   locals+temps coloring problem: both prefer caller-saved, spill to callee-saved
   only under combined pressure, then shrink drops the push.

2. The change belongs in `reallocate_function` (liveness.rs), NOT
   `compute_local_promotion`. The non-relocatable path (the host-link / native-
   pointer consumer's path) re-colors every segment over POOL=[r0..r8] AFTER
   selection, washing out any caller-saved reg the selector picks (verified: a
   selector-level reg choice is a no-op on that path; LOCAL_PROMOTE's structural
   reg-vs-frame change survives, a reg *choice* does not). The reallocator must be
   leaf-aware: bias POOL toward caller-saved for leaf segments so the body touches
   no callee-saved unless pressure forces it; shrink then removes the dead push.
   #193/#220-class — land flag-off + unicorn-differential + on-target gate.

Adds the generic neutral-value fixture (1 param + 3 promotable i32 locals — the
non-vacuity case where locals consume all free caller-saved regs, exposing the
temps interaction) + a scoping note.

Adjacent finding filed separately: the optimized/non-relocatable path emits this
EXPORTED leaf clobbering callee-saved r4-r8 with `bx lr` and no push (independent
of promotion + realloc/shrink; the --relocatable build is correct) — potential
AAPCS violation, differential-blind, #470. If real it outranks this lever.

Refs #428, #242, #390, #470

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@avrabe avrabe merged commit 2847726 into main Jun 24, 2026
11 checks passed
@avrabe avrabe deleted the vcr-ra/428-leaf-prologue-spike branch June 24, 2026 18:38
avrabe added a commit that referenced this pull request Jun 24, 2026
… design (analysis) (#473)

#470 asked whether the optimized (non-relocatable) path emitting an exported leaf
that clobbers callee-saved r4-r8 with `bx lr` and no push is an AAPCS violation.

Determination: by-design, not a bug. synth has two ABI models selected by
--relocatable: the optimized path is the self-contained bare-metal model (non-AAPCS
by design — documented in arm_backend.rs: "does not preserve caller-saved registers
across calls"; ir_to_arm also emits no callee-saved push and its temp pool prefers
r4-r8), and --relocatable is the AAPCS/host-link path (select_with_stack, fp-relative,
proper preservation).

The self-contained model is self-consistent (verified — intra_module_callee_saved.wat):
a call-containing function is declined by the optimizer to the direct selector, which
holds across-call values on the FRAME (caller `a`: str [sp] before `bl`, ldr [sp]
after — never relies on the callee preserving r4-r8) and pushes/pops {r4-r8,lr} for
ITS caller. So an optimized leaf's callee-saved clobber corrupts nothing intra-module;
the only exposure is an external AAPCS caller, for which --relocatable is the required
documented path. No fix warranted.

Also resolves the VCR-RA-002 (#471) baseline worry: the measured `push {r4-r8,lr}` is
on the call-containing/relocatable path, NOT optimized leaves — so #470 does not reset
VCR-RA-002's baseline.

Adds the generic evidence fixture + the ABI-model note.

Refs #470, #428, #242

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant