Measured codegen gap: dissolved gust_mix is 2.81× the cycles of native LLVM — four mechanical lowering issues
A new cycle bench in gale (benches/gust/src/bin/gust_codegen_bench.rs, PR pulseengine/gale#97) times the same function gust_mix (a Q8 clamp) lowered native (rustc/LLVM→thumbv7m) vs dissolved (wasm→loom→synth→cortex-m3), under one SysTick harness (qemu -icount, instr≈cycles on M3), correctness-gated bit-identical over [0,2047]:
| lowering |
fn-only ticks/call |
instructions |
callee-saves |
frame |
| native (LLVM) |
0.40 |
15 |
{r7,lr} |
none |
| dissolved (synth) |
1.125 |
45 (11 wrapper + 34 inner) |
{r4–r8,lr} ×2 |
24 B |
| ratio |
2.81× |
3.0× |
|
|
Reproduce: cargo run --release --bin gust_codegen_bench (gale). Kill-criterion: this is wrong if a fresh run shows fn-only ratio ≤1.3× or the disasm below stops reproducing.
Root structure
gust_mix is split into an export wrapper (bls an inner body) + the inner body, lowered by two different synth paths: the wrapper contains a local call, so OptimizerBridge::optimize_full/ir_to_arm declines it (optimizer_bridge.rs:2279-2287, the #188 "no caller-saved clobber model" decline) → falls to the inferior direct select_with_stack; the callless inner body takes the optimized bridge. Neither reaches native quality.
Four issues (file:function grounded)
- Unconditional 6-register callee-save in both functions (instruction_selector.rs:5305-5310 hardcodes
{r4-r8,lr}; FIXED_PUSH_BYTES=24 :988). The fix exists — shrink_callee_saved_saves (liveness.rs:1887, wired arm_backend.rs:370) — but declines on any SP-touch (liveness.rs:1923-1941), so the framed wrapper is skipped. Ask: relax the SP-touch decline for fixed (balanced) frames; wire the shadow allocator (below) to make leaf prologue-elision sound.
- Wrapper arg spilled+reloaded through stack (
str [sp,#0x14]; ldr back): the direct path treats wasm local 0 as a stack slot (compute_local_layout :836+) and never promotes it. The peephole STR→LDR-same-addr pattern exists (peephole.rs:124-136) but only runs on the IR via the bridge, never on the direct selector's ARM stream. Ask: run peephole/store-load-forwarding on the direct path too. (Inlining the wrapper itself is loom's job — filed at loom.)
- Constants materialized into registers instead of folded into immediates.
ir_to_arm lowers i32.const to Mov/Movw+Movt (optimizer_bridge.rs:2700-2776) and shifts as register shifts with runtime masking (and r12,#31; lsl rd,rn,r12, optimizer_bridge.rs:3082-3103) even for compile-time-constant shift amounts — hence movw r4,#8; lsl.w r5,r3,r4 where native uses lsl #8; the u16 narrow is movw#0xffff; and vs uxth. The encoder already supports the immediate forms (Operand2::Imm, Lsl{shift}). Ask: fold constant operands into Operand2::Imm; emit immediate shift + drop the and #31 mask when the amount is constant; recognize the uxth narrowing idiom. (SYNTH_CONST_CSE at arm_backend.rs:317 is adjacent but distinct.)
- compare→select as materialize-0/1-bool-then-test (highest value).
i32.lt_s becomes Cmp; SetCond{LT} → encoder expands to ITE; MOV#1; MOV#0 (optimizer_bridge.rs:3323-3327, arm_encoder.rs:2847-2897); then Select does Cmp rd,#0; SelectMove{EQ} — flags computed, dumped to a register, recomputed against #0. Ask: fuse comparison-feeds-select into cmp src1,src2; movXX (IT-block predication). Collapses ~7 insns→2-3 per clamp; two clamps here ≈8-10 of the ~30 excess instructions.
(1)+(2) are pure overhead; (3)+(4) are the arithmetic-body gap that also drives the control_step 2.1× and gust_poll 3.9× size ratios in gale's COMPARE.md. None change proof obligations — lowering quality only.
Beat-LLVM (the thesis: go below the 15-insn floor)
The graph-coloring allocator already exists but runs shadow-only (SYNTH_SHADOW_ALLOC=1, arm_backend.rs:382; allocate_function/function_peak_pressure/color_graph_precolored_costed in liveness.rs). Wiring it to drive codegen enables, for this leaf: (a) zero prologue — provably fits R0-R3/R12, no push at all on bare-metal (no unwinder), −2..−4 vs LLVM's {r7,lr}; (b) ssat/proof-carried clamp — constant bounds known, and loom could export v∈[lo,hi] to elide a bound; (c) optimal small-graph coloring makes (a) guaranteed not heuristic. Honest target: body ~10-12 insns (sub-LLVM) + wrapper deleted.
Refs: synth#188 (the local-call bridge decline that forces the wrapper onto the inferior path — fixing that alone would help). Full write-up: gale benches/gust/optimization/RECO-synth-cycles.md.
Measured codegen gap: dissolved
gust_mixis 2.81× the cycles of native LLVM — four mechanical lowering issuesA new cycle bench in gale (
benches/gust/src/bin/gust_codegen_bench.rs, PR pulseengine/gale#97) times the same functiongust_mix(a Q8 clamp) lowered native (rustc/LLVM→thumbv7m) vs dissolved (wasm→loom→synth→cortex-m3), under one SysTick harness (qemu-icount, instr≈cycles on M3), correctness-gated bit-identical over [0,2047]:{r7,lr}{r4–r8,lr}×2Reproduce:
cargo run --release --bin gust_codegen_bench(gale). Kill-criterion: this is wrong if a fresh run shows fn-only ratio ≤1.3× or the disasm below stops reproducing.Root structure
gust_mixis split into an export wrapper (bls an inner body) + the inner body, lowered by two different synth paths: the wrapper contains a local call, soOptimizerBridge::optimize_full/ir_to_armdeclines it (optimizer_bridge.rs:2279-2287, the #188 "no caller-saved clobber model" decline) → falls to the inferior directselect_with_stack; the callless inner body takes the optimized bridge. Neither reaches native quality.Four issues (file:function grounded)
{r4-r8,lr};FIXED_PUSH_BYTES=24:988). The fix exists —shrink_callee_saved_saves(liveness.rs:1887, wired arm_backend.rs:370) — but declines on any SP-touch (liveness.rs:1923-1941), so the framed wrapper is skipped. Ask: relax the SP-touch decline for fixed (balanced) frames; wire the shadow allocator (below) to make leaf prologue-elision sound.str [sp,#0x14]; ldrback): the direct path treats wasmlocal 0as a stack slot (compute_local_layout:836+) and never promotes it. The peephole STR→LDR-same-addr pattern exists (peephole.rs:124-136) but only runs on the IR via the bridge, never on the direct selector's ARM stream. Ask: run peephole/store-load-forwarding on the direct path too. (Inlining the wrapper itself is loom's job — filed at loom.)ir_to_armlowersi32.consttoMov/Movw+Movt(optimizer_bridge.rs:2700-2776) and shifts as register shifts with runtime masking (and r12,#31; lsl rd,rn,r12, optimizer_bridge.rs:3082-3103) even for compile-time-constant shift amounts — hencemovw r4,#8; lsl.w r5,r3,r4where native useslsl #8; the u16 narrow ismovw#0xffff; andvsuxth. The encoder already supports the immediate forms (Operand2::Imm,Lsl{shift}). Ask: fold constant operands intoOperand2::Imm; emit immediate shift + drop theand #31mask when the amount is constant; recognize theuxthnarrowing idiom. (SYNTH_CONST_CSEat arm_backend.rs:317 is adjacent but distinct.)i32.lt_sbecomesCmp; SetCond{LT}→ encoder expands toITE; MOV#1; MOV#0(optimizer_bridge.rs:3323-3327, arm_encoder.rs:2847-2897); thenSelectdoesCmp rd,#0; SelectMove{EQ}— flags computed, dumped to a register, recomputed against #0. Ask: fuse comparison-feeds-select intocmp src1,src2; movXX(IT-block predication). Collapses ~7 insns→2-3 per clamp; two clamps here ≈8-10 of the ~30 excess instructions.(1)+(2) are pure overhead; (3)+(4) are the arithmetic-body gap that also drives the
control_step2.1× andgust_poll3.9× size ratios in gale's COMPARE.md. None change proof obligations — lowering quality only.Beat-LLVM (the thesis: go below the 15-insn floor)
The graph-coloring allocator already exists but runs shadow-only (
SYNTH_SHADOW_ALLOC=1, arm_backend.rs:382;allocate_function/function_peak_pressure/color_graph_precolored_costedin liveness.rs). Wiring it to drive codegen enables, for this leaf: (a) zero prologue — provably fits R0-R3/R12, nopushat all on bare-metal (no unwinder), −2..−4 vs LLVM's{r7,lr}; (b)ssat/proof-carried clamp — constant bounds known, and loom could exportv∈[lo,hi]to elide a bound; (c) optimal small-graph coloring makes (a) guaranteed not heuristic. Honest target: body ~10-12 insns (sub-LLVM) + wrapper deleted.Refs: synth#188 (the local-call bridge decline that forces the wrapper onto the inferior path — fixing that alone would help). Full write-up: gale
benches/gust/optimization/RECO-synth-cycles.md.