Skip to content

Close the per-function cycle gap: dissolved gust_mix is 2.81× LLVM — fuse cmp→select, fold constant immediates, shrink leaf prologue #428

Description

@avrabe

Measured codegen gap: dissolved gust_mix is 2.81× the cycles of native LLVM — four mechanical lowering issues

A new cycle bench in gale (benches/gust/src/bin/gust_codegen_bench.rs, PR pulseengine/gale#97) times the same function gust_mix (a Q8 clamp) lowered native (rustc/LLVM→thumbv7m) vs dissolved (wasm→loom→synth→cortex-m3), under one SysTick harness (qemu -icount, instr≈cycles on M3), correctness-gated bit-identical over [0,2047]:

lowering fn-only ticks/call instructions callee-saves frame
native (LLVM) 0.40 15 {r7,lr} none
dissolved (synth) 1.125 45 (11 wrapper + 34 inner) {r4–r8,lr} ×2 24 B
ratio 2.81× 3.0×

Reproduce: cargo run --release --bin gust_codegen_bench (gale). Kill-criterion: this is wrong if a fresh run shows fn-only ratio ≤1.3× or the disasm below stops reproducing.

Root structure

gust_mix is split into an export wrapper (bls an inner body) + the inner body, lowered by two different synth paths: the wrapper contains a local call, so OptimizerBridge::optimize_full/ir_to_arm declines it (optimizer_bridge.rs:2279-2287, the #188 "no caller-saved clobber model" decline) → falls to the inferior direct select_with_stack; the callless inner body takes the optimized bridge. Neither reaches native quality.

Four issues (file:function grounded)

  1. Unconditional 6-register callee-save in both functions (instruction_selector.rs:5305-5310 hardcodes {r4-r8,lr}; FIXED_PUSH_BYTES=24 :988). The fix exists — shrink_callee_saved_saves (liveness.rs:1887, wired arm_backend.rs:370) — but declines on any SP-touch (liveness.rs:1923-1941), so the framed wrapper is skipped. Ask: relax the SP-touch decline for fixed (balanced) frames; wire the shadow allocator (below) to make leaf prologue-elision sound.
  2. Wrapper arg spilled+reloaded through stack (str [sp,#0x14]; ldr back): the direct path treats wasm local 0 as a stack slot (compute_local_layout :836+) and never promotes it. The peephole STR→LDR-same-addr pattern exists (peephole.rs:124-136) but only runs on the IR via the bridge, never on the direct selector's ARM stream. Ask: run peephole/store-load-forwarding on the direct path too. (Inlining the wrapper itself is loom's job — filed at loom.)
  3. Constants materialized into registers instead of folded into immediates. ir_to_arm lowers i32.const to Mov/Movw+Movt (optimizer_bridge.rs:2700-2776) and shifts as register shifts with runtime masking (and r12,#31; lsl rd,rn,r12, optimizer_bridge.rs:3082-3103) even for compile-time-constant shift amounts — hence movw r4,#8; lsl.w r5,r3,r4 where native uses lsl #8; the u16 narrow is movw#0xffff; and vs uxth. The encoder already supports the immediate forms (Operand2::Imm, Lsl{shift}). Ask: fold constant operands into Operand2::Imm; emit immediate shift + drop the and #31 mask when the amount is constant; recognize the uxth narrowing idiom. (SYNTH_CONST_CSE at arm_backend.rs:317 is adjacent but distinct.)
  4. compare→select as materialize-0/1-bool-then-test (highest value). i32.lt_s becomes Cmp; SetCond{LT} → encoder expands to ITE; MOV#1; MOV#0 (optimizer_bridge.rs:3323-3327, arm_encoder.rs:2847-2897); then Select does Cmp rd,#0; SelectMove{EQ} — flags computed, dumped to a register, recomputed against #0. Ask: fuse comparison-feeds-select into cmp src1,src2; movXX (IT-block predication). Collapses ~7 insns→2-3 per clamp; two clamps here ≈8-10 of the ~30 excess instructions.

(1)+(2) are pure overhead; (3)+(4) are the arithmetic-body gap that also drives the control_step 2.1× and gust_poll 3.9× size ratios in gale's COMPARE.md. None change proof obligations — lowering quality only.

Beat-LLVM (the thesis: go below the 15-insn floor)

The graph-coloring allocator already exists but runs shadow-only (SYNTH_SHADOW_ALLOC=1, arm_backend.rs:382; allocate_function/function_peak_pressure/color_graph_precolored_costed in liveness.rs). Wiring it to drive codegen enables, for this leaf: (a) zero prologue — provably fits R0-R3/R12, no push at all on bare-metal (no unwinder), −2..−4 vs LLVM's {r7,lr}; (b) ssat/proof-carried clamp — constant bounds known, and loom could export v∈[lo,hi] to elide a bound; (c) optimal small-graph coloring makes (a) guaranteed not heuristic. Honest target: body ~10-12 insns (sub-LLVM) + wrapper deleted.

Refs: synth#188 (the local-call bridge decline that forces the wrapper onto the inferior path — fixing that alone would help). Full write-up: gale benches/gust/optimization/RECO-synth-cycles.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions