Close the per-function cycle gap: dissolved gust_mix is 2.81× LLVM — fuse cmp→select, fold constant immediates, shrink leaf prologue

## Measured codegen gap: dissolved `gust_mix` is 2.81× the cycles of native LLVM — four mechanical lowering issues

A new cycle bench in gale (`benches/gust/src/bin/gust_codegen_bench.rs`, PR pulseengine/gale#97) times the **same** function `gust_mix` (a Q8 clamp) lowered native (rustc/LLVM→thumbv7m) vs dissolved (wasm→loom→synth→cortex-m3), under one SysTick harness (qemu `-icount`, instr≈cycles on M3), correctness-gated bit-identical over [0,2047]:

| lowering | fn-only ticks/call | instructions | callee-saves | frame |
|---|---|---|---|---|
| native (LLVM) | 0.40 | 15 | `{r7,lr}` | none |
| dissolved (synth) | 1.125 | 45 (11 wrapper + 34 inner) | `{r4–r8,lr}` ×2 | 24 B |
| **ratio** | **2.81×** | 3.0× | | |

**Reproduce:** `cargo run --release --bin gust_codegen_bench` (gale). **Kill-criterion:** this is wrong if a fresh run shows fn-only ratio ≤1.3× or the disasm below stops reproducing.

### Root structure
`gust_mix` is split into an export **wrapper** (`bl`s an inner body) + the **inner body**, lowered by **two different synth paths**: the wrapper contains a local call, so `OptimizerBridge::optimize_full`/`ir_to_arm` declines it (optimizer_bridge.rs:2279-2287, the #188 "no caller-saved clobber model" decline) → falls to the inferior direct `select_with_stack`; the callless inner body takes the optimized bridge. Neither reaches native quality.

### Four issues (file:function grounded)
1. **Unconditional 6-register callee-save** in both functions (instruction_selector.rs:5305-5310 hardcodes `{r4-r8,lr}`; `FIXED_PUSH_BYTES=24` :988). The fix exists — `shrink_callee_saved_saves` (liveness.rs:1887, wired arm_backend.rs:370) — but **declines on any SP-touch** (liveness.rs:1923-1941), so the framed wrapper is skipped. *Ask:* relax the SP-touch decline for fixed (balanced) frames; wire the shadow allocator (below) to make leaf prologue-elision sound.
2. **Wrapper arg spilled+reloaded through stack** (`str [sp,#0x14]; ldr` back): the direct path treats wasm `local 0` as a stack slot (`compute_local_layout` :836+) and never promotes it. The peephole STR→LDR-same-addr pattern exists (peephole.rs:124-136) **but only runs on the IR via the bridge, never on the direct selector's ARM stream**. *Ask:* run peephole/store-load-forwarding on the direct path too. (Inlining the wrapper itself is loom's job — filed at loom.)
3. **Constants materialized into registers instead of folded into immediates.** `ir_to_arm` lowers `i32.const` to `Mov/Movw+Movt` (optimizer_bridge.rs:2700-2776) and shifts as **register shifts with runtime masking** (`and r12,#31; lsl rd,rn,r12`, optimizer_bridge.rs:3082-3103) even for compile-time-constant shift amounts — hence `movw r4,#8; lsl.w r5,r3,r4` where native uses `lsl #8`; the u16 narrow is `movw#0xffff; and` vs `uxth`. The encoder already supports the immediate forms (`Operand2::Imm`, `Lsl{shift}`). *Ask:* fold constant operands into `Operand2::Imm`; emit immediate shift + drop the `and #31` mask when the amount is constant; recognize the `uxth` narrowing idiom. (`SYNTH_CONST_CSE` at arm_backend.rs:317 is adjacent but distinct.)
4. **compare→select as materialize-0/1-bool-then-test** (highest value). `i32.lt_s` becomes `Cmp; SetCond{LT}` → encoder expands to `ITE; MOV#1; MOV#0` (optimizer_bridge.rs:3323-3327, arm_encoder.rs:2847-2897); then `Select` does `Cmp rd,#0; SelectMove{EQ}` — flags computed, dumped to a register, recomputed against #0. *Ask:* fuse comparison-feeds-select into `cmp src1,src2; movXX` (IT-block predication). Collapses ~7 insns→2-3 per clamp; two clamps here ≈8-10 of the ~30 excess instructions.

(1)+(2) are pure overhead; (3)+(4) are the arithmetic-body gap that also drives the `control_step` 2.1× and `gust_poll` 3.9× **size** ratios in gale's COMPARE.md. None change proof obligations — lowering quality only.

### Beat-LLVM (the thesis: go *below* the 15-insn floor)
The graph-coloring allocator already exists but runs **shadow-only** (`SYNTH_SHADOW_ALLOC=1`, arm_backend.rs:382; `allocate_function`/`function_peak_pressure`/`color_graph_precolored_costed` in liveness.rs). Wiring it to drive codegen enables, for this leaf: (a) **zero prologue** — provably fits R0-R3/R12, no `push` at all on bare-metal (no unwinder), −2..−4 vs LLVM's `{r7,lr}`; (b) **`ssat`/proof-carried clamp** — constant bounds known, and loom could export `v∈[lo,hi]` to elide a bound; (c) **optimal small-graph coloring** makes (a) guaranteed not heuristic. Honest target: body ~10-12 insns (sub-LLVM) + wrapper deleted.

Refs: synth#188 (the local-call bridge decline that forces the wrapper onto the inferior path — fixing that alone would help). Full write-up: gale `benches/gust/optimization/RECO-synth-cycles.md`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close the per-function cycle gap: dissolved gust_mix is 2.81× LLVM — fuse cmp→select, fold constant immediates, shrink leaf prologue #428

Measured codegen gap: dissolved `gust_mix` is 2.81× the cycles of native LLVM — four mechanical lowering issues

Root structure

Four issues (file:function grounded)

Beat-LLVM (the thesis: go below the 15-insn floor)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

lowering	fn-only ticks/call	instructions	callee-saves	frame
native (LLVM)	0.40	15	`{r7,lr}`	none
dissolved (synth)	1.125	45 (11 wrapper + 34 inner)	`{r4–r8,lr}` ×2	24 B
ratio	2.81×	3.0×

Close the per-function cycle gap: dissolved gust_mix is 2.81× LLVM — fuse cmp→select, fold constant immediates, shrink leaf prologue #428

Description

Measured codegen gap: dissolved gust_mix is 2.81× the cycles of native LLVM — four mechanical lowering issues

Root structure

Four issues (file:function grounded)

Beat-LLVM (the thesis: go below the 15-insn floor)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Measured codegen gap: dissolved `gust_mix` is 2.81× the cycles of native LLVM — four mechanical lowering issues