Codegen: close the 3.9× size gap vs LLVM on the gust hot path (regalloc + addressing-mode folding)

# Proposal for `pulseengine/synth`: close the 3.9× codegen gap vs LLVM on the gust hot path

**Context.** Benchmarked synth v0.11.50 lowering the `gust` mini-RTOS scheduler hot
path. Same Rust source, two toolchains:

| function | native rustc/LLVM → thumbv7m | synth dissolved (wasm→loom→synth→cm3) | ratio |
|---|---|---|---|
| `gust_poll` (scheduler poll round) | **208 B** (104 insns) | **816 B** (240 insns) | **3.9×** |
| `gust_mix` (Q8 mixer wrapper) | **12 B** (5 insns) | **44 B** (12 insns) | **3.7×** |

The wasm that synth received from loom is already well-optimized (compact 8-local body,
`select`-based branch-free wrap, offset loads — see the loom proposal). **The gap is
almost entirely in synth's wasm→ARM lowering, not the input wasm.** Disassembly evidence
below; reproduce with `compare-codegen.sh` + `llvm-objdump -d --triple=thumbv7m`.

## Attribution of the 816 − 208 = 608 B gap (synth-side)

| waste category | insns | est. bytes | what LLVM does instead |
|---|---:|---:|---|
| 1. No register allocation — every wasm local/operand spilled to a stack slot | **72** `ldr/str [sp,#…]` | **~288 B** | keeps values in r0–r10 across the whole loop; ~0 spills |
| 2. `__stack_pointer` shadow-stack emulation (entry + exit) | ~14 | ~64 B | none — no shadow stack at all |
| 3. Address materialization into r12 before every linmem access | 14 `add.w r12,rN,#imm` + 18 `[r11,…]` | ~128 B | folds offset into the load: `ldr r0,[rN,#imm]` / scaled-index `ldr.w r10,[r6,r0,lsl #3]` |
| 4. Boolean materialization + redundant compare-against-zero | 10 `ite/it` + 11 `cmp rN,#0` | ~46 B | branches on flags directly; if-converts with `it`/`movne` (1 block, not 2) |
| 5. No copy coalescing — identity `mov rN,rN` survive | 2 (+ several near-identity) | ~6 B | coalesced away |
| 6. Unconditional callee-save of r4–r8 in leaf-ish wrappers (`gust_mix`) | — | ~4 B/fn | saves only what's used |

(Categories overlap slightly; #1 and #3 together dominate and account for the bulk of the
608 B.)

### Evidence #1 — spill churn (the single biggest cost, ~288 B)

`gust_mix` is the clean microcosm. Native, 12 B:

```asm
push {r7, lr}
mov  r7, sp
bl   <mix>
uxth r0, r0        ; zero-extend low 16 bits
pop  {r7, pc}
```

synth, 44 B:

```asm
push.w {r4, r5, r6, r7, r8, lr}   ; saves 5 regs it never uses
sub.w  sp, sp, #0x18              ; 24-byte frame for a wrapper
str.w  r0, [sp, #0x14]            ; spill the incoming arg…
ldr.w  r1, [sp, #0x14]            ; …and immediately reload it
mov    r0, r1                     ; un-coalesced copy
bl     <mix>
ldr.w  r2, [pc, #0x10]            ; load 0xffff from const pool…
and.w  r3, r0, r2                 ; …AND instead of uxth (missed peephole)
mov    r0, r3                     ; un-coalesced copy
add.w  sp, sp, #0x18
pop.w  {r4, r5, r6, r7, r8, pc}
```

In `gust_poll` this pattern is pervasive: e.g. `ldr.w r0,[sp,#0x3c]` (the `sched` pointer)
is **reloaded from the same stack slot at offsets 0x260, 0x26c, 0x284, 0x2c2, 0x2e4, 0x300,
0x422, 0x43e, 0x450, 0x454…** — the same value, never kept in a register. LLVM loads
`sched` once into `r6` and reuses it for the entire function.

### Evidence #2 — `__stack_pointer` shadow-stack emulation

Entry (0x240–0x25e) and exit (0x4fc–0x50e) load `__synth_globals`, do pointer arithmetic,
and store it back — emulating the wasm `global.get/set $__stack_pointer` that loom emitted
purely as ABI boilerplate. `gust_poll` takes its state **by pointer** and uses **no
wasm-linmem statics**, so it needs no shadow stack at all.

```asm
ldr.w r0, [pc,#0x300]   ; &__synth_globals
ldr   r1, [r0]          ; sp_global
ldr.w r0, [pc,#0x300]   ; &__synth_wasm_data base
adds  r0, r1, r0
sub.w r1, r0, #0x10
str.w r1, [sp, #0x8]    ; …and the mirror dance on exit
```

### Evidence #3 — address materialization not folded into the addressing mode

```asm
; synth: 3 insns / 10 bytes per field access
add.w  r12, r2, #0x34      ; r12 = base + 0x34
ldr.w  r3, [r11, r12]      ; load via r11(=0)+r12
; LLVM: 1 insn / 4 bytes
ldrd   r0, r1, [r0, #48]   ; offset folded; even loads two words at once
```

synth assumes `r11 == 0` (verified: r11 is never written in the function) and routes every
linmem access through `[r11, r12]` after computing the effective address in r12. Thumb-2
offers `ldr rD,[rN,#imm]` (imm up to 4095) and scaled-index `ldr.w rD,[rN,rM,lsl #k]` —
both unused.

### Evidence #4 — boolean materialization

wasm `i32.eq`/`i32.ge_u` feeding a `br_if` is lowered as "materialize 0/1 into a reg, then
`cmp reg,#0`, then branch":

```asm
cmn.w  r0, #0x1
ite    eq
moveq  r2, #0x1
movne  r2, #0x0
cmp    r2, #0x0          ; redundant — flags already set
bne.w  <target>
```

LLVM branches straight off the compare flags (`beq`/`bhs`) — 1 insn vs 4.

## Recommended passes (in payoff order)

1. **Linear-scan (or tree-scan) register allocator over wasm locals + operand stack.**
   This is the headline fix (~288 B / ~47% of the gap). The dissolved-native target is
   tiny and leaf-heavy; even a simple linear-scan with the wasm validation-derived
   liveness would eliminate the redundant reload-from-same-slot pattern. Target: keep the
   `sched` pointer and loop induction vars in registers across the loop.
2. **Fold constant linmem offsets into the load/store addressing mode**, and emit the
   scaled-index form for `i32.shl k; i32.add; load`. Removes the `add.w r12,…` prologue on
   every access (~128 B). Pairs naturally with #1.
3. **Elide the `__stack_pointer` shadow-stack prologue/epilogue when the function touches
   no wasm linear memory below the shadow frame** (provable: no `memory.*`/static-offset
   ops). Saves ~64 B here and on every pointer-ABI kernel function.
4. **Branch-on-flags peephole**: when a comparison result feeds only a `br_if`, branch
   directly off the condition codes instead of materializing a bool + `cmp #0`.
   (~46 B; also a cycle win.)
5. **Copy coalescing + identity-`mov` elimination**, and **`and #0xffff` → `uxth` /
   `and #0xff` → `uxtb` peephole**.
6. **Save only clobbered callee-saved registers** in the prologue.

Estimated combined effect: passes 1–3 alone should bring `gust_poll` from 816 B toward the
~300–350 B range (≈1.5–1.7× vs LLVM), within striking distance of the project's
10–20%-overhead thesis. Passes 4–6 close most of the remainder.


---
_Filed from gale's `gust` dissolve benchmark (`benches/gust/compare-codegen.sh`). Reproduce: build native thumbv7m vs wasm→loom→synth→cortex-m3, disassemble with `llvm-objdump -d --triple=thumbv7m`. Measured on synth v0.11.50 / loom v1.1.14._

function	native rustc/LLVM → thumbv7m	synth dissolved (wasm→loom→synth→cm3)	ratio
`gust_poll` (scheduler poll round)	208 B (104 insns)	816 B (240 insns)	3.9×
`gust_mix` (Q8 mixer wrapper)	12 B (5 insns)	44 B (12 insns)	3.7×

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codegen: close the 3.9× size gap vs LLVM on the gust hot path (regalloc + addressing-mode folding) #390

Proposal for `pulseengine/synth`: close the 3.9× codegen gap vs LLVM on the gust hot path

Attribution of the 816 − 208 = 608 B gap (synth-side)

Evidence #1 — spill churn (the single biggest cost, ~288 B)

Evidence #2 — `__stack_pointer` shadow-stack emulation

Evidence #3 — address materialization not folded into the addressing mode

Evidence #4 — boolean materialization

Recommended passes (in payoff order)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

waste category	insns	est. bytes	what LLVM does instead
1. No register allocation — every wasm local/operand spilled to a stack slot	72 `ldr/str [sp,#…]`	~288 B	keeps values in r0–r10 across the whole loop; ~0 spills
2. `__stack_pointer` shadow-stack emulation (entry + exit)	~14	~64 B	none — no shadow stack at all
3. Address materialization into r12 before every linmem access	14 `add.w r12,rN,#imm` + 18 `[r11,…]`	~128 B	folds offset into the load: `ldr r0,[rN,#imm]` / scaled-index `ldr.w r10,[r6,r0,lsl #3]`
4. Boolean materialization + redundant compare-against-zero	10 `ite/it` + 11 `cmp rN,#0`	~46 B	branches on flags directly; if-converts with `it`/`movne` (1 block, not 2)
5. No copy coalescing — identity `mov rN,rN` survive	2 (+ several near-identity)	~6 B	coalesced away
6. Unconditional callee-save of r4–r8 in leaf-ish wrappers (`gust_mix`)	—	~4 B/fn	saves only what's used

Codegen: close the 3.9× size gap vs LLVM on the gust hot path (regalloc + addressing-mode folding) #390

Description

Proposal for pulseengine/synth: close the 3.9× codegen gap vs LLVM on the gust hot path

Attribution of the 816 − 208 = 608 B gap (synth-side)

Evidence #1 — spill churn (the single biggest cost, ~288 B)

Evidence #2 — __stack_pointer shadow-stack emulation

Evidence #3 — address materialization not folded into the addressing mode

Evidence #4 — boolean materialization

Recommended passes (in payoff order)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposal for `pulseengine/synth`: close the 3.9× codegen gap vs LLVM on the gust hot path

Evidence #2 — `__stack_pointer` shadow-stack emulation