Skip to content

Codegen: close the 3.9× size gap vs LLVM on the gust hot path (regalloc + addressing-mode folding) #390

Description

@avrabe

Proposal for pulseengine/synth: close the 3.9× codegen gap vs LLVM on the gust hot path

Context. Benchmarked synth v0.11.50 lowering the gust mini-RTOS scheduler hot
path. Same Rust source, two toolchains:

function native rustc/LLVM → thumbv7m synth dissolved (wasm→loom→synth→cm3) ratio
gust_poll (scheduler poll round) 208 B (104 insns) 816 B (240 insns) 3.9×
gust_mix (Q8 mixer wrapper) 12 B (5 insns) 44 B (12 insns) 3.7×

The wasm that synth received from loom is already well-optimized (compact 8-local body,
select-based branch-free wrap, offset loads — see the loom proposal). The gap is
almost entirely in synth's wasm→ARM lowering, not the input wasm.
Disassembly evidence
below; reproduce with compare-codegen.sh + llvm-objdump -d --triple=thumbv7m.

Attribution of the 816 − 208 = 608 B gap (synth-side)

waste category insns est. bytes what LLVM does instead
1. No register allocation — every wasm local/operand spilled to a stack slot 72 ldr/str [sp,#…] ~288 B keeps values in r0–r10 across the whole loop; ~0 spills
2. __stack_pointer shadow-stack emulation (entry + exit) ~14 ~64 B none — no shadow stack at all
3. Address materialization into r12 before every linmem access 14 add.w r12,rN,#imm + 18 [r11,…] ~128 B folds offset into the load: ldr r0,[rN,#imm] / scaled-index ldr.w r10,[r6,r0,lsl #3]
4. Boolean materialization + redundant compare-against-zero 10 ite/it + 11 cmp rN,#0 ~46 B branches on flags directly; if-converts with it/movne (1 block, not 2)
5. No copy coalescing — identity mov rN,rN survive 2 (+ several near-identity) ~6 B coalesced away
6. Unconditional callee-save of r4–r8 in leaf-ish wrappers (gust_mix) ~4 B/fn saves only what's used

(Categories overlap slightly; #1 and #3 together dominate and account for the bulk of the
608 B.)

Evidence #1 — spill churn (the single biggest cost, ~288 B)

gust_mix is the clean microcosm. Native, 12 B:

push {r7, lr}
mov  r7, sp
bl   <mix>
uxth r0, r0        ; zero-extend low 16 bits
pop  {r7, pc}

synth, 44 B:

push.w {r4, r5, r6, r7, r8, lr}   ; saves 5 regs it never uses
sub.w  sp, sp, #0x18              ; 24-byte frame for a wrapper
str.w  r0, [sp, #0x14]            ; spill the incoming arg…
ldr.w  r1, [sp, #0x14]            ; …and immediately reload it
mov    r0, r1                     ; un-coalesced copy
bl     <mix>
ldr.w  r2, [pc, #0x10]            ; load 0xffff from const pool…
and.w  r3, r0, r2                 ; …AND instead of uxth (missed peephole)
mov    r0, r3                     ; un-coalesced copy
add.w  sp, sp, #0x18
pop.w  {r4, r5, r6, r7, r8, pc}

In gust_poll this pattern is pervasive: e.g. ldr.w r0,[sp,#0x3c] (the sched pointer)
is reloaded from the same stack slot at offsets 0x260, 0x26c, 0x284, 0x2c2, 0x2e4, 0x300,
0x422, 0x43e, 0x450, 0x454…
— the same value, never kept in a register. LLVM loads
sched once into r6 and reuses it for the entire function.

Evidence #2__stack_pointer shadow-stack emulation

Entry (0x240–0x25e) and exit (0x4fc–0x50e) load __synth_globals, do pointer arithmetic,
and store it back — emulating the wasm global.get/set $__stack_pointer that loom emitted
purely as ABI boilerplate. gust_poll takes its state by pointer and uses no
wasm-linmem statics
, so it needs no shadow stack at all.

ldr.w r0, [pc,#0x300]   ; &__synth_globals
ldr   r1, [r0]          ; sp_global
ldr.w r0, [pc,#0x300]   ; &__synth_wasm_data base
adds  r0, r1, r0
sub.w r1, r0, #0x10
str.w r1, [sp, #0x8]    ; …and the mirror dance on exit

Evidence #3 — address materialization not folded into the addressing mode

; synth: 3 insns / 10 bytes per field access
add.w  r12, r2, #0x34      ; r12 = base + 0x34
ldr.w  r3, [r11, r12]      ; load via r11(=0)+r12
; LLVM: 1 insn / 4 bytes
ldrd   r0, r1, [r0, #48]   ; offset folded; even loads two words at once

synth assumes r11 == 0 (verified: r11 is never written in the function) and routes every
linmem access through [r11, r12] after computing the effective address in r12. Thumb-2
offers ldr rD,[rN,#imm] (imm up to 4095) and scaled-index ldr.w rD,[rN,rM,lsl #k]
both unused.

Evidence #4 — boolean materialization

wasm i32.eq/i32.ge_u feeding a br_if is lowered as "materialize 0/1 into a reg, then
cmp reg,#0, then branch":

cmn.w  r0, #0x1
ite    eq
moveq  r2, #0x1
movne  r2, #0x0
cmp    r2, #0x0          ; redundant — flags already set
bne.w  <target>

LLVM branches straight off the compare flags (beq/bhs) — 1 insn vs 4.

Recommended passes (in payoff order)

  1. Linear-scan (or tree-scan) register allocator over wasm locals + operand stack.
    This is the headline fix (~288 B / ~47% of the gap). The dissolved-native target is
    tiny and leaf-heavy; even a simple linear-scan with the wasm validation-derived
    liveness would eliminate the redundant reload-from-same-slot pattern. Target: keep the
    sched pointer and loop induction vars in registers across the loop.
  2. Fold constant linmem offsets into the load/store addressing mode, and emit the
    scaled-index form for i32.shl k; i32.add; load. Removes the add.w r12,… prologue on
    every access (~128 B). Pairs naturally with feat(backend): Add register allocation, code generation, and CFG optimizations #1.
  3. Elide the __stack_pointer shadow-stack prologue/epilogue when the function touches
    no wasm linear memory below the shadow frame
    (provable: no memory.*/static-offset
    ops). Saves ~64 B here and on every pointer-ABI kernel function.
  4. Branch-on-flags peephole: when a comparison result feeds only a br_if, branch
    directly off the condition codes instead of materializing a bool + cmp #0.
    (~46 B; also a cycle win.)
  5. Copy coalescing + identity-mov elimination, and and #0xffffuxth /
    and #0xffuxtb peephole
    .
  6. Save only clobbered callee-saved registers in the prologue.

Estimated combined effect: passes 1–3 alone should bring gust_poll from 816 B toward the
~300–350 B range (≈1.5–1.7× vs LLVM), within striking distance of the project's
10–20%-overhead thesis. Passes 4–6 close most of the remainder.


Filed from gale's gust dissolve benchmark (benches/gust/compare-codegen.sh). Reproduce: build native thumbv7m vs wasm→loom→synth→cortex-m3, disassemble with llvm-objdump -d --triple=thumbv7m. Measured on synth v0.11.50 / loom v1.1.14.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions