You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proposal for pulseengine/synth: close the 3.9× codegen gap vs LLVM on the gust hot path
Context. Benchmarked synth v0.11.50 lowering the gust mini-RTOS scheduler hot
path. Same Rust source, two toolchains:
function
native rustc/LLVM → thumbv7m
synth dissolved (wasm→loom→synth→cm3)
ratio
gust_poll (scheduler poll round)
208 B (104 insns)
816 B (240 insns)
3.9×
gust_mix (Q8 mixer wrapper)
12 B (5 insns)
44 B (12 insns)
3.7×
The wasm that synth received from loom is already well-optimized (compact 8-local body, select-based branch-free wrap, offset loads — see the loom proposal). The gap is
almost entirely in synth's wasm→ARM lowering, not the input wasm. Disassembly evidence
below; reproduce with compare-codegen.sh + llvm-objdump -d --triple=thumbv7m.
Attribution of the 816 − 208 = 608 B gap (synth-side)
waste category
insns
est. bytes
what LLVM does instead
1. No register allocation — every wasm local/operand spilled to a stack slot
72ldr/str [sp,#…]
~288 B
keeps values in r0–r10 across the whole loop; ~0 spills
In gust_poll this pattern is pervasive: e.g. ldr.w r0,[sp,#0x3c] (the sched pointer)
is reloaded from the same stack slot at offsets 0x260, 0x26c, 0x284, 0x2c2, 0x2e4, 0x300,
0x422, 0x43e, 0x450, 0x454… — the same value, never kept in a register. LLVM loads sched once into r6 and reuses it for the entire function.
Entry (0x240–0x25e) and exit (0x4fc–0x50e) load __synth_globals, do pointer arithmetic,
and store it back — emulating the wasm global.get/set $__stack_pointer that loom emitted
purely as ABI boilerplate. gust_poll takes its state by pointer and uses no
wasm-linmem statics, so it needs no shadow stack at all.
Evidence #3 — address materialization not folded into the addressing mode
; synth: 3 insns / 10 bytes per field accessadd.w r12, r2, #0x34 ; r12 = base + 0x34ldr.w r3,[r11,r12] ; load via r11(=0)+r12; LLVM: 1 insn / 4 bytesldrd r0, r1,[r0, #48] ; offset folded; even loads two words at once
synth assumes r11 == 0 (verified: r11 is never written in the function) and routes every
linmem access through [r11, r12] after computing the effective address in r12. Thumb-2
offers ldr rD,[rN,#imm] (imm up to 4095) and scaled-index ldr.w rD,[rN,rM,lsl #k] —
both unused.
LLVM branches straight off the compare flags (beq/bhs) — 1 insn vs 4.
Recommended passes (in payoff order)
Linear-scan (or tree-scan) register allocator over wasm locals + operand stack.
This is the headline fix (~288 B / ~47% of the gap). The dissolved-native target is
tiny and leaf-heavy; even a simple linear-scan with the wasm validation-derived
liveness would eliminate the redundant reload-from-same-slot pattern. Target: keep the sched pointer and loop induction vars in registers across the loop.
Elide the __stack_pointer shadow-stack prologue/epilogue when the function touches
no wasm linear memory below the shadow frame (provable: no memory.*/static-offset
ops). Saves ~64 B here and on every pointer-ABI kernel function.
Branch-on-flags peephole: when a comparison result feeds only a br_if, branch
directly off the condition codes instead of materializing a bool + cmp #0.
(~46 B; also a cycle win.)
Copy coalescing + identity-mov elimination, and and #0xffff → uxth / and #0xff → uxtb peephole.
Save only clobbered callee-saved registers in the prologue.
Estimated combined effect: passes 1–3 alone should bring gust_poll from 816 B toward the
~300–350 B range (≈1.5–1.7× vs LLVM), within striking distance of the project's
10–20%-overhead thesis. Passes 4–6 close most of the remainder.
Filed from gale's gust dissolve benchmark (benches/gust/compare-codegen.sh). Reproduce: build native thumbv7m vs wasm→loom→synth→cortex-m3, disassemble with llvm-objdump -d --triple=thumbv7m. Measured on synth v0.11.50 / loom v1.1.14.
Proposal for
pulseengine/synth: close the 3.9× codegen gap vs LLVM on the gust hot pathContext. Benchmarked synth v0.11.50 lowering the
gustmini-RTOS scheduler hotpath. Same Rust source, two toolchains:
gust_poll(scheduler poll round)gust_mix(Q8 mixer wrapper)The wasm that synth received from loom is already well-optimized (compact 8-local body,
select-based branch-free wrap, offset loads — see the loom proposal). The gap isalmost entirely in synth's wasm→ARM lowering, not the input wasm. Disassembly evidence
below; reproduce with
compare-codegen.sh+llvm-objdump -d --triple=thumbv7m.Attribution of the 816 − 208 = 608 B gap (synth-side)
ldr/str [sp,#…]__stack_pointershadow-stack emulation (entry + exit)add.w r12,rN,#imm+ 18[r11,…]ldr r0,[rN,#imm]/ scaled-indexldr.w r10,[r6,r0,lsl #3]ite/it+ 11cmp rN,#0it/movne(1 block, not 2)mov rN,rNsurvivegust_mix)(Categories overlap slightly; #1 and #3 together dominate and account for the bulk of the
608 B.)
Evidence #1 — spill churn (the single biggest cost, ~288 B)
gust_mixis the clean microcosm. Native, 12 B:synth, 44 B:
In
gust_pollthis pattern is pervasive: e.g.ldr.w r0,[sp,#0x3c](theschedpointer)is reloaded from the same stack slot at offsets 0x260, 0x26c, 0x284, 0x2c2, 0x2e4, 0x300,
0x422, 0x43e, 0x450, 0x454… — the same value, never kept in a register. LLVM loads
schedonce intor6and reuses it for the entire function.Evidence #2 —
__stack_pointershadow-stack emulationEntry (0x240–0x25e) and exit (0x4fc–0x50e) load
__synth_globals, do pointer arithmetic,and store it back — emulating the wasm
global.get/set $__stack_pointerthat loom emittedpurely as ABI boilerplate.
gust_polltakes its state by pointer and uses nowasm-linmem statics, so it needs no shadow stack at all.
Evidence #3 — address materialization not folded into the addressing mode
synth assumes
r11 == 0(verified: r11 is never written in the function) and routes everylinmem access through
[r11, r12]after computing the effective address in r12. Thumb-2offers
ldr rD,[rN,#imm](imm up to 4095) and scaled-indexldr.w rD,[rN,rM,lsl #k]—both unused.
Evidence #4 — boolean materialization
wasm
i32.eq/i32.ge_ufeeding abr_ifis lowered as "materialize 0/1 into a reg, thencmp reg,#0, then branch":LLVM branches straight off the compare flags (
beq/bhs) — 1 insn vs 4.Recommended passes (in payoff order)
This is the headline fix (~288 B / ~47% of the gap). The dissolved-native target is
tiny and leaf-heavy; even a simple linear-scan with the wasm validation-derived
liveness would eliminate the redundant reload-from-same-slot pattern. Target: keep the
schedpointer and loop induction vars in registers across the loop.scaled-index form for
i32.shl k; i32.add; load. Removes theadd.w r12,…prologue onevery access (~128 B). Pairs naturally with feat(backend): Add register allocation, code generation, and CFG optimizations #1.
__stack_pointershadow-stack prologue/epilogue when the function touchesno wasm linear memory below the shadow frame (provable: no
memory.*/static-offsetops). Saves ~64 B here and on every pointer-ABI kernel function.
br_if, branchdirectly off the condition codes instead of materializing a bool +
cmp #0.(~46 B; also a cycle win.)
movelimination, andand #0xffff→uxth/and #0xff→uxtbpeephole.Estimated combined effect: passes 1–3 alone should bring
gust_pollfrom 816 B toward the~300–350 B range (≈1.5–1.7× vs LLVM), within striking distance of the project's
10–20%-overhead thesis. Passes 4–6 close most of the remainder.
Filed from gale's
gustdissolve benchmark (benches/gust/compare-codegen.sh). Reproduce: build native thumbv7m vs wasm→loom→synth→cortex-m3, disassemble withllvm-objdump -d --triple=thumbv7m. Measured on synth v0.11.50 / loom v1.1.14.