inline_functions: small multi-call-site leaves are never inlined — size < 10 OR-term shadows MULTI_CALL_SITE_LIMIT (dead constant); + loom optimize has no whole-function DCE
Context
gale's gust codegen bench (pulseengine/gale#97) measures dissolved gust_mix at 2.81× native LLVM cycles. A large share is an export wrapper that bls an inner body (gust_wasm_scout::mix); they're never merged, so synth lowers two functions + a call + a stack round-trip where native has one.
Root cause (verified live on the real module, loom 1.1.14)
The inner mix is 23 instructions, fully modelable (local.tee+select), with 2 call sites (the gust_mix export wrapper and gust_poll). It is never inlined — loom optimize --passes inline and full loom optimize both leave 6 functions with gust_mix still calling it.
The candidate predicate, loom-core/src/lib.rs:13476:
if (call_count == 1 || size < 10) && size < limit {
inline_candidates.push(*func_idx);
}
With call_count==2 and size==23, both OR terms are false → never a candidate, never offered to the Z3 gate. MULTI_CALL_SITE_LIMIT = 50 (lib.rs:13469-13475) is dead code: it's only read when call_count != 1, but in that branch the OR already requires size < 10, so nothing in [10,50) ever reaches size < limit. The "small function cheap from multiple sites" doc comment (13456-13462) describes only the size<10 behavior.
This is NOT size/modelability (23 < 50, fully modelable), NOT Z3 revert (never reached), NOT export-caller protection (none exists in the candidate/inline loops 13429-13540), NOT the fixpoint guard.
Fix
if size < limit { // let the site-count-dependent budget govern; Z3 (13532) stays the backstop
inline_candidates.push(*func_idx);
}
This admits a 23-insn 2-site leaf under MULTI_CALL_SITE_LIMIT=50; the existing per-inline Z3 verify_or_revert (13532) bounds correctness risk.
Companion finding — loom optimize has no whole-function DCE (must land with the above)
Multi-site inlining duplicates the body into both callers but leaves the original orphaned, growing the module. The remover eliminate_dead_functions (fused_optimizer.rs:1923/2025) is reached only via optimize_module Phase 0 / run_fused_optimizations (fused_optimizer.rs:170) — optimize_command never calls it (loom-cli/src/main.rs:489-652 runs a hand-rolled pass list whose dce = eliminate_dead_code lib.rs:7149, intra-function only). So an orphaned function survives every loom optimize run. Ask: expose module-level dead-function elimination as a CLI pass (e.g. dce-functions) after inline, or route the CLI default through optimize_module.
Secondary (lower impact, real)
- Inliner emits a redundant arg-copy:
inline_calls_in_block (lib.rs:13686-13688) unconditionally local.sets each param into a fresh temp even when the single arg is already stack-positioned → a local.set N; … local.get N round-trip downstream must clean. Add argument-forwarding when the arg is a local.get with no intervening write.
simplify_locals can't clean it up: discards ALL equivalences if the function has any If/Block/Loop (lib.rs:8905-8919), and removes the equivalence on every LocalTee (8997-9000). So the arg-copy survives the full pipeline.
Not loom's gap (filed at synth)
The register-materialized shift (movw r4,#8; lsl.w r5,r3,r4) and the materialized-bool select are synth instruction-selection — loom emits the ideal i32.const 8; i32.shl and i32.lt_s; select.
Beat-LLVM (whole-program leverage)
loom sees the closed-world fused module + already runs a Z3 validator. It could forward proof-carrying facts (operand ranges, shift<32, select totality) as side-band metadata so synth folds/branchless-lowers with a proof, and specialize on the closed call set (e.g. drop the u16 mask/clamp bound if all call sites pass in-range) — information an MCU LLVM build, compiling an externally-visible symbol, structurally cannot use.
Reproduce: loom optimize <gust_kernel.wasm> --passes inline then check gust_mix still contains the call.
inline_functions: small multi-call-site leaves are never inlined —size < 10OR-term shadowsMULTI_CALL_SITE_LIMIT(dead constant); +loom optimizehas no whole-function DCEContext
gale's gust codegen bench (pulseengine/gale#97) measures dissolved
gust_mixat 2.81× native LLVM cycles. A large share is an export wrapper thatbls an inner body (gust_wasm_scout::mix); they're never merged, so synth lowers two functions + a call + a stack round-trip where native has one.Root cause (verified live on the real module, loom 1.1.14)
The inner mix is 23 instructions, fully modelable (
local.tee+select), with 2 call sites (thegust_mixexport wrapper andgust_poll). It is never inlined —loom optimize --passes inlineand fullloom optimizeboth leave 6 functions withgust_mixstill calling it.The candidate predicate,
loom-core/src/lib.rs:13476:With
call_count==2andsize==23, both OR terms are false → never a candidate, never offered to the Z3 gate.MULTI_CALL_SITE_LIMIT = 50(lib.rs:13469-13475) is dead code: it's only read whencall_count != 1, but in that branch the OR already requiressize < 10, so nothing in[10,50)ever reachessize < limit. The "small function cheap from multiple sites" doc comment (13456-13462) describes only thesize<10behavior.This is NOT size/modelability (23 < 50, fully modelable), NOT Z3 revert (never reached), NOT export-caller protection (none exists in the candidate/inline loops 13429-13540), NOT the fixpoint guard.
Fix
This admits a 23-insn 2-site leaf under
MULTI_CALL_SITE_LIMIT=50; the existing per-inline Z3verify_or_revert(13532) bounds correctness risk.Companion finding —
loom optimizehas no whole-function DCE (must land with the above)Multi-site inlining duplicates the body into both callers but leaves the original orphaned, growing the module. The remover
eliminate_dead_functions(fused_optimizer.rs:1923/2025) is reached only viaoptimize_modulePhase 0 /run_fused_optimizations(fused_optimizer.rs:170) —optimize_commandnever calls it (loom-cli/src/main.rs:489-652 runs a hand-rolled pass list whosedce=eliminate_dead_codelib.rs:7149, intra-function only). So an orphaned function survives everyloom optimizerun. Ask: expose module-level dead-function elimination as a CLI pass (e.g.dce-functions) afterinline, or route the CLI default throughoptimize_module.Secondary (lower impact, real)
inline_calls_in_block(lib.rs:13686-13688) unconditionallylocal.sets each param into a fresh temp even when the single arg is already stack-positioned → alocal.set N; … local.get Nround-trip downstream must clean. Add argument-forwarding when the arg is alocal.getwith no intervening write.simplify_localscan't clean it up: discards ALL equivalences if the function has anyIf/Block/Loop(lib.rs:8905-8919), and removes the equivalence on everyLocalTee(8997-9000). So the arg-copy survives the full pipeline.Not loom's gap (filed at synth)
The register-materialized shift (
movw r4,#8; lsl.w r5,r3,r4) and the materialized-boolselectare synth instruction-selection — loom emits the ideali32.const 8; i32.shlandi32.lt_s; select.Beat-LLVM (whole-program leverage)
loom sees the closed-world fused module + already runs a Z3 validator. It could forward proof-carrying facts (operand ranges, shift<32, select totality) as side-band metadata so synth folds/branchless-lowers with a proof, and specialize on the closed call set (e.g. drop the u16 mask/clamp bound if all call sites pass in-range) — information an MCU LLVM build, compiling an externally-visible symbol, structurally cannot use.
Reproduce:
loom optimize <gust_kernel.wasm> --passes inlinethen checkgust_mixstill contains the call.