Skip to content

inline_functions never inlines small multi-call-site leaves (size<10 OR-term shadows MULTI_CALL_SITE_LIMIT); + optimize CLI lacks whole-function DCE #228

Description

@avrabe

inline_functions: small multi-call-site leaves are never inlined — size < 10 OR-term shadows MULTI_CALL_SITE_LIMIT (dead constant); + loom optimize has no whole-function DCE

Context

gale's gust codegen bench (pulseengine/gale#97) measures dissolved gust_mix at 2.81× native LLVM cycles. A large share is an export wrapper that bls an inner body (gust_wasm_scout::mix); they're never merged, so synth lowers two functions + a call + a stack round-trip where native has one.

Root cause (verified live on the real module, loom 1.1.14)

The inner mix is 23 instructions, fully modelable (local.tee+select), with 2 call sites (the gust_mix export wrapper and gust_poll). It is never inlinedloom optimize --passes inline and full loom optimize both leave 6 functions with gust_mix still calling it.

The candidate predicate, loom-core/src/lib.rs:13476:

if (call_count == 1 || size < 10) && size < limit {
    inline_candidates.push(*func_idx);
}

With call_count==2 and size==23, both OR terms are false → never a candidate, never offered to the Z3 gate. MULTI_CALL_SITE_LIMIT = 50 (lib.rs:13469-13475) is dead code: it's only read when call_count != 1, but in that branch the OR already requires size < 10, so nothing in [10,50) ever reaches size < limit. The "small function cheap from multiple sites" doc comment (13456-13462) describes only the size<10 behavior.

This is NOT size/modelability (23 < 50, fully modelable), NOT Z3 revert (never reached), NOT export-caller protection (none exists in the candidate/inline loops 13429-13540), NOT the fixpoint guard.

Fix

if size < limit {   // let the site-count-dependent budget govern; Z3 (13532) stays the backstop
    inline_candidates.push(*func_idx);
}

This admits a 23-insn 2-site leaf under MULTI_CALL_SITE_LIMIT=50; the existing per-inline Z3 verify_or_revert (13532) bounds correctness risk.

Companion finding — loom optimize has no whole-function DCE (must land with the above)

Multi-site inlining duplicates the body into both callers but leaves the original orphaned, growing the module. The remover eliminate_dead_functions (fused_optimizer.rs:1923/2025) is reached only via optimize_module Phase 0 / run_fused_optimizations (fused_optimizer.rs:170) — optimize_command never calls it (loom-cli/src/main.rs:489-652 runs a hand-rolled pass list whose dce = eliminate_dead_code lib.rs:7149, intra-function only). So an orphaned function survives every loom optimize run. Ask: expose module-level dead-function elimination as a CLI pass (e.g. dce-functions) after inline, or route the CLI default through optimize_module.

Secondary (lower impact, real)

  • Inliner emits a redundant arg-copy: inline_calls_in_block (lib.rs:13686-13688) unconditionally local.sets each param into a fresh temp even when the single arg is already stack-positioned → a local.set N; … local.get N round-trip downstream must clean. Add argument-forwarding when the arg is a local.get with no intervening write.
  • simplify_locals can't clean it up: discards ALL equivalences if the function has any If/Block/Loop (lib.rs:8905-8919), and removes the equivalence on every LocalTee (8997-9000). So the arg-copy survives the full pipeline.

Not loom's gap (filed at synth)

The register-materialized shift (movw r4,#8; lsl.w r5,r3,r4) and the materialized-bool select are synth instruction-selection — loom emits the ideal i32.const 8; i32.shl and i32.lt_s; select.

Beat-LLVM (whole-program leverage)

loom sees the closed-world fused module + already runs a Z3 validator. It could forward proof-carrying facts (operand ranges, shift<32, select totality) as side-band metadata so synth folds/branchless-lowers with a proof, and specialize on the closed call set (e.g. drop the u16 mask/clamp bound if all call sites pass in-range) — information an MCU LLVM build, compiling an externally-visible symbol, structurally cannot use.

Reproduce: loom optimize <gust_kernel.wasm> --passes inline then check gust_mix still contains the call.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions