Skip to content

v1.1.15 — inliner: acyclic CF (#226) + multi-site small leaves & whole-function DCE (#228)#229

Merged
avrabe merged 16 commits into
mainfrom
release/v1.1.15-acyclic-inline
Jun 22, 2026
Merged

v1.1.15 — inliner: acyclic CF (#226) + multi-site small leaves & whole-function DCE (#228)#229
avrabe merged 16 commits into
mainfrom
release/v1.1.15-acyclic-inline

Conversation

@avrabe

@avrabe avrabe commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Two Z3-gated inliner improvements found by gale's gust codegen bench. The #219 seam-teardown (carrier forwarding) is intentionally excluded — it stays frozen on feat/219-seam-sroa until the G474RE silicon dual-flash confirms it.

#226 — acyclic control-flow inlining

v1.1.14's inline-modelability gate was straight-line only, so a #[inline] leaf whose body is if oob { call panic_bounds_check; unreachable } + a load + arithmetic (gale's mix) stayed an opaque call. This brings the precise acyclic-CF symbolic executor (PR-C) + the Regime B candidate gate to a release: acyclic CF + by-body calls + divergent (no-return) calls are now admitted and proven by the translation validator. Verified: a mix-shaped repro inlines (call $mix → 0, validates).

#228 — multi-site small leaves + whole-function DCE

  • Predicate bug: (call_count == 1 || size < 10) && size < limit made MULTI_CALL_SITE_LIMIT (50) dead — a multi-site callee only passed when size < 10, so nothing in [10,50) ever inlined (gale's 23-insn 2-site gust_mix). Now size < limit (site-count budget governs; Z3 verify-or-revert stays the correctness gate).
  • Whole-function DCE: multi-site inlining orphans the original body; loom optimize had no module-level DCE. New default-on dce-functions CLI pass exposes eliminate_dead_functions (keeps exports / start / element-table targets live; v1.1.11 element remap scrambles the function-pointer table — valid-but-wrong code, falcon SIL gate fails (0.13m → 593.8m) #196-safe).

Validation

  • 414 lib tests green (incl. new test_inline_multisite_small_leaf_228); clippy clean.
  • e2e: loom optimize --passes inline,dce-functions on an 11-insn 2-site leaf → inlined at both sites + orphan removed + validates. mix-shaped repro still inlines.
  • Merge only through the required status checks.

Refs #226 #228

avrabe and others added 16 commits June 22, 2026 21:07
Faithful local repro of gale #219's C<->Rust decide seam: `decide` returns
{action, new_count} packed into an i64; the shim unpacks them. After
`--passes inline` the `run` body contains the pack (extend_i32_u/shl/or) →
i64 carrier local → unpack (and/shr_u/wrap_i64) round-trip on one value —
exactly the residue to dissolve.

Verified: optimizes + validates today; inline leaves the round-trip intact
(confirmed via `loom optimize --passes inline` + wasm-tools print). The
seam-SROA pass (next) must forward the scalar components and drop the carrier,
yielding `(action & 0xff) + new_count` — each rewrite Z3-proven.

Refs #219
gale attached the authentic post-loom-inline dissolved z_impl_k_sem_give on the
issue. Decoded + verified: sha256 f81da42d… (matches), 5254 B, valid wasm. It
contains the u64 pack/unpack round-trip to dissolve — pack (extend_i32_u/shl/or)
at body lines 59-63, unpack (and/shr_u/wrap_i64) at 90/104, with a dead i64
carrier. Grounds the seam-SROA pass against the real body, not just the
synthetic fixture.

Design confirmed on #219 (proof-carrying: Z3 obligation per ISLE rewrite;
ISLE-term route preferred). Implementation (mem2reg + pack/unpack forwarding +
DCE) follows.

Refs #219
…tend_i32_u)

Lays the ISLE-layer foundation for the #219 pack/unpack forwarding rules:
- types.isle: I32WrapI64 / I64ExtendI32U ValueData variants.
- constructors.isle: decl + extern constructor + extern extractor for each.
- loom-shared/src/lib.rs: i32_wrap_i64_extract / i64_extend_i32_u_extract
  (mirror iadd32_extract, single-field).
- build.rs: register seam_sroa.isle AND add the two extractors to the
  fix_extractor_ownership list (cranelift-isle 0.132.1 emits extractor args as
  owned `Value`; the post-processor rewrites them to `&Value` — without this the
  generated trie move-then-borrows arg0 and loom-shared won't compile).
- seam_sroa.isle: the unconditional, Z3-trivial rule
  `wrap_i64(extend_i32_u(x)) → x`.

loom-shared builds clean.

IMPORTANT (next step): the generated ISLE `simplify` is NOT in loom's live
rewrite path — build.rs documents this, and the active rewriters are the Rust
`rewrite_pure_impl` / `rewrite_with_dataflow` (loom-shared/src/lib.rs), validated
by the Z3 translation validator per pass. So this rule does not fire yet; the
LIVE seam-SROA rewrite must be added as a `rewrite_pure_impl` match arm
(ValueData::I32WrapI64{val} where val is I64ExtendI32U{inner} → inner), with the
Z3 verifier discharging the obligation. ISLE terms retained as proof-layer
documentation of the identity.

Refs #219
The actual rewrite (the ISLE rule committed earlier is in loom's dead simplify
path; the live rewriters are the Rust rewrite_pure_impl/rewrite_with_dataflow,
validated by the Z3 translation validator per pass). Adds rewrite_pure_impl
match arms for the two #219 conversion terms:

- I32WrapI64: recurse operand; if it is I64ExtendI32U(inner) → inner
  (unconditionally sound: extract(31,0, zero_ext(32,x)) = x — the u64-ABI
  round-trip a packed scalar pays at the dissolved decide seam); also folds
  wrap_i64(i64.const N) → i32.const (low 32).
- I64ExtendI32U: recurse operand; folds extend_i32_u(i32.const N) → i64.const
  (zero-extended).

Test: test_seam_sroa_wrap_extend_identity — wrap_i64(extend_i32_u(local.get 0))
dissolves to local.get 0. Full `cargo test --release --lib --test-threads=4`
green, 0 failures.

This is the foundation case. The sem repro's full dissolution additionally
needs the carrier mem2reg + the masked and/shr_u forwarding rules (steps 2-3),
since its unpack uses and/shr_u (not wrap) on the packed u64.

Refs #219
…lves)

Dissolves the u64-pack unpack arithmetic, in rewrite_pure_impl (Z3-validated per
pass; pure-integer so the verifier actually verifies — no skip). Each rule is
unconditionally sound with only constant side-conditions (no range analysis):

I64And (the `& mask` unpack):
- (shl Z k) & M → 0           when M's set bits are all below k (M>>k==0, 0<k<64)
- (or A B) & M → survivor & M  when one OR operand is such a cleared shift

I64ShrU (the `>> k` unpack):
- (shr_u (shl Z k) k) → Z & (u64::MAX>>k)   (shl-then-shr-same masks low 64-k)
- (shr_u (or P Q) k) → (P>>k)|(Q>>k)         when an operand is (shl _ k)
  (logical shift distributes over OR; targeted to the pack shape — no general
  bloat; recursion collapses the shl side via the rule above)

Together these reduce a forwarded pack `(or (shl (extend_u hi) 32) lo)`:
  & 0xff  → lo & 0xff                         (low field)
  >> 32   → (extend_u hi) & 0xffffffff         (high field; lo const shifts out)

Tests: test_seam_sroa_mask_clears_shifted_pack, test_seam_sroa_shr_extracts_
high_field (+ the step-1 wrap/extend test). Full `cargo test --release --lib
--test-threads=4` green (399/28/17, 0 failures).

These fire once inlining delivers pack/unpack adjacency (the inline-modelability
extension for control-flow + trap-call decides is the remaining prerequisite,
tracked separately — higher blast radius, needs gale silicon validation).

Refs #219
…ier)

Design for the #219 inline prerequisite: wire in precise br_table/br_if/br
symbolic execution (the deferred ExecutionState/merge_states scaffolding) so the
verifier can soundly prove inline-equivalence of br_table-dispatching decides.

Approach: state-merging-with-ITE via a label/continuation stack (NOT path
forking), exact for the acyclic single-exit case. Grounded in Alive2 (PLDI'21,
the production SMT bounded-TV precedent), Kuznetsov state-merging (PLDI'12), WASP
(ECOOP'22 wasm br_table rules), Boogie block-predicate WP. Includes the soundness
argument, the explicit pitfalls (selector constraint, fall-through join, ⊥-not-
havoc for trap arms, branch arity, over-approx-and-skip), 2-phase staging
(sem_give integer first, then memory for mutex/pipe), and the adversarial
verify-or-revert + silicon test gate. For gale review before implementation.

Refs #219
…tection)

First component of the PR-C precise-CF inline-verifier extension (design approved
by gale on #219, 2026-06-17). `is_noreturn_callee(func)`: conservative-sound —
true iff the body has no Return and no Br/BrIf/BrTable anywhere (recursing) and
ends in Unreachable. Admits exactly the core::panicking helpers (panic_fmt →
unreachable; panic_const_add_overflow → call panic_fmt; unreachable) and rejects
any normal-returning or branching function.

The by-body inline modeler (PR-C M3, next) will use this to encode a `call` to
such a callee as ⊥ "diverge if reached" — never havoc — so a trapping branch
constrains only reachability, never the post-state (Alive2 `ub` discipline;
avoids the #155/#159 havoc-vs-concrete false-positive surface).

Test: test_is_noreturn_callee (bare trap / trap-after-call / normal-return /
no-trailing-trap / branch-escape). #[allow(dead_code)] until the M3 gate wires it.

Refs #219
The soundness-critical core of the inline-modelability extension: a precise
symbolic executor for structured, acyclic wasm control flow (block / if / br /
br_if / br_table) by state-merging-with-ITE over a label/join stack — never
path-forking. For a bounded acyclic region this is EXACT (a passing SMT check is
a real proof). Precedent: Alive2 (PLDI'21), Kuznetsov state-merging (PLDI'12),
WASP br_table rules (ECOOP'22). See docs/design/pr-c-precise-cf-verification.md.

Built as a NEW ISOLATED function, not by mutating the shared encoders
(encode_block_body / encode_function_to_smt_impl_inner stay untouched -> zero
regression risk to existing passes). encode_simple_instruction's catch-all
HAVOCS (pushes a fresh BV without popping operands), so delegating an unmodeled
op would desync the stack; instead an allowlist gate (is_acyclic_modelable_body)
front-loads soundness and the executor runs only on admitted functions.
Over-approximate-and-skip everything else: loops (back-edge), calls, memory,
unknown opcodes, multi-result/param blocks.

br_table default guard is UNSIGNED bvuge(sel, n) — a signed >= would leave
selectors in [2^31, 2^32) in a partition gap.

All #[cfg(feature="verification")] + #[allow(dead_code)] until M3 wires it into
inline_functions' verify-or-revert. Tests (5, green): br_table switch proves
equivalent to the if/else chain; identity; detects wrong-default and
swapped-arm counterexamples; gate rejects loops/memory, admits the switch.
Also folds in cargo-fmt drift in loom-shared from #219 SROA @4356441.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e acyclic executor

Extends exec_acyclic to model a `call F` BY BODY (the inline-soundness principle
from #155 extended to control flow): pop args, recursively encode the callee
through exec_acyclic with params bound to the args, declared locals zero-init,
globals SHARED with the caller (callee global writes propagate back); push the
merged single result and narrow the caller path to the callee's returning paths.
A call to a no-return callee (is_noreturn_callee: panic -> unreachable) sets
reachable=false — the path DIVERGES (drops out of the result ITE), NEVER havoc
(Alive2 ub discipline; havoc reopens the #155/#159 false-positive surface).

Threads VerificationSignatureContext (function_bodies, the existing #151 callee
lookup) + a recursion depth into exec_acyclic / encode_acyclic_function /
is_acyclic_modelable_body. MAX_ACYCLIC_CALL_DEPTH=8 bounds by-body inlining and
makes a cyclic/recursive call graph fail safely (reject, never infinite-unroll).
The gate now admits Call(g) iff g is a local function that is no-return or
recursively acyclic-modelable; imports/call_indirect/over-deep stay rejected.
exec_acyclic remains the soundness enforcer (Err, never havoc, on anything
unmodelable).

Still #[cfg(verification)] + #[allow(dead_code)] until M3 wires it into
inline_functions' verify-or-revert. Tests (7 total, green): + call-by-body of a
br_table callee proves equivalent to its inlined switch (the seam capability);
+ no-return call diverges not havoc.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the equivalence harness over the precise acyclic CF executor: encodes both
functions via encode_acyclic_function (deterministic input names unify in Z3) and
checks a == b has no counterexample. Mirrors verify_function_equivalence's
soundness boundaries (no width-matching the final equality; Unknown -> Err).

Returns Err whenever either side is not acyclic-modelable, a width mismatches, or
the encoding fails — so the caller (inline_functions' verify-or-revert, wired in
M3.2) FALLS BACK to the existing encoder. This only ADDS proving power for the
br_table-callee inline case; it never removes a check. Void functions defer to
the fallback (no side-effect equivalence modeled yet).

Still #[allow(dead_code)] until M3.2 wires it into inline_functions. Tests
(9 total, green): + test_verify_acyclic_equivalence_inline — a caller that
calls a br_table decide verifies equivalent to the inlined switch, a
wrong-default body does NOT verify, and a loop function Errs (caller falls back).

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…liner

TranslationValidator::verify() now tries verify_acyclic_equivalence for the
inline pass before the existing encoder. The main encoder models br_table
approximately and a br_table callee is not straight-line-by-body-modelable, so
it cannot prove `call F` (br_table callee) equal to its inlined body → the inline
reverts. The acyclic executor is EXACT for acyclic CF + by-body calls, so it can.

PURELY ADDITIVE + scoped to pass_name == "inline_functions": ONLY a precise
Ok(true) short-circuits to accept; Ok(false)/Err/panic FALL THROUGH to the
existing verify_function_equivalence_with_context unchanged (acyclic call wrapped
in its own catch_unwind). This can only ADD accepts (the br_table-callee inlines),
never weaken an existing check; every non-inline pass is byte-identical. This also
makes the M1a/M1b/M3.1 executor chain live (the #[allow(dead_code)] markers are
now harmless no-ops).

No br_table callee is admitted as an inline CANDIDATE yet (function_inline_modelable
is still straight-line — M3.3), so this is a behavioral no-op on the current corpus:
full `cargo test --release --lib` is 408 passed / 0 failed / 2 ignored, unchanged.
clippy + fmt clean. The capability is proven by the 9 acyclic unit tests; M3.3
enables it end-to-end on the seam.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…+ Unreachable trap

function_inline_modelable now admits a callee under EITHER regime A (straight-line
incl. memory — the existing by-body modeler, #155/#157/#161) OR regime B (acyclic
control flow + pure-integer ops + by-body calls to no-return/recursively-modelable
local callees — the precise acyclic executor, PR-C). Mirrors
verify::is_acyclic_modelable_body so the candidate gate admits exactly what the
acyclic verifier can prove (admitting an unprovable callee → caller reverts, the
#159 collateral problem). Threads module context (functions + num_imported_funcs)
into the gate + a recursion-depth bound; a br_table+memory callee passes neither
regime → stays opaque until Phase 2.

Also handles Unreachable as a trap (⊥) in exec_acyclic and both gates: the path
diverges (reachable=false, drops out of the result ITE), never havoc — the seam
decides end their overflow arms in `call panic_const; unreachable`, so this is
required to model them at all.

Full `cargo test --release --lib` = 409 passed / 0 failed / 2 ignored (+1:
test_acyclic_unreachable_diverges). fmt + clippy clean.

SCOPE NOTE (found by inspecting the real repro-219/sem.loom.wasm): decide
(gale_k_sem_give_decide) IS regime-B-eligible (acyclic br_table + i64 pack +
no-return panic call, no memory), so it is now an inline candidate. BUT its caller
z_impl_k_sem_give READS MEMORY (i32.load offset=8/12 for decide's args) and calls
imports, so proving decide's inline INTO z_impl needs the acyclic executor to model
z_impl's memory = PHASE 2. So sem_give's seam will not dissolve until Phase 2
(memory Array threaded through exec_acyclic). M3.1-M3.3 are correct + necessary
infra; Phase 2 is the next real enabler for sem_give.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ecutor

The acyclic executor now models linear memory (loads/stores), required because
real seam callers (z_impl_k_sem_give) read memory for the decide's args.

AcyclicState gains `memory: Array`. Memory is MUTABLE branch state, so it is
ITE-merged at every join exactly like locals/globals: BranchEntry/MergedState
carry the Array, acyclic_merge_entries ITE-merges it (Bool::ite over the Array
sort), acyclic_push_branch records it, acyclic_apply_join sets it, and the If arm
saves/restores it for the else branch. exec_acyclic handles I32/I64 load/store +
partial-width load8/16/store8/16 EXPLICITLY via the shared encode_i32_load_from /
encode_i64_load_from / encode_*_store_into / encode_partial_* helpers (the same
ones the main encoder + by-body modeler use -> bit-identical memory expressions);
they are NEVER delegated to encode_simple_instruction (which havocs). memory is
threaded through encode_acyclic_call (shared + propagated like globals) and
initialized in encode_acyclic_function as Array::new_const("memory", ...) -- same
name as the main encoder so both encodings unify.

Gates: new is_acyclic_memory_op (verify.rs) admitted by is_acyclic_modelable_body;
lib-side is_acyclic_memory_modelable_instr admitted by the regime-B inline gate.

Call gate stays STRICT (no impure-call havoc yet -- that is Phase 2b, the
soundness-critical part). So sem_give's z_impl (which also calls k_spin_lock import
+ z_unpend) is not yet fully modelable; 2a is the contained, testable foundation.

Full `cargo test --release --lib` = 410 passed / 0 failed / 2 ignored. New test
test_acyclic_memory_merge: store-on-each-branch then load proves equivalent to the
direct if/else (validates the per-join memory ITE-merge feeds the loaded value).
test_acyclic_modelable_gate updated (memory now admitted). fmt + clippy clean.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The acyclic executor now models impure calls (imports + non-acyclic-modelable
local callees) by HAVOC, so memory-bearing seam callers like z_impl_k_sem_give
(which call k_spin_lock + z_unpend) become verifier-modelable end-to-end.

exec_acyclic's Call arm is now 3-tier (mirrors the main encoder): no-return ->
diverge; acyclic single-result local -> by-body (encode_acyclic_call); else
(import / multi-result / too-deep / loop-containing) -> acyclic_havoc_call: pop
args, push fresh `call_{g}_result_{i}`, havoc all globals + memory with a SHARED
per-encode counter. The counter is threaded (&mut u64) through exec_acyclic +
encode_acyclic_call and reset per encode_acyclic_function -- so the Nth impure
call gets the SAME havoc id in the original and optimized encodings -> they unify
in Z3. Havoc is conservative-sound: it never forges a value equal to a concrete
one, so a havoc-vs-concrete mismatch refuses to prove (the #155/#159 guard).

is_acyclic_modelable_body is now purely STRUCTURAL (drops ctx/depth params -- any
direct call is modelable since havoc covers the opaque case; clippy
only_used_in_recursion). The lib candidate gate is unchanged (decide already
admitted via its no-return panic call).

Full `cargo test --release --lib` = 413 passed / 0 failed / 2 ignored. New tests:
test_acyclic_havoc_call_unifies (havoc id parity); test_acyclic_havoc_not_concrete
(havoc'd result NOT provably equal to a concrete value -- soundness);
test_acyclic_havoc_memory (load after impure call reads havoc'd memory, not the
initial -- soundness). fmt + clippy clean.

Next: e2e dissolve on repro-219/sem.loom.wasm (decide's inline into z_impl should
now be PROVEN by the acyclic path -> no `call $gale_k_sem_give_decide`).

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ction DCE

The inline_functions candidate predicate `(call_count==1 || size<10) && size<limit`
made MULTI_CALL_SITE_LIMIT dead: a multi-call-site callee only passed when size<10,
so nothing in [10,50) ever inlined (gale's 23-insn gust `mix` seam). Govern purely
by the site-count-dependent budget (`size < limit`); the per-inline Z3 verify-or-revert
stays the correctness backstop.

Companion: multi-site inlining duplicates the body and orphans the original, but
`loom optimize` never ran whole-function DCE (its `dce` = eliminate_dead_code is
intra-function only). Expose eliminate_dead_functions as a CLI `dce-functions` pass
after `inline` (default-on). It keeps exports, the start function, and element-table
(indirect-call) targets live, conservative on parse failure (#196-safe).

Test: test_inline_multisite_small_leaf_228 (11-insn 2-site leaf inlines at both sites).
e2e: --passes inline,dce-functions removes the orphan + validates. 414 lib tests green.

Refs #228

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ves & whole-function DCE (#228)

Found by gale's gust codegen bench. Both fixes are Z3-gated (verify-or-revert).

- #226: precise acyclic-CF symbolic executor (PR-C) + Regime B candidate gate —
  inlines acyclic-CF callees with by-body / divergent (no-return) calls (gale's
  mix: bounds-check if + panic + load). Was straight-line-only before.
- #228: candidate predicate governed by the site-count budget (size < limit) so
  multi-site leaves in [10,50) inline; new default-on CLI dce-functions pass
  (#196-safe whole-function DCE) removes the orphans multi-site inlining creates.

The #219 seam-teardown (carrier forwarding) is intentionally NOT here — frozen on
feat/219-seam-sroa until the G474RE silicon dual-flash confirms it.

Refs #226 #228

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@avrabe avrabe merged commit af438fc into main Jun 22, 2026
22 of 24 checks passed
@avrabe avrabe deleted the release/v1.1.15-acyclic-inline branch June 22, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant