Skip to content

Latest commit

 

History

History
881 lines (695 loc) · 39.7 KB

File metadata and controls

881 lines (695 loc) · 39.7 KB

Performance Optimisations

Tracks engine performance work: what was tried, what failed, and what's planned.

Summary

# Optimisation Date Result
1 WeakMap-cached DocumentIndex March 2026 ✗ Failed (–4–11%)
2 Lightweight shadow construction March 2026 ✅ Done (+5–7%)
3 Wire index by trunk key March 2026 ✗ Failed (–10–23%)
4 Cached element trunk key March 2026 ✅ Done (~0%, code cleanup)
5 Skip OTel when idle March 2026 ✅ Done (+7–9% tool-heavy)
6 Constant cache March 2026 ✅ Done (~0%, no regression)
7 pathEquals loop March 2026 ✅ Done (~0%, code cleanup)
8 Pre-group element wires March 2026 ✅ Done (see #9)
9 Batch element materialisation March 2026 ✅ Done (+44–130% arrays)
10 Sync fast path for resolved values March 2026 ✅ Done (+8–17% all, +42–114% arrays)
11 Pre-compute keys & cache wire tags March 2026 ✅ Done (+12–16% all, +60–129% arrays)
12 De-async schedule() & callTool() March 2026 ✅ Done (+11–18% tool, ~0% arrays)
13 Share toolDefCache across shadows March 2026 🔲 Planned
14 Pre-compute output wire topology March 2026 🔲 Planned
15 Cache executeBridge setup per doc March 2026 🔲 Planned
16 Cheap strict-path hot-path guards March 2026 ✅ Done (partial recovery after error-mapping work)
17 v3 scope-based engine March 2026 ✅ Done (+2.2× flat-1000 vs v2 baseline, 43% of compiled)

Baseline (main, March 2026)

Benchmarks live in packages/bridge/bench/engine.bench.ts (tinybench). Historical tracking via Bencher.

Run locally: pnpm bench

Hardware: MacBook Air M4 (4th gen, 15″). All numbers in this document are from this machine — compare only against the same hardware.

Benchmark ops/sec avg (ms)
parse: simple bridge ~42K 0.024
parse: large bridge (20×5) ~1.1K 0.930
exec: passthrough (no tools) ~463K 0.002
exec: short-circuit ~423K 0.002
exec: simple chain (1 tool) ~276K 0.004
exec: chained 3-tool fan-out ~89K 0.012
exec: flat array 10 ~180K 0.006
exec: flat array 100 ~50.5K 0.020
exec: flat array 1000 ~6,420 0.157
exec: nested array 5×5 ~58.8K 0.017
exec: nested array 10×10 ~26K 0.039
exec: nested array 20×10 ~14K 0.072
exec: array + tool-per-element 10 ~42.5K 0.024
exec: array + tool-per-element 100 ~4.96K 0.206

This table is the current perf level. It is updated after a successful optimisation is committed.


Optimisations

1. WeakMap-cached DocumentIndex

Date: March 2026 Branch: perf1 Result: ✗ 4–11% slower across every benchmark. Reverted.

Introduced a DocumentIndex class that pre-indexed:

  • Wires by target trunk key (Map<string, Wire[]>)
  • Bridge lookups by type:field key
  • Tool definitions by name

Cached in a WeakMap<Instruction[], DocumentIndex> keyed by the document's instruction array, so the index was built once and reused across ExecutionTree instances sharing the same document.

Shadow trees looked up the index via WeakMap.get(this.document.instructions) in the constructor.

Why it failed:

  1. wireTargetKey() string allocation. The new index function wireTargetKey(w) built a string key for every wire lookup (${module}:${type}:${field}:${instance ?? ""}). This replaced sameTrunk() which does zero-allocation 4-field equality comparison. The Map lookup saved O(n) filtering but the per-call string allocation was more expensive than scanning a small array (~5–20 wires per bridge).

  2. WeakMap.get() on every shadow tree. Each new ExecutionTree() called WeakMap.get(instructions) in the constructor to retrieve the cached index. WeakMap.get() costs ~50–100ns per call (hash + GC barrier), far more than the ~5ns property access it was replacing. For 1000 shadow trees this added ~50–100µs of pure overhead.

  3. Extra indirection layers. The DocumentIndex class added a method call + property access on every wire lookup that wasn't there before.

Lesson learned:

  • For small arrays (≤20 elements), linear scan with zero-allocation comparison (sameTrunk) beats Map lookup with string-key construction.
  • WeakMap is not free — avoid it on the per-shadow-tree hot path.
  • Measure before assuming a theoretical complexity improvement translates to real-world speed. N is usually small in bridge wire arrays.

2. Lightweight shadow tree construction

Date: March 2026 Result: ✅ +5–7% on array benchmarks, +2–4% elsewhere.

Benchmark Before After Change
exec: passthrough 613K 625K +2%
exec: short-circuit 754K 759K +1%
exec: simple chain 378K 391K +3%
exec: chained 3-tool fan-out 138K 143K +4%
exec: flat array 10 43K 46K +6%
exec: flat array 100 4.7K 5.0K +6%
exec: flat array 1000 258 270 +5%
exec: nested array 5×5 14.6K 15.5K +6%
exec: nested array 10×10 4.0K 4.3K +7%
exec: nested array 20×10 2.0K 2.1K +6%
exec: array + tool-per-element 10 20K 21K +6%
exec: array + tool-per-element 100 2.1K 2.2K +5%

Every shadow() call ran the full ExecutionTree constructor, which redundantly re-derived data identical to the parent:

  • instructions.find() — O(I) scan to locate the bridge (same result)
  • pipeHandleMap — rebuilt from bridge.pipeHandles (identical)
  • handleVersionMap — rebuilt by iterating all handles (identical)
  • constObj — rebuilt by iterating all instructions (identical constants)
  • { internal, ...(toolFns ?? {}) } — new object spread (same tools)

Refactored shadow() to bypass the constructor and copy pre-computed fields from the parent via Object.create(ExecutionTree.prototype).

shadow(): ExecutionTree {
  const child = Object.create(ExecutionTree.prototype) as ExecutionTree;
  child.trunk = this.trunk;
  child.document = this.document;
  child.parent = this;
  child.depth = this.depth + 1;
  child.state = {};
  child.toolDepCache = new Map();
  child.toolDefCache = this.toolDefCache;
  child.bridge = this.bridge;
  child.pipeHandleMap = this.pipeHandleMap;
  child.handleVersionMap = this.handleVersionMap;
  child.toolFns = this.toolFns;
  child.tracer = this.tracer;
  child.logger = this.logger;
  child.signal = this.signal;
  return child;
}

Key constraint: Shadow trees must not mutate shared maps (pipeHandleMap, handleVersionMap, toolDefCache) — they are populated in the constructor and only read thereafter.

3. Wire index by target trunk key

Date: March 2026 Result: ✗ 10–23% slower across every benchmark. Reverted.

Added a wiresByTrunk: Map<string, Wire[]> field to ExecutionTree, built once in the root constructor by iterating all wires and keying them with a new wireTrunkKey() function (ignoring element flag). A getWiresForTrunk(target) helper replaced all 11 occurrences of bridge.wires.filter(w => sameTrunk(w.to, target)).

Shared the pre-built index to shadow trees via #2.

Benchmark With #2 only With #2 + #3 Change
exec: passthrough 625K 506K -19%
exec: short-circuit 759K 587K -23%
exec: simple chain 391K 327K -16%
exec: chained 3-tool 143K 120K -16%
exec: flat array 10 46K 40K -13%
exec: flat array 100 5.0K 4.5K -10%
exec: flat array 1000 270 240 -11%

Why it failed:

Same root cause as #1: wireTrunkKey(target) builds a template string (${module}:${type}:${field}:${instance ?? ""}) on every getWiresForTrunk() call. At ~70ns per string allocation, this exceeds the cost of linearly scanning 10 wires with sameTrunk() (~30ns). Even though Map.get() is O(1), the key construction dominates.

Additional bugs found during testing:

  1. trunkKey() treats element: true differently (:* suffix), so element wires went to wrong buckets — required a separate wireTrunkKey() that ignores element.
  2. Removing { type, field } destructuring from run() left orphaned variable references → ReferenceError at runtime.

Lesson learned:

  • Reinforces #1's lesson: any scheme that replaces sameTrunk()'s zero-allocation 4-field comparison with string-keyed Map lookup loses for typical wire counts (5–20).
  • This rules out ALL Map-based wire indexing approaches for the current architecture unless wire counts grow significantly.

4. Cache element trunk key

Date: March 2026 Result: ✅ ~0% (code cleanup, no measurable impact).

trunkKey({ ...this.trunk, element: true }) was called 5 times across shadow-tree hot paths. Each call spread the trunk object and built a template string. Since this.trunk is fixed per tree, the result is constant.

Pre-computed elementTrunkKey as a field, set once in the constructor and copied in the shadow() factory:

private elementTrunkKey: string;
// In constructor:
this.elementTrunkKey = `${trunk.module}:${trunk.type}:${trunk.field}:*`;
// In shadow():
child.elementTrunkKey = this.elementTrunkKey;

Impact within noise (+1–2% on some array benchmarks). Each trunkKey() call saves ~15ns; with 5 calls per shadow tree, a 1000-element array saves ~75µs on a ~3.7ms benchmark (~2%). Kept because the code is cleaner.

5. Skip OTel span when no tracer is configured

Date: March 2026 Result: ✅ +7–9% on tool-heavy benchmarks.

Benchmark Before (#2 only) After Change
exec: passthrough 625K 617K ~0% (no tools)
exec: short-circuit 759K 743K ~0% (no tools)
exec: simple chain 391K 419K +7%
exec: chained 3-tool fan-out 143K 156K +9%
exec: flat array 10 46K 48K +4%
exec: flat array 100 5.0K 5.1K +2%
exec: flat array 1000 270 271 ~0%
exec: nested array 5×5 15.5K 15.9K +3%
exec: nested array 10×10 4.3K 4.4K +2%
exec: nested array 20×10 2.1K 2.1K ~0%
exec: array + tool-per-element 10 21K 22.4K +7%
exec: array + tool-per-element 100 2.2K 2.3K +5%

callTool() always called otelTracer.startActiveSpan(...), allocated metricAttrs, and recorded metrics — even when OpenTelemetry had its default no-op provider and no internal tracer/logger was configured. The span callback closure allocation + template string building added ~200–300ns overhead per tool call.

Lazy-probe the OTel tracer once on first tool call using span.isRecording(). When the tracer is no-op AND no internal tracer or logger is configured, take a fast path that calls fnImpl() directly:

let _otelActive: boolean | undefined;
function isOtelActive(): boolean {
  if (_otelActive === undefined) {
    const probe = otelTracer.startSpan("_bridge_probe_");
    _otelActive = probe.isRecording();
    probe.end();
  }
  return _otelActive;
}

// In callTool():
if (!tracer && !logger && !isOtelActive()) {
  return fnImpl(input, toolContext);
}

Biggest gains on benchmarks with many tool calls per operation (simple chain, chained fan-out, tool-per-element). Array benchmarks with one implicit tool see smaller gains because tool-call overhead is amortised over per-element work.

Caveat: _otelActive is probed once on first tool call and cached. If the OTel SDK is registered after the first tool call runs, the flag will remain false. In practice, OTel SDKs are always registered at application startup before any business logic.

6. Cache coerceConstant() results

Date: March 2026 Result: ✅ ~0% (no measurable impact, no regression).

Module-level Map<string, unknown> cache for coerceConstant(). Avoids repeated JSON.parse for the same constant strings across shadow trees. JSON.parse for short primitives ("true", "42") is already very fast (~15ns), so no measurable improvement. Kept because it prevents redundant work in constant-heavy bridges and has zero regression.

Caveat: Only safe for immutable values (primitives, frozen objects). If callers mutate the returned object, they'd corrupt the cache. Current code does not mutate constant values, so this is safe today.

7. Replace pathEquals .every() with for-loop

Date: March 2026 Result: ✅ ~0% (code cleanup).

Replaced .every() callback with a manual for-loop. No measurable impact — paths are typically 1–2 segments, so the closure overhead was already negligible. Kept for consistency and micro-optimisation hygiene.

8. Pre-group element wires

Date: March 2026 Result: ✅ Combined with #9 (eliminates per-element wire filtering).

Every pullOutputField call did bridge.wires.filter(w => sameTrunk(...) && pathEquals(...)) — for a 1000-element array with 3 output fields that's 5 wires × 3 fields × 1000 elements = 15,000 comparisons per execution.

Added a wireGroupsByPath: Map<string, Wire[]> built once per materializeShadows call, keyed by \0-joined path. Added a thin resolvePreGrouped(wires) method to ExecutionTree that lets materializeShadows call resolveWires on a shadow with pre-grouped wires rather than passing a path to re-filter. The map key uses \0 as a separator since field names are identifiers and can't contain it.

9. Batch element materialisation

Date: March 2026 Result: ✅ +44–130% on flat array benchmarks, +14–43% on all array benchmarks.

Benchmark Before After Change
exec: passthrough 610K 610K ~0%
exec: short-circuit 751K 745K ~0%
exec: simple chain 417K 418K ~0%
exec: chained 3-tool fan-out 152K 156K ~0%
exec: flat array 10 48K 69K +44%
exec: flat array 100 5.1K 8.0K +57%
exec: flat array 1000 270 627 +132%
exec: nested array 5×5 15.8K 21K +33%
exec: nested array 10×10 4.3K 6.1K +42%
exec: nested array 20×10 2.1K 3.0K +43%
exec: array + tool-per-element 10 22K 25K +14%
exec: array + tool-per-element 100 2.2K 2.6K +18%

Instead of Promise.all(N × Promise.all(F fields)), the common case (no nested arrays in the output — deepPaths.size === 0) now uses a single flat Promise.all over all N × F resolutions:

// Before: Promise.all(1000 × Promise.all(3 fields))
// After:  Promise.all(3000 flat resolutions)
const flatValues = await Promise.all(
  items.flatMap((shadow) =>
    directFieldArray.map((name) =>
      shadow.resolvePreGrouped(wireGroupsByPath.get(pathKey)!),
    ),
  ),
);

This collapses 1001 nested Promise.all calls into one, cutting significant microtask scheduling overhead. Combined with #8 (pre-grouped wires), each resolution also skips the bridge.wires.filter call entirely.

Nested arrays (where deepPaths.size > 0) take a slow path that uses #8 pre-grouped wires for direct fields but keeps the existing Promise.all(tasks) structure. Inner nested levels — which have no deepPaths of their own — also benefit from the fast path, which explains the +33–43% gains on nested benchmarks.

Why non-tools aren't affected: Benchmarks without array iteration (passthrough, simple chain, chained fan-out) don't call materializeShadows at all, so they see no change.

10. Sync fast path for resolved values

Date: March 2026 Result: ✅ +8–17% on all benchmarks, +42–114% on array benchmarks.

Benchmark Before After Change
exec: passthrough 610K 728K +19%
exec: short-circuit 745K 778K +4%
exec: simple chain 418K 457K +9%
exec: chained 3-tool fan-out 156K 175K +12%
exec: flat array 10 69K 101K +46%
exec: flat array 100 8.0K 13.0K +63%
exec: flat array 1000 627 1,336 +113%
exec: nested array 5×5 21K 29.4K +40%
exec: nested array 10×10 6.1K 9.0K +48%
exec: nested array 20×10 3.0K 4.6K +53%
exec: array + tool-per-element 10 25K 27.6K +10%
exec: array + tool-per-element 100 2.6K 2.97K +14%

pullSingle() always returned Promise<any>, but for element wires like .id <- it.id the value is already synchronously available in this.state[key]. The previous code did await Promise.resolve(value) even when the value was not a Promise, producing 6–7 microtask hops per element × 1000 elements = 6000–7000 scheduled microtasks costing ~2.8ms of the 3.7ms total for flat-array-1000.

Changes made:

  1. MaybePromise<T> type + isPromise() helper — module-level type alias and guard ('then' in (value as any)) to distinguish live Promises from synchronous values without ever constructing a new Promise.

  2. pullSingle de-asyncedasync pullSingle() replaced with a sync-first implementation:

    // sync fast path
    if (!isPromise(value)) return this.applyPath(value, ref);
    // async path only when tool result is still pending
    return (value as Promise<any>).then((resolved) =>
      this.applyPath(resolved, ref),
    );

    Extracted applyPath(resolved, ref) as a private helper for the shared path-traversal logic used by both paths.

  3. resolveWires fast path — new method that detects the common case: a single from wire with no modifiers (no safe, no falsy/nullish/catch fallbacks). In that case it calls pullSingle directly and returns MaybePromise<any>. All other cases fall through to the existing async resolveWiresAsync (renamed from the old resolveWires).

  4. materializeShadows sync collection — replaced Promise.all(items.flatMap(...)) with a loop that writes into a pre-allocated flat array and sets a hasAsync flag on the first Promise it encounters:

    const rawValues: MaybePromise<unknown>[] = new Array(nItems * nFields);
    let hasAsync = false;
    for (...) {
      const v = shadow.resolvePreGrouped(wireGroupsByPath.get(pathKey)!);
      rawValues[i * nFields + j] = v;
      if (!hasAsync && isPromise(v)) hasAsync = true;
    }
    const flatValues = hasAsync
      ? await Promise.all(rawValues)
      : (rawValues as unknown[]);

    For element wires where all values come from state, hasAsync stays false and no Promise.all is ever constructed — zero microtask overhead.

Why non-array benchmarks also improve (+4–19%): resolveWires is called for every output field, not just inside array loops. PassThrough, simple-chain, and fan-out all resolve output wires after tools complete; those values are already in state, so they now go through the sync path too, eliminating one microtask hop per resolved output field.

11. Pre-compute keys & cache wire tags

Date: March 2026 Result: ✅ +12–16% on all benchmarks, +60–129% on array benchmarks.

Benchmark Before After Change
exec: passthrough 728K 846K +16%
exec: short-circuit 778K 811K +4%
exec: simple chain 457K 486K +6%
exec: chained 3-tool fan-out 175K 194K +11%
exec: flat array 10 101K 170K +68%
exec: flat array 100 13.0K 28.3K +118%
exec: flat array 1000 1,336 3,064 +129%
exec: nested array 5×5 29.4K 46.7K +59%
exec: nested array 10×10 9.0K 17.5K +94%
exec: nested array 20×10 4.6K 8.8K +91%
exec: array + tool-per-element 10 27.6K 30.9K +12%
exec: array + tool-per-element 100 2.97K 3.43K +15%

Four micro-optimisations that eliminate string allocation and redundant property checks from the hottest loops:

  1. Cached trunkKey on NodeRefpullSingle memoises the state-map key per AST node as ref[TRUNK_KEY_CACHE] ??= trunkKey(ref). For a 1000-element array pulling 3 fields, this eliminates 3000 template-literal concatenations per execution. The cache is stored under a Symbol key so V8 keeps it in a separate backing store that doesn't participate in hidden-class transitions — the parser's object shapes remain stable even though the engine writes to them at runtime.

  2. Pre-computed pathKeys in materializeShadows — the path-key array ([...pathPrefix, field].join("\0")) only depends on the field index, not the element index. Hoisted out of the N×F loop into an F-length pre-computed array, eliminating N×F array spreads and joins (e.g. 3000 down to 3 for flat-array-1000).

  3. Cached getSimplePullRef(wire) — the 11-property fast-path check in resolveWires is now computed once per wire and cached as wire[SIMPLE_PULL_CACHE] (the from NodeRef, or null). Subsequent calls are a single property read. Also a Symbol key (same rationale as TRUNK_KEY_CACHE). For element wires in the hot path this turns 11 sequential null checks per field per element into 1.

  4. Constant cache capconstantCache is now hard-capped at 10,000 entries. When exceeded the Map is cleared rather than growing unboundedly. No performance impact; pure safety hygiene for long-lived processes.

12. De-async schedule() & callTool()

Date: March 2026 Result: ✅ +11–18% on tool-calling benchmarks, ~0% on pure array benchmarks.

Benchmark Before After Change
exec: passthrough 846K 830K ~0%
exec: short-circuit 811K 801K ~0%
exec: simple chain 486K 558K +15%
exec: chained 3-tool fan-out 194K 216K +11%
exec: flat array 10 170K 175K ~0%
exec: flat array 100 28.3K 28.2K ~0%
exec: flat array 1000 3,064 2,980 ~0%
exec: nested array 5×5 46.7K 47.7K ~0%
exec: nested array 10×10 17.5K 17.5K ~0%
exec: nested array 20×10 8.8K 9.0K ~0%
exec: array + tool-per-element 10 30.9K 36.5K +18%
exec: array + tool-per-element 100 3.43K 3.98K +16%

schedule() previously wrapped its entire body in (async () => { ... })(), always creating a Promise — even for __local bindings, __define_ pass-throughs, and __and/__or logic nodes that need no tool call and whose wires resolve synchronously. Similarly, callTool was declared async, forcing a Promise wrapper even when the tool function (e.g. internal math/string ops) returns synchronously.

Changes made:

  1. schedule returns MaybePromise<any> — Wire collection, grouping, and resolveToolDefByName remain synchronous at the top. For targets with a toolDef, the new scheduleToolDef async helper handles the full async path. For targets without a toolDef (locals, defines, logic nodes, pipe forks with sync tools), bridge wires are resolved via resolveWires (which already returns MaybePromise). If all wires resolve sync, scheduleFinish assembles the result and returns synchronously.

  2. callTool de-asynced — Removed the async keyword. The no-instrumentation fast path (return fnImpl(input, toolContext)) now returns whatever fnImpl returns directly — sync for internal tools (math, string ops, concat, etc.), Promise for async tools (httpCall). The instrumented path returns a Promise via otelTracer.startActiveSpan.

  3. scheduleFinish helper — Extracted the input-assembly + direct-fn-lookup

    • pass-through logic into a private method that returns MaybePromise<any>. For __local/__define_/logic targets with no direct function, this returns synchronously. For pipe forks backed by sync internal tools, callTool returns sync too, so the entire call chain stays sync.

Why pure array benchmarks are unchanged: Flat/nested array benchmarks use element passthrough wires (.id <- it.id) with no per-element tool calls. Their inner loop never calls schedule or callTool — it resolves wires directly via resolvePreGroupedresolveWirespullSingle, which were already sync-capable from #10.

Why tool-calling benchmarks improve (+11–18%): simple chain and chained 3-tool fan-out each schedule 1–3 internal tool calls. Now that schedule skips the async IIFE and callTool skips the async wrapper, each tool call eliminates 2 microtask hops. tool-per-element benefits the most: 10–100 tool calls per execution, each now fully synchronous.

13. Share toolDefCache across shadow trees

Date: March 2026 Status: 🔲 Planned Target: Array benchmarks (+2–5% expected)

Hypothesis:

shadow() currently creates new Map() for both toolDepCache and toolDefCache on every shadow tree. For flat-array-1000 that's 1,000 Map allocations per execution — 2,000 Maps total.

toolDefCache caches ToolDef resolution (resolveToolDefByName), which is a pure function of document.instructions. Since all shadow trees share the same document, the cache produces identical results. Sharing the parent's toolDefCache eliminates 1,000 Map allocations and gives every shadow tree instant cache hits for tool definitions already resolved by the root or a sibling.

Profiling evidence:

  • ExecutionTree constructor/shadow = 0.8% of total ticks (400 ticks)
  • In the V8 tick bottom-up profile, shadow() appears at 0.4% self-time
  • For 1,000-element arrays, the new Map() call is amplified 1,000×

What to change:

In ExecutionTree.shadow() (ExecutionTree.ts ~L371):

// Before:
child.toolDefCache = new Map();

// After:
child.toolDefCache = this.toolDefCache; // share — ToolDef is pure fn of instructions

toolDepCache must remain per-tree because it stores tool dependency promises (async results) that depend on each tree's specific context and element data.

Risk: Low. resolveToolDefByName caches by tool name and the merged result depends only on instructions (same across all trees). No mutation of cached ToolDef objects downstream.

14. Pre-compute output wire topology

Date: March 2026 Status: 🔲 Planned Target: All benchmarks (+1–3% expected)

Hypothesis:

run() calls bridge.wires.some(...) twice on every execution to classify the output wire topology:

const hasRootWire = bridge.wires.some(w =>
  "from" in w && w.to.module === SELF_MODULE && w.to.type === type
  && w.to.field === field && w.to.path.length === 0);

const hasElementWires = bridge.wires.some(w =>
  "from" in w && ((w.from as NodeRef).element === true || ...)
  && w.to.module === SELF_MODULE && ...);

These booleans are static per bridge definition — they depend only on the bridge's wire array, which is immutable after parsing. Computing them once (either in the constructor or lazily with a Symbol-keyed cache on the Bridge object) eliminates redundant O(W) wire scans per executeBridge().

Profiling evidence:

  • run = 1.3% of total ticks (627 ticks)
  • resolveNestedField = 0.9% (433 ticks) — called in the object-output branch, also does bridge.wires.filter(...) per field
  • Every executeBridge() call triggers these scans

What to change:

Option A — Compute in the ExecutionTree constructor and store as fields:

// In constructor, after bridge lookup:
this._hasRootWire = bridge.wires.some(w => ...);
this._hasElementWires = bridge.wires.some(w => ...);

Option B — Cache on the Bridge object via a Symbol key (like TRUNK_KEY_CACHE):

const TOPOLOGY_CACHE = Symbol.for("bridge.topology");
// In run():
const topo = ((bridge as any)[TOPOLOGY_CACHE] ??= computeTopology(bridge));

Option B avoids per-constructor cost and is shared across all trees for the same bridge.

Risk: Low. Wire arrays are immutable after parsing.

15. Cache executeBridge() setup per document

Date: March 2026 Status: 🔲 Planned Target: All benchmarks (+1–3% expected, more for repeated execution)

Hypothesis:

executeBridge() performs three setup operations on every call that produce identical results for the same document:

  1. resolveStd() (0.7% of ticks, 325 ticks) — Calls version.split(".").map(Number) on both the bridge version and bundled std version. For the common case (no version or bundled std satisfies), this is pure overhead on repeat calls.

  2. checkHandleVersions()collectVersionedHandles() (0.3%, 73 ticks) — Iterates all instructions to find versioned handles, then validates each one. The handle list and validation result are static per document.

  3. new ExecutionTree() instruction scan — The constructor does instructions.find(...) to locate the matching bridge, and iterates instructions again for const definitions. These results are the same for every execution of the same document+operation.

Profiling evidence:

  • resolveStd = 0.7% self-time (disproportionate for a pure setup function)
  • collectVersionedHandles = 0.3%
  • executeBridge = 0.4% (includes both setup calls)
  • Total setup overhead ≈ 1.4% of execution time

What to change:

Cache setup results on the BridgeDocument via Symbol keys:

const STD_CACHE = Symbol.for("bridge.resolvedStd");
const VERSIONS_CHECKED = Symbol.for("bridge.versionsChecked");

// In executeBridge():
const stdResult = ((doc as any)[STD_CACHE] ??= resolveStd(
  doc.version,
  bundledStd,
  BUNDLED_STD_VERSION,
  userTools,
));
const { namespace: activeStd, version: activeStdVersion } = stdResult;

if (!(doc as any)[VERSIONS_CHECKED]) {
  checkHandleVersions(doc.instructions, allTools, activeStdVersion);
  (doc as any)[VERSIONS_CHECKED] = true;
}

Risk: Medium. The cache key must account for different tools maps passed to successive executeBridge() calls. If the same document is reused with different tool maps, the cached std resolution may be wrong. Safest approach: cache only when no user tools are provided, or use a WeakMap<ToolMap, result> keyed by the tools object.

16. Cheap strict-path hot-path guards

Date: March 2026 Status: ✅ Done Target: Recover part of the runtime hit introduced by precise runtime error source mapping

Context:

The source-mapping work added stricter path traversal semantics and more precise error attribution. That was expected to cost something, but the runtime benchmark rerun showed a larger-than-desired hit on interpreter hot paths, especially tool-heavy and array-heavy workloads.

Observed branch-level runtime numbers before this mitigation:

Benchmark Baseline Before Change
exec: passthrough (no tools) ~830K ~638K -23%
exec: short-circuit ~801K ~560K -30%
exec: simple chain (1 tool) ~558K ~318K -43%
exec: flat array 1000 ~2,980 ~2,408 -19%
exec: array + tool-per-element 100 ~3,980 ~2,085 -48%

Hypothesis:

Two new costs were landing directly in the hottest interpreter path:

  1. Every single-segment path access paid the full generic multi-segment loop.
  2. Array warning checks still evaluated Array.isArray(...) and the numeric segment regex even when no logger was configured.

Both are pure overhead on the benchmark path.

What changed:

  • Added a dedicated single-segment fast path in ExecutionTree.applyPath().
  • Kept the strict primitive-property failure semantics from the source-mapping work, but avoided the multi-segment loop setup for the common case.
  • Gated array warning work behind this.logger?.warn so the benchmark fast path skips the branch entirely when logging is disabled.

Result:

This did not restore the full pre-source-mapping runtime baseline, but it did recover some of the avoidable overhead while preserving the new error behavior.

Current runtime numbers after the mitigation:

Benchmark Before After Change
exec: passthrough (no tools) ~588K ~636K +8%
exec: short-circuit ~525K ~558K +6%
exec: array + tool-per-element 100 ~1,862 ~2,102 +13%

What remains:

The remaining interpreter regression appears to be real cost from stricter error semantics rather than an obvious accidental slowdown. The next sensible steps are profiling-driven, not blind micro-optimisation:

  • measure ExecutionTree.applyPath() in the runtime flamegraph on tool-heavy cases
  • consider additional small-shape fast paths for 2- and 3-segment strict traversal
  • evaluate whether any error-metadata work can be deferred off the success path

17. v3 scope-based engine (AST interpreter)

Date: March 2026 Result: ✅ +2.2× flat-1000 vs v2 baseline, 43% of compiled engine

Context: The v3 engine (execute-bridge.ts) replaced the v2 shadow-tree based engine with a direct AST interpreter using ExecutionScope and pull-based evaluation. The initial v3 port was ~10× slower on array workloads due to per-element overhead from scope construction, AST re-indexing, and excessive Promise allocation. A series of targeted optimisations closed the gap and then surpassed the v2 baseline.

Key optimisations applied (in order):

  1. StaticScopeIndex — Shared read-only pre-computed maps (ownedTools, toolInputWires, outputWires, aliases, etc.) built once per array body via buildStaticIndex(). Child scopes reference this shared index instead of eagerly allocating 9+ Map/Set objects per element.

  2. Lazy map allocation — All ExecutionScope maps use private _x: T | null with getter properties: if (this.staticIndex) return this.staticIndex.x; return (this._x ??= new Map()). Elements that only read (the common case) never allocate their own maps.

  3. Execution plan hoistingbuildArrayExecutionPlan() pre-computes wire groups, ordering, and sub-field slicing once per array. The inner element loop uses this plan directly, bypassing per-element resolveRequestedFields.

  4. Chunked processing — Elements processed in batches of 2048 via Promise.all. Bounds concurrent promises to prevent GC panic on large arrays.

  5. Promise.allSettled → Promise.all — On the array hot path, replaced Promise.allSettled with Promise.all in both evaluateArrayElement and evaluateArrayExpr. The allSettled wrapper objects (3000+ per iteration for flat-1000) were pure overhead since the inner code already handles errors.

  6. Module-level catchSafe() — Replaced per-call closure wrappers with a single module-level function for safe expression evaluation.

  7. Inlined getActiveSourceLoc — Eliminated a per-wire closure allocation in evaluateSourceChain's catch block.

  8. Synchronous fast path (isPlanSynchronous) — The biggest single win. Detects when every wire in an array body is a simple element property read (no fallbacks, catches, overdefinition, or spreads). Evaluates the entire array in a tight synchronous for loop: getPath(element, ref.path)setPath(output, target). Zero Promises, zero ExecutionScope allocations, zero async function calls. Eliminated ~47% CPU overhead (26.5% microtask scheduling, 15.9% async closures, 5.1% GC) that was spent on async machinery for zero actual I/O.

Benchmark results (v3 vs v2 baseline):

Benchmark v2 baseline v3 current Change
exec: passthrough ~830K ~463K −44%
exec: short-circuit ~801K ~423K −47%
exec: simple chain (1 tool) ~558K ~276K −51%
exec: 3-tool fan-out ~216K ~89K −59%
exec: flat array 10 ~175K ~180K ≈ parity
exec: flat array 100 ~28.2K ~50.5K +79%
exec: flat array 1000 ~2,980 ~6,420 +2.2×
exec: nested array 5×5 ~47.7K ~58.8K +23%
exec: nested array 10×10 ~17.5K ~26K +49%
exec: nested array 20×10 ~9.0K ~14K +56%
exec: tool-per-element 10 ~36.5K ~42.5K +16%
exec: tool-per-element 100 ~3.98K ~4.96K +25%

Summary:

  • Array workloads: v3 is significantly faster. The sync fast path makes pure element-ref arrays 2.2× faster than v2 for flat-1000, and nested arrays see +23–56% gains.
  • Non-array workloads: v3 is ~44–59% slower. The v3 scope-based pull engine has inherent overhead from ExecutionScope construction, recursive expression evaluation, and async function wrappers. The v2 shadow-tree engine's flat wire-loop was simpler for non-array cases.
  • The v3 engine gains scope-based features (defines, nested scopes, aliases, lexical shadowing) that were not possible in the v2 flat wire model.