Skip to content

ArenaNode for array-backed expressions#167

Open
MilesCranmerBot wants to merge 66 commits into
SymbolicML:masterfrom
MilesCranmerBot:arenanode-prototype
Open

ArenaNode for array-backed expressions#167
MilesCranmerBot wants to merge 66 commits into
SymbolicML:masterfrom
MilesCranmerBot:arenanode-prototype

Conversation

@MilesCranmerBot

@MilesCranmerBot MilesCranmerBot commented Feb 20, 2026

Copy link
Copy Markdown
Contributor

Prototype ArenaNode: an array-backed arena representation for expressions, implementing the full NodeInterface. Evaluation runs directly on the arena facade and matches Node on allocations (scripts/bench_arenanode.jl), with cheaper copy and a smaller memory footprint.

@github-actions

github-actions Bot commented Feb 20, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results (Julia v1)

Time benchmarks
master 60d0023... master / 60d0023...
eval/ComplexF32/evaluation 6.98 ± 0.52 ms 6.99 ± 0.55 ms 0.997 ± 0.11
eval/ComplexF64/evaluation 10.4 ± 0.77 ms 10.4 ± 0.96 ms 0.998 ± 0.12
eval/Float32/derivative 11.6 ± 1.5 ms 11.7 ± 2.1 ms 0.997 ± 0.22
eval/Float32/derivative_turbo 11.9 ± 2.2 ms 11.5 ± 1.7 ms 1.03 ± 0.24
eval/Float32/evaluation 2.5 ± 0.26 ms 2.49 ± 0.26 ms 1.01 ± 0.15
eval/Float32/evaluation_bumper 0.607 ± 0.017 ms 0.61 ± 0.019 ms 0.995 ± 0.043
eval/Float32/evaluation_turbo 0.542 ± 0.037 ms 0.537 ± 0.04 ms 1.01 ± 0.1
eval/Float32/evaluation_turbo_bumper 0.608 ± 0.018 ms 0.603 ± 0.017 ms 1.01 ± 0.041
eval/Float64/derivative 15.1 ± 3.6 ms 14.9 ± 3.6 ms 1.01 ± 0.35
eval/Float64/derivative_turbo 14.5 ± 3.1 ms 14.7 ± 3.9 ms 0.987 ± 0.34
eval/Float64/evaluation 2.89 ± 0.29 ms 2.91 ± 0.31 ms 0.994 ± 0.15
eval/Float64/evaluation_bumper 1.28 ± 0.046 ms 1.27 ± 0.046 ms 1.01 ± 0.052
eval/Float64/evaluation_turbo 1.04 ± 0.07 ms 1.03 ± 0.065 ms 1.01 ± 0.092
eval/Float64/evaluation_turbo_bumper 1.28 ± 0.045 ms 1.27 ± 0.046 ms 1.01 ± 0.051
utils/combine_operators/break_sharing 0.0431 ± 0.0011 ms 0.0408 ± 0.00081 ms 1.06 ± 0.035
utils/convert/break_sharing 27.9 ± 3.5 μs 27.8 ± 2.9 μs 1 ± 0.16
utils/convert/preserve_sharing 0.0999 ± 0.0095 ms 0.0992 ± 0.0098 ms 1.01 ± 0.14
utils/copy/break_sharing 27.8 ± 3.5 μs 27.2 ± 3.1 μs 1.02 ± 0.17
utils/copy/preserve_sharing 0.0997 ± 0.0092 ms 0.0995 ± 0.0099 ms 1 ± 0.14
utils/count_constant_nodes/break_sharing 12.7 ± 0.73 μs 13 ± 0.66 μs 0.975 ± 0.075
utils/count_constant_nodes/preserve_sharing 0.0854 ± 0.0083 ms 0.0851 ± 0.0067 ms 1 ± 0.13
utils/count_depth/break_sharing 13.5 ± 0.91 μs 14 ± 0.72 μs 0.964 ± 0.082
utils/count_nodes/break_sharing 12.8 ± 0.82 μs 12.7 ± 0.69 μs 1.01 ± 0.085
utils/count_nodes/preserve_sharing 0.0874 ± 0.0072 ms 0.0864 ± 0.0051 ms 1.01 ± 0.1
utils/get_set_constants!/break_sharing 0.0329 ± 0.004 ms 0.0326 ± 0.0034 ms 1.01 ± 0.16
utils/get_set_constants!/preserve_sharing 0.173 ± 0.011 ms 0.175 ± 0.012 ms 0.989 ± 0.093
utils/get_set_constants_parametric 0.0464 ± 0.0038 ms 0.0452 ± 0.0038 ms 1.03 ± 0.12
utils/has_constants/break_sharing 7.54 ± 0.8 μs 7.86 ± 1.3 μs 0.96 ± 0.19
utils/has_operators/break_sharing 2.62 ± 0.14 μs 2.68 ± 0.33 μs 0.978 ± 0.13
utils/hash/break_sharing 23.6 ± 1.5 μs 24.3 ± 2.1 μs 0.968 ± 0.1
utils/hash/preserve_sharing 0.1 ± 0.0094 ms 0.1 ± 0.0055 ms 1 ± 0.11
utils/index_constant_nodes/break_sharing 28.4 ± 1.8 μs 28.6 ± 1.9 μs 0.992 ± 0.091
utils/index_constant_nodes/preserve_sharing 0.0999 ± 0.0069 ms 0.0994 ± 0.0052 ms 1 ± 0.087
utils/is_constant/break_sharing 8.21 ± 0.89 μs 8.06 ± 0.82 μs 1.02 ± 0.15
utils/simplify_tree/break_sharing 30.6 ± 1.1 μs 30.4 ± 1 μs 1.01 ± 0.05
utils/simplify_tree/preserve_sharing 0.111 ± 0.0077 ms 0.11 ± 0.0051 ms 1.01 ± 0.084
utils/string_tree/break_sharing 0.451 ± 0.026 ms 0.456 ± 0.024 ms 0.989 ± 0.078
utils/string_tree/preserve_sharing 0.547 ± 0.028 ms 0.55 ± 0.026 ms 0.995 ± 0.069
time_to_load 0.175 ± 0.006 s 0.172 ± 0.0019 s 1.01 ± 0.036
Memory benchmarks
master 60d0023... master / 60d0023...
eval/ComplexF32/evaluation 0.975 k allocs: 2.5 MB 0.981 k allocs: 2.51 MB 0.994
eval/ComplexF64/evaluation 0.969 k allocs: 4.94 MB 1.01 k allocs: 5.12 MB 0.964
eval/Float32/derivative 4.63 k allocs: 17.4 MB 4.69 k allocs: 17.7 MB 0.987
eval/Float32/derivative_turbo 4.7 k allocs: 17.7 MB 4.68 k allocs: 17.6 MB 1
eval/Float32/evaluation 0.984 k allocs: 1.28 MB 0.984 k allocs: 1.28 MB 1
eval/Float32/evaluation_bumper 0.303 k allocs: 0.393 MB 0.303 k allocs: 0.393 MB 1
eval/Float32/evaluation_turbo 0.972 k allocs: 1.27 MB 0.957 k allocs: 1.25 MB 1.02
eval/Float32/evaluation_turbo_bumper 0.303 k allocs: 0.393 MB 0.303 k allocs: 0.393 MB 1
eval/Float64/derivative 4.74 k allocs: 0.0346 GB 4.75 k allocs: 0.0348 GB 0.996
eval/Float64/derivative_turbo 4.83 k allocs: 0.0353 GB 4.77 k allocs: 0.0349 GB 1.01
eval/Float64/evaluation 0.999 k allocs: 2.56 MB 0.993 k allocs: 2.54 MB 1.01
eval/Float64/evaluation_bumper 0.303 k allocs: 0.771 MB 0.303 k allocs: 0.771 MB 1
eval/Float64/evaluation_turbo 0.99 k allocs: 2.53 MB 0.996 k allocs: 2.55 MB 0.994
eval/Float64/evaluation_turbo_bumper 0.303 k allocs: 0.771 MB 0.303 k allocs: 0.771 MB 1
utils/combine_operators/break_sharing 4 allocs: 0.953 kB 4 allocs: 0.953 kB 1
utils/convert/break_sharing 2 k allocs: 0.123 MB 2 k allocs: 0.123 MB 1
utils/convert/preserve_sharing 2.4 k allocs: 0.192 MB 2.4 k allocs: 0.192 MB 1
utils/copy/break_sharing 2 k allocs: 0.123 MB 2 k allocs: 0.123 MB 1
utils/copy/preserve_sharing 2.4 k allocs: 0.192 MB 2.4 k allocs: 0.192 MB 1
utils/count_constant_nodes/break_sharing 4 allocs: 0.953 kB 4 allocs: 0.953 kB 1
utils/count_constant_nodes/preserve_sharing 0.404 k allocs: 0.0696 MB 0.404 k allocs: 0.0696 MB 1
utils/count_depth/break_sharing 4 allocs: 0.953 kB 4 allocs: 0.953 kB 1
utils/count_nodes/break_sharing 4 allocs: 0.953 kB 4 allocs: 0.953 kB 1
utils/count_nodes/preserve_sharing 0.404 k allocs: 0.0696 MB 0.404 k allocs: 0.0696 MB 1
utils/get_set_constants!/break_sharing 0.898 k allocs: 25.2 kB 0.898 k allocs: 25.2 kB 1
utils/get_set_constants!/preserve_sharing 1.7 k allocs: 0.138 MB 1.7 k allocs: 0.138 MB 1
utils/get_set_constants_parametric 1.42 k allocs: 0.0663 MB 1.42 k allocs: 0.0663 MB 1
utils/has_constants/break_sharing 4 allocs: 0.203 kB 4 allocs: 0.203 kB 1
utils/has_operators/break_sharing 4 allocs: 0.203 kB 4 allocs: 0.203 kB 1
utils/hash/break_sharing 0.104 k allocs: 2.52 kB 0.104 k allocs: 2.52 kB 1
utils/hash/preserve_sharing 0.504 k allocs: 0.0711 MB 0.504 k allocs: 0.0711 MB 1
utils/index_constant_nodes/break_sharing 2.1 k allocs: 0.094 MB 2.1 k allocs: 0.094 MB 1
utils/index_constant_nodes/preserve_sharing 2.5 k allocs: 0.163 MB 2.5 k allocs: 0.163 MB 1
utils/is_constant/break_sharing 4 allocs: 0.203 kB 4 allocs: 0.203 kB 1
utils/simplify_tree/break_sharing 0.104 k allocs: 2.52 kB 0.104 k allocs: 2.52 kB 1
utils/simplify_tree/preserve_sharing 0.504 k allocs: 0.0711 MB 0.504 k allocs: 0.0711 MB 1
utils/string_tree/break_sharing 11.9 k allocs: 0.999 MB 11.9 k allocs: 0.999 MB 1
utils/string_tree/preserve_sharing 12.3 k allocs: 1.07 MB 12.3 k allocs: 1.07 MB 1
time_to_load 0.145 k allocs: 11 kB 0.145 k allocs: 11 kB 1

@MilesCranmer

Copy link
Copy Markdown
Member

@MilesCranmerBot comments from Claude... pls integrate. Maybe benchmark against master too, to see if it works...

Please remove the eval shim and let evaluation run directly on the arena facade:

1. In `src/Evaluate.jl`, delete the `eval_tree_array(tree::ArenaNode{T}, ...)` method and the `import ..ArenaNodeModule: ArenaNode, tree_from_arena` line. ArenaNode is an AbstractExpressionNode, so the generic method should handle it. Converting to a heap `Node` on every call allocates O(tree size) garbage per eval, which defeats the purpose of the arena.

2. Run the full test suite with `dispatch_doctor_mode = "error"` to confirm nothing fires on the direct path. If something does fire, fix it locally with concrete accessors on the arena (same pattern as the existing `branch_equal`/`leaf_equal` specializations). Do not reintroduce any representation conversion to satisfy the linter.

3. In `test/test_arenanode.jl`, extend the fresh-subprocess allocation checks with an `eval_tree_array` case: after warmup, `@allocated eval_tree_array(atree, X, operators)` should be comparable to the `Node` equivalent on the same tree (within ~10%), not proportional to conversion overhead.

4. While you're in there: delete `ArenaScratch` (unused), and re-evaluate whether the `is_constant(::ArenaNode)` specialization in `src/NodeUtils.jl` is still needed once (2) passes. If it is needed, rewrite it to use a plain local `Vector{Int32}` stack instead of allocating an `ArenaCursor` per call.

Post the before/after numbers from `scripts/bench_arenanode.jl` in the PR description.

Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>
@MilesCranmerBot

Copy link
Copy Markdown
Contributor Author

Integrated in 6b91ec70 and pushed to arenanode-prototype.

What changed:

  • removed the eval_tree_array(::ArenaNode, ...) heap-Node conversion shim
  • stabilized ArenaNode facade property access for the generic eval path
  • removed unused ArenaScratch
  • kept is_constant(::ArenaNode) with a plain Vector{Int32} stack
  • added the fresh-subprocess eval_tree_array allocation comparison

Validation: SR_TEST=main /usr/local/bin/julia --startup-file=no --project=. -e 'using Pkg; Pkg.test(; coverage=false)' passes: 899/899.

Benchmark (scripts/bench_arenanode.jl): Arena eval allocation went from 243352 bytes before to 241064 bytes after; Node baseline is 240216 bytes. origin/master does not have the ArenaNode prototype/script, so there is no direct master run for this benchmark.

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.32150% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.02%. Comparing base (efbc2aa) to head (60d0023).

Files with missing lines Patch % Lines
src/ArenaNode.jl 92.36% 22 Missing ⚠️
src/ArenaNodeEval.jl 92.89% 15 Missing ⚠️
ext/DynamicExpressionsLoopVectorizationExt.jl 0.00% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #167       +/-   ##
===========================================
+ Coverage   59.43%   74.02%   +14.59%     
===========================================
  Files          30       32        +2     
  Lines        2682     3188      +506     
===========================================
+ Hits         1594     2360      +766     
+ Misses       1088      828      -260     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MilesCranmer MilesCranmer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MilesCranmerBot please address above comments.

Also, shouldn't we have a tree_mapreduce? Then we wouldn't need to write a custom is_constant and so forth.

Btw, please expand your test suite. You should actually be using the full Node Interface tests. ArenaNode should pass ALL of them.

Comment thread src/ArenaNode.jl
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread test/test_arenanode.jl Outdated
Comment thread test/unittest.jl Outdated
Comment thread Project.toml Outdated
Comment thread src/ArenaNode.jl Outdated
MilesCranmerBot and others added 3 commits June 10, 2026 12:11
Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>
Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>
@MilesCranmerBot

Copy link
Copy Markdown
Contributor Author

Addressed the review in the latest pushes (fb799dd0, a4b86496, 6b2aaed3).

Changes:

  • added full NodeInterface coverage for ArenaNode
  • removed the custom is_constant(::ArenaNode) in favor of generic tree_mapreduce
  • switched the allocation checks to PerformanceTestTools.include_foreach
  • moved the arena @testitem into test/test_arenanode.jl
  • restored the main Project.toml compat changes out of this PR
  • added the cross-arena child-copy regression test

Local validation:

  • JULIA_PKG_PRECOMPILE_AUTO=0 SR_TEST=main /usr/local/bin/julia --startup-file=no --project=. -e 'using Pkg; Pkg.test(; coverage=false)' -> 855/855 passed
  • JULIA_PKG_PRECOMPILE_AUTO=0 /usr/local/bin/julia --startup-file=no --project=. scripts/bench_arenanode.jl -> Node and ArenaNode eval both allocate 240216 bytes

All addressed review threads are resolved. CI is pending on the latest push.

Comment thread src/NodeUtils.jl Outdated
Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>

@MilesCranmer MilesCranmer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MilesCranmerBot

See new comments.

Also.

Please test with expressions types too. eg Expression with ArenaNode - do the full expression interface tests.

Also maybe try derivatives too?

Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
MilesCranmerBot and others added 2 commits June 10, 2026 19:51
- Test Expression wrapping ArenaNode with the full ExpressionInterface
- Test eval_grad_tree_array and Zygote derivatives for ArenaNode-based expressions
- Cover both variable and constant gradients

Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>

@MilesCranmer MilesCranmer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MilesCranmerBot please see comments above

Also please resolve any comments that have already been addressed.

Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread scripts/bench_arenanode.jl Outdated
e/a/n/c/d/f/v/st and the kernel abbreviations (sp, ks, is, svs, svals,
desc, nfree, fbase, doff, offs, isscal, off, rem, B) become entry, arena,
node, child_idx, degree, feature, value, state, stack_top, kinds, idxs,
scalar_args, scalar_vals, descriptors, num_free, free_base, dest_offset,
offsets, is_scalar, offset, remaining, buffer_rows. No behavior change
(2614 tests pass; bench ratios 0.55/0.43/0.39, unchanged).

@MilesCranmer MilesCranmer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address all comments. Resolve them individually as you address each comment locally. And before you are done, please audit your changes and ensure that all logic is maintained, and no speed is lost due to, e.g., type instabilities.

Comment thread test/test_arenanode.jl Outdated
Comment thread src/ArenaNode.jl
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
Comment thread src/ArenaNode.jl Outdated
- expose ArenaNode/Arena as DynamicExpressions.ArenaNode etc.; tests use
  'using DynamicExpressions: ArenaNode' instead of a module alias
- _push_node! takes keyword defaults; push_constant!/push_feature! only
  override the relevant fields
- PlanState/PlanRegisters (no leading underscore on internal types)
- constrain ArenaEntry/Arena/ArenaNode to T<:Number
- split _arena_eval into _materialize_features! and _write_root_to_output!
- readable planner names: feature_mask, scalar_stack, permanent_stack,
  arity_mask, num_recyclable_args; the feature-slot computation lives in
  a named _feature_slot helper shared with _push_leaf!
- iszero/isone instead of == 0 / != 1 comparisons
- shorten testitem names so headers fit on one line

2614 tests pass; bench ratios 0.56/0.47/0.37 vs Node (unchanged).
_exec_op! is now a thin compile-time arity dispatch; _exec_op_arity! pops
the operand descriptors and branches once into _fold_constant_args! (the
scalar-lane fold, written with guard clauses) or _run_op_kernel! (slot
recycling, destination allocation, operand validation, kernel dispatch).
Max nesting depth drops from 6 to 2.

No behavior or speed change: 2614 tests pass, and an interleaved A/B
against the previous commit gives identical benchmark ratios.
set_child! and set_children! both duplicated the validate-then-link-or-copy
logic; it now lives in one helper with the cross-arena copy rationale on it.
_plan_scratch previously re-implemented the executor's stack machine as raw
bitmask arithmetic — the two had to agree exactly, since the executor trusts
the planner's slot counts under @inbounds, but nothing tied them together.

The kind and recycling policy now lives once: _leaf_kind, _op_result_kind,
and _is_recyclable are called by both sides, and the planner's bitmask pair
becomes a KindStack with push/pop/query operations named after the
executor's concepts. Drift between the passes is now a compile error or an
obviously-local edit instead of a silent buffer-size mismatch.

No behavior change: 2614 tests pass, bench ratios within session noise.
…inbounds in set_scalar_constants!

get_arena/get_index are now the only getfield call sites: internal code calls
them instead of raw getfield (named, greppable) and instead of property
access (functions reachable from getproperty, like get_child via the :l/:r
branches, must not cycle back through it). Arena has no custom getproperty,
so its getfield(arena, :nodes) calls become plain dot access.

set_scalar_constants! gets its @inbounds back: refs are produced by
get_scalar_constants, and callers passing stale refs are out of contract.
_pack_descriptor/_descriptor_kind/_descriptor_slot replace the raw
kind|slot<<2 packing, _feature_bit names the used-feature bitset position,
and _slot_offset replaces _slotoff. All @inline; perf-neutral (verified by
interleaved A/B against the pre-refactor base 74438bd: with early_exit
disabled the two are indistinguishable; the ~5-8% ratio gap with early_exit
on is entirely the operand-validation correctness fix, not the refactors).
The parity fix had moved slot validation to consumption, re-summing feature
columns at every use (~5-8% relative vs the pre-fix base). Validation now
happens at equivalent-but-cheaper points:

- features: one check at materialization (a separate pass over the
  just-written, cache-hot slot; fusing the sum into the copy loop spoiled
  its memcpy pattern). Skipped for single-leaf trees, where no operator
  consumes the feature.
- intermediates: checked at production, right after the kernel writes them.
  Every non-root intermediate is consumed exactly once, so this rejects the
  same trees as consumption checks.
- scalar operands: still checked at consumption (O(1)).
- root output: never checked, as in the generic evaluator.

Also adds mark_compact!/invalidate_compact!/is_compact instead of raw
arena.compact[] writes.

Interleaved A/B vs 74438bd: ratios now overlap at all sizes
(0.69-0.72 / 0.59-0.64 / 0.57-0.61). 2616 tests pass, including a new
bare-feature-root NaN parity case.
@enum OperandKind::UInt8 FoldedConstant PinnedSlot ScratchSlot replaces the
raw _K_SCALAR/_K_PSLOT/_K_SLOT byte constants. Interleaved A/B shows no
measurable cost: packing and comparisons compile to the same integer ops,
and the membership check in the unpacking constructor is branch-predicted
away. 2616 tests pass.
Only the three types (ArenaEntry, Arena, ArenaNode) keep docstrings;
every internal function's docstring becomes a short '#' comment, so
Documenter does not pick them up.
All 38 @inline annotations removed (interleaved A/B shows the inliner does
the same job unannotated; the zero-allocation copy tests still pass). The
one survivor is the call-site @inline on the predicate in _arena_any, which
forces the closure to specialize into the recursion.

Also: setproperty! branches share a single trailing 'return value';
getproperty loads the entry once instead of in five branches; trivial
helpers collapse to one-line definitions; _nullable_child and setproperty!
use get_arena/get_index like everything else.
Mutation hot path; inlining guarantees the constant-symbol branch folds at
the call site rather than relying on interprocedural constprop.
Documents the self-buffering experiment result: the generic fused
unbuffered evaluator benches even with a plan running on a self-allocated
pool, while allocating half the bytes, so the plan stays buffer-gated. Also
shows buffered Node is slower than unbuffered Node (strided buffer rows),
which is the headroom the plan's contiguous slots reclaim.
ArenaNode.jl keeps the types and the AbstractExpressionNode interface
(~510 lines); ArenaNodeEval.jl holds the buffered plan evaluator
(~530 lines), included from within the module. Also drops a tutorial-grade
comment and records why the plan does not self-allocate a pool.
…t compacting copy

CI:
- _scalar_degn binds T via Tuple{T,Vararg{T}} and derives the arity from
  the tuple type, fixing Aqua's unbound-args failure
- formatted with JuliaFormatter v1 (the CI pin; local runs had used v2)

Investigation of the slow any(f, ::ArenaNode) found the generic
AbstractNode traversal machinery is already fast on ArenaNode (within 15%
of Node, zero allocations) -- the custom _arena_any/_arena_mapreduce/
is_constant overrides were unnecessary, and the closure-based recursion in
the any override was nondeterministically up to 8x slower (inlining of the
Base.any layers is compilation-order dependent). So:
- the traversal overrides are deleted; ArenaNode rides the generic
  machinery, except a 3-line compact-arena flat scan for  (1.4-1.7x
  faster than Node, and has_constants/is_constant inherit it)
- the unset-child UndefRefError guard moves from the deleted override into
  _load_entry at the facade layer, so poison facades throw like Node's
  undefined fields for every generic consumer

Copying out of a non-compact arena is now an entry-level compacting write
(_write_subtree! into a pre-sized vector; no facade traversal, no per-push
growth checks), also used by copy_into! and cross-arena attachment.
copy(::ArenaNode): compact 5.7-11x faster than Node at n=15-255;
non-compact crosses over near n=20 (1.1-1.2x above).

ArenaNodeEval.jl is now its own module included from DynamicExpressions.jl
rather than a nested include. The eval benchmark also reports unbuffered
paths.
Property tests (using the existing supposition_utils generators): Node ->
ArenaNode -> Node round-trips with all read-only interface results equal;
evaluation matches Node (unbuffered exactly; the buffered plan's ok flag is
best-effort, values must agree whenever both report ok); random valid
mutations applied to both representations keep them equivalent; copy
re-compacts and preserves evaluation.
…_nodes

hash went through the generic tree_mapreduce machinery (0.64x vs Node);
direct recursion over entries through the shared leaf_hash/branch_hash
combinators recovers it to 0.92x -- the rest is the hash arithmetic itself,
which must be identical across representations since cross-representation
== requires equal hashes (now pinned by the supposition round-trip
property). count_constant_nodes on a compact arena is a flat count over
the entry vector (1.6x vs Node).
The generic traversal skeleton re-reads the entry through the facade for
every field access and shuffles Nullable-wrapped children, which taxed all
tree_mapreduce clients at once (hash 0.64x, collect 0.70x, count_depth
0.76x, count_constant_nodes 0.73x vs Node). One override of the primitive
-- read each entry exactly once, dispatch on the local degree, hand f the
facade -- normalizes the whole family (1.0x for collect/count_depth/
count_constant_nodes, 0.94x index_constant_nodes, 0.85x hash) in both
compact and non-compact arenas.

This replaces the per-function fixes: the previous hash and
count_constant_nodes specializations are deleted, and since the generic
hash combinators now just run on the fast skeleton, hash equality across
representations holds again (pinned in the supposition round-trip
property).
At 31-node trees the generic any falls to 0.52x vs Node on non-compact
arenas (facade re-reads per field, like the tree_mapreduce skeleton);
the entry-level recursion recovers to 0.67x. The remainder is f reading
fields through the facade (arena -> nodes -> entry per access vs Node's
single deref), inherent to the facade contract; compact arenas bypass it
entirely via the flat scan (1.8x ahead of Node at n=31).
A linear postfix scan over compact arenas benches ~20% faster for pure
reductions (hash 1.03x vs Node), but f-application order (parent first,
siblings left to right) is observable API through collect/filter/foreach:
the postfix scan reorders them, applying positional mutations to different
nodes than Node would. Caught by the supposition properties with shrunk
counterexamples (abs(5.0e-324), one set_val). Recursion stays; the comment
on _entry_mapreduce documents the constraint.
The flat scan was the recursive traversal with recursion disabled, so the
two share a single _entry_any whose children loop is gated on a static
Val{recurse} flag: the compact path drives it from a flat loop with the
recursion compiled out, the non-compact path recurses. Performance
unchanged (compact 1.6-2.0x vs Node at n=31, non-compact 0.63-0.68x).
…tant ops

One driver for order-free whole-tree visits, with both walks side by side
and the unused one compiled out via a static flag: compact arenas walk the
entry array (indices valid by construction, so the load is uncheckable and
dead when unused); non-compact arenas walk the children pointers with the
per-degree dispatch unrolled like the generic any (a children loop benches
~40% slower; a worklist allocates and loses at SR sizes; closure-driven
recursion through Base.any inlines nondeterministically -- all measured).
any/count_nodes/get_scalar_constants are one-line clients, replacing the
recursive _entry_any, the count_nodes invoke fallback, and the filter_map
path.

Also, on the canonical-form principle (pure or order-free ops on a compact
arena may use the entries directly):
- == between compact trees is a flat field-compare (equal trees have
  identical entry vectors); 2.7x vs Node, from 1.1x
- count_constant_nodes: flat count when compact (2.0x), entry-level
  mapreduce otherwise
- set_scalar_constants! writes entries directly: val-only writes cannot
  change structure, so the setindex! degree/children diff is skipped
- @inline restored on getproperty (the removal had been validated against
  an eval-only benchmark that never exercises property reads)

n=31 vs Node: compact 1.4-3.4x on ten ops, floor 0.81x (hash); non-compact
0.85-1.75x, floor 0.65x (copy_into!, the re-compaction price that wins
back above n~20).
@MilesCranmer

Copy link
Copy Markdown
Member

@codex review

Older Supposition versions resolved by the downgrade-compat CI job lack
length(::Data.OneOf); a single multi-arg map selecting the mutation kind
by integer generates the same tuples on any version.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fda350b22

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ArenaNode.jl
function copy_into!(
dest::Arena{T,D},
src::ArenaNode{T,D};
ref::Union{Nothing,Base.RefValue{<:Integer}}=nothing,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor the copy_into! ref counter

When callers pass ref to copy_into!, this ArenaNode overload currently accepts it but never resets or increments it, so ref[] remains stale/zero even after a successful copy. The generic preallocation path uses this counter to report how many nodes were consumed and tests assert that behavior for other node types; ArenaNode should either update it to the copied node count (including the dest === src.arena fast path) or not expose the keyword.

Useful? React with 👍 / 👎.

Comment thread src/ArenaNode.jl Outdated
i = refs[j]
nodes[i] = _replace(nodes[i]; val=convert(T, constants[j]))
end
return nothing

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Return the mutated tree from set_scalar_constants!

For ArenaNode callers, set_scalar_constants! now returns nothing, while the existing generic implementation returns the mutated tree. Code that follows the established API by chaining or reassigning the result, e.g. tree = set_scalar_constants!(tree, vals, refs), will lose the ArenaNode even though the constants were updated successfully.

Useful? React with 👍 / 👎.

Comment thread src/ArenaNode.jl Outdated
end

function set_child!(node::ArenaNode{T,D}, child::AbstractNode{D}, i::Int) where {T,D}
idx = _resolve_child_index!(node, child)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Validate child indices before copying

When i is out of range and child comes from another arena, this resolves the child by copying it into node.arena before the bounds-checked entry.children[i] access throws. That leaves an orphaned subtree appended to the parent arena after a failed set_child!, unlike the generic tuple-based implementation which throws without mutating state.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants