ArenaNode for array-backed expressions#167
Conversation
Benchmark Results (Julia v1)Time benchmarks
Memory benchmarks
|
e8acc61 to
847c2c6
Compare
847c2c6 to
04e763d
Compare
|
@MilesCranmerBot comments from Claude... pls integrate. Maybe benchmark against master too, to see if it works... |
Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>
|
Integrated in What changed:
Validation: Benchmark ( |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #167 +/- ##
===========================================
+ Coverage 59.43% 74.02% +14.59%
===========================================
Files 30 32 +2
Lines 2682 3188 +506
===========================================
+ Hits 1594 2360 +766
+ Misses 1088 828 -260 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
MilesCranmer
left a comment
There was a problem hiding this comment.
@MilesCranmerBot please address above comments.
Also, shouldn't we have a tree_mapreduce? Then we wouldn't need to write a custom is_constant and so forth.
Btw, please expand your test suite. You should actually be using the full Node Interface tests. ArenaNode should pass ALL of them.
Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>
Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>
|
Addressed the review in the latest pushes ( Changes:
Local validation:
All addressed review threads are resolved. CI is pending on the latest push. |
Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>
MilesCranmer
left a comment
There was a problem hiding this comment.
See new comments.
Also.
Please test with expressions types too. eg Expression with ArenaNode - do the full expression interface tests.
Also maybe try derivatives too?
- Test Expression wrapping ArenaNode with the full ExpressionInterface - Test eval_grad_tree_array and Zygote derivatives for ArenaNode-based expressions - Cover both variable and constant gradients Co-authored-by: Miles Cranmer <miles.cranmer@gmail.com>
MilesCranmer
left a comment
There was a problem hiding this comment.
@MilesCranmerBot please see comments above
Also please resolve any comments that have already been addressed.
e/a/n/c/d/f/v/st and the kernel abbreviations (sp, ks, is, svs, svals, desc, nfree, fbase, doff, offs, isscal, off, rem, B) become entry, arena, node, child_idx, degree, feature, value, state, stack_top, kinds, idxs, scalar_args, scalar_vals, descriptors, num_free, free_base, dest_offset, offsets, is_scalar, offset, remaining, buffer_rows. No behavior change (2614 tests pass; bench ratios 0.55/0.43/0.39, unchanged).
MilesCranmer
left a comment
There was a problem hiding this comment.
Please address all comments. Resolve them individually as you address each comment locally. And before you are done, please audit your changes and ensure that all logic is maintained, and no speed is lost due to, e.g., type instabilities.
- expose ArenaNode/Arena as DynamicExpressions.ArenaNode etc.; tests use 'using DynamicExpressions: ArenaNode' instead of a module alias - _push_node! takes keyword defaults; push_constant!/push_feature! only override the relevant fields - PlanState/PlanRegisters (no leading underscore on internal types) - constrain ArenaEntry/Arena/ArenaNode to T<:Number - split _arena_eval into _materialize_features! and _write_root_to_output! - readable planner names: feature_mask, scalar_stack, permanent_stack, arity_mask, num_recyclable_args; the feature-slot computation lives in a named _feature_slot helper shared with _push_leaf! - iszero/isone instead of == 0 / != 1 comparisons - shorten testitem names so headers fit on one line 2614 tests pass; bench ratios 0.56/0.47/0.37 vs Node (unchanged).
_exec_op! is now a thin compile-time arity dispatch; _exec_op_arity! pops the operand descriptors and branches once into _fold_constant_args! (the scalar-lane fold, written with guard clauses) or _run_op_kernel! (slot recycling, destination allocation, operand validation, kernel dispatch). Max nesting depth drops from 6 to 2. No behavior or speed change: 2614 tests pass, and an interleaved A/B against the previous commit gives identical benchmark ratios.
set_child! and set_children! both duplicated the validate-then-link-or-copy logic; it now lives in one helper with the cross-arena copy rationale on it.
_plan_scratch previously re-implemented the executor's stack machine as raw bitmask arithmetic — the two had to agree exactly, since the executor trusts the planner's slot counts under @inbounds, but nothing tied them together. The kind and recycling policy now lives once: _leaf_kind, _op_result_kind, and _is_recyclable are called by both sides, and the planner's bitmask pair becomes a KindStack with push/pop/query operations named after the executor's concepts. Drift between the passes is now a compile error or an obviously-local edit instead of a silent buffer-size mismatch. No behavior change: 2614 tests pass, bench ratios within session noise.
…inbounds in set_scalar_constants! get_arena/get_index are now the only getfield call sites: internal code calls them instead of raw getfield (named, greppable) and instead of property access (functions reachable from getproperty, like get_child via the :l/:r branches, must not cycle back through it). Arena has no custom getproperty, so its getfield(arena, :nodes) calls become plain dot access. set_scalar_constants! gets its @inbounds back: refs are produced by get_scalar_constants, and callers passing stale refs are out of contract.
_pack_descriptor/_descriptor_kind/_descriptor_slot replace the raw kind|slot<<2 packing, _feature_bit names the used-feature bitset position, and _slot_offset replaces _slotoff. All @inline; perf-neutral (verified by interleaved A/B against the pre-refactor base 74438bd: with early_exit disabled the two are indistinguishable; the ~5-8% ratio gap with early_exit on is entirely the operand-validation correctness fix, not the refactors).
The parity fix had moved slot validation to consumption, re-summing feature columns at every use (~5-8% relative vs the pre-fix base). Validation now happens at equivalent-but-cheaper points: - features: one check at materialization (a separate pass over the just-written, cache-hot slot; fusing the sum into the copy loop spoiled its memcpy pattern). Skipped for single-leaf trees, where no operator consumes the feature. - intermediates: checked at production, right after the kernel writes them. Every non-root intermediate is consumed exactly once, so this rejects the same trees as consumption checks. - scalar operands: still checked at consumption (O(1)). - root output: never checked, as in the generic evaluator. Also adds mark_compact!/invalidate_compact!/is_compact instead of raw arena.compact[] writes. Interleaved A/B vs 74438bd: ratios now overlap at all sizes (0.69-0.72 / 0.59-0.64 / 0.57-0.61). 2616 tests pass, including a new bare-feature-root NaN parity case.
@enum OperandKind::UInt8 FoldedConstant PinnedSlot ScratchSlot replaces the raw _K_SCALAR/_K_PSLOT/_K_SLOT byte constants. Interleaved A/B shows no measurable cost: packing and comparisons compile to the same integer ops, and the membership check in the unpacking constructor is branch-predicted away. 2616 tests pass.
This reverts commit bd0a2c7.
Only the three types (ArenaEntry, Arena, ArenaNode) keep docstrings; every internal function's docstring becomes a short '#' comment, so Documenter does not pick them up.
All 38 @inline annotations removed (interleaved A/B shows the inliner does the same job unannotated; the zero-allocation copy tests still pass). The one survivor is the call-site @inline on the predicate in _arena_any, which forces the closure to specialize into the recursion. Also: setproperty! branches share a single trailing 'return value'; getproperty loads the entry once instead of in five branches; trivial helpers collapse to one-line definitions; _nullable_child and setproperty! use get_arena/get_index like everything else.
Mutation hot path; inlining guarantees the constant-symbol branch folds at the call site rather than relying on interprocedural constprop.
Documents the self-buffering experiment result: the generic fused unbuffered evaluator benches even with a plan running on a self-allocated pool, while allocating half the bytes, so the plan stays buffer-gated. Also shows buffered Node is slower than unbuffered Node (strided buffer rows), which is the headroom the plan's contiguous slots reclaim.
ArenaNode.jl keeps the types and the AbstractExpressionNode interface (~510 lines); ArenaNodeEval.jl holds the buffered plan evaluator (~530 lines), included from within the module. Also drops a tutorial-grade comment and records why the plan does not self-allocate a pool.
…t compacting copy
CI:
- _scalar_degn binds T via Tuple{T,Vararg{T}} and derives the arity from
the tuple type, fixing Aqua's unbound-args failure
- formatted with JuliaFormatter v1 (the CI pin; local runs had used v2)
Investigation of the slow any(f, ::ArenaNode) found the generic
AbstractNode traversal machinery is already fast on ArenaNode (within 15%
of Node, zero allocations) -- the custom _arena_any/_arena_mapreduce/
is_constant overrides were unnecessary, and the closure-based recursion in
the any override was nondeterministically up to 8x slower (inlining of the
Base.any layers is compilation-order dependent). So:
- the traversal overrides are deleted; ArenaNode rides the generic
machinery, except a 3-line compact-arena flat scan for (1.4-1.7x
faster than Node, and has_constants/is_constant inherit it)
- the unset-child UndefRefError guard moves from the deleted override into
_load_entry at the facade layer, so poison facades throw like Node's
undefined fields for every generic consumer
Copying out of a non-compact arena is now an entry-level compacting write
(_write_subtree! into a pre-sized vector; no facade traversal, no per-push
growth checks), also used by copy_into! and cross-arena attachment.
copy(::ArenaNode): compact 5.7-11x faster than Node at n=15-255;
non-compact crosses over near n=20 (1.1-1.2x above).
ArenaNodeEval.jl is now its own module included from DynamicExpressions.jl
rather than a nested include. The eval benchmark also reports unbuffered
paths.
Property tests (using the existing supposition_utils generators): Node -> ArenaNode -> Node round-trips with all read-only interface results equal; evaluation matches Node (unbuffered exactly; the buffered plan's ok flag is best-effort, values must agree whenever both report ok); random valid mutations applied to both representations keep them equivalent; copy re-compacts and preserves evaluation.
…_nodes hash went through the generic tree_mapreduce machinery (0.64x vs Node); direct recursion over entries through the shared leaf_hash/branch_hash combinators recovers it to 0.92x -- the rest is the hash arithmetic itself, which must be identical across representations since cross-representation == requires equal hashes (now pinned by the supposition round-trip property). count_constant_nodes on a compact arena is a flat count over the entry vector (1.6x vs Node).
The generic traversal skeleton re-reads the entry through the facade for every field access and shuffles Nullable-wrapped children, which taxed all tree_mapreduce clients at once (hash 0.64x, collect 0.70x, count_depth 0.76x, count_constant_nodes 0.73x vs Node). One override of the primitive -- read each entry exactly once, dispatch on the local degree, hand f the facade -- normalizes the whole family (1.0x for collect/count_depth/ count_constant_nodes, 0.94x index_constant_nodes, 0.85x hash) in both compact and non-compact arenas. This replaces the per-function fixes: the previous hash and count_constant_nodes specializations are deleted, and since the generic hash combinators now just run on the fast skeleton, hash equality across representations holds again (pinned in the supposition round-trip property).
At 31-node trees the generic any falls to 0.52x vs Node on non-compact arenas (facade re-reads per field, like the tree_mapreduce skeleton); the entry-level recursion recovers to 0.67x. The remainder is f reading fields through the facade (arena -> nodes -> entry per access vs Node's single deref), inherent to the facade contract; compact arenas bypass it entirely via the flat scan (1.8x ahead of Node at n=31).
A linear postfix scan over compact arenas benches ~20% faster for pure reductions (hash 1.03x vs Node), but f-application order (parent first, siblings left to right) is observable API through collect/filter/foreach: the postfix scan reorders them, applying positional mutations to different nodes than Node would. Caught by the supposition properties with shrunk counterexamples (abs(5.0e-324), one set_val). Recursion stays; the comment on _entry_mapreduce documents the constraint.
The flat scan was the recursive traversal with recursion disabled, so the
two share a single _entry_any whose children loop is gated on a static
Val{recurse} flag: the compact path drives it from a flat loop with the
recursion compiled out, the non-compact path recurses. Performance
unchanged (compact 1.6-2.0x vs Node at n=31, non-compact 0.63-0.68x).
…tant ops One driver for order-free whole-tree visits, with both walks side by side and the unused one compiled out via a static flag: compact arenas walk the entry array (indices valid by construction, so the load is uncheckable and dead when unused); non-compact arenas walk the children pointers with the per-degree dispatch unrolled like the generic any (a children loop benches ~40% slower; a worklist allocates and loses at SR sizes; closure-driven recursion through Base.any inlines nondeterministically -- all measured). any/count_nodes/get_scalar_constants are one-line clients, replacing the recursive _entry_any, the count_nodes invoke fallback, and the filter_map path. Also, on the canonical-form principle (pure or order-free ops on a compact arena may use the entries directly): - == between compact trees is a flat field-compare (equal trees have identical entry vectors); 2.7x vs Node, from 1.1x - count_constant_nodes: flat count when compact (2.0x), entry-level mapreduce otherwise - set_scalar_constants! writes entries directly: val-only writes cannot change structure, so the setindex! degree/children diff is skipped - @inline restored on getproperty (the removal had been validated against an eval-only benchmark that never exercises property reads) n=31 vs Node: compact 1.4-3.4x on ten ops, floor 0.81x (hash); non-compact 0.85-1.75x, floor 0.65x (copy_into!, the re-compaction price that wins back above n~20).
|
@codex review |
Older Supposition versions resolved by the downgrade-compat CI job lack length(::Data.OneOf); a single multi-arg map selecting the mutation kind by integer generates the same tuples on any version.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fda350b22
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| function copy_into!( | ||
| dest::Arena{T,D}, | ||
| src::ArenaNode{T,D}; | ||
| ref::Union{Nothing,Base.RefValue{<:Integer}}=nothing, |
There was a problem hiding this comment.
Honor the copy_into! ref counter
When callers pass ref to copy_into!, this ArenaNode overload currently accepts it but never resets or increments it, so ref[] remains stale/zero even after a successful copy. The generic preallocation path uses this counter to report how many nodes were consumed and tests assert that behavior for other node types; ArenaNode should either update it to the copied node count (including the dest === src.arena fast path) or not expose the keyword.
Useful? React with 👍 / 👎.
| i = refs[j] | ||
| nodes[i] = _replace(nodes[i]; val=convert(T, constants[j])) | ||
| end | ||
| return nothing |
There was a problem hiding this comment.
Return the mutated tree from set_scalar_constants!
For ArenaNode callers, set_scalar_constants! now returns nothing, while the existing generic implementation returns the mutated tree. Code that follows the established API by chaining or reassigning the result, e.g. tree = set_scalar_constants!(tree, vals, refs), will lose the ArenaNode even though the constants were updated successfully.
Useful? React with 👍 / 👎.
| end | ||
|
|
||
| function set_child!(node::ArenaNode{T,D}, child::AbstractNode{D}, i::Int) where {T,D} | ||
| idx = _resolve_child_index!(node, child) |
There was a problem hiding this comment.
Validate child indices before copying
When i is out of range and child comes from another arena, this resolves the child by copying it into node.arena before the bounds-checked entry.children[i] access throws. That leaves an orphaned subtree appended to the parent arena after a failed set_child!, unlike the generic tuple-based implementation which throws without mutating state.
Useful? React with 👍 / 👎.
Prototype
ArenaNode: an array-backed arena representation for expressions, implementing the fullNodeInterface. Evaluation runs directly on the arena facade and matchesNodeon allocations (scripts/bench_arenanode.jl), with cheapercopyand a smaller memory footprint.