nanoarrow Migration: Revisiting C-Level Optimization Passes (#61, #62) #63

justinjoy · 2026-03-01T14:45:37Z

justinjoy
Mar 1, 2026
Maintainer

Context

Phase 1 included two optimization passes that were closed as wontfix after profiling and DD source-level analysis:

Add Subplan Sharing optimization pass #61 Subplan Sharing — Detect identical sub-expressions across rules and compute them once
Add Boolean Specialization optimization pass #62 Boolean Specialization — Replace JOINs with unary relations using set-membership filters

Both were closed because Differential Dataflow's internal machinery absorbs the optimizations:

DD's Variable handles are lightweight references — collections are already shared
DD's join_map and semijoin both call the same join_core — for unary relations, arrange_by_key already produces empty-value arrangements equivalent to arrange_by_self
Profiling confirmed: subplan sharing showed +1.9% slower execution on DOOP (within noise), with only a 15.6% memory reduction

Why nanoarrow Changes Everything

If wirelog migrates from DD's row-oriented execution to an Apache Arrow columnar backend (via nanoarrow), the cost model changes fundamentally:

Aspect	DD (current)	Arrow columnar
Collection reference	`Variable` = lightweight handle (implicit sharing)	RecordBatch (explicit copy or zero-copy ref)
JOIN internals	`arrange_by_key` + `join_core` (arrangement reuse)	Hash join / sort-merge join on column arrays (no arrangement infrastructure)
FILTER	DD operator applies closure to stream	Arrow compute kernel scans column array
Duplicate filter	DD may share underlying stream	Actually scans the column twice

#61 Subplan Sharing → Reopen Candidate

DD implicitly shares collections via Variable handles. Arrow has no such mechanism — if two rules apply the same filter to the same RecordBatch, the filter kernel executes twice. Common Sub-Expression Elimination (CSE) would provide real savings by materializing the shared filtered result once.

Estimated impact: DOOP has 4 sharing groups with identical multi-way join bodies. Without DD's implicit sharing, the 8-way virtual dispatch join would be computed 3x instead of 1x.

#62 Boolean Specialization → Reopen Candidate

DD's join_core processes both JOIN and SEMIJOIN identically for unary relations. In Arrow:

JOIN = hash table build + probe (expensive: allocate hash map, insert all rows, probe all rows)
Unary set membership = bitmap filter or dictionary lookup (cheap: arrow::compute::filter with boolean mask)

The cost difference is potentially 10-100x for set-membership vs. hash join on columnar data.

DOOP has 42 JOINs involving unary relations across all benchmarks (20 in DOOP alone, 12 in recursive strata). With Arrow, converting these to bitmap filters would eliminate hash table construction entirely.

Recommendation

When the nanoarrow migration begins:

Reopen Add Subplan Sharing optimization pass #61 — Implement CSE at the IR or execution plan level. Without DD's arrangement sharing, this becomes a high-value optimization.
Reopen Add Boolean Specialization optimization pass #62 — Implement boolean specialization as bitmap filter replacement for unary JOINs. Arrow's compute::filter makes this a significant win.
Profile first — As with the DD analysis, validate with profiling before committing to implementation. Arrow's vectorized execution may have its own optimizations (SIMD filter, predicate pushdown) that partially absorb these gains.

References

Add Subplan Sharing optimization pass #61 — Subplan Sharing (closed: wontfix, profiling results)
Add Boolean Specialization optimization pass #62 — Boolean Specialization (closed: wontfix, DD source analysis)
DD source: differential-dataflow/src/operators/join.rs:143-155 — join_map and semijoin share join_core
wirelog executor: rust/wirelog-dd/src/dataflow.rs:540-544 — unary right side produces empty-value split

justinjoy · 2026-03-17T03:40:36Z

justinjoy
Mar 17, 2026
Maintainer Author

Status Update (2026-03-17)

The nanoarrow migration that this discussion anticipated has been completed. The Differential Dataflow (Rust) backend was removed in commit 8f03049 and replaced with a pure C11 columnar backend built on nanoarrow. The columnar backend has progressed through multiple phases (2C → 3A → 3B → current) and is now the sole execution engine.

Current state of the referenced issues

Add Subplan Sharing optimization pass #61 Subplan Sharing — still closed (wontfix). The original closure was based on DD's implicit sharing via Variable handles. As this discussion correctly predicted, the Arrow columnar backend has no equivalent implicit sharing mechanism. CSE at the IR/plan level remains a valid future optimization.
Add Boolean Specialization optimization pass #62 Boolean Specialization — still closed. The columnar backend now uses explicit hash join and sort-merge join on column arrays, where unary set-membership could be replaced by bitmap filters — exactly the scenario described here.

What has changed since this discussion was opened

LFTJ (Leapfrog Triejoin) is being implemented (Implement worst-case optimal join ordering (WCOJ) for 8-9 way joins #195) for multi-way joins, which partially addresses the join cost model concerns raised in Add Boolean Specialization optimization pass #62 — LFTJ avoids hash table construction entirely for multi-way joins on sorted arrangements.
The columnar backend includes a sorted arrangement cache (wl_col_arrangement_cache_t), which provides some implicit sharing of sorted materializations — partially mitigating the CSE concern from Add Subplan Sharing optimization pass #61.
K-fusion plan generation was implemented, adding another optimization layer that wasn't present when this discussion was opened.

Recommendation

The core analysis in this discussion remains sound, but the urgency has shifted:

Add Boolean Specialization optimization pass #62 (Boolean Specialization): Lower priority now that LFTJ handles multi-way joins without hash tables. Still valuable for binary joins involving unary relations outside LFTJ chains.
Add Subplan Sharing optimization pass #61 (Subplan Sharing / CSE): Still relevant. The arrangement cache helps but doesn't cover all sharing opportunities (e.g., identical filter subexpressions across rules).

Both should be revisited when profiling identifies join or redundant-computation bottlenecks in the columnar backend. Closing this discussion as the migration is complete and the analysis has been captured.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nanoarrow Migration: Revisiting C-Level Optimization Passes (#61, #62) #63

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

nanoarrow Migration: Revisiting C-Level Optimization Passes (#61, #62) #63

Uh oh!

justinjoy Mar 1, 2026 Maintainer

Context

Why nanoarrow Changes Everything

#61 Subplan Sharing → Reopen Candidate

#62 Boolean Specialization → Reopen Candidate

Recommendation

References

Replies: 1 comment

Uh oh!

justinjoy Mar 17, 2026 Maintainer Author

Status Update (2026-03-17)

Current state of the referenced issues

What has changed since this discussion was opened

Recommendation

justinjoy
Mar 1, 2026
Maintainer

justinjoy
Mar 17, 2026
Maintainer Author