Replies: 1 comment
-
Status Update (2026-03-17)The nanoarrow migration that this discussion anticipated has been completed. The Differential Dataflow (Rust) backend was removed in commit Current state of the referenced issues
What has changed since this discussion was opened
RecommendationThe core analysis in this discussion remains sound, but the urgency has shifted:
Both should be revisited when profiling identifies join or redundant-computation bottlenecks in the columnar backend. Closing this discussion as the migration is complete and the analysis has been captured. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Context
Phase 1 included two optimization passes that were closed as wontfix after profiling and DD source-level analysis:
Both were closed because Differential Dataflow's internal machinery absorbs the optimizations:
Variablehandles are lightweight references — collections are already sharedjoin_mapandsemijoinboth call the samejoin_core— for unary relations,arrange_by_keyalready produces empty-value arrangements equivalent toarrange_by_selfWhy nanoarrow Changes Everything
If wirelog migrates from DD's row-oriented execution to an Apache Arrow columnar backend (via nanoarrow), the cost model changes fundamentally:
Variable= lightweight handle (implicit sharing)arrange_by_key+join_core(arrangement reuse)#61 Subplan Sharing → Reopen Candidate
DD implicitly shares collections via
Variablehandles. Arrow has no such mechanism — if two rules apply the same filter to the same RecordBatch, the filter kernel executes twice. Common Sub-Expression Elimination (CSE) would provide real savings by materializing the shared filtered result once.Estimated impact: DOOP has 4 sharing groups with identical multi-way join bodies. Without DD's implicit sharing, the 8-way virtual dispatch join would be computed 3x instead of 1x.
#62 Boolean Specialization → Reopen Candidate
DD's
join_coreprocesses both JOIN and SEMIJOIN identically for unary relations. In Arrow:arrow::compute::filterwith boolean mask)The cost difference is potentially 10-100x for set-membership vs. hash join on columnar data.
DOOP has 42 JOINs involving unary relations across all benchmarks (20 in DOOP alone, 12 in recursive strata). With Arrow, converting these to bitmap filters would eliminate hash table construction entirely.
Recommendation
When the nanoarrow migration begins:
compute::filtermakes this a significant win.References
differential-dataflow/src/operators/join.rs:143-155—join_mapandsemijoinsharejoin_corerust/wirelog-dd/src/dataflow.rs:540-544— unary right side produces empty-value splitBeta Was this translation helpful? Give feedback.
All reactions