chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821
chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821coderfender wants to merge 8 commits intoapache:mainfrom
Conversation
|
@Dandandan , @2010YOUY01 , Please take a look at these benchmarks I plan to refer for bitmap based optimizations : #21817 . This essentially has a cargo ben h (for faster / simpler bench tests through |
2fbe56f to
6eb40b7
Compare
|
Thank you for working on this! I have some suggestions for you to consider. High-level issueI think the main issue is using A good benchmark should reflect realistic workloads. To achieve that, we should define a set of core axes and vary them systematically, I think for equi-joins, it could be: In contrast, I believe we'd better remove For this PRFor this PR, I suggest keeping the end-to-end For the Criterion micro-benchmarks, it would be better to first focus on a few representative workloads (e.g., join size, type), and then optionally add a small number of targeted cases for specific fast paths, such as right semi/anti joins with In short, fewer end-to-end queries should be sufficient for this PR. We could add criterion micro-benches later based on the above design. |
|
@2010YOUY01 , I updated the benches per your review comments. I do agree that we need |
|
Latest critcmp results from above benches icompared with #21817 |
| probe_size: "60M", | ||
| }, | ||
| // RightSemi Join benchmarks with Int32 keys | ||
| // Q16: RightSemi, Small build (25 rows), 100% Hit rate |
There was a problem hiding this comment.
Let's also doc the fanout here, it means if we change the join type to inner join, for each probe row, how many matches can be found on average.
This can be automatically calculated from explain analyze the query, after changing join type to inner join, it will show up in the HashJoinExec's metrics.
And later we should ensure those queries have covered different fanouts.
There was a problem hiding this comment.
Hmm, my previous explanation may have been confusing. Let me try again.
Suppose we have the query:
SELECT *
FROM generate_series(100) AS t1(v1)
RIGHT SEMI JOIN generate_series(10) AS t2(v1)
ON (t1.v1 % 10) = t2.v1Here, each probe row from t2 matches 10 rows on average from t1, so the matching rows per probe row ratio is 10:1.
Although a semi join only returns whether a match exists, this ratio still matters for execution behavior, because we are evaluating short-circuit optimizations here.
So I suggest we could doc this metric here. See the original reply for how to get this matching ratio metric automatically.
| // RightSemi Join benchmarks | ||
| // ========================================================================= | ||
|
|
||
| // RightSemi - 100% Density, 100% hit rate |
There was a problem hiding this comment.
The density parameter is a bit hard to interpret, could you add a comment to make the workload easier to understand?
There was a problem hiding this comment.
This density metric is still not obvious to me, I think it would also be confusing for others.
I would recommend to add comment to explain
- what is it
- why it matters for this workload
I suspect it is related to the average number of matching rows per probe row, but we still need to look at the implementation to figure it out, which takes time.
|
Thank you for the feedback @2010YOUY01 , I added more helpful comments and made sure the join order is sustained per your feedback. Please take a look whenever you get a chance |
Add Existence Join Benchmarks
What changes are included in this PR?
1. End-to-end benchmarks (
benchmarks/src/hj.rs)Adds Q16-Q21 for RightSemi and RightAnti joins, following reviewer feedback to focus on core axes:
2. Criterion micro-benchmark (
datafusion/physical-plan/benches/hash_join_semi_anti.rs)Density variations :