Skip to content

chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821

Open
coderfender wants to merge 8 commits intoapache:mainfrom
coderfender:implement_additional_benchmarks_hashjoin
Open

chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821
coderfender wants to merge 8 commits intoapache:mainfrom
coderfender:implement_additional_benchmarks_hashjoin

Conversation

@coderfender
Copy link
Copy Markdown
Contributor

@coderfender coderfender commented Apr 24, 2026

Add Existence Join Benchmarks

What changes are included in this PR?

1. End-to-end benchmarks (benchmarks/src/hj.rs)

Adds Q16-Q21 for RightSemi and RightAnti joins, following reviewer feedback to focus on core axes:

Query Join Type Build Size Probe Size Hit Rate
Q16 RightSemi 25 (nation) 1.5M (customer) 100%
Q17 RightSemi 100K (supplier) 60M (lineitem) 100%
Q18 RightSemi 100K (supplier) 60M (lineitem) 10%
Q19 RightAnti 25 (nation) 1.5M (customer) 100%
Q20 RightAnti 100K (supplier) 60M (lineitem) 100%
Q21 RightAnti 100K (supplier) 60M (lineitem) 10%

2. Criterion micro-benchmark (datafusion/physical-plan/benches/hash_join_semi_anti.rs)

Density variations :

Benchmark Join Type Density Hit Rate
right_semi_d100_h100 RightSemi 100% 100%
right_semi_d100_h10 RightSemi 100% 10%
right_semi_d50_h100 RightSemi 50% 100%
right_semi_d50_h10 RightSemi 50% 10%
right_semi_d10_h100 RightSemi 10% 100%
right_semi_d10_h10 RightSemi 10% 10%
right_anti_d100_h100 RightAnti 100% 100%
right_anti_d100_h10 RightAnti 100% 10%
right_anti_d50_h100 RightAnti 50% 100%
right_anti_d50_h10 RightAnti 50% 10%
right_anti_d10_h100 RightAnti 10% 100%
right_anti_d10_h10 RightAnti 10% 10%

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 24, 2026
@coderfender
Copy link
Copy Markdown
Contributor Author

@Dandandan , @2010YOUY01 , Please take a look at these benchmarks I plan to refer for bitmap based optimizations : #21817 . This essentially has a cargo ben h (for faster / simpler bench tests through critcmp ) along with additional existence tests to TPCH datasets . I tried adding various densities and match rates to try and replicate as many real worlds scenarios as possible as well

@coderfender coderfender force-pushed the implement_additional_benchmarks_hashjoin branch from 2fbe56f to 6eb40b7 Compare April 24, 2026 07:03
@coderfender coderfender changed the title feat: Add existence (semi / anti ) benchmarks hashjoin chore: Add existence (semi / anti ) benchmarks hashjoin Apr 24, 2026
@coderfender coderfender changed the title chore: Add existence (semi / anti ) benchmarks hashjoin chore: Add existence (semi / anti ) benchmarks for hashjoinexec Apr 24, 2026
@2010YOUY01
Copy link
Copy Markdown
Contributor

Thank you for working on this! I have some suggestions for you to consider.

High-level issue

I think the main issue is using density as a primary axis when evaluating equi-join performance. While it was introduced in #21821 for perfect hash join experiments, it seems it is not a good axis for designing representative benchmarks.

A good benchmark should reflect realistic workloads. To achieve that, we should define a set of core axes and vary them systematically, I think for equi-joins, it could be:

Equi-join benchmark key axes:
- Build/probe side size
- Join type (inner, outer, semi, etc.)
- Number of join keys
- Join key data type
- Probe hit rate
- Fanout (average number of matches per probe key)

In contrast, density (i.e., key range span divided by key count) is not representative of typical workloads. It is primarily useful for evaluating specific fast paths (e.g., perfect hash join), but making it a primary axis complicates the benchmark design, and may mislead future optimization efforts.

I believe we'd better remove density from the key axes in the future. For fast paths like perfect Hj and semi/anti join, we could simply add a few queries that the fast path wins.

For this PR

For this PR, I suggest keeping the end-to-end hj benchmark simple. We don’t need to enumerate all density combinations here—a smaller set of representative queries should be enough to evaluate the optimization.

For the Criterion micro-benchmarks, it would be better to first focus on a few representative workloads (e.g., join size, type), and then optionally add a small number of targeted cases for specific fast paths, such as right semi/anti joins with Int32 keys, otherwise it would be hard to extend and maintain.

In short, fewer end-to-end queries should be sufficient for this PR. We could add criterion micro-benches later based on the above design.

Comment thread benchmarks/src/hj.rs Outdated
@coderfender
Copy link
Copy Markdown
Contributor Author

coderfender commented May 5, 2026

@2010YOUY01 , I updated the benches per your review comments. I do agree that we need density as an axis in criterion micro-benchmark and added them

@coderfender
Copy link
Copy Markdown
Contributor Author

coderfender commented May 5, 2026

Latest critcmp results from above benches icompared with #21817

group                                               hashmap                                roaring_bitmap
-----                                               -------                                --------------
hash_join_semi_anti/right_anti_d100_h10/1000000     3.01      5.3±0.10ms        ? ?/sec    1.00  1750.0±49.28µs        ? ?/sec
hash_join_semi_anti/right_anti_d100_h100/1000000    1.84      3.3±0.08ms        ? ?/sec    1.00  1806.6±44.54µs        ? ?/sec
hash_join_semi_anti/right_anti_d10_h10/1000000      4.02     11.0±0.12ms        ? ?/sec    1.00      2.8±0.04ms        ? ?/sec
hash_join_semi_anti/right_anti_d10_h100/1000000     1.62      5.3±0.18ms        ? ?/sec    1.00      3.3±0.08ms        ? ?/sec
hash_join_semi_anti/right_anti_d50_h10/1000000      2.49      5.3±0.09ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
hash_join_semi_anti/right_anti_d50_h100/1000000     1.53      3.3±0.07ms        ? ?/sec    1.00      2.2±0.03ms        ? ?/sec
hash_join_semi_anti/right_semi_d100_h10/1000000     1.00  1574.1±31.33µs        ? ?/sec    1.11  1743.7±13.23µs        ? ?/sec
hash_join_semi_anti/right_semi_d100_h100/1000000    4.40      8.0±0.14ms        ? ?/sec    1.00  1814.4±13.57µs        ? ?/sec
hash_join_semi_anti/right_semi_d10_h10/1000000      2.71      7.4±0.30ms        ? ?/sec    1.00      2.7±0.04ms        ? ?/sec
hash_join_semi_anti/right_semi_d10_h100/1000000     3.00     10.0±0.26ms        ? ?/sec    1.00      3.3±0.06ms        ? ?/sec
hash_join_semi_anti/right_semi_d50_h10/1000000      1.00  1604.7±26.37µs        ? ?/sec    1.31      2.1±0.03ms        ? ?/sec
hash_join_semi_anti/right_semi_d50_h100/1000000     3.66      8.0±0.10ms        ? ?/sec    1.00      2.2±0.03ms        ? ?/sec

Comment thread benchmarks/src/hj.rs
probe_size: "60M",
},
// RightSemi Join benchmarks with Int32 keys
// Q16: RightSemi, Small build (25 rows), 100% Hit rate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also doc the fanout here, it means if we change the join type to inner join, for each probe row, how many matches can be found on average.

This can be automatically calculated from explain analyze the query, after changing join type to inner join, it will show up in the HashJoinExec's metrics.

And later we should ensure those queries have covered different fanouts.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, my previous explanation may have been confusing. Let me try again.

Suppose we have the query:

SELECT *
FROM generate_series(100) AS t1(v1)
RIGHT SEMI JOIN generate_series(10) AS t2(v1)
ON (t1.v1 % 10) = t2.v1

Here, each probe row from t2 matches 10 rows on average from t1, so the matching rows per probe row ratio is 10:1.

Although a semi join only returns whether a match exists, this ratio still matters for execution behavior, because we are evaluating short-circuit optimizations here.

So I suggest we could doc this metric here. See the original reply for how to get this matching ratio metric automatically.

Comment thread datafusion/physical-plan/benches/hash_join_semi_anti.rs Outdated
Comment thread benchmarks/src/hj.rs
// RightSemi Join benchmarks
// =========================================================================

// RightSemi - 100% Density, 100% hit rate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The density parameter is a bit hard to interpret, could you add a comment to make the workload easier to understand?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This density metric is still not obvious to me, I think it would also be confusing for others.

I would recommend to add comment to explain

  • what is it
  • why it matters for this workload

I suspect it is related to the average number of matching rows per probe row, but we still need to look at the implementation to figure it out, which takes time.

@coderfender
Copy link
Copy Markdown
Contributor Author

Thank you for the feedback @2010YOUY01 , I added more helpful comments and made sure the join order is sustained per your feedback. Please take a look whenever you get a chance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants