chore: Add existence (semi / anti ) benchmarks for hashjoinexec by coderfender · Pull Request #21821 · apache/datafusion

coderfender · 2026-04-24T06:55:20Z

Add Existence Join Benchmarks

What changes are included in this PR?

1. End-to-end benchmarks (`benchmarks/src/hj.rs`)

Adds Q16-Q21 for RightSemi and RightAnti joins, following reviewer feedback to focus on core axes:

Query	Join Type	Build Size	Probe Size	Hit Rate
Q16	RightSemi	25 (nation)	1.5M (customer)	100%
Q17	RightSemi	100K (supplier)	60M (lineitem)	100%
Q18	RightSemi	100K (supplier)	60M (lineitem)	10%
Q19	RightAnti	25 (nation)	1.5M (customer)	100%
Q20	RightAnti	100K (supplier)	60M (lineitem)	100%
Q21	RightAnti	100K (supplier)	60M (lineitem)	10%

2. Criterion micro-benchmark (`datafusion/physical-plan/benches/hash_join_semi_anti.rs`)

Density variations :

Benchmark	Join Type	Density	Hit Rate
right_semi_d100_h100	RightSemi	100%	100%
right_semi_d100_h10	RightSemi	100%	10%
right_semi_d50_h100	RightSemi	50%	100%
right_semi_d50_h10	RightSemi	50%	10%
right_semi_d10_h100	RightSemi	10%	100%
right_semi_d10_h10	RightSemi	10%	10%
right_anti_d100_h100	RightAnti	100%	100%
right_anti_d100_h10	RightAnti	100%	10%
right_anti_d50_h100	RightAnti	50%	100%
right_anti_d50_h10	RightAnti	50%	10%
right_anti_d10_h100	RightAnti	10%	100%
right_anti_d10_h10	RightAnti	10%	10%

coderfender · 2026-04-24T06:58:08Z

@Dandandan , @2010YOUY01 , Please take a look at these benchmarks I plan to refer for bitmap based optimizations : #21817 . This essentially has a cargo ben h (for faster / simpler bench tests through critcmp ) along with additional existence tests to TPCH datasets . I tried adding various densities and match rates to try and replicate as many real worlds scenarios as possible as well

2010YOUY01 · 2026-04-25T07:10:56Z

Thank you for working on this! I have some suggestions for you to consider.

High-level issue

I think the main issue is using density as a primary axis when evaluating equi-join performance. While it was introduced in #21821 for perfect hash join experiments, it seems it is not a good axis for designing representative benchmarks.

A good benchmark should reflect realistic workloads. To achieve that, we should define a set of core axes and vary them systematically, I think for equi-joins, it could be:

Equi-join benchmark key axes:
- Build/probe side size
- Join type (inner, outer, semi, etc.)
- Number of join keys
- Join key data type
- Probe hit rate
- Fanout (average number of matches per probe key)

In contrast, density (i.e., key range span divided by key count) is not representative of typical workloads. It is primarily useful for evaluating specific fast paths (e.g., perfect hash join), but making it a primary axis complicates the benchmark design, and may mislead future optimization efforts.

I believe we'd better remove density from the key axes in the future. For fast paths like perfect Hj and semi/anti join, we could simply add a few queries that the fast path wins.

For this PR

For this PR, I suggest keeping the end-to-end hj benchmark simple. We don’t need to enumerate all density combinations here—a smaller set of representative queries should be enough to evaluate the optimization.

For the Criterion micro-benchmarks, it would be better to first focus on a few representative workloads (e.g., join size, type), and then optionally add a small number of targeted cases for specific fast paths, such as right semi/anti joins with Int32 keys, otherwise it would be hard to extend and maintain.

In short, fewer end-to-end queries should be sufficient for this PR. We could add criterion micro-benches later based on the above design.

coderfender · 2026-05-05T07:24:23Z

@2010YOUY01 , I updated the benches per your review comments. I do agree that we need density as an axis in criterion micro-benchmark and added them

coderfender · 2026-05-05T07:51:34Z

Latest critcmp results from above benches icompared with #21817

group                                               hashmap                                roaring_bitmap
-----                                               -------                                --------------
hash_join_semi_anti/right_anti_d100_h10/1000000     3.01      5.3±0.10ms        ? ?/sec    1.00  1750.0±49.28µs        ? ?/sec
hash_join_semi_anti/right_anti_d100_h100/1000000    1.84      3.3±0.08ms        ? ?/sec    1.00  1806.6±44.54µs        ? ?/sec
hash_join_semi_anti/right_anti_d10_h10/1000000      4.02     11.0±0.12ms        ? ?/sec    1.00      2.8±0.04ms        ? ?/sec
hash_join_semi_anti/right_anti_d10_h100/1000000     1.62      5.3±0.18ms        ? ?/sec    1.00      3.3±0.08ms        ? ?/sec
hash_join_semi_anti/right_anti_d50_h10/1000000      2.49      5.3±0.09ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
hash_join_semi_anti/right_anti_d50_h100/1000000     1.53      3.3±0.07ms        ? ?/sec    1.00      2.2±0.03ms        ? ?/sec
hash_join_semi_anti/right_semi_d100_h10/1000000     1.00  1574.1±31.33µs        ? ?/sec    1.11  1743.7±13.23µs        ? ?/sec
hash_join_semi_anti/right_semi_d100_h100/1000000    4.40      8.0±0.14ms        ? ?/sec    1.00  1814.4±13.57µs        ? ?/sec
hash_join_semi_anti/right_semi_d10_h10/1000000      2.71      7.4±0.30ms        ? ?/sec    1.00      2.7±0.04ms        ? ?/sec
hash_join_semi_anti/right_semi_d10_h100/1000000     3.00     10.0±0.26ms        ? ?/sec    1.00      3.3±0.06ms        ? ?/sec
hash_join_semi_anti/right_semi_d50_h10/1000000      1.00  1604.7±26.37µs        ? ?/sec    1.31      2.1±0.03ms        ? ?/sec
hash_join_semi_anti/right_semi_d50_h100/1000000     3.66      8.0±0.10ms        ? ?/sec    1.00      2.2±0.03ms        ? ?/sec

2010YOUY01 · 2026-05-05T11:26:57Z

        probe_size: "60M",
    },
+    // RightSemi Join benchmarks with Int32 keys
+    // Q16: RightSemi, Small build (25 rows), 100% Hit rate


Let's also doc the fanout here, it means if we change the join type to inner join, for each probe row, how many matches can be found on average.

This can be automatically calculated from explain analyze the query, after changing join type to inner join, it will show up in the HashJoinExec's metrics.

And later we should ensure those queries have covered different fanouts.

Hmm, my previous explanation may have been confusing. Let me try again.

Suppose we have the query:

SELECT * FROM generate_series(100) AS t1(v1) RIGHT SEMI JOIN generate_series(10) AS t2(v1) ON (t1.v1 % 10) = t2.v1

Here, each probe row from t2 matches 10 rows on average from t1, so the matching rows per probe row ratio is 10:1.

Although a semi join only returns whether a match exists, this ratio still matters for execution behavior, because we are evaluating short-circuit optimizations here.

So I suggest we could doc this metric here. See the original reply for how to get this matching ratio metric automatically.

2010YOUY01 · 2026-05-05T11:36:26Z

+    // RightSemi Join benchmarks
+    // =========================================================================
+
+    // RightSemi - 100% Density, 100% hit rate


The density parameter is a bit hard to interpret, could you add a comment to make the workload easier to understand?

This density metric is still not obvious to me, I think it would also be confusing for others.

I would recommend to add comment to explain

what is it

why it matters for this workload

I suspect it is related to the average number of matching rows per probe row, but we still need to look at the implementation to figure it out, which takes time.

coderfender · 2026-05-06T05:29:28Z

Thank you for the feedback @2010YOUY01 , I added more helpful comments and made sure the join order is sustained per your feedback. Please take a look whenever you get a chance

add_existence_benchmarks_hashjoin

a9d6d68

github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 24, 2026

add_existence_benchmarks_hashjoin

6eb40b7

coderfender force-pushed the implement_additional_benchmarks_hashjoin branch from 2fbe56f to 6eb40b7 Compare April 24, 2026 07:03

Merge branch 'main' into implement_additional_benchmarks_hashjoin

57cfc3e

coderfender changed the title ~~feat: Add existence (semi / anti ) benchmarks hashjoin~~ chore: Add existence (semi / anti ) benchmarks hashjoin Apr 24, 2026

coderfender changed the title ~~chore: Add existence (semi / anti ) benchmarks hashjoin~~ chore: Add existence (semi / anti ) benchmarks for hashjoinexec Apr 24, 2026

Merge branch 'main' into implement_additional_benchmarks_hashjoin

ad83914

2010YOUY01 reviewed Apr 25, 2026

View reviewed changes

Comment thread benchmarks/src/hj.rs Outdated

address_review_comments

4cfb0c5

Merge branch 'main' into implement_additional_benchmarks_hashjoin

499c5d7

2010YOUY01 reviewed May 5, 2026

View reviewed changes

address_review_comments

436cce4

Merge branch 'main' into implement_additional_benchmarks_hashjoin

c9fab55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821

chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821
coderfender wants to merge 8 commits intoapache:mainfrom
coderfender:implement_additional_benchmarks_hashjoin

coderfender commented Apr 24, 2026 •

edited

Loading

Uh oh!

coderfender commented Apr 24, 2026

Uh oh!

2010YOUY01 commented Apr 25, 2026

Uh oh!

Uh oh!

coderfender commented May 5, 2026 •

edited

Loading

Uh oh!

coderfender commented May 5, 2026 •

edited

Loading

Uh oh!

2010YOUY01 May 5, 2026

Uh oh!

2010YOUY01 May 7, 2026

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 May 5, 2026

Uh oh!

2010YOUY01 May 7, 2026

Uh oh!

coderfender commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coderfender commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Existence Join Benchmarks

What changes are included in this PR?

1. End-to-end benchmarks (benchmarks/src/hj.rs)

2. Criterion micro-benchmark (datafusion/physical-plan/benches/hash_join_semi_anti.rs)

Uh oh!

coderfender commented Apr 24, 2026

Uh oh!

2010YOUY01 commented Apr 25, 2026

High-level issue

For this PR

Uh oh!

Uh oh!

coderfender commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderfender commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2010YOUY01 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderfender commented Apr 24, 2026 •

edited

Loading

1. End-to-end benchmarks (`benchmarks/src/hj.rs`)

2. Criterion micro-benchmark (`datafusion/physical-plan/benches/hash_join_semi_anti.rs`)

coderfender commented May 5, 2026 •

edited

Loading

coderfender commented May 5, 2026 •

edited

Loading