[Research] Use custom cost model when deciding between SMJ and SHJ

### What is the problem the feature request solves?

Spark often prefers SortMergeJoin (SMJ) to ShuffledHashJoin (SHJ) because it is more stable (less likely to OOM) and has good performance.

However, Comet (and other vectorized Spark accelerators) tend to have better performance with SHJ. In https://github.com/apache/datafusion-comet/pull/1007 we add a configuration option that lets the user choose between SMJ and SHJ at the time Comet translates Spark's plan, but it would be better if we could automatically enable SHJ.

Spark AQE already re-optimizes each query stage leveraging basic statistics (row count / data size) from completed child query stages and will choose between SMJ and SHJ based on Spark's cost model. Spark supports custom cost models, so we should explore providing a Comet-specific cost model that can make a better choice between SMJ and SHJ.


### Describe the potential solution

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Research] Use custom cost model when deciding between SMJ and SHJ #1011

What is the problem the feature request solves?

Describe the potential solution

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Research] Use custom cost model when deciding between SMJ and SHJ #1011

Description

What is the problem the feature request solves?

Describe the potential solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions