Skip to content

[Research] Use custom cost model when deciding between SMJ and SHJ #1011

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Spark often prefers SortMergeJoin (SMJ) to ShuffledHashJoin (SHJ) because it is more stable (less likely to OOM) and has good performance.

However, Comet (and other vectorized Spark accelerators) tend to have better performance with SHJ. In #1007 we add a configuration option that lets the user choose between SMJ and SHJ at the time Comet translates Spark's plan, but it would be better if we could automatically enable SHJ.

Spark AQE already re-optimizes each query stage leveraging basic statistics (row count / data size) from completed child query stages and will choose between SMJ and SHJ based on Spark's cost model. Spark supports custom cost models, so we should explore providing a Comet-specific cost model that can make a better choice between SMJ and SHJ.

Describe the potential solution

No response

Additional context

No response

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions