Skip to content

Replace ANY/ALL CASE planning with array_has/min/max desugaring#22102

Open
cetra3 wants to merge 2 commits into
apache:mainfrom
pydantic:parquet_pruning_for_any
Open

Replace ANY/ALL CASE planning with array_has/min/max desugaring#22102
cetra3 wants to merge 2 commits into
apache:mainfrom
pydantic:parquet_pruning_for_any

Conversation

@cetra3
Copy link
Copy Markdown
Contributor

@cetra3 cetra3 commented May 11, 2026

Which issue does this PR close?

Rationale for this change

This partially reverts the changes in PR #21743 but keeps the cardinality when desugaring to array_min and array_max values.

This aligns more with the outputs from the existing datafusion functions, rather than going down the path of having full on PostgreSQL null semantics.

What changes are included in this PR?

Adjusts how we desugar certain queries such as > ANY etc.. rather than using a full chain, we use a simplified version that just checks the cardinality first and combines with array_min/array_max operators

I.e,

SELECT * FROM t WHERE col > ANY([1, 2, 3])

Desugars to:

cardinality([1, 2, 3]) > 0 AND col > array_min([1, 2, 3])

Which get simplified to:

col > 1

Are these changes tested?

Yes they are tested

Are there any user-facing changes?

Yes, there are some changes to the output of some queries.

However these changes were not shipped as part of 53.1.0, and are only on main

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 11, 2026

FYI @buraksenn @berkaysynnada and @Jefffrey

Perhaps you can help review this PR as you helped review #21743

Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we have a comprehensive view of how empty haystacks/null haystacks/haystacks containing nulls/null needles look with any/all and all supported operators with this PR?

I've lost track a bit of how the behaviour has evolved over the PRs:

So I want to ensure we have a clear understanding of the final behaviour we're agreeing on, since this PR is fixing the any = behaviour to what it previously was and hopefully aligning the other operators (and all) to similar behaviour it seems?

Comment thread datafusion/sqllogictest/test_files/array/array_all.slt Outdated
Comment thread datafusion/sqllogictest/test_files/array/array_all.slt Outdated
@cetra3
Copy link
Copy Markdown
Contributor Author

cetra3 commented May 14, 2026

So I am basing this PR relative to the latest tagged release, which is 53.1.0. This release only supported the needle = ANY(array) shape and nothing else.

For this one shape, this PR adjusts the behaviour back to 53.1.0 and how it desugared to array_has. No other shapes were supported.

But also, in this PR, I have tried to make simple rules about what expressions get desugared to. Essentially all non-nullable behaviour matches semantics like you'd expect, it's just weird edge cases around some expressions with null values that diverge.

I'm open to expanding/adjusting this, as long as it doesn't impact the existing use case (= ANY), but I think this PR is a balance of pragmatism & correctness.

Here's some example desugaring:

needle OP ANY(haystack):

op desugar
= array_has(haystack, needle)
<> cardinality(haystack) > 0 AND (array_min(haystack) <> needle OR array_max(haystack) <> needle)
> cardinality(haystack) > 0 AND needle > array_min(haystack)
< cardinality(haystack) > 0 AND needle < array_max(haystack)
>= cardinality(haystack) > 0 AND needle >= array_min(haystack)
<= cardinality(haystack) > 0 AND needle <= array_max(haystack)

needle OP ALL(haystack):

op desugar
= cardinality(haystack) = 0 OR (array_min(haystack) = needle AND array_max(haystack) = needle)
<> NOT array_has(haystack, needle)
> cardinality(haystack) = 0 OR needle > array_max(haystack)
< cardinality(haystack) = 0 OR needle < array_min(haystack)
>= cardinality(haystack) = 0 OR needle >= array_max(haystack)
<= cardinality(haystack) = 0 OR needle <= array_min(haystack)

Cardinality Check

The cardinality check is there to deal with empty haystacks and ensuring we return a boolean true/false rather than null. If we desugared to array_min([]) directly we'd get null values back. So the cardinality check is there to help make that a bit nicer.

Here's a table:

Expression cardinality array_min / array_max Desugar evaluated Result
5 > ANY([]) 0 NULL / NULL 0 > 0 AND (5 > NULL)FALSE AND NULL F
5 > ALL([]) 0 NULL / NULL 0 = 0 OR (5 > NULL)TRUE OR NULL T
5 > ANY([3, 7]) 2 3 / 7 2 > 0 AND 5 > 3TRUE AND TRUE T
5 > ALL([3, 7]) 2 3 / 7 2 = 0 OR 5 > 7FALSE OR FALSE F
5 > ANY([3, NULL]) 2 3 / 3 2 > 0 AND 5 > 3TRUE AND TRUE T
5 > ALL([3, NULL]) 2 3 / 3 2 = 0 OR 5 > 3FALSE OR TRUE T
5 > ANY([6, NULL]) 2 6 / 6 2 > 0 AND 5 > 6TRUE AND FALSE F

PostgreSQL Divergence

It's when you start mixing in null values to needles and haystacks that things diverge from other systems. Each of these functions treat null as absent, whereas null in PostgreSQL semantics is treated at not defined:

  • For null needles, pretty much every shape is the same except two exceptions around empty haystacks (which are weird edge cases):
Expression cardinality This PR PostgreSQL
NULL = ANY([]) 0 N F
NULL <> ALL([]) 0 N T
  • For any null values in the haystack, PostgreSQL will always mark the expression as null, whereas we diverge slightly since we desugar to existing functions which filter out nulls:
Expression This PR PostgreSQL Why
5 = ANY([NULL, NULL]) F N array_has([NULL,NULL], 5) = FALSE. PG: 5=N OR 5=N = N.
5 = ANY([3, NULL]) F N array_has([3,NULL], 5) = FALSE. PG: 5=3 OR 5=N = F OR N = N.
5 <> ALL([NULL, NULL]) T N NOT array_has(…) = NOT FALSE = TRUE. PG: 5<>N AND 5<>N = N.
5 <> ALL([3, NULL]) T N Same shape.
5 > ALL([3, NULL]) T N array_max([3,NULL]) = 3, so 5 > 3 = TRUE. PG: 5>3 AND 5>N = T AND N = N.
5 < ANY([3, NULL]) F N array_max([3,NULL]) = 3, so 5 < 3 = FALSE. PG: 5<3 OR 5<N = F OR N = N.
5 >= ALL([3, NULL]) T N Same pattern as >.
5 = ALL([5, NULL]) T N min=max=5, so both 5=5. PG: 5=5 AND 5=N = T AND N = N. (Subtle — needle equals the only non-NULL.)
5 <> ANY([5, NULL]) F N min=max=5, so both 5<>5 = FALSE. PG: 5<>5 OR 5<>N = F OR N = N.

@cetra3 cetra3 force-pushed the parquet_pruning_for_any branch from 8b46e1e to 1753037 Compare May 14, 2026 06:25
@github-actions github-actions Bot added the core Core DataFusion crate label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR #21743 disables Parquet pruning for = ANY([literals])

3 participants