Skip to content

Conversation

@sdf-jkl
Copy link
Contributor

@sdf-jkl sdf-jkl commented Jan 10, 2026

Which issue does this PR close?

Rationale for this change

Check issue.

What changes are included in this PR?

Added preimage impl for date_part udf.

Added sqllogictests for the impl.

Are these changes tested?

Yes, sqllogictests.

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 10, 2026
@sdf-jkl sdf-jkl changed the title Smaller preimage pr 2 Support "pre-image" for pruning predicate evaluation #2 Jan 10, 2026
@github-actions github-actions bot removed the core Core DataFusion crate label Jan 19, 2026
@sdf-jkl sdf-jkl changed the title Support "pre-image" for pruning predicate evaluation #2 Optimize the evaluation of date_part(<col>) == <constant> when pushed down Jan 19, 2026
@github-actions github-actions bot removed logical-expr Logical plan and expressions optimizer Optimizer rules labels Jan 25, 2026
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Left some minor suggestions.

// date_part(col, YEAR) = 2024 => col >= '2024-01-01' and col < '2025-01-01'
// But for anything less than YEAR simplifying is not possible without specifying the bigger interval
// date_part(col, MONTH) = 1 => col = '2023-01-01' or col = '2024-01-01' or ... or col = '3000-01-01'
fn preimage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a section to docs/source/library-user-guide/functions/adding-udfs.md explaining:

  • What preimage is and when to implement it
  • How it enables predicate pushdown
  • Example implementation (perhaps referencing date_part)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR for preimage doc improvement here #20008.

I am however, not sure that this doc needs to explain preimage. I think the doc's goal is to be a very minimal guide on adding and registering a function. There is also no mention of simplify too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the API docs is probably adequate. We could potentially add a note to adding-udfs.md that says something generic like "The ScalarUDFImpl has additional methods that support specialized optimizations such as preimage -- see the API documentation for additional details"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate PR should do.

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 28, 2026

Thanks for your review @kosiew. I've addressed your comments and also added proper handling for timestamps with timezones, please check it out too when you have time.

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 28, 2026

@alamb I'm still thinking about is_literal_or_literal_cast check in expr_simplifier: #19722 (comment)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sdf-jkl -- this is looking very close

I have some suggestions on avoiding adding new date/time arithmetic here.

I also recommend an additional test for non-int32 arguments (coercion works fine, but it is good to show that):

diff --git a/datafusion/sqllogictest/test_files/udf_preimage.slt b/datafusion/sqllogictest/test_files/udf_preimage.slt
index 12f3a9a45b..c9d408227a 100644
--- a/datafusion/sqllogictest/test_files/udf_preimage.slt
+++ b/datafusion/sqllogictest/test_files/udf_preimage.slt
@@ -30,6 +30,11 @@ select c from t1 where extract(year from c) = 2024;
 ----
 2024-01-01

+query D
+select c from t1 where extract(year from c) = cast(2024 as bigint);
+----
+2024-01-01
+
 query D
 select c from t1 where extract(year from c) <> 2024;
 ----
@@ -83,6 +88,16 @@ physical_plan
 01)FilterExec: c@0 >= 2024-01-01 AND c@0 < 2025-01-01
 02)--DataSourceExec: partitions=1, partition_sizes=[1]

+query TT
+explain select c from t1 where extract (year from c) = cast(2024 as bigint)
+----
+logical_plan
+01)Filter: t1.c >= Date32("2024-01-01") AND t1.c < Date32("2025-01-01")
+02)--TableScan: t1 projection=[c]
+physical_plan
+01)FilterExec: c@0 >= 2024-01-01 AND c@0 < 2025-01-01
+02)--DataSourceExec: partitions=1, partition_sizes=[1]
+
 query TT
 explain select c from t1 where extract (year from c) <> 2024
 ----

// date_part(col, YEAR) = 2024 => col >= '2024-01-01' and col < '2025-01-01'
// But for anything less than YEAR simplifying is not possible without specifying the bigger interval
// date_part(col, MONTH) = 1 => col = '2023-01-01' or col = '2024-01-01' or ... or col = '3000-01-01'
fn preimage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the API docs is probably adequate. We could potentially add a note to adding-udfs.md that says something generic like "The ScalarUDFImpl has additional methods that support specialized optimizations such as preimage -- see the API documentation for additional details"

Date32 => ScalarValue::Date32(Some(days as i32)),
Date64 => ScalarValue::Date64(Some(days * MILLISECONDS_IN_DAY)),

Timestamp(unit, tz_opt) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me like this code should be able to re-use the conversion functions in arrow rather than re-implementing them here

For example

        Date32 => ScalarValue::Date32(Some(Date32Type::from_naive_date(date))),
        Date64 => ScalarValue::Date64(Some(Date64Type::from_naive_date(date))),
...

I didn't have a chance to figure out how to do it for Timestamp, but it seems like there should be a function like this for the timestamps too -- for example

https://docs.rs/arrow/latest/arrow/array/types/struct.TimestampSecondType.html and https://docs.rs/arrow/latest/arrow/datatypes/trait.ArrowTimestampType.html

Maybe something like

TimestampSecondType::make_value(date)

(We will have to figure out timestamps)

Copy link
Contributor Author

@sdf-jkl sdf-jkl Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed date32/64 62b0841

As for timestamp: dyn TimestampType::make_value() is using a NaiveDate, not DateTime<Tz>. We'd still have to do some tz math to create an offset for NaiveDate. (if only there was an existing API to help...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize the evaluation of date_part(<col>) == <constant> when pushed down

3 participants