Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
d94889a
Add udf_preimage logic
sdf-jkl Jan 9, 2026
4aa7f4e
Cargo fmt
sdf-jkl Jan 10, 2026
2329c12
Fix err in rewrite_with_preimage
sdf-jkl Jan 10, 2026
7ac8325
Rewrite the preimage_in_comparison
sdf-jkl Jan 10, 2026
7a3e8b3
cargo fmt
sdf-jkl Jan 10, 2026
fbd5dcc
Fix ci
sdf-jkl Jan 10, 2026
d920735
Fix GtEq, Lt logic
sdf-jkl Jan 10, 2026
d3318ff
Add datepart preimage + tests
sdf-jkl Jan 10, 2026
9fb245b
Fix asf header
sdf-jkl Jan 10, 2026
5ffb704
Merge branch 'main' into smaller-preimage-pr-1
sdf-jkl Jan 10, 2026
372f704
Merge branch 'main' into smaller-preimage-pr-2
sdf-jkl Jan 10, 2026
2fdc14c
Merge branch 'main' of https://github.com/apache/datafusion into smal…
sdf-jkl Jan 18, 2026
c2b0cd3
Replace BinaryExpression with binary_expr() fn
sdf-jkl Jan 18, 2026
a0b6564
Add unit tests + add doc part about upper bound
sdf-jkl Jan 19, 2026
0a24d60
Fix docs
sdf-jkl Jan 19, 2026
86b7627
Add datepart preimage + tests
sdf-jkl Jan 10, 2026
0158662
Fix asf header
sdf-jkl Jan 10, 2026
b491d4f
Merge branch 'smaller-preimage-pr-2' of https://github.com/sdf-jkl/da…
sdf-jkl Jan 19, 2026
59235de
clippy
alamb Jan 19, 2026
9ae434e
Merge remote-tracking branch 'apache/main' into smaller-preimage-pr-1
alamb Jan 19, 2026
9f845e7
Make test field nullable
sdf-jkl Jan 19, 2026
08ef1f1
Add datepart preimage + tests
sdf-jkl Jan 10, 2026
57f6c4c
Fix asf header
sdf-jkl Jan 10, 2026
e4dc727
Merge branch 'smaller-preimage-pr-2' of https://github.com/sdf-jkl/da…
sdf-jkl Jan 19, 2026
f36257e
Merge branch 'main' of https://github.com/apache/datafusion into smal…
sdf-jkl Jan 25, 2026
6bceb43
Fix date_part.rs
sdf-jkl Jan 25, 2026
13f1164
Fix udf_preimage.slt
sdf-jkl Jan 25, 2026
d902f65
Small fix
sdf-jkl Jan 25, 2026
6992c8f
fix typo
sdf-jkl Jan 26, 2026
b456a22
Add proper error handling
sdf-jkl Jan 28, 2026
3669fb9
Add tz slt tests
sdf-jkl Jan 28, 2026
bb6625f
Add tz aware timestamp logic
sdf-jkl Jan 28, 2026
aeafe1a
Move tests to date_part.slt
sdf-jkl Jan 29, 2026
62b0841
Use Arrow Type methods for date32/64
sdf-jkl Jan 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 119 additions & 3 deletions datafusion/functions/src/datetime/date_part.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ use std::any::Any;
use std::str::FromStr;
use std::sync::Arc;

use arrow::array::timezone::Tz;
use arrow::array::{Array, ArrayRef, Float64Array, Int32Array};
use arrow::compute::kernels::cast_utils::IntervalUnit;
use arrow::compute::{DatePart, binary, date_part};
Expand All @@ -27,8 +28,10 @@ use arrow::datatypes::DataType::{
};
use arrow::datatypes::TimeUnit::{Microsecond, Millisecond, Nanosecond, Second};
use arrow::datatypes::{
DataType, Field, FieldRef, IntervalUnit as ArrowIntervalUnit, TimeUnit,
DataType, Date32Type, Date64Type, Field, FieldRef, IntervalUnit as ArrowIntervalUnit,
TimeUnit,
};
use chrono::{Datelike, NaiveDate, TimeZone, Utc};
use datafusion_common::types::{NativeType, logical_date};

use datafusion_common::{
Expand All @@ -44,9 +47,11 @@ use datafusion_common::{
types::logical_string,
utils::take_function_args,
};
use datafusion_expr::preimage::PreimageResult;
use datafusion_expr::simplify::SimplifyContext;
use datafusion_expr::{
ColumnarValue, Documentation, ReturnFieldArgs, ScalarUDFImpl, Signature,
TypeSignature, Volatility,
ColumnarValue, Documentation, Expr, ReturnFieldArgs, ScalarUDFImpl, Signature,
TypeSignature, Volatility, interval_arithmetic,
};
use datafusion_expr_common::signature::{Coercion, TypeSignatureClass};
use datafusion_macros::user_doc;
Expand Down Expand Up @@ -237,6 +242,71 @@ impl ScalarUDFImpl for DatePartFunc {
})
}

// Only casting the year is supported since pruning other IntervalUnit is not possible
// date_part(col, YEAR) = 2024 => col >= '2024-01-01' and col < '2025-01-01'
// But for anything less than YEAR simplifying is not possible without specifying the bigger interval
// date_part(col, MONTH) = 1 => col = '2023-01-01' or col = '2024-01-01' or ... or col = '3000-01-01'
fn preimage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a section to docs/source/library-user-guide/functions/adding-udfs.md explaining:

  • What preimage is and when to implement it
  • How it enables predicate pushdown
  • Example implementation (perhaps referencing date_part)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR for preimage doc improvement here #20008.

I am however, not sure that this doc needs to explain preimage. I think the doc's goal is to be a very minimal guide on adding and registering a function. There is also no mention of simplify too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the API docs is probably adequate. We could potentially add a note to adding-udfs.md that says something generic like "The ScalarUDFImpl has additional methods that support specialized optimizations such as preimage -- see the API documentation for additional details"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate PR should do.

&self,
args: &[Expr],
lit_expr: &Expr,
info: &SimplifyContext,
) -> Result<PreimageResult> {
let [part, col_expr] = take_function_args(self.name(), args)?;

// Get the interval unit from the part argument
let interval_unit = part
.as_literal()
.and_then(|sv| sv.try_as_str().flatten())
.map(part_normalization)
.and_then(|s| IntervalUnit::from_str(s).ok());

// only support extracting year
match interval_unit {
Some(IntervalUnit::Year) => (),
_ => return Ok(PreimageResult::None),
}

// Check if the argument is a literal (e.g. date_part(YEAR, col) = 2024)
let Some(argument_literal) = lit_expr.as_literal() else {
return Ok(PreimageResult::None);
};

// Extract i32 year from Scalar value
let year = match argument_literal {
ScalarValue::Int32(Some(y)) => *y,
_ => return Ok(PreimageResult::None),
};

// Can only extract year from Date32/64 and Timestamp column
let target_type = match info.get_data_type(col_expr)? {
Date32 | Date64 | Timestamp(_, _) => &info.get_data_type(col_expr)?,
_ => return Ok(PreimageResult::None),
};

// Compute the Interval bounds
let Some(start_time) = NaiveDate::from_ymd_opt(year, 1, 1) else {
return Ok(PreimageResult::None);
};
let Some(end_time) = start_time.with_year(year + 1) else {
return Ok(PreimageResult::None);
};

// Convert to ScalarValues
let (Some(lower), Some(upper)) = (
date_to_scalar(start_time, target_type),
date_to_scalar(end_time, target_type),
) else {
return Ok(PreimageResult::None);
};
let interval = Box::new(interval_arithmetic::Interval::try_new(lower, upper)?);

Ok(PreimageResult::Range {
expr: col_expr.clone(),
interval,
})
}

fn aliases(&self) -> &[String] {
&self.aliases
}
Expand All @@ -251,6 +321,52 @@ fn is_epoch(part: &str) -> bool {
matches!(part.to_lowercase().as_str(), "epoch")
}

fn date_to_scalar(date: NaiveDate, target_type: &DataType) -> Option<ScalarValue> {
Some(match target_type {
Date32 => ScalarValue::Date32(Some(Date32Type::from_naive_date(date))),
Date64 => ScalarValue::Date64(Some(Date64Type::from_naive_date(date))),

Timestamp(unit, tz_opt) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me like this code should be able to re-use the conversion functions in arrow rather than re-implementing them here

For example

        Date32 => ScalarValue::Date32(Some(Date32Type::from_naive_date(date))),
        Date64 => ScalarValue::Date64(Some(Date64Type::from_naive_date(date))),
...

I didn't have a chance to figure out how to do it for Timestamp, but it seems like there should be a function like this for the timestamps too -- for example

https://docs.rs/arrow/latest/arrow/array/types/struct.TimestampSecondType.html and https://docs.rs/arrow/latest/arrow/datatypes/trait.ArrowTimestampType.html

Maybe something like

TimestampSecondType::make_value(date)

(We will have to figure out timestamps)

Copy link
Contributor Author

@sdf-jkl sdf-jkl Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed date32/64 62b0841

As for timestamp: dyn TimestampType::make_value() is using a NaiveDate, not DateTime<Tz>. We'd still have to do some tz math to create an offset for NaiveDate. (if only there was an existing API to help...)

let naive_midnight = date.and_hms_opt(0, 0, 0)?;

let utc_dt = if let Some(tz_str) = tz_opt {
let tz: Tz = tz_str.parse().ok()?;

let local = tz.from_local_datetime(&naive_midnight);

let local_dt = match local {
chrono::offset::LocalResult::Single(dt) => dt,
chrono::offset::LocalResult::Ambiguous(dt1, _dt2) => dt1,
chrono::offset::LocalResult::None => local.earliest()?,
};

local_dt.with_timezone(&Utc)
} else {
Utc.from_utc_datetime(&naive_midnight)
};

match unit {
Second => {
ScalarValue::TimestampSecond(Some(utc_dt.timestamp()), tz_opt.clone())
}
Millisecond => ScalarValue::TimestampMillisecond(
Some(utc_dt.timestamp_millis()),
tz_opt.clone(),
),
Microsecond => ScalarValue::TimestampMicrosecond(
Some(utc_dt.timestamp_micros()),
tz_opt.clone(),
),
Nanosecond => ScalarValue::TimestampNanosecond(
Some(utc_dt.timestamp_nanos_opt()?),
tz_opt.clone(),
),
}
}
_ => return None,
})
}

// Try to remove quote if exist, if the quote is invalid, return original string and let the downstream function handle the error
fn part_normalization(part: &str) -> &str {
part.strip_prefix(|c| c == '\'' || c == '\"')
Expand Down
Loading