Skip to content

Conversation

@devanshu0987
Copy link
Contributor

This adds a preimage implementation for the floor() function that transforms floor(x) = N into x >= N AND x < N+1. This enables statistics-based predicate pushdown for queries using floor().

For example, a query like:
SELECT * FROM t WHERE floor(price) = 100

Is rewritten to:
SELECT * FROM t WHERE price >= 100 AND price < 101

This allows the query engine to leverage min/max statistics from Parquet row groups, significantly reducing the amount of data scanned.

Benchmarks on the ClickBench hits dataset show:

  • 80% file pruning (89 out of 111 files skipped)
  • 70x fewer rows scanned (1.4M vs 100M)
CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'benchmarks/data/hits_partitioned/';

-- Test the floor preimage optimization
EXPLAIN ANALYZE SELECT COUNT(*) FROM hits WHERE floor(CAST("CounterID" AS DOUBLE)) = 62;

Metric Before (no preimage) After (with preimage)
Files pruned 111 → 111 (0 pruned) 111 → 22 (89 pruned)
Row groups pruned 325 → 325 (0 pruned) 51 → 4 (47 pruned)
Rows scanned 99,997,497 1,410,000
Output rows 738,172 738,172
Pruning predicate None CAST(CounterID_max) >= 62 AND CAST(CounterID_min) < 63

Which issue does this PR close?

  • Closes #.

Rationale for this change

#19946

This epic introduced the pre-image API. This PR is using the pre-image API to provide it for floor function where it is applicable.

What changes are included in this PR?

Are these changes tested?

  • Unit Tests added
  • Existing SLT tests pass for this.

Are there any user-facing changes?

No

@github-actions github-actions bot added the functions Changes to functions implementation label Jan 29, 2026
This adds a `preimage` implementation for the `floor()` function that
transforms `floor(x) = N` into `x >= N AND x < N+1`. This enables
statistics-based predicate pushdown for queries using floor().

For example, a query like:
  SELECT * FROM t WHERE floor(price) = 100

Is rewritten to:
  SELECT * FROM t WHERE price >= 100 AND price < 101

This allows the query engine to leverage min/max statistics from
Parquet row groups, significantly reducing the amount of data scanned.

Benchmarks on the ClickBench hits dataset show:
- 80% file pruning (89 out of 111 files skipped)
- 70x fewer rows scanned (1.4M vs 100M)
(ScalarValue::Int64(Some(lo)), ScalarValue::Int64(Some(hi)))
}),

// Unsupported types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floor also supports decimal types. Should we add those here?

_info: &SimplifyContext,
) -> Result<PreimageResult> {
// floor takes exactly one argument
if args.len() != 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps its good to debug_assert! here?

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PR looks good to me, thanks @devanshu0987 and @masonh22 for the review, decimal types support makes sense, WDYT can decimals be addressed in this PR or in followup?

@devanshu0987
Copy link
Contributor Author

devanshu0987 commented Jan 30, 2026

Hi @comphead and @masonh22

Can you advise on what I should do?

  • Preimage result is a Interval::try_new(lower, upper)
  • Inside Interval::try_new, it does a data type assert
          assert_eq_or_internal_err!(
              lower.data_type(),
              upper.data_type(),
              "Endpoints of an Interval should have the same type"
          );
    
  • Data type asserts for Decimal types also include checking precision and scale.
              ScalarValue::Decimal32(_, precision, scale) => {
                  DataType::Decimal32(*precision, *scale)
              }
    
  • When we add 2 Decimals, the precision changes result_precision = max(s1, s2) + max(p1-s1, p2-s2) + 1
    • Which makes sense to handle the overflow case: 99 + 1 = 100
  • Hence, when we add 1 to find the upper bound, the resulting data type is 1 more than the lower bound, and the Interval::try_new fails.
      SELECT arrow_typeof(CAST(100.00 AS DECIMAL(5,2)) + CAST(1.00 AS DECIMAL(5,2)));
      +-----------------------------------------+
      | arrow_typeof(Float64(100) + Float64(1)) |
      +-----------------------------------------+
      | Decimal128(6, 2)                        | 
      +-----------------------------------------+
    
  • I am hesitant about the idea of writing my own manual calculation to keep the precision the same post addition.
    • Scale < 0, Scale > 0 etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants