Skip to content

[pull] main from apache:main#192

Merged
pull[bot] merged 4 commits into
buraksenn:mainfrom
apache:main
May 15, 2026
Merged

[pull] main from apache:main#192
pull[bot] merged 4 commits into
buraksenn:mainfrom
apache:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 15, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

xudong963 and others added 4 commits May 15, 2026 02:22
## Which issue does this PR close?

- Closes #19028.

## Rationale for this change

When DataFusion evaluates a Parquet scan with filter pushdown, it uses
row group statistics to determine which row groups contain matching
rows. The `RowGroupAccessPlanFilter` already tracks which row groups are
"fully matched" — where statistics prove that **all** rows satisfy the
predicate (via `is_fully_matched`).

However, this information was not propagated downstream. Even for fully
matched row groups:
1. **Page index pruning** still evaluated page-level statistics (wasted
work since no pages can be pruned)
2. **RowFilter evaluation** still decoded filter columns and evaluated
the predicate for every row (wasted work since every row passes)

This is especially costly when filter columns are expensive to decode
(e.g., large strings) or when predicates are complex. Common real-world
examples include time-range filters where entire row groups fall within
the range, or `WHERE status != 'DELETED'` on data with no deleted rows.

## What changes are included in this PR?

### DataFusion changes (this PR)

1. **`row_group_filter.rs`**: `RowGroupAccessPlanFilter::build()` now
returns `(ParquetAccessPlan, Vec<usize>)` — the access plan plus the
indices of fully matched row groups.

2. **`page_filter.rs`**: `prune_plan_with_page_index()` accepts a
`fully_matched_row_groups` parameter and skips page-level pruning for
those row groups.

3. **`opener.rs`**: Wires fully matched row groups through the pipeline
— passes them to page pruning and to the `ParquetPushDecoderBuilder` via
`with_fully_matched_row_groups()`.

### Arrow-rs dependency (apache/arrow-rs#9694)

The new `ArrowReaderBuilder::with_fully_matched_row_groups()` API in
arrow-rs allows skipping `RowFilter` evaluation during Parquet decoding
for specified row groups. This PR uses `[patch.crates-io]` pointing to
the arrow-rs fork branch until that PR is merged and released.

### Benchmark

Includes a criterion benchmark (`parquet_fully_matched_filter`) using
`ParquetPushDecoder` directly — the same code path DataFusion's async
opener uses. Dataset: 20 row groups × 50K rows, with a 1KB string
payload column and predicate `x < 200` (all row groups fully matched).

| Scenario | Time | vs. baseline |
|----------|------|-------------|
| Filter pushdown, no skip | ~43 ms | baseline |
| **Filter pushdown, with skip** | **~20 ms** | **~2.2x faster** |
| No pushdown at all | ~24 ms | — |

## Are these changes tested?

- All 82 existing non-submodule `datafusion-datasource-parquet` tests
pass (16 failures are pre-existing, caused by missing `parquet-testing`
submodule)
- The benchmark verifies correctness by asserting the expected row count
- Clippy and fmt pass

## Are there any user-facing changes?

No user-facing API changes. This is a transparent performance
optimization — queries that previously worked will now be faster when
row group statistics prove all rows match the predicate.

~~**Note:** This PR depends on apache/arrow-rs#9694. The
`[patch.crates-io]` in `Cargo.toml` will be removed once that arrow-rs
change is released.~~ all logic is on df side now

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Which issue does this PR close?

- Closes #22184.

## Rationale for this change

The SQL planner should return an error for unsupported user SQL rather
than panic. Unknown compound identifiers with six or more parts could
reach an unchecked `unwrap()` after failing schema lookup.

## What changes are included in this PR?

Updates compound identifier handling so unmatched identifiers with five
or more parts return the existing `not_impl_err!` path instead of
falling through to `form_identifier`.

Adds a regression test for `SELECT a.b.c.d.e.f` to verify planning
returns an unsupported compound identifier error.

## Are these changes tested?

Yes:

- `cargo fmt --all`
- `cargo test -p datafusion-sql --test sql_integration
select_unknown_deep_compound_identifier_returns_error`
- `cargo clippy --all-targets --all-features -- -D warnings`

## Are there any user-facing changes?

No API changes. Invalid deeply nested compound identifiers now return a
planner error instead of panicking.
## Which issue does this PR close?

- Closes: #22178

## Rationale for this change

Update the pinned Rust toolchain to 1.95.0 and keep the workspace
passing the required formatting and clippy checks with the new
toolchain.

## What changes are included in this PR?

- Bump rust-toolchain.toml from 1.94.0 to 1.95.0.
- Update the contributor docs rust-analyzer example to reference 1.95.0.
- Apply the minimal clippy and compile fixes needed under Rust 1.95.0.

## Are these changes tested?

- cargo fmt --all
- cargo clippy --all-targets --all-features -- -D warnings

## Are there any user-facing changes?

No user-facing runtime changes. This updates the project toolchain and
related developer-facing docs.
## Which issue does this PR close?

No issue reported

## Rationale for this change

`Statistics::with_fetch` unconditionally returned `Precision::Exact(0)`
when the input had `nr <= skip`, even when the input was `Inexact(nr)` —
promoting an estimated upper bound into an exact zero. The exactness
flag then misleads downstream consumers (notably `AggregateStatistics`
via `Count::value_from_stats`) into trusting a derived "0" and folding
the count subtree to a literal.

Concrete user-visible symptom reported on TPC-H Q22:

```rust
let df = ctx.sql(q22_sql).await?;
df.clone().show().await?;   // prints 7 rows (correct)
df.count().await?;          // returns 0 (wrong)
```

`EXPLAIN` for the count plan shows the outer count aggregate collapsed
to `ProjectionExec([lit(0)]) -> PlaceholderRowExec`.

After PR #21240 left uncorrelated scalar subqueries in the filter rather
than rewriting them to joins, `FilterExec` can't use interval analysis
on `ScalarSubqueryExpr`, falls back to the 20% default selectivity, and
produces a small `Inexact` row estimate. A `LeftAnti` join whose
estimated semi-overlap covers the outer estimate then yields
`Inexact(0)`. That zero propagates through grouped aggregates whose
`estimate_num_rows` returns the child stats unchanged when `value == 0`.
The pre-existing `with_fetch` bug on a downstream `SortExec` finally
promotes it to `Exact(0)`, which `AggregateStatistics` trusts.

The root cause is the precision promotion in `with_fetch`. The PR fixes
that; the surrounding plan-shape changes after #21240 just made it
reachable.

## What changes are included in this PR?

- `Statistics::with_fetch`: when `nr <= skip`, preserve the exactness of
the input via `check_num_rows(Some(0),
self.num_rows.is_exact().unwrap())` instead of always returning
`Exact(0)`.
- `datafusion-common`: new unit test
`test_with_fetch_skip_all_rows_inexact` pinning the new behaviour.
- `datafusion-physical-plan`: update the existing
`test_row_number_statistics_for_global_limit` expectation that encoded
the old (incorrect) promotion to expect `Inexact(0)` now.
- `datafusion/sqllogictest/test_files/subquery.slt`: SLT regression test
reproducing the user-visible `count(*)` symptom over a query that
contains a scalar subquery, `not exists`, and a group-by on a derived
column, backed by parquet sources so the data sources report Exact
statistics.

## Are these changes tested?

Yes:
- Unit test in `datafusion-common` for the precision-preserving
behaviour.
- Updated unit test in `datafusion-physical-plan/limit.rs`.
- SLT regression test in `subquery.slt` that fails without the fix
(`count(*)` returns `0` instead of `2`) and passes with it.

## Are there any user-facing changes?

Only the bug fix. No public API changes.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pull pull Bot locked and limited conversation to collaborators May 15, 2026
@pull pull Bot added the ⤵️ pull label May 15, 2026
@pull pull Bot merged commit 9d92944 into buraksenn:main May 15, 2026
20 of 21 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants