refactor(parquet-datasource): mechanical cleanup of opener / file_format / row_group_filter#22156
Draft
adriangb wants to merge 1 commit into
Draft
Conversation
Contributor
|
We could potentially split this into module moves too |
b044e05 to
3efc367
Compare
3efc367 to
17679e3
Compare
…ile_format.rs, row_group_filter.rs
A series of pure code-motion splits to make the parquet datasource crate
easier to navigate and modify. No public API changes (existing import
paths preserved via re-exports), no behavior changes.
Five separate moves:
1. `opener.rs` (~2,700 LOC) → `opener/` module:
- `opener/early_stop.rs` — `EarlyStoppingStream`, the dynamic-filter
early-termination wrapper used at the end of `build_stream`.
- `opener/encryption.rs` — `EncryptionContext` + the
`ParquetMorselizer::get_encryption_context` helpers. Isolates the
`#[cfg(feature = "parquet_encryption")]` gating.
- `opener/push_decoder_stream.rs` — `PushDecoderStreamState`, the
inner stream state machine that drives a `ParquetPushDecoder`.
2. `file_format.rs` (~2,000 LOC) split:
- `sink.rs` — `ParquetSink` + the parallel-write machinery
(`column_serializer_task`, `spawn_column_parallel_row_group_writer`,
`output_single_parquet_file_parallelized`, etc.). The historical
`file_format::ParquetSink` path is preserved via `pub use`.
- `schema_coercion.rs` — Arrow-schema coercion utilities
(`apply_file_schema_type_coercions`, `coerce_int96_to_resolution`,
`coerce_file_schema_to_view_type`, `coerce_file_schema_to_string_type`,
`transform_schema_to_view`, `transform_binary_to_string`,
`field_with_new_type`) and their tests. Re-exported at the crate
root for backward compat.
3. `row_group_filter.rs` (~1,900 LOC):
- `bloom_filter.rs` — `BloomFilterStatistics` (the loaded SBBF data
+ its `PruningStatistics` adapter). Separates "data we loaded
from the file" from "the access-plan filter that consumes it".
After this PR:
- `opener.rs`: 2,717 → 2,433 LOC (-10%; further drop possible by
extracting the test module, but tests stay near the code under
test for now).
- `file_format.rs`: 2,038 → 633 LOC (-69%).
- `row_group_filter.rs`: 1,929 → 1,756 LOC (-9%; bloom code moved
out, the file is now focused entirely on `RowGroupAccessPlanFilter`).
17679e3 to
2706027
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Relates to the discussion in #22024 about the Parquet datasource crate becoming hard to navigate as more features land. Does not close that issue — this is a small, mechanical step that makes future hook/extension work easier.
Rationale for this change
Several files in the parquet datasource crate have grown to bundle multiple responsibilities into one place — making it hard to read and review changes in isolation. This PR is pure code motion: no public API changes (existing import paths preserved via re-exports), no behavior changes.
The hook/extension trait that addresses #22024's broader concern ("the opener is becoming a mini planner") is being prepared as a separate PR so reviewers can evaluate each change on its own merits.
What changes are included in this PR?
Five separate moves grouped into one PR because they're all the same flavor of work:
1.
opener.rs(~2,700 LOC) →opener/moduleopener/early_stop.rs—EarlyStoppingStream(~100 LOC), the dynamic-filter early-termination wrapper used at the end ofbuild_stream.opener/encryption.rs—EncryptionContext+ theParquetMorselizer::get_encryption_contexthelpers. Isolates the#[cfg(feature = \"parquet_encryption\")]gating that previously bled through the main file.opener/push_decoder_stream.rs—PushDecoderStreamState, the inner stream state machine that drives aParquetPushDecoder. Distinct from the outer (open-file) state machine they previously shared a file with.2.
file_format.rs(~2,000 LOC) splitsink.rs—ParquetSink+ the parallel-write machinery (column_serializer_task,spawn_column_parallel_row_group_writer,output_single_parquet_file_parallelized,concatenate_parallel_row_groups, etc.). The historicalfile_format::ParquetSinkpath is preserved viapub use.schema_coercion.rs— Arrow-schema coercion utilities (apply_file_schema_type_coercions,coerce_int96_to_resolution,coerce_file_schema_to_view_type,coerce_file_schema_to_string_type,transform_schema_to_view,transform_binary_to_string,field_with_new_type) and their tests. Re-exported at the crate root for backward compat.3.
row_group_filter.rs(~1,900 LOC)bloom_filter.rs—BloomFilterStatistics(the loaded SBBF data + itsPruningStatisticsadapter). Separates "data we loaded from the file" from "the access-plan filter that consumes it."Net effect
opener.rs(nowopener/mod.rs)file_format.rsrow_group_filter.rsAre these changes tested?
Yes:
datafusion-datasource-parquetunit tests pass (the twocoerce_int96_to_resolution_*tests moved with the function toschema_coercion.rs).cargo fmt --all,./dev/rust_lint.sh,cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warningsall pass.datafusioncore builds clean.datafusion-proto(which importsParquetSinkvia the historicalfile_format::ParquetSinkpath) builds clean thanks to thepub usere-export.Are there any user-facing changes?
No. Public API is unchanged — every previously-public item is still reachable at the same crate-root path. The only difference is the file organization inside the crate.