Record sort order when writing Parquet with WITH ORDER#19595
Merged
adriangb merged 2 commits intoapache:mainfrom Jan 8, 2026
Merged
Record sort order when writing Parquet with WITH ORDER#19595adriangb merged 2 commits intoapache:mainfrom
adriangb merged 2 commits intoapache:mainfrom
Conversation
408898d to
43d152b
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR implements the recording of sort order metadata in Parquet files when writing data with WITH ORDER clauses. When an external table is created with an ordering specification, subsequent INSERT INTO or COPY operations will now embed sorting column information in the Parquet row group metadata, enabling downstream readers to potentially skip redundant sort operations.
- Adds conversion functions to translate DataFusion ordering expressions to Parquet
SortingColumnmetadata - Updates
ParquetSinkto accept and propagate sorting column information through the writer pipeline - Includes comprehensive test coverage to verify metadata is correctly written
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| datafusion/datasource-parquet/src/metadata.rs | Adds sort_expr_to_sorting_column() and lex_ordering_to_sorting_columns() helper functions to convert DataFusion ordering to Parquet sorting metadata |
| datafusion/datasource-parquet/src/file_format.rs | Integrates sorting column conversion into create_writer_physical_plan() and updates ParquetSink with builder pattern support for sorting columns; modifies create_writer_props() to set sorting columns on writer properties |
| datafusion/core/tests/parquet/ordering.rs | Adds new test file with test_create_table_with_order_writes_sorting_columns to verify sorting metadata is correctly written to Parquet files |
| datafusion/core/tests/parquet/mod.rs | Registers the new ordering test module |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When writing data to a table created with `CREATE EXTERNAL TABLE ... WITH ORDER`, this change records the sorting columns in the Parquet file's row group metadata. Changes: - Add `sort_expr_to_sorting_column()` and `lex_ordering_to_sorting_columns()` functions in metadata.rs to convert DataFusion ordering to Parquet SortingColumn - Add `sorting_columns` field to ParquetSink with `with_sorting_columns()` builder - Update `create_writer_physical_plan()` to pass order requirements to ParquetSink - Update `create_writer_props()` to set sorting columns on WriterProperties - Add test verifying sorting_columns metadata is written correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
ab3529f to
e4f1e9a
Compare
44 tasks
Contributor
Author
|
@zhuqi-lucas are you able to review this? |
zhuqi-lucas
approved these changes
Jan 7, 2026
Contributor
zhuqi-lucas
left a comment
There was a problem hiding this comment.
LGTM @adriangb, sorry i was missing this PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #19433
Rationale for this change
When writing data to a table created with
CREATE EXTERNAL TABLE ... WITH ORDER, the sorting columns should be recorded in the Parquet file's row group metadata. This allows downstream readers to know the data is sorted and potentially skip sorting operations.What changes are included in this PR?
sort_expr_to_sorting_column()andlex_ordering_to_sorting_columns()functions inmetadata.rsto convert DataFusion ordering to ParquetSortingColumnsorting_columnsfield toParquetSinkwithwith_sorting_columns()builder methodcreate_writer_physical_plan()to pass order requirements toParquetSinkcreate_writer_props()to set sorting columns onWriterPropertiessorting_columnsmetadata is written correctlyAre these changes tested?
Yes, added
test_create_table_with_order_writes_sorting_columnsthat:WITH ORDER (a ASC NULLS FIRST, b DESC NULLS LAST)sorting_columnsmetadata matches the expected orderAre there any user-facing changes?
No user-facing API changes. Parquet files written via
INSERT INTOorCOPYfor tables withWITH ORDERwill now containsorting_columnsmetadata in the row group.🤖 Generated with Claude Code