Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O#186
Draft
zhuqi-lucas wants to merge 2 commits into
Draft
Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O#186zhuqi-lucas wants to merge 2 commits into
zhuqi-lucas wants to merge 2 commits into
Conversation
A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Covers: - Why SortExec is expensive and what `Exact` / `Inexact` ordering mean at runtime (static `fetch` vs `TopK` dynamic filter). - Phase 1 (#19064): the `PushdownSort` rule + reverse row-group case. - Phase 2 (#21182): statistics-based file sort that upgrades `Unsupported` to `Exact`, eliminating the `SortExec` on non-overlapping ASC scans. Includes the BufferExec compensation (#21426) so the SPM above doesn't lose its implicit memory buffer. - Reverse scans: today's row-group reverse (Inexact, #18817) and the community decision to wait for arrow-rs page-level reverse (#9937) before pursuing Exact reverse, after memory-profile pushback on the original row-group-level proposal. - Benchmarks: 2.1×-49× on the ASC-LIMIT sort_pushdown suite. - What's next: the dynamic / TopK-driven path (#21351 merged, #21733, #21712, #21956, #21580) including the precise RG-pruning vs mid-stream-early-return distinction, and the EnsureRequirements unification (#21976). - Links into the prior dynamic filters and limit pruning posts so the series reads as a coherent thread.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Opening as a draft to share the narrative early — the in-flight PRs the post discusses are still in flight, but the structure and the merged work (Phase 1 #19064, Phase 2 #21182, BufferExec #21426, row-group reverse #18817) are in their final shape.
What this post covers
SortExecis expensive, and whatExact/Inexactmean at runtime (staticfetchvsTopKdynamic filter).PushdownSortrule + reverse row-group case.UnsupportedtoExact, eliminating theSortExecon non-overlapping ASC scans. Includes theBufferExeccompensation (#21426) so the SPM above doesn't lose its implicit memory buffer.Exactreverse, after memory-profile pushback on the original row-group-level proposal.sort_pushdownsuite.EnsureRequirementsunification (#21976).Why a draft
A few of the in-flight PRs the "What's next" section references may evolve in review (e.g. #21580 may be split into smaller pieces, dynamic RG scheduling on top of #21351 is described but not yet on a PR). Opening as draft so we can adjust wording as those land or change shape — happy to flip to ready for review when the dust settles, or earlier if reviewers prefer.
Test plan