Fix bin-pack compaction producing undersized output files#233
Merged
teamurko merged 5 commits intolinkedin:openhouse-1.5.2from Mar 6, 2026
Merged
Fix bin-pack compaction producing undersized output files#233teamurko merged 5 commits intolinkedin:openhouse-1.5.2from
teamurko merged 5 commits intolinkedin:openhouse-1.5.2from
Conversation
SparkStagedScan uses task.sizeBytes() (data + delete files) as the weight function for bin-packing, but the split size passed from SparkBinPackDataRewriter is computed using ContentScanTask::length (data-only). This mismatch causes each Spark partition to hold less actual data than intended, producing output files smaller than target-file-size-bytes. Add a weight function parameter to TableScanUtil.planTaskGroups() and a USE_DATA_ONLY_WEIGHT read option that SparkBinPackDataRewriter sets to signal SparkStagedScan to use ContentScanTask::length as the bin-packing weight, consistent with how splitSize was computed. Also fix a pre-existing bug in SparkStagedScan.hashCode() where splitSize was used twice instead of including splitLookback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sumedhsakdeo
reviewed
Mar 5, 2026
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackDataRewriter.java
Outdated
Show resolved
Hide resolved
SparkStagedScan is only reachable via rewrite procedures, so the opt-in USE_DATA_ONLY_WEIGHT flag and branching are unnecessary. Use ContentScanTask::length as the weight function unconditionally and remove the read option, config method, and caller flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sumedhsakdeo
reviewed
Mar 5, 2026
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkStagedScan.java
Outdated
Show resolved
Hide resolved
Add planTaskGroupsWithDataSize() to TableScanUtil that uses ContentScanTask::length instead of sizeBytes() for bin-packing weight. This avoids exposing weight calculation internals in SparkStagedScan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sumedhsakdeo
approved these changes
Mar 6, 2026
teamurko
added a commit
to teamurko/openhouse
that referenced
this pull request
Mar 9, 2026
…ceberg#233), add integration test Upgrades iceberg from 1.5.2.7 to 1.5.2.8 which includes a fix for budgeted rewrite task grouping to use data file length instead of sizeBytes (linkedin/iceberg#233). Adds an integration test that validates the corrected behavior.
17 tasks
teamurko
added a commit
to linkedin/openhouse
that referenced
this pull request
Mar 9, 2026
Upgrades iceberg from 1.5.2.7 to 1.5.2.8 which includes a fix for data rewrite task grouping to use data file length instead of sizeBytes (linkedin/iceberg#233). Adds an integration test that validates the corrected behavior. ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [x] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [x] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SparkStagedScan uses task.sizeBytes() (data + delete files) as the weight function for bin-packing, but the split size passed from SparkBinPackDataRewriter is computed using ContentScanTask::length (data-only). This mismatch causes each Spark partition to hold less actual data than intended, producing output files smaller than target-file-size-bytes.
Change SparkStagedScan to use ContentScanTask::length as the bin-packing weight, consistent with how splitSize was computed. Also fix a pre-existing bug in SparkStagedScan.hashCode() where splitSize was used twice instead of including splitLookback.
Fix task weight in budget truncation logic from sizeBytes to length.