Fix bin-pack compaction producing undersized output files by teamurko · Pull Request #233 · linkedin/iceberg

teamurko · 2026-03-04T22:56:33Z

SparkStagedScan uses task.sizeBytes() (data + delete files) as the weight function for bin-packing, but the split size passed from SparkBinPackDataRewriter is computed using ContentScanTask::length (data-only). This mismatch causes each Spark partition to hold less actual data than intended, producing output files smaller than target-file-size-bytes.

Change SparkStagedScan to use ContentScanTask::length as the bin-packing weight, consistent with how splitSize was computed. Also fix a pre-existing bug in SparkStagedScan.hashCode() where splitSize was used twice instead of including splitLookback.

Fix task weight in budget truncation logic from sizeBytes to length.

SparkStagedScan uses task.sizeBytes() (data + delete files) as the weight function for bin-packing, but the split size passed from SparkBinPackDataRewriter is computed using ContentScanTask::length (data-only). This mismatch causes each Spark partition to hold less actual data than intended, producing output files smaller than target-file-size-bytes. Add a weight function parameter to TableScanUtil.planTaskGroups() and a USE_DATA_ONLY_WEIGHT read option that SparkBinPackDataRewriter sets to signal SparkStagedScan to use ContentScanTask::length as the bin-packing weight, consistent with how splitSize was computed. Also fix a pre-existing bug in SparkStagedScan.hashCode() where splitSize was used twice instead of including splitLookback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackDataRewriter.java

SparkStagedScan is only reachable via rewrite procedures, so the opt-in USE_DATA_ONLY_WEIGHT flag and branching are unnecessary. Use ContentScanTask::length as the weight function unconditionally and remove the read option, config method, and caller flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkStagedScan.java

Add planTaskGroupsWithDataSize() to TableScanUtil that uses ContentScanTask::length instead of sizeBytes() for bin-packing weight. This avoids exposing weight calculation internals in SparkStagedScan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ceberg#233), add integration test Upgrades iceberg from 1.5.2.7 to 1.5.2.8 which includes a fix for budgeted rewrite task grouping to use data file length instead of sizeBytes (linkedin/iceberg#233). Adds an integration test that validates the corrected behavior.

Upgrades iceberg from 1.5.2.7 to 1.5.2.8 which includes a fix for data rewrite task grouping to use data file length instead of sizeBytes (linkedin/iceberg#233). Adds an integration test that validates the corrected behavior. ## Summary  [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [x] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [x] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done  - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.

github-actions bot added the SPARK label Mar 4, 2026

teamurko force-pushed the cmp_fix1 branch from e909b0a to 48e4cc4 Compare March 5, 2026 05:36

github-actions bot added the CORE label Mar 5, 2026

teamurko added 2 commits March 4, 2026 21:46

Style fix

2ac47ac

Use length instead of sizeBytes when truncating task to fit into budget

151dc53

teamurko requested review from cbb330 and sumedhsakdeo March 5, 2026 06:02

sumedhsakdeo reviewed Mar 5, 2026

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackDataRewriter.java Outdated Show resolved Hide resolved

teamurko requested a review from sumedhsakdeo March 5, 2026 20:39

sumedhsakdeo reviewed Mar 5, 2026

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkStagedScan.java Outdated Show resolved Hide resolved

teamurko requested a review from sumedhsakdeo March 5, 2026 21:22

sumedhsakdeo approved these changes Mar 6, 2026

View reviewed changes

teamurko merged commit a6aef67 into linkedin:openhouse-1.5.2 Mar 6, 2026
23 checks passed

teamurko mentioned this pull request Mar 9, 2026

Upgrade iceberg-1.5.2.8 to include compaction improvement linkedin/openhouse#489

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bin-pack compaction producing undersized output files#233

Fix bin-pack compaction producing undersized output files#233
teamurko merged 5 commits intolinkedin:openhouse-1.5.2from
teamurko:cmp_fix1

teamurko commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teamurko commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

teamurko commented Mar 4, 2026 •

edited

Loading