Skip to content

Fix bin-pack compaction producing undersized output files#233

Merged
teamurko merged 5 commits intolinkedin:openhouse-1.5.2from
teamurko:cmp_fix1
Mar 6, 2026
Merged

Fix bin-pack compaction producing undersized output files#233
teamurko merged 5 commits intolinkedin:openhouse-1.5.2from
teamurko:cmp_fix1

Conversation

@teamurko
Copy link
Collaborator

@teamurko teamurko commented Mar 4, 2026

SparkStagedScan uses task.sizeBytes() (data + delete files) as the weight function for bin-packing, but the split size passed from SparkBinPackDataRewriter is computed using ContentScanTask::length (data-only). This mismatch causes each Spark partition to hold less actual data than intended, producing output files smaller than target-file-size-bytes.

Change SparkStagedScan to use ContentScanTask::length as the bin-packing weight, consistent with how splitSize was computed. Also fix a pre-existing bug in SparkStagedScan.hashCode() where splitSize was used twice instead of including splitLookback.

Fix task weight in budget truncation logic from sizeBytes to length.

@github-actions github-actions bot added the SPARK label Mar 4, 2026
SparkStagedScan uses task.sizeBytes() (data + delete files) as the
weight function for bin-packing, but the split size passed from
SparkBinPackDataRewriter is computed using ContentScanTask::length
(data-only). This mismatch causes each Spark partition to hold less
actual data than intended, producing output files smaller than
target-file-size-bytes.

Add a weight function parameter to TableScanUtil.planTaskGroups() and
a USE_DATA_ONLY_WEIGHT read option that SparkBinPackDataRewriter sets
to signal SparkStagedScan to use ContentScanTask::length as the
bin-packing weight, consistent with how splitSize was computed. Also
fix a pre-existing bug in SparkStagedScan.hashCode() where splitSize
was used twice instead of including splitLookback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@teamurko teamurko requested review from cbb330 and sumedhsakdeo March 5, 2026 06:02
SparkStagedScan is only reachable via rewrite procedures, so the
opt-in USE_DATA_ONLY_WEIGHT flag and branching are unnecessary.
Use ContentScanTask::length as the weight function unconditionally
and remove the read option, config method, and caller flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@teamurko teamurko requested a review from sumedhsakdeo March 5, 2026 20:39
Add planTaskGroupsWithDataSize() to TableScanUtil that uses
ContentScanTask::length instead of sizeBytes() for bin-packing weight.
This avoids exposing weight calculation internals in SparkStagedScan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@teamurko teamurko requested a review from sumedhsakdeo March 5, 2026 21:22
@teamurko teamurko merged commit a6aef67 into linkedin:openhouse-1.5.2 Mar 6, 2026
23 checks passed
teamurko added a commit to teamurko/openhouse that referenced this pull request Mar 9, 2026
…ceberg#233), add integration test

Upgrades iceberg from 1.5.2.7 to 1.5.2.8 which includes a fix for
budgeted rewrite task grouping to use data file length instead of
sizeBytes (linkedin/iceberg#233). Adds an integration test that
validates the corrected behavior.
teamurko added a commit to linkedin/openhouse that referenced this pull request Mar 9, 2026
Upgrades iceberg from 1.5.2.7 to 1.5.2.8 which includes a fix for data
rewrite task grouping to use data file length instead of sizeBytes
(linkedin/iceberg#233). Adds an integration test that validates the
corrected behavior.

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [x] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants