[DataLoader] Add batch_size for intra-file streaming by cbb330 · Pull Request #493 · linkedin/openhouse

cbb330 · 2026-03-10T20:45:58Z

Summary

Add batch_size: int | None parameter to OpenHouseDataLoader and DataLoaderSplit
Switch from TaskOrder (materializes entire file via list()) to ArrivalOrder(concurrent_streams=1) for intra-file streaming — prevents OOM on large files in distributed workers
batch_size controls rows per RecordBatch; None (default) uses PyArrow default (~131K rows)

Stacked on #491 — merge that first.

Test plan

test_batch_size_default_returns_all_data — backwards compatibility
test_batch_size_limits_rows_per_batch — 100 rows with batch_size=10 produces ≤10-row batches
test_batch_size_returns_correct_data — data integrity preserved
test_batch_size_with_columns_and_filters — works with projection + filters
test_batch_size_with_empty_table — no crash on empty table
test_split_batch_size_limits_rows_per_batch — split-level enforcement
test_split_batch_size_none_returns_all_rows — default preserves all data
test_split_batch_size_preserves_data — non-even row counts handled correctly
All 86 existing tests pass

Tests that validate how Iceberg snapshot expiration interacts with tags and branches, demonstrating that refs protect snapshots from expiration and that the `history.expire.max-ref-age-ms` table property can be used to automatically expire refs and their protected snapshots.

Use ArrivalOrder(concurrent_streams=1) from pyiceberg PR #3046 to stream RecordBatches incrementally instead of materializing entire files into memory. The new batch_size parameter controls rows per batch, preventing OOM on large files in distributed workers.

cbb330 · 2026-03-10T21:01:11Z

Folded into #491

cbb330 added 2 commits March 10, 2026 10:52

cbb330 closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Add batch_size for intra-file streaming#493

[DataLoader] Add batch_size for intra-file streaming#493
cbb330 wants to merge 2 commits intolinkedin:mainfrom
cbb330:chbush/dataloader-batch-size

cbb330 commented Mar 10, 2026

Uh oh!

cbb330 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cbb330 commented Mar 10, 2026

Summary

Test plan

Uh oh!

cbb330 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant