Skip to content

[DataLoader] Add batch_size for intra-file streaming#493

Closed
cbb330 wants to merge 2 commits intolinkedin:mainfrom
cbb330:chbush/dataloader-batch-size
Closed

[DataLoader] Add batch_size for intra-file streaming#493
cbb330 wants to merge 2 commits intolinkedin:mainfrom
cbb330:chbush/dataloader-batch-size

Conversation

@cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Mar 10, 2026

Summary

  • Add batch_size: int | None parameter to OpenHouseDataLoader and DataLoaderSplit
  • Switch from TaskOrder (materializes entire file via list()) to ArrivalOrder(concurrent_streams=1) for intra-file streaming — prevents OOM on large files in distributed workers
  • batch_size controls rows per RecordBatch; None (default) uses PyArrow default (~131K rows)

Stacked on #491 — merge that first.

Test plan

  • test_batch_size_default_returns_all_data — backwards compatibility
  • test_batch_size_limits_rows_per_batch — 100 rows with batch_size=10 produces ≤10-row batches
  • test_batch_size_returns_correct_data — data integrity preserved
  • test_batch_size_with_columns_and_filters — works with projection + filters
  • test_batch_size_with_empty_table — no crash on empty table
  • test_split_batch_size_limits_rows_per_batch — split-level enforcement
  • test_split_batch_size_none_returns_all_rows — default preserves all data
  • test_split_batch_size_preserves_data — non-even row counts handled correctly
  • All 86 existing tests pass

cbb330 added 2 commits March 10, 2026 10:52
Tests that validate how Iceberg snapshot expiration interacts with tags
and branches, demonstrating that refs protect snapshots from expiration
and that the `history.expire.max-ref-age-ms` table property can be used
to automatically expire refs and their protected snapshots.
Use ArrivalOrder(concurrent_streams=1) from pyiceberg PR #3046 to
stream RecordBatches incrementally instead of materializing entire
files into memory. The new batch_size parameter controls rows per
batch, preventing OOM on large files in distributed workers.
@cbb330
Copy link
Collaborator Author

cbb330 commented Mar 10, 2026

Folded into #491

@cbb330 cbb330 closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant