Skip to content

[Feature][DataLoader] Make DataLoaderSplit picklable for distributed computing#490

Open
ShreyeshArangath wants to merge 3 commits intolinkedin:mainfrom
ShreyeshArangath:feat/pickle-dataloader-split
Open

[Feature][DataLoader] Make DataLoaderSplit picklable for distributed computing#490
ShreyeshArangath wants to merge 3 commits intolinkedin:mainfrom
ShreyeshArangath:feat/pickle-dataloader-split

Conversation

@ShreyeshArangath
Copy link
Collaborator

@ShreyeshArangath ShreyeshArangath commented Mar 9, 2026

Summary

Makes DataLoaderSplit safely picklable so splits can be serialized and sent to distributed workers (e.g. Ray). This is required for frameworks that ship work units across process boundaries.

  • Eager substrait serialization — LogicalPlan and SessionContext are transient DataFusion FFI objects that cannot be pickled. Instead of deferring serialization to __getstate__, we eagerly convert the plan to substrait bytes in init and never store the FFI objects as instance fields. This eliminates __getstate__/__setstate__ entirely — the object is unconditionally picklable by construction.
  • TableScanContext pickle via __reduce__ — Extracts io.properties for serialization and reconstructs FileIO on unpickle. Passes location=table_metadata.location to load_file_io so the reconstructed FileIO selects the correct filesystem implementation

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

ShreyeshArangath and others added 3 commits March 9, 2026 11:54
…computing

Add pickle support to DataLoaderSplit and TableScanContext so splits
can be serialized for Spark/multiprocessing. LogicalPlan is serialized
to substrait bytes via DataFusion's Producer, and FileIO is
reconstructed via load_file_io in TableScanContext.__reduce__.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eagerly serialize LogicalPlan to substrait bytes in __init__ to eliminate
dual representation and silent data loss. Pass location in unpickle to
ensure correct FileIO dispatch. Add comprehensive test coverage for
double round-trip, plan=None path, and invalid argument guard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… add UDF ordering test

- Simplify redundant `if plan is not None and session_context is not None`
  to `if plan is not None` with assert for mypy narrowing
- Add TODO for future DataFusion integration (Substrait -> LogicalPlan)
- Register UDFs before Substrait serialization in __init__
- Add tests: session_context-without-plan raises, UDF registration ordering
- Materialize io.properties to plain dict in TableScanContext.__reduce__

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ShreyeshArangath ShreyeshArangath marked this pull request as ready for review March 9, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant