[Feature][DataLoader] Make DataLoaderSplit picklable for distributed computing#490
Open
ShreyeshArangath wants to merge 3 commits intolinkedin:mainfrom
Open
[Feature][DataLoader] Make DataLoaderSplit picklable for distributed computing#490ShreyeshArangath wants to merge 3 commits intolinkedin:mainfrom
ShreyeshArangath wants to merge 3 commits intolinkedin:mainfrom
Conversation
…computing Add pickle support to DataLoaderSplit and TableScanContext so splits can be serialized for Spark/multiprocessing. LogicalPlan is serialized to substrait bytes via DataFusion's Producer, and FileIO is reconstructed via load_file_io in TableScanContext.__reduce__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eagerly serialize LogicalPlan to substrait bytes in __init__ to eliminate dual representation and silent data loss. Pass location in unpickle to ensure correct FileIO dispatch. Add comprehensive test coverage for double round-trip, plan=None path, and invalid argument guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… add UDF ordering test - Simplify redundant `if plan is not None and session_context is not None` to `if plan is not None` with assert for mypy narrowing - Add TODO for future DataFusion integration (Substrait -> LogicalPlan) - Register UDFs before Substrait serialization in __init__ - Add tests: session_context-without-plan raises, UDF registration ordering - Materialize io.properties to plain dict in TableScanContext.__reduce__ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes DataLoaderSplit safely picklable so splits can be serialized and sent to distributed workers (e.g. Ray). This is required for frameworks that ship work units across process boundaries.
__getstate__, we eagerly convert the plan to substrait bytes in init and never store the FFI objects as instance fields. This eliminates__getstate__/__setstate__entirely — the object is unconditionally picklable by construction.__reduce__— Extractsio.propertiesfor serialization and reconstructs FileIO on unpickle. Passeslocation=table_metadata.locationto load_file_io so the reconstructed FileIO selects the correct filesystem implementationChanges
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.