Skip to content

ParquetSet is misleading and not interoperable with the Python data ecosystem #228

@Jdev256

Description

@Jdev256

Problem: ParquetSet is not discoverable nor interoperable

The object returned by sinan.download() is a ParquetSet, but its current API makes it unnecessarily hard to use and violates common Python and data ecosystem conventions.

Current issues

The ParquetSet object:

  • Prints a filesystem path via __str__(), which misleads users into assuming it is a path-like object
  • Is not iterable, breaking standard Python expectations for a “set”-like container
  • Does not expose any explicit path attributes (.path, .paths, .files)
  • Is not compatible with pandas or polars readers
  • Does not document the correct way to load the underlying parquet data

As a result, users are forced to reverse-engineer the object behavior, effectively turning them into testers.

Violated principles

  • Principle of Least Surprise
  • Self-describing API
  • Interoperability with the Python data ecosystem

Proposed solution

Implement the filesystem protocol by adding __fspath__ to ParquetSet:

class ParquetSet:
    def __fspath__(self):
        return str(self)

This small change would immediately enable native compatibility with:

pd.read_parquet(files)
pl.read_parquet(files)
pl.scan_parquet(files)

No breaking changes, no new abstractions, and no additional documentation burden.


Benefits

  • Restores expected Python behavior
  • Enables seamless integration with pandas and polars
  • Reduces API surface and user confusion
  • Eliminates the need for helper methods such as to_dataframe()
  • Improves usability without altering internal design

This change optimizes developer experience while preserving the original intent of ParquetSet.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions