Skip to content

DataFrame API for obs and var keys via runtime-checkable Protocol #2043

Description

@ilan-gold

Please describe your wishes and possible alternatives to achieve the desired result.

Our pandas.DataFrame API usage is relatively small (i.e., we use few built-in methods like DataFrame.iloc and DataFrame.reindex) and it is now clearly "documented" by the adapter Dataset2D class which is meant to cover all of those use-cases with the exception of concat in order to make Dataset2D work internally wherever things like my_df.iloc are called. In theory anything that satisfies the Protocol which Dataset2D would define (made generic over the underlying type, whether it's Dataset or DataFrame or Polars etc) should work as a DataFrame-like class in anndata along obs, var, in obsm etc.:

class Dataset2D:
r"""
Bases :class:`~collections.abc.Mapping`\ [:class:`~collections.abc.Hashable`, :class:`~xarray.DataArray` | :class:`~anndata.experimental.backed.Dataset2D`\ ]
A wrapper class meant to enable working with lazy dataframe data according to
:class:`~anndata.AnnData`'s internal API. This class ensures that "dataframe-invariants"
are respected, namely that there is only one 1d dim and coord with the same name i.e.,
like a :class:`pandas.DataFrame`.
You should not have to initiate this class yourself. Setting an :class:`xarray.Dataset`
into a relevant part of the :class:`~anndata.AnnData` object will attempt to wrap that
object in this object, trying to enforce the "dataframe-invariants."
Because xarray requires :attr:`xarray.Dataset.coords` to be in-memory, this class provides
handling for an out-of-memory index via :attr:`~anndata.experimental.backed.Dataset2D.true_index`.
This feature is helpful for loading remote data faster where the index itself may not be initially useful
for constructing the object e.g., cell ids.
"""

To me the concrete steps would be

  • Refactor the current Dataset2D class to inherit from a runtime-checkable Protocol and then replace all instances throughout the codebase of pd.DataFrame or Dataset2D with said Protocol. This will likely entail adding a new method to the Protocol to handle anndata.concat i.e., removing custom concat code for these two types and adding whatever is needed to perform the concat operation to the new Protocol. For pd.DataFrame specifically it might make sense to leave the pandas specific code in place instead of trying to wrap all incoming pd.DataFrame objects
  • Remove the xarray/Dataset dependency from anndata into its own package (optional, likely a later step after these other are completed)
  • Create new test cases that use both the Protocol and an actual pandas.DataFrame object to test the functionality in the absence of Dataset2D i.e., unit tests of a class inheriting the Protocol that wraps a pandas.DataFrame and therefore definitely won't go through any remaining pandas code in the codebase
  • Create other readers/adatpters for other dataframe-like libraries (dask.DataFrame, polars, cudf etc.) and their own read functions (likely other repos, or one big new repo)

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Enhancement.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions