Please describe your wishes and possible alternatives to achieve the desired result.
Our pandas.DataFrame API usage is relatively small (i.e., we use few built-in methods like DataFrame.iloc and DataFrame.reindex) and it is now clearly "documented" by the adapter Dataset2D class which is meant to cover all of those use-cases with the exception of concat in order to make Dataset2D work internally wherever things like my_df.iloc are called. In theory anything that satisfies the Protocol which Dataset2D would define (made generic over the underlying type, whether it's Dataset or DataFrame or Polars etc) should work as a DataFrame-like class in anndata along obs, var, in obsm etc.:
|
class Dataset2D: |
|
r""" |
|
Bases :class:`~collections.abc.Mapping`\ [:class:`~collections.abc.Hashable`, :class:`~xarray.DataArray` | :class:`~anndata.experimental.backed.Dataset2D`\ ] |
|
|
|
A wrapper class meant to enable working with lazy dataframe data according to |
|
:class:`~anndata.AnnData`'s internal API. This class ensures that "dataframe-invariants" |
|
are respected, namely that there is only one 1d dim and coord with the same name i.e., |
|
like a :class:`pandas.DataFrame`. |
|
|
|
You should not have to initiate this class yourself. Setting an :class:`xarray.Dataset` |
|
into a relevant part of the :class:`~anndata.AnnData` object will attempt to wrap that |
|
object in this object, trying to enforce the "dataframe-invariants." |
|
|
|
Because xarray requires :attr:`xarray.Dataset.coords` to be in-memory, this class provides |
|
handling for an out-of-memory index via :attr:`~anndata.experimental.backed.Dataset2D.true_index`. |
|
This feature is helpful for loading remote data faster where the index itself may not be initially useful |
|
for constructing the object e.g., cell ids. |
|
""" |
|
|
To me the concrete steps would be
Please describe your wishes and possible alternatives to achieve the desired result.
Our
pandas.DataFrameAPI usage is relatively small (i.e., we use few built-in methods likeDataFrame.ilocandDataFrame.reindex) and it is now clearly "documented" by the adapterDataset2Dclass which is meant to cover all of those use-cases with the exception ofconcatin order to makeDataset2Dwork internally wherever things likemy_df.ilocare called. In theory anything that satisfies theProtocolwhichDataset2Dwould define (made generic over the underlying type, whether it'sDatasetorDataFrameorPolarsetc) should work as aDataFrame-like class inanndataalongobs,var, inobsmetc.:anndata/src/anndata/_core/xarray.py
Lines 33 to 51 in 401e0d1
To me the concrete steps would be
Dataset2Dclass to inherit from a runtime-checkableProtocoland then replace all instances throughout the codebase ofpd.DataFrameorDataset2Dwith saidProtocol. This will likely entail adding a new method to theProtocolto handleanndata.concati.e., removing customconcatcode for these two types and adding whatever is needed to perform theconcatoperation to the newProtocol. Forpd.DataFramespecifically it might make sense to leave thepandasspecific code in place instead of trying to wrap all incomingpd.DataFrameobjectsxarray/Datasetdependency fromanndatainto its own package (optional, likely a later step after these other are completed)Protocoland an actualpandas.DataFrameobject to test the functionality in the absence ofDataset2Di.e., unit tests of a class inheriting theProtocolthat wraps apandas.DataFrameand therefore definitely won't go through any remainingpandascode in the codebasedask.DataFrame,polars,cudfetc.) and their own read functions (likely other repos, or one big new repo)