`DataFrame` API for `obs` and `var` keys via runtime-checkable `Protocol`

### Please describe your wishes and possible alternatives to achieve the desired result.


Our `pandas.DataFrame` API usage is relatively small (i.e., we use few built-in methods like `DataFrame.iloc` and `DataFrame.reindex`) and it is now clearly "documented" by the adapter `Dataset2D` class which is meant to cover all of those use-cases with the exception of `concat` in order to make `Dataset2D` work internally wherever things like `my_df.iloc` are called.  In theory anything that satisfies the `Protocol` which `Dataset2D` would define (made generic over the underlying type, whether it's `Dataset` or `DataFrame` or `Polars` etc) should work as a `DataFrame`-like class in `anndata` along `obs`, `var`, in `obsm` etc.:

https://github.com/scverse/anndata/blob/401e0d140b204bfdd1b46b45a3557ad3775f55d7/src/anndata/_core/xarray.py#L33-L51

To me the concrete steps would be

- [ ] Refactor the current `Dataset2D` class to inherit from a runtime-checkable `Protocol` and then replace all instances throughout the codebase of `pd.DataFrame` or `Dataset2D` with said `Protocol`.  This will likely entail adding a new method to the `Protocol` to handle `anndata.concat` i.e., removing custom `concat` code for these two types and adding whatever is needed to perform the `concat` operation to the new `Protocol`.  For `pd.DataFrame` specifically it might make sense to leave the `pandas` specific code in place instead of trying to wrap all incoming `pd.DataFrame` objects
- [ ] Remove the `xarray`/`Dataset` dependency from `anndata` into its own package (optional, likely a later step after these other are completed)
- [ ] Create new test cases that use both the `Protocol` and an actual `pandas.DataFrame` object to test the functionality in the absence of `Dataset2D` i.e., unit tests of a class inheriting the `Protocol` that wraps a `pandas.DataFrame` and therefore _definitely_ won't go through any remaining `pandas` code in the codebase
- [ ] Create other readers/adatpters for other dataframe-like libraries (`dask.DataFrame`, `polars`, `cudf` etc.) and their own read functions (likely other repos, or one big new repo)

	class Dataset2D:
	r"""
	Bases :class:`~collections.abc.Mapping`\ [:class:`~collections.abc.Hashable`, :class:`~xarray.DataArray` \| :class:`~anndata.experimental.backed.Dataset2D`\ ]

	A wrapper class meant to enable working with lazy dataframe data according to
	:class:`~anndata.AnnData`'s internal API. This class ensures that "dataframe-invariants"
	are respected, namely that there is only one 1d dim and coord with the same name i.e.,
	like a :class:`pandas.DataFrame`.

	You should not have to initiate this class yourself. Setting an :class:`xarray.Dataset`
	into a relevant part of the :class:`~anndata.AnnData` object will attempt to wrap that
	object in this object, trying to enforce the "dataframe-invariants."

	Because xarray requires :attr:`xarray.Dataset.coords` to be in-memory, this class provides
	handling for an out-of-memory index via :attr:`~anndata.experimental.backed.Dataset2D.true_index`.
	This feature is helpful for loading remote data faster where the index itself may not be initially useful
	for constructing the object e.g., cell ids.
	"""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`DataFrame` API for `obs` and `var` keys via runtime-checkable `Protocol` #2043

Please describe your wishes and possible alternatives to achieve the desired result.

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

DataFrame API for obs and var keys via runtime-checkable Protocol #2043

Description

Please describe your wishes and possible alternatives to achieve the desired result.

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`DataFrame` API for `obs` and `var` keys via runtime-checkable `Protocol` #2043