Skip to content

Anndata and zarr #2145

Description

@Amitbergman

Question

Hey all,

I have a few anndata datasets with sparse csr X matrices (each is with ~10M cells and 40K genes, with parity of about 5%).

I want to be able to quickly load whole rows from these datasets (say given a query, load all rows based on a condition on the obs table).
Currently I am taking the anndata object and converting it to tileDB, but I recently encountered the zarr file format, and specifically the support of zarr v3 in anndata.

I have a few questions regarding zarr:

  1. Is Zarr v3 would be a good fit for our use case? Should I expect improvement over tileDB?
  2. Are there some guidelines on what codec to use? Chunk sizes?
  3. Are there some guidelines as to how to benefit from concurrency? I see dask being used in many places together with zarr.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions