Feature Request: Apply reasonable default chunking to 1D arrays in obs/var
Summary
When writing to zarr, 1D arrays in obs/var use zarr's default auto-chunking, which targets ~500KB chunks. Arrays smaller than this threshold remain as single chunks. For typical single-cell data sizes (10k-100k elements = 40-400KB for float32), this results in single-chunk arrays that prevent efficient partial reads—problematic for lazy/remote access patterns.
Why 1D arrays need smaller chunks than 2D arrays: Row subsetting (cells in obs, genes in var) behaves very differently for 2D vs 1D storage. With the same ~500KB chunk target:
Reading 1000 rows from 10,000 rows × 5 columns:
Case 1: 2D array, ~500KB chunks → (2500 rows × 5 cols) per chunk
┌─────────────────────────────┐
│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 1 chunk loaded (2500 × 5)
│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ = 50KB for 1000 rows
├─────────────────────────────┤
│ │
│ │
├─────────────────────────────┤
│ │
│ │
├─────────────────────────────┤
│ │
└─────────────────────────────┘
col1 col2 col3 col4 col5
Case 2: 5× 1D arrays, ~500KB chunks → 10,000 rows per chunk (under threshold)
┌─────┬─────┬─────┬─────┬─────┐
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│ ← 5 chunks loaded (entire columns!)
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│ = 200KB for 1000 rows (4x more)
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
└─────┴─────┴─────┴─────┴─────┘
col1 col2 col3 col4 col5
Case 3: 5× 1D arrays, 1000-row chunks → 1000 rows per chunk
┌─────┬─────┬─────┬─────┬─────┐
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│ ← 5 chunks loaded (just needed rows)
├─────┼─────┼─────┼─────┼─────┤ = 20KB for 1000 rows (10x less than Case 2)
│ │ │ │ │ │
├─────┼─────┼─────┼─────┼─────┤
│ │ │ │ │ │
├─────┼─────┼─────┼─────┼─────┤
│ │ │ │ │ │
├─────┼─────┼─────┼─────┼─────┤
│ │ │ │ │ │
└─────┴─────┴─────┴─────┴─────┘
col1 col2 col3 col4 col5
░ = data loaded
The ~500KB byte-size target works for 2D arrays because chunks span multiple columns. For 1D arrays, each column is chunked independently, so the same byte budget creates tall, narrow chunks that load far more rows than needed for typical subsets.
Current Behavior
The chunking logic in write_zarr() only applies explicit chunking to the X matrix:
|
def callback( |
|
write_func, store, elem_name: str, elem, *, dataset_kwargs, iospec |
|
) -> None: |
|
if ( |
|
chunks is not None |
|
and not isinstance(elem, sparse.spmatrix) |
|
and elem_name.lstrip("/") == "X" |
|
): |
|
dataset_kwargs = dict(dataset_kwargs, chunks=chunks) |
|
write_func(store, elem_name, elem, dataset_kwargs=dataset_kwargs) |
All other elements (obs, var, obsm, varm, etc.) rely on zarr's auto-chunking, which targets ~500KB chunks:
# Zarr 1D auto-chunking (float32 = 4 bytes):
n=10,000 (39KB) -> single chunk # under threshold
n=50,000 (195KB) -> single chunk # under threshold
n=100,000 (391KB) -> 2 chunks # crosses threshold
n=1,000,000 (3.9MB) -> 8 chunks
# Variable-length strings: can't estimate size, stays whole
n=100,000 strings -> single chunk
Zarr's auto-chunking works, but the ~500KB threshold is too high for typical obs/var sizes. This is zarr's intended behavior optimizing for storage efficiency (zarr-python#270).
Impact
Benchmark: Realistic obs with categorical columns
Real obs/var DataFrames contain mostly categorical columns (cell types, sample IDs, etc.). With zstd compression:
Dataset: 500,000 cells × 50 columns (5 float, 45 categorical)
Categorical cardinalities: 10×20 cats, 20×1000 cats, 15×50000 cats
Platform: Darwin 24.6.0, arm64 (Apple Silicon), local SSD
Cells Single 10k 1k 10k speedup 1k speedup
--------------------------------------------------------------------------------
100 1384 ms 167 ms 259 ms 8.3x 5.3x
1,000 1414 ms 164 ms 272 ms 8.6x 5.2x
10,000 1387 ms 162 ms 338 ms 8.6x 4.1x
50,000 1397 ms 195 ms 710 ms 7.2x 2.0x
500,000 1394 ms 634 ms 4964 ms 2.2x 0.3x (4x slower)
10k chunks provide 8x speedup for partial reads and 2x speedup even for full reads (due to better string decompression). 1k chunks have severe full-read penalty.
Remote storage (NFS and S3)
Remote storage introduces per-request overhead that changes the tradeoffs. Chunking still helps partial reads, but benefits are reduced compared to local storage:
Same dataset, Linux 4.15.0 x86_64, NFS filesystem
Cells Single 10k 1k 10k speedup 1k speedup
--------------------------------------------------------------------------------
100 3564 ms 553 ms 994 ms 6.4x 3.6x
1,000 3518 ms 566 ms 926 ms 6.2x 3.8x
10,000 3495 ms 540 ms 1230 ms 6.5x 2.8x
50,000 3469 ms 670 ms 2558 ms 5.2x 1.4x
500,000 3530 ms 2266 ms 18039 ms 1.6x 0.2x (5x slower)
Same dataset, Linux 4.15.0 x86_64, AWS S3 (us-west-2)
Network: 3.2 Gbps down / 1.2 Gbps up, 41ms ping
Cells Single 10k 1k 10k speedup 1k speedup
--------------------------------------------------------------------------------
100 13374 ms 9078 ms 13438 ms 1.5x 1.0x
1,000 12743 ms 9096 ms 11567 ms 1.4x 1.1x
10,000 12080 ms 8890 ms 13108 ms 1.4x 0.9x
50,000 13330 ms 9904 ms 22491 ms 1.3x 0.6x
500,000 12981 ms 20044 ms 115582 ms 0.6x 0.1x (9x slower)
Same dataset, Darwin 24.6.0, arm64 (Apple M3 Max), AWS S3 (us-west-2)
Network: 103 Mbps down / 87 Mbps up, 23ms ping (home internet)
Cells Single 10k 1k 10k speedup 1k speedup
--------------------------------------------------------------------------------
100 27134 ms 15582 ms 19274 ms 1.7x 1.4x
1,000 21627 ms 14532 ms 19051 ms 1.5x 1.1x
10,000 26384 ms 14243 ms 19760 ms 1.9x 1.3x
50,000 22060 ms 15549 ms 34257 ms 1.4x 0.6x (2x slower)
500,000 22608 ms 31140 ms 194768 ms 0.7x (1.4x slower) 0.1x (9x slower)
Key observations:
- Partial reads: 10k chunks provide 6x speedup on NFS, 1.4-1.9x on S3 (reduced vs 8x local due to per-request overhead)
- Full reads: 10k chunks are 1.6x faster on NFS (string decompression benefit), but 1.4-1.5x slower on S3 (request overhead dominates)
- 1k chunks: Catastrophic everywhere—5x slower on NFS, 9x slower on S3 due to thousands of requests
- Single chunks: Baseline time is dominated by network latency (~3.5s NFS, ~13-27s S3 depending on connection) regardless of subset size
- Home internet vs datacenter: Similar patterns observed on slower home connection (103 Mbps) vs datacenter (3.2 Gbps) - the fundamental tradeoffs hold
10k chunks provide modest improvements for partial reads on S3, but the full-read penalty (1.4x slower) makes it a net negative for typical access patterns where full obs/var reads dominate.
Caveat: Purely numerical data
For obs/var with only numeric columns (uncommon), chunking has a full-read penalty:
Dataset: 500,000 cells × 50 columns (all float32)
Cells Single 10k 1k 10k speedup 1k speedup
--------------------------------------------------------------------------------
100 128 ms 39 ms 34 ms 3.3x 3.8x
1,000 119 ms 37 ms 34 ms 3.2x 3.5x
10,000 120 ms 37 ms 114 ms 3.3x 1.1x
50,000 119 ms 71 ms 482 ms 1.7x 0.2x
500,000 130 ms 575 ms 4737 ms 0.2x 0.03x (36x slower)
For purely numeric data, 10k chunks still provide 3x speedup for partial reads but 5x slowdown for full reads.
Benchmark code
import pandas as pd
import numpy as np
import time
import os
import zarr
from zarr.codecs import ZstdCodec
import anndata as ad
from anndata._io.specs import write_elem
tmpdir = "benchmark_chunking"
os.makedirs(tmpdir, exist_ok=True)
n_obs = 500_000
compressors = [ZstdCodec(level=3)]
# Realistic categorical data
obs_dict = {}
for i in range(5):
obs_dict[f'float_{i}'] = np.random.randn(n_obs).astype(np.float32)
for i in range(45):
n_cats = 20 if i < 10 else (1000 if i < 30 else 50000)
categories = [f'cat_{i}_{j}' for j in range(n_cats)]
obs_dict[f'cat_{i}'] = pd.Categorical(np.random.choice(categories, n_obs))
obs_data = pd.DataFrame(obs_dict)
adata = ad.AnnData(X=np.zeros((n_obs, 10), dtype=np.float32), obs=obs_data)
for chunk_name, chunk_size in [('single', n_obs), ('10k', 10000), ('1k', 1000)]:
z = zarr.open(f"{tmpdir}/{chunk_name}.zarr", mode='w')
write_elem(z, 'obs', adata.obs, dataset_kwargs={
'chunks': (chunk_size,), 'compressors': compressors
})
columns = list(obs_data.columns)
for n_cells in [100, 1000, 10000, 50000, 500000]:
for name in ['single', '10k', '1k']:
times = []
for _ in range(5):
z = zarr.open(f"{tmpdir}/{name}.zarr", mode='r')
start = time.perf_counter()
for col in columns:
obs_col = z[f'obs/{col}']
if hasattr(obs_col, 'shape'):
_ = obs_col[:n_cells]
else:
_ = obs_col['codes'][:n_cells]
_ = obs_col['categories'][:]
times.append(time.perf_counter() - start)
print(f"{name} {n_cells}: {np.median(times)*1000:.1f}ms")
Example: Categorical metadata in perturbation screens
PR #2288 introduced LazyCategoricalDtype with head_categories()/tail_categories() for efficient partial reads. Benchmarks with 100k categories show the chunking limitation:
| Method |
H5AD |
Zarr |
head_categories(10) |
0.19 ms |
8.82 ms |
| Full load |
30.32 ms |
19.19 ms |
| Speedup |
160x |
2.2x |
The zarr speedup is only 2.2x (vs 160x for H5AD) because it must read the entire single chunk.
Common sources of large 1D arrays
The default strings_to_categoricals() converts string columns where n_unique < n_obs, commonly producing large arrays from:
- Gene symbols in
.var (~30k unique)
- Guide IDs in perturbation screens (50k-200k)
- Sample/batch/donor IDs in large atlases
- Cell barcodes stored as metadata
Proposed Solution
Apply default chunking to 1D arrays in obs/var, with chunk size based on access patterns rather than byte size:
def callback(write_func, store, elem_name, elem, *, dataset_kwargs, iospec):
# Existing X chunking logic...
# Apply default chunking to 1D obs/var arrays
if (
"chunks" not in dataset_kwargs
and hasattr(elem, "shape")
and len(elem.shape) == 1
and elem.shape[0] > 10000
and (elem_name.startswith("/obs") or elem_name.startswith("/var"))
):
dataset_kwargs = dict(dataset_kwargs, chunks=(10000,))
write_func(store, elem_name, elem, dataset_kwargs=dataset_kwargs)
Why 10,000 elements?
The chunk size should align with typical subset sizes, not byte size:
- Typical cell subsets: 1,000–10,000 cells
- 10k balances partial read efficiency with chunk access overhead
- Variable-length strings (categorical
categories) decompress more efficiently with smaller chunks
A byte-based target doesn't work well here because:
- Many small 1D arrays each get the full budget independently
- No coordination across co-indexed arrays
- Result: every column is a single chunk, defeating partial reads
Element-count chunking ensures all obs columns (and var columns) are chunked consistently, enabling efficient slicing across the entire dataframe.
Considerations
-
Backward compatibility: Changing chunking doesn't break reading—zarr handles any chunk layout transparently.
-
Chunk size tradeoffs:
- Smaller chunks = better partial reads, more metadata overhead
- Larger chunks = worse partial reads, less overhead
- 1000 elements is a reasonable balance
-
Compression: Each chunk is compressed independently. More chunks = slightly worse compression ratio but enables decompressing only needed data.
-
Not a zarr bug: Zarr's ~500KB threshold is reasonable for general use. Anndata's access patterns (lazy loading, partial reads of small arrays) benefit from smaller chunks.
Related
- PR #2288:
LazyCategoricalDtype exposes the performance limitation
- zarr-python#270: Discussion of chunk size configuration in zarr
Edit: After gathering benchmarks on remote storage (NFS and S3), I'm reconsidering this proposal. The lazy nature of anndata primarily involves loading specific sections (e.g., just obs or just X) rather than subsetting rows. Full reads of obs/var are the common case, and the benchmarks show:
- Local SSD: 10k chunks help even for full reads (2x faster due to string decompression)
- NFS: Modest benefit for full reads (1.6x faster)
- S3: 10k chunks are actually 1.5x slower for full reads due to request overhead
Given that remote/cloud storage is increasingly common and full reads are the dominant access pattern, the tradeoffs are less favorable than initially thought. The benefit is mostly limited to local storage with categorical-heavy data. I'm closing this issue as the change may not be a net improvement across typical use cases.
Feature Request: Apply reasonable default chunking to 1D arrays in obs/var
Summary
When writing to zarr, 1D arrays in obs/var use zarr's default auto-chunking, which targets ~500KB chunks. Arrays smaller than this threshold remain as single chunks. For typical single-cell data sizes (10k-100k elements = 40-400KB for float32), this results in single-chunk arrays that prevent efficient partial reads—problematic for lazy/remote access patterns.
Why 1D arrays need smaller chunks than 2D arrays: Row subsetting (cells in obs, genes in var) behaves very differently for 2D vs 1D storage. With the same ~500KB chunk target:
The ~500KB byte-size target works for 2D arrays because chunks span multiple columns. For 1D arrays, each column is chunked independently, so the same byte budget creates tall, narrow chunks that load far more rows than needed for typical subsets.
Current Behavior
The chunking logic in
write_zarr()only applies explicit chunking to theXmatrix:anndata/src/anndata/_io/zarr.py
Lines 46 to 55 in c6f6f54
All other elements (obs, var, obsm, varm, etc.) rely on zarr's auto-chunking, which targets ~500KB chunks:
Zarr's auto-chunking works, but the ~500KB threshold is too high for typical obs/var sizes. This is zarr's intended behavior optimizing for storage efficiency (zarr-python#270).
Impact
Benchmark: Realistic obs with categorical columns
Real obs/var DataFrames contain mostly categorical columns (cell types, sample IDs, etc.). With zstd compression:
10k chunks provide 8x speedup for partial reads and 2x speedup even for full reads (due to better string decompression). 1k chunks have severe full-read penalty.
Remote storage (NFS and S3)
Remote storage introduces per-request overhead that changes the tradeoffs. Chunking still helps partial reads, but benefits are reduced compared to local storage:
Key observations:
10k chunks provide modest improvements for partial reads on S3, but the full-read penalty (1.4x slower) makes it a net negative for typical access patterns where full obs/var reads dominate.
Caveat: Purely numerical data
For obs/var with only numeric columns (uncommon), chunking has a full-read penalty:
For purely numeric data, 10k chunks still provide 3x speedup for partial reads but 5x slowdown for full reads.
Benchmark code
Example: Categorical metadata in perturbation screens
PR #2288 introduced
LazyCategoricalDtypewithhead_categories()/tail_categories()for efficient partial reads. Benchmarks with 100k categories show the chunking limitation:head_categories(10)The zarr speedup is only 2.2x (vs 160x for H5AD) because it must read the entire single chunk.
Common sources of large 1D arrays
The default
strings_to_categoricals()converts string columns wheren_unique < n_obs, commonly producing large arrays from:.var(~30k unique)Proposed Solution
Apply default chunking to 1D arrays in obs/var, with chunk size based on access patterns rather than byte size:
Why 10,000 elements?
The chunk size should align with typical subset sizes, not byte size:
categories) decompress more efficiently with smaller chunksA byte-based target doesn't work well here because:
Element-count chunking ensures all obs columns (and var columns) are chunked consistently, enabling efficient slicing across the entire dataframe.
Considerations
Backward compatibility: Changing chunking doesn't break reading—zarr handles any chunk layout transparently.
Chunk size tradeoffs:
Compression: Each chunk is compressed independently. More chunks = slightly worse compression ratio but enables decompressing only needed data.
Not a zarr bug: Zarr's ~500KB threshold is reasonable for general use. Anndata's access patterns (lazy loading, partial reads of small arrays) benefit from smaller chunks.
Related
LazyCategoricalDtypeexposes the performance limitationEdit: After gathering benchmarks on remote storage (NFS and S3), I'm reconsidering this proposal. The lazy nature of anndata primarily involves loading specific sections (e.g., just
obsor justX) rather than subsetting rows. Full reads of obs/var are the common case, and the benchmarks show:Given that remote/cloud storage is increasingly common and full reads are the dominant access pattern, the tradeoffs are less favorable than initially thought. The benefit is mostly limited to local storage with categorical-heavy data. I'm closing this issue as the change may not be a net improvement across typical use cases.