Skip to content

feat: Apply reasonable default chunking to 1D arrays in obs/var #2295

Description

@katosh

Feature Request: Apply reasonable default chunking to 1D arrays in obs/var

Summary

When writing to zarr, 1D arrays in obs/var use zarr's default auto-chunking, which targets ~500KB chunks. Arrays smaller than this threshold remain as single chunks. For typical single-cell data sizes (10k-100k elements = 40-400KB for float32), this results in single-chunk arrays that prevent efficient partial reads—problematic for lazy/remote access patterns.

Why 1D arrays need smaller chunks than 2D arrays: Row subsetting (cells in obs, genes in var) behaves very differently for 2D vs 1D storage. With the same ~500KB chunk target:

Reading 1000 rows from 10,000 rows × 5 columns:

Case 1: 2D array, ~500KB chunks → (2500 rows × 5 cols) per chunk
┌─────────────────────────────┐
│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 1 chunk loaded (2500 × 5)
│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│   = 50KB for 1000 rows
├─────────────────────────────┤
│                             │
│                             │
├─────────────────────────────┤
│                             │
│                             │
├─────────────────────────────┤
│                             │
└─────────────────────────────┘
  col1 col2 col3 col4 col5

Case 2: 5× 1D arrays, ~500KB chunks → 10,000 rows per chunk (under threshold)
┌─────┬─────┬─────┬─────┬─────┐
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│ ← 5 chunks loaded (entire columns!)
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│   = 200KB for 1000 rows (4x more)
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│
└─────┴─────┴─────┴─────┴─────┘
  col1  col2  col3  col4  col5

Case 3: 5× 1D arrays, 1000-row chunks → 1000 rows per chunk
┌─────┬─────┬─────┬─────┬─────┐
│░░░░░│░░░░░│░░░░░│░░░░░│░░░░░│ ← 5 chunks loaded (just needed rows)
├─────┼─────┼─────┼─────┼─────┤   = 20KB for 1000 rows (10x less than Case 2)
│     │     │     │     │     │
├─────┼─────┼─────┼─────┼─────┤
│     │     │     │     │     │
├─────┼─────┼─────┼─────┼─────┤
│     │     │     │     │     │
├─────┼─────┼─────┼─────┼─────┤
│     │     │     │     │     │
└─────┴─────┴─────┴─────┴─────┘
  col1  col2  col3  col4  col5

░ = data loaded

The ~500KB byte-size target works for 2D arrays because chunks span multiple columns. For 1D arrays, each column is chunked independently, so the same byte budget creates tall, narrow chunks that load far more rows than needed for typical subsets.

Current Behavior

The chunking logic in write_zarr() only applies explicit chunking to the X matrix:

def callback(
write_func, store, elem_name: str, elem, *, dataset_kwargs, iospec
) -> None:
if (
chunks is not None
and not isinstance(elem, sparse.spmatrix)
and elem_name.lstrip("/") == "X"
):
dataset_kwargs = dict(dataset_kwargs, chunks=chunks)
write_func(store, elem_name, elem, dataset_kwargs=dataset_kwargs)

All other elements (obs, var, obsm, varm, etc.) rely on zarr's auto-chunking, which targets ~500KB chunks:

# Zarr 1D auto-chunking (float32 = 4 bytes):
n=10,000   (39KB)  -> single chunk   # under threshold
n=50,000   (195KB) -> single chunk   # under threshold
n=100,000  (391KB) -> 2 chunks       # crosses threshold
n=1,000,000 (3.9MB) -> 8 chunks

# Variable-length strings: can't estimate size, stays whole
n=100,000 strings -> single chunk

Zarr's auto-chunking works, but the ~500KB threshold is too high for typical obs/var sizes. This is zarr's intended behavior optimizing for storage efficiency (zarr-python#270).

Impact

Benchmark: Realistic obs with categorical columns

Real obs/var DataFrames contain mostly categorical columns (cell types, sample IDs, etc.). With zstd compression:

Dataset: 500,000 cells × 50 columns (5 float, 45 categorical)
Categorical cardinalities: 10×20 cats, 20×1000 cats, 15×50000 cats
Platform: Darwin 24.6.0, arm64 (Apple Silicon), local SSD

Cells      Single       10k          1k           10k speedup  1k speedup
--------------------------------------------------------------------------------
100        1384 ms      167 ms       259 ms       8.3x         5.3x
1,000      1414 ms      164 ms       272 ms       8.6x         5.2x
10,000     1387 ms      162 ms       338 ms       8.6x         4.1x
50,000     1397 ms      195 ms       710 ms       7.2x         2.0x
500,000    1394 ms      634 ms       4964 ms      2.2x         0.3x (4x slower)

10k chunks provide 8x speedup for partial reads and 2x speedup even for full reads (due to better string decompression). 1k chunks have severe full-read penalty.

Remote storage (NFS and S3)

Remote storage introduces per-request overhead that changes the tradeoffs. Chunking still helps partial reads, but benefits are reduced compared to local storage:

Same dataset, Linux 4.15.0 x86_64, NFS filesystem

Cells      Single       10k          1k           10k speedup  1k speedup
--------------------------------------------------------------------------------
100        3564 ms      553 ms       994 ms       6.4x         3.6x
1,000      3518 ms      566 ms       926 ms       6.2x         3.8x
10,000     3495 ms      540 ms       1230 ms      6.5x         2.8x
50,000     3469 ms      670 ms       2558 ms      5.2x         1.4x
500,000    3530 ms      2266 ms      18039 ms     1.6x         0.2x (5x slower)
Same dataset, Linux 4.15.0 x86_64, AWS S3 (us-west-2)
Network: 3.2 Gbps down / 1.2 Gbps up, 41ms ping

Cells      Single       10k          1k           10k speedup  1k speedup
--------------------------------------------------------------------------------
100        13374 ms     9078 ms      13438 ms     1.5x         1.0x
1,000      12743 ms     9096 ms      11567 ms     1.4x         1.1x
10,000     12080 ms     8890 ms      13108 ms     1.4x         0.9x
50,000     13330 ms     9904 ms      22491 ms     1.3x         0.6x
500,000    12981 ms     20044 ms     115582 ms    0.6x         0.1x (9x slower)
Same dataset, Darwin 24.6.0, arm64 (Apple M3 Max), AWS S3 (us-west-2)
Network: 103 Mbps down / 87 Mbps up, 23ms ping (home internet)

Cells      Single       10k          1k           10k speedup  1k speedup
--------------------------------------------------------------------------------
100        27134 ms     15582 ms     19274 ms     1.7x         1.4x
1,000      21627 ms     14532 ms     19051 ms     1.5x         1.1x
10,000     26384 ms     14243 ms     19760 ms     1.9x         1.3x
50,000     22060 ms     15549 ms     34257 ms     1.4x         0.6x (2x slower)
500,000    22608 ms     31140 ms     194768 ms    0.7x (1.4x slower)  0.1x (9x slower)

Key observations:

  • Partial reads: 10k chunks provide 6x speedup on NFS, 1.4-1.9x on S3 (reduced vs 8x local due to per-request overhead)
  • Full reads: 10k chunks are 1.6x faster on NFS (string decompression benefit), but 1.4-1.5x slower on S3 (request overhead dominates)
  • 1k chunks: Catastrophic everywhere—5x slower on NFS, 9x slower on S3 due to thousands of requests
  • Single chunks: Baseline time is dominated by network latency (~3.5s NFS, ~13-27s S3 depending on connection) regardless of subset size
  • Home internet vs datacenter: Similar patterns observed on slower home connection (103 Mbps) vs datacenter (3.2 Gbps) - the fundamental tradeoffs hold

10k chunks provide modest improvements for partial reads on S3, but the full-read penalty (1.4x slower) makes it a net negative for typical access patterns where full obs/var reads dominate.

Caveat: Purely numerical data

For obs/var with only numeric columns (uncommon), chunking has a full-read penalty:

Dataset: 500,000 cells × 50 columns (all float32)

Cells      Single       10k          1k           10k speedup  1k speedup
--------------------------------------------------------------------------------
100        128 ms       39 ms        34 ms        3.3x         3.8x
1,000      119 ms       37 ms        34 ms        3.2x         3.5x
10,000     120 ms       37 ms        114 ms       3.3x         1.1x
50,000     119 ms       71 ms        482 ms       1.7x         0.2x
500,000    130 ms       575 ms       4737 ms      0.2x         0.03x (36x slower)

For purely numeric data, 10k chunks still provide 3x speedup for partial reads but 5x slowdown for full reads.

Benchmark code
import pandas as pd
import numpy as np
import time
import os
import zarr
from zarr.codecs import ZstdCodec
import anndata as ad
from anndata._io.specs import write_elem

tmpdir = "benchmark_chunking"
os.makedirs(tmpdir, exist_ok=True)

n_obs = 500_000
compressors = [ZstdCodec(level=3)]

# Realistic categorical data
obs_dict = {}
for i in range(5):
    obs_dict[f'float_{i}'] = np.random.randn(n_obs).astype(np.float32)
for i in range(45):
    n_cats = 20 if i < 10 else (1000 if i < 30 else 50000)
    categories = [f'cat_{i}_{j}' for j in range(n_cats)]
    obs_dict[f'cat_{i}'] = pd.Categorical(np.random.choice(categories, n_obs))
obs_data = pd.DataFrame(obs_dict)
adata = ad.AnnData(X=np.zeros((n_obs, 10), dtype=np.float32), obs=obs_data)

for chunk_name, chunk_size in [('single', n_obs), ('10k', 10000), ('1k', 1000)]:
    z = zarr.open(f"{tmpdir}/{chunk_name}.zarr", mode='w')
    write_elem(z, 'obs', adata.obs, dataset_kwargs={
        'chunks': (chunk_size,), 'compressors': compressors
    })

columns = list(obs_data.columns)
for n_cells in [100, 1000, 10000, 50000, 500000]:
    for name in ['single', '10k', '1k']:
        times = []
        for _ in range(5):
            z = zarr.open(f"{tmpdir}/{name}.zarr", mode='r')
            start = time.perf_counter()
            for col in columns:
                obs_col = z[f'obs/{col}']
                if hasattr(obs_col, 'shape'):
                    _ = obs_col[:n_cells]
                else:
                    _ = obs_col['codes'][:n_cells]
                    _ = obs_col['categories'][:]
            times.append(time.perf_counter() - start)
        print(f"{name} {n_cells}: {np.median(times)*1000:.1f}ms")

Example: Categorical metadata in perturbation screens

PR #2288 introduced LazyCategoricalDtype with head_categories()/tail_categories() for efficient partial reads. Benchmarks with 100k categories show the chunking limitation:

Method H5AD Zarr
head_categories(10) 0.19 ms 8.82 ms
Full load 30.32 ms 19.19 ms
Speedup 160x 2.2x

The zarr speedup is only 2.2x (vs 160x for H5AD) because it must read the entire single chunk.

Common sources of large 1D arrays

The default strings_to_categoricals() converts string columns where n_unique < n_obs, commonly producing large arrays from:

  • Gene symbols in .var (~30k unique)
  • Guide IDs in perturbation screens (50k-200k)
  • Sample/batch/donor IDs in large atlases
  • Cell barcodes stored as metadata

Proposed Solution

Apply default chunking to 1D arrays in obs/var, with chunk size based on access patterns rather than byte size:

def callback(write_func, store, elem_name, elem, *, dataset_kwargs, iospec):
    # Existing X chunking logic...

    # Apply default chunking to 1D obs/var arrays
    if (
        "chunks" not in dataset_kwargs
        and hasattr(elem, "shape")
        and len(elem.shape) == 1
        and elem.shape[0] > 10000
        and (elem_name.startswith("/obs") or elem_name.startswith("/var"))
    ):
        dataset_kwargs = dict(dataset_kwargs, chunks=(10000,))

    write_func(store, elem_name, elem, dataset_kwargs=dataset_kwargs)

Why 10,000 elements?

The chunk size should align with typical subset sizes, not byte size:

  • Typical cell subsets: 1,000–10,000 cells
  • 10k balances partial read efficiency with chunk access overhead
  • Variable-length strings (categorical categories) decompress more efficiently with smaller chunks

A byte-based target doesn't work well here because:

  • Many small 1D arrays each get the full budget independently
  • No coordination across co-indexed arrays
  • Result: every column is a single chunk, defeating partial reads

Element-count chunking ensures all obs columns (and var columns) are chunked consistently, enabling efficient slicing across the entire dataframe.

Considerations

  1. Backward compatibility: Changing chunking doesn't break reading—zarr handles any chunk layout transparently.

  2. Chunk size tradeoffs:

    • Smaller chunks = better partial reads, more metadata overhead
    • Larger chunks = worse partial reads, less overhead
    • 1000 elements is a reasonable balance
  3. Compression: Each chunk is compressed independently. More chunks = slightly worse compression ratio but enables decompressing only needed data.

  4. Not a zarr bug: Zarr's ~500KB threshold is reasonable for general use. Anndata's access patterns (lazy loading, partial reads of small arrays) benefit from smaller chunks.

Related

  • PR #2288: LazyCategoricalDtype exposes the performance limitation
  • zarr-python#270: Discussion of chunk size configuration in zarr

Edit: After gathering benchmarks on remote storage (NFS and S3), I'm reconsidering this proposal. The lazy nature of anndata primarily involves loading specific sections (e.g., just obs or just X) rather than subsetting rows. Full reads of obs/var are the common case, and the benchmarks show:

  • Local SSD: 10k chunks help even for full reads (2x faster due to string decompression)
  • NFS: Modest benefit for full reads (1.6x faster)
  • S3: 10k chunks are actually 1.5x slower for full reads due to request overhead

Given that remote/cloud storage is increasingly common and full reads are the dominant access pattern, the tradeoffs are less favorable than initially thought. The benefit is mostly limited to local storage with categorical-heavy data. I'm closing this issue as the change may not be a net improvement across typical use cases.

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions