Skip to content

perf: avoid redundant sorting in h5py indexing#2496

Open
tippered1-debug wants to merge 1 commit into
scverse:mainfrom
tippered1-debug:perf-process-index-h5py-unique
Open

perf: avoid redundant sorting in h5py indexing#2496
tippered1-debug wants to merge 1 commit into
scverse:mainfrom
tippered1-debug:perf-process-index-h5py-unique

Conversation

@tippered1-debug

Copy link
Copy Markdown

Summary

Avoid sorting HDF5 indices twice when they are already unique.

Duplicate indices still use the existing sorting and reconstruction path. Tests cover sorted and unsorted unique indices, duplicates, boolean masks, empty indices, and 1D/2D indexing.

Benchmark

Benchmarks were run from benchmarks/benchmarks/h5py_indexing.py. Before and after measurements used the same index arrays in the same process. Results are medians of 25 runs after 5 warmups.

Size Scenario Before After Speedup
10,000 sorted_unique 0.158 ms 0.077 ms 2.04×
10,000 unsorted_unique 0.822 ms 0.270 ms 3.04×
10,000 duplicate_heavy 0.252 ms 0.251 ms 1.01×
100,000 sorted_unique 1.839 ms 0.878 ms 2.10×
100,000 unsorted_unique 13.168 ms 4.621 ms 2.85×
100,000 duplicate_heavy 4.469 ms 4.437 ms 1.01×
1,000,000 sorted_unique 23.180 ms 11.251 ms 2.06×
1,000,000 unsorted_unique 171.516 ms 62.111 ms 2.76×
1,000,000 duplicate_heavy 59.565 ms 60.078 ms 0.99×

Unique-index cases are around 2–3× faster. Duplicate-heavy cases are unchanged.

Checklist

  • Closes #
  • Tests added
  • Release note not necessary because:

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.61%. Comparing base (58712ff) to head (1e6a575).
⚠️ Report is 13 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2496      +/-   ##
==========================================
- Coverage   87.61%   85.61%   -2.00%     
==========================================
  Files          49       49              
  Lines        7684     7677       -7     
==========================================
- Hits         6732     6573     -159     
- Misses        952     1104     +152     
Files with missing lines Coverage Δ
src/anndata/_core/index.py 95.11% <100.00%> (-0.13%) ⬇️

... and 8 files with indirect coverage changes

@ilan-gold ilan-gold added this to the 0.12.17 milestone Jun 15, 2026
@ilan-gold

Copy link
Copy Markdown
Contributor

@sjfleming Could you take a look at this?

@ilan-gold ilan-gold modified the milestones: 0.12.17, 0.12.18 Jun 16, 2026
@sjfleming

Copy link
Copy Markdown
Contributor

From what I can tell, I think this looks good, and I think it makes sense.

Selfishly, I'm looking at related tests I've written in cellarium-ml. All my tests pass using anndata 0.12.17. All tests pass using the current main branch of anndata. All tests pass using anndata from this PR branch as well. Those tests were my original impetus for #2066 . So this PR breaks nothing.

For some reason when I install from this PR branch the version shows up as 0.1.0.dev1746+g1e6a5757e which I thought was a bit strange, but the diff here seems like it's based on current main. I have not confirmed the above speedups, but it seems plausible.

Incidentally, is @tippered1-debug a bot? Is it okay to ask that? :)

@tippered1-debug

Copy link
Copy Markdown
Author

No im not a bot:) I want to help make things better, especially tools that are used in research, medicine. I think that open source is one of the best ways to do that

@sjfleming

Copy link
Copy Markdown
Contributor

Totally agree @tippered1-debug , my bad! ;) I think it’s a nice PR

Comment on lines +24 to +25
def time_process_index_for_h5py(self, size, scenario):
_process_index_for_h5py(self.idx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be benchmarking private methods. Please either remove the benchmark or do it against a public method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants