perf: avoid redundant sorting in h5py indexing#2496
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2496 +/- ##
==========================================
- Coverage 87.61% 85.61% -2.00%
==========================================
Files 49 49
Lines 7684 7677 -7
==========================================
- Hits 6732 6573 -159
- Misses 952 1104 +152
|
|
@sjfleming Could you take a look at this? |
|
From what I can tell, I think this looks good, and I think it makes sense. Selfishly, I'm looking at related tests I've written in For some reason when I install from this PR branch the version shows up as Incidentally, is @tippered1-debug a bot? Is it okay to ask that? :) |
|
No im not a bot:) I want to help make things better, especially tools that are used in research, medicine. I think that open source is one of the best ways to do that |
|
Totally agree @tippered1-debug , my bad! ;) I think it’s a nice PR |
| def time_process_index_for_h5py(self, size, scenario): | ||
| _process_index_for_h5py(self.idx) |
There was a problem hiding this comment.
We shouldn't be benchmarking private methods. Please either remove the benchmark or do it against a public method
Summary
Avoid sorting HDF5 indices twice when they are already unique.
Duplicate indices still use the existing sorting and reconstruction path. Tests cover sorted and unsorted unique indices, duplicates, boolean masks, empty indices, and 1D/2D indexing.
Benchmark
Benchmarks were run from
benchmarks/benchmarks/h5py_indexing.py. Before and after measurements used the same index arrays in the same process. Results are medians of 25 runs after 5 warmups.sorted_uniqueunsorted_uniqueduplicate_heavysorted_uniqueunsorted_uniqueduplicate_heavysorted_uniqueunsorted_uniqueduplicate_heavyUnique-index cases are around 2–3× faster. Duplicate-heavy cases are unchanged.
Checklist