Skip to content

Fix 870: Significant memory and runtime improvements for Ripley's L#1236

Open
jberg5 wants to merge 3 commits into
scverse:mainfrom
jberg5:ripley-L-mem-usage
Open

Fix 870: Significant memory and runtime improvements for Ripley's L#1236
jberg5 wants to merge 3 commits into
scverse:mainfrom
jberg5:ripley-L-mem-usage

Conversation

@jberg5

@jberg5 jberg5 commented Jun 30, 2026

Copy link
Copy Markdown

Description

Headline number: at 250,000 cells in a single cluster, memory usage drops from ~1.8tb to 0.2gb, and runtime drops from probably something like 40 minutes (extrapolated because I don't have 1.8tb of ram lol) to 6 minutes on a GCP n2-highmem-8.

Peak memory

n (cells) main (pdist) branch (KDTree) reduction
20,000 11.5 GB 0.13 GB ~90x
30,000 25.6 GB 0.13 GB ~200x
40,000 45.4 GB 0.13 GB ~350x
100,000 ~290 GB* 0.13 GB ~2,200x
250,000 ~1.8 TB* 0.14 GB ~13,000x
500,000 ~7.2 TB* 0.15 GB ~50,000x

Runtime

n (cells) main (pdist) branch (KDTree) speedup
20,000 15.5 s 5.7 s 2.7x
30,000 34.0 s 10.8 s 3.2x
40,000 60.1 s 19.2 s 3.1x
100,000 ~6.4 min* 77.5 s ~5x
250,000 ~40 min* 6.0 min ~7x
500,000 ~2.7 hr* 20.8 min ~8x

(* for Claude extrapolated because the process OOMed on main)

Previously, ripley's L calculation materialized O(n^2) pairwise distances (via pdist) and then broadcast that across the number of steps in support. In issue #870, at 250,000 cells, this is n * (n - 1) / 2 = 31,249,875,000 unordered pairs, multiplied by 50 steps means distances < support.reshape(-1, 1) would be a 2D bool array of 50 * 31,249,875,000 bytes, so roughly 1.5tb of memory (excluding the pdist intermediate, which would still exist and add another ~250gb on top of that), assuming all cells are in the same cluster. This would OOM on pretty much any reasonable hardware.

Fortunately, we can skip all of that by using a KDTree. Long story short once we build the binary tree once, using O(n) memory, and then two_point_correlation traverses this structure to find the number of points within each radius without materializing every pairwise distance. This gives us O(n) memory usage instead of O(n^2).

One thing to note: this narrows the list of valid metrics down to:

>>> KDTree.valid_metrics
['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity']

whereas previously pdist would have accepted any of

[‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

but I don't think any of the dropped ones were valid / sensical metrics for the kind of spatial stats that are happening here.

How has this been tested?

  • Running existing tests
  • Extensive benchmarking (both memory and runtime) across various problem sizes while verifying correctness againstmain.

Closes

closes #870

@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.81%. Comparing base (a9966fd) to head (c76ef30).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1236      +/-   ##
==========================================
+ Coverage   75.32%   76.81%   +1.49%     
==========================================
  Files          56       63       +7     
  Lines        7936     9270    +1334     
  Branches     1295     1566     +271     
==========================================
+ Hits         5978     7121    +1143     
- Misses       1447     1547     +100     
- Partials      511      602      +91     
Files with missing lines Coverage Δ
src/squidpy/gr/_ripley.py 96.52% <100.00%> (+0.03%) ⬆️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sq.gr.ripley() cost too much memory

1 participant