Fix memory not released after swath matchup with large batch_size by Copilot · Pull Request #74 · fish-pace/point-collocation

Copilot · 2026-03-10T00:59:24Z

With geometry="swath", setting batch_size larger than the number of granules results in 2–3× higher retained memory after matchup() returns compared to batch_size=1, despite processing the same data.

Root cause

xr.open_datatree() creates DataTree objects with internal parent↔child reference cycles. Python's reference counting cannot free these — they require gc.collect(). The engine called gc.collect() only once per batch, so with a large batch_size all granules' DataTree objects (~200 MB each) accumulated before GC ran. Python's allocator then didn't return the freed arenas to the OS promptly, leaving the cgroup memory high even after matchup() returned.

geometry="grid" is unaffected: it uses xr.open_dataset() with no DataTree hierarchy and no cycles.

Changes

engine.py — inside the inner granule loop in _execute_plan, call gc.collect() after _safe_close(file_obj) when open_method == "datatree-merge". This breaks DataTree cycles immediately after each granule rather than waiting until the entire batch finishes, bounding peak memory to ~1 DataTree at a time regardless of batch_size.
tests/test_plan.py — added test_swath_gc_called_per_granule_not_per_batch: monkeypatches gc.collect in the engine module, runs a 3-granule swath matchup with batch_size=1000, and asserts gc.collect() was called at least once per granule.

Original prompt

This section details on the original issue you should resolve

<issue_title>reopening high memory when batch_size large</issue_title>
<issue_description>Task: Diagnose and fix if it is a bug. You worked on this in #70 but the issue persists. It ONLY occurs for geometry="swath". Is there something different happening with geometry="swath"?

Question: Why does setting batch_size to 1000 (number of granules is 24), lead to a final memory of 3Gb, while setting to batch_size=1 lead to final of 1Gb. The memory is not climbing, but it is higher at end of the pc.matchup() call when batch_size is larger. it is not clear why one would have a final memory usage that is larger. Is something not being released?

Expected behavior: I expected that peak memory might be higher with batch_size=1000, but would drop after memory was released.
res = pc.matchup(plan, geometry="swath", variables=["Rrs"], batch_size=1000)
3.3Gb at end, 1.3 minutes
res = pc.matchup(plan, geometry="swath", variables=["Rrs"], batch_size=1)
1Gb at end, 1.3 minutes

This is how I diagnosed the amount of memory:
with open("/sys/fs/cgroup/memory.current") as f:
    used = int(f.read())

print(used / 1024**3, "GB used")
Note, there is no difference in memory usuage with batch_size for geometry="grid", that I can notice at least. Memory usage stays low throughout matching whether I use spatial_method="nearest" or "xoak".

What is different with geometry?

"grid" uses xr.open_dataset() while "swath" uses ds = xr.open_datatreee() and ds.merge()</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes reopening high memory when batch_size large #73

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

DataTree objects created by open_method="datatree-merge" have internal parent<->child reference cycles that Python's reference counting cannot collect. Previously gc.collect() was called only once per batch, so with a large batch_size all granules' DataTree objects accumulated in memory before GC ran. Fix: call gc.collect() once per granule (inside the inner loop) when open_method="datatree-merge". This bounds peak memory to ~1 DataTree at a time regardless of batch_size, matching the behaviour previously only achieved with batch_size=1. Also add test_swath_gc_called_per_granule_not_per_batch to verify the per-granule GC call is made for the datatree-merge (swath) path. Co-authored-by: eeholmes <2545978+eeholmes@users.noreply.github.com>

eeholmes · 2026-03-10T01:25:22Z

@copilot
batch_size is used to create an intermediate dataframe that is saved to parquet in case where the user wants intermediate files. This gc.collect() solution won't break that will it?

Copilot · 2026-03-10T01:26:13Z

@copilot
batch_size is used to create an intermediate dataframe that is saved to parquet in case where the user wants intermediate files. This gc.collect() solution won't break tha...

No, the parquet saving is completely unaffected. gc.collect() only frees unreachable Python objects (the DataTree reference cycles). It has no effect on live objects like batch_rows or output_rows.

The sequence is:

Inner granule loop runs: extracted rows are appended to batch_rows and output_rows as usual.
After each granule, _safe_close() drops the file handle, then gc.collect() frees the now-unreachable DataTree.
After all granules in the batch finish, batch_rows (intact, fully populated) is written to parquet — unchanged from before.

batch_size still controls exactly which granules go into each parquet file, just as before.

Initial plan

d7dfe34

Copilot AI assigned Copilot and eeholmes Mar 10, 2026

Copilot started work on behalf of eeholmes March 10, 2026 00:59 View session

Copilot AI changed the title ~~[WIP] Fix high memory usage when batch size is large for swath~~ Fix memory not released after swath matchup with large batch_size Mar 10, 2026

Copilot finished work on behalf of eeholmes March 10, 2026 01:21

Copilot started work on behalf of eeholmes March 10, 2026 01:25 View session

Copilot finished work on behalf of eeholmes March 10, 2026 01:26

eeholmes marked this pull request as ready for review March 10, 2026 01:32

eeholmes merged commit a3cc5fb into main Mar 10, 2026
1 check passed

eeholmes deleted the copilot/fix-high-memory-issue-swath branch March 10, 2026 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory not released after swath matchup with large batch_size#74

Fix memory not released after swath matchup with large batch_size#74
eeholmes merged 2 commits intomainfrom
copilot/fix-high-memory-issue-swath

Copilot AI commented Mar 10, 2026 •

edited

Loading

Uh oh!

eeholmes commented Mar 10, 2026

Uh oh!

Copilot AI commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Changes

Comments on the Issue (you are @copilot in this section)

Uh oh!

eeholmes commented Mar 10, 2026

Uh oh!

Copilot AI commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 10, 2026 •

edited

Loading