Skip to content

Fix memory not released after swath matchup with large batch_size#74

Merged
eeholmes merged 2 commits intomainfrom
copilot/fix-high-memory-issue-swath
Mar 10, 2026
Merged

Fix memory not released after swath matchup with large batch_size#74
eeholmes merged 2 commits intomainfrom
copilot/fix-high-memory-issue-swath

Conversation

Copy link
Contributor

Copilot AI commented Mar 10, 2026

With geometry="swath", setting batch_size larger than the number of granules results in 2–3× higher retained memory after matchup() returns compared to batch_size=1, despite processing the same data.

Root cause

xr.open_datatree() creates DataTree objects with internal parent↔child reference cycles. Python's reference counting cannot free these — they require gc.collect(). The engine called gc.collect() only once per batch, so with a large batch_size all granules' DataTree objects (~200 MB each) accumulated before GC ran. Python's allocator then didn't return the freed arenas to the OS promptly, leaving the cgroup memory high even after matchup() returned.

geometry="grid" is unaffected: it uses xr.open_dataset() with no DataTree hierarchy and no cycles.

Changes

  • engine.py — inside the inner granule loop in _execute_plan, call gc.collect() after _safe_close(file_obj) when open_method == "datatree-merge". This breaks DataTree cycles immediately after each granule rather than waiting until the entire batch finishes, bounding peak memory to ~1 DataTree at a time regardless of batch_size.

  • tests/test_plan.py — added test_swath_gc_called_per_granule_not_per_batch: monkeypatches gc.collect in the engine module, runs a 3-granule swath matchup with batch_size=1000, and asserts gc.collect() was called at least once per granule.

Original prompt

This section details on the original issue you should resolve

<issue_title>reopening high memory when batch_size large</issue_title>
<issue_description>Task: Diagnose and fix if it is a bug. You worked on this in #70 but the issue persists. It ONLY occurs for geometry="swath". Is there something different happening with geometry="swath"?

Question: Why does setting batch_size to 1000 (number of granules is 24), lead to a final memory of 3Gb, while setting to batch_size=1 lead to final of 1Gb. The memory is not climbing, but it is higher at end of the pc.matchup() call when batch_size is larger. it is not clear why one would have a final memory usage that is larger. Is something not being released?

Expected behavior: I expected that peak memory might be higher with batch_size=1000, but would drop after memory was released.

res = pc.matchup(plan, geometry="swath", variables=["Rrs"], batch_size=1000)

3.3Gb at end, 1.3 minutes

res = pc.matchup(plan, geometry="swath", variables=["Rrs"], batch_size=1)

1Gb at end, 1.3 minutes

This is how I diagnosed the amount of memory:

with open("/sys/fs/cgroup/memory.current") as f:
    used = int(f.read())

print(used / 1024**3, "GB used")

Note, there is no difference in memory usuage with batch_size for geometry="grid", that I can notice at least. Memory usage stays low throughout matching whether I use spatial_method="nearest" or "xoak".

What is different with geometry?

"grid" uses xr.open_dataset() while "swath" uses ds = xr.open_datatreee() and ds.merge()</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

DataTree objects created by open_method="datatree-merge" have internal
parent<->child reference cycles that Python's reference counting cannot
collect. Previously gc.collect() was called only once per batch, so with
a large batch_size all granules' DataTree objects accumulated in memory
before GC ran.

Fix: call gc.collect() once per granule (inside the inner loop) when
open_method="datatree-merge". This bounds peak memory to ~1 DataTree
at a time regardless of batch_size, matching the behaviour previously
only achieved with batch_size=1.

Also add test_swath_gc_called_per_granule_not_per_batch to verify the
per-granule GC call is made for the datatree-merge (swath) path.

Co-authored-by: eeholmes <2545978+eeholmes@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix high memory usage when batch size is large for swath Fix memory not released after swath matchup with large batch_size Mar 10, 2026
@eeholmes
Copy link
Member

@copilot
batch_size is used to create an intermediate dataframe that is saved to parquet in case where the user wants intermediate files. This gc.collect() solution won't break that will it?

Copy link
Contributor Author

Copilot AI commented Mar 10, 2026

@copilot
batch_size is used to create an intermediate dataframe that is saved to parquet in case where the user wants intermediate files. This gc.collect() solution won't break tha...

No, the parquet saving is completely unaffected. gc.collect() only frees unreachable Python objects (the DataTree reference cycles). It has no effect on live objects like batch_rows or output_rows.

The sequence is:

  1. Inner granule loop runs: extracted rows are appended to batch_rows and output_rows as usual.
  2. After each granule, _safe_close() drops the file handle, then gc.collect() frees the now-unreachable DataTree.
  3. After all granules in the batch finish, batch_rows (intact, fully populated) is written to parquet — unchanged from before.

batch_size still controls exactly which granules go into each parquet file, just as before.

@eeholmes eeholmes marked this pull request as ready for review March 10, 2026 01:32
@eeholmes eeholmes merged commit a3cc5fb into main Mar 10, 2026
1 check passed
@eeholmes eeholmes deleted the copilot/fix-high-memory-issue-swath branch March 10, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reopening high memory when batch_size large

2 participants