Fix memory not released after swath matchup with large batch_size#74
Merged
Fix memory not released after swath matchup with large batch_size#74
Conversation
DataTree objects created by open_method="datatree-merge" have internal parent<->child reference cycles that Python's reference counting cannot collect. Previously gc.collect() was called only once per batch, so with a large batch_size all granules' DataTree objects accumulated in memory before GC ran. Fix: call gc.collect() once per granule (inside the inner loop) when open_method="datatree-merge". This bounds peak memory to ~1 DataTree at a time regardless of batch_size, matching the behaviour previously only achieved with batch_size=1. Also add test_swath_gc_called_per_granule_not_per_batch to verify the per-granule GC call is made for the datatree-merge (swath) path. Co-authored-by: eeholmes <2545978+eeholmes@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix high memory usage when batch size is large for swath
Fix memory not released after swath matchup with large batch_size
Mar 10, 2026
Member
|
@copilot |
Contributor
Author
No, the parquet saving is completely unaffected. The sequence is:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With
geometry="swath", settingbatch_sizelarger than the number of granules results in 2–3× higher retained memory aftermatchup()returns compared tobatch_size=1, despite processing the same data.Root cause
xr.open_datatree()creates DataTree objects with internal parent↔child reference cycles. Python's reference counting cannot free these — they requiregc.collect(). The engine calledgc.collect()only once per batch, so with a largebatch_sizeall granules' DataTree objects (~200 MB each) accumulated before GC ran. Python's allocator then didn't return the freed arenas to the OS promptly, leaving the cgroup memory high even aftermatchup()returned.geometry="grid"is unaffected: it usesxr.open_dataset()with no DataTree hierarchy and no cycles.Changes
engine.py— inside the inner granule loop in_execute_plan, callgc.collect()after_safe_close(file_obj)whenopen_method == "datatree-merge". This breaks DataTree cycles immediately after each granule rather than waiting until the entire batch finishes, bounding peak memory to ~1 DataTree at a time regardless ofbatch_size.tests/test_plan.py— addedtest_swath_gc_called_per_granule_not_per_batch: monkeypatchesgc.collectin the engine module, runs a 3-granule swath matchup withbatch_size=1000, and assertsgc.collect()was called at least once per granule.Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.