Skip to content

Make each worker access cached topography file#29

Merged
ray-chew merged 78 commits into
mainfrom
ext/cached_topo
May 12, 2026
Merged

Make each worker access cached topography file#29
ray-chew merged 78 commits into
mainfrom
ext/cached_topo

Conversation

@ray-chew

Copy link
Copy Markdown
Owner

... instead of reopening the topography files each time; made sure everything works with Dask parallelization.

ray-chew added 30 commits May 13, 2024 13:01
got to create MWE and complete netcdf4 io routines
removed it from deprecated status; added comments
all paths and filenames should import from local_paths; although I have not been thorough with checking this, and not all run scripts have been updated accordingly
technically, this should work now, although I need to find a way to parallelise the embarrassing loop and possibly move the writing routine out. I will also need to implement the skipping of ocean grid cells. Finally, the south pole looks pretty okay.
I finally have a reason to move away from the iPython environment, and so now all imports should be done properly
I must update the script to the latest wrapper components that are also used in Delaunay runs to make sure that we are really doing what is mentioned in the manuscript.
checked; the dfft and lsff results have been reproduced
it runs and parallel reading of the topographic dataset sems to work; I have to move the writer out of the parallel routine
however, there is a bug with the REMA IO routine when both the lat and lon extent span multiple files, e.g. with grid cell 47 in this commit
so the code runs on most cases now, except grid cell indices 0 and 1 on the ICON R2B4 grid. Related to these issues, I anticipate I/O errors across the W180-E180 boundary.
requires cleaning up, and some corner cases still exist...
I got to comment up what I am doing though, since I am dealing with the corner cases separately.
…tting diagnostics

however, there is a gap between the MERIT and REMA datasets, and the lat-lon grid are also different. This latitudinal strip is not important for orographic data, as only a few small islands exist here, but it is important for me to find a sensible way to glue these two datasets together, e.g., by interpolation.
ray-chew and others added 28 commits October 23, 2025 03:20
Small difference of about 0.1% improvement for grid cells close to the poles.
Parallelisation, NetCDF outputs, plotting also verified to work; should be ready for HPC runs.
When one of the ICON grid cell's vertex is at 180 longitude, the ETOPO
reader was loading all the topography around the globe. The error has been
fixed by this commit.
These topographic features are not taken into account in the second approximation step.
Based on the latitude of the cells.
Verified that we are loading correct geographical features at the correct lat-lon domains, thanks Claude!
Need to load threadsafe HDF, e.g.,
module load hdf5/1.12.1-threadsafe-gcc-11.2.0
module load netcdf-c/4.8.1-gcc-11.2.0

and reinstall Python library against these libraries:
pip install --no-binary=...
Ocassionally, the plotter crashes ...?
The idea of the land-ocean ratio verification is to make sure that the NetCDF outputs have roughly as many number of land and ocean cells as we expect.
Verification script with etopo_cg=40:
  Land cells: 8509
  Ocean cells: 11971

Merging and verification scripts with outputs from etopo_cg=4:
  Land cells (is_land=1): 8622
  Ocean cells (is_land=0): 11858

Both scripts are consistent. Furthermore, the plots are sensible.
* grid: add cell_area attribute, auto-populated by read_dat from ICON grid file
* grid.apply_f: skip cell_area in non_convertibles so radian conversion leaves it untouched
* nc_writer.grp_struct: accept cell_area; write as "m^2" variable per cell when provided
* icon_etopo_global.do_cell: pass grid.cell_area[c_idx] to grp_struct on both ocean and land branches
* merge_icon_etopo_outputs: extract cell_area per group, propagate into merged NetCDF; backward-compatible if absent
* io.py: global _NETCDF_GLOBAL_LOCK serialises every nc.Dataset open across threads.
* icon_etopo_global: wrap do_cell body in try/except that logs the traceback to logger and stderr before re-raising, so worker crashes surface a stack instead of dying silently.
* icon_etopo_global: invert loop nesting so memory batches are the outer loop. A crash mid-run leaves all earlier memory batches complete; restart skips to the failing batch.
* Dask client now created and closed per memory batch (not per NetCDF chunk); threads_per_worker hardcoded to 1 since the global lock makes >1 threads pointless.
* Preserve existing single-worker = full-machine-memory feature for high-memory polar batches inside the new loop.
* TopographyTileCache pre-loads MERIT/ETOPO/REMA tiles into memory and exposes fast subset access for individual grid cells, avoiding repeated NetCDF opens during parallel cell processing.
* Not yet wired into the main loop: Added as a building block for future I/O optimisation.
* runs/submit_etopo_global.sh: SLURM template (1 node, 128 CPUs, 256 GB, 48 h) targeting the adaptive runs.icon_etopo_global entry point.
* scripts/check_slurm_resources, check_etopo_sizes, diagnose_netcdf_issue: pre-run diagnostics for HPC submission and data-tile integrity.
* scripts/plot_pacific_detail, plot_verification_improved: post-run verification plots over land/ocean fractions.
* verify_icon_etopo_land_ocean: drop stale reference to the now-removed legacy HPC script in a parameter-source comment.
* Conditional stub kicks in only if cartopy import fails — no-op in envs where cartopy is installed.
* pycsa.__init__ eagerly imports pycsa.plotting.cart_plot which imports cartopy; the test suite doesn't actually call plotting, so the import chain just needs to succeed. Stub provides empty cartopy / cartopy.crs / cartopy.mpl.ticker / cartopy.feature / cartopy.io.shapereader packages.
* New get_etopo_data(lat_extent, lon_extent, etopo_cg) — byte-equivalent port of io.read_etopo_topo.get_topo. Mirrors __compute_idx, __get_fns, __get_lon_idxs, __load_topo, and the three-branch dateline logic (global / split_EW / normal).
* Lazy ETOPO mode: constructor skips eager file opens when called with tile_filenames=[] and dataset_type='ETOPO'. Files open on first access in _open_etopo_tile; handles + coord arrays cached for the rest of the worker's lifetime.
* Free function compute_split_EW(lon_verts) using the robust span_360 < lon_span formula (matches the post-550d1d5 io.py fix).
* Use the shared _NETCDF_GLOBAL_LOCK from pycsa.core.io for every nc.Dataset open and every ds[var][...] slice — HDF5 isn't thread-safe.
* MERIT-path fix: get_data_for_region's old (lon_max - lon_min) > 180 check false-positived on cells like the Aleutians; now delegates to compute_split_EW. No other MERIT changes.
* tests/test_tile_cache_etopo_equivalence.py parametrises 4 cells: a typical non-dateline cell (1086), the Aleutians false-positive case (2311), a genuine dateline crossing (1074, split_EW=True), and an extreme south-polar cell (17408). Asserts array_equal for lat / lon / topo against the reference reader. All 4 pass.
* tile_cache: add module-level _WORKER_CACHE + init_worker_cache / get_worker_cache / close_worker_cache helpers. Each Dask worker is a separate process, so the cache lives as per-process state; handles stay open across cells in the same worker. init_worker_cache is idempotent — a second call with the same data_dir is a no-op.
* icon_etopo_global.do_cell: replace the per-cell read_etopo_topo construction with a get_worker_cache().get_etopo_data(...) call. split_EW now comes from the module-level compute_split_EW helper instead of the reader instance.
* icon_etopo_global per-batch loop: after Client(...) is created, call client.run(tile_cache.init_worker_cache, params.path_etopo, "ETOPO") to populate the cache on every worker once per batch.
* read_etopo_topo in pycsa/core/io.py is untouched — tests and archived debug scripts still use it directly.
* tests/test_tile_cache_etopo_equivalence.py: add test_worker_cache_lifecycle covering init/get/close happy path, idempotent re-init, RuntimeError when uninitialised, and a functional round-trip via the worker-cache path for cell 1086. All 5 tests pass.
* Triggers restricted to push/PR on main + workflow_dispatch, removing the double-run that occurred when a feature branch had an open PR (both push and pull_request events fired the same workflow).
* New format-check job runs black --check . (codebase is already clean).
* docs job preserves the existing Sphinx build + peaceiris deploy to gh-pages on push-to-main.
@ray-chew ray-chew merged commit afdbc11 into main May 12, 2026
2 checks passed
@ray-chew ray-chew deleted the ext/cached_topo branch May 12, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant