Make each worker access cached topography file#29
Merged
Conversation
got to create MWE and complete netcdf4 io routines
… MS-GWaM's IO module
removed it from deprecated status; added comments
all paths and filenames should import from local_paths; although I have not been thorough with checking this, and not all run scripts have been updated accordingly
see changes for more details
technically, this should work now, although I need to find a way to parallelise the embarrassing loop and possibly move the writing routine out. I will also need to implement the skipping of ocean grid cells. Finally, the south pole looks pretty okay.
I finally have a reason to move away from the iPython environment, and so now all imports should be done properly
I must update the script to the latest wrapper components that are also used in Delaunay runs to make sure that we are really doing what is mentioned in the manuscript.
checked; the dfft and lsff results have been reproduced
it runs and parallel reading of the topographic dataset sems to work; I have to move the writer out of the parallel routine
however, there is a bug with the REMA IO routine when both the lat and lon extent span multiple files, e.g. with grid cell 47 in this commit
so the code runs on most cases now, except grid cell indices 0 and 1 on the ICON R2B4 grid. Related to these issues, I anticipate I/O errors across the W180-E180 boundary.
requires cleaning up, and some corner cases still exist...
I got to comment up what I am doing though, since I am dealing with the corner cases separately.
…tting diagnostics however, there is a gap between the MERIT and REMA datasets, and the lat-lon grid are also different. This latitudinal strip is not important for orographic data, as only a few small islands exist here, but it is important for me to find a sensible way to glue these two datasets together, e.g., by interpolation.
Small difference of about 0.1% improvement for grid cells close to the poles.
These topographic features are not taken into account in the second approximation step.
Based on the latitude of the cells.
Verified that we are loading correct geographical features at the correct lat-lon domains, thanks Claude!
Need to load threadsafe HDF, e.g., module load hdf5/1.12.1-threadsafe-gcc-11.2.0 module load netcdf-c/4.8.1-gcc-11.2.0 and reinstall Python library against these libraries: pip install --no-binary=...
Ocassionally, the plotter crashes ...?
The idea of the land-ocean ratio verification is to make sure that the NetCDF outputs have roughly as many number of land and ocean cells as we expect. Verification script with etopo_cg=40: Land cells: 8509 Ocean cells: 11971 Merging and verification scripts with outputs from etopo_cg=4: Land cells (is_land=1): 8622 Ocean cells (is_land=0): 11858 Both scripts are consistent. Furthermore, the plots are sensible.
* grid: add cell_area attribute, auto-populated by read_dat from ICON grid file * grid.apply_f: skip cell_area in non_convertibles so radian conversion leaves it untouched * nc_writer.grp_struct: accept cell_area; write as "m^2" variable per cell when provided * icon_etopo_global.do_cell: pass grid.cell_area[c_idx] to grp_struct on both ocean and land branches * merge_icon_etopo_outputs: extract cell_area per group, propagate into merged NetCDF; backward-compatible if absent
* io.py: global _NETCDF_GLOBAL_LOCK serialises every nc.Dataset open across threads. * icon_etopo_global: wrap do_cell body in try/except that logs the traceback to logger and stderr before re-raising, so worker crashes surface a stack instead of dying silently. * icon_etopo_global: invert loop nesting so memory batches are the outer loop. A crash mid-run leaves all earlier memory batches complete; restart skips to the failing batch. * Dask client now created and closed per memory batch (not per NetCDF chunk); threads_per_worker hardcoded to 1 since the global lock makes >1 threads pointless. * Preserve existing single-worker = full-machine-memory feature for high-memory polar batches inside the new loop.
* TopographyTileCache pre-loads MERIT/ETOPO/REMA tiles into memory and exposes fast subset access for individual grid cells, avoiding repeated NetCDF opens during parallel cell processing. * Not yet wired into the main loop: Added as a building block for future I/O optimisation.
* runs/submit_etopo_global.sh: SLURM template (1 node, 128 CPUs, 256 GB, 48 h) targeting the adaptive runs.icon_etopo_global entry point. * scripts/check_slurm_resources, check_etopo_sizes, diagnose_netcdf_issue: pre-run diagnostics for HPC submission and data-tile integrity. * scripts/plot_pacific_detail, plot_verification_improved: post-run verification plots over land/ocean fractions. * verify_icon_etopo_land_ocean: drop stale reference to the now-removed legacy HPC script in a parameter-source comment.
* Conditional stub kicks in only if cartopy import fails — no-op in envs where cartopy is installed. * pycsa.__init__ eagerly imports pycsa.plotting.cart_plot which imports cartopy; the test suite doesn't actually call plotting, so the import chain just needs to succeed. Stub provides empty cartopy / cartopy.crs / cartopy.mpl.ticker / cartopy.feature / cartopy.io.shapereader packages.
* New get_etopo_data(lat_extent, lon_extent, etopo_cg) — byte-equivalent port of io.read_etopo_topo.get_topo. Mirrors __compute_idx, __get_fns, __get_lon_idxs, __load_topo, and the three-branch dateline logic (global / split_EW / normal). * Lazy ETOPO mode: constructor skips eager file opens when called with tile_filenames=[] and dataset_type='ETOPO'. Files open on first access in _open_etopo_tile; handles + coord arrays cached for the rest of the worker's lifetime. * Free function compute_split_EW(lon_verts) using the robust span_360 < lon_span formula (matches the post-550d1d5 io.py fix). * Use the shared _NETCDF_GLOBAL_LOCK from pycsa.core.io for every nc.Dataset open and every ds[var][...] slice — HDF5 isn't thread-safe. * MERIT-path fix: get_data_for_region's old (lon_max - lon_min) > 180 check false-positived on cells like the Aleutians; now delegates to compute_split_EW. No other MERIT changes. * tests/test_tile_cache_etopo_equivalence.py parametrises 4 cells: a typical non-dateline cell (1086), the Aleutians false-positive case (2311), a genuine dateline crossing (1074, split_EW=True), and an extreme south-polar cell (17408). Asserts array_equal for lat / lon / topo against the reference reader. All 4 pass.
* tile_cache: add module-level _WORKER_CACHE + init_worker_cache / get_worker_cache / close_worker_cache helpers. Each Dask worker is a separate process, so the cache lives as per-process state; handles stay open across cells in the same worker. init_worker_cache is idempotent — a second call with the same data_dir is a no-op. * icon_etopo_global.do_cell: replace the per-cell read_etopo_topo construction with a get_worker_cache().get_etopo_data(...) call. split_EW now comes from the module-level compute_split_EW helper instead of the reader instance. * icon_etopo_global per-batch loop: after Client(...) is created, call client.run(tile_cache.init_worker_cache, params.path_etopo, "ETOPO") to populate the cache on every worker once per batch. * read_etopo_topo in pycsa/core/io.py is untouched — tests and archived debug scripts still use it directly. * tests/test_tile_cache_etopo_equivalence.py: add test_worker_cache_lifecycle covering init/get/close happy path, idempotent re-init, RuntimeError when uninitialised, and a functional round-trip via the worker-cache path for cell 1086. All 5 tests pass.
* Triggers restricted to push/PR on main + workflow_dispatch, removing the double-run that occurred when a feature branch had an open PR (both push and pull_request events fired the same workflow). * New format-check job runs black --check . (codebase is already clean). * docs job preserves the existing Sphinx build + peaceiris deploy to gh-pages on push-to-main.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
... instead of reopening the topography files each time; made sure everything works with Dask parallelization.