Make each worker access cached topography file by ray-chew · Pull Request #29 · ray-chew/pyCSA

ray-chew · 2026-05-12T09:08:53Z

... instead of reopening the topography files each time; made sure everything works with Dask parallelization.

got to create MWE and complete netcdf4 io routines

… MS-GWaM's IO module

removed it from deprecated status; added comments

all paths and filenames should import from local_paths; although I have not been thorough with checking this, and not all run scripts have been updated accordingly

see changes for more details

technically, this should work now, although I need to find a way to parallelise the embarrassing loop and possibly move the writing routine out. I will also need to implement the skipping of ocean grid cells. Finally, the south pole looks pretty okay.

I finally have a reason to move away from the iPython environment, and so now all imports should be done properly

I must update the script to the latest wrapper components that are also used in Delaunay runs to make sure that we are really doing what is mentioned in the manuscript.

checked; the dfft and lsff results have been reproduced

it runs and parallel reading of the topographic dataset sems to work; I have to move the writer out of the parallel routine

however, there is a bug with the REMA IO routine when both the lat and lon extent span multiple files, e.g. with grid cell 47 in this commit

so the code runs on most cases now, except grid cell indices 0 and 1 on the ICON R2B4 grid. Related to these issues, I anticipate I/O errors across the W180-E180 boundary.

requires cleaning up, and some corner cases still exist...

I got to comment up what I am doing though, since I am dealing with the corner cases separately.

…tting diagnostics however, there is a gap between the MERIT and REMA datasets, and the lat-lon grid are also different. This latitudinal strip is not important for orographic data, as only a few small islands exist here, but it is important for me to find a sensible way to glue these two datasets together, e.g., by interpolation.

…on-grid

Small difference of about 0.1% improvement for grid cells close to the poles.

Parallelisation, NetCDF outputs, plotting also verified to work; should be ready for HPC runs.

When one of the ICON grid cell's vertex is at 180 longitude, the ETOPO reader was loading all the topography around the globe. The error has been fixed by this commit.

These topographic features are not taken into account in the second approximation step.

Based on the latitude of the cells.

Verified that we are loading correct geographical features at the correct lat-lon domains, thanks Claude!

Need to load threadsafe HDF, e.g., module load hdf5/1.12.1-threadsafe-gcc-11.2.0 module load netcdf-c/4.8.1-gcc-11.2.0 and reinstall Python library against these libraries: pip install --no-binary=...

Ocassionally, the plotter crashes ...?

The idea of the land-ocean ratio verification is to make sure that the NetCDF outputs have roughly as many number of land and ocean cells as we expect. Verification script with etopo_cg=40: Land cells: 8509 Ocean cells: 11971 Merging and verification scripts with outputs from etopo_cg=4: Land cells (is_land=1): 8622 Ocean cells (is_land=0): 11858 Both scripts are consistent. Furthermore, the plots are sensible.

* grid: add cell_area attribute, auto-populated by read_dat from ICON grid file * grid.apply_f: skip cell_area in non_convertibles so radian conversion leaves it untouched * nc_writer.grp_struct: accept cell_area; write as "m^2" variable per cell when provided * icon_etopo_global.do_cell: pass grid.cell_area[c_idx] to grp_struct on both ocean and land branches * merge_icon_etopo_outputs: extract cell_area per group, propagate into merged NetCDF; backward-compatible if absent

* io.py: global _NETCDF_GLOBAL_LOCK serialises every nc.Dataset open across threads. * icon_etopo_global: wrap do_cell body in try/except that logs the traceback to logger and stderr before re-raising, so worker crashes surface a stack instead of dying silently. * icon_etopo_global: invert loop nesting so memory batches are the outer loop. A crash mid-run leaves all earlier memory batches complete; restart skips to the failing batch. * Dask client now created and closed per memory batch (not per NetCDF chunk); threads_per_worker hardcoded to 1 since the global lock makes >1 threads pointless. * Preserve existing single-worker = full-machine-memory feature for high-memory polar batches inside the new loop.

* TopographyTileCache pre-loads MERIT/ETOPO/REMA tiles into memory and exposes fast subset access for individual grid cells, avoiding repeated NetCDF opens during parallel cell processing. * Not yet wired into the main loop: Added as a building block for future I/O optimisation.

* runs/submit_etopo_global.sh: SLURM template (1 node, 128 CPUs, 256 GB, 48 h) targeting the adaptive runs.icon_etopo_global entry point. * scripts/check_slurm_resources, check_etopo_sizes, diagnose_netcdf_issue: pre-run diagnostics for HPC submission and data-tile integrity. * scripts/plot_pacific_detail, plot_verification_improved: post-run verification plots over land/ocean fractions. * verify_icon_etopo_land_ocean: drop stale reference to the now-removed legacy HPC script in a parameter-source comment.

* Conditional stub kicks in only if cartopy import fails — no-op in envs where cartopy is installed. * pycsa.__init__ eagerly imports pycsa.plotting.cart_plot which imports cartopy; the test suite doesn't actually call plotting, so the import chain just needs to succeed. Stub provides empty cartopy / cartopy.crs / cartopy.mpl.ticker / cartopy.feature / cartopy.io.shapereader packages.

* New get_etopo_data(lat_extent, lon_extent, etopo_cg) — byte-equivalent port of io.read_etopo_topo.get_topo. Mirrors __compute_idx, __get_fns, __get_lon_idxs, __load_topo, and the three-branch dateline logic (global / split_EW / normal). * Lazy ETOPO mode: constructor skips eager file opens when called with tile_filenames=[] and dataset_type='ETOPO'. Files open on first access in _open_etopo_tile; handles + coord arrays cached for the rest of the worker's lifetime. * Free function compute_split_EW(lon_verts) using the robust span_360 < lon_span formula (matches the post-550d1d5 io.py fix). * Use the shared _NETCDF_GLOBAL_LOCK from pycsa.core.io for every nc.Dataset open and every ds[var][...] slice — HDF5 isn't thread-safe. * MERIT-path fix: get_data_for_region's old (lon_max - lon_min) > 180 check false-positived on cells like the Aleutians; now delegates to compute_split_EW. No other MERIT changes. * tests/test_tile_cache_etopo_equivalence.py parametrises 4 cells: a typical non-dateline cell (1086), the Aleutians false-positive case (2311), a genuine dateline crossing (1074, split_EW=True), and an extreme south-polar cell (17408). Asserts array_equal for lat / lon / topo against the reference reader. All 4 pass.

* tile_cache: add module-level _WORKER_CACHE + init_worker_cache / get_worker_cache / close_worker_cache helpers. Each Dask worker is a separate process, so the cache lives as per-process state; handles stay open across cells in the same worker. init_worker_cache is idempotent — a second call with the same data_dir is a no-op. * icon_etopo_global.do_cell: replace the per-cell read_etopo_topo construction with a get_worker_cache().get_etopo_data(...) call. split_EW now comes from the module-level compute_split_EW helper instead of the reader instance. * icon_etopo_global per-batch loop: after Client(...) is created, call client.run(tile_cache.init_worker_cache, params.path_etopo, "ETOPO") to populate the cache on every worker once per batch. * read_etopo_topo in pycsa/core/io.py is untouched — tests and archived debug scripts still use it directly. * tests/test_tile_cache_etopo_equivalence.py: add test_worker_cache_lifecycle covering init/get/close happy path, idempotent re-init, RuntimeError when uninitialised, and a functional round-trip via the worker-cache path for cell 1086. All 5 tests pass.

* Triggers restricted to push/PR on main + workflow_dispatch, removing the double-run that occurred when a feature branch had an open PR (both push and pull_request events fired the same workflow). * New format-check job runs black --check . (codebase is already clean). * docs job preserves the existing Sphinx build + peaceiris deploy to gh-pages on push-to-main.

ray-chew added 30 commits May 13, 2024 13:01

updated compactifier notebook to the latest version

6f419b6

icon_usgs_test.py: checked output, it runs

8749a22

icon grid with merit data works

2a117af

got to create MWE and complete netcdf4 io routines

ICON regional MERIT run with CSAM implemented

52a446d

added ICON regional run parameters file

39ca7e4

moved (kks,lls) computation from physics.py to var.py entirely

7305202

updated .gitignore file

969bb3a

implemented NetCDF4 writer class; outputs data structure according to…

6f8171d

… MS-GWaM's IO module

utils.py:pick_cell is required in the prepare_orog notebook

c1a8739

removed it from deprecated status; added comments

latest prepare_orog notebook

7dbb127

updated io.py to support reading of REMA datasets (#5)

df9274f

remove hard-coded paths and filenames

034cda8

all paths and filenames should import from local_paths; although I have not been thorough with checking this, and not all run scripts have been updated accordingly

minor changes to vis.plotter and wrappers.diagnostics

5ac0be0

see changes for more details

made all imports relative; package now works with pip install

78ee0b6

I finally have a reason to move away from the iPython environment, and so now all imports should be done properly

improved src.delaunay.get_land_cells documentation

eaf1a83

ICON merit global run script seems to give sensible results now

82bc4f5

I must update the script to the latest wrapper components that are also used in Delaunay runs to make sure that we are really doing what is mentioned in the manuscript.

updated relative import paths in diagnostics.py

1d815da

updated global ICON script to latest machinery

ef2d8ae

checked; the dfft and lsff results have been reproduced

icon_merit_global: removed commented code

d87c0d6

icon_merit_global: fixed bug in the is_land indexing

2d37f93

intermediate commit for parallel runs

b53b06c

it runs and parallel reading of the topographic dataset sems to work; I have to move the writer out of the parallel routine

parallel run seems to work now

5ee9e01

however, there is a bug with the REMA IO routine when both the lat and lon extent span multiple files, e.g. with grid cell 47 in this commit

updated icon_merit_regional to latest imports

ab1b992

this is an interim commit with parallelisation switched off

fe6cdf8

so the code runs on most cases now, except grid cell indices 0 and 1 on the ICON R2B4 grid. Related to these issues, I anticipate I/O errors across the W180-E180 boundary.

intermediate commit for robust MERIT I/O across E-W split

dbf7915

requires cleaning up, and some corner cases still exist...

the I/O for MERIT REMA seems to work for tricky grid cells now

e04e117

I got to comment up what I am doing though, since I am dealing with the corner cases separately.

intermediate commit before tackling discrepancies in MERIT and REMA l…

fe661f6

…on-grid

i/o routine now works for all cells on the ICON R2B4 grid

e03aefd

ray-chew and others added 28 commits October 23, 2025 03:20

Fixed bug in lin_reg module

f47b0c1

(#13) Getting closer to global ETOPO runs

4d4b622

(#13) Testing ETOPO global ICON runs

9501be3

(#13) Chunking by NetCDF outputs

cde6624

(#8) Attempt at improving planar projection

6fd7d98

Small difference of about 0.1% improvement for grid cells close to the poles.

(#8, #13) Validated global run with centered projection

41c47b1

Parallelisation, NetCDF outputs, plotting also verified to work; should be ready for HPC runs.

Restructured tests

68cec16

(#7, #8) Fixed bugs in ETOPO planar projection

550d1d5

When one of the ICON grid cell's vertex is at 180 longitude, the ETOPO reader was loading all the topography around the globe. The error has been fixed by this commit.

(#3) Updated some straggler imports

f3ec037

(#17) Mask out elevation below -200m

8a008a0

These topographic features are not taken into account in the second approximation step.

(#18) Implement dynamic workers and memory allocation

4dde4a6

Based on the latitude of the cells.

(#19) Added sanity check for loaded topography

1d85321

Verified that we are loading correct geographical features at the correct lat-lon domains, thanks Claude!

(#14) Added logging function

d05ebe8

(#13) Added hardware specific configurations

ad1a4cc

(#15) Fresh HPC install

0f7759e

(#13) Threadsafe I/O support

35ad5ac

Need to load threadsafe HDF, e.g., module load hdf5/1.12.1-threadsafe-gcc-11.2.0 module load netcdf-c/4.8.1-gcc-11.2.0 and reinstall Python library against these libraries: pip install --no-binary=...

(#13) Resolve bug in diagnostic plotter

37b6f10

Ocassionally, the plotter crashes ...?

(#13) If only 1 worker, use all available memory.

e6a6a26

(#30) Blacked

5e6c6ca

ray-chew merged commit afdbc11 into main May 12, 2026
2 checks passed

ray-chew deleted the ext/cached_topo branch May 12, 2026 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make each worker access cached topography file#29

Make each worker access cached topography file#29
ray-chew merged 78 commits into
mainfrom
ext/cached_topo

ray-chew commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ray-chew commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant