Andy pointed out this slowdown during last week's TWG meeting. Dougie noted in this comment, adding even a single density-mapped diagnostic adds ~20% (or more) to the runtime. Adding more density-mapped diags does not increase the runtime any more.
With help from Angus and Ed, I have now managed to identify the same issue in a standalone benchmark using the following steps (with these aggresive compiler flags for vectorisation):
git clone --recursive https://github.com/noaa-gfdl/mom6-examples.git
cd mom6-examples/ocean_only
module load openmpi/5.0.8 intel-llvm-compiler/2025.2.0 netcdf/4.9.2 python3-as-python
FCFLAGS="-O2 -g3 -fno-omit-frame-pointer -O2 -xcascadelake -qopt-zmm-usage=high -vec-threshold0" FFLAGS="-O2 -g3 -fno-omit-frame-pointer -O2 -xcascadelake -qopt-zmm-usage=high -vec-threshold0" LDFLAGS="-O2 -g3 -fno-omit-frame-pointer -O2 -xcascadelake -qopt-zmm-usage=high -vec-threshold0" make -j
cd ../../
git clone --recursive https://github.com/marshallward/mom6-benchmark
cd benchmark_ALE
ln -s .././mom6-examples/ocean_only/build/MOM6 ./MOM6
mpiexec -n 1 ./MOM6
Adding the density mapped levels required adding the following patch in the benchmark_ALE directory (the patch is generated with git diff --patch MOM_input diag_table
diff --git a/benchmark_ALE/MOM_input b/benchmark_ALE/MOM_input
index b52b28e..39fbc55 100644
--- a/benchmark_ALE/MOM_input
+++ b/benchmark_ALE/MOM_input
@@ -184,6 +184,14 @@ INITIAL_T_RANGE = -9.0 ! [degC] default = 0.0
! Initial temperature range (bottom - surface)
! === module MOM_diag_mediator ===
+NUM_DIAG_COORDS = 1
+DIAG_COORDS = "rho2 RHO2 RHO" !"z Z ZSTAR" !
+DIAG_COORD_DEF_RHO2 = "RFNC1:76,999.5,1020.,1034.1,3.1,1041.,0.002" ! default = "WOA09"
+REGRIDDING_ANSWER_DATE = 99991231 ! default = 20181231
! === module MOM_MEKE ===
USE_MEKE = True ! [Boolean] default = False
diff --git a/benchmark_ALE/diag_table b/benchmark_ALE/diag_table
index 42b4a98..02ceeb8 100644
--- a/benchmark_ALE/diag_table
+++ b/benchmark_ALE/diag_table
@@ -47,6 +47,10 @@ benchmark_ALE
"ocean_model", "zos", "zos", "ocean_month", "all", "mean", "none",2
"ocean_model", "Rd1", "Rd1", "ocean_month", "all", "mean", "none",2
+# monthly 3d fields on rho2
+"access-om3.mom6.3d.umo+rho2.1mon.mean.%4yr", 1, "months", 1, "days", "time", 1, "years"
+"ocean_model_rho2", "umo", "umo", "access-om3.mom6.3d.umo+rho2.1mon.mean.%4yr", "all", "average", "none", 2
+
# 3d annual
"ocean_model_z", "agessc", "agessc", "ocean_annual_z", "all", "mean", "none",2
I am going to use this issue to report back on my findings - from the benchmark_ALE first, and then the full OM3 25k IAF config.
Misc runtime notes:
- I needed to run
ulimit -s unlimited on the command-line before running MOM6, other wise I was getting segfaults from exceeding the stack-sizes. I was surprised by this because I thought MOM6 used dynamic memory. This suggests that we might need to add a compiler flag to the effect of mcmodel=huge
make clean in the mom6-examples does not clean the FMS build; need to execute make clean.fms as well (to make sure that new compiler/linker flags are consistently applied to the entire build)
Adding link-time optimisation with -flto requires these changes
- Add
-flto to all the compiler flags
- Add the fortran compiler flags to LDFLAGS, and then add
-flto -fuse-ld=lld
- Edit the FMS make to change AR (which is hardcoded to
ar to llvm-ar corresponding to the load oneAPI compiler)
- Fix the missing path for libnetcdff.so by adding LD_LIBRARY_PATH=$NC_ROOT/lib/Intel:$LD_LIBRARY_PATH (and confirm that ldd shows that MOM6 resolves the correct netcdf fortran library)
Andy pointed out this slowdown during last week's TWG meeting. Dougie noted in this comment, adding even a single density-mapped diagnostic adds ~20% (or more) to the runtime. Adding more density-mapped diags does not increase the runtime any more.
With help from Angus and Ed, I have now managed to identify the same issue in a standalone benchmark using the following steps (with these aggresive compiler flags for vectorisation):
Adding the density mapped levels required adding the following patch in the
benchmark_ALEdirectory (the patch is generated withgit diff --patch MOM_input diag_tableI am going to use this issue to report back on my findings - from the benchmark_ALE first, and then the full OM3 25k IAF config.
Misc runtime notes:
ulimit -s unlimitedon the command-line before running MOM6, other wise I was getting segfaults from exceeding the stack-sizes. I was surprised by this because I thought MOM6 used dynamic memory. This suggests that we might need to add a compiler flag to the effect ofmcmodel=hugemake cleanin the mom6-examples does not clean the FMS build; need to executemake clean.fmsas well (to make sure that new compiler/linker flags are consistently applied to the entire build)Adding link-time optimisation with
-fltorequires these changes-fltoto all the compiler flags-flto -fuse-ld=lldartollvm-arcorresponding to the load oneAPI compiler)