This repository contains a Docker environment and execution tools for running the SICOPOLIS ice sheet model with checkpointing capabilities. The Center for High Throughput Computing (CHTC) at the University of Wisconsin-Madison (USA) is used as the reference execution infrastructure, with HTCondor as the workload manager. However, the workflow presented here can be adapted to any system.
This repository assumes you already have a local copy of SICOPOLIS. If not, run the commands below to download the source code and required input files. For further details on the setup procedure, refer to the official documentation.
git clone https://github.com/sicopolis/sicopolis.gitcd sicopolis./get_input_files.sh./copy_templates.sh.
├── src/
├── Dockerfile # Dockerfile to create the sicopolis-chtc image
├── exec.sh # Main execution script with snapshot management
├── sico.sub # HTCondor submit file
├── sicoCheckpoints.py # Python tool for concatenating snapshot outputs
└── sico_specs_CheckpointTest.h # SICOPOLIS configuration file for testing
├── LICENSE
└── README.md
Contains only the runtime environment and dependencies required to run SICOPOLIS:
- Ubuntu 22.04 base
- Fortran compiler (gfortran)
- NetCDF libraries (libnetcdf-dev, netcdf-bin, libnetcdff-dev, nco)
- LIS library (v2.1.8) for solving linear systems
Usage: Designed for CHTC deployments where the SICOPOLIS source code is transferred separately.
cd srcdocker build -t sicopolis-chtc:latest .docker pull nsartore/sicopolis-chtc:latestThe checkpointing workflow is orchestrated by a single Bash script: exec.sh, which manages SICOPOLIS execution with snapshot and restart capabilities. To initiate a simulation or resume after a timeout, simply invoke the same command again.
exec.sh <SIMULATION_NAME> <OUTPUT_NAME> <SICOPOLIS_FILE> [MAX_RUN_TIME] [CORE_NB] [ANF_PATH_INIT] [VAR_TO_KEEP]SIMULATION_NAME: Name of the simulation configuration (must match the header file name)OUTPUT_NAME: Identifier for the output archive (typically$(Cluster)_$(Process))SICOPOLIS_FILE: Name of the compressed SICOPOLIS source archive (e.g.,sicopolis.zip)MAX_RUN_TIME: Maximum runtime before timeout (default:1d)CORE_NB: Number of CPU cores (default:1)ANF_PATH_INIT: Optional path to the initial NetCDF restart file. Used only on the first execution ofexec.sh.VAR_TO_KEEP: Optional list of variables. When not empty, 2d and 3d files will be cleared of all variables except those listed here, helping save space and transfer time. Format should be: var1,var2,var3,...
First Run (no existing snapshots):
- Extracts and configures SICOPOLIS
- Runs the simulation with an optional initial restart file (
-aflag) - On timeout (exit code 85), moves output to
snapshot/00000/
Subsequent Runs (snapshots exist):
- Identifies the last completed snapshot directory:
snapshot/XXXXX/ - Selects the second-to-last 3D NetCDF file:
${SIMULATION_NAME}NNNN.nc. The last file may be incomplete; the second-to-last provides a safe restart point - Extracts the time value from the NetCDF file using
ncdump - Modifies the SICOPOLIS header file (
sico_specs_${SIMULATION_NAME}.h):#define TIME_INIT0 '<extracted_time>' #define ANFDATNAME '<snapshot_file.nc>' #define ANF_DAT 3
- Runs SICOPOLIS with the updated restart configuration
- On timeout, moves output files to an incremented
snapshot/XXXXX/directory
Each checkpoint run produces a individual snapshot directory. The Python utility sicoCheckpoints.py concatenates the outputs from all snapshots into a single, continuous time series.
- Python 3: numpy, xarray, netCDF4
- NCO tools: ncecat, ncrcat
- Compression tools: tar, pigz
python3 sicoCheckpoints.py [OPTION] <output_path>- Cleanup: Removes any existing concatenated files (
*1D*,*2D*,*3D*) - Iterate through snapshots: Processes each numbered subdirectory (00000, 00001, ...)
- Extract time bounds:
- For intermediate snapshots: reads the time from the second-to-last 3D file
- For the final snapshot: uses all available data
- Slice 1D data: Extracts the time series up to the checkpoint time from
*_ser.nc - Slice 2D data: Selects all 2D files (
*_2d_*.nc) with time ≤ checkpoint time - Concatenate: Uses NCO tools to merge the sliced outputs
- Cleanup: Removes intermediate files; retains the final
${SIMULATION_NAME}_1D.ncand${SIMULATION_NAME}_2D.nc
-eor--extract2dFrom3d: if 2d output files are not present, will extract 2d files from 3d files.
The CHTC uses HTCondor as its workload management platform. Job submission requires a configuration file: sico.sub.
STORAGE_PATH = file:///<path_to_storage>
SICOPOLIS_FILE = <name_of_sicopolis_archive>
MAX_RUN_TIME = <desired_maximum_runtime>
ANF_PATH_INIT = <initial_restart_file> # optional
CORE_NB = <number_of_cpu_cores>
RAM = <maximum_ram>
DISK = <maximum_disk_space>
Ensure sicopolis.zip is present in the storage area: ${STORAGE_PATH}/sicopolis.zip
Edit sico.sub to set the resource requirements and queue mode. Two queue modes are supported:
Single simulation (uncomment the last two lines):
...
# SIMULATION_NAME = grl04_bm5_spinup02_holo_... # Must match a header file
# queue 1
Multiple simulations (provide a file listing simulation names, one per line):
...
queue SIMULATION_NAME from simulation_list.txt
condor_submit sico.subcondor_qcondor_tail -f <job_id>Download the output archive from the staging area:
scp <your_username>@CHTC:${STORAGE_PATH}/output_<cluster>_<process>_<simulation>.tar.gz output.tar.gzExtract the output archive:
mkdir output && tar -xvf output.tar.gz -C output # Single-threaded decompressionConcatenate the time series:
python3 sicoCheckpoints.py outputThe following example demonstrates how to run SICOPOLIS inside the sicopolis-chtc Docker image using exec.sh for checkpoint management. The commands use the sico_specs_CheckpointTest.h configuration file, running on 1 CPU core with a 9-minute timeout, which should produce approximately 3 checkpoints.
After copying the sicopolis source code in the working directory and the sico_specs_CheckpointTest.h in the headers subdirectory, the working directory should have the following structure:
.
├── sicopolis_source_code/
├── docs/
├── headers/
├── sico_specs_CheckpointTest.h
└── ...
└── ...
└── exec.sh
docker run --user $(id -u):$(id -g) -v $(pwd):/sico -w /sico nsartore/sicopolis-chtc:latest ./exec.sh CheckpointTest prefix sicopolis.zip 4m 1Depending on the computational speed of the host machine, one of two outcomes is expected:
- A
snapshot/directory is created if the simulation did not complete within the 9-minute timeout. In this case, re-run the same Docker command to resume the simulation. - A
prefix_output_CheckpointTest.tar.gzarchive is produced if the simulation completed successfully. Proceed to the next step.
The snapshot directory contains numbered subdirectories (00000, 00001, 00002, etc.), one per checkpoint run. The total number of subdirectories will vary depending on the computational speed of the host machine. To concatenate all snapshots into a single output file, place sicoCheckpoints.py in the working directory. The expected file structure is:
.
├── sicopolis_source_code/
└── ...
├── exec.sh
├── prefix_output_CheckpointTest.tar.gz
└── sicoCheckpoints.py
Extract the output archive:
mkdir output && tar -xvf prefix_output_CheckpointTest.tar.gz -C outputAnd run the concatenation utility:
python3 sicoCheckpoints.py outputThe files CheckpointTest_1D.nc and CheckpointTest_2D.nc will be generated in the snapshot/ directory, containing the complete concatenated time series. The final directory structure should be:
.
├── snapshot/
├── 00000/
├── 00001/
├── ...
├── CheckpointTest_1D.nc
└── CheckpointTest_2D.nc
├── sicopolis_source_code/
└── ...
├── exec.sh
├── prefix_output_CheckpointTest.tar.gz
└── sicoCheckpoints.py
-
Path Configuration:
exec.shautomatically updatessico_configs.shto use the appropriate container pathsNETCDFHOME=/usrandLISHOME=/opt/lis. So there is no need to modify them before running simulations. -
Header Files: The SICOPOLIS header file must exist at
headers/sico_specs_${SIMULATION_NAME}.hwithin the source tree. -
Timeout Handling: Exit code 85 is intercepted by HTCondor as a checkpoint signal, not a job failure.
-
⚠️ ⚠️ ⚠️ SICOPOLIS Header Requirements⚠️ ⚠️ ⚠️ : Proper checkpointing requires the following SICOPOLIS header parameters to be configured correctly:OUT_TIMESmust be set to1- When saving 2D and 3D fields separately, the 2D output interval must evenly divide the 3D output interval. In other words, every 3D snapshot must coincide with a 2D snapshot.
- The 3D output interval must be configured such that at least two output files are written within every
MAX_RUN_TIMEwindow.
Nicolas B. Sartore
Research Assistant, Till Wagner Group
Department of Atmospheric and Oceanic Sciences
University of Wisconsin-Madison, USA
- Website: nsartore.me
- Email: nsartore@wisc.edu
- GitHub: github.com/nicsar2
For questions or issues related to this repository, please open a GitHub issue or contact the author directly by email.