-
Notifications
You must be signed in to change notification settings - Fork 4
Feat/forecast inference dataset creation from superjuice #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
khintz
merged 32 commits into
dmidk:main
from
leifdenby:feat/forecast-inference-dataset-creation-from-superjuice
Jan 27, 2026
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
ae0411c
cli for building forecast inference datasets
leifdenby 00d501d
Merge branch 'main' of https://github.com/dmidk/mlwm-deployment into …
leifdenby 7705d85
working inference dataset creation
leifdenby 94742f8
able to load inference datastore and config in neural-lam
leifdenby 081e605
wip on inference run
leifdenby 9ff6393
Merge branch 'main' of https://github.com/dmidk/mlwm-deployment into …
leifdenby 316ccb5
first working inference entry-point!
leifdenby 91b7104
cleanup
leifdenby d0b6acc
more cleanup
leifdenby 2982d4c
more cleanup #2
leifdenby ff7966c
remove src from pyproject.toml
leifdenby 9dcfa6a
include src/ in container image
leifdenby 16f5c99
disable wandb
leifdenby 5f56aef
move runtime args to env vars and support multiple datastores
leifdenby db44c91
add developing notes
leifdenby ebbf192
":" -> "." in datastore input path overrides
leifdenby c0e06a1
update for upstream fixes
leifdenby 62bd766
expose workdir through env var
leifdenby 47904d4
use single gpu during inference
leifdenby 8835370
no inline comments in multiline bash commands
leifdenby c6368f0
Fixes for supporting inference on DGX Spark
leifdenby 29f7cf9
fix linting
leifdenby 54e6027
add comment wrt inference workdir
leifdenby c58cc36
add comment about splits in config
leifdenby 0f4a959
add missing docstring
leifdenby aaa75c1
use new artifact without orography
leifdenby 2d1d0cc
add .env creation util
leifdenby 3696515
Final fixes for inference on superjuice
8dbdc1a
last changes from superjuice
93fe177
Merge branch 'main' of https://github.com/dmidk/mlwm-deployment into …
leifdenby f8ca5c7
remove uv cache bust
leifdenby 9f3f023
fix linting
leifdenby File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,3 +4,4 @@ inference_artifact/ | |
| *.yaml | ||
| inference_workdir/ | ||
| .env | ||
| wandb/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,6 +5,95 @@ surface variables from DANRA, only 10 days of data and only trained 10 | |
| epochs. It is intended only as a demonstration of the inference pipeline and is | ||
| expected to give very poor results. | ||
|
|
||
| ## Building image and running inference | ||
|
|
||
| Currently building the image and running inference is only supported on the "superjuice" machine (`27sj894.dmi.dk`). | ||
|
|
||
| ### Building the image | ||
|
|
||
| To build the image on "superjuice" (`27sj894.dmi.dk`) we need to set the AWS tokens to read the inference artifact and also use the local http proxy for pulling the base image: | ||
|
|
||
| ```bash | ||
| export AWS_SECRET_ACCESS_KEY=<secret-key-to-read-inference-artifact> | ||
| export AWS_ACCESS_KEY_ID=<access-key-to-read-inference-artifact> | ||
| export MLWM_PULL_PROXY=http://squid1.dmi.dk:3128 | ||
| ``` | ||
|
|
||
| Then build the image with: | ||
|
|
||
| ```bash | ||
| ./build_image.sh | ||
| ``` | ||
|
|
||
| ### Running inference | ||
|
|
||
| On "superjuice" (`27sj894.dmi.dk`), run inference for a given analysis time (e.g. `2019-02-04T12:00`) and forecast duration (e.g. `PT18H`) using DINI initial conditions (read from AWS S3) with: | ||
|
|
||
| ```bash | ||
| ./run_inference_container.sh 2019-02-04T12:00 PT18H | ||
| ``` | ||
|
|
||
| Currently this script uses a workaround to get GPU access with rootless Podman. This is required because the necessary system-level NVIDIA Container Toolkit integration is not available on this system. This means that the standard Podman/Docker flag: | ||
|
|
||
| --gpus all | ||
|
|
||
| does not work out of the box, even though the host has a functioning NVIDIA driver and GPUs. | ||
|
|
||
| WHY THIS IS NECESSARY | ||
|
|
||
| Normally, GPU support in containers relies on the NVIDIA Container Toolkit, which at runtime: | ||
|
|
||
| - exposes /dev/nvidia* device nodes to the container | ||
| - bind-mounts the host NVIDIA driver libraries (most importantly libcuda.so.1) | ||
| - injects utilities such as nvidia-smi | ||
|
|
||
| In a rootless Podman setup without system-level NVIDIA integration: | ||
|
|
||
| - --gpus all is a no-op | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no-op ? |
||
| - libcuda.so.1 is not available inside the container | ||
| - CUDA frameworks (PyTorch, Lightning, etc.) report that no GPU is available | ||
|
|
||
| WORKING COMMAND (ROOTLESS, NO SUDO) | ||
|
|
||
| ```bash | ||
| podman run --rm \ | ||
| --device /dev/nvidia0 \ | ||
| --device /dev/nvidiactl \ | ||
| --device /dev/nvidia-uvm \ | ||
| --device /dev/nvidia-uvm-tools \ | ||
| --device /dev/nvidia-modeset \ | ||
| --shm-size=32g \ | ||
| -v /lib/x86_64-linux-gnu/libcuda.so.1:/lib/x86_64-linux-gnu/libcuda.so.1:ro \ | ||
| -v /lib/x86_64-linux-gnu/libnvidia-ml.so.1:/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro \ | ||
| -v /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro \ | ||
| -v ./inference_workdir/:/workspace/inference_workdir/ \ | ||
| localhost/surface-dummy-model_dini:latest | ||
| ``` | ||
|
|
||
| With this setup, CUDA becomes available inside the container. | ||
|
|
||
| WHAT IS NEEDED TO USE `--gpus all` INSTEAD (RECOMMENDED) | ||
|
|
||
| To enable the standard workflow: | ||
|
|
||
| ```bash | ||
| podman run --gpus all ... | ||
| ``` | ||
|
|
||
| the following needs to be provided system-wide by IT: | ||
|
|
||
| 1. Install NVIDIA Container Toolkit on the host | ||
| 2. Enable Container Device Interface (CDI) or OCI hooks for Podman | ||
| 3. Generate the NVIDIA CDI specification using: | ||
| nvidia-ctk cdi generate | ||
| 4. Ensure Podman is configured to consume CDI devices | ||
|
|
||
| Once enabled: | ||
| - GPU devices and driver libraries are injected automatically | ||
| - nvidia-smi works inside containers | ||
| - No manual --device or library mounts are required | ||
| - --gpus all works as expected | ||
|
|
||
| ## Upstream package change requirements | ||
|
|
||
| Relative to the `main` branch on both github.com/mllam/mllam-data-prep and | ||
|
|
@@ -69,3 +158,5 @@ adds: | |
| - make logging of validation steps optional in the training CLI (i.e. `--eval` mode) | ||
|
|
||
| - needs its own branch and PR | ||
|
|
||
| - `torch >= 2.6.0` defaults to `weights_only=True` when loading checkpoints | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
76 changes: 76 additions & 0 deletions
76
configurations/surface-dummy-model_DINI/run_inference_container.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| #!/bin/bash | ||
|
|
||
| # This script runs the inference container using initial conditions from DINI | ||
| # stored on AWS | ||
|
|
||
| # The script takes only one argument: the analysis time to use for inference, | ||
| # in ISO8601 format (e.g. 2025-11-05T090000Z). If "Z" is omitted, UTC is | ||
| # assumed. An optional second argument can be provided to specify the forecast | ||
| # duration in ISO8601 duration format (e.g. PT18H for 18 hours). If not | ||
| # provided, the default is PT18H. | ||
|
|
||
| if [ "$#" -lt 1 ] || [ "$#" -gt 2 ] ; then | ||
| echo "Usage: $0 <ANALYSIS_TIME> [<FORECAST_DURATION>]" >&2 | ||
| echo "" >&2 | ||
| echo " ANALYSIS_TIME: the analysis time to start the forecast from in ISO8601 format" >&2 | ||
| echo " FORECAST_DURATION: the duration of the forecast in ISO8601 duration format (default PT18H)" >&2 | ||
| exit 1 | ||
| fi | ||
| ANALYSIS_TIME="$1" | ||
| if [ "$#" -eq 2 ] ; then | ||
| FORECAST_DURATION="$2" | ||
| else | ||
| FORECAST_DURATION="PT18H" | ||
| fi | ||
|
|
||
| # function to format analysis time to remove colons and ensure UTC 'Z' suffix | ||
| format_analysis_time() { | ||
| local iso="$1" | ||
|
|
||
| if [[ -z "$iso" ]]; then | ||
| echo "format_analysis_time: missing ISO8601 datetime" >&2 | ||
| return 1 | ||
| fi | ||
|
|
||
| if date -u -d "1970-01-01T00:00:00Z" >/dev/null 2>&1; then | ||
| # GNU date (Linux) | ||
| date -u -d "$iso" +"%Y-%m-%dT%H%M%SZ" || return 1 | ||
| else | ||
| # macOS / BSD fallback using Python stdlib | ||
| python3 - <<'EOF' "$iso" | ||
| from datetime import datetime, timezone | ||
| import sys | ||
|
|
||
| dt = datetime.fromisoformat(sys.argv[1].replace("Z", "+00:00")) | ||
| dt = dt.astimezone(timezone.utc) | ||
| print(dt.strftime("%Y-%m-%dT%H%M%SZ")) | ||
| EOF | ||
| fi | ||
| } | ||
|
|
||
| # Create the inference working directory if it doesn't exist | ||
| mkdir -p ./inference_workdir/ | ||
|
|
||
| # prepare environment variables for container | ||
| ANALYSIS_TIME=$(format_analysis_time "${ANALYSIS_TIME}") | ||
| DINI_ZARR="s3://harmonie-zarr/dini/control/${ANALYSIS_TIME}/single_levels.zarr/" | ||
| DATASTORE_INPUT_PATHS="danra.danra_surface=${DINI_ZARR},danra.danra_static=${DINI_ZARR}" | ||
| TIME_DIMENSIONS="time" | ||
| INFERENCE_WORKDIR="$(pwd)/inference_workdir/" | ||
|
|
||
| podman run --rm \ | ||
| --device /dev/nvidia0 \ | ||
| --device /dev/nvidiactl \ | ||
| --device /dev/nvidia-uvm \ | ||
| --device /dev/nvidia-uvm-tools \ | ||
| --device /dev/nvidia-modeset \ | ||
| -v /lib/x86_64-linux-gnu/libcuda.so.1:/lib/x86_64-linux-gnu/libcuda.so.1:ro \ | ||
| -v /lib/x86_64-linux-gnu/libnvidia-ml.so.1:/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro \ | ||
| -v /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro \ | ||
| --shm-size=32g \ | ||
| -v ${INFERENCE_WORKDIR}:/workspace/inference_workdir:Z \ | ||
| -e DATASTORE_INPUT_PATHS="${DATASTORE_INPUT_PATHS}" \ | ||
| -e TIME_DIMENSIONS="${TIME_DIMENSIONS}" \ | ||
| -e ANALYSIS_TIME="${ANALYSIS_TIME}" \ | ||
| -e FORECAST_DURATION="${FORECAST_DURATION}" \ | ||
| localhost/surface-dummy-model_dini:latest |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you not expect it work on one of the sparks?