feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1
Draft
speckhard wants to merge 8 commits into
Draft
feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1speckhard wants to merge 8 commits into
speckhard wants to merge 8 commits into
Conversation
…stra Adds an Adastra-side variant of submit_charge3net.sh and a runbook covering the seven blockers encountered during the port: - HTTP proxy must be set explicitly (Adastra doesn't auto-export it), - 30-day scratch purge wipes setup, so $LEMATRHO_ADASTRA_SETUP is rebuildable from sources, - pip on Adastra defaults to gorgone.cines.fr (missing boto3 etc); --index-url https://pypi.org/simple is required, - huggingface_hub Xet backend silently no-ops the payload fetch, so raw curl with Authorization: Bearer is used for the dataset, - --qos=debug is not granted on the team accounts, - group inode quota on /lus/scratch/CT10/c1816212/ is at the hard cap, so the submit dir lives on cad16353 scratch while the job is billed to c1816212 (account and scratch dir are independent dimensions), - sbatch over SSH defaults WorkDir to \$HOME unless cd'd first. submit_charge3net_adastra.sh mirrors the Jean Zay script (auto-resume from latest.pt, 50-epoch budget) but with MI250 SLURM headers, ROCm HIP_VISIBLE_DEVICES alignment, batch_size=8 (HBM2e has 64 GB per GCD vs A100's 40-80), val_probes=1000, and online W&B (the Adastra proxy gives us live internet, so the Jean Zay offline-then-sync dance is unnecessary). Adds a regression test test_ignores_extra_columns for the dataset loader: Entalpic/lemat-rho-v1 added Bader analysis columns (bader_charges, bader_volumes, material_id) which would have broken _build_parquet_index if it didn't honor the four-column _COLUMNS allowlist. The test confirms the allowlist still holds. Reference smoke run: job 4969516 on g1342, May 19 2026. 65,239 of 68,549 valid materials loaded from 69 parquet chunks. 1,150 training steps in 12 min wall, train L1 down from 29.95 at step 50 to 5.67 at step 1,000. Hit TIMEOUT before completing the epoch (expected: one epoch needs ~150 min at batch=4), no val/test metrics yet; a follow-up 6h job under the production knobs will produce those.
…uards Adds tests/test_equivariance.py with 7 structural tests that pin down the architectural properties needed for ChargE3Net's rotational equivariance guarantee: - Production model has 1.9M params (catches drift that would break loading charge3net_mp.pt). - atom_irreps_sequence reaches lmax >= 4 (the "higher-order" in the paper title; a silent drop to lmax=0 would degenerate the model to a much weaker scalar-only baseline). - Atom representation includes both even and odd parity components. - get_irreps(500, lmax=4) returns 10 entries with no zero-multiplicity irreps (catches a regression that would silently delete some irreps). - atom_irreps_sequence length matches num_interactions. - Atom-model cutoff matches the 4.0 A baked into KdTreeGraphConstructor in LeMatRhoDataset. - Final irreps are an e3nn o3.Irreps instance (replacing this with a plain list would silently break equivariance while still producing output). A runtime equivariance check (rotate inputs, predict, compare) is the gold standard but requires a real forward pass at production hyperparameters that is too slow for a CPU unit test. The structural tests cover the same property at the architecture level. Tests autoskip when the sibling AIforGreatGood/charge3net repo is absent.
… training
Two changes motivated by job 4969727 (FAILED after 1h47m on the previous
single-GPU submit):
1. Multi-GPU via torch DistributedDataParallel. The paper uses per-GPU
batch=16 across 4 GPUs (effective batch=64). Our previous Adastra
submit was single-GPU batch=8 — 8x smaller effective batch. With the
half-node submit (4 GCDs, 64 CPUs, 128 GB RAM, batch=16 per GCD) the
effective batch now matches the paper.
Implementation:
- New _setup_ddp / _is_ddp / _is_main helpers in train.py read
WORLD_SIZE / RANK / LOCAL_RANK / MASTER_ADDR / MASTER_PORT from
the env (set in the submit script via srun + scontrol show hostname).
- Backend is nccl which routes through RCCL on AMD ROCm builds.
- Model wrapped in DistributedDataParallel after .to(device).
- DistributedSampler injected into the train loader via a new
distributed=True flag on build_dataloaders. Val/test stay
non-distributed; cheap enough at 5% of 65k.
- DistributedSampler.set_epoch called each epoch for proper shuffling.
- All prints and wandb logs gated on is_main (rank 0 only).
- Save and load go through a new _unwrap helper so checkpoints are
interchangeable between single-GPU and DDP runs.
- dist.barrier at end of each epoch to keep ranks in lockstep
during checkpoint saves.
- dist.destroy_process_group at the very end.
2. Wandb soft-fail. wandb.init now sits inside try/except — if the
compute node can't reach api.wandb.ai through the proxy (which is
what killed job 4969727 after 5min of timeouts and 1h47m elapsed
total), the script logs a warning and sets use_wandb=False so
training proceeds with stdout + checkpoints only.
Submit script (submit_charge3net_adastra.sh) updated for half-node:
--nodes=1 --ntasks-per-node=4
--gpus-per-node=4 --cpus-per-task=16
--mem=125000M --time=06:00:00
plus srun-based DDP launcher that exports RANK/LOCAL_RANK per task,
batch_size=16 per GPU, val_probes=1000, wandb-mode=offline.
Test plan
- pytest tests/ ... 34 passed, 1 failure pre-existing (test_metrics
collection error from src.charge3net path shadowing in pytest;
unrelated, same on main).
- ruff format + check clean on the touched files.
- DDP path not yet exercised end-to-end on Adastra; the immediate
next step is a 6h submission. If the DDP init fails, the
single-GPU code path is still reachable by running without srun.
…om-scratch (TDD)
The submit script now reads LEMATRHO_TRAINING_MODE to switch between
two runs that share all infrastructure (same DDP, same hyperparams,
same dataset, same node layout) but differ in init:
pretrained (default) --ckpt-path charge3net_mp.pt
save-dir charge3net_checkpoints/
WANDB_NAME=pretrained_mp
from_scratch no --ckpt-path (random init)
save-dir charge3net_checkpoints_fromscratch/
WANDB_NAME=from_scratch
Auto-resume from latest.pt is per-mode (the two save-dirs don't
collide), so each arm can be relaunched independently via
sbatch ... submit_charge3net_adastra.sh until val NMAPE plateaus.
Also adds a LEMATRHO_DRY_RUN=1 escape hatch that prints the resolved
train command and exits 0 without sourcing the venv or invoking srun.
Used by the 9 new pytest tests in tests/test_submit_script.py:
- dry-run prints train command
- default mode is pretrained, uses MP checkpoint
- pretrained writes to charge3net_checkpoints (not fromscratch dir)
- from_scratch drops --ckpt-path completely and never references
charge3net_mp.pt
- from_scratch uses a separate save dir
- WANDB_NAME differs between modes
- invalid mode exits non-zero with a clear error
- batch-size 16, val-probes 1000 (paper-matching)
- wandb-mode is offline
TDD: 9 tests RED before the refactor, all GREEN after. Full suite
still 33 passed (data + model + equivariance + submit). ruff format
+ check clean.
Submission examples in the script header and in ADASTRA.md.
PR 1 of a 2-PR stack to land DeepDFT as a baseline for the
ChargE3Net VASP-speedup experiment. This PR adds only the data
adapter; PR 2 will add the training submission (DDP-patched).
What's here:
deepdft_ft/__init__.py empty package marker
deepdft_ft/data.py LeMatRhoDeepDFTDataset adapter
tests/test_deepdft_data.py 11 TDD tests pinning the contract
The adapter reuses charge3net_ft.data's _row_to_atoms_and_density and
_build_parquet_index, then re-shapes the per-sample output into the
dict that DeepDFT's CollateFuncRandomSample expects:
{
"density": np.ndarray (Nx, Ny, Nz),
"atoms": ase.Atoms,
"origin": np.ndarray (3,),
"grid_position": np.ndarray (Nx, Ny, Nz, 3),
"metadata": {"filename": str},
}
_calculate_grid_pos is inlined from upstream DeepDFT/dataset.py so
this adapter has no runtime dependency on the DeepDFT sibling repo
(which keeps the test suite hermetic).
Tests pinned (RED then GREEN):
- dataset length matches the count of valid parquet rows
- sample dict has all 5 required keys
- density is a 3D numpy array
- atoms is ase.Atoms with PBC True/True/True
- origin is zeros (matches LeMat-Rho convention)
- grid_position has shape (Nx, Ny, Nz, 3)
- grid_position[0,0,0] = (0,0,0)
- grid_position[1,0,0] = (a_lattice / Nx, 0, 0)
- metadata.filename present and unique per sample
- extra columns (bader_charges, material_id) ignored
- empty parquet dir raises FileNotFoundError
Caching is keyed by absolute parquet path (not file index) so multiple
LeMatRhoDeepDFTDataset instances pointing at different directories
don't collide on fi=0 (which bit me writing the metadata test).
Full LeMat-Rho test suite: 44 passed. Ruff format + check clean.
Next: PR 2 will add deepdft_ft/runner.py (vendored from upstream
DeepDFT + DDP patches) and submit_deepdft_adastra.sh (4-GCD half-node
DDP, PaiNN model variant for equivariance parity with ChargE3Net).
PR 2 of the DeepDFT-on-LeMat-Rho stack (PR 1 was the data adapter).
Closes the gap from "we have a DeepDFT-compatible Dataset" to "we
can sbatch a 4-GCD DDP DeepDFT training run on Adastra".
What's here:
deepdft_ft/runner.py vendored from peterbjorgensen/DeepDFT@main
+ DDP patches + LeMat-Rho parquet auto-detect
+ asap3 stub (no C++ headers on Adastra)
submit_deepdft_adastra.sh half-node 4-GCD DDP submission, PaiNN default,
LEMATRHO_DEEPDFT_VARIANT={painn,schnet} env var,
LEMATRHO_DRY_RUN=1 supported
DDP patches mirror what we did in charge3net_ft/train.py:
- _setup_ddp + _is_main + _unwrap helpers
- DistributedSampler when WORLD_SIZE>1, RandomSampler otherwise
- DistributedDataParallel wrap of the PaiNN/SchNet model
- All logging.info and checkpoint saves gated on rank 0
- Device pinned to cuda:LOCAL_RANK via torch.cuda.set_device
LeMat-Rho parquet auto-detect: if --dataset points at a directory
containing chunk_*.parquet, the runner uses LeMatRhoDeepDFTDataset
(PR 1). Other dataset paths (.tar, .txt, dir of cube/CHGCAR) still
work unchanged — upstream's dataset.DensityData path is preserved.
asap3 stub: upstream DeepDFT imports asap3 at module load. asap3
needs Python.h to build from source which isn't on Adastra (and would
need admin). The stub at the top of runner.py registers a fake asap3
module with a FullNeighborList class that delegates to ASE's
NewPrimitiveNeighborList. Slower than real asap3 but functionally
identical for DeepDFT's call sites. Skipped when real asap3 is
installed.
Submit script defaults:
- PaiNN model (matches equivariance of ChargE3Net for the comparison)
- batch=2 (DeepDFT's upstream default — they iterate on probes,
not materials, so per-batch counts work differently from ChargE3Net)
- cutoff=4.0, num_interactions=3, node_size=128
- max_steps=1e8 (effectively unbounded; SLURM walltime is the limiter)
- WANDB_NAME=deepdft_painn (or deepdft_schnet)
Verified on Adastra: runner module imports cleanly under the venv311,
asap3 stub kicks in without error, parquet directory detection works.
The actual training run will be submitted next.
8487ae9 to
8d510d2
Compare
Root-causes job 4971720's OOM-kill at startup and aligns the DeepDFT
training to the upstream paper's submission settings.
Two changes:
1. submit_deepdft_adastra.sh: switch from half-node DDP (4 GCDs) to
paper-faithful single-GPU (1 GCD on mi250-shared, HIP_VISIBLE_DEVICES=0,
WORLD_SIZE unset). Upstream DeepDFT was trained on 1x RTX 3090 per
pretrained_models/*/submit_script.sh. Single-GPU keeps gradient-step
semantics identical to the paper's batch=2; no LR sweep needed.
Effective hyperparameters are now exactly the upstream PaiNN settings
from pretrained_models/{nmc,qm9,ethylenecarbonate}_painn/commandline_args.txt:
--cutoff 4
--num_interactions 3
--node_size 128
--max_steps 10000000
--use_painn_model
batch_size=2 materials (hardcoded in runner.py)
train_probes=1000 per material (hardcoded)
val_probes=5000 per material (hardcoded)
DDP code paths in runner.py stay in place but only fire when
WORLD_SIZE>1, so a future DDP variant of DeepDFT is one env flip away.
2. deepdft_ft/runner.py: replace upstream's eager validation preload
`val_loader = [b for b in val_loader]` with a comment explaining
why we left it as a streaming DataLoader. Upstream's val sets are
~100 materials (NMC, QM9 ethylenecarbonate subsets) so the preload
is cheap. Our val set is 3,261 materials at 5000 probes each, x4
ranks under DDP, which materialised ~150 GB and OOM-killed job
4971720 at startup before a single training step. Streaming the
val loader is a data-loading detail, not a hyperparameter; the
model math is unchanged.
Test plan:
- 44/44 local tests still pass (no behavioural changes to the data
adapter or submit-script env contract; only the runner internals
and the SLURM headers move).
- New job to be submitted as the next step; will confirm DeepDFT
trains and produces step-level loss in the .out log.
Observation from jobs 4971293 and 4971343: SLURM bumped both to EXCLUSIVE mode despite us requesting half-node resources. The --mem=125000M line was exactly half the 256 GB node's memory, which crosses SLURM's auto-exclusive threshold. Dropping --mem entirely lets SLURM allocate memory proportional to our CPU share (64 of 128 logical CPUs -> ~128 GB out of 256 GB). The other half of the node stays schedulable for other users / jobs. The currently running jobs 4971293 and 4971343 keep their exclusive allocations; only future submissions are affected. Test plan - 9/9 tests in tests/test_submit_script.py still pass (no memory assertion). - Will confirm on next sbatch by inspecting AllocTRES.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on PR LeMaterial#8. Adds an Adastra-side variant of the ChargE3Net fine-tuning pipeline (NVIDIA A100 on Jean Zay → AMD MI250X on Adastra/CINES) without touching
charge3net_ft/. Same training code, same dataset layout; only the submit script + setup runbook differ.What's in this PR
submit_charge3net_adastra.shHIP_VISIBLE_DEVICESalignment,batch_size=8(vs A100's 4, since MI250X has 64 GB HBM2e per GCD),val_probes=1000, online W&B (Adastra proxy gives live internet), auto-resume fromlatest.pt. Submit dir defaults tocad16353scratch, account billed toc1816212.ADASTRA.mdtests/test_data.pytest_ignores_extra_columnsregression test for the Bader-analysis columns thatEntalpic/lemat-rho-v1added (bader_charges,bader_volumes,material_id).Port blockers solved
pip installreturns HTTP 000HTTP_PROXYHTTP_PROXY=http://proxy-l-adastra.cines.fr:3128(+ HTTPS, lowercase); now in~/.bashrcon Adastra\$LEMATRHO_ADASTRA_SETUPis now rebuildable from sourcespip install boto3times outgorgone.cines.fr(missing boto3)pip install --index-url https://pypi.org/simple ...for non-torch depssnapshot_downloadreports 100% but cache is emptycurlwithAuthorization: Bearerper file (3.5 GB in 16 s withxargs -P 8)sbatch: You are not allowed to ask for a qos--qos=debugnot granted on team accounts--qos; default works with 6 h MaxWall0:53(signal 53 = prolog failure), no log filesc1816212group inode quota at hard cap (Ali owns ~85% of 1.1M files)--account=c1816212(active window). Account and scratch dir are independent in SLURM..outlands in\$HOMEcddefaultsWorkDir=\$HOMEcd \$WORK_DIR && sbatch ...in the submit scriptReference smoke run
Job 4969516 on g1342, 2026-05-19. Loaded 65,239 / 68,549 valid materials from 69 parquet chunks. 1,150 training steps in 12 min wall, train L1 down from 29.95 (step 50) → 5.67 (step 1,000). Hit TIMEOUT before completing the epoch (expected: one epoch ≈ 150 min at the debug-run knobs); no val/test metrics yet. A 6 h job under the production knobs in this script is the next step.
Test plan