feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra by speckhard · Pull Request #1 · speckhard/LeMat-Rho

speckhard · 2026-05-19T16:01:25Z

Summary

Stacked on PR LeMaterial#8. Adds an Adastra-side variant of the ChargE3Net fine-tuning pipeline (NVIDIA A100 on Jean Zay → AMD MI250X on Adastra/CINES) without touching charge3net_ft/. Same training code, same dataset layout; only the submit script + setup runbook differ.

What's in this PR

File	What
`submit_charge3net_adastra.sh`	MI250 SLURM headers, ROCm `HIP_VISIBLE_DEVICES` alignment, `batch_size=8` (vs A100's 4, since MI250X has 64 GB HBM2e per GCD), `val_probes=1000`, online W&B (Adastra proxy gives live internet), auto-resume from `latest.pt`. Submit dir defaults to `cad16353` scratch, account billed to `c1816212`.
`ADASTRA.md`	Step-by-step setup (proxy, venv, dataset transfer) + a gotchas table covering the seven port blockers.
`tests/test_data.py`	New `test_ignores_extra_columns` regression test for the Bader-analysis columns that `Entalpic/lemat-rho-v1` added (`bader_charges`, `bader_volumes`, `material_id`).

Port blockers solved

#	Symptom	Cause	Fix
1	`pip install` returns HTTP 000	Adastra doesn't auto-set `HTTP_PROXY`	Export `HTTP_PROXY=http://proxy-l-adastra.cines.fr:3128` (+ HTTPS, lowercase); now in `~/.bashrc` on Adastra
2	April setup vanished	CINES 30-day scratch purge	Setup tree under `\$LEMATRHO_ADASTRA_SETUP` is now rebuildable from sources
3	`pip install boto3` times out	Adastra's pip prefers `gorgone.cines.fr` (missing boto3)	`pip install --index-url https://pypi.org/simple ...` for non-torch deps
4	`snapshot_download` reports 100% but cache is empty	HF Xet backend silently no-ops on Adastra	Raw `curl` with `Authorization: Bearer` per file (3.5 GB in 16 s with `xargs -P 8`)
5	`sbatch: You are not allowed to ask for a qos`	`--qos=debug` not granted on team accounts	Omit `--qos`; default works with 6 h MaxWall
6	Exit code `0:53` (signal 53 = prolog failure), no log files	`c1816212` group inode quota at hard cap (Ali owns ~85% of 1.1M files)	Cross-account setup: submit dir on cad16353 scratch (390 k headroom), `--account=c1816212` (active window). Account and scratch dir are independent in SLURM.
7	sbatch `.out` lands in `\$HOME`	sbatch over SSH without `cd` defaults `WorkDir=\$HOME`	`cd \$WORK_DIR && sbatch ...` in the submit script

Reference smoke run

Job 4969516 on g1342, 2026-05-19. Loaded 65,239 / 68,549 valid materials from 69 parquet chunks. 1,150 training steps in 12 min wall, train L1 down from 29.95 (step 50) → 5.67 (step 1,000). Hit TIMEOUT before completing the epoch (expected: one epoch ≈ 150 min at the debug-run knobs); no val/test metrics yet. A 6 h job under the production knobs in this script is the next step.

Test plan

`pytest tests/test_data.py -v` — 11/11 pass including the new `test_ignores_extra_columns`
`ruff format` + `ruff check` — clean
Manual smoke run on Adastra (job 4969516) — pipeline trains end-to-end on AMD MI250X
Real-data 6 h run with production knobs (`batch=8`, `val_probes=1000`, online W&B) — follow-up, not in this PR

…stra Adds an Adastra-side variant of submit_charge3net.sh and a runbook covering the seven blockers encountered during the port: - HTTP proxy must be set explicitly (Adastra doesn't auto-export it), - 30-day scratch purge wipes setup, so $LEMATRHO_ADASTRA_SETUP is rebuildable from sources, - pip on Adastra defaults to gorgone.cines.fr (missing boto3 etc); --index-url https://pypi.org/simple is required, - huggingface_hub Xet backend silently no-ops the payload fetch, so raw curl with Authorization: Bearer is used for the dataset, - --qos=debug is not granted on the team accounts, - group inode quota on /lus/scratch/CT10/c1816212/ is at the hard cap, so the submit dir lives on cad16353 scratch while the job is billed to c1816212 (account and scratch dir are independent dimensions), - sbatch over SSH defaults WorkDir to \$HOME unless cd'd first. submit_charge3net_adastra.sh mirrors the Jean Zay script (auto-resume from latest.pt, 50-epoch budget) but with MI250 SLURM headers, ROCm HIP_VISIBLE_DEVICES alignment, batch_size=8 (HBM2e has 64 GB per GCD vs A100's 40-80), val_probes=1000, and online W&B (the Adastra proxy gives us live internet, so the Jean Zay offline-then-sync dance is unnecessary). Adds a regression test test_ignores_extra_columns for the dataset loader: Entalpic/lemat-rho-v1 added Bader analysis columns (bader_charges, bader_volumes, material_id) which would have broken _build_parquet_index if it didn't honor the four-column _COLUMNS allowlist. The test confirms the allowlist still holds. Reference smoke run: job 4969516 on g1342, May 19 2026. 65,239 of 68,549 valid materials loaded from 69 parquet chunks. 1,150 training steps in 12 min wall, train L1 down from 29.95 at step 50 to 5.67 at step 1,000. Hit TIMEOUT before completing the epoch (expected: one epoch needs ~150 min at batch=4), no val/test metrics yet; a follow-up 6h job under the production knobs will produce those.

…uards Adds tests/test_equivariance.py with 7 structural tests that pin down the architectural properties needed for ChargE3Net's rotational equivariance guarantee: - Production model has 1.9M params (catches drift that would break loading charge3net_mp.pt). - atom_irreps_sequence reaches lmax >= 4 (the "higher-order" in the paper title; a silent drop to lmax=0 would degenerate the model to a much weaker scalar-only baseline). - Atom representation includes both even and odd parity components. - get_irreps(500, lmax=4) returns 10 entries with no zero-multiplicity irreps (catches a regression that would silently delete some irreps). - atom_irreps_sequence length matches num_interactions. - Atom-model cutoff matches the 4.0 A baked into KdTreeGraphConstructor in LeMatRhoDataset. - Final irreps are an e3nn o3.Irreps instance (replacing this with a plain list would silently break equivariance while still producing output). A runtime equivariance check (rotate inputs, predict, compare) is the gold standard but requires a real forward pass at production hyperparameters that is too slow for a CPU unit test. The structural tests cover the same property at the architecture level. Tests autoskip when the sibling AIforGreatGood/charge3net repo is absent.

… training Two changes motivated by job 4969727 (FAILED after 1h47m on the previous single-GPU submit): 1. Multi-GPU via torch DistributedDataParallel. The paper uses per-GPU batch=16 across 4 GPUs (effective batch=64). Our previous Adastra submit was single-GPU batch=8 — 8x smaller effective batch. With the half-node submit (4 GCDs, 64 CPUs, 128 GB RAM, batch=16 per GCD) the effective batch now matches the paper. Implementation: - New _setup_ddp / _is_ddp / _is_main helpers in train.py read WORLD_SIZE / RANK / LOCAL_RANK / MASTER_ADDR / MASTER_PORT from the env (set in the submit script via srun + scontrol show hostname). - Backend is nccl which routes through RCCL on AMD ROCm builds. - Model wrapped in DistributedDataParallel after .to(device). - DistributedSampler injected into the train loader via a new distributed=True flag on build_dataloaders. Val/test stay non-distributed; cheap enough at 5% of 65k. - DistributedSampler.set_epoch called each epoch for proper shuffling. - All prints and wandb logs gated on is_main (rank 0 only). - Save and load go through a new _unwrap helper so checkpoints are interchangeable between single-GPU and DDP runs. - dist.barrier at end of each epoch to keep ranks in lockstep during checkpoint saves. - dist.destroy_process_group at the very end. 2. Wandb soft-fail. wandb.init now sits inside try/except — if the compute node can't reach api.wandb.ai through the proxy (which is what killed job 4969727 after 5min of timeouts and 1h47m elapsed total), the script logs a warning and sets use_wandb=False so training proceeds with stdout + checkpoints only. Submit script (submit_charge3net_adastra.sh) updated for half-node: --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --cpus-per-task=16 --mem=125000M --time=06:00:00 plus srun-based DDP launcher that exports RANK/LOCAL_RANK per task, batch_size=16 per GPU, val_probes=1000, wandb-mode=offline. Test plan - pytest tests/ ... 34 passed, 1 failure pre-existing (test_metrics collection error from src.charge3net path shadowing in pytest; unrelated, same on main). - ruff format + check clean on the touched files. - DDP path not yet exercised end-to-end on Adastra; the immediate next step is a 6h submission. If the DDP init fails, the single-GPU code path is still reachable by running without srun.

…om-scratch (TDD) The submit script now reads LEMATRHO_TRAINING_MODE to switch between two runs that share all infrastructure (same DDP, same hyperparams, same dataset, same node layout) but differ in init: pretrained (default) --ckpt-path charge3net_mp.pt save-dir charge3net_checkpoints/ WANDB_NAME=pretrained_mp from_scratch no --ckpt-path (random init) save-dir charge3net_checkpoints_fromscratch/ WANDB_NAME=from_scratch Auto-resume from latest.pt is per-mode (the two save-dirs don't collide), so each arm can be relaunched independently via sbatch ... submit_charge3net_adastra.sh until val NMAPE plateaus. Also adds a LEMATRHO_DRY_RUN=1 escape hatch that prints the resolved train command and exits 0 without sourcing the venv or invoking srun. Used by the 9 new pytest tests in tests/test_submit_script.py: - dry-run prints train command - default mode is pretrained, uses MP checkpoint - pretrained writes to charge3net_checkpoints (not fromscratch dir) - from_scratch drops --ckpt-path completely and never references charge3net_mp.pt - from_scratch uses a separate save dir - WANDB_NAME differs between modes - invalid mode exits non-zero with a clear error - batch-size 16, val-probes 1000 (paper-matching) - wandb-mode is offline TDD: 9 tests RED before the refactor, all GREEN after. Full suite still 33 passed (data + model + equivariance + submit). ruff format + check clean. Submission examples in the script header and in ADASTRA.md.

PR 1 of a 2-PR stack to land DeepDFT as a baseline for the ChargE3Net VASP-speedup experiment. This PR adds only the data adapter; PR 2 will add the training submission (DDP-patched). What's here: deepdft_ft/__init__.py empty package marker deepdft_ft/data.py LeMatRhoDeepDFTDataset adapter tests/test_deepdft_data.py 11 TDD tests pinning the contract The adapter reuses charge3net_ft.data's _row_to_atoms_and_density and _build_parquet_index, then re-shapes the per-sample output into the dict that DeepDFT's CollateFuncRandomSample expects: { "density": np.ndarray (Nx, Ny, Nz), "atoms": ase.Atoms, "origin": np.ndarray (3,), "grid_position": np.ndarray (Nx, Ny, Nz, 3), "metadata": {"filename": str}, } _calculate_grid_pos is inlined from upstream DeepDFT/dataset.py so this adapter has no runtime dependency on the DeepDFT sibling repo (which keeps the test suite hermetic). Tests pinned (RED then GREEN): - dataset length matches the count of valid parquet rows - sample dict has all 5 required keys - density is a 3D numpy array - atoms is ase.Atoms with PBC True/True/True - origin is zeros (matches LeMat-Rho convention) - grid_position has shape (Nx, Ny, Nz, 3) - grid_position[0,0,0] = (0,0,0) - grid_position[1,0,0] = (a_lattice / Nx, 0, 0) - metadata.filename present and unique per sample - extra columns (bader_charges, material_id) ignored - empty parquet dir raises FileNotFoundError Caching is keyed by absolute parquet path (not file index) so multiple LeMatRhoDeepDFTDataset instances pointing at different directories don't collide on fi=0 (which bit me writing the metadata test). Full LeMat-Rho test suite: 44 passed. Ruff format + check clean. Next: PR 2 will add deepdft_ft/runner.py (vendored from upstream DeepDFT + DDP patches) and submit_deepdft_adastra.sh (4-GCD half-node DDP, PaiNN model variant for equivariance parity with ChargE3Net).

PR 2 of the DeepDFT-on-LeMat-Rho stack (PR 1 was the data adapter). Closes the gap from "we have a DeepDFT-compatible Dataset" to "we can sbatch a 4-GCD DDP DeepDFT training run on Adastra". What's here: deepdft_ft/runner.py vendored from peterbjorgensen/DeepDFT@main + DDP patches + LeMat-Rho parquet auto-detect + asap3 stub (no C++ headers on Adastra) submit_deepdft_adastra.sh half-node 4-GCD DDP submission, PaiNN default, LEMATRHO_DEEPDFT_VARIANT={painn,schnet} env var, LEMATRHO_DRY_RUN=1 supported DDP patches mirror what we did in charge3net_ft/train.py: - _setup_ddp + _is_main + _unwrap helpers - DistributedSampler when WORLD_SIZE>1, RandomSampler otherwise - DistributedDataParallel wrap of the PaiNN/SchNet model - All logging.info and checkpoint saves gated on rank 0 - Device pinned to cuda:LOCAL_RANK via torch.cuda.set_device LeMat-Rho parquet auto-detect: if --dataset points at a directory containing chunk_*.parquet, the runner uses LeMatRhoDeepDFTDataset (PR 1). Other dataset paths (.tar, .txt, dir of cube/CHGCAR) still work unchanged — upstream's dataset.DensityData path is preserved. asap3 stub: upstream DeepDFT imports asap3 at module load. asap3 needs Python.h to build from source which isn't on Adastra (and would need admin). The stub at the top of runner.py registers a fake asap3 module with a FullNeighborList class that delegates to ASE's NewPrimitiveNeighborList. Slower than real asap3 but functionally identical for DeepDFT's call sites. Skipped when real asap3 is installed. Submit script defaults: - PaiNN model (matches equivariance of ChargE3Net for the comparison) - batch=2 (DeepDFT's upstream default — they iterate on probes, not materials, so per-batch counts work differently from ChargE3Net) - cutoff=4.0, num_interactions=3, node_size=128 - max_steps=1e8 (effectively unbounded; SLURM walltime is the limiter) - WANDB_NAME=deepdft_painn (or deepdft_schnet) Verified on Adastra: runner module imports cleanly under the venv311, asap3 stub kicks in without error, parquet directory detection works. The actual training run will be submitted next.

Root-causes job 4971720's OOM-kill at startup and aligns the DeepDFT training to the upstream paper's submission settings. Two changes: 1. submit_deepdft_adastra.sh: switch from half-node DDP (4 GCDs) to paper-faithful single-GPU (1 GCD on mi250-shared, HIP_VISIBLE_DEVICES=0, WORLD_SIZE unset). Upstream DeepDFT was trained on 1x RTX 3090 per pretrained_models/*/submit_script.sh. Single-GPU keeps gradient-step semantics identical to the paper's batch=2; no LR sweep needed. Effective hyperparameters are now exactly the upstream PaiNN settings from pretrained_models/{nmc,qm9,ethylenecarbonate}_painn/commandline_args.txt: --cutoff 4 --num_interactions 3 --node_size 128 --max_steps 10000000 --use_painn_model batch_size=2 materials (hardcoded in runner.py) train_probes=1000 per material (hardcoded) val_probes=5000 per material (hardcoded) DDP code paths in runner.py stay in place but only fire when WORLD_SIZE>1, so a future DDP variant of DeepDFT is one env flip away. 2. deepdft_ft/runner.py: replace upstream's eager validation preload `val_loader = [b for b in val_loader]` with a comment explaining why we left it as a streaming DataLoader. Upstream's val sets are ~100 materials (NMC, QM9 ethylenecarbonate subsets) so the preload is cheap. Our val set is 3,261 materials at 5000 probes each, x4 ranks under DDP, which materialised ~150 GB and OOM-killed job 4971720 at startup before a single training step. Streaming the val loader is a data-loading detail, not a hyperparameter; the model math is unchanged. Test plan: - 44/44 local tests still pass (no behavioural changes to the data adapter or submit-script env contract; only the runner internals and the SLURM headers move). - New job to be submitted as the next step; will confirm DeepDFT trains and produces step-level loss in the .out log.

Observation from jobs 4971293 and 4971343: SLURM bumped both to EXCLUSIVE mode despite us requesting half-node resources. The --mem=125000M line was exactly half the 256 GB node's memory, which crosses SLURM's auto-exclusive threshold. Dropping --mem entirely lets SLURM allocate memory proportional to our CPU share (64 of 128 logical CPUs -> ~128 GB out of 256 GB). The other half of the node stays schedulable for other users / jobs. The currently running jobs 4971293 and 4971343 keep their exclusive allocations; only future submissions are affected. Test plan - 9/9 tests in tests/test_submit_script.py still pass (no memory assertion). - Will confirm on next sbatch by inspecting AllocTRES.

speckhard added 6 commits May 20, 2026 13:26

speckhard force-pushed the feat/charge3net-adastra branch from 8487ae9 to 8d510d2 Compare May 20, 2026 11:27

speckhard added 2 commits May 20, 2026 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1

feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1
speckhard wants to merge 8 commits into
feat/charge3netfrom
feat/charge3net-adastra

speckhard commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

speckhard commented May 19, 2026

Summary

What's in this PR

Port blockers solved

Reference smoke run

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant