Skip to content

feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1

Draft
speckhard wants to merge 8 commits into
feat/charge3netfrom
feat/charge3net-adastra
Draft

feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1
speckhard wants to merge 8 commits into
feat/charge3netfrom
feat/charge3net-adastra

Conversation

@speckhard
Copy link
Copy Markdown
Owner

Summary

Stacked on PR LeMaterial#8. Adds an Adastra-side variant of the ChargE3Net fine-tuning pipeline (NVIDIA A100 on Jean Zay → AMD MI250X on Adastra/CINES) without touching charge3net_ft/. Same training code, same dataset layout; only the submit script + setup runbook differ.

What's in this PR

File What
submit_charge3net_adastra.sh MI250 SLURM headers, ROCm HIP_VISIBLE_DEVICES alignment, batch_size=8 (vs A100's 4, since MI250X has 64 GB HBM2e per GCD), val_probes=1000, online W&B (Adastra proxy gives live internet), auto-resume from latest.pt. Submit dir defaults to cad16353 scratch, account billed to c1816212.
ADASTRA.md Step-by-step setup (proxy, venv, dataset transfer) + a gotchas table covering the seven port blockers.
tests/test_data.py New test_ignores_extra_columns regression test for the Bader-analysis columns that Entalpic/lemat-rho-v1 added (bader_charges, bader_volumes, material_id).

Port blockers solved

# Symptom Cause Fix
1 pip install returns HTTP 000 Adastra doesn't auto-set HTTP_PROXY Export HTTP_PROXY=http://proxy-l-adastra.cines.fr:3128 (+ HTTPS, lowercase); now in ~/.bashrc on Adastra
2 April setup vanished CINES 30-day scratch purge Setup tree under \$LEMATRHO_ADASTRA_SETUP is now rebuildable from sources
3 pip install boto3 times out Adastra's pip prefers gorgone.cines.fr (missing boto3) pip install --index-url https://pypi.org/simple ... for non-torch deps
4 snapshot_download reports 100% but cache is empty HF Xet backend silently no-ops on Adastra Raw curl with Authorization: Bearer per file (3.5 GB in 16 s with xargs -P 8)
5 sbatch: You are not allowed to ask for a qos --qos=debug not granted on team accounts Omit --qos; default works with 6 h MaxWall
6 Exit code 0:53 (signal 53 = prolog failure), no log files c1816212 group inode quota at hard cap (Ali owns ~85% of 1.1M files) Cross-account setup: submit dir on cad16353 scratch (390 k headroom), --account=c1816212 (active window). Account and scratch dir are independent in SLURM.
7 sbatch .out lands in \$HOME sbatch over SSH without cd defaults WorkDir=\$HOME cd \$WORK_DIR && sbatch ... in the submit script

Reference smoke run

Job 4969516 on g1342, 2026-05-19. Loaded 65,239 / 68,549 valid materials from 69 parquet chunks. 1,150 training steps in 12 min wall, train L1 down from 29.95 (step 50) → 5.67 (step 1,000). Hit TIMEOUT before completing the epoch (expected: one epoch ≈ 150 min at the debug-run knobs); no val/test metrics yet. A 6 h job under the production knobs in this script is the next step.

Test plan

  • `pytest tests/test_data.py -v` — 11/11 pass including the new `test_ignores_extra_columns`
  • `ruff format` + `ruff check` — clean
  • Manual smoke run on Adastra (job 4969516) — pipeline trains end-to-end on AMD MI250X
  • Real-data 6 h run with production knobs (`batch=8`, `val_probes=1000`, online W&B) — follow-up, not in this PR

speckhard added 6 commits May 20, 2026 13:26
…stra

Adds an Adastra-side variant of submit_charge3net.sh and a runbook
covering the seven blockers encountered during the port:
- HTTP proxy must be set explicitly (Adastra doesn't auto-export it),
- 30-day scratch purge wipes setup, so $LEMATRHO_ADASTRA_SETUP is
  rebuildable from sources,
- pip on Adastra defaults to gorgone.cines.fr (missing boto3 etc);
  --index-url https://pypi.org/simple is required,
- huggingface_hub Xet backend silently no-ops the payload fetch, so
  raw curl with Authorization: Bearer is used for the dataset,
- --qos=debug is not granted on the team accounts,
- group inode quota on /lus/scratch/CT10/c1816212/ is at the hard cap,
  so the submit dir lives on cad16353 scratch while the job is billed
  to c1816212 (account and scratch dir are independent dimensions),
- sbatch over SSH defaults WorkDir to \$HOME unless cd'd first.

submit_charge3net_adastra.sh mirrors the Jean Zay script (auto-resume
from latest.pt, 50-epoch budget) but with MI250 SLURM headers, ROCm
HIP_VISIBLE_DEVICES alignment, batch_size=8 (HBM2e has 64 GB per GCD
vs A100's 40-80), val_probes=1000, and online W&B (the Adastra proxy
gives us live internet, so the Jean Zay offline-then-sync dance is
unnecessary).

Adds a regression test test_ignores_extra_columns for the dataset
loader: Entalpic/lemat-rho-v1 added Bader analysis columns
(bader_charges, bader_volumes, material_id) which would have broken
_build_parquet_index if it didn't honor the four-column _COLUMNS
allowlist. The test confirms the allowlist still holds.

Reference smoke run: job 4969516 on g1342, May 19 2026. 65,239 of
68,549 valid materials loaded from 69 parquet chunks. 1,150 training
steps in 12 min wall, train L1 down from 29.95 at step 50 to 5.67
at step 1,000. Hit TIMEOUT before completing the epoch (expected:
one epoch needs ~150 min at batch=4), no val/test metrics yet; a
follow-up 6h job under the production knobs will produce those.
…uards

Adds tests/test_equivariance.py with 7 structural tests that pin down the
architectural properties needed for ChargE3Net's rotational equivariance
guarantee:

- Production model has 1.9M params (catches drift that would break loading
  charge3net_mp.pt).
- atom_irreps_sequence reaches lmax >= 4 (the "higher-order" in the paper
  title; a silent drop to lmax=0 would degenerate the model to a much
  weaker scalar-only baseline).
- Atom representation includes both even and odd parity components.
- get_irreps(500, lmax=4) returns 10 entries with no zero-multiplicity
  irreps (catches a regression that would silently delete some irreps).
- atom_irreps_sequence length matches num_interactions.
- Atom-model cutoff matches the 4.0 A baked into KdTreeGraphConstructor in
  LeMatRhoDataset.
- Final irreps are an e3nn o3.Irreps instance (replacing this with a plain
  list would silently break equivariance while still producing output).

A runtime equivariance check (rotate inputs, predict, compare) is the gold
standard but requires a real forward pass at production hyperparameters
that is too slow for a CPU unit test. The structural tests cover the same
property at the architecture level.

Tests autoskip when the sibling AIforGreatGood/charge3net repo is absent.
… training

Two changes motivated by job 4969727 (FAILED after 1h47m on the previous
single-GPU submit):

1. Multi-GPU via torch DistributedDataParallel. The paper uses per-GPU
   batch=16 across 4 GPUs (effective batch=64). Our previous Adastra
   submit was single-GPU batch=8 — 8x smaller effective batch. With the
   half-node submit (4 GCDs, 64 CPUs, 128 GB RAM, batch=16 per GCD) the
   effective batch now matches the paper.

   Implementation:
   - New _setup_ddp / _is_ddp / _is_main helpers in train.py read
     WORLD_SIZE / RANK / LOCAL_RANK / MASTER_ADDR / MASTER_PORT from
     the env (set in the submit script via srun + scontrol show hostname).
   - Backend is nccl which routes through RCCL on AMD ROCm builds.
   - Model wrapped in DistributedDataParallel after .to(device).
   - DistributedSampler injected into the train loader via a new
     distributed=True flag on build_dataloaders. Val/test stay
     non-distributed; cheap enough at 5% of 65k.
   - DistributedSampler.set_epoch called each epoch for proper shuffling.
   - All prints and wandb logs gated on is_main (rank 0 only).
   - Save and load go through a new _unwrap helper so checkpoints are
     interchangeable between single-GPU and DDP runs.
   - dist.barrier at end of each epoch to keep ranks in lockstep
     during checkpoint saves.
   - dist.destroy_process_group at the very end.

2. Wandb soft-fail. wandb.init now sits inside try/except — if the
   compute node can't reach api.wandb.ai through the proxy (which is
   what killed job 4969727 after 5min of timeouts and 1h47m elapsed
   total), the script logs a warning and sets use_wandb=False so
   training proceeds with stdout + checkpoints only.

Submit script (submit_charge3net_adastra.sh) updated for half-node:
   --nodes=1 --ntasks-per-node=4
   --gpus-per-node=4 --cpus-per-task=16
   --mem=125000M  --time=06:00:00
plus srun-based DDP launcher that exports RANK/LOCAL_RANK per task,
batch_size=16 per GPU, val_probes=1000, wandb-mode=offline.

Test plan
- pytest tests/ ... 34 passed, 1 failure pre-existing (test_metrics
  collection error from src.charge3net path shadowing in pytest;
  unrelated, same on main).
- ruff format + check clean on the touched files.
- DDP path not yet exercised end-to-end on Adastra; the immediate
  next step is a 6h submission. If the DDP init fails, the
  single-GPU code path is still reachable by running without srun.
…om-scratch (TDD)

The submit script now reads LEMATRHO_TRAINING_MODE to switch between
two runs that share all infrastructure (same DDP, same hyperparams,
same dataset, same node layout) but differ in init:

  pretrained   (default)  --ckpt-path charge3net_mp.pt
                          save-dir charge3net_checkpoints/
                          WANDB_NAME=pretrained_mp
  from_scratch            no --ckpt-path (random init)
                          save-dir charge3net_checkpoints_fromscratch/
                          WANDB_NAME=from_scratch

Auto-resume from latest.pt is per-mode (the two save-dirs don't
collide), so each arm can be relaunched independently via
sbatch ... submit_charge3net_adastra.sh until val NMAPE plateaus.

Also adds a LEMATRHO_DRY_RUN=1 escape hatch that prints the resolved
train command and exits 0 without sourcing the venv or invoking srun.
Used by the 9 new pytest tests in tests/test_submit_script.py:
  - dry-run prints train command
  - default mode is pretrained, uses MP checkpoint
  - pretrained writes to charge3net_checkpoints (not fromscratch dir)
  - from_scratch drops --ckpt-path completely and never references
    charge3net_mp.pt
  - from_scratch uses a separate save dir
  - WANDB_NAME differs between modes
  - invalid mode exits non-zero with a clear error
  - batch-size 16, val-probes 1000 (paper-matching)
  - wandb-mode is offline

TDD: 9 tests RED before the refactor, all GREEN after. Full suite
still 33 passed (data + model + equivariance + submit). ruff format
+ check clean.

Submission examples in the script header and in ADASTRA.md.
PR 1 of a 2-PR stack to land DeepDFT as a baseline for the
ChargE3Net VASP-speedup experiment. This PR adds only the data
adapter; PR 2 will add the training submission (DDP-patched).

What's here:
  deepdft_ft/__init__.py         empty package marker
  deepdft_ft/data.py             LeMatRhoDeepDFTDataset adapter
  tests/test_deepdft_data.py     11 TDD tests pinning the contract

The adapter reuses charge3net_ft.data's _row_to_atoms_and_density and
_build_parquet_index, then re-shapes the per-sample output into the
dict that DeepDFT's CollateFuncRandomSample expects:

  {
      "density":       np.ndarray (Nx, Ny, Nz),
      "atoms":         ase.Atoms,
      "origin":        np.ndarray (3,),
      "grid_position": np.ndarray (Nx, Ny, Nz, 3),
      "metadata":      {"filename": str},
  }

_calculate_grid_pos is inlined from upstream DeepDFT/dataset.py so
this adapter has no runtime dependency on the DeepDFT sibling repo
(which keeps the test suite hermetic).

Tests pinned (RED then GREEN):
  - dataset length matches the count of valid parquet rows
  - sample dict has all 5 required keys
  - density is a 3D numpy array
  - atoms is ase.Atoms with PBC True/True/True
  - origin is zeros (matches LeMat-Rho convention)
  - grid_position has shape (Nx, Ny, Nz, 3)
  - grid_position[0,0,0] = (0,0,0)
  - grid_position[1,0,0] = (a_lattice / Nx, 0, 0)
  - metadata.filename present and unique per sample
  - extra columns (bader_charges, material_id) ignored
  - empty parquet dir raises FileNotFoundError

Caching is keyed by absolute parquet path (not file index) so multiple
LeMatRhoDeepDFTDataset instances pointing at different directories
don't collide on fi=0 (which bit me writing the metadata test).

Full LeMat-Rho test suite: 44 passed. Ruff format + check clean.

Next: PR 2 will add deepdft_ft/runner.py (vendored from upstream
DeepDFT + DDP patches) and submit_deepdft_adastra.sh (4-GCD half-node
DDP, PaiNN model variant for equivariance parity with ChargE3Net).
PR 2 of the DeepDFT-on-LeMat-Rho stack (PR 1 was the data adapter).
Closes the gap from "we have a DeepDFT-compatible Dataset" to "we
can sbatch a 4-GCD DDP DeepDFT training run on Adastra".

What's here:
  deepdft_ft/runner.py            vendored from peterbjorgensen/DeepDFT@main
                                  + DDP patches + LeMat-Rho parquet auto-detect
                                  + asap3 stub (no C++ headers on Adastra)
  submit_deepdft_adastra.sh       half-node 4-GCD DDP submission, PaiNN default,
                                  LEMATRHO_DEEPDFT_VARIANT={painn,schnet} env var,
                                  LEMATRHO_DRY_RUN=1 supported

DDP patches mirror what we did in charge3net_ft/train.py:
- _setup_ddp + _is_main + _unwrap helpers
- DistributedSampler when WORLD_SIZE>1, RandomSampler otherwise
- DistributedDataParallel wrap of the PaiNN/SchNet model
- All logging.info and checkpoint saves gated on rank 0
- Device pinned to cuda:LOCAL_RANK via torch.cuda.set_device

LeMat-Rho parquet auto-detect: if --dataset points at a directory
containing chunk_*.parquet, the runner uses LeMatRhoDeepDFTDataset
(PR 1). Other dataset paths (.tar, .txt, dir of cube/CHGCAR) still
work unchanged — upstream's dataset.DensityData path is preserved.

asap3 stub: upstream DeepDFT imports asap3 at module load. asap3
needs Python.h to build from source which isn't on Adastra (and would
need admin). The stub at the top of runner.py registers a fake asap3
module with a FullNeighborList class that delegates to ASE's
NewPrimitiveNeighborList. Slower than real asap3 but functionally
identical for DeepDFT's call sites. Skipped when real asap3 is
installed.

Submit script defaults:
- PaiNN model (matches equivariance of ChargE3Net for the comparison)
- batch=2 (DeepDFT's upstream default — they iterate on probes,
  not materials, so per-batch counts work differently from ChargE3Net)
- cutoff=4.0, num_interactions=3, node_size=128
- max_steps=1e8 (effectively unbounded; SLURM walltime is the limiter)
- WANDB_NAME=deepdft_painn (or deepdft_schnet)

Verified on Adastra: runner module imports cleanly under the venv311,
asap3 stub kicks in without error, parquet directory detection works.
The actual training run will be submitted next.
@speckhard speckhard force-pushed the feat/charge3net-adastra branch from 8487ae9 to 8d510d2 Compare May 20, 2026 11:27
speckhard added 2 commits May 20, 2026 13:28
Root-causes job 4971720's OOM-kill at startup and aligns the DeepDFT
training to the upstream paper's submission settings.

Two changes:

1. submit_deepdft_adastra.sh: switch from half-node DDP (4 GCDs) to
   paper-faithful single-GPU (1 GCD on mi250-shared, HIP_VISIBLE_DEVICES=0,
   WORLD_SIZE unset). Upstream DeepDFT was trained on 1x RTX 3090 per
   pretrained_models/*/submit_script.sh. Single-GPU keeps gradient-step
   semantics identical to the paper's batch=2; no LR sweep needed.

   Effective hyperparameters are now exactly the upstream PaiNN settings
   from pretrained_models/{nmc,qm9,ethylenecarbonate}_painn/commandline_args.txt:
     --cutoff 4
     --num_interactions 3
     --node_size 128
     --max_steps 10000000
     --use_painn_model
     batch_size=2 materials (hardcoded in runner.py)
     train_probes=1000 per material (hardcoded)
     val_probes=5000 per material (hardcoded)

   DDP code paths in runner.py stay in place but only fire when
   WORLD_SIZE>1, so a future DDP variant of DeepDFT is one env flip away.

2. deepdft_ft/runner.py: replace upstream's eager validation preload
   `val_loader = [b for b in val_loader]` with a comment explaining
   why we left it as a streaming DataLoader. Upstream's val sets are
   ~100 materials (NMC, QM9 ethylenecarbonate subsets) so the preload
   is cheap. Our val set is 3,261 materials at 5000 probes each, x4
   ranks under DDP, which materialised ~150 GB and OOM-killed job
   4971720 at startup before a single training step. Streaming the
   val loader is a data-loading detail, not a hyperparameter; the
   model math is unchanged.

Test plan:
- 44/44 local tests still pass (no behavioural changes to the data
  adapter or submit-script env contract; only the runner internals
  and the SLURM headers move).
- New job to be submitted as the next step; will confirm DeepDFT
  trains and produces step-level loss in the .out log.
Observation from jobs 4971293 and 4971343: SLURM bumped both to
EXCLUSIVE mode despite us requesting half-node resources. The
--mem=125000M line was exactly half the 256 GB node's memory, which
crosses SLURM's auto-exclusive threshold.

Dropping --mem entirely lets SLURM allocate memory proportional to
our CPU share (64 of 128 logical CPUs -> ~128 GB out of 256 GB).
The other half of the node stays schedulable for other users / jobs.

The currently running jobs 4971293 and 4971343 keep their exclusive
allocations; only future submissions are affected.

Test plan
- 9/9 tests in tests/test_submit_script.py still pass (no memory
  assertion).
- Will confirm on next sbatch by inspecting AllocTRES.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant