Add chained sbatch submissions for runs that exceed cluster time limits#184
Open
j321m wants to merge 1 commit into
Open
Add chained sbatch submissions for runs that exceed cluster time limits#184j321m wants to merge 1 commit into
j321m wants to merge 1 commit into
Conversation
A long training run that won't fit in one SLURM time slot can now be
submitted as N chained jobs that share one checkpoint dir and one wandb
run. Set `chain.n_jobs: N` in the experiment yaml.
How it works:
- run_exp.py generates an EXPERIMENT_ID once per submission, exports it via
`sbatch --export=ALL,EXPERIMENT_ID=...`, and submits N sbatch calls in
the same tmux pane chained with `--dependency=afterany:<prev>`. afterany
(not aftercorr) is required so a SLURM time-limit kill (non-zero exit)
doesn't break the chain.
- checkpointing.get_full_checkpoint_path prefers EXPERIMENT_ID over
SLURM_JOB_ID, so the checkpoint dir is stable across chain steps.
- _resolve_load_path applies the same {EXPERIMENT_ID}/{ARRAY_TASK_ID}
nesting that save uses, then picks the highest step_*. First chain step
finds nothing → fresh start; subsequent steps find the latest
checkpoint and resume.
- main.run() exits early when next_step >= n_steps, so chain steps after
training is complete are cheap no-ops.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Long training runs that won't fit in one SLURM time slot can now be submitted as N chained jobs that share one checkpoint dir and one wandb run. Just set
chain.n_jobs: Nin the experiment yaml — no code changes per experiment.How it works
run_exp.pygenerates anEXPERIMENT_IDonce per submission (the existingexperiment_branch_name), exports it viasbatch --export=ALL,EXPERIMENT_ID=..., and submits N sbatch calls in the same tmux pane chained with--dependency=afterany:<prev>.afterany(notaftercorr) is required so a SLURM time-limit kill (non-zero exit) doesn't break the chain.checkpointing.get_full_checkpoint_pathprefersEXPERIMENT_IDoverSLURM_JOB_ID, so the checkpoint dir is stable across chain steps. Falls back toSLURM_JOB_IDfor non-chained runs._resolve_load_pathapplies the same{EXPERIMENT_ID}/{ARRAY_TASK_ID}nesting that save uses, then picks the higheststep_*. First chain step finds nothing → fresh start; subsequent steps find the latest checkpoint and resume.main.run()exits early whennext_step >= n_steps, so chain steps after training is complete are cheap no-ops.Disk layout
Grid + chain composition
For a grid of M cells × chain length N, total = N sbatch arrays of M tasks each. Each grid cell K has its own consistent dir at
{save.path}/{EXPERIMENT_ID}/K/step_*regardless of which chain step is running.Caveats
afteranyis whole-array level — chain step K+1 waits for ALL tasks of chain step K to finish. SLURM has no per-taskafterany, so independent per-cell chains aren't possible without restructuring as M independent chains. For most uses this is fine since grid cells are similarly costly.only_weights: falsepath is still flagged "not tested" in the existing checkpoint configs. Verified end-to-end on entropy with the smoke test.Tested
End-to-end on entropy h100, n_steps=500, chain.n_jobs=2:
step_499via auto-resolve, hit early-exit branch (next_step=500 >= n_steps=500), exited cleanly in 10s.COMPLETED 0:0.Verified path resolution and EXPERIMENT_ID propagation produced the expected disk layout:
chain_smoke/chain_smoke_2026-05-01_19-40-36/0/step_*.🤖 Generated with Claude Code