[WIP] Allow overriding srt-slurm repo/ref at the launcher level#1118
[WIP] Allow overriding srt-slurm repo/ref at the launcher level#1118Oseltamivir wants to merge 3 commits intomainfrom
Conversation
Adds optional SRT_SLURM_REPO / SRT_SLURM_REF env vars that the GB200 launcher reads to choose a non-default srt-slurm fork/branch. Defaults are unchanged — production runs that don't set these still clone NVIDIA/srt-slurm@sa-submission-q2-2026 (dynamo-vllm, dynamo-trt+kimik2.5) or ishandhanani/srt-slurm@sa-submission-q1-2026 (everything else) as before. Also exports LM_EVAL_WORKSPACE=$GITHUB_WORKSPACE alongside the existing INFMAX_WORKSPACE. The generalized lm-eval support in NVIDIA/srt-slurm#41 reads LM_EVAL_WORKSPACE; the legacy sa-submission branches read INFMAX_WORKSPACE. Exporting both keeps a single launcher working with either version. Plumbs srt-slurm-repo and srt-slurm-ref as optional inputs through e2e-tests.yml -> benchmark-multinode-tmpl.yml so workflow_dispatch runs can target a specific srt-slurm fork/branch for validation without touching production defaults.
There was a problem hiding this comment.
LGTM — additive override plumbing with unchanged defaults.
Extended reasoning...
Overview
This PR adds optional srt-slurm-repo / srt-slurm-ref inputs to e2e-tests.yml (both workflow_dispatch and workflow_call) and to benchmark-multinode-tmpl.yml, exposes them to the launcher as SRT_SLURM_REPO / SRT_SLURM_REF env vars, and refactors the three git clone branches in runners/launch_gb200-nv.sh into a single clone block driven by DEFAULT_SRT_REPO/DEFAULT_SRT_REF variables with env-var override. It also exports LM_EVAL_WORKSPACE=$GITHUB_WORKSPACE alongside the existing INFMAX_WORKSPACE for forward-compat with a generalized upstream srt-slurm branch.
Security risks
None meaningful. The override inputs let a user with workflow-dispatch permission point the launcher at an arbitrary git URL, but that capability already exists for anyone who can edit the launcher script, and the clone/checkout runs inside the benchmarking job — not a trust boundary that would elevate privilege. No secrets handling changes.
Level of scrutiny
Low. This is CI/benchmark infra, not production-serving code. Defaults are byte-identical to the prior behavior: the three framework branches still map to the same (repo, ref) pairs, and ${VAR:-default} falls through to the default when the workflow passes an empty string. The change is isolated to one launcher (launch_gb200-nv.sh) — the other GPU launchers are explicitly out of scope per the PR description.
Other factors
No bugs were found by the bug hunting system. The refactor is mechanical and the override pattern is idiomatic for shell scripts. Adding LM_EVAL_WORKSPACE alongside INFMAX_WORKSPACE is purely additive and justified by the compatibility rationale.
One-entry sweep config used by the manual workflow_dispatch that validates NVIDIA/srt-slurm#41 on GB200. Mirrors the cheapest entry from dsr1-fp8-gb200-dynamo-trt (8k1k stp, eval-conc=63, 1P/3D) so the end-to-end eval path is exercised without running the full gb200 sweep. Not referenced by any automated workflow; picked up only when passed explicitly via --config-files.
The fp8 variant mapped to /mnt/numa1/.../deepseek-r1-0528/, which is only visible from SLURM compute nodes and not from the gb200 GH runners, so the new preflight validation in NVIDIA/srt-slurm@lm-eval-main (not present in sa-submission-q2-2026) rejected the submission on the runner side before sbatch. The dsr1-fp4 path in the launcher maps to /mnt/lustre01/models/deepseek-r1-0528-fp4-v2/, which is on the shared lustre mount accessible from both the runner and compute nodes. Recipe ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml is the smallest 8k1k gb200 trtllm fp4 recipe (1P+8D=9 nodes, single variant, nginx-sqsh container alias matches the launcher's srtslurm.yaml).
Summary
SRT_SLURM_REPO/SRT_SLURM_REFenv-var override inrunners/launch_gb200-nv.shso aworkflow_dispatchcan test a WIP srt-slurm fork/branch without touching production defaults.LM_EVAL_WORKSPACE=$GITHUB_WORKSPACEalongside the existingINFMAX_WORKSPACEso the launcher is forward-compatible with the generalized lm-eval runner in NVIDIA/srt-slurm#41 (which readsLM_EVAL_WORKSPACE) while legacysa-submission-*branches keep working (they readINFMAX_WORKSPACE).srt-slurm-repoandsrt-slurm-refas optional inputs throughe2e-tests.yml→benchmark-multinode-tmpl.ymlso the override can be set at dispatch time..github/configs/srt-slurm-validation.yaml, a single-entry sweep config mirroring the cheapestdsr1-fp8-gb200-dynamo-trtentry so the validation dispatch runs exactly one eval job instead of the full gb200 matrix.Why
NVIDIA/srt-slurm#41 generalizes the lm-eval runner so it no longer requires SA-specific infra (the reviewer's ask on #12). Before switching defaults, we want to validate the generalized version end-to-end on GB200 with the existing InferenceX eval harness. This PR adds that knob and leaves every existing code path unchanged.
How to test
Runs exactly one eval-only job:
dsr1_8k1k dynamo-trt fp8 | P(tp8/ep8/dptrue/nw1) D(tp8/ep8/dpfalse/nw3) | disagg-true spec-none conc-63 | eval-only. Clones the generalized srt-slurm fork (Oseltamivir/srt-slurm-nvidia@lm-eval-main) instead of the defaultNVIDIA/srt-slurm@sa-submission-q2-2026.Scope / non-goals
launch_gb200-nv.shis updated. The other NVIDIA launchers (launch_{b200,b300,gb300,h100,h200}-*.sh) keep their currentsa-submission-q2-2026clone and can be updated in a follow-up once the smoke test passes.srt-slurm-validation.yamlis a test fixture referenced only by explicit--config-filesin the dispatch command above; no automated workflow picks it up.