Skip to content

Replace DSv4 8k1k recipes with NVIDIA/srt-slurm PR #78#1164

Closed
Oseltamivir wants to merge 2 commits intomainfrom
dsv4-8k1k-nvidia-pr78-recipes
Closed

Replace DSv4 8k1k recipes with NVIDIA/srt-slurm PR #78#1164
Oseltamivir wants to merge 2 commits intomainfrom
dsv4-8k1k-nvidia-pr78-recipes

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

  • Replaces all 8k1k DSv4-Pro GB200 disagg recipes with the tuned configs from NVIDIA/srt-slurm PR #78 ("vLLM DSv4 Disagg Improvement 042526" — PD tuning + FlashInfer all2all)
  • Drops the old low-conc TEP recipe (1p1d-dep8-tep8) and hand-rolled mid/high topologies
  • New topologies match NVIDIA's latest recommended configurations

Old → New topology mapping

Old New Nodes Conc
1p1d-dep8-tep8 (DP=8 prefill, TP=8 decode) deleted 4 1-64
2p1d-dep8-dep8 (2× DP=8 prefill, DP=8 decode) 6 256, 512, 1024
3p1d-dep8-dep16 (3× DP=8 prefill, DP=16 decode) 3p1d-dep8-dep8 (3× DP=8 prefill, DP=8 decode) 8 (was 10) 2048
7p1d-dep8-dep16 7p1d-dep8-dep16 (updated config) 18 4096, 8192

Key vllm_config changes from NVIDIA PR #78

  • Prefill: max-model-len 16384, max-num-seqs 16, max-num-batched-tokens 32768, offload (3/1/2), numa-bind, no-async-scheduling, tokenizer-mode deepseek_v4, enable-ep-weight-filter
  • Decode: all2all-backend flashinfer_nvlink_one_sided, tokenizer-mode deepseek_v4, enable-ep-weight-filter
  • Env vars: VLLM_RANDOMIZE_DP_DUMMY_INPUTS, VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random, VLLM_LOG_STATS_INTERVAL=1

Local adaptations (kept from our env)

  • model.path deepseek-v4-pro (launch script alias)
  • Container floating tag :deepseekv4-cu130 (vs NVIDIA's sha256 pin)
  • Dynamo hash-based pinning (vs version-based)
  • slurm.time_limit + health_check (Lustre cold-cache loads)
  • Benchmark: sa-bench (our CI) instead of vllm-bench

Test plan

  • nvidia-master.yaml 8k1k section has 3 search-space entries matching new recipe files
  • Recipe filenames match CONFIG_FILE references in nvidia-master.yaml
  • perf-changelog.yaml is additions-only vs main

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

block-size: 256
gpu-memory-utilization: 0.8
no-disable-hybrid-kv-cache-manager: true
numa-bind: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 All three new 8k1k recipes (2p1d, 3p1d, 7p1d at line 90) set numa-bind: true on prefill, but our NVIDIA/srt-slurm@sa-submission-q2-2026 clone (still pinned in runners/launch_gb200-nv.sh:146) does not ship the vllm_numa_bind_hash_fix.py patch — the existing 1k1k recipe disagg-gb200-1p1d-dep8-tep8.yaml:108-110 documents this exact constraint. Either drop numa-bind: true from the new prefill blocks (matching the deleted 8k1k 1p1d recipe and its 1k1k sibling), or land the patch in the local clone and remove the now-stale 1k1k comment.

Extended reasoning...

The bug. All three new 8k1k recipes added in this PR set numa-bind: true in the prefill vllm_config block:

  • benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8.yaml:90
  • benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep8.yaml:90
  • benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml:90

This contradicts a documented limitation of our local fork. The unchanged 1k1k sibling recipe benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:108-110 explicitly states:

Numa-bind from upstream is still off because our NVIDIA/srt-slurm@sa-submission-q2-2026 clone doesn't ship the vllm_numa_bind_hash_fix.py patch.

The 8k1k 1p1d recipe deleted in this same PR carried the same comment, and perf-changelog.yaml line 1755 ('1p1d-dep8-tep8 ... with offload kept and numa-bind dropped') reinforces it. nvidia-master.yaml also notes 'offload + numa-bind stripped — see recipe header' for the 1k1k variant.

Why the constraint still applies. The launch script runners/launch_gb200-nv.sh:146 still pins the srt-slurm clone to branch sa-submission-q2-2026 — the exact clone the 1k1k comment refers to. This PR does not bump that pin, and vllm_numa_bind_hash_fix.py does not appear anywhere else in the repo, so the patch is still missing.

Why this isn't caught by upstream agreement. This PR ports configs from NVIDIA/srt-slurm PR #78 wholesale. The PR description's 'Local adaptations (kept from our env)' section enumerates every preserved local delta — model.path, container tag, dynamo hash, slurm.time_limit, health_check, sa-bench — but does NOT mention dropping numa-bind. That, combined with the fact that the new files mirror upstream verbatim on this flag, strongly suggests the numa-bind-drop adaptation was simply forgotten when porting.

Impact. Per the prefill recipe header that previously shipped (and the matching 1k1k comment), prefill workers fail to start without the hash-fix patch. Concretely, when vllm serve parses numa-bind: true on the unpatched fork, the numa-binding code path crashes/hangs because it relies on the missing hash helpers. This would block the entire 8k1k DSv4 sweep at startup — every run of the new 2p1d, 3p1d, and 7p1d recipes.

Step-by-step proof of the failure path.

  1. CI invokes runners/launch_gb200-nv.sh for one of the new 8k1k entries in nvidia-master.yaml (e.g. CONFIG_FILE recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8.yaml).
  2. Line 146 of the launch script clones NVIDIA/srt-slurm at branch sa-submission-q2-2026. The local 8k1k recipe is overlaid into that checkout.
  3. srt-slurm reads the recipe and, because of numa-bind: true in the prefill block, takes the numa-binding code path that requires vllm_numa_bind_hash_fix.py.
  4. That file is not present in the sa-submission-q2-2026 branch (as documented in the still-shipping 1k1k recipe and verified by the absence of any patch-landing commit).
  5. The prefill workers fail to start. Decode workers depend on the prefill side, so the run aborts before producing any benchmark numbers.

Fix. Two acceptable options:

  1. (simpler) Remove numa-bind: true from the prefill block in all three new recipes, matching the deleted 8k1k 1p1d recipe and the surviving 1k1k sibling. Add a one-line comment pointing at the constraint so the next port doesn't repeat the mistake.
  2. (if numa-bind is now actually wanted) Bump the srt-slurm clone in runners/launch_gb200-nv.sh to a branch that includes vllm_numa_bind_hash_fix.py, drop the now-stale comment from the 1k1k recipe, and update the perf-changelog narrative for the 1k1k entry. The PR description should also call out the bump as a local adaptation.

Comment thread perf-changelog.yaml
- "Old topologies: 1p1d-dep8-tep8 (c1-64), 3p1d-dep8-dep16 (c512-1024), 7p1d-dep8-dep16 (c4096-8192)"
- "New topologies: 2p1d-dep8-dep8 (c256-1024), 3p1d-dep8-dep8 (c2048), 7p1d-dep8-dep16 (c4096-8192)"
- "Key changes: max-model-len 16384, prefill max-num-seqs 16, offload+numa-bind on prefill, all2all-backend flashinfer_nvlink_one_sided on decode, enable-ep-weight-filter, tokenizer-mode deepseek_v4"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry on line 1844 sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD instead of the actual PR number. Every other entry in the file uses a concrete numeric pull/<N> URL, so after merge this URL resolves to a 404 and breaks any tooling that follows pr-link entries. Replace TBD with 1164 (this PR's number) before merging.

Extended reasoning...

What the bug is

perf-changelog.yaml line 1844 (the new changelog entry added by this PR) reads:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD

This is the only TBD placeholder in the entire file. A grep across the rest of perf-changelog.yaml shows ~125 prior entries and every single one uses a concrete numeric URL like /pull/1144, /pull/1148, /pull/1160 (the entry directly above this one). The PR number is already known — it is #1164, the PR adding this very entry.

How it manifests

After this PR merges, anyone (or any tool) that resolves the pr-link field will hit https://github.com/SemiAnalysisAI/InferenceX/pull/TBD. GitHub interprets TBD as a non-numeric pull-request ID, so the URL returns a 404 "page not found" forever — there is no path that retroactively makes the link valid once committed.

Why the existing tooling/process doesn't prevent it

There is nothing in the diff (or apparently in repo CI) that validates pr-link values are numeric or resolvable. The entry was clearly added during PR drafting before the PR number was assigned, and the placeholder was not swept after the PR was opened (#1164). It's a common authoring pattern that's typically resolved manually at review time.

Impact

Docs-only — no runtime behavior is affected. But the cross-reference is permanently dead post-merge: changelog viewers, search tooling, audit scripts, or anyone clicking through from the changelog to the introducing PR for context will land on a 404. Since every other entry in the file resolves correctly, this single broken entry is also a long-term inconsistency in the file.

How to fix

Replace TBD with 1164 on line 1844:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1164

Step-by-step proof

  1. Open perf-changelog.yaml line 1844 — see pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD.
  2. grep -c 'pull/TBD' perf-changelog.yaml → 1 (only this new entry).
  3. grep -c 'pull/[0-9]' perf-changelog.yaml → ~125 (every other entry uses a numeric ID).
  4. The previous entry on line 1835 uses /pull/1160 — a fully resolved link — confirming the convention for the file.
  5. PR metadata: this is PR Replace DSv4 8k1k recipes with NVIDIA/srt-slurm PR #78 #1164, so the correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/1164.
  6. After merge, fetching https://github.com/SemiAnalysisAI/InferenceX/pull/TBD returns 404 because TBD is not a valid pull-request ID — confirming the link is broken.

Comment on lines +107 to +109
max-num-seqs: 512
max-cudagraph-capture-size: 512
max-num-batched-tokens: 512
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Stale cross-file comment in the unmodified 1k1k sibling benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-dep16.yaml (lines 13-14): it lists decode max-num-seqs: 512 instead of 256 and max-cudagraph-capture-size / max-num-batched-tokens (decode): 512 as differences from 8k1k 7p1d-dep8-dep16, but this PR bumps the 8k1k 7p1d decode values to 512/512/512 (lines 107-109), so those bullets are no longer differences — both recipes now share them. Pre-existing-style doc rot, comment-only, no runtime impact; recommend updating the 1k1k delta block (drop the two bullets) in this PR or a quick follow-up.

Extended reasoning...

What the bug is

The 1k1k sibling benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-dep16.yaml carries a "Differences from our 8k1k 7p1d-dep8-dep16" comment block (lines 9-14) that enumerates how the 1k1k recipe diverges from the 8k1k 7p1d-dep8-dep16 recipe. Two of those bullets call out decode-side parameters:

* decode max-num-seqs: 512 instead of 256 (shorter KV, more parallelism)
* max-cudagraph-capture-size / max-num-batched-tokens (decode): 512

This PR bumps the 8k1k 7p1d-dep8-dep16 decode block (lines 107-109 of the new file) to exactly the same values the 1k1k comment claims are unique to 1k1k.

Step-by-step proof

  1. Read benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-dep16.yaml:13-14 — it asserts that 1k1k's decode max-num-seqs is 512 instead of 256, and that max-cudagraph-capture-size / max-num-batched-tokens are 512 (implicitly, vs the 8k1k baseline's lower values).
  2. Look at the new 8k1k file in this PR, benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml:107-109:
    max-num-seqs: 512
    max-cudagraph-capture-size: 512
    max-num-batched-tokens: 512
  3. Compare: both recipes now share max-num-seqs=512, max-cudagraph-capture-size=512, max-num-batched-tokens=512 on the decode side. The 1k1k comment's premise ("512 instead of 256") is now wrong: the 8k1k baseline is also 512.

Why existing code does not prevent this

The cross-file comment is plain prose tied to a specific snapshot of the 8k1k recipe. There is no test or lint that checks whether documented deltas match reality, and the 1k1k file is not modified by this PR, so a normal diff review would not flag it.

Impact

Documentation-only — zero runtime effect, no behavior change, no CI breakage. The risk is purely confusion: a future reader comparing the two recipes (which the comment block exists specifically to help with) will see contradictory information and waste time figuring out which file is authoritative.

How to fix

Either drop the two now-stale bullets from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-dep16.yaml:13-14, or rewrite the delta block from scratch against the post-PR 8k1k recipe (which has many other legitimate deltas now: prefill max-model-len, numa-bind, offload, all2all backend, etc., that the comment block doesn't currently mention). Cheap to fix in this PR while the author has the context, but a follow-up is also fine since this is documentation rot, not a runtime regression.

# Low-concurrency / interactivity: 1 prefill (DP=8) + 1 decode (TP=8).
# 4 nodes total. Mirrors NVIDIA aflowers/gb200-dsv4-recipes branch.
- conc-list: [1, 4, 8, 16, 32, 64]
# 2P1D: 2 prefills (DP=8) + 1 decode (DP=8). 6 nodes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 After this PR merges, the 1k1k sibling benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml (lines 11-12) still defers documentation to ../8k1k/disagg-gb200-1p1d-dep8-tep8.yaml — but this PR deletes that file, so the cross-reference is dead and the documented deltas (model.path alias, dropped numa-bind, kept offload, container floating tag, slurm.time_limit, etc.) are lost. Either inline those deltas into the 1k1k recipe header or remove the cross-reference. Note: the bug location is benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12, not .github/configs/nvidia-master.yaml as listed in the bug header.

Extended reasoning...

What the bug is

This PR deletes benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-1p1d-dep8-tep8.yaml. However, the 1k1k sibling recipe at benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml lines 11-12 contains:

# Local deltas vs upstream 8k/1k sibling: same as the 8k/1k recipe — see
# ../8k1k/disagg-gb200-1p1d-dep8-tep8.yaml for the full deviation list.

After merge, that relative reference points to a non-existent file.

Why the existing code doesn't prevent it

The 1k1k recipe was authored when the 8k1k 1p1d-dep8-tep8 file existed (see PR #1129 in perf-changelog.yaml). The 1k1k header chose to defer to the 8k1k sibling rather than inline the deltas, on the assumption that both recipes would evolve together. This PR breaks that assumption by deleting only the 8k1k variant (the new NVIDIA/srt-slurm PR #78 topologies start at 2p1d, leaving the 1p1d-dep8-tep8 low-concurrency entry point owned solely by the 1k1k variant).

Impact

Documentation-only — no runtime effect. But the lost documentation is non-trivial: the upstream-vs-local deviation list explained (1) why model.path is renamed deepseekv4-fp4 -> deepseek-v4-pro (launch script alias), (2) why numa-bind is dropped (the vllm_numa_bind_hash_fix.py patch is missing in our srt-slurm clone), (3) why CPU/DRAM offload is kept despite numa-bind being off (load-bearing for KV cache fit), (4) why the container uses a floating tag instead of the upstream sha256 pin, (5) why slurm.time_limit was added (Lustre cold-cache loads), and (6) why benchmark.use_chat_template: false (sa-bench tokenizer support gap). A future maintainer touching the 1k1k recipe will lose all of that rationale and may inadvertently re-enable numa-bind or change the offload settings without realizing the constraints.

Step-by-step proof

  1. Before this PR, 8k1k/disagg-gb200-1p1d-dep8-tep8.yaml existed and contained a detailed comment block explaining all local deltas (visible in the diff's deletion section).
  2. The 1k1k sibling at 1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12 references that file by relative path: ../8k1k/disagg-gb200-1p1d-dep8-tep8.yaml.
  3. This PR's diff shows the 8k1k file deleted (+++ /dev/null).
  4. Verified post-merge state: ls benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/ returns only disagg-gb200-2p1d-dep8-dep8.yaml, disagg-gb200-3p1d-dep8-dep8.yaml, and disagg-gb200-7p1d-dep8-dep16.yaml — no 1p1d-dep8-tep8.
  5. Therefore the relative path in the 1k1k header resolves to a non-existent file.

How to fix

Two options, both small:

  • Inline the deltas into the 1k1k recipe header. Copy the bullet list from the deleted 8k1k file's comment block (model.path alias, numa-bind dropped + reason, offload kept + reason, floating container tag, slurm.time_limit / health_check, sa-bench benchmark switch) into 1k1k/disagg-gb200-1p1d-dep8-tep8.yaml lines 11-12, replacing the cross-reference. This is the higher-value fix since the 1k1k variant now owns the only 1p1d-dep8-tep8 recipe in the tree.
  • Remove the cross-reference. Delete lines 11-12 entirely. Cheaper but loses the documentation chain.

Note on file location

The synthesis agent listed the file as .github/configs/nvidia-master.yaml but the actual broken cross-reference lives in benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12. The substance of the bug is unchanged.

@Oseltamivir
Copy link
Copy Markdown
Collaborator Author

Taken over by #1163

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant