Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7622,37 +7622,37 @@ dsv4-fp4-gb200-dynamo-vllm:
- isl: 8192
osl: 1024
search-space:
# Low-concurrency / interactivity: 1 prefill (DP=8) + 1 decode (TP=8).
# 4 nodes total. Mirrors NVIDIA aflowers/gb200-dsv4-recipes branch.
- conc-list: [1, 4, 8, 16, 32, 64]
# 2P1D: 2 prefills (DP=8) + 1 decode (DP=8). 6 nodes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 After this PR merges, the 1k1k sibling benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml (lines 11-12) still defers documentation to ../8k1k/disagg-gb200-1p1d-dep8-tep8.yaml — but this PR deletes that file, so the cross-reference is dead and the documented deltas (model.path alias, dropped numa-bind, kept offload, container floating tag, slurm.time_limit, etc.) are lost. Either inline those deltas into the 1k1k recipe header or remove the cross-reference. Note: the bug location is benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12, not .github/configs/nvidia-master.yaml as listed in the bug header.

Extended reasoning...

What the bug is

This PR deletes benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-1p1d-dep8-tep8.yaml. However, the 1k1k sibling recipe at benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml lines 11-12 contains:

# Local deltas vs upstream 8k/1k sibling: same as the 8k/1k recipe — see
# ../8k1k/disagg-gb200-1p1d-dep8-tep8.yaml for the full deviation list.

After merge, that relative reference points to a non-existent file.

Why the existing code doesn't prevent it

The 1k1k recipe was authored when the 8k1k 1p1d-dep8-tep8 file existed (see PR #1129 in perf-changelog.yaml). The 1k1k header chose to defer to the 8k1k sibling rather than inline the deltas, on the assumption that both recipes would evolve together. This PR breaks that assumption by deleting only the 8k1k variant (the new NVIDIA/srt-slurm PR #78 topologies start at 2p1d, leaving the 1p1d-dep8-tep8 low-concurrency entry point owned solely by the 1k1k variant).

Impact

Documentation-only — no runtime effect. But the lost documentation is non-trivial: the upstream-vs-local deviation list explained (1) why model.path is renamed deepseekv4-fp4 -> deepseek-v4-pro (launch script alias), (2) why numa-bind is dropped (the vllm_numa_bind_hash_fix.py patch is missing in our srt-slurm clone), (3) why CPU/DRAM offload is kept despite numa-bind being off (load-bearing for KV cache fit), (4) why the container uses a floating tag instead of the upstream sha256 pin, (5) why slurm.time_limit was added (Lustre cold-cache loads), and (6) why benchmark.use_chat_template: false (sa-bench tokenizer support gap). A future maintainer touching the 1k1k recipe will lose all of that rationale and may inadvertently re-enable numa-bind or change the offload settings without realizing the constraints.

Step-by-step proof

  1. Before this PR, 8k1k/disagg-gb200-1p1d-dep8-tep8.yaml existed and contained a detailed comment block explaining all local deltas (visible in the diff's deletion section).
  2. The 1k1k sibling at 1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12 references that file by relative path: ../8k1k/disagg-gb200-1p1d-dep8-tep8.yaml.
  3. This PR's diff shows the 8k1k file deleted (+++ /dev/null).
  4. Verified post-merge state: ls benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/ returns only disagg-gb200-2p1d-dep8-dep8.yaml, disagg-gb200-3p1d-dep8-dep8.yaml, and disagg-gb200-7p1d-dep8-dep16.yaml — no 1p1d-dep8-tep8.
  5. Therefore the relative path in the 1k1k header resolves to a non-existent file.

How to fix

Two options, both small:

  • Inline the deltas into the 1k1k recipe header. Copy the bullet list from the deleted 8k1k file's comment block (model.path alias, numa-bind dropped + reason, offload kept + reason, floating container tag, slurm.time_limit / health_check, sa-bench benchmark switch) into 1k1k/disagg-gb200-1p1d-dep8-tep8.yaml lines 11-12, replacing the cross-reference. This is the higher-value fix since the 1k1k variant now owns the only 1p1d-dep8-tep8 recipe in the tree.
  • Remove the cross-reference. Delete lines 11-12 entirely. Cheaper but loses the documentation chain.

Note on file location

The synthesis agent listed the file as .github/configs/nvidia-master.yaml but the actual broken cross-reference lives in benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12. The substance of the bug is unchanged.

# From NVIDIA/srt-slurm PR #78.
- conc-list: [256, 512, 1024]
prefill:
num-worker: 1
num-worker: 2
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-1p1d-dep8-tep8.yaml"
- "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8.yaml"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
# Mid: 3 prefills (DP=8) + 1 wide decode (DP=16). 10 nodes total.
- conc-list: [512, 1024]
ep: 8
dp-attn: true
# 3P1D: 3 prefills (DP=8) + 1 decode (DP=8). 8 nodes.
- conc-list: [2048]
prefill:
num-worker: 3
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16.yaml"
- "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep8.yaml"
decode:
num-worker: 1
tp: 16
ep: 16
tp: 8
ep: 8
dp-attn: true
# Max throughput: 7 prefills (DP=8) + 1 wide decode (DP=16). 18 nodes
# (full cluster). Mirrors NVIDIA/srt-slurm PR #67.
# 7P1D: 7 prefills (DP=8) + 1 decode (DP=16). 18 nodes.
# From NVIDIA/srt-slurm PR #78.
- conc-list: [4096, 8192]
prefill:
num-worker: 7
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
name: "dsv4-vllm-disagg-gb200-2p1d-dep8-dep8"

# From NVIDIA/srt-slurm PR #78. 2P1D topology: 2 prefill workers (DP=8) +
# 1 decode (DP=8). 6 nodes total. Targets conc 256-1024.
#
# Local deltas vs upstream:
# * model.path: deepseekv4-fp4 -> deepseek-v4-pro (launch script alias)
# * container: sha256 pin -> floating tag :deepseekv4-cu130
# * dynamo: version 1.0.2 -> hash pin (our env uses hash-based pinning)
# * Added slurm.time_limit + health_check (Lustre cold-cache loads)
# * benchmark: vllm-bench -> sa-bench (our CI tooling)

model:
path: "deepseek-v4-pro"
container: "vllm/vllm-openai:deepseekv4-cu130"
precision: "fp4"

dynamo:
hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
install: true

setup_script: vllm-container-deps.sh

slurm:
time_limit: "8:00:00"

health_check:
max_attempts: 1440
interval_seconds: 10

resources:
gpu_type: "gb200"
gpus_per_node: 4
prefill_nodes: 4
decode_nodes: 2
prefill_workers: 2
decode_workers: 1
gpus_per_prefill: 8
gpus_per_decode: 8

frontend:
type: dynamo
enable_multiple_frontends: false

backend:
type: vllm
connector: null

prefill_environment:
TILELANG_CLEANUP_TEMP_FILES: "1"
VLLM_USE_NCCL_SYMM_MEM: "1"
NCCL_CUMEM_ENABLE: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_NVLS_ENABLE: "1"
VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
VLLM_LOG_STATS_INTERVAL: "1"

decode_environment:
TILELANG_CLEANUP_TEMP_FILES: "1"
VLLM_USE_NCCL_SYMM_MEM: "1"
NCCL_CUMEM_ENABLE: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_NVLS_ENABLE: "1"
VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
VLLM_LOG_STATS_INTERVAL: "1"

vllm_config:
prefill:
kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
kv-cache-dtype: "fp8"
tensor-parallel-size: 1
pipeline-parallel-size: 1
data-parallel-size: 8
data-parallel-rpc-port: 13345
enable-expert-parallel: true
enforce-eager: true
max-model-len: 16384
max-num-seqs: 16
max-num-batched-tokens: 32768
trust-remote-code: true
no-enable-prefix-caching: true
no-enable-flashinfer-autotune: true
no-async-scheduling: true
block-size: 256
gpu-memory-utilization: 0.8
no-disable-hybrid-kv-cache-manager: true
numa-bind: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 All three new 8k1k recipes (2p1d, 3p1d, 7p1d at line 90) set numa-bind: true on prefill, but our NVIDIA/srt-slurm@sa-submission-q2-2026 clone (still pinned in runners/launch_gb200-nv.sh:146) does not ship the vllm_numa_bind_hash_fix.py patch — the existing 1k1k recipe disagg-gb200-1p1d-dep8-tep8.yaml:108-110 documents this exact constraint. Either drop numa-bind: true from the new prefill blocks (matching the deleted 8k1k 1p1d recipe and its 1k1k sibling), or land the patch in the local clone and remove the now-stale 1k1k comment.

Extended reasoning...

The bug. All three new 8k1k recipes added in this PR set numa-bind: true in the prefill vllm_config block:

  • benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8.yaml:90
  • benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep8.yaml:90
  • benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml:90

This contradicts a documented limitation of our local fork. The unchanged 1k1k sibling recipe benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:108-110 explicitly states:

Numa-bind from upstream is still off because our NVIDIA/srt-slurm@sa-submission-q2-2026 clone doesn't ship the vllm_numa_bind_hash_fix.py patch.

The 8k1k 1p1d recipe deleted in this same PR carried the same comment, and perf-changelog.yaml line 1755 ('1p1d-dep8-tep8 ... with offload kept and numa-bind dropped') reinforces it. nvidia-master.yaml also notes 'offload + numa-bind stripped — see recipe header' for the 1k1k variant.

Why the constraint still applies. The launch script runners/launch_gb200-nv.sh:146 still pins the srt-slurm clone to branch sa-submission-q2-2026 — the exact clone the 1k1k comment refers to. This PR does not bump that pin, and vllm_numa_bind_hash_fix.py does not appear anywhere else in the repo, so the patch is still missing.

Why this isn't caught by upstream agreement. This PR ports configs from NVIDIA/srt-slurm PR #78 wholesale. The PR description's 'Local adaptations (kept from our env)' section enumerates every preserved local delta — model.path, container tag, dynamo hash, slurm.time_limit, health_check, sa-bench — but does NOT mention dropping numa-bind. That, combined with the fact that the new files mirror upstream verbatim on this flag, strongly suggests the numa-bind-drop adaptation was simply forgotten when porting.

Impact. Per the prefill recipe header that previously shipped (and the matching 1k1k comment), prefill workers fail to start without the hash-fix patch. Concretely, when vllm serve parses numa-bind: true on the unpatched fork, the numa-binding code path crashes/hangs because it relies on the missing hash helpers. This would block the entire 8k1k DSv4 sweep at startup — every run of the new 2p1d, 3p1d, and 7p1d recipes.

Step-by-step proof of the failure path.

  1. CI invokes runners/launch_gb200-nv.sh for one of the new 8k1k entries in nvidia-master.yaml (e.g. CONFIG_FILE recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8.yaml).
  2. Line 146 of the launch script clones NVIDIA/srt-slurm at branch sa-submission-q2-2026. The local 8k1k recipe is overlaid into that checkout.
  3. srt-slurm reads the recipe and, because of numa-bind: true in the prefill block, takes the numa-binding code path that requires vllm_numa_bind_hash_fix.py.
  4. That file is not present in the sa-submission-q2-2026 branch (as documented in the still-shipping 1k1k recipe and verified by the absence of any patch-landing commit).
  5. The prefill workers fail to start. Decode workers depend on the prefill side, so the run aborts before producing any benchmark numbers.

Fix. Two acceptable options:

  1. (simpler) Remove numa-bind: true from the prefill block in all three new recipes, matching the deleted 8k1k 1p1d recipe and the surviving 1k1k sibling. Add a one-line comment pointing at the constraint so the next port doesn't repeat the mistake.
  2. (if numa-bind is now actually wanted) Bump the srt-slurm clone in runners/launch_gb200-nv.sh to a branch that includes vllm_numa_bind_hash_fix.py, drop the now-stale comment from the 1k1k recipe, and update the perf-changelog narrative for the 1k1k entry. The PR description should also call out the bump as a local adaptation.

offload-group-size: 3
offload-num-in-group: 1
offload-prefetch-step: 2
tokenizer-mode: deepseek_v4
enable-ep-weight-filter: true

decode:
kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
kv-cache-dtype: "fp8"
tensor-parallel-size: 1
pipeline-parallel-size: 1
data-parallel-size: 8
data-parallel-rpc-port: 13345
enable-expert-parallel: true
max-model-len: 16384
max-num-seqs: 128
max-cudagraph-capture-size: 128
max-num-batched-tokens: 128
trust-remote-code: true
no-enable-prefix-caching: true
block-size: 256
compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
gpu-memory-utilization: 0.9
stream-interval: 50
no-disable-hybrid-kv-cache-manager: true
tokenizer-mode: deepseek_v4
all2all-backend: "flashinfer_nvlink_one_sided"
enable-ep-weight-filter: true

benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "256x512x1024"
num_warmups: 64
req_rate: "inf"
use_chat_template: false
Loading
Loading