-
Notifications
You must be signed in to change notification settings - Fork 156
Replace DSv4 8k1k recipes with NVIDIA/srt-slurm PR #78 #1164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| name: "dsv4-vllm-disagg-gb200-2p1d-dep8-dep8" | ||
|
|
||
| # From NVIDIA/srt-slurm PR #78. 2P1D topology: 2 prefill workers (DP=8) + | ||
| # 1 decode (DP=8). 6 nodes total. Targets conc 256-1024. | ||
| # | ||
| # Local deltas vs upstream: | ||
| # * model.path: deepseekv4-fp4 -> deepseek-v4-pro (launch script alias) | ||
| # * container: sha256 pin -> floating tag :deepseekv4-cu130 | ||
| # * dynamo: version 1.0.2 -> hash pin (our env uses hash-based pinning) | ||
| # * Added slurm.time_limit + health_check (Lustre cold-cache loads) | ||
| # * benchmark: vllm-bench -> sa-bench (our CI tooling) | ||
|
|
||
| model: | ||
| path: "deepseek-v4-pro" | ||
| container: "vllm/vllm-openai:deepseekv4-cu130" | ||
| precision: "fp4" | ||
|
|
||
| dynamo: | ||
| hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b | ||
| install: true | ||
|
|
||
| setup_script: vllm-container-deps.sh | ||
|
|
||
| slurm: | ||
| time_limit: "8:00:00" | ||
|
|
||
| health_check: | ||
| max_attempts: 1440 | ||
| interval_seconds: 10 | ||
|
|
||
| resources: | ||
| gpu_type: "gb200" | ||
| gpus_per_node: 4 | ||
| prefill_nodes: 4 | ||
| decode_nodes: 2 | ||
| prefill_workers: 2 | ||
| decode_workers: 1 | ||
| gpus_per_prefill: 8 | ||
| gpus_per_decode: 8 | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: false | ||
|
|
||
| backend: | ||
| type: vllm | ||
| connector: null | ||
|
|
||
| prefill_environment: | ||
| TILELANG_CLEANUP_TEMP_FILES: "1" | ||
| VLLM_USE_NCCL_SYMM_MEM: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_NVLS_ENABLE: "1" | ||
| VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" | ||
| VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" | ||
| VLLM_LOG_STATS_INTERVAL: "1" | ||
|
|
||
| decode_environment: | ||
| TILELANG_CLEANUP_TEMP_FILES: "1" | ||
| VLLM_USE_NCCL_SYMM_MEM: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_NVLS_ENABLE: "1" | ||
| VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" | ||
| VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" | ||
| VLLM_LOG_STATS_INTERVAL: "1" | ||
|
|
||
| vllm_config: | ||
| prefill: | ||
| kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' | ||
| served-model-name: "deepseek-ai/DeepSeek-V4-Pro" | ||
| kv-cache-dtype: "fp8" | ||
| tensor-parallel-size: 1 | ||
| pipeline-parallel-size: 1 | ||
| data-parallel-size: 8 | ||
| data-parallel-rpc-port: 13345 | ||
| enable-expert-parallel: true | ||
| enforce-eager: true | ||
| max-model-len: 16384 | ||
| max-num-seqs: 16 | ||
| max-num-batched-tokens: 32768 | ||
| trust-remote-code: true | ||
| no-enable-prefix-caching: true | ||
| no-enable-flashinfer-autotune: true | ||
| no-async-scheduling: true | ||
| block-size: 256 | ||
| gpu-memory-utilization: 0.8 | ||
| no-disable-hybrid-kv-cache-manager: true | ||
| numa-bind: true | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 All three new 8k1k recipes (2p1d, 3p1d, 7p1d at line 90) set Extended reasoning...The bug. All three new 8k1k recipes added in this PR set
This contradicts a documented limitation of our local fork. The unchanged 1k1k sibling recipe
The 8k1k 1p1d recipe deleted in this same PR carried the same comment, and Why the constraint still applies. The launch script Why this isn't caught by upstream agreement. This PR ports configs from NVIDIA/srt-slurm PR #78 wholesale. The PR description's 'Local adaptations (kept from our env)' section enumerates every preserved local delta — Impact. Per the prefill recipe header that previously shipped (and the matching 1k1k comment), prefill workers fail to start without the hash-fix patch. Concretely, when Step-by-step proof of the failure path.
Fix. Two acceptable options:
|
||
| offload-group-size: 3 | ||
| offload-num-in-group: 1 | ||
| offload-prefetch-step: 2 | ||
| tokenizer-mode: deepseek_v4 | ||
| enable-ep-weight-filter: true | ||
|
|
||
| decode: | ||
| kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' | ||
| served-model-name: "deepseek-ai/DeepSeek-V4-Pro" | ||
| kv-cache-dtype: "fp8" | ||
| tensor-parallel-size: 1 | ||
| pipeline-parallel-size: 1 | ||
| data-parallel-size: 8 | ||
| data-parallel-rpc-port: 13345 | ||
| enable-expert-parallel: true | ||
| max-model-len: 16384 | ||
| max-num-seqs: 128 | ||
| max-cudagraph-capture-size: 128 | ||
| max-num-batched-tokens: 128 | ||
| trust-remote-code: true | ||
| no-enable-prefix-caching: true | ||
| block-size: 256 | ||
| compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' | ||
| gpu-memory-utilization: 0.9 | ||
| stream-interval: 50 | ||
| no-disable-hybrid-kv-cache-manager: true | ||
| tokenizer-mode: deepseek_v4 | ||
| all2all-backend: "flashinfer_nvlink_one_sided" | ||
| enable-ep-weight-filter: true | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| concurrencies: "256x512x1024" | ||
| num_warmups: 64 | ||
| req_rate: "inf" | ||
| use_chat_template: false | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 After this PR merges, the 1k1k sibling
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml(lines 11-12) still defers documentation to../8k1k/disagg-gb200-1p1d-dep8-tep8.yaml— but this PR deletes that file, so the cross-reference is dead and the documented deltas (model.path alias, dropped numa-bind, kept offload, container floating tag, slurm.time_limit, etc.) are lost. Either inline those deltas into the 1k1k recipe header or remove the cross-reference. Note: the bug location isbenchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12, not.github/configs/nvidia-master.yamlas listed in the bug header.Extended reasoning...
What the bug is
This PR deletes
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-1p1d-dep8-tep8.yaml. However, the 1k1k sibling recipe atbenchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yamllines 11-12 contains:After merge, that relative reference points to a non-existent file.
Why the existing code doesn't prevent it
The 1k1k recipe was authored when the 8k1k 1p1d-dep8-tep8 file existed (see PR #1129 in perf-changelog.yaml). The 1k1k header chose to defer to the 8k1k sibling rather than inline the deltas, on the assumption that both recipes would evolve together. This PR breaks that assumption by deleting only the 8k1k variant (the new NVIDIA/srt-slurm PR #78 topologies start at 2p1d, leaving the 1p1d-dep8-tep8 low-concurrency entry point owned solely by the 1k1k variant).
Impact
Documentation-only — no runtime effect. But the lost documentation is non-trivial: the upstream-vs-local deviation list explained (1) why
model.pathis renameddeepseekv4-fp4 -> deepseek-v4-pro(launch script alias), (2) whynuma-bindis dropped (thevllm_numa_bind_hash_fix.pypatch is missing in our srt-slurm clone), (3) why CPU/DRAM offload is kept despite numa-bind being off (load-bearing for KV cache fit), (4) why the container uses a floating tag instead of the upstream sha256 pin, (5) whyslurm.time_limitwas added (Lustre cold-cache loads), and (6) whybenchmark.use_chat_template: false(sa-bench tokenizer support gap). A future maintainer touching the 1k1k recipe will lose all of that rationale and may inadvertently re-enable numa-bind or change the offload settings without realizing the constraints.Step-by-step proof
8k1k/disagg-gb200-1p1d-dep8-tep8.yamlexisted and contained a detailed comment block explaining all local deltas (visible in the diff's deletion section).1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12references that file by relative path:../8k1k/disagg-gb200-1p1d-dep8-tep8.yaml.+++ /dev/null).ls benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/returns onlydisagg-gb200-2p1d-dep8-dep8.yaml,disagg-gb200-3p1d-dep8-dep8.yaml, anddisagg-gb200-7p1d-dep8-dep16.yaml— no1p1d-dep8-tep8.How to fix
Two options, both small:
1k1k/disagg-gb200-1p1d-dep8-tep8.yamllines 11-12, replacing the cross-reference. This is the higher-value fix since the 1k1k variant now owns the only 1p1d-dep8-tep8 recipe in the tree.Note on file location
The synthesis agent listed the file as
.github/configs/nvidia-master.yamlbut the actual broken cross-reference lives inbenchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:11-12. The substance of the bug is unchanged.