-
Notifications
You must be signed in to change notification settings - Fork 150
Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69) #1137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,13 @@ | ||
| - config-keys: | ||
| - dsv4-fp4-gb200-sglang | ||
| description: | ||
| - "Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (1k/1k, TP=8, 2 nodes)" | ||
| - "Recipes from YAMY1234/srt-slurm-nv:dsv4-pro-recipes (NVIDIA srt-slurm PR #69)" | ||
| - "Image: lmsysorg/sglang:deepseek-v4-grace-blackwell" | ||
| - "Two recipes: agg-2n-low-latency (EAGLE 3/4 spec decoding) for conc 1-64, agg-2n-nomtp for conc 128-1024" | ||
| - "Runner script clones the YAMY1234 fork pinned at commit da535e87 instead of NVIDIA/srt-slurm" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD | ||
|
Comment on lines
+1
to
+9
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new Extended reasoning...What the bug is
The diff for Why existing code/convention doesnt prevent itThe rule is a human-enforced convention documented in AGENTS.md — there is no lint or CI check that validates ordering, so the only guard is reviewer/author attention. Every other recent PR in the repo (e.g. #1120 evals trigger, #1040 qwen atom, #1043 glm5.1 atom, #1098, #1106, #1094) lands its entry at the bottom, which is how new readers correctly infer the chronology. ImpactAnyone scanning How to fixMove the 9-line block currently at lines 1-9 of Step-by-step proof
|
||
|
|
||
| - config-keys: | ||
| - dsr1-fp8-h100-dynamo-trt | ||
| - dsr1-fp8-h100-dynamo-sglang | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The low-latency search-space entry (conc 1-64) references
agg-2n-low-latency.yaml, which this PR's own description and the adjacent YAML comment describe as using EAGLE 3/4 speculative decoding, but the entry omitsspec-decoding: "mtp"and therefore defaults to"none". BecauseSPEC_DECODINGis plumbed through the runner into the result metadata (utils/process_result.pyline 53), every EAGLE run from this config will be mislabeled as non-speculative in downstream dashboards, and it will be indistinguishable from theagg-2n-nomtpentry in any aggregation that keys on spec-decoding. Fix: addspec-decoding: "mtp"to the conc 1-64 entry at lines 7452-7465, matching the convention used by every other EAGLE/MTP config in this file.Extended reasoning...
What the bug is
The new
dsv4-fp4-gb200-sglangconfig in.github/configs/nvidia-master.yamlhas two search-space entries for the 1k/1k seq-len. The first entry (conc[1, 2, 4, 8, 16, 32, 64]) points atrecipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml. This PR's own description saysagg-2n-low-latency.yaml — TP=8 + EAGLE 3/4 speculative decoding, the adjacent YAML comment at line 7450 saysLow-latency: TP=8 + EAGLE 3/4 speculative decoding, and the perf-changelog entry saysagg-2n-low-latency (EAGLE 3/4 spec decoding). Despite this, the search-space entry does not setspec-decoding: "mtp".Why the default is wrong here
Per
utils/matrix_logic/validation.py:227-228,MultiNodeSearchSpaceEntry.spec_decodingis declared asLiteral["mtp", "draft_model", "none"]withdefault="none". So the low-latency EAGLE entry silently becomesspec_decoding="none"when the matrix is generated. Every other EAGLE/MTP recipe in this same file explicitly setsspec-decoding: "mtp"(the*-trt-mtp,*-sglang-mtp,*-dynamo-sglang-mtpentries; the changelog's own description of the other MTP PRs in this file confirms this is the established convention).Impact
utils/process_result.pyis the result-metadata writer run for every benchmark point. At module load (line 27) it requiresSPEC_DECODINGas an env var, and at line 53 it writes'spec_decoding': spec_decodinginto the result JSON that is uploaded for dashboards:With
spec-decodingmissing from the YAML, the generated matrix entry carriesspec_decoding="none", the job'sSPEC_DECODINGenv var is set to"none", and all seven low-latency EAGLE points (conc 1, 2, 4, 8, 16, 32, 64) get written into result metadata as non-speculative. Downstream dashboards and aggregations that key onspec_decodingwill silently fold the EAGLE points in with the non-MTP entry (agg-2n-nomtp.yaml, conc 128-1024), producing a combined sweep that looks like a pure non-MTP config — you lose the ability to see the EAGLE speedup at low concurrency.The multi-node eval-grouping logic in
utils/matrix_logic/generate_sweep_configs.py:106-114also keys on spec-decoding, but the grouping targets only 8k1k (seetarget_isl, target_osl = seq_len_stoi["8k1k"]at line 51) and this config is 1k1k, so eval selection is not directly affected. The metadata-correctness impact on result labeling remains, which is the primary defect.Proof (step-by-step)
generate_sweep_configs.pyexpands the config. For the conc=16 point in the first search-space entry, it calls into the validatedMultiNodeSearchSpaceEntrymodel. Becausespec-decodingis absent, Pydantic applies the default"none"(validation.py:227-228).spec-decoding: "none".SPEC_DECODING=nonein the environment alongsideFRAMEWORK=sglang,MODEL_PREFIX=dsv4,CONFIG_FILE=recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml.srtctl applyruns theagg-2n-low-latency.yamlrecipe, which actually does use EAGLE 3/4 speculative decoding (per the YAMY1234/srt-slurm-nv fork linked in the diff).utils/process_result.pyreadsSPEC_DECODING="none"and writes{"spec_decoding": "none", ...}into the per-run result JSON that is uploaded for dashboards.Fix
Add
spec-decoding: "mtp"to the first search-space entry:The second (throughput) entry correctly leaves it at the default since
agg-2n-nomtp.yamlhas no MTP.