[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP#1145
[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP#1145
Conversation
Adds the DeepSeek-V4-Flash B200 SGLang recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4. Prefix caching and speculative decoding are disabled for baseline numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Uses deepseek-ai/DeepSeek-V4-Pro with tp=8, ep=8, dp-attention enabled and sweep concurrency ranges aligned with dsv4-fp4-b200-vllm (4-1024 at 1k/1k, 4-512 at 8k/1k). Script now passes --enable-dp-attention when DP_ATTENTION=true and sets --mem-fraction-static per the Pro recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server launch now mirrors the DeepSeek-V4-Pro command from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4: --tp N, --moe-runner-backend flashinfer_mxfp4, --mem-fraction-static 0.82, SGLANG_JIT_DEEPGEMM_PRECOMPILE=0. Speculative decoding omitted and --disable-radix-cache added per the no-spec / no-prefix-cache baseline. YAML search-space drops ep/dp-attn to tp=8, ep=1. Also syncs runners/launch_b200-dgxc-slurm.sh with the HF cache mount path from origin/claude/add-dsv4-fp4-b200-vllm so both PRs stay in agreement on runner layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deepseek-v4-blackwell image doesn't expose sglang via system python3, so the module import fails: /usr/bin/python3: Error while finding module specification for 'sglang.launch_server' (ModuleNotFoundError: No module named 'sglang') Switch to the `sglang serve` entrypoint that the cookbook uses; the CLI resolves the correct interpreter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lmsysorg/sglang:deepseek-v4-blackwell image installs sglang editable at /workspace/sglang/python — unlike every prior sglang tag which uses /sgl-workspace/sglang. Our $GITHUB_WORKSPACE:/workspace/ bind-mount masks that directory, breaking `import sglang`. Conditionally mount at /ix for this image only and make the dsv4 benchmark script use $PWD for server/metrics/result paths so it works regardless of the mount target. All other configs still mount at /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lmsysorg/sglang:deepseek-v4-blackwell image installs sglang editable at /workspace/sglang/python, which our $GITHUB_WORKSPACE:/workspace/ bind-mount masks. Temporary one-line workaround: pip install --no-deps sglang in the benchmark script to restore a non-editable copy in site-packages. Runner reverted to the standard /workspace mount. Marked with a TODO(Cam) for the proper fix once lmsys publishes an image that doesn't editable-install under /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
'pip install --no-deps sglang' is a no-op when sglang is already registered in site-packages -- even if the underlying editable path is missing -- so the prior workaround never actually swapped in a working install. Uninstall the broken egg-link first, then reinstall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Back to the proper mount fix so we use the same 'PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...' invocation as every other sglang single_node script. Conditional mount target keeps the blast radius to this one config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The image ENV pins CUDA_VISIBLE_DEVICES=4,5,6,7 (leftover from lmsys's internal testing). With --no-container-entrypoint it isn't cleared, so the container only sees 4 GPUs and TP=8 fails with torch.AcceleratorError: CUDA error: invalid device ordinal Unset it at the top of the script so Slurm's 8-GPU allocation is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only patched launch_b200-dgxc-slurm.sh last time; the b200-nb runner still had the default $GITHUB_WORKSPACE:/workspace/ mount, which masks the deepseek-v4-blackwell image's /workspace/sglang editable install. Most B200 jobs in this repo run on b200-nb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ding Only replace the sglang launch command, keep all surrounding logic intact. Add PYTHONNOUSERSITE=1, SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1, SGLANG_OPT_USE_TOPK_V2=1 env prefixes. Switch to sglang serve with EAGLE speculative decoding (3 steps, topk=1, 4 draft tokens), chunked prefill 4096, and disable-flashinfer-autotune. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EAGLE speculative decoding is enabled in the benchmark script, so the YAML search-space entries need spec-decoding: "mtp" to ensure correct classification in config generation and eval selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy of dsv4_fp4_b200.sh with --use-chat-template added to run_benchmark_serving, as required by AGENTS.md for MTP scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| - config-keys: | ||
| - dsv4-fp4-b200-sglang | ||
| description: | ||
| - "Add DeepSeek-V4-Pro single-node B200 SGLang benchmark (TP8, EP8, dp-attention)" | ||
| - "Container: lmsysorg/sglang:deepseek-v4-blackwell" | ||
| - "Recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4" | ||
| - "Parallelism and sweep conc ranges match the dsv4-fp4-b200-vllm config" | ||
| - "Prefix caching and speculative decoding disabled for baseline numbers" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1131 | ||
|
|
There was a problem hiding this comment.
🔴 The new perf-changelog.yaml entry for dsv4-fp4-b200-sglang misrepresents what is actually shipped in this PR: it claims "TP8, EP8, dp-attention" and "Prefix caching and speculative decoding disabled", but nvidia-master.yaml has ep:1 (not 8), neither dsv4_fp4_b200.sh nor dsv4_fp4_b200_mtp.sh passes --enable-dp-attention/--dp-size, both scripts enable EAGLE speculative decoding (--speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4), and neither script disables prefix caching (no --disable-radix-cache). The pr-link also still points at #1131 (the predecessor PR) instead of this PR. The text appears to have been copy-pasted from #1131's baseline description and not reconciled with the MTP recipe being shipped here — please update it to reflect EP=1, no dp-attention, EAGLE/MTP enabled, prefix caching enabled, and the correct PR number, since perf-changelog.yaml is consumed by tooling and readers to interpret the published numbers.
Extended reasoning...
Summary
The perf-changelog.yaml entry added by this PR contradicts the YAML config and benchmark scripts shipped in the same diff on four points, plus has a stale pr-link.
What the entry claims vs. what is shipped
The new entry (perf-changelog.yaml, top of file):
- config-keys:
- dsv4-fp4-b200-sglang
description:
- "Add DeepSeek-V4-Pro single-node B200 SGLang benchmark (TP8, EP8, dp-attention)"
- ...
- "Prefix caching and speculative decoding disabled for baseline numbers"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1131Claim 1: "EP8". .github/configs/nvidia-master.yaml shipped in the same diff has, for both seq-len configs:
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 1024, spec-decoding: "mtp" }EP is 1, not 8.
Claim 2: "dp-attention". Neither benchmarks/single_node/dsv4_fp4_b200.sh nor benchmarks/single_node/dsv4_fp4_b200_mtp.sh passes --enable-dp-attention or --dp-size. dp-attention is not enabled.
Claim 3: "speculative decoding disabled for baseline numbers". Both scripts pass:
--speculative-algo EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
Speculative decoding is enabled, not disabled. The PR title is literally "dsv4 B200 MTP SGLang launch" and the PR description bullet point is "Add EAGLE speculative decoding".
Claim 4: "Prefix caching ... disabled". sglang has prefix caching on by default; disabling it requires --disable-radix-cache, which neither script passes. So prefix caching is on.
Stale pr-link. The entry points at #1131, but this is PR #1145 (the description says "Based on #1131"). The link should be updated to this PR.
Why this happened
The PR description says "Based on #1131", and #1131 is the predecessor non-MTP, baseline-numbers submission. The author appears to have copy-pasted #1131's changelog entry verbatim and updated the YAML/scripts to the new MTP recipe without reconciling the changelog text. Every clause that's wrong is correct for #1131's baseline configuration.
Step-by-step proof
- Open
.github/configs/nvidia-master.yamlat the newdsv4-fp4-b200-sglangblock. For bothisl: 1024 / osl: 1024andisl: 8192 / osl: 1024, search-space is{ tp: 8, ep: 1, ... }— EP=1, contradicting "EP8". - Grep both new scripts for
enable-dp-attentionordp-size: zero matches. Contradicts "dp-attention". - Grep both new scripts for
speculative-algo: each script has--speculative-algo EAGLEplus the three EAGLE tuning flags. Contradicts "speculative decoding disabled". - Grep both new scripts for
disable-radix-cache: zero matches; sglang defaults prefix caching on. Contradicts "Prefix caching ... disabled". - The
pr-linkfield ishttps://github.com/SemiAnalysisAI/InferenceX/pull/1131but the PR being reviewed is#1145.
Impact
perf-changelog.yaml is consumed by both tooling and humans to interpret what each published benchmark number represents. A reader comparing dsv4-fp4-b200-sglang numbers against a competing config will be misled into thinking this is a TP8/EP8/dp-attention baseline with speculative decoding off, when it is actually TP8/EP1 with EAGLE MTP enabled and prefix caching on — a fundamentally different operating point. No runtime impact (documentation only), so this is on the boundary between "nit" and "normal"; I'm flagging it as normal because the entry actively misleads about the test conditions of all numbers shipped under this config key.
Suggested fix
Update the entry to reflect what ships:
- Replace
(TP8, EP8, dp-attention)with(TP8, EP1, EAGLE MTP)(or similar accurate wording). - Replace
"Prefix caching and speculative decoding disabled for baseline numbers"with a description that matches the MTP recipe, e.g."EAGLE speculative decoding (num-steps 3, draft-tokens 4, topk 1); prefix caching enabled". - Update
pr-linkto point at this PR ([NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP #1145) instead of [NVIDIA] chore: B200 single node DeepSeek v4 SGLang #1131.
| sglang serve \ | ||
| --trust-remote-code \ | ||
| --model-path $MODEL \ | ||
| --tp 8 \ | ||
| --moe-runner-backend flashinfer_mxfp4 \ | ||
| --speculative-algo EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ | ||
| --chunked-prefill-size 4096 \ | ||
| --disable-flashinfer-autotune \ | ||
| --mem-fraction-static 0.82 \ | ||
| --host 0.0.0.0 \ | ||
| --port $PORT > $SERVER_LOG 2>&1 & |
There was a problem hiding this comment.
🔴 EVAL_CONTEXT_ARGS is populated to '--context-length $EVAL_MAX_MODEL_LEN' inside the EVAL_ONLY branch (lines 40-44) but never expanded into the sglang serve invocation (lines 49-66), so the server boots with the default model context. When this benchmark runs in EVAL_ONLY mode (e.g. via the multi-node lm-eval flow added in #1000/#1094/#1120), long-context evals will be silently truncated or fail. Append $EVAL_CONTEXT_ARGS to the sglang serve command (right before > $SERVER_LOG); the same fix is needed in dsv4_fp4_b200.sh.
Extended reasoning...
What the bug is
In benchmarks/single_node/dsv4_fp4_b200_mtp.sh (and identically in dsv4_fp4_b200.sh), the EVAL_ONLY branch declares and populates EVAL_CONTEXT_ARGS:
EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi…but the variable is never referenced again in the file. The sglang serve invocation that follows ends with --port $PORT > $SERVER_LOG 2>&1 & with no $EVAL_CONTEXT_ARGS expansion. Net effect: the server boots with whatever the model's default max context is, regardless of what setup_eval_context decided.
The established pattern this diverges from
Every sibling script that defines EVAL_CONTEXT_ARGS also appends it to the launch line. Examples in this repo: dsr1_fp8_b200.sh:80, dsr1_fp4_b200.sh:48, glm5_fp8_b200.sh:56, dsr1_fp8_h200.sh:44, dsr1_fp8_b200_mtp.sh:93, glm5_fp4_b200_mtp.sh:61. benchmarks/benchmark_lib.sh:648 documents the convention: "Scripts then wire $EVAL_MAX_MODEL_LEN into whichever server variable they need." The new dsv4 scripts skip step two.
Why existing code does not prevent it
Bash silently tolerates an unused variable — there is no static check that catches EVAL_CONTEXT_ARGS being set-but-never-expanded. setup_eval_context does its job (computing EVAL_MAX_MODEL_LEN); the breakage is purely at the consumer.
Impact
The end-of-script branch if [ "${RUN_EVAL}" = "true" ]; then run_eval --framework lm-eval --port "$PORT" (lines 87-90) shows the author intended this benchmark to be eval-runnable. Per perf-changelog.yaml, recent PRs (#1000, #1094, #1120) added a multi-node lm-eval flow that exercises EVAL_ONLY, and the new dsv4-fp4-b200-sglang config in nvidia-master.yaml will be picked up by future evals-only entries that follow this pattern. When that happens, lm-eval tasks with prompts longer than the model's default context will either be truncated (silent accuracy degradation) or error with "prompt is longer than maximum context length". This is dead code today, latent eval breakage tomorrow.
Step-by-step proof
- Operator sets
EVAL_ONLY=trueand runs the runner script, which dispatchesdsv4_fp4_b200_mtp.sh. - Line 40 sets
EVAL_CONTEXT_ARGS="". - Line 41-44 enters the
if [ "${EVAL_ONLY}" = "true" ]branch, callssetup_eval_context(which setsEVAL_MAX_MODEL_LENto e.g. 8192 based on lm-eval task requirements), and assignsEVAL_CONTEXT_ARGS="--context-length 8192". - Lines 49-66 launch
sglang serve … --port $PORT > $SERVER_LOG 2>&1 &.grep EVAL_CONTEXT_ARGSagainst this block returns no match — the variable's expanded value never reaches the server's argv. - SGLang reads
max_position_embeddingsfrom the model config and uses that (or its own internal default) as the effective context length. - lm-eval submits a request whose token count exceeds the default → server returns 4xx with a context-length error, or silently truncates depending on backend behavior. The eval result is wrong.
How to fix
Append $EVAL_CONTEXT_ARGS to the sglang serve invocation, right before > $SERVER_LOG:
--host 0.0.0.0 \
--port $PORT $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &Apply the same fix in dsv4_fp4_b200.sh. When EVAL_ONLY != true, EVAL_CONTEXT_ARGS stays empty and the launch is unchanged — zero risk to the non-eval benchmark path.
| --speculative-algo EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ |
There was a problem hiding this comment.
🔴 The non-MTP benchmarks/single_node/dsv4_fp4_b200.sh enables EAGLE speculative decoding (--speculative-algo EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4), contradicting both its name and the convention used by every other non-MTP variant in this directory. Drop the four --speculative-* flags from the non-MTP script so the runner picks up genuinely-disabled spec decoding when a non-MTP YAML entry is added later (matches dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_fp8_b200.sh, etc.). The current YAML only references spec-decoding: mtp, so this is dormant today, but it ships in the PR and the changelog explicitly states "Prefix caching and speculative decoding disabled for baseline numbers," which this script contradicts.
Extended reasoning...
What is broken
benchmarks/single_node/dsv4_fp4_b200.sh (the non-MTP variant added in this PR) launches sglang at lines 58–61 with:
--speculative-algo EAGLE \\
--speculative-num-steps 3 \\
--speculative-eagle-topk 1 \\
--speculative-num-draft-tokens 4 \\
The two new dsv4 scripts dsv4_fp4_b200.sh and dsv4_fp4_b200_mtp.sh are byte-for-byte identical except that the _mtp script appends --use-chat-template to the benchmark client (line 85 of the _mtp script). This pattern strongly suggests the non-MTP file was created by copy-pasting the MTP script and forgetting to strip the spec-decoding block.
Why this contradicts repo convention
Across benchmarks/single_node/ every other framework + GPU pair follows a strict <model>_<precision>_<gpu>.sh (no spec) vs <model>_<precision>_<gpu>_mtp.sh (spec on) split — e.g. dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_fp4_b200.sh, and glm5_fp8_b200.sh contain no --speculative-* flags, while their _mtp.sh siblings do. dsv4_fp4_b200.sh is the only non-_mtp script in this directory that enables EAGLE.
How the runner selects scripts
runners/launch_b200-dgxc-slurm.sh (and launch_b200-nb.sh) build the script name with:
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
... benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.shSo an entry without spec-decoding: mtp resolves to dsv4_fp4_b200.sh (no suffix). The current nvidia-master.yaml only adds entries with spec-decoding: "mtp" for dsv4-fp4-b200-sglang, so the broken script is dormant today — that's why no one will see incorrect numbers right now.
Step-by-step proof of the latent bug
- A future PR adds a baseline (non-MTP) entry to the dsv4-fp4-b200-sglang config, e.g.
{ tp: 8, ep: 1, conc-start: 4, conc-end: 1024 }(nospec-decodingfield). SPEC_DECODINGis empty for that entry, soSPEC_SUFFIXis empty.- The runner invokes
benchmarks/single_node/dsv4_fp4_b200.sh. - That script still launches sglang with
--speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4. - The benchmark records numbers from EAGLE-accelerated decoding and the user labels them as a "baseline (no spec decoding)" baseline. Silently wrong perf numbers result.
Conflict with the PR's own changelog
The perf-changelog.yaml entry added in this PR states for dsv4-fp4-b200-sglang:
"Prefix caching and speculative decoding disabled for baseline numbers"
That promise is correct for the MTP variant only. The non-MTP script as written breaks this contract the moment any non-MTP YAML entry is added.
Fix
Remove these four lines from benchmarks/single_node/dsv4_fp4_b200.sh (lines 58–61):
--speculative-algo EAGLE \\
--speculative-num-steps 3 \\
--speculative-eagle-topk 1 \\
--speculative-num-draft-tokens 4 \\
After this, dsv4_fp4_b200.sh matches its name and matches the convention established by every other non-MTP variant in the directory. The MTP script (dsv4_fp4_b200_mtp.sh) is correct as-is and should keep the spec flags.
| sglang serve \ | ||
| --trust-remote-code \ | ||
| --model-path $MODEL \ | ||
| --tp 8 \ |
There was a problem hiding this comment.
🟡 Hard-coded --tp 8 on line 56 ignores the $TP env var that check_env_vars enforces and that the script echoes for logging on line 38; same issue at dsv4_fp4_b200.sh:56. Latent today (yaml only has tp:8 entries) but inconsistent with every sibling *_b200*.sh script in this directory which all parameterize via --tensor-parallel-size=$TP or --tp "$TP" — worth fixing now while the script is being added. Trivial fix: change --tp 8 to --tp "$TP".
Extended reasoning...
What the bug is
Both benchmarks/single_node/dsv4_fp4_b200.sh:56 and benchmarks/single_node/dsv4_fp4_b200_mtp.sh:56 invoke sglang serve with a literal --tp 8, even though the same script:
- Calls
check_env_vars … TP …at the top, enforcing that$TPis set by the runner. - Echoes
"TP: $TP, …"for logging on line 38. - Then ignores
$TPand pins TP=8 on line 56.
Every other comparable single-node b200 sglang script parameterizes via $TP. grep in the same directory:
benchmarks/single_node/dsr1_fp4_b200.sh:44:--tensor-parallel-size=$TP …
benchmarks/single_node/dsr1_fp8_b200.sh:76:--tensor-parallel-size=$TP …
benchmarks/single_node/dsr1_fp8_b200_mtp.sh:72:--tensor-parallel-size=$TP \
benchmarks/single_node/dsv4_fp4_b200.sh:56: --tp 8 \
benchmarks/single_node/dsv4_fp4_b200_mtp.sh:56: --tp 8 \
Why this is latent today
In .github/configs/nvidia-master.yaml, the new dsv4-fp4-b200-sglang block only defines { tp: 8, … } entries for both seq-len configs — so today $TP always equals 8 and there is no functional divergence. This is why I'm filing as nit, not normal.
Why this still warrants a fix in this PR
If a future PR adds a { tp: 4, … } entry to this config (consistent with dsr1-fp4-b200-sglang, which already has both tp: 4 and tp: 8 entries), the failure mode is silent and confusing:
- The runner sweep loop in
runners/launch_b200-dgxc-slurm.sh:285reserves--gres=gpu:$TP— so slurm only allocates 4 GPUs for the job. - The script then
unset CUDA_VISIBLE_DEVICES(per the comment on line 25, to clear the image's baked-in mask) and invokessglang serve --tp 8. - sglang would try to fan out to 8 ranks against a 4-GPU allocation. Worse, the unset CUDA_VISIBLE_DEVICES makes the symptom less obvious because the original 4-GPU mask is being cleared right before the launch.
Addressing the refutation
The refutation argues this is intentional design because:
- "Comment says TP=8": The comment on lines 23-25 explains why
unset CUDA_VISIBLE_DEVICESis needed (the image bakes in a 4-GPU mask) — it documents an image-specific quirk, not a hard TP=8 design constraint. The comment says "so TP=8 can bind to all ranks", not "this script only supports TP=8". - "dsr1_fp8_b200_mtp.sh also hardcodes TP=8": This is factually incorrect.
dsr1_fp8_b200_mtp.shuses--tensor-parallel-size=$TPon line 72 (not hardcoded), with a defensive guardif [[ $TP -ne 8 ]]; then exit 1; fion line 29. That's the right pattern: parameterize via $TP, validate the input. The dsv4 scripts do neither. - "Yaml only has tp:8": Acknowledged; that's why this is filed as nit, not normal.
Step-by-step proof of the latent failure
Suppose someone adds a TP=4 entry to the yaml in a future PR:
dsv4-fp4-b200-sglang:
...
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: "mtp" }Then for that sweep entry:
- Runner sets
TP=4and runssalloc --gres=gpu:4 …(line 285 inlaunch_b200-dgxc-slurm.sh). - Slurm allocates a 4-GPU job. CUDA_VISIBLE_DEVICES at this point reflects the 4 allocated GPUs.
dsv4_fp4_b200_mtp.shruns, echoes "TP: 4, …".- Script
unset CUDA_VISIBLE_DEVICES(clears the visible mask). - Script invokes
sglang serve --tp 8— tries to use 8 ranks. Crashes or hangs at startup. The user sees a TP=8 error in a "TP: 4" log context, and has to reverse-engineer why.
How to fix
One-line change on line 56 in each file: --tp 8 → --tp "$TP". Optionally also add a if [[ $TP -ne 8 ]]; then …; fi guard like dsr1_fp8_b200_mtp.sh:29 if the author wants to keep TP=8-only enforcement explicit. Either way matches the established pattern in the rest of the repo.
| --model-path $MODEL \ | ||
| --tp 8 \ | ||
| --moe-runner-backend flashinfer_mxfp4 \ | ||
| --speculative-algo EAGLE \ |
There was a problem hiding this comment.
🟡 Both new dsv4 scripts pass --speculative-algo EAGLE (dsv4_fp4_b200.sh:58 and dsv4_fp4_b200_mtp.sh:58), while all 17 sibling MTP scripts in benchmarks/single_node/ (dsr1_mtp.sh, glm5mtp.sh, qwen3.5*_mtp.sh) use the canonical --speculative-algorithm. The abbreviated form likely works today via argparse prefix-matching, but it's a clear convention break and is fragile (would silently break if sglang ever adds another --speculative-algo* flag, making the prefix ambiguous, or switches sglang serve from argparse to click). Recommend changing both lines to --speculative-algorithm EAGLE for consistency.
Extended reasoning...
What the bug is
Both new dsv4 launch scripts use a non-canonical flag name:
benchmarks/single_node/dsv4_fp4_b200.sh:58—--speculative-algo EAGLEbenchmarks/single_node/dsv4_fp4_b200_mtp.sh:58—--speculative-algo EAGLE
Every other MTP script in this directory (17 files: dsr1_fp8_b200_mtp.sh, dsr1_fp8_b300_mtp.sh, glm5_fp4_b200_mtp.sh, glm5_fp4_b300_mtp.sh, glm5_fp8_b200_mtp.sh, glm5_fp8_b300_mtp.sh, glm5_fp8_mi355x_mtp.sh, qwen3.5_bf16_b200_mtp.sh, qwen3.5_bf16_b300_mtp.sh, qwen3.5_bf16_mi355x_mtp.sh, qwen3.5_fp4_b200_mtp.sh, qwen3.5_fp4_b300_mtp.sh, qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp8_b300_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_mi355x_mtp.sh) uses the full --speculative-algorithm EAGLE. Upstream sglang's server_args.py defines the canonical flag as --speculative-algorithm; --speculative-algo is not declared as an alias.
Why it likely works today
Python's argparse with allow_abbrev=True (the default) accepts any unambiguous prefix of a registered long option. The other speculative flags in the same launch (--speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens) do not share the --speculative-algo prefix, so today --speculative-algo resolves uniquely to --speculative-algorithm. So in practice the launch most likely works on the deepseek-v4-blackwell image.
Why it's still worth fixing
- Fragile: if sglang ever adds another flag starting with
--speculative-algo(e.g.,--speculative-algo-version), argparse will then fail with an ambiguous-prefix error and EAGLE will be silently disabled or the launch will hard-fail. This is exactly the kind of breakage that surfaces only after an image bump. - Disabled by default in some Pythons: argparse
allow_abbrevwas made tunable in Python 3.5; some sglang code paths or future click-based rewrites ofsglang serve(which is the new 0.5.x entry point — these are the FIRST scripts in the repo to usesglang serverather thanpython3 -m sglang.launch_server) will not honor the abbreviation. Click does not do prefix matching at all. - Consistency: 17 sibling scripts use the full name. Matching the convention costs nothing and removes the question of whether this is intentional.
Addressing the refutation
A reviewer noted the PR description cites the upstream cookbook recipe (https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4) and suggested the PR author may have copied the abbreviated flag from there. That's possible, but it doesn't change the recommendation: the canonical sglang flag is --speculative-algorithm, every existing repo script uses it, and the fix is a one-character-per-line edit (-algo → -algorithm). Even if the cookbook also uses the short form, the canonical name is unambiguously safer and consistent with the rest of this repo.
Step-by-step proof
- Run
grep -n 'speculative-alg' benchmarks/single_node/*.sh— onlydsv4_fp4_b200.shanddsv4_fp4_b200_mtp.shuse--speculative-algo; the other 17 use--speculative-algorithm. - Inspect upstream
sglang/srt/server_args.py— argparse registers--speculative-algorithm(no--speculative-algoalias). - Today: argparse with default
allow_abbrev=Trueresolves--speculative-algo→--speculative-algorithmbecause no other registered long option starts with--speculative-algo. - Hypothetical near-future scenario: sglang adds
--speculative-algo-version(a plausible sub-knob name). Now--speculative-algois an ambiguous prefix and argparse exits witherror: ambiguous option: --speculative-algo could match --speculative-algorithm, --speculative-algo-version. The dsv4 launch fails or silently disables EAGLE; every other MTP script in the repo continues to work because they pass the full name.
Fix
Change both lines to:
--speculative-algorithm EAGLE \\
Severity
nit — works today via argparse prefix-matching, but inconsistent with 17 sibling scripts and fragile to upstream additions. Worth fixing while the PR is open.
…gl-mtp-b200 # Conflicts: # .github/configs/nvidia-master.yaml # benchmarks/single_node/dsv4_fp4_b200.sh
…env vars Both flags were on the sglang serve invocation in the original PR (#1139) and got dropped when the script was restructured to mirror the baseline 3-recipe layout. Re-add as exports so they apply across all recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror dsv4-fp4-b200-sglang's low-latency / balanced / max-throughput split across both seq-len-configs so result filenames carry ep=/dpa= labels per recipe. spec-decoding: "mtp" stays on every entry; the script picks the MTP params (or omits them at max-throughput) by CONC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
need to support jinja chat template. we are working on it |
Summary
python3 -m sglang.launch_servertosglang serve--speculative-num-steps 3,--speculative-eagle-topk 1,--speculative-num-draft-tokens 4)--chunked-prefill-size 4096and--disable-flashinfer-autotuneSGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1,SGLANG_OPT_USE_TOPK_V2=1--moe-runner-backend flashinfer_mxfp4, port 8888, benchmark backend vllm unchangedBased on #1131
Test plan
🤖 Generated with Claude Code