[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP by cquil11 · Pull Request #1145 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-24T22:26:33Z

Summary

Switch dsv4 B200 SGLang launch from python3 -m sglang.launch_server to sglang serve
Add EAGLE speculative decoding (--speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4)
Add --chunked-prefill-size 4096 and --disable-flashinfer-autotune
Add env flags: SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1, SGLANG_OPT_USE_TOPK_V2=1
Keep --moe-runner-backend flashinfer_mxfp4, port 8888, benchmark backend vllm unchanged

Based on #1131

Test plan

Sweep run produces results for 1k/1k and 8k/1k ISL/OSL

🤖 Generated with Claude Code

Adds the DeepSeek-V4-Flash B200 SGLang recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4. Prefix caching and speculative decoding are disabled for baseline numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Uses deepseek-ai/DeepSeek-V4-Pro with tp=8, ep=8, dp-attention enabled and sweep concurrency ranges aligned with dsv4-fp4-b200-vllm (4-1024 at 1k/1k, 4-512 at 8k/1k). Script now passes --enable-dp-attention when DP_ATTENTION=true and sets --mem-fraction-static per the Pro recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Server launch now mirrors the DeepSeek-V4-Pro command from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4: --tp N, --moe-runner-backend flashinfer_mxfp4, --mem-fraction-static 0.82, SGLANG_JIT_DEEPGEMM_PRECOMPILE=0. Speculative decoding omitted and --disable-radix-cache added per the no-spec / no-prefix-cache baseline. YAML search-space drops ep/dp-attn to tp=8, ep=1. Also syncs runners/launch_b200-dgxc-slurm.sh with the HF cache mount path from origin/claude/add-dsv4-fp4-b200-vllm so both PRs stay in agreement on runner layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The deepseek-v4-blackwell image doesn't expose sglang via system python3, so the module import fails: /usr/bin/python3: Error while finding module specification for 'sglang.launch_server' (ModuleNotFoundError: No module named 'sglang') Switch to the `sglang serve` entrypoint that the cookbook uses; the CLI resolves the correct interpreter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lmsysorg/sglang:deepseek-v4-blackwell image installs sglang editable at /workspace/sglang/python — unlike every prior sglang tag which uses /sgl-workspace/sglang. Our $GITHUB_WORKSPACE:/workspace/ bind-mount masks that directory, breaking `import sglang`. Conditionally mount at /ix for this image only and make the dsv4 benchmark script use $PWD for server/metrics/result paths so it works regardless of the mount target. All other configs still mount at /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lmsysorg/sglang:deepseek-v4-blackwell image installs sglang editable at /workspace/sglang/python, which our $GITHUB_WORKSPACE:/workspace/ bind-mount masks. Temporary one-line workaround: pip install --no-deps sglang in the benchmark script to restore a non-editable copy in site-packages. Runner reverted to the standard /workspace mount. Marked with a TODO(Cam) for the proper fix once lmsys publishes an image that doesn't editable-install under /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

'pip install --no-deps sglang' is a no-op when sglang is already registered in site-packages -- even if the underlying editable path is missing -- so the prior workaround never actually swapped in a working install. Uninstall the broken egg-link first, then reinstall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Back to the proper mount fix so we use the same 'PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...' invocation as every other sglang single_node script. Conditional mount target keeps the blast radius to this one config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The image ENV pins CUDA_VISIBLE_DEVICES=4,5,6,7 (leftover from lmsys's internal testing). With --no-container-entrypoint it isn't cleared, so the container only sees 4 GPUs and TP=8 fails with torch.AcceleratorError: CUDA error: invalid device ordinal Unset it at the top of the script so Slurm's 8-GPU allocation is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Only patched launch_b200-dgxc-slurm.sh last time; the b200-nb runner still had the default $GITHUB_WORKSPACE:/workspace/ mount, which masks the deepseek-v4-blackwell image's /workspace/sglang editable install. Most B200 jobs in this repo run on b200-nb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ding Only replace the sglang launch command, keep all surrounding logic intact. Add PYTHONNOUSERSITE=1, SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1, SGLANG_OPT_USE_TOPK_V2=1 env prefixes. Switch to sglang serve with EAGLE speculative decoding (3 steps, topk=1, 4 draft tokens), chunked prefill 4096, and disable-flashinfer-autotune. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EAGLE speculative decoding is enabled in the benchmark script, so the YAML search-space entries need spec-decoding: "mtp" to ensure correct classification in config generation and eval selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copy of dsv4_fp4_b200.sh with --use-chat-template added to run_benchmark_serving, as required by AGENTS.md for MTP scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-04-24T22:37:11Z

+- config-keys:
+    - dsv4-fp4-b200-sglang
+  description:
+    - "Add DeepSeek-V4-Pro single-node B200 SGLang benchmark (TP8, EP8, dp-attention)"
+    - "Container: lmsysorg/sglang:deepseek-v4-blackwell"
+    - "Recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4"
+    - "Parallelism and sweep conc ranges match the dsv4-fp4-b200-vllm config"
+    - "Prefix caching and speculative decoding disabled for baseline numbers"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1131
+


🔴 The new perf-changelog.yaml entry for dsv4-fp4-b200-sglang misrepresents what is actually shipped in this PR: it claims "TP8, EP8, dp-attention" and "Prefix caching and speculative decoding disabled", but nvidia-master.yaml has ep:1 (not 8), neither dsv4_fp4_b200.sh nor dsv4_fp4_b200_mtp.sh passes --enable-dp-attention/--dp-size, both scripts enable EAGLE speculative decoding (--speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4), and neither script disables prefix caching (no --disable-radix-cache). The pr-link also still points at #1131 (the predecessor PR) instead of this PR. The text appears to have been copy-pasted from #1131's baseline description and not reconciled with the MTP recipe being shipped here — please update it to reflect EP=1, no dp-attention, EAGLE/MTP enabled, prefix caching enabled, and the correct PR number, since perf-changelog.yaml is consumed by tooling and readers to interpret the published numbers.

Extended reasoning...

Summary

The perf-changelog.yaml entry added by this PR contradicts the YAML config and benchmark scripts shipped in the same diff on four points, plus has a stale pr-link.

What the entry claims vs. what is shipped

The new entry (perf-changelog.yaml, top of file):

- config-keys: - dsv4-fp4-b200-sglang description: - "Add DeepSeek-V4-Pro single-node B200 SGLang benchmark (TP8, EP8, dp-attention)" - ... - "Prefix caching and speculative decoding disabled for baseline numbers" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1131

Claim 1: "EP8". .github/configs/nvidia-master.yaml shipped in the same diff has, for both seq-len configs:

search-space: - { tp: 8, ep: 1, conc-start: 4, conc-end: 1024, spec-decoding: "mtp" }

EP is 1, not 8.

Claim 2: "dp-attention". Neither benchmarks/single_node/dsv4_fp4_b200.sh nor benchmarks/single_node/dsv4_fp4_b200_mtp.sh passes --enable-dp-attention or --dp-size. dp-attention is not enabled.

Claim 3: "speculative decoding disabled for baseline numbers". Both scripts pass:

--speculative-algo EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \

Speculative decoding is enabled, not disabled. The PR title is literally "dsv4 B200 MTP SGLang launch" and the PR description bullet point is "Add EAGLE speculative decoding".

Claim 4: "Prefix caching ... disabled". sglang has prefix caching on by default; disabling it requires --disable-radix-cache, which neither script passes. So prefix caching is on.

Stale pr-link. The entry points at #1131, but this is PR #1145 (the description says "Based on #1131"). The link should be updated to this PR.

Why this happened

The PR description says "Based on #1131", and #1131 is the predecessor non-MTP, baseline-numbers submission. The author appears to have copy-pasted #1131's changelog entry verbatim and updated the YAML/scripts to the new MTP recipe without reconciling the changelog text. Every clause that's wrong is correct for #1131's baseline configuration.

Step-by-step proof

Open .github/configs/nvidia-master.yaml at the new dsv4-fp4-b200-sglang block. For both isl: 1024 / osl: 1024 and isl: 8192 / osl: 1024, search-space is { tp: 8, ep: 1, ... } — EP=1, contradicting "EP8".

Grep both new scripts for enable-dp-attention or dp-size: zero matches. Contradicts "dp-attention".

Grep both new scripts for speculative-algo: each script has --speculative-algo EAGLE plus the three EAGLE tuning flags. Contradicts "speculative decoding disabled".

Grep both new scripts for disable-radix-cache: zero matches; sglang defaults prefix caching on. Contradicts "Prefix caching ... disabled".

The pr-link field is https://github.com/SemiAnalysisAI/InferenceX/pull/1131 but the PR being reviewed is #1145.

Impact

perf-changelog.yaml is consumed by both tooling and humans to interpret what each published benchmark number represents. A reader comparing dsv4-fp4-b200-sglang numbers against a competing config will be misled into thinking this is a TP8/EP8/dp-attention baseline with speculative decoding off, when it is actually TP8/EP1 with EAGLE MTP enabled and prefix caching on — a fundamentally different operating point. No runtime impact (documentation only), so this is on the boundary between "nit" and "normal"; I'm flagging it as normal because the entry actively misleads about the test conditions of all numbers shipped under this config key.

Suggested fix

Update the entry to reflect what ships:

Replace (TP8, EP8, dp-attention) with (TP8, EP1, EAGLE MTP) (or similar accurate wording).

Replace "Prefix caching and speculative decoding disabled for baseline numbers" with a description that matches the MTP recipe, e.g. "EAGLE speculative decoding (num-steps 3, draft-tokens 4, topk 1); prefix caching enabled".

Update pr-link to point at this PR ([NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP #1145) instead of [NVIDIA] chore: B200 single node DeepSeek v4 SGLang #1131.

claude · 2026-04-24T22:37:12Z

+sglang serve \
+  --trust-remote-code \
+  --model-path $MODEL \
+  --tp 8 \
+  --moe-runner-backend flashinfer_mxfp4 \
+  --speculative-algo EAGLE \
+  --speculative-num-steps 3 \
+  --speculative-eagle-topk 1 \
+  --speculative-num-draft-tokens 4 \
+  --chunked-prefill-size 4096 \
+  --disable-flashinfer-autotune \
+  --mem-fraction-static 0.82 \
+  --host 0.0.0.0 \
+  --port $PORT > $SERVER_LOG 2>&1 &


🔴 EVAL_CONTEXT_ARGS is populated to '--context-length $EVAL_MAX_MODEL_LEN' inside the EVAL_ONLY branch (lines 40-44) but never expanded into the sglang serve invocation (lines 49-66), so the server boots with the default model context. When this benchmark runs in EVAL_ONLY mode (e.g. via the multi-node lm-eval flow added in #1000/#1094/#1120), long-context evals will be silently truncated or fail. Append $EVAL_CONTEXT_ARGS to the sglang serve command (right before > $SERVER_LOG); the same fix is needed in dsv4_fp4_b200.sh.

Extended reasoning...

What the bug is

In benchmarks/single_node/dsv4_fp4_b200_mtp.sh (and identically in dsv4_fp4_b200.sh), the EVAL_ONLY branch declares and populates EVAL_CONTEXT_ARGS:

EVAL_CONTEXT_ARGS="" if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" fi

…but the variable is never referenced again in the file. The sglang serve invocation that follows ends with --port $PORT > $SERVER_LOG 2>&1 & with no $EVAL_CONTEXT_ARGS expansion. Net effect: the server boots with whatever the model's default max context is, regardless of what setup_eval_context decided.

The established pattern this diverges from

Every sibling script that defines EVAL_CONTEXT_ARGS also appends it to the launch line. Examples in this repo: dsr1_fp8_b200.sh:80, dsr1_fp4_b200.sh:48, glm5_fp8_b200.sh:56, dsr1_fp8_h200.sh:44, dsr1_fp8_b200_mtp.sh:93, glm5_fp4_b200_mtp.sh:61. benchmarks/benchmark_lib.sh:648 documents the convention: "Scripts then wire $EVAL_MAX_MODEL_LEN into whichever server variable they need." The new dsv4 scripts skip step two.

Why existing code does not prevent it

Bash silently tolerates an unused variable — there is no static check that catches EVAL_CONTEXT_ARGS being set-but-never-expanded. setup_eval_context does its job (computing EVAL_MAX_MODEL_LEN); the breakage is purely at the consumer.

Impact

The end-of-script branch if [ "${RUN_EVAL}" = "true" ]; then run_eval --framework lm-eval --port "$PORT" (lines 87-90) shows the author intended this benchmark to be eval-runnable. Per perf-changelog.yaml, recent PRs (#1000, #1094, #1120) added a multi-node lm-eval flow that exercises EVAL_ONLY, and the new dsv4-fp4-b200-sglang config in nvidia-master.yaml will be picked up by future evals-only entries that follow this pattern. When that happens, lm-eval tasks with prompts longer than the model's default context will either be truncated (silent accuracy degradation) or error with "prompt is longer than maximum context length". This is dead code today, latent eval breakage tomorrow.

Step-by-step proof

Operator sets EVAL_ONLY=true and runs the runner script, which dispatches dsv4_fp4_b200_mtp.sh.

Line 40 sets EVAL_CONTEXT_ARGS="".

Line 41-44 enters the if [ "${EVAL_ONLY}" = "true" ] branch, calls setup_eval_context (which sets EVAL_MAX_MODEL_LEN to e.g. 8192 based on lm-eval task requirements), and assigns EVAL_CONTEXT_ARGS="--context-length 8192".

Lines 49-66 launch sglang serve … --port $PORT > $SERVER_LOG 2>&1 &. grep EVAL_CONTEXT_ARGS against this block returns no match — the variable's expanded value never reaches the server's argv.

SGLang reads max_position_embeddings from the model config and uses that (or its own internal default) as the effective context length.

lm-eval submits a request whose token count exceeds the default → server returns 4xx with a context-length error, or silently truncates depending on backend behavior. The eval result is wrong.

How to fix

Append $EVAL_CONTEXT_ARGS to the sglang serve invocation, right before > $SERVER_LOG:

--host 0.0.0.0 \ --port $PORT $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

Apply the same fix in dsv4_fp4_b200.sh. When EVAL_ONLY != true, EVAL_CONTEXT_ARGS stays empty and the launch is unchanged — zero risk to the non-eval benchmark path.

claude · 2026-04-24T22:37:12Z

+  --speculative-algo EAGLE \
+  --speculative-num-steps 3 \
+  --speculative-eagle-topk 1 \
+  --speculative-num-draft-tokens 4 \


🔴 The non-MTP benchmarks/single_node/dsv4_fp4_b200.sh enables EAGLE speculative decoding (--speculative-algo EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4), contradicting both its name and the convention used by every other non-MTP variant in this directory. Drop the four --speculative-* flags from the non-MTP script so the runner picks up genuinely-disabled spec decoding when a non-MTP YAML entry is added later (matches dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_fp8_b200.sh, etc.). The current YAML only references spec-decoding: mtp, so this is dormant today, but it ships in the PR and the changelog explicitly states "Prefix caching and speculative decoding disabled for baseline numbers," which this script contradicts.

Extended reasoning...

What is broken

benchmarks/single_node/dsv4_fp4_b200.sh (the non-MTP variant added in this PR) launches sglang at lines 58–61 with:

--speculative-algo EAGLE \\ --speculative-num-steps 3 \\ --speculative-eagle-topk 1 \\ --speculative-num-draft-tokens 4 \\

The two new dsv4 scripts dsv4_fp4_b200.sh and dsv4_fp4_b200_mtp.sh are byte-for-byte identical except that the _mtp script appends --use-chat-template to the benchmark client (line 85 of the _mtp script). This pattern strongly suggests the non-MTP file was created by copy-pasting the MTP script and forgetting to strip the spec-decoding block.

Why this contradicts repo convention

Across benchmarks/single_node/ every other framework + GPU pair follows a strict <model>_<precision>_<gpu>.sh (no spec) vs <model>_<precision>_<gpu>_mtp.sh (spec on) split — e.g. dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_fp4_b200.sh, and glm5_fp8_b200.sh contain no --speculative-* flags, while their _mtp.sh siblings do. dsv4_fp4_b200.sh is the only non-_mtp script in this directory that enables EAGLE.

How the runner selects scripts

runners/launch_b200-dgxc-slurm.sh (and launch_b200-nb.sh) build the script name with:

SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') ... benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

So an entry without spec-decoding: mtp resolves to dsv4_fp4_b200.sh (no suffix). The current nvidia-master.yaml only adds entries with spec-decoding: "mtp" for dsv4-fp4-b200-sglang, so the broken script is dormant today — that's why no one will see incorrect numbers right now.

Step-by-step proof of the latent bug

A future PR adds a baseline (non-MTP) entry to the dsv4-fp4-b200-sglang config, e.g. { tp: 8, ep: 1, conc-start: 4, conc-end: 1024 } (no spec-decoding field).

SPEC_DECODING is empty for that entry, so SPEC_SUFFIX is empty.

The runner invokes benchmarks/single_node/dsv4_fp4_b200.sh.

That script still launches sglang with --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.

The benchmark records numbers from EAGLE-accelerated decoding and the user labels them as a "baseline (no spec decoding)" baseline. Silently wrong perf numbers result.

Conflict with the PR's own changelog

The perf-changelog.yaml entry added in this PR states for dsv4-fp4-b200-sglang:

"Prefix caching and speculative decoding disabled for baseline numbers"

That promise is correct for the MTP variant only. The non-MTP script as written breaks this contract the moment any non-MTP YAML entry is added.

Fix

Remove these four lines from benchmarks/single_node/dsv4_fp4_b200.sh (lines 58–61):

--speculative-algo EAGLE \\ --speculative-num-steps 3 \\ --speculative-eagle-topk 1 \\ --speculative-num-draft-tokens 4 \\

After this, dsv4_fp4_b200.sh matches its name and matches the convention established by every other non-MTP variant in the directory. The MTP script (dsv4_fp4_b200_mtp.sh) is correct as-is and should keep the spec flags.

claude · 2026-04-24T22:37:12Z

+sglang serve \
+  --trust-remote-code \
+  --model-path $MODEL \
+  --tp 8 \


🟡 Hard-coded --tp 8 on line 56 ignores the $TP env var that check_env_vars enforces and that the script echoes for logging on line 38; same issue at dsv4_fp4_b200.sh:56. Latent today (yaml only has tp:8 entries) but inconsistent with every sibling *_b200*.sh script in this directory which all parameterize via --tensor-parallel-size=$TP or --tp "$TP" — worth fixing now while the script is being added. Trivial fix: change --tp 8 to --tp "$TP".

Extended reasoning...

What the bug is

Both benchmarks/single_node/dsv4_fp4_b200.sh:56 and benchmarks/single_node/dsv4_fp4_b200_mtp.sh:56 invoke sglang serve with a literal --tp 8, even though the same script:

Calls check_env_vars … TP … at the top, enforcing that $TP is set by the runner.

Echoes "TP: $TP, …" for logging on line 38.

Then ignores $TP and pins TP=8 on line 56.

Every other comparable single-node b200 sglang script parameterizes via $TP. grep in the same directory:

benchmarks/single_node/dsr1_fp4_b200.sh:44:--tensor-parallel-size=$TP … benchmarks/single_node/dsr1_fp8_b200.sh:76:--tensor-parallel-size=$TP … benchmarks/single_node/dsr1_fp8_b200_mtp.sh:72:--tensor-parallel-size=$TP \ benchmarks/single_node/dsv4_fp4_b200.sh:56: --tp 8 \ benchmarks/single_node/dsv4_fp4_b200_mtp.sh:56: --tp 8 \

Why this is latent today

In .github/configs/nvidia-master.yaml, the new dsv4-fp4-b200-sglang block only defines { tp: 8, … } entries for both seq-len configs — so today $TP always equals 8 and there is no functional divergence. This is why I'm filing as nit, not normal.

Why this still warrants a fix in this PR

If a future PR adds a { tp: 4, … } entry to this config (consistent with dsr1-fp4-b200-sglang, which already has both tp: 4 and tp: 8 entries), the failure mode is silent and confusing:

The runner sweep loop in runners/launch_b200-dgxc-slurm.sh:285 reserves --gres=gpu:$TP — so slurm only allocates 4 GPUs for the job.

The script then unset CUDA_VISIBLE_DEVICES (per the comment on line 25, to clear the image's baked-in mask) and invokes sglang serve --tp 8.

sglang would try to fan out to 8 ranks against a 4-GPU allocation. Worse, the unset CUDA_VISIBLE_DEVICES makes the symptom less obvious because the original 4-GPU mask is being cleared right before the launch.

Addressing the refutation

The refutation argues this is intentional design because:

"Comment says TP=8": The comment on lines 23-25 explains why unset CUDA_VISIBLE_DEVICES is needed (the image bakes in a 4-GPU mask) — it documents an image-specific quirk, not a hard TP=8 design constraint. The comment says "so TP=8 can bind to all ranks", not "this script only supports TP=8".

"dsr1_fp8_b200_mtp.sh also hardcodes TP=8": This is factually incorrect. dsr1_fp8_b200_mtp.sh uses --tensor-parallel-size=$TP on line 72 (not hardcoded), with a defensive guard if [[ $TP -ne 8 ]]; then exit 1; fi on line 29. That's the right pattern: parameterize via $TP, validate the input. The dsv4 scripts do neither.

"Yaml only has tp:8": Acknowledged; that's why this is filed as nit, not normal.

Step-by-step proof of the latent failure

Suppose someone adds a TP=4 entry to the yaml in a future PR:

dsv4-fp4-b200-sglang: ... search-space: - { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: "mtp" }

Then for that sweep entry:

Runner sets TP=4 and runs salloc --gres=gpu:4 … (line 285 in launch_b200-dgxc-slurm.sh).

Slurm allocates a 4-GPU job. CUDA_VISIBLE_DEVICES at this point reflects the 4 allocated GPUs.

dsv4_fp4_b200_mtp.sh runs, echoes "TP: 4, …".

Script unset CUDA_VISIBLE_DEVICES (clears the visible mask).

Script invokes sglang serve --tp 8 — tries to use 8 ranks. Crashes or hangs at startup. The user sees a TP=8 error in a "TP: 4" log context, and has to reverse-engineer why.

How to fix

One-line change on line 56 in each file: --tp 8 → --tp "$TP". Optionally also add a if [[ $TP -ne 8 ]]; then …; fi guard like dsr1_fp8_b200_mtp.sh:29 if the author wants to keep TP=8-only enforcement explicit. Either way matches the established pattern in the rest of the repo.

claude · 2026-04-24T22:37:12Z

+  --model-path $MODEL \
+  --tp 8 \
+  --moe-runner-backend flashinfer_mxfp4 \
+  --speculative-algo EAGLE \


🟡 Both new dsv4 scripts pass --speculative-algo EAGLE (dsv4_fp4_b200.sh:58 and dsv4_fp4_b200_mtp.sh:58), while all 17 sibling MTP scripts in benchmarks/single_node/ (dsr1_mtp.sh, glm5mtp.sh, qwen3.5*_mtp.sh) use the canonical --speculative-algorithm. The abbreviated form likely works today via argparse prefix-matching, but it's a clear convention break and is fragile (would silently break if sglang ever adds another --speculative-algo* flag, making the prefix ambiguous, or switches sglang serve from argparse to click). Recommend changing both lines to --speculative-algorithm EAGLE for consistency.

Extended reasoning...

What the bug is

Both new dsv4 launch scripts use a non-canonical flag name:

benchmarks/single_node/dsv4_fp4_b200.sh:58 — --speculative-algo EAGLE

benchmarks/single_node/dsv4_fp4_b200_mtp.sh:58 — --speculative-algo EAGLE

Every other MTP script in this directory (17 files: dsr1_fp8_b200_mtp.sh, dsr1_fp8_b300_mtp.sh, glm5_fp4_b200_mtp.sh, glm5_fp4_b300_mtp.sh, glm5_fp8_b200_mtp.sh, glm5_fp8_b300_mtp.sh, glm5_fp8_mi355x_mtp.sh, qwen3.5_bf16_b200_mtp.sh, qwen3.5_bf16_b300_mtp.sh, qwen3.5_bf16_mi355x_mtp.sh, qwen3.5_fp4_b200_mtp.sh, qwen3.5_fp4_b300_mtp.sh, qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp8_b300_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_mi355x_mtp.sh) uses the full --speculative-algorithm EAGLE. Upstream sglang's server_args.py defines the canonical flag as --speculative-algorithm; --speculative-algo is not declared as an alias.

Why it likely works today

Python's argparse with allow_abbrev=True (the default) accepts any unambiguous prefix of a registered long option. The other speculative flags in the same launch (--speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens) do not share the --speculative-algo prefix, so today --speculative-algo resolves uniquely to --speculative-algorithm. So in practice the launch most likely works on the deepseek-v4-blackwell image.

Why it's still worth fixing

Fragile: if sglang ever adds another flag starting with --speculative-algo (e.g., --speculative-algo-version), argparse will then fail with an ambiguous-prefix error and EAGLE will be silently disabled or the launch will hard-fail. This is exactly the kind of breakage that surfaces only after an image bump.

Disabled by default in some Pythons: argparse allow_abbrev was made tunable in Python 3.5; some sglang code paths or future click-based rewrites of sglang serve (which is the new 0.5.x entry point — these are the FIRST scripts in the repo to use sglang serve rather than python3 -m sglang.launch_server) will not honor the abbreviation. Click does not do prefix matching at all.

Consistency: 17 sibling scripts use the full name. Matching the convention costs nothing and removes the question of whether this is intentional.

Addressing the refutation

A reviewer noted the PR description cites the upstream cookbook recipe (https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4) and suggested the PR author may have copied the abbreviated flag from there. That's possible, but it doesn't change the recommendation: the canonical sglang flag is --speculative-algorithm, every existing repo script uses it, and the fix is a one-character-per-line edit (-algo → -algorithm). Even if the cookbook also uses the short form, the canonical name is unambiguously safer and consistent with the rest of this repo.

Step-by-step proof

Run grep -n 'speculative-alg' benchmarks/single_node/*.sh — only dsv4_fp4_b200.sh and dsv4_fp4_b200_mtp.sh use --speculative-algo; the other 17 use --speculative-algorithm.

Inspect upstream sglang/srt/server_args.py — argparse registers --speculative-algorithm (no --speculative-algo alias).

Today: argparse with default allow_abbrev=True resolves --speculative-algo → --speculative-algorithm because no other registered long option starts with --speculative-algo.

Hypothetical near-future scenario: sglang adds --speculative-algo-version (a plausible sub-knob name). Now --speculative-algo is an ambiguous prefix and argparse exits with error: ambiguous option: --speculative-algo could match --speculative-algorithm, --speculative-algo-version. The dsv4 launch fails or silently disables EAGLE; every other MTP script in the repo continues to work because they pass the full name.

Fix

Change both lines to:

--speculative-algorithm EAGLE \\

Severity

nit — works today via argparse prefix-matching, but inconsistent with 17 sibling scripts and fragile to upstream additions. Worth fixing while the PR is open.

…gl-mtp-b200 # Conflicts: # .github/configs/nvidia-master.yaml # benchmarks/single_node/dsv4_fp4_b200.sh

…env vars Both flags were on the sglang serve invocation in the original PR (#1139) and got dropped when the script was restructured to mirror the baseline 3-recipe layout. Re-add as exports so they apply across all recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror dsv4-fp4-b200-sglang's low-latency / balanced / max-throughput split across both seq-len-configs so result filenames carry ep=/dpa= labels per recipe. spec-decoding: "mtp" stays on every entry; the script picks the MTP params (or omits them at max-throughput) by CONC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kedarpotdar-nv · 2026-04-24T23:38:55Z

need to support jinja chat template. we are working on it

cquil11 and others added 14 commits April 24, 2026 01:10

Drop --container-name arg from launch_b200-nb.sh

d538a4a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add dsv4_fp4_b200_mtp.sh for spec-decoding benchmarks

1dd4db6

Copy of dsv4_fp4_b200.sh with --use-chat-template added to run_benchmark_serving, as required by AGENTS.md for MTP scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cquil11 requested a review from a team April 24, 2026 22:26

cquil11 requested a review from kedarpotdar-nv as a code owner April 24, 2026 22:26

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

cquil11 requested a review from jgangani as a code owner April 24, 2026 22:26

cquil11 mentioned this pull request Apr 24, 2026

dsv4 B200 MTP SGLang launch #1139

Closed

1 task

cquil11 changed the title ~~dsv4 B200 MTP SGLang launch~~ [NVIDIA] chore: B300 single node DeepSeek v4 SGLang MTP Apr 24, 2026

cquil11 added the NVIDIA label Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

cquil11 changed the title ~~[NVIDIA] chore: B300 single node DeepSeek v4 SGLang MTP~~ [NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP Apr 24, 2026

cquil11 and others added 3 commits April 24, 2026 17:39

Merge remote-tracking branch 'origin/main' into worktree-chore+dsv4-s…

0ab8925

…gl-mtp-b200 # Conflicts: # .github/configs/nvidia-master.yaml # benchmarks/single_node/dsv4_fp4_b200.sh

cquil11 added the sweep-enabled label Apr 24, 2026

Merge branch 'main' into chore/dsv4-sgl-mtp-b200

33955e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP#1145

[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP#1145
cquil11 wants to merge 18 commits intomainfrom
chore/dsv4-sgl-mtp-b200

cquil11 commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

kedarpotdar-nv commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cquil11 commented Apr 24, 2026

Summary

Test plan

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Summary

What the entry claims vs. what is shipped

Why this happened

Step-by-step proof

Impact

Suggested fix

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

What the bug is

Why this is latent today

Why this still warrants a fix in this PR

Addressing the refutation

Step-by-step proof of the latent failure

How to fix

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

What the bug is

Why it likely works today

Why it's still worth fixing

Addressing the refutation

Step-by-step proof

Fix

Severity

Uh oh!

kedarpotdar-nv commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants