Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks by Oseltamivir · Pull Request #1157 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-25T20:01:30Z

Summary

Adds dsv4-fp4-gb200-dynamo-sglang for DeepSeek-V4-Pro on GB200 (Dynamo + SGLang disagg). Mirrors the topology of #1129 (Dynamo + vLLM) so cross-framework numbers stay directly comparable. Both 1k/1k and 8k/1k sweeps are active.

Active sweep

6 cluster startups / 27 benchmark points across the two seq-lens:

ISL	Topology	Conc	Nodes
1k/1k	`1p1d-dep8-tep8`	1, 4, 8, 16, 32, 64	4
1k/1k	`1p1d-dep8-dep16`	128, 256, 1024, 2048, 4096	4
1k/1k	`3p1d-dep8-dep16`	4096, 8192	8
8k/1k	`1p1d-dep8-tep8`	1, 4, 8, 16, 32, 64	4
8k/1k	`3p1d-dep8-dep16`	512, 1024	8
8k/1k	`7p1d-dep8-dep16`	4096, 8192	16

The dep8/dep16 filenames are kept for symmetry with the dsv4-fp4-gb200-dynamo-vllm sibling (#1129) — the actual sglang config is TP=8 / DP=8 / no DeepEP for every entry. See "Known issue: DeepEP path broken on this image" below for why.

Hand-rolled — no upstream GB200 DSV4 sglang disagg recipe exists. Per-worker sglang_config is mirrored from NVIDIA/srt-slurm PR #69 (recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml — GB200 DSV4 sglang aggregated baseline: env-var set + flashinfer_mxfp4 + chunked-prefill-size 4096 + disable-flashinfer-autotune + mem-fraction-static 0.82). Disagg flag set (nixl transfer backend) cross-checked against PR #75 (recipes/gb300-fp4/1k1k-dsv4/disagg-1p1d-tp4-mxfp4.yaml — GB300 DSV4 sglang disagg) and the SGLang DeepSeek-V4 cookbook.

Files

.github/configs/nvidia-master.yaml — new sweep config keyed dsv4-fp4-gb200-dynamo-sglang
benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/{1k1k,8k1k}/*.yaml — 6 recipe YAMLs (overlaid onto the upstream srt-slurm checkout at runtime)
runners/launch_gb200-nv.sh — dsv4+dynamo-sglang model-prefix branch + recipe-overlay step (cp -rT matching the existing dsv4+dynamo-vllm branch)
perf-changelog.yaml — sweep changelog entry

Image

lmsysorg/sglang:deepseek-v4-grace-blackwell (linux/arm64) — the only DSV4 SGLang image published for aarch64 on Docker Hub. The -blackwell tag is amd64-only and won't run on GB200.

Known issue: every EP backend is dead on this image

Both EP backends sglang exposes in this image fail deterministically — confirmed across four separate runs:

`moe-a2a-backend`	`deepep-mode`	Failure
`deepep`	`auto` (default → "normal" for prefill)	`mxfp4_deepseek.py:347 AttributeError: 'DeepEPNormalDispatchOutput' object has no attribute 'topk_output'`
`deepep`	`low_latency`	`mxfp4_deepseek.py:347 AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output'`
`deepep`	`low_latency` (high-conc 3p1d retry)	same as above
`flashinfer`	n/a	`server_args.py:2133 AssertionError: Flashinfer MoE A2A is only supported with flashinfer_cutlass moe runner backend`

DeepEP: mxfp4_deepseek.py is a fork-only file in this image (verified via gh search code — does not exist in upstream sgl-project/sglang). The kernel reads dispatch_output.topk_output, but neither DeepEPLLDispatchOutput nor DeepEPNormalDispatchOutput exposes that field in this fork's DeepEP token-dispatcher.

flashinfer: _handle_a2a_moe asserts moe_runner_backend == "flashinfer_cutlass" whenever moe-a2a-backend: flashinfer is set. flashinfer_cutlass is FP8-only — it won't load DSV4-Pro's MXFP4-quantized weights. So the only path that satisfies the assertion would also fail to load the model.

The remaining EP backends (mooncake, nixl-ep, mori, ascend_fuseep) are either Ascend-NPU-only or have not been wired into this image.

The sweep therefore runs everything through the forward_normal MoE path (default moe_a2a_backend="none", no EP), with experts TP-sharded instead of EP-sharded. Memory still fits comfortably — DSV4-Pro at MXFP4 is ~340 GB total, ≈42 GB per rank at TP=8 — but throughput at high concurrency lags what an EP-sharded run could deliver. We give up the EP throughput-scaling data point on this image; the rest of the sweep is unaffected. Cannot be fixed from the recipe — requires either the image rebuilt with mxfp4_deepseek.py patched, an upstream sglang fix, or a flashinfer_mxfp4_cutlass runner that doesn't exist yet.

Why `enable-dp-attention: true` even on the 1p1d-tep8 topology

DP-attention is on everywhere — even on the low-conc 1p1d topology where it's not strictly needed for parallelism. It's there to keep the FP8 block-quant validator from rejecting the model at load time, and removing it cascades into a chain of failures we already burned through during early iteration. The chain:

TP=4 doesn't fit. DSV4-Pro at MXFP4 (~340 GB) OOMs on a single GB200 node (4 GPUs × 96 GB = 384 GB). Confirmed by torch.OutOfMemoryError on the first TP=4 attempt. We need TP=8 across 2 nodes (768 GB) for any worker to load.
TP=8 trips FP8 block-quant. DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants in 128-element blocks. With TP=8 the per-rank slice is 1536 / 8 = 192, which fails validate_block_quant_shapes (192 % 128 ≠ 0). PR #75 sidesteps this with TP=4 (1536 / 4 = 384), but per (1) we can't.
moe-dense-tp-size: 1 is the documented escape hatch — it runs the dense / shared-MLP layers replicated (TP=1) so the divisibility check passes on a per-replica basis while attention + routed experts keep TP=8. Per python/sglang/srt/server_args.py: "useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports."
The flag is silently ignored without DP-attention. Verified in upstream sglang python/sglang/srt/layers/dp_attention.py:compute_dp_attention_local_info:
```
if not enable_dp_attention:
    return tp_rank, tp_size, 0   # moe_dense_tp_size IGNORED
local_tp_size = moe_dense_tp_size if moe_dense_tp_size else tp_size
```
The function returns the full tp_size and never consults moe_dense_tp_size when DP-attn is off. So setting the flag without enabling DP-attention is a no-op, and the FP8 ValueError fires.

DP-attention must be on, but DeepEP must stay off (see "Known issue" above). We leave moe-a2a-backend at its default "none" (sglang server_args.py), which routes the model through forward_normal instead — verified in deepseek_v2.py:

self._enable_a2a_moe = (
    get_moe_a2a_backend().is_deepep() or ...is_flashinfer())
...
if not self._enable_a2a_moe:
    return self.forward_normal(...)   # avoids the buggy mxfp4_deepseek path
else:
    return self.forward_deepep(...)

The single combination that satisfies all constraints is enable-dp-attention: true + dp-size: 8 + moe-dense-tp-size: 1 + (no moe-a2a-backend set, default "none"), applied uniformly on prefill and decode in every recipe.

Recipe-reminder bot response

No upstream NVIDIA/srt-slurm GB200 DSV4 sglang disagg recipe exists at this time. The aflowers/gb200-dsv4-recipes branch only ships vLLM DSV4 recipes (recipes/vllm/deepseek-v4-pro/...). The closest upstream sglang DSV4 references are PR Add a comment on the use of float16 and set some EVs explicitly #69 (GB200 aggregated, same author, same cluster) and PR improve NVIDIA CI stability #75 (GB300 disagg, TP=4 only).
Per-worker sglang_config is hand-rolled with explicit citations in each recipe header pointing back to PR Add a comment on the use of float16 and set some EVs explicitly #69 / PR improve NVIDIA CI stability #75 / cookbook. Topology mirrors Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks #1129 vLLM siblings — same node counts at 1k/1k 1p1d/3p1d. (At 1k/1k mid the vLLM sibling's wider DP=16 decode is collapsed to TP=8 since DeepEP is dead in this image.)

Test plan

python3 utils/matrix_logic/generate_sweep_configs.py test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-gb200-dynamo-sglang expands to 6 entries / 27 conc points
All 6 recipe YAMLs parse via yaml.safe_load
bash -n runners/launch_gb200-nv.sh passes
perf-changelog.yaml diff vs main is additions-only (gate passes)
First-run smoke test on GB200: confirm --moe-runner-backend flashinfer_mxfp4 + DP-attention + no DeepEP combination loads without OOM at low-conc (1p1d-dep8-tep8); may require tuning mem-fraction-static or cuda-graph-max-bs if first-run hits memory pressure.

github-actions · 2026-04-25T20:01:38Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-25T20:01:38Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude

Mechanical, additive benchmark config PR mirroring the dynamo-vllm dsv4 sibling — no production code paths affected.

Extended reasoning...

Overview

This PR adds a new SGLang variant of the DeepSeek-V4-Pro FP4 GB200 disaggregated benchmark sweep (dsv4-fp4-gb200-dynamo-sglang), structurally mirroring the previously merged vLLM sibling (#1129). Changes touch only benchmark configuration: a new entry in .github/configs/nvidia-master.yaml, 6 new isolated SGLang recipe YAMLs, a single elif branch in runners/launch_gb200-nv.sh that mirrors the existing dsv4+dynamo-vllm branch (sets MODEL_PATH/SRT_SLURM_MODEL_PREFIX, clones srt-slurm at the same pinned branch, and overlays the new recipes), and a perf-changelog.yaml entry.

Security risks

None. No authentication, authorization, cryptography, secrets handling, or production-runtime code paths are modified. The new shell branch only writes to the runner workspace and clones from the same upstream NVIDIA/srt-slurm repository on the same pinned branch already used by the merged vLLM sibling.

Level of scrutiny

Low. This is a benchmark config addition following an established and recently merged pattern. The 1k/1k sweep is the only active block (8k/1k recipes are shipped but commented out), and the per-worker tunings have explicit citations to upstream srt-slurm references (PR #69 GB200 agg, PR #75 GB300 disagg) plus the SGLang DeepSeek-V4 cookbook. Failure modes are confined to benchmark runs, not infrastructure or other workloads.

Other factors

No bugs reported by the bug-hunting system.
No outstanding reviewer comments — only the standard automated recipe-reminder bot pings.
The launch script change is a small, mechanical mirror of the existing dynamo-vllm dsv4 branch added in the merged sibling PR #1129.
The PR author has explicitly flagged that first-run smoke testing on GB200 may require mem-fraction-static/cuda-graph-max-bs tuning, which is appropriate for a Day-0 benchmark.

srtctl SrtConfig schema rejects backend.connector for the sglang backend type. The field was carried over from the dynamo-vllm dsv4 recipes (where it is valid and set to null). PR #69/#75 sglang recipes upstream do not declare it.

…ckwell sglang fork Re-installing dynamo 0.8.1 over the lmsysorg/sglang:deepseek-v4-grace-blackwell container's pre-baked sglang fails at import time: File ".../dynamo/sglang/health_check.py", line 20 def _get_bos_token_id_from_engine(engine: Optional[sgl.Engine]) AttributeError: module 'sglang' has no attribute 'Engine' The DSV4 sglang fork bundled in this image does not expose sgl.Engine. Drop the dynamo: block so srtctl uses the dynamo build pre-installed in the container — matches NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe), which also has no dynamo: block.

srtctl's DynamoConfig (src/srtctl/core/schema.py L680) defaults to install=True, which pip installs dynamo 0.8.0 even when no `dynamo:` block is specified. Use the explicit opt-out so srtctl uses the dynamo build baked into the lmsysorg/sglang:deepseek-v4-grace-blackwell image. This image's sglang fork doesn't expose sgl.Engine, which dynamo.sglang.health_check imports at top level — re-installing dynamo over it breaks startup.

install: false fixed the pip-install crash, but the lmsysorg/sglang:deepseek-v4-grace-blackwell image doesn't have dynamo pre-installed (ModuleNotFoundError: No module named 'dynamo'), so srtctl needs to install something compatible. The DSV4-targeted dynamo tag v1.2.0-sglang-deepseek-v4-dev.1 (sha 21f135f5edf40e12e6ff5db2b462d862a6d6ab9b) includes 'from __future__ import annotations' in dynamo/sglang/health_check.py (ai-dynamo PR #7255, commit cdb7218a, 2026-03-12), which makes the Optional[sgl.Engine] annotation lazy. The PyPI 0.8.0/0.8.1 releases predate that fix and crash with AttributeError on this image's sglang fork.

…patch bug Prefill warmup crashed in run 24941291328 with: File ".../sglang/srt/layers/quantization/mxfp4_deepseek.py", line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPNormalDispatchOutput' object has no attribute 'topk_output' Per sglang server_args.py, --deepep-mode defaults to 'auto', which picks 'normal' for prefill batches and 'low_latency' for decode. The mxfp4_deepseek MoE kernel only handles the low_latency dispatch output shape (which carries topk_output); the normal-dispatch output type does not, so any prefill forward (or decode warmup using forward_idle) hits the AttributeError before the worker can serve. Force deepep-mode: low_latency on every prefill + decode block that uses moe-a2a-backend: deepep. The two 1p1d-dep8-tep8 decode blocks remain TP-only (no DeepEP) and are unaffected. Run reference: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24941291328

…tch types broken Run after the deepep-mode: low_latency change failed again. Logs show two distinct DeepEP-path failures: 1. Prefill scheduler crash: File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output' The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch output type in this image's sglang fork exposes topk_output, so forcing low_latency vs normal mode does not help. mxfp4_deepseek.py is a fork-only file (does not exist in upstream sgl-project/sglang), so the API mismatch can only be fixed by rebuilding the image. 2. Decode CUDA graph capture crash: RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233 'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank' DeepEP low_latency_dispatch's per-rank token cap is exceeded by the cuda-graph-max-bs we configured. Both failures are in the DeepEP path. Per upstream sgl-project/sglang (server_args.py), moe_a2a_backend defaults to 'none', which uses all-reduce/all-gather dispatch and lets TP shard the expert weights across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe) takes the same TP-only stance — pure tensor-parallel-size: N with no enable-dp-attention, no moe-a2a-backend deepep, no dp-size, no ep-size. Drop those five fields from all 6 recipes. Topology shape preserved: - 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes) - 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes) DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank) or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of KV cache headroom on each 96 GB GB200 GPU. Topology filenames retain the 'dep8' / 'dep16' historical names from the vLLM PR #1129 sibling for symmetry — the actual sglang_config is TP-only.

…ility at TP=8/16 After the DeepEP removal, model load crashed at: File '.../sglang/srt/layers/quantization/fp8.py', line 282, in validate_block_quant_shapes raise ValueError( ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants in 128-element blocks. With TP=8 the per-rank slice is 1536/8=192, which fails the divisibility check. PR #75 sidesteps this by using TP=4 (1536/4=384), but that locks us into single-node workers. sglang's --moe-dense-tp-size flag is the documented workaround (server_args.py: 'useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports'). Setting moe-dense-tp-size: 1 runs the shared / dense-MLP layers replicated across ranks (TP=1) while the rest of the model — attention, routed experts — keeps TP=8/16. Memory cost is small since shared experts are a fraction of total weights. Applied to all 6 recipes; topology/node counts unchanged.

…ocks Belt-and-suspenders for the DeepEP per-rank dispatch buffer cap. The default is too low; with this set we'll have headroom if EP / DeepEP is re-enabled later (e.g., once the fork's mxfp4_deepseek dispatch API mismatch is fixed). 1024 matches the cookbook's B200 decode reference.

Run after moe-dense-tp-size: 1 added still hit: ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info): if not enable_dp_attention: return tp_rank, tp_size, 0 # moe_dense_tp_size IGNORED The flag is only honored when enable_dp_attention=True. Since we already dropped DP-attention to avoid the fork's mxfp4_deepseek bug, moe-dense-tp-size: 1 was a no-op. Two valid paths: (a) re-enable DP-attention without DeepEP — speculative, never tested (b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant passes. Matches NVIDIA/srt-slurm PR #75 (the only verified- working DSV4 sglang disagg recipe upstream) verbatim. Going with (b). Recipes drop moe-dense-tp-size (no longer needed at TP=4) and switch tensor-parallel-size to 4 in both prefill+decode. gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per worker). prefill_nodes / decode_nodes track worker counts. Topology shape (filenames keep historical dep8/dep16 naming for symmetry with the vLLM #1129 sibling; actual config is TP=4): - 1k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes total) - 1k1k 1p1d-dep16: P TP=4 / D TP=4 (2 nodes total) — same shape, different conc - 1k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes) - 8k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 7p1d-dep16: P 7*TP=4 / D TP=4 (8 nodes) nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on every prefill+decode block — including the commented 8k/1k block). Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048 in all env blocks (DeepEP path is dormant in this config but the env var is in place for re-enabling later).

The merge of main into this branch (c0aec93) accidentally overwrote the two dsv4-fp8-mi355x-sglang retry entries (PR #1148 retry-pair tail and PR #1159 retry-pair) with duplicated copies of our own dsv4-fp4-gb200-dynamo-sglang entry. The process_changelog.py gate rejects deletions, so the workflow blocked. Restore the two mi355x entries verbatim from origin/main and keep a single copy of our dsv4 entry, appended after the restored mi355x block. perf-changelog.yaml diff vs origin/main is now additions-only.

…oe-a2a-backend TP=4 OOMed — DSV4-Pro at MXFP4 doesn't fit on a single GB200 node. Need TP=8 across 2 nodes (768 GB total). But TP=8 trips two issues that earlier rounds papered over: a) shared-experts gate_up_proj FP8 block-quant divisibility (1536/8=192, not a multiple of block_n=128) b) the lmsysorg/sglang:deepseek-v4-grace-blackwell fork's mxfp4_deepseek kernel crashes on every DeepEP forward path Single combo that solves both — verified in upstream sglang source: * enable-dp-attention: true + moe-dense-tp-size: 1 Runs dense / shared-MLP layers replicated (TP=1) — fixes (a). moe-dense-tp-size IS gated on enable_dp_attention=True per python/sglang/srt/layers/dp_attention.py (compute_dp_attention_local_info ignores it when DP-attn is off). * NO moe-a2a-backend set (default 'none') Lands the model on forward_normal instead of forward_deepep — avoids (b). Verified in deepseek_v2.py: _enable_a2a_moe = is_deepep | is_mooncake | is_nixl | is_mori | is_ascend_fuseep | is_flashinfer With backend='none' this is False and forward_normal runs. Recipes: tensor-parallel-size 4 → 8 (both prefill+decode); add moe-dense-tp-size: 1, enable-dp-attention: true, dp-size: 8 to every sglang_config block; gpus_per_prefill / gpus_per_decode 4 → 8; prefill_nodes / decode_nodes scale to workers × 2. nvidia-master.yaml mirrors: tp 4 → 8, dp-attn false → true on every prefill+decode block (active 1k/1k + commented 8k/1k). Topology shape restored to: - 1k1k 1p1d-* : 4 nodes (was 2) - 1k1k 3p1d-* : 8 nodes (was 4) - 8k1k 1p1d-* : 4 nodes (commented) - 8k1k 3p1d-* : 8 nodes (commented) - 8k1k 7p1d-* : 16 nodes (commented)

Comment out the low-conc (1-64) and mid-conc (128-4096) search-space entries in nvidia-master.yaml so the sweep iterates only on the high- conc 3p1d-dep8-dep16 topology. Re-enable DeepEP on that one recipe to exercise the EP path: 3p1d-dep8-dep16 prefill+decode: + ep-size: 8 + moe-a2a-backend: "deepep" + deepep-mode: low_latency (kept enable-dp-attention + moe-dense-tp-size: 1 + tp=8 / dp=8) Master matrix label updated to ep=8 to reflect the recipe. Sibling 1p1d recipes on disk are unchanged (still TP=8 + DP-attn, no DeepEP). They are still referenced by the commented-out master entries — restore them by uncommenting.

…L hard ceiling DeepEP run (3p1d-dep8-dep16) crashed at: File '.../sglang/srt/layers/moe/token_dispatcher/deepep.py', line 325 assert self.num_max_dispatch_tokens_per_rank <= 1024 AssertionError _DeepEPDispatcherImplLowLatency enforces a hard upper bound of 1024 in low_latency mode. We had bumped the env var to 2048 to give headroom above the earlier C++ side cap (deep_ep.cpp:1233 'x.size(0) <= num_max_dispatch_tokens_per_rank'), but 2048 trips this Python-side assertion at scheduler init. 1024 is the exactly-allowed value: high enough to cover the cuda-graph-max-bs we use, low enough to satisfy the LL dispatcher constructor. Apply 2048 → 1024 across all 6 recipes (every prefill + decode env block).

…k/1k sweep DeepEP is broken on the lmsysorg/sglang:deepseek-v4-grace-blackwell image — verified across three runs (deepep-mode auto/normal, deepep-mode low_latency, and the latest 3p1d try). All hit the fork-only mxfp4_deepseek.py:347 reading dispatch_output.topk_output, which neither DeepEPLLDispatchOutput nor DeepEPNormalDispatchOutput exposes in this fork. Cannot be fixed from the recipe — needs the image rebuilt with mxfp4_deepseek patched, or an upstream sglang fix. 3p1d-dep8-dep16 recipe: drop ep-size, moe-a2a-backend, deepep-mode from prefill+decode. Now matches the 1p1d siblings: TP=8 + DP=8 + moe-dense-tp-size: 1, default 'none' a2a backend (forward_normal path bypasses the buggy mxfp4_deepseek kernel). nvidia-master.yaml: * Uncomment the 1k/1k mid-conc and 8k/1k blocks (low + mid + high). * 3p1d-dep8-dep16 matrix label ep: 8 → ep: 1 to match recipe. Sweep now expands to 6 entries / 27 conc points (3 1k/1k + 3 8k/1k).

DeepEP is dead in this image (mxfp4_deepseek.py:347 reads dispatch_output.topk_output, neither DeepEPNormal nor DeepEPLL output exposes that field). Smoke test the only other plausible EP backend upstream sglang offers: flashinfer. Per upstream docs/advanced_features/expert_parallelism.md, flashinfer is the documented option for 'Large-scale EP deployments' and uses a different dispatcher than DeepEP — its output class may or may not trip the same mxfp4_deepseek bug. Per server_args.py _handle_a2a_moe, flashinfer auto-sets SGLANG_MOE_NVFP4_DISPATCH=True and forces ep_size = tp_size, so we set ep-size: 8 explicitly. Everything else (TP=8 / DP=8 / moe-dense-tp-size: 1) stays so the FP8 block-quant path remains valid. Scope: 1k/1k 3p1d-dep8-dep16 only. If the EP path serves on this image, port back to the 1p1d siblings; if it crashes the same way DeepEP did, revert to the no-EP forward_normal path and accept the TP-only pareto. nvidia-master.yaml matrix labels for the 3p1d entry updated to ep=8 to match the recipe.

…d dead on this image flashinfer EP smoke test (3p1d-dep8-dep16 1k/1k) crashed at startup: File '.../sglang/srt/server_args.py', line 2133, in _handle_a2a_moe assert self.moe_runner_backend in [...] AssertionError: Flashinfer MoE A2A is only supported with flashinfer_cutlass moe runner backend flashinfer_cutlass is FP8-only — won't load DSV4-Pro's MXFP4 weights. The only path that satisfies the assertion would also fail at model load. So flashinfer is unusable for DSV4 on any image that doesn't ship a flashinfer_mxfp4_cutlass runner (which doesn't exist). Combined with the earlier deepep failure (mxfp4_deepseek.py:347 AttributeError on dispatch_output.topk_output, both Normal and LL dispatch types), every EP backend sglang exposes in this image is dead. Remaining options (mooncake, nixl-ep, mori, ascend_fuseep) are either Ascend-NPU-only or not wired into this image. Revert 3p1d-dep8-dep16 recipe to no-EP TP-only (matches the 5 sibling recipes) and master.yaml matrix labels (ep: 8 → ep: 1). PR description's Known Issues section updated to a 4-row table covering every EP backend tried and accepted as dead end.

sglang computes per-rank capacity as max_running_requests // dp_size. With dp-size=8, a value of 4 floors to 0, hitting the "max_running_request is zero" assertion in tp_worker.py:277. Bump to 8 so each DP rank gets at least 1 slot — matches the working 1p1d recipe.

Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks

93db2e2

Oseltamivir requested a review from a team April 25, 2026 20:01

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners April 25, 2026 20:01

github-project-automation Bot added this to InferenceMAX Board Apr 25, 2026

Oseltamivir added the sweep-enabled label Apr 25, 2026

claude Bot reviewed Apr 25, 2026

View reviewed changes

Oseltamivir and others added 15 commits April 25, 2026 13:35

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

c0d477d

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

c0aec93

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

34e4a92

Oseltamivir requested a review from Qiaolin-Yu as a code owner April 26, 2026 16:19

Oseltamivir and others added 5 commits April 26, 2026 09:22

tep fix + dep for high conc

b913586

sike no dpa

bca99eb

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

6c09973

Oseltamivir and others added 8 commits April 26, 2026 11:55

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

0526fa0

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

30c2512

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

8ea8e77

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

90304df

Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg

a172069

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks#1157

Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks#1157
Oseltamivir wants to merge 29 commits intomainfrom
dsv4-fp4-gb200-dynamo-sglang-disagg

Oseltamivir commented Apr 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Active sweep

Files

Image

Known issue: every EP backend is dead on this image

Why enable-dp-attention: true even on the 1p1d-tep8 topology

Recipe-reminder bot response

Test plan

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Apr 25, 2026 •

edited

Loading

Why `enable-dp-attention: true` even on the 1p1d-tep8 topology