dsv4-b300-sglang: update points by yhyang201 · Pull Request #1179 · SemiAnalysisAI/InferenceX

yhyang201 · 2026-04-26T15:57:37Z

Summary

Test plan

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-26T15:57:44Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-04-26T16:04:01Z

+        export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
+        export SGLANG_OPT_FIX_HASH_MEGA_MOE=1
+        export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=288
+        PARALLEL_ARGS=(
+            --dp-size "$TP"
+            --enable-dp-attention
+            --moe-a2a-backend deepep
+            --cuda-graph-max-bs 288
+            --deepep-config "$DEEPEP_CONFIG"
+            --chunked-prefill-size 65536
+            --tokenizer-worker-num 4
+            --enable-prefill-delayer
+        )
+        MAX_RUNNING_REQUESTS=2560
+        MEM_FRACTION_STATIC=0.87


🟡 Two pre-existing comments immediately above the DP_ATTENTION block became inaccurate after this PR added the CONC=2048 branch. The block comment at lines 63-66 still describes the recipe as "flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer", but the new CONC=2048 path uses --moe-a2a-backend deepep and --chunked-prefill-size 65536 (4x the non-DP value of 8192, not halved). Line 69 says the DP-attn branch "overrides to 0.94", but it now overrides to either 0.94 or 0.87 depending on CONC — worth refreshing the comments alongside this change so future maintainers don't trust stale assumptions.

Extended reasoning...

What the stale comments say

Lines 63-66 contain the rationale comment for the DP_ATTENTION dispatch block:

Pick the parallelism + MoE backend based on DP_ATTENTION (mirrors the vllm script's pattern). DP-attention runs the empirically-tuned high-concurrency recipe (flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer); single-instance uses flashinfer_mxfp4 with the cookbook defaults.

Line 69 contains:

# Default; the DP-attn branch below overrides to 0.94.

Both were accurate before this PR — the DP-attn branch was a single recipe that always used flashinfer_mxfp4, set --chunked-prefill-size 16384 (half the previous 32768 cookbook value, hence "halved"), and always set MEM_FRACTION_STATIC=0.94.

Why this PR makes them inaccurate

The new if [ "$CONC" = "2048" ]; then ... else ... split inside the DP-attn branch breaks both invariants:

The CONC=2048 path uses --moe-a2a-backend deepep (not flashinfer_mxfp4), SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 (the mega_moe deepep recipe — described in the PR description and changelog as a different recipe family entirely), and --chunked-prefill-size 65536. The block comment now describes only half of the DP-attn cases.

The wording "halved prefill chunks" is now actively misleading: 65536 is 8x the non-DP path's --chunked-prefill-size 8192, i.e. multiplied, not halved. A reader looking at line 65 next to lines 78-94 will see a direct contradiction.

MEM_FRACTION_STATIC is now overridden to 0.94 (CONC<2048) or 0.87 (CONC=2048), so line 69's single-value claim is no longer correct.

Step-by-step proof

Before this PR: DP_ATTENTION=true → always --moe-runner-backend flashinfer_mxfp4, --chunked-prefill-size 16384, MEM_FRACTION_STATIC=0.94. Comments are correct.

After this PR with DP_ATTENTION=true CONC=2048: --moe-a2a-backend deepep (not flashinfer_mxfp4) ✗, --chunked-prefill-size 65536 (not halved relative to non-DP 8192 — it's 8x) ✗, MEM_FRACTION_STATIC=0.87 (not 0.94) ✗. All three claims fail.

After this PR with DP_ATTENTION=true CONC=1024: comments still happen to be correct, but a maintainer reading them as describing "the DP-attn recipe" will be wrong about the other branch.

Severity / impact

This is a documentation accuracy issue, not a behavioral bug — runtime behavior is unaffected. But the file's comments are explicitly there to give future maintainers the empirical rationale ("empirically-tuned", "cookbook defaults"), and silently letting them drift turns future debugging into a trap. Easiest fix is to update lines 63-66 to mention both recipes (flashinfer_mxfp4 + halved chunks for CONC<2048; mega_moe deepep + larger chunks for CONC=2048) and reword line 69 to say the DP-attn branch overrides to 0.94 or 0.87 depending on CONC.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 · 2026-04-26T16:17:20Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-26T16:17:28Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24961231373
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 6a02d2d
Approval: not required (trusted collaborator).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 · 2026-04-26T17:04:04Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-26T17:04:14Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24962186268
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 0ea8e62
Approval: not required (trusted collaborator).

cquil11 · 2026-04-26T20:15:00Z

@yhyang201 Hi please hold off on sweeps until we get some CI unblocked

cquil11 · 2026-04-26T21:03:36Z

Ok back @yhyang201 https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24962186268

…nc=2048 - YAML: conc=2048 and conc=4096 (both 1k1k and 8k1k) had tp=4, should be tp=8 - Script: conc=2048 was missing explicit SWA_FULL_TOKENS_RATIO=0.1, causing 1k1k to incorrectly use 0.5 from the ISL-based default Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 · 2026-04-27T05:48:18Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T05:48:27Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24978717689
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: e8685d9
Approval: not required (trusted collaborator).

Disable NVSHMEM IB transport in the two code paths that explicitly use --moe-a2a-backend deepep (EP_SIZE=8 and CONC=2048/4096).

yhyang201 · 2026-04-27T11:05:28Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T11:05:39Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24991420778
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 4575ce6
Approval: not required (trusted collaborator).

Pin dsv4-fp4-b300-sglang to lmsysorg/sglang:deepseek-v4-b300@sha256:2fec8d7958bb0d53b50d7bf04d6ae6a7de8a35503775826e0550a45dd8c3ee15.

yhyang201 · 2026-04-27T11:57:02Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T11:57:12Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24993602429
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 0fb4d3c
Approval: not required (trusted collaborator).

…2048/4096

yhyang201 · 2026-04-27T12:26:58Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T12:27:08Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24994940494
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: c0f9334
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T13:14:15Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T13:14:26Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24997173342
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 5352757
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T13:29:44Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T13:29:55Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24997928458
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: a6e7ea0
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T13:49:53Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T13:50:09Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24998946908
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 758012f
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T14:08:58Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T14:09:13Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24999947919
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 8e2d2ff
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T14:21:27Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T14:21:40Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25000588798
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: d6c8873
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T14:51:15Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T14:51:43Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25002151282
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: f809d8f
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T14:58:51Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T14:59:06Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25002547656
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: d34b32c
Approval: not required (trusted collaborator).

Qiaolin-Yu · 2026-04-27T20:23:03Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T20:23:16Z

@Qiaolin-Yu Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25017583560
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 5c98596
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-28T02:24:29Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-28T02:24:38Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25030503512
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 5c98596
Approval: not required (trusted collaborator).

Both high-conc (CONC=2048/4096) and medium-conc recipes use ep=8 in the YAML, so EP_SIZE is always "8" for both. The previous if/elif order meant EP_SIZE=8 matched first, shadowing the CONC=2048/4096 branch entirely. Swap the order so the more specific high-conc check runs first.

yhyang201 · 2026-04-28T03:11:56Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-28T03:12:05Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25031820374
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 4f995ac
Approval: not required (trusted collaborator).

- max-running-requests: 4608 → 4352 - swa-full-tokens-ratio: 0.06 → 0.075 - MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: 544 → 8320 - add --decode-log-interval 5 - move SGLANG_LOG_FORWARD_ITERS to conc-2048 only

…2048 too

yhyang201 · 2026-04-28T03:40:00Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-28T03:40:08Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25032591138
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: f596249
Approval: not required (trusted collaborator).

dsv4-b300-sglang: conc=2048 mega_moe deepep recipe

72ad174

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 requested a review from a team April 26, 2026 15:57

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

dsv4-b300-sglang: add conc=4096 mega_moe deepep recipe

0e578da

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude Bot reviewed Apr 26, 2026

View reviewed changes

dsv4-b300-sglang: 1k1k conc=512/1024 mega_moe deepep recipe

6a02d2d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yhyang201 requested review from Qiaolin-Yu, jgangani and kedarpotdar-nv as code owners April 26, 2026 16:11

yhyang201 and others added 2 commits April 27, 2026 00:18

dsv4-b300-sglang: merge changelog entries into single PR#1179 entry

02b9b92

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsv4-b300-sglang: add conc=2048/4096 mega_moe CI entries for both ISL…

0ea8e62

… configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Qiaolin-Yu self-assigned this Apr 26, 2026

dsv4-b300-sglang: set NVSHMEM_DISABLE_IB=1 for deepep recipes

4575ce6

Disable NVSHMEM IB transport in the two code paths that explicitly use --moe-a2a-backend deepep (EP_SIZE=8 and CONC=2048/4096).

dsv4-b300-sglang: update image to sha256:2fec8d79

0fb4d3c

Pin dsv4-fp4-b300-sglang to lmsysorg/sglang:deepseek-v4-b300@sha256:2fec8d7958bb0d53b50d7bf04d6ae6a7de8a35503775826e0550a45dd8c3ee15.

dsv4-b300-sglang: enable SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW for conc …

c0f9334

…2048/4096

dsv4-b300-sglang: set swa-full-tokens-ratio 0.06 for conc 2048/4096

7cc1c12

yhyang201 force-pushed the dsv4-b300-sglang-conc2048-mega-moe branch from 85d3b27 to 7cc1c12 Compare April 27, 2026 16:37

dsv4-b300-sglang: temporarily limit sweep to 8k1k conc 2048/4096

5c98596

Qiaolin-Yu changed the title ~~dsv4-b300-sglang: conc=2048 mega_moe deepep recipe~~ dsv4-b300-sglang: update points Apr 27, 2026

yhyang201 added 2 commits April 28, 2026 11:35

dsv4-b300-sglang: update conc-4096 recipe parameters

c8aacff

- max-running-requests: 4608 → 4352 - swa-full-tokens-ratio: 0.06 → 0.075 - MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: 544 → 8320 - add --decode-log-interval 5 - move SGLANG_LOG_FORWARD_ITERS to conc-2048 only

dsv4-b300-sglang: set MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8320 for conc-…

f596249

…2048 too

Conversation

yhyang201 commented Apr 26, 2026 • edited by Qiaolin-Yu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

claude Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

yhyang201 commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

cquil11 commented Apr 26, 2026

Uh oh!

cquil11 commented Apr 26, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Qiaolin-Yu commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

yhyang201 commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

yhyang201 commented Apr 28, 2026

yhyang201 commented Apr 26, 2026 •

edited by Qiaolin-Yu

Loading