Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 | ||
| export SGLANG_OPT_FIX_HASH_MEGA_MOE=1 | ||
| export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=288 | ||
| PARALLEL_ARGS=( | ||
| --dp-size "$TP" | ||
| --enable-dp-attention | ||
| --moe-a2a-backend deepep | ||
| --cuda-graph-max-bs 288 | ||
| --deepep-config "$DEEPEP_CONFIG" | ||
| --chunked-prefill-size 65536 | ||
| --tokenizer-worker-num 4 | ||
| --enable-prefill-delayer | ||
| ) | ||
| MAX_RUNNING_REQUESTS=2560 | ||
| MEM_FRACTION_STATIC=0.87 |
There was a problem hiding this comment.
🟡 Two pre-existing comments immediately above the DP_ATTENTION block became inaccurate after this PR added the CONC=2048 branch. The block comment at lines 63-66 still describes the recipe as "flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer", but the new CONC=2048 path uses --moe-a2a-backend deepep and --chunked-prefill-size 65536 (4x the non-DP value of 8192, not halved). Line 69 says the DP-attn branch "overrides to 0.94", but it now overrides to either 0.94 or 0.87 depending on CONC — worth refreshing the comments alongside this change so future maintainers don't trust stale assumptions.
Extended reasoning...
What the stale comments say
Lines 63-66 contain the rationale comment for the DP_ATTENTION dispatch block:
Pick the parallelism + MoE backend based on DP_ATTENTION (mirrors the vllm script's pattern). DP-attention runs the empirically-tuned high-concurrency recipe (flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer); single-instance uses flashinfer_mxfp4 with the cookbook defaults.
Line 69 contains:
# Default; the DP-attn branch below overrides to 0.94.
Both were accurate before this PR — the DP-attn branch was a single recipe that always used flashinfer_mxfp4, set --chunked-prefill-size 16384 (half the previous 32768 cookbook value, hence "halved"), and always set MEM_FRACTION_STATIC=0.94.
Why this PR makes them inaccurate
The new if [ "$CONC" = "2048" ]; then ... else ... split inside the DP-attn branch breaks both invariants:
-
The CONC=2048 path uses
--moe-a2a-backend deepep(notflashinfer_mxfp4),SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1(the mega_moe deepep recipe — described in the PR description and changelog as a different recipe family entirely), and--chunked-prefill-size 65536. The block comment now describes only half of the DP-attn cases. -
The wording "halved prefill chunks" is now actively misleading: 65536 is 8x the non-DP path's
--chunked-prefill-size 8192, i.e. multiplied, not halved. A reader looking at line 65 next to lines 78-94 will see a direct contradiction. -
MEM_FRACTION_STATICis now overridden to 0.94 (CONC<2048) or 0.87 (CONC=2048), so line 69's single-value claim is no longer correct.
Step-by-step proof
- Before this PR:
DP_ATTENTION=true→ always--moe-runner-backend flashinfer_mxfp4,--chunked-prefill-size 16384,MEM_FRACTION_STATIC=0.94. Comments are correct. - After this PR with
DP_ATTENTION=true CONC=2048:--moe-a2a-backend deepep(not flashinfer_mxfp4) ✗,--chunked-prefill-size 65536(not halved relative to non-DP 8192 — it's 8x) ✗,MEM_FRACTION_STATIC=0.87(not 0.94) ✗. All three claims fail. - After this PR with
DP_ATTENTION=true CONC=1024: comments still happen to be correct, but a maintainer reading them as describing "the DP-attn recipe" will be wrong about the other branch.
Severity / impact
This is a documentation accuracy issue, not a behavioral bug — runtime behavior is unaffected. But the file's comments are explicitly there to give future maintainers the empirical rationale ("empirically-tuned", "cookbook defaults"), and silently letting them drift turns future debugging into a trap. Easiest fix is to update lines 63-66 to mention both recipes (flashinfer_mxfp4 + halved chunks for CONC<2048; mega_moe deepep + larger chunks for CONC=2048) and reword line 69 to say the DP-attn branch overrides to 0.94 or 0.87 depending on CONC.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24961231373 |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24962186268 |
|
@yhyang201 Hi please hold off on sweeps until we get some CI unblocked |
…nc=2048 - YAML: conc=2048 and conc=4096 (both 1k1k and 8k1k) had tp=4, should be tp=8 - Script: conc=2048 was missing explicit SWA_FULL_TOKENS_RATIO=0.1, causing 1k1k to incorrectly use 0.5 from the ISL-based default Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24978717689 |
Disable NVSHMEM IB transport in the two code paths that explicitly use --moe-a2a-backend deepep (EP_SIZE=8 and CONC=2048/4096).
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24991420778 |
Pin dsv4-fp4-b300-sglang to lmsysorg/sglang:deepseek-v4-b300@sha256:2fec8d7958bb0d53b50d7bf04d6ae6a7de8a35503775826e0550a45dd8c3ee15.
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24993602429 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24994940494 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24997173342 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24997928458 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24998946908 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24999947919 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25000588798 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25002151282 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25002547656 |
85d3b27 to
7cc1c12
Compare
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@Qiaolin-Yu Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25017583560 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25030503512 |
Both high-conc (CONC=2048/4096) and medium-conc recipes use ep=8 in the YAML, so EP_SIZE is always "8" for both. The previous if/elif order meant EP_SIZE=8 matched first, shadowing the CONC=2048/4096 branch entirely. Swap the order so the more specific high-conc check runs first.
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25031820374 |
- max-running-requests: 4608 → 4352 - swa-full-tokens-ratio: 0.06 → 0.075 - MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: 544 → 8320 - add --decode-log-interval 5 - move SGLANG_LOG_FORWARD_ITERS to conc-2048 only
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25032591138 |
Summary
Test plan
🤖 Generated with Claude Code