Skip to content

Expand SGLang downstream coverage with MI35X model E2E tests#2884

Open
bingxche wants to merge 3 commits intomainfrom
bingxche/add-sglang-test
Open

Expand SGLang downstream coverage with MI35X model E2E tests#2884
bingxche wants to merge 3 commits intomainfrom
bingxche/add-sglang-test

Conversation

@bingxche
Copy link
Copy Markdown
Contributor

Summary

  • replace the current SGLang downstream 1-GPU MI300X smoke coverage with MI35X 8-GPU model end-to-end coverage
  • reuse a single downstream setup flow, then run the MI35X DeepSeek-R1-MXFP4, Qwen3-235B-MXFP4, and DeepSeek-V3.2 test steps in sequence
  • align the downstream runner, GPU settings, dependency install, and per-step timeouts with the MI35X nightly model test paths

Test plan

  • Run Sglang Model E2E Test (8 GPU) in GitHub Actions
  • Verify DeepSeek-R1-MXFP4 accuracy and perf steps complete on linux-aiter-mi35x-8
  • Verify Qwen3-235B-MXFP4 combined suite completes on linux-aiter-mi35x-8
  • Verify DeepSeek-V3.2 accuracy and perf steps complete on linux-aiter-mi35x-8

Made with Cursor

@bingxche bingxche requested a review from a team April 23, 2026 15:45
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2884 --add-label <label>

@bingxche bingxche marked this pull request as draft April 23, 2026 15:47
Drop the downstream-specific image override so the MI35X model E2E job follows SGLang's own container selection logic.

Made-with: Cursor
@gyohuangxin gyohuangxin self-assigned this Apr 24, 2026
@gyohuangxin
Copy link
Copy Markdown
Member

Related to #2751

@gyohuangxin gyohuangxin marked this pull request as ready for review April 24, 2026 05:10
Copilot AI review requested due to automatic review settings April 24, 2026 05:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SGLang downstream GitHub Actions workflow to shift from MI300X 1-GPU smoke coverage to MI35X 8-GPU end-to-end model test coverage, running multiple MI35X model suites sequentially after a single downstream setup.

Changes:

  • Switch downstream runner target from MI300X (1 GPU) to MI35X (8 GPU) and update related GPU/hostname settings.
  • Remove dynamic SGLang base-image resolution and run MI35X nightly-style accuracy/performance suites in sequence (with a final aggregation step to fail the job if any suite fails).
  • Add extra Python dependencies needed by the MI35X E2E suites.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +55 to +57
SGL_BRANCH: main
GPU_ARCH: gfx950
SGLANG_CI_HOSTNAME_OVERRIDE: linux-mi35x-gpu-8
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGL_BRANCH is set to main, and the container start no longer pins a specific SGLang base image. This makes the downstream CI non-reproducible and can introduce unrelated breakages when SGLang (or its base image) changes. Consider pinning to a specific tag/commit (and/or explicitly selecting an image) or documenting why tracking main is required for these E2E suites.

Copilot uses AI. Check for mistakes.
Comment on lines +186 to +213
- name: Accuracy Test MI35x (8-GPU Qwen 3.5)
id: qwen35_accuracy
timeout-minutes: 70
continue-on-error: true
run: |
set -ex
cd "${SGLANG_WORKSPACE}"
> github_summary.md
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-qwen35 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> "$GITHUB_STEP_SUMMARY" || true
exit ${TEST_EXIT_CODE:-0}

- name: Performance Test MI35x (8-GPU Qwen 3.5 FP8)
id: qwen35_perf
timeout-minutes: 100
continue-on-error: true
run: |
set -ex
cd "${SGLANG_WORKSPACE}"
> github_summary.md
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" \
-e SGLANG_USE_AITER=1 \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-qwen35-fp8 --nightly --timeout-per-file 5400 || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> "$GITHUB_STEP_SUMMARY" || true
exit ${TEST_EXIT_CODE:-0}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description lists MI35X DeepSeek-R1-MXFP4, Qwen3-235B-MXFP4, and DeepSeek-V3.2 E2E coverage, but this workflow also adds a separate Qwen 3.5 accuracy + perf sequence. Either update the PR description/test plan to include Qwen 3.5, or remove these extra steps if they’re not intended as part of this PR’s scope.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants