Expand SGLang downstream coverage with MI35X model E2E tests#2884
Expand SGLang downstream coverage with MI35X model E2E tests#2884
Conversation
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Drop the downstream-specific image override so the MI35X model E2E job follows SGLang's own container selection logic. Made-with: Cursor
|
Related to #2751 |
There was a problem hiding this comment.
Pull request overview
This PR updates the SGLang downstream GitHub Actions workflow to shift from MI300X 1-GPU smoke coverage to MI35X 8-GPU end-to-end model test coverage, running multiple MI35X model suites sequentially after a single downstream setup.
Changes:
- Switch downstream runner target from MI300X (1 GPU) to MI35X (8 GPU) and update related GPU/hostname settings.
- Remove dynamic SGLang base-image resolution and run MI35X nightly-style accuracy/performance suites in sequence (with a final aggregation step to fail the job if any suite fails).
- Add extra Python dependencies needed by the MI35X E2E suites.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| SGL_BRANCH: main | ||
| GPU_ARCH: gfx950 | ||
| SGLANG_CI_HOSTNAME_OVERRIDE: linux-mi35x-gpu-8 |
There was a problem hiding this comment.
SGL_BRANCH is set to main, and the container start no longer pins a specific SGLang base image. This makes the downstream CI non-reproducible and can introduce unrelated breakages when SGLang (or its base image) changes. Consider pinning to a specific tag/commit (and/or explicitly selecting an image) or documenting why tracking main is required for these E2E suites.
| - name: Accuracy Test MI35x (8-GPU Qwen 3.5) | ||
| id: qwen35_accuracy | ||
| timeout-minutes: 70 | ||
| continue-on-error: true | ||
| run: | | ||
| set -ex | ||
| cd "${SGLANG_WORKSPACE}" | ||
| > github_summary.md | ||
| bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" \ | ||
| -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ | ||
| python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-mi35x-qwen35 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$? | ||
| echo "$(<github_summary.md )" >> "$GITHUB_STEP_SUMMARY" || true | ||
| exit ${TEST_EXIT_CODE:-0} | ||
|
|
||
| - name: Performance Test MI35x (8-GPU Qwen 3.5 FP8) | ||
| id: qwen35_perf | ||
| timeout-minutes: 100 | ||
| continue-on-error: true | ||
| run: | | ||
| set -ex | ||
| cd "${SGLANG_WORKSPACE}" | ||
| > github_summary.md | ||
| bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" \ | ||
| -e SGLANG_USE_AITER=1 \ | ||
| -e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \ | ||
| python3 run_suite.py --hw amd --suite nightly-perf-8-gpu-mi35x-qwen35-fp8 --nightly --timeout-per-file 5400 || TEST_EXIT_CODE=$? | ||
| echo "$(<github_summary.md )" >> "$GITHUB_STEP_SUMMARY" || true | ||
| exit ${TEST_EXIT_CODE:-0} |
There was a problem hiding this comment.
PR description lists MI35X DeepSeek-R1-MXFP4, Qwen3-235B-MXFP4, and DeepSeek-V3.2 E2E coverage, but this workflow also adds a separate Qwen 3.5 accuracy + perf sequence. Either update the PR description/test plan to include Qwen 3.5, or remove these extra steps if they’re not intended as part of this PR’s scope.
Summary
Test plan
Sglang Model E2E Test (8 GPU)in GitHub Actionslinux-aiter-mi35x-8linux-aiter-mi35x-8linux-aiter-mi35x-8Made with Cursor