Skip to content

[AMD/Hyperloom] Optimize dsr1, gptoss, kimik2.5, qwen3.5 (mi355x)#7

Open
lishuoshuo-amd wants to merge 20 commits intomainfrom
hyperloom/ci-20260420-1024-replay
Open

[AMD/Hyperloom] Optimize dsr1, gptoss, kimik2.5, qwen3.5 (mi355x)#7
lishuoshuo-amd wants to merge 20 commits intomainfrom
hyperloom/ci-20260420-1024-replay

Conversation

@lishuoshuo-amd
Copy link
Copy Markdown
Owner

Description

Automated performance optimization update from Hyperloom CI.

dsr1-fp8-mi355x-sglang

Metric Value
Baseline (tok/s/GPU) 311.65
Optimized (tok/s/GPU) 331.22
Optimization Gain +6.3%

Server flag changes:

  • --num-continuous-decode-steps: 48

gptoss-fp4-mi355x-vllm

Metric Value
Baseline (tok/s/GPU) 7389.00
Optimized (tok/s/GPU) 7762.36
Optimization Gain +5.0%

Server flag changes:

  • Add --max-num-seqs 512

kimik2.5-int4-mi355x-vllm

Metric Value
Baseline (tok/s/GPU) 184.94
Optimized (tok/s/GPU) 198.81
Optimization Gain +7.5%

Server flag changes:

  • --max-num-seqs: 256512

qwen3.5-bf16-mi355x-sglang

Metric Value
Baseline (tok/s/GPU) 260.55
Optimized (tok/s/GPU) 272.63
Optimization Gain +4.6%

Server flag changes:

  • Add --enable-mixed-chunk
  • Add --num-continuous-decode-steps 8

Related Issue

Automated by Hyperloom CI

Type of Change

  • Configuration change

Checklist

  • I have tested my changes locally
  • I have updated documentation if necessary
  • If I changed a container image or config, I have already updated perf-changelog.yaml

- dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps: 4 -> 8
- gptoss-fp4-mi355x-vllm: Add --max-num-seqs 512
- kimik2.5-int4-mi355x-vllm: --max-num-seqs: 256 -> 512
- qwen3.5-bf16-mi355x-sglang: Add --enable-mixed-chunk; Add --num-continuous-decode-steps 8
@lishuoshuo-amd lishuoshuo-amd added the verify-enabled Validate PR label Apr 20, 2026
@github-actions
Copy link
Copy Markdown

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@lishuoshuo-amd lishuoshuo-amd added verify-enabled Validate PR and removed verify-enabled Validate PR labels Apr 20, 2026
@lishuoshuo-amd lishuoshuo-amd changed the title [AMD/Hyperloom] Optimize 4 AMD models [AMD/Hyperloom] Optimize dsr1, gptoss, kimik2.5, qwen3.5 (mi355x) Apr 20, 2026
@lishuoshuo-amd lishuoshuo-amd force-pushed the hyperloom/ci-20260420-1024-replay branch from c0f3247 to 08ab55b Compare April 21, 2026 01:16
…eanup

- paths: trigger on infra file changes (scripts, configs, workflow)
- detect: run all whitelisted scripts when infra files change
- runner: validate summary quality, fail CI if no results produced
- entrypoint: fail-fast on git clone/SHA errors
- cleanup: SIGKILL + sglang.srt/ray workers + longer sleep

Made-with: Cursor
@lishuoshuo-amd
Copy link
Copy Markdown
Owner Author

Verify PR (Hyperloom) — Unofficial smoke test

Only the upstream InferenceX
repo contains the official benchmark of record. This fork verify reproduces
the full (tp, conc, isl, osl) search-space from amd-master.yaml for each
changed script, running both PR base and PR head; results below are a
smoke-test indicator. ±2% is treated as noise (verdict OK).

dsr1_fp8_mi355x.sh

isl/osl tp conc baseline (tok/s) optimized (tok/s) gain verdict
1k1k 8 4 399.90 417.60 +4.43% OK
1k1k 8 8 729.10 750.26 +2.90% OK
1k1k 8 16 1140.92 1173.48 +2.85% OK
1k1k 8 32 1683.81 1739.49 +3.31% OK
1k1k 8 64 2614.64 2654.50 +1.52% OK
8k1k 4 32 770.98 821.53 +6.56% OK
8k1k 4 64 991.61 1031.77 +4.05% OK
8k1k 8 4 310.30 366.00 +17.95% OK
8k1k 8 8 625.61 636.92 +1.81% OK
8k1k 8 16 902.98 925.49 +2.49% OK
8k1k 8 32 1213.94 1279.19 +5.38% OK
8k1k 8 64 1664.40 1709.37 +2.70% OK

glm5_fp8_mi355x.sh

isl/osl tp conc baseline (tok/s) optimized (tok/s) gain verdict
1k1k 8 4 176.32 176.24 -0.05% OK
1k1k 8 8 324.05 323.87 -0.06% OK
1k1k 8 16 582.42 582.80 +0.07% OK
1k1k 8 32 982.46 988.96 +0.66% OK
1k1k 8 64 1547.23 1553.80 +0.42% OK
8k1k 8 4 152.00 153.37 +0.90% OK
8k1k 8 8 259.40 261.49 +0.81% OK
8k1k 8 16 406.97 411.06 +1.01% OK
8k1k 8 32 592.49 601.82 +1.57% OK
8k1k 8 64 771.44 780.61 +1.19% OK

gptoss_fp4_mi355x.sh

isl/osl tp conc baseline (tok/s) optimized (tok/s) gain verdict
1k1k 1 4 835.68 847.40 +1.40% OK
1k1k 1 8 1438.74 1416.76 -1.53% OK
1k1k 1 16 2258.12 2239.38 -0.83% OK
1k1k 1 32 3514.46 3560.56 +1.31% OK
1k1k 1 64 5285.79 5421.64 +2.57% OK
1k1k 1 128 7607.70 7719.08 +1.46% OK
1k1k 4 4 1124.75 1142.61 +1.59% OK
1k1k 4 8 1818.63 1822.67 +0.22% OK
1k1k 8 4 1119.82 1127.07 +0.65% OK
1k1k 8 8 2129.35 2157.85 +1.34% OK
1k1k 8 16 3854.10 4006.08 +3.94% OK
8k1k 1 4 714.90 735.80 +2.92% OK
8k1k 1 8 1124.70 1173.86 +4.37% OK
8k1k 1 16 1609.91 1617.08 +0.45% OK
8k1k 1 32 2240.54 2242.82 +0.10% OK
8k1k 1 64 2822.34 2825.84 +0.12% OK
8k1k 1 128 3353.39 3351.61 -0.05% OK
8k1k 4 4 1018.39 1050.47 +3.15% OK
8k1k 8 4 1024.98 1067.11 +4.11% OK
8k1k 8 8 1914.10 1950.51 +1.90% OK

kimik2.5_int4_mi355x.sh

isl/osl tp conc baseline (tok/s) optimized (tok/s) gain verdict
1k1k 8 4 229.76 229.94 +0.08% OK
1k1k 8 8 396.17 398.16 +0.50% OK
1k1k 8 16 672.49 671.10 -0.21% OK
1k1k 8 32 1019.56 1013.30 -0.61% OK
1k1k 8 64 1587.47 1590.80 +0.21% OK
8k1k 8 4 212.63 212.30 -0.15% OK
8k1k 8 8 356.20 355.70 -0.14% OK
8k1k 8 16 561.59 560.71 -0.16% OK
8k1k 8 32 800.30 800.63 +0.04% OK
8k1k 8 64 1100.29 1098.94 -0.12% OK

minimaxm2.5_fp8_mi355x.sh

isl/osl tp conc baseline (tok/s) optimized (tok/s) gain verdict
1k1k 2 2 172.65 172.07 -0.33% OK
1k1k 2 4 318.39 316.01 -0.75% OK
1k1k 2 8 543.05 542.29 -0.14% OK
1k1k 2 16 915.67 917.36 +0.19% OK
1k1k 2 32 1521.84 1513.49 -0.55% OK
1k1k 2 64 2271.16 2265.32 -0.26% OK
1k1k 2 128 3740.62 3750.39 +0.26% OK
1k1k 2 256 5645.52 5633.46 -0.21% OK
1k1k 2 512 8112.39 8106.57 -0.07% OK
1k1k 4 4 351.99 355.63 +1.03% OK
1k1k 4 8 663.45 661.60 -0.28% OK
1k1k 4 16 1121.67 1122.70 +0.09% OK
1k1k 4 32 1980.79 1968.42 -0.62% OK
1k1k 4 64 3243.74 3274.35 +0.94% OK
1k1k 4 128 5378.00 5389.17 +0.21% OK
1k1k 4 256 7891.14 7880.06 -0.14% OK
1k1k 8 2 190.94 193.15 +1.16% OK
8k1k 2 2 166.82 164.70 -1.27% OK
8k1k 2 4 288.88 292.45 +1.24% OK
8k1k 2 8 479.08 476.09 -0.62% OK
8k1k 2 16 730.34 738.51 +1.12% OK
8k1k 2 32 1095.36 1091.24 -0.38% OK
8k1k 2 64 1442.24 1443.35 +0.08% OK
8k1k 2 128 1913.58 1906.63 -0.36% OK
8k1k 2 256 2294.86 2295.02 +0.01% OK
8k1k 4 4 331.52 336.44 +1.49% OK
8k1k 4 8 594.88 600.51 +0.95% OK
8k1k 4 16 956.31 964.51 +0.86% OK
8k1k 4 32 1538.30 1545.08 +0.44% OK
8k1k 4 64 2286.73 2288.51 +0.08% OK
8k1k 4 128 3081.30 3098.23 +0.55% OK
8k1k 4 256 3730.4123665125944 FAIL n/a FAIL

qwen3.5_bf16_mi355x.sh

isl/osl tp conc baseline (tok/s) optimized (tok/s) gain verdict
1k1k 8 4 375.14 377.09 +0.52% OK
1k1k 8 8 698.01 702.18 +0.60% OK
1k1k 8 16 1219.10 1225.60 +0.53% OK
1k1k 8 32 1958.72 1959.46 +0.04% OK
1k1k 8 64 3048.54 3049.67 +0.04% OK
1k1k 8 128 4667.29 4577.29 -1.93% OK
1k1k 8 256 7150.10 6426.82 -10.12% WARN: regression
8k1k 8 4 344.03 337.05 -2.03% WARN: regression
8k1k 8 8 624.26 591.99 -5.17% WARN: regression
8k1k 8 16 1027.84 862.79 -16.06% WARN: regression
8k1k 8 32 1515.61 1013.75 -33.11% WARN: regression
8k1k 8 64 2086.05 910.82 -56.34% WARN: regression
8k1k 8 128 2672.82 626.84 -76.55% WARN: regression
8k1k 8 256 3297.20 370.43 -88.77% WARN: regression

@lishuoshuo-amd
Copy link
Copy Markdown
Owner Author

Verify PR (Hyperloom) — failed before producing results

The verify jobs did not upload any summary. Check the workflow logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

verify-enabled Validate PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant