[AMD/Hyperloom] Optimize dsr1, gptoss, kimik2.5, qwen3.5 (mi355x) by lishuoshuo-amd · Pull Request #7 · lishuoshuo-amd/InferenceX

lishuoshuo-amd · 2026-04-20T10:24:47Z

Description

Automated performance optimization update from Hyperloom CI.

dsr1-fp8-mi355x-sglang

Metric	Value
Baseline (tok/s/GPU)	311.65
Optimized (tok/s/GPU)	331.22
Optimization Gain	+6.3%

Server flag changes:

--num-continuous-decode-steps: 4 → 8

gptoss-fp4-mi355x-vllm

Metric	Value
Baseline (tok/s/GPU)	7389.00
Optimized (tok/s/GPU)	7762.36
Optimization Gain	+5.0%

Server flag changes:

Add --max-num-seqs 512

kimik2.5-int4-mi355x-vllm

Metric	Value
Baseline (tok/s/GPU)	184.94
Optimized (tok/s/GPU)	198.81
Optimization Gain	+7.5%

Server flag changes:

--max-num-seqs: 256 → 512

qwen3.5-bf16-mi355x-sglang

Metric	Value
Baseline (tok/s/GPU)	260.55
Optimized (tok/s/GPU)	272.63
Optimization Gain	+4.6%

Server flag changes:

Add --enable-mixed-chunk
Add --num-continuous-decode-steps 8

Related Issue

Automated by Hyperloom CI

Type of Change

Configuration change

Checklist

I have tested my changes locally
I have updated documentation if necessary
If I changed a container image or config, I have already updated perf-changelog.yaml

- dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps: 4 -> 8 - gptoss-fp4-mi355x-vllm: Add --max-num-seqs 512 - kimik2.5-int4-mi355x-vllm: --max-num-seqs: 256 -> 512 - qwen3.5-bf16-mi355x-sglang: Add --enable-mixed-chunk; Add --num-continuous-decode-steps 8

github-actions · 2026-04-20T10:24:55Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

…-1024-replay

Made-with: Cursor

…-time Made-with: Cursor

Made-with: Cursor

…SSL warnings Made-with: Cursor

…runs Made-with: Cursor

…eanup - paths: trigger on infra file changes (scripts, configs, workflow) - detect: run all whitelisted scripts when infra files change - runner: validate summary quality, fail CI if no results produced - entrypoint: fail-fast on git clone/SHA errors - cleanup: SIGKILL + sglang.srt/ray workers + longer sleep Made-with: Cursor

Made-with: Cursor

lishuoshuo-amd · 2026-04-21T11:06:33Z

Verify PR (Hyperloom) — Unofficial smoke test

Only the upstream InferenceX
repo contains the official benchmark of record. This fork verify reproduces
the full (tp, conc, isl, osl) search-space from amd-master.yaml for each
changed script, running both PR base and PR head; results below are a
smoke-test indicator. ±2% is treated as noise (verdict OK).

`dsr1_fp8_mi355x.sh`

isl/osl	tp	conc	baseline (tok/s)	optimized (tok/s)	gain	verdict
1k1k	8	4	399.90	417.60	+4.43%	OK
1k1k	8	8	729.10	750.26	+2.90%	OK
1k1k	8	16	1140.92	1173.48	+2.85%	OK
1k1k	8	32	1683.81	1739.49	+3.31%	OK
1k1k	8	64	2614.64	2654.50	+1.52%	OK
8k1k	4	32	770.98	821.53	+6.56%	OK
8k1k	4	64	991.61	1031.77	+4.05%	OK
8k1k	8	4	310.30	366.00	+17.95%	OK
8k1k	8	8	625.61	636.92	+1.81%	OK
8k1k	8	16	902.98	925.49	+2.49%	OK
8k1k	8	32	1213.94	1279.19	+5.38%	OK
8k1k	8	64	1664.40	1709.37	+2.70%	OK

`glm5_fp8_mi355x.sh`

isl/osl	tp	conc	baseline (tok/s)	optimized (tok/s)	gain	verdict
1k1k	8	4	176.32	176.24	-0.05%	OK
1k1k	8	8	324.05	323.87	-0.06%	OK
1k1k	8	16	582.42	582.80	+0.07%	OK
1k1k	8	32	982.46	988.96	+0.66%	OK
1k1k	8	64	1547.23	1553.80	+0.42%	OK
8k1k	8	4	152.00	153.37	+0.90%	OK
8k1k	8	8	259.40	261.49	+0.81%	OK
8k1k	8	16	406.97	411.06	+1.01%	OK
8k1k	8	32	592.49	601.82	+1.57%	OK
8k1k	8	64	771.44	780.61	+1.19%	OK

`gptoss_fp4_mi355x.sh`

isl/osl	tp	conc	baseline (tok/s)	optimized (tok/s)	gain	verdict
1k1k	1	4	835.68	847.40	+1.40%	OK
1k1k	1	8	1438.74	1416.76	-1.53%	OK
1k1k	1	16	2258.12	2239.38	-0.83%	OK
1k1k	1	32	3514.46	3560.56	+1.31%	OK
1k1k	1	64	5285.79	5421.64	+2.57%	OK
1k1k	1	128	7607.70	7719.08	+1.46%	OK
1k1k	4	4	1124.75	1142.61	+1.59%	OK
1k1k	4	8	1818.63	1822.67	+0.22%	OK
1k1k	8	4	1119.82	1127.07	+0.65%	OK
1k1k	8	8	2129.35	2157.85	+1.34%	OK
1k1k	8	16	3854.10	4006.08	+3.94%	OK
8k1k	1	4	714.90	735.80	+2.92%	OK
8k1k	1	8	1124.70	1173.86	+4.37%	OK
8k1k	1	16	1609.91	1617.08	+0.45%	OK
8k1k	1	32	2240.54	2242.82	+0.10%	OK
8k1k	1	64	2822.34	2825.84	+0.12%	OK
8k1k	1	128	3353.39	3351.61	-0.05%	OK
8k1k	4	4	1018.39	1050.47	+3.15%	OK
8k1k	8	4	1024.98	1067.11	+4.11%	OK
8k1k	8	8	1914.10	1950.51	+1.90%	OK

`kimik2.5_int4_mi355x.sh`

isl/osl	tp	conc	baseline (tok/s)	optimized (tok/s)	gain	verdict
1k1k	8	4	229.76	229.94	+0.08%	OK
1k1k	8	8	396.17	398.16	+0.50%	OK
1k1k	8	16	672.49	671.10	-0.21%	OK
1k1k	8	32	1019.56	1013.30	-0.61%	OK
1k1k	8	64	1587.47	1590.80	+0.21%	OK
8k1k	8	4	212.63	212.30	-0.15%	OK
8k1k	8	8	356.20	355.70	-0.14%	OK
8k1k	8	16	561.59	560.71	-0.16%	OK
8k1k	8	32	800.30	800.63	+0.04%	OK
8k1k	8	64	1100.29	1098.94	-0.12%	OK

`minimaxm2.5_fp8_mi355x.sh`

isl/osl	tp	conc	baseline (tok/s)	optimized (tok/s)	gain	verdict
1k1k	2	2	172.65	172.07	-0.33%	OK
1k1k	2	4	318.39	316.01	-0.75%	OK
1k1k	2	8	543.05	542.29	-0.14%	OK
1k1k	2	16	915.67	917.36	+0.19%	OK
1k1k	2	32	1521.84	1513.49	-0.55%	OK
1k1k	2	64	2271.16	2265.32	-0.26%	OK
1k1k	2	128	3740.62	3750.39	+0.26%	OK
1k1k	2	256	5645.52	5633.46	-0.21%	OK
1k1k	2	512	8112.39	8106.57	-0.07%	OK
1k1k	4	4	351.99	355.63	+1.03%	OK
1k1k	4	8	663.45	661.60	-0.28%	OK
1k1k	4	16	1121.67	1122.70	+0.09%	OK
1k1k	4	32	1980.79	1968.42	-0.62%	OK
1k1k	4	64	3243.74	3274.35	+0.94%	OK
1k1k	4	128	5378.00	5389.17	+0.21%	OK
1k1k	4	256	7891.14	7880.06	-0.14%	OK
1k1k	8	2	190.94	193.15	+1.16%	OK
8k1k	2	2	166.82	164.70	-1.27%	OK
8k1k	2	4	288.88	292.45	+1.24%	OK
8k1k	2	8	479.08	476.09	-0.62%	OK
8k1k	2	16	730.34	738.51	+1.12%	OK
8k1k	2	32	1095.36	1091.24	-0.38%	OK
8k1k	2	64	1442.24	1443.35	+0.08%	OK
8k1k	2	128	1913.58	1906.63	-0.36%	OK
8k1k	2	256	2294.86	2295.02	+0.01%	OK
8k1k	4	4	331.52	336.44	+1.49%	OK
8k1k	4	8	594.88	600.51	+0.95%	OK
8k1k	4	16	956.31	964.51	+0.86%	OK
8k1k	4	32	1538.30	1545.08	+0.44%	OK
8k1k	4	64	2286.73	2288.51	+0.08%	OK
8k1k	4	128	3081.30	3098.23	+0.55%	OK
8k1k	4	256	3730.4123665125944	FAIL	n/a	FAIL

`qwen3.5_bf16_mi355x.sh`

isl/osl	tp	conc	baseline (tok/s)	optimized (tok/s)	gain	verdict
1k1k	8	4	375.14	377.09	+0.52%	OK
1k1k	8	8	698.01	702.18	+0.60%	OK
1k1k	8	16	1219.10	1225.60	+0.53%	OK
1k1k	8	32	1958.72	1959.46	+0.04%	OK
1k1k	8	64	3048.54	3049.67	+0.04%	OK
1k1k	8	128	4667.29	4577.29	-1.93%	OK
1k1k	8	256	7150.10	6426.82	-10.12%	WARN: regression
8k1k	8	4	344.03	337.05	-2.03%	WARN: regression
8k1k	8	8	624.26	591.99	-5.17%	WARN: regression
8k1k	8	16	1027.84	862.79	-16.06%	WARN: regression
8k1k	8	32	1515.61	1013.75	-33.11%	WARN: regression
8k1k	8	64	2086.05	910.82	-56.34%	WARN: regression
8k1k	8	128	2672.82	626.84	-76.55%	WARN: regression
8k1k	8	256	3297.20	370.43	-88.77%	WARN: regression

Made-with: Cursor

lishuoshuo-amd · 2026-04-21T13:01:03Z

Verify PR (Hyperloom) — failed before producing results

The verify jobs did not upload any summary. Check the workflow logs.

lishuoshuo-amd added the verify-enabled Validate PR label Apr 20, 2026

lishuoshuo-amd added verify-enabled Validate PR and removed verify-enabled Validate PR labels Apr 20, 2026

lishuoshuo-amd added 12 commits April 20, 2026 11:46

Merge remote-tracking branch 'origin/main' into hyperloom/ci-20260420…

ceeb713

…-1024-replay

fix: use curl instead of wget (not available on runner)

9c75359

Made-with: Cursor

fix: add error logging + disable SSL verify for SaFE API

d4752d0

Made-with: Cursor

fix: sanitize workload name (no underscores, max 44 chars)

d5a20a5

Made-with: Cursor

fix: strip all non-alphanumeric chars from workload name

d40fee9

Made-with: Cursor

fix: use /workspace as working dir (benchmark_lib.sh expects it)

80118a2

Made-with: Cursor

fix: clear NFS result dir before creating workload (avoid stale data)

43cf864

Made-with: Cursor

fix: tee all pod output to NFS pod.log + keep pod 10min after finish

a9fc548

Made-with: Cursor

feat: print pod.log to CI stdout for easier debugging

8fb3909

Made-with: Cursor

fix: pass EP_SIZE env var to benchmark scripts (required by qwen/etc)

f8d1079

Made-with: Cursor

feat: add glm5+minimax to verify sweep + stream pod.log to CI in real…

2371d01

…-time Made-with: Cursor

fix: correct GLM-5-FP8 model path (was zai-org-GLM-5-FP8, doesn't exist)

b45f058

Made-with: Cursor

lishuoshuo-amd changed the title ~~[AMD/Hyperloom] Optimize 4 AMD models~~ [AMD/Hyperloom] Optimize dsr1, gptoss, kimik2.5, qwen3.5 (mi355x) Apr 20, 2026

lishuoshuo-amd added 2 commits April 20, 2026 13:43

fix: bump pod resources to 64C/1024Gi + line-buffered tee + suppress …

db1f8d5

…SSL warnings Made-with: Cursor

fix: increase timeout to 6h + kill residual server processes between …

08ab55b

…runs Made-with: Cursor

lishuoshuo-amd force-pushed the hyperloom/ci-20260420-1024-replay branch from c0f3247 to 08ab55b Compare April 21, 2026 01:16

lishuoshuo-amd added 4 commits April 21, 2026 01:30

fix: graceful SIGTERM first, SIGKILL fallback + clean /dev/shm leaks

6b3a28c

Made-with: Cursor

fix: strict paired validation + fetch upstream base SHA for fork PRs

8fc979c

Made-with: Cursor

fix: handle NFS rmtree errors with ignore_errors + retry

c0a8256

Made-with: Cursor

feat: add Teams webhook notification + increase timeout to 15h

8980e53

Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD/Hyperloom] Optimize dsr1, gptoss, kimik2.5, qwen3.5 (mi355x)#7

[AMD/Hyperloom] Optimize dsr1, gptoss, kimik2.5, qwen3.5 (mi355x)#7
lishuoshuo-amd wants to merge 20 commits intomainfrom
hyperloom/ci-20260420-1024-replay

lishuoshuo-amd commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

lishuoshuo-amd commented Apr 21, 2026

Uh oh!

lishuoshuo-amd commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lishuoshuo-amd commented Apr 20, 2026

Description

dsr1-fp8-mi355x-sglang

gptoss-fp4-mi355x-vllm

kimik2.5-int4-mi355x-vllm

qwen3.5-bf16-mi355x-sglang

Related Issue

Type of Change

Checklist

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

lishuoshuo-amd commented Apr 21, 2026

Verify PR (Hyperloom) — Unofficial smoke test

dsr1_fp8_mi355x.sh

glm5_fp8_mi355x.sh

gptoss_fp4_mi355x.sh

kimik2.5_int4_mi355x.sh

minimaxm2.5_fp8_mi355x.sh

qwen3.5_bf16_mi355x.sh

Uh oh!

lishuoshuo-amd commented Apr 21, 2026

Verify PR (Hyperloom) — failed before producing results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`dsr1_fp8_mi355x.sh`

`glm5_fp8_mi355x.sh`

`gptoss_fp4_mi355x.sh`

`kimik2.5_int4_mi355x.sh`

`minimaxm2.5_fp8_mi355x.sh`

`qwen3.5_bf16_mi355x.sh`