Skip to content

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1109

Open
lishuoshuo-amd wants to merge 2 commits intoSemiAnalysisAI:mainfrom
lishuoshuo-amd:amd/hyperloom/mi355x-tune-dsr1
Open

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1109
lishuoshuo-amd wants to merge 2 commits intoSemiAnalysisAI:mainfrom
lishuoshuo-amd:amd/hyperloom/mi355x-tune-dsr1

Conversation

@lishuoshuo-amd
Copy link
Copy Markdown

@lishuoshuo-amd lishuoshuo-amd commented Apr 21, 2026

Description

Tune --num-continuous-decode-steps from 4 to 8 for DeepSeek-R1-0528 FP8 on MI355X (SGLang).
Increasing continuous decode steps reduces prefill/decode scheduling overhead, lowering per-token latency (TPOT) and improving overall throughput.

Changes

  • benchmarks/single_node/dsr1_fp8_mi355x.sh: --num-continuous-decode-steps 48
  • perf-changelog.yaml: Added changelog entry

Performance Results

Hyperloom CI Optimization Report (conc=64, 1k/1k)

Metric Baseline Optimized Change
Output Throughput (per GPU) 311.65 tok/s 331.22 tok/s +6.28%
TPOT 24.46 ms 23.01 ms -5.93%
TTFT 581.20 ms 528.52 ms -9.07%
vs InferenceX Official +0.50% +6.81%

Full Parameter Sweep (12 points, 0 failures)

Verified across the complete (tp, conc, isl, osl) search-space from amd-master.yaml:

ISL/OSL TP Conc Baseline (tok/s) Optimized (tok/s) Gain
1k/1k 8 4 399.90 417.60 +4.43%
1k/1k 8 8 729.10 750.26 +2.90%
1k/1k 8 16 1140.92 1173.48 +2.85%
1k/1k 8 32 1683.81 1739.49 +3.31%
1k/1k 8 64 2614.64 2654.50 +1.52%
8k/1k 4 32 770.98 821.53 +6.56%
8k/1k 4 64 991.61 1031.77 +4.05%
8k/1k 8 4 310.30 366.00 +17.95%
8k/1k 8 8 625.61 636.92 +1.81%
8k/1k 8 16 902.98 925.49 +2.49%
8k/1k 8 32 1213.94 1279.19 +5.38%
8k/1k 8 64 1664.40 1709.37 +2.70%

Average gain: +4.7% — positive improvement across all parameter combinations with no regression.

Baseline Validation Against InferenceX Official

Conc Official (tok/s/GPU) Our Baseline (tok/s/GPU) Diff
4 49.82 49.99 +0.3%

Baseline aligns within <1% of official InferenceX data, confirming test environment reliability.

Note: All throughput numbers in this PR refer to output (decode) token throughput, never total. The "Optimization Report" and "Baseline Validation" tables show per-GPU values; the "Full Parameter Sweep" table shows aggregate (TP-summed) values from raw SGLang output_throughput. Per-GPU = aggregate / TP. Gain percentages are unit-invariant.

Related Issue

Automated optimization by Hyperloom CI.

Type of Change

  • Configuration change

Checklist

  • I have tested my changes locally
  • I have updated documentation if necessary
  • If I changed a container image or config, I have already updated perf-changelog.yaml

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@lishuoshuo-amd
Copy link
Copy Markdown
Author

@claude review

@billishyahao
Copy link
Copy Markdown
Collaborator

cc @Duyi-Wang

@lishuoshuo-amd lishuoshuo-amd force-pushed the amd/hyperloom/mi355x-tune-dsr1 branch from 54aee90 to b10c872 Compare April 27, 2026 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants