Skip to content

[AMD/ROCM] Update gptoss-fp4-mi355x-atom config#1195

Open
seungrokj wants to merge 2 commits intomainfrom
srok/atom_gptoss_fp4
Open

[AMD/ROCM] Update gptoss-fp4-mi355x-atom config#1195
seungrokj wants to merge 2 commits intomainfrom
srok/atom_gptoss_fp4

Conversation

@seungrokj
Copy link
Copy Markdown
Collaborator

@seungrokj seungrokj commented Apr 27, 2026

Hi, This is just a minor change. Will do sweep after high priority dsv4 stuffs.

Summary

  • Update Atom image to rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
  • Expand conc-end to 256 for tp1 search-space configs
  • Add --gpu-memory-utilization 0.9 flag to benchmark script
  • Add perf-changelog entry for gptoss-fp4-mi355x-atom

Test plan

  • CI benchmark run for gptoss-fp4-mi355x-atom

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@seungrokj seungrokj changed the title Update gptoss-fp4-mi355x-atom config [AMD/ROCM] Update gptoss-fp4-mi355x-atom config Apr 27, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread perf-changelog.yaml Outdated
Comment on lines +1895 to +1900

- config-keys:
- gptoss-fp4-mi355x-atom
description:
- ""
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new perf-changelog.yaml entry for gptoss-fp4-mi355x-atom (lines 1895-1900) is a forgotten template: description is an empty string - "" and pr-link is .../pull/ with no PR number. As written, the entry documents nothing and links nowhere — please fill in a real description (atom image bump rocm7.1.1/atom0.1.1 → rocm7.2.2/atom0.1.2.post, conc-end 128 → 256 for tp1 configs, and the new --gpu-memory-utilization 0.9 flag) and set pr-link to https://github.com/SemiAnalysisAI/InferenceX/pull/1195 before merging.

Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml for gptoss-fp4-mi355x-atom (lines 1895-1900 of the diff) is structurally valid YAML but contains placeholder content that was never filled in:

- config-keys:
    - gptoss-fp4-mi355x-atom
  description:
    - ""
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/

Both user-visible fields are empty: description is a list containing only the empty string, and pr-link ends in pull/ with no PR number.

Why existing validation does not catch it

The Pydantic schema in utils/matrix_logic/validation.py requires description: list[str] = Field(min_length=1) and pr_link: str. A one-element list whose only element is "" satisfies min_length=1, and any string — including a URL missing its terminal segment — satisfies pr_link: str. So the entry passes validation but conveys no information.

Impact

The entire reason this PR touches perf-changelog.yaml is to document the changes to gptoss-fp4-mi355x-atom. As written, that goal is not achieved:

  • Anyone scanning the changelog sees gptoss-fp4-mi355x-atom with no description.
  • The pr-link does not resolve to this PR (or any PR) — clicking it goes to https://github.com/SemiAnalysisAI/InferenceX/pull/, which is not a valid PR URL.
  • Every other entry in this 1900-line file (including the immediately preceding entry at line 1887 for PR dsv4-fp4-b300-sglang-mtp: pass --dsv4 to use DSv4 chat template #1182) follows the convention of a meaningful description and a numeric pr-link. This entry breaks the convention.

Step-by-step proof

  1. Open perf-changelog.yaml at the bottom (lines 1895-1900 in the diff).
  2. Read description: - "" — that is literally a single empty string, not a description.
  3. Read pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/ — copy/paste into a browser; GitHub redirects to the pull-request index, not to PR [AMD/ROCM] Update gptoss-fp4-mi355x-atom config #1195.
  4. Compare with the entry just above (PR dsv4-fp4-b300-sglang-mtp: pass --dsv4 to use DSv4 chat template #1182): it has a real description ("Pass --dsv4 to run_benchmark_serving ...") and a complete link (.../pull/1182). Compare with any of the other ~120 entries — same pattern.
  5. The PR description itself enumerates three substantive changes (image bump to rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post, conc-end 128 → 256 for tp1 search-space, --gpu-memory-utilization 0.9 flag) — none of which are recorded in the changelog entry.

How to fix

Replace lines 1898-1900 with something like:

  description:
    - "Update Atom image from rocm7.1.1-ubuntu24.04-pytorch2.9-atom0.1.1-MI350x to rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post"
    - "Extend conc-end from 128 to 256 for tp1 configs (1k1k and 8k1k)"
    - "Add --gpu-memory-utilization 0.9 to benchmarks/single_node/gptoss_fp4_mi355x_atom.sh"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1195

Severity: normal — not a runtime defect (benchmarks still run, validation still passes), but it defeats the purpose of the changelog edit, which is the only documentation artifact in this PR.

@seungrokj seungrokj added the AMD label Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant