[Tracking] Release v0.1.13 validation (DO NOT MERGE) by sunway513 · Pull Request #2913 · ROCm/aiter

sunway513 · 2026-04-24T23:13:30Z

Purpose

Tracking-only PR to fire ci:all extended validation against release/v0.1.13.

DO NOT MERGE.

Branch state

Commit	Source	Notes
`74dcf38d`	this branch	ci(release): cherry-pick #2875 — torch_pin + torch_index_url workflow inputs
`e039817a`	main 2026-04-23 11:15Z	MLA PS mode support nhead8,2 in MI308 (#2852) — release base

Why this commit as v0.1.13 base

ATOM benchmark dashboard shows ATOM commit 4f3c57970c (2026-04-23 14:46Z) ran 14/14 accuracy pass against this AITER head.
ATOM commit 9ea09987bf (2026-04-23 13:05Z) also 14/14 pass against this AITER head.
AITER's own atom-test CI on e039817a is 3/3 green (DSR1 MI300X + MI355X, gpt-oss-120b MI355X).
Avoids the 04-24 churn around fmha / sliding-window-mtp / gpt-oss-tuned-config which had cancelled / failed CI.

Validation matrix (target)

aiter-test (1 GPU, 8 GPU)
sglang_downstream
vllm benchmark — no regression vs post1
atom-test (DSR1, GPT-OSS, etc.)
triton-test (MI355)
Manual GSM8K accuracy (5 models)
vLLM + ATOM plugin OOT
SGLang + ATOM plugin OOT
6-wheel build matrix (manylinux + torch 2.10)

… inputs Brings the manylinux + torch ABI pin workflow controls onto release/v0.1.13 so v0.1.13 release wheel builds can dispatch torch_pin=2.10.0 directly, matching the post2 build path. Cherry-pick of: 7c4cc6c (#2875)

github-actions · 2026-04-24T23:13:41Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2913 --add-label <label>

Copilot

Pull request overview

Tracking PR to run extended CI validation for release/v0.1.13, including a cherry-pick that makes the release workflow’s manylinux build able to pin/override the PyTorch wheel source for ABI-targeted builds.

Changes:

Add torch_pin and torch_index_url inputs to workflow_dispatch and workflow_call.
Wire the new inputs into job environment variables.
Update the manylinux dependency install step to optionally use the provided index URL and torch version spec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T23:18:12Z

+          # legacy auto-derived behavior.
+          if [ -n "${TORCH_INDEX_URL}" ]; then
+            TORCH_INDEX="${TORCH_INDEX_URL}"
+          else
+            TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}"


TORCH_INDEX_URL is user-provided now, but it’s accepted verbatim as the pip --index-url. To reduce supply-chain risk and accidental misuse, consider validating this input before use (e.g., require https:// and/or restrict to an allowlist like download.pytorch.org unless an explicit “unsafe_allow_custom_index” flag is set).

Suggested change

# legacy auto-derived behavior.

if [ -n "${TORCH_INDEX_URL}" ]; then

TORCH_INDEX="${TORCH_INDEX_URL}"

else

TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}"

# legacy auto-derived behavior. Custom indexes must use HTTPS and

# are restricted to download.pytorch.org unless explicitly allowed.

DEFAULT_TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}"

if [ -n "${TORCH_INDEX_URL}" ]; then

case "${TORCH_INDEX_URL}" in

https://*)

;;

*)

echo "TORCH_INDEX_URL must start with https://" >&2

exit 1

;;

esac

case "${TORCH_INDEX_URL}" in

https://download.pytorch.org/*)

TORCH_INDEX="${TORCH_INDEX_URL}"

;;

*)

if [ "${UNSAFE_ALLOW_CUSTOM_TORCH_INDEX}" = "true" ]; then

TORCH_INDEX="${TORCH_INDEX_URL}"

else

echo "TORCH_INDEX_URL host is not allowed. Set UNSAFE_ALLOW_CUSTOM_TORCH_INDEX=true to permit a custom HTTPS index." >&2

exit 1

fi

;;

esac

else

TORCH_INDEX="${DEFAULT_TORCH_INDEX}"

Copilot · 2026-04-24T23:18:12Z

+          # Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest.
+          if [ -n "${TORCH_PIN}" ]; then
+            TORCH_SPEC="torch==${TORCH_PIN}"
+          else
+            TORCH_SPEC="torch"
+          fi
          echo "Torch index: ${TORCH_INDEX}"


Since TORCH_INDEX can now come from the caller-provided torch_index_url, echoing it verbatim can leak credentials if someone passes a URL containing embedded auth (or other sensitive query params). Consider either rejecting URLs with credentials, or logging a sanitized/redacted form of the URL.

Suggested change

# Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest.

if [ -n "${TORCH_PIN}" ]; then

TORCH_SPEC="torch==${TORCH_PIN}"

else

TORCH_SPEC="torch"

fi

echo "Torch index: ${TORCH_INDEX}"

SAFE_TORCH_INDEX="$(TORCH_INDEX="${TORCH_INDEX}" python3 - <<'PY'

import os

from urllib.parse import urlsplit, urlunsplit

url = os.environ["TORCH_INDEX"]

parts = urlsplit(url)

if not parts.scheme or not parts.netloc:

print(url)

else:

host = parts.hostname or ""

if parts.port is not None:

host = f"{host}:{parts.port}"

print(urlunsplit((parts.scheme, host, parts.path, "", "")))

PY

)"

# Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest.

if [ -n "${TORCH_PIN}" ]; then

TORCH_SPEC="torch==${TORCH_PIN}"

else

TORCH_SPEC="torch"

fi

echo "Torch index: ${SAFE_TORCH_INDEX}"

#2893) * test: expand test_batch_prefill_large_kvcache for >4GB KV cache overflow Rewrite test_batch_prefill_large_kvcache to validate the per-tile SRD rebase fix for >4GB KV caches across all page sizes, dtypes, and attention configurations: - Add page_size=1 and 16 (page_size < kN0, exercises rebase path) - Add GQA (16, 8) in addition to MHA (8, 8) - Add causal masking with CK-compatible attn_mask for SDPA reference - Use full KV cache (4.5GB) with pages spanning the overflow boundary - Use torch SDPA as reference (memory-efficient backend, no score matrix materialization) - Add scatter_pages parameter (False only; True for future global_load_lds flat addressing) - Add GPU memory check to skip configs that exceed HBM capacity Test matrix: 24 cases (3 page_sizes × 2 dtypes × 2 causal × 2 GQA × 1 scatter) * test: add GPU sync after CK kernel in large_kvcache test Add torch.cuda.synchronize() after CK kernel launch in test_batch_prefill_large_kvcache to ensure all async GPU work completes before memory is freed between tests. Without this sync, repeated allocate/free cycles of large KV cache buffers (~20GB) with mixed dtype (bf16→fp8) can trigger GPU page faults when the HIP memory allocator reuses virtual addresses that are still referenced by pending async GPU work. The fault manifests as VM_L2_PROTECTION_FAULT at address 0x0 (NULL), causing GPU reset and kernel soft lockup. * feat(fmha): runtime dispatch for >4GB KV cache in batch prefill Add use_64bit_load to batch prefill traits and runtime overflow detection. When page_block_size < 128 and max_page_byte_offset > INT32_MAX, dispatch to the flat 64-bit load kernel variant for correctness. Also add vectorized KV layout coverage to test_batch_prefill_large_kvcache. * fix: remove unused k_vector_size variable in large_kvcache test * fix(mha): improve batch_prefill TORCH_CHECK error message for >4GB KV cache Include page_size, num_pages, and dtype in the error message when kernel dispatch fails. Add hint about CDNA3+ GPU requirement when KV cache exceeds 4GB with page_size < 128. * test: update scatter_pages comment in large_kvcache test The comment incorrectly stated scatter_pages=True was "expected to FAIL". This is no longer true — the flat 64-bit load path handles scattered pages correctly. Update to describe the test's purpose instead. * fix(mha): widen batch_prefill 64-bit threshold to total KV bytes The previous check used (num_total_pages - 1) * batch_stride * element_size which measures the last-page base offset, missing within-page offsets and producing an off-by-one at exactly INT32_MAX (the largest representable SRD voffset). Switch to total KV cache footprint (num_total_pages * batch_stride * element_size > INT32_MAX) so within-page reads on the last page are covered, and drop the redundant num_total_pages > 1 guard since single-page configs trivially fit in 32 bits. Also unify wording: 4GB → 2GB (INT32_MAX byte offset for SRD voffset), matching CK's TwoGB convention. The actual hardware bound has always been 2GB; the prior comments were imprecise. Found during batch prefill template dispatch review. * docs(mha): unify >2GB wording in batch_prefill error and test The 4GB number in the TORCH_CHECK error message and the test comment was imprecise — the actual SRD voffset bound is 2GB (INT32_MAX). Update both to match the threshold check and CK's TwoGB convention. Found during batch prefill template dispatch review. * refactor(mha): drop wrapper-side use_64bit_load; let CK dispatcher decide The wrapper hardcoded kN0_min = 128 to compute the >2GB KV cache predicate, which leaked CK tile config into aiter and would silently break if a new arm with bn0 != 128 were added. The CK auto-generated dispatcher now decides per-arm using its own compile-time bn0 and per-dtype kElementBytes, so the wrapper just forwards args. Remove the `use_64bit_load` runtime field from `mha_batch_prefill_traits`, the parameter from `get_mha_batch_prefill_traits()`, and the entire predicate computation block from the dispatcher call site. Bumps CK submodule to pull in the matching codegen change. * chore(mha): bump CK + update wrapper wording for kUseGlobalLoad rename Bumps 3rdparty/composable_kernel to dd8d293ea (refactor(fmha): batch prefill review polish — assert helper + setter guards) which builds on the prior 99a3ca9af kUseGlobalLoad rename. Wrapper-side updates to match: * csrc/cpp_itfs/mha_fwd_batch_prefill.cu: rename "64-bit-load" wording in the per-arm dispatcher comment to "kUseGlobalLoad" so the wrapper comment matches the CK-side identifier. Also drops the trailing `false /* skip_min_seqlen_q */` argument from the get_mha_batch_prefill_traits call to match the upstream CK API signature change. * csrc/py_itfs_ck/mha_batch_prefill_kernels.cu: change the >2GB error message from "page_size < 128" to "page_size < kN0" so the diagnostic tracks the tile-size constant rather than a magic number. * op_tests/test_batch_prefill.py (test_batch_prefill_large_kvcache): three documentation enhancements with no behavior change — - explain why qo_len caps at 128 (causal) / 1024 (non-causal): the causal cap is a math-backend cliff for the SDPA reference, not a kernel limit; - explain that the +256 padding on kv_page_indices is a batch_prefill ABI requirement (kernel may speculatively read up to bn0=256 entries past the last valid page index); - expand the torch.cuda.synchronize comment to call out the misattribution failure mode and GPU-reset cascade risk. * test(fmha): parametrize test_batch_prefill_large_kvcache over batch_size {1, 4} Adds multi-batch coverage to the >2GB KV cache regression test. The previous single-batch coverage left the kernel's per-sequence SRD rebase path unexercised: with cu_seqlens_q=[0, qo_len] and kv_indptr= [0, num_blocks], the kernel never walks the indptr to reposition K/V SRDs across batch boundaries. After the kUseGlobalLoad rename and the new positive static_assert(kUseGlobalLoad_) calls in update_physical_pages and set_page_stride_elements, we want a regression that catches any boundary-crossing SRD bug -- the failure mode no single-batch test can detect (one batch correct, others wrong). batch_size=4 partitions the >2GB page pool across 4 sequences (last sequence absorbs the remainder), exercising 3 cross-batch SRD transitions. The SDPA reference is computed per-batch and concatenated; per-iteration free + empty_cache keeps peak memory at one batch's worth. Verified on: - gfx950 (smci355-gfx950, MI355X): 160 passed, 32 skipped - gfx942 (smc300x-clt, MI308X): 160 passed, 32 skipped Skips are the existing vectorized + page_size=1 incompatibility (3D tensor layout), now 16 per batch_size value. --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com>

--------- Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>

sunway513 requested review from a team and Copilot April 24, 2026 23:13

sunway513 added the ci:all label Apr 24, 2026

Copilot started reviewing on behalf of sunway513 April 24, 2026 23:14 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Jeff-Huang and others added 3 commits April 24, 2026 23:37

Fix top_k_per_row_prefill err when batched_token_numm > 4096 (#2901)

33b7a4c

revert gptoss tuned config (#2904)

930c941

--------- Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] Release v0.1.13 validation (DO NOT MERGE)#2913

[Tracking] Release v0.1.13 validation (DO NOT MERGE)#2913
sunway513 wants to merge 4 commits intomainfrom
release/v0.1.13

sunway513 commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-          # legacy auto-derived behavior.
-          if [ -n "${TORCH_INDEX_URL}" ]; then
-            TORCH_INDEX="${TORCH_INDEX_URL}"
-          else
-            TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}"
+          # legacy auto-derived behavior. Custom indexes must use HTTPS and
+          # are restricted to download.pytorch.org unless explicitly allowed.
+          DEFAULT_TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}"
+          if [ -n "${TORCH_INDEX_URL}" ]; then
+            case "${TORCH_INDEX_URL}" in
+              https://*)
+                ;;
+              *)
+                echo "TORCH_INDEX_URL must start with https://" >&2
+                exit 1
+                ;;
+            esac
+            case "${TORCH_INDEX_URL}" in
+              https://download.pytorch.org/*)
+                TORCH_INDEX="${TORCH_INDEX_URL}"
+                ;;
+              *)
+                if [ "${UNSAFE_ALLOW_CUSTOM_TORCH_INDEX}" = "true" ]; then
+                  TORCH_INDEX="${TORCH_INDEX_URL}"
+                else
+                  echo "TORCH_INDEX_URL host is not allowed. Set UNSAFE_ALLOW_CUSTOM_TORCH_INDEX=true to permit a custom HTTPS index." >&2
+                  exit 1
+                fi
+                ;;
+            esac
+          else
+            TORCH_INDEX="${DEFAULT_TORCH_INDEX}"

-          # Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest.
-          if [ -n "${TORCH_PIN}" ]; then
-            TORCH_SPEC="torch==${TORCH_PIN}"
-          else
-            TORCH_SPEC="torch"
-          fi
-          echo "Torch index: ${TORCH_INDEX}"
+          SAFE_TORCH_INDEX="$(TORCH_INDEX="${TORCH_INDEX}" python3 - <<'PY'
+import os
+from urllib.parse import urlsplit, urlunsplit
+url = os.environ["TORCH_INDEX"]
+parts = urlsplit(url)
+if not parts.scheme or not parts.netloc:
+    print(url)
+else:
+    host = parts.hostname or ""
+    if parts.port is not None:
+        host = f"{host}:{parts.port}"
+    print(urlunsplit((parts.scheme, host, parts.path, "", "")))
+PY
+          )"
+          # Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest.
+          if [ -n "${TORCH_PIN}" ]; then
+            TORCH_SPEC="torch==${TORCH_PIN}"
+          else
+            TORCH_SPEC="torch"
+          fi
+          echo "Torch index: ${SAFE_TORCH_INDEX}"

Conversation

sunway513 commented Apr 24, 2026

Purpose

Branch state

Why this commit as v0.1.13 base

Validation matrix (target)

Uh oh!

github-actions Bot commented Apr 24, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants