Skip to content

[Tracking] Release v0.1.13 validation (DO NOT MERGE)#2913

Open
sunway513 wants to merge 4 commits intomainfrom
release/v0.1.13
Open

[Tracking] Release v0.1.13 validation (DO NOT MERGE)#2913
sunway513 wants to merge 4 commits intomainfrom
release/v0.1.13

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Purpose

Tracking-only PR to fire ci:all extended validation against release/v0.1.13.

DO NOT MERGE.

Branch state

Commit Source Notes
74dcf38d this branch ci(release): cherry-pick #2875 — torch_pin + torch_index_url workflow inputs
e039817a main 2026-04-23 11:15Z MLA PS mode support nhead8,2 in MI308 (#2852) — release base

Why this commit as v0.1.13 base

  • ATOM benchmark dashboard shows ATOM commit 4f3c57970c (2026-04-23 14:46Z) ran 14/14 accuracy pass against this AITER head.
  • ATOM commit 9ea09987bf (2026-04-23 13:05Z) also 14/14 pass against this AITER head.
  • AITER's own atom-test CI on e039817a is 3/3 green (DSR1 MI300X + MI355X, gpt-oss-120b MI355X).
  • Avoids the 04-24 churn around fmha / sliding-window-mtp / gpt-oss-tuned-config which had cancelled / failed CI.

Validation matrix (target)

  • aiter-test (1 GPU, 8 GPU)
  • sglang_downstream
  • vllm benchmark — no regression vs post1
  • atom-test (DSR1, GPT-OSS, etc.)
  • triton-test (MI355)
  • Manual GSM8K accuracy (5 models)
  • vLLM + ATOM plugin OOT
  • SGLang + ATOM plugin OOT
  • 6-wheel build matrix (manylinux + torch 2.10)

… inputs

Brings the manylinux + torch ABI pin workflow controls onto release/v0.1.13
so v0.1.13 release wheel builds can dispatch torch_pin=2.10.0 directly,
matching the post2 build path.

Cherry-pick of: 7c4cc6c (#2875)
@sunway513 sunway513 requested review from a team and Copilot April 24, 2026 23:13
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2913 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Tracking PR to run extended CI validation for release/v0.1.13, including a cherry-pick that makes the release workflow’s manylinux build able to pin/override the PyTorch wheel source for ABI-targeted builds.

Changes:

  • Add torch_pin and torch_index_url inputs to workflow_dispatch and workflow_call.
  • Wire the new inputs into job environment variables.
  • Update the manylinux dependency install step to optionally use the provided index URL and torch version spec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +328 to +332
# legacy auto-derived behavior.
if [ -n "${TORCH_INDEX_URL}" ]; then
TORCH_INDEX="${TORCH_INDEX_URL}"
else
TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}"
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TORCH_INDEX_URL is user-provided now, but it’s accepted verbatim as the pip --index-url. To reduce supply-chain risk and accidental misuse, consider validating this input before use (e.g., require https:// and/or restrict to an allowlist like download.pytorch.org unless an explicit “unsafe_allow_custom_index” flag is set).

Suggested change
# legacy auto-derived behavior.
if [ -n "${TORCH_INDEX_URL}" ]; then
TORCH_INDEX="${TORCH_INDEX_URL}"
else
TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}"
# legacy auto-derived behavior. Custom indexes must use HTTPS and
# are restricted to download.pytorch.org unless explicitly allowed.
DEFAULT_TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}"
if [ -n "${TORCH_INDEX_URL}" ]; then
case "${TORCH_INDEX_URL}" in
https://*)
;;
*)
echo "TORCH_INDEX_URL must start with https://" >&2
exit 1
;;
esac
case "${TORCH_INDEX_URL}" in
https://download.pytorch.org/*)
TORCH_INDEX="${TORCH_INDEX_URL}"
;;
*)
if [ "${UNSAFE_ALLOW_CUSTOM_TORCH_INDEX}" = "true" ]; then
TORCH_INDEX="${TORCH_INDEX_URL}"
else
echo "TORCH_INDEX_URL host is not allowed. Set UNSAFE_ALLOW_CUSTOM_TORCH_INDEX=true to permit a custom HTTPS index." >&2
exit 1
fi
;;
esac
else
TORCH_INDEX="${DEFAULT_TORCH_INDEX}"

Copilot uses AI. Check for mistakes.
Comment on lines +334 to 340
# Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest.
if [ -n "${TORCH_PIN}" ]; then
TORCH_SPEC="torch==${TORCH_PIN}"
else
TORCH_SPEC="torch"
fi
echo "Torch index: ${TORCH_INDEX}"
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since TORCH_INDEX can now come from the caller-provided torch_index_url, echoing it verbatim can leak credentials if someone passes a URL containing embedded auth (or other sensitive query params). Consider either rejecting URLs with credentials, or logging a sanitized/redacted form of the URL.

Suggested change
# Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest.
if [ -n "${TORCH_PIN}" ]; then
TORCH_SPEC="torch==${TORCH_PIN}"
else
TORCH_SPEC="torch"
fi
echo "Torch index: ${TORCH_INDEX}"
SAFE_TORCH_INDEX="$(TORCH_INDEX="${TORCH_INDEX}" python3 - <<'PY'
import os
from urllib.parse import urlsplit, urlunsplit
url = os.environ["TORCH_INDEX"]
parts = urlsplit(url)
if not parts.scheme or not parts.netloc:
print(url)
else:
host = parts.hostname or ""
if parts.port is not None:
host = f"{host}:{parts.port}"
print(urlunsplit((parts.scheme, host, parts.path, "", "")))
PY
)"
# Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest.
if [ -n "${TORCH_PIN}" ]; then
TORCH_SPEC="torch==${TORCH_PIN}"
else
TORCH_SPEC="torch"
fi
echo "Torch index: ${SAFE_TORCH_INDEX}"

Copilot uses AI. Check for mistakes.
Jeff-Huang and others added 3 commits April 24, 2026 23:37
#2893)

* test: expand test_batch_prefill_large_kvcache for >4GB KV cache overflow

Rewrite test_batch_prefill_large_kvcache to validate the per-tile SRD
rebase fix for >4GB KV caches across all page sizes, dtypes, and
attention configurations:

- Add page_size=1 and 16 (page_size < kN0, exercises rebase path)
- Add GQA (16, 8) in addition to MHA (8, 8)
- Add causal masking with CK-compatible attn_mask for SDPA reference
- Use full KV cache (4.5GB) with pages spanning the overflow boundary
- Use torch SDPA as reference (memory-efficient backend, no score
  matrix materialization)
- Add scatter_pages parameter (False only; True for future
  global_load_lds flat addressing)
- Add GPU memory check to skip configs that exceed HBM capacity

Test matrix: 24 cases (3 page_sizes × 2 dtypes × 2 causal × 2 GQA × 1 scatter)

* test: add GPU sync after CK kernel in large_kvcache test

Add torch.cuda.synchronize() after CK kernel launch in
test_batch_prefill_large_kvcache to ensure all async GPU work
completes before memory is freed between tests.

Without this sync, repeated allocate/free cycles of large KV cache
buffers (~20GB) with mixed dtype (bf16→fp8) can trigger GPU page
faults when the HIP memory allocator reuses virtual addresses that
are still referenced by pending async GPU work. The fault manifests
as VM_L2_PROTECTION_FAULT at address 0x0 (NULL), causing GPU reset
and kernel soft lockup.

* feat(fmha): runtime dispatch for >4GB KV cache in batch prefill

Add use_64bit_load to batch prefill traits and runtime overflow detection.
When page_block_size < 128 and max_page_byte_offset > INT32_MAX, dispatch
to the flat 64-bit load kernel variant for correctness.

Also add vectorized KV layout coverage to test_batch_prefill_large_kvcache.

* fix: remove unused k_vector_size variable in large_kvcache test

* fix(mha): improve batch_prefill TORCH_CHECK error message for >4GB KV cache

Include page_size, num_pages, and dtype in the error message when kernel
dispatch fails. Add hint about CDNA3+ GPU requirement when KV cache
exceeds 4GB with page_size < 128.

* test: update scatter_pages comment in large_kvcache test

The comment incorrectly stated scatter_pages=True was "expected to FAIL".
This is no longer true — the flat 64-bit load path handles scattered
pages correctly. Update to describe the test's purpose instead.

* fix(mha): widen batch_prefill 64-bit threshold to total KV bytes

The previous check used (num_total_pages - 1) * batch_stride * element_size
which measures the last-page base offset, missing within-page offsets and
producing an off-by-one at exactly INT32_MAX (the largest representable
SRD voffset). Switch to total KV cache footprint (num_total_pages *
batch_stride * element_size > INT32_MAX) so within-page reads on the last
page are covered, and drop the redundant num_total_pages > 1 guard since
single-page configs trivially fit in 32 bits.

Also unify wording: 4GB → 2GB (INT32_MAX byte offset for SRD voffset),
matching CK's TwoGB convention. The actual hardware bound has always been
2GB; the prior comments were imprecise.

Found during batch prefill template dispatch review.

* docs(mha): unify >2GB wording in batch_prefill error and test

The 4GB number in the TORCH_CHECK error message and the test comment was
imprecise — the actual SRD voffset bound is 2GB (INT32_MAX). Update both
to match the threshold check and CK's TwoGB convention.

Found during batch prefill template dispatch review.

* refactor(mha): drop wrapper-side use_64bit_load; let CK dispatcher decide

The wrapper hardcoded kN0_min = 128 to compute the >2GB KV cache
predicate, which leaked CK tile config into aiter and would silently
break if a new arm with bn0 != 128 were added. The CK auto-generated
dispatcher now decides per-arm using its own compile-time bn0 and
per-dtype kElementBytes, so the wrapper just forwards args.

Remove the `use_64bit_load` runtime field from `mha_batch_prefill_traits`,
the parameter from `get_mha_batch_prefill_traits()`, and the entire
predicate computation block from the dispatcher call site. Bumps CK
submodule to pull in the matching codegen change.

* chore(mha): bump CK + update wrapper wording for kUseGlobalLoad rename

Bumps 3rdparty/composable_kernel to dd8d293ea (refactor(fmha): batch
prefill review polish — assert helper + setter guards) which builds on
the prior 99a3ca9af kUseGlobalLoad rename.

Wrapper-side updates to match:

* csrc/cpp_itfs/mha_fwd_batch_prefill.cu: rename "64-bit-load" wording
  in the per-arm dispatcher comment to "kUseGlobalLoad" so the wrapper
  comment matches the CK-side identifier. Also drops the trailing
  `false /* skip_min_seqlen_q */` argument from the
  get_mha_batch_prefill_traits call to match the upstream CK API
  signature change.

* csrc/py_itfs_ck/mha_batch_prefill_kernels.cu: change the >2GB error
  message from "page_size < 128" to "page_size < kN0" so the diagnostic
  tracks the tile-size constant rather than a magic number.

* op_tests/test_batch_prefill.py (test_batch_prefill_large_kvcache):
  three documentation enhancements with no behavior change —
  - explain why qo_len caps at 128 (causal) / 1024 (non-causal): the
    causal cap is a math-backend cliff for the SDPA reference, not a
    kernel limit;
  - explain that the +256 padding on kv_page_indices is a batch_prefill
    ABI requirement (kernel may speculatively read up to bn0=256
    entries past the last valid page index);
  - expand the torch.cuda.synchronize comment to call out the
    misattribution failure mode and GPU-reset cascade risk.

* test(fmha): parametrize test_batch_prefill_large_kvcache over batch_size {1, 4}

Adds multi-batch coverage to the >2GB KV cache regression test.

The previous single-batch coverage left the kernel's per-sequence SRD
rebase path unexercised: with cu_seqlens_q=[0, qo_len] and kv_indptr=
[0, num_blocks], the kernel never walks the indptr to reposition K/V
SRDs across batch boundaries. After the kUseGlobalLoad rename and the
new positive static_assert(kUseGlobalLoad_) calls in update_physical_pages
and set_page_stride_elements, we want a regression that catches any
boundary-crossing SRD bug -- the failure mode no single-batch test can
detect (one batch correct, others wrong).

batch_size=4 partitions the >2GB page pool across 4 sequences (last
sequence absorbs the remainder), exercising 3 cross-batch SRD transitions.
The SDPA reference is computed per-batch and concatenated; per-iteration
free + empty_cache keeps peak memory at one batch's worth.

Verified on:
  - gfx950 (smci355-gfx950, MI355X): 160 passed, 32 skipped
  - gfx942 (smc300x-clt, MI308X):    160 passed, 32 skipped

Skips are the existing vectorized + page_size=1 incompatibility
(3D tensor layout), now 16 per batch_size value.

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
---------

Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants