[Tracking] Release v0.1.13 validation (DO NOT MERGE)#2913
[Tracking] Release v0.1.13 validation (DO NOT MERGE)#2913
Conversation
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
Tracking PR to run extended CI validation for release/v0.1.13, including a cherry-pick that makes the release workflow’s manylinux build able to pin/override the PyTorch wheel source for ABI-targeted builds.
Changes:
- Add
torch_pinandtorch_index_urlinputs toworkflow_dispatchandworkflow_call. - Wire the new inputs into job environment variables.
- Update the manylinux dependency install step to optionally use the provided index URL and torch version spec.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # legacy auto-derived behavior. | ||
| if [ -n "${TORCH_INDEX_URL}" ]; then | ||
| TORCH_INDEX="${TORCH_INDEX_URL}" | ||
| else | ||
| TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}" |
There was a problem hiding this comment.
TORCH_INDEX_URL is user-provided now, but it’s accepted verbatim as the pip --index-url. To reduce supply-chain risk and accidental misuse, consider validating this input before use (e.g., require https:// and/or restrict to an allowlist like download.pytorch.org unless an explicit “unsafe_allow_custom_index” flag is set).
| # legacy auto-derived behavior. | |
| if [ -n "${TORCH_INDEX_URL}" ]; then | |
| TORCH_INDEX="${TORCH_INDEX_URL}" | |
| else | |
| TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}" | |
| # legacy auto-derived behavior. Custom indexes must use HTTPS and | |
| # are restricted to download.pytorch.org unless explicitly allowed. | |
| DEFAULT_TORCH_INDEX="https://download.pytorch.org/whl/rocm${ROCM_NUM}" | |
| if [ -n "${TORCH_INDEX_URL}" ]; then | |
| case "${TORCH_INDEX_URL}" in | |
| https://*) | |
| ;; | |
| *) | |
| echo "TORCH_INDEX_URL must start with https://" >&2 | |
| exit 1 | |
| ;; | |
| esac | |
| case "${TORCH_INDEX_URL}" in | |
| https://download.pytorch.org/*) | |
| TORCH_INDEX="${TORCH_INDEX_URL}" | |
| ;; | |
| *) | |
| if [ "${UNSAFE_ALLOW_CUSTOM_TORCH_INDEX}" = "true" ]; then | |
| TORCH_INDEX="${TORCH_INDEX_URL}" | |
| else | |
| echo "TORCH_INDEX_URL host is not allowed. Set UNSAFE_ALLOW_CUSTOM_TORCH_INDEX=true to permit a custom HTTPS index." >&2 | |
| exit 1 | |
| fi | |
| ;; | |
| esac | |
| else | |
| TORCH_INDEX="${DEFAULT_TORCH_INDEX}" |
| # Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest. | ||
| if [ -n "${TORCH_PIN}" ]; then | ||
| TORCH_SPEC="torch==${TORCH_PIN}" | ||
| else | ||
| TORCH_SPEC="torch" | ||
| fi | ||
| echo "Torch index: ${TORCH_INDEX}" |
There was a problem hiding this comment.
Since TORCH_INDEX can now come from the caller-provided torch_index_url, echoing it verbatim can leak credentials if someone passes a URL containing embedded auth (or other sensitive query params). Consider either rejecting URLs with credentials, or logging a sanitized/redacted form of the URL.
| # Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest. | |
| if [ -n "${TORCH_PIN}" ]; then | |
| TORCH_SPEC="torch==${TORCH_PIN}" | |
| else | |
| TORCH_SPEC="torch" | |
| fi | |
| echo "Torch index: ${TORCH_INDEX}" | |
| SAFE_TORCH_INDEX="$(TORCH_INDEX="${TORCH_INDEX}" python3 - <<'PY' | |
| import os | |
| from urllib.parse import urlsplit, urlunsplit | |
| url = os.environ["TORCH_INDEX"] | |
| parts = urlsplit(url) | |
| if not parts.scheme or not parts.netloc: | |
| print(url) | |
| else: | |
| host = parts.hostname or "" | |
| if parts.port is not None: | |
| host = f"{host}:{parts.port}" | |
| print(urlunsplit((parts.scheme, host, parts.path, "", ""))) | |
| PY | |
| )" | |
| # Optional torch version pin (e.g. 2.10.0+rocm7.1). Empty = latest. | |
| if [ -n "${TORCH_PIN}" ]; then | |
| TORCH_SPEC="torch==${TORCH_PIN}" | |
| else | |
| TORCH_SPEC="torch" | |
| fi | |
| echo "Torch index: ${SAFE_TORCH_INDEX}" |
#2893) * test: expand test_batch_prefill_large_kvcache for >4GB KV cache overflow Rewrite test_batch_prefill_large_kvcache to validate the per-tile SRD rebase fix for >4GB KV caches across all page sizes, dtypes, and attention configurations: - Add page_size=1 and 16 (page_size < kN0, exercises rebase path) - Add GQA (16, 8) in addition to MHA (8, 8) - Add causal masking with CK-compatible attn_mask for SDPA reference - Use full KV cache (4.5GB) with pages spanning the overflow boundary - Use torch SDPA as reference (memory-efficient backend, no score matrix materialization) - Add scatter_pages parameter (False only; True for future global_load_lds flat addressing) - Add GPU memory check to skip configs that exceed HBM capacity Test matrix: 24 cases (3 page_sizes × 2 dtypes × 2 causal × 2 GQA × 1 scatter) * test: add GPU sync after CK kernel in large_kvcache test Add torch.cuda.synchronize() after CK kernel launch in test_batch_prefill_large_kvcache to ensure all async GPU work completes before memory is freed between tests. Without this sync, repeated allocate/free cycles of large KV cache buffers (~20GB) with mixed dtype (bf16→fp8) can trigger GPU page faults when the HIP memory allocator reuses virtual addresses that are still referenced by pending async GPU work. The fault manifests as VM_L2_PROTECTION_FAULT at address 0x0 (NULL), causing GPU reset and kernel soft lockup. * feat(fmha): runtime dispatch for >4GB KV cache in batch prefill Add use_64bit_load to batch prefill traits and runtime overflow detection. When page_block_size < 128 and max_page_byte_offset > INT32_MAX, dispatch to the flat 64-bit load kernel variant for correctness. Also add vectorized KV layout coverage to test_batch_prefill_large_kvcache. * fix: remove unused k_vector_size variable in large_kvcache test * fix(mha): improve batch_prefill TORCH_CHECK error message for >4GB KV cache Include page_size, num_pages, and dtype in the error message when kernel dispatch fails. Add hint about CDNA3+ GPU requirement when KV cache exceeds 4GB with page_size < 128. * test: update scatter_pages comment in large_kvcache test The comment incorrectly stated scatter_pages=True was "expected to FAIL". This is no longer true — the flat 64-bit load path handles scattered pages correctly. Update to describe the test's purpose instead. * fix(mha): widen batch_prefill 64-bit threshold to total KV bytes The previous check used (num_total_pages - 1) * batch_stride * element_size which measures the last-page base offset, missing within-page offsets and producing an off-by-one at exactly INT32_MAX (the largest representable SRD voffset). Switch to total KV cache footprint (num_total_pages * batch_stride * element_size > INT32_MAX) so within-page reads on the last page are covered, and drop the redundant num_total_pages > 1 guard since single-page configs trivially fit in 32 bits. Also unify wording: 4GB → 2GB (INT32_MAX byte offset for SRD voffset), matching CK's TwoGB convention. The actual hardware bound has always been 2GB; the prior comments were imprecise. Found during batch prefill template dispatch review. * docs(mha): unify >2GB wording in batch_prefill error and test The 4GB number in the TORCH_CHECK error message and the test comment was imprecise — the actual SRD voffset bound is 2GB (INT32_MAX). Update both to match the threshold check and CK's TwoGB convention. Found during batch prefill template dispatch review. * refactor(mha): drop wrapper-side use_64bit_load; let CK dispatcher decide The wrapper hardcoded kN0_min = 128 to compute the >2GB KV cache predicate, which leaked CK tile config into aiter and would silently break if a new arm with bn0 != 128 were added. The CK auto-generated dispatcher now decides per-arm using its own compile-time bn0 and per-dtype kElementBytes, so the wrapper just forwards args. Remove the `use_64bit_load` runtime field from `mha_batch_prefill_traits`, the parameter from `get_mha_batch_prefill_traits()`, and the entire predicate computation block from the dispatcher call site. Bumps CK submodule to pull in the matching codegen change. * chore(mha): bump CK + update wrapper wording for kUseGlobalLoad rename Bumps 3rdparty/composable_kernel to dd8d293ea (refactor(fmha): batch prefill review polish — assert helper + setter guards) which builds on the prior 99a3ca9af kUseGlobalLoad rename. Wrapper-side updates to match: * csrc/cpp_itfs/mha_fwd_batch_prefill.cu: rename "64-bit-load" wording in the per-arm dispatcher comment to "kUseGlobalLoad" so the wrapper comment matches the CK-side identifier. Also drops the trailing `false /* skip_min_seqlen_q */` argument from the get_mha_batch_prefill_traits call to match the upstream CK API signature change. * csrc/py_itfs_ck/mha_batch_prefill_kernels.cu: change the >2GB error message from "page_size < 128" to "page_size < kN0" so the diagnostic tracks the tile-size constant rather than a magic number. * op_tests/test_batch_prefill.py (test_batch_prefill_large_kvcache): three documentation enhancements with no behavior change — - explain why qo_len caps at 128 (causal) / 1024 (non-causal): the causal cap is a math-backend cliff for the SDPA reference, not a kernel limit; - explain that the +256 padding on kv_page_indices is a batch_prefill ABI requirement (kernel may speculatively read up to bn0=256 entries past the last valid page index); - expand the torch.cuda.synchronize comment to call out the misattribution failure mode and GPU-reset cascade risk. * test(fmha): parametrize test_batch_prefill_large_kvcache over batch_size {1, 4} Adds multi-batch coverage to the >2GB KV cache regression test. The previous single-batch coverage left the kernel's per-sequence SRD rebase path unexercised: with cu_seqlens_q=[0, qo_len] and kv_indptr= [0, num_blocks], the kernel never walks the indptr to reposition K/V SRDs across batch boundaries. After the kUseGlobalLoad rename and the new positive static_assert(kUseGlobalLoad_) calls in update_physical_pages and set_page_stride_elements, we want a regression that catches any boundary-crossing SRD bug -- the failure mode no single-batch test can detect (one batch correct, others wrong). batch_size=4 partitions the >2GB page pool across 4 sequences (last sequence absorbs the remainder), exercising 3 cross-batch SRD transitions. The SDPA reference is computed per-batch and concatenated; per-iteration free + empty_cache keeps peak memory at one batch's worth. Verified on: - gfx950 (smci355-gfx950, MI355X): 160 passed, 32 skipped - gfx942 (smc300x-clt, MI308X): 160 passed, 32 skipped Skips are the existing vectorized + page_size=1 incompatibility (3D tensor layout), now 16 per batch_size value. --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com>
--------- Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
Purpose
Tracking-only PR to fire
ci:allextended validation againstrelease/v0.1.13.DO NOT MERGE.
Branch state
74dcf38de039817aWhy this commit as v0.1.13 base
4f3c57970c(2026-04-23 14:46Z) ran 14/14 accuracy pass against this AITER head.9ea09987bf(2026-04-23 13:05Z) also 14/14 pass against this AITER head.e039817ais 3/3 green (DSR1 MI300X + MI355X, gpt-oss-120b MI355X).Validation matrix (target)