[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs #3593

dbsanfte · 2026-01-16T11:32:46Z

Summary

This PR is an aggregation of fixes discovered while working with ComposableKernel on gfx906 (MI50/MI60) GPUs. These GPUs don't have MFMA instructions, so they rely on the DeviceGemmDl path which has some edge cases that aren't well-tested.

Note: This is a draft PR that will be updated as we discover more issues.

Fix 1: Buffer Load OOB Crash with Large K and Small M

Problem

DeviceGemmDl crashes on gfx906 when K >= 1472 with small M (e.g., M=1 decode case in LLM inference).

The crash occurs in gridwise_gemm_dl_v1r3.hpp during block_sync_lds() after an invalid buffer load.

Root Cause

CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK was disabled by default (set to 0).

Without the offset trick:

Buffer loads execute unconditionally
Bounds check happens after the load returns
If the load address is unmapped, the GPU crashes before the bounds check matters

With the offset trick enabled:

Invalid coordinates get 0x80000000 added to offset
This flags the load as OOB to hardware
Hardware safely returns zero instead of accessing unmapped memory

Solution

include/ck/ck.hpp: Enable CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK by default
include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp: Use coordinate_has_valid_offset() instead of coordinate_has_valid_offset_assuming_visible_index_is_valid() for full bounds validation

Verification

INT8 GEMM tests pass for:

M=1 (single-row decode)
K up to 14336
FFN projection dimensions for Qwen2.5 and Llama3

Fix 2: GridwiseGemmDlMultipleD Element Op Type Mismatch (FloatAcc != FloatC)

Problem

When FloatAcc differs from FloatC (e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types.

The element op contract is: (E& e, const C& c, const D& d...) where:

E = FloatC (final output type, e.g., float)
C = FloatAcc (accumulator type, e.g., int32_t)

Root Cause

Original code at lines 615-618 used generate_tie() returning the same c_thread_buf for both E& and C&:

auto dst_data_refs = generate_tie(
    [&](auto) -> auto& { return c_thread_buf(Number<c_offset>{}); },
    Number<2>{});

This causes:

Type mismatch when FloatAcc != FloatC (element op expects float& for e, gets int32_t&)
Compile errors with strictly-typed element ops
Undefined behavior during ThreadwiseTensorSliceTransfer which type-puns FloatAcc bits as FloatC

This bug has existed since the file was created in December 2022 (PR #517).

Solution

include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp:

Introduce separate e_thread_buf<FloatC> for element op output
Pass (E& e) from e_thread_buf and (const C& c) from c_thread_buf using tie()
Transfer e_thread_buf (not c_thread_buf) to global memory

Minimal Repro

See original PR #3565 for compile-time repro that demonstrates the type mismatch.

Environment

GPU: gfx906 (MI50)
ROCm: 7.1.1
Use case: INT8×INT8→INT32 GEMM with FP32 output for LLM inference

Problem: DeviceGemmDl crashes on gfx906 when K >= 1472 with small M (M=1 decode case). Root cause: CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK was disabled by default. Without this, invalid buffer loads execute and crash before bounds checking can prevent them. Solution: 1. Enable the OOB offset trick (0x80000000) so invalid coordinates safely return zero instead of accessing unmapped memory 2. Use full coordinate_has_valid_offset() check instead of the _assuming_visible_index_is_valid variant for proper K bounds validation Verified with INT8 GEMM tests: M=1 decode, K=14336, FFN projections.

Problem: When FloatAcc differs from FloatC (e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types. The element op contract is: (E& e, const C& c, const D& d...) where: - E = FloatC (final output type, e.g., float) - C = FloatAcc (accumulator type, e.g., int32_t) Original code used generate_tie() returning the same c_thread_buf for both E& and C&, which: 1. Violates the element op signature when types differ 2. Causes compile errors with strictly-typed element ops 3. Results in undefined behavior during ThreadwiseTensorSliceTransfer Solution: Introduce separate e_thread_buf<FloatC> for element op output, pass (E& e) from e_thread_buf and (const C& c) from c_thread_buf, then transfer e_thread_buf to global memory. Bug has existed since the file was created in December 2022 (PR ROCm#517).

dbsanfte added 2 commits January 16, 2026 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs #3593

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs #3593

dbsanfte commented Jan 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs #3593

Are you sure you want to change the base?

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs #3593

Conversation

dbsanfte commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1: Buffer Load OOB Crash with Large K and Small M

Problem

Root Cause

Solution

Verification

Fix 2: GridwiseGemmDlMultipleD Element Op Type Mismatch (FloatAcc != FloatC)

Problem

Root Cause

Solution

Minimal Repro

Environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dbsanfte commented Jan 16, 2026 •

edited

Loading