[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs #3593
+29
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR is an aggregation of fixes discovered while working with ComposableKernel on gfx906 (MI50/MI60) GPUs. These GPUs don't have MFMA instructions, so they rely on the
DeviceGemmDlpath which has some edge cases that aren't well-tested.Note: This is a draft PR that will be updated as we discover more issues.
Fix 1: Buffer Load OOB Crash with Large K and Small M
Problem
DeviceGemmDlcrashes on gfx906 when K >= 1472 with small M (e.g., M=1 decode case in LLM inference).The crash occurs in
gridwise_gemm_dl_v1r3.hppduringblock_sync_lds()after an invalid buffer load.Root Cause
CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICKwas disabled by default (set to 0).Without the offset trick:
With the offset trick enabled:
0x80000000added to offsetSolution
include/ck/ck.hpp: EnableCK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICKby defaultinclude/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp: Usecoordinate_has_valid_offset()instead ofcoordinate_has_valid_offset_assuming_visible_index_is_valid()for full bounds validationVerification
INT8 GEMM tests pass for:
Fix 2: GridwiseGemmDlMultipleD Element Op Type Mismatch (FloatAcc != FloatC)
Problem
When
FloatAccdiffers fromFloatC(e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types.The element op contract is:
(E& e, const C& c, const D& d...)where:E=FloatC(final output type, e.g.,float)C=FloatAcc(accumulator type, e.g.,int32_t)Root Cause
Original code at lines 615-618 used
generate_tie()returning the samec_thread_buffor bothE&andC&:This causes:
FloatAcc != FloatC(element op expectsfloat&fore, getsint32_t&)ThreadwiseTensorSliceTransferwhich type-punsFloatAccbits asFloatCThis bug has existed since the file was created in December 2022 (PR #517).
Solution
include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp:e_thread_buf<FloatC>for element op output(E& e)frome_thread_bufand(const C& c)fromc_thread_bufusingtie()e_thread_buf(notc_thread_buf) to global memoryMinimal Repro
See original PR #3565 for compile-time repro that demonstrates the type mismatch.
Environment