Implement batched gemm bias permute for RDNA4 #3534

ErwinTerpstra · 2026-01-08T11:56:56Z

Proposed changes

This MR implements batched gemm bias permute for RDNA3/4. In practice, this is a multidimensional contraction operation. The MR contains the following:

Profiler and test infrastructure for the batched contraction instances, as this was not implemented yet for XDL versions
Device struct for batched contraction using WMMA instructions (device_batched_contraction_multiple_d_wmma_cshuffle_v3)
Changes to the GridwiseGemmWmmaCShuffleV3 to allow passing in non-naive grid descriptors

Note that support for different dimensions and D tensor configurations is very limited at the moment. More scaffolding would be needed to add generic support for variable number of dimensions, but with this limited implementation there is at least parity with the XDL versions.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

… e permute)

…ew gridwise op

…rs for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

… overload

…_bias_permute-for-rdna4

EnricoDeg · 2026-01-14T18:30:46Z

Can you also add an example for wmma?

EnricoDeg

Nice work !

EnricoDeg · 2026-01-14T15:13:20Z

.../tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle_v3.hpp

+        __host__ __device__ constexpr long_index_t GetAPtrOffset(index_t g_idx) const
+        {
+            return static_cast<long_index_t>(g_idx) * batch_stride_A_;
+        }
+
+        __host__ __device__ constexpr long_index_t GetBPtrOffset(index_t g_idx) const
+        {
+            return static_cast<long_index_t>(g_idx) * batch_stride_B_;
+        }
+
+        __host__ __device__ constexpr auto GetDsPtrOffset(index_t g_idx) const
+        {
+            std::array<long_index_t, NumDTensor> ds_offset;
+
+            static_for<0, NumDTensor, 1>{}([&](auto i) {
+                ds_offset[i] = static_cast<long_index_t>(g_idx) *
+                               ds_grid_desc_g_m_n_[i].CalculateOffset(make_multi_index(1, 0, 0));
+            });
+
+            return ds_offset;
+        }
+
+        __host__ __device__ constexpr long_index_t GetEPtrOffset(index_t g_idx) const
+        {
+            return static_cast<long_index_t>(g_idx) *
+                   e_grid_desc_g_m_n_.CalculateOffset(make_multi_index(1, 0, 0));
+        }


This is confusing for me. Why for A and B the stride is used and for D and E the grid descriptor is used?

For A and B there is no 3D grid descriptor created (No GMN, just MN), because that isn't used anywhere. I assumed the D and E must use a grid descriptor because it can be a non-trivial transformation (I think because E is permuted, although that probably doesn't change the batch stride)

But yeah, it's a bit inconsistent. I could make it more consistent if you want.

I think it's fine. It was just unclear to me looking at the code. Maybe add a comment about it

.../tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle_v3.hpp

EnricoDeg · 2026-01-14T15:48:20Z

.../tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle_v3.hpp

+            if(GridwiseGemm::CalculateHasMainKBlockLoop(arg.K))
+            {
+                return launch_kernel(integral_constant<bool, true>{});
+            }
+            else
+            {
+                return launch_kernel(integral_constant<bool, false>{});
+            }


Shouldn't we define tailNum?

Good catch, that was missing. I added it and added some small test cases (where HasMainKBlock == false) to verify that it works.

.../tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle_v3.hpp

EnricoDeg · 2026-01-14T18:41:57Z

...ermute/device_batched_gemm_bias_permute_m2_n3_k1_wmma_c_shuffle_f16_f16_f16_f16_instance.cpp

+        DeviceBatchedContractionMultipleD_Wmma_CShuffle_V3<       1,       2,       3,       1,   F16,   F16,     F32,      F16, F16_Tuple,   F16,  PassThrough, PassThrough,         Add,       GemmSpec,         ABSpec,         ABSpec,         DESpec,   256,   256,   128,    32,   8,   8,   16,   16,      4,       4,     S<4, 64, 1>,      S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,                 S<1, 32, 1, 8>,          S<1, 1>>
+        // clang-format on
+        >;


Maybe it's better to have a few more instances to check correctness in the tests

ApoorvaKalyani

Great work!
I also think we need more instances and we need to reverify the tests for those.

…e code between platforms

…tances to the test

…_bias_permute-for-rdna4

ErwinTerpstra · 2026-01-15T14:05:16Z

@EnricoDeg @ApoorvaKalyani Thank you for the reviews. I processed the comments, added an example and added a couple of instances for both v1 and v3 pipelines. Let me know if there's still something you'd like to see changed.

ErwinTerpstra added 5 commits December 19, 2025 16:03

feat: test setup for batched contraction (aka batched gemm multiple d…

873688f

… e permute)

wip: device struct for WMMA batched contraction multiple d based on n…

c78c353

…ew gridwise op

feat: working batched contraction on RDNA, non-naive tensor descripto…

0fae879

…rs for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

fix: failure to resolve template parameters when calling new function…

82323d2

… overload

fix: passing reference type as parameter instead of underlying types

c67a425

ErwinTerpstra added the organization: streamhpc label Jan 8, 2026

ErwinTerpstra added 4 commits January 8, 2026 12:09

Merge branch 'develop' into eterpstr/96-implement-device_batched_gemm…

ac82e53

…_bias_permute-for-rdna4

fix: merge error caused duplicate definitions

3c6fc61

fix: make sure constness of template and parameters types match

56e9620

fix: don't compile batched contraction test on unsupported architectures

918981d

krithalith requested review from ApoorvaKalyani and EnricoDeg January 9, 2026 12:41

EnricoDeg reviewed Jan 14, 2026

View reviewed changes

ApoorvaKalyani reviewed Jan 15, 2026

View reviewed changes

ErwinTerpstra added 7 commits January 15, 2026 10:22

feat: add example for new wmma implementation, and consolidate exampl…

7a201c9

…e code between platforms

style: return inline instead of with branch

d01b8f6

chore: add extra assert on vector memory access sizes

55959b0

chore: clean up some unused variables

51f8d41

fix: correct tail number calculation, added small cases and extra ins…

96612c1

…tances to the test

Merge branch 'develop' into eterpstr/96-implement-device_batched_gemm…

7759d80

…_bias_permute-for-rdna4

fix: merge caused duplicate function definition

b7e97c8

Implement batched gemm bias permute for RDNA4 #3534

Are you sure you want to change the base?

Implement batched gemm bias permute for RDNA4 #3534

Conversation

ErwinTerpstra commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

EnricoDeg commented Jan 14, 2026

Uh oh!

EnricoDeg left a comment

Choose a reason for hiding this comment

Uh oh!

EnricoDeg Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ErwinTerpstra Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

EnricoDeg Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EnricoDeg Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ErwinTerpstra Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

EnricoDeg Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ErwinTerpstra Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

ApoorvaKalyani left a comment

Choose a reason for hiding this comment

Uh oh!

ErwinTerpstra commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ErwinTerpstra commented Jan 8, 2026 •

edited

Loading