Skip to content

Add unaligned Blit kernel#1780

Draft
AidanBeltonS wants to merge 1 commit intoROCm:amd-stagingfrom
AidanBeltonS:users/abeltons/unaligned_blit_kernel
Draft

Add unaligned Blit kernel#1780
AidanBeltonS wants to merge 1 commit intoROCm:amd-stagingfrom
AidanBeltonS:users/abeltons/unaligned_blit_kernel

Conversation

@AidanBeltonS
Copy link

@AidanBeltonS AidanBeltonS commented Mar 17, 2026

Currently we use three separate kernels to implement an unaligned memset operation. This kernel allows us to handle aligned and unaligned cases with one kernel and achieve better performance for the unaligned case, while maintaining aligned perf.

This has been tested inside HIP and will be wired into CLR in a seperate PR.

Unaligned results:

  Develop   UnalignedKernel   Percentage Change  
Size (bytes) Time (ns) Throughput (GiB/s) Time (ns) Throughput (GiB/s) Time (- is better) Throughput (+ is better)
512 7857 0.242758 4550 0.419198 -42% 73%
4096 8606 1.77304 4607 3.31209 -46% 87%
32768 8731 13.9813 4206 29.0229 -52% 108%
262144 8682 112.481 4159 234.807 -52% 109%
2097152 8587 909.806 4748 1645.43 -45% 81%
16777216 15882 3935.27 11998 5209.2 -24% 32%
134217728 115994 4310.57 112568 4441.76 -3% 3%

@z1-cciauto
Copy link
Collaborator

uint g = __builtin_amdgcn_workgroup_id_x();
ulong id = (g * 256 + l);

// Handle head, body and tail - each store in a separate warp to reduce divergence
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future HW could have 32 wavefront size, how does that affect the kernel perf?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants