Kefan.cao/fix universal copy by kefan203 · Pull Request #429 · ROCm/FlyDSL

kefan203 · 2026-04-23T06:43:45Z

Motivation

fly.copy_atom_call using a universal_copy atom with a strided !fly.memref on one side (e.g. !fly.memref<f16, global, 4:8>) was silently lowered to a single contiguous llvm.load / llvm.store / llvm.memcpy against the memory-side pointer, ignoring the memref layout's stride. The result is that adjacent lanes are read or written instead of stride-spaced elements — a silent correctness bug whenever a single atom moves a non-unit-stride slice of memory.

Technical Details

Bug is in CopyOpUniversalCopyType::emitAtomCallSSA and emitAtomCall (lib/Dialect/Fly/IR/FlyUniversalOps.cpp): both paths assumed contiguous memory and emitted a single vector<N × elemTy> load/store (or a single llvm.memcpy for the non-SSA memref-to-memref path).

Fix:

Consult the coalesced leaf of the memref's LayoutAttr to obtain static (count, stride).
count <= 1 || stride == 1: keep the existing fast path (single vector load/store, or llvm.memcpy for the memref-to-memref path).
Otherwise: emit element-wise gather/scatter — per-element llvm.getelementptr with offset i * stride (in elements), applySwizzleOnPtr, then llvm.load + llvm.insertelement (gather side) or llvm.extractelement + llvm.store (scatter side).
Swizzle is applied per element pointer rather than once on the base, which also fixes a latent issue for non-trivial swizzles.

Helpers added in the same file: getCoalescedLeafCountAndStride, emitStridedLoadAsVector, emitStridedStoreFromVector.

Behavior change worth noting: memrefs whose layout does not coalesce to a single static leaf previously went through the silently-wrong contiguous path; they now make the lowering fail explicitly. Register-promoted memrefs are unaffected because ConvertAtomCallToSSAForm::isEligibleToPromote already restricts them to single-leaf layouts with stride 1 or shape 1.

Changes are split into two review-friendly commits:

[Test] Add reproducer for strided universal_copy in convert-fly-to-rocdl — adds tests/mlir/Transforms/convert_fly_to_rocdl_universal_copy_strided.mlir with FileCheck lines frozen to the currently-buggy IR.
[Fix] Honor stride in universal copy atom lowering — fixes emitAtomCallSSA / emitAtomCall and updates the CHECK lines to the corrected IR. The diff on the test file in commit 2 is exactly the IR before/after the fix.

Test Plan

New reproducer tests/mlir/Transforms/convert_fly_to_rocdl_universal_copy_strided.mlir runs --fly-convert-atom-call-to-ssa-form --convert-fly-to-rocdl against two kernels:
- load_strided_global_into_register: !fly.memref<f16, global, 4:8> -> !fly.memref<f16, register, 4:1>.
- store_register_into_strided_global: the reverse direction.
FileCheck asserts the lowered IR contains four per-element llvm.getelementptr %arg0[0]/[8]/[16]/[24] with f16 loads/stores and the matching llvm.insertelement / llvm.extractelement chain, proving the stride-8 layout is honored.
Full regression over tests/mlir/Transforms/*.mlir (including canonicalize, convert-atom-call-to-ssa-form, layout_lowering, promote_regmem_to_ssa*, rewrite_func_signature) to ensure the strided path did not regress existing contiguous/register lowerings.

Test Result

PASS  canonicalize.mlir
PASS  convert-atom-call-to-ssa-form.mlir
PASS  convert_fly_to_rocdl_universal_copy_strided.mlir
PASS  layout_lowering.mlir
PASS  promote_regmem_to_ssa_copy_pred.mlir
PASS  promote_regmem_to_ssa_invalid.mlir
PASS  promote_regmem_to_ssa.mlir
PASS  rewrite_func_signature.mlir
8 passed, 0 failed

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

kefan203 · 2026-04-23T06:45:37Z

@coderfeli @sjfeng1999 Could you please review this PR when you have time? Thanks!

sjfeng1999 · 2026-04-23T16:01:51Z

I think a better general approach is to verify the op before lowering atom-call and reject invalid cases at that stage (like contiguous byte count is smaller than copy granularity). Non-contiguous memrefs cannot be handled by copy64b at all. We should better report a clearer diagnostic when constraints conflict, rather than silently generating code that deviates from the program's original intent.`

Freezes current (buggy) lowering: a non-unit-stride !fly.memref on one side of fly.copy_atom_call is lowered to a single contiguous llvm.load / llvm.store against the memory-side pointer, silently ignoring the stride. Next commit will fix emitAtomCallSSA and update these CHECKs.

`fly.copy_atom_call` / `fly.copy_atom_call_ssa` with `!fly.universal_copy<...>` used to accept non-contiguous memrefs and lower them as if the atom operated on a contiguous slice. Fix this by verifying universal copy operands before lowering: * the memref layout must coalesce to a single static leaf * the coalesced bit count must match the copy atom bit width * the contiguous bit count must not be smaller than the copy granularity This turns strided/otherwise incompatible memrefs into a clear verification error instead of silently generating code that deviates from the original atom semantics.

kefan203 · 2026-04-24T02:40:49Z

I think a better general approach is to verify the op before lowering atom-call and reject invalid cases at that stage (like contiguous byte count is smaller than copy granularity). Non-contiguous memrefs cannot be handled by copy64b at all. We should better report a clearer diagnostic when constraints conflict, rather than silently generating code that deviates from the program's original intent.`

Makes sense, thanks. I updated the fix to validate the atom-call operands up front and emit a clear diagnostic for incompatible layouts instead.

coderfeli requested a review from sjfeng1999 April 23, 2026 07:03

kefan203 added 2 commits April 24, 2026 02:35

kefan203 force-pushed the kefan.cao/fix-universal-copy branch from b4130b0 to cd1ac59 Compare April 24, 2026 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kefan.cao/fix universal copy#429

Kefan.cao/fix universal copy#429
kefan203 wants to merge 2 commits intoROCm:mainfrom
Deep-Spark:kefan.cao/fix-universal-copy

kefan203 commented Apr 23, 2026

Uh oh!

kefan203 commented Apr 23, 2026

Uh oh!

sjfeng1999 commented Apr 23, 2026

Uh oh!

kefan203 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kefan203 commented Apr 23, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

kefan203 commented Apr 23, 2026

Uh oh!

sjfeng1999 commented Apr 23, 2026

Uh oh!

kefan203 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants