Skip to content

Kefan.cao/fix universal copy#429

Open
kefan203 wants to merge 2 commits intoROCm:mainfrom
Deep-Spark:kefan.cao/fix-universal-copy
Open

Kefan.cao/fix universal copy#429
kefan203 wants to merge 2 commits intoROCm:mainfrom
Deep-Spark:kefan.cao/fix-universal-copy

Conversation

@kefan203
Copy link
Copy Markdown
Contributor

Motivation

fly.copy_atom_call using a universal_copy atom with a strided !fly.memref on one side (e.g. !fly.memref<f16, global, 4:8>) was silently lowered to a single contiguous llvm.load / llvm.store / llvm.memcpy against the memory-side pointer, ignoring the memref layout's stride. The result is that adjacent lanes are read or written instead of stride-spaced elements — a silent correctness bug whenever a single atom moves a non-unit-stride slice of memory.

Technical Details

Bug is in CopyOpUniversalCopyType::emitAtomCallSSA and emitAtomCall (lib/Dialect/Fly/IR/FlyUniversalOps.cpp): both paths assumed contiguous memory and emitted a single vector<N × elemTy> load/store (or a single llvm.memcpy for the non-SSA memref-to-memref path).

Fix:

  • Consult the coalesced leaf of the memref's LayoutAttr to obtain static (count, stride).
  • count <= 1 || stride == 1: keep the existing fast path (single vector load/store, or llvm.memcpy for the memref-to-memref path).
  • Otherwise: emit element-wise gather/scatter — per-element llvm.getelementptr with offset i * stride (in elements), applySwizzleOnPtr, then llvm.load + llvm.insertelement (gather side) or llvm.extractelement + llvm.store (scatter side).
  • Swizzle is applied per element pointer rather than once on the base, which also fixes a latent issue for non-trivial swizzles.

Helpers added in the same file: getCoalescedLeafCountAndStride, emitStridedLoadAsVector, emitStridedStoreFromVector.

Behavior change worth noting: memrefs whose layout does not coalesce to a single static leaf previously went through the silently-wrong contiguous path; they now make the lowering fail explicitly. Register-promoted memrefs are unaffected because ConvertAtomCallToSSAForm::isEligibleToPromote already restricts them to single-leaf layouts with stride 1 or shape 1.

Changes are split into two review-friendly commits:

  1. [Test] Add reproducer for strided universal_copy in convert-fly-to-rocdl — adds tests/mlir/Transforms/convert_fly_to_rocdl_universal_copy_strided.mlir with FileCheck lines frozen to the currently-buggy IR.
  2. [Fix] Honor stride in universal copy atom lowering — fixes emitAtomCallSSA / emitAtomCall and updates the CHECK lines to the corrected IR. The diff on the test file in commit 2 is exactly the IR before/after the fix.

Test Plan

  • New reproducer tests/mlir/Transforms/convert_fly_to_rocdl_universal_copy_strided.mlir runs --fly-convert-atom-call-to-ssa-form --convert-fly-to-rocdl against two kernels:
    • load_strided_global_into_register: !fly.memref<f16, global, 4:8> -> !fly.memref<f16, register, 4:1>.
    • store_register_into_strided_global: the reverse direction.
  • FileCheck asserts the lowered IR contains four per-element llvm.getelementptr %arg0[0]/[8]/[16]/[24] with f16 loads/stores and the matching llvm.insertelement / llvm.extractelement chain, proving the stride-8 layout is honored.
  • Full regression over tests/mlir/Transforms/*.mlir (including canonicalize, convert-atom-call-to-ssa-form, layout_lowering, promote_regmem_to_ssa*, rewrite_func_signature) to ensure the strided path did not regress existing contiguous/register lowerings.

Test Result

PASS  canonicalize.mlir
PASS  convert-atom-call-to-ssa-form.mlir
PASS  convert_fly_to_rocdl_universal_copy_strided.mlir
PASS  layout_lowering.mlir
PASS  promote_regmem_to_ssa_copy_pred.mlir
PASS  promote_regmem_to_ssa_invalid.mlir
PASS  promote_regmem_to_ssa.mlir
PASS  rewrite_func_signature.mlir
8 passed, 0 failed

Submission Checklist

@kefan203
Copy link
Copy Markdown
Contributor Author

@coderfeli @sjfeng1999 Could you please review this PR when you have time? Thanks!

@coderfeli coderfeli requested a review from sjfeng1999 April 23, 2026 07:03
@sjfeng1999
Copy link
Copy Markdown
Collaborator

I think a better general approach is to verify the op before lowering atom-call and reject invalid cases at that stage (like contiguous byte count is smaller than copy granularity). Non-contiguous memrefs cannot be handled by copy64b at all. We should better report a clearer diagnostic when constraints conflict, rather than silently generating code that deviates from the program's original intent.`

Freezes current (buggy) lowering: a non-unit-stride !fly.memref on one
side of fly.copy_atom_call is lowered to a single contiguous llvm.load
/ llvm.store against the memory-side pointer, silently ignoring the
stride. Next commit will fix emitAtomCallSSA and update these CHECKs.
`fly.copy_atom_call` / `fly.copy_atom_call_ssa` with
`!fly.universal_copy<...>` used to accept non-contiguous memrefs and
lower them as if the atom operated on a contiguous slice.

Fix this by verifying universal copy operands before lowering:

* the memref layout must coalesce to a single static leaf
* the coalesced bit count must match the copy atom bit width
* the contiguous bit count must not be smaller than the copy granularity

This turns strided/otherwise incompatible memrefs into a clear
verification error instead of silently generating code that deviates
from the original atom semantics.
@kefan203 kefan203 force-pushed the kefan.cao/fix-universal-copy branch from b4130b0 to cd1ac59 Compare April 24, 2026 02:39
@kefan203
Copy link
Copy Markdown
Contributor Author

I think a better general approach is to verify the op before lowering atom-call and reject invalid cases at that stage (like contiguous byte count is smaller than copy granularity). Non-contiguous memrefs cannot be handled by copy64b at all. We should better report a clearer diagnostic when constraints conflict, rather than silently generating code that deviates from the program's original intent.`

Makes sense, thanks. I updated the fix to validate the atom-call operands up front and emit a clear diagnostic for incompatible layouts instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants