Skip to content

Refactor Memory.h and simply the API#3742

Open
guangyey wants to merge 1 commit into
mainfrom
guangyey/memcpy
Open

Refactor Memory.h and simply the API#3742
guangyey wants to merge 1 commit into
mainfrom
guangyey/memcpy

Conversation

@guangyey
Copy link
Copy Markdown
Contributor

@guangyey guangyey commented May 22, 2026

Motivation

Centralize XPU memory copy/set operations into comm/Memory.h. Previously, all APIs defined in Memory.h are unused. I refactor it and provide some APIs could be reused in code.

Raw queue.memcpy() + record_event() pairs were scattered across the codebase, making it easy to miss event recording and leak pinned memory lifetime bugs. This introduces a small set of typed wrappers that encapsulate the correct synchronization and event-recording protocol, then migrates all callsites to use them.

Benefits: correctness by construction (can't forget record_event), single point of maintenance.

@github-actions github-actions Bot added disable_e2e Disable all e2e test jobs for the PR disable_distributed Disable distributed UT test jobs for the PR labels May 22, 2026
@chuanqi129 chuanqi129 marked this pull request as draft May 22, 2026 10:35
@chuanqi129 chuanqi129 marked this pull request as ready for review May 22, 2026 10:35
@guangyey guangyey force-pushed the guangyey/memcpy branch 4 times, most recently from 1b15397 to 529a550 Compare May 22, 2026 11:18
@guangyey guangyey requested a review from Copilot May 22, 2026 11:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Skill file(s) read: .github/skills/xpu-ops-pr-review/SKILL.md.

This PR refactors XPU SYCL memory copy/memset helpers into src/comm/Memory.h, introducing a small set of wrappers intended to standardize synchronization and pinned-memory event recording, and migrates multiple callsites to use the new APIs.

Changes:

  • Replaced scattered queue.memcpy()/memset() usages (plus ad-hoc record_event) with centralized wrappers (memcpyAndSync, memcpyAsync, memcpyHostToDeviceAsync, memcpyPinnedHostToDeviceAsync, memcpyDeviceToHostAsync, memsetAndSync, memsetAsync).
  • Migrated key callsites (copy path, foreach/multi-tensor metadata uploads, scalar extraction, sparse CSR add, resize) to use the wrappers.

Must-fix (blocking):

  • src/ATen/native/xpu/Copy.cpp device-to-device “memcpy-eligible” path now enqueues the copy via xpu::sycl::memcpyAsync, which uses the current stream internally rather than the provided copy_stream. This can break the cross-device barrier/synchronization logic and cause races.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/comm/Memory.h Adds centralized SYCL memcpy/memset wrappers with pinned-host staging and event recording.
src/ATen/native/xpu/XPUScalar.cpp Uses memcpyAndSync for scalar extraction instead of raw queue memcpy+wait.
src/ATen/native/xpu/sycl/ResizeKernel.cpp Uses centralized async memcpy for storage resize copy.
src/ATen/native/xpu/sycl/MultiTensorApply.h Uses pinned-H2D wrapper for metadata uploads and encapsulated event recording.
src/ATen/native/xpu/sycl/ForeachReduceKernels.cpp Uses pinned-H2D wrapper for metadata/count uploads and encapsulated event recording.
src/ATen/native/xpu/Copy.cpp Refactors memcpy paths to use centralized wrappers (introduces stream/queue mismatch bug).
src/ATen/native/sparse/xpu/sycl/SparseCsrTensorAddKernels.cpp Uses centralized memset/memcpy wrappers for async set and sync readback of nnz.

auto src = iter.data_ptr(1);
size_t size = iter.numel() * iter.element_size(0);
q.copy(src, dst, size);
xpu::sycl::memcpyAsync(dst, src, size);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disable_distributed Disable distributed UT test jobs for the PR disable_e2e Disable all e2e test jobs for the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants