Add intrinsic for launch-sized workgroup memory on GPUs by Flakebi · Pull Request #146181 · rust-lang/rust

Flakebi · 2025-09-03T22:32:23Z

Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.

Interface

With this change, workgroup memory can be accessed in Rust by
calling the new gpu_launch_sized_workgroup_mem<T>() -> *mut T
intrinsic.

It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of T.
The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of T but can also be larger
or smaller or zero).

All calls to this intrinsic return a pointer to the same address.

See the intrinsic documentation for more details.

Alternative Interfaces

It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.

Implementation Details

Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.

There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.

Tracking issue: #135516

rustbot · 2025-09-03T22:32:28Z

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-09-03T22:32:32Z

Some changes occurred in src/tools/compiletest

cc @jieyouxu

Some changes occurred in compiler/rustc_codegen_ssa

cc @WaffleLapkin

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @RalfJung, @oli-obk, @lcnr

RalfJung · 2025-09-04T06:28:17Z

+#[rustc_nounwind]
+#[unstable(feature = "dynamic_shared_memory", issue = "135513")]
+#[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))]
+pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T;


Note that outside the GPU world, "shared memory" typically refers to memory shared between processes. So I would suggest using a name that's less likely to be confused, like something that explicitly involves "GPU" or so.

This sounds like a form of "global" memory (similar to a static item), but then apparently OpenCL calls it "local" which is very confusing...

Does it make sense to add a mod gpu?
I think there are more intrinsics for gpus that make can be added (although more in the traditional intrinsic sense, relating to an instruction, edit: re-exposing intrinsics from core::arch::nvptx and the amdgpu equivalent).

Or should it be in core::arch::gpu?
(From #135516 (comment), cc @workingjubilee)

Rust intrinsic names are not namespaced. They are exposed in a module, but inside the compiler they are identified entirely by their name. So moving them into a different module doesn't alleviate the need for a clear name that will be understandable to non-GPU people working in the compiler (which is the vast majority of compiler devs).

If there's more GPU intrinsics to come, moving them into a gpu.rs file here still might make sense.

I don't have a strong opinion on how the eventually stable public API is organized, I am commenting entirely as someone who has an interest in keeping the set of intrinsics the Rust compiler offers understandable and well-defined (the ones in this folder, not the ones in core::arch which you call "more traditional" but that's very dependent on your background ;). These intrinsics are just an implementation detail, but every intrinsic we add here is a new language primitive -- it's like adding a new keyword, just without the syntax discussions and perma-unstable. In the past we used to have intrinsics that entirely break the internal consistency of the language, and we used to have intrinsics whose safety requirements were very poorly documented.

RalfJung · 2025-09-04T10:28:19Z

Sorry for drowning you in questions here, but extending the core language with new operations (as in, adding a new intrinsic doing things that couldn't be done before) is a big deal, and we had a bad experience in the past when this was done without wider discussion in the team to ensure that the intrinsics actually make sense in the context of Rust. Not everything that exists in the hardware can be 1:1 exposed in Rust, sometimes this requires a lot of work and sometimes it's just basically impossible. It can be a lot of work to clean these things up later, and as someone who did a bunch of that work, I'd rather not have to do it again. :)

Flakebi · 2025-09-04T11:33:41Z

I agree that it makes a lot of sense to have the discussion now. Thanks for taking a look and helping to design something useful!

Speaking of safety requirements... how does one use this pointer?

Heh, yes, that’s something that should be mentioned in the doc comment as well. (Especially comments on how to safely use it.)

I get that it is aligned, but does it point to enough memory to store a T?

Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.

If it's always the same address, doesn't everyone overwrite each other's data all the time? This API looks very odd for a non-GPU person, and it's not clear to me whether that is resolved by having more magic behavior (which should be documented or at least referenced here), or whether there's higher-level APIs built on top that deal with this (but this intrinsic provides so few guarantees, I can't see how that should be possible).

There are “higher-level APIs” like “do a fast matrix-matrix multiplication”, but not much in-between. I’d assume that people usually use this in its raw form.
On GPUs, accessing memory is orders of magnitude slower than it is on CPUs. But, GPUs

have a lot more registers (e.g. up to 256 32-bit registers on amdgpu)
and shared memory, which is essentially a software-defined cache.

Two general use cases are: 1) All threads in a group load a part from global memory (the RAM/VRAM) and store it in shared memory. Then all threads read from the collaboratively loaded data. 2) All threads in a group do some work and collaborate on shared memory (with atomics or so) to aggregate results. Then one of the threads stores the final result to global memory.

So, shared memory is meant to be accessed collaboratively and the developer must ensure proper synchronization. It is hard to provide a safe abstraction for this and tbh, I don’t want to try 😅 (though I can see 3rd party crates doing this – at least to some extent).

From Rust’s perspective, guarantees should be the same as with memory that’s shared between processes.

Typically, intrinsic documentations should be detailed enough that I can read and write code using the intrinsic and know exactly whether the code is correct and what it will do in all circumstances. I don't know if there's any hope of achieving that with GPU intrinsics, but if not then we need to have a bit of a wider discussion -- we have had bad experience with just importing "externally defined" semantics into Rust without considering all the interactions (in general, it is not logically coherent to have semantics externally defined).

I agree, it would be nice to have good documentation for the intrinsics in Rust!

RalfJung · 2025-09-04T13:07:49Z

Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.

Wait, there's a single static size set when launching the kernel? Why is it called "dynamic" memory? "dynamic" memory usually means malloc/free, i.e. you can get any amount of fresh memory during runtime (until RAM is full obviously).

Are you saying dynamic shared memory is neither dynamic in the normal sense nor shared in the normal sense? ;)

petrochenkov · 2025-09-04T13:13:20Z

r? @RalfJung

RalfJung · 2025-09-04T13:31:21Z

I won't be able to do the final approval here, I can just help with ensuring that the intrinsics are documented well enough that they can be understood without GPU expertise, and that the LLVM codegen looks vaguely reasonable.

I don't know if we have anyone who actually knows how the generated LLVM IR should look like and can ensure it makes sense. r? @nikic maybe?

Add intrinsic for launch-sized workgroup memory on GPUs try-job: x86_64-gnu-nopt try-job: x86_64-gnu-debug

rust-bors · 2026-04-02T14:22:14Z

☀️ Try build successful (CI)
Build commit: 0665f15 (0665f1549ce7f4f25322095d9ed05a2f6a98975c, parent: e6b64a2f4c696b840f8a384ec28690eed6a5d267)

Flakebi · 2026-04-07T08:45:53Z

Rebased to fix merge conflicts in a use statement.
Would be nice to get this merged :)

ZuseZ4 · 2026-04-07T12:23:49Z

If Jubilee doesn't get to it by next week and doesn't mind, Marcelo and I can also have a look then.
At a brief skim, it looks mostly like what I've been expecting on the LLVM side, and I was able to use it successfully to implement a simple matrix multiplication with it. I just need a bit more time to poke at the alignment.

Flakebi · 2026-04-08T21:54:48Z

Fixed merge conflict in another use statement.

Flakebi · 2026-04-19T08:58:28Z

Fixed another merge conflict in use statement.

rustbot · 2026-04-19T20:39:26Z

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

Flakebi · 2026-04-19T20:39:50Z

And same use statement again! :D

ZuseZ4 · 2026-04-22T21:08:55Z

Thanks for taking the time and working through all the feedback!

I don't know if we have anyone who actually knows how the generated LLVM IR should look like and can ensure it makes sense.

@Sa4dUs and I reviewed the IR and it does look like what we were expecting. From LLVM 23 onwards, we should now also have the same IR for both vendors, which is nice. For testing, I also built a minimal frontend for it and wrote a shared-memory matmul on top of it, which worked like a charm.

Just to summarize the previous discussion for those who got lost in ~140 comments:

There is a little bit of magic going on in the LLVM backend: multiple globals getting fused, all invocations returning the same ptr, alignment+size responsibility being split between host (launching code) and kernel (launched code)
Ralf/Jubilee raised a question about DCE affecting the alignment: Add intrinsic for launch-sized workgroup memory on GPUs #146181 (comment). It should work fine in general. Users could make wrong assumptions that we wouldn't catch, but it is an intrinsic, and it returns a raw pointer. We generally accept that users could make wrong assumptions and use either of those incorrectly, even if we try to prevent it.
Other ways of mapping this intrinsic/behaviour to Rust were discussed, but this implementation seemed (so far) like the best we can express in today's Rust.
The naming had a lot of iterations. I agree that the final name is sensible.

All previous questions/feedback (especially from Jubilee as the last reviewer) were addressed, and code, design and documentation look reasonable to Marcelo and me. Individual aspects were also thoroughly reviewed by others before: #146181 (comment). I'm happy to make the final sign-off.

I was told it's conventional to add everyone involved in the partial reviewing, so let's see who is on a team. I added everyone from #146181 (comment), with the exception of Flakebi, who's the author.

cc @workingjubilee @RalfJung @nikic @kjetilkjeka @kulst

@bors r=ZuseZ4,Sa4dus,workingjubilee,RalfJung,nikic,kjetilkjeka,kulst

RalfJung · 2026-04-22T21:12:01Z

Ralf/Jubilee raised a question about DCE affecting the alignment: #146181 (comment). It should work fine in general. Users could make wrong assumptions that we wouldn't catch, but it is an intrinsic, and it returns a raw pointer. We generally accept that users could make wrong assumptions and use either of those incorrectly, even if we try to prevent it.

Yeah that seems fine as long as it is properly documented.

Workgroup memory is a memory region that is shared between all threads in a workgroup on GPUs. Workgroup memory can be allocated statically or after compilation, when launching a gpu-kernel. The intrinsic added here returns the pointer to the memory that is allocated at launch-time. # Interface With this change, workgroup memory can be accessed in Rust by calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T` intrinsic. It returns the pointer to workgroup memory guaranteeing that it is aligned to at least the alignment of `T`. The pointer is dereferencable for the size specified when launching the current gpu-kernel (which may be the size of `T` but can also be larger or smaller or zero). All calls to this intrinsic return a pointer to the same address. See the intrinsic documentation for more details. ## Alternative Interfaces It was also considered to expose dynamic workgroup memory as extern static variables in Rust, like they are represented in LLVM IR. However, due to the pointer not being guaranteed to be dereferencable (that depends on the allocated size at runtime), such a global must be zero-sized, which makes global variables a bad fit. # Implementation Details Workgroup memory in amdgpu and nvptx lives in address space 3. Workgroup memory from a launch is implemented by creating an external global variable in address space 3. The global is declared with size 0, as the actual size is only known at runtime. It is defined behavior in LLVM to access an external global outside the defined size. There is no similar way to get the allocated size of launch-sized workgroup memory on amdgpu an nvptx, so users have to pass this out-of-band or rely on target specific ways for now.

rustbot assigned petrochenkov Sep 3, 2025

Flakebi mentioned this pull request Sep 3, 2025

Tracking Issue for NVPTX shared memory #135516

Open

3 tasks

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 0aa0e58 to 3ebaccb Compare September 3, 2025 22:43

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 3ebaccb to 2378959 Compare September 3, 2025 22:50

RalfJung reviewed Sep 4, 2025

View reviewed changes

Comment thread library/core/src/intrinsics/mod.rs Outdated

RalfJung reviewed Sep 4, 2025

View reviewed changes

Comment thread compiler/rustc_codegen_llvm/src/intrinsic.rs

RalfJung reviewed Sep 4, 2025

View reviewed changes

Comment thread compiler/rustc_codegen_llvm/src/intrinsic.rs Outdated

RalfJung reviewed Sep 4, 2025

View reviewed changes

Comment thread compiler/rustc_abi/src/lib.rs Outdated

RalfJung reviewed Sep 4, 2025

View reviewed changes

Comment thread library/core/src/intrinsics/mod.rs Outdated

RalfJung reviewed Sep 4, 2025

View reviewed changes

Comment thread library/core/src/intrinsics/mod.rs Outdated

RalfJung reviewed Sep 4, 2025

View reviewed changes

Comment thread library/core/src/intrinsics/mod.rs Outdated

rustbot assigned RalfJung and unassigned petrochenkov Sep 4, 2025

This comment has been minimized.

Sign in to view

rust-bors Bot pushed a commit that referenced this pull request Apr 2, 2026

Auto merge of #146181 - Flakebi:dynamic-shared-memory, r=<try>

0665f15

Add intrinsic for launch-sized workgroup memory on GPUs try-job: x86_64-gnu-nopt try-job: x86_64-gnu-debug

This comment has been minimized.

Sign in to view

ZuseZ4 mentioned this pull request Apr 5, 2026

std::offload sharedmem #154835

Draft

Flakebi force-pushed the dynamic-shared-memory branch from 6236dd9 to ceb1be7 Compare April 7, 2026 08:45

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from ceb1be7 to 8a95ca4 Compare April 8, 2026 21:54

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 8a95ca4 to 1f7a58d Compare April 19, 2026 08:56

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 1f7a58d to 752c152 Compare April 19, 2026 20:39

ZuseZ4 reviewed Apr 20, 2026

View reviewed changes

Comment thread compiler/rustc_codegen_llvm/src/intrinsic.rs Outdated

Flakebi force-pushed the dynamic-shared-memory branch from 752c152 to 1270c5d Compare April 21, 2026 07:29

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 1270c5d to acdf598 Compare April 21, 2026 07:30

ZuseZ4 reviewed Apr 23, 2026

View reviewed changes

Comment thread library/core/src/intrinsics/gpu.rs

Flakebi force-pushed the dynamic-shared-memory branch from acdf598 to 13ec3de Compare April 24, 2026 08:09

Uh oh!

Conversation

Flakebi commented Sep 3, 2025 • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Interface

Alternative Interfaces

Implementation Details

Uh oh!

rustbot commented Sep 3, 2025

Uh oh!

rustbot commented Sep 3, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

RalfJung Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Flakebi Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Flakebi Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

RalfJung Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RalfJung commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flakebi commented Sep 4, 2025

Uh oh!

RalfJung commented Sep 4, 2025

Uh oh!

petrochenkov commented Sep 4, 2025

Uh oh!

RalfJung commented Sep 4, 2025

Uh oh!

This comment has been minimized.

rust-bors Bot commented Apr 2, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

Flakebi commented Apr 7, 2026

Uh oh!

This comment has been minimized.

ZuseZ4 commented Apr 7, 2026

Uh oh!

This comment has been minimized.

Flakebi commented Apr 8, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

Flakebi commented Apr 19, 2026

Uh oh!

This comment has been minimized.

rustbot commented Apr 19, 2026

Uh oh!

Flakebi commented Apr 19, 2026

Uh oh!

Uh oh!

This comment has been minimized.

ZuseZ4 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RalfJung commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Flakebi commented Sep 3, 2025 •

edited by rustbot

Loading

RalfJung Sep 4, 2025 •

edited

Loading

Flakebi Sep 4, 2025 •

edited

Loading

RalfJung Sep 4, 2025 •

edited

Loading

RalfJung commented Sep 4, 2025 •

edited

Loading

ZuseZ4 commented Apr 22, 2026 •

edited

Loading