Skip to content

Use fork_join_executor by default#18

Draft
K-ballo wants to merge 90 commits intomainfrom
fork_join_executor
Draft

Use fork_join_executor by default#18
K-ballo wants to merge 90 commits intomainfrom
fork_join_executor

Conversation

@K-ballo
Copy link
Copy Markdown
Member

@K-ballo K-ballo commented Aug 7, 2025

No description provided.

fbusato and others added 30 commits March 20, 2025 02:01
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com>
Those were forgotten to add to the migration guide

(cherry picked from commit 53ef7fd)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
This avoids one of the last remaining patches they need to apply to CCCL
* Fix `not_fn`

Our implementation of `perfect_forward` stores the functor in a tuple.

However, that seems to break with e.g. device lambdas with captures

* Drop some thrust macros from repo.toml

(cherry picked from commit 185832a)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
…rs try to use pack indexing (NVIDIA#4278) (NVIDIA#4284)

(cherry picked from commit 863bb97)

Co-authored-by: Eric Niebler <eniebler@nvidia.com>
Fixes NVIDIA#4318

(cherry picked from commit ed275a8)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Co-authored-by: Federico Busato <50413820+fbusato@users.noreply.github.com>
…IDIA#4347)

Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
…uition (NVIDIA#4279) (NVIDIA#4287)

* tweak the cccl compiler version check macros to better agree with intuition

prior to this commit, a compiler check such as:

```c++
```

would fail if the compiler was actually v19.1. that is because 19.1 is
greater than 19. what the author of this code probably intended was to
check only the compiler's major version number, in which case the check
would have succeed.

this commit changes the behavior of the following macros when only a
major version number is specified:

* `_CCCL_COMPILER`
* `_CCCL_CUDA_COMPILER`
* `_CCCL_CUDACC_BELOW`
* `_CCCL_CUDACC_AT_LEAST`

* guard `_CCCL_COMPILER(FOO)` with an extra set of parens

Co-authored-by: Eric Niebler <eniebler@nvidia.com>
…DIA#4394)

* Make compiler version comparisons safer (NVIDIA#4185)

* Make compiler version comparisons safer

* remove MSVC2017 check

* use `_CCCL_HAS_CUDA_COMPILER()` instead of `_CCCL_CUDACC()`

* Fix compile issues

---------

Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
…IA#4425)

* Fix uninitialized read in local atomic code path.

* PTX assumes '=' operands are always overwritten. For this code path the predicated mov instruction will only sometimes overwrite the original value. The compiler may or may not initialize `__temp`. This patch fixes this by always signing a 0 or 1 to the output register removing the need to initialize `__temp`. An alternative is to use `+` instead of `=` on the output operand.

* Create a test to cover the PTX path of local storage atomics regardless of CTK version

* Disable test for nvrtc

* Use new additional compile flags

* Try and fix checking for compile flags

* Update comments in is_local codepath

* Make test compatible with older NVCC

* Revert "Try and fix checking for compile flags"

This reverts commit a846ea5.

* Remove unroll pragma, it is unneeded for repro.

---------

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
(cherry picked from commit 3200156)

Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com>
…ync` policy (NVIDIA#4204) (NVIDIA#4483)

* [Thrust] Perform asynchronous allocations by default for the `par_nosync` policy.
This will make algorithms (like scans) that don't have a computation-dependent
result but do temporary allocation properly asynchronous under `par_nosync`.

* Cleanup

* Apply suggestions from code review to `par_nosync` async allocation

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

* Switch from `reinterpret_pointer_cast` to `raw_pointer_cast` when we're going to `void*`.

* [Thrust] Pass a raw pointer instead of a Thrust pointer to `cudaFree`.

* Run pre-commit.

* [Thrust]: Correct comment on `par_nosync` fallback path.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
(cherry picked from commit fbf517d)

Co-authored-by: Bryce Adelstein Lelbach aka wash <brycelelbach@gmail.com>
* Drop invalid relative includes. (NVIDIA#4468)

They can pull in stale headers

* Drop cudax includes too
…A#4492)

* Use `cudaStream_t` for `thrust::device.on(...)`.

This was recently switched to use `cuda::stream_ref`, which broke users that have their own custom stream wrappers (nvbench, rmm, probably others).
There's no real benefit to using a stream_ref here, and it breaks existing implicit conversions.

* Add test to ensure that stream wrappers work with thrust::device.on
…IDIA#4516)

Co-authored-by: Giannis Gonidelis <gonidelis@hotmail.com>
…NVIDIA#4526)

We cannot control the predicate used nor the iterator. For example rapids uses device only predicates a lot

(cherry picked from commit a7d76b3)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
We already dropped it in main
* Avoid deprecated CUDART usage. (NVIDIA#4505)

* Disable NVTX tests for NVHPC in C++20 (NVIDIA#4686)

The nvtx headers contain a variable named `module` which nvc++ rejects as a c++20 keyword

---------

Co-authored-by: Allison Piper <alliepiper16@gmail.com>
…ts (NVIDIA#4594) (NVIDIA#4801)

* Switch cuCtxCreate to cuDevicePrimaryCtxRetain

* Correct release argument

* Set the retained context current

(cherry picked from commit cb926fb)

Co-authored-by: pciolkosz <pciolkosz@nvidia.com>
…mpiler issues (NVIDIA#4586) (NVIDIA#4853)

* Always bypass automatic atomic storage checks to prevent potential compiler issues

* Waive atomic.local.pass.cpp if `_LIBCUDACXX_ATOMIC_UNSAFE_AUTOMATIC_STORAGE` is set.

---------


(cherry picked from commit 549dd45)

Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com>
Depending on certain device properties, `cuda::std::addressof` returns a valid device pointer even when called on host.

Avoid spurious test failures by checking that if that is the case we get the correct result;
The warning seems bogus, but there is no reason not to work around it.
…A#4990) (NVIDIA#4994)

(cherry picked from commit 8c1195a)

Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com>
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
@K-ballo K-ballo requested a review from hkaiser August 7, 2025 09:16
Copy link
Copy Markdown
Member

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@K-ballo
Copy link
Copy Markdown
Member Author

K-ballo commented Aug 26, 2025

The mere construction of a fork-join executor prevents other executors from making progress.

@K-ballo K-ballo force-pushed the fork_join_executor branch from a2db469 to 2175af8 Compare August 29, 2025 13:22
@hkaiser
Copy link
Copy Markdown
Member

hkaiser commented Sep 17, 2025

The mere construction of a fork-join executor prevents other executors from making progress.

Yes, that's a 'feature' and not a bug ;-)

@K-ballo
Copy link
Copy Markdown
Member Author

K-ballo commented Sep 20, 2025

The mere construction of a fork-join executor prevents other executors from making progress.

Yes, that's a 'feature' and not a bug ;-)

In that case any choice of executor should be removed entirely from the backend, as it will always lead to incorrect programs.

@hkaiser
Copy link
Copy Markdown
Member

hkaiser commented Sep 20, 2025

The mere construction of a fork-join executor prevents other executors from making progress.

Yes, that's a 'feature' and not a bug ;-)

In that case any choice of executor should be removed entirely from the backend, as it will always lead to incorrect programs.

If there is a way to customize the executor in the benchmark without adding it to the backend itself - sure, I'd agree.

@K-ballo
Copy link
Copy Markdown
Member Author

K-ballo commented Sep 20, 2025

This PR makes the fork_join_executor the default executor, not the only possible executor. When explicit policies are used, execution happens on the executor specified by the policy. Those executions can never progress since a fork_join_executor has been constructed. That's the reason this PR is still a draft: it does not work.

For things to work, the fork_join_executor has to be the only executor in the system.

If there is a way to customize the executor in the benchmark without adding it to the backend itself - sure, I'd agree.

I'm not aware of such a way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants