Use fork_join_executor by default by K-ballo · Pull Request #18 · STEllAR-GROUP/cccl

K-ballo · 2025-08-07T09:16:25Z

No description provided.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com>

Those were forgotten to add to the migration guide (cherry picked from commit 53ef7fd) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

This avoids one of the last remaining patches they need to apply to CCCL

* Fix `not_fn` Our implementation of `perfect_forward` stores the functor in a tuple. However, that seems to break with e.g. device lambdas with captures * Drop some thrust macros from repo.toml (cherry picked from commit 185832a) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

…rs try to use pack indexing (NVIDIA#4278) (NVIDIA#4284) (cherry picked from commit 863bb97) Co-authored-by: Eric Niebler <eniebler@nvidia.com>

Fixes NVIDIA#4318 (cherry picked from commit ed275a8) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Co-authored-by: Federico Busato <50413820+fbusato@users.noreply.github.com>

…IDIA#4347) Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>

…uition (NVIDIA#4279) (NVIDIA#4287) * tweak the cccl compiler version check macros to better agree with intuition prior to this commit, a compiler check such as: ```c++ ``` would fail if the compiler was actually v19.1. that is because 19.1 is greater than 19. what the author of this code probably intended was to check only the compiler's major version number, in which case the check would have succeed. this commit changes the behavior of the following macros when only a major version number is specified: * `_CCCL_COMPILER` * `_CCCL_CUDA_COMPILER` * `_CCCL_CUDACC_BELOW` * `_CCCL_CUDACC_AT_LEAST` * guard `_CCCL_COMPILER(FOO)` with an extra set of parens Co-authored-by: Eric Niebler <eniebler@nvidia.com>

…DIA#4394) * Make compiler version comparisons safer (NVIDIA#4185) * Make compiler version comparisons safer * remove MSVC2017 check * use `_CCCL_HAS_CUDA_COMPILER()` instead of `_CCCL_CUDACC()` * Fix compile issues --------- Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

…IA#4425) * Fix uninitialized read in local atomic code path. * PTX assumes '=' operands are always overwritten. For this code path the predicated mov instruction will only sometimes overwrite the original value. The compiler may or may not initialize `__temp`. This patch fixes this by always signing a 0 or 1 to the output register removing the need to initialize `__temp`. An alternative is to use `+` instead of `=` on the output operand. * Create a test to cover the PTX path of local storage atomics regardless of CTK version * Disable test for nvrtc * Use new additional compile flags * Try and fix checking for compile flags * Update comments in is_local codepath * Make test compatible with older NVCC * Revert "Try and fix checking for compile flags" This reverts commit a846ea5. * Remove unroll pragma, it is unneeded for repro. --------- Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> (cherry picked from commit 3200156) Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com>

…ync` policy (NVIDIA#4204) (NVIDIA#4483) * [Thrust] Perform asynchronous allocations by default for the `par_nosync` policy. This will make algorithms (like scans) that don't have a computation-dependent result but do temporary allocation properly asynchronous under `par_nosync`. * Cleanup * Apply suggestions from code review to `par_nosync` async allocation Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Switch from `reinterpret_pointer_cast` to `raw_pointer_cast` when we're going to `void*`. * [Thrust] Pass a raw pointer instead of a Thrust pointer to `cudaFree`. * Run pre-commit. * [Thrust]: Correct comment on `par_nosync` fallback path. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> (cherry picked from commit fbf517d) Co-authored-by: Bryce Adelstein Lelbach aka wash <brycelelbach@gmail.com>

* Drop invalid relative includes. (NVIDIA#4468) They can pull in stale headers * Drop cudax includes too

…A#4492) * Use `cudaStream_t` for `thrust::device.on(...)`. This was recently switched to use `cuda::stream_ref`, which broke users that have their own custom stream wrappers (nvbench, rmm, probably others). There's no real benefit to using a stream_ref here, and it breaks existing implicit conversions. * Add test to ensure that stream wrappers work with thrust::device.on

…IDIA#4516) Co-authored-by: Giannis Gonidelis <gonidelis@hotmail.com>

…NVIDIA#4526) We cannot control the predicate used nor the iterator. For example rapids uses device only predicates a lot (cherry picked from commit a7d76b3) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

We already dropped it in main

* Avoid deprecated CUDART usage. (NVIDIA#4505) * Disable NVTX tests for NVHPC in C++20 (NVIDIA#4686) The nvtx headers contain a variable named `module` which nvc++ rejects as a c++20 keyword --------- Co-authored-by: Allison Piper <alliepiper16@gmail.com>

…oating points (NVIDIA#4751)

…ts (NVIDIA#4594) (NVIDIA#4801) * Switch cuCtxCreate to cuDevicePrimaryCtxRetain * Correct release argument * Set the retained context current (cherry picked from commit cb926fb) Co-authored-by: pciolkosz <pciolkosz@nvidia.com>

…ht architectures (NVIDIA#4440) (NVIDIA#4850)

…mpiler issues (NVIDIA#4586) (NVIDIA#4853) * Always bypass automatic atomic storage checks to prevent potential compiler issues * Waive atomic.local.pass.cpp if `_LIBCUDACXX_ATOMIC_UNSAFE_AUTOMATIC_STORAGE` is set. --------- (cherry picked from commit 549dd45) Co-authored-by: Yunsong Wang <yunsongw@nvidia.com> Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com>

Depending on certain device properties, `cuda::std::addressof` returns a valid device pointer even when called on host. Avoid spurious test failures by checking that if that is the case we get the correct result;

The warning seems bogus, but there is no reason not to work around it.

) Fixes: NVIDIA#4967

…A#4990) (NVIDIA#4994) (cherry picked from commit 8c1195a) Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Use a custom memory_resource to touch the first byte of each page

HPX implementation of copy/copy_n

HPX implementation of count/count_if

HPX implementation of equal

HPX implementation of merge

hkaiser

LGTM, thanks!

Unwrap contiguous iterator

K-ballo · 2025-08-26T09:22:32Z

The mere construction of a fork-join executor prevents other executors from making progress.

hkaiser · 2025-09-17T12:45:33Z

The mere construction of a fork-join executor prevents other executors from making progress.

Yes, that's a 'feature' and not a bug ;-)

K-ballo · 2025-09-20T07:30:29Z

The mere construction of a fork-join executor prevents other executors from making progress.

Yes, that's a 'feature' and not a bug ;-)

In that case any choice of executor should be removed entirely from the backend, as it will always lead to incorrect programs.

hkaiser · 2025-09-20T14:47:24Z

The mere construction of a fork-join executor prevents other executors from making progress.

Yes, that's a 'feature' and not a bug ;-)

In that case any choice of executor should be removed entirely from the backend, as it will always lead to incorrect programs.

If there is a way to customize the executor in the benchmark without adding it to the backend itself - sure, I'd agree.

K-ballo · 2025-09-20T16:58:44Z

This PR makes the fork_join_executor the default executor, not the only possible executor. When explicit policies are used, execution happens on the executor specified by the policy. Those executions can never progress since a fork_join_executor has been constructed. That's the reason this PR is still a draft: it does not work.

For things to work, the fork_join_executor has to be the only executor in the system.

If there is a way to customize the executor in the benchmark without adding it to the backend itself - sure, I'd agree.

I'm not aware of such a way.

fbusato and others added 30 commits March 20, 2025 02:01

[backport/3.0] Replace CUB util_arch.cuh macros with inline constexpr…

72ff92c

… variables NVIDIA#4202

Set NO_CMAKE_FIND_ROOT_PATH for cudax. (NVIDIA#4162) (NVIDIA#4215)

f90f3d7

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

[BACKPORT] Fix the cuda python setup (NVIDIA#4217)

68ec3d3

Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com>

Remove python/cuda_cooperative/setup.py (NVIDIA#4221) (NVIDIA#4234)

243ce3e

Document deprecations from NVIDIA#4165 (NVIDIA#4237) (NVIDIA#4242)

dd0b071

Those were forgotten to add to the migration guide (cherry picked from commit 53ef7fd) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Allow rapids to avoid unrolling some loops in sort (NVIDIA#4254)

ca0eca3

This avoids one of the last remaining patches they need to apply to CCCL

change version check in type_list.h so that *NO* clang-19.X compile…

d2a46f1

…rs try to use pack indexing (NVIDIA#4278) (NVIDIA#4284) (cherry picked from commit 863bb97) Co-authored-by: Eric Niebler <eniebler@nvidia.com>

Remove invalid single # in builtin.h (NVIDIA#4319) (NVIDIA#4327)

3157751

Fixes NVIDIA#4318 (cherry picked from commit ed275a8) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Update PTX ld/st (NVIDIA#4324) (NVIDIA#4345)

e6562b6

Co-authored-by: Federico Busato <50413820+fbusato@users.noreply.github.com>

Rename WarpShuffleResult to warp_shuffle_result (NVIDIA#4332) (NV…

36958cc

…IDIA#4347) Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>

[Backport] Drop relative includes (NVIDIA#4470)

fbcc9bc

* Drop invalid relative includes. (NVIDIA#4468) They can pull in stale headers * Drop cudax includes too

Missing forward include in iterator facade category (NVIDIA#4512) (NV…

16ca0ed

…IDIA#4516) Co-authored-by: Giannis Gonidelis <gonidelis@hotmail.com>

Drop cuspatial for 3.0 (NVIDIA#4518)

e990993

We already dropped it in main

Ignore Wmaybe-uninitialized in dispatch_reduce.cuh. (NVIDIA#4622)

3134516

Do not use open-coded INFINITY for tests that also test extended fl…

1d6c99b

…oating points (NVIDIA#4751)

Fix uninitialized read in local atomics test when compiled for SM lig…

16cf0a2

…ht architectures (NVIDIA#4440) (NVIDIA#4850)

Avoid errors in get_device_address tests (NVIDIA#4209) (NVIDIA#4892)

a1c23b0

Depending on certain device properties, `cuda::std::addressof` returns a valid device pointer even when called on host. Avoid spurious test failures by checking that if that is the case we get the correct result;

Avoid warning in cuda::ilog10 (NVIDIA#4908) (NVIDIA#4918)

25b2c1c

The warning seems bogus, but there is no reason not to work around it.

Support more arguments to CCCL_PP_SPLICE_WITH (NVIDIA#4972) (NVIDIA#4993

3878bea

) Fixes: NVIDIA#4967

Add potential search path for cccl headers in potential layout (NVIDI…

f1a4d8d

…A#4990) (NVIDIA#4994) (cherry picked from commit 8c1195a) Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

hkaiser and others added 12 commits August 1, 2025 12:51

Merge pull request #13 from STEllAR-GROUP/topology_resource

864d2c7

Use a custom memory_resource to touch the first byte of each page

Test fill with HPX policies

685270a

Impl copy/copy_n

ae1f80a

Test copy with HPX policies

31587f6

Merge pull request #14 from STEllAR-GROUP/impl-copy

be029b8

HPX implementation of copy/copy_n

Impl count/count_if

6fbc937

Test count with HPX policies

f9334bd

Merge pull request #15 from STEllAR-GROUP/impl-count

dbdd9eb

HPX implementation of count/count_if

Impl equal

4355373

Test equal with HPX policies

ef57a2a

Impl merge

1c64095

Test merge with HPX policies

b9b4064

K-ballo requested a review from hkaiser August 7, 2025 09:16

hkaiser added 2 commits August 7, 2025 10:54

Merge pull request #16 from STEllAR-GROUP/impl-equal

91d6b7d

HPX implementation of equal

Merge pull request #17 from STEllAR-GROUP/impl-merge

3e00c11

HPX implementation of merge

hkaiser approved these changes Aug 7, 2025

View reviewed changes

K-ballo and others added 6 commits August 12, 2025 11:39

Unwrap contiguous iterators when dispatching to HPX

0e6586a

Contiguous iterator utilities

ddbf51a

Merge pull request #19 from STEllAR-GROUP/unwrap_contiguous_iterator

0cddcd5

Unwrap contiguous iterator

Use fork_join_executor by default

1954019

Fix default executor construction/destruction order

f8a65c0

Request bound thread priority for run_as_hpx_thread

312208b

Use default fork-join executor for memory touching

2175af8

K-ballo force-pushed the fork_join_executor branch from a2db469 to 2175af8 Compare August 29, 2025 13:22

kollanur force-pushed the main branch from 4bbe8a3 to c5ddbff Compare January 29, 2026 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use fork_join_executor by default#18

Use fork_join_executor by default#18
K-ballo wants to merge 90 commits intomainfrom
fork_join_executor

K-ballo commented Aug 7, 2025

Uh oh!

hkaiser left a comment

Uh oh!

K-ballo commented Aug 26, 2025

Uh oh!

hkaiser commented Sep 17, 2025

Uh oh!

K-ballo commented Sep 20, 2025

Uh oh!

hkaiser commented Sep 20, 2025

Uh oh!

K-ballo commented Sep 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

K-ballo commented Aug 7, 2025

Uh oh!

hkaiser left a comment

Choose a reason for hiding this comment

Uh oh!

K-ballo commented Aug 26, 2025

Uh oh!

hkaiser commented Sep 17, 2025

Uh oh!

K-ballo commented Sep 20, 2025

Uh oh!

hkaiser commented Sep 20, 2025

Uh oh!

K-ballo commented Sep 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants