Skip to content

executors: fix performance regressions#7209

Open
arpittkhandelwal wants to merge 1 commit intoTheHPXProject:masterfrom
arpittkhandelwal:fix/executor-performance-final
Open

executors: fix performance regressions#7209
arpittkhandelwal wants to merge 1 commit intoTheHPXProject:masterfrom
arpittkhandelwal:fix/executor-performance-final

Conversation

@arpittkhandelwal
Copy link
Copy Markdown
Contributor

@arpittkhandelwal arpittkhandelwal commented Apr 20, 2026

Description

This Pull Request addresses critical performance regressions in fork_join_executor and parallel_executor by implementing missing traits, optimizing caching strategies, and refactoring the executor template structure.

Key Changes

  1. fork_join_executor:
    • Implemented tag_invoke for processing_units_count_t and get_first_core_t. This prevents the executor from falling back to a default (often 1) core count, which was causing algorithms to run sequentially.
    • Added stacksize validation to ensure correct executor instantiation.
      reducing code duplication and maintenance burden.

📊 Performance Results

The following results were obtained using the foreach_report_test benchmark with 8 threads in the standard Docker build environment:

Executor Execution Time (ms) Status
fork_join_executor ~2.3ms Fixed (from ~150ms regression)
parallel_executor ~16.8ms Stable (at parallel baseline)

These results confirm that the sequential fallback regression in fork_join_executor is fully resolved and the parallel_executor optimizations are performing as intended.


Closes

This PR supersedes and Closes #6919.

Copilot AI review requested due to automatic review settings April 20, 2026 17:40
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 20, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets performance regressions in HPX executors by ensuring fork_join_executor reports correct processing unit traits (avoiding sequential fallbacks) and by optimizing parallel_executor PU-mask handling while reducing template duplication.

Changes:

  • Add missing executor trait customizations to fork_join_executor and validate unsupported nostack stacksize.
  • Add lazy caching of pu_mask in parallel_policy_executor_base to avoid repeated recomputation in hot paths.
  • Unify flat vs hierarchical parallel_policy_executor specializations into a single template with if constexpr and preserve serialization format.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
libs/core/executors/include/hpx/executors/parallel_executor.hpp Adds PU-mask caching and merges hierarchical/flat executor implementations using if constexpr.
libs/core/executors/include/hpx/executors/fork_join_executor.hpp Implements missing trait CPOs and rejects thread_stacksize::nostack for correctness.

Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp Outdated
Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp Outdated
Comment thread libs/core/executors/include/hpx/executors/fork_join_executor.hpp Outdated
Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp Outdated
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 23, 2026

@arpittkhandelwal Would you mind addressing/reacting to the copilot comments, please? I think at least some of them make sense.

Copilot AI review requested due to automatic review settings April 24, 2026 00:34
@arpittkhandelwal arpittkhandelwal force-pushed the fix/executor-performance-final branch from 1ed042c to 0331efb Compare April 24, 2026 00:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp Outdated
Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp Outdated
Copilot AI review requested due to automatic review settings April 24, 2026 00:54
@arpittkhandelwal arpittkhandelwal force-pushed the fix/executor-performance-final branch from 16bf461 to 6b8877e Compare April 24, 2026 00:54
Copy link
Copy Markdown
Contributor

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you wrote that you turned mask_ into an atomic, the current state of the PR doesn't relect that.

Comment thread libs/core/executors/include/hpx/executors/fork_join_executor.hpp Outdated
Comment thread libs/core/executors/include/hpx/executors/fork_join_executor.hpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp
@arpittkhandelwal arpittkhandelwal force-pushed the fix/executor-performance-final branch from 6b8877e to 9117541 Compare April 24, 2026 01:24
@hkaiser hkaiser changed the title executors: fix performance regressions and optimize parallel_executor caching executors: fix performance regressions Apr 25, 2026
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 25, 2026

@arpittkhandelwal thanks for the fixes. However the CIs fail compiling your test.

@arpittkhandelwal arpittkhandelwal force-pushed the fix/executor-performance-final branch from 9117541 to 1da76d8 Compare April 26, 2026 05:29
Copilot AI review requested due to automatic review settings April 26, 2026 05:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

libs/core/executors/include/hpx/executors/fork_join_executor.hpp:1284

  • The PR description mentions adding stacksize validation (and the previous docs stated stacksize must not be nostack), but in the current constructors the stacksize value is passed through without validation and the doc note was removed. Either add the promised validation (e.g. reject thread_stacksize::nostack if unsupported) or update the docs/PR description to reflect the actual supported values and behavior (yielding is already conditional on nostack).
        /// \brief Construct a fork_join_executor.
        ///
        /// \param priority  The priority of the worker threads.
        /// \param stacksize The stacksize of the worker threads.
        /// \param sched     The loop schedule of the parallel regions.
        /// \param yield_delay The time after which the executor yields to other
        ///        work if it has not received any new work for execution.
        explicit fork_join_executor(
            threads::thread_priority priority = threads::thread_priority::bound,
            threads::thread_stacksize stacksize =
                threads::thread_stacksize::small_,
            loop_schedule sched = loop_schedule::dynamic,
            std::chrono::nanoseconds yield_delay = std::chrono::microseconds(
                300))
        {

Comment thread libs/core/executors/tests/unit/fork_join_executor.cpp
Comment thread libs/core/testing/include/hpx/testing/api.hpp Outdated
Comment thread libs/core/datastructures/include/hpx/datastructures/tuple.hpp Outdated
Comment thread libs/core/executors/tests/unit/parallel_executor.cpp Outdated
Comment thread libs/core/executors/tests/unit/fork_join_executor.cpp Outdated
@arpittkhandelwal arpittkhandelwal force-pushed the fix/executor-performance-final branch from e14c53a to c6505f9 Compare April 26, 2026 08:36
Copilot AI review requested due to automatic review settings April 26, 2026 13:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread libs/core/executors/include/hpx/executors/fork_join_executor.hpp
Comment on lines +1355 to +1359
template <typename Parameters>
friend std::size_t tag_invoke(
hpx::execution::experimental::processing_units_count_t,
Parameters&&, fork_join_executor const& exec,
hpx::chrono::steady_duration const& = hpx::chrono::null_duration,
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processing_units_count_t customization is introduced with an unconstrained template <typename Parameters>. Other executors constrain this overload to executor_parameters/is_executor_parameters_v, which avoids overly-permissive tag_invoke participation and improves diagnostics. Consider adding the same constraint here to match the rest of the codebase’s patterns for processing_units_count_t.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants