executors: fix performance regressions#7209
executors: fix performance regressions#7209arpittkhandelwal wants to merge 1 commit intoTheHPXProject:masterfrom
Conversation
Up to standards ✅🟢 Issues
|
There was a problem hiding this comment.
Pull request overview
This PR targets performance regressions in HPX executors by ensuring fork_join_executor reports correct processing unit traits (avoiding sequential fallbacks) and by optimizing parallel_executor PU-mask handling while reducing template duplication.
Changes:
- Add missing executor trait customizations to
fork_join_executorand validate unsupportednostackstacksize. - Add lazy caching of
pu_maskinparallel_policy_executor_baseto avoid repeated recomputation in hot paths. - Unify flat vs hierarchical
parallel_policy_executorspecializations into a single template withif constexprand preserve serialization format.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| libs/core/executors/include/hpx/executors/parallel_executor.hpp | Adds PU-mask caching and merges hierarchical/flat executor implementations using if constexpr. |
| libs/core/executors/include/hpx/executors/fork_join_executor.hpp | Implements missing trait CPOs and rejects thread_stacksize::nostack for correctness. |
|
@arpittkhandelwal Would you mind addressing/reacting to the copilot comments, please? I think at least some of them make sense. |
1ed042c to
0331efb
Compare
16bf461 to
6b8877e
Compare
hkaiser
left a comment
There was a problem hiding this comment.
While you wrote that you turned mask_ into an atomic, the current state of the PR doesn't relect that.
6b8877e to
9117541
Compare
|
@arpittkhandelwal thanks for the fixes. However the CIs fail compiling your test. |
9117541 to
1da76d8
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
libs/core/executors/include/hpx/executors/fork_join_executor.hpp:1284
- The PR description mentions adding stacksize validation (and the previous docs stated stacksize must not be
nostack), but in the current constructors thestacksizevalue is passed through without validation and the doc note was removed. Either add the promised validation (e.g. rejectthread_stacksize::nostackif unsupported) or update the docs/PR description to reflect the actual supported values and behavior (yielding is already conditional onnostack).
/// \brief Construct a fork_join_executor.
///
/// \param priority The priority of the worker threads.
/// \param stacksize The stacksize of the worker threads.
/// \param sched The loop schedule of the parallel regions.
/// \param yield_delay The time after which the executor yields to other
/// work if it has not received any new work for execution.
explicit fork_join_executor(
threads::thread_priority priority = threads::thread_priority::bound,
threads::thread_stacksize stacksize =
threads::thread_stacksize::small_,
loop_schedule sched = loop_schedule::dynamic,
std::chrono::nanoseconds yield_delay = std::chrono::microseconds(
300))
{
e14c53a to
c6505f9
Compare
c6505f9 to
2e0a0f9
Compare
| template <typename Parameters> | ||
| friend std::size_t tag_invoke( | ||
| hpx::execution::experimental::processing_units_count_t, | ||
| Parameters&&, fork_join_executor const& exec, | ||
| hpx::chrono::steady_duration const& = hpx::chrono::null_duration, |
There was a problem hiding this comment.
processing_units_count_t customization is introduced with an unconstrained template <typename Parameters>. Other executors constrain this overload to executor_parameters/is_executor_parameters_v, which avoids overly-permissive tag_invoke participation and improves diagnostics. Consider adding the same constraint here to match the rest of the codebase’s patterns for processing_units_count_t.
Description
This Pull Request addresses critical performance regressions in
fork_join_executorandparallel_executorby implementing missing traits, optimizing caching strategies, and refactoring the executor template structure.Key Changes
tag_invokeforprocessing_units_count_tandget_first_core_t. This prevents the executor from falling back to a default (often 1) core count, which was causing algorithms to run sequentially.reducing code duplication and maintenance burden.
📊 Performance Results
The following results were obtained using the
foreach_report_testbenchmark with 8 threads in the standard Docker build environment:These results confirm that the sequential fallback regression in
fork_join_executoris fully resolved and theparallel_executoroptimizations are performing as intended.Closes
This PR supersedes and Closes #6919.