Continuing the work done in #7056 I decided to continue any optimization I could find. In thread_queue::cleanup_terminated_locked(), every iteration of the cleanup loop issues two individual atomic decrements
( --terminated_items_count_, --thread_map_count_), causing multiple cache-line coherence complete trips every call.Instead we could accumulate deltas locally, basically a single fetch after every loop.
Continuing the work done in #7056 I decided to continue any optimization I could find. In
thread_queue::cleanup_terminated_locked(), every iteration of the cleanup loop issues two individual atomic decrements( --terminated_items_count_, --thread_map_count_), causing multiple cache-line coherence complete trips every call.Instead we could accumulate deltas locally, basically a single fetch after every loop.