[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] steal tasks to improve CPU utilization#1618
[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] steal tasks to improve CPU utilization#1618Avenger-285714 wants to merge 10 commits intodeepin-community:linux-6.18.yfrom
Conversation
Provide struct sparsemask and functions to manipulate it. A sparsemask is a sparse bitmap. It reduces cache contention vs the usual bitmap when many threads concurrently set, clear, and visit elements, by reducing the number of significant bits per cacheline. For each cacheline chunk of the mask, only the first K bits of the first word are used, and the remaining bits are ignored, where K is a creation time parameter. Thus a sparsemask that can represent a set of N elements is approximately (N/K * CACHELINE) bytes in size. This type is simpler and more efficient than the struct sbitmap used by block drivers. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Add functions sd_llc_alloc_all() and sd_llc_free_all() to allocate and free data pointed to by struct sched_domain_shared at the last-level-cache domain. sd_llc_alloc_all() is called after the SD hierarchy is known, to eliminate the unnecessary allocations that would occur if we instead allocated in __sdt_alloc() and then figured out which shared nodes are redundant. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Define and initialize a sparse bitmap of overloaded CPUs, per last-level-cache scheduling domain, for use by the CFS scheduling class. Save a pointer to cfs_overload_cpus in the rq for efficient access. Signed-off-by: Steve Sistare <steve.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
An overloaded CPU has more than 1 runnable task. When a CFS task wakes on a CPU, if h_nr_runnable transitions from 1 to more, then set the CPU in the cfs_overload_cpus bitmap. When a CFS task sleeps, if h_nr_runnable transitions from 2 to less, then clear the CPU in cfs_overload_cpus. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Move the update of idle_stamp from idle_balance to the call site in pick_next_task_fair, to prepare for a future patch that adds work to pick_next_task_fair which must be included in the idle_stamp interval. No functional change. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
The detach_task function takes a struct lb_env argument, but only needs a few of its members. Pass the rq and cpu arguments explicitly so the function may be called from code that is not based on lb_env. No functional change. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Define a simpler version of can_migrate_task called can_migrate_task_llc which does not require a struct lb_env argument, and judges whether a migration from one CPU to another within the same LLC should be allowed. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC, using the cfs_overload_cpus bitmap to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. Stealing is controlled by the sched feature SCHED_STEAL, which is enabled by default. Note that all test results presented below are based on the NO_DELAY_DEQUEUE implementation. Stealing imprroves utilization with only a modest CPU overhead in scheduler code. In the following experiment, hackbench is run with varying numbers of groups (40 tasks per group), and the delta in /proc/schedstat is shown for each run, averaged per CPU, augmented with these non-standard stats: steal - number of times a task is stolen from another CPU. X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz hackbench <grps> process 100000 baseline grps time %busy sched idle wake steal 1 2.182 20.00 35876 17905 17958 0 2 2.391 39.00 67753 33808 33921 0 3 2.871 47.00 100944 48966 51538 0 4 2.928 62.00 114489 55171 59059 0 8 4.852 83.00 219907 92961 121703 0 new grps time %busy sched idle wake steal %speedup 1 2.229 18.00 45450 22691 22751 52 -2.1 2 2.123 40.00 49975 24977 24990 6 12.6 3 2.690 61.00 56118 22641 32780 9073 6.7 4 2.828 80.00 37927 12828 24165 8442 3.5 8 4.120 95.00 85929 8613 57858 11098 17.8 Elapsed time improves by 17.8, and CPU busy utilization is up by 1 to 18% hitting 95% at peak load. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Replace zero-length array chunks[0] with C99 flexible array member chunks[] in struct sparsemask to fix UBSAN warning: UBSAN: array-index-out-of-bounds in kernel/sched/sparsemask.h:181:32 index 0 is out of range for type 'struct sparsemask_chunk[0]' The zero-length array is a deprecated GCC extension. Using a proper flexible array member eliminates the UBSAN false positive while maintaining the same runtime behavior. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202603242133.f66e336f-lkp@intel.com Cc: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Disable the STEAL scheduler feature by default to minimize the risk of unexpected regressions on existing workloads. The steal-task mechanism is still compiled in and can be dynamically enabled at runtime via: echo STEAL > /sys/kernel/debug/sched/features Once sufficient testing evidence demonstrates that this feature is universally beneficial with no adverse side effects, this commit can simply be reverted to re-enable it by default. Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Reviewer's GuideIntroduces a sparse bitmap abstraction (sparsemask) and uses it to track overloaded CFS runqueues per-LLC, wiring this into the fair scheduler and topology code to enable low-cost task stealing from overloaded CPUs when a CPU goes idle, guarded by a new SCHED_STEAL feature flag. Sequence diagram for stealing a CFS task when a CPU goes idlesequenceDiagram
actor CPU
participant rq
participant pick_next_task_fair
participant sched_balance_newidle
participant try_steal
participant steal_from
participant src_rq
participant detach_next_task
participant attach_task
CPU->>pick_next_task_fair: call with rq, rq_flags
pick_next_task_fair->>rq: no CFS task, go to idle
pick_next_task_fair->>rq: rq_idle_stamp_update
pick_next_task_fair->>sched_balance_newidle: new_tasks = sched_balance_newidle(rq, rf)
sched_balance_newidle-->>pick_next_task_fair: return new_tasks
alt no_task_pulled
pick_next_task_fair->>try_steal: new_tasks = try_steal(rq, rf)
try_steal-->>try_steal: check sched_feat(STEAL), cpu_active, avg_idle
try_steal-->>try_steal: rcu_dereference cfs_overload_cpus
alt SMT_present
loop cpus_in_cpu_smt_mask
try_steal->>steal_from: attempt steal_from(dst_rq, dst_rf, locked, src_cpu)
alt stolen_from_smt_sibling
steal_from->>src_rq: rq_lock_irqsave(src_rq, rf)
steal_from->>src_rq: update_rq_clock
steal_from->>detach_next_task: p = detach_next_task(src_rq->cfs, dst_rq)
detach_next_task-->>steal_from: return p
steal_from->>src_rq: rq_unlock(src_rq, rf)
steal_from->>rq: raw_spin_rq_lock(dst_rq)
steal_from->>rq: rq_repin_lock(dst_rq, dst_rf)
steal_from->>rq: update_rq_clock
steal_from->>attach_task: attach_task(dst_rq, p)
attach_task-->>steal_from: done
steal_from-->>try_steal: return 1
else not_stolen_here
steal_from-->>try_steal: return 0
end
end
end
alt not_stolen_yet
loop sparsemask_for_each overloaded_cpu_in_llc
try_steal->>steal_from: attempt steal_from(dst_rq, dst_rf, locked, src_cpu)
alt stolen_from_llc_peer
steal_from->>src_rq: rq_lock_irqsave(src_rq, rf)
steal_from->>src_rq: update_rq_clock
steal_from->>detach_next_task: p = detach_next_task(src_rq->cfs, dst_rq)
detach_next_task-->>steal_from: return p
steal_from->>src_rq: rq_unlock(src_rq, rf)
steal_from->>rq: raw_spin_rq_lock(dst_rq)
steal_from->>rq: rq_repin_lock(dst_rq, dst_rf)
steal_from->>rq: update_rq_clock
steal_from->>attach_task: attach_task(dst_rq, p)
attach_task-->>steal_from: done
steal_from-->>try_steal: return 1
else no_task_migrated
steal_from-->>try_steal: return 0
end
end
end
try_steal-->>pick_next_task_fair: return stolen_flag
alt task_stolen
pick_next_task_fair->>rq: rq_idle_stamp_clear
pick_next_task_fair-->>CPU: RETRY_TASK (reselect next task)
else no_task_stolen
pick_next_task_fair-->>CPU: remain idle
end
else task_pulled_by_newidle
pick_next_task_fair->>rq: rq_idle_stamp_clear
pick_next_task_fair-->>CPU: run pulled task
end
Updated class diagram for runqueue, sched_domain_shared, and sparsemaskclassDiagram
class rq {
+cfs : cfs_rq
+rt : rt_rq
+dl : dl_rq
+cfs_overload_cpus : sparsemask*
}
class sched_domain_shared {
+ref : atomic_t
+nr_busy_cpus : atomic_t
+has_idle_cores : int
+cfs_overload_cpus : sparsemask*
+nr_idle_scan : int
}
class sparsemask_chunk {
+word : unsigned long
}
class sparsemask {
+nelems : short
+density : short
+chunks : sparsemask_chunk[]
+sparsemask_size(nelems int, density int) size_t
+sparsemask_init(mask sparsemask*, nelems int, density int) void
+sparsemask_alloc_node(nelems int, density int, flags gfp_t, node int) sparsemask*
+sparsemask_free(mask sparsemask*) void
+sparsemask_set_elem(dst sparsemask*, elem int) void
+sparsemask_clear_elem(dst sparsemask*, elem int) void
+sparsemask_test_elem(mask sparsemask*, elem int) int
+sparsemask_next(mask sparsemask*, origin int, prev int) int
}
rq --> sparsemask : cfs_overload_cpus
sched_domain_shared --> sparsemask : cfs_overload_cpus
sparsemask o-- sparsemask_chunk : contains
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The lifetime of
cfs_overload_cpusrelative to RCU readers looks unsafe: it is freed insd_llc_free/sd_llc_free_allwhile rqs still hold RCU-protected pointers to it, so you likely need an RCU-safe free (e.g. call_rcu) or a synchronize_rcu() after clearing the pointers before freeing the sparsemask. - The rq locking and IRQ state transitions in
steal_from/try_stealare non-standard for scheduler code (e.g. mixingrq_lock_irqsaveon src_rq, raw locking on dst_rq, andlocal_irq_restore(rf.flags)after reacquiring dst_rq), so consider revisiting this to follow the usual rq lock + IRQ discipline and avoid subtle deadlocks or IRQ-state mismatches. - The description says SCHED_STEAL is enabled by default, but
SCHED_FEAT(STEAL, false)disables it, so either the feature default or the description should be updated to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The lifetime of `cfs_overload_cpus` relative to RCU readers looks unsafe: it is freed in `sd_llc_free`/`sd_llc_free_all` while rqs still hold RCU-protected pointers to it, so you likely need an RCU-safe free (e.g. call_rcu) or a synchronize_rcu() after clearing the pointers before freeing the sparsemask.
- The rq locking and IRQ state transitions in `steal_from`/`try_steal` are non-standard for scheduler code (e.g. mixing `rq_lock_irqsave` on src_rq, raw locking on dst_rq, and `local_irq_restore(rf.flags)` after reacquiring dst_rq), so consider revisiting this to follow the usual rq lock + IRQ discipline and avoid subtle deadlocks or IRQ-state mismatches.
- The description says SCHED_STEAL is enabled by default, but `SCHED_FEAT(STEAL, false)` disables it, so either the feature default or the description should be updated to avoid confusion.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
This PR introduces an LLC-scoped “overloaded CPU” tracking bitmap (implemented as a new sparse bitmap type) and uses it to opportunistically steal a migratable CFS task when a CPU is about to go idle and sched_balance_newidle() doesn’t pull anything, aiming to improve CPU utilization with a cheap, local search.
Changes:
- Add
sparsemask(a cacheline-sparse bitmap) and integrate it assched_domain_shared::cfs_overload_cpus. - Track per-LLC overloaded CPUs and expose the LLC bitmap to each
rqvia an RCU-updated pointer. - Add a new idle-time stealing path in
fair.cwhich scans overloaded CPUs (SMT siblings first) and migrates the first suitable CFS task found.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
kernel/sched/topology.c |
Allocates/frees per-LLC sparsemask and wires it into per-CPU rq via RCU assignment. |
kernel/sched/sparsemask.h |
Adds the new sparse bitmap type and iteration API used by the scheduler. |
kernel/sched/sched.h |
Adds rq->cfs_overload_cpus pointer for fast access to the LLC overloaded-CPU bitmap. |
kernel/sched/features.h |
Introduces the STEAL scheduler feature flag. |
kernel/sched/fair.c |
Updates overload tracking on enqueue/dequeue and implements try_steal() in the idle path. |
include/linux/sched/topology.h |
Extends sched_domain_shared with a pointer to the overload bitmap. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| struct sched_domain_shared { | ||
| atomic_t ref; | ||
| atomic_t nr_busy_cpus; | ||
| int has_idle_cores; | ||
| struct sparsemask *cfs_overload_cpus; | ||
| int nr_idle_scan; | ||
| }; |
There was a problem hiding this comment.
include/linux/sched/topology.h now references struct sparsemask in struct sched_domain_shared, but there is no forward declaration or header include that declares struct sparsemask here. This will fail to compile for any translation unit that includes this header. Add struct sparsemask; (or move the type to an appropriate exported header) before struct sched_domain_shared.
| struct cfs_rq cfs; | ||
| struct rt_rq rt; | ||
| struct dl_rq dl; | ||
| struct sparsemask *cfs_overload_cpus; |
There was a problem hiding this comment.
rq->cfs_overload_cpus is accessed using RCU primitives (rcu_assign_pointer/rcu_dereference), but the field itself is not annotated as an RCU-protected pointer. This can lead to sparse/RCU checking issues and makes the intended lifetime rules unclear. Consider declaring it as struct sparsemask __rcu *cfs_overload_cpus (and updating accesses accordingly).
| struct sparsemask *cfs_overload_cpus; | |
| struct sparsemask __rcu *cfs_overload_cpus; |
| /* | ||
| * Steal a CFS task from another CPU when going idle. | ||
| * Improves CPU utilization. | ||
| */ | ||
| SCHED_FEAT(STEAL, false) | ||
|
|
There was a problem hiding this comment.
SCHED_FEAT(STEAL, false) disables stealing by default, but the PR description states that SCHED_STEAL is enabled by default. Please reconcile this (either set the default to true or update the description / feature naming so behavior matches expectations).
| struct sparsemask { | ||
| short nelems; /* current number of elements */ | ||
| short density; /* store 2^density elements per chunk */ | ||
| struct sparsemask_chunk chunks[]; /* embedded array of chunks */ | ||
| }; |
There was a problem hiding this comment.
The struct sparsemask stores nelems and density in short. Since callers pass int nelems (e.g., nr_cpu_ids), this can truncate on larger configurations and make iteration/allocation incorrect. Use int (or at least u32) for these fields, and validate inputs (including nelems upper bound) in sparsemask_init().
| int index = _SMASK_INDEX(density, nelems) + 1; | ||
|
|
There was a problem hiding this comment.
sparsemask_size() computes index = _SMASK_INDEX(density, nelems) + 1, which over-allocates by one chunk when nelems is an exact multiple of 2^density (e.g., nelems=8, density=3). If valid elements are 0..nelems-1, this should typically be based on nelems-1 (with a special-case for nelems==0).
| int index = _SMASK_INDEX(density, nelems) + 1; | |
| int index; | |
| if (nelems == 0) | |
| index = 0; | |
| else | |
| index = _SMASK_INDEX(density, nelems - 1) + 1; |
| int dst_cpu = dst_rq->cpu; | ||
| struct rq *src_rq = cpu_rq(src_cpu); | ||
|
|
||
| if (dst_cpu == src_cpu || src_rq->cfs.h_nr_runnable < 2) |
There was a problem hiding this comment.
steal_from() reads src_rq->cfs.h_nr_runnable before taking src_rq's lock. Even though this is an early-bailout check, it can trigger KCSAN data-race reports and may observe transient values. Prefer moving this check under the rq_lock_irqsave(src_rq, ...) region or using READ_ONCE() for the unlocked read.
| if (dst_cpu == src_cpu || src_rq->cfs.h_nr_runnable < 2) | |
| if (dst_cpu == src_cpu || READ_ONCE(src_rq->cfs.h_nr_runnable) < 2) |
| @@ -680,8 +691,10 @@ static void update_top_cache_domain(int cpu) | |||
| id = cpumask_first(sched_domain_span(sd)); | |||
| size = cpumask_weight(sched_domain_span(sd)); | |||
| sds = sd->shared; | |||
| cfs_overload_cpus = sds->cfs_overload_cpus; | |||
| } | |||
|
|
|||
| rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus); | |||
| rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); | |||
| per_cpu(sd_llc_size, cpu) = size; | |||
There was a problem hiding this comment.
rq->cfs_overload_cpus is an RCU-protected pointer, but it is updated in update_top_cache_domain() while the backing sds->cfs_overload_cpus is freed from the sched-domain RCU teardown path. To avoid a potential UAF, ensure that all CPUs switch rq->cfs_overload_cpus away from the old mask before the old mask can be freed (e.g., update/clear this pointer before scheduling destroy_sched_domains() for the old domain, or otherwise guarantee ordering with RCU grace periods).
When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates. To minimize search time, steal the first migratable
task that is found when the bitmap is traversed. For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.
This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.
The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.
Patch 1 defines the sparsemask type and its operations.
Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
Patches 5 and 6 refactor existing code for a cleaner merge of later
patches
Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
Patch 9 adds schedstats for comparing the new behavior to the old, and
provided as a convenience for developers only, not for integration.
The patch series is based on kernel 7.0.0-rc1. It compiles, boots, and
runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
and CONFIG_PREEMPT.
Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.
Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:
steal - number of times a task is stolen from another CPU.
X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
hackbench process 100000
baseline
grps time %busy sched idle wake steal
1 2.182 20.00 35876 17905 17958 0
2 2.391 39.00 67753 33808 33921 0
3 2.871 47.00 100944 48966 51538 0
4 2.928 62.00 114489 55171 59059 0
8 4.852 83.00 219907 92961 121703 0
new
grps time %busy sched idle wake steal %speedup
1 2.229 18.00 45450 22691 22751 52 -2.1
2 2.123 40.00 49975 24977 24990 6 12.6
3 2.690 61.00 56118 22641 32780 9073 6.7
4 2.828 80.00 37927 12828 24165 8442 3.5
8 4.120 95.00 85929 8613 57858 11098 17.8
Elapsed time improves by 17.8, and CPU busy utilization is up
by 1 to 18% hitting 95% at peak load.
Note that all test results presented below are based on the
NO_DELAY_DEQUEUE implementation. Although I have implemented the necessary
adaptations to support DELAY_DEQUEUE, I observed a noticeable performance
regression in hackbench when both DELAY_DEQUEUE and SCHED_STEAL are enabled
simultaneously, specifically in heavily overloaded scenarios where the
number of tasks far exceeds the number of CPUs. Any suggestions on how to
address this would be appreciated.
Link: https://lore.kernel.org/lkml/20260320055920.2518389-1-chenjinghuang2@huawei.com/
Summary by Sourcery
Introduce a sparse bitmap-backed tracking of overloaded CPUs per LLC and use it to opportunistically steal CFS tasks when a CPU goes idle, improving utilization while integrating with existing scheduler topology and features.
New Features:
Enhancements: