[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] steal tasks to improve CPU utilization by Avenger-285714 · Pull Request #1618 · deepin-community/kernel

Avenger-285714 · 2026-04-13T12:28:52Z

When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates. To minimize search time, steal the first migratable
task that is found when the bitmap is traversed. For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.

Patch 1 defines the sparsemask type and its operations.

Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.

Patches 5 and 6 refactor existing code for a cleaner merge of later
patches

Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.

Patch 9 adds schedstats for comparing the new behavior to the old, and
provided as a convenience for developers only, not for integration.

The patch series is based on kernel 7.0.0-rc1. It compiles, boots, and
runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
and CONFIG_PREEMPT.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

steal - number of times a task is stolen from another CPU.

X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
hackbench process 100000

baseline
grps time %busy sched idle wake steal
1 2.182 20.00 35876 17905 17958 0
2 2.391 39.00 67753 33808 33921 0
3 2.871 47.00 100944 48966 51538 0
4 2.928 62.00 114489 55171 59059 0
8 4.852 83.00 219907 92961 121703 0

new
grps time %busy sched idle wake steal %speedup
1 2.229 18.00 45450 22691 22751 52 -2.1
2 2.123 40.00 49975 24977 24990 6 12.6
3 2.690 61.00 56118 22641 32780 9073 6.7
4 2.828 80.00 37927 12828 24165 8442 3.5
8 4.120 95.00 85929 8613 57858 11098 17.8

Elapsed time improves by 17.8, and CPU busy utilization is up
by 1 to 18% hitting 95% at peak load.

Note that all test results presented below are based on the
NO_DELAY_DEQUEUE implementation. Although I have implemented the necessary
adaptations to support DELAY_DEQUEUE, I observed a noticeable performance
regression in hackbench when both DELAY_DEQUEUE and SCHED_STEAL are enabled
simultaneously, specifically in heavily overloaded scenarios where the
number of tasks far exceeds the number of CPUs. Any suggestions on how to
address this would be appreciated.

Link: https://lore.kernel.org/lkml/20260320055920.2518389-1-chenjinghuang2@huawei.com/

Summary by Sourcery

Introduce a sparse bitmap-backed tracking of overloaded CPUs per LLC and use it to opportunistically steal CFS tasks when a CPU goes idle, improving utilization while integrating with existing scheduler topology and features.

New Features:

Add a sparsemask data structure and API for low-contention sparse bitmap operations.
Track overloaded CFS runqueues per LLC using a shared sparsemask and expose it to each rq.
Implement opportunistic CFS task stealing from overloaded CPUs in the same LLC when a CPU is going idle, gated by the SCHED_STEAL feature flag.

Enhancements:

Integrate LLC-level overload tracking into scheduler topology setup and teardown, including allocation and RCU wiring of shared state.
Adjust idle time accounting and newidle balancing to cooperate with task stealing without double-accounting idle time.

Provide struct sparsemask and functions to manipulate it. A sparsemask is a sparse bitmap. It reduces cache contention vs the usual bitmap when many threads concurrently set, clear, and visit elements, by reducing the number of significant bits per cacheline. For each cacheline chunk of the mask, only the first K bits of the first word are used, and the remaining bits are ignored, where K is a creation time parameter. Thus a sparsemask that can represent a set of N elements is approximately (N/K * CACHELINE) bytes in size. This type is simpler and more efficient than the struct sbitmap used by block drivers. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

Add functions sd_llc_alloc_all() and sd_llc_free_all() to allocate and free data pointed to by struct sched_domain_shared at the last-level-cache domain. sd_llc_alloc_all() is called after the SD hierarchy is known, to eliminate the unnecessary allocations that would occur if we instead allocated in __sdt_alloc() and then figured out which shared nodes are redundant. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

Define and initialize a sparse bitmap of overloaded CPUs, per last-level-cache scheduling domain, for use by the CFS scheduling class. Save a pointer to cfs_overload_cpus in the rq for efficient access. Signed-off-by: Steve Sistare <steve.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

An overloaded CPU has more than 1 runnable task. When a CFS task wakes on a CPU, if h_nr_runnable transitions from 1 to more, then set the CPU in the cfs_overload_cpus bitmap. When a CFS task sleeps, if h_nr_runnable transitions from 2 to less, then clear the CPU in cfs_overload_cpus. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

Move the update of idle_stamp from idle_balance to the call site in pick_next_task_fair, to prepare for a future patch that adds work to pick_next_task_fair which must be included in the idle_stamp interval. No functional change. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

The detach_task function takes a struct lb_env argument, but only needs a few of its members. Pass the rq and cpu arguments explicitly so the function may be called from code that is not based on lb_env. No functional change. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

Define a simpler version of can_migrate_task called can_migrate_task_llc which does not require a struct lb_env argument, and judges whether a migration from one CPU to another within the same LLC should be allowed. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC, using the cfs_overload_cpus bitmap to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. Stealing is controlled by the sched feature SCHED_STEAL, which is enabled by default. Note that all test results presented below are based on the NO_DELAY_DEQUEUE implementation. Stealing imprroves utilization with only a modest CPU overhead in scheduler code. In the following experiment, hackbench is run with varying numbers of groups (40 tasks per group), and the delta in /proc/schedstat is shown for each run, averaged per CPU, augmented with these non-standard stats: steal - number of times a task is stolen from another CPU. X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz hackbench <grps> process 100000 baseline grps time %busy sched idle wake steal 1 2.182 20.00 35876 17905 17958 0 2 2.391 39.00 67753 33808 33921 0 3 2.871 47.00 100944 48966 51538 0 4 2.928 62.00 114489 55171 59059 0 8 4.852 83.00 219907 92961 121703 0 new grps time %busy sched idle wake steal %speedup 1 2.229 18.00 45450 22691 22751 52 -2.1 2 2.123 40.00 49975 24977 24990 6 12.6 3 2.690 61.00 56118 22641 32780 9073 6.7 4 2.828 80.00 37927 12828 24165 8442 3.5 8 4.120 95.00 85929 8613 57858 11098 17.8 Elapsed time improves by 17.8, and CPU busy utilization is up by 1 to 18% hitting 95% at peak load. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

Replace zero-length array chunks[0] with C99 flexible array member chunks[] in struct sparsemask to fix UBSAN warning: UBSAN: array-index-out-of-bounds in kernel/sched/sparsemask.h:181:32 index 0 is out of range for type 'struct sparsemask_chunk[0]' The zero-length array is a deprecated GCC extension. Using a proper flexible array member eliminates the UBSAN false positive while maintaining the same runtime behavior. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202603242133.f66e336f-lkp@intel.com Cc: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

Disable the STEAL scheduler feature by default to minimize the risk of unexpected regressions on existing workloads. The steal-task mechanism is still compiled in and can be dynamically enabled at runtime via: echo STEAL > /sys/kernel/debug/sched/features Once sufficient testing evidence demonstrates that this feature is universally beneficial with no adverse side effects, this commit can simply be reverted to re-enable it by default. Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

sourcery-ai · 2026-04-13T12:28:58Z

Reviewer's Guide

Introduces a sparse bitmap abstraction (sparsemask) and uses it to track overloaded CFS runqueues per-LLC, wiring this into the fair scheduler and topology code to enable low-cost task stealing from overloaded CPUs when a CPU goes idle, guarded by a new SCHED_STEAL feature flag.

Sequence diagram for stealing a CFS task when a CPU goes idle

sequenceDiagram
    actor CPU
    participant rq
    participant pick_next_task_fair
    participant sched_balance_newidle
    participant try_steal
    participant steal_from
    participant src_rq
    participant detach_next_task
    participant attach_task

    CPU->>pick_next_task_fair: call with rq, rq_flags
    pick_next_task_fair->>rq: no CFS task, go to idle
    pick_next_task_fair->>rq: rq_idle_stamp_update
    pick_next_task_fair->>sched_balance_newidle: new_tasks = sched_balance_newidle(rq, rf)
    sched_balance_newidle-->>pick_next_task_fair: return new_tasks
    alt no_task_pulled
        pick_next_task_fair->>try_steal: new_tasks = try_steal(rq, rf)
        try_steal-->>try_steal: check sched_feat(STEAL), cpu_active, avg_idle
        try_steal-->>try_steal: rcu_dereference cfs_overload_cpus
        alt SMT_present
            loop cpus_in_cpu_smt_mask
                try_steal->>steal_from: attempt steal_from(dst_rq, dst_rf, locked, src_cpu)
                alt stolen_from_smt_sibling
                    steal_from->>src_rq: rq_lock_irqsave(src_rq, rf)
                    steal_from->>src_rq: update_rq_clock
                    steal_from->>detach_next_task: p = detach_next_task(src_rq->cfs, dst_rq)
                    detach_next_task-->>steal_from: return p
                    steal_from->>src_rq: rq_unlock(src_rq, rf)
                    steal_from->>rq: raw_spin_rq_lock(dst_rq)
                    steal_from->>rq: rq_repin_lock(dst_rq, dst_rf)
                    steal_from->>rq: update_rq_clock
                    steal_from->>attach_task: attach_task(dst_rq, p)
                    attach_task-->>steal_from: done
                    steal_from-->>try_steal: return 1
                else not_stolen_here
                    steal_from-->>try_steal: return 0
                end
            end
        end
        alt not_stolen_yet
            loop sparsemask_for_each overloaded_cpu_in_llc
                try_steal->>steal_from: attempt steal_from(dst_rq, dst_rf, locked, src_cpu)
                alt stolen_from_llc_peer
                    steal_from->>src_rq: rq_lock_irqsave(src_rq, rf)
                    steal_from->>src_rq: update_rq_clock
                    steal_from->>detach_next_task: p = detach_next_task(src_rq->cfs, dst_rq)
                    detach_next_task-->>steal_from: return p
                    steal_from->>src_rq: rq_unlock(src_rq, rf)
                    steal_from->>rq: raw_spin_rq_lock(dst_rq)
                    steal_from->>rq: rq_repin_lock(dst_rq, dst_rf)
                    steal_from->>rq: update_rq_clock
                    steal_from->>attach_task: attach_task(dst_rq, p)
                    attach_task-->>steal_from: done
                    steal_from-->>try_steal: return 1
                else no_task_migrated
                    steal_from-->>try_steal: return 0
                end
            end
        end
        try_steal-->>pick_next_task_fair: return stolen_flag
        alt task_stolen
            pick_next_task_fair->>rq: rq_idle_stamp_clear
            pick_next_task_fair-->>CPU: RETRY_TASK (reselect next task)
        else no_task_stolen
            pick_next_task_fair-->>CPU: remain idle
        end
    else task_pulled_by_newidle
        pick_next_task_fair->>rq: rq_idle_stamp_clear
        pick_next_task_fair-->>CPU: run pulled task
    end

Updated class diagram for runqueue, sched_domain_shared, and sparsemask

classDiagram
    class rq {
      +cfs : cfs_rq
      +rt : rt_rq
      +dl : dl_rq
      +cfs_overload_cpus : sparsemask*
    }

    class sched_domain_shared {
      +ref : atomic_t
      +nr_busy_cpus : atomic_t
      +has_idle_cores : int
      +cfs_overload_cpus : sparsemask*
      +nr_idle_scan : int
    }

    class sparsemask_chunk {
      +word : unsigned long
    }

    class sparsemask {
      +nelems : short
      +density : short
      +chunks : sparsemask_chunk[]
      +sparsemask_size(nelems int, density int) size_t
      +sparsemask_init(mask sparsemask*, nelems int, density int) void
      +sparsemask_alloc_node(nelems int, density int, flags gfp_t, node int) sparsemask*
      +sparsemask_free(mask sparsemask*) void
      +sparsemask_set_elem(dst sparsemask*, elem int) void
      +sparsemask_clear_elem(dst sparsemask*, elem int) void
      +sparsemask_test_elem(mask sparsemask*, elem int) int
      +sparsemask_next(mask sparsemask*, origin int, prev int) int
    }

    rq --> sparsemask : cfs_overload_cpus
    sched_domain_shared --> sparsemask : cfs_overload_cpus
    sparsemask o-- sparsemask_chunk : contains

File-Level Changes

Change	Details	Files
Add a sparsemask bitmap implementation for low-contention CPU-set tracking and hook it into scheduler shared state.	Define sparsemask data structures and helpers, including allocation, basic bit operations, and an iterator that wraps around the bitmap. Align sparsemask chunks to cachelines and parameterize density to trade memory for lower contention. Extend sched_domain_shared with a sparsemask pointer for CFS overload tracking and ensure proper allocation and freeing of this bitmap at LLC level and during topology teardown.	`kernel/sched/sparsemask.h` `include/linux/sched/topology.h` `kernel/sched/topology.c`
Track overloaded CFS runqueues using the sparsemask and maintain per-rq pointers to the LLC-shared overload bitmap.	Add a cfs_overload_cpus pointer to struct rq and wire it to the LLC shared domain in update_top_cache_domain. Allocate and free LLC-level overload bitmaps after sched domains are built, including helpers to allocate once per LLC and free for all CPUs. Provide overload_set/overload_clear helpers that RCU-read the shared sparsemask and set/clear the bit for a CPU, no-op when SCHED_STEAL is disabled.	`kernel/sched/sched.h` `kernel/sched/topology.c` `include/linux/sched/topology.h` `kernel/sched/fair.c`
Integrate overload tracking into CFS enqueue/dequeue so overloaded CPUs are marked when they have multiple runnable tasks and cleared when they fall back to at most one.	Track previous h_nr_runnable in enqueue_task_fair and dequeue_entities and use it to detect transitions between <=1 and >=2 runnable CFS tasks. On enqueue, set the overload bit when a CPU moves from at most one runnable task to two or more, including delayed enqueue paths. On dequeue, clear the overload bit when a CPU drops from at least two runnable tasks to at most one, handling both the early-return and normal paths.	`kernel/sched/fair.c`
Introduce a low-cost task stealing path for idle CPUs that uses the overloaded-CPU sparsemask within the same LLC, including SMT-aware stealing and fairness considerations.	Add SCHED_FEAT(STEAL) feature flag and guard stealing logic and overload updates behind it. When pick_next_task_fair enters the idle path, set idle_stamp, call sched_balance_newidle, then try_steal if newidle balancing finds nothing, clearing idle_stamp if any task was obtained. Implement try_steal to check avg_idle vs SCHED_STEAL_COST, look up the LLC-shared overload sparsemask, prefer SMT siblings first (if enabled), then iterate other overloaded CPUs in the LLC with sparsemask_for_each, using steal_from to do the actual migration while preserving rq locking discipline; return -1 when another class becomes runnable. Implement steal_from to temporarily drop dst rq lock, lock the source rq, pick the first migratable CFS task in next-to-run order via detach_next_task, then attach it to dst rq with proper clock updates and IRQ handling, and signal success/failure back to try_steal. Add can_migrate_task_llc and detach_task_steal helpers tailored for same-LLC stealing, skipping coloc/temperature checks and avoiding delayed entities. Ensure final try_steal return encodes both whether any CFS tasks are now runnable and whether a non-CFS task appeared (forcing RETRY_TASK).	`kernel/sched/fair.c` `kernel/sched/features.h`
Adjust existing idle balancing and rq idle accounting to coexist correctly with the new stealing path.	Refactor idle_stamp updates: introduce rq_idle_stamp_update/clear helpers and move idle_stamp setting from sched_balance_newidle into the pick_next_task_fair idle path, so both balancing and stealing account against idle time. Avoid clearing idle_stamp in sched_balance_newidle when a task is pulled; the clearing is now centralized in pick_next_task_fair after either sched_balance_newidle or try_steal succeeds. Ensure nohz_newidle_balance is still called when no task was pulled and stealing did not succeed, keeping existing NOHZ behavior intact.	`kernel/sched/fair.c`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

deepin-ci-robot · 2026-04-13T12:29:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from avenger-285714. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

deepin/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sourcery-ai

Hey - I've left some high level feedback:

The lifetime of cfs_overload_cpus relative to RCU readers looks unsafe: it is freed in sd_llc_free/sd_llc_free_all while rqs still hold RCU-protected pointers to it, so you likely need an RCU-safe free (e.g. call_rcu) or a synchronize_rcu() after clearing the pointers before freeing the sparsemask.
The rq locking and IRQ state transitions in steal_from/try_steal are non-standard for scheduler code (e.g. mixing rq_lock_irqsave on src_rq, raw locking on dst_rq, and local_irq_restore(rf.flags) after reacquiring dst_rq), so consider revisiting this to follow the usual rq lock + IRQ discipline and avoid subtle deadlocks or IRQ-state mismatches.
The description says SCHED_STEAL is enabled by default, but SCHED_FEAT(STEAL, false) disables it, so either the feature default or the description should be updated to avoid confusion.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The lifetime of `cfs_overload_cpus` relative to RCU readers looks unsafe: it is freed in `sd_llc_free`/`sd_llc_free_all` while rqs still hold RCU-protected pointers to it, so you likely need an RCU-safe free (e.g. call_rcu) or a synchronize_rcu() after clearing the pointers before freeing the sparsemask.
- The rq locking and IRQ state transitions in `steal_from`/`try_steal` are non-standard for scheduler code (e.g. mixing `rq_lock_irqsave` on src_rq, raw locking on dst_rq, and `local_irq_restore(rf.flags)` after reacquiring dst_rq), so consider revisiting this to follow the usual rq lock + IRQ discipline and avoid subtle deadlocks or IRQ-state mismatches.
- The description says SCHED_STEAL is enabled by default, but `SCHED_FEAT(STEAL, false)` disables it, so either the feature default or the description should be updated to avoid confusion.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Copilot

Pull request overview

This PR introduces an LLC-scoped “overloaded CPU” tracking bitmap (implemented as a new sparse bitmap type) and uses it to opportunistically steal a migratable CFS task when a CPU is about to go idle and sched_balance_newidle() doesn’t pull anything, aiming to improve CPU utilization with a cheap, local search.

Changes:

Add sparsemask (a cacheline-sparse bitmap) and integrate it as sched_domain_shared::cfs_overload_cpus.
Track per-LLC overloaded CPUs and expose the LLC bitmap to each rq via an RCU-updated pointer.
Add a new idle-time stealing path in fair.c which scans overloaded CPUs (SMT siblings first) and migrates the first suitable CFS task found.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`kernel/sched/topology.c`	Allocates/frees per-LLC `sparsemask` and wires it into per-CPU `rq` via RCU assignment.
`kernel/sched/sparsemask.h`	Adds the new sparse bitmap type and iteration API used by the scheduler.
`kernel/sched/sched.h`	Adds `rq->cfs_overload_cpus` pointer for fast access to the LLC overloaded-CPU bitmap.
`kernel/sched/features.h`	Introduces the `STEAL` scheduler feature flag.
`kernel/sched/fair.c`	Updates overload tracking on enqueue/dequeue and implements `try_steal()` in the idle path.
`include/linux/sched/topology.h`	Extends `sched_domain_shared` with a pointer to the overload bitmap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T12:37:16Z

 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	struct sparsemask *cfs_overload_cpus;
 	int		nr_idle_scan;
 };


include/linux/sched/topology.h now references struct sparsemask in struct sched_domain_shared, but there is no forward declaration or header include that declares struct sparsemask here. This will fail to compile for any translation unit that includes this header. Add struct sparsemask; (or move the type to an appropriate exported header) before struct sched_domain_shared.

Copilot · 2026-04-13T12:37:17Z

 	struct cfs_rq		cfs;
 	struct rt_rq		rt;
 	struct dl_rq		dl;
+	struct sparsemask	*cfs_overload_cpus;


rq->cfs_overload_cpus is accessed using RCU primitives (rcu_assign_pointer/rcu_dereference), but the field itself is not annotated as an RCU-protected pointer. This can lead to sparse/RCU checking issues and makes the intended lifetime rules unclear. Consider declaring it as struct sparsemask __rcu *cfs_overload_cpus (and updating accesses accordingly).

Suggested change

struct sparsemask *cfs_overload_cpus;

struct sparsemask __rcu *cfs_overload_cpus;

Copilot · 2026-04-13T12:37:17Z

+/*
+ * Steal a CFS task from another CPU when going idle.
+ * Improves CPU utilization.
+ */
+SCHED_FEAT(STEAL, false)
+


SCHED_FEAT(STEAL, false) disables stealing by default, but the PR description states that SCHED_STEAL is enabled by default. Please reconcile this (either set the default to true or update the description / feature naming so behavior matches expectations).

Copilot · 2026-04-13T12:37:17Z

+struct sparsemask {
+	short nelems;		/* current number of elements */
+	short density;		/* store 2^density elements per chunk */
+	struct sparsemask_chunk chunks[];  /* embedded array of chunks */
+};


The struct sparsemask stores nelems and density in short. Since callers pass int nelems (e.g., nr_cpu_ids), this can truncate on larger configurations and make iteration/allocation incorrect. Use int (or at least u32) for these fields, and validate inputs (including nelems upper bound) in sparsemask_init().

Copilot · 2026-04-13T12:37:18Z

+	int index = _SMASK_INDEX(density, nelems) + 1;
+


sparsemask_size() computes index = _SMASK_INDEX(density, nelems) + 1, which over-allocates by one chunk when nelems is an exact multiple of 2^density (e.g., nelems=8, density=3). If valid elements are 0..nelems-1, this should typically be based on nelems-1 (with a special-case for nelems==0).

Suggested change

int index = _SMASK_INDEX(density, nelems) + 1;

int index;

if (nelems == 0)

index = 0;

else

index = _SMASK_INDEX(density, nelems - 1) + 1;

Copilot · 2026-04-13T12:37:18Z

+	int dst_cpu = dst_rq->cpu;
+	struct rq *src_rq = cpu_rq(src_cpu);
+
+	if (dst_cpu == src_cpu || src_rq->cfs.h_nr_runnable < 2)


steal_from() reads src_rq->cfs.h_nr_runnable before taking src_rq's lock. Even though this is an early-bailout check, it can trigger KCSAN data-race reports and may observe transient values. Prefer moving this check under the rq_lock_irqsave(src_rq, ...) region or using READ_ONCE() for the unlocked read.

Suggested change

if (dst_cpu == src_cpu || src_rq->cfs.h_nr_runnable < 2)

if (dst_cpu == src_cpu || READ_ONCE(src_rq->cfs.h_nr_runnable) < 2)

Copilot · 2026-04-13T12:37:18Z

@@ -680,8 +691,10 @@ static void update_top_cache_domain(int cpu)
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
 		sds = sd->shared;
+		cfs_overload_cpus = sds->cfs_overload_cpus;
 	}

+	rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus);
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;


rq->cfs_overload_cpus is an RCU-protected pointer, but it is updated in update_top_cache_domain() while the backing sds->cfs_overload_cpus is freed from the sched-domain RCU teardown path. To avoid a potential UAF, ensure that all CPUs switch rq->cfs_overload_cpus away from the old mask before the old mask can be freed (e.g., update/clear this pointer before scheduling destroy_sched_domains() for the old domain, or otherwise guarantee ordering with RCU grace periods).

Steve Sistare and others added 10 commits April 13, 2026 20:28

Avenger-285714 requested review from Copilot and opsiff April 13, 2026 12:28

deepin-ci-robot requested review from BLumia and chenchongbiao April 13, 2026 12:29

Copilot started reviewing on behalf of Avenger-285714 April 13, 2026 12:31 View session

sourcery-ai bot reviewed Apr 13, 2026

View reviewed changes

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] steal tasks to improve CPU utilization#1618

[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] steal tasks to improve CPU utilization#1618
Avenger-285714 wants to merge 10 commits intodeepin-community:linux-6.18.yfrom
Avenger-285714:feature/steal-tasks

Avenger-285714 commented Apr 13, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Apr 13, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

deepin-ci-robot commented Apr 13, 2026

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	struct sparsemask *cfs_overload_cpus;
	struct sparsemask __rcu *cfs_overload_cpus;

-	int index = _SMASK_INDEX(density, nelems) + 1;
+	int index;
+	if (nelems == 0)
+		index = 0;
+	else
+		index = _SMASK_INDEX(density, nelems - 1) + 1;

	if (dst_cpu == src_cpu \|\| src_rq->cfs.h_nr_runnable < 2)
	if (dst_cpu == src_cpu \|\| READ_ONCE(src_rq->cfs.h_nr_runnable) < 2)

Conversation

Avenger-285714 commented Apr 13, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for stealing a CFS task when a CPU goes idle

Updated class diagram for runqueue, sched_domain_shared, and sparsemask

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

deepin-ci-robot commented Apr 13, 2026

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Avenger-285714 commented Apr 13, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Apr 13, 2026 •

edited

Loading