Skip to content

[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] steal tasks to improve CPU utilization#1618

Open
Avenger-285714 wants to merge 10 commits intodeepin-community:linux-6.18.yfrom
Avenger-285714:feature/steal-tasks
Open

[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] steal tasks to improve CPU utilization#1618
Avenger-285714 wants to merge 10 commits intodeepin-community:linux-6.18.yfrom
Avenger-285714:feature/steal-tasks

Conversation

@Avenger-285714
Copy link
Copy Markdown
Member

@Avenger-285714 Avenger-285714 commented Apr 13, 2026

When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates. To minimize search time, steal the first migratable
task that is found when the bitmap is traversed. For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.

Patch 1 defines the sparsemask type and its operations.

Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.

Patches 5 and 6 refactor existing code for a cleaner merge of later
patches

Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.

Patch 9 adds schedstats for comparing the new behavior to the old, and
provided as a convenience for developers only, not for integration.

The patch series is based on kernel 7.0.0-rc1. It compiles, boots, and
runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
and CONFIG_PREEMPT.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

steal - number of times a task is stolen from another CPU.

X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
hackbench process 100000

baseline
grps time %busy sched idle wake steal
1 2.182 20.00 35876 17905 17958 0
2 2.391 39.00 67753 33808 33921 0
3 2.871 47.00 100944 48966 51538 0
4 2.928 62.00 114489 55171 59059 0
8 4.852 83.00 219907 92961 121703 0

new
grps time %busy sched idle wake steal %speedup
1 2.229 18.00 45450 22691 22751 52 -2.1
2 2.123 40.00 49975 24977 24990 6 12.6
3 2.690 61.00 56118 22641 32780 9073 6.7
4 2.828 80.00 37927 12828 24165 8442 3.5
8 4.120 95.00 85929 8613 57858 11098 17.8

Elapsed time improves by 17.8, and CPU busy utilization is up
by 1 to 18% hitting 95% at peak load.

Note that all test results presented below are based on the
NO_DELAY_DEQUEUE implementation. Although I have implemented the necessary
adaptations to support DELAY_DEQUEUE, I observed a noticeable performance
regression in hackbench when both DELAY_DEQUEUE and SCHED_STEAL are enabled
simultaneously, specifically in heavily overloaded scenarios where the
number of tasks far exceeds the number of CPUs. Any suggestions on how to
address this would be appreciated.

Link: https://lore.kernel.org/lkml/20260320055920.2518389-1-chenjinghuang2@huawei.com/

Summary by Sourcery

Introduce a sparse bitmap-backed tracking of overloaded CPUs per LLC and use it to opportunistically steal CFS tasks when a CPU goes idle, improving utilization while integrating with existing scheduler topology and features.

New Features:

  • Add a sparsemask data structure and API for low-contention sparse bitmap operations.
  • Track overloaded CFS runqueues per LLC using a shared sparsemask and expose it to each rq.
  • Implement opportunistic CFS task stealing from overloaded CPUs in the same LLC when a CPU is going idle, gated by the SCHED_STEAL feature flag.

Enhancements:

  • Integrate LLC-level overload tracking into scheduler topology setup and teardown, including allocation and RCU wiring of shared state.
  • Adjust idle time accounting and newidle balancing to cooperate with task stealing without double-accounting idle time.

Steve Sistare and others added 10 commits April 13, 2026 20:28
Provide struct sparsemask and functions to manipulate it.  A sparsemask is
a sparse bitmap.  It reduces cache contention vs the usual bitmap when many
threads concurrently set, clear, and visit elements, by reducing the number
of significant bits per cacheline.  For each cacheline chunk of the mask,
only the first K bits of the first word are used, and the remaining bits
are ignored, where K is a creation time parameter.  Thus a sparsemask that
can represent a set of N elements is approximately (N/K * CACHELINE) bytes
in size.

This type is simpler and more efficient than the struct sbitmap used by
block drivers.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Add functions sd_llc_alloc_all() and sd_llc_free_all() to allocate and
free data pointed to by struct sched_domain_shared at the last-level-cache
domain.  sd_llc_alloc_all() is called after the SD hierarchy is known, to
eliminate the unnecessary allocations that would occur if we instead
allocated in __sdt_alloc() and then figured out which shared nodes are
redundant.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Define and initialize a sparse bitmap of overloaded CPUs, per
last-level-cache scheduling domain, for use by the CFS scheduling class.
Save a pointer to cfs_overload_cpus in the rq for efficient access.

Signed-off-by: Steve Sistare <steve.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
An overloaded CPU has more than 1 runnable task.  When a CFS task wakes
on a CPU, if h_nr_runnable transitions from 1 to more, then set the CPU in
the cfs_overload_cpus bitmap.  When a CFS task sleeps, if h_nr_runnable
transitions from 2 to less, then clear the CPU in cfs_overload_cpus.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Move the update of idle_stamp from idle_balance to the call site in
pick_next_task_fair, to prepare for a future patch that adds work to
pick_next_task_fair which must be included in the idle_stamp interval.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
The detach_task function takes a struct lb_env argument, but only needs a
few of its members.  Pass the rq and cpu arguments explicitly so the
function may be called from code that is not based on lb_env.  No
functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Define a simpler version of can_migrate_task called can_migrate_task_llc
which does not require a struct lb_env argument, and judges whether a
migration from one CPU to another within the same LLC should be allowed.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
When a CPU has no more CFS tasks to run, and idle_balance() fails to find a
task, then attempt to steal a task from an overloaded CPU in the same LLC,
using the cfs_overload_cpus bitmap to efficiently identify candidates.  To
minimize search time, steal the first migratable task that is found when
the bitmap is traversed.  For fairness, search for migratable tasks on an
overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle.  idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy.  Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default. Note that all test results presented below are based on the
NO_DELAY_DEQUEUE implementation.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code.  In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

  steal - number of times a task is stolen from another CPU.

X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
hackbench <grps> process 100000

  baseline
  grps  time   %busy  sched   idle    wake   steal
  1     2.182  20.00  35876   17905   17958  0
  2     2.391  39.00  67753   33808   33921  0
  3     2.871  47.00  100944  48966   51538  0
  4     2.928  62.00  114489  55171   59059  0
  8     4.852  83.00  219907  92961   121703 0

  new
  grps  time   %busy  sched   idle    wake   steal   %speedup
  1     2.229  18.00  45450   22691   22751  52      -2.1
  2     2.123  40.00  49975   24977   24990  6       12.6
  3     2.690  61.00  56118   22641   32780  9073    6.7
  4     2.828  80.00  37927   12828   24165  8442    3.5
  8     4.120  95.00  85929   8613    57858  11098   17.8

Elapsed time improves by 17.8, and CPU busy utilization is up
by 1 to 18% hitting 95% at peak load.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Replace zero-length array chunks[0] with C99 flexible array member
chunks[] in struct sparsemask to fix UBSAN warning:

  UBSAN: array-index-out-of-bounds in kernel/sched/sparsemask.h:181:32
  index 0 is out of range for type 'struct sparsemask_chunk[0]'

The zero-length array is a deprecated GCC extension. Using a proper
flexible array member eliminates the UBSAN false positive while
maintaining the same runtime behavior.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202603242133.f66e336f-lkp@intel.com
Cc: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Disable the STEAL scheduler feature by default to minimize the risk of
unexpected regressions on existing workloads. The steal-task mechanism
is still compiled in and can be dynamically enabled at runtime via:

  echo STEAL > /sys/kernel/debug/sched/features

Once sufficient testing evidence demonstrates that this feature is
universally beneficial with no adverse side effects, this commit can
simply be reverted to re-enable it by default.

Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Apr 13, 2026

Reviewer's Guide

Introduces a sparse bitmap abstraction (sparsemask) and uses it to track overloaded CFS runqueues per-LLC, wiring this into the fair scheduler and topology code to enable low-cost task stealing from overloaded CPUs when a CPU goes idle, guarded by a new SCHED_STEAL feature flag.

Sequence diagram for stealing a CFS task when a CPU goes idle

sequenceDiagram
    actor CPU
    participant rq
    participant pick_next_task_fair
    participant sched_balance_newidle
    participant try_steal
    participant steal_from
    participant src_rq
    participant detach_next_task
    participant attach_task

    CPU->>pick_next_task_fair: call with rq, rq_flags
    pick_next_task_fair->>rq: no CFS task, go to idle
    pick_next_task_fair->>rq: rq_idle_stamp_update
    pick_next_task_fair->>sched_balance_newidle: new_tasks = sched_balance_newidle(rq, rf)
    sched_balance_newidle-->>pick_next_task_fair: return new_tasks
    alt no_task_pulled
        pick_next_task_fair->>try_steal: new_tasks = try_steal(rq, rf)
        try_steal-->>try_steal: check sched_feat(STEAL), cpu_active, avg_idle
        try_steal-->>try_steal: rcu_dereference cfs_overload_cpus
        alt SMT_present
            loop cpus_in_cpu_smt_mask
                try_steal->>steal_from: attempt steal_from(dst_rq, dst_rf, locked, src_cpu)
                alt stolen_from_smt_sibling
                    steal_from->>src_rq: rq_lock_irqsave(src_rq, rf)
                    steal_from->>src_rq: update_rq_clock
                    steal_from->>detach_next_task: p = detach_next_task(src_rq->cfs, dst_rq)
                    detach_next_task-->>steal_from: return p
                    steal_from->>src_rq: rq_unlock(src_rq, rf)
                    steal_from->>rq: raw_spin_rq_lock(dst_rq)
                    steal_from->>rq: rq_repin_lock(dst_rq, dst_rf)
                    steal_from->>rq: update_rq_clock
                    steal_from->>attach_task: attach_task(dst_rq, p)
                    attach_task-->>steal_from: done
                    steal_from-->>try_steal: return 1
                else not_stolen_here
                    steal_from-->>try_steal: return 0
                end
            end
        end
        alt not_stolen_yet
            loop sparsemask_for_each overloaded_cpu_in_llc
                try_steal->>steal_from: attempt steal_from(dst_rq, dst_rf, locked, src_cpu)
                alt stolen_from_llc_peer
                    steal_from->>src_rq: rq_lock_irqsave(src_rq, rf)
                    steal_from->>src_rq: update_rq_clock
                    steal_from->>detach_next_task: p = detach_next_task(src_rq->cfs, dst_rq)
                    detach_next_task-->>steal_from: return p
                    steal_from->>src_rq: rq_unlock(src_rq, rf)
                    steal_from->>rq: raw_spin_rq_lock(dst_rq)
                    steal_from->>rq: rq_repin_lock(dst_rq, dst_rf)
                    steal_from->>rq: update_rq_clock
                    steal_from->>attach_task: attach_task(dst_rq, p)
                    attach_task-->>steal_from: done
                    steal_from-->>try_steal: return 1
                else no_task_migrated
                    steal_from-->>try_steal: return 0
                end
            end
        end
        try_steal-->>pick_next_task_fair: return stolen_flag
        alt task_stolen
            pick_next_task_fair->>rq: rq_idle_stamp_clear
            pick_next_task_fair-->>CPU: RETRY_TASK (reselect next task)
        else no_task_stolen
            pick_next_task_fair-->>CPU: remain idle
        end
    else task_pulled_by_newidle
        pick_next_task_fair->>rq: rq_idle_stamp_clear
        pick_next_task_fair-->>CPU: run pulled task
    end
Loading

Updated class diagram for runqueue, sched_domain_shared, and sparsemask

classDiagram
    class rq {
      +cfs : cfs_rq
      +rt : rt_rq
      +dl : dl_rq
      +cfs_overload_cpus : sparsemask*
    }

    class sched_domain_shared {
      +ref : atomic_t
      +nr_busy_cpus : atomic_t
      +has_idle_cores : int
      +cfs_overload_cpus : sparsemask*
      +nr_idle_scan : int
    }

    class sparsemask_chunk {
      +word : unsigned long
    }

    class sparsemask {
      +nelems : short
      +density : short
      +chunks : sparsemask_chunk[]
      +sparsemask_size(nelems int, density int) size_t
      +sparsemask_init(mask sparsemask*, nelems int, density int) void
      +sparsemask_alloc_node(nelems int, density int, flags gfp_t, node int) sparsemask*
      +sparsemask_free(mask sparsemask*) void
      +sparsemask_set_elem(dst sparsemask*, elem int) void
      +sparsemask_clear_elem(dst sparsemask*, elem int) void
      +sparsemask_test_elem(mask sparsemask*, elem int) int
      +sparsemask_next(mask sparsemask*, origin int, prev int) int
    }

    rq --> sparsemask : cfs_overload_cpus
    sched_domain_shared --> sparsemask : cfs_overload_cpus
    sparsemask o-- sparsemask_chunk : contains
Loading

File-Level Changes

Change Details Files
Add a sparsemask bitmap implementation for low-contention CPU-set tracking and hook it into scheduler shared state.
  • Define sparsemask data structures and helpers, including allocation, basic bit operations, and an iterator that wraps around the bitmap.
  • Align sparsemask chunks to cachelines and parameterize density to trade memory for lower contention.
  • Extend sched_domain_shared with a sparsemask pointer for CFS overload tracking and ensure proper allocation and freeing of this bitmap at LLC level and during topology teardown.
kernel/sched/sparsemask.h
include/linux/sched/topology.h
kernel/sched/topology.c
Track overloaded CFS runqueues using the sparsemask and maintain per-rq pointers to the LLC-shared overload bitmap.
  • Add a cfs_overload_cpus pointer to struct rq and wire it to the LLC shared domain in update_top_cache_domain.
  • Allocate and free LLC-level overload bitmaps after sched domains are built, including helpers to allocate once per LLC and free for all CPUs.
  • Provide overload_set/overload_clear helpers that RCU-read the shared sparsemask and set/clear the bit for a CPU, no-op when SCHED_STEAL is disabled.
kernel/sched/sched.h
kernel/sched/topology.c
include/linux/sched/topology.h
kernel/sched/fair.c
Integrate overload tracking into CFS enqueue/dequeue so overloaded CPUs are marked when they have multiple runnable tasks and cleared when they fall back to at most one.
  • Track previous h_nr_runnable in enqueue_task_fair and dequeue_entities and use it to detect transitions between <=1 and >=2 runnable CFS tasks.
  • On enqueue, set the overload bit when a CPU moves from at most one runnable task to two or more, including delayed enqueue paths.
  • On dequeue, clear the overload bit when a CPU drops from at least two runnable tasks to at most one, handling both the early-return and normal paths.
kernel/sched/fair.c
Introduce a low-cost task stealing path for idle CPUs that uses the overloaded-CPU sparsemask within the same LLC, including SMT-aware stealing and fairness considerations.
  • Add SCHED_FEAT(STEAL) feature flag and guard stealing logic and overload updates behind it.
  • When pick_next_task_fair enters the idle path, set idle_stamp, call sched_balance_newidle, then try_steal if newidle balancing finds nothing, clearing idle_stamp if any task was obtained.
  • Implement try_steal to check avg_idle vs SCHED_STEAL_COST, look up the LLC-shared overload sparsemask, prefer SMT siblings first (if enabled), then iterate other overloaded CPUs in the LLC with sparsemask_for_each, using steal_from to do the actual migration while preserving rq locking discipline; return -1 when another class becomes runnable.
  • Implement steal_from to temporarily drop dst rq lock, lock the source rq, pick the first migratable CFS task in next-to-run order via detach_next_task, then attach it to dst rq with proper clock updates and IRQ handling, and signal success/failure back to try_steal.
  • Add can_migrate_task_llc and detach_task_steal helpers tailored for same-LLC stealing, skipping coloc/temperature checks and avoiding delayed entities.
  • Ensure final try_steal return encodes both whether any CFS tasks are now runnable and whether a non-CFS task appeared (forcing RETRY_TASK).
kernel/sched/fair.c
kernel/sched/features.h
Adjust existing idle balancing and rq idle accounting to coexist correctly with the new stealing path.
  • Refactor idle_stamp updates: introduce rq_idle_stamp_update/clear helpers and move idle_stamp setting from sched_balance_newidle into the pick_next_task_fair idle path, so both balancing and stealing account against idle time.
  • Avoid clearing idle_stamp in sched_balance_newidle when a task is pulled; the clearing is now centralized in pick_next_task_fair after either sched_balance_newidle or try_steal succeeds.
  • Ensure nohz_newidle_balance is still called when no task was pulled and stealing did not succeed, keeping existing NOHZ behavior intact.
kernel/sched/fair.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepin-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from avenger-285714. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The lifetime of cfs_overload_cpus relative to RCU readers looks unsafe: it is freed in sd_llc_free/sd_llc_free_all while rqs still hold RCU-protected pointers to it, so you likely need an RCU-safe free (e.g. call_rcu) or a synchronize_rcu() after clearing the pointers before freeing the sparsemask.
  • The rq locking and IRQ state transitions in steal_from/try_steal are non-standard for scheduler code (e.g. mixing rq_lock_irqsave on src_rq, raw locking on dst_rq, and local_irq_restore(rf.flags) after reacquiring dst_rq), so consider revisiting this to follow the usual rq lock + IRQ discipline and avoid subtle deadlocks or IRQ-state mismatches.
  • The description says SCHED_STEAL is enabled by default, but SCHED_FEAT(STEAL, false) disables it, so either the feature default or the description should be updated to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The lifetime of `cfs_overload_cpus` relative to RCU readers looks unsafe: it is freed in `sd_llc_free`/`sd_llc_free_all` while rqs still hold RCU-protected pointers to it, so you likely need an RCU-safe free (e.g. call_rcu) or a synchronize_rcu() after clearing the pointers before freeing the sparsemask.
- The rq locking and IRQ state transitions in `steal_from`/`try_steal` are non-standard for scheduler code (e.g. mixing `rq_lock_irqsave` on src_rq, raw locking on dst_rq, and `local_irq_restore(rf.flags)` after reacquiring dst_rq), so consider revisiting this to follow the usual rq lock + IRQ discipline and avoid subtle deadlocks or IRQ-state mismatches.
- The description says SCHED_STEAL is enabled by default, but `SCHED_FEAT(STEAL, false)` disables it, so either the feature default or the description should be updated to avoid confusion.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an LLC-scoped “overloaded CPU” tracking bitmap (implemented as a new sparse bitmap type) and uses it to opportunistically steal a migratable CFS task when a CPU is about to go idle and sched_balance_newidle() doesn’t pull anything, aiming to improve CPU utilization with a cheap, local search.

Changes:

  • Add sparsemask (a cacheline-sparse bitmap) and integrate it as sched_domain_shared::cfs_overload_cpus.
  • Track per-LLC overloaded CPUs and expose the LLC bitmap to each rq via an RCU-updated pointer.
  • Add a new idle-time stealing path in fair.c which scans overloaded CPUs (SMT siblings first) and migrates the first suitable CFS task found.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
kernel/sched/topology.c Allocates/frees per-LLC sparsemask and wires it into per-CPU rq via RCU assignment.
kernel/sched/sparsemask.h Adds the new sparse bitmap type and iteration API used by the scheduler.
kernel/sched/sched.h Adds rq->cfs_overload_cpus pointer for fast access to the LLC overloaded-CPU bitmap.
kernel/sched/features.h Introduces the STEAL scheduler feature flag.
kernel/sched/fair.c Updates overload tracking on enqueue/dequeue and implements try_steal() in the idle path.
include/linux/sched/topology.h Extends sched_domain_shared with a pointer to the overload bitmap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 66 to 72
struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
struct sparsemask *cfs_overload_cpus;
int nr_idle_scan;
};
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include/linux/sched/topology.h now references struct sparsemask in struct sched_domain_shared, but there is no forward declaration or header include that declares struct sparsemask here. This will fail to compile for any translation unit that includes this header. Add struct sparsemask; (or move the type to an appropriate exported header) before struct sched_domain_shared.

Copilot uses AI. Check for mistakes.
Comment thread kernel/sched/sched.h
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
struct sparsemask *cfs_overload_cpus;
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rq->cfs_overload_cpus is accessed using RCU primitives (rcu_assign_pointer/rcu_dereference), but the field itself is not annotated as an RCU-protected pointer. This can lead to sparse/RCU checking issues and makes the intended lifetime rules unclear. Consider declaring it as struct sparsemask __rcu *cfs_overload_cpus (and updating accesses accordingly).

Suggested change
struct sparsemask *cfs_overload_cpus;
struct sparsemask __rcu *cfs_overload_cpus;

Copilot uses AI. Check for mistakes.
Comment thread kernel/sched/features.h
Comment on lines +90 to +95
/*
* Steal a CFS task from another CPU when going idle.
* Improves CPU utilization.
*/
SCHED_FEAT(STEAL, false)

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCHED_FEAT(STEAL, false) disables stealing by default, but the PR description states that SCHED_STEAL is enabled by default. Please reconcile this (either set the default to true or update the description / feature naming so behavior matches expectations).

Copilot uses AI. Check for mistakes.
Comment thread kernel/sched/sparsemask.h
Comment on lines +42 to +46
struct sparsemask {
short nelems; /* current number of elements */
short density; /* store 2^density elements per chunk */
struct sparsemask_chunk chunks[]; /* embedded array of chunks */
};
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct sparsemask stores nelems and density in short. Since callers pass int nelems (e.g., nr_cpu_ids), this can truncate on larger configurations and make iteration/allocation incorrect. Use int (or at least u32) for these fields, and validate inputs (including nelems upper bound) in sparsemask_init().

Copilot uses AI. Check for mistakes.
Comment thread kernel/sched/sparsemask.h
Comment on lines +138 to +139
int index = _SMASK_INDEX(density, nelems) + 1;

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparsemask_size() computes index = _SMASK_INDEX(density, nelems) + 1, which over-allocates by one chunk when nelems is an exact multiple of 2^density (e.g., nelems=8, density=3). If valid elements are 0..nelems-1, this should typically be based on nelems-1 (with a special-case for nelems==0).

Suggested change
int index = _SMASK_INDEX(density, nelems) + 1;
int index;
if (nelems == 0)
index = 0;
else
index = _SMASK_INDEX(density, nelems - 1) + 1;

Copilot uses AI. Check for mistakes.
Comment thread kernel/sched/fair.c
int dst_cpu = dst_rq->cpu;
struct rq *src_rq = cpu_rq(src_cpu);

if (dst_cpu == src_cpu || src_rq->cfs.h_nr_runnable < 2)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

steal_from() reads src_rq->cfs.h_nr_runnable before taking src_rq's lock. Even though this is an early-bailout check, it can trigger KCSAN data-race reports and may observe transient values. Prefer moving this check under the rq_lock_irqsave(src_rq, ...) region or using READ_ONCE() for the unlocked read.

Suggested change
if (dst_cpu == src_cpu || src_rq->cfs.h_nr_runnable < 2)
if (dst_cpu == src_cpu || READ_ONCE(src_rq->cfs.h_nr_runnable) < 2)

Copilot uses AI. Check for mistakes.
Comment thread kernel/sched/topology.c
Comment on lines 680 to 699
@@ -680,8 +691,10 @@ static void update_top_cache_domain(int cpu)
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
sds = sd->shared;
cfs_overload_cpus = sds->cfs_overload_cpus;
}

rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus);
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_size, cpu) = size;
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rq->cfs_overload_cpus is an RCU-protected pointer, but it is updated in update_top_cache_domain() while the backing sds->cfs_overload_cpus is freed from the sched-domain RCU teardown path. To avoid a potential UAF, ensure that all CPUs switch rq->cfs_overload_cpus away from the old mask before the old mask can be freed (e.g., update/clear this pointer before scheduling destroy_sched_domains() for the old domain, or otherwise guarantee ordering with RCU grace periods).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants