Skip to content

scx_mitosis: Drain ineligible tasks due to affinity restrictions#3668

Open
kkdwvd wants to merge 2 commits into
sched-ext:mainfrom
kkdwvd:mitosis-affn-viol
Open

scx_mitosis: Drain ineligible tasks due to affinity restrictions#3668
kkdwvd wants to merge 2 commits into
sched-ext:mainfrom
kkdwvd:mitosis-affn-viol

Conversation

@kkdwvd

@kkdwvd kkdwvd commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

We have cases where we queued tasks in cell DSQs with specific affinity masks that matched cpumask of the cell such that all_cell_cpus_allowed was true. Later, these cells may undergo rebalancing and lose CPUs on LLCs where such tasks are queued. Attempts to drain such tasks will continually fail due to affinity restrictions wrt CPUs trying to drain such tasks.

In such cases, when a scx_bpf_dsq_move_to_local() consumption fails, we dispatch the task to an eligible per-CPU DSQ. In case the task undergoes affinity mask changes, it will undergo dequeue()/enqueue() cycle again.

When such a task starts running, set all_cell_cpus_allowed=false for it to avoid updating its LLC, and invalidate the current selection. The call to maybe_refresh_cell() in the enqueue path will eventually fix any drift from the intended state of the task.

The script reproduces the failure in absence of the fix, and verifies non-zero drain_affn_cnt in presence of it.

@kkdwvd kkdwvd requested review from dschatzberg and tommy-u June 23, 2026 01:30
@kkdwvd kkdwvd force-pushed the mitosis-affn-viol branch 2 times, most recently from 2a84b8f to 44ba238 Compare June 23, 2026 17:41
kkdwvd added 2 commits June 24, 2026 14:23
We have cases where we queued tasks in cell DSQs with specific affinity
masks that matched cpumask of the cell such that all_cell_cpus_allowed
was true. Later, these cells may undergo rebalancing and lose CPUs on
LLCs where such tasks are queued. Attempts to drain such tasks will
continually fail due to affinity restrictions wrt CPUs trying to drain
such tasks.

In such cases, when a scx_bpf_dsq_move_to_local() consumption fails, we
dispatch the task to an eligible per-CPU DSQ. In case the task undergoes
affinity mask changes, it will undergo dequeue()/enqueue() cycle again.

When such a task starts running, set all_cell_cpus_allowed=false for it
to avoid updating its LLC, and invalidate the current selection. The
call to maybe_refresh_cell() in the enqueue path will eventually fix any
drift from the intended state of the task.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
@kkdwvd kkdwvd force-pushed the mitosis-affn-viol branch from 44ba238 to 51a13a6 Compare June 24, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant