scx_lavd, lib/cgroup_bw: cap task-stall latency under heavy cpu.max throttling by multics69 · Pull Request #3618 · sched-ext/scx

multics69 · 2026-06-02T14:43:32Z

When a cgroup's cpu.max quota is tight relative to the work its tasks
want to do, the cgroup spends most of its time throttled. Tasks then
wait a long time for the small amount of CPU the quota allows, and
that wait is not bounded: it can grow until a task trips the
30-second SCX runnable-task-stall watchdog and brings the scheduler
down. Worse, the wait is not even fairly distributed -- some tasks
can be starved indefinitely while others make progress.

A single throttled cgroup should never be able to stall the whole
scheduler. This series makes that guarantee by attacking the two
ways an unbounded wait arises.

The first is slice length. Under a tight quota, the scheduler can
only admit a few tasks per period, and if each runs a full slice,
the rest wait proportionally longer -- the more contended the cgroup,
the worse it gets. The scheduler needs to know how contended a
cgroup is so it can react. [2/3] gives it that: scx_cgroup_bw_pressure()
reports a cgroup's bandwidth pressure, and scx_lavd shortens slices as
pressure rises, so more tasks get a turn within the watchdog window
instead of a few monopolizing each period.

The second is queue ordering. A throttled task's place in line is
its scheduler vtime, which is fine for fairness but offers no
liveness guarantee: a compute-heavy task accumulates large vtime and
can sit behind an endless stream of short, small-vtime tasks, never
reaching the front. Fairness alone does not bound how long it waits.
[3/3] adds that bound -- a task is guaranteed to reach the head of
the queue within a few seconds regardless of vtime -- while keeping
the scheduler's vtime ordering intact for tasks queued close
together.

[1/3] is preparatory refactoring with no functional change: it
factors the per-task context-cache accessors into helpers so [2/3]
can reuse them.

[1/3] lib/cgroup_bw: factor taskc-cached cgx/llcx accessors into helpers
[2/3] lib/cgroup_bw, scx_lavd: add scx_cgroup_bw_pressure() API
[3/3] lib/cgroup_bw: blend wall-clock time into BTQ vtime to bound throttle delay

Signed-off-by: Changwoo Min changwoo@igalia.com

multics69 · 2026-06-11T15:09:22Z

Rebased to the main branch.

Earlier commits introduced per-task caching of the cgroup and per-LLC contexts to eliminate hot-path map lookups: 335e754 ("lib/cgroup_bw: cache cgx per task to eliminate throttle-path map lookup") d848162 ("lib/cgroup_bw: cache llcx per task to eliminate consume-path map lookup") However, the cache lookup and invalidation patterns are open-coded at every call site -- cbw_cgroup_bw_throttled(), scx_cgroup_bw_consume(), and scx_cgroup_bw_pressure() for the lookup; cbw_drain_atq_to_root(), cbw_free_llc_ctx(), and scx_cgroup_bw_move() for the invalidation. Duplicated code means duplicated bugs. Add three static __always_inline helpers that centralise the pattern: cbw_taskc_get_cgx_raw(taskc, cgrp_id) cbw_taskc_get_llcx_raw(taskc, cgrp_id, llc_id) cbw_taskc_invalidate(taskc) The getters accept a possibly-NULL taskc and return 0 on miss so each caller keeps its own miss policy. cbw_taskc_invalidate() centralises the __sync_lock_test_and_set workaround for the arena-pointer fields, letting scx_cgroup_bw_move() drop its local `volatile` qualifier. No semantic change. Signed-off-by: Changwoo Min <changwoo@igalia.com>

When a cgroup is throttled, giving its tasks a full time slice is not ideal. For example, if the cgroup's quota is tight and many tasks are competing for it, each task holds the CPU for its full slice before the scheduler can recheck the throttle. Shortening the slice lets more tasks make forward progress within a reasonable window -- well before the 30 s SCX watchdog would fire on a stalled runnable task. Add scx_cgroup_bw_pressure() to expose a 1024-scale pressure hint that BPF schedulers can use to shorten time slices proportionally: slice = (base_slice * 1024) / pressure Pressure is computed at each replenishment boundary from two signals that are combined by addition so that both contribute independently: Budget pressure: a hyperbolic curve that rises steeply below 25% of the replenished period_budget. A small budget after replenishment also indicates accumulated debt from prior over-consumption, so high pressure is correct in that case too. Backlog pressure: a linear term proportional to the number of tasks queued in the BTQ across all LLC domains. A growing backlog signals that the reenqueue path cannot drain fast enough; shorter slices reduce the time any single task monopolises the CPU. The combined pressure is clamped to [1024, 16384], limiting the maximum reduction to 1/16 of the base slice. scx_lavd adopts the new API in calc_time_slice(): pressure is fetched once per scheduling decision, slice boost is suppressed under any throttle pressure, and the final slice is scaled by the pressure before being assigned to the task. Signed-off-by: Changwoo Min <changwoo@igalia.com>

… delay When a cgroup's quota is tight and it runs a mixture of short-running and compute-intensive tasks, the compute-intensive tasks accumulate large scheduler vtime values while short-running tasks stay close to the minimum. In the BTQ, position is determined purely by vtime, so a steady stream of newly enqueued short-running (low-vtime) tasks keeps overtaking the compute-intensive (high-vtime) tasks already waiting. The compute-intensive tasks can be starved arbitrarily long -- especially when the cgroup is throttled frequently and the queue keeps refilling. Fix this by blending wall-clock time into the BTQ key so a task's queue position eventually advances regardless of its vtime: btq_vtime = (scx_bpf_now() & CBW_BTQ_VTIME_UPPER_MASK) | (vtime & CBW_BTQ_VTIME_LOWER_MASK) The upper 32 bits come from the current nanosecond timestamp; the lower 32 bits come from the scheduler-provided vtime. The 64-bit key is split evenly so each side contributes 32 bits. Tasks enqueued within the same ~4-second window (2^32 ns ~= 4.29 s) still compete by their scheduler vtime, preserving relative fairness. Once a new wall-clock epoch begins, earlier-queued tasks take priority regardless of their vtime, so no task waits more than ~4 seconds in the BTQ due to vtime ordering alone. Signed-off-by: Changwoo Min <changwoo@igalia.com>

multics69 requested review from bboymimi, daidavid, etsal and rrnewton June 2, 2026 14:43

multics69 self-assigned this Jun 2, 2026

multics69 mentioned this pull request Jun 3, 2026

lib/cgroup_bw: cap task-stall latency under cpu.max #3554

Closed

multics69 force-pushed the pr-lavd-cap-cbw branch from cadcb01 to b985cf3 Compare June 11, 2026 15:08

multics69 added 3 commits June 27, 2026 19:40

multics69 force-pushed the pr-lavd-cap-cbw branch from b985cf3 to c584bb1 Compare June 27, 2026 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scx_lavd, lib/cgroup_bw: cap task-stall latency under heavy cpu.max throttling#3618

scx_lavd, lib/cgroup_bw: cap task-stall latency under heavy cpu.max throttling#3618
multics69 wants to merge 3 commits into
sched-ext:mainfrom
multics69:pr-lavd-cap-cbw

multics69 commented Jun 2, 2026

Uh oh!

multics69 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

multics69 commented Jun 2, 2026

Uh oh!

multics69 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant