scx_lavd, lib/cgroup_bw: cap task-stall latency under heavy cpu.max throttling#3618
Open
multics69 wants to merge 3 commits into
Open
scx_lavd, lib/cgroup_bw: cap task-stall latency under heavy cpu.max throttling#3618multics69 wants to merge 3 commits into
multics69 wants to merge 3 commits into
Conversation
cadcb01 to
b985cf3
Compare
Contributor
Author
|
Rebased to the main branch. |
Earlier commits introduced per-task caching of the cgroup and per-LLC contexts to eliminate hot-path map lookups: 335e754 ("lib/cgroup_bw: cache cgx per task to eliminate throttle-path map lookup") d848162 ("lib/cgroup_bw: cache llcx per task to eliminate consume-path map lookup") However, the cache lookup and invalidation patterns are open-coded at every call site -- cbw_cgroup_bw_throttled(), scx_cgroup_bw_consume(), and scx_cgroup_bw_pressure() for the lookup; cbw_drain_atq_to_root(), cbw_free_llc_ctx(), and scx_cgroup_bw_move() for the invalidation. Duplicated code means duplicated bugs. Add three static __always_inline helpers that centralise the pattern: cbw_taskc_get_cgx_raw(taskc, cgrp_id) cbw_taskc_get_llcx_raw(taskc, cgrp_id, llc_id) cbw_taskc_invalidate(taskc) The getters accept a possibly-NULL taskc and return 0 on miss so each caller keeps its own miss policy. cbw_taskc_invalidate() centralises the __sync_lock_test_and_set workaround for the arena-pointer fields, letting scx_cgroup_bw_move() drop its local `volatile` qualifier. No semantic change. Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cgroup is throttled, giving its tasks a full time slice is not ideal. For example, if the cgroup's quota is tight and many tasks are competing for it, each task holds the CPU for its full slice before the scheduler can recheck the throttle. Shortening the slice lets more tasks make forward progress within a reasonable window -- well before the 30 s SCX watchdog would fire on a stalled runnable task. Add scx_cgroup_bw_pressure() to expose a 1024-scale pressure hint that BPF schedulers can use to shorten time slices proportionally: slice = (base_slice * 1024) / pressure Pressure is computed at each replenishment boundary from two signals that are combined by addition so that both contribute independently: Budget pressure: a hyperbolic curve that rises steeply below 25% of the replenished period_budget. A small budget after replenishment also indicates accumulated debt from prior over-consumption, so high pressure is correct in that case too. Backlog pressure: a linear term proportional to the number of tasks queued in the BTQ across all LLC domains. A growing backlog signals that the reenqueue path cannot drain fast enough; shorter slices reduce the time any single task monopolises the CPU. The combined pressure is clamped to [1024, 16384], limiting the maximum reduction to 1/16 of the base slice. scx_lavd adopts the new API in calc_time_slice(): pressure is fetched once per scheduling decision, slice boost is suppressed under any throttle pressure, and the final slice is scaled by the pressure before being assigned to the task. Signed-off-by: Changwoo Min <changwoo@igalia.com>
… delay
When a cgroup's quota is tight and it runs a mixture of short-running
and compute-intensive tasks, the compute-intensive tasks accumulate
large scheduler vtime values while short-running tasks stay close
to the minimum. In the BTQ, position is determined purely by vtime,
so a steady stream of newly enqueued short-running (low-vtime) tasks
keeps overtaking the compute-intensive (high-vtime) tasks already
waiting. The compute-intensive tasks can be starved arbitrarily
long -- especially when the cgroup is throttled frequently and the
queue keeps refilling.
Fix this by blending wall-clock time into the BTQ key so a task's
queue position eventually advances regardless of its vtime:
btq_vtime = (scx_bpf_now() & CBW_BTQ_VTIME_UPPER_MASK) |
(vtime & CBW_BTQ_VTIME_LOWER_MASK)
The upper 32 bits come from the current nanosecond timestamp; the
lower 32 bits come from the scheduler-provided vtime. The 64-bit
key is split evenly so each side contributes 32 bits. Tasks enqueued
within the same ~4-second window (2^32 ns ~= 4.29 s) still compete
by their scheduler vtime, preserving relative fairness. Once a new
wall-clock epoch begins, earlier-queued tasks take priority
regardless of their vtime, so no task waits more than ~4 seconds
in the BTQ due to vtime ordering alone.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
b985cf3 to
c584bb1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a cgroup's cpu.max quota is tight relative to the work its tasks
want to do, the cgroup spends most of its time throttled. Tasks then
wait a long time for the small amount of CPU the quota allows, and
that wait is not bounded: it can grow until a task trips the
30-second SCX runnable-task-stall watchdog and brings the scheduler
down. Worse, the wait is not even fairly distributed -- some tasks
can be starved indefinitely while others make progress.
A single throttled cgroup should never be able to stall the whole
scheduler. This series makes that guarantee by attacking the two
ways an unbounded wait arises.
The first is slice length. Under a tight quota, the scheduler can
only admit a few tasks per period, and if each runs a full slice,
the rest wait proportionally longer -- the more contended the cgroup,
the worse it gets. The scheduler needs to know how contended a
cgroup is so it can react. [2/3] gives it that: scx_cgroup_bw_pressure()
reports a cgroup's bandwidth pressure, and scx_lavd shortens slices as
pressure rises, so more tasks get a turn within the watchdog window
instead of a few monopolizing each period.
The second is queue ordering. A throttled task's place in line is
its scheduler vtime, which is fine for fairness but offers no
liveness guarantee: a compute-heavy task accumulates large vtime and
can sit behind an endless stream of short, small-vtime tasks, never
reaching the front. Fairness alone does not bound how long it waits.
[3/3] adds that bound -- a task is guaranteed to reach the head of
the queue within a few seconds regardless of vtime -- while keeping
the scheduler's vtime ordering intact for tasks queued close
together.
[1/3] is preparatory refactoring with no functional change: it
factors the per-task context-cache accessors into helpers so [2/3]
can reuse them.
Signed-off-by: Changwoo Min changwoo@igalia.com