Skip to content

scx_lavd, lib/cgroup_bw: cap task-stall latency under heavy cpu.max throttling#3618

Open
multics69 wants to merge 3 commits into
sched-ext:mainfrom
multics69:pr-lavd-cap-cbw
Open

scx_lavd, lib/cgroup_bw: cap task-stall latency under heavy cpu.max throttling#3618
multics69 wants to merge 3 commits into
sched-ext:mainfrom
multics69:pr-lavd-cap-cbw

Conversation

@multics69

Copy link
Copy Markdown
Contributor

When a cgroup's cpu.max quota is tight relative to the work its tasks
want to do, the cgroup spends most of its time throttled. Tasks then
wait a long time for the small amount of CPU the quota allows, and
that wait is not bounded: it can grow until a task trips the
30-second SCX runnable-task-stall watchdog and brings the scheduler
down. Worse, the wait is not even fairly distributed -- some tasks
can be starved indefinitely while others make progress.

A single throttled cgroup should never be able to stall the whole
scheduler. This series makes that guarantee by attacking the two
ways an unbounded wait arises.

The first is slice length. Under a tight quota, the scheduler can
only admit a few tasks per period, and if each runs a full slice,
the rest wait proportionally longer -- the more contended the cgroup,
the worse it gets. The scheduler needs to know how contended a
cgroup is so it can react. [2/3] gives it that: scx_cgroup_bw_pressure()
reports a cgroup's bandwidth pressure, and scx_lavd shortens slices as
pressure rises, so more tasks get a turn within the watchdog window
instead of a few monopolizing each period.

The second is queue ordering. A throttled task's place in line is
its scheduler vtime, which is fine for fairness but offers no
liveness guarantee: a compute-heavy task accumulates large vtime and
can sit behind an endless stream of short, small-vtime tasks, never
reaching the front. Fairness alone does not bound how long it waits.
[3/3] adds that bound -- a task is guaranteed to reach the head of
the queue within a few seconds regardless of vtime -- while keeping
the scheduler's vtime ordering intact for tasks queued close
together.

[1/3] is preparatory refactoring with no functional change: it
factors the per-task context-cache accessors into helpers so [2/3]
can reuse them.

  • [1/3] lib/cgroup_bw: factor taskc-cached cgx/llcx accessors into helpers
  • [2/3] lib/cgroup_bw, scx_lavd: add scx_cgroup_bw_pressure() API
  • [3/3] lib/cgroup_bw: blend wall-clock time into BTQ vtime to bound throttle delay

Signed-off-by: Changwoo Min changwoo@igalia.com

@multics69

Copy link
Copy Markdown
Contributor Author

Rebased to the main branch.

Earlier commits introduced per-task caching of the cgroup and
per-LLC contexts to eliminate hot-path map lookups:

  335e754 ("lib/cgroup_bw: cache cgx per task to eliminate throttle-path map lookup")
  d848162 ("lib/cgroup_bw: cache llcx per task to eliminate consume-path map lookup")

However, the cache lookup and invalidation patterns are open-coded
at every call site -- cbw_cgroup_bw_throttled(),
scx_cgroup_bw_consume(), and scx_cgroup_bw_pressure() for the
lookup; cbw_drain_atq_to_root(), cbw_free_llc_ctx(), and
scx_cgroup_bw_move() for the invalidation. Duplicated code means
duplicated bugs.

Add three static __always_inline helpers that centralise the
pattern:

  cbw_taskc_get_cgx_raw(taskc, cgrp_id)
  cbw_taskc_get_llcx_raw(taskc, cgrp_id, llc_id)
  cbw_taskc_invalidate(taskc)

The getters accept a possibly-NULL taskc and return 0 on miss so
each caller keeps its own miss policy. cbw_taskc_invalidate()
centralises the __sync_lock_test_and_set workaround for the
arena-pointer fields, letting scx_cgroup_bw_move() drop its local
`volatile` qualifier.

No semantic change.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cgroup is throttled, giving its tasks a full time slice is
not ideal. For example, if the cgroup's quota is tight and many
tasks are competing for it, each task holds the CPU for its full
slice before the scheduler can recheck the throttle. Shortening the
slice lets more tasks make forward progress within a reasonable
window -- well before the 30 s SCX watchdog would fire on a stalled
runnable task.

Add scx_cgroup_bw_pressure() to expose a 1024-scale pressure hint
that BPF schedulers can use to shorten time slices proportionally:

  slice = (base_slice * 1024) / pressure

Pressure is computed at each replenishment boundary from two signals
that are combined by addition so that both contribute independently:

  Budget pressure: a hyperbolic curve that rises steeply below 25%
  of the replenished period_budget.  A small budget after
  replenishment also indicates accumulated debt from prior
  over-consumption, so high pressure is correct in that case too.

  Backlog pressure: a linear term proportional to the number of
  tasks queued in the BTQ across all LLC domains.  A growing backlog
  signals that the reenqueue path cannot drain fast enough; shorter
  slices reduce the time any single task monopolises the CPU.

The combined pressure is clamped to [1024, 16384], limiting the
maximum reduction to 1/16 of the base slice.

scx_lavd adopts the new API in calc_time_slice(): pressure is
fetched once per scheduling decision, slice boost is suppressed
under any throttle pressure, and the final slice is scaled by the
pressure before being assigned to the task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
… delay

When a cgroup's quota is tight and it runs a mixture of short-running
and compute-intensive tasks, the compute-intensive tasks accumulate
large scheduler vtime values while short-running tasks stay close
to the minimum. In the BTQ, position is determined purely by vtime,
so a steady stream of newly enqueued short-running (low-vtime) tasks
keeps overtaking the compute-intensive (high-vtime) tasks already
waiting. The compute-intensive tasks can be starved arbitrarily
long -- especially when the cgroup is throttled frequently and the
queue keeps refilling.

Fix this by blending wall-clock time into the BTQ key so a task's
queue position eventually advances regardless of its vtime:

    btq_vtime = (scx_bpf_now() & CBW_BTQ_VTIME_UPPER_MASK) |
                (vtime & CBW_BTQ_VTIME_LOWER_MASK)

The upper 32 bits come from the current nanosecond timestamp; the
lower 32 bits come from the scheduler-provided vtime. The 64-bit
key is split evenly so each side contributes 32 bits. Tasks enqueued
within the same ~4-second window (2^32 ns ~= 4.29 s) still compete
by their scheduler vtime, preserving relative fairness. Once a new
wall-clock epoch begins, earlier-queued tasks take priority
regardless of their vtime, so no task waits more than ~4 seconds
in the BTQ due to vtime ordering alone.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant