Skip to content

scx_pandemoniumv5.15.0:#3675

Open
wllclngn wants to merge 1 commit into
sched-ext:mainfrom
wllclngn:update-pandemonium-5.15.0
Open

scx_pandemoniumv5.15.0:#3675
wllclngn wants to merge 1 commit into
sched-ext:mainfrom
wllclngn:update-pandemonium-5.15.0

Conversation

@wllclngn

Copy link
Copy Markdown
Contributor

v5.15.0 is rebuilt from the v5.12.0 base -- the last clean floor -- and what makes it v5.15.0 is that the latent bug in that floor was finally found and resolved. v5.12.0's per-CPU kthread and interactive verdict was scattered across six uncoordinated sites that overlapped on the easy on-home wakeup and left the cross-CPU strand with no net, so the base stranded pinned kthreads (ksoftirqd/N, kworker/N) worse than EEVDF. Collapsing all six into ONE judgement -- set once in runnable(), read everywhere else -- clears it: montauk kstrand at 2C, the worst case, reports the strands gone. Everything else is re-added onto the cleaned base one gated, measured piece at a time so each earns its place against the storm and latency numbers before the next lands. ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION below carries the fix; v5.15.0-CHECKLIST.md carries the full re-add ledger and the original release notes; this records what has landed and validated.

RESET TO THE v5.12.0 BASE -- src/

  • The scheduler crate (BPF, the Rust loader and intf.h) resets to the v5.12.0 commit, the clean placer the rest stands on: per-CPU DSQs, one flattened cache_domain, the node_dsq spill, no CCX/Phi machinery yet. The harness under tests/ is kept current so its fixes survive the reset.

IDLE QUIESCENCE ENVELOPE -- src/bpf/main.bpf.c, src/bpf/intf.h, src/adaptive.rs, src/scheduler.rs

  • The v5.13.0 envelope re-adds onto the base's existing CoDel oscillator: a decayed disp^2+v^2 energy reservoir parks the oscillator at codel equilibrium when it settles -- pin the target, freeze the velocity integrator, stop the per-tick recompute -- until a rescue event or an equilibrium retune disturbs it. A Schmitt trigger on the energy (park below env_park, release at 2x) with a heartbeat valve. nr_osc_park lands as a stat with adaptive park: telemetry.

INTERACTIVE-WAITING PREEMPT -- src/bpf/main.bpf.c

  • The per-CPU tick preempt gated only on pcpu_enqueue_ns -- the per-CPU DSQ waiter -- and was blind to interactive_enqueue_ns, the shared node_dsq waiter. An interactive task spilled to node_dsq under load forced no resident off its core, dispatch never ran, and the task starved behind a batch hog up to the 167ms net: the 116-195ms interactive-stall tail under mixed and long-run load. The v5.11->v5.12 boundary split one waiter signal in two and wired only the per-CPU half.
  • The fix restores v5.11.0's global interactive-waiting path keyed on the existing interactive_enqueue_ns, placed before the per-CPU early-return so it is reached: a BATCH resident held past preempt_thresh while a node_dsq interactive waiter is pending is preempted -- dispatch runs and STEP 2 drains the waiter (sojourn_gate_pass already keys on the same signal). BATCH residents only; interactive residents keep their protection.
  • Measured: long-run worst 153ms -> 2.6ms on the BPF arm at every width (2/4/8/16C), mixed worst 167ms -> 4ms, no preemption storm at 16 threads.

V7.1 REENQUEUE GUARD REMOVED -- src/bpf/main.bpf.c

  • The SCX_ENQ_REENQ short-circuit added for the 7.1 cpu_release rework routed a reenqueued task straight to node_dsq with no kick, on the theory it broke a 7.1.1 reenqueue<->kick livelock. It moved nothing measurable on 7.0.13 and is removed: pandemonium_enqueue falls from the wakeup classify straight into Tier 1 as on the v5.12.0 base. cpu_release reenqueues unconditionally.

SOJOURN UNIFICATION -- src/bpf/main.bpf.c, src/bpf/intf.h, src/tuning.rs, src/adaptive.rs

  • codel_thresh_ns renames sojourn_thresh_ns (the batch-DSQ rescue deadline, a Rust-set knob); the pcpu_sojourn_thresh local alias collapses into it.
  • codel_starve_ns renames starvation_rescue_ns (the ~100ms last-resort net, KEPT distinct from the ~6ms target band).
  • codel_seed_ns renames codel_target_equilibrium_ns (the R_eff x tau equilibrium the oscillator parks at; codel_eq_ns stays the feeding knob, the clamp preserved).
  • sojourn_stamp_overflow[] as struct { u64 inter, batch; } folds interactive_enqueue_ns + batch_enqueue_ns onto one cacheline per domain (per-tier x per-domain preserved, zero cells lost); sojourn_stamp_pcpu[] renames pcpu_enqueue_ns (stays separate, 64-byte padded).
  • DELETED overflow_sojourn_rescue_ns. The storm fix had the CPU-0 tick set it = codel_target_ns every beat while apply_tau_scaling still seeded it = the equilibrium -- a two-writer conflict that disagreed whenever the oscillator drove the target off equilibrium. Resolved to the LIVE target (every other sojourn deadline in dispatch keys off codel_target_ns), dropped the vestigial equilibrium writer; the reads read codel_target_ns directly.
  • Effect: behavior-preserving. montauk sojourn-pressure A/B -- 16C p99 1018us against the v5.14.0 baseline ~2994us, held and under, 4/4 widths survived. The redundancy was real and free.

ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION -- src/bpf/main.bpf.c

  • The per-CPU kthread (ksoftirqd/N, kworker/N) latency verdict was scattered across six uncoordinated sites -- the procdb tier vote, the runnable behavioral classify, a three-branch override cascade, an enqueue fast path, the enqueue affinity skip and the tick preempt -- each added to patch a different strand, overlapping on the on-home wakeup and leaving the cross-CPU wakeup with no net. The result stranded pinned kthreads worse than EEVDF: pcpu-burst 5 strands to 58ms, sojourn-pressure 2 to 56ms, fork-thread 26 to 50ms where EEVDF stranded none.
  • The collapse: runnable() is the sole decider. The kthread flags (PF_KTHREAD, nr_cpus_allowed, static_prio) are tested once and folded into tctx->pinned_service alongside the tier; enqueue() and tick() become pure reads of that bit. No struct growth -- the bit reuses existing pad.
  • enqueue() routes a pinned-service task straight to its home CPU's per-CPU DSQ (drained first by dispatch STEP 0, cache-hot) and PREEMPT-kicks it. A pinned task has no placement choice, so this covers BOTH the on-home and the cross-CPU wakeup -- the old on-home-only fast path left the cross-CPU strand falling to the ~167ms net.
  • tick() restores the pinned-waiter backstop keyed on the per-CPU sojourn signal, placed before the LAT_CRITICAL tight-band exemption so a welded ksoftirqd/kworker behind a LAT_CRITICAL resident is rescued -- steal cannot move a pinned task. The separate pcpu_pinned_strand_ns[] stamp array is gone; the one bit carries it.
  • Measured: montauk kstrand at 2C, the worst case. ipc, longrun, mixed-ADAPTIVE and longrun-ADAPTIVE report no strands over threshold; mixed-BPF shows one 8.1ms kworker/1:1 held 99% by montauk itself -- the tracer, not the scheduler -- far under the net. dispatch-stall attributes the residual floored wakes to python3 saturation at 2 CPUs, 0% priority inversion and 0% FIFO violation.

"Everywhere it is machines—real ones, not figurative ones: machines driving
other machines, machines being driven by other machines, with all the
necessary couplings and connections."
— Gilles Deleuze and Félix Guattari, Anti-Oedipus

v5.15.0 is rebuilt from the v5.12.0 base -- the last clean floor -- and what makes it v5.15.0 is
that the latent bug in that floor was finally found and resolved. v5.12.0's per-CPU kthread and
interactive verdict was scattered across six uncoordinated sites that overlapped on the easy
on-home wakeup and left the cross-CPU strand with no net, so the base stranded pinned kthreads
(ksoftirqd/N, kworker/N) worse than EEVDF. Collapsing all six into ONE judgement -- set once in
runnable(), read everywhere else -- clears it: montauk kstrand at 2C, the worst case, reports the
strands gone. Everything else is re-added onto the cleaned base one gated, measured piece at a
time so each earns its place against the storm and latency numbers before the next lands.
ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION below carries the fix; v5.15.0-CHECKLIST.md carries the
full re-add ledger and the original release notes; this records what has landed and validated.

RESET TO THE v5.12.0 BASE -- src/
- The scheduler crate (BPF, the Rust loader and intf.h) resets to the v5.12.0 commit, the clean
  placer the rest stands on: per-CPU DSQs, one flattened cache_domain, the node_dsq spill, no
  CCX/Phi machinery yet. The harness under tests/ is kept current so its fixes survive the reset.

IDLE QUIESCENCE ENVELOPE -- src/bpf/main.bpf.c, src/bpf/intf.h, src/adaptive.rs, src/scheduler.rs
- The v5.13.0 envelope re-adds onto the base's existing CoDel oscillator: a decayed disp^2+v^2
  energy reservoir parks the oscillator at codel equilibrium when it settles -- pin the target,
  freeze the velocity integrator, stop the per-tick recompute -- until a rescue event or an
  equilibrium retune disturbs it. A Schmitt trigger on the energy (park below env_park, release at
  2x) with a heartbeat valve. nr_osc_park lands as a stat with adaptive park: telemetry.

INTERACTIVE-WAITING PREEMPT -- src/bpf/main.bpf.c
- The per-CPU tick preempt gated only on pcpu_enqueue_ns -- the per-CPU DSQ waiter -- and was blind
  to interactive_enqueue_ns, the shared node_dsq waiter. An interactive task spilled to node_dsq
  under load forced no resident off its core, dispatch never ran, and the task starved behind a
  batch hog up to the 167ms net: the 116-195ms interactive-stall tail under mixed and long-run
  load. The v5.11->v5.12 boundary split one waiter signal in two and wired only the per-CPU half.
- The fix restores v5.11.0's global interactive-waiting path keyed on the existing
  interactive_enqueue_ns, placed before the per-CPU early-return so it is reached: a BATCH resident
  held past preempt_thresh while a node_dsq interactive waiter is pending is preempted -- dispatch
  runs and STEP 2 drains the waiter (sojourn_gate_pass already keys on the same signal). BATCH
  residents only; interactive residents keep their protection.
- Measured: long-run worst 153ms -> 2.6ms on the BPF arm at every width (2/4/8/16C), mixed worst
  167ms -> 4ms, no preemption storm at 16 threads.

V7.1 REENQUEUE GUARD REMOVED -- src/bpf/main.bpf.c
- The SCX_ENQ_REENQ short-circuit added for the 7.1 cpu_release rework routed a reenqueued task
  straight to node_dsq with no kick, on the theory it broke a 7.1.1 reenqueue<->kick livelock. It
  moved nothing measurable on 7.0.13 and is removed: pandemonium_enqueue falls from the wakeup
  classify straight into Tier 1 as on the v5.12.0 base. cpu_release reenqueues unconditionally.

SOJOURN UNIFICATION -- src/bpf/main.bpf.c, src/bpf/intf.h, src/tuning.rs, src/adaptive.rs
- codel_thresh_ns renames sojourn_thresh_ns (the batch-DSQ rescue deadline, a Rust-set
  knob); the pcpu_sojourn_thresh local alias collapses into it.
- codel_starve_ns renames starvation_rescue_ns (the ~100ms last-resort net, KEPT distinct
  from the ~6ms target band).
- codel_seed_ns renames codel_target_equilibrium_ns (the R_eff x tau equilibrium the
  oscillator parks at; codel_eq_ns stays the feeding knob, the clamp preserved).
- sojourn_stamp_overflow[] as struct { u64 inter, batch; } folds interactive_enqueue_ns +
  batch_enqueue_ns onto one cacheline per domain (per-tier x per-domain preserved, zero cells
  lost); sojourn_stamp_pcpu[] renames pcpu_enqueue_ns (stays separate, 64-byte padded).
- DELETED overflow_sojourn_rescue_ns. The storm fix had the CPU-0 tick set it = codel_target_ns
  every beat while apply_tau_scaling still seeded it = the equilibrium -- a two-writer conflict
  that disagreed whenever the oscillator drove the target off equilibrium. Resolved to the LIVE
  target (every other sojourn deadline in dispatch keys off codel_target_ns), dropped the
  vestigial equilibrium writer; the reads read codel_target_ns directly.
- Effect: behavior-preserving. montauk sojourn-pressure A/B -- 16C p99 1018us against the
  v5.14.0 baseline ~2994us, held and under, 4/4 widths survived. The redundancy was real and free.

ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION -- src/bpf/main.bpf.c
- The per-CPU kthread (ksoftirqd/N, kworker/N) latency verdict was scattered across six
  uncoordinated sites -- the procdb tier vote, the runnable behavioral classify, a three-branch
  override cascade, an enqueue fast path, the enqueue affinity skip and the tick preempt -- each
  added to patch a different strand, overlapping on the on-home wakeup and leaving the cross-CPU
  wakeup with no net. The result stranded pinned kthreads worse than EEVDF: pcpu-burst 5 strands to
  58ms, sojourn-pressure 2 to 56ms, fork-thread 26 to 50ms where EEVDF stranded none.
- The collapse: runnable() is the sole decider. The kthread flags (PF_KTHREAD, nr_cpus_allowed,
  static_prio) are tested once and folded into tctx->pinned_service alongside the tier; enqueue()
  and tick() become pure reads of that bit. No struct growth -- the bit reuses existing pad.
- enqueue() routes a pinned-service task straight to its home CPU's per-CPU DSQ (drained first by
  dispatch STEP 0, cache-hot) and PREEMPT-kicks it. A pinned task has no placement choice, so this
  covers BOTH the on-home and the cross-CPU wakeup -- the old on-home-only fast path left the
  cross-CPU strand falling to the ~167ms net.
- tick() restores the pinned-waiter backstop keyed on the per-CPU sojourn signal, placed before the
  LAT_CRITICAL tight-band exemption so a welded ksoftirqd/kworker behind a LAT_CRITICAL resident is
  rescued -- steal cannot move a pinned task. The separate pcpu_pinned_strand_ns[] stamp array is
  gone; the one bit carries it.
- Measured: montauk kstrand at 2C, the worst case. ipc, longrun, mixed-ADAPTIVE and longrun-ADAPTIVE
  report no strands over threshold; mixed-BPF shows one 8.1ms kworker/1:1 held 99% by montauk itself
  -- the tracer, not the scheduler -- far under the net. dispatch-stall attributes the residual
  floored wakes to python3 saturation at 2 CPUs, 0% priority inversion and 0% FIFO violation.

HARNESS -- HONEST EJECTION DETECTION -- tests/pandemonium-tests.py
- The bench phases checked guard.proc.poll() -- the userspace process alive -- never the scx
  registration, so a watchdog ejection that left the process running read as still-PANDEMONIUM
  while the kernel was on EEVDF. scheduler_active() reads is_scx_active() + scx_scheduler_name(),
  and sched_ejected(guard) folds it into every survival check: an ejected scheduler reports
  CRASHED, never a lie.

HARNESS -- ONE EXIT GUARD (SHARED) -- pandemonium_common.py, tests/prism.py, tests/pandemonium-tests.py
- prism's Ctrl+C cleanup becomes the suite's single exit guard, lifted into pandemonium_common:
  eject_scheduler (systemctl stop, then force-kill by exact comm for a bare loader systemd does not
  own) + install_exit_guard (SIGINT/SIGTERM/atexit -> registered cleanups, eject, re-online every
  offlined CPU) + register_exit_cleanup. prism and pandemonium-tests both install it; the
  hand-rolled _cleanup_on_exit / _fatal_signal_eject is deleted, no other signal/atexit handler
  remains.
- The bug it closes: pandemonium-tests ejected only the schedulers it spawned via a process-group
  SIGINT, which missed the systemd pandemonium service and left the box stuck on a CPU subset after
  Ctrl+C. The canonical eject plus the unconditional CPU re-online fix both.

HARNESS -- MULTI-WORKLOAD --dev, EXPLICIT --trace ONLY, PROBE CAPTURE -- tests/prism.py, src/cli/probe.rs, tests/pandemonium-tests.py
- prism --dev takes one or more workloads (prism --dev ipc fork-thread runs both in sequence).
  --trace is respected only when passed: a bare --dev stays pre-elevation and captures nothing, and
  the trace-capable allow-list that auto-captured is removed -- one flag, one behaviour.
- The latency probe sets a distinct comm pand-probe via prctl so montauk can target just the
  probe; the long-run and mixed scale phases record it gated on --trace -- the interactive-stall
  tail lands in .events for montauk_analyze, the event the scale numbers summarize but could not
  show before.

HARNESS -- COMPREHENSIVE prism --dev ipc -- tests/pandemonium-tests.py
- prism --dev ipc was one number per cell (RTT p99). It now carries per-primitive throughput
  (RT/s, implied from the median round-trip) beside the latency, and under --trace folds montauk's
  own analysis of the IPC capture -- wake2run dispatch latency, cache locality / cross-CCX, the
  task that held the core under saturation, dispatch fractality -- per scheduler against the EEVDF
  baseline. Localized to the --ipc path: prism-scale and the IPC engine are byte-identical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant