scx_pandemoniumv5.15.0: by wllclngn · Pull Request #3675 · sched-ext/scx

wllclngn · 2026-06-27T17:57:27Z

v5.15.0 is rebuilt from the v5.12.0 base -- the last clean floor -- and what makes it v5.15.0 is that the latent bug in that floor was finally found and resolved. v5.12.0's per-CPU kthread and interactive verdict was scattered across six uncoordinated sites that overlapped on the easy on-home wakeup and left the cross-CPU strand with no net, so the base stranded pinned kthreads (ksoftirqd/N, kworker/N) worse than EEVDF. Collapsing all six into ONE judgement -- set once in runnable(), read everywhere else -- clears it: montauk kstrand at 2C, the worst case, reports the strands gone. Everything else is re-added onto the cleaned base one gated, measured piece at a time so each earns its place against the storm and latency numbers before the next lands. ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION below carries the fix; v5.15.0-CHECKLIST.md carries the full re-add ledger and the original release notes; this records what has landed and validated.

RESET TO THE v5.12.0 BASE -- src/

The scheduler crate (BPF, the Rust loader and intf.h) resets to the v5.12.0 commit, the clean placer the rest stands on: per-CPU DSQs, one flattened cache_domain, the node_dsq spill, no CCX/Phi machinery yet. The harness under tests/ is kept current so its fixes survive the reset.

IDLE QUIESCENCE ENVELOPE -- src/bpf/main.bpf.c, src/bpf/intf.h, src/adaptive.rs, src/scheduler.rs

The v5.13.0 envelope re-adds onto the base's existing CoDel oscillator: a decayed disp^2+v^2 energy reservoir parks the oscillator at codel equilibrium when it settles -- pin the target, freeze the velocity integrator, stop the per-tick recompute -- until a rescue event or an equilibrium retune disturbs it. A Schmitt trigger on the energy (park below env_park, release at 2x) with a heartbeat valve. nr_osc_park lands as a stat with adaptive park: telemetry.

INTERACTIVE-WAITING PREEMPT -- src/bpf/main.bpf.c

The per-CPU tick preempt gated only on pcpu_enqueue_ns -- the per-CPU DSQ waiter -- and was blind to interactive_enqueue_ns, the shared node_dsq waiter. An interactive task spilled to node_dsq under load forced no resident off its core, dispatch never ran, and the task starved behind a batch hog up to the 167ms net: the 116-195ms interactive-stall tail under mixed and long-run load. The v5.11->v5.12 boundary split one waiter signal in two and wired only the per-CPU half.
The fix restores v5.11.0's global interactive-waiting path keyed on the existing interactive_enqueue_ns, placed before the per-CPU early-return so it is reached: a BATCH resident held past preempt_thresh while a node_dsq interactive waiter is pending is preempted -- dispatch runs and STEP 2 drains the waiter (sojourn_gate_pass already keys on the same signal). BATCH residents only; interactive residents keep their protection.
Measured: long-run worst 153ms -> 2.6ms on the BPF arm at every width (2/4/8/16C), mixed worst 167ms -> 4ms, no preemption storm at 16 threads.

V7.1 REENQUEUE GUARD REMOVED -- src/bpf/main.bpf.c

The SCX_ENQ_REENQ short-circuit added for the 7.1 cpu_release rework routed a reenqueued task straight to node_dsq with no kick, on the theory it broke a 7.1.1 reenqueue<->kick livelock. It moved nothing measurable on 7.0.13 and is removed: pandemonium_enqueue falls from the wakeup classify straight into Tier 1 as on the v5.12.0 base. cpu_release reenqueues unconditionally.

SOJOURN UNIFICATION -- src/bpf/main.bpf.c, src/bpf/intf.h, src/tuning.rs, src/adaptive.rs

codel_thresh_ns renames sojourn_thresh_ns (the batch-DSQ rescue deadline, a Rust-set knob); the pcpu_sojourn_thresh local alias collapses into it.
codel_starve_ns renames starvation_rescue_ns (the ~100ms last-resort net, KEPT distinct from the ~6ms target band).
codel_seed_ns renames codel_target_equilibrium_ns (the R_eff x tau equilibrium the oscillator parks at; codel_eq_ns stays the feeding knob, the clamp preserved).
sojourn_stamp_overflow[] as struct { u64 inter, batch; } folds interactive_enqueue_ns + batch_enqueue_ns onto one cacheline per domain (per-tier x per-domain preserved, zero cells lost); sojourn_stamp_pcpu[] renames pcpu_enqueue_ns (stays separate, 64-byte padded).
DELETED overflow_sojourn_rescue_ns. The storm fix had the CPU-0 tick set it = codel_target_ns every beat while apply_tau_scaling still seeded it = the equilibrium -- a two-writer conflict that disagreed whenever the oscillator drove the target off equilibrium. Resolved to the LIVE target (every other sojourn deadline in dispatch keys off codel_target_ns), dropped the vestigial equilibrium writer; the reads read codel_target_ns directly.
Effect: behavior-preserving. montauk sojourn-pressure A/B -- 16C p99 1018us against the v5.14.0 baseline ~2994us, held and under, 4/4 widths survived. The redundancy was real and free.

ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION -- src/bpf/main.bpf.c

The per-CPU kthread (ksoftirqd/N, kworker/N) latency verdict was scattered across six uncoordinated sites -- the procdb tier vote, the runnable behavioral classify, a three-branch override cascade, an enqueue fast path, the enqueue affinity skip and the tick preempt -- each added to patch a different strand, overlapping on the on-home wakeup and leaving the cross-CPU wakeup with no net. The result stranded pinned kthreads worse than EEVDF: pcpu-burst 5 strands to 58ms, sojourn-pressure 2 to 56ms, fork-thread 26 to 50ms where EEVDF stranded none.
The collapse: runnable() is the sole decider. The kthread flags (PF_KTHREAD, nr_cpus_allowed, static_prio) are tested once and folded into tctx->pinned_service alongside the tier; enqueue() and tick() become pure reads of that bit. No struct growth -- the bit reuses existing pad.
enqueue() routes a pinned-service task straight to its home CPU's per-CPU DSQ (drained first by dispatch STEP 0, cache-hot) and PREEMPT-kicks it. A pinned task has no placement choice, so this covers BOTH the on-home and the cross-CPU wakeup -- the old on-home-only fast path left the cross-CPU strand falling to the ~167ms net.
tick() restores the pinned-waiter backstop keyed on the per-CPU sojourn signal, placed before the LAT_CRITICAL tight-band exemption so a welded ksoftirqd/kworker behind a LAT_CRITICAL resident is rescued -- steal cannot move a pinned task. The separate pcpu_pinned_strand_ns[] stamp array is gone; the one bit carries it.
Measured: montauk kstrand at 2C, the worst case. ipc, longrun, mixed-ADAPTIVE and longrun-ADAPTIVE report no strands over threshold; mixed-BPF shows one 8.1ms kworker/1:1 held 99% by montauk itself -- the tracer, not the scheduler -- far under the net. dispatch-stall attributes the residual floored wakes to python3 saturation at 2 CPUs, 0% priority inversion and 0% FIFO violation.

"Everywhere it is machines—real ones, not figurative ones: machines driving other machines, machines being driven by other machines, with all the necessary couplings and connections." — Gilles Deleuze and Félix Guattari, Anti-Oedipus v5.15.0 is rebuilt from the v5.12.0 base -- the last clean floor -- and what makes it v5.15.0 is that the latent bug in that floor was finally found and resolved. v5.12.0's per-CPU kthread and interactive verdict was scattered across six uncoordinated sites that overlapped on the easy on-home wakeup and left the cross-CPU strand with no net, so the base stranded pinned kthreads (ksoftirqd/N, kworker/N) worse than EEVDF. Collapsing all six into ONE judgement -- set once in runnable(), read everywhere else -- clears it: montauk kstrand at 2C, the worst case, reports the strands gone. Everything else is re-added onto the cleaned base one gated, measured piece at a time so each earns its place against the storm and latency numbers before the next lands. ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION below carries the fix; v5.15.0-CHECKLIST.md carries the full re-add ledger and the original release notes; this records what has landed and validated. RESET TO THE v5.12.0 BASE -- src/ - The scheduler crate (BPF, the Rust loader and intf.h) resets to the v5.12.0 commit, the clean placer the rest stands on: per-CPU DSQs, one flattened cache_domain, the node_dsq spill, no CCX/Phi machinery yet. The harness under tests/ is kept current so its fixes survive the reset. IDLE QUIESCENCE ENVELOPE -- src/bpf/main.bpf.c, src/bpf/intf.h, src/adaptive.rs, src/scheduler.rs - The v5.13.0 envelope re-adds onto the base's existing CoDel oscillator: a decayed disp^2+v^2 energy reservoir parks the oscillator at codel equilibrium when it settles -- pin the target, freeze the velocity integrator, stop the per-tick recompute -- until a rescue event or an equilibrium retune disturbs it. A Schmitt trigger on the energy (park below env_park, release at 2x) with a heartbeat valve. nr_osc_park lands as a stat with adaptive park: telemetry. INTERACTIVE-WAITING PREEMPT -- src/bpf/main.bpf.c - The per-CPU tick preempt gated only on pcpu_enqueue_ns -- the per-CPU DSQ waiter -- and was blind to interactive_enqueue_ns, the shared node_dsq waiter. An interactive task spilled to node_dsq under load forced no resident off its core, dispatch never ran, and the task starved behind a batch hog up to the 167ms net: the 116-195ms interactive-stall tail under mixed and long-run load. The v5.11->v5.12 boundary split one waiter signal in two and wired only the per-CPU half. - The fix restores v5.11.0's global interactive-waiting path keyed on the existing interactive_enqueue_ns, placed before the per-CPU early-return so it is reached: a BATCH resident held past preempt_thresh while a node_dsq interactive waiter is pending is preempted -- dispatch runs and STEP 2 drains the waiter (sojourn_gate_pass already keys on the same signal). BATCH residents only; interactive residents keep their protection. - Measured: long-run worst 153ms -> 2.6ms on the BPF arm at every width (2/4/8/16C), mixed worst 167ms -> 4ms, no preemption storm at 16 threads. V7.1 REENQUEUE GUARD REMOVED -- src/bpf/main.bpf.c - The SCX_ENQ_REENQ short-circuit added for the 7.1 cpu_release rework routed a reenqueued task straight to node_dsq with no kick, on the theory it broke a 7.1.1 reenqueue<->kick livelock. It moved nothing measurable on 7.0.13 and is removed: pandemonium_enqueue falls from the wakeup classify straight into Tier 1 as on the v5.12.0 base. cpu_release reenqueues unconditionally. SOJOURN UNIFICATION -- src/bpf/main.bpf.c, src/bpf/intf.h, src/tuning.rs, src/adaptive.rs - codel_thresh_ns renames sojourn_thresh_ns (the batch-DSQ rescue deadline, a Rust-set knob); the pcpu_sojourn_thresh local alias collapses into it. - codel_starve_ns renames starvation_rescue_ns (the ~100ms last-resort net, KEPT distinct from the ~6ms target band). - codel_seed_ns renames codel_target_equilibrium_ns (the R_eff x tau equilibrium the oscillator parks at; codel_eq_ns stays the feeding knob, the clamp preserved). - sojourn_stamp_overflow[] as struct { u64 inter, batch; } folds interactive_enqueue_ns + batch_enqueue_ns onto one cacheline per domain (per-tier x per-domain preserved, zero cells lost); sojourn_stamp_pcpu[] renames pcpu_enqueue_ns (stays separate, 64-byte padded). - DELETED overflow_sojourn_rescue_ns. The storm fix had the CPU-0 tick set it = codel_target_ns every beat while apply_tau_scaling still seeded it = the equilibrium -- a two-writer conflict that disagreed whenever the oscillator drove the target off equilibrium. Resolved to the LIVE target (every other sojourn deadline in dispatch keys off codel_target_ns), dropped the vestigial equilibrium writer; the reads read codel_target_ns directly. - Effect: behavior-preserving. montauk sojourn-pressure A/B -- 16C p99 1018us against the v5.14.0 baseline ~2994us, held and under, 4/4 widths survived. The redundancy was real and free. ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION -- src/bpf/main.bpf.c - The per-CPU kthread (ksoftirqd/N, kworker/N) latency verdict was scattered across six uncoordinated sites -- the procdb tier vote, the runnable behavioral classify, a three-branch override cascade, an enqueue fast path, the enqueue affinity skip and the tick preempt -- each added to patch a different strand, overlapping on the on-home wakeup and leaving the cross-CPU wakeup with no net. The result stranded pinned kthreads worse than EEVDF: pcpu-burst 5 strands to 58ms, sojourn-pressure 2 to 56ms, fork-thread 26 to 50ms where EEVDF stranded none. - The collapse: runnable() is the sole decider. The kthread flags (PF_KTHREAD, nr_cpus_allowed, static_prio) are tested once and folded into tctx->pinned_service alongside the tier; enqueue() and tick() become pure reads of that bit. No struct growth -- the bit reuses existing pad. - enqueue() routes a pinned-service task straight to its home CPU's per-CPU DSQ (drained first by dispatch STEP 0, cache-hot) and PREEMPT-kicks it. A pinned task has no placement choice, so this covers BOTH the on-home and the cross-CPU wakeup -- the old on-home-only fast path left the cross-CPU strand falling to the ~167ms net. - tick() restores the pinned-waiter backstop keyed on the per-CPU sojourn signal, placed before the LAT_CRITICAL tight-band exemption so a welded ksoftirqd/kworker behind a LAT_CRITICAL resident is rescued -- steal cannot move a pinned task. The separate pcpu_pinned_strand_ns[] stamp array is gone; the one bit carries it. - Measured: montauk kstrand at 2C, the worst case. ipc, longrun, mixed-ADAPTIVE and longrun-ADAPTIVE report no strands over threshold; mixed-BPF shows one 8.1ms kworker/1:1 held 99% by montauk itself -- the tracer, not the scheduler -- far under the net. dispatch-stall attributes the residual floored wakes to python3 saturation at 2 CPUs, 0% priority inversion and 0% FIFO violation. HARNESS -- HONEST EJECTION DETECTION -- tests/pandemonium-tests.py - The bench phases checked guard.proc.poll() -- the userspace process alive -- never the scx registration, so a watchdog ejection that left the process running read as still-PANDEMONIUM while the kernel was on EEVDF. scheduler_active() reads is_scx_active() + scx_scheduler_name(), and sched_ejected(guard) folds it into every survival check: an ejected scheduler reports CRASHED, never a lie. HARNESS -- ONE EXIT GUARD (SHARED) -- pandemonium_common.py, tests/prism.py, tests/pandemonium-tests.py - prism's Ctrl+C cleanup becomes the suite's single exit guard, lifted into pandemonium_common: eject_scheduler (systemctl stop, then force-kill by exact comm for a bare loader systemd does not own) + install_exit_guard (SIGINT/SIGTERM/atexit -> registered cleanups, eject, re-online every offlined CPU) + register_exit_cleanup. prism and pandemonium-tests both install it; the hand-rolled _cleanup_on_exit / _fatal_signal_eject is deleted, no other signal/atexit handler remains. - The bug it closes: pandemonium-tests ejected only the schedulers it spawned via a process-group SIGINT, which missed the systemd pandemonium service and left the box stuck on a CPU subset after Ctrl+C. The canonical eject plus the unconditional CPU re-online fix both. HARNESS -- MULTI-WORKLOAD --dev, EXPLICIT --trace ONLY, PROBE CAPTURE -- tests/prism.py, src/cli/probe.rs, tests/pandemonium-tests.py - prism --dev takes one or more workloads (prism --dev ipc fork-thread runs both in sequence). --trace is respected only when passed: a bare --dev stays pre-elevation and captures nothing, and the trace-capable allow-list that auto-captured is removed -- one flag, one behaviour. - The latency probe sets a distinct comm pand-probe via prctl so montauk can target just the probe; the long-run and mixed scale phases record it gated on --trace -- the interactive-stall tail lands in .events for montauk_analyze, the event the scale numbers summarize but could not show before. HARNESS -- COMPREHENSIVE prism --dev ipc -- tests/pandemonium-tests.py - prism --dev ipc was one number per cell (RTT p99). It now carries per-primitive throughput (RT/s, implied from the median round-trip) beside the latency, and under --trace folds montauk's own analysis of the IPC capture -- wake2run dispatch latency, cache locality / cross-CCX, the task that held the core under saturation, dispatch fractality -- per scheduler against the EEVDF baseline. Localized to the --ipc path: prism-scale and the IPC engine are byte-identical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scx_pandemoniumv5.15.0:#3675

scx_pandemoniumv5.15.0:#3675
wllclngn wants to merge 1 commit into
sched-ext:mainfrom
wllclngn:update-pandemonium-5.15.0

wllclngn commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wllclngn commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant