scx_pandemoniumv5.15.0:#3675
Open
wllclngn wants to merge 1 commit into
Open
Conversation
"Everywhere it is machines—real ones, not figurative ones: machines driving
other machines, machines being driven by other machines, with all the
necessary couplings and connections."
— Gilles Deleuze and Félix Guattari, Anti-Oedipus
v5.15.0 is rebuilt from the v5.12.0 base -- the last clean floor -- and what makes it v5.15.0 is
that the latent bug in that floor was finally found and resolved. v5.12.0's per-CPU kthread and
interactive verdict was scattered across six uncoordinated sites that overlapped on the easy
on-home wakeup and left the cross-CPU strand with no net, so the base stranded pinned kthreads
(ksoftirqd/N, kworker/N) worse than EEVDF. Collapsing all six into ONE judgement -- set once in
runnable(), read everywhere else -- clears it: montauk kstrand at 2C, the worst case, reports the
strands gone. Everything else is re-added onto the cleaned base one gated, measured piece at a
time so each earns its place against the storm and latency numbers before the next lands.
ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION below carries the fix; v5.15.0-CHECKLIST.md carries the
full re-add ledger and the original release notes; this records what has landed and validated.
RESET TO THE v5.12.0 BASE -- src/
- The scheduler crate (BPF, the Rust loader and intf.h) resets to the v5.12.0 commit, the clean
placer the rest stands on: per-CPU DSQs, one flattened cache_domain, the node_dsq spill, no
CCX/Phi machinery yet. The harness under tests/ is kept current so its fixes survive the reset.
IDLE QUIESCENCE ENVELOPE -- src/bpf/main.bpf.c, src/bpf/intf.h, src/adaptive.rs, src/scheduler.rs
- The v5.13.0 envelope re-adds onto the base's existing CoDel oscillator: a decayed disp^2+v^2
energy reservoir parks the oscillator at codel equilibrium when it settles -- pin the target,
freeze the velocity integrator, stop the per-tick recompute -- until a rescue event or an
equilibrium retune disturbs it. A Schmitt trigger on the energy (park below env_park, release at
2x) with a heartbeat valve. nr_osc_park lands as a stat with adaptive park: telemetry.
INTERACTIVE-WAITING PREEMPT -- src/bpf/main.bpf.c
- The per-CPU tick preempt gated only on pcpu_enqueue_ns -- the per-CPU DSQ waiter -- and was blind
to interactive_enqueue_ns, the shared node_dsq waiter. An interactive task spilled to node_dsq
under load forced no resident off its core, dispatch never ran, and the task starved behind a
batch hog up to the 167ms net: the 116-195ms interactive-stall tail under mixed and long-run
load. The v5.11->v5.12 boundary split one waiter signal in two and wired only the per-CPU half.
- The fix restores v5.11.0's global interactive-waiting path keyed on the existing
interactive_enqueue_ns, placed before the per-CPU early-return so it is reached: a BATCH resident
held past preempt_thresh while a node_dsq interactive waiter is pending is preempted -- dispatch
runs and STEP 2 drains the waiter (sojourn_gate_pass already keys on the same signal). BATCH
residents only; interactive residents keep their protection.
- Measured: long-run worst 153ms -> 2.6ms on the BPF arm at every width (2/4/8/16C), mixed worst
167ms -> 4ms, no preemption storm at 16 threads.
V7.1 REENQUEUE GUARD REMOVED -- src/bpf/main.bpf.c
- The SCX_ENQ_REENQ short-circuit added for the 7.1 cpu_release rework routed a reenqueued task
straight to node_dsq with no kick, on the theory it broke a 7.1.1 reenqueue<->kick livelock. It
moved nothing measurable on 7.0.13 and is removed: pandemonium_enqueue falls from the wakeup
classify straight into Tier 1 as on the v5.12.0 base. cpu_release reenqueues unconditionally.
SOJOURN UNIFICATION -- src/bpf/main.bpf.c, src/bpf/intf.h, src/tuning.rs, src/adaptive.rs
- codel_thresh_ns renames sojourn_thresh_ns (the batch-DSQ rescue deadline, a Rust-set
knob); the pcpu_sojourn_thresh local alias collapses into it.
- codel_starve_ns renames starvation_rescue_ns (the ~100ms last-resort net, KEPT distinct
from the ~6ms target band).
- codel_seed_ns renames codel_target_equilibrium_ns (the R_eff x tau equilibrium the
oscillator parks at; codel_eq_ns stays the feeding knob, the clamp preserved).
- sojourn_stamp_overflow[] as struct { u64 inter, batch; } folds interactive_enqueue_ns +
batch_enqueue_ns onto one cacheline per domain (per-tier x per-domain preserved, zero cells
lost); sojourn_stamp_pcpu[] renames pcpu_enqueue_ns (stays separate, 64-byte padded).
- DELETED overflow_sojourn_rescue_ns. The storm fix had the CPU-0 tick set it = codel_target_ns
every beat while apply_tau_scaling still seeded it = the equilibrium -- a two-writer conflict
that disagreed whenever the oscillator drove the target off equilibrium. Resolved to the LIVE
target (every other sojourn deadline in dispatch keys off codel_target_ns), dropped the
vestigial equilibrium writer; the reads read codel_target_ns directly.
- Effect: behavior-preserving. montauk sojourn-pressure A/B -- 16C p99 1018us against the
v5.14.0 baseline ~2994us, held and under, 4/4 widths survived. The redundancy was real and free.
ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION -- src/bpf/main.bpf.c
- The per-CPU kthread (ksoftirqd/N, kworker/N) latency verdict was scattered across six
uncoordinated sites -- the procdb tier vote, the runnable behavioral classify, a three-branch
override cascade, an enqueue fast path, the enqueue affinity skip and the tick preempt -- each
added to patch a different strand, overlapping on the on-home wakeup and leaving the cross-CPU
wakeup with no net. The result stranded pinned kthreads worse than EEVDF: pcpu-burst 5 strands to
58ms, sojourn-pressure 2 to 56ms, fork-thread 26 to 50ms where EEVDF stranded none.
- The collapse: runnable() is the sole decider. The kthread flags (PF_KTHREAD, nr_cpus_allowed,
static_prio) are tested once and folded into tctx->pinned_service alongside the tier; enqueue()
and tick() become pure reads of that bit. No struct growth -- the bit reuses existing pad.
- enqueue() routes a pinned-service task straight to its home CPU's per-CPU DSQ (drained first by
dispatch STEP 0, cache-hot) and PREEMPT-kicks it. A pinned task has no placement choice, so this
covers BOTH the on-home and the cross-CPU wakeup -- the old on-home-only fast path left the
cross-CPU strand falling to the ~167ms net.
- tick() restores the pinned-waiter backstop keyed on the per-CPU sojourn signal, placed before the
LAT_CRITICAL tight-band exemption so a welded ksoftirqd/kworker behind a LAT_CRITICAL resident is
rescued -- steal cannot move a pinned task. The separate pcpu_pinned_strand_ns[] stamp array is
gone; the one bit carries it.
- Measured: montauk kstrand at 2C, the worst case. ipc, longrun, mixed-ADAPTIVE and longrun-ADAPTIVE
report no strands over threshold; mixed-BPF shows one 8.1ms kworker/1:1 held 99% by montauk itself
-- the tracer, not the scheduler -- far under the net. dispatch-stall attributes the residual
floored wakes to python3 saturation at 2 CPUs, 0% priority inversion and 0% FIFO violation.
HARNESS -- HONEST EJECTION DETECTION -- tests/pandemonium-tests.py
- The bench phases checked guard.proc.poll() -- the userspace process alive -- never the scx
registration, so a watchdog ejection that left the process running read as still-PANDEMONIUM
while the kernel was on EEVDF. scheduler_active() reads is_scx_active() + scx_scheduler_name(),
and sched_ejected(guard) folds it into every survival check: an ejected scheduler reports
CRASHED, never a lie.
HARNESS -- ONE EXIT GUARD (SHARED) -- pandemonium_common.py, tests/prism.py, tests/pandemonium-tests.py
- prism's Ctrl+C cleanup becomes the suite's single exit guard, lifted into pandemonium_common:
eject_scheduler (systemctl stop, then force-kill by exact comm for a bare loader systemd does not
own) + install_exit_guard (SIGINT/SIGTERM/atexit -> registered cleanups, eject, re-online every
offlined CPU) + register_exit_cleanup. prism and pandemonium-tests both install it; the
hand-rolled _cleanup_on_exit / _fatal_signal_eject is deleted, no other signal/atexit handler
remains.
- The bug it closes: pandemonium-tests ejected only the schedulers it spawned via a process-group
SIGINT, which missed the systemd pandemonium service and left the box stuck on a CPU subset after
Ctrl+C. The canonical eject plus the unconditional CPU re-online fix both.
HARNESS -- MULTI-WORKLOAD --dev, EXPLICIT --trace ONLY, PROBE CAPTURE -- tests/prism.py, src/cli/probe.rs, tests/pandemonium-tests.py
- prism --dev takes one or more workloads (prism --dev ipc fork-thread runs both in sequence).
--trace is respected only when passed: a bare --dev stays pre-elevation and captures nothing, and
the trace-capable allow-list that auto-captured is removed -- one flag, one behaviour.
- The latency probe sets a distinct comm pand-probe via prctl so montauk can target just the
probe; the long-run and mixed scale phases record it gated on --trace -- the interactive-stall
tail lands in .events for montauk_analyze, the event the scale numbers summarize but could not
show before.
HARNESS -- COMPREHENSIVE prism --dev ipc -- tests/pandemonium-tests.py
- prism --dev ipc was one number per cell (RTT p99). It now carries per-primitive throughput
(RT/s, implied from the median round-trip) beside the latency, and under --trace folds montauk's
own analysis of the IPC capture -- wake2run dispatch latency, cache locality / cross-CCX, the
task that held the core under saturation, dispatch fractality -- per scheduler against the EEVDF
baseline. Localized to the --ipc path: prism-scale and the IPC engine are byte-identical.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v5.15.0 is rebuilt from the v5.12.0 base -- the last clean floor -- and what makes it v5.15.0 is that the latent bug in that floor was finally found and resolved. v5.12.0's per-CPU kthread and interactive verdict was scattered across six uncoordinated sites that overlapped on the easy on-home wakeup and left the cross-CPU strand with no net, so the base stranded pinned kthreads (ksoftirqd/N, kworker/N) worse than EEVDF. Collapsing all six into ONE judgement -- set once in runnable(), read everywhere else -- clears it: montauk kstrand at 2C, the worst case, reports the strands gone. Everything else is re-added onto the cleaned base one gated, measured piece at a time so each earns its place against the storm and latency numbers before the next lands. ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION below carries the fix; v5.15.0-CHECKLIST.md carries the full re-add ledger and the original release notes; this records what has landed and validated.
RESET TO THE v5.12.0 BASE -- src/
IDLE QUIESCENCE ENVELOPE -- src/bpf/main.bpf.c, src/bpf/intf.h, src/adaptive.rs, src/scheduler.rs
INTERACTIVE-WAITING PREEMPT -- src/bpf/main.bpf.c
V7.1 REENQUEUE GUARD REMOVED -- src/bpf/main.bpf.c
SOJOURN UNIFICATION -- src/bpf/main.bpf.c, src/bpf/intf.h, src/tuning.rs, src/adaptive.rs
ONE-JUDGEMENT PER-CPU KTHREAD UNIFICATION -- src/bpf/main.bpf.c