Skip to content

scripts: add scx_smoke_test.py two-phase release-validation harness#3603

Open
rrnewton wants to merge 2 commits into
sched-ext:mainfrom
rrnewton:feat/scx-smoke-test
Open

scripts: add scx_smoke_test.py two-phase release-validation harness#3603
rrnewton wants to merge 2 commits into
sched-ext:mainfrom
rrnewton:feat/scx-smoke-test

Conversation

@rrnewton

@rrnewton rrnewton commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds scripts/scx_smoke_test.py, a two-phase release-validation harness
for the scx scheduler crates published to crates.io.

  • Phase 1 — discover. Walks the workspace Cargo.toml for crates
    under scheds/rust/ and scheds/experimental/ (auto-discovery, no
    hard-coded list), cross-checks each against the crates.io API for the
    latest stable version, layers in per-crate metadata (extra runtime
    args, known-fragile classification, packaging-bug fallback versions),
    and emits a JSON manifest (schema_version: 1) that is the unit of
    input/output for Phase 2 and is committable alongside results for
    full provenance.
  • Phase 2 — run. Reads a manifest, installs each crate via
    cargo install --locked --version <ver> <crate> (with optional
    fallback-version retry), sudo-runs the binary for the configured
    duration, classifies as PASS / FAIL / KNOWN_FRAGILE / ERROR, writes
    per-crate stdout/stderr/install logs + SUMMARY.tsv into a
    timestamped output directory. The effective manifest is copied into
    the run directory.

The intent is a release gate: every new crates.io release of scx
(1.1.1, 1.1.2, 1.2.0, ...) can be smoke-tested on a representative
host before announcement, and users can inspect and hand-edit the
manifest (override versions, drop crates, tweak args) between Phase 1
and Phase 2.

Usage

# Default flow: discover from workspace + crates.io, then run.
./scripts/scx_smoke_test.py run

# Inspect first, hand-edit, then run.
./scripts/scx_smoke_test.py discover -o m.json
$EDITOR m.json
./scripts/scx_smoke_test.py run --manifest m.json --duration 300

# Subset / quick smoke.
./scripts/scx_smoke_test.py list --manifest m.json
./scripts/scx_smoke_test.py run --manifest m.json \
    --schedulers "scx_rusty scx_lavd" --duration 60

Discovered crate set (16 as of 1.1.1)

scx_beerland, scx_bpfland, scx_cake, scx_chaos, scx_cosmos
(1.1.2 — 1.1.1 packaging bug fallback), scx_flash, scx_lavd,
scx_layered (default layer spec under scx_smoke_test_data/),
scx_p2dq, scx_pandemonium (5.9.1 — independent versioning),
scx_rustland, scx_rusty, scx_tickless, scx_flow (2.2.5 —
independent), scx_rlfifo (KNOWN_FRAGILE — userspace FIFO demo,
30s soft-pass window), scx_mitosis (published: false — skipped
by Phase 2 unless explicitly whitelisted via --schedulers).

Mechanics

For each scheduler in the (filtered) manifest:

  1. cargo install --locked --version <ver> <crate>; on failure, retry
    the per-crate fallback version (e.g. scx_cosmos 1.1.1 -> 1.1.2)
    unless --no-fallback is passed. Skip install if already present at
    an acceptable version.
  2. Wait for /sys/kernel/sched_ext/state == 'disabled' between runs;
    only one sched_ext scheduler can attach at a time, so the loop is
    strictly sequential.
  3. sudo timeout --signal=TERM --kill-after=10 <dur> <bin> <extra_args>.
  4. Classify:
    • PASS — ran for the full window (rc=124 SIGTERM-on-timeout
      with runtime >= dur-5).
    • FAIL — exited early, non-zero rc not from timeout, or stderr
      matches BPF program load failed|panicked at|FATAL|libbpf:.*error.
    • KNOWN_FRAGILE — scheduler is documented not-for-production and
      a watchdog trip during the window is expected; FAIL outcomes are
      softened to KNOWN_FRAGILE (does not count toward release-blocking
      totals).
    • ERROR — install failed, binary missing, or sched_ext stuck.
  5. A SIGINT/SIGTERM/EXIT handler detaches any leftover scheduler so
    Ctrl+C never leaves the host degraded.

Pre-flight requirements

  • Kernel built with CONFIG_SCHED_CLASS_EXT=y (>=6.12); script checks
    /sys/kernel/sched_ext/state and refuses to start if missing or
    already active.
  • Passwordless sudo (schedulers must run as root).
  • cargo on PATH.
  • Python 3.10+. Standard library only — no requests dependency.

Manifest schema (v1, abbreviated)

{
  "schema_version": 1,
  "generated": "2026-05-26T20:37:16+00:00",
  "scx_root": "/path/to/scx",
  "default_duration_s": 300,
  "default_fragile_duration_s": 30,
  "crates": [
    {
      "name": "scx_lavd",
      "version": "1.1.1",
      "type": "scheduler",          // or "fragile"
      "published": true,            // false => skipped unless whitelisted
      "fallback_version": null,     // or e.g. "1.1.2"
      "extra_args": [],             // e.g. ["file:layered_default.json"]
      "fragile_reason": null,
      "fragile_duration_s": null,
      "source_path": "scheds/rust/scx_lavd",
      "notes": null
    }
    // ... 15 more
  ]
}

Exit code

0 if every non-KNOWN_FRAGILE scheduler PASSed, 1 otherwise.
KNOWN_FRAGILE is opt-in soft-pass and never blocks the release gate.

@rrnewton

Copy link
Copy Markdown
Collaborator Author

scx 1.1.1 reference smoke-test run (2026-05-21)

Full output from ./scripts/scx_smoke_test.sh --version 1.1.1 --duration 300
on a Meta devserver — kernel 6.16.1, 176 CPUs (88 physical + SMT),
CONFIG_SCHED_CLASS_EXT=y. Wall-clock ~45 min; per-scheduler installs
were < 2 min each and each run is a strict 5-min window.

Headline: 8 PASS, 3 FAIL out of 11.

Of the three failures, exactly one is a real 1.1.1 release defect
(scx_cosmos); the other two are smoke-test wiring limitations.

Results table

Scheduler Result rc Runtime Notes
scx_rusty PASS 124 300 s clean register, runs, clean unregister
scx_lavd PASS 124 301 s clean
scx_bpfland PASS 124 301 s clean
scx_layered FAIL 1 0 s smoke-test bug — needs --specs JSON (see §1)
scx_rustland PASS 124 300 s clean
scx_rlfifo FAIL 1 34 s runnable-task watchdog (known demo limitation, §2)
scx_p2dq PASS 124 301 s clean
scx_tickless PASS 124 301 s clean
scx_chaos PASS 124 301 s clean
scx_flash PASS 124 301 s clean
scx_cosmos ERROR cargo install fails — packaging bug in 1.1.1 (§3)

(Crates not on crates.io at 1.1.1: scx_wd40, scx_mitosis — silently
skipped by the smoke test.)

§1. scx_layered FAIL — usage, not regression

scx_layered exits immediately with Error: No layer spec when invoked
without arguments. From scx_layered --help:

scx_layered allows classifying tasks into multiple layers and applying
different scheduling policies to them. The configuration is specified
in json…

This is by design — scx_layered cannot run without a layer
specification file (typically passed via --specs <path/to/layers.json>).
The smoke test currently runs every binary bare. Fix options:

  • A (recommended): ship a minimal default spec at
    scripts/scx_smoke_test/layered_default.json (single catch-all layer)
    and pass it when crate name == scx_layered.
  • B: mark scx_layered as requires_config and skip it
    (smoke test would then only verify install + --help).

Either is a follow-up to the script itself; it does not indicate a
1.1.1 release problem.

§2. scx_rlfifo FAIL — known-fragile demo scheduler

scx_rlfifo ran for 34 s before exiting with:

Error: EXIT: runnable task stall (watchdog failed to check in for 5.001s)

The scheduler itself prints this warning in its --help output:

WARNING: The purpose of scx_rlfifo is to provide a simple scheduler
implementation based on scx_rustland_core, and it is not intended for
use in production environments. … Please do not open GitHub issues in
the event of poor performance, or scheduler eviction due to a
runnable task timeout
.

scx_rlfifo is a user-space, FIFO, demo scheduler whose stated purpose
is pedagogical. On a 176-CPU production-load host, hitting the 5-second
runnable-task watchdog is the documented expected failure mode. Not a
1.1.1 regression — the same watchdog trip would happen in 1.1.0 and
earlier. The smoke test should either skip scx_rlfifo or quarantine
it as known_fragile with a shorter expected runtime.

§3. scx_cosmos 1.1.1 — REAL packaging bug

cargo install --locked --version 1.1.1 scx_cosmos fails during the
build.rs step:

thread 'main' panicked at scx_cosmos-1.1.1/build.rs:13:10:
  called `Result::unwrap()` on an `Err` value: failed to build `../../../lib/pmu.bpf.c`

  Caused by:
    clang: error: no such file or directory: '../../../lib/pmu.bpf.c'
    clang: error: no input files

Root cause: scx_cosmos's build.rs references its BPF source via
the relative path ../../../lib/pmu.bpf.c. That path resolves correctly
when building in-tree inside the scx git checkout (where the crate sits
at scheds/rust/scx_cosmos/ and lib/pmu.bpf.c lives at the repo
root). When cargo install extracts the crate to a flat directory under
/tmp/cargo-install*/, the parent directories do not exist and clang
fails to find pmu.bpf.c. The other ten schedulers vendor or copy their
BPF sources correctly via cargo's include directives.

Impact: scx_cosmos 1.1.1 is uninstallable from cargo on any host.

Fix already on crates.io: cargo install --locked --version 1.1.2 scx_cosmos installs cleanly and the resulting binary runs for the full
window — verified by hand (rc=124 after 30 s, clean register +
unregister, no panic). The 1.1.2 release evidently includes the
packaging fix.

Recommendation: consider yanking scx_cosmos 1.1.1 from crates.io
so that cargo install scx_cosmos (no --version pin) doesn't re-pick
the broken release. At minimum, document the 1.1.2 hotfix in the 1.1.1
release notes.

Reproducibility

git clone https://github.com/sched-ext/scx.git
cd scx
./scripts/scx_smoke_test.sh --version 1.1.1 --duration 300
# wall-clock: ~45 min (most installs <2 min; runs are 5 min each)
# logs land in ./scx-smoke-1.1.1-<stamp>/

Per-scheduler artifacts captured in the output dir:

  • <crate>.install.log — cargo install output
  • <crate>.stderr.log — scheduler stderr (where the errors live)
  • <crate>.stdout.log — scheduler stdout (usually empty)
  • SUMMARY.tsv — one row per scheduler

Suggested upstream follow-ups

  1. Open issue / fix in scheds/rust/scx_cosmos/build.rs: use
    CARGO_MANIFEST_DIR-relative paths or include = [...] in
    Cargo.toml, or yank 1.1.1.
  2. Run this smoke test on every future cargo release as a release gate.

@rrnewton rrnewton force-pushed the feat/scx-smoke-test branch from 7e84cc3 to 56d5b8a Compare May 26, 2026 16:37
@rrnewton

Copy link
Copy Markdown
Collaborator Author

Update — improved script, second smoke run on v1.1.1 (2026-05-26)

Per the follow-ups suggested in the original PR body, force-pushed an
amended commit (7e84cc3c56d5b8a9) that adds:

Improvement Hook
Bundled minimal layer spec for scx_layered scripts/scx_smoke_test_data/layered_default.json (2-layer catch-all using file: prefix per scx_layered --help)
Per-crate VERSION_FALLBACK map scx_cosmos 1.1.1 → 1.1.2 auto-retry when primary install fails (closes the cosmos 1.1.1 build.rs packaging bug end-to-end without operator intervention)
KNOWN_FRAGILE classification scx_rlfifo runs for reduced FRAGILE_DURATION=30s; a watchdog trip soft-passes with the documented reason rather than counting as FAIL
SIGINT/SIGTERM/EXIT cleanup trap Tracks the live scheduler binary path; on cleanup sudo pkill -TERM then -KILL, then waits up to 10s for /sys/kernel/sched_ext/state == disabled. Ctrl+C now leaves the host clean.
Colorized terminal output tput-detected, --no-color / NO_COLOR=1 honored, post-processed onto column -t-aligned text so ANSI bytes don't break column widths
installed_version in SUMMARY.tsv Surfaces fallback activations + drift
Per-crate EXTRA_ARGS map Generalizable hook for the layered-spec pattern

Single commit, +529 total (script +245 net, layered_default.json
+26 new file). Updated --help, classification block, and
release-notes excerpt in the commit message reflect the new behavior.

Rerun results — v1.1.1, 5 min each

Host: devbig176.vll5.facebook.com, kernel
6.9.0-0_fbk12_hardened_0_g28f2d09ad102 (NB: same host as the prior
2026-05-21 run but with the kernel rolled back to 6.9 fbk-hardened
since; original ran on 6.16.1-0_fbk2_0_gf40efc324cc8).

scheduler     status         exit_code  runtime_s  installed_version  notes
scx_rusty     PASS           124        300        1.1.1
scx_lavd      FAIL           1          4          1.1.1              exit rc=1 at 4s
scx_bpfland   PASS           124        301        1.1.1
scx_layered   PASS           124        300        1.1.1              ← was FAIL (no spec) in prior run; PASS now with bundled default spec
scx_rustland  PASS           124        300        1.1.1
scx_rlfifo    PASS           124        30         1.1.1              ← KNOWN_FRAGILE classified; no watchdog trip in reduced 30s window
scx_p2dq      FAIL           1          1          1.1.1              exit rc=1 at 1s
scx_tickless  FAIL           1          0          1.1.1              exit rc=1 at 0s
scx_chaos     FAIL           1          1          1.1.1              exit rc=1 at 1s; stderr has panic/load-error pattern
scx_flash     PASS           124        300        1.1.1
scx_cosmos    PASS           124        300        1.1.2              ← VERSION_FALLBACK kicked in: primary 1.1.1 install failed (build.rs), fallback 1.1.2 succeeded

PASS: 7   FAIL: 4   KNOWN_FRAGILE: 0   ERROR: 0

Movement vs the prior 2026-05-21 run

Scheduler 2026-05-21 (6.16) 2026-05-26 (6.9) Why moved
scx_layered FAIL (no spec) PASS bundled file:layered_default.json
scx_rlfifo FAIL (watchdog) PASS KNOWN_FRAGILE reduced window — no trip this run (would have soft-passed regardless)
scx_cosmos ERROR (cargo install) PASS @ 1.1.2 VERSION_FALLBACK auto-retry succeeded
scx_lavd PASS FAIL BPF E2BIG — 6.9 instruction-buffer limit, kernel-side
scx_p2dq PASS FAIL BPF EINVAL — 6.9 verifier rejection, kernel-side
scx_tickless PASS FAIL BPF EPERM — 6.9 missing required cap/flag, kernel-side
scx_chaos PASS FAIL BPF EINVAL on chaos_dispatch, kernel-side

The 4 new FAILs are all BPF program load failures from the kernel
verifier/loader
, emanating from running 1.1.1 binaries (built for a
6.16-era BPF feature surface) on a 6.9 fbk-hardened kernel. The
schedulers' user-space code never enters its main loop. This is not a
regression in the script; it is exactly the kind of "this release
runs on kernel X but not Y" signal a release-gate smoke test should
catch.

Proof of VERSION_FALLBACK working end-to-end

From the runner log:

━━━ [11/11] scx_cosmos ━━━
  installing scx_cosmos 1.1.1  (log: .../scx_cosmos.install.log)
  primary install failed; trying fallback 1.1.2
  running for 300s
  -> PASS (rc=124, 300s)

The cosmos install log confirms the 1.1.1 attempt hit the same
build.rs path bug documented in the original PR comment, and the
script seamlessly retried with 1.1.2.

Follow-ups still open

  • KERNEL_INCOMPATIBLE classification (vs plain FAIL) when stderr
    matches the BPF-load-error pattern AND kernel < per-scheduler
    MIN_KERNEL. Would distinguish "your kernel is too old" from "the
    release is broken" without operator inspection.
  • JSON output alongside TSV for CI consumers.
  • dmesg | grep sched_ext capture per run (where the real
    sched_ext-rejection lines tend to live).
  • GitHub Actions wiring as a nightly release-gate workflow.
  • Eventual Python rewrite for cargo install --list parsing + per-crate
    version detection (see skeptic review). Deferred — the bash form is
    sufficient for "shipping a release gate today."

rrnewton added a commit to rrnewton/dev-sched-test that referenced this pull request May 26, 2026
Capture from second run of scripts/scx_smoke_test.sh against scx 1.1.1
after applying skeptic-review-driven improvements (PR sched-ext/scx#3603
commit 56d5b8a9). 7 PASS / 4 FAIL / 0 KNOWN_FRAGILE / 0 ERROR.

All 3 issues from the prior 2026-05-21 run resolved (scx_layered with
bundled default spec, scx_cosmos via VERSION_FALLBACK -> 1.1.2,
scx_rlfifo soft-pass via KNOWN_FRAGILE). 4 new FAILs are all BPF
program-load failures from the 6.9 fbk-hardened kernel rejecting
1.1.1-era BPF skeletons - not script regressions.

Reference run: experiments/scx_1_1_1_smoke_test_20260521/ (8/3 on 6.16).
Results posted as PR comment:
sched-ext/scx#3603 (comment)
@hodgesds

Copy link
Copy Markdown
Contributor

Doesn't layered have --run-example that could be used so that a spec isn't needed?

@rrnewton rrnewton force-pushed the feat/scx-smoke-test branch from 56d5b8a to c9da2d9 Compare May 26, 2026 20:39
@rrnewton rrnewton changed the title scripts: add scx_smoke_test.sh release-validation harness scripts: add scx_smoke_test.py two-phase release-validation harness May 26, 2026
@rrnewton

Copy link
Copy Markdown
Collaborator Author

Force-pushed feat/scx-smoke-test to c9da2d90 — refactored the harness from a single bash script into a two-phase Python tool per the skeptic review.

Two-phase design (scripts/scx_smoke_test.py):

  • Phase 1 — discover. Walks the workspace Cargo.toml for scheduler crates under scheds/rust/ and scheds/experimental/ (auto-discovery, no hard-coded list — picks up scx_beerland, scx_cake, and any future scheduler automatically). Cross-checks each crate against the crates.io API to record the latest stable version actually published. Layers in per-crate metadata: extra runtime args (e.g. scx_layered layer spec), known-fragile classification (e.g. scx_rlfifo), fallback versions (e.g. scx_cosmos 1.1.1 → 1.1.2). Emits a JSON manifest with schema_version: 1 that is the unit of input/output for Phase 2.
  • Phase 2 — run. Reads a manifest, installs each crate via cargo install --locked --version, runs the binary under sudo timeout for the configured duration, classifies as PASS/FAIL/KNOWN_FRAGILE/ERROR, writes per-crate logs + SUMMARY.tsv into a timestamped output directory. The effective manifest is copied into the run directory for full provenance.

CLI:

```
scx_smoke_test.py discover [--out FILE|-] [--version VER] [--scx-root DIR] [--no-network]
scx_smoke_test.py run [--manifest FILE] [--duration SEC] [--out-dir DIR]
[--schedulers "a b c"] [--version VER] [--no-fallback]
scx_smoke_test.py list [--manifest FILE]
```

Typical flows:

```bash

Default: discover from workspace + crates.io, then run.

./scx_smoke_test.py run

Inspect first, hand-edit, then run.

./scx_smoke_test.py discover -o m.json
$EDITOR m.json # drop crates, override versions, tweak args
./scx_smoke_test.py run --manifest m.json --duration 300

Short smoke (single scheduler).

./scx_smoke_test.py run --manifest m.json --schedulers scx_rusty --duration 30
```

Discovery (16 crates on current workspace):

crate version notes
scx_beerland, scx_bpfland, scx_cake, scx_chaos, scx_flash, scx_lavd, scx_p2dq, scx_rustland, scx_rusty, scx_tickless 1.1.1 standard scheds
scx_layered 1.1.1 extra_args: [file:layered_default.json]
scx_cosmos 1.1.2 fallback_version: 1.1.2 (1.1.1 packaging bug)
scx_flow 2.2.5 independent version stream
scx_pandemonium 5.9.1 independent version stream
scx_rlfifo 1.1.1 type: fragile, 30s window
scx_mitosis n/a published: false — skipped unless whitelisted

Implementation notes:

  • Standard library only (urllib for crates.io; no requests dependency). Python 3.10+.
  • Manifest discovery is tolerant: a missing scx_mitosis row in crates.io yields published: false rather than aborting the discovery pass.
  • Same host-friendly invariants as the old .sh: sequential runs (only one sched_ext scheduler can attach at a time); pre-flight verifies sched_ext is disabled, passwordless sudo, cargo on PATH; SIGINT/SIGTERM/EXIT handler detaches any leftover scheduler so Ctrl+C never leaves the host degraded.
  • Classification logic mirrors the prior bash version, including KNOWN_FRAGILE soft-pass and BPF panic/load-error regex demotion.

Spot-checked locally: discover produces a 16-crate manifest; run --manifest <m> --schedulers scx_rusty --duration 15 PASSes on the 6.9-fbk-hardened kernel.

The prior .sh was removed as it is fully superseded by the Python harness; data files under scripts/scx_smoke_test_data/ (e.g. layered_default.json) are retained.

@rrnewton

Copy link
Copy Markdown
Collaborator Author

VM mode added + full 6.16.1 VM run

Added --vm (virtme-ng) support to scripts/scx_smoke_test.py so the
harness can boot an alternate kernel and run the full scheduler suite
inside it without rebooting or reprovisioning. The host machine here
runs 6.9, but the validation kernel is 6.16.1.

CLI

scx_smoke_test.py run --vm \
    --kernel /boot/vmlinuz-6.16.1-0_fbk2_0_gf40efc324cc8 \
    --manifest /tmp/manifest.json \
    --duration 60 \
    --out-dir <outdir>

Install runs on the host (cargo is not in the VM image); the host then
re-execs the script with --in-vm under vng --run, sharing the
out-dir via --rwdir. Pre-install errors stay durable across the VM
boundary; in-VM run rows are merged back via SUMMARY.tsv reload after
vng exits.

Full-suite result (scx 1.1.1, kernel 6.16.1, 60s per scheduler)

scheduler status runtime notes
scx_beerland PASS 60s
scx_bpfland PASS 60s
scx_cake ERROR - cargo install failed (build.rs missing ../../../lib/arena.bpf.c when built from crates.io)
scx_chaos PASS 60s
scx_cosmos PASS 60s installed 1.1.2
scx_flash PASS 60s
scx_lavd PASS 60s (regression-tested — fails on host 6.9, passes here)
scx_layered PASS 60s
scx_p2dq PASS 60s
scx_pandemonium PASS 60s installed 5.9.1
scx_rustland FAIL 70s killed (SIGKILL) at 70s; no stderr emitted
scx_rusty PASS 60s
scx_tickless PASS 60s (regression-tested — fails on host 6.9, passes here)
scx_flow PASS 60s installed 2.2.5
scx_rlfifo KNOWN_FRAGILE 40s userspace FIFO demo; runnable-task watchdog under load is documented expected behavior

Tally: 12 PASS · 1 KNOWN_FRAGILE · 1 FAIL (scx_rustland) · 1 ERROR (scx_cake install)

scx_lavd and scx_tickless (both fail on the host 6.9 kernel) PASS
under the 6.16.1 VM — confirming the kernel-version-rejection theory.

Followups

  • scx_cake install: build.rs references ../../../lib/arena.bpf.c,
    a workspace-relative path that doesn't exist in the cargo-install
    tarball. Pre-existing bug, not specific to this harness.
  • scx_rustland: ran ~10s past --duration 60, then SIGKILL with no
    output. Worth investigating whether it's the same userspace-watchdog
    pattern documented for scx_rlfifo.

Branch: feat/scx-smoke-test @ b7cdc1a3 on rrnewton/scx.

@rrnewton rrnewton force-pushed the feat/scx-smoke-test branch 2 times, most recently from de03d08 to d1ad12f Compare May 28, 2026 18:21
rrnewton added 2 commits June 3, 2026 09:14
Two-phase scx scheduler release-validation harness, replacing the prior
scx_smoke_test.sh single-shot script with a Python tool that separates
manifest discovery from the install+run pipeline so users can inspect or
hand-edit what gets tested.

  Phase 1 (discover)
    - Walks the scx workspace Cargo.toml for scheduler crates under
      scheds/rust/ and scheds/experimental/ (auto-discovers; nothing
      hard-coded). 16 crates as of 1.1.1, including scx_beerland and
      scx_cake which post-date the .sh script's hand-curated list.
    - Cross-checks each crate against the crates.io API to record the
      latest stable version actually published. Crates with no published
      release (e.g. scx_mitosis as of 1.1.1) appear in the manifest with
      "published": false and are skipped by Phase 2 unless explicitly
      whitelisted via --schedulers.
    - Layers in per-crate metadata: extra runtime args (e.g. scx_layered
      --run-example flag), known-fragile classification (e.g.
      scx_rlfifo), and fallback versions (e.g. scx_cosmos 1.1.1 ->
      1.1.2 packaging-bug workaround).
    - Emits a JSON manifest (schema_version: 1). The manifest is the
      unit of input/output for Phase 2 and is committable alongside
      results for full provenance.

  Phase 2 (run)
    - Reads a manifest, installs each crate via cargo install --locked
      --version (with fallback retry), runs the binary under sudo
      timeout for the per-crate duration, classifies as
      PASS / FAIL / KNOWN_FRAGILE / ERROR, and writes per-crate
      stdout/stderr/install logs plus SUMMARY.tsv into a timestamped
      output dir. Copies the effective manifest into the run dir for
      provenance.
    - Same host-friendly invariants as the .sh: sequential runs;
      pre-flight verifies sched_ext is 'disabled', passwordless sudo
      available, cargo present; SIGINT/SIGTERM/EXIT trap detaches any
      leftover scheduler so Ctrl+C never leaves the host degraded.

  CLI
    scx_smoke_test.py discover [--out FILE|-] [--version VER]
                               [--scx-root DIR] [--no-network]
    scx_smoke_test.py run      [--manifest FILE] [--duration SEC]
                               [--out-dir DIR] [--schedulers "a b c"]
                               [--version VER]  [--no-fallback]
    scx_smoke_test.py list     [--manifest FILE]

  Typical flows
    # default flow: discover from workspace + crates.io, then run.
    ./scx_smoke_test.py run

    # inspect first, hand-edit, then run.
    ./scx_smoke_test.py discover -o m.json
    $EDITOR m.json   # drop crates, override versions, tweak args
    ./scx_smoke_test.py run --manifest m.json --duration 300

    # short smoke (single scheduler).
    ./scx_smoke_test.py run --manifest m.json --schedulers scx_rusty \
        --duration 30 --out-dir /tmp/smoke

  Implementation notes
    - Standard library only (urllib for crates.io; no requests
      dependency). Works on Python 3.10+.
    - Manifest discovery is tolerant: a missing scx_mitosis row in
      crates.io produces published=false rather than aborting.
    - Per-crate fallback only fires when the manifest's primary
      version cargo-install fails; --no-fallback disables it.
    - Classification mirrors the .sh: timeout-rc 124/137/143 with
      runtime >= dur-5 = PASS; non-zero exit or early exit = FAIL;
      stderr panic / BPF-load-error regex demotes PASS to FAIL;
      KNOWN_FRAGILE crates soft-pass any FAIL.
…el testing

Adds --vm, --kernel, --vng-arg, --in-vm, --bin-dir flags so that the smoke
test can boot a vng VM with a user-specified kernel image and run the full
scheduler suite inside it. Install still happens on the host (cargo is not
available inside the VM); the host re-execs this script with --in-vm under
vng --run, sharing the out-dir via --rwdir. Pre-install errors written by
the host are preserved across the VM boundary; in-VM run rows are merged
back via SUMMARY.tsv reload after vng exits.

Usage:
  scx_smoke_test.py run --vm --kernel /boot/vmlinuz-6.16.x ...

This lets PR validators boot a kernel that differs from the host (e.g. the
6.16 kernel needed by scx_lavd / scx_tickless when the host is 6.9) without
reboot or hardware reprovisioning.
@rrnewton rrnewton force-pushed the feat/scx-smoke-test branch from d1ad12f to 612b36b Compare June 3, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants