Skip to content

scheds/experimental/scx_flow: v2.2.4 — priority-aware rt_sensitive and relaxed preempt threshold#3561

Merged
hodgesds merged 7 commits into
sched-ext:mainfrom
galpt:scx_flow_v2_2_scalability
May 8, 2026
Merged

scheds/experimental/scx_flow: v2.2.4 — priority-aware rt_sensitive and relaxed preempt threshold#3561
hodgesds merged 7 commits into
sched-ext:mainfrom
galpt:scx_flow_v2_2_scalability

Conversation

@galpt

@galpt galpt commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

This is an incremental update on top of the scx_flow v2.2.0 scheduler that
landed in PR #3525. It carries two changes since that submission, both aimed
at improving worst-case wakeup latency without adding code paths that bypass
the existing lane analysis or containment machinery:

  • Priority-aware rt_sensitive — the rt_sensitive_ready condition now
    checks p->prio so that kernel RT-class tasks (SCHED_FIFO / SCHED_RR at
    any priority) qualify for WAKE_PROFILE_RT_SENSITIVE on sufficient refill,
    rather than needing the combination of being pinned and having the 80us
    refill floor. Fixed in v2.2.4: the refill threshold is properly ANDed with
    the genuine RT classification (pinned || prio < 100) so well-refilled
    SCHED_OTHER tasks are not falsely classified as RT-sensitive.
  • Relaxed preempt_ready refill check — the original preempt_ready
    condition required both last_refill_ns >= preempt_refill_min_ns (200us)
    and budget_ns >= preempt_budget_min_ns (150us). For short-interval
    wakeups such as cyclictest at 200us the per-wakeup refill is only
    sleep_ns / 100 = 2us, far below the 200us threshold. Dropped the
    refill gate so that accumulated positive budget alone qualifies a task for
    the preempt path.

Changes

The BPF changes are confined to one function in main.bpf.c.

1. Priority-aware rt_sensitive (v2.2.3, fixed in v2.2.4)

// v2.2.0 baseline — pinned + refill >= 80us
p->nr_cpus_allowed == 1 &&
tctx->last_refill_ns > 0 &&
tctx->last_refill_ns >= (s64)FLOW_INTERACTIVE_FLOOR_MIN_NS

// v2.2.3 (buggy) — last_refill_ns >= 80us was a standalone OR
tctx->last_refill_ns > 0 &&
(p->nr_cpus_allowed == 1 ||
 p->prio < 100 ||
 tctx->last_refill_ns >= (s64)FLOW_INTERACTIVE_FLOOR_MIN_NS)

// v2.2.4 (fixed) — refill threshold ANDed with RT classification
tctx->last_refill_ns > 0 &&
(p->nr_cpus_allowed == 1 || p->prio < 100) &&
tctx->last_refill_ns >= (s64)FLOW_INTERACTIVE_FLOOR_MIN_NS

p->prio < 100 identifies kernel SCHED_FIFO and SCHED_RR tasks (which have
priority 0--99).

2. Relaxed preempt_ready refill check

// before
(tctx->last_refill_ns >= (s64)preempt_refill_min_ns &&
 tctx->budget_ns >= (s64)preempt_budget_min_ns)

// after
tctx->budget_ns >= (s64)preempt_budget_min_ns

The idle-CPU local-reserved path that was present in v2.2.2/v2.2.3 has been
removed — it bypassed the latency lane routing, causing periodic tasks to
lose their priority dispatch slot. The net submission only carries the two
changes above.

Benchmark Results

Full tagged artifacts (PNG, SVG, CSV, report):
https://github.com/galpt/testing-scx_flow/tree/benchmark-archives/20260409_scx_flow_v2.2.0_release/mini/v2.2.4

All runs on CachyOS 7.0.3-1-cachyos, 16-core AMD system, Balanced power profile.

Normal mode (cyclictest -D 30 -t 4 -a 0 -m -v)

Scheduler Max Latency (us) Spikes >100us
baseline (CFS/EEVDF) 1106 304
scx_cosmos 852 821
scx_bpfland 2724 222
scx_flow v2.2.4 142 2

Hard RT mode (cyclictest --priority=99 --smp --interval=200 --histogram=20)

Scheduler Overflows >20us Max Latency (us)
baseline (CFS/EEVDF) 344 407
scx_cosmos 351 227
scx_bpfland 376 375
scx_flow v2.2.4 402 339

Note: hard RT cyclictest runs at SCHED_FIFO 99 which is dispatched by the
kernel's rt_sched_class, not by ext_sched_class. The hard RT results reflect
system noise under SCX background load rather than the scheduler policy
itself, and are consistent across all tested schedulers.

Files Changed

 Cargo.lock                                      | 2 +-
 scheds/experimental/scx_flow/Cargo.toml         | 2 +-
 scheds/experimental/scx_flow/README.md          | 2 +-
 scheds/experimental/scx_flow/src/bpf/main.bpf.c | 5 ++---
 4 files changed, 4 insertions(+), 5 deletions(-)

Signed-off-by: Galih Tama galpt@v.recipes

galpt added 4 commits May 8, 2026 10:11
…eups

The preempt_ready condition required both last_refill_ns >=
preempt_refill_min_ns AND budget_ns >= preempt_budget_min_ns.
For short-interval wakeups such as cyclictest at 200us period, the
per-wakeup refill (sleep_ns / 100) is only ~2us, far below the
200us preempt_refill_min_ns threshold. This caused all short-interval
wakeups to miss the WAKE_PROFILE_PREEMPT_READY profile bit and fall
through to the reserved DSQ path instead of the direct SCX_DSQ_LOCAL_ON
fast path, adding measurable dispatch latency.

Removed the last_refill_ns condition so that preempt_ready depends
only on budget_ns >= preempt_budget_min_ns. A task's accumulated
positive budget is sufficient evidence of responsiveness — the refill
history gate was unnecessarily penalizing high-frequency wakeups.

Version bumped to 2.2.2.

Fixes: hard-RT max latency gap vs scx_cosmos (429us vs 188us)
Adds a dedicated bypass in flow_enqueue for tasks that have been
sleeping for less than FLOW_INTERACTIVE_SLEEP_MIN_NS (750us).
Such tasks are extremely high-frequency wakeups (e.g. cyclictest at
200us, timer-driven periodic work) and should skip all lane analysis,
wake-profile routing, and budget refill gates.

The fast path checks:
  1. is_wakeup && tctx && budget_ns > 0  — positive budget task
  2. !containment_active                 — not a hog
  3. last_sleep_ns <= 750us              — short sleep, clearly responsive
  4. valid target CPU                    — can dispatch locally

When matched, the task is inserted directly to SCX_DSQ_LOCAL_ON with
a minimal slice (FLOW_SLICE_MIN_NS) and returns immediately. This
bypasses all of:
  - Wake profile recomputation and bit checks
  - should_preempt / lane routing
  - Urgent, latency, reserved, contained, shared DSQ arbitration
  - Per-wakeup counter tracking

Combined with the earlier preempt_ready relaxed refill check and the
idle-CPU local-reserved fix, this ensures that high-frequency wakeups
consistently take the fastest possible dispatch path regardless of
load or lane state.
…reshold

The rt_sensitive_ready condition now separates the refill threshold
from the priority check.  An RT-priority task (p->prio < 100) only
needs any positive refill to qualify for the WAKE_PROFILE_RT_SENSITIVE
bit, while non-RT tasks still need the full FLOW_INTERACTIVE_FLOOR_MIN_NS
(80 us) threshold.

Without this fix the p->prio check was useless in practice because the
common-path refill threshold also required last_refill_ns >= 80 us.
Cyclictest at 200 us period produces a per-wakeup refill of only 2 us
(sleep_ns / 100), which never satisfied the 80 us gate.  With the
threshold separated, RT-priority tasks get the low-latency preempt path
through the existing lane machinery while non-RT behaviour is unchanged.

Signed-off-by: Galih Tama <galpt@v.recipes>
Bump the benchmark snapshot reference to match the current release line.

Signed-off-by: Galih Tama <galpt@v.recipes>
Copilot AI review requested due to automatic review settings May 8, 2026 06:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the experimental scx_flow sched_ext scheduler to v2.2.3 by refining wakeup classification and local dispatch behavior to improve responsiveness (especially for RT-priority tasks) without reintroducing the prior short-sleep lane-analysis bypass.

Changes:

  • Make rt_sensitive wake profiling priority-aware by marking RT-priority tasks (p->prio < 100) as WAKE_PROFILE_RT_SENSITIVE on positive refill.
  • Relax preempt_ready gating to depend on accumulated budget only (drops the refill-threshold requirement).
  • Extend the local-reserved fast path to cover wakeups targeting an idle CPU.
  • Bump version references to 2.2.3 in Cargo.toml, Cargo.lock, and README text.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
scheds/experimental/scx_flow/src/bpf/main.bpf.c Adjusts wake-profile logic and enqueue fast-path routing (RT sensitivity, preempt gating, idle-CPU local dispatch).
scheds/experimental/scx_flow/README.md Updates benchmark snapshot label to v2.2.3 (link currently appears inconsistent).
scheds/experimental/scx_flow/Cargo.toml Bumps crate version to 2.2.3.
Cargo.lock Updates locked scx_flow package version to 2.2.3.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scheds/experimental/scx_flow/src/bpf/main.bpf.c
Comment thread scheds/experimental/scx_flow/src/bpf/main.bpf.c Outdated
Comment thread scheds/experimental/scx_flow/README.md Outdated

@sirlucjan sirlucjan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, everything is OK.

@galpt galpt marked this pull request as draft May 8, 2026 08:43
The v2.2.3 change that relaxed the rt_sensitive_ready condition had a
logical error: last_refill_ns >= FLOW_INTERACTIVE_FLOOR_MIN_NS (80 us)
was placed as a standalone OR branch, causing any task with 80 us or
more of recent refill to qualify as RT-sensitive regardless of priority
or affinity. This falsely classified SCHED_OTHER periodic tasks as
RT-sensitive, blocking them from the latency lane, forcing 50 us
preemption slices, and regressing max latency from 173 us (v2.2.0) to
476 us (v2.2.3) in the 100 us benchmark.

Fix by ANDing the refill threshold with the genuine RT classification
(pinned or prio < 100) instead of ORing it. Non-RT tasks with good
refill retain access to the latency lane and the idle-CPU fast path.

Signed-off-by: Galih Tama <galpt@v.recipes>
@galpt galpt changed the title scheds/experimental/scx_flow: v2.2.3 — priority-aware rt_sensitive with relaxed refill threshold scheds/experimental/scx_flow: v2.2.4 — fix rt_sensitive_ready predicate regression, priority-aware wakeup routing, and relaxed preempt threshold May 8, 2026
galpt added 2 commits May 8, 2026 15:52
Signed-off-by: Galih Tama <galpt@v.recipes>
…ccess

The idle-CPU local-reserved path added in v2.2.2 caused all wakeups to
idle CPUs to short-circuit flow_enqueue via use_local_reserved, which
bypassed the latency lane insertion code. This meant tasks like cyclictest
that should have been routed through the LATENCY_DSQ (where they get
priority dispatch ahead of reserved/shared DSQ entries) instead landed
directly on SCX_DSQ_LOCAL_ON, losing their scheduling priority on the
next busy-CPU wakeup.

Remove the (tctx->wake_cpu_idle && is_wakeup) condition from
use_local_reserved. Idle-CPU wakeups now flow through the normal lane
routing: the per-CPU reserved DSQ with SCX_KICK_IDLE for non-latency
wakeups, or the LATENCY_DSQ when the task has accumulated latency
allowance. RT-sensitive tasks (SCHED_FIFO, pinned) and IPC tasks still
get the fast local-reserved path via should_preempt and
ipc_confidence_wakeup respectively, which remain correct.

Signed-off-by: Galih Tama <galpt@v.recipes>
@galpt galpt changed the title scheds/experimental/scx_flow: v2.2.4 — fix rt_sensitive_ready predicate regression, priority-aware wakeup routing, and relaxed preempt threshold scheds/experimental/scx_flow: v2.2.4 — priority-aware rt_sensitive and relaxed preempt threshold May 8, 2026
@galpt galpt marked this pull request as ready for review May 8, 2026 11:36
@hodgesds hodgesds added this pull request to the merge queue May 8, 2026
Merged via the queue into sched-ext:main with commit d479fcb May 8, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants