Skip to content

mitosis: fix vtime drift for borrowed and cross-cell-pinned tasks#3593

Open
likewhatevs wants to merge 2 commits into
sched-ext:mainfrom
likewhatevs:mitosis-init-stall-fix
Open

mitosis: fix vtime drift for borrowed and cross-cell-pinned tasks#3593
likewhatevs wants to merge 2 commits into
sched-ext:mainfrom
likewhatevs:mitosis-init-stall-fix

Conversation

@likewhatevs

@likewhatevs likewhatevs commented May 22, 2026

Copy link
Copy Markdown
Contributor

Problem

scx_mitosis charges each task's dsq_vtime on every stop, but only
advanced the owning cell's vtime_now for in-cell, un-borrowed runs.
Runtime a task spent borrowing an idle CPU from another cell, or
pinned to a CPU outside its home cell, was charged to dsq_vtime
but hidden from every cell's vtime_now — the basis the next
mitosis_enqueue compares against and the value mitosis_dispatch
orders by.

So dsq_vtime drifts away from every basis it is ever checked against.
For a SCHED_IDLE task (weight 1 → charged at 100×) the gap grows fast
and surfaces two ways:

  • "vtime too far ahead" error: dsq_vtime exceeds the basis by more
    than 8192 * slice_ns (~164s at the 20ms default), so mitosis_enqueue
    calls scx_bpf_error and the scheduler exits.
  • Runnable-task stall: before the gap gets that large, the drifted
    task already loses every vtime-ordered dispatch and the watchdog fires
    (SCX_EXIT_ERROR_STALL).

Fix

Both behaviors are gated behind --vtime-borrow-fixes (default off, so
this is a no-op until the flag is set).

  • Borrowed runtime: mitosis_stopping advances the task's
    charge-cell vtime_now regardless of borrow / CPU retag, so a
    borrowing task keeps its home cell tracking the runtime it spent on loan.
  • Cross-cell pin: a task pinned to a CPU outside its home cell is
    charged to the cell of the CPU it actually runs on — vtime_charge_cell
    is taken from the run CPU in mitosis_select_cpu and mitosis_enqueue,
    not from the home cell. That cell and that CPU's vtime_now then track
    it, so its dsq_vtime stays in the domain the foreign cell's tasks —
    and mitosis_dispatch — compare against, and it stays dispatchable.

Commits

  1. add ktstr reproducers for vtime drift errors and stalls — four
    KVM-VM reproducers (error + stall, each via borrowing and via
    cross-cell pin), the --vtime-borrow-fixes flag (declared as a no-op
    here), and the ktstr dev-dependency behind the ktstr-tests feature.
    With the flag a no-op, all four reproduce the bugs (fail) at this commit.
  2. fix vtime drift for borrowed and cross-cell-pinned tasks — adds
    the vtime_borrow_fixes behavior; with the flag set, the four
    reproducers pass.

Test plan

Each reproducer was run against both commits (--kernel ../linux,
-E 'test(mitosis_vtime)'):

  • commit 1 (flag a no-op): all four reproduce the bug — 2/2 runs
  • commit 2 (fix): all four pass — 2/2 runs

note: stall_via_borrowing is flaky, 1/4 runs errantly passes without fix. this is a hard bug to reliably reproduce.

@likewhatevs likewhatevs marked this pull request as draft May 22, 2026 22:41
@likewhatevs likewhatevs force-pushed the mitosis-init-stall-fix branch from c36eb8f to 9ea7a3a Compare May 27, 2026 21:11
@likewhatevs likewhatevs changed the title mitosis vtime far ahead stall test + fix mitosis: vtime-too-far-ahead drift — tests and fixes May 27, 2026
@likewhatevs likewhatevs force-pushed the mitosis-init-stall-fix branch 2 times, most recently from d1a6cc1 to 6098d30 Compare May 27, 2026 21:17
@likewhatevs likewhatevs marked this pull request as ready for review May 27, 2026 21:17
@likewhatevs likewhatevs force-pushed the mitosis-init-stall-fix branch from 6098d30 to 618ec74 Compare May 27, 2026 21:21
@likewhatevs likewhatevs changed the title mitosis: vtime-too-far-ahead drift — tests and fixes mitosis: fix vtime drift (errors and runnable-task stalls) behind --vtime-borrow-fixes May 29, 2026
Add ktstr (KVM-backed) reproducers for the scx_mitosis vtime drift
bugs, the --vtime-borrow-fixes flag they gate on, and the ktstr
dev-dependency. The flag is declared here as a no-op; the follow-up
commit gives it the behavior that fixes the bugs, so these tests
reproduce the bugs (fail) at this commit and pass after it.

Two failure modes, each reachable from two paths:

vtime-too-far-ahead error (ktstr_mitosis_init_stall_tests.rs): the
mitosis_enqueue "vtime too far ahead" scx_bpf_error fires when a task's
dsq_vtime exceeds basis_vtime by more than 8192 * slice_ns (~164s at the
default 20ms slice).
  - via_borrowing: SCHED_IDLE workers ride borrowed CPUs; dsq_vtime grows
    at 100x but their cell's vtime_now does not, so the enqueue basis
    check trips.
  - via_cross_cell_pin: a SCHED_IDLE worker pinned to a CPU outside its
    home cell never advances that CPU's vtime, so dsq_vtime drifts past
    the per-CPU basis without bound.

Runnable-task stall (ktstr_mitosis_vtime_runnable_stall_tests.rs): a
SCHED_IDLE drifter loses every vtime-ordered dispatch until the watchdog
fires (SCX_EXIT_ERROR_STALL).
  - via_borrowing: the runaway dsq_vtime poisons the home cell domain, so
    a sibling task can no longer be dispatched once borrowing is cut.
  - via_cross_cell_pin: the drifted dsq_vtime sits far above the foreign
    cell's tasks that mitosis_dispatch compares it against.

ktstr is gated behind the ktstr-tests cargo feature.

Signed-off-by: Pat Somaru <patso@likewhatevs.io>
Two related vtime drift bugs, both gated behind --vtime-borrow-fixes
(default off, so this is a no-op until the flag is set).

A task's dsq_vtime is charged on every stop, but its cell's vtime_now --
the basis the next enqueue compares against and the value
mitosis_dispatch orders by -- was only advanced for in-cell, un-borrowed
runs. Runtime spent borrowing, or pinned outside the home cell, was
hidden from every cell, so dsq_vtime drifted away from every basis it is
checked against: it tripped the "vtime too far ahead" error, or, once
the gap was large enough, lost every dispatch into a runnable-task stall
(watchdog SCX_EXIT_ERROR_STALL).

Borrowed runtime: mitosis_stopping now advances the task's charge-cell
vtime regardless of borrow / CPU retag, so a borrowing task keeps its
home cell's vtime tracking the runtime it spent on loan.

Cross-cell pin: a task pinned to a CPU outside its home cell is charged
to the cell of the CPU it actually runs on -- vtime_charge_cell is set
from the run CPU in mitosis_select_cpu and mitosis_enqueue, not from the
home cell. That cell's and that CPU's vtime_now then track it through the
existing stopping gates, so cctx->vtime_now is a live enqueue basis and
its dsq_vtime stays in the domain the foreign cell's tasks -- and
mitosis_dispatch -- compare against. It stays dispatchable instead of
drifting until it loses every dispatch.

With the flag set, the four reproducers from the previous commit pass.

Signed-off-by: Pat Somaru <patso@likewhatevs.io>
@likewhatevs likewhatevs force-pushed the mitosis-init-stall-fix branch from 67c22aa to 3574b98 Compare May 29, 2026 02:19
@likewhatevs likewhatevs changed the title mitosis: fix vtime drift (errors and runnable-task stalls) behind --vtime-borrow-fixes mitosis: fix vtime drift for borrowed and cross-cell-pinned tasks May 29, 2026
@likewhatevs likewhatevs requested a review from hodgesds June 2, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant