Skip to content

mitosis: rescue cell 0 when child cpusets cover every CPU#3605

Open
likewhatevs wants to merge 3 commits into
sched-ext:mainfrom
likewhatevs:mitosis-cell0-no-cpus-repro
Open

mitosis: rescue cell 0 when child cpusets cover every CPU#3605
likewhatevs wants to merge 3 commits into
sched-ext:mainfrom
likewhatevs:mitosis-cell0-no-cpus-repro

Conversation

@likewhatevs

@likewhatevs likewhatevs commented May 27, 2026

Copy link
Copy Markdown
Contributor

mitosis bails when a cell ends up with zero CPUs. Cell 0 — the
catch-all that receives any CPUs no child cgroup's cpuset.cpus
claims — is squeezed to empty once child cpusets collectively cover
every host CPU, and the scheduler exits mid-recompute:

  running scheduler main loop
  Caused by:
    0: checking cpuset changes
    1: recomputing cell configuration after cpuset change
    2: computing demand-weighted CPU assignments
    3: Cell 0 has no CPUs assigned (nr_cpus=N, num_cells=M)

Add --cell0-min-cpus <N> (default 0): before the cpuset-based
assignment runs, reserve up to N CPUs as a holdout for cell 0 —
preferring CPUs no child cpuset claims and, when those run out,
stealing from the cells round-robin (one CPU per cell in turn,
largest cell first) but never a CPU that would leave a child cell
with none — including when cpusets overlap, since a shared CPU is
assigned to only one of its claimants. So the reservation is spread
across cells and can't starve a child into the same bail. (Within a
donor the pick prefers an exclusive CPU, then the LLC split across
the most cells, then the lowest CPU.) The held-out CPUs are skipped
by the assignment and pre-seeded onto cell 0, so a non-zero value
keeps cell 0 from being starved to zero; if N exceeds what can be
reserved without emptying a child, cell 0 simply gets fewer than N.
Default 0 disables the holdout; the existing bail is unchanged.

First commit adds a ktstr e2e reproducer for the bail; the second
adds the flag and the holdout.

Test plan

  • --cell0-min-cpus 1: ktstr repro
    mitosis_cell0_starved_by_full_coverage_cpusets passes — cell 0
    rescued, scenario runs to a clean shutdown, no "Cell 0 has no CPUs"
    bail and no stall
  • Default (--cell0-min-cpus 0): the bail is preserved —
    test_cpu_assignments_cpusets_cover_all_cpus
  • Holdout spreads the steal across cells (round-robin, largest
    first) — test_cpu_assignments_holdout_takes_from_largest_cell,
    test_cpu_assignments_holdout_steals_evenly_across_cells
  • An over-large --cell0-min-cpus never empties a child cell, with
    disjoint or overlapping cpusets —
    test_cpu_assignments_holdout_never_starves_a_child,
    test_cpu_assignments_holdout_overlapping_cpusets_no_starvation
  • 53 mitosis unit tests pass; cargo fmt --check clean

@likewhatevs likewhatevs requested review from Copilot and removed request for Copilot June 2, 2026 15:08
@likewhatevs likewhatevs requested a review from hodgesds June 2, 2026 15:09
@likewhatevs likewhatevs marked this pull request as draft June 12, 2026 15:27
@likewhatevs likewhatevs force-pushed the mitosis-cell0-no-cpus-repro branch 2 times, most recently from 146ee0e to 475e493 Compare June 17, 2026 03:18
@likewhatevs likewhatevs requested a review from etsal June 22, 2026 16:58
@likewhatevs likewhatevs marked this pull request as ready for review June 22, 2026 17:44
@likewhatevs likewhatevs force-pushed the mitosis-cell0-no-cpus-repro branch from 475e493 to 676f713 Compare June 22, 2026 21:57
@likewhatevs likewhatevs marked this pull request as draft June 22, 2026 23:36
@likewhatevs likewhatevs force-pushed the mitosis-cell0-no-cpus-repro branch 2 times, most recently from a6ee461 to e3c42ed Compare June 23, 2026 14:14
When --cell-parent-cgroup is used, cell 0 is the catch-all that
receives any CPUs not explicitly claimed by a child cgroup's
cpuset.cpus. cell_manager.rs's compute_cpu_assignments (Phase 5)
bails when any cell — including cell 0 — would end up with zero
CPUs. Child cgroup cpusets that collectively cover every host
CPU squeeze cell 0 to empty and the scheduler tears down
mid-recompute:

  running scheduler main loop
  Caused by:
    0: checking cpuset changes
    1: recomputing cell configuration after cpuset change
    2: computing demand-weighted CPU assignments
    3: Cell 0 has no CPUs assigned (nr_cpus=N, num_cells=M)

This adds a ktstr e2e reproducer that drives the full scheduler
over this condition: two cgroups with disjoint cpusets covering
all 16 topology CPUs trigger the bail at scheduler startup.

ktstr gated behind the ktstr-tests cargo feature.

Signed-off-by: Pat Somaru <patso@likewhatevs.io>
@likewhatevs likewhatevs force-pushed the mitosis-cell0-no-cpus-repro branch from e3c42ed to 4caf17c Compare June 23, 2026 23:21
@likewhatevs likewhatevs marked this pull request as ready for review June 23, 2026 23:25
Comment thread scheds/rust/scx_mitosis/src/main.rs Outdated
)
.with_context(|| format!("initializing cell manager for cgroup {}", parent_cgroup))?;
.with_context(|| format!("initializing cell manager for cgroup {}", parent_cgroup))?
.with_cpu_to_llc(cpu_to_llc);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird, why not pass it in like everything else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was just to minimize delta, changed to this instead

@likewhatevs likewhatevs force-pushed the mitosis-cell0-no-cpus-repro branch 2 times, most recently from c11f1fe to 1c9722a Compare June 24, 2026 18:37
compute_cpu_assignments bails when cell 0 -- the catch-all for CPUs
no child cgroup's cpuset claims -- ends up with zero CPUs, which
happens once child cpusets collectively cover every host CPU. The
bail propagates up and exits the scheduler mid-recompute.

Reserve a configurable number of CPUs for cell 0 via --cell0-min-cpus
(default 0, off) before the cpuset-based assignment runs. Prefer CPUs
that no child cpuset claims; when those run out, take from the cells
evenly, largest first, but never a cell's last CPU, so the reservation
can't starve a child into the same bail. This also lets operators set
a minimum size for cell 0.

Signed-off-by: Pat Somaru <patso@likewhatevs.io>
@likewhatevs likewhatevs force-pushed the mitosis-cell0-no-cpus-repro branch from 1c9722a to 0e5b36d Compare June 24, 2026 21:49
Comment on lines +160 to +161
/// cell 0 from being starved to zero when child cgroups' cpusets cover
/// every CPU. The reservation never takes a child cell's last CPU, so cell 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it strictly true that the cousets need to cover every cpu? I thought you could repro this when there were held out cpus.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where I've seen this exact error in the wild, yes. I have not seen this exact issue (the error in the PR) w/ holdout present.

Comment on lines +245 to +252
Self::new_with_path(
path,
max_cells,
all_cpus,
exclude,
cell0_min_cpus,
cpu_to_llc,
)

@tommy-u tommy-u Jun 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this to

        Self::new_with_path_opts(
            path,
            max_cells,
            all_cpus,
            exclude,
            cell0_min_cpus,
            cpu_to_llc,
        )

Which your 4 tests call directly, and all the others are unchanged because they call a wrapper

They call

    fn new_with_path(
        path: PathBuf,
        max_cells: u32,
        all_cpus: Cpumask,
        exclude: HashSet<String>,
    ) -> Result<Self> {
        Self::new_with_path_opts(path, max_cells, all_cpus, exclude, 0, HashMap::new())
    }

Then decorate with

    #[cfg(test)]
    fn new_with_path(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thx!

}
}

// Holdout: reserve up to `cell0_min_cpus` CPUs for cell 0 before the

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think by "up to" you mean the number we reserve in this step can be lower than cell0_min_cpus because there may be unclaimed cpus to start with.

No -- by "up to", I mean that if there are 4 non-root cells (i.e. excluding cell 0 from those 4), 5 CPUs, and 2 CPUs are specified for holdout, holdout being 2 would trigger this error by leaving another cell without CPUs if we respected it ultimately. ATM, this PR would give only 1 CPU to cell 0 in this scenario (i.e. the "up to").

I think in that situation "up to" is less problematic than "must", but that's really a judgement call. Do "must" instead of "up to"?

}
}

// Holdout: reserve up to `cell0_min_cpus` CPUs for cell 0 before the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holdout tries to reserve cell0_min_cpus many cpus, but it may reserve fewer if that would mean taking a workload cell's last cpu.

Adding the cell-0 holdout gave the cell-manager test constructor two extra
arguments, which forced every existing test to pass cell0_min_cpus and
cpu_to_llc even though only the holdout tests use them. Move the holdout
arguments into a new new_with_path_opts that the holdout tests call directly,
and restore the original four-argument new_with_path as a test-only wrapper
that defaults the holdout off -- so the tests that do not exercise the holdout
read as they did before it existed.

Also reword the holdout comment to say it tries to reserve cell0_min_cpus CPUs
but reserves fewer when that would take a workload cell's last CPU.

Signed-off-by: Pat Somaru <patso@likewhatevs.io>

@tommy-u tommy-u left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change I need is to keep a flag in the stats that says "I actually enforced holdout during this run." Just a sticky boolean that you set if, at any time, you took a cpu away from a workload cell. I think it will help us with debugging. Otherwise good to squash and merge.

Comment on lines +651 to +655
let mut taken = 0usize;
for cpu in unclaimed.into_iter().take(self.cell0_min_cpus) {
holdout.set_cpu(cpu).ok();
taken += 1;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A future concern is picking hyper thread pairs.

// grows. Only exclusive CPUs count toward survival: a contested CPU
// may be awarded to a co-claimant in Phase 3, so it cannot be what
// keeps a cell alive -- counting it would let the steal take a
// cell's last exclusive CPU and zero the cell (the Phase-5 bail).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we got into this loop, we are now taking claimed cpus from someone. Can you please add a "enforced_holdout" boolean that gets initialized to False when the scheduler starts up and gets or'ed to true here if we get into the body of this loop?

When we hear of some hosts getting wonky perf, I'd like to be able to query this to know if we've put our hand on the scales.

.filter(|&cpu| !holdout.test_cpu(cpu) && reservable(cpu))
.min_by(|&a, &b| {
let contested =
|cpu: usize| contention.get(&cpu).map_or(0, |v| v.len()) > 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of binary "pick uncontested before contested" you could pick in ascending number of contesting cgrpus. Just an idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants