mitosis: rescue cell 0 when child cpusets cover every CPU#3605
mitosis: rescue cell 0 when child cpusets cover every CPU#3605likewhatevs wants to merge 3 commits into
Conversation
146ee0e to
475e493
Compare
475e493 to
676f713
Compare
a6ee461 to
e3c42ed
Compare
When --cell-parent-cgroup is used, cell 0 is the catch-all that
receives any CPUs not explicitly claimed by a child cgroup's
cpuset.cpus. cell_manager.rs's compute_cpu_assignments (Phase 5)
bails when any cell — including cell 0 — would end up with zero
CPUs. Child cgroup cpusets that collectively cover every host
CPU squeeze cell 0 to empty and the scheduler tears down
mid-recompute:
running scheduler main loop
Caused by:
0: checking cpuset changes
1: recomputing cell configuration after cpuset change
2: computing demand-weighted CPU assignments
3: Cell 0 has no CPUs assigned (nr_cpus=N, num_cells=M)
This adds a ktstr e2e reproducer that drives the full scheduler
over this condition: two cgroups with disjoint cpusets covering
all 16 topology CPUs trigger the bail at scheduler startup.
ktstr gated behind the ktstr-tests cargo feature.
Signed-off-by: Pat Somaru <patso@likewhatevs.io>
e3c42ed to
4caf17c
Compare
| ) | ||
| .with_context(|| format!("initializing cell manager for cgroup {}", parent_cgroup))?; | ||
| .with_context(|| format!("initializing cell manager for cgroup {}", parent_cgroup))? | ||
| .with_cpu_to_llc(cpu_to_llc); |
There was a problem hiding this comment.
This seems weird, why not pass it in like everything else?
There was a problem hiding this comment.
that was just to minimize delta, changed to this instead
c11f1fe to
1c9722a
Compare
compute_cpu_assignments bails when cell 0 -- the catch-all for CPUs no child cgroup's cpuset claims -- ends up with zero CPUs, which happens once child cpusets collectively cover every host CPU. The bail propagates up and exits the scheduler mid-recompute. Reserve a configurable number of CPUs for cell 0 via --cell0-min-cpus (default 0, off) before the cpuset-based assignment runs. Prefer CPUs that no child cpuset claims; when those run out, take from the cells evenly, largest first, but never a cell's last CPU, so the reservation can't starve a child into the same bail. This also lets operators set a minimum size for cell 0. Signed-off-by: Pat Somaru <patso@likewhatevs.io>
1c9722a to
0e5b36d
Compare
| /// cell 0 from being starved to zero when child cgroups' cpusets cover | ||
| /// every CPU. The reservation never takes a child cell's last CPU, so cell 0 |
There was a problem hiding this comment.
Is it strictly true that the cousets need to cover every cpu? I thought you could repro this when there were held out cpus.
There was a problem hiding this comment.
Where I've seen this exact error in the wild, yes. I have not seen this exact issue (the error in the PR) w/ holdout present.
| Self::new_with_path( | ||
| path, | ||
| max_cells, | ||
| all_cpus, | ||
| exclude, | ||
| cell0_min_cpus, | ||
| cpu_to_llc, | ||
| ) |
There was a problem hiding this comment.
Let's change this to
Self::new_with_path_opts(
path,
max_cells,
all_cpus,
exclude,
cell0_min_cpus,
cpu_to_llc,
)
Which your 4 tests call directly, and all the others are unchanged because they call a wrapper
They call
fn new_with_path(
path: PathBuf,
max_cells: u32,
all_cpus: Cpumask,
exclude: HashSet<String>,
) -> Result<Self> {
Self::new_with_path_opts(path, max_cells, all_cpus, exclude, 0, HashMap::new())
}
Then decorate with
#[cfg(test)]
fn new_with_path(
There was a problem hiding this comment.
Good idea, thx!
| } | ||
| } | ||
|
|
||
| // Holdout: reserve up to `cell0_min_cpus` CPUs for cell 0 before the |
There was a problem hiding this comment.
Also, I think by "up to" you mean the number we reserve in this step can be lower than cell0_min_cpus because there may be unclaimed cpus to start with.
No -- by "up to", I mean that if there are 4 non-root cells (i.e. excluding cell 0 from those 4), 5 CPUs, and 2 CPUs are specified for holdout, holdout being 2 would trigger this error by leaving another cell without CPUs if we respected it ultimately. ATM, this PR would give only 1 CPU to cell 0 in this scenario (i.e. the "up to").
I think in that situation "up to" is less problematic than "must", but that's really a judgement call. Do "must" instead of "up to"?
| } | ||
| } | ||
|
|
||
| // Holdout: reserve up to `cell0_min_cpus` CPUs for cell 0 before the |
There was a problem hiding this comment.
Holdout tries to reserve cell0_min_cpus many cpus, but it may reserve fewer if that would mean taking a workload cell's last cpu.
Adding the cell-0 holdout gave the cell-manager test constructor two extra arguments, which forced every existing test to pass cell0_min_cpus and cpu_to_llc even though only the holdout tests use them. Move the holdout arguments into a new new_with_path_opts that the holdout tests call directly, and restore the original four-argument new_with_path as a test-only wrapper that defaults the holdout off -- so the tests that do not exercise the holdout read as they did before it existed. Also reword the holdout comment to say it tries to reserve cell0_min_cpus CPUs but reserves fewer when that would take a workload cell's last CPU. Signed-off-by: Pat Somaru <patso@likewhatevs.io>
tommy-u
left a comment
There was a problem hiding this comment.
The only change I need is to keep a flag in the stats that says "I actually enforced holdout during this run." Just a sticky boolean that you set if, at any time, you took a cpu away from a workload cell. I think it will help us with debugging. Otherwise good to squash and merge.
| let mut taken = 0usize; | ||
| for cpu in unclaimed.into_iter().take(self.cell0_min_cpus) { | ||
| holdout.set_cpu(cpu).ok(); | ||
| taken += 1; | ||
| } |
There was a problem hiding this comment.
A future concern is picking hyper thread pairs.
| // grows. Only exclusive CPUs count toward survival: a contested CPU | ||
| // may be awarded to a co-claimant in Phase 3, so it cannot be what | ||
| // keeps a cell alive -- counting it would let the steal take a | ||
| // cell's last exclusive CPU and zero the cell (the Phase-5 bail). |
There was a problem hiding this comment.
If we got into this loop, we are now taking claimed cpus from someone. Can you please add a "enforced_holdout" boolean that gets initialized to False when the scheduler starts up and gets or'ed to true here if we get into the body of this loop?
When we hear of some hosts getting wonky perf, I'd like to be able to query this to know if we've put our hand on the scales.
| .filter(|&cpu| !holdout.test_cpu(cpu) && reservable(cpu)) | ||
| .min_by(|&a, &b| { | ||
| let contested = | ||
| |cpu: usize| contention.get(&cpu).map_or(0, |v| v.len()) > 1; |
There was a problem hiding this comment.
Instead of binary "pick uncontested before contested" you could pick in ascending number of contesting cgrpus. Just an idea.
mitosis bails when a cell ends up with zero CPUs. Cell 0 — the
catch-all that receives any CPUs no child cgroup's cpuset.cpus
claims — is squeezed to empty once child cpusets collectively cover
every host CPU, and the scheduler exits mid-recompute:
Add
--cell0-min-cpus <N>(default 0): before the cpuset-basedassignment runs, reserve up to N CPUs as a holdout for cell 0 —
preferring CPUs no child cpuset claims and, when those run out,
stealing from the cells round-robin (one CPU per cell in turn,
largest cell first) but never a CPU that would leave a child cell
with none — including when cpusets overlap, since a shared CPU is
assigned to only one of its claimants. So the reservation is spread
across cells and can't starve a child into the same bail. (Within a
donor the pick prefers an exclusive CPU, then the LLC split across
the most cells, then the lowest CPU.) The held-out CPUs are skipped
by the assignment and pre-seeded onto cell 0, so a non-zero value
keeps cell 0 from being starved to zero; if N exceeds what can be
reserved without emptying a child, cell 0 simply gets fewer than N.
Default 0 disables the holdout; the existing bail is unchanged.
First commit adds a ktstr e2e reproducer for the bail; the second
adds the flag and the holdout.
Test plan
--cell0-min-cpus 1: ktstr repromitosis_cell0_starved_by_full_coverage_cpusets passes — cell 0
rescued, scenario runs to a clean shutdown, no "Cell 0 has no CPUs"
bail and no stall
--cell0-min-cpus 0): the bail is preserved —test_cpu_assignments_cpusets_cover_all_cpus
first) — test_cpu_assignments_holdout_takes_from_largest_cell,
test_cpu_assignments_holdout_steals_evenly_across_cells
--cell0-min-cpusnever empties a child cell, withdisjoint or overlapping cpusets —
test_cpu_assignments_holdout_never_starves_a_child,
test_cpu_assignments_holdout_overlapping_cpusets_no_starvation