scx_lavd: decouple per-cpdom cpumasks from build-time NR_CPUS by jkkm · Pull Request #3633 · sched-ext/scx

jkkm · 2026-06-08T20:21:29Z

Two commits:

scx_lavd: allocate per-cpdom cpumasks dynamically to decouple from NR_CPUS
scx_lavd: inline lookup_cpdom_cpumask() to avoid a hot-path call

Problem

scx_lavd stores its per-compute-domain online CPU masks in a file-scope
array of struct bpf_cpumask embedded by value:

private(LAVD) struct bpf_cpumask cpdom_cpumask[LAVD_CPDOM_MAX_NR];

struct bpf_cpumask embeds struct cpumask, whose size is
BITS_TO_LONGS(NR_CPUS) * sizeof(long) — fixed at the NR_CPUS the scheduler is
compiled against. Because the array lives in global (.data) storage, its
element stride and total size are frozen into the BPF object at build time, while
the verifier checks the program's accesses against the running kernel's
struct cpumask size.
If those disagree, the scheduler won't load. A scx_lavd built against
NR_CPUS=512 (sizeof(struct bpf_cpumask) == 72) and loaded on an
NR_CPUS=1024 kernel (== 136) is rejected by the verifier on every op that
touches the array — lavd_select_cpu, lavd_enqueue, lavd_init,
lavd_cpu_online, lavd_cpu_offline:

... map value, value_size=9248 off=9144 size=128
R2 max value is outside of the allowed memory range

The 128-byte cpumask access to the last element runs past the build-time-sized
.data storage. In effect the prebuilt binary is silently tied to one NR_CPUS
value, which breaks shipping a single scx_lavd across kernels that differ in
NR_CPUS.

Fix

Allocate the per-cpdom masks dynamically (commit 1):

Replace the by-value global array with a BPF_MAP_TYPE_ARRAY of
struct cpdom_cpumask_wrapper { struct bpf_cpumask __kptr *mask; }.
Create one bpf_cpumask per domain at init via bpf_cpumask_create() /
bpf_kptr_xchg() in init_cpdom_cpumasks(), called from lavd_init() before
the per-CPU init populates them.
Reads go through a lookup_cpdom_cpumask() helper; the in-place
set_cpu/clear_cpu sites keep their existing bpf_rcu_read_lock() sections.
The kernel now allocates cpumask storage for the running nr_cpu_ids, and the
map value holds only an 8-byte pointer — independent of build-time NR_CPUS.
This mirrors how scx_lavd's other masks (turbo/big/active/ovrflw/
steady) and scx_layered's layer_cpumasks are already handled.
Commit 2 moves the map (as a __weak map) and lookup_cpdom_cpumask() (now
static __always_inline) into lavd.bpf.h, so the cross-TU readers in
idle.bpf.c/preempt.bpf.c inline the lookup instead of emitting a BPF-to-BPF
call. This matches the map-in-header idiom already used by lib/percpu.h.

Testing

All on an NR_CPUS=1024 kernel (6.19), with objects built from
NR_CPUS-mismatched headers to reproduce the failure.
Verification (veristat). Before: the 5 ops above fail to load. After: all
programs verify. A 3-way comparison (by-value / kptr+helper / kptr+inline, built
against the same header) shows the verifier cost is neutral-to-mixed and the
inline commit recovers the helper-call overhead:

program	by-value	kptr+helper	kptr+inline
`lavd_select_cpu`	133366	137593	132139
`lavd_enqueue`	123778	136876	133101
`lavd_init`	79084	79629	79629
(processed-instruction counts; states track the same trend.)
Runtime soak. Attached on the `NR_CPUS=1024` host for 30 minutes:
clean attach, system-wide scheduling throughout (including a load spike to ~118),
clean detach, no verifier rejection, no stall.
Runtime A/B (virtme-ng, 8 pinned vCPUs, interleaved by-value vs kptr+inline).
No statistically significant regression:

perf bench sched messaging (n=16/arm): +0.34% median (within noise, CV ~2%).
perf bench sched pipe (n=32/arm): +4.2% median but well under 1σ
(σ ≈ 1.3 µs vs a 0.28 µs gap) — not significant. Consistent with the change's
runtime cost being a single inlined pointer load.

Cross-NR_CPUS portability (the core claim). Built test kernels with
NR_CPUS=64, NR_CPUS=512, and MAXSMP (NR_CPUS=8192 +
CONFIG_CPUMASK_OFFSTACK=y), plus the NR_CPUS=1024 host kernel. A single
scheduler binary built against a 512-cpumask header was attached on each, with
no rebuild:

kernel (`sizeof(struct cpumask)`)	by-value	this patch
`NR_CPUS=64` (8 B)	attaches	attaches
`NR_CPUS=512` (64 B)	attaches	attaches
`NR_CPUS=1024` (128 B)	fails to load	attaches
`MAXSMP=8192` + offstack (1024 B)	fails to load	attaches

The by-value build is rejected by the verifier wherever the running kernel's
cpumask is larger than the build-time one (and CONFIG_CPUMASK_OFFSTACK does not
help, since it doesn't change sizeof(struct cpumask)). The two binaries differ
only in cpdom_cpumask handling, so this isolates the fix as both necessary and
sufficient for a single NR_CPUS-independent binary.

Note

Independent of this change, LAVD_CPU_ID_MAX = 512 (intf.h) still caps the
manual cpdom_ctx.__cpumask[] bitmap at 512 CPUs; systems with more online CPUs
would need that raised separately.

…_CPUS cpdom_cpumask is a file-scope array of struct bpf_cpumask embedded by value: private(LAVD) struct bpf_cpumask cpdom_cpumask[LAVD_CPDOM_MAX_NR]; struct bpf_cpumask embeds struct cpumask, whose size is BITS_TO_LONGS(NR_CPUS) * sizeof(long) -- fixed at the NR_CPUS the scheduler is compiled against. Because the array lives in global (.data) storage, its per-element stride and total size are baked into the BPF object at build time, while the verifier validates the program's accesses against the running kernel's struct cpumask size. When the scheduler is built against a kernel whose NR_CPUS differs from the kernel it runs on, the two no longer agree and the program fails to load. Concretely, a scx_lavd built against NR_CPUS=512 (sizeof(struct bpf_cpumask) == 72) and loaded on an NR_CPUS=1024 kernel (== 136) is rejected by the verifier on every op that touches the array (lavd_select_cpu, lavd_enqueue, lavd_init, lavd_cpu_online, lavd_cpu_offline): ... map value, value_size=9248 off=9144 size=128 R2 max value is outside of the allowed memory range i.e. the 128-byte cpumask access to the last element runs past the build-time-sized .data storage. This makes the scheduler binary silently non-portable across kernels with different NR_CPUS. Store the per-cpdom masks as kptr cpumasks allocated at init via bpf_cpumask_create() and held in a BPF_MAP_TYPE_ARRAY of wrapper structs, so the cpumask storage is allocated by the kernel for the running nr_cpu_ids and the map value holds only an 8-byte pointer -- independent of build-time NR_CPUS. Reads go through a new lookup_cpdom_cpumask() helper; the in-place set/clear sites keep their existing bpf_rcu_read_lock() sections. This mirrors how scx_lavd's other masks (turbo/big/active/ovrflw/steady) and scx_layered's layer_cpumasks are already managed. Tested on an NR_CPUS=1024 kernel with an object built against NR_CPUS=512 headers: previously failed to load, now verifies, attaches, and schedules cleanly. Signed-off-by: Kyle McMartin <jkkm@meta.com>

The previous change exposed the per-cpdom cpumask map via a global lookup_cpdom_cpumask() helper. Because it is a non-static cross-TU function, callers in the pick-idle / enqueue / preemption paths (idle.bpf.c, preempt.bpf.c) emit a BPF-to-BPF call per lookup instead of an inlined index, as the by-value array did. Move the array map (as a __weak map) and lookup_cpdom_cpumask() (now static __always_inline) into lavd.bpf.h so the lookup -- an inlined array-map lookup plus one kptr load -- is inlined at every call site. This matches the map-in-header idiom already used by lib/percpu.h. No functional change intended. Signed-off-by: Kyle McMartin <jkkm@meta.com>

Kyle McMartin added 2 commits June 5, 2026 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scx_lavd: decouple per-cpdom cpumasks from build-time NR_CPUS#3633

scx_lavd: decouple per-cpdom cpumasks from build-time NR_CPUS#3633
jkkm wants to merge 2 commits into
sched-ext:mainfrom
jkkm:lavd-cpdom-cpumask-kptr

jkkm commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jkkm commented Jun 8, 2026

Problem

Fix

Testing

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant