scx_lavd: decouple per-cpdom cpumasks from build-time NR_CPUS#3633
Open
jkkm wants to merge 2 commits into
Open
scx_lavd: decouple per-cpdom cpumasks from build-time NR_CPUS#3633jkkm wants to merge 2 commits into
jkkm wants to merge 2 commits into
Conversation
added 2 commits
June 5, 2026 20:56
…_CPUS
cpdom_cpumask is a file-scope array of struct bpf_cpumask embedded by
value:
private(LAVD) struct bpf_cpumask cpdom_cpumask[LAVD_CPDOM_MAX_NR];
struct bpf_cpumask embeds struct cpumask, whose size is
BITS_TO_LONGS(NR_CPUS) * sizeof(long) -- fixed at the NR_CPUS the
scheduler is compiled against. Because the array lives in global
(.data) storage, its per-element stride and total size are baked into
the BPF object at build time, while the verifier validates the
program's accesses against the running kernel's struct cpumask size.
When the scheduler is built against a kernel whose NR_CPUS differs from
the kernel it runs on, the two no longer agree and the program fails to
load. Concretely, a scx_lavd built against NR_CPUS=512 (sizeof(struct
bpf_cpumask) == 72) and loaded on an NR_CPUS=1024 kernel (== 136) is
rejected by the verifier on every op that touches the array
(lavd_select_cpu, lavd_enqueue, lavd_init, lavd_cpu_online,
lavd_cpu_offline):
... map value, value_size=9248 off=9144 size=128
R2 max value is outside of the allowed memory range
i.e. the 128-byte cpumask access to the last element runs past the
build-time-sized .data storage. This makes the scheduler binary
silently non-portable across kernels with different NR_CPUS.
Store the per-cpdom masks as kptr cpumasks allocated at init via
bpf_cpumask_create() and held in a BPF_MAP_TYPE_ARRAY of wrapper
structs, so the cpumask storage is allocated by the kernel for the
running nr_cpu_ids and the map value holds only an 8-byte pointer --
independent of build-time NR_CPUS. Reads go through a new
lookup_cpdom_cpumask() helper; the in-place set/clear sites keep their
existing bpf_rcu_read_lock() sections. This mirrors how scx_lavd's
other masks (turbo/big/active/ovrflw/steady) and scx_layered's
layer_cpumasks are already managed.
Tested on an NR_CPUS=1024 kernel with an object built against
NR_CPUS=512 headers: previously failed to load, now verifies, attaches,
and schedules cleanly.
Signed-off-by: Kyle McMartin <jkkm@meta.com>
The previous change exposed the per-cpdom cpumask map via a global lookup_cpdom_cpumask() helper. Because it is a non-static cross-TU function, callers in the pick-idle / enqueue / preemption paths (idle.bpf.c, preempt.bpf.c) emit a BPF-to-BPF call per lookup instead of an inlined index, as the by-value array did. Move the array map (as a __weak map) and lookup_cpdom_cpumask() (now static __always_inline) into lavd.bpf.h so the lookup -- an inlined array-map lookup plus one kptr load -- is inlined at every call site. This matches the map-in-header idiom already used by lib/percpu.h. No functional change intended. Signed-off-by: Kyle McMartin <jkkm@meta.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two commits:
scx_lavd: allocate per-cpdom cpumasks dynamically to decouple from NR_CPUSscx_lavd: inline lookup_cpdom_cpumask() to avoid a hot-path callProblem
scx_lavdstores its per-compute-domain online CPU masks in a file-scopearray of
struct bpf_cpumaskembedded by value:struct bpf_cpumaskembedsstruct cpumask, whose size isBITS_TO_LONGS(NR_CPUS) * sizeof(long)— fixed at theNR_CPUSthe scheduler iscompiled against. Because the array lives in global (
.data) storage, itselement stride and total size are frozen into the BPF object at build time, while
the verifier checks the program's accesses against the running kernel's
struct cpumasksize.If those disagree, the scheduler won't load. A
scx_lavdbuilt againstNR_CPUS=512(sizeof(struct bpf_cpumask) == 72) and loaded on anNR_CPUS=1024kernel (== 136) is rejected by the verifier on every op thattouches the array —
lavd_select_cpu,lavd_enqueue,lavd_init,lavd_cpu_online,lavd_cpu_offline:The 128-byte cpumask access to the last element runs past the build-time-sized
.datastorage. In effect the prebuilt binary is silently tied to oneNR_CPUSvalue, which breaks shipping a single
scx_lavdacross kernels that differ inNR_CPUS.Fix
Allocate the per-cpdom masks dynamically (commit 1):
BPF_MAP_TYPE_ARRAYofstruct cpdom_cpumask_wrapper { struct bpf_cpumask __kptr *mask; }.bpf_cpumaskper domain at init viabpf_cpumask_create()/bpf_kptr_xchg()ininit_cpdom_cpumasks(), called fromlavd_init()beforethe per-CPU init populates them.
lookup_cpdom_cpumask()helper; the in-placeset_cpu/clear_cpusites keep their existingbpf_rcu_read_lock()sections.The kernel now allocates cpumask storage for the running
nr_cpu_ids, and themap value holds only an 8-byte pointer — independent of build-time
NR_CPUS.This mirrors how
scx_lavd's other masks (turbo/big/active/ovrflw/steady) andscx_layered'slayer_cpumasksare already handled.Commit 2 moves the map (as a
__weakmap) andlookup_cpdom_cpumask()(nowstatic __always_inline) intolavd.bpf.h, so the cross-TU readers inidle.bpf.c/preempt.bpf.cinline the lookup instead of emitting a BPF-to-BPFcall. This matches the map-in-header idiom already used by
lib/percpu.h.Testing
All on an
NR_CPUS=1024kernel (6.19), with objects built fromNR_CPUS-mismatched headers to reproduce the failure.Verification (veristat). Before: the 5 ops above fail to load. After: all
programs verify. A 3-way comparison (by-value / kptr+helper / kptr+inline, built
against the same header) shows the verifier cost is neutral-to-mixed and the
inline commit recovers the helper-call overhead:
lavd_select_cpulavd_enqueuelavd_initNR_CPUS=1024host for 30 minutes:perf bench sched messaging(n=16/arm): +0.34% median (within noise, CV ~2%).perf bench sched pipe(n=32/arm): +4.2% median but well under 1σ(σ ≈ 1.3 µs vs a 0.28 µs gap) — not significant. Consistent with the change's
runtime cost being a single inlined pointer load.
Cross-NR_CPUS portability (the core claim). Built test kernels with
NR_CPUS=64,NR_CPUS=512, andMAXSMP(NR_CPUS=8192+CONFIG_CPUMASK_OFFSTACK=y), plus theNR_CPUS=1024host kernel. A singlescheduler binary built against a 512-cpumask header was attached on each, with
no rebuild:
sizeof(struct cpumask))NR_CPUS=64(8 B)NR_CPUS=512(64 B)NR_CPUS=1024(128 B)MAXSMP=8192+ offstack (1024 B)The by-value build is rejected by the verifier wherever the running kernel's
cpumask is larger than the build-time one (and
CONFIG_CPUMASK_OFFSTACKdoes nothelp, since it doesn't change
sizeof(struct cpumask)). The two binaries differonly in
cpdom_cpumaskhandling, so this isolates the fix as both necessary andsufficient for a single NR_CPUS-independent binary.
Note
Independent of this change,
LAVD_CPU_ID_MAX = 512(intf.h) still caps themanual
cpdom_ctx.__cpumask[]bitmap at 512 CPUs; systems with more online CPUswould need that raised separately.