Skip to content

scx_lavd: decouple per-cpdom cpumasks from build-time NR_CPUS#3633

Open
jkkm wants to merge 2 commits into
sched-ext:mainfrom
jkkm:lavd-cpdom-cpumask-kptr
Open

scx_lavd: decouple per-cpdom cpumasks from build-time NR_CPUS#3633
jkkm wants to merge 2 commits into
sched-ext:mainfrom
jkkm:lavd-cpdom-cpumask-kptr

Conversation

@jkkm

@jkkm jkkm commented Jun 8, 2026

Copy link
Copy Markdown

Two commits:

  1. scx_lavd: allocate per-cpdom cpumasks dynamically to decouple from NR_CPUS
  2. scx_lavd: inline lookup_cpdom_cpumask() to avoid a hot-path call

Problem

scx_lavd stores its per-compute-domain online CPU masks in a file-scope
array of struct bpf_cpumask embedded by value:

private(LAVD) struct bpf_cpumask cpdom_cpumask[LAVD_CPDOM_MAX_NR];

struct bpf_cpumask embeds struct cpumask, whose size is
BITS_TO_LONGS(NR_CPUS) * sizeof(long) — fixed at the NR_CPUS the scheduler is
compiled against. Because the array lives in global (.data) storage, its
element stride and total size are frozen into the BPF object at build time, while
the verifier checks the program's accesses against the running kernel's
struct cpumask size.
If those disagree, the scheduler won't load. A scx_lavd built against
NR_CPUS=512 (sizeof(struct bpf_cpumask) == 72) and loaded on an
NR_CPUS=1024 kernel (== 136) is rejected by the verifier on every op that
touches the array — lavd_select_cpu, lavd_enqueue, lavd_init,
lavd_cpu_online, lavd_cpu_offline:

... map value, value_size=9248 off=9144 size=128
R2 max value is outside of the allowed memory range

The 128-byte cpumask access to the last element runs past the build-time-sized
.data storage. In effect the prebuilt binary is silently tied to one NR_CPUS
value, which breaks shipping a single scx_lavd across kernels that differ in
NR_CPUS.

Fix

Allocate the per-cpdom masks dynamically (commit 1):

  • Replace the by-value global array with a BPF_MAP_TYPE_ARRAY of
    struct cpdom_cpumask_wrapper { struct bpf_cpumask __kptr *mask; }.
  • Create one bpf_cpumask per domain at init via bpf_cpumask_create() /
    bpf_kptr_xchg() in init_cpdom_cpumasks(), called from lavd_init() before
    the per-CPU init populates them.
  • Reads go through a lookup_cpdom_cpumask() helper; the in-place
    set_cpu/clear_cpu sites keep their existing bpf_rcu_read_lock() sections.
    The kernel now allocates cpumask storage for the running nr_cpu_ids, and the
    map value holds only an 8-byte pointer — independent of build-time NR_CPUS.
    This mirrors how scx_lavd's other masks (turbo/big/active/ovrflw/
    steady) and scx_layered's layer_cpumasks are already handled.
    Commit 2 moves the map (as a __weak map) and lookup_cpdom_cpumask() (now
    static __always_inline) into lavd.bpf.h, so the cross-TU readers in
    idle.bpf.c/preempt.bpf.c inline the lookup instead of emitting a BPF-to-BPF
    call. This matches the map-in-header idiom already used by lib/percpu.h.

Testing

All on an NR_CPUS=1024 kernel (6.19), with objects built from
NR_CPUS-mismatched headers to reproduce the failure.
Verification (veristat). Before: the 5 ops above fail to load. After: all
programs verify. A 3-way comparison (by-value / kptr+helper / kptr+inline, built
against the same header) shows the verifier cost is neutral-to-mixed and the
inline commit recovers the helper-call overhead:

program by-value kptr+helper kptr+inline
lavd_select_cpu 133366 137593 132139
lavd_enqueue 123778 136876 133101
lavd_init 79084 79629 79629
(processed-instruction counts; states track the same trend.)
Runtime soak. Attached on the NR_CPUS=1024 host for 30 minutes:
clean attach, system-wide scheduling throughout (including a load spike to ~118),
clean detach, no verifier rejection, no stall.
Runtime A/B (virtme-ng, 8 pinned vCPUs, interleaved by-value vs kptr+inline).
No statistically significant regression:
  • perf bench sched messaging (n=16/arm): +0.34% median (within noise, CV ~2%).
  • perf bench sched pipe (n=32/arm): +4.2% median but well under 1σ
    (σ ≈ 1.3 µs vs a 0.28 µs gap) — not significant. Consistent with the change's
    runtime cost being a single inlined pointer load.

Cross-NR_CPUS portability (the core claim). Built test kernels with
NR_CPUS=64, NR_CPUS=512, and MAXSMP (NR_CPUS=8192 +
CONFIG_CPUMASK_OFFSTACK=y), plus the NR_CPUS=1024 host kernel. A single
scheduler binary built against a 512-cpumask header was attached on each, with
no rebuild:

kernel (sizeof(struct cpumask)) by-value this patch
NR_CPUS=64 (8 B) attaches attaches
NR_CPUS=512 (64 B) attaches attaches
NR_CPUS=1024 (128 B) fails to load attaches
MAXSMP=8192 + offstack (1024 B) fails to load attaches

The by-value build is rejected by the verifier wherever the running kernel's
cpumask is larger than the build-time one (and CONFIG_CPUMASK_OFFSTACK does not
help, since it doesn't change sizeof(struct cpumask)). The two binaries differ
only in cpdom_cpumask handling, so this isolates the fix as both necessary and
sufficient for a single NR_CPUS-independent binary.

Note

Independent of this change, LAVD_CPU_ID_MAX = 512 (intf.h) still caps the
manual cpdom_ctx.__cpumask[] bitmap at 512 CPUs; systems with more online CPUs
would need that raised separately.

Kyle McMartin added 2 commits June 5, 2026 20:56
…_CPUS

cpdom_cpumask is a file-scope array of struct bpf_cpumask embedded by
value:

    private(LAVD) struct bpf_cpumask cpdom_cpumask[LAVD_CPDOM_MAX_NR];

struct bpf_cpumask embeds struct cpumask, whose size is
BITS_TO_LONGS(NR_CPUS) * sizeof(long) -- fixed at the NR_CPUS the
scheduler is compiled against. Because the array lives in global
(.data) storage, its per-element stride and total size are baked into
the BPF object at build time, while the verifier validates the
program's accesses against the running kernel's struct cpumask size.

When the scheduler is built against a kernel whose NR_CPUS differs from
the kernel it runs on, the two no longer agree and the program fails to
load. Concretely, a scx_lavd built against NR_CPUS=512 (sizeof(struct
bpf_cpumask) == 72) and loaded on an NR_CPUS=1024 kernel (== 136) is
rejected by the verifier on every op that touches the array
(lavd_select_cpu, lavd_enqueue, lavd_init, lavd_cpu_online,
lavd_cpu_offline):

    ... map value, value_size=9248 off=9144 size=128
    R2 max value is outside of the allowed memory range

i.e. the 128-byte cpumask access to the last element runs past the
build-time-sized .data storage. This makes the scheduler binary
silently non-portable across kernels with different NR_CPUS.

Store the per-cpdom masks as kptr cpumasks allocated at init via
bpf_cpumask_create() and held in a BPF_MAP_TYPE_ARRAY of wrapper
structs, so the cpumask storage is allocated by the kernel for the
running nr_cpu_ids and the map value holds only an 8-byte pointer --
independent of build-time NR_CPUS. Reads go through a new
lookup_cpdom_cpumask() helper; the in-place set/clear sites keep their
existing bpf_rcu_read_lock() sections. This mirrors how scx_lavd's
other masks (turbo/big/active/ovrflw/steady) and scx_layered's
layer_cpumasks are already managed.

Tested on an NR_CPUS=1024 kernel with an object built against
NR_CPUS=512 headers: previously failed to load, now verifies, attaches,
and schedules cleanly.

Signed-off-by: Kyle McMartin <jkkm@meta.com>
The previous change exposed the per-cpdom cpumask map via a global
lookup_cpdom_cpumask() helper. Because it is a non-static cross-TU
function, callers in the pick-idle / enqueue / preemption paths
(idle.bpf.c, preempt.bpf.c) emit a BPF-to-BPF call per lookup instead
of an inlined index, as the by-value array did.

Move the array map (as a __weak map) and lookup_cpdom_cpumask() (now
static __always_inline) into lavd.bpf.h so the lookup -- an inlined
array-map lookup plus one kptr load -- is inlined at every call site.
This matches the map-in-header idiom already used by lib/percpu.h.

No functional change intended.

Signed-off-by: Kyle McMartin <jkkm@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant