scx_soft_domain: through topology-aware scheduling to improve perf#3612
scx_soft_domain: through topology-aware scheduling to improve perf#3612CyangCyang wants to merge 1 commit into
Conversation
sirlucjan
left a comment
There was a problem hiding this comment.
Could you move scheduler to: https://github.com/sched-ext/scx/tree/main/scheds/experimental please?
|
Can you fix these build errors? Thanks. |
|
Still doesn't compile |
|
Since I'm working with a relatively old kernel version, a few APIs are not consistent with the latest upstream. I've just modified them. |
There was a problem hiding this comment.
lucjan at cachyos ~/Pobrane/scx/target/release 7:41:10 fd86c33f main
❯ sudo ./scx_soft_domain
[sudo] hasło użytkownika lucjan:
07:41:24 [INFO] libbpf: struct_ops soft_domain_ops: member sub_attach not found in kernel, skipping it as it's set to zero
07:41:24 [INFO] libbpf: struct_ops soft_domain_ops: member sub_detach not found in kernel, skipping it as it's set to zero
07:41:24 [INFO] libbpf: struct_ops soft_domain_ops: member sub_cgroup_id not found in kernel, skipping it as it's set to zero
07:41:24 [WARN] libbpf: map 'soft_domain_ops': BPF map skeleton link is uninitialized
07:41:24 [INFO] scx_soft_domain scheduler initialized
07:41:24 [INFO] scx_soft_domain scheduler started
^CReceived Ctrl+C, exiting...
All schedulers in the repo display the version number (and the Git commit ID), which makes debugging easier.
lucjan at cachyos ~/Pobrane/scx/target/release 7:41:46 fd86c33f main
❯ sudo scx_pandemonium
[07:42:21] [INFO] scx_pandemonium 5.12.0 x86_64-unknown-linux-gnu SMT on
[07:42:21] [INFO] CPUS: 16 (governor: performance)
[07:42:21] [INFO] VERBOSE: false
libbpf: struct_ops pandemonium_ops: member sub_attach not found in kernel, skipping it as it's set to zero
libbpf: struct_ops pandemonium_ops: member sub_detach not found in kernel, skipping it as it's set to zero
libbpf: struct_ops pandemonium_ops: member sub_cgroup_id not found in kernel, skipping it as it's set to zero
[07:42:22] [INFO] L2 GROUP 0: [0,8]
[07:42:22] [INFO] L2 GROUP 1: [1,9]
[07:42:22] [INFO] L2 GROUP 2: [2,10]
[07:42:22] [INFO] L2 GROUP 3: [3,11]
[07:42:22] [INFO] L2 GROUP 4: [4,12]
[07:42:22] [INFO] L2 GROUP 5: [5,13]
[07:42:22] [INFO] L2 GROUP 6: [6,14]
[07:42:22] [INFO] L2 GROUP 7: [7,15]
[07:42:22] [INFO] L2 GROUPS: 8 across 16 CPUs, 1 SOCKETS
[07:42:22] [INFO] LLC DOMAINS: 1 (SAME-CCX RUNG INACTIVE -- MONOLITHIC L3)
[07:42:22] [INFO] TOPOLOGY SPECTRUM: lambda2=16.0000 tau=10ms codel_eq=8000us
[07:42:22] [INFO] RESISTANCE AFFINITY: CPU 0 rank: CPU8(R=0.001), CPU1(R=0.063), CPU2(R=0.063)
[07:42:22] [INFO] RESISTANCE AFFINITY: R_eff L2=0.0010 non-L2=0.0630 ratio=63.4x
[07:42:22] [INFO] PANDEMONIUM IS ACTIVE (CTRL+C TO EXIT)
[07:42:22] [INFO] PROCDB: LOADED 4 PROFILES FROM /var/lib/pandemonium/procdb.bin
^C[07:42:27] [INFO] PROCDB: SAVED 4/33 PROFILES TO /var/lib/pandemonium/procdb.bin
[KNOBS] regime=MIXED slice_ns=1000000 batch_ns=15000000 preempt_ns=749969 mwu=0.996 ticks=L:0/M:5/H:0 frozen=0 l2_hit=B:95%/I:80%/L:0% xccx_scatter_pct=0 xccx_sel_tight=0 xccx_sel_sync=0 xccx_sel_normal=0 xccx_sel_dfl=0 xccx_enq_t1=0 xccx_enq_t2=0 xccx_steal=0 xccx_step5=0
[07:42:27] [INFO] PANDEMONIUM IS SHUTTING DOWN
PANDEMONIUM SUMMARY
TOTAL DISPATCHES: 2750
TOTAL IDLE HITS: 1190
TOTAL SHARED: 185
TOTAL PREEMPT: 0
TOTAL KEEP_RUN: 0
PEAK DISPATCH/S: 634
AVG DISPATCH/S: 687
IDLE HIT RATE: 43.3%
ELAPSED: 4.0s
SAMPLES: 5
[07:42:27] [INFO] Shutdown complete
lucjan at cachyos ~/Pobrane/scx/target/release 7:42:27 fd86c33f main
❯ sudo scx_cake
[2026-06-17T05:42:35Z WARN scx_cake] SCX_CAKE_ENQ_KICK_IDLE=1 but booted kernel lacks SCX_ENQ_KICK_IDLE; using explicit kick
[2026-06-17T05:42:36Z INFO scx_cake] IRQ-avoid: enabled but no IRQ-noisy CPUs detected
[2026-06-17T05:42:36Z INFO scx_cake] Topology Strategy: Per-CPU local-first dispatch
[2026-06-17T05:42:36Z INFO scx_cake] Core spread route table: top CPUs [(1, 232), (5, 232), (9, 232), (13, 232), (7, 226), (15, 226)]
[2026-06-17T05:42:36Z INFO scx_cake] libbpf: struct_ops cake_ops: member sub_attach not found in kernel, skipping it as it's set to zero
[2026-06-17T05:42:36Z INFO scx_cake] libbpf: struct_ops cake_ops: member sub_detach not found in kernel, skipping it as it's set to zero
[2026-06-17T05:42:36Z INFO scx_cake] libbpf: struct_ops cake_ops: member sub_cgroup_id not found in kernel, skipping it as it's set to zero
[2026-06-17T05:42:36Z WARN scx_cake] libbpf: map 'cake_ops': BPF map skeleton link is uninitialized
[2026-06-17T05:42:36Z INFO scx_cake] scx_cake 1.2.0 x86_64-unknown-linux-gnu SMT on
[2026-06-17T05:42:36Z INFO scx_cake] scheduler options: scx_cake
[2026-06-17T05:42:36Z INFO scx_cake] release accelerators: route-pred=off, confidence=off, llc-pending=on, local-waiter=on, domain-drr=off, trust-maps=off, core-steal-dhq=off
[2026-06-17T05:42:36Z INFO scx_cake] 16 CPUs, 1 LLCs, profile: gaming, quantum: 1000us, queue-policy: llc-vtime, storm-guard: shield, busy-wake-kick: policy, learned-locality: off, wake-chain-locality: off, core-steal-dhq: off
^Z
[1] + 64811 suspended sudo scx_cake
Could you implement this?
Thanks for the review and the suggestion. I've now implemented the display of both the version number and the Git commit ID in the scheduler startup message. |
sirlucjan
left a comment
There was a problem hiding this comment.
Is there a reason you're using older versions of anyhow, clap, ctrlc, libc, and log?
|
I ran a benchmark (https://github.com/CachyOS/cachyos-benchmarker), and scx_soft_domain didn't perform very well. It crashes a lot. First attempt: Second attempt: |
|
Third attempt: |
sirlucjan
left a comment
There was a problem hiding this comment.
So, please fix above stalls.
scx_soft_domain: through topology-aware scheduling to improve performance
Background
In the performance test of multi-instance applications, such as Redis, a large number of CPU migrations across clusters or NUMA nodes are detected by capturing cputrace. In addition, there is cross-NUMA memory access. In addition, multiple Redis processes on the same CPU are switched back and forth. When the test pressure is high, the CPU competition is fierce.
We hope to reduce the Redis migration and competition and reduce the cross-NUMA memory access based on the topology information.
Introduce
scx_soft_domain is a scheduler dedicated to multi-instance services, such as Redis. It detects the CPU hardware topology (LLC and NUMA) and task affinity, and executes tasks on the CPUs in the domain where the tasks are located as much as possible. This reduces data transmission across LLCs and NUMAs, improves cache locality, and improves the overall system performance.
Core implementation logic
The CPU selection policy is implemented as follows: Multiple instances are evenly distributed on the machine. Within the same scheduling domain (that is, the same LLC or NUMA node), tasks are preferentially scheduled to CPUs in the domain, and cross-domain migration is avoided as much as possible.
When the load is high, load balancing is performed. First, whether the task can be directly executed on the selected CPU is quickly determined based on the load of the selected CPU and the running status of the task on the selected CPU. If the task cannot be directly executed on the selected CPU, the task is inserted into a user-defined queue of the LLC where the selected CPU is located and waits for an idle CPU to execute the task. Then, a CPU with the lowest load is selected in the LLC domain, and it is determined whether the CPU can be used as a new selected CPU. If the conditions are met, the corresponding mapping table is updated.
Test
The test environment consists of machines with 384 CPUs and 4 NUMA nodes. The performance of Redis's set and get operations is tested, with a total of 384 instances, using the valkey_benchmark tool. The kernel source code used in the test is obtained from https://atomgit.com/openeuler/kernel/tree/openEuler-26.09.
./target/release/scx_soft_domain -P redis-server
set
get
cargo build --release -p scx_soft_domainpassescargo fmt --checkpassesNotes
Currently, this feature only applies to the Redis performance improvement in the entire system scenario, including bare metal scenarios or container scenarios. In the future, more service scenarios will be supported.