linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Kuba Piecuch	7e311bafb9	tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY() This fixes the following compilation error when using the header from C++ code: error: assigning to 'struct scx_flux__data_uei_dump ' from incompatible type 'void ' Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-13 06:14:11 -10:00
Kuba Piecuch	4615361f0b	sched_ext: Make string params of __ENUM_set() const A small change to improve type safety/const correctness. __COMPAT_read_enum() already has const string parameters. It fixes a warning when using the header in C++ code: error: ISO C++11 does not allow conversion from string literal to 'char *' [-Werror,-Wwritable-strings] That's because string literals have type char[N] in C and const char[N] in C++. Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-13 06:14:05 -10:00
Tejun Heo	3d3667f265	tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap scx_qmap uses global BPF queue maps (BPF_MAP_TYPE_QUEUE) that any CPU's ops.dispatch() can pop from. When a CPU pops a task that can't run on it (e.g. a pinned per-CPU kthread), it inserts the task into SHARED_DSQ. consume_dispatch_q() then skips the task due to affinity mismatch, leaving it stranded until some CPU in its allowed mask calls ops.dispatch(). This doesn't cause indefinite stalls -- the periodic tick keeps firing (can_stop_idle_tick() returns false when softirq is pending) -- but can cause noticeable scheduling delays. After inserting to SHARED_DSQ, kick the task's home CPU if this CPU can't run it. There's a small race window where the home CPU can enter idle before the kick lands -- if a per-CPU kthread like ksoftirqd is the stranded task, this can trigger a "NOHZ tick-stop error" warning. The kick arrives shortly after and the home CPU drains the task. Rather than fully eliminating the warning by routing pinned tasks to local or global DSQs, the current code keeps them going through the normal BPF queue path and documents the race and the resulting warning in detail. scx_qmap is an example scheduler and having tasks go through the usual dispatch path is useful for testing. The detailed comment also serves as a reference for other schedulers that may encounter similar warnings. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-13 06:13:59 -10:00
Cheng-Yang Chou	a3c3fb2f86	tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing scx_alloc_free_idx() zeroes the payload of a freed arena allocation one word at a time. The loop bound was alloc->pool.elem_size / 8, but elem_size includes sizeof(struct sdt_data) (the 8-byte union sdt_id header). This caused the loop to write one extra u64 past the allocation, corrupting the tid field of the adjacent pool element. Fix the loop bound to (elem_size - sizeof(struct sdt_data)) / 8 so only the payload portion is zeroed. Test plan: - Add a temporary sanity check in scx_task_free() before the free call: if (mval->data->tid.idx != mval->tid.idx) scx_bpf_error("tid corruption: arena=%d storage=%d", mval->data->tid.idx, (int)mval->tid.idx); - stress-ng --fork 100 -t 10 & sudo ./build/bin/scx_sdt Without this fix, running scx_sdt under fork-heavy load triggers the corruption error. With the fix applied, the same workload completes without error. Fixes: `36929ebd17` ("tools/sched_ext: add arena based scheduler") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-06 08:06:24 -10:00
Zhao Mengmeng	d6edb15ad9	scx_central: Defer timer start to central dispatch to fix init error scx_central currently assumes that ops.init() runs on the selected central CPU and aborts otherwise. This is no longer true, as ops.init() is invoked from the scx_enable_helper thread, which can run on any CPU. As a result, sched_setaffinity() from userspace doesn't work, causing scx_central to fail when loading with: [ 1985.319942] sched_ext: central: scx_central.bpf.c:314: init from non-central CPU [ 1985.320317] scx_exit+0xa3/0xd0 [ 1985.320535] scx_bpf_error_bstr+0xbd/0x220 [ 1985.320840] bpf_prog_3a445a8163fa8149_central_init+0x103/0x1ba [ 1985.321073] bpf__sched_ext_ops_init+0x40/0xa8 [ 1985.321286] scx_root_enable_workfn+0x507/0x1650 [ 1985.321461] kthread_worker_fn+0x260/0x940 [ 1985.321745] kthread+0x303/0x3e0 [ 1985.321901] ret_from_fork+0x589/0x7d0 [ 1985.322065] ret_from_fork_asm+0x1a/0x30 DEBUG DUMP =================================================================== central: root scx_enable_help[134] triggered exit kind 1025: scx_bpf_error (scx_central.bpf.c:314: init from non-central CPU) Fix this by: - Defer bpf_timer_start() to the first dispatch on the central CPU. - Initialize the BPF timer in central_init() and kick the central CPU to guarantee entering the dispatch path on the central CPU immediately. - Remove the unnecessary sched_setaffinity() call in userspace. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-27 07:33:00 -10:00
Tejun Heo	ea70239320	tools/sched_ext: Remove redundant SCX_ENQ_IMMED compat definition compat.bpf.h defined a fallback SCX_ENQ_IMMED macro using __COMPAT_ENUM_OR_ZERO(). After `6bf36c68b0` ("tools/sched_ext: Regenerate autogen enum headers") added SCX_ENQ_IMMED to the autogen headers, including both triggers -Wmacro-redefined warnings. The autogen definition through const volatile __weak already resolves to 0 on older kernels, providing the same backward compatibility. Remove the now-redundant compat fallback. Fixes: `6bf36c68b0` ("tools/sched_ext: Regenerate autogen enum headers") Link: https://lore.kernel.org/r/20260326100313.338388-1-zhaomzhao@126.com Reported-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-26 10:07:42 -10:00
Zhao Mengmeng	f546c77038	tools/sched_ext: scx_pair: fix pair_ctx indexing for CPU pairs scx_pair sizes pair_ctx to nr_cpu_ids / 2, so valid pair_ctx keys are dense pair indexes in the range [0, nr_cpu_ids / 2). However, the userspace setup code stores pair_id as the first CPU number in each pair. On an 8-CPU system with "-S 1", that produces pair IDs 0, 2, 4 and 6 for pairs [0,1], [2,3], [4,5] and [6,7]. CPUs in the latter half then look up pair_ctx with out-of-range keys and the BPF scheduler aborts with: EXIT: scx_bpf_error (scx_pair.bpf.c:328: failed to lookup pairc and in_pair_mask for cpu[5]) Assign pair_id using a dense pair counter instead so that each CPU pair maps to a valid pair_ctx entry. Besides, reject odd CPU configuration, as scx_pair requires all CPUs to be paired. Fixes: `f0262b102c` ("tools/sched_ext: add scx_pair scheduler") Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-25 17:45:23 -10:00
Cheng-Yang Chou	6bf36c68b0	tools/sched_ext: Regenerate autogen enum headers Regenerate enum_defs.autogen.h, enums.autogen.h and enums.autogen.bpf.h using the upstream scripts [1][2] to sync with recent kernel enum additions. [1] https://github.com/sched-ext/scx/blob/main/scripts/gen_enum_defs.py [2] https://github.com/sched-ext/scx/blob/main/scripts/gen_enums.py Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-25 05:58:08 -10:00
Cheng-Yang Chou	cb251eae7b	tools/sched_ext: Add scx_bpf_sub_dispatch() compat wrapper Add a transparent compatibility wrapper for the scx_bpf_sub_dispatch() kfunc in compat.bpf.h. This allows BPF schedulers using the sub-sched dispatch feature to build and run on older kernels that lack the kfunc. To avoid requiring code changes in individual schedulers, the transparent wrapper pattern is used instead of a __COMPAT prefix. The kfunc is declared with a ___compat suffix, while the static inline wrapper retains the original scx_bpf_sub_dispatch() name. When the kfunc is unavailable, the wrapper safely falls back to returning false. This is acceptable because the dispatch path cannot do anything useful without underlying sub-sched support anyway. Tested scx_qmap on v6.14 successfully. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-23 07:45:08 -10:00
Andrea Righi	f03ffe53ab	tools/sched_ext: Add compat handling for sub-scheduler ops Extend SCX_OPS_OPEN() with compatibility handling for ops.sub_attach() and ops.sub_detach(), allowing scx C schedulers with sub-scheduler support to run on kernels both with and without its support. Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-22 10:03:05 -10:00
Ke Zhao	068014daad	tools/sched_ext: Update stale scx_ops_error() comment in fcg_cgroup_move() The function scx_ops_error() was dropped, but the comment here is left pointing to the old name. Update to be consistent with current API. Signed-off-by: Ke Zhao <ke.zhao.kernel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-21 08:35:56 -10:00
zhidao su	2e5e5b3738	sched_ext: Fix typos in comments Fix five typos across three files: - kernel/sched/ext.c: 'monotically' -> 'monotonically' (line 55) - kernel/sched/ext.c: 'used by to check' -> 'used to check' (line 56) - kernel/sched/ext.c: 'hardlockdup' -> 'hardlockup' (line 3881) - kernel/sched/ext_idle.c: 'don't perfectly overlaps' -> 'don't perfectly overlap' (line 371) - tools/sched_ext/scx_flatcg.bpf.c: 'shaer' -> 'share' (line 21) Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-17 07:46:36 -10:00
Cheng-Yang Chou	6712c4fefc	sched_ext: Update demo schedulers and selftests to use scx_bpf_task_set_dsq_vtime() Direct writes to p->scx.dsq_vtime are deprecated in favor of scx_bpf_task_set_dsq_vtime(). Update scx_simple, scx_flatcg, and select_cpu_vtime selftest to use the new kfunc with scale_by_task_weight_inverse(). Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-14 22:53:59 -10:00
Tejun Heo	3229ac4a5e	sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start running immediately. Otherwise, the task is re-enqueued through ops.enqueue(). This provides tighter control but requires specifying the flag on every insertion. Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is automatically applied to all local DSQ enqueues including through scx_bpf_dsq_move_to_local(). scx_qmap is updated with -I option to test the feature and -F option for IMMED stress testing which forces every Nth enqueue to a busy local DSQ. v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2). - scx_qmap: Remove sched_switch and cpu_release handlers (superseded by kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	860683763e	sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local() scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the current CPU's local DSQ. This is an indirect way of dispatching to a local DSQ and should support enq_flags like direct dispatches do - e.g. SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate execution guarantees. Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The original becomes a v1 compat wrapper passing 0. The compat macro is updated to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume (pre-rename). All in-tree BPF schedulers are updated to pass 0. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	98d709cba3	sched_ext: Implement SCX_ENQ_IMMED Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class. rq_is_open() uses rq->next_class to determine whether the rq is available, and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task arrives. These capture all higher class preemptions. Combined with reenqueue points in the dispatch path, all cases where an IMMED task would not execute immediately are covered. SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the guarantee survives SAVE/RESTORE cycles. If preempted while running, put_prev_task_scx() reenqueues through ops.enqueue() with SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the local DSQ. This enables tighter scheduling latency control by preventing tasks from piling up on local DSQs. It also enables opportunistic CPU sharing across sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a shared CPU, making it difficult for others to use. v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and implement wakeup_preempt_scx() to achieve complete coverage of all cases where IMMED tasks could get stranded. - Track IMMED persistently in p->scx.flags and reenqueue preempted-while-running tasks through ops.enqueue(). - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT). - Misc renames, documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:22 -10:00
Cheng-Yang Chou	bd377af097	sched_ext: Fix incomplete help text usage strings Several demo schedulers and the selftest runner had usage strings that omitted options which are actually supported: - scx_central: add missing [-v] - scx_pair: add missing [-v] - scx_qmap: add missing [-S] and [-H] - scx_userland: add missing [-v] - scx_sdt: remove [-f] which no longer exists - runner.c: add missing [-s], [-l], [-q]; drop [-h] which none of the other sched_ext tools list in their usage lines Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-11 11:02:57 -10:00
Zhao Mengmeng	bec10581e9	sched_ext: remove SCX_OPS_HAS_CGROUP_WEIGHT While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-09 09:45:18 -10:00
Tejun Heo	0a0d3b8dd0	tools/sched_ext/include: Regenerate enum_defs.autogen.h Regenerate enum_defs.autogen.h from the current vmlinux.h to pick up new SCX enums added in the for-7.1 cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	93ac9b150e	tools/sched_ext/include: Add libbpf version guard for assoc_struct_ops Extract the inline bpf_program__assoc_struct_ops() call in SCX_OPS_LOAD() into a __scx_ops_assoc_prog() helper and wrap it with a libbpf >= 1.7 version guard. bpf_program__assoc_struct_ops() was added in libbpf 1.7; the guard provides a no-op fallback for older versions. Add the <bpf/libbpf.h> include needed by the helper, and fix "assumming" typo in a nearby comment. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	c9c8546cde	tools/sched_ext/include: Add __COMPAT_HAS_scx_bpf_select_cpu_and macro scx_bpf_select_cpu_and() is now an inline wrapper so bpf_ksym_exists(scx_bpf_select_cpu_and) no longer works. Add __COMPAT_HAS_scx_bpf_select_cpu_and macro that checks for either the struct args type (new) or the compat ksym (old) to test availability. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	3691d380d5	tools/sched_ext/include: Add missing helpers to common.bpf.h Sync several helpers from the scx repo: - bpf_cgroup_acquire() ksym declaration - __sink() macro for hiding values from verifier precision tracking - ctzll() count-trailing-zeros implementation - get_prandom_u64() helper - scx_clock_task/pelt/virt/irq() clock helpers with get_current_rq() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	9c6437f7c2	tools/sched_ext/include: Sync bpf_arena_common.bpf.h with scx repo Sync the following changes from the scx repo: - Guard __arena define with #ifndef to avoid redefinition when the attribute is already defined by another header. - Add bpf_arena_reserve_pages() and bpf_arena_mapping_nr_pages() ksym declarations. - Rename TEST to SCX_BPF_UNITTEST to avoid collision with generic TEST macros in other projects. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	c90af06c80	tools/sched_ext/include: Remove dead sdt_task_defs.h guard from common.h The __has_include guard for sdt_task_defs.h is vestigial — the only remaining content is the bpf_arena_common.h include which is available unconditionally. Remove the dead guard. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 22:45:12 -10:00
Tejun Heo	84b1a0ea0b	sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support user-defined DSQs by adding a deferred re-enqueue mechanism similar to the local DSQ handling. Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list. process_deferred_reenq_users() then iterates the DSQ using the cursor helpers and re-enqueues each task. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:50 -10:00
Tejun Heo	9c34c5074d	sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON \| cpu. This will be expanded to support user DSQs by future changes. scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future. Update compat.bpf.h with a compatibility shim and scx_qmap to test the new functionality. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:49 -10:00
Tejun Heo	4f8b122848	sched_ext: Add basic building blocks for nested sub-scheduler dispatching This is an early-stage partial implementation that demonstrates the core building blocks for nested sub-scheduler dispatching. While significant work remains in the enqueue path and other areas, this patch establishes the fundamental mechanisms needed for hierarchical scheduler operation. The key building blocks introduced include: - Private stack support for ops.dispatch() to prevent stack overflow when walking down nested schedulers during dispatch operations - scx_bpf_sub_dispatch() kfunc that allows parent schedulers to trigger dispatch operations on their direct child schedulers - Proper parent-child relationship validation to ensure dispatch requests are only made to legitimate child schedulers - Updated scx_dispatch_sched() to handle both nested and non-nested invocations with appropriate kf_mask handling The qmap scheduler is updated to demonstrate the functionality by calling scx_bpf_sub_dispatch() on registered child schedulers when it has no tasks in its own queues. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:04 -10:00
Tejun Heo	105dcd005b	sched_ext: Introduce scx_prog_sched() In preparation for multiple scheduler support, introduce scx_prog_sched() accessor which returns the scx_sched instance associated with a BPF program. The association is determined via the special KF_IMPLICIT_ARGS kfunc parameter, which provides access to bpf_prog_aux. This aux can be used to retrieve the struct_ops (sched_ext_ops) that the program is associated with, and from there, the corresponding scx_sched instance. For compatibility, when ops.sub_attach is not implemented (older schedulers without sub-scheduler support), unassociated programs fall back to scx_root. A warning is logged once per scheduler for such programs. As scx_root is still the only scheduler, this shouldn't introduce user-visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	ebeca1f930	sched_ext: Introduce cgroup sub-sched support A system often runs multiple workloads especially in multi-tenant server environments where a system is split into partitions servicing separate more-or-less independent workloads each requiring an application-specific scheduler. To support such and other use cases, sched_ext is in the process of growing multiple scheduler support. When partitioning a system in terms of CPUs for such use cases, an oft-taken approach is hard partitioning the system using cpuset. While it would be possible to tie sched_ext multiple scheduler support to cpuset partitions, such an approach would have fundamental limitations stemming from the lack of dynamism and flexibility. Users often don't care which specific CPUs are assigned to which workload and want to take advantage of optimizations which are enabled by running workloads on a larger machine - e.g. opportunistic over-commit, improving latency critical workload characteristics while maintaining bandwidth fairness, employing control mechanisms based on different criteria than on-CPU time for e.g. flexible memory bandwidth isolation, packing similar parts from different workloads on same L3s to improve cache efficiency, and so on. As this sort of dynamic behaviors are impossible or difficult to implement with hard partitioning, sched_ext is implementing cgroup sub-sched support where schedulers can be attached to the cgroup hierarchy and a parent scheduler is responsible for controlling the CPUs that each child can use at any given moment. This makes CPU distribution dynamically controlled by BPF allowing high flexibility. This patch adds the skeletal sched_ext cgroup sub-sched support: - sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero sub_cgroup_id indicates that the scheduler is to be attached to the identified cgroup. A sub-sched is attached to the cgroup iff the nearest ancestor scheduler implements .sub_attach() and grants the attachment. Max nesting depth is limited by SCX_SUB_MAX_DEPTH. - When a scheduler exits, all its descendant schedulers are exited together. Also, cgroup.scx_sched added which points to the effective scheduler instance for the cgroup. This is updated on scheduler init/exit and inherited on cgroup online. When a cgroup is offlined, the attached scheduler is automatically exited. - Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is automatically enabled if both SCX and cgroups are enabled. Sub-sched support is not tied to the CPU controller but rather the cgroup hierarchy itself. This is intentional as the support for cpu.weight and cpu.max based resource control is orthogonal to sub-sched support. Note that CONFIG_CGROUPS around cgroup subtree iteration support for scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency. - This allows loading sub-scheds and most framework operations such as propagating disable down the hierarchy work. However, sub-scheds are not operational yet and all tasks stay with the root sched. This will serve as the basis for building up full sub-sched support. - DSQs point to the scx_sched they belong to. - scx_qmap is updated to allow attachment of sub-scheds and also serving as sub-scheds. - scx_is_descendant() is added but not yet used in this patch. It is used by later changes in the series and placed here as this is where the function belongs. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	32e940f2bd	Merge branch 'for-7.0-fixes' into for-7.1 To prepare for hierarchical scheduling patchset which will cause multiple conflicts otherwise. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-06 07:46:32 -10:00
Zhao Mengmeng	9af832c0a7	tools/sched_ext: Add -fms-extensions to bpf build flags Similar to commit `835a507535` ("selftests/bpf: Add -fms-extensions to bpf build flags") and commit `639f58a0f4` ("bpftool: Fix build warnings due to MS extensions") The kernel is now built with -fms-extensions, therefore generated vmlinux.h contains types like: struct aes_key { struct aes_enckey; union aes_invkey_arch inv_k; }; struct ns_common { ... union { struct ns_tree; struct callback_head ns_rcu; }; }; Which raise warning like below when building scx scheduler: tools/sched_ext/build/include/vmlinux.h:50533:3: warning: declaration does not declare anything [-Wmissing-declarations] 50533 \| struct ns_tree; \| ^ Fix it by using -fms-extensions and -Wno-microsoft-anon-tag flags to build bpf programs that #include "vmlinux.h" Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-02 22:00:23 -10:00
David Carlier	032e084f0d	tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq() scx_hotplug_seq() uses strtoul() but validates the result with a negative check (val < 0), which can never be true for an unsigned return value. Use the endptr mechanism to verify the entire string was consumed, and check errno == ERANGE for overflow detection. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-27 09:17:44 -10:00
Cheng-Yang Chou	ee0ff6690f	tools/sched_ext: Add Kconfig to sync with upstream Add the missing Kconfig file to tools/sched_ext/ as referenced in the README. Ref: https://github.com/sched-ext/scx/blob/main/kernel.config Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-24 07:51:36 -10:00
Cheng-Yang Chou	095f569332	tools/sched_ext: Sync README.md Kconfig with upstream scx Sync the documentation with the upstream scx repository to reflect the current recommended configuration. Ref: https://github.com/sched-ext/scx/blob/main/README.md#build--install Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-24 07:51:29 -10:00
Andrea Righi	ebf1ccff79	sched_ext: Fix ops.dequeue() semantics Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in the BPF scheduler's custody when the scheduler is responsible for managing its lifecycle. This includes tasks dispatched to user-created DSQs or stored in the BPF scheduler's internal data structures from ops.enqueue(). Custody ends when the task is dispatched to a terminal DSQ (such as the local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a property change. Tasks directly dispatched to terminal DSQs bypass the BPF scheduler entirely and are never in its custody. Terminal DSQs include: - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues where tasks go directly to execution. - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the BPF scheduler is considered "done" with the task. As a result, ops.dequeue() is not invoked for tasks directly dispatched to terminal DSQs. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: a task dispatched to a user DSQ or stored in internal BPF data structures is moved to a terminal DSQ (ops.dequeue() called without any special flags set), b) core scheduling dispatch: core-sched picks task before dispatch (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set), c) property change: task properties modified before dispatch, (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set). This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:01:18 -10:00
Cheng-Yang Chou	cbb297323d	tools/sched_ext: scx_sdt: Remove unused '-f' option The '-f' option is defined in getopt() but not handled in the switch statement or documented in the help text. Providing '-f' currently triggers the default error path. Remove it to sync the optstring with the actual implementation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 07:45:34 -10:00
Cheng-Yang Chou	0c36a6f6f0	tools/sched_ext: scx_central: Remove unused '-p' option The '-p' option is defined in getopt() but not handled in the switch statement or documented in the help text. Providing '-p' currently triggers the default error path. Remove it to sync the optstring with the actual implementation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 07:45:30 -10:00
David Carlier	640c9dc72f	tools/sched_ext: fix getopt not re-parsed on restart After goto restart, optind retains its advanced position from the previous getopt loop, causing getopt() to immediately return -1. This silently drops all command-line options on the restarted skeleton. Reset optind to 1 at the restart label so options are re-parsed. Affected schedulers: scx_simple, scx_central, scx_flatcg, scx_pair, scx_sdt, scx_cpu0. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-20 17:17:38 -10:00
David Carlier	f892f9f994	tools/sched_ext: scx_userland: fix data races on shared counters The stats thread reads nr_vruntime_enqueues, nr_vruntime_dispatches, nr_vruntime_failed, and nr_curr_enqueued concurrently with the main thread writing them, with no synchronization. Use __atomic builtins with relaxed ordering for all accesses to these counters to eliminate the data races. Only display accuracy is affected, not scheduling correctness. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-20 17:17:31 -10:00
David Carlier	625be3456b	tools/sched_ext: scx_pair: fix stride == 0 crash on single-CPU systems nr_cpu_ids / 2 produces stride 0 on a single-CPU system, which later causes SCX_BUG_ON(i == j) to fire. Validate stride after option parsing to also catch invalid user-supplied values via -S. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-18 07:03:57 -10:00
David Carlier	55a24d9203	tools/sched_ext: scx_central: fix CPU_SET and skeleton leak on early exit Use CPU_SET_S() instead of CPU_SET() on the dynamically allocated cpuset to avoid a potential out-of-bounds write when nr_cpu_ids exceeds CPU_SETSIZE. Also destroy the skeleton before returning on invalid central CPU ID to prevent a resource leak. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-18 07:03:50 -10:00
David Carlier	0767684613	tools/sched_ext: scx_userland: fix stale data on restart Reset all counters, tasks and vruntime_head list on restart. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-16 21:02:16 -10:00
David Carlier	cabd76bbc0	tools/sched_ext: scx_flatcg: fix potential stack overflow from VLA in fcg_read_stats fcg_read_stats() had a VLA allocating 21 * nr_cpus * 8 bytes on the stack, risking stack overflow on large CPU counts (nr_cpus can be up to 512). Fix by using a single heap allocation with the correct size, reusing it across all stat indices, and freeing it at the end. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-16 21:01:18 -10:00
David Carlier	048714d9df	tools/sched_ext: scx_userland: fix restart and stats thread lifecycle bugs Fix three issues in scx_userland's restart path: - exit_req is not reset on restart, causing sched_main_loop() to exit immediately without doing any scheduling work. - stats_printer thread handle is local to spawn_stats_thread(), making it impossible to join from main(). Promote it to file scope. - The stats thread continues reading skel->bss after the skeleton is destroyed on restart, causing a use-after-free. Join the stats thread before destroying the skeleton to ensure it has exited. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-12 11:17:35 -10:00
David Carlier	988369d236	tools/sched_ext: scx_central: fix sched_setaffinity() call with the set size The cpu set is dynamically allocated for nr_cpu_ids using CPU_ALLOC(), so the size passed to sched_setaffinity() should be CPU_ALLOC_SIZE() rather than sizeof(cpu_set_t). Valgrind flagged this as accessing unaddressable bytes past the allocation. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-12 07:30:17 -10:00
David Carlier	11fece49e9	tools/sched_ext: scx_flatcg: zero-initialize stats counter array The local cnts array in read_stats() is not initialized before being accumulated into per-CPU stats, which may lead to reading garbage values. Zero it out with memset alongside the existing stats array initialization. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-12 07:28:02 -10:00
Linus Torvalds	38ef046544	Merge tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Move C example schedulers back from the external scx repo to tools/sched_ext as the authoritative source. scx_userland and scx_pair are returning while scx_sdt (BPF arena-based task data management) is new. These schedulers will be dropped from the external repo. - Improve error reporting by adding scx_bpf_error() calls when DSQ creation fails across all in-tree schedulers - Avoid redundant irq_work_queue() calls in destroy_dsq() by only queueing when llist_add() indicates an empty list - Fix flaky init_enable_count selftest by properly synchronizing pre-forked children using a pipe instead of sleep() * tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Fix init_enable_count flakiness tools/sched_ext: Fix data header access during free in scx_sdt tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers tools/sched_ext: add arena based scheduler tools/sched_ext: add scx_pair scheduler tools/sched_ext: add scx_userland scheduler sched_ext: Add error logging for dsq creation failures sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()	2026-02-11 13:35:24 -08:00
Emil Tsalapatis	2e06d54ea9	tools/sched_ext: Fix data header access during free in scx_sdt Fix a pointer arithmetic error in scx_sdt during freeing that causes the allocator to use the wrong memory address for the allocation's data header. Fixes: `36929ebd17` ("tools/sched_ext: add arena based scheduler") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-02 05:50:14 -10:00
zhidao su	bd4f0822f4	tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers Add scx_bpf_error() calls when scx_bpf_create_dsq() fails in the remaining schedulers to improve debuggability: - scx_simple.bpf.c: simple_init() - scx_sdt.bpf.c: sdt_init() - scx_cpu0.bpf.c: cpu0_init() - scx_flatcg.bpf.c: fcg_init() This follows the same pattern established in commit `2f8d489897` ("sched_ext: Add error logging for dsq creation failures") for other schedulers and ensures consistent error reporting across all schedulers. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-01 07:10:18 -10:00
Emil Tsalapatis	36929ebd17	tools/sched_ext: add arena based scheduler Add a scheduler that uses BPF arenas to manage task context data. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-27 15:43:34 -10:00

1 2 3

150 Commits