This fixes the following compilation error when using the header from
C++ code:
error: assigning to 'struct scx_flux__data_uei_dump *' from
incompatible type 'void *'
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A small change to improve type safety/const correctness.
__COMPAT_read_enum() already has const string parameters.
It fixes a warning when using the header in C++ code:
error: ISO C++11 does not allow conversion from string literal
to 'char *' [-Werror,-Wwritable-strings]
That's because string literals have type char[N] in C and
const char[N] in C++.
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_qmap uses global BPF queue maps (BPF_MAP_TYPE_QUEUE) that any CPU's
ops.dispatch() can pop from. When a CPU pops a task that can't run on it
(e.g. a pinned per-CPU kthread), it inserts the task into SHARED_DSQ.
consume_dispatch_q() then skips the task due to affinity mismatch, leaving it
stranded until some CPU in its allowed mask calls ops.dispatch(). This doesn't
cause indefinite stalls -- the periodic tick keeps firing (can_stop_idle_tick()
returns false when softirq is pending) -- but can cause noticeable scheduling
delays.
After inserting to SHARED_DSQ, kick the task's home CPU if this CPU can't run
it. There's a small race window where the home CPU can enter idle before the
kick lands -- if a per-CPU kthread like ksoftirqd is the stranded task, this
can trigger a "NOHZ tick-stop error" warning. The kick arrives shortly after
and the home CPU drains the task.
Rather than fully eliminating the warning by routing pinned tasks to local or
global DSQs, the current code keeps them going through the normal BPF queue
path and documents the race and the resulting warning in detail. scx_qmap is an
example scheduler and having tasks go through the usual dispatch path is useful
for testing. The detailed comment also serves as a reference for other
schedulers that may encounter similar warnings.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_alloc_free_idx() zeroes the payload of a freed arena allocation
one word at a time. The loop bound was alloc->pool.elem_size / 8, but
elem_size includes sizeof(struct sdt_data) (the 8-byte union sdt_id
header). This caused the loop to write one extra u64 past the
allocation, corrupting the tid field of the adjacent pool element.
Fix the loop bound to (elem_size - sizeof(struct sdt_data)) / 8 so
only the payload portion is zeroed.
Test plan:
- Add a temporary sanity check in scx_task_free() before the free call:
if (mval->data->tid.idx != mval->tid.idx)
scx_bpf_error("tid corruption: arena=%d storage=%d",
mval->data->tid.idx, (int)mval->tid.idx);
- stress-ng --fork 100 -t 10 & sudo ./build/bin/scx_sdt
Without this fix, running scx_sdt under fork-heavy load triggers the
corruption error. With the fix applied, the same workload completes
without error.
Fixes: 36929ebd17 ("tools/sched_ext: add arena based scheduler")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_central currently assumes that ops.init() runs on the selected
central CPU and aborts otherwise. This is no longer true, as ops.init()
is invoked from the scx_enable_helper thread, which can run on any
CPU.
As a result, sched_setaffinity() from userspace doesn't work, causing
scx_central to fail when loading with:
[ 1985.319942] sched_ext: central: scx_central.bpf.c:314: init from non-central CPU
[ 1985.320317] scx_exit+0xa3/0xd0
[ 1985.320535] scx_bpf_error_bstr+0xbd/0x220
[ 1985.320840] bpf_prog_3a445a8163fa8149_central_init+0x103/0x1ba
[ 1985.321073] bpf__sched_ext_ops_init+0x40/0xa8
[ 1985.321286] scx_root_enable_workfn+0x507/0x1650
[ 1985.321461] kthread_worker_fn+0x260/0x940
[ 1985.321745] kthread+0x303/0x3e0
[ 1985.321901] ret_from_fork+0x589/0x7d0
[ 1985.322065] ret_from_fork_asm+0x1a/0x30
DEBUG DUMP
===================================================================
central: root
scx_enable_help[134] triggered exit kind 1025:
scx_bpf_error (scx_central.bpf.c:314: init from non-central CPU)
Fix this by:
- Defer bpf_timer_start() to the first dispatch on the central CPU.
- Initialize the BPF timer in central_init() and kick the central CPU
to guarantee entering the dispatch path on the central CPU immediately.
- Remove the unnecessary sched_setaffinity() call in userspace.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
compat.bpf.h defined a fallback SCX_ENQ_IMMED macro using
__COMPAT_ENUM_OR_ZERO(). After 6bf36c68b0 ("tools/sched_ext:
Regenerate autogen enum headers") added SCX_ENQ_IMMED to the autogen
headers, including both triggers -Wmacro-redefined warnings.
The autogen definition through const volatile __weak already resolves to
0 on older kernels, providing the same backward compatibility. Remove
the now-redundant compat fallback.
Fixes: 6bf36c68b0 ("tools/sched_ext: Regenerate autogen enum headers")
Link: https://lore.kernel.org/r/20260326100313.338388-1-zhaomzhao@126.com
Reported-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_pair sizes pair_ctx to nr_cpu_ids / 2, so valid pair_ctx keys are
dense pair indexes in the range [0, nr_cpu_ids / 2).
However, the userspace setup code stores pair_id as the first CPU number
in each pair. On an 8-CPU system with "-S 1", that produces pair IDs
0, 2, 4 and 6 for pairs [0,1], [2,3], [4,5] and [6,7]. CPUs in the
latter half then look up pair_ctx with out-of-range keys and the BPF
scheduler aborts with:
EXIT: scx_bpf_error (scx_pair.bpf.c:328: failed to lookup pairc and
in_pair_mask for cpu[5])
Assign pair_id using a dense pair counter instead so that each CPU pair
maps to a valid pair_ctx entry. Besides, reject odd CPU configuration, as
scx_pair requires all CPUs to be paired.
Fixes: f0262b102c ("tools/sched_ext: add scx_pair scheduler")
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a transparent compatibility wrapper for the scx_bpf_sub_dispatch()
kfunc in compat.bpf.h. This allows BPF schedulers using the sub-sched
dispatch feature to build and run on older kernels that lack the kfunc.
To avoid requiring code changes in individual schedulers, the
transparent wrapper pattern is used instead of a __COMPAT prefix. The
kfunc is declared with a ___compat suffix, while the static inline
wrapper retains the original scx_bpf_sub_dispatch() name.
When the kfunc is unavailable, the wrapper safely falls back to
returning false. This is acceptable because the dispatch path cannot
do anything useful without underlying sub-sched support anyway.
Tested scx_qmap on v6.14 successfully.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Extend SCX_OPS_OPEN() with compatibility handling for ops.sub_attach()
and ops.sub_detach(), allowing scx C schedulers with sub-scheduler
support to run on kernels both with and without its support.
Cc: Cheng-Yang Chou <yphbchou0911@gmail.com>
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The function scx_ops_error() was dropped, but the
comment here is left pointing to the old name.
Update to be consistent with current API.
Signed-off-by: Ke Zhao <ke.zhao.kernel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Direct writes to p->scx.dsq_vtime are deprecated in favor of
scx_bpf_task_set_dsq_vtime(). Update scx_simple, scx_flatcg, and
select_cpu_vtime selftest to use the new kfunc with
scale_by_task_weight_inverse().
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start
running immediately. Otherwise, the task is re-enqueued through ops.enqueue().
This provides tighter control but requires specifying the flag on every
insertion.
Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is
automatically applied to all local DSQ enqueues including through
scx_bpf_dsq_move_to_local().
scx_qmap is updated with -I option to test the feature and -F option for
IMMED stress testing which forces every Nth enqueue to a busy local DSQ.
v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2).
- scx_qmap: Remove sched_switch and cpu_release handlers (superseded by
kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the
current CPU's local DSQ. This is an indirect way of dispatching to a local
DSQ and should support enq_flags like direct dispatches do - e.g.
SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate
execution guarantees.
Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The
original becomes a v1 compat wrapper passing 0. The compat macro is updated
to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume
(pre-rename). All in-tree BPF schedulers are updated to pass 0.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is
dispatched with IMMED, it either gets on the CPU immediately and stays on it,
or gets reenqueued back to the BPF scheduler. It will never linger on a local
DSQ behind other tasks or on a CPU taken by a higher-priority class.
rq_is_open() uses rq->next_class to determine whether the rq is available,
and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task
arrives. These capture all higher class preemptions. Combined with reenqueue
points in the dispatch path, all cases where an IMMED task would not execute
immediately are covered.
SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the
guarantee survives SAVE/RESTORE cycles. If preempted while running,
put_prev_task_scx() reenqueues through ops.enqueue() with
SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the
local DSQ.
This enables tighter scheduling latency control by preventing tasks from
piling up on local DSQs. It also enables opportunistic CPU sharing across
sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a
shared CPU, making it difficult for others to use.
v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and
implement wakeup_preempt_scx() to achieve complete coverage of all
cases where IMMED tasks could get stranded.
- Track IMMED persistently in p->scx.flags and reenqueue
preempted-while-running tasks through ops.enqueue().
- Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT).
- Misc renames, documentation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Several demo schedulers and the selftest runner had usage strings
that omitted options which are actually supported:
- scx_central: add missing [-v]
- scx_pair: add missing [-v]
- scx_qmap: add missing [-S] and [-H]
- scx_userland: add missing [-v]
- scx_sdt: remove [-f] which no longer exists
- runner.c: add missing [-s], [-l], [-q]; drop [-h] which none of the
other sched_ext tools list in their usage lines
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is
deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been
marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Regenerate enum_defs.autogen.h from the current vmlinux.h to pick up
new SCX enums added in the for-7.1 cycle.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Extract the inline bpf_program__assoc_struct_ops() call in SCX_OPS_LOAD()
into a __scx_ops_assoc_prog() helper and wrap it with a libbpf >= 1.7
version guard. bpf_program__assoc_struct_ops() was added in libbpf 1.7;
the guard provides a no-op fallback for older versions. Add the
<bpf/libbpf.h> include needed by the helper, and fix "assumming" typo in
a nearby comment.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_select_cpu_and() is now an inline wrapper so
bpf_ksym_exists(scx_bpf_select_cpu_and) no longer works. Add
__COMPAT_HAS_scx_bpf_select_cpu_and macro that checks for either the
struct args type (new) or the compat ksym (old) to test availability.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Sync the following changes from the scx repo:
- Guard __arena define with #ifndef to avoid redefinition when the
attribute is already defined by another header.
- Add bpf_arena_reserve_pages() and bpf_arena_mapping_nr_pages() ksym
declarations.
- Rename TEST to SCX_BPF_UNITTEST to avoid collision with generic TEST
macros in other projects.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
The __has_include guard for sdt_task_defs.h is vestigial — the only
remaining content is the bpf_arena_common.h include which is available
unconditionally. Remove the dead guard.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support
user-defined DSQs by adding a deferred re-enqueue mechanism similar to the
local DSQ handling.
Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and
deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a
user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list.
process_deferred_reenq_users() then iterates the DSQ using the cursor helpers
and re-enqueues each task.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's
local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target
any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be
expanded to support user DSQs by future changes.
scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around
scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future.
Update compat.bpf.h with a compatibility shim and scx_qmap to test the new
functionality.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
This is an early-stage partial implementation that demonstrates the core
building blocks for nested sub-scheduler dispatching. While significant
work remains in the enqueue path and other areas, this patch establishes
the fundamental mechanisms needed for hierarchical scheduler operation.
The key building blocks introduced include:
- Private stack support for ops.dispatch() to prevent stack overflow when
walking down nested schedulers during dispatch operations
- scx_bpf_sub_dispatch() kfunc that allows parent schedulers to trigger
dispatch operations on their direct child schedulers
- Proper parent-child relationship validation to ensure dispatch requests
are only made to legitimate child schedulers
- Updated scx_dispatch_sched() to handle both nested and non-nested
invocations with appropriate kf_mask handling
The qmap scheduler is updated to demonstrate the functionality by calling
scx_bpf_sub_dispatch() on registered child schedulers when it has no
tasks in its own queues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
In preparation for multiple scheduler support, introduce scx_prog_sched()
accessor which returns the scx_sched instance associated with a BPF program.
The association is determined via the special KF_IMPLICIT_ARGS kfunc
parameter, which provides access to bpf_prog_aux. This aux can be used to
retrieve the struct_ops (sched_ext_ops) that the program is associated with,
and from there, the corresponding scx_sched instance.
For compatibility, when ops.sub_attach is not implemented (older schedulers
without sub-scheduler support), unassociated programs fall back to scx_root.
A warning is logged once per scheduler for such programs.
As scx_root is still the only scheduler, this shouldn't introduce
user-visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
A system often runs multiple workloads especially in multi-tenant server
environments where a system is split into partitions servicing separate
more-or-less independent workloads each requiring an application-specific
scheduler. To support such and other use cases, sched_ext is in the process
of growing multiple scheduler support.
When partitioning a system in terms of CPUs for such use cases, an
oft-taken approach is hard partitioning the system using cpuset. While it
would be possible to tie sched_ext multiple scheduler support to cpuset
partitions, such an approach would have fundamental limitations stemming
from the lack of dynamism and flexibility.
Users often don't care which specific CPUs are assigned to which workload
and want to take advantage of optimizations which are enabled by running
workloads on a larger machine - e.g. opportunistic over-commit, improving
latency critical workload characteristics while maintaining bandwidth
fairness, employing control mechanisms based on different criteria than
on-CPU time for e.g. flexible memory bandwidth isolation, packing similar
parts from different workloads on same L3s to improve cache efficiency,
and so on.
As this sort of dynamic behaviors are impossible or difficult to implement
with hard partitioning, sched_ext is implementing cgroup sub-sched support
where schedulers can be attached to the cgroup hierarchy and a parent
scheduler is responsible for controlling the CPUs that each child can use
at any given moment. This makes CPU distribution dynamically controlled by
BPF allowing high flexibility.
This patch adds the skeletal sched_ext cgroup sub-sched support:
- sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero
sub_cgroup_id indicates that the scheduler is to be attached to the
identified cgroup. A sub-sched is attached to the cgroup iff the nearest
ancestor scheduler implements .sub_attach() and grants the attachment. Max
nesting depth is limited by SCX_SUB_MAX_DEPTH.
- When a scheduler exits, all its descendant schedulers are exited
together. Also, cgroup.scx_sched added which points to the effective
scheduler instance for the cgroup. This is updated on scheduler
init/exit and inherited on cgroup online. When a cgroup is offlined, the
attached scheduler is automatically exited.
- Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is
automatically enabled if both SCX and cgroups are enabled. Sub-sched
support is not tied to the CPU controller but rather the cgroup
hierarchy itself. This is intentional as the support for cpu.weight and
cpu.max based resource control is orthogonal to sub-sched support. Note
that CONFIG_CGROUPS around cgroup subtree iteration support for
scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency.
- This allows loading sub-scheds and most framework operations such as
propagating disable down the hierarchy work. However, sub-scheds are not
operational yet and all tasks stay with the root sched. This will serve
as the basis for building up full sub-sched support.
- DSQs point to the scx_sched they belong to.
- scx_qmap is updated to allow attachment of sub-scheds and also serving
as sub-scheds.
- scx_is_descendant() is added but not yet used in this patch. It is used by
later changes in the series and placed here as this is where the function
belongs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Similar to commit 835a507535 ("selftests/bpf: Add -fms-extensions to
bpf build flags") and commit 639f58a0f4 ("bpftool: Fix build warnings
due to MS extensions")
The kernel is now built with -fms-extensions, therefore
generated vmlinux.h contains types like:
struct aes_key {
struct aes_enckey;
union aes_invkey_arch inv_k;
};
struct ns_common {
...
union {
struct ns_tree;
struct callback_head ns_rcu;
};
};
Which raise warning like below when building scx scheduler:
tools/sched_ext/build/include/vmlinux.h:50533:3: warning:
declaration does not declare anything [-Wmissing-declarations]
50533 | struct ns_tree;
| ^
Fix it by using -fms-extensions and -Wno-microsoft-anon-tag flags
to build bpf programs that #include "vmlinux.h"
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_hotplug_seq() uses strtoul() but validates the result with a
negative check (val < 0), which can never be true for an unsigned
return value.
Use the endptr mechanism to verify the entire string was consumed,
and check errno == ERANGE for overflow detection.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.
Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).
BPF scheduler custody concept: a task is considered to be in the BPF
scheduler's custody when the scheduler is responsible for managing its
lifecycle. This includes tasks dispatched to user-created DSQs or stored
in the BPF scheduler's internal data structures from ops.enqueue().
Custody ends when the task is dispatched to a terminal DSQ (such as the
local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed
due to a property change.
Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are never in its custody. Terminal DSQs include:
- Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
where tasks go directly to execution.
- Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
BPF scheduler is considered "done" with the task.
As a result, ops.dequeue() is not invoked for tasks directly dispatched
to terminal DSQs.
To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.
New ops.dequeue() semantics:
- ops.dequeue() is invoked exactly once when the task leaves the BPF
scheduler's custody, in one of the following cases:
a) regular dispatch: a task dispatched to a user DSQ or stored in
internal BPF data structures is moved to a terminal DSQ
(ops.dequeue() called without any special flags set),
b) core scheduling dispatch: core-sched picks task before dispatch
(ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
c) property change: task properties modified before dispatch,
(ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).
This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of managed tasks,
- update internal state when tasks change properties.
Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The '-f' option is defined in getopt() but not handled in the switch
statement or documented in the help text. Providing '-f' currently
triggers the default error path.
Remove it to sync the optstring with the actual implementation.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The '-p' option is defined in getopt() but not handled in the switch
statement or documented in the help text. Providing '-p' currently
triggers the default error path.
Remove it to sync the optstring with the actual implementation.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After goto restart, optind retains its advanced position from the
previous getopt loop, causing getopt() to immediately return -1.
This silently drops all command-line options on the restarted skeleton.
Reset optind to 1 at the restart label so options are re-parsed.
Affected schedulers: scx_simple, scx_central, scx_flatcg, scx_pair,
scx_sdt, scx_cpu0.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The stats thread reads nr_vruntime_enqueues, nr_vruntime_dispatches,
nr_vruntime_failed, and nr_curr_enqueued concurrently with the main
thread writing them, with no synchronization.
Use __atomic builtins with relaxed ordering for all accesses to these
counters to eliminate the data races.
Only display accuracy is affected, not scheduling correctness.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
nr_cpu_ids / 2 produces stride 0 on a single-CPU system, which later
causes SCX_BUG_ON(i == j) to fire. Validate stride after option
parsing to also catch invalid user-supplied values via -S.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use CPU_SET_S() instead of CPU_SET() on the dynamically allocated
cpuset to avoid a potential out-of-bounds write when nr_cpu_ids
exceeds CPU_SETSIZE.
Also destroy the skeleton before returning on invalid central CPU ID
to prevent a resource leak.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reset all counters, tasks and vruntime_head list on restart.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
fcg_read_stats() had a VLA allocating 21 * nr_cpus * 8 bytes on the
stack, risking stack overflow on large CPU counts (nr_cpus can be up
to 512).
Fix by using a single heap allocation with the correct size, reusing
it across all stat indices, and freeing it at the end.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix three issues in scx_userland's restart path:
- exit_req is not reset on restart, causing sched_main_loop() to exit
immediately without doing any scheduling work.
- stats_printer thread handle is local to spawn_stats_thread(), making
it impossible to join from main(). Promote it to file scope.
- The stats thread continues reading skel->bss after the skeleton is
destroyed on restart, causing a use-after-free. Join the stats thread
before destroying the skeleton to ensure it has exited.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cpu set is dynamically allocated for nr_cpu_ids using CPU_ALLOC(),
so the size passed to sched_setaffinity() should be CPU_ALLOC_SIZE()
rather than sizeof(cpu_set_t). Valgrind flagged this as accessing
unaddressable bytes past the allocation.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The local cnts array in read_stats() is not initialized before being
accumulated into per-CPU stats, which may lead to reading garbage
values. Zero it out with memset alongside the existing stats array
initialization.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull sched_ext updates from Tejun Heo:
- Move C example schedulers back from the external scx repo to
tools/sched_ext as the authoritative source. scx_userland and
scx_pair are returning while scx_sdt (BPF arena-based task data
management) is new. These schedulers will be dropped from the
external repo.
- Improve error reporting by adding scx_bpf_error() calls when DSQ
creation fails across all in-tree schedulers
- Avoid redundant irq_work_queue() calls in destroy_dsq() by only
queueing when llist_add() indicates an empty list
- Fix flaky init_enable_count selftest by properly synchronizing
pre-forked children using a pipe instead of sleep()
* tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Fix init_enable_count flakiness
tools/sched_ext: Fix data header access during free in scx_sdt
tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers
tools/sched_ext: add arena based scheduler
tools/sched_ext: add scx_pair scheduler
tools/sched_ext: add scx_userland scheduler
sched_ext: Add error logging for dsq creation failures
sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()
Fix a pointer arithmetic error in scx_sdt during freeing that
causes the allocator to use the wrong memory address for the
allocation's data header.
Fixes: 36929ebd17 ("tools/sched_ext: add arena based scheduler")
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add scx_bpf_error() calls when scx_bpf_create_dsq() fails in the remaining
schedulers to improve debuggability:
- scx_simple.bpf.c: simple_init()
- scx_sdt.bpf.c: sdt_init()
- scx_cpu0.bpf.c: cpu0_init()
- scx_flatcg.bpf.c: fcg_init()
This follows the same pattern established in commit 2f8d489897
("sched_ext: Add error logging for dsq creation failures") for other
schedulers and ensures consistent error reporting across all schedulers.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a scheduler that uses BPF arenas to manage task context data.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>