mirror of
https://github.com/torvalds/linux.git
synced 2026-04-18 06:44:00 -04:00
nohz_full was introduced in v3.10 in 2013, which means this documentation is overdue for 13 years. Fortunately Paul wrote a part of the needed documentation a while ago, especially concerning nohz_full in Documentation/timers/no_hz.rst and also about per-CPU kthreads in Documentation/admin-guide/kernel-per-CPU-kthreads.rst Introduce a new page that gives an overview of CPU isolation in general. Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <20260402094749.18879-1-frederic@kernel.org>
358 lines
11 KiB
ReStructuredText
358 lines
11 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
=============
|
|
CPU Isolation
|
|
=============
|
|
|
|
Introduction
|
|
============
|
|
|
|
"CPU Isolation" means leaving a CPU exclusive to a given workload
|
|
without any undesired code interference from the kernel.
|
|
|
|
Those interferences, commonly pointed out as "noise", can be triggered
|
|
by asynchronous events (interrupts, timers, scheduler preemption by
|
|
workqueues and kthreads, ...) or synchronous events (syscalls and page
|
|
faults).
|
|
|
|
Such noise usually goes unnoticed. After all, synchronous events are a
|
|
component of the requested kernel service. And asynchronous events are
|
|
either sufficiently well-distributed by the scheduler when executed
|
|
as tasks or reasonably fast when executed as interrupt. The timer
|
|
interrupt can even execute 1024 times per seconds without a significant
|
|
and measurable impact most of the time.
|
|
|
|
However some rare and extreme workloads can be quite sensitive to
|
|
those kinds of noise. This is the case, for example, with high
|
|
bandwidth network processing that can't afford losing a single packet
|
|
or very low latency network processing. Typically those use cases
|
|
involve DPDK, bypassing the kernel networking stack and performing
|
|
direct access to the networking device from userspace.
|
|
|
|
In order to run a CPU without or with limited kernel noise, the
|
|
related housekeeping work needs to be either shut down, migrated or
|
|
offloaded.
|
|
|
|
Housekeeping
|
|
============
|
|
|
|
In the CPU isolation terminology, housekeeping is the work, often
|
|
asynchronous, that the kernel needs to process in order to maintain
|
|
all its services. It matches the noises and disturbances enumerated
|
|
above except when at least one CPU is isolated. Then housekeeping may
|
|
make use of further coping mechanisms if CPU-tied work must be
|
|
offloaded.
|
|
|
|
Housekeeping CPUs are the non-isolated CPUs where the kernel noise
|
|
is moved away from isolated CPUs.
|
|
|
|
The isolation can be implemented in several ways depending on the
|
|
nature of the noise:
|
|
|
|
- Unbound work, where "unbound" means not tied to any CPU, can be
|
|
simply migrated away from isolated CPUs to housekeeping CPUs.
|
|
This is the case of unbound workqueues, kthreads and timers.
|
|
|
|
- Bound work, where "bound" means tied to a specific CPU, usually
|
|
can't be moved away as-is by nature. Either:
|
|
|
|
- The work must switch to a locked implementation. E.g.:
|
|
This is the case of RCU with CONFIG_RCU_NOCB_CPU.
|
|
|
|
- The related feature must be shut down and considered
|
|
incompatible with isolated CPUs. E.g.: Lockup watchdog,
|
|
unreliable clocksources, etc...
|
|
|
|
- An elaborate and heavyweight coping mechanism stands as a
|
|
replacement. E.g.: the timer tick is shut down on nohz_full
|
|
CPUs but with the constraint of running a single task on
|
|
them. A significant cost penalty is added on kernel entry/exit
|
|
and a residual 1Hz scheduler tick is offloaded to housekeeping
|
|
CPUs.
|
|
|
|
In any case, housekeeping work has to be handled, which is why there
|
|
must be at least one housekeeping CPU in the system, preferably more
|
|
if the machine runs a lot of CPUs. For example one per node on NUMA
|
|
systems.
|
|
|
|
Also CPU isolation often means a tradeoff between noise-free isolated
|
|
CPUs and added overhead on housekeeping CPUs, sometimes even on
|
|
isolated CPUs entering the kernel.
|
|
|
|
Isolation features
|
|
==================
|
|
|
|
Different levels of isolation can be configured in the kernel, each of
|
|
which has its own drawbacks and tradeoffs.
|
|
|
|
Scheduler domain isolation
|
|
--------------------------
|
|
|
|
This feature isolates a CPU from the scheduler topology. As a result,
|
|
the target isn't part of the load balancing. Tasks won't migrate
|
|
either from or to it unless affined explicitly.
|
|
|
|
As a side effect the CPU is also isolated from unbound workqueues and
|
|
unbound kthreads.
|
|
|
|
Requirements
|
|
~~~~~~~~~~~~
|
|
|
|
- CONFIG_CPUSETS=y for the cpusets-based interface
|
|
|
|
Tradeoffs
|
|
~~~~~~~~~
|
|
|
|
By nature, the system load is overall less distributed since some CPUs
|
|
are extracted from the global load balancing.
|
|
|
|
Interfaces
|
|
~~~~~~~~~~
|
|
|
|
- Documentation/admin-guide/cgroup-v2.rst cpuset isolated partitions are recommended
|
|
because they are tunable at runtime.
|
|
|
|
- The 'isolcpus=' kernel boot parameter with the 'domain' flag is a
|
|
less flexible alternative that doesn't allow for runtime
|
|
reconfiguration.
|
|
|
|
IRQs isolation
|
|
--------------
|
|
|
|
Isolate the IRQs whenever possible, so that they don't fire on the
|
|
target CPUs.
|
|
|
|
Interfaces
|
|
~~~~~~~~~~
|
|
|
|
- The file /proc/irq/\*/smp_affinity as explained in detail in
|
|
Documentation/core-api/irq/irq-affinity.rst page.
|
|
|
|
- The "irqaffinity=" kernel boot parameter for a default setting.
|
|
|
|
- The "managed_irq" flag in the "isolcpus=" kernel boot parameter
|
|
tries a best effort affinity override for managed IRQs.
|
|
|
|
Full Dynticks (aka nohz_full)
|
|
-----------------------------
|
|
|
|
Full dynticks extends the dynticks idle mode, which stops the tick when
|
|
the CPU is idle, to CPUs running a single task in userspace. That is,
|
|
the timer tick is stopped if the environment allows it.
|
|
|
|
Global timer callbacks are also isolated from the nohz_full CPUs.
|
|
|
|
Requirements
|
|
~~~~~~~~~~~~
|
|
|
|
- CONFIG_NO_HZ_FULL=y
|
|
|
|
Constraints
|
|
~~~~~~~~~~~
|
|
|
|
- The isolated CPUs must run a single task only. Multitask requires
|
|
the tick to maintain preemption. This is usually fine since the
|
|
workload usually can't stand the latency of random context switches.
|
|
|
|
- No call to the kernel from isolated CPUs, at the risk of triggering
|
|
random noise.
|
|
|
|
- No use of POSIX CPU timers on isolated CPUs.
|
|
|
|
- Architecture must have a stable and reliable clocksource (no
|
|
unreliable TSC that requires the watchdog).
|
|
|
|
|
|
Tradeoffs
|
|
~~~~~~~~~
|
|
|
|
In terms of cost, this is the most invasive isolation feature. It is
|
|
assumed to be used when the workload spends most of its time in
|
|
userspace and doesn't rely on the kernel except for preparatory
|
|
work because:
|
|
|
|
- RCU adds more overhead due to the locked, offloaded and threaded
|
|
callbacks processing (the same that would be obtained with "rcu_nocbs"
|
|
boot parameter).
|
|
|
|
- Kernel entry/exit through syscalls, exceptions and IRQs are more
|
|
costly due to fully ordered RmW operations that maintain userspace
|
|
as RCU extended quiescent state. Also the CPU time is accounted on
|
|
kernel boundaries instead of periodically from the tick.
|
|
|
|
- Housekeeping CPUs must run a 1Hz residual remote scheduler tick
|
|
on behalf of the isolated CPUs.
|
|
|
|
Checklist
|
|
=========
|
|
|
|
You have set up each of the above isolation features but you still
|
|
observe jitters that trash your workload? Make sure to check a few
|
|
elements before proceeding.
|
|
|
|
Some of these checklist items are similar to those of real-time
|
|
workloads:
|
|
|
|
- Use mlock() to prevent your pages from being swapped away. Page
|
|
faults are usually not compatible with jitter sensitive workloads.
|
|
|
|
- Avoid SMT to prevent your hardware thread from being "preempted"
|
|
by another one.
|
|
|
|
- CPU frequency changes may induce subtle sorts of jitter in a
|
|
workload. Cpufreq should be used and tuned with caution.
|
|
|
|
- Deep C-states may result in latency issues upon wake-up. If this
|
|
happens to be a problem, C-states can be limited via kernel boot
|
|
parameters such as processor.max_cstate or intel_idle.max_cstate.
|
|
More finegrained tunings are described in
|
|
Documentation/admin-guide/pm/cpuidle.rst page
|
|
|
|
- Your system may be subject to firmware-originating interrupts - x86 has
|
|
System Management Interrupts (SMIs) for example. Check your system BIOS
|
|
to disable such interference, and with some luck your vendor will have
|
|
a BIOS tuning guidance for low-latency operations.
|
|
|
|
|
|
Full isolation example
|
|
======================
|
|
|
|
In this example, the system has 8 CPUs and the 8th is to be fully
|
|
isolated. Since CPUs start from 0, the 8th CPU is CPU 7.
|
|
|
|
Kernel parameters
|
|
-----------------
|
|
|
|
Set the following kernel boot parameters to disable SMT and setup tick
|
|
and IRQ isolation:
|
|
|
|
- Full dynticks: nohz_full=7
|
|
|
|
- IRQs isolation: irqaffinity=0-6
|
|
|
|
- Managed IRQs isolation: isolcpus=managed_irq,7
|
|
|
|
- Prevent SMT: nosmt
|
|
|
|
The full command line is then:
|
|
|
|
nohz_full=7 irqaffinity=0-6 isolcpus=managed_irq,7 nosmt
|
|
|
|
CPUSET configuration (cgroup v2)
|
|
--------------------------------
|
|
|
|
Assuming cgroup v2 is mounted to /sys/fs/cgroup, the following script
|
|
isolates CPU 7 from scheduler domains.
|
|
|
|
::
|
|
|
|
cd /sys/fs/cgroup
|
|
# Activate the cpuset subsystem
|
|
echo +cpuset > cgroup.subtree_control
|
|
# Create partition to be isolated
|
|
mkdir test
|
|
cd test
|
|
echo +cpuset > cgroup.subtree_control
|
|
# Isolate CPU 7
|
|
echo 7 > cpuset.cpus
|
|
echo "isolated" > cpuset.cpus.partition
|
|
|
|
The userspace workload
|
|
----------------------
|
|
|
|
Fake a pure userspace workload, the program below runs a dummy
|
|
userspace loop on the isolated CPU 7.
|
|
|
|
::
|
|
|
|
#include <stdio.h>
|
|
#include <fcntl.h>
|
|
#include <unistd.h>
|
|
#include <errno.h>
|
|
int main(void)
|
|
{
|
|
// Move the current task to the isolated cpuset (bind to CPU 7)
|
|
int fd = open("/sys/fs/cgroup/test/cgroup.procs", O_WRONLY);
|
|
if (fd < 0) {
|
|
perror("Can't open cpuset file...\n");
|
|
return 0;
|
|
}
|
|
|
|
write(fd, "0\n", 2);
|
|
close(fd);
|
|
|
|
// Run an endless dummy loop until the launcher kills us
|
|
while (1)
|
|
;
|
|
|
|
return 0;
|
|
}
|
|
|
|
Build it and save for later step:
|
|
|
|
::
|
|
|
|
# gcc user_loop.c -o user_loop
|
|
|
|
The launcher
|
|
------------
|
|
|
|
The below launcher runs the above program for 10 seconds and traces
|
|
the noise resulting from preempting tasks and IRQs.
|
|
|
|
::
|
|
|
|
TRACING=/sys/kernel/tracing/
|
|
# Make sure tracing is off for now
|
|
echo 0 > $TRACING/tracing_on
|
|
# Flush previous traces
|
|
echo > $TRACING/trace
|
|
# Record disturbance from other tasks
|
|
echo 1 > $TRACING/events/sched/sched_switch/enable
|
|
# Record disturbance from interrupts
|
|
echo 1 > $TRACING/events/irq_vectors/enable
|
|
# Now we can start tracing
|
|
echo 1 > $TRACING/tracing_on
|
|
# Run the dummy user_loop for 10 seconds on CPU 7
|
|
./user_loop &
|
|
USER_LOOP_PID=$!
|
|
sleep 10
|
|
kill $USER_LOOP_PID
|
|
# Disable tracing and save traces from CPU 7 in a file
|
|
echo 0 > $TRACING/tracing_on
|
|
cat $TRACING/per_cpu/cpu7/trace > trace.7
|
|
|
|
If no specific problem arose, the output of trace.7 should look like
|
|
the following:
|
|
|
|
::
|
|
|
|
<idle>-0 [007] d..2. 1980.976624: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=user_loop next_pid=1553 next_prio=120
|
|
user_loop-1553 [007] d.h.. 1990.946593: reschedule_entry: vector=253
|
|
user_loop-1553 [007] d.h.. 1990.946593: reschedule_exit: vector=253
|
|
|
|
That is, no specific noise triggered between the first trace and the
|
|
second during 10 seconds when user_loop was running.
|
|
|
|
Debugging
|
|
=========
|
|
|
|
Of course things are never so easy, especially on this matter.
|
|
Chances are that actual noise will be observed in the aforementioned
|
|
trace.7 file.
|
|
|
|
The best way to investigate further is to enable finer grained
|
|
tracepoints such as those of subsystems producing asynchronous
|
|
events: workqueue, timer, irq_vector, etc... It also can be
|
|
interesting to enable the tick_stop event to diagnose why the tick is
|
|
retained when that happens.
|
|
|
|
Some tools may also be useful for higher level analysis:
|
|
|
|
- Documentation/tools/rtla/rtla.rst provides a suite of tools to analyze
|
|
latency and noise in the system. For example Documentation/tools/rtla/rtla-osnoise.rst
|
|
runs a kernel tracer that analyzes and output a summary of the noises.
|
|
|
|
- dynticks-testing does something similar to rtla-osnoise but in userspace. It is available
|
|
at git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git
|