Commit Graph

2896 Commits

Author SHA1 Message Date
Zhan Xusheng
e9fb60a780 posix-timers: Fix stale function name in comment
The comment in exit_itimers() still refers to itimer_delete(),
which was replaced by posix_timer_delete(). Update the comment
accordingly.

Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260326142210.98632-1-zhanxusheng@xiaomi.com
2026-03-28 14:18:13 +01:00
Shrikanth Hegde
551e49beb1 timers: Get this_cpu once while clearing the idle state
Calling smp_processor_id() on:
- In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
not print any warning.
- In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
__smp_processor_id

So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section it is
better to cache the value. It saves a few cycles. Though tiny, repeated
adds up.

timer_clear_idle() is called with interrupts disabled. So cache the value
once.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Link: https://patch.msgid.link/20260323193630.640311-5-sshegde@linux.ibm.com
2026-03-24 23:21:30 +01:00
Ingo Molnar
f6472b1793 Merge tag 'v7.0-rc4' into timers/core, to resolve conflict
Resolve conflict between this change in the upstream kernel:

  4c652a4772 ("rseq: Mark rseq_arm_slice_extension_timer() __always_inline")

... and this pending change in timers/core:

  0e98eb1481 ("entry: Prepare for deferred hrtimer rearming")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2026-03-21 08:02:36 +01:00
Thomas Gleixner
763aacf86f clocksource: Rewrite watchdog code completely
The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design, which was
made in the context of systems far smaller than today, is based on the
assumption that the to be monitored clocksource (TSC) can be trivially
compared against a known to be stable clocksource (HPET/ACPI-PM timer).

Over the years it turned out that this approach has major flaws:

  - Long delays between watchdog invocations can result in wrap arounds
    of the reference clocksource

  - Scalability of the reference clocksource readout can degrade on large
    multi-socket systems due to interconnect congestion

This was addressed with various heuristics which degraded the accuracy of
the watchdog to the point that it fails to detect actual TSC problems on
older hardware which exposes slow inter CPU drifts due to firmware
manipulating the TSC to hide SMI time.

To address this and bring back sanity to the watchdog, rewrite the code
completely with a different approach:

  1) Restrict the validation against a reference clocksource to the boot
     CPU, which is usually the CPU/Socket closest to the legacy block which
     contains the reference source (HPET/ACPI-PM timer). Validate that the
     reference readout is within a bound latency so that the actual
     comparison against the TSC stays within 500ppm as long as the clocks
     are stable.

  2) Compare the TSCs of the other CPUs in a round robin fashion against
     the boot CPU in the same way the TSC synchronization on CPU hotplug
     works. This still can suffer from delayed reaction of the remote CPU
     to the SMP function call and the latency of the control variable cache
     line. But this latency is not affecting correctness. It only affects
     the accuracy. With low contention the readout latency is in the low
     nanoseconds range, which detects even slight skews between CPUs. Under
     high contention this becomes obviously less accurate, but still
     detects slow skews reliably as it solely relies on subsequent readouts
     being monotonically increasing. It just can take slightly longer to
     detect the issue.

  3) Rewrite the watchdog test so it tests the various mechanisms one by
     one and validating the result against the expectation.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Reviewed-by: Daniel J Blueman <daniel@quora.org>
Link: https://patch.msgid.link/20260123231521.926490888@kernel.org
Link: https://patch.msgid.link/87h5qeomm5.ffs@tglx
2026-03-20 13:36:32 +01:00
Thomas Gleixner
1432f9d4e8 clocksource: Don't use non-continuous clocksources as watchdog
Using a non-continuous aka untrusted clocksource as a watchdog for another
untrusted clocksource is equivalent to putting the fox in charge of the
henhouse.

That's especially true with the jiffies clocksource which depends on
interrupt delivery based on a periodic timer. Neither the frequency of that
timer is trustworthy nor the kernel's ability to react on it in a timely
manner and rearm it if it is not self rearming.

Just don't bother to deal with this. It's not worth the trouble and only
relevant to museum piece hardware.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260123231521.858743259@kernel.org
2026-03-12 12:23:27 +01:00
Thomas Weißschuh (Schneider Electric)
88c316ff76 hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node
The container_of() call is open-coded multiple times.

Add a helper macro.

Use container_of_const() to preserve constness.

Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-12-095357392669@linutronix.de
2026-03-12 12:15:56 +01:00
Thomas Weißschuh (Schneider Electric)
bd803783df hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event
This pointer indirection is a remnant from when ktime_t was a struct,
today it is pointless.

Drop the pointer indirection.

Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-9-095357392669@linutronix.de
2026-03-12 12:15:55 +01:00
Thomas Weißschuh (Schneider Electric)
194675f16d hrtimer: Don't zero-initialize ret in hrtimer_nanosleep()
The value will be assigned to before any usage.
No other function in hrtimer.c does such a zero-initialization.

Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-7-095357392669@linutronix.de
2026-03-12 12:15:55 +01:00
Thomas Weißschuh (Schneider Electric)
112c685f02 timekeeping: Mark offsets array as const
Neither the array nor the offsets it is pointing to are meant to be
changed through the array.

Mark both the array and the values it points to as const.

Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-5-095357392669@linutronix.de
2026-03-12 12:15:54 +01:00
Thomas Weißschuh (Schneider Electric)
ba546d3d89 timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals()
In aux_clock_enable() the clocksource from tkr_raw is used to call
tk_setup_internals(). Do the same in tk_aux_update_clocksource().  While
the clocksources will be the same in any case, this is less confusing.

Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-4-095357392669@linutronix.de
2026-03-12 12:15:54 +01:00
Thomas Weißschuh (Schneider Electric)
bb2705b4e0 timer_list: Print offset as signed integer
The offset of a hrtimer base may be negative.

Print those values correctly.

Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-3-095357392669@linutronix.de
2026-03-12 12:15:54 +01:00
Thomas Gleixner
1e4a70e0f6 Merge branch 'sched/hrtick' into timers/core
Pick up the hrtick related hrtimer changes so other unrelated changes can
be queued on top.
2026-03-11 21:14:32 +01:00
Peter Zijlstra
92f7ee408c hrtimer: Less agressive interrupt 'hang' handling
When the hrtimer_interrupt needs to restart more than 3 times and still has
expired timers, the interrupt is considered hung. To give the system a
little time to recover, the hardware timer is programmed a little into the
future.

Prior to commit 2889243848 ("hrtimer: Re-arrange hrtimer_interrupt()"),
this was relative to the amount of time spend serving the interrupt with a
max of 100 msec.

However, in order to simplify, and because this condition 'should' not
happen, the timeout was unconditionally set to 100 msec.

'Obviously' there is a benchmark that hits this hard, by programming a
ton of very short timers :-/

Since reprogramming is decoupled from the interrupt handling, the actual
execution time is lost, however the code does track max_hang_time. Using
that, rather than the 100 ms max restores performance.

  stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --timermix 64

                  bogo ops/s
 288924384856^1: 23715979.93
 2889243848:   11550049.77
 patched:        23361116.78

Additionally, Thomas noted that cpu_base->hang_detected should not be
cleared until the next interrupt, such that __hrtimer_reprogram() won't
undo the extra delay.

Fixes: 2889243848 ("hrtimer: Re-arrange hrtimer_interrupt()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311121500.GF652779@noisy.programming.kicks-ass.net
Closes: https://lore.kernel.org/oe-lkp/202603102229.74b9dee4-lkp@intel.com
2026-03-11 21:13:55 +01:00
Steven Rostedt
755a648e78 time/jiffies: Mark jiffies_64_to_clock_t() notrace
The trace_clock_jiffies() function that handles the "uptime" clock for
tracing calls jiffies_64_to_clock_t(). This causes the function tracer to
constantly recurse when the tracing clock is set to "uptime". Mark it
notrace to prevent unnecessary recursion when using the "uptime" clock.

Fixes: 58d4e21e50 ("tracing: Fix wraparound problems in "uptime" trace clock")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260306212403.72270bb2@robin
2026-03-11 10:33:12 +01:00
Linus Torvalds
6ff1020c2f Merge tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Make clock_adjtime() syscall timex validation slightly more permissive
  for auxiliary clocks, to not reject syscalls based on the status field
  that do not try to modify the status field.

  This makes the ABI behavior in clock_adjtime() consistent with
  CLOCK_REALTIME"

* tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix timex status validation for auxiliary clocks
2026-03-07 17:09:15 -08:00
Thomas Gleixner
53007d526e clocksource: Update clocksource::freq_khz on registration
Borislav reported a division by zero in the timekeeping code and random
hangs with the new coupled clocksource/clockevent functionality.

It turned out that the TSC clocksource is not always updating the
freq_khz field of the clocksource on registration. The coupled mode
conversion calculation requires the frequency and as it's not
initialized the resulting factor is zero or a random value. As a
consequence this causes a division by zero or random boot hangs.

Instead of chasing down all clocksources which fail to update that
member, fill it in at registration time where the caller has to supply
the frequency anyway. Except for special clocksources like jiffies which
never can have coupled mode.

To make this more robust put a check into the registration function to
validate that the caller supplied a frequency if the coupled mode
feature bit is set. If not, emit a warning and clear the feature bit.

Fixes: cd38bdb8e6 ("timekeeping: Provide infrastructure for coupled clockevents")
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/87cy1jsa4m.ffs@tglx
Closes: https://lore.kernel.org/20260303213027.GA2168957@ax162
2026-03-05 17:41:06 +01:00
Thomas Gleixner
9d5e25b361 timekeeping: Initialize the coupled clocksource conversion completely
Nathan reported a boot failure after the coupled clocksource/event support
was enabled for the TSC deadline timer. It turns out that on the affected
test systems the TSC frequency is not refined against HPET, so it is
registered with the same frequency as the TSC-early clocksource.

As a consequence the update function which checks for a change of the
shift/mult pair of the clocksource fails to compute the conversion
limit, which is zero initialized. This check is there to avoid pointless
computations on every timekeeping update cycle (tick).

So the actual clockevent conversion function limits the delta expiry to
zero, which means the timer is always programmed to expire in the
past. This obviously results in a spectacular timer interrupt storm,
which goes unnoticed because the per CPU interrupts on x86 are not
exposed to the runaway detection mechanism and the NMI watchdog is not
yet functional. So the machine simply stops booting.

That did not show up in testing. All test machines refine the TSC frequency
so TSC has a differrent shift/mult pair than TSC-early and the conversion
limit is properly initialized.

Cure that by setting the conversion limit right at the point where the new
clocksource is installed.

Fixes: cd38bdb8e6 ("timekeeping: Provide infrastructure for coupled clockevents")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/87bjh4zies.ffs@tglx
Closes: https://lore.kernel.org/20260303012905.GA978396@ax162
2026-03-05 17:40:46 +01:00
Miroslav Lichvar
e48a869957 timekeeping: Fix timex status validation for auxiliary clocks
The timekeeping_validate_timex() function validates the timex status
of an auxiliary system clock even when the status is not to be changed,
which causes unexpected errors for applications that make read-only
clock_adjtime() calls, or set some other timex fields, but without
clearing the status field.

Do the AUX-specific status validation only when the modes field contains
ADJ_STATUS, i.e. the application is actually trying to change the
status. This makes the AUX-specific clock_adjtime() behavior consistent
with CLOCK_REALTIME.

Fixes: 4eca49d0b6 ("timekeeping: Prepare do_adtimex() for auxiliary clocks")
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260225085231.276751-1-mlichvar@redhat.com
2026-03-04 20:05:37 +01:00
Linus Torvalds
ecc64d2dc9 Merge tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl fix from Joel Granados:

 - Fix error when reporting jiffies converted values back to user space

   Return the converted value instead of "Invalid argument" error

* tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ
2026-03-04 08:21:11 -08:00
Gerd Rausch
6932256d3a time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ
Commit 2dc164a48e ("sysctl: Create converter functions with two new
macros") incorrectly returns error to user space when jiffies sysctl
converter is used. The old overflow check got replaced with an
unconditional one:
     +    if (USER_HZ < HZ)
     +        return -EINVAL;
which will always be true on configurations with "USER_HZ < HZ".

Remove the check; it is no longer needed as clock_t_to_jiffies() returns
ULONG_MAX for the overflow case and proc_int_u2k_conv_uop() checks for
"> INT_MAX" after conversion

Fixes: 2dc164a48e ("sysctl: Create converter functions with two new macros")
Reported-by: Colm Harrington <colm.harrington@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-03-04 13:48:31 +01:00
Linus Torvalds
0031c06807 Merge tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - Fix circular locking dependency in cpuset partition code by
   deferring housekeeping_update() calls to a workqueue instead
   of calling them directly under cpus_read_lock

 - Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when
   generate_sched_domains() returns NULL due to kmalloc failure

 - Fix incorrect cpuset behavior for effective_xcpus in
   partition_xcpus_del() and cpuset_update_tasks_cpumask()
   in update_cpumasks_hier()

 - Fix race between task migration and cgroup iteration

* tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked
  cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
  kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
  cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
  cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
  cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
  cgroup: fix race between task migration and iteration
2026-03-03 14:25:18 -08:00
Linus Torvalds
f6542af922 Merge tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
  the common HZ=100, 250 or 1000 cases. Only use a function call for odd
  HZ values like HZ=300 that generate more code.

  The function call overhead showed up in performance tests of the TCP
  code"

* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
2026-03-01 12:15:58 -08:00
Thomas Gleixner
343f2f4dc5 hrtimer: Try to modify timers in place
When modifying the expiry of a armed timer it is first dequeued, then the
expiry value is updated and then it is queued again.

This can be avoided when the new expiry value is within the range of the
previous and the next timer as that does not change the position in the RB
tree.

The linked timerqueue allows to peak ahead to the neighbours and check
whether the new expiry time is within the range of the previous and next
timer. If so just modify the timer in place and spare the enqueue and
requeue effort, which might end up rotating the RB tree twice for nothing.

This speeds up the handling of frequently rearmed hrtimers, like the hrtick
scheduler timer significantly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.873359816@kernel.org
2026-02-27 16:40:17 +01:00
Thomas Gleixner
b7418e6e9b hrtimer: Use linked timerqueue
To prepare for optimizing the rearming of enqueued timers, switch to the
linked timerqueue. That allows to check whether the new expiry time changes
the position of the timer in the RB tree or not, by checking the new expiry
time against the previous and the next timers expiry.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.806643179@kernel.org
2026-02-27 16:40:16 +01:00
Thomas Gleixner
3601a1d850 hrtimer: Optimize for_each_active_base()
Give the compiler some help to emit way better code.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.599804894@kernel.org
2026-02-27 16:40:15 +01:00
Thomas Gleixner
a64ad57e41 hrtimer: Simplify run_hrtimer_queues()
Replace the open coded container_of() orgy with a trivial
clock_base_next_timer() helper.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.532927977@kernel.org
2026-02-27 16:40:15 +01:00
Thomas Gleixner
2bd1cc24fa hrtimer: Rework next event evaluation
The per clock base cached expiry time allows to do a more efficient
evaluation of the next expiry on a CPU.

Separate the reprogramming evaluation from the NOHZ idle evaluation which
needs to exclude the NOHZ timer to keep the reprogramming path lean and
clean.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.468186893@kernel.org
2026-02-27 16:40:15 +01:00
Thomas Gleixner
eddffab828 hrtimer: Keep track of first expiring timer per clock base
Evaluating the next expiry time of all clock bases is cache line expensive
as the expiry time of the first expiring timer is not cached in the base
and requires to access the timer itself, which is definitely in a different
cache line.

It's way more efficient to keep track of the expiry time on enqueue and
dequeue operations as the relevant data is already in the cache at that
point.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.404839710@kernel.org
2026-02-27 16:40:14 +01:00
Thomas Gleixner
b95c4442b0 hrtimer: Avoid re-evaluation when nothing changed
Most times there is no change between hrtimer_interrupt() deferring the rearm
and the invocation of hrtimer_rearm_deferred(). In those cases it's a pointless
exercise to re-evaluate the next expiring timer.

Cache the required data and use it if nothing changed.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.338569372@kernel.org
2026-02-27 16:40:14 +01:00
Peter Zijlstra
15dd3a9488 hrtimer: Push reprogramming timers into the interrupt return path
Currently hrtimer_interrupt() runs expired timers, which can re-arm
themselves, after which it computes the next expiration time and
re-programs the hardware.

However, things like HRTICK, a highres timer driving preemption, cannot
re-arm itself at the point of running, since the next task has not been
determined yet. The schedule() in the interrupt return path will switch to
the next task, which then causes a new hrtimer to be programmed.

This then results in reprogramming the hardware at least twice, once after
running the timers, and once upon selecting the new task.

Notably, *both* events happen in the interrupt.

By pushing the hrtimer reprogram all the way into the interrupt return
path, it runs after schedule() picks the new task and the double reprogram
can be avoided.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.273488269@kernel.org
2026-02-27 16:40:14 +01:00
Peter Zijlstra
a43b4856bc hrtimer: Prepare stubs for deferred rearming
The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer set
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule().

To make this correct the affected code parts need to be made aware of this.

Provide empty stubs for the deferred rearming mechanism, so that the
relevant code changes for entry, softirq and scheduler can be split up into
separate changes independent of the actual enablement in the hrtimer code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.000891171@kernel.org
2026-02-27 16:40:13 +01:00
Thomas Gleixner
9e07a9c980 hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm
The upcoming deferred rearming scheme has the same effect as the deferred
rearming when the hrtimer interrupt is executing. So it can reuse the
in_hrtirq flag, but when it gets deferred beyond the hrtimer interrupt
path, then the name does not make sense anymore.

Rename it to deferred_rearm upfront to keep the actual functional change
separate from the mechanical rename churn.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.935623347@kernel.org
2026-02-27 16:40:12 +01:00
Peter Zijlstra
2889243848 hrtimer: Re-arrange hrtimer_interrupt()
Rework hrtimer_interrupt() such that reprogramming is split out into an
independent function at the end of the interrupt.

This prepares for reprogramming getting delayed beyond the end of
hrtimer_interrupt().

Notably, this changes the hang handling to always wait 100ms instead of
trying to keep it proportional to the actual delay. This simplifies the
state, also this really shouldn't be happening.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.870639266@kernel.org
2026-02-27 16:40:12 +01:00
Thomas Gleixner
85a690d1c1 hrtimer: Separate remove/enqueue handling for local timers
As the base switch can be avoided completely when the base stays the same
the remove/enqueue handling can be more streamlined.

Split it out into a separate function which handles both in one go which is
way more efficient and makes the code simpler to follow.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.737600486@kernel.org
2026-02-27 16:40:11 +01:00
Thomas Gleixner
c939191457 hrtimer: Use NOHZ information for locality
The decision to keep a timer which is associated to the local CPU on that
CPU does not take NOHZ information into account. As a result there are a
lot of hrtimer base switch invocations which end up not switching the base
and stay on the local CPU. That's just work for nothing and can be further
improved.

If the local CPU is part of the NOISE housekeeping mask, then check:

  1) Whether the local CPU has the tick running, which means it is
     either not idle or already expecting a timer soon.

  2) Whether the tick is stopped and need_resched() is set, which
     means the CPU is about to exit idle.

This reduces the amount of hrtimer base switch attempts, which end up on
the local CPU anyway, significantly and prepares for further optimizations.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.673473029@kernel.org
2026-02-27 16:40:11 +01:00
Thomas Gleixner
3288cd4863 hrtimer: Optimize for local timers
The decision whether to keep timers on the local CPU or on the CPU they are
associated to is suboptimal and causes the expensive switch_hrtimer_base()
mechanism to be invoked more than necessary. This is especially true for
pinned timers.

Rewrite the decision logic so that the current base is kept if:

   1) The callback is running on the base

   2) The timer is associated to the local CPU and the first expiring timer as
      that allows to optimize for reprogramming avoidance

   3) The timer is associated to the local CPU and pinned

   4) The timer is associated to the local CPU and timer migration is
      disabled.

Only #2 was covered by the original code, but especially #3 makes a
difference for high frequency rearming timers like the scheduler hrtick
timer. If timer migration is disabled, then #4 avoids most of the base
switches.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.607935269@kernel.org
2026-02-27 16:40:11 +01:00
Thomas Gleixner
22f011be7a hrtimer: Convert state and properties to boolean
All 'u8' flags are true booleans, so make it entirely clear that these can
only contain true or false.

This is especially true for hrtimer::state, which has a historical leftover
of using the state with bitwise operations. That was used in the early
hrtimer implementation with several bits, but then converted to a boolean
state. But that conversion missed to replace the bit OR and bit check
operations all over the place, which creates suboptimal code. As of today
'state' is a misnomer because it's only purpose is to reflect whether the
timer is enqueued into the RB-tree or not. Rename it to 'is_queued' and
make all operations on it boolean.

This reduces text size from 8926 to 8732 bytes.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.542427240@kernel.org
2026-02-27 16:40:11 +01:00
Thomas Gleixner
7d27eafe54 hrtimer: Replace the bitfield in hrtimer_cpu_base
Use bool for the various flags as that creates better code in the hot path.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.475262618@kernel.org
2026-02-27 16:40:10 +01:00
Thomas Gleixner
8ffc9ea881 hrtimer: Evaluate timer expiry only once
No point in accessing the timer twice.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.409352042@kernel.org
2026-02-27 16:40:10 +01:00
Thomas Gleixner
0c6af0ea51 hrtimer: Cleanup coding style and comments
As this code has some major surgery ahead, clean up coding style and bring
comments up to date.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.342740952@kernel.org
2026-02-27 16:40:10 +01:00
Thomas Gleixner
6abfc2bd5b hrtimer: Use guards where appropriate
Simplify and tidy up the code where possible.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.275551488@kernel.org
2026-02-27 16:40:09 +01:00
Thomas Gleixner
f2e388a019 hrtimer: Reduce trace noise in hrtimer_start()
hrtimer_start() when invoked with an already armed timer traces like:

 <comm>-..   [032] d.h2. 5.002263: hrtimer_cancel: hrtimer= ....
 <comm>-..   [032] d.h1. 5.002263: hrtimer_start: hrtimer= ....

Which is incorrect as the timer doesn't get canceled. Just the expiry time
changes. The internal dequeue operation which is required for that is not
really interesting for trace analysis. But it makes it tedious to keep real
cancellations and the above case apart.

Remove the cancel tracing in hrtimer_start() and add a 'was_armed'
indicator to the hrtimer start tracepoint, which clearly indicates what the
state of the hrtimer is when hrtimer_start() is invoked:

<comm>-..   [032] d.h1. 6.200103: hrtimer_start: hrtimer= .... was_armed=0
 <comm>-..   [032] d.h1. 6.200558: hrtimer_start: hrtimer= .... was_armed=1

Fixes: c6a2a17702 ("hrtimer: Add tracepoint for hrtimers")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.208491877@kernel.org
2026-02-27 16:40:09 +01:00
Thomas Gleixner
513e744a0a hrtimer: Add debug object init assertion
The debug object coverage in hrtimer_start_range_ns() happens too late to
do anything useful. Implement the init assert assertion part and invoke
that early in hrtimer_start_range_ns().

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.143098153@kernel.org
2026-02-27 16:40:09 +01:00
Thomas Gleixner
89f951a1e8 clockevents: Provide support for clocksource coupled comparators
Some clockevent devices are coupled to the system clocksource by
implementing a less than or equal comparator which compares the programmed
absolute expiry time against the underlying time counter.

The timekeeping core provides a function to convert and absolute
CLOCK_MONOTONIC based expiry time to a absolute clock cycles time which can
be directly fed into the comparator. That spares two time reads in the next
event progamming path, one to convert the absolute nanoseconds time to a
delta value and the other to convert the delta value back to a absolute
time value suitable for the comparator.

Provide a new clocksource callback which takes the absolute cycle value and
wire it up in clockevents_program_event(). Similar to clocksources allow
architectures to inline the rearm operation.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.010425428@kernel.org
2026-02-27 16:40:08 +01:00
Thomas Gleixner
cd38bdb8e6 timekeeping: Provide infrastructure for coupled clockevents
Some architectures have clockevent devices which are coupled to the system
clocksource by implementing a less than or equal comparator which compares
the programmed absolute expiry time against the underlying time
counter. Well known examples are TSC/TSC deadline timer and the S390 TOD
clocksource/comparator.

While the concept is nice it has some downsides:

  1) The clockevents core code is strictly based on relative expiry times
     as that's the most common case for clockevent device hardware. That
     requires to convert the absolute expiry time provided by the caller
     (hrtimers, NOHZ code) to a relative expiry time by reading and
     substracting the current time.

     The clockevent::set_next_event() callback must then read the counter
     again to convert the relative expiry back into a absolute one.

  2) The conversion factors from nanoseconds to counter clock cycles are
     set up when the clockevent is registered. When NTP applies corrections
     then the clockevent conversion factors can deviate from the
     clocksource conversion substantially which either results in timers
     firing late or in the worst case early. The early expiry then needs to
     do a reprogam with a short delta.

     In most cases this is papered over by the fact that the read in the
     set_next_event() callback happens after the read which is used to
     calculate the delta. So the tendency is that timers expire mostly
     late.

All of this can be avoided by providing support for these devices in the
core code:

  1) The timekeeping core keeps track of the last update to the clocksource
     by storing the base nanoseconds and the corresponding clocksource
     counter value. That's used to keep the conversion math for reading the
     time within 64-bit in the common case.

     This information can be used to avoid both reads of the underlying
     clocksource in the clockevents reprogramming path:

     delta = expiry - base_ns;
     cycles = base_cycles + ((delta * clockevent::mult) >> clockevent::shift);

     The resulting cycles value can be directly used to program the
     comparator.

  2) As #1 does not longer provide the "compensation" through the second
     read the deviation of the clocksource and clockevent conversions
     caused by NTP become more prominent.

     This can be cured by letting the timekeeping core compute and store
     the reverse conversion factors when the clocksource cycles to
     nanoseconds factors are modified by NTP:

         CS::MULT      (1 << NS_TO_CYC_SHIFT)
     --------------- = ----------------------
     (1 << CS:SHIFT)       NS_TO_CYC_MULT

     Ergo: NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT

     The NS_TO_CYC_SHIFT value is calculated when the clocksource is
     installed so that it aims for a one hour maximum sleep time.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.944763521@kernel.org
2026-02-27 16:40:08 +01:00
Thomas Gleixner
2e27beeb66 timekeeping: Allow inlining clocksource::read()
On some architectures clocksource::read() boils down to a single
instruction, so the indirect function call is just a massive overhead
especially with speculative execution mitigations in effect.

Allow architectures to enable conditional inlining of that read to avoid
that by:

   - providing a static branch to switch to the inlined variant

   - disabling the branch before clocksource changes

   - enabling the branch after a clocksource change, when the clocksource
     indicates in a feature flag that it is the one which provides the
     inlined variant

This is intentionally not a static call as that would only remove the
indirect call, but not the rest of the overhead.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.675151545@kernel.org
2026-02-27 16:40:07 +01:00
Thomas Gleixner
7080280739 clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME
The only real usecase for this is the hrtimer based broadcast device.
No point in using two different feature flags for this.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.609049777@kernel.org
2026-02-27 16:40:06 +01:00
Thomas Gleixner
adcec6a7f5 tick/sched: Avoid hrtimer_cancel/start() sequence
The sequence of cancel and start is inefficient. It has to do the timer
lock/unlock twice and in the worst case has to reprogram the underlying
clock event device twice.

The reason why it is done this way is the usage of hrtimer_forward_now(),
which requires the timer to be inactive.

But that can be completely avoided as the forward can be done on a variable
and does not need any of the overrun accounting provided by
hrtimer_forward_now().

Implement a trivial forwarding mechanism and replace the cancel/reprogram
sequence with hrtimer_start(..., new_expiry).

For the non high resolution case the timer is not actually armed, but used
for storage so that code checking for expiry times can unconditially look
it up in the timer. So it is safe for that case to set the new expiry time
directly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.542178086@kernel.org
2026-02-27 16:40:06 +01:00
Peter Zijlstra
b7dd64778a hrtimer: Provide LAZY_REARM mode
The hrtick timer is frequently rearmed before expiry and most of the time
the new expiry is past the armed one. As this happens on every context
switch it becomes expensive with scheduling heavy work loads especially in
virtual machines as the "hardware" reprogamming implies a VM exit.

Add a lazy rearm mode flag which skips the reprogamming if:

    1) The timer was the first expiring timer before the rearm

    2) The new expiry time is farther out than the armed time

This avoids a massive amount of reprogramming operations of the hrtick
timer for the price of eventually taking the alredy armed interrupt for
nothing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.408524456@kernel.org
2026-02-27 16:40:06 +01:00
Thomas Gleixner
0a93d30861 hrtimer: Provide a static branch based hrtimer_hres_enabled()
The scheduler evaluates this via hrtimer_is_hres_active() every time it has
to update HRTICK. This needs to follow three pointers, which is expensive.

Provide a static branch based mechanism to avoid that.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.136503358@kernel.org
2026-02-27 16:40:04 +01:00