Add further netkit queue-lease coverage for netns lifecycle of the guest
and physical halves, channel resize across active leases, single-device
and multi-lessee scenarios, L3 mode operation, lease capacity exhaustion,
and corner-cases of e.g. queue-create rejection paths. Also make the tests
more robust by removing the time.sleep(0.1) after netns deletion and turn
them into a wait loop.
Full test run:
# ./nk_qlease.py
TAP version 13
1..45
ok 1 nk_qlease.test_remove_phys
ok 2 nk_qlease.test_double_lease
ok 3 nk_qlease.test_virtual_lessor
ok 4 nk_qlease.test_phys_lessee
ok 5 nk_qlease.test_different_lessors
ok 6 nk_qlease.test_queue_out_of_range
ok 7 nk_qlease.test_resize_leased
ok 8 nk_qlease.test_self_lease
ok 9 nk_qlease.test_create_tx_type
ok 10 nk_qlease.test_create_primary
ok 11 nk_qlease.test_create_limit
ok 12 nk_qlease.test_link_flap_phys
ok 13 nk_qlease.test_queue_get_virtual
ok 14 nk_qlease.test_remove_virt_first
ok 15 nk_qlease.test_multiple_leases
ok 16 nk_qlease.test_lease_queue_tx_type
ok 17 nk_qlease.test_invalid_netns
ok 18 nk_qlease.test_invalid_phys_ifindex
ok 19 nk_qlease.test_multi_netkit_remove_phys
ok 20 nk_qlease.test_single_remove_phys
ok 21 nk_qlease.test_link_flap_virt
ok 22 nk_qlease.test_phys_queue_no_lease
ok 23 nk_qlease.test_same_ns_lease
ok 24 nk_qlease.test_resize_after_unlease
ok 25 nk_qlease.test_lease_queue_zero
ok 26 nk_qlease.test_release_and_reuse
ok 27 nk_qlease.test_veth_queue_create
ok 28 nk_qlease.test_two_netkits_same_queue
ok 29 nk_qlease.test_l3_mode_lease
ok 30 nk_qlease.test_single_double_lease
ok 31 nk_qlease.test_single_different_lessors
ok 32 nk_qlease.test_cross_ns_netns_id
ok 33 nk_qlease.test_delete_guest_netns
ok 34 nk_qlease.test_move_guest_netns
ok 35 nk_qlease.test_resize_phys_no_reduction
ok 36 nk_qlease.test_delete_one_netkit_of_two
ok 37 nk_qlease.test_bind_rx_leased_phys_queue
ok 38 nk_qlease.test_resize_phys_shrink_past_leased
ok 39 nk_qlease.test_resize_virt_not_supported
ok 40 nk_qlease.test_lease_devices_down
ok 41 nk_qlease.test_lease_capacity_exhaustion
ok 42 nk_qlease.test_resize_phys_up
ok 43 nk_qlease.test_multi_ns_lease
ok 44 nk_qlease.test_multi_ns_delete_one
ok 45 nk_qlease.test_move_phys_netns
# Totals: pass:45 fail:0 xfail:0 xpass:0 skip:0 error:0
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260413220809.604592-4-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
YnlFamily opens an AF_NETLINK socket in __init__ but has no way
to release it other than leaving it to the GC. YnlFamily holds a
self reference cycle through SpecFamily's self.family = self
in its super().__init__() call, so refcount GC cannot reclaim
it and the socket stays open until the cyclic GC runs.
If a test creates a guest netns, instantiates a YnlFamily inside
it via NetNSEnter(), performs some test case work via Ynl, and
then deletes the netns, then the 'ip netns del' only drops the
mount binding and cleanup_net in the kernel never runs, so any
subsequent test case assertions that objects got cleaned up would
fail given this only gets triggered later via cyclic GC run.
Add an explicit close() that closes the netlink socket and wire
up the __enter__/__exit__ so callers can scope the instance
deterministically via 'with YnlFamily(...) as ynl: ...'.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260413220809.604592-2-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a selftest that reproduces the null-ptr-deref in
bond_rr_gen_slave_id() when XDP redirect targets a bond device in
round-robin mode that was never brought up. The test verifies the fix
by ensuring no crash occurs.
Test setup:
- bond0: active-backup mode, UP, with native XDP (enables
bpf_master_redirect_enabled_key globally)
- bond1: round-robin mode, never UP
- veth1: slave of bond1, with generic XDP (XDP_TX)
- BPF_PROG_TEST_RUN with live frames triggers the redirect path
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260411005524.201200-3-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Remove the inclusion of ../filesystems/utils.h from listns_efault_test.c.
The test doesn't use any symbols from that header. Including it alongside
../pidfd/pidfd.h causes a build failure because both headers define
wait_for_pid() with conflicting linkage:
../filesystems/utils.h: extern int wait_for_pid(pid_t pid);
../pidfd/pidfd.h: static inline int wait_for_pid(pid_t pid)
All symbols the test actually uses (create_child, read_nointr,
write_nointr, sys_pidfd_send_signal) come from pidfd.h.
Reported-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/all/acPV19IY3Gna6Ira@sirena.org.uk
Fixes: 07d7ad46da ("selftests/namespaces: test for efault")
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add missing top-level kselftest TARGETS entries for empty_mntns and
fsmount_ns so that 'make kselftest' discovers and runs these tests.
Fix requires_cap_sys_admin test which always SKIPped because fsopen()
was called after enter_userns(), where CAP_SYS_ADMIN in the mount
namespace's user_ns is unavailable. Move fsopen/fsconfig before fork so
the configured fs_fd is inherited by the child, which then only needs to
call fsmount() after dropping privileges.
Fixes: 3ac7ea91f3 ("selftests: add FSMOUNT_NAMESPACE tests")
Signed-off-by: Christian Brauner <brauner@kernel.org>
CLONE_EMPTY_MNTNS is (1ULL << 37) = 0x2000000000ULL, not 0x400000000ULL.
Fixes: 5b8ffd63fb ("selftests/filesystems: add clone3 tests for empty mount namespaces")
Signed-off-by: Christian Brauner <brauner@kernel.org>
empty_mntns.h includes ../statmount/statmount.h which provides a
4-argument statmount_alloc(mnt_id, mnt_ns_id, mask, flags), but then
redefines its own 3-argument version without the flags parameter. This
causes a build failure due to conflicting types.
Remove the duplicate definition from empty_mntns.h and update all
callers to pass 0 for the flags argument.
Fixes: 32f54f2bbc ("selftests/filesystems: add tests for empty mount namespaces")
Signed-off-by: Christian Brauner <brauner@kernel.org>
Remove the local static wait_for_pid() definition from
statmount_test_ns.c as it conflicts with the extern declaration in
utils.h. The identical function is already provided by utils.c.
Fixes: 3ac7ea91f3 ("selftests: add FSMOUNT_NAMESPACE tests")
Cc: <stable@kernel.org> # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull gpio updates from Bartosz Golaszewski:
"For this merge window we have two new drivers: support for
GPIO-signalled ACPI events on Intel platforms and a generic
GPIO-over-pinctrl driver using the ARM SCMI protocol for
controlling pins.
Several things have been reworked in GPIO core: we unduplicated GPIO
hog handling, reduced the number of SRCU locks and dereferences,
improved support for software-node-based lookup and removed more
legacy code after converting remaining users to modern alternatives.
There's also a number of driver reworks and refactoring, documentation
updates, some bug-fixes and new tests.
GPIO core:
- defer probe on software node lookups when the remote software node
exists but has not been registered as a firmware node yet
- unify GPIO hog handling by moving code duplicated in OF and ACPI
modules into GPIO core and allow setting up hogs with software
nodes
- allow matching GPIO controllers by secondary firmware node if
matching by primary does not succeed
- demote deferral warnings to debug level as they are quite normal
when using software nodes which don't support fw_devlink yet
- disable the legacy GPIO character device uAPI v1 supprt in Kconfig
by default
- rework several core functions in preparation for the upcoming
Revocable helper library for protecting resources against sudden
removal, this reduces the number of SRCU dereferences in GPIO core
- simplify file descriptor logic in GPIO character device code by
using FD_PREPARE()
- introduce a header defining symbols used by both GPIO consumers and
providers to avoid having to include provider-specific headers from
drivers which only consume GPIOs
- replace snprintf() with strscpy() where formatting is not required
New drivers:
- add the gpio-by-pinctrl generic driver using the ARM SCMI protocol
to control GPIOs (along with SCMI changes pulled from the pinctrl
tree)
- add a driver providing support for handling of platform events via
GPIO-signalled ACPI events (used on Intel Nova Lake and later
platforms)
Driver changes:
- extend the gpio-kempld driver with support for more recent models,
interrupts and setting/getting multiple values at once
- improve interrupt handling in gpio-brcmstb
- add support for multi-SoC systems in gpio-tegra186
- make sure we return correct values from the .get() callbacks in
several GPIO drivers by normalizing any values other than 0, 1 or
negative error numbers
- use flexible arrays in several drivers to reduce the number of
required memory allocations
- simplify synchronous waiting for virtual drivers to probe and
remove the dedicated, a bit overengineered helper library
dev-sync-probe
- remove unneeded Kconfig dependencies on OF_GPIO in several drivers
and subsystems
- convert the two remaining users of of_get_named_gpio() to using
GPIO descriptors and remove the (no longer used) function along
with the header that declares it
- add missing includes in gpio-mmio
- shrink and simplify code in gpio-max732x by using guard(mutex)
- remove duplicated code handling the 'ngpios' property from
gpio-ts4800, it's already handled in GPIO core
- use correct variable type in gpio-aspeed
- add support for a new model in gpio-realtek-otto
- allow to specify the active-low setting of simulated hogs over the
configfs interface (in addition to existing devicetree support) in
gpio-sim
Bug fixes:
- clear the OF_POPULATED flag on hog nodes in GPIO chip remove path
on OF systems
- fix resource leaks in error path in gpiochip_add_data_with_key()
- drop redundant device reference in gpio-mpsse
Tests:
- add selftests for use-after-free cases in GPIO character device
code
DT bindings:
- add a DT binding document for SCMI based, gpio-over-pinctrl devices
- fix interrupt description in microchip,mpfs-gpio
- add new compatible for gpio-realtek-otto
- describe the resets of the mpfs-gpio controller
- fix maintainer's email in gpio-delay bindings
- remove the binding document for cavium,thunder-8890 as the
corresponding device is bound over PCI and not firmware nodes
Documentation:
- update the recommended way of converting legacy boards to using
software nodes for GPIO description
- describe GPIO line value semantics
- misc updates to kerneldocs
Misc:
- convert OMAP1 ams-delta board to using GPIO hogs described with
software nodes"
* tag 'gpio-updates-for-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (79 commits)
gpio: swnode: defer probe on references to unregistered software nodes
dt-bindings: gpio: cavium,thunder-8890: Remove DT binding
Documentation: gpio: update the preferred method for using software node lookup
gpio: gpio-by-pinctrl: s/used to do/is used to do/
gpio: aspeed: fix unsigned long int declaration
gpio: rockchip: convert to dynamic GPIO base allocation
gpio: remove dev-sync-probe
gpio: virtuser: stop using dev-sync-probe
gpio: aggregator: stop using dev-sync-probe
gpio: sim: stop using dev-sync-probe
gpio: Add Intel Nova Lake ACPI GPIO events driver
gpiolib: Make deferral warnings debug messages
gpiolib: fix hogs with multiple lines
gpio: fix up CONFIG_OF dependencies
gpio: gpio-by-pinctrl: add pinctrl based generic GPIO driver
gpio: dt-bindings: Add GPIO on top of generic pin control
firmware: arm_scmi: Allow PINCTRL_REQUEST to return EOPNOTSUPP
pinctrl: scmi: ignore PIN_CONFIG_PERSIST_STATE
pinctrl: scmi: Delete PIN_CONFIG_OUTPUT_IMPEDANCE_OHMS support
pinctrl: scmi: Add SCMI_PIN_INPUT_VALUE
...
Pull power management updates from Rafael Wysocki:
"Once again, cpufreq is the most active development area, mostly
because of the new feature additions and documentation updates in the
amd-pstate driver, but there are also changes in the cpufreq core
related to boost support and other assorted updates elsewhere.
Next up are power capping changes due to the major cleanup of the
Intel RAPL driver.
On the cpuidle front, a new C-states table for Intel Panther Lake is
added to the intel_idle driver, the stopped tick handling in the menu
and teo governors is updated, and there are a couple of cleanups.
Apart from the above, support for Tegra114 is added to devfreq and
there are assorted cleanups of that code, there are also two updates
of the operating performance points (OPP) library, two minor updates
related to hibernation, and cpupower utility man pages updates and
cleanups.
Specifics:
- Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)
- Update cpufreq-dt-platdev blocklist (Faruque Ansari)
- Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
Rosen Penev)
- Add MAINTAINERS entry for CPPC driver (Viresh Kumar)
- Add support for new features: CPPC performance priority, Dynamic
EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham
Shenoy, Mario Limonciello)
- Fix sysfs files being present when HW missing and broken/outdated
documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)
- Pass the policy to cpufreq_driver->adjust_perf() to avoid using
cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate
which leads to a scheduling-while-atomic bug (K Prateek Nayak)
- Clean up dead code in Kconfig for cpufreq (Julian Braha)
- Remove max_freq_req update for pre-existing cpufreq policy and add
a boost_freq_req QoS request to save the boost constraint instead
of overwriting the last scaling_max_freq constraint (Pierre
Gondois)
- Embed cpufreq QoS freq_req objects in cpufreq policy so they all
are allocated in one go along with the policy to simplify lifetime
rules and avoid error handling issues (Viresh Kumar)
- Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
scaling driver (Henry Tseng)
- Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
of cpumask_weight() because the former is more efficient (Yury
Norov)
- Use sysfs_emit() in sysfs show functions for cpufreq governor
attributes (Thorsten Blum)
- Update intel_pstate to stop returning an error when "off" is
written to its status sysfs attribute while the driver is already
off (Fabio De Francesco)
- Include current frequency in the debug message printed by
__cpufreq_driver_target() (Pengjie Zhang)
- Refine stopped tick handling in the menu cpuidle governor and
rearrange stopped tick handling in the teo cpuidle governor (Rafael
Wysocki)
- Add Panther Lake C-states table to the intel_idle driver (Artem
Bityutskiy)
- Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha)
- Simplify cpuidle_register_device() with guard() (Huisong Li)
- Use performance level if available to distinguish between rates in
OPP debugfs (Manivannan Sadhasivam)
- Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar)
- Return -ENODATA if the snapshot image is not loaded (Alberto
Garcia)
- Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric
Biggers)
- Clean up and rearrange the intel_rapl power capping driver to make
the respective interface drivers (TPMI, MSR, and MMOI) hold their
own settings and primitives and consolidate PL4 and PMU support
flags into rapl_defaults (Kuppuswamy Sathyanarayanan)
- Correct kernel-doc function parameter names in the power capping
core code (Randy Dunlap)
- Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko)
- Use _visible attribute to replace create/remove_sysfs_files() in
devfreq (Pengjie Zhang)
- Add Tegra114 support to activity monitor device in tegra30-devfreq
as a preparation to upcoming EMC controller support (Svyatoslav
Ryhel)
- Fix mistakes in cpupower man pages, add the boost and epp options
to the cpupower-frequency-info man page, and add the perf-bias
option to the cpupower-info man page (Roberto Ricci)
- Remove unnecessary extern declarations from getopt.h in arguments
parsing functions in cpufreq-set, cpuidle-info, cpuidle-set,
cpupower-info, and cpupower-set utilities (Kaushlendra Kumar)"
* tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
cpupower: remove extern declarations in cmd functions
cpuidle: Simplify cpuidle_register_device() with guard()
PM / devfreq: tegra30-devfreq: add support for Tegra114
PM / devfreq: use _visible attribute to replace create/remove_sysfs_files()
PM / devfreq: Remove unneeded casting for HZ_PER_KHZ
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
...
Pull driver core updates from Danilo Krummrich:
"debugfs:
- Fix NULL pointer dereference in debugfs_create_str()
- Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
- Fix soundwire debugfs NULL pointer dereference from uninitialized
firmware_file
device property:
- Make fwnode flags modifications thread safe; widen the field to
unsigned long and use set_bit() / clear_bit() based accessors
- Document how to check for the property presence
devres:
- Separate struct devres_node from its "subclasses" (struct devres,
struct devres_group); give struct devres_node its own release and
free callbacks for per-type dispatch
- Introduce struct devres_action for devres actions, avoiding the
ARCH_DMA_MINALIGN alignment overhead of struct devres
- Export struct devres_node and its init/add/remove/dbginfo
primitives for use by Rust Devres<T>
- Fix missing node debug info in devm_krealloc()
- Use guard(spinlock_irqsave) where applicable; consolidate unlock
paths in devres_release_group()
driver_override:
- Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
generic driver_override infrastructure, replacing per-bus
driver_override strings, sysfs attributes, and match logic; fixes a
potential UAF from unsynchronized access to driver_override in bus
match() callbacks
- Simplify __device_set_driver_override() logic
kernfs:
- Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file
and directory removal
- Add corresponding selftests for memcg
platform:
- Allow attaching software nodes when creating platform devices via a
new 'swnode' field in struct platform_device_info
- Add kerneldoc for struct platform_device_info
software node:
- Move software node initialization from postcore_initcall() to
driver_init(), making it available early in the boot process
- Move kernel_kobj initialization (ksysfs_init) earlier to support
the above
- Remove software_node_exit(); dead code in a built-in unit
SoC:
- Introduce of_machine_read_compatible() and of_machine_read_model()
OF helpers and export soc_attr_read_machine() to replace direct
accesses to of_root from SoC drivers; also enables
CONFIG_COMPILE_TEST coverage for these drivers
sysfs:
- Constify attribute group array pointers to
'const struct attribute_group *const *' in sysfs functions,
device_add_groups() / device_remove_groups(), and struct class
Rust:
- Devres:
- Embed struct devres_node directly in Devres<T> instead of going
through devm_add_action(), avoiding the extra allocation and the
unnecessary ARCH_DMA_MINALIGN alignment
- I/O:
- Turn IoCapable from a marker trait into a functional trait
carrying the raw I/O accessor implementation (io_read /
io_write), providing working defaults for the per-type Io
methods
- Add RelaxedMmio wrapper type, making relaxed accessors usable in
code generic over the Io trait
- Remove overloaded per-type Io methods and per-backend macros
from Mmio and PCI ConfigSpace
- I/O (Register):
- Add IoLoc trait and generic read/write/update methods to the Io
trait, making I/O operations parameterizable by typed locations
- Add register! macro for defining hardware register types with
typed bitfield accessors backed by Bounded values; supports
direct, relative, and array register addressing
- Add write_reg() / try_write_reg() and LocatedRegister trait
- Update PCI sample driver to demonstrate the register! macro
Example:
```
register! {
/// UART control register.
CTRL(u32) @ 0x18 {
/// Receiver enable.
19:19 rx_enable => bool;
/// Parity configuration.
14:13 parity ?=> Parity;
}
/// FIFO watermark and counter register.
WATER(u32) @ 0x2c {
/// Number of datawords in the receive FIFO.
26:24 rx_count;
/// RX interrupt threshold.
17:16 rx_water;
}
}
impl WATER {
fn rx_above_watermark(&self) -> bool {
self.rx_count() > self.rx_water()
}
}
fn init(bar: &pci::Bar<BAR0_SIZE>) {
let water = WATER::zeroed()
.with_const_rx_water::<1>(); // > 3 would not compile
bar.write_reg(water);
let ctrl = CTRL::zeroed()
.with_parity(Parity::Even)
.with_rx_enable(true);
bar.write_reg(ctrl);
}
fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
if bar.read(WATER).rx_above_watermark() {
// drain the FIFO
}
}
fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
bar.update(CTRL, |r| r.with_parity(parity));
}
```
- IRQ:
- Move 'static bounds from where clauses to trait declarations for
IRQ handler traits
- Misc:
- Enable the generic_arg_infer Rust feature
- Extend Bounded with shift operations, single-bit bool
conversion, and const get()
Misc:
- Make deferred_probe_timeout default a Kconfig option
- Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
callbacks when no bus type PM ops are set
- Add conditional guard support for device_lock()
- Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
- Fix kernel-doc warnings in base.h
- Fix stale reference to memory_block_add_nid() in documentation"
* tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (67 commits)
bus: fsl-mc: use generic driver_override infrastructure
s390/ap: use generic driver_override infrastructure
s390/cio: use generic driver_override infrastructure
vdpa: use generic driver_override infrastructure
platform/wmi: use generic driver_override infrastructure
PCI: use generic driver_override infrastructure
driver core: make software nodes available earlier
software node: remove software_node_exit()
kernel: ksysfs: initialize kernel_kobj earlier
MAINTAINERS: add ksysfs.c to the DRIVER CORE entry
drivers/base/memory: fix stale reference to memory_block_add_nid()
device property: Document how to check for the property presence
soundwire: debugfs: initialize firmware_file to empty string
debugfs: fix placement of EXPORT_SYMBOL_GPL for debugfs_create_str()
debugfs: check for NULL pointer in debugfs_create_str()
driver core: Make deferred_probe_timeout default a Kconfig option
driver core: simplify __device_set_driver_override() clearing logic
driver core: auxiliary bus: Drop auxiliary_dev_pm_ops
device property: Make modifications of fwnode "flags" thread safe
rust: devres: embed struct devres_node directly
...
Pull hardening updates from Kees Cook:
- randomize_kstack: Improve implementation across arches (Ryan Roberts)
- lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test
- refcount: Remove unused __signed_wrap function annotations
* tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test
refcount: Remove unused __signed_wrap function annotations
randomize_kstack: Unify random source across arches
randomize_kstack: Maintain kstack_offset per task
Pull CRC updates from Eric Biggers:
- Several improvements related to crc_kunit, to align with the standard
KUnit conventions and make it easier for developers and CI systems to
run this test suite
- Add an arm64-optimized implementation of CRC64-NVME
- Remove unused code for big endian arm64
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc: arm64: Simplify intrinsics implementation
lib/crc: arm64: Use existing macros for kernel-mode FPU cflags
lib/crc: arm64: Drop unnecessary chunking logic from crc64
lib/crc: arm64: Assume a little-endian kernel
lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation
lib/crc: arm64: Drop check for CONFIG_KERNEL_MODE_NEON
crypto: crc32c - Remove another outdated comment
crypto: crc32c - Remove more outdated usage information
kunit: configs: Enable all CRC tests in all_tests.config
lib/crc: tests: Add a .kunitconfig file
lib/crc: tests: Add CRC_ENABLE_ALL_FOR_KUNIT
lib/crc: tests: Make crc_kunit test only the enabled CRC variants
Pull crypto library updates from Eric Biggers:
- Migrate more hash algorithms from the traditional crypto subsystem to
lib/crypto/
Like the algorithms migrated earlier (e.g. SHA-*), this simplifies
the implementations, improves performance, enables further
simplifications in calling code, and solves various other issues:
- AES CBC-based MACs (AES-CMAC, AES-XCBC-MAC, and AES-CBC-MAC)
- Support these algorithms in lib/crypto/ using the AES library
and the existing arm64 assembly code
- Reimplement the traditional crypto API's "cmac(aes)",
"xcbc(aes)", and "cbcmac(aes)" on top of the library
- Convert mac80211 to use the AES-CMAC library. Note: several
other subsystems can use it too and will be converted later
- Drop the broken, nonstandard, and likely unused support for
"xcbc(aes)" with key lengths other than 128 bits
- Enable optimizations by default
- GHASH
- Migrate the standalone GHASH code into lib/crypto/
- Integrate the GHASH code more closely with the very similar
POLYVAL code, and improve the generic GHASH implementation to
resist cache-timing attacks and use much less memory
- Reimplement the AES-GCM library and the "gcm" crypto_aead
template on top of the GHASH library. Remove "ghash" from the
crypto_shash API, as it's no longer needed
- Enable optimizations by default
- SM3
- Migrate the kernel's existing SM3 code into lib/crypto/, and
reimplement the traditional crypto API's "sm3" on top of it
- I don't recommend using SM3, but this cleanup is worthwhile
to organize the code the same way as other algorithms
- Testing improvements:
- Add a KUnit test suite for each of the new library APIs
- Migrate the existing ChaCha20Poly1305 test to KUnit
- Make the KUnit all_tests.config enable all crypto library tests
- Move the test kconfig options to the Runtime Testing menu
- Other updates to arch-optimized crypto code:
- Optimize SHA-256 for Zhaoxin CPUs using the Padlock Hash Engine
- Remove some MD5 implementations that are no longer worth keeping
- Drop big endian and voluntary preemption support from the arm64
code, as those configurations are no longer supported on arm64
- Make jitterentropy and samples/tsm-mr use the crypto library APIs
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (66 commits)
lib/crypto: arm64: Assume a little-endian kernel
arm64: fpsimd: Remove obsolete cond_yield macro
lib/crypto: arm64/sha3: Remove obsolete chunking logic
lib/crypto: arm64/sha512: Remove obsolete chunking logic
lib/crypto: arm64/sha256: Remove obsolete chunking logic
lib/crypto: arm64/sha1: Remove obsolete chunking logic
lib/crypto: arm64/poly1305: Remove obsolete chunking logic
lib/crypto: arm64/gf128hash: Remove obsolete chunking logic
lib/crypto: arm64/chacha: Remove obsolete chunking logic
lib/crypto: arm64/aes: Remove obsolete chunking logic
lib/crypto: Include <crypto/utils.h> instead of <crypto/algapi.h>
lib/crypto: aesgcm: Don't disable IRQs during AES block encryption
lib/crypto: aescfb: Don't disable IRQs during AES block encryption
lib/crypto: tests: Migrate ChaCha20Poly1305 self-test to KUnit
lib/crypto: sparc: Drop optimized MD5 code
lib/crypto: mips: Drop optimized MD5 code
lib: Move crypto library tests to Runtime Testing menu
crypto: sm3 - Remove 'struct sm3_state'
crypto: sm3 - Remove the original "sm3_block_generic()"
crypto: sm3 - Remove sm3_base.h
...
Pull block updates from Jens Axboe:
- Add shared memory zero-copy I/O support for ublk, bypassing per-I/O
copies between kernel and userspace by matching registered buffer
PFNs at I/O time. Includes selftests.
- Refactor bio integrity to support filesystem initiated integrity
operations and arbitrary buffer alignment.
- Clean up bio allocation, splitting bio_alloc_bioset() into clear fast
and slow paths. Add bio_await() and bio_submit_or_kill() helpers,
unify synchronous bi_end_io callbacks.
- Fix zone write plug refcount handling and plug removal races. Add
support for serializing zone writes at QD=1 for rotational zoned
devices, yielding significant throughput improvements.
- Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET
command.
- Add io_uring passthrough (uring_cmd) support to the BSG layer.
- Replace pp_buf in partition scanning with struct seq_buf.
- zloop improvements and cleanups.
- drbd genl cleanup, switching to pre_doit/post_doit.
- NVMe pull request via Keith:
- Fabrics authentication updates
- Enhanced block queue limits support
- Workqueue usage updates
- A new write zeroes device quirk
- Tagset cleanup fix for loop device
- MD pull requests via Yu Kuai:
- Fix raid5 soft lockup in retry_aligned_read()
- Fix raid10 deadlock with check operation and nowait requests
- Fix raid1 overlapping writes on writemostly disks
- Fix sysfs deadlock on array_state=clear
- Proactive RAID-5 parity building with llbitmap, with
write_zeroes_unmap optimization for initial sync
- Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops
version mismatch fallback
- Fix bcache use-after-free and uninitialized closure
- Validate raid5 journal metadata payload size
- Various cleanups
- Various other fixes, improvements, and cleanups
* tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits)
ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd()
block: refactor blkdev_zone_mgmt_ioctl
MAINTAINERS: update ublk driver maintainer email
Documentation: ublk: address review comments for SHMEM_ZC docs
ublk: allow buffer registration before device is started
ublk: replace xarray with IDA for shmem buffer index allocation
ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
ublk: verify all pages in multi-page bvec fall within registered range
ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
xfs: use bio_await in xfs_zone_gc_reset_sync
block: add a bio_submit_or_kill helper
block: factor out a bio_await helper
block: unify the synchronous bi_end_io callbacks
xfs: fix number of GC bvecs
selftests/ublk: add read-only buffer registration test
selftests/ublk: add filesystem fio verify test for shmem_zc
selftests/ublk: add hugetlbfs shmem_zc test for loop target
selftests/ublk: add shared memory zero-copy test
selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
...
Pull Landlock update from Mickaël Salaün:
"This adds a new Landlock access right for pathname UNIX domain sockets
thanks to a new LSM hook, and a few fixes"
* tag 'landlock-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: (23 commits)
landlock: Document fallocate(2) as another truncation corner case
landlock: Document FS access right for pathname UNIX sockets
selftests/landlock: Simplify ruleset creation and enforcement in fs_test
selftests/landlock: Check that coredump sockets stay unrestricted
selftests/landlock: Audit test for LANDLOCK_ACCESS_FS_RESOLVE_UNIX
selftests/landlock: Test LANDLOCK_ACCESS_FS_RESOLVE_UNIX
selftests/landlock: Replace access_fs_16 with ACCESS_ALL in fs_test
samples/landlock: Add support for named UNIX domain socket restrictions
landlock: Clarify BUILD_BUG_ON check in scoping logic
landlock: Control pathname UNIX domain socket resolution by path
landlock: Use mem_is_zero() in is_layer_masks_allowed()
lsm: Add LSM hook security_unix_find
landlock: Fix kernel-doc warning for pointer-to-array parameters
landlock: Fix formatting in tsync.c
landlock: Improve kernel-doc "Return:" section consistency
landlock: Add missing kernel-doc "Return:" sections
selftests/landlock: Fix format warning for __u64 in net_test
selftests/landlock: Skip stale records in audit_match_record()
selftests/landlock: Drain stale audit records on init
selftests/landlock: Fix socket file descriptor leaks in audit helpers
...
Pull clone and pidfs updates from Christian Brauner:
"Add three new clone3() flags for pidfd-based process lifecycle
management.
CLONE_AUTOREAP:
CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to
the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for
SIGCHLD which applies to all children of a given parent.
Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped
property affecting all children which makes it unsuitable for
libraries or applications that need selective auto-reaping of
specific children while still being able to wait() on others.
CLONE_AUTOREAP stores an autoreap flag in the child's
signal_struct. When the child exits do_notify_parent() checks this
flag and causes exit_notify() to transition the task directly to
EXIT_DEAD. Since the flag lives on the child it survives
reparenting: if the original parent exits and the child is
reparented to a subreaper or init the child still auto-reaps when
it eventually exits. This is cleaner than forcing the subreaper to
get SIGCHLD and then reaping it. If the parent doesn't care the
subreaper won't care. If there's a subreaper that would care it
would be easy enough to add a prctl() that either just turns back
on SIGCHLD and turns off auto-reaping or a prctl() that just
notifies the subreaper whenever a child is reparented to it.
CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent
to monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.
CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because
autoreap is a process-level property, and CLONE_PARENT because an
autoreap child reparented via CLONE_PARENT could become an
invisible zombie under a parent that never calls wait().
The flag is not inherited by the autoreap process's own children.
Each child that should be autoreaped must be explicitly created
with CLONE_AUTOREAP.
CLONE_NNP:
CLONE_NNP sets no_new_privs on the child at clone time. Unlike
prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself,
CLONE_NNP allows the parent to impose no_new_privs on the child at
creation without affecting the parent's own privileges.
CLONE_THREAD is rejected because threads share credentials.
CLONE_NNP is useful on its own for any spawn-and-sandbox pattern
but was specifically introduced to enable unprivileged usage of
CLONE_PIDFD_AUTOKILL.
CLONE_PIDFD_AUTOKILL:
This flag ties a child's lifetime to the pidfd returned from
clone3(). When the last reference to the struct file created by
clone3() is closed the kernel sends SIGKILL to the child. A pidfd
obtained via pidfd_open() for the same process does not keep the
child alive and does not trigger autokill - only the specific
struct file from clone3() has this property. This is useful for
container runtimes, service managers, and sandboxed subprocess
execution - any scenario where the child must die if the parent
crashes or abandons the pidfd or just wants a throwaway helper
process.
CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP.
It requires CLONE_PIDFD because the whole point is tying the
child's lifetime to the pidfd. It requires CLONE_AUTOREAP because a
killed child with no one to reap it would become a zombie - the
primary use case is the parent crashing or abandoning the pidfd so
no one is around to call waitpid(). CLONE_THREAD is rejected
because autokill targets a process not a thread.
If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an
unprivileged user may spawn a process that is autokilled. The child
cannot escalate privileges via setuid/setgid exec after being
spawned. If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the
caller must have have CAP_SYS_ADMIN in its user namespace"
* tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: check pidfd_info->coredump_code correctness
pidfds: add coredump_code field to pidfd_info
kselftest/coredump: reintroduce null pointer dereference
selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests
selftests/pidfd: add CLONE_NNP tests
selftests/pidfd: add CLONE_AUTOREAP tests
pidfd: add CLONE_PIDFD_AUTOKILL
clone: add CLONE_NNP
clone: add CLONE_AUTOREAP
Pull vfs xattr updates from Christian Brauner:
"This reworks the simple_xattr infrastructure and adds support for
user.* extended attributes on sockets.
The simple_xattr subsystem currently uses an rbtree protected by a
reader-writer spinlock. This series replaces the rbtree with an
rhashtable giving O(1) average-case lookup with RCU-based lockless
reads. This sped up concurrent access patterns on tmpfs quite a bit
and it's an overall easy enough conversion to do and gets rid or
rwlock_t.
The conversion is done incrementally: a new rhashtable path is added
alongside the existing rbtree, consumers are migrated one at a time
(shmem, kernfs, pidfs), and then the rbtree code is removed. All three
consumers switch from embedded structs to pointer-based lazy
allocation so the rhashtable overhead is only paid for inodes that
actually use xattrs.
With this infrastructure in place the series adds support for user.*
xattrs on sockets. Path-based AF_UNIX sockets inherit xattr support
from the underlying filesystem (e.g. tmpfs) but sockets in sockfs -
that is everything created via socket() including abstract namespace
AF_UNIX sockets - had no xattr support at all.
The xattr_permission() checks are reworked to allow user.* xattrs on
S_IFSOCK inodes. Sockfs sockets get per-inode limits of 128 xattrs and
128KB total value size matching the limits already in use for kernfs.
The practical motivation comes from several directions. systemd and
GNOME are expanding their use of Varlink as an IPC mechanism.
For D-Bus there are tools like dbus-monitor that can observe IPC
traffic across the system but this only works because D-Bus has a
central broker.
For Varlink there is no broker and there is currently no way to
identify which sockets speak Varlink. With user.* xattrs on sockets a
service can label its socket with the IPC protocol it speaks (e.g.,
user.varlink=1) and an eBPF program can then selectively capture
traffic on those sockets. Enumerating bound sockets via netlink
combined with these xattr labels gives a way to discover all Varlink
IPC entrypoints for debugging and introspection.
Similarly, systemd-journald wants to use xattrs on the /dev/log socket
for protocol negotiation to indicate whether RFC 5424 structured
syslog is supported or whether only the legacy RFC 3164 format should
be used.
In containers these labels are particularly useful as high-privilege
or more complicated solutions for socket identification aren't
available.
The series comes with comprehensive selftests covering path-based
AF_UNIX sockets, sockfs socket operations, per-inode limit
enforcement, and xattr operations across multiple address families
(AF_INET, AF_INET6, AF_NETLINK, AF_PACKET)"
* tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests/xattr: test xattrs on various socket families
selftests/xattr: sockfs socket xattr tests
selftests/xattr: path-based AF_UNIX socket xattr tests
xattr: support extended attributes on sockets
xattr,net: support limited amount of extended attributes on sockfs sockets
xattr: move user limits for xattrs to generic infra
xattr: switch xattr_permission() to switch statement
xattr: add xattr_permission_error()
xattr: remove rbtree-based simple_xattr infrastructure
pidfs: adapt to rhashtable-based simple_xattrs
kernfs: adapt to rhashtable-based simple_xattrs with lazy allocation
shmem: adapt to rhashtable-based simple_xattrs with lazy allocation
xattr: add rhashtable-based simple_xattr infrastructure
xattr: add rcu_head and rhash_head to struct simple_xattr
check_requires() compares requirement strings that can contain shell
pattern characters such as '[' and ']'. Under /bin/sh, the unquoted
test expressions can emit 'unexpected operator' warnings while parsing
README-backed requirements.
Quote the relevant comparisons and path checks so the helper handles
those patterns without spurious shell warnings.
Validated by rerunning fprobe_syntax_errors.tc and confirming the
previous '/bin/sh: unexpected operator' lines disappear from the
detailed ftracetest log.
Signed-off-by: Cao Ruichuang <create0818@163.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20260408043212.8063-1-create0818@163.com
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Track failures explicitly in the top-level selftests all/install loops.
The current code multiplies `ret` by each sub-make exit status. For
example, with `TARGETS=net`, the implicit `net/lib` dependency runs after
`net`, so a failed `net` build can be followed by a successful `net/lib`
build and reset the final result to success.
Set `ret` to 1 on any non-zero sub-make exit code and keep it sticky, so
the top-level make returns failure when any selected selftest target
fails.
Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://lore.kernel.org/r/20260320-selftests-fixes-v1-5-79144f76be01@suse.com
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
The --per-test-log option currently hard-codes /tmp. However, the system
under test will most likely have tmpfs mounted there. Since it's not clear
which filenames the log files will have, the user should be able to specify
a persistent directory to store the logs. Keeping those logs are important
because the run_kselftest.sh runner will only yield KTAP output, trimming
information that is otherwise available through running individual tests
directly.
Allow --per-test-log to take an optional directory argument. Keep the
existing behaviour when the option is passed without an argument, but if
a directory is provided, create it if needed, reject non-directory paths
and non-writable directories, canonicalize it, and have runner.sh write
per-test logs there instead of /tmp.
This also makes relative paths safe by resolving them before the runner
changes into a collection directory.
Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://lore.kernel.org/r/20260320-selftests-fixes-v1-4-79144f76be01@suse.com
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
run_kselftest.sh only needs to canonicalize the directory containing the
script itself. Use shell-native path resolution for that by changing into
the directory and calling pwd -P.
This avoids depending on either realpath or readlink -f while still
producing a physical absolute path for BASE_DIR.
Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://lore.kernel.org/r/20260320-selftests-fixes-v1-3-79144f76be01@suse.com
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
KVM SVM changes for 7.1
- Fix and optimize IRQ window inhibit handling for AVIC (the tracking needs to
be per-vCPU, e.g. so that KVM doesn't prematurely re-enable AVIC if multiple
vCPUs have to-be-injected IRQs).
- Fix an undefined behavior warning where a crafty userspace can read the
"avic" module param before it's fully initialized.
- Fix a (likely benign) bug in the "OS-visible workarounds" handling, where
KVM could clobber state when enabling virtualization on multiple CPUs in
parallel, and clean up and optimize the code.
- Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a
"too large" size based purely on user input, and clean up and harden the
related pinning code.
- Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as
doing so for an SNP guest will trigger an RMP violation #PF and crash the
host.
- Protect all of sev_mem_enc_register_region() with kvm->lock to ensure
sev_guest() is stable for the entire of the function.
- Lock all vCPUs when synchronizing VMSAs for SNP guests to ensure the VMSA
page isn't actively being used.
- Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are
required to hold kvm->lock (KVM has had multiple bugs due "is SEV?" checks
becoming stale), enforced by lockdep. Add and use vCPU-scoped APIs when
possible/appropriate, as all checks that originate from a vCPU are
guaranteed to be stable.
- Convert a pile of kvm->lock SEV code to guard().
Pull RCU updates from Joel Fernandes:
"NOCB CPU management:
- Consolidate rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() to
reduce code duplication
- Extract nocb_bypass_needs_flush() helper to reduce duplication in
NOCB bypass path
rcutorture/torture infrastructure:
- Add NOCB01 config for RCU_LAZY torture testing
- Add NOCB02 config for NOCB poll mode testing
- Add TRIVIAL-PREEMPT config for textbook-style preemptible RCU
torture
- Test call_srcu() with preemption both disabled and enabled
- Remove kvm-check-branches.sh in favor of kvm-series.sh
- Make hangs more visible in torture.sh output
- Add informative message for tests without a recheck file
- Fix numeric test comparison in srcu_lockdep.sh
- Use torture_shutdown_init() in refscale and rcuscale instead of
open-coded shutdown functions
- Fix modulo-zero error in torture_hrtimeout_ns().
SRCU:
- Fix SRCU read flavor macro comments
- Fix s/they disables/they disable/ typo in srcu_read_unlock_fast()
RCU Tasks:
- Document that RCU Tasks Trace grace periods now imply RCU grace
periods
- Remove unnecessary smp_store_release() in cblist_init_generic()"
* tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
rcutorture: Test call_srcu() with preemption disabled and not
rcu: Add BOOTPARAM_RCU_STALL_PANIC Kconfig option
torture: Avoid modulo-zero error in torture_hrtimeout_ns()
rcu/nocb: Extract nocb_bypass_needs_flush() to reduce duplication
rcu/nocb: Consolidate rcu_nocb_cpu_offload/deoffload functions
rcu-tasks: Remove unnecessary smp_store_release() in cblist_init_generic()
rcutorture: Add NOCB02 config for nocb poll mode testing
rcutorture: Add NOCB01 config for RCU_LAZY torture testing
rcu-tasks: Document that RCU Tasks Trace grace periods now imply RCU grace periods
srcu: Fix s/they disables/they disable/ typo in srcu_read_unlock_fast()
srcu: Fix SRCU read flavor macro comments
rcuscale: Ditch rcu_scale_shutdown in favor of torture_shutdown_init()
refscale: Ditch ref_scale_shutdown in favor of torture_shutdown_init()
rcutorture: Fix numeric "test" comparison in srcu_lockdep.sh
torture: Print informative message for test without recheck file
torture: Make hangs more visible in torture.sh output
kvm-check-branches.sh: Remove in favor of kvm-series.sh
rcutorture: Add a textbook-style trivial preemptible RCU
This fixes the following compilation error when using the header from
C++ code:
error: assigning to 'struct scx_flux__data_uei_dump *' from
incompatible type 'void *'
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A small change to improve type safety/const correctness.
__COMPAT_read_enum() already has const string parameters.
It fixes a warning when using the header in C++ code:
error: ISO C++11 does not allow conversion from string literal
to 'char *' [-Werror,-Wwritable-strings]
That's because string literals have type char[N] in C and
const char[N] in C++.
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_qmap uses global BPF queue maps (BPF_MAP_TYPE_QUEUE) that any CPU's
ops.dispatch() can pop from. When a CPU pops a task that can't run on it
(e.g. a pinned per-CPU kthread), it inserts the task into SHARED_DSQ.
consume_dispatch_q() then skips the task due to affinity mismatch, leaving it
stranded until some CPU in its allowed mask calls ops.dispatch(). This doesn't
cause indefinite stalls -- the periodic tick keeps firing (can_stop_idle_tick()
returns false when softirq is pending) -- but can cause noticeable scheduling
delays.
After inserting to SHARED_DSQ, kick the task's home CPU if this CPU can't run
it. There's a small race window where the home CPU can enter idle before the
kick lands -- if a per-CPU kthread like ksoftirqd is the stranded task, this
can trigger a "NOHZ tick-stop error" warning. The kick arrives shortly after
and the home CPU drains the task.
Rather than fully eliminating the warning by routing pinned tasks to local or
global DSQs, the current code keeps them going through the normal BPF queue
path and documents the race and the resulting warning in detail. scx_qmap is an
example scheduler and having tasks go through the usual dispatch path is useful
for testing. The detailed comment also serves as a reference for other
schedulers that may encounter similar warnings.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This series cleans up some of the special user copy functions naming and
semantics. In particular, get rid of the (very traditional) double
underscore names and behavior: the whole "optimize away the range check"
model has been largely excised from the other user accessors because
it's so subtle and can be unsafe, but also because it's just not a
relevant optimization any more.
To do that, a couple of drivers that misused the "user" copies as kernel
copies in order to get non-temporal stores had to be fixed up, but that
kind of code should never have been allowed anyway.
The x86-only "nocache" version was also renamed to more accurately
reflect what it actually does.
This was all done because I looked at this code due to a report by Jann
Horn, and I just couldn't stand the inconsistent naming, the horrible
semantics, and the random misuse of these functions. This code should
probably be cleaned up further, but it's at least slightly closer to
normal semantics.
I had a more intrusive series that went even further in trying to
normalize the semantics, but that ended up hitting so many other
inconsistencies between different architectures in this area (eg
'size_t' vs 'unsigned long' vs 'int' as size arguments, and various
iovec check differences that Vasily Gorbik pointed out) that I ended up
with this more limited version that fixed the worst of the issues.
Reported-by: Jann Horn <jannh@google.com>
Tested-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/all/CAHk-=wgg1QVWNWG-UCFo1hx0zqrPnB3qhPzUTrWNft+MtXQXig@mail.gmail.com/
* nocache-cleanup:
x86-64/arm64/powerpc: clean up and rename __copy_from_user_flushcache
x86: rename and clean up __copy_from_user_inatomic_nocache()
x86-64: rename misleadingly named '__copy_user_nocache()' function
There are no tests that verify enablement and disablement of team driver
ports with teamd. This should work even with changes to the enablement
option, so it is important to test.
This test sets up an active-backup network configuration across two
network namespaces, and tries to send traffic while changing which
link is the active one.
Also increase the team test timeout to 300 seconds, because gracefully
killing teamd can take 30 seconds for each instance.
Signed-off-by: Marc Harvey <marcharvey@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260409-teaming-driver-internal-v7-5-f47e7589685d@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are currently no kernel tests that verify the effect of setting
the enabled team driver option. In a followup patch, there will be
changes to this option, so it will be important to make sure it still
behaves as it does now.
The test verifies that tcp continues to work across two different team
devices in separate network namespaces, even when member links are
manually disabled.
Signed-off-by: Marc Harvey <marcharvey@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260409-teaming-driver-internal-v7-4-f47e7589685d@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
KVM nested SVM changes for 7.1 (with one common x86 fix)
- To minimize the probability of corrupting guest state, defer KVM's
non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
consumption of the payload is imminent, and force delivery of the payload
in all paths where userspace saves relevant state.
- Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
is migrated while L2 is faulting in memory.
- Fix a class of nSVM bugs where some fields written by the CPU are not
synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
up-to-date when saved by KVM_GET_NESTED_STATE.
- Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
save+restore.
- Add a variety of missing nSVM consistency checks.
- Fix several bugs where KVM failed to correctly update VMCB fields on nested
#VMEXIT.
- Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
SVM-related instructions.
- Add support for save+restore of virtualized LBRs (on SVM).
- Refactor various helpers and macros to improve clarity and (hopefully) make
the code easier to maintain.
- Aggressively sanitize fields when copying from vmcb12 to guard against
unintentionally allowing L1 to utilize yet-to-be-defined features.
- Fix several bugs where KVM botched rAX legality checks when emulating SVM
instructions. Note, KVM is still flawed in that KVM doesn't address size
prefix overrides for 64-bit guests; this should probably be documented as a
KVM erratum.
- Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
sketchy behavior of generating #GP if for "unsupported" addresses).
- Cache all used vmcb12 fields to further harden against TOCTOU bugs.
KVM selftests changes for 7.1
- Add support for Hygon CPUs in KVM selftests.
- Fix a bug in the MSR test where it would get false failures on AMD/Hygon
CPUs with exactly one of RDPID or RDTSCP.
- Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a
bug where the kernel would attempt to collapse guest_memfd folios against
KVM's will.
KVM/arm64 updates for 7.1
* New features:
- Add support for tracing in the standalone EL2 hypervisor code,
which should help both debugging and performance analysis.
This comes with a full infrastructure for 'remote' trace buffers
that can be exposed by non-kernel entities such as firmware.
- Add support for GICv5 Per Processor Interrupts (PPIs), as the
starting point for supporting the new GIC architecture in KVM.
- Finally add support for pKVM protected guests, with anonymous
memory being used as a backing store. About time!
* Improvements and bug fixes:
- Rework the dreaded user_mem_abort() function to make it more
maintainable, reducing the amount of state being exposed to
the various helpers and rendering a substantial amount of
state immutable.
- Expand the Stage-2 page table dumper to support NV shadow
page tables on a per-VM basis.
- Tidy up the pKVM PSCI proxy code to be slightly less hard
to follow.
- Fix both SPE and TRBE in non-VHE configurations so that they
do not generate spurious, out of context table walks that
ultimately lead to very bad HW lockups.
- A small set of patches fixing the Stage-2 MMU freeing in error
cases.
- Tighten-up accepted SMC immediate value to be only #0 for host
SMCCC calls.
- The usual cleanups and other selftest churn.
LoongArch KVM changes for v7.1
1. Use CSR_CRMD_PLV in kvm_arch_vcpu_in_kernel().
2. Let vcpu_is_preempted() a macro & some enhanments.
3. Add DMSINTC irqchip in kernel support.
4. Add KVM PMU test cases for tools/selftests.
KVM/riscv changes for 7.1
- Fix steal time shared memory alignment checks
- Fix vector context allocation leak
- Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()
- Fix double-free of sdata in kvm_pmu_clear_snapshot_area()
- Fix integer overflow in kvm_pmu_validate_counter_mask()
- Fix shift-out-of-bounds in make_xfence_request()
- Fix lost write protection on huge pages during dirty logging
- Split huge pages during fault handling for dirty logging
- Skip CSR restore if VCPU is reloaded on the same core
- Implement kvm_arch_has_default_irqchip() for KVM selftests
- Factored-out ISA checks into separate sources
- Added hideleg to struct kvm_vcpu_config
- Factored-out VCPU config into separate sources
- Support configuration of per-VM HGATP mode from KVM user space
Add a selftest covering ETH_HLEN-sized IPv4/IPv6 EtherType inputs for
bpf_prog_test_run_skb().
Reuse a single zero-initialized struct ethhdr eth_hlen and set
eth_hlen.h_proto from the per-test h_proto field.
Also add a dedicated tc_adjust_room program and route the short
IPv4/IPv6 cases to it, so the selftest actually exercises the
bpf_skb_adjust_room() path from the report.
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260408034623.180320-3-sun.jian.kdev@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Replace shm_open/shm_unlink with memfd_create in the shmem subtest.
shm_open requires /dev/shm to be mounted, which is not always available
in test environments, causing the test to fail with ENOENT.
memfd_create creates an anonymous shmem-backed fd without any filesystem
dependency while exercising the same shmem accounting path.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20260412210636.47516-1-alexei.starovoitov@gmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
xulang <xulang@uniontech.com> says:
====================
Fix OOB read when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE
map to another pcpu map with the same value_size that is not rounded
up to 8 bytes, and add a test case to reproduce the issue.
The root cause is that pcpu_init_value() uses copy_map_value_long() which
rounds up the copy size to 8 bytes, but CGROUP_STORAGE map values are not
8-byte aligned (e.g., 4-byte). This causes a 4-byte OOB read when
the copy is performed.
====================
Link: https://lore.kernel.org/r/7653EEEC2BAB17DF+20260402073948.2185396-1-xulang@uniontech.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a test case to reproduce the out-of-bounds read issue when copying
from a cgroup storage map to a pcpu map with a value_size not rounded
up to 8 bytes.
The test creates:
1. A CGROUP_STORAGE map with 4-byte value (not 8-byte aligned)
2. A LRU_PERCPU_HASH map with 4-byte value (same size)
When a socket is created in the cgroup, the BPF program triggers
bpf_map_update_elem() which calls copy_map_value_long(). This function
rounds up the copy size to 8 bytes, but the cgroup storage buffer is
only 4 bytes, causing an OOB read (before the fix).
Signed-off-by: Lang Xu <xulang@uniontech.com>
Link: https://lore.kernel.org/r/D63BF0DBFF1EA122+20260402074236.2187154-2-xulang@uniontech.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit efc11a6678 ("bpf: Improve bounds when tnum has a single
possible value") improved the bounds refinement to detect when the tnum
and u64 range overlap in a single value (and the bounds can thus be set
to that value).
Eduard then noticed that it broke the slow-mode reg_bounds selftests
because they don't have an equivalent logic and are therefore unable to
refine the bounds as much as the verifier. The following test case
illustrates this.
ACTUAL TRUE1: scalar(u64=0xffffffff00000000,u32=0,s64=0xffffffff00000000,s32=0)
EXPECTED TRUE1: scalar(u64=[0xfffffffe00000001; 0xffffffff00000000],u32=0,s64=[0xfffffffe00000001; 0xffffffff00000000],s32=0)
[...]
#323/1007 reg_bounds_gen_consts_s64_s32/(s64)[0xfffffffe00000001; 0xffffffff00000000] (s32)<op> S64_MIN:FAIL
with the verifier logs:
[...]
19: w0 = w6 ; R0=scalar(smin=0,smax=umax=0xffffffff,
var_off=(0x0; 0xffffffff))
R6=scalar(smin=0xfffffffe00000001,smax=0xffffffff00000000,
umin=0xfffffffe00000001,umax=0xffffffff00000000,
var_off=(0xfffffffe00000000; 0x1ffffffff))
20: w0 = w7 ; R0=0 R7=0x8000000000000000
21: if w6 == w7 goto pc+3
[...]
from 21 to 25: [...]
25: w0 = w6 ; R0=0 R6=0xffffffff00000000
; ^
; unexpected refined value
26: w0 = w7 ; R0=0 R7=0x8000000000000000
27: exit
When w6 == w7 is true, the verifier can deduce that the R6's tnum is
equal to (0xfffffffe00000000; 0x100000000) and then use that information
to refine the bounds: the tnum only overlap with the u64 range in
0xffffffff00000000. The reg_bounds selftest doesn't know about tnums
and therefore fails to perform the same refinement.
This issue happens when the tnum carries information that cannot be
represented in the ranges, as otherwise the selftest could reach the
same refined value using just the ranges. The tnum thus needs to
represent non-contiguous values (ex., R6's tnum above, after the
condition). The only way this can happen in the reg_bounds selftest is
at the boundary between the 32 and 64bit ranges. We therefore only need
to handle that case.
This patch fixes the selftest refinement logic by checking if the u32
and u64 ranges overlap in a single value. If so, the ranges can be set
to that value. We need to handle two cases: either they overlap in
umin64...
u64 values
matching u32 range: xxx xxx xxx xxx
|--------------------------------------|
u64 range: 0 xxxxx UMAX64
or in umax64:
u64 values
matching u32 range: xxx xxx xxx xxx
|--------------------------------------|
u64 range: 0 xxxxx UMAX64
To detect the first case, we decrease umax64 to the maximum value that
matches the u32 range. If that happens to be umin64, then umin64 is the
only overlap. We proceed similarly for the second case, increasing
umin64 to the minimum value that matches the u32 range.
Note this is similar to how the verifier handles the general case using
tnum, but we don't need to care about a single-value overlap in the
middle of the range. That case is not possible when comparing two
ranges.
This patch also adds two test cases reproducing this bug as part of the
normal test runs (without SLOW_TESTS=1).
Fixes: efc11a6678 ("bpf: Improve bounds when tnum has a single possible value")
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Closes: https://lore.kernel.org/bpf/4e6dd64a162b3cab3635706ae6abfdd0be4db5db.camel@gmail.com/
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/ada9UuSQi2SE2IfB@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add selftests to verify SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() correctly
return NULL/zero when dst_reg == src_reg and is_fullsock == 0.
Three subtests are included:
- get_sk: ctx->sk with same src/dst register (SOCK_OPS_GET_SK)
- get_field: ctx->snd_cwnd with same src/dst register (SOCK_OPS_GET_FIELD)
- get_sk_diff_reg: ctx->sk with different src/dst register (baseline)
Each BPF program uses inline asm (__naked) to force specific register
allocation, reads is_fullsock first, then loads the field using the same
(or different) register. The test triggers TCP_NEW_SYN_RECV via a TCP
handshake and checks that the result is NULL/zero when is_fullsock == 0.
Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260407022720.162151-3-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>