Pull vfs mount updates from Christian Brauner:
- Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount
namespace with the newly created filesystem attached to a copy of the
real rootfs. This returns a namespace file descriptor instead of an
O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for
open_tree().
This allows creating a new filesystem and immediately placing it in a
new mount namespace in a single operation, which is useful for
container runtimes and other namespace-based isolation mechanisms.
This accompanies OPEN_TREE_NAMESPACE and avoids a needless detour via
OPEN_TREE_NAMESPACE to get the same effect. Will be especially useful
when you mount an actual filesystem to be used as the container
rootfs.
- Currently, creating a new mount namespace always copies the entire
mount tree from the caller's namespace. For containers and sandboxes
that intend to build their mount table from scratch this is wasteful:
they inherit a potentially large mount tree only to immediately tear
it down.
This series adds support for creating a mount namespace that contains
only a clone of the root mount, with none of the child mounts. Two
new flags are introduced:
- CLONE_EMPTY_MNTNS (0x400000000) for clone3(), using the 64-bit flag space
- UNSHARE_EMPTY_MNTNS (0x00100000) for unshare()
Both flags imply CLONE_NEWNS. The resulting namespace contains a
single nullfs root mount with an immutable empty directory. The
intended workflow is to then mount a real filesystem (e.g., tmpfs)
over the root and build the mount table from there.
- Allow MOVE_MOUNT_BENEATH to target the caller's rootfs, allowing to
switch out the rootfs without pivot_root(2).
The traditional approach to switching the rootfs involves
pivot_root(2) or a chroot_fs_refs()-based mechanism that atomically
updates fs->root for all tasks sharing the same fs_struct. This has
consequences for fork(), unshare(CLONE_FS), and setns().
This series instead decomposes root-switching into individually
atomic, locally-scoped steps:
fd_tree = open_tree(-EBADF, "/newroot", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
fchdir(fd_tree);
move_mount(fd_tree, "", AT_FDCWD, "/", MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
chroot(".");
umount2(".", MNT_DETACH);
Since each step only modifies the caller's own state, the
fork/unshare/setns races are eliminated by design.
A key step to making this possible is to remove the locked mount
restriction. Originally MOVE_MOUNT_BENEATH doesn't support mounting
beneath a mount that is locked. The locked mount protects the
underlying mount from being revealed. This is a core mechanism of
unshare(CLONE_NEWUSER | CLONE_NEWNS). The mounts in the new mount
namespace become locked. That effectively makes the new mount table
useless as the caller cannot ever get rid of any of the mounts no
matter how useless they are.
We can lift this restriction though. We simply transfer the locked
property from the top mount to the mount beneath. This works because
what we care about is to protect the underlying mount aka the parent.
The mount mounted between the parent and the top mount takes over the
job of protecting the parent mount from the top mount mount. This
leaves us free to remove the locked property from the top mount which
can consequently be unmounted:
unshare(CLONE_NEWUSER | CLONE_NEWNS)
and we inherit a clone of procfs on /proc then currently we cannot
unmount it as:
umount -l /proc
will fail with EINVAL because the procfs mount is locked.
After this series we can now do:
mount --beneath -t tmpfs tmpfs /proc
umount -l /proc
after which a tmpfs mount has been placed beneath the procfs mount.
The tmpfs mount has become locked and the procfs mount has become
unlocked.
This means you can safely modify an inherited mount table after
unprivileged namespace creation.
Afterwards we simply make it possible to move a mount beneath the
rootfs allowing to upgrade the rootfs.
Removing the locked restriction makes this very useful for containers
created with unshare(CLONE_NEWUSER | CLONE_NEWNS) to reshuffle an
inherited mount table safely and MOVE_MOUNT_BENEATH makes it possible
to switch out the rootfs instead of using the costly pivot_root(2).
* tag 'vfs-7.1-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests/namespaces: remove unused utils.h include from listns_efault_test
selftests/fsmount_ns: add missing TARGETS and fix cap test
selftests/empty_mntns: fix wrong CLONE_EMPTY_MNTNS hex value in comment
selftests/empty_mntns: fix statmount_alloc() signature mismatch
selftests/statmount: remove duplicate wait_for_pid()
mount: always duplicate mount
selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests
move_mount: allow MOVE_MOUNT_BENEATH on the rootfs
move_mount: transfer MNT_LOCKED
selftests/filesystems: add clone3 tests for empty mount namespaces
selftests/filesystems: add tests for empty mount namespaces
namespace: allow creating empty mount namespaces
selftests: add FSMOUNT_NAMESPACE tests
selftests/statmount: add statmount_alloc() helper
tools: update mount.h header
mount: add FSMOUNT_NAMESPACE
mount: simplify __do_loopback()
mount: start iterating from start of rbtree
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support HW queue leasing, allowing containers to be granted access
to HW queues for zero-copy operations and AF_XDP
- Number of code moves to help the compiler with inlining. Avoid
output arguments for returning drop reason where possible
- Rework drop handling within qdiscs to include more metadata about
the reason and dropping qdisc in the tracepoints
- Remove the rtnl_lock use from IP Multicast Routing
- Pack size information into the Rx Flow Steering table pointer
itself. This allows making the table itself a flat array of u32s,
thus making the table allocation size a power of two
- Report TCP delayed ack timer information via socket diag
- Add ip_local_port_step_width sysctl to allow distributing the
randomly selected ports more evenly throughout the allowed space
- Add support for per-route tunsrc in IPv6 segment routing
- Start work of switching sockopt handling to iov_iter
- Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
buffer size drifting up
- Support MSG_EOR in MPTCP
- Add stp_mode attribute to the bridge driver for STP mode selection.
This addresses concerns about call_usermodehelper() usage
- Remove UDP-Lite support (as announced in 2023)
- Remove support for building IPv6 as a module. Remove the now
unnecessary function calling indirection
Cross-tree stuff:
- Move Michael MIC code from generic crypto into wireless, it's
considered insecure but some WiFi networks still need it
Netfilter:
- Switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
Florian W reports this gets us ~13% higher packet rate
- Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net. Convert some code that
walks the service lists to use RCU instead of the service_mutex
- Add more opinionated input validation to lower security exposure
- Make IPVS hash tables to be per-netns and resizable
Wireless:
- Finished assoc frame encryption/EPPKE/802.1X-over-auth
- Radar detection improvements
- Add 6 GHz incumbent signal detection APIs
- Multi-link support for FILS, probe response templates and client
probing
- New APIs and mac80211 support for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
Driver API:
- Add numerical ID for devlink instances (to avoid having to create
fake bus/device pairs just to have an ID). Support shared devlink
instances which span multiple PFs
- Add standard counters for reporting pause storm events (implement
in mlx5 and fbnic)
- Add configuration API for completion writeback buffering (implement
in mana)
- Support driver-initiated change of RSS context sizes
- Support DPLL monitoring input frequency (implement in zl3073x)
- Support per-port resources in devlink (implement in mlx5)
Misc:
- Expand the YAML spec for Netfilter
Drivers
- Software:
- macvlan: support multicast rx for bridge ports with shared
source MAC address
- team: decouple receive and transmit enablement for IEEE 802.3ad
LACP "independent control"
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- support high order pages in zero-copy mode (for payload
coalescing)
- support multiple packets in a page (for systems with 64kB
pages)
- Broadcom 25-400GE (bnxt):
- implement XDP RSS hash metadata extraction
- add software fallback for UDP GSO, lowering the IOMMU cost
- Broadcom 800GE (bnge):
- add link status and configuration handling
- add various HW and SW statistics
- Marvell/Cavium:
- NPC HW block support for cn20k
- Huawei (hinic3):
- add mailbox / control queue
- add rx VLAN offload
- add driver info and link management
- Ethernet NICs:
- Marvell/Aquantia:
- support reading SFP module info on some AQC100 cards
- Realtek PCI (r8169):
- add support for RTL8125cp
- Realtek USB (r8152):
- support for the RTL8157 5Gbit chip
- add 2500baseT EEE status/configuration support
- Ethernet NICs embedded and off-the-shelf IP:
- Synopsys (stmmac):
- cleanup and reorganize SerDes handling and PCS support
- cleanup descriptor handling and per-platform data
- cleanup and consolidate MDIO defines and handling
- shrink driver memory use for internal structures
- improve Tx IRQ coalescing
- improve TCP segmentation handling
- add support for Spacemit K3
- Cadence (macb):
- support PHYs that have inband autoneg disabled with GEM
- support IEEE 802.3az EEE
- rework usrio capabilities and handling
- AMD (xgbe):
- improve power management for S0i3
- improve TX resilience for link-down handling
- Virtual:
- Google cloud vNIC:
- support larger ring sizes in DQO-QPL mode
- improve HW-GRO handling
- support UDP GSO for DQO format
- PCIe NTB:
- support queue count configuration
- Ethernet PHYs:
- automatically disable PHY autonomous EEE if MAC is in charge
- Broadcom:
- add BCM84891/BCM84892 support
- Micrel:
- support for LAN9645X internal PHY
- Realtek:
- add RTL8224 pair order support
- support PHY LEDs on RTL8211F-VD
- support spread spectrum clocking (SSC)
- Maxlinear:
- add PHY-level statistics via ethtool
- Ethernet switches:
- Maxlinear (mxl862xx):
- support for bridge offloading
- support for VLANs
- support driver statistics
- Bluetooth:
- large number of fixes and new device IDs
- Mediatek:
- support MT6639 (MT7927)
- support MT7902 SDIO
- WiFi:
- Intel (iwlwifi):
- UNII-9 and continuing UHR work
- MediaTek (mt76):
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- Qualcomm (ath12k):
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
- support IPQ5424
- Realtek:
- add USB RX aggregation to improve performance
- add USB TX flow control by tracking in-flight URBs
- Cellular:
- IPA v5.2 support"
* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
wireguard: allowedips: remove redundant space
tools: ynl: add sample for wireguard
wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
MAINTAINERS: Add netkit selftest files
selftests/net: Add additional test coverage in nk_qlease
selftests/net: Split netdevsim tests from HW tests in nk_qlease
tools/ynl: Make YnlFamily closeable as a context manager
net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
net: airoha: Fix VIP configuration for AN7583 SoC
net: caif: clear client service pointer on teardown
net: strparser: fix skb_head leak in strp_abort_strp()
net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
selftests/bpf: add test for xdp_master_redirect with bond not up
net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
sctp: disable BH before calling udp_tunnel_xmit_skb()
sctp: fix missing encap_port propagation for GSO fragments
net: airoha: Rely on net_device pointer in ETS callbacks
...
Pull bpf updates from Alexei Starovoitov:
- Welcome new BPF maintainers: Kumar Kartikeya Dwivedi, Eduard
Zingerman while Martin KaFai Lau reduced his load to Reviwer.
- Lots of fixes everywhere from many first time contributors. Thank you
All.
- Diff stat is dominated by mechanical split of verifier.c into
multiple components:
- backtrack.c: backtracking logic and jump history
- states.c: state equivalence
- cfg.c: control flow graph, postorder, strongly connected
components
- liveness.c: register and stack liveness
- fixups.c: post-verification passes: instruction patching, dead
code removal, bpf_loop inlining, finalize fastcall
8k line were moved. verifier.c still stands at 20k lines.
Further refactoring is planned for the next release.
- Replace dynamic stack liveness with static stack liveness based on
data flow analysis.
This improved the verification time by 2x for some programs and
equally reduced memory consumption. New logic is in liveness.c and
supported by constant folding in const_fold.c (Eduard Zingerman,
Alexei Starovoitov)
- Introduce BTF layout to ease addition of new BTF kinds (Alan Maguire)
- Use kmalloc_nolock() universally in BPF local storage (Amery Hung)
- Fix several bugs in linked registers delta tracking (Daniel Borkmann)
- Improve verifier support of arena pointers (Emil Tsalapatis)
- Improve verifier tracking of register bounds in min/max and tnum
domains (Harishankar Vishwanathan, Paul Chaignon, Hao Sun)
- Further extend support for implicit arguments in the verifier (Ihor
Solodrai)
- Add support for nop,nop5 instruction combo for USDT probes in libbpf
(Jiri Olsa)
- Support merging multiple module BTFs (Josef Bacik)
- Extend applicability of bpf_kptr_xchg (Kaitao Cheng)
- Retire rcu_trace_implies_rcu_gp() (Kumar Kartikeya Dwivedi)
- Support variable offset context access for 'syscall' programs (Kumar
Kartikeya Dwivedi)
- Migrate bpf_task_work and dynptr to kmalloc_nolock() (Mykyta
Yatsenko)
- Fix UAF in in open-coded task_vma iterator (Puranjay Mohan)
* tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (241 commits)
selftests/bpf: cover short IPv4/IPv6 inputs with adjust_room
bpf: reject short IPv4/IPv6 inputs in bpf_prog_test_run_skb
selftests/bpf: Use memfd_create instead of shm_open in cgroup_iter_memcg
selftests/bpf: Add test for cgroup storage OOB read
bpf: Fix OOB in pcpu_init_value
selftests/bpf: Fix reg_bounds to match new tnum-based refinement
selftests/bpf: Add tests for non-arena/arena operations
bpf: Allow instructions with arena source and non-arena dest registers
bpftool: add missing fsession to the usage and docs of bpftool
docs/bpf: add missing fsession attach type to docs
bpf: add missing fsession to the verifier log
bpf: Move BTF checking logic into check_btf.c
bpf: Move backtracking logic to backtrack.c
bpf: Move state equivalence logic to states.c
bpf: Move check_cfg() into cfg.c
bpf: Move compute_insn_live_regs() into liveness.c
bpf: Move fixup/post-processing logic from verifier.c into fixups.c
bpf: Simplify do_check_insn()
bpf: Move checks for reserved fields out of the main pass
bpf: Delete unused variable
...
Pull module updates from Sami Tolvanen:
"Kernel symbol flags:
- Replace the separate *_gpl symbol sections (__ksymtab_gpl and
__kcrctab_gpl) with a unified symbol table and a new __kflagstab
section.
This section stores symbol flags, such as the GPL-only flag, as an
8-bit bitset for each exported symbol. This is a cleanup that
simplifies symbol lookup in the module loader by avoiding table
fragmentation and will allow a cleaner way to add more flags later
if needed.
Module signature UAPI:
- Move struct module_signature to the UAPI headers to allow reuse by
tools outside the kernel proper, such as kmod and
scripts/sign-file.
This also renames a few constants for clarity and drops unused
signature types as preparation for hash-based module integrity
checking work that's in progress.
Sysfs:
- Add a /sys/module/<module>/import_ns sysfs attribute to show the
symbol namespaces imported by loaded modules.
This makes it easier to verify driver API access at runtime on
systems that care about such things (e.g. Android).
Cleanups and fixes:
- Force sh_addr to 0 for all sections in module.lds. This prevents
non-zero section addresses when linking modules with 'ld.bfd -r',
which confused elfutils.
- Fix a memory leak of charp module parameters on module unload when
the kernel is configured with CONFIG_SYSFS=n.
- Override the -EEXIST error code returned by module_init() to
userspace. This prevents confusion with the errno reserved by the
module loader to indicate that a module is already loaded.
- Simplify the warning message and drop the stack dump on positive
returns from module_init().
- Drop unnecessary extern keywords from function declarations and
synchronize parse_args() arguments with their implementation"
* tag 'modules-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (23 commits)
module: Simplify warning on positive returns from module_init()
module: Override -EEXIST module return
documentation: remove references to *_gpl sections
module: remove *_gpl sections from vmlinux and modules
module: deprecate usage of *_gpl sections in module loader
module: use kflagstab instead of *_gpl sections
module: populate kflagstab in modpost
module: add kflagstab section to vmlinux and modules
module: define ksym_flags enumeration to represent kernel symbol flags
selftests/bpf: verify_pkcs7_sig: Use 'struct module_signature' from the UAPI headers
sign-file: use 'struct module_signature' from the UAPI headers
tools uapi headers: add linux/module_signature.h
module: Move 'struct module_signature' to UAPI
module: Give MODULE_SIG_STRING a more descriptive name
module: Give 'enum pkey_id_type' a more specific name
module: Drop unused signature types
extract-cert: drop unused definition of PKEY_ID_PKCS7
docs: symbol-namespaces: mention sysfs attribute
module: expose imported namespaces via sysfs
module: Remove extern keyword from param prototypes
...
Pull scheduler updates from Ingo Molnar:
"Fair scheduling updates:
- Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
- Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
- Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
- Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
- Avoid overflow in enqueue_entity() (K Prateek Nayak)
- Update overutilized detection (Vincent Guittot)
- Prevent negative lag increase during delayed dequeue (Vincent Guittot)
- Clear buddies for preempt_short (Vincent Guittot)
- Implement more complex proportional newidle balance (Peter Zijlstra)
- Increase weight bits for avg_vruntime (Peter Zijlstra)
- Use full weight to __calc_delta() (Peter Zijlstra)
RT and DL scheduling updates:
- Fix incorrect schedstats for rt and dl thread (Dengjun Su)
- Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
- Move group schedulability check to sched_rt_global_validate()
(Michal Koutný)
- Add reporting of runtime left & abs deadline to sched_getattr()
for DEADLINE tasks (Tommaso Cucinotta)
Scheduling topology updates by K Prateek Nayak:
- Compute sd_weight considering cpuset partitions
- Extract "imb_numa_nr" calculation into a separate helper
- Allocate per-CPU sched_domain_shared in s_data
- Switch to assigning "sd->shared" from s_data
- Remove sched_domain_shared allocation with sd_data
Energy-aware scheduling updates:
- Filter false overloaded_group case for EAS (Vincent Guittot)
- PM: EM: Switch to rcu_dereference_all() in wakeup path
(Dietmar Eggemann)
Infrastructure updates:
- Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)
Proxy scheduling updates by John Stultz:
- Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
- Minimise repeated sched_proxy_exec() checking
- Fix potentially missing balancing with Proxy Exec
- Fix and improve task::blocked_on et al handling
- Add assert_balance_callbacks_empty() helper
- Add logic to zap balancing callbacks if we pick again
- Move attach_one_task() and attach_task() helpers to sched.h
- Handle blocked-waiter migration (and return migration)
- Add K Prateek Nayak to scheduler reviewers for proxy execution
Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter
Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth
Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot"
* tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
sched/eevdf: Clear buddies for preempt_short
sched/rt: Cleanup global RT bandwidth functions
sched/rt: Move group schedulability check to sched_rt_global_validate()
sched/rt: Skip group schedulable check with rt_group_sched=0
sched/fair: Avoid overflow in enqueue_entity()
sched: Use u64 for bandwidth ratio calculations
sched/fair: Prevent negative lag increase during delayed dequeue
sched/fair: Use sched_energy_enabled()
sched: Handle blocked-waiter migration (and return migration)
sched: Move attach_one_task and attach_task helpers to sched.h
sched: Add logic to zap balance callbacks if we pick again
sched: Add assert_balance_callbacks_empty helper
sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
sched: Fix modifying donor->blocked on without proper locking
locking: Add task::blocked_lock to serialize blocked_on state
sched: Fix potentially missing balancing with Proxy Exec
sched: Minimise repeated sched_proxy_exec() checking
sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
MAINTAINERS: Add K Prateek Nayak to scheduler reviewers
sched/core: Get this cpu once in ttwu_queue_cond()
...
Merge in late fixes in preparation for the net-next PR.
Conflicts:
include/net/sch_generic.h
a6bd339dbb ("net_sched: fix skb memory leak in deferred qdisc drops")
ff2998f29f ("net: sched: introduce qdisc-specific drop reason tracing")
https://lore.kernel.org/adz0iX85FHMz0HdO@sirena.org.uk
drivers/net/ethernet/airoha/airoha_eth.c
1acdfbdb51 ("net: airoha: Fix VIP configuration for AN7583 SoC")
bf3471e6e6 ("net: airoha: Make flow control source port mapping dependent on nbq parameter")
Adjacent changes:
drivers/net/ethernet/airoha/airoha_ppe.c
f44218cd5e ("net: airoha: Reset PPE cpu port configuration in airoha_ppe_hw_init()")
7da62262ec ("inet: add ip_local_port_step_width sysctl to improve port usage distribution")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull btrfs updates from David Sterba:
"User visible changes:
- move shutdown ioctl support out of experimental features, a forced
stop of filesystem operation until the next unmount; additionally
there's a super block operation to forcibly remove a device from
under the filesystem that could lead to a shutdown or not if the
redundancy allows that
- report filesystem shutdown using fserror mechanism
- tree-checker updates:
- verify free space info, extent and bitmap items
- verify remap-tree items and related data in block group items
Performance improvements:
- speed up clearing first extent in the tracked range (+10%
throughput on sample workload)
- reduce COW rewrites of extent buffers during the same transaction
- avoid taking big device lock to update device stats during
transaction commit
- fix unnecessary flush on close when truncating empty files
(observed in practice on a backup application)
- prevent direct reclaim during compressed readahead to avoid stalls
under memory pressure
Notable fixes:
- fix chunk allocation strategy on RAID1-like block groups with
disproportionate device sizes, this could lead to ENOSPC due to
skewed reservation estimates
- adjust metadata reservation overcommit ratio to be less aggressive
and also try to flush if possible, this avoids ENOSPC and potential
transaction aborts in some edge cases (that are otherwise hard to
reproduce)
- fix silent IO error in encoded writes and ordered extent split in
zoned mode, the error was not correctly propagated to the address
space and could lead to zeroed ranges
- don't mark inline files NOCOMPRESS unexpectedly, the intent was to
do that for single block writes of regular files
- fix deadlock between reflink and transaction commit when using
flushoncommit
- fix overly strict item check of a running dev-replace operation
Core:
- zoned mode space reservation fixes:
- cap delayed refs metadata reservation to avoid overcommit
- update logic to reclaim partially unusable zones
- add another state to flush and reclaim partially used zone
- limit number of zones reclaimed in one go to avoid blocking
other operations
- don't let log trees consume global reserve on overcommit and fall
back to transaction commit
- revalidate extent buffer when checking its up-to-date status
- add self tests for zoned mode block group specifics
- reduce atomic allocations in some qgroup paths
- avoid unnecessary root node COW during snapshotting
- start new transaction in block group relocation conditionally
- faster check of NOCOW files on currently snapshotted root
- change how compressed bio size is tracked from bio and reduce the
structure size
- new tracepoint for search slot restart tracking
- checksum list manipulation improvements
- type, parameter cleanups, refactoring
- error handling improvements, transaction abort call adjustments"
* tag 'for-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (116 commits)
btrfs: btrfs_log_dev_io_error() on all bio errors
btrfs: fix silent IO error loss in encoded writes and zoned split
btrfs: skip clearing EXTENT_DEFRAG for NOCOW ordered extents
btrfs: use BTRFS_FS_UPDATE_UUID_TREE_GEN flag for UUID tree rescan check
btrfs: remove duplicate journal_info reset on failure to commit transaction
btrfs: tag as unlikely if statements that check for fs in error state
btrfs: fix double free in create_space_info() error path
btrfs: fix double free in create_space_info_sub_group() error path
btrfs: do not reject a valid running dev-replace
btrfs: only invalidate btree inode pages after all ebs are released
btrfs: prevent direct reclaim during compressed readahead
btrfs: replace BUG_ON() with error return in cache_save_setup()
btrfs: zstd: don't cache sectorsize in a local variable
btrfs: zlib: don't cache sectorsize in a local variable
btrfs: zlib: drop redundant folio address variable
btrfs: lzo: inline read/write length helpers
btrfs: use common eb range validation in read_extent_buffer_to_user_nofault()
btrfs: read eb folio index right before loops
btrfs: rename local variable for offset in folio
btrfs: unify types for binary search variables
...
Pull io_uring updates from Jens Axboe:
- Add a callback driven main loop for io_uring, and BPF struct_ops
on top to allow implementing custom event loop logic
- Decouple IOPOLL from being a ring-wide all-or-nothing setting,
allowing IOPOLL use cases to also issue certain white listed
non-polled opcodes
- Timeout improvements. Migrate internal timeout storage from
timespec64 to ktime_t for simpler arithmetic and avoid copying of
timespec data
- Zero-copy receive (zcrx) updates:
- Add a device-less mode (ZCRX_REG_NODEV) for testing and
experimentation where data flows through the copy fallback path
- Fix two-step unregistration regression, DMA length calculations,
xarray mark usage, and a potential 32-bit overflow in id
shifting
- Refactoring toward multi-area support: dedicated refill queue
struct, consolidated DMA syncing, netmem array refilling format,
and guard-based locking
- Zero-copy transmit (zctx) cleanup:
- Unify io_send_zc() and io_sendmsg_zc() into a single function
- Add vectorized registered buffer send for IORING_OP_SEND_ZC
- Add separate notification user_data via sqe->addr3 so
notification and completion CQEs can be distinguished without
extra reference counting
- Switch struct io_ring_ctx internal bitfields to explicit flag bits
with atomic-safe accessors, and annotate the known harmless races on
those flags
- Various optimizations caching ctx and other request fields in local
variables to avoid repeated loads, and cleanups for tctx setup, ring
fd registration, and read path early returns
* tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (58 commits)
io_uring: unify getting ctx from passed in file descriptor
io_uring/register: don't get a reference to the registered ring fd
io_uring/tctx: clean up __io_uring_add_tctx_node() error handling
io_uring/tctx: have io_uring_alloc_task_context() return tctx
io_uring/timeout: use 'ctx' consistently
io_uring/rw: clean up __io_read() obsolete comment and early returns
io_uring/zcrx: use correct mmap off constants
io_uring/zcrx: use dma_len for chunk size calculation
io_uring/zcrx: don't clear not allocated niovs
io_uring/zcrx: don't use mark0 for allocating xarray
io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring()
io_uring/zcrx: reject REG_NODEV with large rx_buf_size
io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP
io_uring/rsrc: use io_cache_free() to free node
io_uring/zcrx: rename zcrx [un]register functions
io_uring/zcrx: check ctrl op payload struct sizes
io_uring/zcrx: cache fallback availability in zcrx ctx
io_uring/zcrx: warn on a repeated area append
io_uring/zcrx: consolidate dma syncing
io_uring/zcrx: netmem array as refiling format
...
Pull block updates from Jens Axboe:
- Add shared memory zero-copy I/O support for ublk, bypassing per-I/O
copies between kernel and userspace by matching registered buffer
PFNs at I/O time. Includes selftests.
- Refactor bio integrity to support filesystem initiated integrity
operations and arbitrary buffer alignment.
- Clean up bio allocation, splitting bio_alloc_bioset() into clear fast
and slow paths. Add bio_await() and bio_submit_or_kill() helpers,
unify synchronous bi_end_io callbacks.
- Fix zone write plug refcount handling and plug removal races. Add
support for serializing zone writes at QD=1 for rotational zoned
devices, yielding significant throughput improvements.
- Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET
command.
- Add io_uring passthrough (uring_cmd) support to the BSG layer.
- Replace pp_buf in partition scanning with struct seq_buf.
- zloop improvements and cleanups.
- drbd genl cleanup, switching to pre_doit/post_doit.
- NVMe pull request via Keith:
- Fabrics authentication updates
- Enhanced block queue limits support
- Workqueue usage updates
- A new write zeroes device quirk
- Tagset cleanup fix for loop device
- MD pull requests via Yu Kuai:
- Fix raid5 soft lockup in retry_aligned_read()
- Fix raid10 deadlock with check operation and nowait requests
- Fix raid1 overlapping writes on writemostly disks
- Fix sysfs deadlock on array_state=clear
- Proactive RAID-5 parity building with llbitmap, with
write_zeroes_unmap optimization for initial sync
- Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops
version mismatch fallback
- Fix bcache use-after-free and uninitialized closure
- Validate raid5 journal metadata payload size
- Various cleanups
- Various other fixes, improvements, and cleanups
* tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits)
ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd()
block: refactor blkdev_zone_mgmt_ioctl
MAINTAINERS: update ublk driver maintainer email
Documentation: ublk: address review comments for SHMEM_ZC docs
ublk: allow buffer registration before device is started
ublk: replace xarray with IDA for shmem buffer index allocation
ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
ublk: verify all pages in multi-page bvec fall within registered range
ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
xfs: use bio_await in xfs_zone_gc_reset_sync
block: add a bio_submit_or_kill helper
block: factor out a bio_await helper
block: unify the synchronous bi_end_io callbacks
xfs: fix number of GC bvecs
selftests/ublk: add read-only buffer registration test
selftests/ublk: add filesystem fio verify test for shmem_zc
selftests/ublk: add hugetlbfs shmem_zc test for loop target
selftests/ublk: add shared memory zero-copy test
selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
...
Pull Landlock update from Mickaël Salaün:
"This adds a new Landlock access right for pathname UNIX domain sockets
thanks to a new LSM hook, and a few fixes"
* tag 'landlock-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: (23 commits)
landlock: Document fallocate(2) as another truncation corner case
landlock: Document FS access right for pathname UNIX sockets
selftests/landlock: Simplify ruleset creation and enforcement in fs_test
selftests/landlock: Check that coredump sockets stay unrestricted
selftests/landlock: Audit test for LANDLOCK_ACCESS_FS_RESOLVE_UNIX
selftests/landlock: Test LANDLOCK_ACCESS_FS_RESOLVE_UNIX
selftests/landlock: Replace access_fs_16 with ACCESS_ALL in fs_test
samples/landlock: Add support for named UNIX domain socket restrictions
landlock: Clarify BUILD_BUG_ON check in scoping logic
landlock: Control pathname UNIX domain socket resolution by path
landlock: Use mem_is_zero() in is_layer_masks_allowed()
lsm: Add LSM hook security_unix_find
landlock: Fix kernel-doc warning for pointer-to-array parameters
landlock: Fix formatting in tsync.c
landlock: Improve kernel-doc "Return:" section consistency
landlock: Add missing kernel-doc "Return:" sections
selftests/landlock: Fix format warning for __u64 in net_test
selftests/landlock: Skip stale records in audit_match_record()
selftests/landlock: Drain stale audit records on init
selftests/landlock: Fix socket file descriptor leaks in audit helpers
...
Pull audit updates from Paul Moore:
- Improved handling of unknown status requests from userspace
The current kernel code ignores unknown/unused request bits sent from
userspace and returns an error code based on the results of the
request(s) it does understand. The patch from Ricardo fixes this so
that unknown requests return an -EINVAL to userspace, making
compatibility a bit easier moving forward.
- A number of small style and formatting cleanups
* tag 'audit-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: handle unknown status requests in audit_receive_msg()
audit: fix coding style issues
audit: remove redundant initialization of static variables to 0
audit: fix whitespace alignment in include/uapi/linux/audit.h
As noted in the blamed commit, the AR8035 and other PHYs from this
family advertise the Extended Next Page support by default, which may be
understood by some partners as this PHY being multi-gig capable.
The fix is to disable XNP advertising, which is done by setting bit 12
of the Auto-Negotiation Advertisement Register (MII_ADVERTISE).
The blamed commit incorrectly uses MDIO_AN_CTRL1_XNP, which is bit 13 as per
802.3 : 45.2.7.1 AN control register (Register 7.0)
BIT 12 in MII_ADVERTISE is wrapped by ADVERTISE_RESV, used by some
drivers such as the aquantia one. 802.3 Clause 28 defines bit 12 as
Extended Next Page ability, at least in recent versions of the standard.
Let's add a define for it and use it in the at803x driver.
Fixes: 3c51fa5d2a ("net: phy: ar803x: disable extended next page bit")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260410171021.1277138-1-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull clone and pidfs updates from Christian Brauner:
"Add three new clone3() flags for pidfd-based process lifecycle
management.
CLONE_AUTOREAP:
CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to
the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for
SIGCHLD which applies to all children of a given parent.
Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped
property affecting all children which makes it unsuitable for
libraries or applications that need selective auto-reaping of
specific children while still being able to wait() on others.
CLONE_AUTOREAP stores an autoreap flag in the child's
signal_struct. When the child exits do_notify_parent() checks this
flag and causes exit_notify() to transition the task directly to
EXIT_DEAD. Since the flag lives on the child it survives
reparenting: if the original parent exits and the child is
reparented to a subreaper or init the child still auto-reaps when
it eventually exits. This is cleaner than forcing the subreaper to
get SIGCHLD and then reaping it. If the parent doesn't care the
subreaper won't care. If there's a subreaper that would care it
would be easy enough to add a prctl() that either just turns back
on SIGCHLD and turns off auto-reaping or a prctl() that just
notifies the subreaper whenever a child is reparented to it.
CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent
to monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.
CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because
autoreap is a process-level property, and CLONE_PARENT because an
autoreap child reparented via CLONE_PARENT could become an
invisible zombie under a parent that never calls wait().
The flag is not inherited by the autoreap process's own children.
Each child that should be autoreaped must be explicitly created
with CLONE_AUTOREAP.
CLONE_NNP:
CLONE_NNP sets no_new_privs on the child at clone time. Unlike
prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself,
CLONE_NNP allows the parent to impose no_new_privs on the child at
creation without affecting the parent's own privileges.
CLONE_THREAD is rejected because threads share credentials.
CLONE_NNP is useful on its own for any spawn-and-sandbox pattern
but was specifically introduced to enable unprivileged usage of
CLONE_PIDFD_AUTOKILL.
CLONE_PIDFD_AUTOKILL:
This flag ties a child's lifetime to the pidfd returned from
clone3(). When the last reference to the struct file created by
clone3() is closed the kernel sends SIGKILL to the child. A pidfd
obtained via pidfd_open() for the same process does not keep the
child alive and does not trigger autokill - only the specific
struct file from clone3() has this property. This is useful for
container runtimes, service managers, and sandboxed subprocess
execution - any scenario where the child must die if the parent
crashes or abandons the pidfd or just wants a throwaway helper
process.
CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP.
It requires CLONE_PIDFD because the whole point is tying the
child's lifetime to the pidfd. It requires CLONE_AUTOREAP because a
killed child with no one to reap it would become a zombie - the
primary use case is the parent crashing or abandoning the pidfd so
no one is around to call waitpid(). CLONE_THREAD is rejected
because autokill targets a process not a thread.
If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an
unprivileged user may spawn a process that is autokilled. The child
cannot escalate privileges via setuid/setgid exec after being
spawned. If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the
caller must have have CAP_SYS_ADMIN in its user namespace"
* tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: check pidfd_info->coredump_code correctness
pidfds: add coredump_code field to pidfd_info
kselftest/coredump: reintroduce null pointer dereference
selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests
selftests/pidfd: add CLONE_NNP tests
selftests/pidfd: add CLONE_AUTOREAP tests
pidfd: add CLONE_PIDFD_AUTOKILL
clone: add CLONE_NNP
clone: add CLONE_AUTOREAP
Pull kvm fixes from Paolo Bonzini:
"s390:
- vsie: Fix races with partial gmap invalidations
x86:
- Use __DECLARE_FLEX_ARRAY() for UAPI structures with VLAs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: vsie: Fix races with partial gmap invalidations
KVM: x86: Use __DECLARE_FLEX_ARRAY() for UAPI structures with VLAs
KVM x86 fixes for 7.1
Declare flexible arrays in uAPI structures using __DECLARE_FLEX_ARRAY() so
that KVM's uAPI headers can be included in C++ projects.
Pull RISC-V updates from Paul Walmsley:
"Before v7.0 is released, fix a few issues with the CFI patchset,
merged earlier in v7.0-rc, that primarily affect interfaces to
non-kernel code:
- Improve the prctl() interface for per-task indirect branch landing
pad control to expand abbreviations and to resemble the speculation
control prctl() interface
- Expand the "LP" and "SS" abbreviations in the ptrace uapi header
file to "branch landing pad" and "shadow stack", to improve
readability
- Fix a typo in a CFI-related macro name in the ptrace uapi header
file
- Ensure that the indirect branch tracking state and shadow stack
state are unlocked immediately after an exec() on the new task so
that libc subsequently can control it
- While working in this area, clean up the kernel-internal,
cross-architecture prctl() function names by expanding the
abbreviations mentioned above"
* tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
prctl: cfi: change the branch landing pad prctl()s to be more descriptive
riscv: ptrace: cfi: expand "SS" references to "shadow stack" in uapi headers
prctl: rename branch landing pad implementation functions to be more explicit
riscv: ptrace: expand "LP" references to "branch landing pads" in uapi headers
riscv: cfi: clear CFI lock status in start_thread()
riscv: ptrace: cfi: fix "PRACE" typo in uapi header
The bridge-stp usermode helper is currently restricted to the initial
network namespace, preventing userspace STP daemons (e.g. mstpd) from
operating on bridges in other network namespaces. Since commit
ff62198553 ("bridge: Only call /sbin/bridge-stp for the initial
network namespace"), bridges in non-init namespaces silently fall
back to kernel STP with no way to use userspace STP.
Add a new bridge attribute IFLA_BR_STP_MODE that allows explicit
per-bridge control over STP mode selection:
BR_STP_MODE_AUTO (default) - Existing behavior: invoke the
/sbin/bridge-stp helper in init_net only; fall back to kernel STP
if it fails or in non-init namespaces.
BR_STP_MODE_USER - Directly enable userspace STP (BR_USER_STP)
without invoking the helper. Works in any network namespace.
Userspace is responsible for ensuring an STP daemon manages the
bridge.
BR_STP_MODE_KERNEL - Directly enable kernel STP (BR_KERNEL_STP)
without invoking the helper.
The mode can only be changed while STP is disabled, or set to the
same value (-EBUSY otherwise). IFLA_BR_STP_MODE is processed before
IFLA_BR_STP_STATE in br_changelink(), so both can be set atomically
in a single netlink message. The mode can also be changed in the
same message that disables STP.
The stp_mode struct field is u8 since all possible values fit, while
NLA_U32 is used for the netlink attribute since it occupies the same
space in the netlink message as NLA_U8.
A new stp_helper_active boolean tracks whether the /sbin/bridge-stp
helper was invoked during br_stp_start(), so that br_stp_stop() only
calls the helper for stop when it was called for start. This avoids
calling the helper asymmetrically when stp_mode changes between
start and stop.
Suggested-by: Ido Schimmel <idosch@nvidia.com>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Andy Roulin <aroulin@nvidia.com>
Link: https://patch.msgid.link/20260405205224.3163000-2-aroulin@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
netkit: Support for io_uring zero-copy and AF_XDP
Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.
This patchset adds the concept of queue leasing to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
Leased queues are bound to a real queue in a physical netdev and act
as a proxy.
Memory providers and AF_XDP operations take an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a leased queue, which then gets proxied to the underlying real
queue.
We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
====================
Link: https://patch.msgid.link/20260402231031.447597-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a single device mode for netkit instead of netkit pairs. The primary
target for the paired devices is to connect network namespaces, of course,
and support has been implemented in projects like Cilium [0]. For the rxq
leasing the plan is to support two main scenarios related to single device
mode:
* For the use-case of io_uring zero-copy, the control plane can either
set up a netkit pair where the peer device can perform rxq leasing which
is then tied to the lifetime of the peer device, or the control plane
can use a regular netkit pair to connect the hostns to a Pod/container
and dynamically add/remove rxq leasing through a single device without
having to interrupt the device pair. In the case of io_uring, the memory
pool is used as skb non-linear pages, and thus the skb will go its way
through the regular stack into netkit. Things like the netkit policy when
no BPF is attached or skb scrubbing etc apply as-is in case the paired
devices are used, or if the backend memory is tied to the single device
and traffic goes through a paired device.
* For the use-case of AF_XDP, the control plane needs to use netkit in the
single device mode. The single device mode currently enforces only a
pass policy when no BPF is attached, and does not yet support BPF link
attachments for AF_XDP. skbs sent to that device get dropped at the
moment. Given AF_XDP operates at a lower layer of the stack tying this
to the netkit pair did not make sense. In future, the plan is to allow
BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
to push selected egress traffic up to the single netkit device to the
local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
single netkit into the AF_XDP application (e.g. DHCP replies). Also,
the control-plane can dynamically manage rxq leasing for the single
netkit device without having to interrupt (e.g. down/up cycle) the main
netkit pair for the Pod which has traffic going in and out.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0]
Link: https://patch.msgid.link/20260402231031.447597-11-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a ynl netdev family operation called queue-create that creates a
new queue on a netdevice:
name: queue-create
attribute-set: queue
flags: [admin-perm]
do:
request:
attributes:
- ifindex
- type
- lease
reply: &queue-create-op
attributes:
- id
This is a generic operation such that it can be extended for various
use cases in future. Right now it is mandatory to specify ifindex,
the queue type which is enforced to rx and a lease. The newly created
queue id is returned to the caller.
A queue from a virtual device can have a lease which refers to another
queue from a physical device. This is useful for memory providers
and AF_XDP operations which take an ifindex and queue id to allow
applications to bind against virtual devices in containers. The lease
couples both queues together and allows to proxy the operations from
a virtual device in a container to the physical device.
In future, the nested lease attribute can be lifted and made optional
for other use-cases such as dynamic queue creation for physical
netdevs. The lack of lease and the specification of the physical
device as an ifindex will imply that we need a real queue to be
allocated. Similarly, the queue type enforcement to rx can then be
lifted as well to support tx.
An early implementation had only driver-specific integration [0], but
in order for other virtual devices to reuse, it makes sense to have
this as a generic API in core net.
For leasing queues, the virtual netdev must have real_num_rx_queues
less than num_rx_queues at the time of calling queue-create. The
queue-type must be rx as only rx queues are supported for leasing
for now. We also enforce that the queue-create ifindex must point
to a virtual device, and that the nested lease attribute's ifindex
must point to a physical device. The nested lease attribute set
contains a netns-id attribute which is optional and can specify a
netns-id relative to the caller's netns. It requires cap_net_admin
and if the netns-id attribute is not specified, the lease ifindex
will be retrieved from the current netns. Also, it is modeled as
an s32 type similarly as done elsewhere in the stack.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
Link: https://patch.msgid.link/20260402231031.447597-2-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The __u32 len field cannot represent a 4GB buffer (0x100000000
overflows to 0). Change it to __u64 so buffers up to 4GB can be
registered. Add a reserved field for alignment and validate it
is zero.
The kernel enforces a default max of 4GB (UBLK_SHMEM_BUF_SIZE_MAX)
which may be increased in future.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-2-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow filtering the resource dump to device-level or port-level
resources using the 'scope' option.
Example - dump only device-level resources:
$ devlink resource show scope dev
pci/0000:03:00.0:
name max_local_SFs size 128 unit entry dpipe_tables none
name max_external_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1:
name max_local_SFs size 128 unit entry dpipe_tables none
name max_external_SFs size 128 unit entry dpipe_tables none
Example - dump only port-level resources:
$ devlink resource show scope port
pci/0000:03:00.0/196608:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.0/196609:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1/196708:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1/196709:
name max_SFs size 128 unit entry dpipe_tables none
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull HID fixes from Jiri Kosina:
- handling of new keycodes for contextual AI usages (Akshai Murari)
- fix for UAF in hid-roccat (Benoît Sevens)
- deduplication of error logging in amd_sfh (Maximilian Pezzullo)
- various device-specific quirks and device ID additions (Even Xu, Lode
Willems, Leo Vriska)
* tag 'hid-for-linus-2026040801' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
Input: add keycodes for contextual AI usages (HUTRR119)
HID: Kysona: Add support for VXE Dragonfly R1 Pro
HID: amd_sfh: don't log error when device discovery fails with -EOPNOTSUPP
HID: quirks: add HID_QUIRK_ALWAYS_POLL for 8BitDo Pro 3
HID: roccat: fix use-after-free in roccat_report_event
HID: Intel-thc-hid: Intel-quickspi: Add NVL Device IDs
HID: Intel-thc-hid: Intel-quicki2c: Add NVL Device IDs
Should have no effect in practice; all of these use the
nft_parse_register_load/store apis which is mandatory anyway due
to the need to further validate the register load/store, e.g.
that the size argument doesn't result in out-of-bounds load/store.
OTOH this is a simple method to reject obviously wrong input
at earlier stage.
Signed-off-by: Florian Westphal <fw@strlen.de>
Introduce checks for FREE_SPACE_INFO item, which include:
- Key alignment check
The objectid is the logical bytenr of the chunk/bg, and offset is the
length of the chunk/bg, thus they should all be aligned to the fs
block size.
- Item size check
The FREE_SPACE_INFO should a fix size.
- Flags check
The flags member should have no other flags than
BTRFS_FREE_SPACE_USING_BITMAPS.
For future expansion, introduce a new macro
BTRFS_FREE_SPACE_FLAGS_MASK for such checks.
And since we're here, the BTRFS_FREE_SPACE_USING_BITMAPS should not
use unsigned long long, as the flags is only 32 bits wide.
So fix that to use unsigned long.
- Extent count check
That member shows how many free space bitmap/extent items there are
inside the chunk/bg.
We know the chunk size (from key->offset), thus there should be at
most (key->offset >> sectorsize_bits) blocks inside the chunk.
Use that value as the upper limit and if that counter is larger than
that, there is a high chance it's a bitflip in high bits.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
* Add a new access right LANDLOCK_ACCESS_FS_RESOLVE_UNIX, which
controls the lookup operations for named UNIX domain sockets. The
resolution happens during connect() and sendmsg() (depending on
socket type).
* Change access_mask_t from u16 to u32 (see below)
* Hook into the path lookup in unix_find_bsd() in af_unix.c, using a
LSM hook. Make policy decisions based on the new access rights
* Increment the Landlock ABI version.
* Minor test adaptations to keep the tests working.
* Document the design rationale for scoped access rights,
and cross-reference it from the header documentation.
With this access right, access is granted if either of the following
conditions is met:
* The target socket's filesystem path was allow-listed using a
LANDLOCK_RULE_PATH_BENEATH rule, *or*:
* The target socket was created in the same Landlock domain in which
LANDLOCK_ACCESS_FS_RESOLVE_UNIX was restricted.
In case of a denial, connect() and sendmsg() return EACCES, which is
the same error as it is returned if the user does not have the write
bit in the traditional UNIX file system permissions of that file.
The access_mask_t type grows from u16 to u32 to make space for the new
access right. This also doubles the size of struct layer_access_masks
from 32 byte to 64 byte. To avoid memory layout inconsistencies between
architectures (especially m68k), pack and align struct access_masks [2].
Document the (possible future) interaction between scoped flags and
other access rights in struct landlock_ruleset_attr, and summarize the
rationale, as discussed in code review leading up to [3].
This feature was created with substantial discussion and input from
Justin Suess, Tingmao Wang and Mickaël Salaün.
Cc: Tingmao Wang <m@maowtm.org>
Cc: Justin Suess <utilityemal77@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Jann Horn <jannh@google.com>
Link[1]: https://github.com/landlock-lsm/linux/issues/36
Link[2]: https://lore.kernel.org/all/20260401.Re1Eesu1Yaij@digikod.net/
Link[3]: https://lore.kernel.org/all/20260205.8531e4005118@gnoack.org/
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20260327164838.38231-5-gnoack3000@gmail.com
[mic: Fix kernel-doc formatting, pack and align access_masks]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
LANDLOCK_RESTRICT_SELF_TSYNC does not allow
LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF with ruleset_fd=-1, preventing
a multithreaded process from atomically propagating subdomain log muting
to all threads without creating a domain layer. Relax the fd=-1
condition to accept TSYNC alongside LOG_SUBDOMAINS_OFF, and update the
documentation accordingly.
Add flag validation tests for all TSYNC combinations with ruleset_fd=-1,
and audit tests verifying both transition directions: muting via TSYNC
(logged to not logged) and override via TSYNC (not logged to logged).
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: 42fc7e6543 ("landlock: Multithreading support for landlock_restrict_self()")
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260407164107.2012589-2-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL.
Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from
returning false to checking the actual flag, enabling the shared
memory zero-copy feature for devices that request it.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-4-ming.lei@redhat.com
[axboe: ublk_buf_reg -> ublk_shmem_buf_reg errors]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add control commands for registering and unregistering shared memory
buffers for zero-copy I/O:
- UBLK_U_CMD_REG_BUF (0x18): pins pages from userspace, inserts PFN
ranges into a per-device maple tree for O(log n) lookup during I/O.
Buffer pointers are tracked in a per-device xarray. Returns the
assigned buffer index.
- UBLK_U_CMD_UNREG_BUF (0x19): removes PFN entries and unpins pages.
Queue freeze/unfreeze is handled internally so userspace need not
quiesce the device during registration.
Also adds:
- UBLK_IO_F_SHMEM_ZC flag and addr encoding helpers in UAPI header
(16-bit buffer index supporting up to 65536 buffers)
- Data structures (ublk_buf, ublk_buf_range) and xarray/maple tree
- __ublk_ctrl_reg_buf() helper for PFN insertion with error unwinding
- __ublk_ctrl_unreg_buf() helper for cleanup reuse
- ublk_support_shmem_zc() / ublk_dev_support_shmem_zc() stubs
(returning false — feature not enabled yet)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-2-ming.lei@redhat.com
[axboe: fixup ublk_buf_reg -> ublk_shmem_buf_reg errors, comments]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Per Linus' comments requesting the replacement of "INDIR_BR_LP" in the
indirect branch tracking prctl()s with something more readable, and
suggesting the use of the speculation control prctl()s as an exemplar,
reimplement the prctl()s and related constants that control per-task
forward-edge control flow integrity.
This primarily involves two changes. First, the prctls are
restructured to resemble the style of the speculative execution
workaround control prctls PR_{GET,SET}_SPECULATION_CTRL, to make them
easier to extend in the future. Second, the "indir_br_lp" abbrevation
is expanded to "branch_landing_pads" to be less telegraphic. The
kselftest and documentation is adjusted accordingly.
Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Add DPLL_A_FREQUENCY_MONITOR device attribute to allow control over
the frequency monitor feature. The attribute uses the existing
dpll_feature_state enum (enable/disable) and is present in both
device-get reply and device-set request.
Add DPLL_A_PIN_MEASURED_FREQUENCY pin attribute to expose the measured
input frequency in millihertz (mHz). The attribute is present in the
pin-get reply. Add DPLL_PIN_MEASURED_FREQUENCY_DIVIDER constant to
allow userspace to extract integer and fractional parts.
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260402184057.1890514-2-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following fix in sched/urgent:
e08d007f9d ("sched/debug: Fix avg_vruntime() usage")
is in conflict with this pending commit in sched/core:
4823725d9d ("sched/fair: Increase weight bits for avg_vruntime")
Both modify the same variable definition and initialization blocks,
resolve it by merging the two.
Conflicts:
kernel/sched/debug.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow creating a zcrx instance without attaching it to a net device.
All data will be copied through the fallback path. The user is also
expected to use ZCRX_CTRL_FLUSH_RQ to handle overflows as it normally
should even with a netdev, but it becomes even more relevant as there
will likely be no one to automatically pick up buffers.
Apart from that, it follows the zcrx uapi for the I/O path, and is
useful for testing, experimentation, and potentially for the copy
receive path in the future if improved.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/674f8ad679c5a0bc79d538352b3042cf0999596e.1774261953.git.asml.silence@gmail.com
[axboe: fix spelling error in uapi header and commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Clarify bpf_ringbuf_discard() documentation for BPF_RB_NO_WAKEUP.
Discarded ring buffer records are still left in the ring buffer and are
only skipped when user space consumes them. This can matter when
BPF_RB_NO_WAKEUP is used: a later submit relying on adaptive wakeup
might not wake the consumer, because the discarded record still needs to
be consumed first.
Scenario:
epoll_wait(rb_fd); // blocks
rec = bpf_ringbuf_reserve(&rb, ...);
bpf_ringbuf_discard(rec, BPF_RB_NO_WAKEUP);
rec = bpf_ringbuf_reserve(&rb, ...);
bpf_ringbuf_submit(rec, 0); // valid record, but no wakeup
Document this in bpf_ringbuf_discard() to make the interaction between
discarded records, user-space consumption, and adaptive wakeups explicit.
Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260331130612.3762433-1-eyal.birger@gmail.com
----
v2: adapt wording per feedback from Andrii.
The TCG Opal device could enter a state where no new session can be
created, blocking even Discovery or PSID reset. While a power cycle
or waiting for the timeout should work, there is another possibility
for recovery: using the Stack Reset command.
The Stack Reset command is defined in the TCG Storage Architecture Core
Specification and is mandatory for all Opal devices (see Section 3.3.6
of the Opal SSC specification).
This patch implements the Stack Reset command. Sending it should clear
all active sessions immediately, allowing subsequent commands to run
successfully. While it is a TCG transport layer command, the Linux
kernel implements only Opal ioctls, so it makes sense to use the
IOC_OPAL ioctl interface.
The Stack Reset takes no arguments; the response can be success or pending.
If the command reports a pending state, userspace can try to repeat it;
in this case, the code returns -EBUSY.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Ondrej Kozina <okozina@redhat.com>
Link: https://patch.msgid.link/20260310095349.411287-1-gmazyland@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
HUTRR119 introduces new usages for keys intended to invoke AI agents
based on the current context. These are useful with the increasing
number of operating systems with integrated Large Language Models
Add new key definitions for KEY_ACTION_ON_SELECTION,
KEY_CONTEXTUAL_INSERT and KEY_CONTEXTUAL_QUERY
Signed-off-by: Akshai Murari <akshaim@google.com>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
Johannes Berg says:
====================
A fairly big set of changes all over, notably with:
- cfg80211: new APIs for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
- mt76:
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- iwlwifi: UNII-9 and continuing UHR work
* tag 'wireless-next-2026-03-26' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (230 commits)
wifi: mac80211: ignore reserved bits in reconfiguration status
wifi: cfg80211: allow protected action frame TX for NAN
wifi: ieee80211: Add some missing NAN definitions
wifi: nl80211: Add a notification to notify NAN channel evacuation
wifi: nl80211: add NL80211_CMD_NAN_ULW_UPDATE notification
wifi: nl80211: allow reporting spurious NAN Data frames
wifi: cfg80211: allow ToDS=0/FromDS=0 data frames on NAN data interfaces
wifi: nl80211: define an API for configuring the NAN peer's schedule
wifi: nl80211: add support for NAN stations
wifi: cfg80211: separately store HT, VHT and HE capabilities for NAN
wifi: cfg80211: add support for NAN data interface
wifi: cfg80211: make sure NAN chandefs are valid
wifi: cfg80211: Add an API to configure local NAN schedule
wifi: mac80211: cleanup error path of ieee80211_do_open
wifi: mac80211: extract channel logic from link logic
wifi: iwlwifi: mld: set RX_FLAG_RADIOTAP_TLV_AT_END generically
wifi: iwlwifi: reduce the number of prints upon firmware crash
wifi: iwlwifi: fix the description of SESSION_PROTECTION_CMD
wifi: iwlwifi: mld: introduce iwl_mld_vif_fw_id_valid
wifi: iwlwifi: mld: block EMLSR during TDLS connections
...
====================
Link: https://patch.msgid.link/20260326152021.305959-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace manual range and mask validations with netlink policy
annotations in ctnetlink code paths, so that the netlink core rejects
invalid values early and can generate extack errors.
- CTA_PROTOINFO_TCP_STATE: reject values > TCP_CONNTRACK_SYN_SENT2 at
policy level, removing the manual >= TCP_CONNTRACK_MAX check.
- CTA_PROTOINFO_TCP_WSCALE_ORIGINAL/REPLY: reject values > TCP_MAX_WSCALE
(14). The normal TCP option parsing path already clamps to this value,
but the ctnetlink path accepted 0-255, causing undefined behavior when
used as a u32 shift count.
- CTA_FILTER_ORIG_FLAGS/REPLY_FLAGS: use NLA_POLICY_MASK with
CTA_FILTER_F_ALL, removing the manual mask checks.
- CTA_EXPECT_FLAGS: use NLA_POLICY_MASK with NF_CT_EXPECT_MASK, adding
a new mask define grouping all valid expect flags.
Extracted from a broader nf-next patch by Florian Westphal, scoped to
ctnetlink for the fixes tree.
Fixes: c8e2078cfe ("[NETFILTER]: ctnetlink: add support for internal tcp connection tracking flags handling")
Signed-off-by: David Carlier <devnexen@gmail.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are 2 types of logical links with a NAN peer:
- management (NMI), which is used for Tx/Rx of NAN management frames.
- data (NDI), which is used for Tx/Rx of data frames, or non-NAN
management frames.
The NMI station has two roles:
- representation of the NAN peer - for example, the peer's schedule
and the HT, VHT, HE capabilities - belong to the NMI station, and not to
the NDI ones.
- Tx/Rx of NAN management frames to/from the peer.
The NDI station is used for Tx/Rx data frames of a specific NDP that was
established with the NAN peer.
Note that a peer can choose to reuse its NMI address as the NDI address.
In that case, it is expected that two stations will be added even though
they will have the same address.
- An NDI station can only be added after the corresponding NMI station
was configured with capabilities.
- All the NDI stations will be removed before the NDI interface is brought
down.
- All NMI stations will be removed before NAN is stopped.
- Before NMI sta removal, all corresponding NDI stations will be removed
Add support for adding, removing, and changing NMI and NDI stations.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.d280936ee832.I6d859eee759bb5824a9ffd2984410faf879ba00e@changeid
Link: https://patch.msgid.link/20260318123926.206536-6-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In NAN, unlike in other modes, there is only one set of (HT, VHT, HE)
capabilities that is used for all channels (and bands) used in the NAN
data path.
This set of capabilities will have to be a special one, for example - have
the minimum of (HT-for-5 GHz, HT-for-2.4 GHz), careful handling of the
bits that have a different meaning for each band, etc.
While we could use the exiting sband/iftype capabilities, and require
identical capabilities for all bands (makes no sense since this means
that we will have VHT capabilities in the 2.4 GHz slot),
or require that only one of the sbands will be set,
or have logic to extract the minimum and handle the conflicting bits -
it seems simpler to add a dedicated set of capabilities which is special
for NAN, and is band agnostic, to be populated by the driver.
That way we also let the driver decide how it wants to handle the
conflicting bits.
Add this special set of these capabilities to wiphy:nan_capabilities, to be
populated by the driver.
Send it to user space.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.4b6f3e4a81b4.I45422adc0df3ad4101d857a92e83f0de5cf241e1@changeid
Link: https://patch.msgid.link/20260318123926.206536-5-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>