Commit Graph

2103 Commits

Author SHA1 Message Date
Linus Torvalds
91a4855d6c Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Support HW queue leasing, allowing containers to be granted access
     to HW queues for zero-copy operations and AF_XDP

   - Number of code moves to help the compiler with inlining. Avoid
     output arguments for returning drop reason where possible

   - Rework drop handling within qdiscs to include more metadata about
     the reason and dropping qdisc in the tracepoints

   - Remove the rtnl_lock use from IP Multicast Routing

   - Pack size information into the Rx Flow Steering table pointer
     itself. This allows making the table itself a flat array of u32s,
     thus making the table allocation size a power of two

   - Report TCP delayed ack timer information via socket diag

   - Add ip_local_port_step_width sysctl to allow distributing the
     randomly selected ports more evenly throughout the allowed space

   - Add support for per-route tunsrc in IPv6 segment routing

   - Start work of switching sockopt handling to iov_iter

   - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
     buffer size drifting up

   - Support MSG_EOR in MPTCP

   - Add stp_mode attribute to the bridge driver for STP mode selection.
     This addresses concerns about call_usermodehelper() usage

   - Remove UDP-Lite support (as announced in 2023)

   - Remove support for building IPv6 as a module. Remove the now
     unnecessary function calling indirection

  Cross-tree stuff:

   - Move Michael MIC code from generic crypto into wireless, it's
     considered insecure but some WiFi networks still need it

  Netfilter:

   - Switch nft_fib_ipv6 module to no longer need temporary dst_entry
     object allocations by using fib6_lookup() + RCU.

     Florian W reports this gets us ~13% higher packet rate

   - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
     switch the service tables to be per-net. Convert some code that
     walks the service lists to use RCU instead of the service_mutex

   - Add more opinionated input validation to lower security exposure

   - Make IPVS hash tables to be per-netns and resizable

  Wireless:

   - Finished assoc frame encryption/EPPKE/802.1X-over-auth

   - Radar detection improvements

   - Add 6 GHz incumbent signal detection APIs

   - Multi-link support for FILS, probe response templates and client
     probing

   - New APIs and mac80211 support for NAN (Neighbor Aware Networking,
     aka Wi-Fi Aware) so less work must be in firmware

  Driver API:

   - Add numerical ID for devlink instances (to avoid having to create
     fake bus/device pairs just to have an ID). Support shared devlink
     instances which span multiple PFs

   - Add standard counters for reporting pause storm events (implement
     in mlx5 and fbnic)

   - Add configuration API for completion writeback buffering (implement
     in mana)

   - Support driver-initiated change of RSS context sizes

   - Support DPLL monitoring input frequency (implement in zl3073x)

   - Support per-port resources in devlink (implement in mlx5)

  Misc:

   - Expand the YAML spec for Netfilter

  Drivers

   - Software:
      - macvlan: support multicast rx for bridge ports with shared
        source MAC address
      - team: decouple receive and transmit enablement for IEEE 802.3ad
        LACP "independent control"

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - support high order pages in zero-copy mode (for payload
           coalescing)
         - support multiple packets in a page (for systems with 64kB
           pages)
      - Broadcom 25-400GE (bnxt):
         - implement XDP RSS hash metadata extraction
         - add software fallback for UDP GSO, lowering the IOMMU cost
      - Broadcom 800GE (bnge):
         - add link status and configuration handling
         - add various HW and SW statistics
      - Marvell/Cavium:
         - NPC HW block support for cn20k
      - Huawei (hinic3):
         - add mailbox / control queue
         - add rx VLAN offload
         - add driver info and link management

   - Ethernet NICs:
      - Marvell/Aquantia:
         - support reading SFP module info on some AQC100 cards
      - Realtek PCI (r8169):
         - add support for RTL8125cp
      - Realtek USB (r8152):
         - support for the RTL8157 5Gbit chip
         - add 2500baseT EEE status/configuration support

   - Ethernet NICs embedded and off-the-shelf IP:
      - Synopsys (stmmac):
         - cleanup and reorganize SerDes handling and PCS support
         - cleanup descriptor handling and per-platform data
         - cleanup and consolidate MDIO defines and handling
         - shrink driver memory use for internal structures
         - improve Tx IRQ coalescing
         - improve TCP segmentation handling
         - add support for Spacemit K3
      - Cadence (macb):
         - support PHYs that have inband autoneg disabled with GEM
         - support IEEE 802.3az EEE
         - rework usrio capabilities and handling
      - AMD (xgbe):
         - improve power management for S0i3
         - improve TX resilience for link-down handling

   - Virtual:
      - Google cloud vNIC:
         - support larger ring sizes in DQO-QPL mode
         - improve HW-GRO handling
         - support UDP GSO for DQO format
      - PCIe NTB:
         - support queue count configuration

   - Ethernet PHYs:
      - automatically disable PHY autonomous EEE if MAC is in charge
      - Broadcom:
         - add BCM84891/BCM84892 support
      - Micrel:
         - support for LAN9645X internal PHY
      - Realtek:
         - add RTL8224 pair order support
         - support PHY LEDs on RTL8211F-VD
         - support spread spectrum clocking (SSC)
      - Maxlinear:
         - add PHY-level statistics via ethtool

   - Ethernet switches:
      - Maxlinear (mxl862xx):
         - support for bridge offloading
         - support for VLANs
         - support driver statistics

   - Bluetooth:
      - large number of fixes and new device IDs
      - Mediatek:
         - support MT6639 (MT7927)
         - support MT7902 SDIO

   - WiFi:
      - Intel (iwlwifi):
         - UNII-9 and continuing UHR work
      - MediaTek (mt76):
         - mt7996/mt7925 MLO fixes/improvements
         - mt7996 NPU support (HW eth/wifi traffic offload)
      - Qualcomm (ath12k):
         - monitor mode support on IPQ5332
         - basic hwmon temperature reporting
         - support IPQ5424
      - Realtek:
         - add USB RX aggregation to improve performance
         - add USB TX flow control by tracking in-flight URBs

   - Cellular:
      - IPA v5.2 support"

* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
  net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
  wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
  wireguard: allowedips: remove redundant space
  tools: ynl: add sample for wireguard
  wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
  MAINTAINERS: Add netkit selftest files
  selftests/net: Add additional test coverage in nk_qlease
  selftests/net: Split netdevsim tests from HW tests in nk_qlease
  tools/ynl: Make YnlFamily closeable as a context manager
  net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
  net: airoha: Fix VIP configuration for AN7583 SoC
  net: caif: clear client service pointer on teardown
  net: strparser: fix skb_head leak in strp_abort_strp()
  net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
  selftests/bpf: add test for xdp_master_redirect with bond not up
  net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
  net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
  sctp: disable BH before calling udp_tunnel_xmit_skb()
  sctp: fix missing encap_port propagation for GSO fragments
  net: airoha: Rely on net_device pointer in ETS callbacks
  ...
2026-04-14 18:36:10 -07:00
Linus Torvalds
23acda7c22 Merge tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:

 - Add a callback driven main loop for io_uring, and BPF struct_ops
   on top to allow implementing custom event loop logic

 - Decouple IOPOLL from being a ring-wide all-or-nothing setting,
   allowing IOPOLL use cases to also issue certain white listed
   non-polled opcodes

 - Timeout improvements. Migrate internal timeout storage from
   timespec64 to ktime_t for simpler arithmetic and avoid copying of
   timespec data

 - Zero-copy receive (zcrx) updates:

      - Add a device-less mode (ZCRX_REG_NODEV) for testing and
        experimentation where data flows through the copy fallback path

      - Fix two-step unregistration regression, DMA length calculations,
        xarray mark usage, and a potential 32-bit overflow in id
        shifting

      - Refactoring toward multi-area support: dedicated refill queue
        struct, consolidated DMA syncing, netmem array refilling format,
        and guard-based locking

 - Zero-copy transmit (zctx) cleanup:

      - Unify io_send_zc() and io_sendmsg_zc() into a single function

      - Add vectorized registered buffer send for IORING_OP_SEND_ZC

      - Add separate notification user_data via sqe->addr3 so
        notification and completion CQEs can be distinguished without
        extra reference counting

 - Switch struct io_ring_ctx internal bitfields to explicit flag bits
   with atomic-safe accessors, and annotate the known harmless races on
   those flags

 - Various optimizations caching ctx and other request fields in local
   variables to avoid repeated loads, and cleanups for tctx setup, ring
   fd registration, and read path early returns

* tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (58 commits)
  io_uring: unify getting ctx from passed in file descriptor
  io_uring/register: don't get a reference to the registered ring fd
  io_uring/tctx: clean up __io_uring_add_tctx_node() error handling
  io_uring/tctx: have io_uring_alloc_task_context() return tctx
  io_uring/timeout: use 'ctx' consistently
  io_uring/rw: clean up __io_read() obsolete comment and early returns
  io_uring/zcrx: use correct mmap off constants
  io_uring/zcrx: use dma_len for chunk size calculation
  io_uring/zcrx: don't clear not allocated niovs
  io_uring/zcrx: don't use mark0 for allocating xarray
  io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring()
  io_uring/zcrx: reject REG_NODEV with large rx_buf_size
  io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP
  io_uring/rsrc: use io_cache_free() to free node
  io_uring/zcrx: rename zcrx [un]register functions
  io_uring/zcrx: check ctrl op payload struct sizes
  io_uring/zcrx: cache fallback availability in zcrx ctx
  io_uring/zcrx: warn on a repeated area append
  io_uring/zcrx: consolidate dma syncing
  io_uring/zcrx: netmem array as refiling format
  ...
2026-04-13 16:22:30 -07:00
Linus Torvalds
ef3da345cc Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Features:
   - coredump: add tracepoint for coredump events
   - fs: hide file and bfile caches behind runtime const machinery

  Fixes:
   - fix architecture-specific compat_ftruncate64 implementations
   - dcache: Limit the minimal number of bucket to two
   - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START
   - fs/mbcache: cancel shrink work before destroying the cache
   - dcache: permit dynamic_dname()s up to NAME_MAX

  Cleanups:
   - remove or unexport unused fs_context infrastructure
   - trivial ->setattr cleanups
   - selftests/filesystems: Assume that TIOCGPTPEER is defined
   - writeback: fix kernel-doc function name mismatch for wb_put_many()
   - autofs: replace manual symlink buffer allocation in autofs_dir_symlink
   - init/initramfs.c: trivial fix: FSM -> Finite-state machine
   - fs: remove stale and duplicate forward declarations
   - readdir: Introduce dirent_size()
   - fs: Replace user_access_{begin/end} by scoped user access
   - kernel: acct: fix duplicate word in comment
   - fs: write a better comment in step_into() concerning .mnt assignment
   - fs: attr: fix comment formatting and spelling issues"

* tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  dcache: permit dynamic_dname()s up to NAME_MAX
  fs: attr: fix comment formatting and spelling issues
  fs: hide file and bfile caches behind runtime const machinery
  fs: write a better comment in step_into() concerning .mnt assignment
  proc: rename proc_notify_change to proc_setattr
  proc: rename proc_setattr to proc_nochmod_setattr
  affs: rename affs_notify_change to affs_setattr
  adfs: rename adfs_notify_change to adfs_setattr
  hfs: update comments on hfs_inode_setattr
  kernel: acct: fix duplicate word in comment
  fs: Replace user_access_{begin/end} by scoped user access
  readdir: Introduce dirent_size()
  coredump: add tracepoint for coredump events
  fs: remove do_sys_truncate
  fs: pass on FTRUNCATE_* flags to do_truncate
  fs: fix archiecture-specific compat_ftruncate64
  fs: remove stale and duplicate forward declarations
  init/initramfs.c: trivial fix: FSM -> Finite-state machine
  autofs: replace manual symlink buffer allocation in autofs_dir_symlink
  fs/mbcache: cancel shrink work before destroying the cache
  ...
2026-04-13 14:20:11 -07:00
Jakub Kicinski
1508922588 Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'
Daniel Borkmann says:

====================
netkit: Support for io_uring zero-copy and AF_XDP

Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.

This patchset adds the concept of queue leasing to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
Leased queues are bound to a real queue in a physical netdev and act
as a proxy.

Memory providers and AF_XDP operations take an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a leased queue, which then gets proxied to the underlying real
queue.

We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
====================

Link: https://patch.msgid.link/20260402231031.447597-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09 18:24:35 -07:00
David Wei
222b5566a0 net: Proxy netdev_queue_get_dma_dev for leased queues
Extend netdev_queue_get_dma_dev to return the physical device of the
real rxq for DMA in case the queue was leased. This allows memory
providers like io_uring zero-copy or devmem to bind to the physically
leased rxq via virtual devices such as netkit.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-8-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09 18:21:46 -07:00
Daniel Borkmann
1e91c98bc9 net: Slightly simplify net_mp_{open,close}_rxq
net_mp_open_rxq is currently not used in the tree as all callers are
using __net_mp_open_rxq directly, and net_mp_close_rxq is only used
once while all other locations use __net_mp_close_rxq.

Consolidate into a single API, netif_mp_{open,close}_rxq, using the
netif_ prefix to indicate that the caller is responsible for locking.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09 18:21:46 -07:00
Jens Axboe
c5e9f6a96b io_uring: unify getting ctx from passed in file descriptor
io_uring_enter() and io_uring_register() end up having duplicated code
for getting a ctx from a passed in file descriptor, for either a
registered ring descriptor or a normal file descriptor. Move the
io_uring_register_get_file() into io_uring.c and name it a bit more
generically, and use it from both callsites rather than have that logic
and handling duplicated.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08 13:21:35 -06:00
Jens Axboe
b4d893d636 io_uring/register: don't get a reference to the registered ring fd
This isn't necessary and was only done because the register path isn't a
hot path and hence the extra ref/put doesn't matter, and to have the
exit path be able to unconditionally put whatever file was gotten
regardless of the type.

In preparation for sharing this code with the main io_uring_enter(2)
syscall, drop the reference and have the caller conditionally put the
file if it was a normal file descriptor.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08 13:21:35 -06:00
Jens Axboe
7880174e1e io_uring/tctx: clean up __io_uring_add_tctx_node() error handling
Refactor __io_uring_add_tctx_node() so that on error it never leaves
current->io_uring pointing at a half-setup tctx. This moves the
assignment of current->io_uring to the end of the function post any
failure points.

Separate out the node installation into io_tctx_install_node() to
further clean this up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08 13:21:34 -06:00
Jens Axboe
2c453a4281 io_uring/tctx: have io_uring_alloc_task_context() return tctx
Instead of having io_uring_alloc_task_context() return an int and
assign tsk->io_uring, just have it return the task context directly.
This enables cleaner error handling in callers, which may have
failure points post calling io_uring_alloc_task_context().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08 13:21:30 -06:00
Linus Torvalds
e41255ce7a Merge tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:

 - A previous fix in this release covered the case of the rings being
   RCU protected during resize, but it missed a few spots. This covers
   the rest

 - Fix the cBPF filters when COW'ed, introduced in this merge window

 - Fix for an attempt to import a zero sized buffer

 - Fix for a missing clamp in importing bundle buffers

* tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/bpf_filters: retain COW'ed settings on parse failures
  io_uring: protect remaining lockless ctx->rings accesses with RCU
  io_uring/rsrc: reject zero-length fixed buffer import
  io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()
2026-04-03 11:58:04 -07:00
Yang Xiuwei
f847bf6d29 io_uring/timeout: use 'ctx' consistently
There's already a local ctx variable, yet cq_timeouts accounting uses
req->ctx. Use ctx consistently.

Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Link: https://patch.msgid.link/20260402014952.260414-1-yangxiuwei@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 07:08:40 -06:00
Joanne Koong
c7f3aaf3e8 io_uring/rw: clean up __io_read() obsolete comment and early returns
After commit a9165b83c1 ("io_uring/rw: always setup io_async_rw for
read/write requests") which moved the iovec allocation into the prep
path and stores it in req->async_data where it now gets freed as part of
the request lifecycle, this comment is now outdated.

Remove it and clean up the goto as well.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260401173511.4052303-1-joannelkoong@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:50 -06:00
Pavel Begunkov
4c6f93951b io_uring/zcrx: use correct mmap off constants
zcrx was using IORING_OFF_PBUF_SHIFT during first iterations, but there
is now a separate constant it should use. Both are 16 so it doesn't
change anything, but improve it for the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/fe16ebe9ba4048a7e12f9b3b50880bd175b1ce03.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:48 -06:00
Pavel Begunkov
7120b87bed io_uring/zcrx: use dma_len for chunk size calculation
Buffers are now dma-mapped earlier and we can sg_dma_len(), otherwise,
since it's walking with for_each_sgtable_dma_sg(), it might wrongfully
reject some configurations. As a bonus, it'd now be able to use larger
chunks if dma addresses are coalesced e.g by iommu.

Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/03b219af3f6cfdd1cf64679b8bab7461e47cc123.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:47 -06:00
Pavel Begunkov
52dcd1776b io_uring/zcrx: don't clear not allocated niovs
Now that area->is_mapped is set earlier before niovs array is allocated,
io_zcrx_free_area -> io_zcrx_unmap_area in an error path can try to
clear dma addresses for unallocated niovs, fix it.

Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/cbcb7749b5a001ecd4d1c303515ce9403215640c.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:36 -06:00
Pavel Begunkov
8ae2837d5a io_uring/zcrx: don't use mark0 for allocating xarray
XA_MARK_0 is not compatible with xarray allocating entries, use
XA_MARK_1.

Fixes: fda90d43f4fac ("io_uring/zcrx: return back two step unregistration")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/f232cfd3c466047d333b474dd2bddd246b6ebb82.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Anas Iqbal
77d8c8d0f1 io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring()
Smatch warns:
io_uring/zcrx.c:393 io_allocate_rbuf_ring() warn: should 'id << 16' be a 64 bit type?

The expression 'id << IORING_OFF_PBUF_SHIFT' is evaluated using 32-bit
arithmetic because id is a u32. This may overflow before being promoted
to the 64-bit mmap_offset.

Cast id to u64 before shifting to ensure the shift is performed in
64-bit arithmetic.

Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/52400e1b343691416bef3ed3ae287fb1a88d407f.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
a9d008489f io_uring/zcrx: reject REG_NODEV with large rx_buf_size
The copy fallback path doesn't care about the actual niov size and only
uses first PAGE_SIZE bytes, and any additional space will be wasted.
Since ZCRX_REG_NODEV solely relies on the copy path, it doesn't make
sense to support non-standard rx_buf_len. Reject it for now, and
re-enable once improved.

Fixes: c11728021d5cd ("io_uring/zcrx: implement device-less mode for zcrx")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3e7652d9c27f8ac5d2b141e3af47971f2771fb05.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Amir Mohammad Jahangirzad
85a58309c0 io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP
io_async_cancel_prep() reads the opcode selector from sqe->len and
stores it in cancel->opcode, which is an 8-bit field. Since sqe->len
is a 32-bit value, values larger than U8_MAX are implicitly truncated.

This can cause unintended opcode matches when the truncated value
corresponds to a valid io_uring opcode. For example, submitting a value
such as 0x10b will be truncated to 0x0b (IORING_OP_TIMEOUT), allowing a
cancel request to match operations it did not intend to target.
Validate the opcode value before assigning it to the 8-bit field and
reject values outside the valid io_uring opcode range.

Signed-off-by: Amir Mohammad Jahangirzad <a.jahangirzad@gmail.com>
Link: https://patch.msgid.link/20260331232113.615972-1-a.jahangirzad@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Jackie Liu
19a8cc6cda io_uring/rsrc: use io_cache_free() to free node
Replace kfree(node) with io_cache_free() in io_buffer_register_bvec()
to match all other error paths that free nodes allocated via
io_rsrc_node_alloc(). The node is allocated through io_cache_alloc()
internally, so it should be returned to the cache via io_cache_free()
for proper object reuse.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Link: https://patch.msgid.link/20260331104509.7055-1-liu.yun@linux.dev
[axboe: remove fixes tag, it's not a fix, it's a cleanup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
7c713dd007 io_uring/zcrx: rename zcrx [un]register functions
Drop "ifqs" from function names, as it refers to an interface queue and
there might be none once a device-less mode is introduced.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/657874acd117ec30fa6f45d9d844471c753b5a0f.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
de6ed1b323 io_uring/zcrx: check ctrl op payload struct sizes
Add a build check that ctrl payloads are of the same size and don't grow
struct zcrx_ctrl.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/af66caf9776d18e9ff880ab828eb159a6a03caf5.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
5c727ce042 io_uring/zcrx: cache fallback availability in zcrx ctx
Store a flag in struct io_zcrx_ifq telling if the backing memory is
normal page or dmabuf based. It was looking it up from the area, however
it logically allocates from the zcrx ctx and not a particular area, and
once we add more than one area it'll become a mess.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/65e75408a7758fe7e60fae89b7a8d5ae4857f515.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
f0b92207a0 io_uring/zcrx: warn on a repeated area append
We only support a single area, no path should be able to call
io_zcrx_append_area() twice. Warn if that happens instead of just
returning an error.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/28eb67fb8c48445584d7c247a36e1ad8800f0c8b.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
61cfadaae6 io_uring/zcrx: consolidate dma syncing
Split refilling into two steps, first allocate niovs, and then do DMA
sync for them. This way dma synchronisation code can be better
optimised. E.g. we don't need to call dma_dev_need_sync() for each every
niov, and maybe we can coalesce sync for adjacent netmems in the future
as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/19f2d50baa62ff2e0c6cd56dd7c394cab728c567.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
c0989138c0 io_uring/zcrx: netmem array as refiling format
Instead of peeking into page pool allocation cache directly or via
net_mp_netmem_place_in_cache(), pass a netmem array around. It's a
better intermediate format, e.g. you can have it on stack and reuse the
refilling code and decouples it from page pools a bit more.

It still points into the page pool directly, there will be no additional
copies. As the next step, we can change the callback prototype to take
the netmem array from page pool.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/9d8549adb7ef6672daf2d8a52858ce5926279a82.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
48f253d65d io_uring/zcrx: warn on alloc with non-empty pp cache
Page pool ensures the cache is empty before asking to refill it. Warn if
the assumption is violated.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/9c9792d6e65f3780d57ff83b6334d341ed9a5f29.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
7df542a665 io_uring/zcrx: move count check into zcrx_get_free_niov
Instead of relying on the caller of __io_zcrx_get_free_niov() to check
that there are free niovs available (i.e. free_count > 0), move the
check into the function and return NULL if can't allocate. It
consolidates the free count checks, and it'll be easier to extend the
niov free list allocator in the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/6df04a6b3a6170f86d4345da9864f238311163f9.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
898ad80d12 io_uring/zcrx: use guards for locking
Convert last several places using manual locking to guards to simplify
the code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
6a55a0a7eb io_uring/zcrx: add a struct for refill queue
Add a new structure that keeps the refill queue state. It's cleaner and
will be useful once we introduce multiple refill queues.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/4ce200da1ff0309c377293b949200f95f80be9ae.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
ebae09bce4 io_uring/zcrx: use better name for RQ region
Rename "region" to "rq_region" to highlight that it's a refill queue
region.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/ac815790d2477a15826aecaa3d94f2a94ef507e6.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
825f276491 io_uring/zcrx: implement device-less mode for zcrx
Allow creating a zcrx instance without attaching it to a net device.
All data will be copied through the fallback path. The user is also
expected to use ZCRX_CTRL_FLUSH_RQ to handle overflows as it normally
should even with a netdev, but it becomes even more relevant as there
will likely be no one to automatically pick up buffers.

Apart from that, it follows the zcrx uapi for the I/O path, and is
useful for testing, experimentation, and potentially for the copy
receive path in the future if improved.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/674f8ad679c5a0bc79d538352b3042cf0999596e.1774261953.git.asml.silence@gmail.com
[axboe: fix spelling error in uapi header and commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
06fc3b6d38 io_uring/zcrx: extract netdev+area init into a helper
In preparation to following patches, add a function that is responsibly
for looking up a netdev, creating an area, DMA mapping it and opening a
queue.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/88cb6f746ecb496a9030756125419df273d0b003.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
b8d6eb6c1c io_uring/zcrx: always dma map in advance
zcrx was originally establisihing dma mappings at a late stage when it
was being bound to a page pool. Dma-buf couldn't work this way, so it's
initialised during area creation.

It's messy having them do it at different spots, just move everything to
the area creation time.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/334092a2cbdd4aabd7c025050aa99f05ace89bb5.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
41041562a7 io_uring/zcrx: fully clean area on error in io_import_umem()
When accounting fails, io_import_umem() sets the page array, etc. and
returns an error expecting that the error handling code will take care
of the rest. To make the next patch simpler, only return a fully
initialised areas from the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3a602b7fb347dbd4da6797ac49b52ea5dedb856d.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
e5361d25e2 io_uring/zcrx: return back two step unregistration
There are reports where io_uring instance removal takes too long and an
ifq reallocation by another zcrx instance fails. Split zcrx destruction
into two steps similarly how it was before, first close the queue early
but maintain zcrx alive, and then when all inflight requests are
completed, drop the main zcrx reference. For extra protection, mark
terminated zcrx instances in xarray and warn if we double put them.

Cc: stable@vger.kernel.org # 6.19+
Link: https://github.com/axboe/liburing/issues/1550
Reported-by: Youngmin Choi <youngminchoi94@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/0ce21f0565ab4358668922a28a8a36922dfebf76.1774261953.git.asml.silence@gmail.com
[axboe: NULL ifq before break inside scoped guard]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Jens Axboe
aa35dd6bdd io_uring/bpf_filters: retain COW'ed settings on parse failures
If io_parse_restrictions() fails, it ends up clearing any restrictions
currently set. The intent is only to clear whatever it already applied,
but it ends up clearing everything, including whatever settings may have
been applied in a copy-on-write fashion already. Ensure that those are
retained.

Link: https://lore.kernel.org/io-uring/CAK8a0jzF-zaO5ZmdOrmfuxrhXuKg5m5+RDuO7tNvtj=kUYbW7Q@mail.gmail.com/
Reported-by: antonius <bluedragonsec2023@gmail.com>
Fixes: ed82f35b92 ("io_uring: allow registration of per-task restrictions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 08:34:14 -06:00
Jens Axboe
61a11cf481 io_uring: protect remaining lockless ctx->rings accesses with RCU
Commit 9618908026 addressed one case of ctx->rings being potentially
accessed while a resize is happening on the ring, but there are still
a few others that need handling. Add a helper for retrieving the
rings associated with an io_uring context, and add some sanity checking
to that to catch bad uses. ->rings_rcu is always valid, as long as it's
used within RCU read lock. Any use of ->rings_rcu or ->rings inside
either ->uring_lock or ->completion_lock is sane as well.

Do the minimum fix for the current kernel, but set it up such that this
basic infra can be extended for later kernels to make this harder to
mess up in the future.

Thanks to Junxi Qian for finding and debugging this issue.

Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reviewed-by: Junxi Qian <qjx1298677004@gmail.com>
Tested-by: Junxi Qian <qjx1298677004@gmail.com>
Link: https://lore.kernel.org/io-uring/20260330172348.89416-1-qjx1298677004@gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 08:34:11 -06:00
Qi Tang
111a12b422 io_uring/rsrc: reject zero-length fixed buffer import
validate_fixed_range() admits buf_addr at the exact end of the
registered region when len is zero, because the check uses strict
greater-than (buf_end > imu->ubuf + imu->len).  io_import_fixed()
then computes offset == imu->len, which causes the bvec skip logic
to advance past the last bio_vec entry and read bv_offset from
out-of-bounds slab memory.

Return early from io_import_fixed() when len is zero.  A zero-length
import has no data to transfer and should not walk the bvec array
at all.

  BUG: KASAN: slab-out-of-bounds in io_import_reg_buf+0x697/0x7f0
  Read of size 4 at addr ffff888002bcc254 by task poc/103
  Call Trace:
   io_import_reg_buf+0x697/0x7f0
   io_write_fixed+0xd9/0x250
   __io_issue_sqe+0xad/0x710
   io_issue_sqe+0x7d/0x1100
   io_submit_sqes+0x86a/0x23c0
   __do_sys_io_uring_enter+0xa98/0x1590
  Allocated by task 103:
  The buggy address is located 12 bytes to the right of
   allocated 584-byte region [ffff888002bcc000, ffff888002bcc248)

Fixes: 8622b20f23 ("io_uring: add validate_fixed_range() for validate fixed buffer")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Link: https://patch.msgid.link/20260329164936.240871-1-tpluszz77@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-29 14:03:55 -06:00
Junxi Qian
b948f9d5d3 io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()
sqe->len is __u32 but gets stored into sr->len which is int. When
userspace passes sqe->len values exceeding INT_MAX (e.g. 0xFFFFFFFF),
sr->len overflows to a negative value. This negative value propagates
through the bundle recv/send path:

  1. io_recv(): sel.val = sr->len (ssize_t gets -1)
  2. io_recv_buf_select(): arg.max_len = sel->val (size_t gets
     0xFFFFFFFFFFFFFFFF)
  3. io_ring_buffers_peek(): buf->len is not clamped because max_len
     is astronomically large
  4. iov[].iov_len = 0xFFFFFFFF flows into io_bundle_nbufs()
  5. io_bundle_nbufs(): min_t(int, 0xFFFFFFFF, ret) yields -1,
     causing ret to increase instead of decrease, creating an
     infinite loop that reads past the allocated iov[] array

This results in a slab-out-of-bounds read in io_bundle_nbufs() from
the kmalloc-64 slab, as nbufs increments past the allocated iovec
entries.

  BUG: KASAN: slab-out-of-bounds in io_bundle_nbufs+0x128/0x160
  Read of size 8 at addr ffff888100ae05c8 by task exp/145
  Call Trace:
   io_bundle_nbufs+0x128/0x160
   io_recv_finish+0x117/0xe20
   io_recv+0x2db/0x1160

Fix this by rejecting negative sr->len values early in both
io_sendmsg_prep() and io_recvmsg_prep(). Since sqe->len is __u32,
any value > INT_MAX indicates overflow and is not a valid length.

Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Cc: stable@vger.kernel.org
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Link: https://patch.msgid.link/20260329153909.279046-1-qjx1298677004@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-29 14:03:14 -06:00
Linus Torvalds
196ef74abd Merge tag 'io_uring-7.0-20260327' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
 "Just two small fixes, both fixing regressions added in the fdinfo code
  in 6.19 with the SQE mixed size support"

* tag 'io_uring-7.0-20260327' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/fdinfo: fix OOB read in SQE_MIXED wrap check
  io_uring/fdinfo: fix SQE_MIXED SQE displaying
2026-03-27 15:35:38 -07:00
Nicholas Carlini
5170efd9c3 io_uring/fdinfo: fix OOB read in SQE_MIXED wrap check
__io_uring_show_fdinfo() iterates over pending SQEs and, for 128-byte
SQEs on an IORING_SETUP_SQE_MIXED ring, needs to detect when the second
half of the SQE would be past the end of the sq_sqes array. The current
check tests (++sq_head & sq_mask) == 0, but sq_head is only incremented
when a 128-byte SQE is encountered, not on every iteration. The actual
array index is sq_idx = (i + sq_head) & sq_mask, which can be sq_mask
(the last slot) while the wrap check passes.

Fix by checking sq_idx directly. Keep the sq_head increment so the loop
still skips the second half of the 128-byte SQE on the next iteration.

Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Nicholas Carlini <nicholas@carlini.com>
Link: https://patch.msgid.link/20260327021823.3138396-1-nicholas@carlini.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-26 20:28:28 -06:00
Jens Axboe
b59efde9e6 io_uring/fdinfo: fix SQE_MIXED SQE displaying
When displaying pending SQEs for a MIXED ring, each 128-byte SQE
increments sq_head to skip the second slot, but the loop counter is not
adjusted. This can cause the loop to read past sq_tail by one entry for
each 128-byte SQE encountered, displaying SQEs that haven't been made
consumable yet by the application.

Match the kernel's own consumption logic in io_init_req() which
decrements what's left when consuming the extra slot.

Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-26 07:15:55 -06:00
Christoph Hellwig
0924f6b80d fs: pass on FTRUNCATE_* flags to do_truncate
Pass the flags one level down to replace the somewhat confusing small
argument, and clean up do_truncate as a result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260323070205.2939118-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-23 12:41:57 +01:00
Linus Torvalds
c612261bed Merge tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:

 - A bit of a work-around for AF_UNIX recv multishot, as the in-kernel
   implementation doesn't properly signal EOF. We'll likely rework this
   one going forward, but the fix is sufficient for now

 - Two fixes for incrementally consumed buffers, for non-pollable files
   and for 0 byte reads

* tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/kbuf: propagate BUF_MORE through early buffer commit path
  io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF
  io_uring/poll: fix multishot recv missing EOF on wakeup race
2026-03-20 09:58:56 -07:00
Jens Axboe
418eab7a6f io_uring/kbuf: propagate BUF_MORE through early buffer commit path
When io_should_commit() returns true (eg for non-pollable files), buffer
commit happens at buffer selection time and sel->buf_list is set to
NULL. When __io_put_kbufs() generates CQE flags at completion time, it
calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence
cannot determine whether the buffer was consumed or not. This means that
IORING_CQE_F_BUF_MORE is never set for non-pollable input with
incrementally consumed buffers.

Likewise for io_buffers_select(), which always commits upfront and
discards the return value of io_kbuf_commit().

Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early
commit. Then __io_put_kbuf_ring() can check this flag and set
IORING_F_BUF_MORE accordingy.

Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-19 15:09:48 -06:00
Jens Axboe
3ecd3e0314 io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF
For a zero length transfer, io_kbuf_inc_commit() is called with !len.
Since we never enter the while loop to consume the buffers,
io_kbuf_inc_commit() ends up returning true, consuming the buffer. But
if no data was consumed, by definition it cannot have consumed the
buffer. Return false for that case.

Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-19 15:09:40 -06:00
Jens Axboe
f41b075492 io_uring: avoid req->ctx reload in io_req_put_rsrc_nodes()
Cache 'ctx' to avoid it needing to get potentially reloaded.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17 14:35:00 -06:00
Jens Axboe
3e97c2582f io_uring/rw: use cached file rather than req->file
In io_rw_init_file(), req->file is cached in file, yet the former is
still being used when checking for O_DIRECT. As this is post setting
the kiocb flags, the compiler has to reload req->file. Just use the
locally cached file instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17 14:35:00 -06:00