139 Commits

Author SHA1 Message Date
Linus Torvalds
91a4855d6c Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Support HW queue leasing, allowing containers to be granted access
     to HW queues for zero-copy operations and AF_XDP

   - Number of code moves to help the compiler with inlining. Avoid
     output arguments for returning drop reason where possible

   - Rework drop handling within qdiscs to include more metadata about
     the reason and dropping qdisc in the tracepoints

   - Remove the rtnl_lock use from IP Multicast Routing

   - Pack size information into the Rx Flow Steering table pointer
     itself. This allows making the table itself a flat array of u32s,
     thus making the table allocation size a power of two

   - Report TCP delayed ack timer information via socket diag

   - Add ip_local_port_step_width sysctl to allow distributing the
     randomly selected ports more evenly throughout the allowed space

   - Add support for per-route tunsrc in IPv6 segment routing

   - Start work of switching sockopt handling to iov_iter

   - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
     buffer size drifting up

   - Support MSG_EOR in MPTCP

   - Add stp_mode attribute to the bridge driver for STP mode selection.
     This addresses concerns about call_usermodehelper() usage

   - Remove UDP-Lite support (as announced in 2023)

   - Remove support for building IPv6 as a module. Remove the now
     unnecessary function calling indirection

  Cross-tree stuff:

   - Move Michael MIC code from generic crypto into wireless, it's
     considered insecure but some WiFi networks still need it

  Netfilter:

   - Switch nft_fib_ipv6 module to no longer need temporary dst_entry
     object allocations by using fib6_lookup() + RCU.

     Florian W reports this gets us ~13% higher packet rate

   - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
     switch the service tables to be per-net. Convert some code that
     walks the service lists to use RCU instead of the service_mutex

   - Add more opinionated input validation to lower security exposure

   - Make IPVS hash tables to be per-netns and resizable

  Wireless:

   - Finished assoc frame encryption/EPPKE/802.1X-over-auth

   - Radar detection improvements

   - Add 6 GHz incumbent signal detection APIs

   - Multi-link support for FILS, probe response templates and client
     probing

   - New APIs and mac80211 support for NAN (Neighbor Aware Networking,
     aka Wi-Fi Aware) so less work must be in firmware

  Driver API:

   - Add numerical ID for devlink instances (to avoid having to create
     fake bus/device pairs just to have an ID). Support shared devlink
     instances which span multiple PFs

   - Add standard counters for reporting pause storm events (implement
     in mlx5 and fbnic)

   - Add configuration API for completion writeback buffering (implement
     in mana)

   - Support driver-initiated change of RSS context sizes

   - Support DPLL monitoring input frequency (implement in zl3073x)

   - Support per-port resources in devlink (implement in mlx5)

  Misc:

   - Expand the YAML spec for Netfilter

  Drivers

   - Software:
      - macvlan: support multicast rx for bridge ports with shared
        source MAC address
      - team: decouple receive and transmit enablement for IEEE 802.3ad
        LACP "independent control"

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - support high order pages in zero-copy mode (for payload
           coalescing)
         - support multiple packets in a page (for systems with 64kB
           pages)
      - Broadcom 25-400GE (bnxt):
         - implement XDP RSS hash metadata extraction
         - add software fallback for UDP GSO, lowering the IOMMU cost
      - Broadcom 800GE (bnge):
         - add link status and configuration handling
         - add various HW and SW statistics
      - Marvell/Cavium:
         - NPC HW block support for cn20k
      - Huawei (hinic3):
         - add mailbox / control queue
         - add rx VLAN offload
         - add driver info and link management

   - Ethernet NICs:
      - Marvell/Aquantia:
         - support reading SFP module info on some AQC100 cards
      - Realtek PCI (r8169):
         - add support for RTL8125cp
      - Realtek USB (r8152):
         - support for the RTL8157 5Gbit chip
         - add 2500baseT EEE status/configuration support

   - Ethernet NICs embedded and off-the-shelf IP:
      - Synopsys (stmmac):
         - cleanup and reorganize SerDes handling and PCS support
         - cleanup descriptor handling and per-platform data
         - cleanup and consolidate MDIO defines and handling
         - shrink driver memory use for internal structures
         - improve Tx IRQ coalescing
         - improve TCP segmentation handling
         - add support for Spacemit K3
      - Cadence (macb):
         - support PHYs that have inband autoneg disabled with GEM
         - support IEEE 802.3az EEE
         - rework usrio capabilities and handling
      - AMD (xgbe):
         - improve power management for S0i3
         - improve TX resilience for link-down handling

   - Virtual:
      - Google cloud vNIC:
         - support larger ring sizes in DQO-QPL mode
         - improve HW-GRO handling
         - support UDP GSO for DQO format
      - PCIe NTB:
         - support queue count configuration

   - Ethernet PHYs:
      - automatically disable PHY autonomous EEE if MAC is in charge
      - Broadcom:
         - add BCM84891/BCM84892 support
      - Micrel:
         - support for LAN9645X internal PHY
      - Realtek:
         - add RTL8224 pair order support
         - support PHY LEDs on RTL8211F-VD
         - support spread spectrum clocking (SSC)
      - Maxlinear:
         - add PHY-level statistics via ethtool

   - Ethernet switches:
      - Maxlinear (mxl862xx):
         - support for bridge offloading
         - support for VLANs
         - support driver statistics

   - Bluetooth:
      - large number of fixes and new device IDs
      - Mediatek:
         - support MT6639 (MT7927)
         - support MT7902 SDIO

   - WiFi:
      - Intel (iwlwifi):
         - UNII-9 and continuing UHR work
      - MediaTek (mt76):
         - mt7996/mt7925 MLO fixes/improvements
         - mt7996 NPU support (HW eth/wifi traffic offload)
      - Qualcomm (ath12k):
         - monitor mode support on IPQ5332
         - basic hwmon temperature reporting
         - support IPQ5424
      - Realtek:
         - add USB RX aggregation to improve performance
         - add USB TX flow control by tracking in-flight URBs

   - Cellular:
      - IPA v5.2 support"

* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
  net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
  wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
  wireguard: allowedips: remove redundant space
  tools: ynl: add sample for wireguard
  wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
  MAINTAINERS: Add netkit selftest files
  selftests/net: Add additional test coverage in nk_qlease
  selftests/net: Split netdevsim tests from HW tests in nk_qlease
  tools/ynl: Make YnlFamily closeable as a context manager
  net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
  net: airoha: Fix VIP configuration for AN7583 SoC
  net: caif: clear client service pointer on teardown
  net: strparser: fix skb_head leak in strp_abort_strp()
  net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
  selftests/bpf: add test for xdp_master_redirect with bond not up
  net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
  net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
  sctp: disable BH before calling udp_tunnel_xmit_skb()
  sctp: fix missing encap_port propagation for GSO fragments
  net: airoha: Rely on net_device pointer in ETS callbacks
  ...
2026-04-14 18:36:10 -07:00
David Wei
222b5566a0 net: Proxy netdev_queue_get_dma_dev for leased queues
Extend netdev_queue_get_dma_dev to return the physical device of the
real rxq for DMA in case the queue was leased. This allows memory
providers like io_uring zero-copy or devmem to bind to the physically
leased rxq via virtual devices such as netkit.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-8-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09 18:21:46 -07:00
Daniel Borkmann
1e91c98bc9 net: Slightly simplify net_mp_{open,close}_rxq
net_mp_open_rxq is currently not used in the tree as all callers are
using __net_mp_open_rxq directly, and net_mp_close_rxq is only used
once while all other locations use __net_mp_close_rxq.

Consolidate into a single API, netif_mp_{open,close}_rxq, using the
netif_ prefix to indicate that the caller is responsible for locking.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09 18:21:46 -07:00
Pavel Begunkov
4c6f93951b io_uring/zcrx: use correct mmap off constants
zcrx was using IORING_OFF_PBUF_SHIFT during first iterations, but there
is now a separate constant it should use. Both are 16 so it doesn't
change anything, but improve it for the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/fe16ebe9ba4048a7e12f9b3b50880bd175b1ce03.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:48 -06:00
Pavel Begunkov
7120b87bed io_uring/zcrx: use dma_len for chunk size calculation
Buffers are now dma-mapped earlier and we can sg_dma_len(), otherwise,
since it's walking with for_each_sgtable_dma_sg(), it might wrongfully
reject some configurations. As a bonus, it'd now be able to use larger
chunks if dma addresses are coalesced e.g by iommu.

Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/03b219af3f6cfdd1cf64679b8bab7461e47cc123.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:47 -06:00
Pavel Begunkov
52dcd1776b io_uring/zcrx: don't clear not allocated niovs
Now that area->is_mapped is set earlier before niovs array is allocated,
io_zcrx_free_area -> io_zcrx_unmap_area in an error path can try to
clear dma addresses for unallocated niovs, fix it.

Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/cbcb7749b5a001ecd4d1c303515ce9403215640c.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:36 -06:00
Pavel Begunkov
8ae2837d5a io_uring/zcrx: don't use mark0 for allocating xarray
XA_MARK_0 is not compatible with xarray allocating entries, use
XA_MARK_1.

Fixes: fda90d43f4fac ("io_uring/zcrx: return back two step unregistration")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/f232cfd3c466047d333b474dd2bddd246b6ebb82.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Anas Iqbal
77d8c8d0f1 io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring()
Smatch warns:
io_uring/zcrx.c:393 io_allocate_rbuf_ring() warn: should 'id << 16' be a 64 bit type?

The expression 'id << IORING_OFF_PBUF_SHIFT' is evaluated using 32-bit
arithmetic because id is a u32. This may overflow before being promoted
to the 64-bit mmap_offset.

Cast id to u64 before shifting to ensure the shift is performed in
64-bit arithmetic.

Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/52400e1b343691416bef3ed3ae287fb1a88d407f.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
a9d008489f io_uring/zcrx: reject REG_NODEV with large rx_buf_size
The copy fallback path doesn't care about the actual niov size and only
uses first PAGE_SIZE bytes, and any additional space will be wasted.
Since ZCRX_REG_NODEV solely relies on the copy path, it doesn't make
sense to support non-standard rx_buf_len. Reject it for now, and
re-enable once improved.

Fixes: c11728021d5cd ("io_uring/zcrx: implement device-less mode for zcrx")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3e7652d9c27f8ac5d2b141e3af47971f2771fb05.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
7c713dd007 io_uring/zcrx: rename zcrx [un]register functions
Drop "ifqs" from function names, as it refers to an interface queue and
there might be none once a device-less mode is introduced.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/657874acd117ec30fa6f45d9d844471c753b5a0f.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
de6ed1b323 io_uring/zcrx: check ctrl op payload struct sizes
Add a build check that ctrl payloads are of the same size and don't grow
struct zcrx_ctrl.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/af66caf9776d18e9ff880ab828eb159a6a03caf5.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
5c727ce042 io_uring/zcrx: cache fallback availability in zcrx ctx
Store a flag in struct io_zcrx_ifq telling if the backing memory is
normal page or dmabuf based. It was looking it up from the area, however
it logically allocates from the zcrx ctx and not a particular area, and
once we add more than one area it'll become a mess.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/65e75408a7758fe7e60fae89b7a8d5ae4857f515.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
f0b92207a0 io_uring/zcrx: warn on a repeated area append
We only support a single area, no path should be able to call
io_zcrx_append_area() twice. Warn if that happens instead of just
returning an error.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/28eb67fb8c48445584d7c247a36e1ad8800f0c8b.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
61cfadaae6 io_uring/zcrx: consolidate dma syncing
Split refilling into two steps, first allocate niovs, and then do DMA
sync for them. This way dma synchronisation code can be better
optimised. E.g. we don't need to call dma_dev_need_sync() for each every
niov, and maybe we can coalesce sync for adjacent netmems in the future
as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/19f2d50baa62ff2e0c6cd56dd7c394cab728c567.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
c0989138c0 io_uring/zcrx: netmem array as refiling format
Instead of peeking into page pool allocation cache directly or via
net_mp_netmem_place_in_cache(), pass a netmem array around. It's a
better intermediate format, e.g. you can have it on stack and reuse the
refilling code and decouples it from page pools a bit more.

It still points into the page pool directly, there will be no additional
copies. As the next step, we can change the callback prototype to take
the netmem array from page pool.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/9d8549adb7ef6672daf2d8a52858ce5926279a82.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
48f253d65d io_uring/zcrx: warn on alloc with non-empty pp cache
Page pool ensures the cache is empty before asking to refill it. Warn if
the assumption is violated.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/9c9792d6e65f3780d57ff83b6334d341ed9a5f29.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
7df542a665 io_uring/zcrx: move count check into zcrx_get_free_niov
Instead of relying on the caller of __io_zcrx_get_free_niov() to check
that there are free niovs available (i.e. free_count > 0), move the
check into the function and return NULL if can't allocate. It
consolidates the free count checks, and it'll be easier to extend the
niov free list allocator in the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/6df04a6b3a6170f86d4345da9864f238311163f9.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
898ad80d12 io_uring/zcrx: use guards for locking
Convert last several places using manual locking to guards to simplify
the code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
6a55a0a7eb io_uring/zcrx: add a struct for refill queue
Add a new structure that keeps the refill queue state. It's cleaner and
will be useful once we introduce multiple refill queues.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/4ce200da1ff0309c377293b949200f95f80be9ae.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
ebae09bce4 io_uring/zcrx: use better name for RQ region
Rename "region" to "rq_region" to highlight that it's a refill queue
region.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/ac815790d2477a15826aecaa3d94f2a94ef507e6.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
825f276491 io_uring/zcrx: implement device-less mode for zcrx
Allow creating a zcrx instance without attaching it to a net device.
All data will be copied through the fallback path. The user is also
expected to use ZCRX_CTRL_FLUSH_RQ to handle overflows as it normally
should even with a netdev, but it becomes even more relevant as there
will likely be no one to automatically pick up buffers.

Apart from that, it follows the zcrx uapi for the I/O path, and is
useful for testing, experimentation, and potentially for the copy
receive path in the future if improved.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/674f8ad679c5a0bc79d538352b3042cf0999596e.1774261953.git.asml.silence@gmail.com
[axboe: fix spelling error in uapi header and commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
06fc3b6d38 io_uring/zcrx: extract netdev+area init into a helper
In preparation to following patches, add a function that is responsibly
for looking up a netdev, creating an area, DMA mapping it and opening a
queue.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/88cb6f746ecb496a9030756125419df273d0b003.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
b8d6eb6c1c io_uring/zcrx: always dma map in advance
zcrx was originally establisihing dma mappings at a late stage when it
was being bound to a page pool. Dma-buf couldn't work this way, so it's
initialised during area creation.

It's messy having them do it at different spots, just move everything to
the area creation time.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/334092a2cbdd4aabd7c025050aa99f05ace89bb5.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
41041562a7 io_uring/zcrx: fully clean area on error in io_import_umem()
When accounting fails, io_import_umem() sets the page array, etc. and
returns an error expecting that the error handling code will take care
of the rest. To make the next patch simpler, only return a fully
initialised areas from the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3a602b7fb347dbd4da6797ac49b52ea5dedb856d.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
e5361d25e2 io_uring/zcrx: return back two step unregistration
There are reports where io_uring instance removal takes too long and an
ifq reallocation by another zcrx instance fails. Split zcrx destruction
into two steps similarly how it was before, first close the queue early
but maintain zcrx alive, and then when all inflight requests are
completed, drop the main zcrx reference. For extra protection, mark
terminated zcrx instances in xarray and warn if we double put them.

Cc: stable@vger.kernel.org # 6.19+
Link: https://github.com/axboe/liburing/issues/1550
Reported-by: Youngmin Choi <youngminchoi94@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/0ce21f0565ab4358668922a28a8a36922dfebf76.1774261953.git.asml.silence@gmail.com
[axboe: NULL ifq before break inside scoped guard]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:12 -06:00
Pavel Begunkov
dc156e0f1a io_uring/zcrx: declare some constants for query
Add constants for zcrx features and supported registration flags that
can be reused by the query code. I was going to add another registration
flag, and this patch helps to avoid duplication and keeps changes
specific to zcrx files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-09 07:21:54 -06:00
Linus Torvalds
3ad66a34cc Merge tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:

 - Fix a typo in the mock_file help text

 - Fix a comment regarding IORING_SETUP_TASKRUN_FLAG in the
   io_uring.h UAPI header

 - Use READ_ONCE() for reading refill queue entries

 - Reject SEND_VECTORIZED for fixed buffer sends, as it isn't
   implemented. Currently this flag is silently ignored

   This is in preparation for making these work, but first we
   need a fixup so that older kernels will correctly reject them

 - Ensure "0" means default for the rx page size

* tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/zcrx: use READ_ONCE with user shared RQEs
  io_uring/mock: Fix typo in help text
  io_uring/net: reject SEND_VECTORIZED when unsupported
  io_uring: correct comment for IORING_SETUP_TASKRUN_FLAG
  io_uring/zcrx: don't set rx_page_size when not requested
2026-03-06 08:31:36 -08:00
Pavel Begunkov
531bb98a03 io_uring/zcrx: use READ_ONCE with user shared RQEs
Refill queue entries are shared with the user space, use READ_ONCE when
reading them.

Fixes: 34a3e60821 ("io_uring/zcrx: implement zerocopy receive pp memory provider");
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-04 06:30:39 -07:00
Jakub Kicinski
3d17d76d1f io_uring/zcrx: don't set rx_page_size when not requested
The rx_buf_len parameter was recently added to the Rx zero-copy
implementation. The expectation is that when not set system will
maintain previous behavior and use the default buffer size (PAGE_SIZE).

This works correctly at the iouring level, but we don't preserve
the same "zero means default" semantics when registering the memory
provider on the netdev. mp_param.rx_page_size is unconditionally
set to PAGE_SIZE. This causes __net_mp_open_rxq() to check for
QCFG_RX_PAGE_SIZE support in the driver, and return -EOPNOTSUPP
for drivers that don't advertise it -- even though the user never
asked for large buffers.

Only set mp_param.rx_page_size when rx_buf_len was explicitly provided,
so that the default page size path works on all zcrx-capable drivers.
mlx5 and fbnic only support 4kB pages in the current release.

Fixes: 795663b4d1 ("io_uring/zcrx: implement large rx buffer support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-27 12:32:49 -07:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
8934827db5 Merge tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kmalloc_obj conversion from Kees Cook:
 "This does the tree-wide conversion to kmalloc_obj() and friends using
  coccinelle, with a subsequent small manual cleanup of whitespace
  alignment that coccinelle does not handle.

  This uncovered a clang bug in __builtin_counted_by_ref(), so the
  conversion is preceded by disabling that for current versions of
  clang.  The imminent clang 22.1 release has the fix.

  I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I
  did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv,
  s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc"

* tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kmalloc_obj: Clean up after treewide replacements
  treewide: Replace kmalloc with kmalloc_obj for non-scalar types
  compiler_types: Disable __builtin_counted_by_ref for Clang
2026-02-21 11:02:58 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Kai Aizen
003049b1c4 io_uring/zcrx: fix user_ref race between scrub and refill paths
The io_zcrx_put_niov_uref() function uses a non-atomic
check-then-decrement pattern (atomic_read followed by separate
atomic_dec) to manipulate user_refs. This is serialized against other
callers by rq_lock, but io_zcrx_scrub() modifies the same counter with
atomic_xchg() WITHOUT holding rq_lock.

On SMP systems, the following race exists:

  CPU0 (refill, holds rq_lock)          CPU1 (scrub, no rq_lock)
  put_niov_uref:
    atomic_read(uref) - 1
    // window opens
                                        atomic_xchg(uref, 0) - 1
                                        return_niov_freelist(niov) [PUSH #1]
    // window closes
    atomic_dec(uref) - wraps to -1
    returns true
    return_niov(niov)
    return_niov_freelist(niov)           [PUSH #2: DOUBLE-FREE]

The same niov is pushed to the freelist twice, causing free_count to
exceed nr_iovs. Subsequent freelist pushes then perform an out-of-bounds
write (a u32 value) past the kvmalloc'd freelist array into the adjacent
slab object.

Fix this by replacing the non-atomic read-then-dec in
io_zcrx_put_niov_uref() with an atomic_try_cmpxchg loop that atomically
tests and decrements user_refs. This makes the operation safe against
concurrent atomic_xchg from scrub without requiring scrub to acquire
rq_lock.

Fixes: 34a3e60821 ("io_uring/zcrx: implement zerocopy receive pp memory provider")
Cc: stable@vger.kernel.org
Signed-off-by: Kai Aizen <kai@snailsploit.com>
[pavel: removed a warning and a comment]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-18 10:39:48 -07:00
Linus Torvalds
7b751b01ad Merge tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more io_uring updates from Jens Axboe:
 "This is a mix of cleanups and fixes. No major fixes in here, just a
  bunch of little fixes. Some of them marked for stable as it fixes
  behavioral issues

   - Fix an issue with SOCKET_URING_OP_SETSOCKOPT for netlink sockets,
     due to a too restrictive check on it having an ioctl handler

   - Remove a redundant SQPOLL check in ring creation

   - Kill dead accounting for zero-copy send, which doesn't use ->buf
     or ->len post the initial setup

   - Fix missing clamp of the allocation hint, which could cause
     allocations to fall outside of the range the application asked
     for. Still within the allowed limits.

   - Fix for IORING_OP_PIPE's handling of direct descriptors

   - Tweak to the API for the newly added BPF filters, making them
     more future proof in terms of how applications deal with them

   - A few fixes for zcrx, fixing a few error handling conditions

   - Fix for zcrx request flag checking

   - Add support for querying the zcrx page size

   - Improve the NO_SQARRAY static branch inc/dec, avoiding busy
     conditions causing too much traffic

   - Various little cleanups"

* tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/bpf_filter: pass in expected filter payload size
  io_uring/bpf_filter: move filter size and populate helper into struct
  io_uring/cancel: de-unionize file and user_data in struct io_cancel_data
  io_uring/rsrc: improve regbuf iov validation
  io_uring: remove unneeded io_send_zc accounting
  io_uring/cmd_net: fix too strict requirement on ioctl
  io_uring: delay sqarray static branch disablement
  io_uring/query: add query.h copyright notice
  io_uring/query: return support for custom rx page size
  io_uring/zcrx: check unsupported flags on import
  io_uring/zcrx: fix post open error handling
  io_uring/zcrx: fix sgtable leak on mapping failures
  io_uring: use the right type for creds iteration
  io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots
  io_uring/filetable: clamp alloc_hint to the configured alloc range
  io_uring/rsrc: replace reg buffer bit field with flags
  io_uring/zcrx: improve types for size calculation
  io_uring/tctx: avoid modifying loop variable in io_ring_add_registered_file
  io_uring: simplify IORING_SETUP_DEFER_TASKRUN && !SQPOLL check
2026-02-17 08:33:49 -08:00
Pavel Begunkov
7496e658a7 io_uring/zcrx: check unsupported flags on import
The imoorted zcrx registration path checks for ZCRX_REG_IMPORT, as it
should, but doesn't reject any unsupported flags. Fix that.

Cc: stable@vger.kernel.org
Fixes: 00d9148127 ("io_uring/zcrx: share an ifq between rings")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-15 14:55:29 -07:00
Pavel Begunkov
5d540e4508 io_uring/zcrx: fix post open error handling
Closing a queue doesn't guarantee that all associated page pools are
terminated right away, let the refcounting do the work instead of
releasing the zcrx ctx directly.

Cc: stable@vger.kernel.org
Fixes: e0793de24a ("io_uring/zcrx: set pp memory provider for an rx queue")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-14 18:05:08 -07:00
Pavel Begunkov
a983aae397 io_uring/zcrx: fix sgtable leak on mapping failures
In an unlikely case when io_populate_area_dma() fails, which could only
happen on a PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA machine,
io_zcrx_map_area() will have an initialised and not freed table. It was
supposed to be cleaned up in the error path, but !is_mapped prevents
that.

Fixes: 439a98b972 ("io_uring/zcrx: deduplicate area mapping")
Cc: stable@vger.kernel.org
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-14 18:05:00 -07:00
Linus Torvalds
041c16acba Merge tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring large rx buffer support from Jens Axboe:
 "Now that the networking updates are upstream, here's the support for
  large buffers for zcrx.

  Using larger (bigger than 4K) rx buffers can increase the effiency of
  zcrx. For example, it's been shown that using 32K buffers can decrease
  CPU usage by ~30% compared to 4K buffers"

* tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/zcrx: implement large rx buffer support
2026-02-12 15:07:50 -08:00
Pavel Begunkov
417d029dc4 io_uring/zcrx: improve types for size calculation
Make sure io_import_umem() promotes the type to long before calculating
the area size. While the area size is capped at 1GB by
io_validate_user_buf_range() and fits into an "int", it's still too
error prone.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-10 05:26:12 -07:00
Pavel Begunkov
af07330e28 io_uring/zcrx: fix rq flush locking
zcrx needs to keep the rq lock for uref manipulations, for now move all
zcrx_return_buffers() under the lock.

Fixes: 475eb39b00 ("io_uring/zcrx: add sync refill queue flushing")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-02 08:19:43 -07:00
Pavel Begunkov
0ae91d8ab7 io_uring/zcrx: fix page array leak
d9f595b9a6 ("io_uring/zcrx: fix leaking pages on sg init fail") fixed
a page leakage but didn't free the page array, release it as well.

Fixes: b84621d96e ("io_uring/zcrx: allocate sgtable for umem areas")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-02 08:19:35 -07:00
Pavel Begunkov
795663b4d1 io_uring/zcrx: implement large rx buffer support
There are network cards that support receive buffers larger than 4K, and
that can be vastly beneficial for performance, and benchmarks for this
patch showed up to 30% CPU util improvement for 32K vs 4K buffers.

Allows zcrx users to specify the size in struct
io_uring_zcrx_ifq_reg::rx_buf_len. If set to zero, zcrx will use a
default value. zcrx will check and fail if the memory backing the area
can't be split into physically contiguous chunks of the required size.
It's more restrictive as it only needs dma addresses to be contig, but
that's beyond this series.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: kill duplicate netdev_queues.h include]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-24 08:33:03 -07:00
David Wei
00d9148127 io_uring/zcrx: share an ifq between rings
Add a way to share an ifq from a src ring that is real (i.e. bound to a
HW RX queue) with other rings. This is done by passing a new flag
IORING_ZCRX_IFQ_REG_IMPORT in the registration struct
io_uring_zcrx_ifq_reg, alongside the fd of an exported zcrx ifq.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 11:19:37 -07:00
David Wei
0926f94ab3 io_uring/zcrx: add io_fill_zcrx_offsets()
Add a helper io_fill_zcrx_offsets() that sets the constant offsets in
struct io_uring_zcrx_offsets returned to userspace.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 11:19:37 -07:00
Pavel Begunkov
d7af80b213 io_uring/zcrx: export zcrx via a file
Add an option to wrap a zcrx instance into a file and expose it to the
user space. Currently, users can't do anything meaningful with the file,
but it'll be used in a next patch to import it into another io_uring
instance. It's implemented as a new op called ZCRX_CTRL_EXPORT for the
IORING_REGISTER_ZCRX_CTRL registration opcode.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 11:19:37 -07:00
David Wei
742cb2e14e io_uring/zcrx: move io_zcrx_scrub() and dependencies up
In preparation for adding zcrx ifq exporting and importing, move
io_zcrx_scrub() and its dependencies up the file to be closer to
io_close_queue().

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 11:19:37 -07:00
Pavel Begunkov
39c9676f78 io_uring/zcrx: count zcrx users
zcrx tries to detach ifq / terminate page pools when the io_uring ctx
owning it is being destroyed. There will be multiple io_uring instances
attached to it in the future, so add a separate counter to track the
users. Note, refs can't be reused for this purpose as it only used to
prevent zcrx and rings destruction, and also used by page pools to keep
it alive.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 11:19:37 -07:00
Pavel Begunkov
475eb39b00 io_uring/zcrx: add sync refill queue flushing
Add an zcrx interface via IORING_REGISTER_ZCRX_CTRL that forces the
kernel to flush / consume entries from the refill queue. Just as with
the IORING_REGISTER_ZCRX_REFILL attempt, the motivation is to address
cases where the refill queue becomes full, and the user can't return
buffers and needs to stash them. It's still a slow path, and the user
should size refill queue appropriately, but it should be helpful for
handling temporary traffic spikes and other unpredictable conditions.

The interface is simpler comparing to ZCRX_REFILL as it doesn't need
temporary refill entry arrays and gives natural batching, whereas
ZCRX_REFILL requires even more user logic to be somewhat efficient.

Also, add a structure for the operation. It's not currently used but
can serve for future improvements like limiting the number of buffers to
process, etc.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 11:19:37 -07:00
Pavel Begunkov
d663976dad io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
It'll be annoying and take enough of boilerplate code to implement
new zcrx features as separate io_uring register opcode. Introduce
IORING_REGISTER_ZCRX_CTRL that will multiplex such calls to zcrx.
Note, there are no real users of the opcode in this patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 11:19:37 -07:00
Pedro Demarchi Gomes
a0169c3a62 io_uring/zcrx: use folio_nr_pages() instead of shift operation
folio_nr_pages() is a faster helper function to get the number of pages when
NR_PAGES_IN_LARGE_FOLIO is enabled.

Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 11:19:37 -07:00