This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Use the new TRAILING_OVERLAP() helper to fix a misalignment bug
along with the following warning:
drivers/net/virtio_net.c:429:46: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
This helper creates a union between a flexible-array member (FAM)
and a set of members that would otherwise follow it (in this case
`u8 rss_hash_key_data[VIRTIO_NET_RSS_MAX_KEY_SIZE];`). This
overlays the trailing members (rss_hash_key_data) onto the FAM
(hash_key_data) while keeping the FAM and the start of MEMBERS aligned.
The static_assert() ensures this alignment remains.
Notice that due to tail padding in flexible `struct
virtio_net_rss_config_trailer`, `rss_trailer.hash_key_data`
(at offset 83 in struct virtnet_info) and `rss_hash_key_data` (at
offset 84 in struct virtnet_info) are misaligned by one byte. See
below:
struct virtio_net_rss_config_trailer {
__le16 max_tx_vq; /* 0 2 */
__u8 hash_key_length; /* 2 1 */
__u8 hash_key_data[]; /* 3 0 */
/* size: 4, cachelines: 1, members: 3 */
/* padding: 1 */
/* last cacheline: 4 bytes */
};
struct virtnet_info {
...
struct virtio_net_rss_config_trailer rss_trailer; /* 80 4 */
/* XXX last struct has 1 byte of padding */
u8 rss_hash_key_data[40]; /* 84 40 */
...
/* size: 832, cachelines: 13, members: 48 */
/* sum members: 801, holes: 8, sum holes: 31 */
/* paddings: 2, sum paddings: 5 */
};
After changes, those members are correctly aligned at offset 795:
struct virtnet_info {
...
union {
struct virtio_net_rss_config_trailer rss_trailer; /* 792 4 */
struct {
unsigned char __offset_to_hash_key_data[3]; /* 792 3 */
u8 rss_hash_key_data[40]; /* 795 40 */
}; /* 792 43 */
}; /* 792 44 */
...
/* size: 840, cachelines: 14, members: 47 */
/* sum members: 801, holes: 8, sum holes: 35 */
/* padding: 4 */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 8 bytes */
};
As a result, the RSS key passed to the device is shifted by 1
byte: the last byte is cut off, and instead a (possibly
uninitialized) byte is added at the beginning.
As a last note `struct virtio_net_rss_config_hdr *rss_hdr;` is also
moved to the end, since it seems those three members should stick
around together. :)
Cc: stable@vger.kernel.org
Fixes: ed3100e90d ("virtio_net: Use new RSS config structs")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/aWIItWq5dV9XTTCJ@kspp
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When we fail to refill the receive buffers, we schedule a delayed worker
to retry later. However, this worker creates some concurrency issues.
For example, when the worker runs concurrently with virtnet_xdp_set,
both need to temporarily disable queue's NAPI before enabling again.
Without proper synchronization, a deadlock can happen when
napi_disable() is called on an already disabled NAPI. That
napi_disable() call will be stuck and so will the subsequent
napi_enable() call.
To simplify the logic and avoid further problems, we will instead retry
refilling in the next NAPI poll.
Fixes: 4bc12818b3 ("virtio-net: disable delayed refill when pausing rx")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Closes: https://lore.kernel.org/526b5396-459d-4d02-8635-a222d07b46d7@redhat.com
Cc: stable@vger.kernel.org
Suggested-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260106150438.7425-2-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit a2fb4bc4e2 ("net: implement virtio helpers to handle UDP
GSO tunneling.") inadvertently altered checksum offload behavior
for guests not using UDP GSO tunneling.
Before, tun_put_user called tun_vnet_hdr_from_skb, which passed
has_data_valid = true to virtio_net_hdr_from_skb.
After, tun_put_user began calling tun_vnet_hdr_tnl_from_skb instead,
which passes has_data_valid = false into both call sites.
This caused virtio hdr flags to not include VIRTIO_NET_HDR_F_DATA_VALID
for SKBs where skb->ip_summed == CHECKSUM_UNNECESSARY. As a result,
guests are forced to recalculate checksums unnecessarily.
Restore the previous behavior by ensuring has_data_valid = true is
passed in the !tnl_gso_type case, but only from tun side, as
virtio_net_hdr_tnl_from_skb() is used also by the virtio_net driver,
which in turn must not use VIRTIO_NET_HDR_F_DATA_VALID on tx.
cc: stable@vger.kernel.org
Fixes: a2fb4bc4e2 ("net: implement virtio helpers to handle UDP GSO tunneling.")
Signed-off-by: Jon Kohler <jon@nutanix.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20251125222754.1737443-1-jon@nutanix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch refines and strengthens the statistics collection of TX queue
wake/stop events introduced by commit c39add9b24 ("virtio_net: Add TX
stopped and wake counters").
Previously, the driver only recorded partial wake/stop statistics
for TX queues. Some wake events triggered by 'skb_xmit_done()' or resume
operations were not counted, which made the per-queue metrics incomplete.
Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20251120015320.1418-1-liming.wu@jaguarmicro.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.18-rc6).
No conflicts, adjacent changes in:
drivers/net/phy/micrel.c
96a9178a29 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface")
61b7ade9ba ("net: phy: micrel: Add support for non PTP SKUs for lan8814")
and a trivial one in tools/testing/selftests/drivers/net/Makefile.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The purpose of commit 703eec1b24 ("virtio_net: fixing XDP for fully
checksummed packets handling") is to record the flags in advance, as
their value may be overwritten in the XDP case. However, the flags
recorded under big mode are incorrect, because in big mode, the passed
buf does not point to the rx buffer, but rather to the page of the
submitted buffer. This commit fixes this issue.
For the small mode, the commit c11a49d58a ("virtio_net: Fix mismatched
buf address when unmapping for small packets") fixed it.
Tested-by: Alyssa Ross <hi@alyssa.is>
Fixes: 703eec1b24 ("virtio_net: fixing XDP for fully checksummed packets handling")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20251111090828.23186-1-xuanzhuo@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cross-merge networking fixes after downstream PR (net-6.18-rc5).
Conflicts:
drivers/net/wireless/ath/ath12k/mac.c
9222582ec5 ("Revert "wifi: ath12k: Fix missing station power save configuration"")
6917e268c4 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon")
https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net
Adjacent changes:
drivers/net/ethernet/intel/Kconfig
b1d16f7c00 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG")
93f53db9f9 ("ice: switch to Page Pool")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 4959aebba8 ("virtio-net: use mtu size as buffer length
for big packets"), when guest gso is off, the allocated size for big
packets is not MAX_SKB_FRAGS * PAGE_SIZE anymore but depends on
negotiated MTU. The number of allocated frags for big packets is stored
in vi->big_packets_num_skbfrags.
Because the host announced buffer length can be malicious (e.g. the host
vhost_net driver's get_rx_bufs is modified to announce incorrect
length), we need a check in virtio_net receive path. Currently, the
check is not adapted to the new change which can lead to NULL page
pointer dereference in the below while loop when receiving length that
is larger than the allocated one.
This commit fixes the received length check corresponding to the new
change.
Fixes: 4959aebba8 ("virtio-net: use mtu size as buffer length for big packets")
Cc: stable@vger.kernel.org
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Link: https://patch.msgid.link/20251030144438.7582-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Changing alignment of header would mean it's no longer safe to cast a
2 byte aligned pointer between formats. Use two 16 bit fields to make
it 2 byte aligned as previously.
This fixes the performance regression since
commit ("virtio_net: enable gso over UDP tunnel support.") as it uses
virtio_net_hdr_v1_hash_tunnel which embeds
virtio_net_hdr_v1_hash. Pktgen in guest + XDP_DROP on TAP + vhost_net
shows the TX PPS is recovered from 2.4Mpps to 4.45Mpps.
Fixes: 56a06bd40f ("virtio_net: enable gso over UDP tunnel support.")
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Link: https://patch.msgid.link/20251031060551.126-1-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In virtio-net, we have not yet supported multi-buffer XDP packet in
zerocopy mode when there is a binding XDP program. However, in that
case, when receiving multi-buffer XDP packet, we skip the XDP program
and return XDP_PASS. As a result, the packet is passed to normal network
stack which is an incorrect behavior (e.g. a XDP program for packet
count is installed, multi-buffer XDP packet arrives and does go through
XDP program. As a result, the packet count does not increase but the
packet is still received from network stack).This commit instead returns
XDP_ABORTED in that case.
Fixes: 99c861b44e ("virtio_net: xsk: rx: support recv merge mode")
Cc: stable@vger.kernel.org
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20251022155630.49272-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull virtio updates from Michael Tsirkin:
"Just fixes and cleanups this time around. The mapping cleanups are
preparing the ground for new features, though"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio-vdpa: Drop redundant conversion to bool
vduse: Use fixed 4KB bounce pages for non-4KB page size
vduse: switch to use virtio map API instead of DMA API
vdpa: introduce map ops
vdpa: support virtio_map
virtio: introduce map ops in virtio core
virtio_ring: rename dma_handle to map_handle
virtio: introduce virtio_map container union
virtio: rename dma helpers
virtio_ring: switch to use dma_{map|unmap}_page()
virtio_ring: constify virtqueue pointer for DMA helpers
virtio_balloon: Remove redundant __GFP_NOWARN
vhost: vringh: Fix copy_to_iter return value check
vhost: vringh: Modify the return value check
Replace the existing virtnet_get_rxnfc callback with a dedicated
virtnet_get_rxrings implementation to provide the number of RX rings
directly via the new ethtool_ops get_rx_ring_count pointer.
This simplifies the RX ring count retrieval and aligns virtio_net with
the new ethtool API for querying RX ring parameters.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250917-gxrings-v4-8-dae520e2e1cb@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xdp_update_skb_shared_info() needs to update skb state which
was maintained in xdp_buff / frame. Pass full flags into it,
instead of breaking it out bit by bit. We will need to add
a bit for unreadable frags (even tho XDP doesn't support
those the driver paths may be common), at which point almost
all call sites would become:
xdp_update_skb_shared_info(skb, num_frags,
sinfo->xdp_frags_size,
MY_PAGE_SIZE * num_frags,
xdp_buff_is_frag_pfmemalloc(xdp),
xdp_buff_is_frag_unreadable(xdp));
Keep a helper for accessing the flags, in case we need to
transform them somehow in the future (e.g. to cover up xdp_buff
vs xdp_frame differences).
While we are touching call callers - rename the helper to
xdp_update_skb_frags_info(), previous name may have implied that
it's shinfo that's updated. We are updating flags in struct sk_buff
based on frags that got attched.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20250905221539.2930285-2-kuba@kernel.org
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
"Use after free" issue appears in suspend once race occurs when
napi poll scheduls after `netif_device_detach` and before napi disables.
For details, during suspend flow of virtio-net,
the tx queue state is set to "__QUEUE_STATE_DRV_XOFF" by CPU-A.
And at some coincidental times, if a TCP connection is still working,
CPU-B does `virtnet_poll` before napi disable.
In this flow, the state "__QUEUE_STATE_DRV_XOFF"
of tx queue will be cleared. This is not the normal process it expects.
After that, CPU-A continues to close driver then virtqueue is removed.
Sequence likes below:
--------------------------------------------------------------------------
CPU-A CPU-B
----- -----
suspend is called A TCP based on
virtio-net still work
virtnet_freeze
|- virtnet_freeze_down
| |- netif_device_detach
| | |- netif_tx_stop_all_queues
| | |- netif_tx_stop_queue
| | |- set_bit
| | (__QUEUE_STATE_DRV_XOFF,...)
| | softirq rasied
| | |- net_rx_action
| | |- napi_poll
| | |- virtnet_poll
| | |- virtnet_poll_cleantx
| | |- netif_tx_wake_queue
| | |- test_and_clear_bit
| | (__QUEUE_STATE_DRV_XOFF,...)
| |- virtnet_close
| |- virtnet_disable_queue_pair
| |- virtnet_napi_tx_disable
|- remove_vq_common
--------------------------------------------------------------------------
When TCP delayack timer is up, a cpu gets softirq and irq handler
`tcp_delack_timer_handler` will be called, which will finally call
`start_xmit` in virtio net driver.
Then the access to tx virtq will cause panic.
The root cause of this issue is that napi tx
is not disable before `netif_tx_stop_queue`,
once `virnet_poll` schedules in such coincidental time,
the tx queue state will be cleared.
To solve this issue, adjusts the order of
function `virtnet_close` in `virtnet_freeze_down`.
Co-developed-by: Ying Xu <ying123.xu@samsung.com>
Signed-off-by: Ying Xu <ying123.xu@samsung.com>
Signed-off-by: Junnan Wu <junnan01.wu@samsung.com>
Message-Id: <20250812090817.3463403-1-junnan01.wu@samsung.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The deadlock appears in a stack trace like:
virtnet_probe()
rtnl_lock()
virtio_config_changed_work()
netdev_notify_peers()
rtnl_lock()
It happens if the VMM sends a VIRTIO_NET_S_ANNOUNCE request while the
virtio-net driver is still probing.
The config_work in probe() will get scheduled until virtnet_open() enables
the config change notification via virtio_config_driver_enable().
Fixes: df28de7b00 ("virtio-net: synchronize operstate with admin state on up/down")
Signed-off-by: Zigit Zo <zuozhijie@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250716115717.1472430-1-zuozhijie@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni says:
====================
virtio: introduce GSO over UDP tunnel
Some virtualized deployments use UDP tunnel pervasively and are impacted
negatively by the lack of GSO support for such kind of traffic in the
virtual NIC driver.
The virtio_net specification recently introduced support for GSO over
UDP tunnel, this series updates the virtio implementation to support
such a feature.
Currently the kernel virtio support limits the feature space to 64,
while the virtio specification allows for a larger number of features.
Specifically the GSO-over-UDP-tunnel-related virtio features use bits
65-69.
The first four patches in this series rework the virtio and vhost
feature support to cope with up to 128 bits. The limit is set by
a define and could be easily raised in future, as needed.
This implementation choice is aimed at keeping the code churn as
limited as possible. For the same reason, only the virtio_net driver is
reworked to leverage the extended feature space; all other
virtio/vhost drivers are unaffected, but could be upgraded to support
the extended features space in a later time.
The last four patches bring in the actual GSO over UDP tunnel support.
As per specification, some additional fields are introduced into the
virtio net header to support the new offload. The presence of such
fields depends on the negotiated features.
New helpers are introduced to convert the UDP-tunneled skb metadata to
an extended virtio net header and vice versa. Such helpers are used by
the tun and virtio_net driver to cope with the newly supported offloads.
Tested with basic stream transfer with all the possible permutations of
host kernel/qemu/guest kernel with/without GSO over UDP tunnel support.
====================
Link: https://patch.msgid.link/cover.1751874094.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the related virtio feature is set, enable transmission and reception
of gso over UDP tunnel packets.
Most of the work is done by the previously introduced helper, just need
to determine the UDP tunnel features inside the virtio_net_hdr and
update accordingly the virtio net hdr size.
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The virtio_net driver needs it to implement GSO over UDP tunnel
offload.
The only missing piece is mapping them to/from the extended
features.
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The `tx_may_stop()` logic stops TX queues if free descriptors
(`sq->vq->num_free`) fall below the threshold of (`MAX_SKB_FRAGS` + 2).
If the total ring size (`ring_num`) is not strictly greater than this
value, queues can become persistently stopped or stop after minimal
use, severely degrading performance.
A single sk_buff transmission typically requires descriptors for:
- The virtio_net_hdr (1 descriptor)
- The sk_buff's linear data (head) (1 descriptor)
- Paged fragments (up to MAX_SKB_FRAGS descriptors)
This patch enforces that the TX ring size ('ring_num') must be strictly
greater than (MAX_SKB_FRAGS + 2). This ensures that the ring is
always large enough to hold at least one maximally-fragmented packet
plus at least one additional slot.
Reported-by: Lei Yang <leiyang@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250521092236.661410-4-lvivier@redhat.com
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When calling buf_to_xdp, the len argument is the frame data's length
without virtio header's length (vi->hdr_len). We check that len with
xsk_pool_get_rx_frame_size() + vi->hdr_len
to ensure the provided len does not larger than the allocated chunk
size. The additional vi->hdr_len is because in virtnet_add_recvbuf_xsk,
we use part of XDP_PACKET_HEADROOM for virtio header and ask the vhost
to start placing data from
hard_start + XDP_PACKET_HEADROOM - vi->hdr_len
not
hard_start + XDP_PACKET_HEADROOM
But the first buffer has virtio_header, so the maximum frame's length in
the first buffer can only be
xsk_pool_get_rx_frame_size()
not
xsk_pool_get_rx_frame_size() + vi->hdr_len
like in the current check.
This commit adds an additional argument to buf_to_xdp differentiate
between the first buffer and other ones to correctly calculate the maximum
frame's length.
Cc: stable@vger.kernel.org
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Fixes: a4e7ba7027 ("virtio_net: xsk: rx: support recv small mode")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250630151315.86722-2-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
NIPA tests report that the interface statistics reported
via qstat are lower than those reported via ip link.
Looks like this is because some tests flip the queue
count up and down, and we end up with some of the traffic
accounted on disabled queues.
Add up counters from disabled queues.
Fixes: d888f04c09 ("virtio-net: support queue stat")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250507003221.823267-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 4bc12818b3 ("virtio-net: disable delayed refill when pausing rx")
fixed a deadlock between reconfig paths and refill work trying to disable
the same NAPI instance. The refill work can't run in parallel with reconfig
because trying to double-disable a NAPI instance causes a stall under the
instance lock, which the reconfig path needs to re-enable the NAPI and
therefore unblock the stalled thread.
There are two cases where we re-enable refill too early. One is in the
virtnet_set_queues() handler. We call it when installing XDP:
virtnet_rx_pause_all(vi);
...
virtnet_napi_tx_disable(..);
...
virtnet_set_queues(..);
...
virtnet_rx_resume_all(..);
We want the work to be disabled until we call virtnet_rx_resume_all(),
but virtnet_set_queues() kicks it before NAPIs were re-enabled.
The other case is a more trivial case of mis-ordering in
__virtnet_rx_resume() found by code inspection.
Taking the spin lock in virtnet_set_queues() (requested during review)
may be unnecessary as we are under rtnl_lock and so are all paths writing
to ->refill_enabled.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>
Fixes: 4bc12818b3 ("virtio-net: disable delayed refill when pausing rx")
Fixes: 413f0271f3 ("net: protect NAPI enablement with netdev_lock()")
Link: https://patch.msgid.link/20250430163758.3029367-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When pausing rx (e.g. set up xdp, xsk pool, rx resize), we call
napi_disable() on the receive queue's napi. In delayed refill_work, it
also calls napi_disable() on the receive queue's napi. When
napi_disable() is called on an already disabled napi, it will sleep in
napi_disable_locked while still holding the netdev_lock. As a result,
later napi_enable gets stuck too as it cannot acquire the netdev_lock.
This leads to refill_work and the pause-then-resume tx are stuck
altogether.
This scenario can be reproducible by binding a XDP socket to virtio-net
interface without setting up the fill ring. As a result, try_fill_recv
will fail until the fill ring is set up and refill_work is scheduled.
This commit adds virtnet_rx_(pause/resume)_all helpers and fixes up the
virtnet_rx_resume to disable future and cancel all inflights delayed
refill_work before calling napi_disable() to pause the rx.
Fixes: 413f0271f3 ("net: protect NAPI enablement with netdev_lock()")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250417072806.18660-2-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Continue Netlink conversions to per-namespace RTNL lock
(IPv4 routing, routing rules, routing next hops, ARP ioctls)
- Continue extending the use of netdev instance locks. As a driver
opt-in protect queue operations and (in due course) ethtool
operations with the instance lock and not RTNL lock.
- Support collecting TCP timestamps (data submitted, sent, acked) in
BPF, allowing for transparent (to the application) and lower
overhead tracking of TCP RPC performance.
- Tweak existing networking Rx zero-copy infra to support zero-copy
Rx via io_uring.
- Optimize MPTCP performance in single subflow mode by 29%.
- Enable GRO on packets which went thru XDP CPU redirect (were queued
for processing on a different CPU). Improving TCP stream
performance up to 2x.
- Improve performance of contended connect() by 200% by searching for
an available 4-tuple under RCU rather than a spin lock. Bring an
additional 229% improvement by tweaking hash distribution.
- Avoid unconditionally touching sk_tsflags on RX, improving
performance under UDP flood by as much as 10%.
- Avoid skb_clone() dance in ping_rcv() to improve performance under
ping flood.
- Avoid FIB lookup in netfilter if socket is available, 20% perf win.
- Rework network device creation (in-kernel) API to more clearly
identify network namespaces and their roles. There are up to 4
namespace roles but we used to have just 2 netns pointer arguments,
interpreted differently based on context.
- Use sysfs_break_active_protection() instead of trylock to avoid
deadlocks between unregistering objects and sysfs access.
- Add a new sysctl and sockopt for capping max retransmit timeout in
TCP.
- Support masking port and DSCP in routing rule matches.
- Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
- Support specifying at what time packet should be sent on AF_XDP
sockets.
- Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
users.
- Add Netlink YAML spec for WiFi (nl80211) and conntrack.
- Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
which only need to be exported when IPv6 support is built as a
module.
- Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
normal bridging.
- Allow users to specify source port range for GENEVE tunnels.
- netconsole: allow attaching kernel release, CPU ID and task name to
messages as metadata
Driver API:
- Continue rework / fixing of Energy Efficient Ethernet (EEE) across
the SW layers. Delegate the responsibilities to phylink where
possible. Improve its handling in phylib.
- Support symmetric OR-XOR RSS hashing algorithm.
- Support tracking and preserving IRQ affinity by NAPI itself.
- Support loopback mode speed selection for interface selftests.
Device drivers:
- Remove the IBM LCS driver for s390
- Remove the sb1000 cable modem driver
- Add support for SFP module access over SMBus
- Add MCTP transport driver for MCTP-over-USB
- Enable XDP metadata support in multiple drivers
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- add PCIe TLP Processing Hints (TPH) support for new AMD
platforms
- support dumping RoCE queue state for debug
- opt into instance locking
- Intel (100G, ice, idpf):
- ice: rework MSI-X IRQ management and distribution
- ice: support for E830 devices
- iavf: add support for Rx timestamping
- iavf: opt into instance locking
- nVidia/Mellanox:
- mlx4: use page pool memory allocator for Rx
- mlx5: support for one PTP device per hardware clock
- mlx5: support for 200Gbps per-lane link modes
- mlx5: move IPSec policy check after decryption
- AMD/Solarflare:
- support FW flashing via devlink
- Cisco (enic):
- use page pool memory allocator for Rx
- enable 32, 64 byte CQEs
- get max rx/tx ring size from the device
- Meta (fbnic):
- support flow steering and RSS configuration
- report queue stats
- support TCP segmentation
- support IRQ coalescing
- support ring size configuration
- Marvell/Cavium:
- support AF_XDP
- Wangxun:
- support for PTP clock and timestamping
- Huawei (hibmcge):
- checksum offload
- add more statistics
- Ethernet virtual:
- VirtIO net:
- aggressively suppress Tx completions, improve perf by 96%
with 1 CPU and 55% with 2 CPUs
- expose NAPI to IRQ mapping and persist NAPI settings
- Google (gve):
- support XDP in DQO RDA Queue Format
- opt into instance locking
- Microsoft vNIC:
- support BIG TCP
- Ethernet NICs consumer, and embedded:
- Synopsys (stmmac):
- cleanup Tx and Tx clock setting and other link-focused
cleanups
- enable SGMII and 2500BASEX mode switching for Intel platforms
- support Sophgo SG2044
- Broadcom switches (b53):
- support for BCM53101
- TI:
- iep: add perout configuration support
- icssg: support XDP
- Cadence (macb):
- implement BQL
- Xilinx (axinet):
- support dynamic IRQ moderation and changing coalescing at
runtime
- implement BQL
- report standard stats
- MediaTek:
- support phylink managed EEE
- Intel:
- igc: don't restart the interface on every XDP program change
- RealTek (r8169):
- support reading registers of internal PHYs directly
- increase max jumbo packet size on RTL8125/RTL8126
- Airoha:
- support for RISC-V NPU packet processing unit
- enable scatter-gather and support MTU up to 9kB
- Tehuti (tn40xx):
- support cards with TN4010 MAC and an Aquantia AQR105 PHY
- Ethernet PHYs:
- support for TJA1102S, TJA1121
- dp83tg720: add randomized polling intervals for link detection
- dp83822: support changing the transmit amplitude voltage
- support for LEDs on 88q2xxx
- CAN:
- canxl: support Remote Request Substitution bit access
- flexcan: add S32G2/S32G3 SoC
- WiFi:
- remove cooked monitor support
- strict mode for better AP testing
- basic EPCS support
- OMI RX bandwidth reduction support
- batman-adv: add support for jumbo frames
- WiFi drivers:
- RealTek (rtw88):
- support RTL8814AE and RTL8814AU
- RealTek (rtw89):
- switch using wiphy_lock and wiphy_work
- add BB context to manipulate two PHY as preparation of MLO
- improve BT-coexistence mechanism to play A2DP smoothly
- Intel (iwlwifi):
- add new iwlmld sub-driver for latest HW/FW combinations
- MediaTek (mt76):
- preparation for mt7996 Multi-Link Operation (MLO) support
- Qualcomm/Atheros (ath12k):
- continued work on MLO
- Silabs (wfx):
- Wake-on-WLAN support
- Bluetooth:
- add support for skb TX SND/COMPLETION timestamping
- hci_core: enable buffer flow control for SCO/eSCO
- coredump: log devcd dumps into the monitor
- Bluetooth drivers:
- intel: add support to configure TX power
- nxp: handle bootloader error during cmd5 and cmd7"
* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
unix: fix up for "apparmor: add fine grained af_unix mediation"
mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
net: usb: asix: ax88772: Increase phy_name size
net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
net: libwx: fix Tx L4 checksum
net: libwx: fix Tx descriptor content for some tunnel packets
atm: Fix NULL pointer dereference
net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
net: phy: aquantia: add essential functions to aqr105 driver
net: phy: aquantia: search for firmware-name in fwnode
net: phy: aquantia: add probe function to aqr105 for firmware loading
net: phy: Add swnode support to mdiobus_scan
gve: add XDP DROP and PASS support for DQ
gve: update XDP allocation path support RX buffer posting
gve: merge packet buffer size fields
gve: update GQ RX to use buf_size
gve: introduce config-based allocation for XDP
gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
...
Use netif_queue_set_napi to map NAPIs to queue IDs so that the mapping
can be accessed by user apps. Note that the netif_queue_set_napi
currently requires RTNL, so care must be taken to ensure RTNL is held on
paths where this API might be reached.
The paths in the driver where this API can be reached appear to be:
- ndo_open, ndo_close, which hold RTNL so no driver change is needed.
- rx_pause, rx_resume, tx_pause, tx_resume are reached either via
an ethtool ioctl or via XSK - neither path requires a driver change.
- power management paths (which call open and close), which have been
updated to hold/release RTNL.
$ ethtool -i ens4 | grep driver
driver: virtio_net
$ sudo ethtool -L ens4 combined 4
$ ./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--dump queue-get --json='{"ifindex": 2}'
[{'id': 0, 'ifindex': 2, 'napi-id': 8289, 'type': 'rx'},
{'id': 1, 'ifindex': 2, 'napi-id': 8290, 'type': 'rx'},
{'id': 2, 'ifindex': 2, 'napi-id': 8291, 'type': 'rx'},
{'id': 3, 'ifindex': 2, 'napi-id': 8292, 'type': 'rx'},
{'id': 0, 'ifindex': 2, 'type': 'tx'},
{'id': 1, 'ifindex': 2, 'type': 'tx'},
{'id': 2, 'ifindex': 2, 'type': 'tx'},
{'id': 3, 'ifindex': 2, 'type': 'tx'}]
Note that virtio_net has TX-only NAPIs which do not have NAPI IDs, so
the lack of 'napi-id' in the above output is expected.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20250307011215.266806-4-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The inner loop may be replaced with the dedicated for_each_online_cpu_wrap.
Use it as it improves readability and simplifies maintenance.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Nick Child <nnac123@linux.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>