linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Jakub Kicinski	35c2c39832	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Merge in late fixes in preparation for the net-next PR. Conflicts: include/net/sch_generic.h `a6bd339dbb` ("net_sched: fix skb memory leak in deferred qdisc drops") `ff2998f29f` ("net: sched: introduce qdisc-specific drop reason tracing") https://lore.kernel.org/adz0iX85FHMz0HdO@sirena.org.uk drivers/net/ethernet/airoha/airoha_eth.c `1acdfbdb51` ("net: airoha: Fix VIP configuration for AN7583 SoC") `bf3471e6e6` ("net: airoha: Make flow control source port mapping dependent on nbq parameter") Adjacent changes: drivers/net/ethernet/airoha/airoha_ppe.c `f44218cd5e` ("net: airoha: Reset PPE cpu port configuration in airoha_ppe_hw_init()") `7da62262ec` ("inet: add ip_local_port_step_width sysctl to improve port usage distribution") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-14 12:04:00 -07:00
Norbert Szetei	d114bfdc9b	vsock: fix buffer size clamping order In vsock_update_buffer_size(), the buffer size was being clamped to the maximum first, and then to the minimum. If a user sets a minimum buffer size larger than the maximum, the minimum check overrides the maximum check, inverting the constraint. This breaks the intended socket memory boundaries by allowing the vsk->buffer_size to grow beyond the configured vsk->buffer_max_size. Fix this by checking the minimum first, and then the maximum. This ensures the buffer size never exceeds the buffer_max_size. Fixes: `b9f2b0ffde` ("vsock: handle buffer_size sockopts in the core") Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Norbert Szetei <norbert@doyensec.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/180118C5-8BCF-4A63-A305-4EE53A34AB9C@doyensec.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:31:50 -07:00
Luigi Leonardi	006679268a	vsock/virtio: remove unnecessary call to `virtio_transport_get_ops` `virtio_transport_send_pkt_info` gets all the transport information from the parameter `t_ops`. There is no need to call `virtio_transport_get_ops()`. Remove it. Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260408-remove_parameter-v2-1-e00f31cf7a17@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:57:01 -07:00
Laurence Rowe	98f28d8d6e	vsock: avoid timeout for non-blocking accept() with empty backlog A common pattern in epoll network servers is to eagerly accept all pending connections from the non-blocking listening socket after epoll_wait indicates the socket is ready by calling accept in a loop until EAGAIN is returned indicating that the backlog is empty. Scheduling a timeout for a non-blocking accept with an empty backlog meant AF_VSOCK sockets used by epoll network servers incurred hundreds of microseconds of additional latency per accept loop compared to AF_INET or AF_UNIX sockets. Signed-off-by: Laurence Rowe <laurencerowe@gmail.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260402204918.130395-1-laurencerowe@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-06 18:29:01 -07:00
Jakub Kicinski	8ffb33d770	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc7). Conflicts: net/vmw_vsock/af_vsock.c `b18c833888` ("vsock: initialize child_ns_mode_locked in vsock_net_init()") `0de607dc4f` ("vsock: add G2H fallback for CIDs not owned by H2G transport") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c `ceee35e567` ("bnxt_en: Refactor some basic ring setup and adjustment logic") `57cdfe0dc7` ("bnxt_en: Resize RSS contexts on channel count change") drivers/net/wireless/intel/iwlwifi/mld/mac80211.c `4d56037a02` ("wifi: iwlwifi: mld: block EMLSR during TDLS connections") `687a95d204` ("wifi: iwlwifi: mld: correctly set wifi generation data") drivers/net/wireless/intel/iwlwifi/mld/scan.h `b6045c899e` ("wifi: iwlwifi: mld: Refactor scan command handling") `ec66ec6a5a` ("wifi: iwlwifi: mld: Fix MLO scan timing") drivers/net/wireless/intel/iwlwifi/mvm/fw.c `078df640ef` ("wifi: iwlwifi: mld: add support for iwl_mcc_allowed_ap_type_cmd v 2") `323156c354` ("wifi: iwlwifi: mvm: don't send a 6E related command when not supported") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-02 11:03:13 -07:00
Stefano Garzarella	b18c833888	vsock: initialize child_ns_mode_locked in vsock_net_init() The `child_ns_mode_locked` field lives in `struct net`, which persists across vsock module reloads. When the module is unloaded and reloaded, `vsock_net_init()` resets `mode` and `child_ns_mode` back to their default values, but does not reset `child_ns_mode_locked`. The stale lock from the previous module load causes subsequent writes to `child_ns_mode` to silently fail: `vsock_net_set_child_mode()` sees the old lock, skips updating the actual value, and returns success when the requested mode matches the stale lock. The sysctl handler reports no error, but `child_ns_mode` remains unchanged. Steps to reproduce: $ modprobe vsock $ echo local > /proc/sys/net/vsock/child_ns_mode $ cat /proc/sys/net/vsock/child_ns_mode local $ modprobe -r vsock $ modprobe vsock $ echo local > /proc/sys/net/vsock/child_ns_mode $ cat /proc/sys/net/vsock/child_ns_mode global <--- expected "local" Fix this by initializing `child_ns_mode_locked` to 0 (unlocked) in `vsock_net_init()`, so the write-once mechanism works correctly after module reload. Fixes: `102eab95f0` ("vsock: lock down child_ns_mode as write-once") Reported-by: Jin Liu <jinl@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260401092153.28462-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-02 08:18:56 -07:00
Kexin Sun	88c07dff9f	hv_sock: update outdated comment for renamed vsock_stream_recvmsg() The function vsock_stream_recvmsg() was renamed to vsock_connectible_recvmsg() by commit `a9e29e5511` ("af_vsock: update functions for connectible socket"). Update the comment accordingly. Assisted-by: unnamed:deepseek-v3.2 coccinelle Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260321105753.6751-1-kexinsun@smail.nju.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-24 13:16:58 +01:00
Alexander Graf	0de607dc4f	vsock: add G2H fallback for CIDs not owned by H2G transport When no H2G transport is loaded, vsock currently routes all CIDs to the G2H transport (commit `65b422d9b6` ("vsock: forward all packets to the host when no H2G is registered"). Extend that existing behavior: when an H2G transport is loaded but does not claim a given CID, the connection falls back to G2H in the same way. This matters in environments like Nitro Enclaves, where an instance may run nested VMs via vhost-vsock (H2G) while also needing to reach sibling enclaves at higher CIDs through virtio-vsock-pci (G2H). With the old code, any CID > 2 was unconditionally routed to H2G when vhost was loaded, making those enclaves unreachable without setting VMADDR_FLAG_TO_HOST explicitly on every connect. Requiring every application to set VMADDR_FLAG_TO_HOST creates friction: tools like socat, iperf, and others would all need to learn about it. The flag was introduced 6 years ago and I am still not aware of any tool that supports it. Even if there was support, it would be cumbersome to use. The most natural experience is a single CID address space where H2G only wins for CIDs it actually owns, and everything else falls through to G2H, extending the behavior that already exists when H2G is absent. To give user space at least a hint that the kernel applied this logic, automatically set the VMADDR_FLAG_TO_HOST on the remote address so it can determine the path taken via getpeername(). Add a per-network namespace sysctl net.vsock.g2h_fallback (default 1). At 0 it forces strict routing: H2G always wins for CID > VMADDR_CID_HOST, or ENODEV if H2G is not loaded. Signed-off-by: Alexander Graf <graf@amazon.com> Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260304230027.59857-1-graf@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-12 10:59:36 +01:00
Eric Dumazet	8341c989ac	net: remove addr_len argument of recvmsg() handlers Use msg->msg_namelen as a place holder instead of a temporary variable, notably in inet[6]_recvmsg(). This removes stack canaries and allows tail-calls. $ scripts/bloat-o-meter -t vmlinux.old vmlinux add/remove: 0/0 grow/shrink: 2/19 up/down: 26/-532 (-506) Function old new delta rawv6_recvmsg 744 767 +23 vsock_dgram_recvmsg 55 58 +3 vsock_connectible_recvmsg 50 47 -3 unix_stream_recvmsg 161 158 -3 unix_seqpacket_recvmsg 62 59 -3 unix_dgram_recvmsg 42 39 -3 tcp_recvmsg 546 543 -3 mptcp_recvmsg 1568 1565 -3 ping_recvmsg 806 800 -6 tcp_bpf_recvmsg_parser 983 974 -9 ip_recv_error 588 576 -12 ipv6_recv_rxpmtu 442 428 -14 udp_recvmsg 1243 1224 -19 ipv6_recv_error 1046 1024 -22 udpv6_recvmsg 1487 1461 -26 raw_recvmsg 465 437 -28 udp_bpf_recvmsg 1027 984 -43 sock_common_recvmsg 103 27 -76 inet_recvmsg 257 175 -82 inet6_recvmsg 257 175 -82 tcp_bpf_recvmsg 663 568 -95 Total: Before=25143834, After=25143328, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260227151120.1346573-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:17:17 -08:00
Linus Torvalds	b9c8fc2cae	Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Bobby Eshleman	102eab95f0	vsock: lock down child_ns_mode as write-once Two administrator processes may race when setting child_ns_mode as one process sets child_ns_mode to "local" and then creates a namespace, but another process changes child_ns_mode to "global" between the write and the namespace creation. The first process ends up with a namespace in "global" mode instead of "local". While this can be detected after the fact by reading ns_mode and retrying, it is fragile and error-prone. Make child_ns_mode write-once so that a namespace manager can set it once and be sure it won't change. Writing a different value after the first write returns -EBUSY. This applies to all namespaces, including init_net, where an init process can write "local" to lock all future namespaces into local mode. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Co-developed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Greg Kroah-Hartman	5cc619583c	vsock: Use container_of() to get net namespace in sysctl handlers current->nsproxy is should not be accessed directly as syzbot has found that it could be NULL at times, causing crashes. Fix up the af_vsock sysctl handlers to use container_of() to deal with the current net namespace instead of attempting to rely on current. This is the same type of change done in commit `7f5611cbc4` ("rds: sysctl: rds_tcp_{rcv,snd}buf: avoid using current->nsproxy") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Link: https://patch.msgid.link/2026022318-rearview-gallery-ae13@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 18:59:18 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	8bf22c33e7	Merge tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200" * tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits) net: nfc: nci: Fix parameter validation for packet data net/mlx5e: Use unsigned for mlx5e_get_max_num_channels net/mlx5e: Fix deadlocks between devlink and netdev instance locks net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event net/mlx5: Fix misidentification of write combining CQE during poll loop net/mlx5e: Fix misidentification of ASO CQE during poll loop net/mlx5: Fix multiport device check over light SFs bonding: alb: fix UAF in rlb_arp_recv during bond up/down bnge: fix reserving resources from FW eth: fbnic: Advertise supported XDP features. rds: tcp: fix uninit-value in __inet_bind net/rds: Fix NULL pointer dereference in rds_tcp_accept_one octeontx2-af: Fix default entries mcam entry action net/mlx5e: XSK, Fix unintended ICOSQ change ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow() inet: move icmp_global_{credit,stamp} to a separate cache line icmp: prevent possible overflow in icmp_global_allow() selftests/net: packetdrill: add ipv4-mapped-ipv6 tests ...	2026-02-19 10:39:08 -08:00
Stefano Garzarella	6a997f38bd	vsock: prevent child netns mode switch from local to global A "local" namespace can change its `child_ns_mode` sysctl to "global", allowing nested namespaces to access global CIDs. This can be exploited by an unprivileged user who gained CAP_NET_ADMIN through a user namespace. Prevent this by rejecting writes that attempt to set `child_ns_mode` to "global" when the current namespace's mode is "local". Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Cc: bobbyeshleman@meta.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260212205916.97533-3-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:28:38 -08:00
Stefano Garzarella	9dd391493a	vsock: fix child netns mode initialization When a new network namespace is created, vsock_net_init() correctly initializes the namespace's mode by reading the parent's `child_ns_mode` via vsock_net_child_mode(). However, the `child_ns_mode` of the new namespace was always hardcoded to VSOCK_NET_MODE_GLOBAL, regardless of its own mode. This means that if a parent namespace has `child_ns_mode` set to "local", the child namespace correctly gets mode "local", but its `child_ns_mode` is reset to "global". As a result, further nested namespaces will incorrectly get mode "global" instead of inheriting "local", breaking the expected propagation of the mode through nested namespaces. Fix this by initializing `child_ns_mode` to the namespace's own mode, so the setting propagates correctly through all levels of nesting. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Cc: bobbyeshleman@meta.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260212205916.97533-2-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:28:38 -08:00
Linus Torvalds	a353e7260b	Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - in-order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably dma alignment fixes for non-cache-coherent systems * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits) vduse: avoid adding implicit padding vhost: fix caching attributes of MMIO regions by setting them explicitly vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr() vdpa/mlx5: reuse common function for MAC address updates vdpa/mlx5: update mlx_features with driver state check crypto: virtio: Replace package id with numa node id crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req crypto: virtio: Add spinlock protection with virtqueue notification Documentation: Add documentation for VDUSE Address Space IDs vduse: bump version number vduse: add vq group asid support vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls vduse: take out allocations from vduse_dev_alloc_coherent vduse: remove unused vaddr parameter of vduse_domain_free_coherent vduse: refactor vdpa_dev_add for goto err handling vhost: forbid change vq groups ASID if DRIVER_OK is set vdpa: document set_group_asid thread safety vduse: return internal vq group struct as map token vduse: add vq group support vduse: add v1 API definition ...	2026-02-13 12:02:18 -08:00
Arnd Bergmann	e25dbf561e	vmw_vsock: bypass false-positive Wnonnull warning with gcc-16 The gcc-16.0.1 snapshot produces a false-positive warning that turns into a build failure with CONFIG_WERROR: In file included from arch/x86/include/asm/string.h:6, from net/vmw_vsock/vmci_transport.c:10: In function 'vmci_transport_packet_init', inlined from '__vmci_transport_send_control_pkt.constprop' at net/vmw_vsock/vmci_transport.c:198:2: arch/x86/include/asm/string_32.h:150:25: error: argument 2 null where non-null expected because argument 3 is nonzero [-Werror=nonnull] 150 \| #define memcpy(t, f, n) __builtin_memcpy(t, f, n) \| ^~~~~~~~~~~~~~~~~~~~~~~~~ net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy' 164 \| memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait)); \| ^~~~~~ arch/x86/include/asm/string_32.h:150:25: note: in a call to built-in function '__builtin_memcpy' net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy' 164 \| memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait)); \| ^~~~~~ This seems relatively harmless, and it so far the only instance of this warning I have found. The __vmci_transport_send_control_pkt function is called either with wait=NULL or with one of the type values that pass 'wait' into memcpy() here, but not from the same caller. Replacing the memcpy with a struct assignment is otherwise the same but avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bryan Tan <bryan-bt.tan@broadcom.com> Link: https://patch.msgid.link/20260203163406.2636463-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:40:31 -08:00
Michael S. Tsirkin	f9108dee78	vsock/virtio: reorder fields to reduce padding Reorder struct virtio_vsock fields to place the DMA buffer (event_list) last. This eliminates the padding from aligning the struct size on ARCH_DMA_MINALIGN. Message-ID: <ce44f61af415521e00ab7492aa16d3d19f00bd5e.1769632071.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2026-01-28 15:32:08 -05:00
Bobby Eshleman	a69686327e	vsock: add netns support to virtio transports Add netns support to loopback and vhost. Keep netns disabled for virtio-vsock, but add necessary changes to comply with common API updates. This is the patch in the series when vhost-vsock namespaces actually come online. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-3-2859a7512097@meta.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-27 10:45:38 +01:00
Bobby Eshleman	a6ae12a599	virtio: set skb owner of virtio_transport_reset_no_sock() reply Associate reply packets with the sending socket. When vsock must reply with an RST packet and there exists a sending socket (e.g., for loopback), setting the skb owner to the socket correctly handles reference counting between the skb and sk (i.e., the sk stays alive until the skb is freed). This allows the net namespace to be used for socket lookups for the duration of the reply skb's lifetime, preventing race conditions between the namespace lifecycle and vsock socket search using the namespace pointer. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-2-2859a7512097@meta.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-27 10:45:38 +01:00
Bobby Eshleman	eafb64f40c	vsock: add netns to vsock core Add netns logic to vsock core. Additionally, modify transport hook prototypes to be used by later transport-specific patches (e.g., *_seqpacket_allow()). Namespaces are supported primarily by changing socket lookup functions (e.g., vsock_find_connected_socket()) to take into account the socket namespace and the namespace mode before considering a candidate socket a "match". This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode for new namespaces. Add netns functionality (initialization, passing to transports, procfs, etc...) to the af_vsock socket layer. Later patches that add netns support to transports depend on this patch. This patch changes the allocation of random ports for connectible vsocks in order to avoid leaking the random port range starting point to other namespaces. dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are modified to take a vsk in order to perform logic on namespace modes. In future patches, the net will also be used for socket lookups in these functions. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-1-2859a7512097@meta.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-27 10:45:38 +01:00
Melbin K Mathew	8ee784fdf0	vsock/virtio: cap TX credit to local buffer size The virtio transports derives its TX credit directly from peer_buf_alloc, which is set from the remote endpoint's SO_VM_SOCKETS_BUFFER_SIZE value. On the host side this means that the amount of data we are willing to queue for a connection is scaled by a guest-chosen buffer size, rather than the host's own vsock configuration. A malicious guest can advertise a large buffer and read slowly, causing the host to allocate a correspondingly large amount of sk_buff memory. The same thing would happen in the guest with a malicious host, since virtio transports share the same code base. Introduce a small helper, virtio_transport_tx_buf_size(), that returns min(peer_buf_alloc, buf_alloc), and use it wherever we consume peer_buf_alloc. This ensures the effective TX window is bounded by both the peer's advertised buffer and our own buf_alloc (already clamped to buffer_max_size via SO_VM_SOCKETS_BUFFER_MAX_SIZE), so a remote peer cannot force the other to queue more data than allowed by its own vsock settings. On an unpatched Ubuntu 22.04 host (~64 GiB RAM), running a PoC with 32 guest vsock connections advertising 2 GiB each and reading slowly drove Slab/SUnreclaim from ~0.5 GiB to ~57 GiB; the system only recovered after killing the QEMU process. That said, if QEMU memory is limited with cgroups, the maximum memory used will be limited. With this patch applied: Before: MemFree: ~61.6 GiB Slab: ~142 MiB SUnreclaim: ~117 MiB After 32 high-credit connections: MemFree: ~61.5 GiB Slab: ~178 MiB SUnreclaim: ~152 MiB Only ~35 MiB increase in Slab/SUnreclaim, no host OOM, and the guest remains responsive. Compatibility with non-virtio transports: - VMCI uses the AF_VSOCK buffer knobs to size its queue pairs per socket based on the local vsk->buffer_* values; the remote side cannot enlarge those queues beyond what the local endpoint configured. - Hyper-V's vsock transport uses fixed-size VMBus ring buffers and an MTU bound; there is no peer-controlled credit field comparable to peer_buf_alloc, and the remote endpoint cannot drive in-flight kernel memory above those ring sizes. - The loopback path reuses virtio_transport_common.c, so it naturally follows the same semantics as the virtio transport. This change is limited to virtio_transport_common.c and thus affects virtio-vsock, vhost-vsock, and loopback, bringing them in line with the "remote window intersected with local policy" behaviour that VMCI and Hyper-V already effectively have. Fixes: `06a8fc7836` ("VSOCK: Introduce virtio_vsock_common.ko") Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Melbin K Mathew <mlbnkm1@gmail.com> [Stefano: small adjustments after changing the previous patch] [Stefano: tweak the commit message] Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20260121093628.9941-4-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-22 15:41:33 +01:00
Melbin K Mathew	3ef3d52a1a	vsock/virtio: fix potential underflow in virtio_transport_get_credit() The credit calculation in virtio_transport_get_credit() uses unsigned arithmetic: ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt); If the peer shrinks its advertised buffer (peer_buf_alloc) while bytes are in flight, the subtraction can underflow and produce a large positive value, potentially allowing more data to be queued than the peer can handle. Reuse virtio_transport_has_space() which already handles this case and add a comment to make it clear why we are doing that. Fixes: `06a8fc7836` ("VSOCK: Introduce virtio_vsock_common.ko") Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Melbin K Mathew <mlbnkm1@gmail.com> [Stefano: use virtio_transport_has_space() instead of duplicating the code] [Stefano: tweak the commit message] Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20260121093628.9941-2-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-22 15:41:33 +01:00
Michal Luczaj	0386bd321d	vsock/virtio: Coalesce only linear skb vsock/virtio common tries to coalesce buffers in rx queue: if a linear skb (with a spare tail room) is followed by a small skb (length limited by GOOD_COPY_LEN = 128), an attempt is made to join them. Since the introduction of MSG_ZEROCOPY support, assumption that a small skb will always be linear is incorrect. In the zerocopy case, data is lost and the linear skb is appended with uninitialized kernel memory. Of all 3 supported virtio-based transports, only loopback-transport is affected. G2H virtio-transport rx queue operates on explicitly linear skbs; see virtio_vsock_alloc_linear_skb() in virtio_vsock_rx_fill(). H2G vhost-transport may allocate non-linear skbs, but only for sizes that are not considered for coalescence; see PAGE_ALLOC_COSTLY_ORDER in virtio_vsock_alloc_skb(). Ensure only linear skbs are coalesced. Note that skb_tailroom(last_skb) > 0 guarantees last_skb is linear. Fixes: `581512a6dc` ("vsock/virtio: MSG_ZEROCOPY flag support") Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260113-vsock-recv-coalescence-v2-1-552b17837cf4@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-15 19:44:44 -08:00
Michael S. Tsirkin	db191ba0c8	vsock/virtio: use virtqueue_add_inbuf_cache_clean for events The event_list array contains 8 small (4-byte) events that share cachelines with each other. When CONFIG_DMA_API_DEBUG is enabled, this can trigger warnings about overlapping DMA mappings within the same cacheline. The previous patch isolated event_list in its own cache lines so the warnings are spurious. Use virtqueue_add_inbuf_cache_clean() to indicate that the CPU does not write into these fields, suppressing the warnings. Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Message-ID: <4b5bf63a7ebb782d87f643466b3669df567c9fe1.1767601130.git.mst@redhat.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2026-01-08 09:54:26 -05:00
Michael S. Tsirkin	63dfad0517	vsock/virtio: fix DMA alignment for event_list On non-cache-coherent platforms, when a structure contains a buffer used for DMA alongside fields that the CPU writes to, cacheline sharing can cause data corruption. The event_list array is used for DMA_FROM_DEVICE operations via virtqueue_add_inbuf(). The adjacent event_run and guest_cid fields are written by the CPU while the buffer is available, so mapped for the device. If these share cachelines with event_list, CPU writes can corrupt DMA data. Add __dma_from_device_group_begin()/end() annotations to ensure event_list is isolated in its own cachelines. Message-ID: <f19ebd74f70c91cab4b0178df78cf6a6e107a96b.1767601130.git.mst@redhat.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2026-01-08 09:54:19 -05:00
Michal Luczaj	ce5e612dd4	vsock: Make accept()ed sockets use custom setsockopt() SO_ZEROCOPY handling in vsock_connectible_setsockopt() does not get called on accept()ed sockets due to a missing flag. Flip it. Fixes: `e0718bd82e` ("vsock: enable setting SO_ZEROCOPY") Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20251229-vsock-child-sock-custom-sockopt-v2-1-64778d6c4f88@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-05 16:14:50 -08:00
Jakub Kicinski	9e203721ec	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc7). No conflicts, adjacent changes: tools/testing/selftests/net/af_unix/Makefile `e1bb28bf13` ("selftest: af_unix: Add test for SO_PEEK_OFF.") `45a1cd8346` ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 09:13:26 -08:00
Michal Luczaj	002541ef65	vsock: Ignore signal/timeout on connect() if already established During connect(), acting on a signal/timeout by disconnecting an already established socket leads to several issues: 1. connect() invoking vsock_transport_cancel_pkt() -> virtio_transport_purge_skbs() may race with sendmsg() invoking virtio_transport_get_credit(). This results in a permanently elevated `vvs->bytes_unsent`. Which, in turn, confuses the SOCK_LINGER handling. 2. connect() resetting a connected socket's state may race with socket being placed in a sockmap. A disconnected socket remaining in a sockmap breaks sockmap's assumptions. And gives rise to WARNs. 3. connect() transitioning SS_CONNECTED -> SS_UNCONNECTED allows for a transport change/drop after TCP_ESTABLISHED. Which poses a problem for any simultaneous sendmsg() or connect() and may result in a use-after-free/null-ptr-deref. Do not disconnect socket on signal/timeout. Keep the logic for unconnected sockets: they don't linger, can't be placed in a sockmap, are rejected by sendmsg(). [1]: https://lore.kernel.org/netdev/e07fd95c-9a38-4eea-9638-133e38c2ec9b@rbox.co/ [2]: https://lore.kernel.org/netdev/20250317-vsock-trans-signal-race-v4-0-fc8837f3f1d4@rbox.co/ [3]: https://lore.kernel.org/netdev/60f1b7db-3099-4f6a-875e-af9f6ef194f6@rbox.co/ Fixes: `d021c34405` ("VSOCK: Introduce VM Sockets") Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20251119-vsock-interrupted-connect-v2-1-70734cf1233f@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 07:40:06 -08:00
Kees Cook	85cb0757d7	net: Convert proto_ops connect() callbacks to use sockaddr_unsized Update all struct proto_ops connect() callback function prototypes from "struct sockaddr " to "struct sockaddr_unsized " to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:32 -08:00
Kees Cook	0e50474fa5	net: Convert proto_ops bind() callbacks to use sockaddr_unsized Update all struct proto_ops bind() callback function prototypes from "struct sockaddr " to "struct sockaddr_unsized " to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:32 -08:00
Stefano Garzarella	f7c877e753	vsock: fix lock inversion in vsock_assign_transport() Syzbot reported a potential lock inversion deadlock between vsock_register_mutex and sk_lock-AF_VSOCK when vsock_linger() is called. The issue was introduced by commit `687aa0c558` ("vsock: Fix transport_* TOCTOU") which added vsock_register_mutex locking in vsock_assign_transport() around the transport->release() call, that can call vsock_linger(). vsock_assign_transport() can be called with sk_lock held. vsock_linger() calls sk_wait_event() that temporarily releases and re-acquires sk_lock. During this window, if another thread hold vsock_register_mutex while trying to acquire sk_lock, a circular dependency is created. Fix this by releasing vsock_register_mutex before calling transport->release() and vsock_deassign_transport(). This is safe because we don't need to hold vsock_register_mutex while releasing the old transport, and we ensure the new transport won't disappear by obtaining a module reference first via try_module_get(). Reported-by: syzbot+10e35716f8e4929681fa@syzkaller.appspotmail.com Tested-by: syzbot+10e35716f8e4929681fa@syzkaller.appspotmail.com Fixes: `687aa0c558` ("vsock: Fix transport_* TOCTOU") Cc: mhal@rbox.co Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20251021121718.137668-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-23 16:07:58 +02:00
Linus Torvalds	2215336295	Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Unify guest entry code for KVM and MSHV (Sean Christopherson) - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain() (Nam Cao) - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV (Mukesh Rathor) - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov) - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna Kumar T S M) - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari, Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley) * tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hyperv: Remove the spurious null directive line MAINTAINERS: Mark hyperv_fb driver Obsolete fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver Drivers: hv: Make CONFIG_HYPERV bool Drivers: hv: Add CONFIG_HYPERV_VMBUS option Drivers: hv: vmbus: Fix typos in vmbus_drv.c Drivers: hv: vmbus: Fix sysfs output format for ring buffer index Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store() x86/hyperv: Switch to msi_create_parent_irq_domain() mshv: Use common "entry virt" APIs to do work in root before running guest entry: Rename "kvm" entry code assets to "virt" to genericize APIs entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper mshv: Handle NEED_RESCHED_LAZY before transferring to guest x86/hyperv: Add kexec/kdump support on Azure CVMs Drivers: hv: Simplify data structures for VMBus channel close message Drivers: hv: util: Cosmetic changes for hv_utils_transport.c mshv: Add support for a new parent partition configuration clocksource: hyper-v: Skip unnecessary checks for the root partition hyperv: Add missing field to hv_output_map_device_interrupt	2025-10-07 08:40:15 -07:00
Mukesh Rathor	94b04355e6	Drivers: hv: Add CONFIG_HYPERV_VMBUS option At present VMBus driver is hinged off of CONFIG_HYPERV which entails lot of builtin code and encompasses too much. It's not always clear what depends on builtin hv code and what depends on VMBus. Setting CONFIG_HYPERV as a module and fudging the Makefile to switch to builtin adds even more confusion. VMBus is an independent module and should have its own config option. Also, there are scenarios like baremetal dom0/root where support is built in with CONFIG_HYPERV but without VMBus. Lastly, there are more features coming down that use CONFIG_HYPERV and add more dependencies on it. So, create a fine grained HYPERV_VMBUS option and update Kconfigs for dependency on VMBus. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # drivers/pci Signed-off-by: Wei Liu <wei.liu@kernel.org>	2025-10-01 00:00:42 +00:00
Marco Crivellari	27ce71e1ce	net: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag at the network subsystem, to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20250918142427.309519-4-marco.crivellari@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-22 17:40:30 -07:00
Marco Crivellari	5fd8bb982e	net: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20250918142427.309519-3-marco.crivellari@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-22 17:40:30 -07:00
Jakub Kicinski	d23ad54de7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/intel/idpf/idpf_txrx.c `02614eee26` ("idpf: do not linearize big TSO packets") `6c4e684802` ("idpf: remove obsolete stashing code") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 11:48:01 -07:00
Will Deacon	7fb1291257	vsock/virtio: Fix message iterator handling on transmit path Commit `6693731487` ("vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers") converted the virtio vsock transmit path to utilise nonlinear SKBs when handling large buffers. As part of this change, virtio_transport_fill_skb() was updated to call skb_copy_datagram_from_iter() instead of memcpy_from_msg() as the latter expects a single destination buffer and cannot handle nonlinear SKBs correctly. Unfortunately, during this conversion, I overlooked the error case when the copying function returns -EFAULT due to a fault on the input buffer in userspace. In this case, memcpy_from_msg() reverts the iterator to its initial state thanks to copy_from_iter_full() whereas skb_copy_datagram_from_iter() leaves the iterator partially advanced. This results in a WARN_ONCE() from the vsock code, which expects the iterator to stay in sync with the number of bytes transmitted so that virtio_transport_send_pkt_info() can return -EFAULT when it is called again: ------------[ cut here ]------------ 'send_pkt()' returns 0, but 65536 expected WARNING: CPU: 0 PID: 5503 at net/vmw_vsock/virtio_transport_common.c:428 virtio_transport_send_pkt_info+0xd11/0xf00 net/vmw_vsock/virtio_transport_common.c:426 Modules linked in: CPU: 0 UID: 0 PID: 5503 Comm: syz.0.17 Not tainted 6.16.0-syzkaller-12063-g37816488247d #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call virtio_transport_fill_skb_full() to restore the previous iterator behaviour. Cc: Jason Wang <jasowang@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Fixes: `6693731487` ("vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers") Reported-by: syzbot+b4d960daf7a3c7c2b7b1@syzkaller.appspotmail.com Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Link: https://patch.msgid.link/20250818180355.29275-3-will@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 17:49:19 -07:00
Jakub Kicinski	f24775c325	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc2). No conflicts. Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c `d7a276a576` ("net: stmmac: rk: convert to suspend()/resume() methods") `de1e963ad0` ("net: stmmac: rk: put the PHY clock on remove") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-14 12:13:00 -07:00
Wang Liang	4d18083d6b	vsock: use sizeof(struct sockaddr_storage) instead of magic value Previous commit `230b183921` ("net: Use standard structures for generic socket address structures.") use 'struct sockaddr_storage address;' to replace 'char address[MAX_SOCK_ADDR];'. The macro MAX_SOCK_ADDR is removed by commit `01893c82b4` ("net: Remove MAX_SOCK_ADDR constant"). The comment in vsock_getname() is outdated, use sizeof(struct sockaddr_storage) instead of magic value 128. Signed-off-by: Wang Liang <wangliang74@huawei.com> Link: https://patch.msgid.link/20250812015929.1419896-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-13 17:05:21 -07:00
Budimir Markovic	aba0c94f61	vsock: Do not allow binding to VMADDR_PORT_ANY It is possible for a vsock to autobind to VMADDR_PORT_ANY. This can cause a use-after-free when a connection is made to the bound socket. The socket returned by accept() also has port VMADDR_PORT_ANY but is not on the list of unbound sockets. Binding it will result in an extra refcount decrement similar to the one fixed in `fcdd2242c0` (vsock: Keep the binding until socket destruction). Modify the check in __vsock_bind_connectible() to also prevent binding to VMADDR_PORT_ANY. Fixes: `d021c34405` ("VSOCK: Introduce VM Sockets") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250807041811.678-1-markovicbudimir@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-08 12:55:00 -07:00
Linus Torvalds	821c9e515d	Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - vhost can now support legacy threading if enabled in Kconfig - vsock memory allocation strategies for large buffers have been improved, reducing pressure on kmalloc - vhost now supports the in-order feature. guest bits missed the merge window. - fixes, cleanups all over the place * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits) vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers vsock/virtio: Rename virtio_vsock_skb_rx_put() vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers vsock/virtio: Move SKB allocation lower-bound check to callers vsock/virtio: Rename virtio_vsock_alloc_skb() vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put() vsock/virtio: Validate length in packet header before skb_put() vhost/vsock: Avoid allocating arbitrarily-sized SKBs vhost_net: basic in_order support vhost: basic in order support vhost: fail early when __vhost_add_used() fails vhost: Reintroduce kthread API and add mode selection vdpa: Fix IDR memory leak in VDUSE module exit vdpa/mlx5: Fix release of uninitialized resources on error path vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit virtio: virtio_dma_buf: fix missing parameter documentation vhost: Fix typos vhost: vringh: Remove unused functions vhost: vringh: Remove unused iotlb functions ...	2025-08-01 14:17:48 -07:00
Will Deacon	6693731487	vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers When transmitting a vsock packet, virtio_transport_send_pkt_info() calls virtio_transport_alloc_linear_skb() to allocate and fill SKBs with the transmit data. Unfortunately, these are always linear allocations and can therefore result in significant pressure on kmalloc() considering that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB allocation for each packet. Rework the vsock SKB allocation so that, for sizes with page order greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated instead with the packet header in the SKB and the transmit data in the fragments. Note that this affects both the vhost and virtio transports. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-10-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	8ca76151d2	vsock/virtio: Rename virtio_vsock_skb_rx_put() In preparation for using virtio_vsock_skb_rx_put() when populating SKBs on the vsock TX path, rename virtio_vsock_skb_rx_put() to virtio_vsock_skb_put(). No functional change. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-9-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	2304c64a28	vsock/virtio: Rename virtio_vsock_alloc_skb() In preparation for nonlinear allocations for large SKBs, rename virtio_vsock_alloc_skb() to virtio_vsock_alloc_linear_skb() to indicate that it returns linear SKBs unconditionally and switch all callers over to this new interface for now. No functional change. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-6-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	03a92f036a	vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page When allocating receive buffers for the vsock virtio RX virtqueue, an SKB is allocated with a 4140 data payload (the 44-byte packet header + VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE). Even when factoring in the SKB overhead, the resulting 8KiB allocation thanks to the rounding in kmalloc_reserve() is wasteful (~3700 unusable bytes) and results in a higher-order page allocation on systems with 4KiB pages just for the sake of a few hundred bytes of packet data. Limit the vsock virtio RX buffers to 4KiB per SKB, resulting in much better memory utilisation and removing the need to allocate higher-order pages entirely. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-5-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	87dbae5e36	vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put() virtio_vsock_skb_rx_put() only calls skb_put() if the length in the packet header is not zero even though skb_put() handles this case gracefully. Remove the functionally redundant check from virtio_vsock_skb_rx_put() and, on the assumption that this is a worthwhile optimisation for handling credit messages, augment the existing length checks in virtio_transport_rx_work() to elide the call for zero-length payloads. Since the callers all have the length, extend virtio_vsock_skb_rx_put() to take it as an additional parameter rather than fish it back out of the packet header. Note that the vhost code already has similar logic in vhost_vsock_alloc_skb(). Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-4-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	0dab924844	vsock/virtio: Validate length in packet header before skb_put() When receiving a vsock packet in the guest, only the virtqueue buffer size is validated prior to virtio_vsock_skb_rx_put(). Unfortunately, virtio_vsock_skb_rx_put() uses the length from the packet header as the length argument to skb_put(), potentially resulting in SKB overflow if the host has gone wonky. Validate the length as advertised by the packet header before calling virtio_vsock_skb_rx_put(). Cc: <stable@vger.kernel.org> Fixes: `71dc9ec9ac` ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-3-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>	2025-08-01 09:11:09 -04:00

1 2 3 4 5 ...

449 Commits