mirror of
https://github.com/torvalds/linux.git
synced 2026-04-18 06:44:00 -04:00
fcd11ff8bd0e526bdd5f43f534ccf7c4e67245ad
83911 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e5f635edd3 |
bpf: Fix precedence bug in convert_bpf_ld_abs alignment check
Fix an operator precedence issue in convert_bpf_ld_abs() where the
expression offset + ip_align % size evaluates as offset + (ip_align % size)
due to % having higher precedence than +. That latter evaluation does
not make any sense. The intended check is (offset + ip_align) % size == 0
to verify that the packet load offset is properly aligned for direct
access.
With NET_IP_ALIGN == 2, the bug causes the inline fast-path for direct
packet loads to almost never be taken on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
platforms. This forces nearly all cBPF BPF_LD_ABS packet loads through
the bpf_skb_load_helper slow path on the affected archs.
Fixes:
|
||
|
|
64c2f93fc3 |
bpf, sockmap: Take state lock for af_unix iter
When a BPF iterator program updates a sockmap, there is a race condition in
unix_stream_bpf_update_proto() where the `peer` pointer can become stale[1]
during a state transition TCP_ESTABLISHED -> TCP_CLOSE.
CPU0 bpf CPU1 close
-------- ----------
// unix_stream_bpf_update_proto()
sk_pair = unix_peer(sk)
if (unlikely(!sk_pair))
return -EINVAL;
// unix_release_sock()
skpair = unix_peer(sk);
unix_peer(sk) = NULL;
sock_put(skpair)
sock_hold(sk_pair) // UaF
More practically, this fix guarantees that the iterator program is
consistently provided with a unix socket that remains stable during
iterator execution.
[1]:
BUG: KASAN: slab-use-after-free in unix_stream_bpf_update_proto+0x155/0x490
Write of size 4 at addr ffff8881178c9a00 by task test_progs/2231
Call Trace:
dump_stack_lvl+0x5d/0x80
print_report+0x170/0x4f3
kasan_report+0xe4/0x1c0
kasan_check_range+0x125/0x200
unix_stream_bpf_update_proto+0x155/0x490
sock_map_link+0x71c/0xec0
sock_map_update_common+0xbc/0x600
sock_map_update_elem+0x19a/0x1f0
bpf_prog_bbbf56096cdd4f01_selective_dump_unix+0x20c/0x217
bpf_iter_run_prog+0x21e/0xae0
bpf_iter_unix_seq_show+0x1e0/0x2a0
bpf_seq_read+0x42c/0x10d0
vfs_read+0x171/0xb20
ksys_read+0xff/0x200
do_syscall_64+0xf7/0x5e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Allocated by task 2236:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
__kasan_slab_alloc+0x63/0x80
kmem_cache_alloc_noprof+0x1d5/0x680
sk_prot_alloc+0x59/0x210
sk_alloc+0x34/0x470
unix_create1+0x86/0x8a0
unix_stream_connect+0x318/0x15b0
__sys_connect+0xfd/0x130
__x64_sys_connect+0x72/0xd0
do_syscall_64+0xf7/0x5e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 2236:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x70
__kasan_slab_free+0x47/0x70
kmem_cache_free+0x11c/0x590
__sk_destruct+0x432/0x6e0
unix_release_sock+0x9b3/0xf60
unix_release+0x8a/0xf0
__sock_release+0xb0/0x270
sock_close+0x18/0x20
__fput+0x36e/0xac0
fput_close_sync+0xe5/0x1a0
__x64_sys_close+0x7d/0xd0
do_syscall_64+0xf7/0x5e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes:
|
||
|
|
dca38b7734 |
bpf, sockmap: Fix af_unix null-ptr-deref in proto update
unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.
CPU0 bpf CPU1 connect
-------- ------------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
sock_hold(newsk)
smp_mb__after_atomic()
unix_peer(sk) = newsk
BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
sock_map_link+0x564/0x8b0
sock_map_update_common+0x6e/0x340
sock_map_update_elem_sys+0x17d/0x240
__sys_bpf+0x26db/0x3250
__x64_sys_bpf+0x21/0x30
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected.
Then a NULL check during proto update in unix_stream_bpf_update_proto() was
considered[2], but the follow-up discussion[3] focused on the root cause,
i.e. sockmap update taking a wrong lock. Or, more specifically, missing
unix_state_lock()[4].
In the end it was concluded that teaching sockmap about the af_unix locking
would be unnecessarily complex[5].
Complexity aside, since BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT
are allowed to update sockmaps, sock_map_update_elem() taking the unix
lock, as it is currently implemented in unix_state_lock():
spin_lock(&unix_sk(s)->lock), would be problematic. unix_state_lock() taken
in a process context, followed by a softirq-context TC BPF program
attempting to take the same spinlock -- deadlock[6].
This way we circled back to the peer check idea[2].
[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
[4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
[5]: https://lore.kernel.org/bpf/CAAVpQUAHijOMext28Gi10dSLuMzGYh+jK61Ujn+fZ-wvcODR2A@mail.gmail.com/
[6]: https://lore.kernel.org/bpf/dd043c69-4d03-46fe-8325-8f97101435cf@linux.dev/
Summary of scenarios where af_unix/stream connect() may race a sockmap
update:
1. connect() vs. bpf(BPF_MAP_UPDATE_ELEM), i.e. sock_map_update_elem_sys()
Implemented NULL check is sufficient. Once assigned, socket peer won't
be released until socket fd is released. And that's not an issue because
sock_map_update_elem_sys() bumps fd refcnf.
2. connect() vs BPF program doing update
Update restricted per verifier.c:may_update_sockmap() to
BPF_PROG_TYPE_TRACING/BPF_TRACE_ITER
BPF_PROG_TYPE_SOCK_OPS (bpf_sock_map_update() only)
BPF_PROG_TYPE_SOCKET_FILTER
BPF_PROG_TYPE_SCHED_CLS
BPF_PROG_TYPE_SCHED_ACT
BPF_PROG_TYPE_XDP
BPF_PROG_TYPE_SK_REUSEPORT
BPF_PROG_TYPE_FLOW_DISSECTOR
BPF_PROG_TYPE_SK_LOOKUP
Plus one more race to consider:
CPU0 bpf CPU1 connect
-------- ------------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
sock_hold(newsk)
smp_mb__after_atomic()
unix_peer(sk) = newsk
sk_pair = unix_peer(sk)
if (unlikely(!sk_pair))
return -EINVAL;
CPU1 close
----------
skpair = unix_peer(sk);
unix_peer(sk) = NULL;
sock_put(skpair)
// use after free?
sock_hold(sk_pair)
2.1 BPF program invoking helper function bpf_sock_map_update() ->
BPF_CALL_4(bpf_sock_map_update(), ...)
Helper limited to BPF_PROG_TYPE_SOCK_OPS. Nevertheless, a unix sock
might be accessible via bpf_map_lookup_elem(). Which implies sk
already having psock, which in turn implies sk already having
sk_pair. Since sk_psock_destroy() is queued as RCU work, sk_pair
won't go away while BPF executes the update.
2.2 BPF program invoking helper function bpf_map_update_elem() ->
sock_map_update_elem()
2.2.1 Unix sock accessible to BPF prog only via sockmap lookup in
BPF_PROG_TYPE_SOCKET_FILTER, BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_SK_REUSEPORT, BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_SK_LOOKUP.
Pretty much the same as case 2.1.
2.2.2 Unix sock accessible to BPF program directly:
BPF_PROG_TYPE_TRACING, narrowed down to BPF_TRACE_ITER.
Sockmap iterator (sock_map_seq_ops) is safe: unix sock
residing in a sockmap means that the sock already went through
the proto update step.
Unix sock iterator (bpf_iter_unix_seq_ops), on the other hand,
gives access to socks that may still be unconnected. Which
means iterator prog can race sockmap/proto update against
connect().
BUG: KASAN: null-ptr-deref in unix_stream_bpf_update_proto+0x253/0x4d0
Write of size 4 at addr 0000000000000080 by task test_progs/3140
Call Trace:
dump_stack_lvl+0x5d/0x80
kasan_report+0xe4/0x1c0
kasan_check_range+0x125/0x200
unix_stream_bpf_update_proto+0x253/0x4d0
sock_map_link+0x71c/0xec0
sock_map_update_common+0xbc/0x600
sock_map_update_elem+0x19a/0x1f0
bpf_prog_bbbf56096cdd4f01_selective_dump_unix+0x20c/0x217
bpf_iter_run_prog+0x21e/0xae0
bpf_iter_unix_seq_show+0x1e0/0x2a0
bpf_seq_read+0x42c/0x10d0
vfs_read+0x171/0xb20
ksys_read+0xff/0x200
do_syscall_64+0xf7/0x5e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
While the introduced NULL check prevents null-ptr-deref in the
BPF program path as well, it is insufficient to guard against
a poorly timed close() leading to a use-after-free. This will
be addressed in a subsequent patch.
Fixes:
|
||
|
|
4d328dd695 |
bpf, sockmap: Fix af_unix iter deadlock
bpf_iter_unix_seq_show() may deadlock when lock_sock_fast() takes the fast
path and the iter prog attempts to update a sockmap. Which ends up spinning
at sock_map_update_elem()'s bh_lock_sock():
WARNING: possible recursive locking detected
test_progs/1393 is trying to acquire lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0
but task is already holding lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(slock-AF_UNIX);
lock(slock-AF_UNIX);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by test_progs/1393:
#0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
#1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
#2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
#3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00
Call Trace:
dump_stack_lvl+0x5d/0x80
print_deadlock_bug.cold+0xc0/0xce
__lock_acquire+0x130f/0x2590
lock_acquire+0x14e/0x2b0
_raw_spin_lock+0x30/0x40
sock_map_update_elem+0xdb/0x1f0
bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
bpf_iter_run_prog+0x5b9/0xb00
bpf_iter_unix_seq_show+0x1f7/0x2e0
bpf_seq_read+0x42c/0x10d0
vfs_read+0x171/0xb20
ksys_read+0xff/0x200
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes:
|
||
|
|
a25566084e |
bpf, sockmap: Annotate af_unix sock:: Sk_state data-races
sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix socket sk_state locklessly. Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects not only af_unix, but all non-TCP sockets (UDP, af_vsock). Suggested-by: Kuniyuki Iwashima <kuniyu@google.com> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260414-unix-proto-update-null-ptr-deref-v4-1-2af6fe97918e@rbox.co |
||
|
|
91a4855d6c |
Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support HW queue leasing, allowing containers to be granted access
to HW queues for zero-copy operations and AF_XDP
- Number of code moves to help the compiler with inlining. Avoid
output arguments for returning drop reason where possible
- Rework drop handling within qdiscs to include more metadata about
the reason and dropping qdisc in the tracepoints
- Remove the rtnl_lock use from IP Multicast Routing
- Pack size information into the Rx Flow Steering table pointer
itself. This allows making the table itself a flat array of u32s,
thus making the table allocation size a power of two
- Report TCP delayed ack timer information via socket diag
- Add ip_local_port_step_width sysctl to allow distributing the
randomly selected ports more evenly throughout the allowed space
- Add support for per-route tunsrc in IPv6 segment routing
- Start work of switching sockopt handling to iov_iter
- Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
buffer size drifting up
- Support MSG_EOR in MPTCP
- Add stp_mode attribute to the bridge driver for STP mode selection.
This addresses concerns about call_usermodehelper() usage
- Remove UDP-Lite support (as announced in 2023)
- Remove support for building IPv6 as a module. Remove the now
unnecessary function calling indirection
Cross-tree stuff:
- Move Michael MIC code from generic crypto into wireless, it's
considered insecure but some WiFi networks still need it
Netfilter:
- Switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
Florian W reports this gets us ~13% higher packet rate
- Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net. Convert some code that
walks the service lists to use RCU instead of the service_mutex
- Add more opinionated input validation to lower security exposure
- Make IPVS hash tables to be per-netns and resizable
Wireless:
- Finished assoc frame encryption/EPPKE/802.1X-over-auth
- Radar detection improvements
- Add 6 GHz incumbent signal detection APIs
- Multi-link support for FILS, probe response templates and client
probing
- New APIs and mac80211 support for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
Driver API:
- Add numerical ID for devlink instances (to avoid having to create
fake bus/device pairs just to have an ID). Support shared devlink
instances which span multiple PFs
- Add standard counters for reporting pause storm events (implement
in mlx5 and fbnic)
- Add configuration API for completion writeback buffering (implement
in mana)
- Support driver-initiated change of RSS context sizes
- Support DPLL monitoring input frequency (implement in zl3073x)
- Support per-port resources in devlink (implement in mlx5)
Misc:
- Expand the YAML spec for Netfilter
Drivers
- Software:
- macvlan: support multicast rx for bridge ports with shared
source MAC address
- team: decouple receive and transmit enablement for IEEE 802.3ad
LACP "independent control"
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- support high order pages in zero-copy mode (for payload
coalescing)
- support multiple packets in a page (for systems with 64kB
pages)
- Broadcom 25-400GE (bnxt):
- implement XDP RSS hash metadata extraction
- add software fallback for UDP GSO, lowering the IOMMU cost
- Broadcom 800GE (bnge):
- add link status and configuration handling
- add various HW and SW statistics
- Marvell/Cavium:
- NPC HW block support for cn20k
- Huawei (hinic3):
- add mailbox / control queue
- add rx VLAN offload
- add driver info and link management
- Ethernet NICs:
- Marvell/Aquantia:
- support reading SFP module info on some AQC100 cards
- Realtek PCI (r8169):
- add support for RTL8125cp
- Realtek USB (r8152):
- support for the RTL8157 5Gbit chip
- add 2500baseT EEE status/configuration support
- Ethernet NICs embedded and off-the-shelf IP:
- Synopsys (stmmac):
- cleanup and reorganize SerDes handling and PCS support
- cleanup descriptor handling and per-platform data
- cleanup and consolidate MDIO defines and handling
- shrink driver memory use for internal structures
- improve Tx IRQ coalescing
- improve TCP segmentation handling
- add support for Spacemit K3
- Cadence (macb):
- support PHYs that have inband autoneg disabled with GEM
- support IEEE 802.3az EEE
- rework usrio capabilities and handling
- AMD (xgbe):
- improve power management for S0i3
- improve TX resilience for link-down handling
- Virtual:
- Google cloud vNIC:
- support larger ring sizes in DQO-QPL mode
- improve HW-GRO handling
- support UDP GSO for DQO format
- PCIe NTB:
- support queue count configuration
- Ethernet PHYs:
- automatically disable PHY autonomous EEE if MAC is in charge
- Broadcom:
- add BCM84891/BCM84892 support
- Micrel:
- support for LAN9645X internal PHY
- Realtek:
- add RTL8224 pair order support
- support PHY LEDs on RTL8211F-VD
- support spread spectrum clocking (SSC)
- Maxlinear:
- add PHY-level statistics via ethtool
- Ethernet switches:
- Maxlinear (mxl862xx):
- support for bridge offloading
- support for VLANs
- support driver statistics
- Bluetooth:
- large number of fixes and new device IDs
- Mediatek:
- support MT6639 (MT7927)
- support MT7902 SDIO
- WiFi:
- Intel (iwlwifi):
- UNII-9 and continuing UHR work
- MediaTek (mt76):
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- Qualcomm (ath12k):
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
- support IPQ5424
- Realtek:
- add USB RX aggregation to improve performance
- add USB TX flow control by tracking in-flight URBs
- Cellular:
- IPA v5.2 support"
* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
wireguard: allowedips: remove redundant space
tools: ynl: add sample for wireguard
wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
MAINTAINERS: Add netkit selftest files
selftests/net: Add additional test coverage in nk_qlease
selftests/net: Split netdevsim tests from HW tests in nk_qlease
tools/ynl: Make YnlFamily closeable as a context manager
net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
net: airoha: Fix VIP configuration for AN7583 SoC
net: caif: clear client service pointer on teardown
net: strparser: fix skb_head leak in strp_abort_strp()
net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
selftests/bpf: add test for xdp_master_redirect with bond not up
net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
sctp: disable BH before calling udp_tunnel_xmit_skb()
sctp: fix missing encap_port propagation for GSO fragments
net: airoha: Rely on net_device pointer in ETS callbacks
...
|
||
|
|
f5ad410100 |
Merge tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Welcome new BPF maintainers: Kumar Kartikeya Dwivedi, Eduard
Zingerman while Martin KaFai Lau reduced his load to Reviwer.
- Lots of fixes everywhere from many first time contributors. Thank you
All.
- Diff stat is dominated by mechanical split of verifier.c into
multiple components:
- backtrack.c: backtracking logic and jump history
- states.c: state equivalence
- cfg.c: control flow graph, postorder, strongly connected
components
- liveness.c: register and stack liveness
- fixups.c: post-verification passes: instruction patching, dead
code removal, bpf_loop inlining, finalize fastcall
8k line were moved. verifier.c still stands at 20k lines.
Further refactoring is planned for the next release.
- Replace dynamic stack liveness with static stack liveness based on
data flow analysis.
This improved the verification time by 2x for some programs and
equally reduced memory consumption. New logic is in liveness.c and
supported by constant folding in const_fold.c (Eduard Zingerman,
Alexei Starovoitov)
- Introduce BTF layout to ease addition of new BTF kinds (Alan Maguire)
- Use kmalloc_nolock() universally in BPF local storage (Amery Hung)
- Fix several bugs in linked registers delta tracking (Daniel Borkmann)
- Improve verifier support of arena pointers (Emil Tsalapatis)
- Improve verifier tracking of register bounds in min/max and tnum
domains (Harishankar Vishwanathan, Paul Chaignon, Hao Sun)
- Further extend support for implicit arguments in the verifier (Ihor
Solodrai)
- Add support for nop,nop5 instruction combo for USDT probes in libbpf
(Jiri Olsa)
- Support merging multiple module BTFs (Josef Bacik)
- Extend applicability of bpf_kptr_xchg (Kaitao Cheng)
- Retire rcu_trace_implies_rcu_gp() (Kumar Kartikeya Dwivedi)
- Support variable offset context access for 'syscall' programs (Kumar
Kartikeya Dwivedi)
- Migrate bpf_task_work and dynptr to kmalloc_nolock() (Mykyta
Yatsenko)
- Fix UAF in in open-coded task_vma iterator (Puranjay Mohan)
* tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (241 commits)
selftests/bpf: cover short IPv4/IPv6 inputs with adjust_room
bpf: reject short IPv4/IPv6 inputs in bpf_prog_test_run_skb
selftests/bpf: Use memfd_create instead of shm_open in cgroup_iter_memcg
selftests/bpf: Add test for cgroup storage OOB read
bpf: Fix OOB in pcpu_init_value
selftests/bpf: Fix reg_bounds to match new tnum-based refinement
selftests/bpf: Add tests for non-arena/arena operations
bpf: Allow instructions with arena source and non-arena dest registers
bpftool: add missing fsession to the usage and docs of bpftool
docs/bpf: add missing fsession attach type to docs
bpf: add missing fsession to the verifier log
bpf: Move BTF checking logic into check_btf.c
bpf: Move backtracking logic to backtrack.c
bpf: Move state equivalence logic to states.c
bpf: Move check_cfg() into cfg.c
bpf: Move compute_insn_live_regs() into liveness.c
bpf: Move fixup/post-processing logic from verifier.c into fixups.c
bpf: Simplify do_check_insn()
bpf: Move checks for reserved fields out of the main pass
bpf: Delete unused variable
...
|
||
|
|
35c2c39832 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes in preparation for the net-next PR. Conflicts: include/net/sch_generic.h |
||
|
|
f7cf8ece8c |
net: caif: clear client service pointer on teardown
`caif_connect()` can tear down an existing client after remote shutdown by
calling `caif_disconnect_client()` followed by `caif_free_client()`.
`caif_free_client()` releases the service layer referenced by
`adap_layer->dn`, but leaves that pointer stale.
When the socket is later destroyed, `caif_sock_destructor()` calls
`caif_free_client()` again and dereferences the freed service pointer.
Clear the client/service links before releasing the service object so
repeated teardown becomes harmless.
Fixes:
|
||
|
|
fe72340daa |
net: strparser: fix skb_head leak in strp_abort_strp()
When the stream parser is aborted, for example after a message assembly timeout,
it can still hold a reference to a partially assembled message in
strp->skb_head.
That skb is not released in strp_abort_strp(), which leaks the partially
assembled message and can be triggered repeatedly to exhaust memory.
Fix this by freeing strp->skb_head and resetting the parser state in the
abort path. Leave strp_stop() unchanged so final cleanup still happens in
strp_done() after the work and timer have been synchronized.
Fixes:
|
||
|
|
1921f91298 |
net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
syzkaller reported a kernel panic in bond_rr_gen_slave_id() reached via
xdp_master_redirect(). Full decoded trace:
https://syzkaller.appspot.com/bug?extid=80e046b8da2820b6ba73
bond_rr_gen_slave_id() dereferences bond->rr_tx_counter, a per-CPU
counter that bonding only allocates in bond_open() when the mode is
round-robin. If the bond device was never brought up, rr_tx_counter
stays NULL.
The XDP redirect path can still reach that code on a bond that was
never opened: bpf_master_redirect_enabled_key is a global static key,
so as soon as any bond device has native XDP attached, the
XDP_TX -> xdp_master_redirect() interception is enabled for every
slave system-wide. The path xdp_master_redirect() ->
bond_xdp_get_xmit_slave() -> bond_xdp_xmit_roundrobin_slave_get() ->
bond_rr_gen_slave_id() then runs against a bond that has no
rr_tx_counter and crashes.
Fix this in the generic xdp_master_redirect() by refusing to call into
the master's ->ndo_xdp_get_xmit_slave() when the master device is not
up. IFF_UP is only set after ->ndo_open() has successfully returned,
so this reliably excludes masters whose XDP state has not been fully
initialized. Drop the frame with XDP_ABORTED so the exception is
visible via trace_xdp_exception() rather than silently falling through.
This is not specific to bonding: any current or future master that
defers XDP state allocation to ->ndo_open() is protected.
Fixes:
|
||
|
|
370c388319 |
Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull crypto library updates from Eric Biggers:
- Migrate more hash algorithms from the traditional crypto subsystem to
lib/crypto/
Like the algorithms migrated earlier (e.g. SHA-*), this simplifies
the implementations, improves performance, enables further
simplifications in calling code, and solves various other issues:
- AES CBC-based MACs (AES-CMAC, AES-XCBC-MAC, and AES-CBC-MAC)
- Support these algorithms in lib/crypto/ using the AES library
and the existing arm64 assembly code
- Reimplement the traditional crypto API's "cmac(aes)",
"xcbc(aes)", and "cbcmac(aes)" on top of the library
- Convert mac80211 to use the AES-CMAC library. Note: several
other subsystems can use it too and will be converted later
- Drop the broken, nonstandard, and likely unused support for
"xcbc(aes)" with key lengths other than 128 bits
- Enable optimizations by default
- GHASH
- Migrate the standalone GHASH code into lib/crypto/
- Integrate the GHASH code more closely with the very similar
POLYVAL code, and improve the generic GHASH implementation to
resist cache-timing attacks and use much less memory
- Reimplement the AES-GCM library and the "gcm" crypto_aead
template on top of the GHASH library. Remove "ghash" from the
crypto_shash API, as it's no longer needed
- Enable optimizations by default
- SM3
- Migrate the kernel's existing SM3 code into lib/crypto/, and
reimplement the traditional crypto API's "sm3" on top of it
- I don't recommend using SM3, but this cleanup is worthwhile
to organize the code the same way as other algorithms
- Testing improvements:
- Add a KUnit test suite for each of the new library APIs
- Migrate the existing ChaCha20Poly1305 test to KUnit
- Make the KUnit all_tests.config enable all crypto library tests
- Move the test kconfig options to the Runtime Testing menu
- Other updates to arch-optimized crypto code:
- Optimize SHA-256 for Zhaoxin CPUs using the Padlock Hash Engine
- Remove some MD5 implementations that are no longer worth keeping
- Drop big endian and voluntary preemption support from the arm64
code, as those configurations are no longer supported on arm64
- Make jitterentropy and samples/tsm-mr use the crypto library APIs
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (66 commits)
lib/crypto: arm64: Assume a little-endian kernel
arm64: fpsimd: Remove obsolete cond_yield macro
lib/crypto: arm64/sha3: Remove obsolete chunking logic
lib/crypto: arm64/sha512: Remove obsolete chunking logic
lib/crypto: arm64/sha256: Remove obsolete chunking logic
lib/crypto: arm64/sha1: Remove obsolete chunking logic
lib/crypto: arm64/poly1305: Remove obsolete chunking logic
lib/crypto: arm64/gf128hash: Remove obsolete chunking logic
lib/crypto: arm64/chacha: Remove obsolete chunking logic
lib/crypto: arm64/aes: Remove obsolete chunking logic
lib/crypto: Include <crypto/utils.h> instead of <crypto/algapi.h>
lib/crypto: aesgcm: Don't disable IRQs during AES block encryption
lib/crypto: aescfb: Don't disable IRQs during AES block encryption
lib/crypto: tests: Migrate ChaCha20Poly1305 self-test to KUnit
lib/crypto: sparc: Drop optimized MD5 code
lib/crypto: mips: Drop optimized MD5 code
lib: Move crypto library tests to Runtime Testing menu
crypto: sm3 - Remove 'struct sm3_state'
crypto: sm3 - Remove the original "sm3_block_generic()"
crypto: sm3 - Remove sm3_base.h
...
|
||
|
|
2cd7e6971f |
sctp: disable BH before calling udp_tunnel_xmit_skb()
udp_tunnel_xmit_skb() / udp_tunnel6_xmit_skb() are expected to run with BH disabled. After commit |
||
|
|
bf6f95ae3b |
sctp: fix missing encap_port propagation for GSO fragments
encap_port in SCTP_INPUT_CB(skb) is used by sctp_vtag_verify() for
SCTP-over-UDP processing. In the GSO case, it is only set on the head
skb, while fragment skbs leave it 0.
This results in fragment skbs seeing encap_port == 0, breaking
SCTP-over-UDP connections.
Fix it by propagating encap_port from the head skb cb when initializing
fragment skbs in sctp_inq_pop().
Fixes:
|
||
|
|
c058bbf05b |
tcp: Don't set treq->req_usec_ts in cookie_tcp_reqsk_init().
Commit |
||
|
|
b80a95ccf1 |
udp: Force compute_score to always inline
Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback thoughput test that we traced to the extra overhead of calling compute_score on two places, introduced by commit |
||
|
|
b8f82cb0d8 |
Merge tag 'landlock-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux
Pull Landlock update from Mickaël Salaün: "This adds a new Landlock access right for pathname UNIX domain sockets thanks to a new LSM hook, and a few fixes" * tag 'landlock-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: (23 commits) landlock: Document fallocate(2) as another truncation corner case landlock: Document FS access right for pathname UNIX sockets selftests/landlock: Simplify ruleset creation and enforcement in fs_test selftests/landlock: Check that coredump sockets stay unrestricted selftests/landlock: Audit test for LANDLOCK_ACCESS_FS_RESOLVE_UNIX selftests/landlock: Test LANDLOCK_ACCESS_FS_RESOLVE_UNIX selftests/landlock: Replace access_fs_16 with ACCESS_ALL in fs_test samples/landlock: Add support for named UNIX domain socket restrictions landlock: Clarify BUILD_BUG_ON check in scoping logic landlock: Control pathname UNIX domain socket resolution by path landlock: Use mem_is_zero() in is_layer_masks_allowed() lsm: Add LSM hook security_unix_find landlock: Fix kernel-doc warning for pointer-to-array parameters landlock: Fix formatting in tsync.c landlock: Improve kernel-doc "Return:" section consistency landlock: Add missing kernel-doc "Return:" sections selftests/landlock: Fix format warning for __u64 in net_test selftests/landlock: Skip stale records in audit_match_record() selftests/landlock: Drain stale audit records on init selftests/landlock: Fix socket file descriptor leaks in audit helpers ... |
||
|
|
7809fea20c |
net: qrtr: ns: Fix use-after-free in driver remove()
In the remove callback, if a packet arrives after destroy_workqueue() is
called, but before sock_release(), the qrtr_ns_data_ready() callback will
try to queue the work, causing use-after-free issue.
Fix this issue by saving the default 'sk_data_ready' callback during
qrtr_ns_init() and use it to replace the qrtr_ns_data_ready() callback at
the start of remove(). This ensures that even if a packet arrives after
destroy_workqueue(), the work struct will not be dereferenced.
Note that it is also required to ensure that the RX threads are completed
before destroying the workqueue, because the threads could be using the
qrtr_ns_data_ready() callback.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
27d5e84e81 |
net: qrtr: ns: Limit the total number of nodes
Currently, the nameserver doesn't limit the number of nodes it handles.
This can be an attack vector if a malicious client starts registering
random nodes, leading to memory exhaustion.
Hence, limit the maximum number of nodes to 64. Note that, limit of 64 is
chosen based on the current platform requirements. If requirement changes
in the future, this limit can be increased.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
68efba3644 |
net: qrtr: ns: Free the node during ctrl_cmd_bye()
A node sends the BYE packet when it is about to go down. So the nameserver
should advertise the removal of the node to all remote and local observers
and free the node finally. But currently, the nameserver doesn't free the
node memory even after processing the BYE packet. This causes the node
memory to leak.
Hence, remove the node from Xarray list and free the node memory during
both success and failure case of ctrl_cmd_bye().
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
5640227d9a |
net: qrtr: ns: Limit the maximum number of lookups
Current code does no bound checking on the number of lookups a client can
perform. Though the code restricts the lookups to local clients, there is
still a possibility of a malicious local client sending a flood of
NEW_LOOKUP messages over the same socket.
Fix this issue by limiting the maximum number of lookups to 64 globally.
Since the nameserver allows only atmost one local observer, this global
lookup count will ensure that the lookups stay within the limit.
Note that, limit of 64 is chosen based on the current platform
requirements. If requirement changes in the future, this limit can be
increased.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
d5ee2ff983 |
net: qrtr: ns: Limit the maximum server registration per node
Current code does no bound checking on the number of servers added per
node. A malicious client can flood NEW_SERVER messages and exhaust memory.
Fix this issue by limiting the maximum number of server registrations to
256 per node. If the NEW_SERVER message is received for an old port, then
don't restrict it as it will get replaced. While at it, also rate limit
the error messages in the failure path of qrtr_ns_worker().
Note that the limit of 256 is chosen based on the current platform
requirements. If requirement changes in the future, this limit can be
increased.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
5b75e7d676 |
can: raw: convert to getsockopt_iter
Convert CAN raw socket's getsockopt implementation to use the new getsockopt_iter callback with sockopt_t. Key changes: - Replace (char __user *optval, int __user *optlen) with sockopt_t *opt - Use opt->optlen for buffer length (input) and returned size (output) - Use copy_to_iter() instead of copy_to_user() - For CAN_RAW_FILTER and CAN_RAW_XL_VCID_OPTS: on -ERANGE, set opt->optlen to the required buffer size. The wrapper writes this back to userspace even on error, preserving the existing API that lets userspace discover the needed allocation size. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260408-getsockopt-v3-4-061bb9cb355d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
9c99d62705 |
af_packet: convert to getsockopt_iter
Convert AF_PACKET's getsockopt implementation to use the new getsockopt_iter callback with sockopt_t. Key changes: - Replace (char __user *optval, int __user *optlen) with sockopt_t *opt - Use opt->optlen for buffer length (input) and returned size (output) - Use copy_to_iter() instead of put_user()/copy_to_user() - For PACKET_HDRLEN which reads from optval: use opt->iter_in with copy_from_iter() for the input read, then the common opt->iter_out copy_to_iter() epilogue handles the output Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260408-getsockopt-v3-3-061bb9cb355d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
5bd0dec150 |
net: call getsockopt_iter if available
Update do_sock_getsockopt() to use the new getsockopt_iter callback when available. Add do_sock_getsockopt_iter() helper that: 1. Reads optlen from user/kernel space 2. Initializes a sockopt_t with the appropriate iov_iter (kvec for kernel, ubuf for user buffers) and sets opt.optlen 3. Calls the protocol's getsockopt_iter callback 4. Writes opt.optlen back to user/kernel space The optlen is always written back, even on failure. Some protocols (e.g. CAN raw) return -ERANGE and set optlen to the required buffer size so userspace knows how much to allocate. The callback is responsible for setting opt.optlen to indicate the returned data size. Important to say that iov_out does not need to be copied back in do_sock_getsockopt(). When optval is not kernel (the userspace path), sockptr_to_sockopt() sets up opt->iter_out as a ITER_DEST ubuf iterator pointing directly at the userspace buffer (optval.user). So when getsockopt_iter implementations call copy_to_iter(..., &opt->iter_out), the data is written directly to userspace — no intermediate kernel buffer is involved. When optval.is_kernel is true (the in-kernel path, e.g. from io_uring), the kvec points at the already-provided kernel buffer (optval.kernel), so the data lands in the caller's buffer directly via the kvec-backed iterator. In both cases the iterator writes to the final destination in-place at protocol callback. There's nothing to copy back — only optlen needs to be written back. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260408-getsockopt-v3-2-061bb9cb355d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
e9dc62f25b |
Merge tag 'for-net-next-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
core:
- hci_core: Rate limit the logging of invalid ISO handle
- hci_sync: make hci_cmd_sync_run_once return -EEXIST if exists
- hci_event: fix locking in hci_conn_request_evt() with HCI_PROTO_DEFER
- hci_event: fix potential UAF in SSP passkey handlers
- HCI: Avoid a couple -Wflex-array-member-not-at-end warnings
- L2CAP: CoC: Disconnect if received packet size exceeds MPS
- L2CAP: Add missing chan lock in l2cap_ecred_reconf_rsp
- L2CAP: Fix printing wrong information if SDU length exceeds MTU
- SCO: check for codecs->num_codecs == 1 before assigning to sco_pi(sk)->codec
drivers:
- btusb: MT7922: Add VID/PID 0489/e174
- btusb: Add Lite-On 04ca:3807 for MediaTek MT7921
- btusb: Add MT7927 IDs ASUS ROG Crosshair X870E Hero, Lenovo Legion Pro 7
16ARX9, Gigabyte Z790 AORUS MASTER X, MSI X870E Ace Max, TP-Link
Archer TBE550E, ASUS X870E / ProArt X870E-Creator.
- btusb: Add MT7902 IDs 13d3/3579, 13d3/3580, 13d3/3594, 13d3/3596, 0e8d/1ede
- btusb: Add MT7902 IDs 13d3/3579, 13d3/3580, 13d3/3594, 13d3/3596, 0e8d/1ede
- btusb: MediaTek MT7922: Add VID 0489 & PID e11d
- btintel: Add support for Scorpious Peak2 support
- btintel: Add support for Scorpious Peak2F support
- btintel_pcie: Add device id of Scorpius Peak2, Nova Lake-PCD-H
- btintel_pcie: Add device id of Scorpious2, Nova Lake-PCD-S
- btmtk: Add reset mechanism if downloading firmware failed
- btmtk: Add MT6639 (MT7927) Bluetooth support
- btmtk: fix ISO interface setup for single alt setting
- btmtk: add MT7902 SDIO support
- Bluetooth: btmtk: add MT7902 MCU support
- btbcm: Add entry for BCM4343A2 UART Bluetooth
- qca: enable pwrseq support for wcn39xx devices
- hci_qca: Fix BT not getting powered-off on rmmod
- hci_qca: disable power control for WCN7850 when bt_en is not defined
- hci_qca: Fix missing wakeup during SSR memdump handling
- hci_ldisc: Clear HCI_UART_PROTO_INIT on error
- mmc: sdio: add MediaTek MT7902 SDIO device ID
- hci_ll: Enable BROKEN_ENHANCED_SETUP_SYNC_CONN for WL183x
* tag 'for-net-next-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (59 commits)
Bluetooth: hci_qca: Fix missing wakeup during SSR memdump handling
Bluetooth: btintel_pcie: use strscpy to copy plain strings
Bluetooth: hci_event: fix potential UAF in SSP passkey handlers
Bluetooth: hci.h: Avoid a couple -Wflex-array-member-not-at-end warnings
Bluetooth: SCO: check for codecs->num_codecs == 1 before assigning to sco_pi(sk)->codec
Bluetooth: btintel_pcie: Align shared DMA memory to 128 bytes
Bluetooth: l2cap: Add missing chan lock in l2cap_ecred_reconf_rsp
Bluetooth: hci_ll: Enable BROKEN_ENHANCED_SETUP_SYNC_CONN for WL183x
Bluetooth: btusb: MediaTek MT7922: Add VID 0489 & PID e11d
Bluetooth: btmtk: hide unused btmtk_mt6639_devs[] array
Bluetooth: btusb: Add MT7927 ID for ASUS X870E / ProArt X870E-Creator
Bluetooth: btusb: Add MT7927 ID for TP-Link Archer TBE550E
Bluetooth: btusb: Add MT7927 ID for MSI X870E Ace Max
Bluetooth: btusb: Add MT7927 ID for Gigabyte Z790 AORUS MASTER X
Bluetooth: btusb: Add MT7927 ID for Lenovo Legion Pro 7 16ARX9
Bluetooth: btusb: Add MT7927 ID for ASUS ROG Crosshair X870E Hero
Bluetooth: btmtk: fix ISO interface setup for single alt setting
Bluetooth: btmtk: Add MT6639 (MT7927) Bluetooth support
Bluetooth: fix locking in hci_conn_request_evt() with HCI_PROTO_DEFER
Bluetooth: btmtk: refactor endpoint lookup
...
====================
Link: https://patch.msgid.link/20260413132247.320961-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
||
|
|
b7d74ea0fd |
Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs i_ino updates from Christian Brauner: "For historical reasons, the inode->i_ino field is an unsigned long, which means that it's 32 bits on 32 bit architectures. This has caused a number of filesystems to implement hacks to hash a 64-bit identifier into a 32-bit field, and deprives us of a universal identifier field for an inode. This changes the inode->i_ino field from an unsigned long to a u64. This shouldn't make any material difference on 64-bit hosts, but 32-bit hosts will see struct inode grow by at least 4 bytes. This could have effects on slabcache sizes and field alignment. The bulk of the changes are to format strings and tracepoints, since the kernel itself doesn't care that much about the i_ino field. The first patch changes some vfs function arguments, so check that one out carefully. With this change, we may be able to shrink some inode structures. For instance, struct nfs_inode has a fileid field that holds the 64-bit inode number. With this set of changes, that field could be eliminated. I'd rather leave that sort of cleanups for later just to keep this simple" * tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group() EVM: add comment describing why ino field is still unsigned long vfs: remove externs from fs.h on functions modified by i_ino widening treewide: fix missed i_ino format specifier conversions ext4: fix signed format specifier in ext4_load_inode trace event treewide: change inode->i_ino from unsigned long to u64 nilfs2: widen trace event i_ino fields to u64 f2fs: widen trace event i_ino fields to u64 ext4: widen trace event i_ino fields to u64 zonefs: widen trace event i_ino fields to u64 hugetlbfs: widen trace event i_ino fields to u64 ext2: widen trace event i_ino fields to u64 cachefiles: widen trace event i_ino fields to u64 vfs: widen trace event i_ino fields to u64 net: change sock.sk_ino and sock_i_ino() to u64 audit: widen ino fields to u64 vfs: widen inode hash/lookup functions to u64 |
||
|
|
c8db08110c |
Merge tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs xattr updates from Christian Brauner: "This reworks the simple_xattr infrastructure and adds support for user.* extended attributes on sockets. The simple_xattr subsystem currently uses an rbtree protected by a reader-writer spinlock. This series replaces the rbtree with an rhashtable giving O(1) average-case lookup with RCU-based lockless reads. This sped up concurrent access patterns on tmpfs quite a bit and it's an overall easy enough conversion to do and gets rid or rwlock_t. The conversion is done incrementally: a new rhashtable path is added alongside the existing rbtree, consumers are migrated one at a time (shmem, kernfs, pidfs), and then the rbtree code is removed. All three consumers switch from embedded structs to pointer-based lazy allocation so the rhashtable overhead is only paid for inodes that actually use xattrs. With this infrastructure in place the series adds support for user.* xattrs on sockets. Path-based AF_UNIX sockets inherit xattr support from the underlying filesystem (e.g. tmpfs) but sockets in sockfs - that is everything created via socket() including abstract namespace AF_UNIX sockets - had no xattr support at all. The xattr_permission() checks are reworked to allow user.* xattrs on S_IFSOCK inodes. Sockfs sockets get per-inode limits of 128 xattrs and 128KB total value size matching the limits already in use for kernfs. The practical motivation comes from several directions. systemd and GNOME are expanding their use of Varlink as an IPC mechanism. For D-Bus there are tools like dbus-monitor that can observe IPC traffic across the system but this only works because D-Bus has a central broker. For Varlink there is no broker and there is currently no way to identify which sockets speak Varlink. With user.* xattrs on sockets a service can label its socket with the IPC protocol it speaks (e.g., user.varlink=1) and an eBPF program can then selectively capture traffic on those sockets. Enumerating bound sockets via netlink combined with these xattr labels gives a way to discover all Varlink IPC entrypoints for debugging and introspection. Similarly, systemd-journald wants to use xattrs on the /dev/log socket for protocol negotiation to indicate whether RFC 5424 structured syslog is supported or whether only the legacy RFC 3164 format should be used. In containers these labels are particularly useful as high-privilege or more complicated solutions for socket identification aren't available. The series comes with comprehensive selftests covering path-based AF_UNIX sockets, sockfs socket operations, per-inode limit enforcement, and xattr operations across multiple address families (AF_INET, AF_INET6, AF_NETLINK, AF_PACKET)" * tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/xattr: test xattrs on various socket families selftests/xattr: sockfs socket xattr tests selftests/xattr: path-based AF_UNIX socket xattr tests xattr: support extended attributes on sockets xattr,net: support limited amount of extended attributes on sockfs sockets xattr: move user limits for xattrs to generic infra xattr: switch xattr_permission() to switch statement xattr: add xattr_permission_error() xattr: remove rbtree-based simple_xattr infrastructure pidfs: adapt to rhashtable-based simple_xattrs kernfs: adapt to rhashtable-based simple_xattrs with lazy allocation shmem: adapt to rhashtable-based simple_xattrs with lazy allocation xattr: add rhashtable-based simple_xattr infrastructure xattr: add rcu_head and rhash_head to struct simple_xattr |
||
|
|
b025461303 |
tcp: update window_clamp when SO_RCVBUF is set
Commit under Fixes moved recomputing the window clamp to tcp_measure_rcv_mss() (when scaling_ratio changes). I suspect it missed the fact that we don't recompute the clamp when rcvbuf is set. Until scaling_ratio changes we are stuck with the old window clamp which may be based on the small initial buffer. scaling_ratio may never change. Inspired by Eric's recent commit |
||
|
|
85fa351204 |
Bluetooth: hci_event: fix potential UAF in SSP passkey handlers
hci_conn lookup and field access must be covered by hdev lock in
hci_user_passkey_notify_evt() and hci_keypress_notify_evt(), otherwise
the connection can be freed concurrently.
Extend the hci_dev_lock critical section to cover all conn usage in both
handlers.
Keep the existing keypress notification behavior unchanged by routing
the early exits through a common unlock path.
Fixes:
|
||
|
|
4e10a9ebbf |
Bluetooth: SCO: check for codecs->num_codecs == 1 before assigning to sco_pi(sk)->codec
copy_struct_from_sockptr() fill 'buffer' in
sco_sock_setsockopt() with zeros, so there's no
real problem.
But it actually looks strange to do this,
without checking all of codecs->codecs[0]
really comes from userspace:
sco_pi(sk)->codec = codecs->codecs[0];
As only optlen < sizeof(struct bt_codecs) is checked
and codecs->num_codecs is not checked against != 1,
but only <= 1, and the space for the additional struct bt_codec
is not checked.
Note I don't understand bluetooth and I didn't do any runtime
tests with this! I just found it when debugging a problem
in copy_struct_from_sockptr().
I just added this to check the size is as expected:
BUILD_BUG_ON(struct_size(codecs, codecs, 0) != 1);
BUILD_BUG_ON(struct_size(codecs, codecs, 1) != 8);
And made sure it still compiles using this:
make CF=-D__CHECK_ENDIAN__ W=1ce C=1 net/bluetooth/sco.o
Fixes:
|
||
|
|
42776497cd |
Bluetooth: l2cap: Add missing chan lock in l2cap_ecred_reconf_rsp
l2cap_ecred_reconf_rsp() calls l2cap_chan_del() without holding
l2cap_chan_lock(). Every other l2cap_chan_del() caller in the file
acquires the lock first. A remote BLE device can send a crafted
L2CAP ECRED reconfiguration response to corrupt the channel list
while another thread is iterating it.
Add l2cap_chan_hold() and l2cap_chan_lock() before l2cap_chan_del(),
and l2cap_chan_unlock() and l2cap_chan_put() after, matching the
pattern used in l2cap_ecred_conn_rsp() and l2cap_conn_del().
Fixes:
|
||
|
|
5c7209a341 |
Bluetooth: fix locking in hci_conn_request_evt() with HCI_PROTO_DEFER
When protocol sets HCI_PROTO_DEFER, hci_conn_request_evt() calls
hci_connect_cfm(conn) without hdev->lock. Generally hci_connect_cfm()
assumes it is held, and if conn is deleted concurrently -> UAF.
Only SCO and ISO set HCI_PROTO_DEFER and only for defer setup listen,
and HCI_EV_CONN_REQUEST is not generated for ISO. In the non-deferred
listening socket code paths, hci_connect_cfm(conn) is called with
hdev->lock held.
Fix by holding the lock.
Fixes:
|
||
|
|
d288f4db09 |
Bluetooth: hci_sync: make hci_cmd_sync_run_once return -EEXIST if exists
hci_cmd_sync_run_once() needs to indicate whether a queue item was added, so caller can know if callbacks are called, so it can avoid leaking resources. Change the function to return -EEXIST if queue item already exists. Modify all callsites vs. the changes. The only callsite is hci_abort_conn(). Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
15bf35a660 |
Bluetooth: L2CAP: Fix printing wrong information if SDU length exceeds MTU
The code was printing skb->len and sdu_len in the places where it should
be sdu_len and chan->imtu respectively to match the if conditions.
Link: https://lore.kernel.org/linux-bluetooth/20260315132013.75ab40c5@kernel.org/T/#m1418f9c82eeff8510c1beaa21cf53af20db96c06
Fixes:
|
||
|
|
12bec2bd4b |
bpf: reject short IPv4/IPv6 inputs in bpf_prog_test_run_skb
bpf_prog_test_run_skb() calls eth_type_trans() first and then uses
skb->protocol to initialize sk family and address fields for the test
run.
For IPv4 and IPv6 packets, it may access ip_hdr(skb) or ipv6_hdr(skb)
even when the provided test input only contains an Ethernet header.
Reject the input earlier if the Ethernet frame carries IPv4/IPv6
EtherType but the L3 header is too short.
Fold the IPv4/IPv6 header length checks into the existing protocol
switch and return -EINVAL before accessing the network headers.
Fixes:
|
||
|
|
29b1ee8788 |
net: add noinline __init __no_profile to skb_extensions_init() for GCOV compatibility
With -fprofile-update=atomic in global CFLAGS_GCOV, GCC still cannot constant-fold the skb_ext_total_length() loop when it is inlined into a profiled caller. The existing __no_profile on skb_ext_total_length() itself is insufficient because after __always_inline expansion the code resides in the caller's body, which still carries GCOV instrumentation. Mark skb_extensions_init() with __no_profile so the BUILD_BUG_ON checks can be evaluated at compile time. Also mark it noinline to prevent the compiler from inlining it into skb_init() (which lacks __no_profile), which would re-expose the function body to GCOV instrumentation. Add __init since skb_extensions_init() is only called from __init skb_init(). Previously it was implicitly inlined into the .init.text section; with noinline it would otherwise remain in permanent .text, wasting memory after boot. Build-tested with both CONFIG_GCOV_PROFILE_ALL=y and CONFIG_KCOV_INSTRUMENT_ALL=y. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Link: https://patch.msgid.link/20260410162150.3105738-3-khorenko@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
c0b4382c86 |
net: fix skb_ext_total_length() BUILD_BUG_ON with CONFIG_GCOV_PROFILE_ALL
When CONFIG_GCOV_PROFILE_ALL=y is enabled, the kernel fails to build:
In file included from <command-line>:
In function 'skb_extensions_init',
inlined from 'skb_init' at net/core/skbuff.c:5214:2:
././include/linux/compiler_types.h:706:45: error: call to
'__compiletime_assert_1490' declared with attribute error:
BUILD_BUG_ON failed: skb_ext_total_length() > 255
CONFIG_GCOV_PROFILE_ALL adds -fprofile-arcs -ftest-coverage
-fno-tree-loop-im to CFLAGS globally. GCC inserts branch profiling
counters into the skb_ext_total_length() loop and, combined with
-fno-tree-loop-im (which disables loop invariant motion), cannot
constant-fold the result.
BUILD_BUG_ON requires a compile-time constant and fails.
The issue manifests in kernels with 5+ SKB extension types enabled
(e.g., after addition of SKB_EXT_CAN, SKB_EXT_PSP). With 4 extensions
GCC can still unroll and fold the loop despite GCOV instrumentation;
with 5+ it gives up.
Mark skb_ext_total_length() with __no_profile to prevent GCOV from
inserting counters into this function. Without counters the loop is
"clean" and GCC can constant-fold it even with -fno-tree-loop-im active.
This allows BUILD_BUG_ON to work correctly while keeping GCOV profiling
for the rest of the kernel.
This also removes the CONFIG_KCOV_INSTRUMENT_ALL preprocessor guard
introduced by
|
||
|
|
d114bfdc9b |
vsock: fix buffer size clamping order
In vsock_update_buffer_size(), the buffer size was being clamped to the
maximum first, and then to the minimum. If a user sets a minimum buffer
size larger than the maximum, the minimum check overrides the maximum
check, inverting the constraint.
This breaks the intended socket memory boundaries by allowing the
vsk->buffer_size to grow beyond the configured vsk->buffer_max_size.
Fix this by checking the minimum first, and then the maximum. This
ensures the buffer size never exceeds the buffer_max_size.
Fixes:
|
||
|
|
9336854a59 |
Merge branch 'net-reduce-sk_filter-and-friends-bloat'
Eric Dumazet says: ==================== net: reduce sk_filter() (and friends) bloat Some functions return an error by value, and a drop_reason by an output parameter. This extra parameter can force stack canaries. A drop_reason is enough and more efficient. This series reduces bloat by 678 bytes on x86_64: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.final add/remove: 0/0 grow/shrink: 3/18 up/down: 79/-757 (-678) Function old new delta vsock_queue_rcv_skb 50 79 +29 ipmr_cache_report 1290 1315 +25 ip6mr_cache_report 1322 1347 +25 tcp_v6_rcv 3169 3167 -2 packet_rcv_spkt 329 327 -2 unix_dgram_sendmsg 1731 1726 -5 netlink_unicast 957 945 -12 netlink_dump 1372 1359 -13 sk_filter_trim_cap 889 858 -31 netlink_broadcast_filtered 1633 1595 -38 tcp_v4_rcv 3152 3111 -41 raw_rcv_skb 122 80 -42 ping_queue_rcv_skb 109 61 -48 ping_rcv 215 162 -53 rawv6_rcv_skb 278 224 -54 __sk_receive_skb 690 632 -58 raw_rcv 591 527 -64 udpv6_queue_rcv_one_skb 935 869 -66 udp_queue_rcv_one_skb 919 853 -66 tun_net_xmit 1146 1074 -72 sock_queue_rcv_skb_reason 166 76 -90 Total: Before=29722890, After=29722212, chg -0.00% Future conversions from sock_queue_rcv_skb() to sock_queue_rcv_skb_reason() can be done later. ==================== Link: https://patch.msgid.link/20260409145625.2306224-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
fb37aea2a0 |
net: change sk_filter_trim_cap() to return a drop_reason by value
Current return value can be replaced with the drop_reason, reducing kernel bloat: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/11 up/down: 32/-603 (-571) Function old new delta tcp_v6_rcv 3135 3167 +32 unix_dgram_sendmsg 1731 1726 -5 netlink_unicast 957 945 -12 netlink_dump 1372 1359 -13 sk_filter_trim_cap 882 858 -24 tcp_v4_rcv 3143 3111 -32 __pfx_tcp_filter 32 - -32 netlink_broadcast_filtered 1633 1595 -38 sock_queue_rcv_skb_reason 126 76 -50 tun_net_xmit 1127 1074 -53 __sk_receive_skb 690 632 -58 udpv6_queue_rcv_one_skb 935 869 -66 udp_queue_rcv_one_skb 919 853 -66 tcp_filter 154 - -154 Total: Before=29722783, After=29722212, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
97449a5f1a |
tcp: change tcp_filter() to return the reason by value
sk_filter_trim_cap() will soon return the reason by value, do the same for tcp_filter(). Note: tcp_filter() is no longer inlined. Following patch will inline it again. $ scripts/bloat-o-meter -t vmlinux.4 vmlinux.5 add/remove: 2/0 grow/shrink: 0/2 up/down: 186/-43 (143) Function old new delta tcp_filter - 154 +154 __pfx_tcp_filter - 32 +32 tcp_v4_rcv 3152 3143 -9 tcp_v6_rcv 3169 3135 -34 Total: Before=29722640, After=29722783, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
c78bcbd519 |
net: change sk_filter_reason() to return the reason by value
sk_filter_trim_cap will soon return the reason by value, do the same for sk_filter_reason(). $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-21 (-21) Function old new delta sock_queue_rcv_skb_reason 128 126 -2 tun_net_xmit 1146 1127 -19 Total: Before=29722661, After=29722640, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
734ea7e324 |
net: always set reason in sk_filter_trim_cap()
sk_filter_trim_cap() will soon return the drop reason by value. Make sure *reason is cleared when no error is returned, to ease this conversion. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-7 (-7) Function old new delta sk_filter_trim_cap 889 882 -7 Total: Before=29722668, After=29722661, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
900f27fb79 |
net: change sock_queue_rcv_skb_reason() to return a drop_reason
Change sock_queue_rcv_skb_reason() to return the drop_reason directly instead of using a reference. This is part of an effort to remove stack canaries and reduce bloat. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 3/7 up/down: 79/-301 (-222) Function old new delta vsock_queue_rcv_skb 50 79 +29 ipmr_cache_report 1290 1315 +25 ip6mr_cache_report 1322 1347 +25 packet_rcv_spkt 329 327 -2 sock_queue_rcv_skb_reason 166 128 -38 raw_rcv_skb 122 80 -42 ping_queue_rcv_skb 109 61 -48 ping_rcv 215 162 -53 rawv6_rcv_skb 278 224 -54 raw_rcv 591 527 -64 Total: Before=29722890, After=29722668, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
ebf71dd4af |
net/rds: Restrict use of RDS/IB to the initial network namespace
Prevent using RDS/IB in network namespaces other than the initial one.
The existing RDS/IB code will not work properly in non-initial network
namespaces.
Fixes:
|
||
|
|
236f718ac8 |
net/rds: Optimize rds_ib_laddr_check
rds_ib_laddr_check() creates a CM_ID and attempts to bind the address
in question to it. This in order to qualify the allegedly local
address as a usable IB/RoCE address.
In the field, ExaWatcher runs rds-ping to all ports in the fabric from
all local ports. This using all active ToS'es. In a full rack system,
we have 14 cell servers and eight db servers. Typically, 6 ToS'es are
used. This implies 528 rds-ping invocations per ExaWatcher's "RDSinfo"
interval.
Adding to this, each rds-ping invocation creates eight sockets and
binds the local address to them:
socket(AF_RDS, SOCK_SEQPACKET, 0) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0) = 4
bind(4, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0) = 6
bind(6, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0) = 7
bind(7, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0) = 8
bind(8, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0) = 9
bind(9, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0) = 10
bind(10, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("192.168.36.2")}, 16) = 0
So, at every interval ExaWatcher executes rds-ping's, 4224 CM_IDs are
allocated, considering this full-rack system. After the a CM_ID has
been allocated, rdma_bind_addr() is called, with the port number being
zero. This implies that the CMA will attempt to search for an un-used
ephemeral port. Simplified, the algorithm is to start at a random
position in the available port space, and then if needed, iterate
until an un-used port is found.
The book-keeping of used ports uses the idr system, which again uses
slab to allocate new struct idr_layer's. The size is 2092 bytes and
slab tries to reduce the wasted space. Hence, it chooses an order:3
allocation, for which 15 idr_layer structs will fit and only 1388
bytes are wasted per the 32KiB order:3 chunk.
Although this order:3 allocation seems like a good space/speed
trade-off, it does not resonate well with how it used by the CMA. The
combination of the randomized starting point in the port space (which
has close to zero spatial locality) and the close proximity in time of
the 4224 invocations of the rds-ping's, creates a memory hog for
order:3 allocations.
These costly allocations may need reclaims and/or compaction. At
worst, they may fail and produce a stack trace such as (from uek4):
[<ffffffff811a72d5>] __inc_zone_page_state+0x35/0x40
[<ffffffff811c2e97>] page_add_file_rmap+0x57/0x60
[<ffffffffa37ca1df>] remove_migration_pte+0x3f/0x3c0 [ksplice_6cn872bt_vmlinux_new]
[<ffffffff811c3de8>] rmap_walk+0xd8/0x340
[<ffffffff811e8860>] remove_migration_ptes+0x40/0x50
[<ffffffff811ea83c>] migrate_pages+0x3ec/0x890
[<ffffffff811afa0d>] compact_zone+0x32d/0x9a0
[<ffffffff811b00ed>] compact_zone_order+0x6d/0x90
[<ffffffff811b03b2>] try_to_compact_pages+0x102/0x270
[<ffffffff81190e56>] __alloc_pages_direct_compact+0x46/0x100
[<ffffffff8119165b>] __alloc_pages_nodemask+0x74b/0xaa0
[<ffffffff811d8411>] alloc_pages_current+0x91/0x110
[<ffffffff811e3b0b>] new_slab+0x38b/0x480
[<ffffffffa41323c7>] __slab_alloc+0x3b7/0x4a0 [ksplice_s0dk66a8_vmlinux_new]
[<ffffffff811e42ab>] kmem_cache_alloc+0x1fb/0x250
[<ffffffff8131fdd6>] idr_layer_alloc+0x36/0x90
[<ffffffff8132029c>] idr_get_empty_slot+0x28c/0x3d0
[<ffffffff813204ad>] idr_alloc+0x4d/0xf0
[<ffffffffa051727d>] cma_alloc_port+0x4d/0xa0 [rdma_cm]
[<ffffffffa0517cbe>] rdma_bind_addr+0x2ae/0x5b0 [rdma_cm]
[<ffffffffa09d8083>] rds_ib_laddr_check+0x83/0x2c0 [ksplice_6l2xst5i_rds_rdma_new]
[<ffffffffa05f892b>] rds_trans_get_preferred+0x5b/0xa0 [rds]
[<ffffffffa05f09f2>] rds_bind+0x212/0x280 [rds]
[<ffffffff815b4016>] SYSC_bind+0xe6/0x120
[<ffffffff815b4d3e>] SyS_bind+0xe/0x10
[<ffffffff816b031a>] system_call_fastpath+0x18/0xd4
To avoid these excessive calls to rdma_bind_addr(), we optimize
rds_ib_laddr_check() by simply checking if the address in question has
been used before. The rds_rdma module keeps track of addresses
associated with IB devices, and the function rds_ib_get_device() is
used to determine if the address already has been qualified as a valid
local address. If not found, we call the legacy rds_ib_laddr_check(),
now renamed to rds_ib_laddr_check_cm().
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260408080420.540032-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
||
|
|
2835750dd6 |
net: rose: reject truncated CLEAR_REQUEST frames in state machines
All five ROSE state machines (states 1-5) handle ROSE_CLEAR_REQUEST by reading the cause and diagnostic bytes directly from skb->data[3] and skb->data[4] without verifying that the frame is long enough: rose_disconnect(sk, ..., skb->data[3], skb->data[4]); The entry-point check in rose_route_frame() only enforces ROSE_MIN_LEN (3 bytes), so a remote peer on a ROSE network can send a syntactically valid but truncated CLEAR_REQUEST (3 or 4 bytes) while a connection is open in any state. Processing such a frame causes a one- or two-byte out-of-bounds read past the skb data, leaking uninitialized heap content as the cause/diagnostic values returned to user space via getsockopt(ROSE_GETCAUSE). Add a single length check at the rose_process_rx_frame() dispatch point, before any state machine is entered, to drop frames that carry the CLEAR_REQUEST type code but are too short to contain the required cause and diagnostic fields. Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org> Link: https://patch.msgid.link/20260408172551.281486-1-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
8632175ccb |
gre: Count GRE packet drops
GRE is silently dropping packets without updating statistics. In case of drop, increment rx_dropped counter to provide visibility into packet loss. For the case where no GRE protocol handler is registered, use rx_nohandler. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20260409090945.1542440-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
10f86a2a5c |
bpf: Fix same-register dst/src OOB read and pointer leak in sock_ops
When a BPF sock_ops program accesses ctx fields with dst_reg == src_reg, the SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() macros fail to zero the destination register in the !fullsock / !locked_tcp_sock path. Both macros borrow a temporary register to check is_fullsock / is_locked_tcp_sock when dst_reg == src_reg, because dst_reg holds the ctx pointer. When the check is false (e.g., TCP_NEW_SYN_RECV state with a request_sock), dst_reg should be zeroed but is not, leaving the stale ctx pointer: - SOCK_OPS_GET_SK: dst_reg retains the ctx pointer, passes NULL checks as PTR_TO_SOCKET_OR_NULL, and can be used as a bogus socket pointer, leading to stack-out-of-bounds access in helpers like bpf_skc_to_tcp6_sock(). - SOCK_OPS_GET_FIELD: dst_reg retains the ctx pointer which the verifier believes is a SCALAR_VALUE, leaking a kernel pointer. Fix both macros by: - Changing JMP_A(1) to JMP_A(2) in the fullsock path to skip the added instruction. - Adding BPF_MOV64_IMM(si->dst_reg, 0) after the temp register restore in the !fullsock path, placed after the restore because dst_reg == src_reg means we need src_reg intact to read ctx->temp. Fixes: |