Commit Graph

1219 Commits

Author SHA1 Message Date
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
311aa68319 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "Usual smallish cycle. The NFS biovec work to push it down into RDMA
  instead of indirecting through a scatterlist is pretty nice to see,
  been talked about for a long time now.

   - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe

   - Small driver improvements and minor bug fixes to hns, mlx5, rxe,
     mana, mlx5, irdma

   - Robusness improvements in completion processing for EFA

   - New query_port_speed() verb to move past limited IBA defined speed
     steps

   - Support for SG_GAPS in rts and many other small improvements

   - Rare list corruption fix in iwcm

   - Better support different page sizes in rxe

   - Device memory support for mana

   - Direct bio vec to kernel MR for use by NFS-RDMA

   - QP rate limiting for bnxt_re

   - Remote triggerable NULL pointer crash in siw

   - DMA-buf exporter support for RDMA mmaps like doorbells"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
  RDMA/mlx5: Implement DMABUF export ops
  RDMA/uverbs: Add DMABUF object type and operations
  RDMA/uverbs: Support external FD uobjects
  RDMA/siw: Fix potential NULL pointer dereference in header processing
  RDMA/umad: Reject negative data_len in ib_umad_write
  IB/core: Extend rate limit support for RC QPs
  RDMA/mlx5: Support rate limit only for Raw Packet QP
  RDMA/bnxt_re: Report QP rate limit in debugfs
  RDMA/bnxt_re: Report packet pacing capabilities when querying device
  RDMA/bnxt_re: Add support for QP rate limiting
  MAINTAINERS: Drop RDMA files from Hyper-V section
  RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc
  svcrdma: use bvec-based RDMA read/write API
  RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
  RDMA/core: add MR support for bvec-based RDMA operations
  RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  RDMA/core: add bio_vec based RDMA read/write API
  RDMA/irdma: Use kvzalloc for paged memory DMA address array
  RDMA/rxe: Fix race condition in QP timer handlers
  RDMA/mana_ib: Add device‑memory support
  ...
2026-02-12 17:05:20 -08:00
YunJe Shin
14ab3da122 RDMA/siw: Fix potential NULL pointer dereference in header processing
If siw_get_hdr() returns -EINVAL before set_rx_fpdu_context(),
qp->rx_fpdu can be NULL. The error path in siw_tcp_rx_data()
dereferences qp->rx_fpdu->more_ddp_segs without checking, which
may lead to a NULL pointer deref. Only check more_ddp_segs when
rx_fpdu is present.

KASAN splat:
[  101.384271] KASAN: null-ptr-deref in range [0x00000000000000c0-0x00000000000000c7]
[  101.385869] RIP: 0010:siw_tcp_rx_data+0x13ad/0x1e50

Fixes: 8b6a361b8c ("rdma/siw: receive path")
Signed-off-by: YunJe Shin <ioerts@kookmin.ac.kr>
Link: https://patch.msgid.link/20260204092546.489842-1-ioerts@kookmin.ac.kr
Acked-by: Bernard Metzler <bernard.metzler@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-05 07:46:52 -05:00
Li Zhijian
87bf646921 RDMA/rxe: Fix race condition in QP timer handlers
I encontered the following warning:
 WARNING: drivers/infiniband/sw/rxe/rxe_task.c:249 at rxe_sched_task+0x1c8/0x238 [rdma_rxe], CPU#0: swapper/0/0
...
  libsha1 [last unloaded: ip6_udp_tunnel]
 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G         C          6.19.0-rc5-64k-v8+ #37 PREEMPT
 Tainted: [C]=CRAP
 Hardware name: Raspberry Pi 4 Model B Rev 1.2
 Call trace:
  rxe_sched_task+0x1c8/0x238 [rdma_rxe] (P)
  retransmit_timer+0x130/0x188 [rdma_rxe]
  call_timer_fn+0x68/0x4d0
  __run_timers+0x630/0x888
...
 WARNING: drivers/infiniband/sw/rxe/rxe_task.c:38 at rxe_sched_task+0x1c0/0x238 [rdma_rxe], CPU#0: swapper/0/0
...
 WARNING: drivers/infiniband/sw/rxe/rxe_task.c:111 at do_work+0x488/0x5c8 [rdma_rxe], CPU#3: kworker/u17:4/93400
...
 refcount_t: underflow; use-after-free.
 WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x138/0x1a0, CPU#3: kworker/u17:4/93400

The issue is caused by a race condition between retransmit_timer() and
rxe_destroy_qp, leading to the Queue Pair's (QP) reference count dropping
to zero during timer handler execution.

It seems this warning is harmless because rxe_qp_do_cleanup() will flush
all pending timers and requests.

Example of flow causing the issue:

CPU0                                   CPU1
retransmit_timer() {
    spin_lock_irqsave
                           rxe_destroy_qp()
                            __rxe_cleanup()
                              __rxe_put() // qp->ref_count decrease to 0
                            rxe_qp_do_cleanup() {
    if (qp->valid) {
        rxe_sched_task() {
            WARN_ON(rxe_read(task->qp) <= 0);
        }
    }
    spin_unlock_irqrestore
}
                              spin_lock_irqsave
                              qp->valid = 0
                              spin_unlock_irqrestore
                            }

Ensure the QP's reference count is maintained and its validity is checked
within the timer callbacks by adding calls to rxe_get(qp) and corresponding
rxe_put(qp) after use.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Fixes: d946716325 ("RDMA/rxe: Rewrite rxe_task.c")
Link: https://patch.msgid.link/20260120074437.623018-1-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-28 05:02:30 -05:00
Li Zhijian
12985e5915 RDMA/rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
The current implementation incorrectly handles memory regions (MRs) with
page sizes different from the system PAGE_SIZE. The core issue is that
rxe_set_page() is called with mr->page_size step increments, but the
page_list stores individual struct page pointers, each representing
PAGE_SIZE of memory.

ib_sg_to_page() has ensured that when i>=1 either
a) SG[i-1].dma_end and SG[i].dma_addr are contiguous
or
b) SG[i-1].dma_end and SG[i].dma_addr are mr->page_size aligned.

This leads to incorrect iova-to-va conversion in scenarios:

1) page_size < PAGE_SIZE (e.g., MR: 4K, system: 64K):
   ibmr->iova = 0x181800
   sg[0]: dma_addr=0x181800, len=0x800
   sg[1]: dma_addr=0x173000, len=0x1000

   Access iova = 0x181800 + 0x810 = 0x182010
   Expected VA: 0x173010 (second SG, offset 0x10)
   Before fix:
     - index = (0x182010 >> 12) - (0x181800 >> 12) = 1
     - page_offset = 0x182010 & 0xFFF = 0x10
     - xarray[1] stores system page base 0x170000
     - Resulting VA: 0x170000 + 0x10 = 0x170010 (wrong)

2) page_size > PAGE_SIZE (e.g., MR: 64K, system: 4K):
   ibmr->iova = 0x18f800
   sg[0]: dma_addr=0x18f800, len=0x800
   sg[1]: dma_addr=0x170000, len=0x1000

   Access iova = 0x18f800 + 0x810 = 0x190010
   Expected VA: 0x170010 (second SG, offset 0x10)
   Before fix:
     - index = (0x190010 >> 16) - (0x18f800 >> 16) = 1
     - page_offset = 0x190010 & 0xFFFF = 0x10
     - xarray[1] stores system page for dma_addr 0x170000
     - Resulting VA: system page of 0x170000 + 0x10 = 0x170010 (wrong)

Yi Zhang reported a kernel panic[1] years ago related to this defect.

Solution:
1. Replace xarray with pre-allocated rxe_mr_page array for sequential
   indexing (all MR page indices are contiguous)
2. Each rxe_mr_page stores both struct page* and offset within the
   system page
3. Handle MR page_size != PAGE_SIZE relationships:
   - page_size > PAGE_SIZE: Split MR pages into multiple system pages
   - page_size <= PAGE_SIZE: Store offset within system page
4. Add boundary checks and compatibility validation

This ensures correct iova-to-va conversion regardless of MR page size
and system PAGE_SIZE relationship, while improving performance through
array-based sequential access.

Tests on 4K and 64K PAGE_SIZE hosts:
- rdma-core/pytests
  $ ./build/bin/run_tests.py  --dev eth0_rxe
- blktest:
  $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme srp rnbd

[1] https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/

Fixes: 592627ccbd ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://patch.msgid.link/20260116032753.2574363-1-lizhijian@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-25 08:33:19 -05:00
Li Zhijian
d3922f6dad RDMA/rxe: Remove unused page_offset member
In rxe_map_mr_sg(), the `page_offset` member of the `rxe_mr` struct
was initialized based on `ibmr.iova`, which will be updated inside
ib_sg_to_pages() later.

Consequently, the value assigned to `page_offset` was incorrect. However,
since `page_offset` was never utilized throughout the code, it can be safely
removed to clean up the codebase and avoid future confusion.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://patch.msgid.link/20260116032833.2574627-1-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-18 11:46:15 -05:00
Jiasheng Jiang
0beefd0e15 RDMA/rxe: Fix double free in rxe_srq_from_init
In rxe_srq_from_init(), the queue pointer 'q' is assigned to
'srq->rq.queue' before copying the SRQ number to user space.
If copy_to_user() fails, the function calls rxe_queue_cleanup()
to free the queue, but leaves the now-invalid pointer in
'srq->rq.queue'.

The caller of rxe_srq_from_init() (rxe_create_srq) eventually
calls rxe_srq_cleanup() upon receiving the error, which triggers
a second rxe_queue_cleanup() on the same memory, leading to a
double free.

The call trace looks like this:
   kmem_cache_free+0x.../0x...
   rxe_queue_cleanup+0x1a/0x30 [rdma_rxe]
   rxe_srq_cleanup+0x42/0x60 [rdma_rxe]
   rxe_elem_release+0x31/0x70 [rdma_rxe]
   rxe_create_srq+0x12b/0x1a0 [rdma_rxe]
   ib_create_srq_user+0x9a/0x150 [ib_core]

Fix this by moving 'srq->rq.queue = q' after copy_to_user.

Fixes: aae0484e15 ("IB/rxe: avoid srq memory leak")
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Link: https://patch.msgid.link/20260112015412.29458-1-jiashengjiangcool@gmail.com
Reviewed-by: Zhu Yanjun <yanjun.Zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-15 04:59:53 -05:00
Li Zhijian
3c68cf6823 IB/rxe: Fix missing umem_odp->umem_mutex unlock on error path
rxe_odp_map_range_and_lock() must release umem_odp->umem_mutex when an
error occurs, including cases where rxe_check_pagefault() fails.

Fixes: 2fae67ab63 ("RDMA/rxe: Add support for Send/Recv/Write/Read with ODP")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://patch.msgid.link/20251226094112.3042583-1-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-30 04:22:59 -05:00
Stefan Metzmacher
de41cbc64d RDMA/rxe: let rxe_reclassify_recv_socket() call sk_owner_put()
On kernels build with CONFIG_PROVE_LOCKING, CONFIG_MODULES
and CONFIG_DEBUG_LOCK_ALLOC 'rmmod rdma_rxe' is no longer
possible.

For the global recv sockets rxe_net_exit() is where we
call rxe_release_udp_tunnel-> udp_tunnel_sock_release(),
which means the sockets are destroyed before 'rmmod rdma_rxe'
finishes, so there's no need to protect against
rxe_recv_slock_key and rxe_recv_sk_key disappearing
while the sockets are still alive.

Fixes: 80a85a771d ("RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep")
Cc: Zhu Yanjun <zyjzyj2000@gmail.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251219140408.2300163-1-metze@samba.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-21 05:29:11 -05:00
Linus Torvalds
55aa394a5e Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "This has another new RDMA driver 'bng_en' for latest generation
  Broadcom NICs. There might be one more new driver still to come.

  Otherwise it is a fairly quite cycle. Summary:

   - Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re,
     mlx5

   - Many bug fix patches for irdma

   - WQ_PERCPU annotations and system_dfl_wq changes

   - Improved mlx5 support for "other eswitches" and multiple PFs

   - 1600Gbps link speed reporting support. Four Digits Now!

   - New driver bng_en for latest generation Broadcom NICs

   - Bonding support for hns

   - Adjust mlx5's hmm based ODP to work with the very large address
     space created by the new 5 level paging default on x86

   - Lockdep fixups in rxe and siw"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
  RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep
  RDMA/siw: reclassify sockets in order to avoid false positives from lockdep
  RDMA/bng_re: Remove prefetch instruction
  RDMA/core: Reduce cond_resched() frequency in __ib_umem_release
  RDMA/irdma: Fix SRQ shadow area address initialization
  RDMA/irdma: Remove doorbell elision logic
  RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+
  RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY
  RDMA/irdma: Add missing mutex destroy
  RDMA/irdma: Fix SIGBUS in AEQ destroy
  RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2
  RDMA/irdma: Fix data race in irdma_free_pble
  RDMA/irdma: Fix data race in irdma_sc_ccq_arm
  RDMA/mlx5: Add support for 1600_8x lane speed
  RDMA/core: Add new IB rate for XDR (8x) support
  IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled
  RDMA/bnxt_re: Pass correct flag for dma mr creation
  RDMA/bnxt_re: Fix the inline size for GenP7 devices
  RDMA/hns: Support reset recovery for bond
  RDMA/hns: Support link state reporting for bond
  ...
2025-12-04 18:54:37 -08:00
Stefan Metzmacher
80a85a771d RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep
While developing IPPROTO_SMBDIRECT support for the code
under fs/smb/common/smbdirect [1], I noticed false positives like this:

[+0,003927] ============================================
[+0,000532] WARNING: possible recursive locking detected
[+0,000611] 6.18.0-rc5-metze-kasan-lockdep.02+ #1 Tainted: G           OE
[+0,000835] --------------------------------------------
[+0,000729] ksmbd:r5445/3609 is trying to acquire lock:
[+0,000709] ffff88800b9570f8 (k-sk_lock-AF_INET){+.+.}-{0:0},
                              at: inet_shutdown+0x52/0x360
[+0,000831]
            but task is already holding lock:
[+0,000684] ffff88800654af78 (k-sk_lock-AF_INET){+.+.}-{0:0},
                           at: smbdirect_sk_close+0x122/0x790 [smbdirect]
[+0,000928]
            other info that might help us debug this:
[+0,005552]  Possible unsafe locking scenario:

[+0,000723]        CPU0
[+0,000359]        ----
[+0,000377]   lock(k-sk_lock-AF_INET);
[+0,000478]   lock(k-sk_lock-AF_INET);
[+0,000498]
             *** DEADLOCK ***

[+0,001012]  May be due to missing lock nesting notation

[+0,000831] 3 locks held by ksmbd:r5445/3609:
[+0,000484]  #0: ffff88800654af78 (k-sk_lock-AF_INET){+.+.}-{0:0},
                           at: smbdirect_sk_close+0x122/0x790 [smbdirect]
[+0,001000]  #1: ffff888020a40458 (&id_priv->handler_mutex){+.+.}-{4:4},
                           at: rdma_lock_handler+0x17/0x30 [rdma_cm]
[+0,000982]  #2: ffff888020a40350 (&id_priv->qp_mutex){+.+.}-{4:4},
                           at: rdma_destroy_qp+0x5d/0x1f0 [rdma_cm]
[+0,000934]
            stack backtrace:
[+0,000589] CPU: 0 UID: 0 PID: 3609 Comm: ksmbd:r5445 Kdump: loaded
             Tainted: G           OE
             6.18.0-rc5-metze-kasan-lockdep.02+ #1 PREEMPT(voluntary)
[+0,000023] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[+0,000004] Hardware name: innotek GmbH VirtualBox/VirtualBox,
            BIOS VirtualBox 12/01/2006
...
[+0,000010] print_deadlock_bug+0x245/0x330
[+0,000014] validate_chain+0x32a/0x590
[+0,000012] __lock_acquire+0x535/0xc30
[+0,000013] lock_acquire.part.0+0xb3/0x240
[+0,000017] ? inet_shutdown+0x52/0x360
[+0,000013] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000007] ? mark_held_locks+0x46/0x90
[+0,000012] lock_acquire+0x60/0x140
[+0,000006] ? inet_shutdown+0x52/0x360
[+0,000028] lock_sock_nested+0x3b/0xf0
[+0,000009] ? inet_shutdown+0x52/0x360
[+0,000008] inet_shutdown+0x52/0x360
[+0,000010] kernel_sock_shutdown+0x5b/0x90
[+0,000011] rxe_qp_do_cleanup+0x4ef/0x810 [rdma_rxe]
[+0,000043] ? __pfx_rxe_qp_do_cleanup+0x10/0x10 [rdma_rxe]
[+0,000030] execute_in_process_context+0x2b/0x170
[+0,000013] rxe_qp_cleanup+0x1c/0x30 [rdma_rxe]
[+0,000021] __rxe_cleanup+0x1cf/0x2e0 [rdma_rxe]
[+0,000036] ? __pfx___rxe_cleanup+0x10/0x10 [rdma_rxe]
[+0,000020] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? __kasan_check_read+0x11/0x20
[+0,000012] rxe_destroy_qp+0xe1/0x230 [rdma_rxe]
[+0,000035] ib_destroy_qp_user+0x217/0x450 [ib_core]
[+0,000074] rdma_destroy_qp+0x83/0x1f0 [rdma_cm]
[+0,000034] smbdirect_connection_destroy_qp+0x98/0x2e0 [smbdirect]
[+0,000017] ? __pfx_smb_direct_logging_needed+0x10/0x10 [ksmbd]
[+0,000044] smbdirect_connection_destroy+0x698/0xed0 [smbdirect]
[+0,000023] ? __pfx_smbdirect_connection_destroy+0x10/0x10 [smbdirect]
[+0,000033] ? __pfx_smb_direct_logging_needed+0x10/0x10 [ksmbd]
[+0,000031] smbdirect_connection_destroy_sync+0x42b/0x9f0 [smbdirect]
[+0,000029] ? mark_held_locks+0x46/0x90
[+0,000012] ? __pfx_smbdirect_connection_destroy_sync+0x10/0x10 [smbdirect]
[+0,000019] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000007] ? trace_hardirqs_on+0x64/0x70
[+0,000029] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000010] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? __smbdirect_connection_schedule_disconnect+0x339/0x4b0
[+0,000021] smbdirect_sk_destroy+0xb0/0x680 [smbdirect]
[+0,000024] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? trace_hardirqs_on+0x64/0x70
[+0,000006] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000005] ? __local_bh_enable_ip+0xba/0x150
[+0,000011] sk_common_release+0x66/0x340
[+0,000010] smbdirect_sk_close+0x12a/0x790 [smbdirect]
[+0,000023] ? ip_mc_drop_socket+0x1e/0x240
[+0,000013] inet_release+0x10a/0x240
[+0,000011] smbdirect_sock_release+0x502/0xe80 [smbdirect]
[+0,000015] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000024] sock_release+0x91/0x1c0
[+0,000010] smb_direct_free_transport+0x31/0x50 [ksmbd]
[+0,000025] ksmbd_conn_free+0x1d0/0x240 [ksmbd]
[+0,000040] smb_direct_disconnect+0xb2/0x120 [ksmbd]
[+0,000023] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000018] ksmbd_conn_handler_loop+0x94e/0xf10 [ksmbd]
...

I'll also add reclassify to the smbdirect socket code [1],
but I think it's better to have it in both direction
(below and above the RDMA layer).

[1]
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/master-ipproto-smbdirect

Cc: Zhu Yanjun <zyjzyj2000@gmail.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251127105614.2040922-1-metze@samba.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-27 07:10:02 -05:00
Stefan Metzmacher
3a2c32d357 RDMA/siw: reclassify sockets in order to avoid false positives from lockdep
While developing IPPROTO_SMBDIRECT support for the code
under fs/smb/common/smbdirect [1], I noticed false positives like this:

[T79] ======================================================
[T79] WARNING: possible circular locking dependency detected
[T79] 6.18.0-rc4-metze-kasan-lockdep.01+ #1 Tainted: G           OE
[T79] ------------------------------------------------------
[T79] kworker/2:0/79 is trying to acquire lock:
[T79] ffff88801f968278 (sk_lock-AF_INET){+.+.}-{0:0},
                        at: sock_set_reuseaddr+0x14/0x70
[T79]
        but task is already holding lock:
[T79] ffffffffc10f7230 (lock#9){+.+.}-{4:4},
                        at: rdma_listen+0x3d2/0x740 [rdma_cm]
[T79]
        which lock already depends on the new lock.

[T79]
        the existing dependency chain (in reverse order) is:
[T79]
        -> #1 (lock#9){+.+.}-{4:4}:
[T79]        __lock_acquire+0x535/0xc30
[T79]        lock_acquire.part.0+0xb3/0x240
[T79]        lock_acquire+0x60/0x140
[T79]        __mutex_lock+0x1af/0x1c10
[T79]        mutex_lock_nested+0x1b/0x30
[T79]        cma_get_port+0xba/0x7d0 [rdma_cm]
[T79]        rdma_bind_addr_dst+0x598/0x9a0 [rdma_cm]
[T79]        cma_bind_addr+0x107/0x320 [rdma_cm]
[T79]        rdma_resolve_addr+0xa3/0x830 [rdma_cm]
[T79]        destroy_lease_table+0x12b/0x420 [ksmbd]
[T79]        ksmbd_NTtimeToUnix+0x3e/0x80 [ksmbd]
[T79]        ndr_encode_posix_acl+0x6e9/0xab0 [ksmbd]
[T79]        ndr_encode_v4_ntacl+0x53/0x870 [ksmbd]
[T79]        __sys_connect_file+0x131/0x1c0
[T79]        __sys_connect+0x111/0x140
[T79]        __x64_sys_connect+0x72/0xc0
[T79]        x64_sys_call+0xe7d/0x26a0
[T79]        do_syscall_64+0x93/0xff0
[T79]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[T79]
        -> #0 (sk_lock-AF_INET){+.+.}-{0:0}:
[T79]        check_prev_add+0xf3/0xcd0
[T79]        validate_chain+0x466/0x590
[T79]        __lock_acquire+0x535/0xc30
[T79]        lock_acquire.part.0+0xb3/0x240
[T79]        lock_acquire+0x60/0x140
[T79]        lock_sock_nested+0x3b/0xf0
[T79]        sock_set_reuseaddr+0x14/0x70
[T79]        siw_create_listen+0x145/0x1540 [siw]
[T79]        iw_cm_listen+0x313/0x5b0 [iw_cm]
[T79]        cma_iw_listen+0x271/0x3c0 [rdma_cm]
[T79]        rdma_listen+0x3b1/0x740 [rdma_cm]
[T79]        cma_listen_on_dev+0x46a/0x750 [rdma_cm]
[T79]        rdma_listen+0x4b0/0x740 [rdma_cm]
[T79]        ksmbd_rdma_init+0x12b/0x270 [ksmbd]
[T79]        ksmbd_conn_transport_init+0x26/0x70 [ksmbd]
[T79]        server_ctrl_handle_work+0x1e5/0x280 [ksmbd]
[T79]        process_one_work+0x86c/0x1930
[T79]        worker_thread+0x6f0/0x11f0
[T79]        kthread+0x3ec/0x8b0
[T79]        ret_from_fork+0x314/0x400
[T79]        ret_from_fork_asm+0x1a/0x30
[T79]
        other info that might help us debug this:

[T79]  Possible unsafe locking scenario:

[T79]        CPU0                    CPU1
[T79]        ----                    ----
[T79]   lock(lock#9);
[T79]                                lock(sk_lock-AF_INET);
[T79]                                lock(lock#9);
[T79]   lock(sk_lock-AF_INET);
[T79]
         *** DEADLOCK ***

[T79] 5 locks held by kworker/2:0/79:
[T79] #0: ffff88800120b158 ((wq_completion)events_long){+.+.}-{0:0},
                           at: process_one_work+0xfca/0x1930
[T79] #1: ffffc9000474fd00 ((work_completion)(&ctrl->ctrl_work))
                           {+.+.}-{0:0},
                           at: process_one_work+0x804/0x1930
[T79] #2: ffffffffc11307d0 (ctrl_lock){+.+.}-{4:4},
                           at: server_ctrl_handle_work+0x21/0x280 [ksmbd]
[T79] #3: ffffffffc11347b0 (init_lock){+.+.}-{4:4},
                           at: ksmbd_conn_transport_init+0x18/0x70 [ksmbd]
[T79] #4: ffffffffc10f7230 (lock#9){+.+.}-{4:4},
                            at: rdma_listen+0x3d2/0x740 [rdma_cm]
[T79]
        stack backtrace:
[T79] CPU: 2 UID: 0 PID: 79 Comm: kworker/2:0 Kdump: loaded
      Tainted: G           OE
      6.18.0-rc4-metze-kasan-lockdep.01+ #1 PREEMPT(voluntary)
[T79] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[T79] Hardware name: innotek GmbH VirtualBox/VirtualBox,
      BIOS VirtualBox 12/01/2006
[T79] Workqueue: events_long server_ctrl_handle_work [ksmbd]
...
[T79]  print_circular_bug+0xfd/0x130
[T79]  check_noncircular+0x150/0x170
[T79]  check_prev_add+0xf3/0xcd0
[T79]  validate_chain+0x466/0x590
[T79]  __lock_acquire+0x535/0xc30
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  lock_acquire.part.0+0xb3/0x240
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? __kasan_check_write+0x14/0x30
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? apparmor_socket_post_create+0x180/0x700
[T79]  lock_acquire+0x60/0x140
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  lock_sock_nested+0x3b/0xf0
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  sock_set_reuseaddr+0x14/0x70
[T79]  siw_create_listen+0x145/0x1540 [siw]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? local_clock_noinstr+0xe/0xd0
[T79]  ? __pfx_siw_create_listen+0x10/0x10 [siw]
[T79]  ? trace_preempt_on+0x4c/0x130
[T79]  ? __raw_spin_unlock_irqrestore+0x4a/0x90
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? preempt_count_sub+0x52/0x80
[T79]  iw_cm_listen+0x313/0x5b0 [iw_cm]
[T79]  cma_iw_listen+0x271/0x3c0 [rdma_cm]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  rdma_listen+0x3b1/0x740 [rdma_cm]
[T79]  ? _raw_spin_unlock+0x2c/0x60
[T79]  ? __pfx_rdma_listen+0x10/0x10 [rdma_cm]
[T79]  ? rdma_restrack_add+0x12c/0x630 [ib_core]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  cma_listen_on_dev+0x46a/0x750 [rdma_cm]
[T79]  rdma_listen+0x4b0/0x740 [rdma_cm]
[T79]  ? __pfx_rdma_listen+0x10/0x10 [rdma_cm]
[T79]  ? cma_get_port+0x30d/0x7d0 [rdma_cm]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? rdma_bind_addr_dst+0x598/0x9a0 [rdma_cm]
[T79]  ksmbd_rdma_init+0x12b/0x270 [ksmbd]
[T79]  ? __pfx_ksmbd_rdma_init+0x10/0x10 [ksmbd]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? register_netdevice_notifier+0x1dc/0x240
[T79]  ksmbd_conn_transport_init+0x26/0x70 [ksmbd]
[T79]  server_ctrl_handle_work+0x1e5/0x280 [ksmbd]
[T79]  process_one_work+0x86c/0x1930
[T79]  ? __pfx_process_one_work+0x10/0x10
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? assign_work+0x16f/0x280
[T79]  worker_thread+0x6f0/0x11f0

I was not able to reproduce this as I was testing with various
runs switching siw and rxe as well as IPPROTO_SMBDIRECT sockets,
while the above stack used siw with the non IPPROTO_SMBDIRECT
patches [1].

Even if this patch doesn't solve the above I think it's
a good idea to reclassify the sockets used by siw,
I also send patches for rxe to reclassify, as well
as my IPPROTO_SMBDIRECT socket patches [1] will do it,
this should minimize potential false positives.

[1]
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/master-ipproto-smbdirect

Cc: Bernard Metzler <bernard.metzler@linux.dev>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251126150842.1837072-1-metze@samba.org
Acked-by: Bernard Metzler <bernard.metzler@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-27 05:07:24 -05:00
Marco Crivellari
7196156b0c IB/rdmavt: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-6-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Kees Cook
85cb0757d7 net: Convert proto_ops connect() callbacks to use sockaddr_unsized
Update all struct proto_ops connect() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.

No binary changes expected.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04 19:10:32 -08:00
Kees Cook
0e50474fa5 net: Convert proto_ops bind() callbacks to use sockaddr_unsized
Update all struct proto_ops bind() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.

No binary changes expected.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04 19:10:32 -08:00
Zhu Yanjun
503a5e4690 RDMA/rxe: Fix null deref on srq->rq.queue after resize failure
A NULL pointer dereference can occur in rxe_srq_chk_attr() when
ibv_modify_srq() is invoked twice in succession under certain error
conditions. The first call may fail in rxe_queue_resize(), which leads
rxe_srq_from_attr() to set srq->rq.queue = NULL. The second call then
triggers a crash (null deref) when accessing
srq->rq.queue->buf->index_mask.

Call Trace:
<TASK>
rxe_modify_srq+0x170/0x480 [rdma_rxe]
? __pfx_rxe_modify_srq+0x10/0x10 [rdma_rxe]
? uverbs_try_lock_object+0x4f/0xa0 [ib_uverbs]
? rdma_lookup_get_uobject+0x1f0/0x380 [ib_uverbs]
ib_uverbs_modify_srq+0x204/0x290 [ib_uverbs]
? __pfx_ib_uverbs_modify_srq+0x10/0x10 [ib_uverbs]
? tryinc_node_nr_active+0xe6/0x150
? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x2c0/0x470 [ib_uverbs]
? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs]
ib_uverbs_run_method+0x55a/0x6e0 [ib_uverbs]
? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
ib_uverbs_cmd_verbs+0x54d/0x800 [ib_uverbs]
? __pfx_ib_uverbs_cmd_verbs+0x10/0x10 [ib_uverbs]
? __pfx___raw_spin_lock_irqsave+0x10/0x10
? __pfx_do_vfs_ioctl+0x10/0x10
? ioctl_has_perm.constprop.0.isra.0+0x2c7/0x4c0
? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
ib_uverbs_ioctl+0x13e/0x220 [ib_uverbs]
? __pfx_ib_uverbs_ioctl+0x10/0x10 [ib_uverbs]
__x64_sys_ioctl+0x138/0x1c0
do_syscall_64+0x82/0x250
? fdget_pos+0x58/0x4c0
? ksys_write+0xf3/0x1c0
? __pfx_ksys_write+0x10/0x10
? do_syscall_64+0xc8/0x250
? __pfx_vm_mmap_pgoff+0x10/0x10
? fget+0x173/0x230
? fput+0x2a/0x80
? ksys_mmap_pgoff+0x224/0x4c0
? do_syscall_64+0xc8/0x250
? do_user_addr_fault+0x37b/0xfe0
? clear_bhb_loop+0x50/0xa0
? clear_bhb_loop+0x50/0xa0
? clear_bhb_loop+0x50/0xa0
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Tested-by: Liu Yi <asatsuyu.liu@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20251027215203.1321-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-28 04:04:07 -04:00
Colin Ian King
1511efaca0 RDMA/rxe: Remove redundant assignment to variable page_offset
The variable page_offset is being assigned a value at the start of
a loop and being redundantly zero'd at the end of the loop, there
is no code that reads the zero'd value. The assignment is redundant
and can be removed.

Signed-off-by: Colin Ian King <coking@nvidia.com>
Link: https://patch.msgid.link/20251014120343.2528608-1-coking@nvidia.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19 08:06:21 -04:00
Linus Torvalds
2ccb4d203f Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "A new Pensando ionic driver, a new Gen 3 HW support for Intel irdma,
  and lots of small bnxt_re improvements.

   - Small bug fixes and improves to hfi1, efa, mlx5, erdma, rdmarvt,
     siw

   - Allow userspace access to IB service records through the rdmacm

   - Optimize dma mapping for erdma

   - Fix shutdown of the GSI QP in mana

   - Support relaxed ordering MR and fix a corruption bug with mlx5 DMA
     Data Direct

   - Many improvement to bnxt_re:
       - Debugging features and counters
       - Improve performance of some commands
       - Change flow_label reporting in completions
       - Mirror vnic
       - RDMA flow support

   - New RDMA driver for Pensando Ethernet devices: ionic

   - Gen 3 hardware support for the Intel irdma driver

   - Fix rdma routing resolution with VRFs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (85 commits)
  RDMA/ionic: Fix memory leak of admin q_wr
  RDMA/siw: Always report immediate post SQ errors
  RDMA/bnxt_re: improve clarity in ALLOC_PAGE handler
  RDMA/irdma: Remove unused struct irdma_cq fields
  RDMA/irdma: Fix positive vs negative error codes in irdma_post_send()
  RDMA/bnxt_re: Remove non-statistics counters from hw_counters
  RDMA/bnxt_re: Add debugfs info entry for device and resource information
  RDMA/bnxt_re: Fix incorrect errno used in function comments
  RDMA: Use %pe format specifier for error pointers
  RDMA/ionic: Use ether_addr_copy instead of memcpy
  RDMA/ionic: Fix build failure on SPARC due to xchg() operand size
  RDMA/rxe: Fix race in do_task() when draining
  IB/sa: Fix sa_local_svc_timeout_ms read race
  IB/ipoib: Ignore L3 master device
  RDMA/core: Use route entry flag to decide on loopback traffic
  RDMA/core: Resolve MAC of next-hop device without ARP support
  RDMA/core: Squash a single user static function
  RDMA/irdma: Update Kconfig
  RDMA/irdma: Extend CQE Error and Flush Handling for GEN3 Devices
  RDMA/irdma: Add Atomic Operations support
  ...
2025-10-03 18:35:22 -07:00
Bernard Metzler
fdd0fe94d6 RDMA/siw: Always report immediate post SQ errors
In siw_post_send(), any immediate error encountered during processing of
the work request list must be reported to the caller, even if previous
work requests in that list were just accepted and added to the send queue.

Not reporting those errors confuses the caller, which would wait
indefinitely for the failing and potentially subsequently aborted work
requests completion.

This fixes a case where immediate errors were overwritten by subsequent
code in siw_post_send().

Fixes: 303ae1cdfd ("rdma/siw: application interface")
Link: https://patch.msgid.link/r/20250923144536.103825-1-bernard.metzler@linux.dev
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Bernard Metzler <bernard.metzler@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-09-26 13:12:19 -03:00
Gui-Dong Han
8ca7eada62 RDMA/rxe: Fix race in do_task() when draining
When do_task() exhausts its iteration budget (!ret), it sets the state
to TASK_STATE_IDLE to reschedule, without a secondary check on the
current task->state. This can overwrite the TASK_STATE_DRAINING state
set by a concurrent call to rxe_cleanup_task() or rxe_disable_task().

While state changes are protected by a spinlock, both rxe_cleanup_task()
and rxe_disable_task() release the lock while waiting for the task to
finish draining in the while(!is_done(task)) loop. The race occurs if
do_task() hits its iteration limit and acquires the lock in this window.
The cleanup logic may then proceed while the task incorrectly
reschedules itself, leading to a potential use-after-free.

This bug was introduced during the migration from tasklets to workqueues,
where the special handling for the draining case was lost.

Fix this by restoring the original pre-migration behavior. If the state is
TASK_STATE_DRAINING when iterations are exhausted, set cont to 1 to
force a new loop iteration. This allows the task to finish its work, so
that a subsequent iteration can reach the switch statement and correctly
transition the state to TASK_STATE_DRAINED, stopping the task as intended.

Fixes: 9b4b7c1f9f ("RDMA/rxe: Add workqueue support for rxe tasks")
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
Link: https://patch.msgid.link/20250919025212.1682087-1-hanguidong02@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-21 07:22:28 -04:00
Qianfeng Rong
490a253cb4 RDMA/rdmavt: Use int type to store negative error codes
Change 'ret' from u32 to int in alloc_qpn() to store -EINVAL, and remove
the 'bail' label as it simply returns 'ret'.

Storing negative error codes in an u32 causes no runtime issues, but it's
ugly as pants,  Change 'ret' from u32 to int type - this change has no
runtime impact.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://patch.msgid.link/20250826150556.541440-1-rongqianfeng@vivo.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:35 -04:00
Zhu Yanjun
3c3e9a9f29 RDMA/rxe: Flush delayed SKBs while releasing RXE resources
When skb packets are sent out, these skb packets still depends on
the rxe resources, for example, QP, sk, when these packets are
destroyed.

If these rxe resources are released when the skb packets are destroyed,
the call traces will appear.

To avoid skb packets hang too long time in some network devices,
a timestamp is added when these skb packets are created. If these
skb packets hang too long time in network devices, these network
devices can free these skb packets to release rxe resources.

Reported-by: syzbot+8425ccfb599521edb153@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8425ccfb599521edb153
Tested-by: syzbot+8425ccfb599521edb153@syzkaller.appspotmail.com
Fixes: 1a633bdc8f ("RDMA/rxe: Let destroy qp succeed with stuck packet")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20250726013104.463570-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 06:20:00 -04:00
Linus Torvalds
2095cf558f Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fix from Jason Gunthorpe:
 "Single fix to correct the iov_iter construction in soft iwarp. This
  avoids blktest crashes with recent changes to the allocators"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/siw: Fix the sendmsg byte count in siw_tcp_sendpages
2025-08-07 07:36:23 +03:00
Pedro Falcato
c18646248f RDMA/siw: Fix the sendmsg byte count in siw_tcp_sendpages
Ever since commit c2ff29e99a ("siw: Inline do_tcp_sendpages()"),
we have been doing this:

static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
                             size_t size)
[...]
        /* Calculate the number of bytes we need to push, for this page
         * specifically */
        size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
        /* If we can't splice it, then copy it in, as normal */
        if (!sendpage_ok(page[i]))
                msg.msg_flags &= ~MSG_SPLICE_PAGES;
        /* Set the bvec pointing to the page, with len $bytes */
        bvec_set_page(&bvec, page[i], bytes, offset);
        /* Set the iter to $size, aka the size of the whole sendpages (!!!) */
        iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
try_page_again:
        lock_sock(sk);
        /* Sendmsg with $size size (!!!) */
        rv = tcp_sendmsg_locked(sk, &msg, size);

This means we've been sending oversized iov_iters and tcp_sendmsg calls
for a while. This has a been a benign bug because sendpage_ok() always
returned true. With the recent slab allocator changes being slowly
introduced into next (that disallow sendpage on large kmalloc
allocations), we have recently hit out-of-bounds crashes, due to slight
differences in iov_iter behavior between the MSG_SPLICE_PAGES and
"regular" copy paths:

(MSG_SPLICE_PAGES)
skb_splice_from_iter
  iov_iter_extract_pages
    iov_iter_extract_bvec_pages
      uses i->nr_segs to correctly stop in its tracks before OoB'ing everywhere
  skb_splice_from_iter gets a "short" read

(!MSG_SPLICE_PAGES)
skb_copy_to_page_nocache copy=iov_iter_count
 [...]
   copy_from_iter
        /* this doesn't help */
        if (unlikely(iter->count < len))
                len = iter->count;
          iterate_bvec
            ... and we run off the bvecs

Fix this by properly setting the iov_iter's byte count, plus sending the
correct byte count to tcp_sendmsg_locked.

Link: https://patch.msgid.link/r/20250729120348.495568-1-pfalcato@suse.de
Cc: stable@vger.kernel.org
Fixes: c2ff29e99a ("siw: Inline do_tcp_sendpages()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202507220801.50a7210-lkp@intel.com
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Bernard Metzler <bernard.metzler@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-05 11:33:10 -03:00
Linus Torvalds
7ce4de1cda Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:

 - Various minor code cleanups and fixes for hns, iser, cxgb4, hfi1,
   rxe, erdma, mana_ib

 - Prefetch supprot for rxe ODP

 - Remove memory window support from hns as new device FW is no longer
   support it

 - Remove qib, it is very old and obsolete now, Cornelis wishes to
   restructure the hfi1/qib shared layer

 - Fix a race in destroying CQs where we can still end up with work
   running because the work is cancled before the driver stops
   triggering it

 - Improve interaction with namespaces:
     * Follow the devlink namespace for newly spawned RDMA devices
     * Create iopoib net devces in the parent IB device's namespace
     * Allow CAP_NET_RAW checks to pass in user namespaces

 - A new flow control scheme for IB MADs to try and avoid queue
   overflows in the network

 - Fix 2G message sizes in bnxt_re

 - Optimize mkey layout for mlx5 DMABUF

 - New "DMA Handle" concept to allow controlling PCI TPH and steering
   tags

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
  RDMA/siw: Change maintainer email address
  RDMA/mana_ib: add support of multiple ports
  RDMA/mlx5: Refactor optional counters steering code
  RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr
  IB: Extend UVERBS_METHOD_REG_MR to get DMAH
  RDMA/mlx5: Add DMAH object support
  RDMA/core: Introduce a DMAH object and its alloc/free APIs
  IB/core: Add UVERBS_METHOD_REG_MR on the MR object
  net/mlx5: Add support for device steering tag
  net/mlx5: Expose IFC bits for TPH
  PCI/TPH: Expose pcie_tph_get_st_table_size()
  RDMA/mlx5: Fix incorrect MKEY masking
  RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()
  RDMA/mlx5: remove redundant check on err on return expression
  RDMA/mana_ib: add additional port counters
  RDMA/mana_ib: Fix DSCP value in modify QP
  RDMA/efa: Add CQ with external memory support
  RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers
  RDMA/uverbs: Add a common way to create CQ with umem
  RDMA/mlx5: Optimize DMABUF mkey page size
  ...
2025-07-31 12:19:55 -07:00
Yishai Hadas
a272019a46 IB: Extend UVERBS_METHOD_REG_MR to get DMAH
Extend UVERBS_METHOD_REG_MR to get DMAH and pass it to all drivers.

It will be used in mlx5 driver as part of the next patch from the
series.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/2ae1e628c0675db81f092cc00d3ad6fbf6139405.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23 01:42:11 -04:00
Stanislav Fomichev
93893a57ef net: s/dev_get_flags/netif_get_flags/
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18 17:27:47 -07:00
Mark Bloch
8cffca866b RDMA/core: Extend RDMA device registration to be net namespace aware
Presently, RDMA devices are always registered within the init network
namespace, even if the associated devlink device's namespace was
changed via a devlink reload. This mismatch leads to discrepancies
between the network namespace of the devlink device and that of the
RDMA device.

Therefore, extend the RDMA device allocation API to optionally take
the net namespace. This isn't limited to devices that support devlink
but allows all users to provide the network namespace if they need to
do so.

If a network namespace is provided during device allocation, it's up
to the caller to make sure the namespace stays valid until
ib_register_device() is called.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26 08:10:07 -04:00
Dan Carpenter
19564a8576 RDMA/rxe: Fix a couple IS_ERR() vs NULL bugs
The lookup_mr() function returns NULL on error.  It never returns error
pointers.

Fixes: 9284bc34c7 ("RDMA/rxe: Enable asynchronous prefetch for ODP MRs")
Fixes: 3576b0df15 ("RDMA/rxe: Implement synchronous prefetch for ODP MRs")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/685c1430.050a0220.18b0ef.da83@mx.google.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26 05:19:56 -04:00
Arnd Bergmann
12423d8e18 RDMA/siw: work around clang stack size warning
clang inlines a lot of functions into siw_qp_sq_process(), with the
aggregate stack frame blowing the warning limit in some configurations:

drivers/infiniband/sw/siw/siw_qp_tx.c:1014:5: error: stack frame size (1544) exceeds limit (1280) in 'siw_qp_sq_process' [-Werror,-Wframe-larger-than]

The real problem here is the array of kvec structures in siw_tx_hdt that
makes up the majority of the consumed stack space.

Ideally there would be a way to avoid allocating the array on the
stack, but that would require a larger rework. Add a noinline_for_stack
annotation to avoid the warning for now, and make clang behave the same
way as gcc here. The combined stack usage is still similar, but is spread
over multiple functions now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20250620114332.4072051-1-arnd@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26 05:19:56 -04:00
Daisuke Matsuda
c81fef2202 RDMA/rxe: Remove redundant page presence check
hmm_pfn_to_page() does not return NULL. ib_umem_odp_map_dma_and_lock()
should return an error in case the target pages cannot be mapped until
timeout, so these checks can safely be removed.

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
Link: https://patch.msgid.link/20250611162758.10000-1-dskmtsd@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12 07:07:40 -04:00
Daisuke Matsuda
9284bc34c7 RDMA/rxe: Enable asynchronous prefetch for ODP MRs
Calling ibv_advise_mr(3) with flags other than IBV_ADVISE_MR_FLAG_FLUSH
invokes an asynchronous request. It is best-effort, and thus can safely be
deferred to the system-wide workqueue.

The reference counter in rxe_mr is used to ensure that the MRs persist and
that rxe is not terminated until the queued work is done.

Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
Link: https://patch.msgid.link/20250522111955.3227-3-dskmtsd@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12 04:09:42 -04:00
Daisuke Matsuda
3576b0df15 RDMA/rxe: Implement synchronous prefetch for ODP MRs
Minimal implementation of ibv_advise_mr(3) requires synchronous calls being
successful with the IBV_ADVISE_MR_FLAG_FLUSH flag. Asynchronous requests,
which are best-effort, will be supported subsequently.

Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
Link: https://patch.msgid.link/20250522111955.3227-2-dskmtsd@gmail.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12 04:07:04 -04:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
dd91b5e1d6 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "Usual collection of driver fixes:

   - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw

   - Further ODP functionality in rxe

   - Remote access MRs in mana, along with more page sizes

   - Improve CM scalability with a rwlock around the agent

   - More trace points for hns

   - ODP hmm conversion to the new two step dma API

   - Support the ethernet HW device in mana as well as the RNIC

   - Cleanups:
       - Use secs_to_jiffies() when appropriate
       - Use ERR_CAST() instead of naked casts
       - Don't use %pK in printk
       - Unusued functions removed
       - Allocation type matching"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits)
  RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work
  RDMA/bnxt_re: Support extended stats for Thor2 VF
  RDMA/hns: Fix endian issue in trace events
  RDMA/mlx5: Avoid flexible array warning
  IB/cm: Remove dead code and adjust naming
  RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices
  RDMA/rxe: Break endless pagefault loop for RO pages
  RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc
  RDMA/bnxt_re: Fix missing error handling for tx_queue
  RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output
  RDMA/mlx5: Add support for 200Gbps per lane speeds
  RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage
  RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction
  net: mana: Add support for auxiliary device servicing events
  RDMA/mana_ib: unify mana_ib functions to support any gdma device
  RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic
  net: mana: Probe rdma device in mana driver
  RDMA/siw: replace redundant ternary operator with just rv
  RDMA/umem: Separate implicit ODP initialization from explicit ODP
  RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
  ...
2025-05-30 10:18:56 -07:00
Jason Gunthorpe
ef2233850e Merge tag 'v6.15' into rdma.git for-next
Following patches need the RDMA rc branch since we are past the RC cycle
now.

Merge conflicts resolved based on Linux-next:

- For RXE odp changes keep for-next version and fixup new places that
  need to call is_odp_mr()
  https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au
  https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au

- irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info
  change from for-next
  https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-26 15:33:52 -03:00
Jakub Kicinski
33e1b1b399 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc8).

Conflicts:
  80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
  4bcc063939 ("ice, irdma: fix an off by one in error handling code")
  c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au

No extra adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 09:42:41 -07:00
Leon Romanovsky
0b261d7c1c RDMA/rxe: Break endless pagefault loop for RO pages
RO pages has "perm" equal to 0, that caused to the situation
where such pages were marked as needed to have fault and caused
to infinite loop.

Fixes: eedd5b1276 ("RDMA/umem: Store ODP access mask information in PFN")
Reported-by: Daisuke Matsuda <dskmtsd@gmail.com>
Closes: https://lore.kernel.org/all/3016329a-4edd-4550-862f-b298a1b79a39@gmail.com/
Link: https://patch.msgid.link/096fab178d48ed86942ee22eafe9be98e29092aa.1747913377.git.leonro@nvidia.com
Tested-by: Daisuke Matsuda <dskmtsd@gmail.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-22 12:05:21 -04:00
Eric Biggers
62673b7df9 RDMA/siw: use skb_crc32c() instead of __skb_checksum()
Instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C, just call the new function skb_crc32c().  This is faster
and simpler.

Acked-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-5-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:40:05 -07:00
Colin Ian King
8536666a52 RDMA/siw: replace redundant ternary operator with just rv
The use of the ternary operator on rv is redundant, rv is
either the initialized value of 0 or a negative error return
code, so it can never be greater than zero, and hence the
zero assignment in ternary operator is redundant. Just return
rv instead.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://patch.msgid.link/20250507131834.253823-1-colin.i.king@gmail.com
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12 06:20:24 -04:00
Leon Romanovsky
1efe8c0670 RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
Reuse newly added DMA API to cache IOVA and only link/unlink pages
in fast path for UMEM ODP flow.

Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-12 06:06:51 -04:00
Leon Romanovsky
eedd5b1276 RDMA/umem: Store ODP access mask information in PFN
As a preparation to remove dma_list, store access mask in PFN pointer
and not in dma_addr_t.

Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-12 06:06:46 -04:00
Dr. David Alan Gilbert
e56b4eab9c RDMA/siw: Remove unused siw_mem_add
siw_mem_add() was added in 2019 by commit 2251334dca ("rdma/siw:
application buffer management") but has remained unused.

Remove it.

Link: https://patch.msgid.link/r/20250505210226.88994-1-linux@treblig.org
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-06 14:30:13 -03:00
Daisuke Matsuda
23ea3c70ee RDMA/rxe: Remove 32-bit architecture support
Major linux distibutions have phased out support for 32-bit machines. Since
rxe is primarily used for development and testing, the benefit of
maintaining 32-bit support is minimal. This change simplifies ATOMIC WRITE
implementations and improves maintainability of the driver.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://patch.msgid.link/20250421025101.3588139-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-21 04:16:29 -04:00
Dr. David Alan Gilbert
d85080df12 RDMA/rxe: Remove unused rxe_run_task
rxe_run_task() has been unused since 2024's
commit 23bc06af54 ("RDMA/rxe: Don't call direct between tasks")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250419132725.199785-1-linux@treblig.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-20 11:27:39 -04:00
Zhu Yanjun
1c7eec4d5f RDMA/rxe: Fix "trying to register non-static key in rxe_qp_do_cleanup" bug
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
 assign_lock_key kernel/locking/lockdep.c:986 [inline]
 register_lock_class+0x4a3/0x4c0 kernel/locking/lockdep.c:1300
 __lock_acquire+0x99/0x1ba0 kernel/locking/lockdep.c:5110
 lock_acquire kernel/locking/lockdep.c:5866 [inline]
 lock_acquire+0x179/0x350 kernel/locking/lockdep.c:5823
 __timer_delete_sync+0x152/0x1b0 kernel/time/timer.c:1644
 rxe_qp_do_cleanup+0x5c3/0x7e0 drivers/infiniband/sw/rxe/rxe_qp.c:815
 execute_in_process_context+0x3a/0x160 kernel/workqueue.c:4596
 __rxe_cleanup+0x267/0x3c0 drivers/infiniband/sw/rxe/rxe_pool.c:232
 rxe_create_qp+0x3f7/0x5f0 drivers/infiniband/sw/rxe/rxe_verbs.c:604
 create_qp+0x62d/0xa80 drivers/infiniband/core/verbs.c:1250
 ib_create_qp_kernel+0x9f/0x310 drivers/infiniband/core/verbs.c:1361
 ib_create_qp include/rdma/ib_verbs.h:3803 [inline]
 rdma_create_qp+0x10c/0x340 drivers/infiniband/core/cma.c:1144
 rds_ib_setup_qp+0xc86/0x19a0 net/rds/ib_cm.c:600
 rds_ib_cm_initiate_connect+0x1e8/0x3d0 net/rds/ib_cm.c:944
 rds_rdma_cm_event_handler_cmn+0x61f/0x8c0 net/rds/rdma_transport.c:109
 cma_cm_event_handler+0x94/0x300 drivers/infiniband/core/cma.c:2184
 cma_work_handler+0x15b/0x230 drivers/infiniband/core/cma.c:3042
 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238
 process_scheduled_works kernel/workqueue.c:3319 [inline]
 worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400
 kthread+0x3c2/0x780 kernel/kthread.c:464
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

The root cause is as below:

In the function rxe_create_qp, the function rxe_qp_from_init is called
to create qp, if this function rxe_qp_from_init fails, rxe_cleanup will
be called to handle all the allocated resources, including the timers:
retrans_timer and rnr_nak_timer.

The function rxe_qp_from_init calls the function rxe_qp_init_req to
initialize the timers: retrans_timer and rnr_nak_timer.

But these timers are initialized in the end of rxe_qp_init_req.
If some errors occur before the initialization of these timers, this
problem will occur.

The solution is to check whether these timers are initialized or not.
If these timers are not initialized, ignore these timers.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Reported-by: syzbot+4edb496c3cad6e953a31@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=4edb496c3cad6e953a31
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20250419080741.1515231-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-20 11:25:37 -04:00
Zhu Yanjun
f81b33582f RDMA/rxe: Fix slab-use-after-free Read in rxe_queue_cleanup bug
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x7d/0xa0 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xcf/0x610 mm/kasan/report.c:489
 kasan_report+0xb5/0xe0 mm/kasan/report.c:602
 rxe_queue_cleanup+0xd0/0xe0 drivers/infiniband/sw/rxe/rxe_queue.c:195
 rxe_cq_cleanup+0x3f/0x50 drivers/infiniband/sw/rxe/rxe_cq.c:132
 __rxe_cleanup+0x168/0x300 drivers/infiniband/sw/rxe/rxe_pool.c:232
 rxe_create_cq+0x22e/0x3a0 drivers/infiniband/sw/rxe/rxe_verbs.c:1109
 create_cq+0x658/0xb90 drivers/infiniband/core/uverbs_cmd.c:1052
 ib_uverbs_create_cq+0xc7/0x120 drivers/infiniband/core/uverbs_cmd.c:1095
 ib_uverbs_write+0x969/0xc90 drivers/infiniband/core/uverbs_main.c:679
 vfs_write fs/read_write.c:677 [inline]
 vfs_write+0x26a/0xcc0 fs/read_write.c:659
 ksys_write+0x1b8/0x200 fs/read_write.c:731
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xaa/0x1b0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

In the function rxe_create_cq, when rxe_cq_from_init fails, the function
rxe_cleanup will be called to handle the allocated resources. In fact,
some memory resources have already been freed in the function
rxe_cq_from_init. Thus, this problem will occur.

The solution is to let rxe_cleanup do all the work.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://paste.ubuntu.com/p/tJgC42wDf6/
Tested-by: liuyi <liuy22@mails.tsinghua.edu.cn>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20250412075714.3257358-1-yanjun.zhu@linux.dev
Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-20 06:14:49 -04:00
Daisuke Matsuda
29610226c3 RDMA/rxe: Fix mismatched type declarations
Some functions return int values while they are defined as enum resp_states
variables. This patch resolves the mismatches in rxe.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://patch.msgid.link/20250409102701.1275265-1-matsuda-daisuke@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-11 13:45:07 -04:00