Commit Graph

9403 Commits

Author SHA1 Message Date
Jason Gunthorpe
faa72102b1 RDMA/ionic: Fix kernel stack leak in ionic_create_cq()
struct ionic_cq_resp resp {
    __u32 cqid[2];         // offset 0 - PARTIALLY SET (see below)
    __u8  udma_mask;       // offset 8 - SET (resp.udma_mask = vcq->udma_mask)
    __u8  rsvd[7];         // offset 9 - NEVER SET <- LEAK
};

rsvd[7]: 7 bytes of stack memory leaked unconditionally.

cqid[2]: The loop at line 1256 iterates over udma_idx but skips indices
where !(vcq->udma_mask & BIT(udma_idx)). The array has 2 entries but
udma_count could be 1, meaning cqid[1] might never be written via
ionic_create_cq_common(). If udma_mask only has bit 0 set, cqid[1] (4
bytes) is also leaked. So potentially 11 bytes leaked.

Cc: stable@vger.kernel.org
Fixes: e8521822c7 ("RDMA/ionic: Register device ops for control path")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://patch.msgid.link/4-v1-83e918d69e73+a9-rdma_udata_rc_jgg@nvidia.com
Acked-by: Abhijit Gangurde <abhijit.gangurde@amd.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-24 05:03:15 -05:00
Jason Gunthorpe
74586c6da9 RDMA/irdma: Fix kernel stack leak in irdma_create_user_ah()
struct irdma_create_ah_resp {  // 8 bytes, no padding
    __u32 ah_id;               // offset 0 - SET (uresp.ah_id = ah->sc_ah.ah_info.ah_idx)
    __u8  rsvd[4];             // offset 4 - NEVER SET <- LEAK
};

rsvd[4]: 4 bytes of stack memory leaked unconditionally. Only ah_id is assigned before ib_respond_udata().

The reserved members of the structure were not zeroed.

Cc: stable@vger.kernel.org
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://patch.msgid.link/3-v1-83e918d69e73+a9-rdma_udata_rc_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-24 05:03:15 -05:00
Jason Gunthorpe
117942ca43 IB/mthca: Add missed mthca_unmap_user_db() for mthca_create_srq()
Fix a user triggerable leak on the system call failure path.

Cc: stable@vger.kernel.org
Fixes: ec34a922d2 ("[PATCH] IB/mthca: Add SRQ implementation")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://patch.msgid.link/2-v1-83e918d69e73+a9-rdma_udata_rc_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-24 05:03:15 -05:00
Jason Gunthorpe
f22c77ce49 RDMA/efa: Fix typo in efa_alloc_mr()
The pattern is to check the entire driver request space, not just
sizeof something unrelated.

Fixes: 40909f664d ("RDMA/efa: Add EFA verbs implementation")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://patch.msgid.link/1-v1-83e918d69e73+a9-rdma_udata_rc_jgg@nvidia.com
Acked-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-24 05:01:32 -05:00
Kamal Heib
fd80bd7105 RDMA/ionic: Fix potential NULL pointer dereference in ionic_query_port
The function ionic_query_port() calls ib_device_get_netdev() without
checking the return value which could lead to NULL pointer dereference,
Fix it by checking the return value and return -ENODEV if the 'ndev' is
NULL.

Fixes: 2075bbe8ef ("RDMA/ionic: Register device ops for miscellaneous functionality")
Signed-off-by: Kamal Heib <kheib@redhat.com>
Link: https://patch.msgid.link/20260220222125.16973-2-kheib@redhat.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-24 04:54:10 -05:00
Siva Reddy Kallam
3d2e5d12a2 RDMA/bng_re: Unwind bng_re_dev_init properly
Fix below smatch warning:
drivers/infiniband/hw/bng_re/bng_dev.c:270
bng_re_dev_init() warn: missing unwind goto?

Current bng_re_dev_init function is not having clear unwinding code.
So, added proper unwinding with ladder.

Fixes: 4f830cd8d7 ("RDMA/bng_re: Add infrastructure for enabling Firmware channel")
Reported-by: Simon Horman <horms@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202601010413.sWadrQel-lkp@intel.com/
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20260218091246.1764808-3-siva.kallam@broadcom.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2026-02-24 03:56:28 -05:00
Siva Reddy Kallam
7a23af417d RDMA/bng_re: Remove unnessary validity checks
Fix below smatch warning:
drivers/infiniband/hw/bng_re/bng_dev.c:113
bng_re_net_ring_free() warn: variable dereferenced before check 'rdev'
(see line 107)

current driver has unnessary validity checks. So, removing these
unnessary validity checks.

Fixes: 4f830cd8d7 ("RDMA/bng_re: Add infrastructure for enabling Firmware channel")
Fixes: 745065770c ("RDMA/bng_re: Register and get the resources from bnge driver")
Fixes: 04e031ff6e ("RDMA/bng_re: Initialize the Firmware and Hardware")
Fixes: d0da769c19 ("RDMA/bng_re: Add Auxiliary interface")
Reported-by: Simon Horman <horms@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202601010413.sWadrQel-lkp@intel.com/
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20260218091246.1764808-2-siva.kallam@broadcom.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2026-02-24 03:53:24 -05:00
Kees Cook
189f164e57 Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses
Conversion performed via this Coccinelle script:

  // SPDX-License-Identifier: GPL-2.0-only
  // Options: --include-headers-for-types --all-includes --include-headers --keep-comments
  virtual patch

  @gfp depends on patch && !(file in "tools") && !(file in "samples")@
  identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
 		    kzalloc_obj,kzalloc_objs,kzalloc_flex,
		    kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
		    kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
  @@

  	ALLOC(...
  -		, GFP_KERNEL
  	)

  $ make coccicheck MODE=patch COCCI=gfp.cocci

Build and boot tested x86_64 with Fedora 42's GCC and Clang:

Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22 08:26:33 -08:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
311aa68319 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "Usual smallish cycle. The NFS biovec work to push it down into RDMA
  instead of indirecting through a scatterlist is pretty nice to see,
  been talked about for a long time now.

   - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe

   - Small driver improvements and minor bug fixes to hns, mlx5, rxe,
     mana, mlx5, irdma

   - Robusness improvements in completion processing for EFA

   - New query_port_speed() verb to move past limited IBA defined speed
     steps

   - Support for SG_GAPS in rts and many other small improvements

   - Rare list corruption fix in iwcm

   - Better support different page sizes in rxe

   - Device memory support for mana

   - Direct bio vec to kernel MR for use by NFS-RDMA

   - QP rate limiting for bnxt_re

   - Remote triggerable NULL pointer crash in siw

   - DMA-buf exporter support for RDMA mmaps like doorbells"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
  RDMA/mlx5: Implement DMABUF export ops
  RDMA/uverbs: Add DMABUF object type and operations
  RDMA/uverbs: Support external FD uobjects
  RDMA/siw: Fix potential NULL pointer dereference in header processing
  RDMA/umad: Reject negative data_len in ib_umad_write
  IB/core: Extend rate limit support for RC QPs
  RDMA/mlx5: Support rate limit only for Raw Packet QP
  RDMA/bnxt_re: Report QP rate limit in debugfs
  RDMA/bnxt_re: Report packet pacing capabilities when querying device
  RDMA/bnxt_re: Add support for QP rate limiting
  MAINTAINERS: Drop RDMA files from Hyper-V section
  RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc
  svcrdma: use bvec-based RDMA read/write API
  RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
  RDMA/core: add MR support for bvec-based RDMA operations
  RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  RDMA/core: add bio_vec based RDMA read/write API
  RDMA/irdma: Use kvzalloc for paged memory DMA address array
  RDMA/rxe: Fix race condition in QP timer handlers
  RDMA/mana_ib: Add device‑memory support
  ...
2026-02-12 17:05:20 -08:00
Vikas Gupta
42d1c54d62 bnge/bng_re: Add a new HSI
The HSI is shared between the firmware and the driver and is
automatically generated.
Add a new HSI for the BNGE driver. The current HSI refers to BNXT,
which will become incompatible with ThorUltra devices as the
BNGE driver adds more features. The BNGE driver will not use the HSI
located in the bnxt folder.
Also, add an HSI for ThorUltra RoCE driver.

Changes in v3:
- Fix in bng_roce_hsi.h reported by Jakub (AI review)
  https://lore.kernel.org/netdev/20260207051422.4181717-1-kuba@kernel.org/
- Add an entry in MAINTAINERS

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Link: https://patch.msgid.link/20260208172925.1861255-1-vikas.gupta@broadcom.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-11 13:44:47 +01:00
Yishai Hadas
d6c58f4eb3 RDMA/mlx5: Implement DMABUF export ops
Enable p2pdma on the mlx5 PCI device to allow DMABUF-based peer-to-peer
DMA mappings.

Add implementation of the mmap_get_pfns and pgoff_to_mmap_entry device
operations required for DMABUF support in the mlx5 RDMA driver.

The pgoff_to_mmap_entry operation converts a page offset to the
corresponding rdma_user_mmap_entry by extracting the command and index
from the offset and looking it up in the ucontext's mmap_xa.

The mmap_get_pfns operation retrieves the physical address and length
from the mmap entry and obtains the p2pdma provider for the underlying
PCI device, which is needed for peer-to-peer DMA operations with
DMABUFs.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260201-dmabuf-export-v3-3-da238b614fe3@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-08 23:50:41 -05:00
Yael Chemla
215b53099b net/mlx5: Fix 1600G link mode enum naming
Rename TAUI/TBASE to GAUI/GBASE in 1600G link mode identifier and its
usage in ethtool and link-info tables.

Reported-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reported-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://patch.msgid.link/20260204194324.1723534-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 18:29:04 -08:00
Kalesh AP
cae42d97d9 RDMA/mlx5: Support rate limit only for Raw Packet QP
mlx5 based hardware supports rate limiting only on Raw ethernet QPs.
Added an explicit check to fail the operation on any other QP types.
The rate limit support has been enahanced in the stack for RC QPs too.

Compile tested only.

CC: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/20260202133413.3182578-5-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-02 08:37:59 -05:00
Kalesh AP
949e7c062d RDMA/bnxt_re: Report QP rate limit in debugfs
Update QP info debugfs hook to report the rate limit applied
on the QP. 0 means unlimited.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20260202133413.3182578-4-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-02 08:37:59 -05:00
Kalesh AP
13edc7d4e0 RDMA/bnxt_re: Report packet pacing capabilities when querying device
Enable the support to report packet pacing capabilities
from kernel to user space. Packet pacing allows to limit
the rate to any number between the maximum and minimum.

The capabilities are exposed to user space through query_device.
The following capabilities are reported:

1. The maximum and minimum rate limit in kbps.
2. Bitmap showing which QP types support rate limit.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20260202133413.3182578-3-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Anantha Prabhu <anantha.prabhu@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-02 08:37:59 -05:00
Kalesh AP
e72d45d274 RDMA/bnxt_re: Add support for QP rate limiting
Broadcom P7 chips supports applying rate limit to RC QPs.
It allows adjust shaper rate values during the INIT -> RTR,
RTR -> RTS, RTS -> RTS state changes or after QP transitions
to RTR or RTS.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20260202133413.3182578-2-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-02 08:37:59 -05:00
Carlos Bilbao
959d2c356e RDMA/irdma: Use kvzalloc for paged memory DMA address array
Allocate array chunk->dmainfo.dmaaddrs using kvzalloc() to allow the
allocation to fall back to vmalloc when contiguous memory is unavailable
(instead of failing and logging page allocation warnings).

Acked-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Carlos Bilbao (Lambda) <carlos.bilbao@kernel.org>
Link: https://patch.msgid.link/20260128014446.405247-1-carlos.bilbao@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-28 05:44:04 -05:00
Konstantin Taranov
a01745ccf7 RDMA/mana_ib: Add device‑memory support
Introduce a basic DM implementation that enables creating and
registering device memory, and using the associated memory keys
for networking operations.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/20260127082649.429018-1-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-27 09:16:11 -05:00
Zilin Guan
9b9d253908 RDMA/mlx5: Fix memory leak in GET_DATA_DIRECT_SYSFS_PATH handler
The UVERBS_HANDLER(MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH) function
allocates memory for the device path using kobject_get_path(). If the
length of the device path exceeds the output buffer length, the function
returns -ENOSPC but does not free the allocated memory, resulting in a
memory leak.

Add a kfree() call to the error path to ensure the allocated memory is
properly freed.

Compile tested only. Issue found using a prototype static analysis tool
and code review.

Fixes: ec7ad65309 ("RDMA/mlx5: Introduce GET_DATA_DIRECT_SYSFS_PATH ioctl")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Link: https://patch.msgid.link/20260126074801.627898-1-zilin@seu.edu.cn
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-27 07:04:18 -05:00
Jacob Moroni
2529aead51 RDMA/irdma: Use CQ ID for CEQE context
The hardware allows for an opaque CQ context field to be carried
over into CEQEs for the CQ. Previously, a pointer to the CQ was used
for this context. In the normal CQ destroy flow, the CEQ ring is
scrubbed to remove any preexisting CEQEs for the CQ that may not have
been processed yet so that the CQ structure is not dereferenced in the
CEQ ISR after the CQ has been freed.

However, in some cases, it is possible for a CEQE to be in flight in
HW even after the CQ destroy command completion is received, so it
could be missed during the scrub.

To protect against this, we can take advantage of the CQ table that
already exists and use the CQ ID for this context rather than a CQ
pointer.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260120212546.1893076-2-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-25 08:54:20 -05:00
Jacob Moroni
2b7c2ba130 RDMA/irdma: Add enum defs for reserved CQs/QPs
Added definitions for the special reserved CQs and QPs.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260120212546.1893076-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-25 08:54:20 -05:00
Or Har-Toov
18ea78e2ae IB/mlx5: Fix port speed query for representors
When querying speed information for a representor in switchdev mode,
the code previously used the first device in the eswitch, which may not
match the device that actually owns the representor. In setups such as
multi-port eswitch or LAG, this led to incorrect port attributes being
reported.

Fix this by retrieving the correct core device from the representor's
eswitch before querying its port attributes.

Fixes: 27f9e0ccb6 ("net/mlx5: Lag, Add single RDMA device in multiport mode")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260115-port-speed-query-fix-v2-1-3bde6a3c78e7@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-18 11:36:59 -05:00
Chiara Meiohas
ebc2164a4c RDMA/mlx5: Fix UMR hang in LAG error state unload
During firmware reset in LAG mode, a race condition causes the driver
to hang indefinitely while waiting for UMR completion during device
unload. See [1].

In LAG mode the bond device is only registered on the master, so it
never sees sys_error events from the slave.
During firmware reset this causes UMR waits to hang forever on unload
as the slave is dead but the master hasn't entered error state yet, so
UMR posts succeed but completions never arrive.

Fix this by adding a sys_error notifier that gets registered before
MLX5_IB_STAGE_IB_REG and stays alive until after ib_unregister_device().
This ensures error events reach the bond device throughout teardown.

[1]
Call Trace:
 __schedule+0x2bd/0x760
 schedule+0x37/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.6+0x2b5/0x4a0
 __mlx5_ib_dereg_mr+0x606/0x870 [mlx5_ib]
 ? __xa_erase+0x4a/0xa0
 ? _cond_resched+0x15/0x30
 ? wait_for_completion+0x31/0x100
 ib_dereg_mr_user+0x48/0xc0 [ib_core]
 ? rdmacg_uncharge_hierarchy+0xa0/0x100
 destroy_hw_idr_uobject+0x20/0x50 [ib_uverbs]
 uverbs_destroy_uobject+0x37/0x150 [ib_uverbs]
 __uverbs_cleanup_ufile+0xda/0x140 [ib_uverbs]
 uverbs_destroy_ufile_hw+0x3a/0xf0 [ib_uverbs]
 ib_uverbs_remove_one+0xc3/0x140 [ib_uverbs]
 remove_client_context+0x8b/0xd0 [ib_core]
 disable_device+0x8c/0x130 [ib_core]
 __ib_unregister_device+0x10d/0x180 [ib_core]
 ib_unregister_device+0x21/0x30 [ib_core]
 __mlx5_ib_remove+0x1e4/0x1f0 [mlx5_ib]
 auxiliary_bus_remove+0x1e/0x30
 device_release_driver_internal+0x103/0x1f0
 bus_remove_device+0xf7/0x170
 device_del+0x181/0x410
 mlx5_rescan_drivers_locked.part.10+0xa9/0x1d0 [mlx5_core]
 mlx5_disable_lag+0x253/0x260 [mlx5_core]
 mlx5_lag_disable_change+0x89/0xc0 [mlx5_core]
 mlx5_eswitch_disable+0x67/0xa0 [mlx5_core]
 mlx5_unload+0x15/0xd0 [mlx5_core]
 mlx5_unload_one+0x71/0xc0 [mlx5_core]
 mlx5_sync_reset_reload_work+0x83/0x100 [mlx5_core]
 process_one_work+0x1a7/0x360
 worker_thread+0x30/0x390
 ? create_worker+0x1a0/0x1a0
 kthread+0x116/0x130
 ? kthread_flush_work_fn+0x10/0x10
 ret_from_fork+0x22/0x40

Fixes: ede132a5cf ("RDMA/mlx5: Move events notifier registration to be after device registration")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260113-umr-hand-lag-fix-v1-1-3dc476e00cd9@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-18 11:04:07 -05:00
Konstantin Taranov
f972bde732 RDMA/mana_ib: Take CQ type from the device type
Get CQ type from the used gdma device. The MANA_IB_CREATE_RNIC_CQ
flag is ignored. It was used in older kernel versions where
the mana_ib was shared between ethernet and rnic.

Fixes: d4293f96ce ("RDMA/mana_ib: unify mana_ib functions to support any gdma device")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/20260115093625.177306-1-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-15 06:03:22 -05:00
Chengchang Tang
354e7a6d44 RDMA/hns: Support drain SQ and RQ
Some ULPs, e.g. rpcrdma, rely on drain_qp() to ensure all outstanding
requests are completed before releasing related memory. If drain_qp()
fails, ULPs may release memory directly, and in-flight WRs may later be
flushed after the memory is freed, potentially leading to UAF.

drain_qp() failures can happen when HW enters an error state or is
reset. Add support to drain SQ and RQ in such cases by posting a
fake WR during reset, so the driver can process all remaining WRs in
sequence and generate corresponding completions.

Always invoke comp_handler() in drain process to ensure completions
are not lost under concurrency (e.g. concurrent post_send() and
reset, or QPs created during reset). If the CQ is already processed,
cancel any already scheduled comp_handler() to avoid concurrency
issues.

Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260108113032.856306-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-15 04:59:53 -05:00
Jacob Moroni
5c3f795d17 RDMA/irdma: Remove fixed 1 ms delay during AH wait loop
The AH CQP command wait loop executes in an atomic context and was
using a fixed 1 ms delay. Since many AH create commands can complete
much faster than 1 ms, use poll_timeout_us_atomic with a 1 us delay.

Also, use the timeout value indicated during the capability exchange
rather than a hard-coded value.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260105180550.2907858-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-13 08:19:11 -05:00
Jacob Moroni
52f3d34c29 RDMA/irdma: Remove redundant dma_wmb() before writel()
A dma_wmb() is not necessary before a writel() because writel()
already has an even stronger store barrier. A dma_wmb() is only
required to order writes to consistent/DMA memory whereas the
barrier in writel() is specified to order writes to DMA memory as
well as MMIO.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260103172517.2088895-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-13 08:01:37 -05:00
Michael Chan
fdb573d675 bnxt_en: Update FW interface to 1.10.3.151
The main changes are the new HWRM_PORT_PHY_FDRSTAT command to collect
FEC histogram bins and the new HWRM_NVM_DEFRAG command to defragment the
NVRAM.  There is also a minor name change in struct hwrm_vnic_cfg_input
that requires updating the bnxt_re driver's main.c.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260108183521.215610-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-10 15:19:50 -08:00
Leon Romanovsky
325e3b5431 RDMA/ocrdma: Remove unused OCRDMA_UVERBS definition
The OCRDMA_UVERBS() macro is unused, so remove it to clean up the code.

Link: https://patch.msgid.link/20260104-ib-core-misc-v1-6-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2026-01-05 04:04:01 -05:00
Leon Romanovsky
cc016ebeb1 RDMA/qedr: Remove unused defines
Perform basic cleanup by removing unused defines from qedr.h.

Link: https://patch.msgid.link/20260104-ib-core-misc-v1-5-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2026-01-05 04:03:46 -05:00
Leon Romanovsky
522a5c1c56 RDMA/mlx5: Avoid direct access to DMA device pointer
The dma_device field is marked as internal and must not be accessed by
drivers or ULPs. Remove all direct mlx5 references to this field.

Link: https://patch.msgid.link/20260104-ib-core-misc-v1-4-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2026-01-05 04:02:51 -05:00
Maher Sanalla
6dc78c53de RDMA/mlx5: Fix ucaps init error flow
In mlx5_ib_stage_caps_init(), if mlx5_ib_init_ucaps() fails after
mlx5_ib_init_var_table() succeeds, the VAR bitmap is leaked since
the function returns without cleanup.

Thus, cleanup the var table bitmap in case of error of initializing
ucaps before exiting, preventing the leak above.

Fixes: cf7174e898 ("RDMA/mlx5: Create UCAP char devices for supported device capabilities")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/20260104-ib-core-misc-v1-3-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-05 04:02:40 -05:00
Or Har-Toov
aaecff5e13 RDMA/mlx5: Implement query_port_speed callback
Implement the query_port_speed callback for mlx5 driver to support
querying effective port bandwidth.

For LAG configurations, query the aggregated speed from the LAG layer
or from the modified vport max_tx_speed.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-05 02:43:17 -05:00
Or Har-Toov
3fd984d5cd RDMA/mlx5: Raise async event on device speed change
Raise IB_EVENT_DEVICE_SPEED_CHANGE whenever the speed of one of the
device's ports changes. Usually all ports of the device changes
together.

This ensures user applications and upper-layer software are immediately
notified when bandwidth changes, improving traffic management in dynamic
environments. This is especially useful for vports which are part of a
LAG configuration, to know if the effective speed of the LAG was
changed.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-05 02:43:12 -05:00
Chengchang Tang
0789f92990 RDMA/hns: Notify ULP of remaining soft-WCs during reset
During a reset, software-generated WCs cannot be reported via
interrupts. This may cause the ULP to miss some WCs.

To avoid this, add check in the CQ arm process: if a hardware reset
has occurred and there are still unreported soft-WCs, notify the ULP
to handle the remaining WCs, thereby preventing any loss of completions.

Fixes: 626903e935 ("RDMA/hns: Add support for reporting wc as software mode")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260104064057.1582216-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-04 10:09:51 -05:00
Junxian Huang
84bd5d60f0 RDMA/hns: Fix RoCEv1 failure due to DSCP
DSCP is not supported in RoCEv1, but get_dscp() is still called. If
get_dscp() returns an error, it'll eventually cause create_ah to fail
even when using RoCEv1.

Correct the return value and avoid calling get_dscp() when using
RoCEv1.

Fixes: ee20cc17e9 ("RDMA/hns: Support DSCP")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260104064057.1582216-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-04 10:09:51 -05:00
Junxian Huang
8cda8acbb1 RDMA/hns: Return actual error code instead of fixed EINVAL
query_cqc() and query_mpt() may return various error codes in
different cases. Return actual error code instead of fixed EINVAL.

Fixes: f2b070f36d ("RDMA/hns: Support CQ's restrack raw ops for hns driver")
Fixes: 3d67e7e236 ("RDMA/hns: Support MR's restrack raw ops for hns driver")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260104064057.1582216-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-04 10:09:51 -05:00
Chengchang Tang
c0a26bbd3f RDMA/hns: Fix WQ_MEM_RECLAIM warning
When sunrpc is used, if a reset triggered, our wq may lead the
following trace:

workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma]
is flushing !WQ_MEM_RECLAIM hns_roce_irq_workq:flush_work_handle
[hns_roce_hw_v2]
WARNING: CPU: 0 PID: 8250 at kernel/workqueue.c:2644 check_flush_dependency+0xe0/0x144
Call trace:
  check_flush_dependency+0xe0/0x144
  start_flush_work.constprop.0+0x1d0/0x2f0
  __flush_work.isra.0+0x40/0xb0
  flush_work+0x14/0x30
  hns_roce_v2_destroy_qp+0xac/0x1e0 [hns_roce_hw_v2]
  ib_destroy_qp_user+0x9c/0x2b4
  rdma_destroy_qp+0x34/0xb0
  rpcrdma_ep_destroy+0x28/0xcc [rpcrdma]
  rpcrdma_ep_put+0x74/0xb4 [rpcrdma]
  rpcrdma_xprt_disconnect+0x1d8/0x260 [rpcrdma]
  xprt_rdma_connect_worker+0xc0/0x120 [rpcrdma]
  process_one_work+0x1cc/0x4d0
  worker_thread+0x154/0x414
  kthread+0x104/0x144
  ret_from_fork+0x10/0x18

Since QP destruction frees memory, this wq should have the WQ_MEM_RECLAIM.

Fixes: ffd541d457 ("RDMA/hns: Add the workqueue framework for flush cqe handler")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260104064057.1582216-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-04 10:09:51 -05:00
Lianfa Weng
8818ffb04b RDMA/hns: Introduce limit_bank mode with better performance
In limit_bank mode, QPs/CQs are restricted to using half of the banks.
HW concentrates resources on these banks, thereby improving performance
compared to the default mode.

Switch between limit_bank mode and default mode by setting the cap
flag in FW. Since the number of QPs and CQs will be halved, this mode
is suitable for scenarios where fewer QPs and CQs are required.

Signed-off-by: Lianfa Weng <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251230154911.3397584-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-31 05:33:34 -05:00
Thomas Fourier
fcd431a962 RDMA/bnxt_re: fix dma_free_coherent() pointer
The dma_alloc_coherent() allocates a dma-mapped buffer, pbl->pg_arr[i].
The dma_free_coherent() should pass the same buffer to
dma_free_coherent() and not page-aligned.

Fixes: 1ac5a40479 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Link: https://patch.msgid.link/20251230085121.8023-2-fourier.thomas@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-30 06:45:51 -05:00
Kalesh AP
3d70e0fb0f RDMA/bnxt_re: Fix to use correct page size for PDE table
In bnxt_qplib_alloc_init_hwq(), while allocating memory for PDE table
driver incorrectly is using the "pg_size" value passed to the function.
Fixed to use the right value 4K. Also, fixed the allocation size for
PBL table.

Fixes: 0c4dcd6028 ("RDMA/bnxt_re: Refactor hardware queue memory allocation")
Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251223131855.145955-1-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-23 09:23:22 -05:00
Ding Hui
9b68a1cc96 RDMA/bnxt_re: Fix OOB write in bnxt_re_copy_err_stats()
Commit ef56081d18 ("RDMA/bnxt_re: RoCE related hardware counters
update") added three new counters and placed them after
BNXT_RE_OUT_OF_SEQ_ERR.

BNXT_RE_OUT_OF_SEQ_ERR acts as a boundary marker for allocating hardware
statistics with different num_counters values on chip_gen_p5_p7 devices.

As a result, BNXT_RE_NUM_STD_COUNTERS are used when allocating
hw_stats, which leads to an out-of-bounds write in
bnxt_re_copy_err_stats().

The counters BNXT_RE_REQ_CQE_ERROR, BNXT_RE_RESP_CQE_ERROR, and
BNXT_RE_RESP_REMOTE_ACCESS_ERRS are applicable to generic hardware, not
only p5/p7 devices.

Fix this by moving these counters before BNXT_RE_OUT_OF_SEQ_ERR so they
are included in the generic counter set.

Fixes: ef56081d18 ("RDMA/bnxt_re: RoCE related hardware counters update")
Reported-by: Yingying Zheng <zhengyingying@sangfor.com.cn>
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
Link: https://patch.msgid.link/20251208072110.28874-1-dinghui@sangfor.com.cn
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Tested-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-22 03:56:35 -05:00
Alok Tiwari
f01765a236 RDMA/bnxt_re: Fix IB_SEND_IP_CSUM handling in post_send
The bnxt_re SEND path checks wr->send_flags to enable features such as
IP checksum offload. However, send_flags is a bitmask and may contain
multiple flags (e.g. IB_SEND_SIGNALED | IB_SEND_IP_CSUM), while the
existing code uses a switch() statement that only matches when
send_flags is exactly IB_SEND_IP_CSUM.

As a result, checksum offload is not enabled when additional SEND
flags are present.

Replace the switch() with a bitmask test:

    if (wr->send_flags & IB_SEND_IP_CSUM)

This ensures IP checksum offload is enabled correctly when multiple
SEND flags are used.

Fixes: 1ac5a40479 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20251219093308.2415620-1-alok.a.tiwari@oracle.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-22 03:02:26 -05:00
Alok Tiwari
145a417a39 RDMA/bnxt_re: Fix incorrect BAR check in bnxt_qplib_map_creq_db()
RCFW_COMM_CONS_PCI_BAR_REGION is defined as BAR 2, so checking
!creq_db->reg.bar_id is incorrect and always false.

pci_resource_start() returns the BAR base address, and a value of 0
indicates that the BAR is unassigned. Update the condition to test
bar_base == 0 instead.

This ensures the driver detects and logs an error for an unassigned
RCFW communication BAR.

Fixes: cee0c7bba4 ("RDMA/bnxt_re: Refactor command queue management code")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20251217100158.752504-1-alok.a.tiwari@oracle.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-21 04:32:47 -05:00
Yonatan Nachum
dab5825491 RDMA/efa: Improve admin completion context state machine
Add a new unused state to the admin completion contexts state machine
instead of the occupied field. This improves the completion validity
check because it now enforce the context to be in submitted state prior
to completing it. Also add allocated state as a intermediate state
between unused and submitted.

Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Link: https://patch.msgid.link/20251210130614.36460-3-ynachum@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-18 10:12:38 -05:00
Yonatan Nachum
4b01ec0f13 RDMA/efa: Check stored completion CTX command ID with received one
In admin command completion, we receive a CQE with the command ID which
is constructed from context index and entropy bits from the admin queue
producer counter. To try to detect memory corruptions in the received
CQE, validate the full command ID of the fetched context with the CQE
command ID. If there is a mismatch, complete the CQE with error.
Also use LSBs of the admin queue producer counter to better detect
entropy mismatch between smaller number of commands.

Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Link: https://patch.msgid.link/20251210130614.36460-2-ynachum@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-18 10:12:38 -05:00