Pull rdma updates from Jason Gunthorpe:
"Many small changes across the subystem, some highlights:
- Usual driver cleanups in qedr, siw, erdma, hfi1, mlx4/5, irdma,
mthca, hns, and bnxt_re
- siw now works over tunnel and other netdevs with a MAC address by
removing assumptions about a MAC/GID from the connection manager
- "Doorbell Pacing" for bnxt_re - this is a best effort scheme to
allow userspace to slow down the doorbell rings if the HW gets full
- irdma egress VLAN priority, better QP/WQ sizing
- rxe bug fixes in queue draining and srq resizing
- Support more ethernet speed options in the core layer
- DMABUF support for bnxt_re
- Multi-stage MTT support for erdma to allow much bigger MR
registrations
- A irdma fix with a CVE that came in too late to go to -rc, missing
bounds checking for 0 length MRs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (87 commits)
IB/hfi1: Reduce printing of errors during driver shut down
RDMA/hfi1: Move user SDMA system memory pinning code to its own file
RDMA/hfi1: Use list_for_each_entry() helper
RDMA/mlx5: Fix trailing */ formatting in block comment
RDMA/rxe: Fix redundant break statement in switch-case.
RDMA/efa: Fix wrong resources deallocation order
RDMA/siw: Call llist_reverse_order in siw_run_sq
RDMA/siw: Correct wrong debug message
RDMA/siw: Balance the reference of cep->kref in the error path
Revert "IB/isert: Fix incorrect release of isert connection"
RDMA/bnxt_re: Fix kernel doc errors
RDMA/irdma: Prevent zero-length STAG registration
RDMA/erdma: Implement hierarchical MTT
RDMA/erdma: Refactor the storage structure of MTT entries
RDMA/erdma: Renaming variable names and field names of struct erdma_mem
RDMA/hns: Support hns HW stats
RDMA/hns: Dump whole QP/CQ/MR resource in raw
RDMA/irdma: Add missing kernel-doc in irdma_setup_umode_qp()
RDMA/mlx4: Copy union directly
RDMA/irdma: Drop unused kernel push code
...
If a send packet is dropped by the IP layer in rxe_requester()
the call to rxe_xmit_packet() can fail with err == -EAGAIN.
To recover, the state of the wqe is restored to the state before
the packet was sent so it can be resent. However, the routines
that save and restore the state miss a significnt part of the
variable state in the wqe, the dma struct which is used to process
through the sge table. And, the state is not saved before the packet
is built which modifies the dma struct.
Under heavy stress testing with many QPs on a fast node sending
large messages to a slow node dropped packets are observed and
the resent packets are corrupted because the dma struct was not
restored. This patch fixes this behavior and allows the test cases
to succeed.
Fixes: 3050b99850 ("IB/rxe: Fix race condition between requester and completer")
Link: https://lore.kernel.org/r/20230721200748.4604-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This patch corrects an error in rxe_modify_srq where if the
caller changes the srq size the actual new value is not returned
to the caller since it may be larger than what is requested.
Additionally it open codes the subroutine rcv_wqe_size() which
adds very little value, and makes some whitespace changes.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20230620140142.9452-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This patch:
- Moves code to initialize a qp send work queue to a
subroutine named rxe_init_sq.
- Moves code to initialize a qp recv work queue to a
subroutine named rxe_init_rq.
- Moves initialization of qp request and response packet
queues ahead of work queue initialization so that cleanup
of a qp if it is not fully completed can successfully
attempt to drain the packet queues without a seg fault.
- Makes minor whitespace cleanups.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20230620135519.9365-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Pull rdma updates from Jason Gunthorpe:
"This cycle saw a focus on rxe and bnxt_re drivers:
- Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
- rxe uses workqueues instead of tasklets
- rxe has better compliance around access checks for MRs and rereg_mr
- mana supportst he 'v2' FW interface for RX coalescing
- hfi1 bug fix for stale cache entries in its MR cache
- mlx5 buf fix to handle FW failures when destroying QPs
- erdma HW has a new doorbell allocation mechanism for uverbs that is
secure
- Lots of small cleanups and rework in bnxt_re:
- Use the common mmap functions
- Support disassociation
- Improve FW command flow
- support for 'low latency push'"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
RDMA/bnxt_re: Fix an IS_ERR() vs NULL check
RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged"
RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c
RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc()
RDMA/bnxt_re: Remove incorrect return check from slow path
RDMA/bnxt_re: Enable low latency push
RDMA/bnxt_re: Reorg the bar mapping
RDMA/bnxt_re: Move the interface version to chip context structure
RDMA/bnxt_re: Query function capabilities from firmware
RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage
RDMA/bnxt_re: Add disassociate ucontext support
RDMA/bnxt_re: Use the common mmap helper functions
RDMA/bnxt_re: Initialize opcode while sending message
RDMA/cma: Remove NULL check before dev_{put, hold}
RDMA/rxe: Simplify cq->notify code
RDMA/rxe: Fixes mr access supported list
RDMA/bnxt_re: optimize the parameters passed to helper functions
RDMA/bnxt_re: remove redundant cmdq_bitmap
RDMA/bnxt_re: use firmware provided max request timeout
RDMA/bnxt_re: cancel all control path command waiters upon error
...
Pull RCU updates from Paul McKenney:
"Documentation updates
Miscellaneous fixes, perhaps most notably:
- Remove RCU_NONIDLE(). The new visibility of most of the idle loop
to RCU has obsoleted this API.
- Make the RCU_SOFTIRQ callback-invocation time limit also apply to
the rcuc kthreads that invoke callbacks for CONFIG_PREEMPT_RT.
- Add a jiffies-based callback-invocation time limit to handle
long-running callbacks. (The local_clock() function is only invoked
once per 32 callbacks due to its high overhead.)
- Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs, which
fixes a bug that can occur on systems with non-contiguous CPU
numbering.
kvfree_rcu updates:
- Eliminate the single-argument variant of k[v]free_rcu() now that
all uses have been converted to k[v]free_rcu_mightsleep().
- Add WARN_ON_ONCE() checks for k[v]free_rcu*() freeing callbacks too
soon. Yes, this is closing the barn door after the horse has
escaped, but Murphy says that there will be more horses.
Callback-offloading updates:
- Fix a number of bugs involving the shrinker and lazy callbacks.
Tasks RCU updates
Torture-test updates"
* tag 'rcu.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (32 commits)
torture: Remove duplicated argument -enable-kvm for ppc64
doc/rcutorture: Add description of rcutorture.stall_cpu_block
rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
rcutorture: Correct name of use_softirq module parameter
locktorture: Add long_hold to adjust lock-hold delays
rcu/nocb: Make shrinker iterate only over NOCB CPUs
rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs
rcu: Make rcu_cpu_starting() rely on interrupts being disabled
rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work
rcu: Mark additional concurrent load from ->cpu_no_qs.b.exp
rcu: Employ jiffies-based backstop to callback time limit
rcu: Check callback-invocation time limit for rcuc kthreads
rcu: Remove RCU_NONIDLE()
rcu: Add more RCU files to kernel-api.rst
rcu-tasks: Clarify the cblist_init_generic() function's pr_info() output
rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()
rcu/nocb: Recheck lazy callbacks under the ->nocb_lock from shrinker
rcu/nocb: Fix shrinker race against callback enqueuer
rcu/nocb: Protect lazy shrinker against concurrent (de-)offloading
...
The flags parameter to the request notify verb is a bitmask. But, rxe
driver treats cq->notify as an int. If someone ever set both the
IB_CQ_SOLICITED and the IB_CQ_NEXT_COMP bits rxe_cq_post could fail to
generate a completion event. This patch treats the notify flags as a bit
mask consistently and can handle the above case correctly.
Link: https://lore.kernel.org/r/20230612162244.20038-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A recent patch incorrectly did not include IB_ACCESS_RELAXED_ORDERING in
the list of supported access flags for the rxe driver. The driver actually
does nothing related to relaxed ordering but it causes no problems to
include it as supported but with no effect. This change caused ib_send_bw
and friends to not run correctly.
The correct approach is for the driver to allow any of the optional access
flags and otherwise ignore them. This patch adds IB_ACCESS_OPTIONAL to the
list of rxe supported flags.
Fixes: 02ed253770 ("RDMA/rxe: Introduce rxe access supported flags")
Link: https://lore.kernel.org/r/20230613171654.19334-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A recent patch replaced a tasklet execution of cq->comp_handler by a
direct call. While this made sense it let changes to cq->notify state be
unprotected and assumed that the cq completion machinery and the ulp done
callbacks were reentrant. The result is that in some cases completion
events can be lost. This patch moves the cq->comp_handler call inside of
the spinlock in rxe_cq_post which solves both issues. This is compatible
with the matching code in the request notify verb.
Fixes: 78b26a3353 ("RDMA/rxe: Remove tasklet call from rxe_cq.c")
Link: https://lore.kernel.org/r/20230612155032.17036-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The IBA requires:
o11-5.2.5: If the HCA supports SRQ, for RC and UD service,
the CI shall generate a Last WQE Reached Affiliated Asynchronous
Event on a QP that is in the Error State and is associated with
an SRQ when either:
• a CQE is generated for the last WQE, or
• the QP gets in the Error State and there are no more
WQEs on the RQ.
This patch implements this behavior in flush_recv_queue() which is called
as a result of rxe_qp_error() being called whenever the qp is put into the
error state. The rxe responder executes SRQ WQEs directly from the SRQ so
there are never more WQES on the RQ.
Link: https://lore.kernel.org/r/20230602164229.9277-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In the following:
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
assign_lock_key kernel/locking/lockdep.c:982 [inline]
register_lock_class+0xdb6/0x1120 kernel/locking/lockdep.c:1295
__lock_acquire+0x10a/0x5df0 kernel/locking/lockdep.c:4951
lock_acquire kernel/locking/lockdep.c:5691 [inline]
lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5656
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162
skb_dequeue+0x20/0x180 net/core/skbuff.c:3639
drain_resp_pkts drivers/infiniband/sw/rxe/rxe_comp.c:555 [inline]
rxe_completer+0x250d/0x3cc0 drivers/infiniband/sw/rxe/rxe_comp.c:652
rxe_qp_do_cleanup+0x1be/0x820 drivers/infiniband/sw/rxe/rxe_qp.c:761
execute_in_process_context+0x3b/0x150 kernel/workqueue.c:3473
__rxe_cleanup+0x21e/0x370 drivers/infiniband/sw/rxe/rxe_pool.c:233
rxe_create_qp+0x3f6/0x5f0 drivers/infiniband/sw/rxe/rxe_verbs.c:583
This is a use-before-initialization problem.
It happens because rxe_qp_do_cleanup is called during error unwind before
the struct has been fully initialized.
Move the initialization of the skb earlier.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20230602035408.741534-1-yanjun.zhu@intel.com
Reported-by: syzbot+eba589d8f49c73d356da@syzkaller.appspotmail.com
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is a reference count error in error path code and a potential race
in check_rkey() in rxe_resp.c. When looking up the rkey for a memory
window the reference to the mw from rxe_lookup_mw() is dropped before a
reference is taken on the mr referenced by the mw. If the mr is destroyed
immediately after the call to rxe_put(mw) the mr pointer is unprotected
and may end up pointing at freed memory. The rxe_get(mr) call should take
place before the rxe_put(mw) call.
All errors in check_rkey() call rxe_put(mw) if mw is not NULL but it was
already called after the above. The mw pointer should be set to NULL after
the rxe_put(mw) call to prevent this from happening.
Fixes: cdd0b85675 ("RDMA/rxe: Implement memory access through MWs")
Link: https://lore.kernel.org/r/20230517211509.1819998-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In rxe_net.c a received packet, from udp or loopback, is passed to
rxe_rcv() in rxe_recv.c as a udp packet. I.e. skb->data is pointing at the
udp header. But rxe_rcv() makes length checks to verify the packet is long
enough to hold the roce headers as if it were a roce
packet. I.e. skb->data pointing at the bth header. A runt packet would
appear to have 8 more bytes than it actually does which may lead to
incorrect behavior.
This patch calls skb_pull() to adjust the skb to point at the bth header
before calling rxe_rcv() which fixes this error.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20230517172242.1806340-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently the rxe driver makes little effort to make the changes to qp
state (which includes qp->attr.qp_state, qp->attr.sq_draining and
qp->valid) atomic between different client threads and IO threads. In
particular a common template is for an RDMA application to call
ib_modify_qp() to move a qp to ERR state and then wait until all the
packet and work queues have drained before calling ib_destroy_qp(). None
of these state changes are protected by locks to assure that the changes
are executed atomically and that memory barriers are included. This has
been observed to lead to incorrect behavior around qp cleanup.
This patch continues the work of the previous patches in this series and
adds locking code around qp state changes and lookups.
Link: https://lore.kernel.org/r/20230405042611.6467-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The rxe driver has four different QP state variables,
qp->attr.qp_state,
qp->req.state,
qp->comp.state, and
qp->resp.state.
All of these basically carry the same information.
This patch replaces uses of qp->req.state by qp->attr.qp_state and enum
rxe_qp_state. This is the third of three patches which will remove all
but the qp->attr.qp_state variable. This will bring the driver closer to
the IBA description.
Link: https://lore.kernel.org/r/20230405042611.6467-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The rxe driver has four different QP state variables,
qp->attr.qp_state,
qp->req.state,
qp->comp.state, and
qp->resp.state.
All of these basically carry the same information.
This patch replaces uses of qp->comp.state by qp->attr.qp_state. This is
the second of three patches which will remove all but the
qp->attr.qp_state variable. This will bring the driver closer to the IBA
description.
Link: https://lore.kernel.org/r/20230405042611.6467-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The rxe driver has four different QP state variables,
qp->attr.qp_state,
qp->req.state,
qp->comp.state, and
qp->resp.state.
All of these basically carry the same information.
This patch replaces uses of qp->resp.state by qp->attr.qp_state. This is
the first of three patches which will remove all but the qp->attr.qp_state
variable. This will bring the driver closer to the IBA description.
Link: https://lore.kernel.org/r/20230405042611.6467-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Make it clearer what is going on by adding a function to go back from the
"virtual" dma_addr to a kva and another to a struct page. This is used in the
ib_uses_virt_dma() style drivers (siw, rxe, hfi, qib).
Call them instead of a naked casting and virt_to_page() when working with dma_addr
values encoded by the various ib_map functions.
This also fixes the virt_to_page() casting problem Linus Walleij has been
chasing.
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v2-05ea785520ed+10-ib_virt_page_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove the tasklet call in rxe_cq.c and also the is_dying in the
cq struct. There is no reason for the rxe driver to defer the call
to the cq completion handler by scheduling a tasklet. rxe_cq_post()
is not called in a hard irq context.
The rxe driver currently is incorrect because the tasklet call is
made without protecting the cq pointer with a reference from having
the underlying memory freed before the deferred routine is called.
Executing the comp_handler inline fixes this problem.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Link: https://lore.kernel.org/r/20230327215643.10410-1-rpearsonhpe@gmail.com
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This patch is a major rewrite of the tasklet routines in rxe_task.c. The
main motivation for this is the realization that the code violates the
safety of the qp pointer by correct reference counting. When a tasklet is
scheduled from a verbs API the calling thread has a valid reference to the
qp and schedules the tasklet to run at a later time carrying a pointer to
the qp. Once the calling code returns however the qp can be destroyed at
any time. In order to correct this a reference to the qp must be taken
when the task is scheduled and held until it finishes running. This is
complicated by the tasklet library not alwys running a task that is
scheduled depending on whether someone else has scheduled it.
This patch moves the logic for deciding whether to run or schedule a task
outside of do_task() and guarantees that there is only one copy of the
task scheduled or running at a time.
Secondly the separate flags controlling teardown and draining of the task
are included in the task state machine and all references to the state are
protected by spinlocks to avoid consistency and memory barrier issues.
Link: https://lore.kernel.org/r/20230304174533.11296-9-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The subroutine __rxe_do_task is not thread safe and it has no way to
guarantee that the tasks, which are designed with the assumption that they
are non-reentrant, are not reentered. All of its uses are non-performance
critical.
This patch replaces calls to __rxe_do_task with calls to
rxe_sched_task. It also removes irrelevant or unneeded if tests.
Instead of calling the task machinery a single call to the tasklet
function (rxe_requester, etc.) is sufficient to draing the queues if task
execution has been disabled or stopped.
Together these changes allow the removal of __rxe_do_task.
Link: https://lore.kernel.org/r/20230304174533.11296-7-rpearsonhpe@gmail.com
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently each of the three tasklets requester, completer and responder in
the rxe driver take and release a reference to the qp argument at the
beginning and end of the subroutines. The caller passing in the qp
argument should be responsible for holding a reference to qp so these are
not required. Further doing so breaks the qp cleanup code in
rxe_qp_do_cleanup which calls these routines after all the references have
been dropped so they cannot drain the packet and work request queues as
intended.
In fact if these routines are deferred by calling tasklet_schedule there
is no guarantee that the calling code does have a qp reference. That is a
bug in rxe_task.c which will be fixed later in this series.
Link: https://lore.kernel.org/r/20230304174533.11296-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Originally is was thought that the tasklet machinery in rxe_task.c would
be used in other applications but that has not happened for years. This
patch replaces the 'void *arg' by struct 'rxe_qp *qp' in the parameters to
the tasklet calls. This change will have no affect on performance but may
make the code a little clearer.
Link: https://lore.kernel.org/r/20230304174533.11296-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>