linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Nigel Croxon	43806c3d5b	raid10: cleanup memleak at raid10_make_request If raid10_read_request or raid10_write_request registers a new request and the REQ_NOWAIT flag is set, the code does not free the malloc from the mempool. unreferenced object 0xffff8884802c3200 (size 192): comm "fio", pid 9197, jiffies 4298078271 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 88 41 02 00 00 00 00 00 .........A...... 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc c1a049a2): __kmalloc+0x2bb/0x450 mempool_alloc+0x11b/0x320 raid10_make_request+0x19e/0x650 [raid10] md_handle_request+0x3b3/0x9e0 __submit_bio+0x394/0x560 __submit_bio_noacct+0x145/0x530 submit_bio_noacct_nocheck+0x682/0x830 __blkdev_direct_IO_async+0x4dc/0x6b0 blkdev_read_iter+0x1e5/0x3b0 __io_read+0x230/0x1110 io_read+0x13/0x30 io_issue_sqe+0x134/0x1180 io_submit_sqes+0x48c/0xe90 __do_sys_io_uring_enter+0x574/0x8b0 do_syscall_64+0x5c/0xe0 entry_SYSCALL_64_after_hwframe+0x76/0x7e V4: changing backing tree to see if CKI tests will pass. The patch code has not changed between any versions. Fixes: `c9aa889b03` ("md: raid10 add nowait support") Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Link: https://lore.kernel.org/linux-raid/c0787379-9caa-42f3-b5fc-369aed784400@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-05 19:30:41 +08:00
Wang Jinchao	d67ed2ccd2	md/raid1: Fix stack memory use after return in raid1_reshape In the raid1_reshape function, newpool is allocated on the stack and assigned to conf->r1bio_pool. This results in conf->r1bio_pool.wait.head pointing to a stack address. Accessing this address later can lead to a kernel panic. Example access path: raid1_reshape() { // newpool is on the stack mempool_t newpool, oldpool; // initialize newpool.wait.head to stack address mempool_init(&newpool, ...); conf->r1bio_pool = newpool; } raid1_read_request() or raid1_write_request() { alloc_r1bio() { mempool_alloc() { // if pool->alloc fails remove_element() { --pool->curr_nr; } } } } mempool_free() { if (pool->curr_nr < pool->min_nr) { // pool->wait.head is a stack address // wake_up() will try to access this invalid address // which leads to a kernel panic return; wake_up(&pool->wait); } } Fix: reinit conf->r1bio_pool.wait after assigning newpool. Fixes: `afeee514ce` ("md: convert to bioset_init()/mempool_init()") Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/linux-raid/20250612112901.3023950-1-wangjinchao600@gmail.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-05 19:17:37 +08:00
Jens Axboe	75ef7b8d44	Merge tag 'nvme-6.16-2025-07-03' of git://git.infradead.org/nvme into block-6.16 Pull NVMe fixes from Christoph: "- fix incorrect cdw15 value in passthru error logging (Alok Tiwari) - fix memory leak of bio integrity in nvmet (Dmitry Bogdanov) - refresh visible attrs after being checked (Eugen Hristev) - fix suspicious RCU usage warning in the multipath code (Geliang Tang) - correctly account for namespace head reference counter (Nilay Shroff)" * tag 'nvme-6.16-2025-07-03' of git://git.infradead.org/nvme: nvme-multipath: fix suspicious RCU usage warning nvme-pci: refresh visible attrs after being checked nvmet: fix memory leak of bio integrity nvme: correctly account for namespace head reference counter nvme: Fix incorrect cdw15 value in passthru error logging	2025-07-03 09:42:07 -06:00
Yu Kuai	0d519bb0de	brd: fix sleeping function called from invalid context in brd_insert_page() __xa_cmpxchg() is called with rcu_read_lock(), and it will allocate memory if necessary. Fix the problem by moving rcu_read_lock() after __xa_cmpxchg(), meanwhile, it still should be held before xa_unlock(), prevent returned page to be freed by concurrent discard. Fixes: `bbcacab2e8` ("brd: avoid extra xarray lookups on first write") Reported-by: syzbot+ea4c8fd177a47338881a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/685ec4c9.a00a0220.129264.000c.GAE@google.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630112828.421219-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-01 08:14:01 -06:00
Ming Lei	01ed88aea5	ublk: don't queue request if the associated uring_cmd is canceled Commit `524346e9d7` ("ublk: build batch from IOs in same io_ring_ctx and io task") need to dereference `io->cmd` for checking if the IO can be added to current batch, see ublk_belong_to_same_batch() and io_uring_cmd_ctx_handle(). However, `io->cmd` may become invalid after the uring_cmd is canceled. Fixes it by only allowing to queue this IO in case that ublk_prep_req() returns `BLK_STS_OK`, when 'io->cmd' is guaranteed to be valid. Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: `524346e9d7` ("ublk: build batch from IOs in same io_ring_ctx and io task") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250701072325.1458109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-01 07:54:35 -06:00
Geliang Tang	d681107420	nvme-multipath: fix suspicious RCU usage warning When I run the NVME over TCP test in virtme-ng, I get the following "suspicious RCU usage" warning in nvme_mpath_add_sysfs_link(): ''' [ 5.024557][ T44] nvmet: Created nvm controller 1 for subsystem nqn.2025-06.org.nvmexpress.mptcp for NQN nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77. [ 5.027401][ T183] nvme nvme0: creating 2 I/O queues. [ 5.029017][ T183] nvme nvme0: mapped 2/0/0 default/read/poll queues. [ 5.032587][ T183] nvme nvme0: new ctrl: NQN "nqn.2025-06.org.nvmexpress.mptcp", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77 [ 5.042214][ T25] [ 5.042440][ T25] ============================= [ 5.042579][ T25] WARNING: suspicious RCU usage [ 5.042705][ T25] 6.16.0-rc3+ #23 Not tainted [ 5.042812][ T25] ----------------------------- [ 5.042934][ T25] drivers/nvme/host/multipath.c:1203 RCU-list traversed in non-reader section!! [ 5.043111][ T25] [ 5.043111][ T25] other info that might help us debug this: [ 5.043111][ T25] [ 5.043341][ T25] [ 5.043341][ T25] rcu_scheduler_active = 2, debug_locks = 1 [ 5.043502][ T25] 3 locks held by kworker/u9:0/25: [ 5.043615][ T25] #0: ffff888008730948 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x7ed/0x1350 [ 5.043830][ T25] #1: ffffc900001afd40 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0xcf3/0x1350 [ 5.044084][ T25] #2: ffff888013ee0020 (&head->srcu){.+.+}-{0:0}, at: nvme_mpath_add_sysfs_link.part.0+0xb4/0x3a0 [ 5.044300][ T25] [ 5.044300][ T25] stack backtrace: [ 5.044439][ T25] CPU: 0 UID: 0 PID: 25 Comm: kworker/u9:0 Not tainted 6.16.0-rc3+ #23 PREEMPT(full) [ 5.044441][ T25] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 5.044442][ T25] Workqueue: async async_run_entry_fn [ 5.044445][ T25] Call Trace: [ 5.044446][ T25] <TASK> [ 5.044449][ T25] dump_stack_lvl+0x6f/0xb0 [ 5.044453][ T25] lockdep_rcu_suspicious.cold+0x4f/0xb1 [ 5.044457][ T25] nvme_mpath_add_sysfs_link.part.0+0x2fb/0x3a0 [ 5.044459][ T25] ? queue_work_on+0x90/0xf0 [ 5.044461][ T25] ? lockdep_hardirqs_on+0x78/0x110 [ 5.044466][ T25] nvme_mpath_set_live+0x1e9/0x4f0 [ 5.044470][ T25] nvme_mpath_add_disk+0x240/0x2f0 [ 5.044472][ T25] ? __pfx_nvme_mpath_add_disk+0x10/0x10 [ 5.044475][ T25] ? add_disk_fwnode+0x361/0x580 [ 5.044480][ T25] nvme_alloc_ns+0x81c/0x17c0 [ 5.044483][ T25] ? kasan_quarantine_put+0x104/0x240 [ 5.044487][ T25] ? __pfx_nvme_alloc_ns+0x10/0x10 [ 5.044495][ T25] ? __pfx_nvme_find_get_ns+0x10/0x10 [ 5.044496][ T25] ? rcu_read_lock_any_held+0x45/0xa0 [ 5.044498][ T25] ? validate_chain+0x232/0x4f0 [ 5.044503][ T25] nvme_scan_ns+0x4c8/0x810 [ 5.044506][ T25] ? __pfx_nvme_scan_ns+0x10/0x10 [ 5.044508][ T25] ? find_held_lock+0x2b/0x80 [ 5.044512][ T25] ? ktime_get+0x16d/0x220 [ 5.044517][ T25] ? kvm_clock_get_cycles+0x18/0x30 [ 5.044520][ T25] ? __pfx_nvme_scan_ns_async+0x10/0x10 [ 5.044522][ T25] async_run_entry_fn+0x97/0x560 [ 5.044523][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044526][ T25] process_one_work+0xd3c/0x1350 [ 5.044532][ T25] ? __pfx_process_one_work+0x10/0x10 [ 5.044536][ T25] ? assign_work+0x16c/0x240 [ 5.044539][ T25] worker_thread+0x4da/0xd50 [ 5.044545][ T25] ? __pfx_worker_thread+0x10/0x10 [ 5.044546][ T25] kthread+0x356/0x5c0 [ 5.044548][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044549][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044552][ T25] ? __lock_release.isra.0+0x5d/0x180 [ 5.044553][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044555][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044557][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044559][ T25] ret_from_fork+0x218/0x2e0 [ 5.044561][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044562][ T25] ret_from_fork_asm+0x1a/0x30 [ 5.044570][ T25] </TASK> ''' This patch uses sleepable RCU version of helper list_for_each_entry_srcu() instead of list_for_each_entry_rcu() to fix it. Fixes: `4dbd2b2ebe` ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-01 08:17:02 +02:00
Eugen Hristev	14005c96d6	nvme-pci: refresh visible attrs after being checked The sysfs attributes are registered early, but the driver does not know whether they are needed or not at that moment. For the CMB attributes, commit `e917a849c3` ("nvme-pci: refresh visible attrs for cmb attributes") solved this problem by calling nvme_update_attrs after mapping the CMB. However the issue persists for the HMB attributes. To solve the problem, moved the call to nvme_update_attrs after nvme_setup_host_mem, which sets up the HMB. Fixes: `e917a849c3` ("nvme-pci: refresh visible attrs for cmb attributes") Fixes: `86adbf0cdb` ("nvme: simplify transport specific device attribute handling") Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-30 08:42:47 +02:00
Dmitry Bogdanov	190f4c2c86	nvmet: fix memory leak of bio integrity If nvmet receives commands with metadata there is a continuous memory leak of kmalloc-128 slab or more precisely bio->bi_integrity. Since commit `bf4c89fc87` ("block: don't call bio_uninit from bio_endio") each user of bio_init has to use bio_uninit as well. Otherwise the bio integrity is not getting free. Nvmet uses bio_init for inline bios. Uninit the inline bio to complete deallocation of integrity in bio. Fixes: `bf4c89fc87` ("block: don't call bio_uninit from bio_endio") Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-30 08:32:16 +02:00
Nilay Shroff	ba806c9003	nvme: correctly account for namespace head reference counter The blktests nvme/058 manifests an issue where the NVMe subsystem kobject entry remains stale in sysfs, causing a failure during subsequent NVMe module reloads[1]. Specifically, when attempting to register a new NVMe subsystem, the driver encounters a kobejct name collision because a stale kobject still exists. Though, please note that nvme/058 doesn't report any failure and test case passes and it's only during subsequent NVMe module reloads, the stale nvme sub- system kobject entry in sysfs causes the observed symptom[1]. This issue stems from an imbalance in the get/put usage of the namespace head (nshead) reference counter. The nshead holds a reference to the associated NVMe subsystem. If the nshead reference is not properly released, it prevents the cleanup of the subsystem's kobject, leaving nvme subsystem stale entry behind in sysfs. During the failure case, the last namespace path referencing a nshead is removed, but the nshead reference was not released. This occurs because the release logic currently only puts the nshead reference when its state is LIVE. However, in configurations where ANA (Asymmetric Namespace Access) is enabled, a namespace may be associated with an ANA state that is neither optimized nor non-optimized. In this case, the nshead may never transition to LIVE, and the corresponding nshead reference is then never dropped. In fact nvme/058 associates some of nvme namespaces to an inaccessible ANA state and with that nshead is created but it's state is not transitioned to LIVE. So the current logic would then causes nshead reference to be leaked for non-LIVE states. Another scenario, during namespace allocation, the driver first allocates a nshead and then issues an Identify Namespace command. If this command fails — which can happen in tests like nvme/058 that rapidly enables and disables namespaces — we must release the reference to the newly allocated nshead. However this reference release is currently missing in the failure, causing a nshead reference leak. To fix this, we now unconditionally release the nshead reference when the last nvme path referencing to the nshead is removed, regardless of the head’s state. Also during identify namespace failure case we now properly release the nshead reference. So this ensures proper cleanup of the nshead, and consequently, the NVMe subsystem and its associated kobject. This change prevents stale kobject entries from lingering in sysfs and eliminates the module reload failures observed just after running nvme/058. [1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Reported-by: yi.zhang@redhat.com Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Fixes: `62188639ec` ("nvme-multipath: introduce delayed removal of the multipath head node") Tested-by: yi.zhang@redhat.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-30 08:31:49 +02:00
Alok Tiwari	2e96d2d8c2	nvme: Fix incorrect cdw15 value in passthru error logging Fix an error in nvme_log_err_passthru() where cdw14 was incorrectly printed twice instead of cdw15. This fix ensures accurate logging of the full passthrough command payload. Fixes: `9f079dda14` ("nvme: allow passthru cmd error logging") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-30 08:31:45 +02:00
Yu Kuai	c007062188	block: fix false warning in bdev_count_inflight_rw() While bdev_count_inflight is interating all cpus, if some IOs are issued from traversed cpu and then completed from the cpu that is not traversed yet: cpu0 cpu1 bdev_count_inflight //for_each_possible_cpu // cpu0 is 0 infliht += 0 // issue a io blk_account_io_start // cpu0 inflight ++ cpu2 // the io is done blk_account_io_done // cpu2 inflight -- // cpu 1 is 0 inflight += 0 // cpu2 is -1 inflight += -1 ... In this case, the total inflight will be -1, causing lots of false warning. Fix the problem by removing the warning. Noted there is still a valid warning for nvme-mpath(From Yi) that is not fixed yet. Fixes: `f5482ee5ed` ("block: WARN if bdev inflight counter is negative") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#mae89155a5006463d0a21a4a2c35ae0034b26a339 Reported-and-tested-by: Calvin Owens <calvin@wbinvd.org> Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#m1d935a00070bf95055d0ac84e6075158b08acaef Reported-by: Dave Chinner <david@fromorbit.com> Closes: https://lore.kernel.org/linux-block/aFuypjqCXo9-5_En@dread.disaster.area/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250626115743.1641443-1-yukuai3@huawei.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-26 07:34:11 -06:00
Jens Axboe	5990b776fb	Merge tag 'nvme-6.16-2025-06-26' of git://git.infradead.org/nvme into block-6.16 Pull NVMe fixes from Christoph: " - reset delayed remove_work after reconnect (Keith Busch) - fix atomic write size validation (Christoph Hellwig)" * tag 'nvme-6.16-2025-06-26' of git://git.infradead.org/nvme: nvme: fix atomic write size validation nvme: refactor the atomic write unit detection nvme: reset delayed remove_work after reconnect	2025-06-26 07:31:52 -06:00
Ronnie Sahlberg	969127bf07	ublk: sanity check add_dev input for underflow Add additional checks that queue depth and number of queues are non-zero. Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250626022046.235018-1-ronniesahlberg@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-26 07:31:24 -06:00
Christoph Hellwig	f46d273449	nvme: fix atomic write size validation Don't mix the namespace and controller values, and validate the per-controller limit when probing the controller. This avoid spurious failures for controllers with namespaces that have different namespaces with different logical block sizes, or report the per-namespace values only for some namespaces. It also fixes a missing queue_limits_cancel_update in an error path by removing that error path. Fixes: `8695f060a0` ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Tested-by: Yi Zhang <yi.zhang@redhat.com>	2025-06-26 13:04:37 +02:00
Christoph Hellwig	b2e607feca	nvme: refactor the atomic write unit detection Move all the code out of nvme_update_disk_info into the helper, and rename the helper to have a somewhat less clumsy name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>	2025-06-26 13:04:37 +02:00
Keith Busch	dd2c185489	nvme: reset delayed remove_work after reconnect The remove_work will proceed with permanently disconnecting on the initial final path failure if the head shows no paths after the delay. If a new path connects while the remove_work is pending, and if that new path happens to disconnect before that remove_work executes, the delayed removal should reset based on the most recent path disconnect time, but queue_delayed_work() won't do anything if the work is already pending. Attempt to cancel the delayed work when a new path connects, and use mod_delayed_work() in case the remove_work remains pending anyway. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-26 13:04:35 +02:00
Ming Lei	4c8a951787	ublk: setup ublk_io correctly in case of ublk_get_data() failure If ublk_get_data() fails, -EIOCBQUEUED is returned and the current command becomes ASYNC. And the only reason is that mapping data can't move on, because of no enough pages or pending signal, then the current ublk request has to be requeued. Once the request need to be requeued, we have to setup `ublk_io` correctly, including io->cmd and flags, otherwise the request may not be forwarded to ublk server successfully. Fixes: `9810362a57` ("ublk: don't call ublk_dispatch_req() for NEED_GET_DATA") Reported-by: Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGVVp+VN9QcpHUz_0nasFf5q9i1gi8H8j-G-6mkBoqa3TyjRHA@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Changhui Zhong <czhong@redhat.com> Link: https://lore.kernel.org/r/20250624104121.859519-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-24 20:45:31 -06:00
Caleb Sander Mateos	81b4d1a1d0	ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header UBLK_F_SUPPORT_ZERO_COPY has a very old comment describing the initial idea for how zero-copy would be implemented. The actual implementation added in commit `1f6540e2aa` ("ublk: zc register/unregister bvec") uses io_uring registered buffers rather than shared memory mapping. Remove the inaccurate remarks about mapping ublk request memory into the ublk server's address space and requiring 4K block size. Replace them with a description of the current zero-copy mechanism. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250621171015.354932-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-24 20:45:31 -06:00
Caleb Sander Mateos	67caa528ae	ublk: fix narrowing warnings in UAPI header When a C++ file compiled with -Wc++11-narrowing includes the UAPI header linux/ublk_cmd.h, ublk_sqe_addr_to_auto_buf_reg()'s assignments of u64 values to u8, u16, and u32 fields result in compiler warnings. Add explicit casts to the intended types to avoid these warnings. Drop the unnecessary bitmasks. Reported-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `99c1e4eb6a` ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250621162842.337452-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-24 20:45:31 -06:00
Ming Lei	5223372e67	selftests: ublk: don't take same backing file for more than one ublk devices Don't use same backing file for more than one ublk devices, and avoid concurrent write on same file from more ublk disks. Fixes: `8ccebc19ee` ("selftests: ublk: support UBLK_F_AUTO_BUF_REG") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250623011934.741788-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-24 20:45:31 -06:00
Ming Lei	524346e9d7	ublk: build batch from IOs in same io_ring_ctx and io task ublk_queue_cmd_list() dispatches the whole batch list by scheduling task work via the tail request's io_uring_cmd, this way is fine even though more than one io_ring_ctx are involved for this batch since it is just one running context. However, the task work handler ublk_cmd_list_tw_cb() takes `issue_flags` of tail uring_cmd's io_ring_ctx for completing all commands. This way is wrong if any uring_cmd is issued from different io_ring_ctx. Fixes it by always building batch IOs from same io_ring_ctx and io task because ublk_dispatch_req() does validate task context, and IO needs to be aborted in case of running from fallback task work context. For typical per-queue or per-io daemon implementation, this way shouldn't make difference from performance viewpoint, because single io_ring_ctx is taken in each daemon for normal use case. Fixes: `d796cea7b9` ("ublk: implement ->queue_rqs()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250625022554.883571-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-24 20:44:52 -06:00
Ronnie Sahlberg	8c84728558	ublk: santizize the arguments from userspace when adding a device Sanity check the values for queue depth and number of queues we get from userspace when adding a device. Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Fixes: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Fixes: `62fe99cef9` ("ublk: add read()/write() support for ublk char device") Link: https://lore.kernel.org/r/20250619021031.181340-1-ronniesahlberg@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-19 07:53:24 -06:00
Justin Sanders	cffc873d68	aoe: defer rexmit timer downdev work to workqueue When aoe's rexmit_timer() notices that an aoe target fails to respond to commands for more than aoe_deadsecs, it calls aoedev_downdev() which cleans the outstanding aoe and block queues. This can involve sleeping, such as in blk_mq_freeze_queue(), which should not occur in irq context. This patch defers that aoedev_downdev() call to the aoe device's workqueue. Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665 Signed-off-by: Justin Sanders <jsanders.devel@gmail.com> Link: https://lore.kernel.org/r/20250610170600.869-2-jsanders.devel@gmail.com Tested-By: Valentin Kleibel <valentin@vrvis.at> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-17 06:12:35 -06:00
Justin Sanders	7f90d45e57	aoe: clean device rq_list in aoedev_downdev() An aoe device's rq_list contains accepted block requests that are waiting to be transmitted to the aoe target. This queue was added as part of the conversion to blk_mq. However, the queue was not cleaned out when an aoe device is downed which caused blk_mq_freeze_queue() to sleep indefinitely waiting for those requests to complete, causing a hang. This fix cleans out the queue before calling blk_mq_freeze_queue(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665 Fixes: `3582dd2917` ("aoe: convert aoeblk to blk-mq") Signed-off-by: Justin Sanders <jsanders.devel@gmail.com> Link: https://lore.kernel.org/r/20250610170600.869-1-jsanders.devel@gmail.com Tested-By: Valentin Kleibel <valentin@vrvis.at> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-17 06:12:30 -06:00
Jens Axboe	9ce6c9875f	nvme: always punt polled uring_cmd end_io work to task_work Currently NVMe uring_cmd completions will complete locally, if they are polled. This is done because those completions are always invoked from task context. And while that is true, there's no guarantee that it's invoked under the right ring context, or even task. If someone does NVMe passthrough via multiple threads and with a limited number of poll queues, then ringA may find completions from ringB. For that case, completing the request may not be sound. Always just punt the passthrough completions via task_work, which will redirect the completion, if needed. Cc: stable@vger.kernel.org Fixes: `585079b6e4` ("nvme: wire up async polling for io passthrough commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-13 15:18:34 -06:00
Bagas Sanjaya	db3dfae1a2	Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists Stephen Rothwell reports htmldocs warning on ublk docs: Documentation/block/ublk.rst:414: ERROR: Unexpected indentation. [docutils] Fix the warning by separating sublists of auto buffer registration fallback behavior from their appropriate parent list item. Fixes: `ff20c51648` ("ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-next/20250612132638.193de386@canb.auug.org.au/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20250613023857.15971-1-bagasdotme@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-13 09:25:42 -06:00
Matthew Wilcox (Oracle)	5e223e06ee	block: Fix bvec_set_folio() for very large folios Similarly to `26064d3e2b` ("block: fix adding folio to bio"), if we attempt to add a folio that is larger than 4GB, we'll silently truncate the offset and len. Widen the parameters to size_t, assert that the length is less than 4GB and set the first page that contains the interesting data rather than the first page of the folio. Fixes: `26db5ee158` (block: add a bvec_set_folio helper) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250612144255.2850278-1-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-13 06:20:17 -06:00
Matthew Wilcox (Oracle)	f826ec7966	bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP It is possible for physically contiguous folios to have discontiguous struct pages if SPARSEMEM is enabled and SPARSEMEM_VMEMMAP is not. This is correctly handled by folio_page_idx(), so remove this open-coded implementation. Fixes: `640d1930be` (block: Add bio_for_each_folio_all()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250612144126.2849931-1-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-13 06:19:34 -06:00
Jens Axboe	961296e89d	block: use plug request list tail for one-shot backmerge attempt Previously, the block layer stored the requests in the plug list in LIFO order. For this reason, blk_attempt_plug_merge() would check just the head entry for a back merge attempt, and abort after that unless requests for multiple queues existed in the plug list. If more than one request is present in the plug list, this makes the one-shot back merging less useful than before, as it'll always fail to find a quick merge candidate. Use the tail entry for the one-shot merge attempt, which is the last added request in the list. If that fails, abort immediately unless there are multiple queues available. If multiple queues are available, then scan the list. Ideally the latter scan would be a backwards scan of the list, but as it currently stands, the plug list is singly linked and hence this isn't easily feasible. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/ Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Fixes: `e70c301fae` ("block: don't reorder requests in blk_add_rq_to_plug") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-11 08:48:46 -06:00
Christoph Hellwig	cf625013d8	block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work Bios queued up in the zone write plug have already gone through all all preparation in the submit_bio path, including the freeze protection. Submitting them through submit_bio_noacct_nocheck duplicates the work and can can cause deadlocks when freezing a queue with pending bio write plugs. Go straight to ->submit_bio or blk_mq_submit_bio to bypass the superfluous extra freeze protection and checks. Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-11 06:42:27 -06:00
Damien Le Moal	f705d33c2f	block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion When blk_zone_write_plug_bio_endio() is called for a regular write BIO used to emulate a zone append operation, that is, a BIO flagged with BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not cleared. Clear it to fully return the BIO to its orginal definition. Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-11 06:42:07 -06:00
Ming Lei	ff20c51648	ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG). Cc: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250611085632.109719-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-11 06:41:55 -06:00
Ming Lei	4bb08cf974	loop: move lo_set_size() out of queue freeze It isn't necessary to freeze queue for updating disk size given submit_bio() doesn't grab queue usage counter for checking eod. Also many driver won't freeze queue for calling set_capacity_and_notify(). Move lo_set_size() out of queue freeze for fixing many lockdep warning report. Link: https://lore.kernel.org/linux-block/67ea99e0.050a0220.3c3d88.0042.GAE@google.com/ Reported-by: syzbot+9dd7dbb1a4b915dee638@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250611084938.108829-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-11 06:41:45 -06:00
Jens Axboe	6f65947a1e	Merge tag 'nvme-6.16-2025-06-05' of git://git.infradead.org/nvme into block-6.16 Pull NVMe updates and fixes from Christoph: "nvme updates for Linux 6.16 - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang)" * tag 'nvme-6.16-2025-06-05' of git://git.infradead.org/nvme: nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code	2025-06-05 07:40:38 -06:00
Yi Zhang	44e479d720	nvme: spelling fixes Fix various spelling errors in comments. Signed-off-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-04 10:23:28 +02:00
Hannes Reinecke	f42d4796ee	nvme-tcp: fix I/O stalls on congested sockets When the socket is busy processing nvme_tcp_try_recv() might return -EAGAIN, but this doesn't automatically imply that the sending side is blocked, too. So check if there are pending requests once nvme_tcp_try_recv() returns -EAGAIN and continue with the sending loop to avoid I/O stalls. Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: Chris Leech <cleech@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-04 10:02:23 +02:00
Hannes Reinecke	0bf04c874f	nvme-tcp: sanitize request list handling Validate the request in nvme_tcp_handle_r2t() to ensure it's not part of any list, otherwise a malicious R2T PDU might inject a loop in request list processing. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-04 10:01:51 +02:00
Shin'ichiro Kawasaki	e714370670	nvme-tcp: remove tag set when second admin queue config fails Commit `104d0e2f62` ("nvme-fabrics: reset admin connection for secure concatenation") modified nvme_tcp_setup_ctrl() to call nvme_tcp_configure_admin_queue() twice. The first call prepares for DH-CHAP negotitation, and the second call is required for secure concatenation. However, this change triggered BUG KASAN slab-use-after- free in blk_mq_queue_tag_busy_iter(). This BUG can be recreated by repeating the blktests test case nvme/063 a few times [1]. When the BUG happens, nvme_tcp_create_ctrl() fails in the call chain below: nvme_tcp_create_ctrl() nvme_tcp_alloc_ctrl() new=true ... Alloc nvme_tcp_ctrl and admin_tag_set nvme_tcp_setup_ctrl() new=true nvme_tcp_configure_admin_queue() new=true ... Succeed nvme_alloc_admin_tag_set() ... Alloc the tag set for admin_tag_set nvme_stop_keep_alive() nvme_tcp_teardown_admin_queue() remove=false nvme_tcp_configure_admin_queue() new=false nvme_tcp_alloc_admin_queue() ... Fail, but do not call nvme_remove_admin_tag_set() nvme_uninit_ctrl() nvme_put_ctrl() ... Free up the nvme_tcp_ctrl and admin_tag_set The first call of nvme_tcp_configure_admin_queue() succeeds with new=true argument. The second call fails with new=false argument. This second call does not call nvme_remove_admin_tag_set() on failure, due to the new=false argument. Then the admin tag set is not removed. However, nvme_tcp_create_ctrl() assumes that nvme_tcp_setup_ctrl() would call nvme_remove_admin_tag_set(). Then it frees up struct nvme_tcp_ctrl which has admin_tag_set field. Later on, the timeout handler accesses the admin_tag_set field and causes the BUG KASAN slab-use-after-free. To not leave the admin tag set, call nvme_remove_admin_tag_set() when the second nvme_tcp_configure_admin_queue() call fails. Do not return from nvme_tcp_setup_ctrl() on failure. Instead, jump to "destroy_admin" go-to label to call nvme_tcp_teardown_admin_queue() which calls nvme_remove_admin_tag_set(). Fixes: `104d0e2f62` ("nvme-fabrics: reset admin connection for secure concatenation") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1] Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-04 09:52:39 +02:00
Pavel Begunkov	3c12a8939e	nvme: enable vectored registered bufs for passthrough cmds nvme already supports registered buffers for non-vectored io_uring passthrough commands, enable it for the vectored mode as well. It takes an iovec, each entry of which should contain a range within the same registered buffer specificied in sqe->buf_index. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-04 09:51:10 +02:00
Pavel Begunkov	c4b680ac28	nvme: fix implicit bool to flags conversion nvme_map_user_request() takes flags as the last argument, but nvme_uring_cmd_io() shoves a bool "vec" into it. It behaves as expected because bool is converted to 0/1 and NVME_IOCTL_VEC is defined as 1, but it's better to pass flags explicitly. Fixes: `7b7fdb8e2d` ("nvme: replace the "bool vec" arguments with flags in the ioctl path") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-04 09:51:10 +02:00
Keith Busch	10f4a7cd72	nvme: fix command limits status code The command specific status code, 0x183, was introduced in the NVMe 2.0 specification defined to "Command Size Limits Exceeded" and only ever applied to DSM and Copy commands. Fix the name and, remove the incorrect translation to error codes and special treatment in the target code for it. Fixes: `3b7c33b28a` ("nvme.h: add Write Zeroes definitions") Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-04 09:51:10 +02:00
Uday Shankar	a2f4c1ae16	selftests: ublk: kublk: improve behavior on init failure Some failure modes are handled poorly by kublk. For example, if ublk_drv is built as a module but not currently loaded into the kernel, ./kublk add ... just hangs forever. This happens because in this case (and a few others), the worker process does not notify its parent (via a write to the shared eventfd) that it has tried and failed to initialize, so the parent hangs forever. Fix this by ensuring that we always notify the parent process of any initialization failure, and have the parent print a (not very descriptive) log line when this happens. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250603-ublk_init_fail-v1-1-87c91486230e@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-03 20:19:44 -06:00
Caleb Sander Mateos	43a67dd812	block: flip iter directions in blk_rq_integrity_map_user() blk_rq_integrity_map_user() creates the ubuf iter with ITER_DEST for write-direction operations and ITER_SOURCE for read-direction ones. This is backwards; writes use the user buffer as a source for metadata and reads use it as a destination. Switch to the rq_data_dir() helper, which maps writes to ITER_SOURCE (WRITE) and reads to ITER_DEST(READ). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `fe8f4ca710` ("block: modify bio_integrity_map_user to accept iov_iter as argument") Link: https://lore.kernel.org/r/20250603184752.1185676-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-03 17:24:59 -06:00
Caleb Sander Mateos	c09a8b00f8	block: drop direction param from bio_integrity_copy_user() direction is determined from bio, which is already passed in. Compute op_is_write(bio_op(bio)) directly instead of converting it to an iter direction and back to a bool. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20250603183133.1178062-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-03 12:45:45 -06:00
Ming Lei	da12597a1d	selftests: ublk: cover PER_IO_DAEMON in more stress tests We have stress_03, stress_04 and stress_05 for checking new feature vs. stress IO & device removal & ublk server crash & recovery, so let the three existing stress tests cover PER_IO_DAEMON. Then stress_06 can be removed, since the same test function is included in stress_03. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250602132113.1398645-1-ming.lei@redhat.com Reviewed-by: Uday Shankar <ushankar@purestorage.com> [axboe: remove test_stress_06.sh from Makefile too] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-02 12:00:31 -06:00
Uday Shankar	08652bd86e	Documentation: ublk: document UBLK_F_PER_IO_DAEMON Explain the restrictions imposed on ublk servers in two cases: 1. When UBLK_F_PER_IO_DAEMON is set (current ublk_drv) 2. When UBLK_F_PER_IO_DAEMON is not set (legacy) Remove most references to per-queue daemons, as the new UBLK_F_PER_IO_DAEMON feature renders that concept obsolete. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-9-e9d3b119336a@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-31 14:38:57 -06:00
Uday Shankar	17574aa2a0	selftests: ublk: add stress test for per io daemons Add a new test_stress_06 for the per io daemons feature. This is just a copy of test_stress_01 with the per_io_tasks flag added, with varying amounts of nthreads. This test is able to reproduce a panic which was caught manually during development [1]; in the current version of this patch set, it passes. Note that this commit also makes all stress tests using the run_io_and_remove helper more stressful by additionally exercising the batch submit (queue_rqs) path. [1] https://lore.kernel.org/linux-block/aDgwGoGCEpwd1mFY@fedora/ Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-8-e9d3b119336a@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-31 14:38:53 -06:00
Uday Shankar	236918d3e9	selftests: ublk: add functional test for per io daemons Add a new test test_generic_12 which: - sets up a ublk server with per_io_tasks and a different number of ublk server threads and ublk_queues. This is possible now that these objects are decoupled - runs some I/O load from a single CPU - verifies that all the ublk server threads handle some I/O Before this changeset, this test fails, since I/O issued from one CPU is always handled by the one ublk server thread. After this changeset, the test passes. In the future, the last check above may be strengthened to "verify that all ublk server threads handle the same amount of I/O." However, this requires some adjustments/bugfixes to tag allocation, so this work is postponed to a followup. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-7-e9d3b119336a@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-31 14:38:48 -06:00
Uday Shankar	abe54c1603	selftests: ublk: kublk: decouple ublk_queues from ublk server threads Add support in kublk for decoupled ublk_queues and ublk server threads. kublk now has two modes of operation: - (preexisting mode) threads and queues are paired 1:1, and each thread services all the I/Os of one queue - (new mode) thread and queue counts are independently configurable. threads service I/Os in a way that balances load across threads even if load is not balanced over queues. The default is the preexisting mode. The new mode is activated by passing the --per_io_tasks flag. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-6-e9d3b119336a@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-31 14:38:43 -06:00
Uday Shankar	b9848ca7a7	selftests: ublk: kublk: move per-thread data out of ublk_queue Towards the goal of decoupling ublk_queues from ublk server threads, move resources/data that should be per-thread rather than per-queue out of ublk_queue and into a new struct ublk_thread. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-5-e9d3b119336a@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-31 14:38:39 -06:00

1 2 3 4 5 ...

1355516 Commits