Pull block updates from Jens Axboe:
- Add shared memory zero-copy I/O support for ublk, bypassing per-I/O
copies between kernel and userspace by matching registered buffer
PFNs at I/O time. Includes selftests.
- Refactor bio integrity to support filesystem initiated integrity
operations and arbitrary buffer alignment.
- Clean up bio allocation, splitting bio_alloc_bioset() into clear fast
and slow paths. Add bio_await() and bio_submit_or_kill() helpers,
unify synchronous bi_end_io callbacks.
- Fix zone write plug refcount handling and plug removal races. Add
support for serializing zone writes at QD=1 for rotational zoned
devices, yielding significant throughput improvements.
- Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET
command.
- Add io_uring passthrough (uring_cmd) support to the BSG layer.
- Replace pp_buf in partition scanning with struct seq_buf.
- zloop improvements and cleanups.
- drbd genl cleanup, switching to pre_doit/post_doit.
- NVMe pull request via Keith:
- Fabrics authentication updates
- Enhanced block queue limits support
- Workqueue usage updates
- A new write zeroes device quirk
- Tagset cleanup fix for loop device
- MD pull requests via Yu Kuai:
- Fix raid5 soft lockup in retry_aligned_read()
- Fix raid10 deadlock with check operation and nowait requests
- Fix raid1 overlapping writes on writemostly disks
- Fix sysfs deadlock on array_state=clear
- Proactive RAID-5 parity building with llbitmap, with
write_zeroes_unmap optimization for initial sync
- Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops
version mismatch fallback
- Fix bcache use-after-free and uninitialized closure
- Validate raid5 journal metadata payload size
- Various cleanups
- Various other fixes, improvements, and cleanups
* tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits)
ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd()
block: refactor blkdev_zone_mgmt_ioctl
MAINTAINERS: update ublk driver maintainer email
Documentation: ublk: address review comments for SHMEM_ZC docs
ublk: allow buffer registration before device is started
ublk: replace xarray with IDA for shmem buffer index allocation
ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
ublk: verify all pages in multi-page bvec fall within registered range
ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
xfs: use bio_await in xfs_zone_gc_reset_sync
block: add a bio_submit_or_kill helper
block: factor out a bio_await helper
block: unify the synchronous bi_end_io callbacks
xfs: fix number of GC bvecs
selftests/ublk: add read-only buffer registration test
selftests/ublk: add filesystem fio verify test for shmem_zc
selftests/ublk: add hugetlbfs shmem_zc test for loop target
selftests/ublk: add shared memory zero-copy test
selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
...
Pull vfs buffer_head updates from Christian Brauner:
"This cleans up the mess that has accumulated over the years in
metadata buffer_head tracking for inodes.
It moves the tracking into dedicated structure in filesystem-private
part of the inode (so that we don't use private_list, private_data,
and private_lock in struct address_space), and also moves couple other
users of private_data and private_list so these are removed from
struct address_space saving 3 longs in struct inode for 99% of inodes"
* tag 'vfs-7.1-rc1.bh.metadata' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
fs: Drop i_private_list from address_space
fs: Drop mapping_metadata_bhs from address space
ext4: Track metadata bhs in fs-private inode part
minix: Track metadata bhs in fs-private inode part
udf: Track metadata bhs in fs-private inode part
fat: Track metadata bhs in fs-private inode part
bfs: Track metadata bhs in fs-private inode part
affs: Track metadata bhs in fs-private inode part
ext2: Track metadata bhs in fs-private inode part
fs: Provide functions for handling mapping_metadata_bhs directly
fs: Switch inode_has_buffers() to take mapping_metadata_bhs
fs: Make bhs point to mapping_metadata_bhs
fs: Move metadata bhs tracking to a separate struct
fs: Fold fsync_buffers_list() into sync_mapping_buffers()
fs: Drop osync_buffers_list()
kvm: Use private inode list instead of i_private_list
fs: Remove i_private_data
aio: Stop using i_private_data and i_private_lock
hugetlbfs: Stop using i_private_data
fs: Stop using i_private_data for metadata bh tracking
...
Add a new helper to wait for a bio and anything chained off it to
complete synchronously after submitting it. This factors common code out
of submit_bio_wait and bio_await_chain and will also be useful for
file system code and thus is exported.
Note that this will now set REQ_SYNC also for the bio_await case for
consistency. Nothing should look at the flag in the end_io handler,
but if something does having the flag set makes more sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260407140538.633364-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
wbt_init_enable_default() uses WARN_ON_ONCE to check for failures from
wbt_alloc() and wbt_init(). However, both are expected failure paths:
- wbt_alloc() can return NULL under memory pressure (-ENOMEM)
- wbt_init() can fail with -EBUSY if wbt is already registered
syzbot triggers this by injecting memory allocation failures during MTD
partition creation via ioctl(BLKPG), causing a spurious warning.
wbt_init_enable_default() is a best-effort initialization called from
blk_register_queue() with a void return type. Failure simply means the
disk operates without writeback throttling, which is harmless.
Replace WARN_ON_ONCE with plain if-checks, consistent with how
wbt_set_lat() in the same file already handles these failures. Add a
pr_warn() for the wbt_init() failure to retain diagnostic information
without triggering a full stack trace.
Reported-by: syzbot+71fcf20f7c1e5043d78c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=71fcf20f7c1e5043d78c
Fixes: 41afaeeda5 ("blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter")
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://patch.msgid.link/20260316070358.65225-2-ytohnuki@amazon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a bio is allocated from the mempool with REQ_ALLOC_CACHE set and
later completed, bio_put() places it into the per-cpu bio_alloc_cache
via bio_put_percpu_cache() instead of freeing it back to the
mempool/slab. The slab allocation remains tracked by kmemleak, but the
only reference to the bio is through the percpu cache's free_list,
which kmemleak fails to trace through percpu memory. This causes
kmemleak to report the cached bios as unreferenced objects.
Use symmetric kmemleak_free()/kmemleak_alloc() calls to properly track
bios across percpu cache transitions:
- bio_put_percpu_cache: call kmemleak_free() when a bio enters the
cache, unregistering it from kmemleak tracking.
- bio_alloc_percpu_cache: call kmemleak_alloc() when a bio is taken
from the cache for reuse, re-registering it so that genuine leaks
of reused bios remain detectable.
- __bio_alloc_cache_prune: call kmemleak_alloc() before bio_free() so
that kmem_cache_free()'s internal kmemleak_free() has a matching
allocation to pair with.
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260326144058.2392319-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the missing put_disk() on the error path in
blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or
blkg_tryget() fails, the function jumps to the out label which only
calls rcu_read_unlock() but does not release the disk reference acquired
by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk
is already set to NULL before the lookup, blkcg_exit() cannot release
this reference either, causing the disk to never be freed.
Restore the reference release that was present as blk_put_queue() in the
original code but was inadvertently dropped during the conversion from
request_queue to gendisk.
Fixes: f05837ed73 ("blk-cgroup: store a gendisk to throttle in struct task_struct")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following
sequence occurs:
1. disk_force_media_change() sets GD_NEED_PART_SCAN
2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent
3. loop_global_unlock() releases the lock
4. loop_reread_partitions() calls bdev_disk_changed() to scan
There is a race between steps 2 and 4: when udev receives the uevent
and opens the device before loop_reread_partitions() runs,
blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls
bdev_disk_changed() for a first scan. Then loop_reread_partitions()
does a second scan. The open_mutex serializes these two scans, but
does not prevent both from running.
The second scan in bdev_disk_changed() drops all partition devices
from the first scan (via blk_drop_partitions()) before re-adding
them, causing partition block devices to briefly disappear. This
breaks any systemd unit with BindsTo= on the partition device: systemd
observes the device going dead, fails the dependent units, and does
not retry them when the device reappears.
Fix this by removing the GD_NEED_PART_SCAN set from
disk_force_media_change() entirely. None of the current callers need
the lazy on-open partition scan triggered by this flag:
- floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always
false and GD_NEED_PART_SCAN has no effect.
- loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is
set, loop_reread_partitions() performs an explicit scan. When not
set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path.
- loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if
LO_FLAGS_PARTSCAN is set.
- nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately
after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere.
With GD_NEED_PART_SCAN no longer set by disk_force_media_change(),
udev opening the loop device after the uevent no longer triggers a
redundant scan in blkdev_get_whole(), and only the single explicit
scan from loop_reread_partitions() runs.
A regression test for this bug has been submitted to blktests:
https://github.com/linux-blktests/blktests/pull/240.
Fixes: 9f65c489b6 ("loop: raise media_change event")
Signed-off-by: Daan De Meyer <daan@amutable.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20260331105130.1077599-1-daan@amutable.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The TCG Opal device could enter a state where no new session can be
created, blocking even Discovery or PSID reset. While a power cycle
or waiting for the timeout should work, there is another possibility
for recovery: using the Stack Reset command.
The Stack Reset command is defined in the TCG Storage Architecture Core
Specification and is mandatory for all Opal devices (see Section 3.3.6
of the Opal SSC specification).
This patch implements the Stack Reset command. Sending it should clear
all active sessions immediately, allowing subsequent commands to run
successfully. While it is a TCG transport layer command, the Linux
kernel implements only Opal ioctls, so it makes sense to use the
IOC_OPAL ioctl interface.
The Stack Reset takes no arguments; the response can be success or pending.
If the command reports a pending state, userspace can try to repeat it;
in this case, the code returns -EBUSY.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Ondrej Kozina <okozina@redhat.com>
Link: https://patch.msgid.link/20260310095349.411287-1-gmazyland@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nobody is calling mark_buffer_dirty_inode() with internal bdev inode and
it doesn't make sense for internal bdev inode to have any metadata
buffer heads. Just drop the pointless invalidate_inode_buffers() call
and consequently the whole bdev_evict_inode() because generic code takes
care of the rest.
CC: linux-block@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260326095354.16340-47-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
bio_alloc_bioset() first strips __GFP_DIRECT_RECLAIM from the optimistic
fast allocation attempt with try_alloc_gfp(). If that fast path fails,
the slowpath checks saved_gfp to decide whether blocking allocation is
allowed, but then still calls mempool_alloc() with the stripped gfp mask.
That can lead to a NULL bio pointer being passed into bio_init().
Fix the slowpath by using saved_gfp for the bio and bvec mempool
allocations.
Fixes: b520c4eef8 ("block: split bio_alloc_bioset more clearly into a fast and slowpath")
Reported-by: syzbot+09ddb593eea76a158f42@syzkaller.appspotmail.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/p01.gc6e9ad5845ad.ttca29g@ub.hpns
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for removing the strlcat API[1], replace the char *pp_buf
with a struct seq_buf, which tracks the current write position and
remaining space internally. This allows for:
- Direct use of seq_buf_printf() in place of snprintf()+strlcat()
pairs, eliminating local tmp buffers throughout.
- Adjacent strlcat() calls that build strings piece-by-piece
(e.g., strlcat("["); strlcat(name); strlcat("]")) to be collapsed
into single seq_buf_printf() calls.
- Simpler call sites: seq_buf_puts() takes only the buffer and string,
with no need to pass PAGE_SIZE at every call.
The backing buffer allocation is unchanged (__get_free_page), and the
output path uses seq_buf_str() to NUL-terminate before passing to
printk().
Link: https://github.com/KSPP/linux/issues/370 [1]
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Josh Law <objecting@objecting.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Josh Law <objecting@objecting.org>
Link: https://patch.msgid.link/20260321004840.work.670-kees@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an io_uring command handler to the generic BSG layer. The new
.uring_cmd file operation validates io_uring features and delegates
handling to a per-queue bsg_uring_cmd_fn callback.
Extend bsg_register_queue() so transport drivers can register both
sg_io and io_uring command handlers.
Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260317072226.2598233-3-yangxiuwei@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function bio_add_page() returns the number of bytes added to the
bio, and if that failed it should return 0.
However there is a special quirk, if a caller is passing a page with
length 0, that function will always return 0 but with different results:
- The page is added to the bio
If there is enough bvec slot or the folio can be merged with the last
bvec.
The return value 0 is just the length passed in, which is also 0.
- The page is not added to the bio
If the page is not mergeable with the last bvec, or there is no bvec
slot available.
The return value 0 means page is not added into the bio.
Unfortunately the caller is not able to distinguish the above two cases,
and will treat the 0 return value as page addition failure.
In that case, this can lead to the double releasing of the last page:
- By the bio cleanup
Which normally goes through every page of the bio, including the last
page which is added into the bio.
- By the caller
Which believes the page is not added into the bio, thus would manually
release the page.
I do not think anyone should call bio_add_folio()/bio_add_page() with zero
length, but idiots like me can still show up.
So add an extra WARN_ON_ONCE() check for zero length and rejects it
early to avoid double freeing.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Link: https://patch.msgid.link/bc2223c080f38d0b63f968f605c918181c840f40.1773734749.git.wqu@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_alloc_bioset tries non-waiting slab allocations first for the bio and
bvec array, but does so in a somewhat convoluted way.
Restructure the function so that it first open codes these slab
allocations, and then falls back to the mempools with the original
gfp mask.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A bio segment may have partial interval block data with the rest
continuing into the next segments because direct-io data payloads only
need to align in memory to the device's DMA limits.
At the same time, the protection information may also be split in
multiple segments. The most likely way that may happen is if two
requests merge, or if we're directly using the io_uring user metadata.
The generate/verify, however, only ever accessed the first bip_vec.
Further, it may be possible to unalign the protection fields from the
user space buffer, or if there are odd additional opaque bytes in front
or in back of the protection information metadata region.
Change up the iteration to allow spanning multiple segments. This patch
is mostly a re-write of the protection information handling to allow any
arbitrary alignments, so it's probably easier to review the end result
rather than the diff.
Many controllers are not able to handle interval data composed of
multiple segments when PI is used, so this patch introduces a new
integrity limit that a low level driver can set to notify that it is
capable, default to false. The nvme driver is the first one to enable it
in this patch. Everyone else will force DMA alignment to the logical
block size as before to ensure interval data is always aligned within a
single segment.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260313144701.1221652-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a queue is shared across disk rebind (e.g., SCSI unbind/bind), the
previous disk's blkcg state is cleaned up asynchronously via
disk_release() -> blkcg_exit_disk(). If the new disk's blkcg_init_disk()
runs before that cleanup finishes, we may overwrite q->root_blkg while
the old one is still alive, and radix_tree_insert() in blkg_create()
fails with -EEXIST because the old blkg entries still occupy the same
queue id slot in blkcg->blkg_tree. This causes the sd probe to fail
with -ENOMEM.
Fix it by waiting in blkcg_init_disk() for root_blkg to become NULL,
which indicates the previous disk's blkcg cleanup has completed.
Fixes: 1059699f87 ("block: move blkcg initialization/destroy into disk allocation/release handler")
Cc: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260311032837.2368714-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_steal_bios() transfers bios from a request to a bio_list when the
request is requeued to a different queue. The NVMe multipath failover
path (nvme_failover_req) currently open-codes clearing of REQ_POLLED,
bi_cookie, and REQ_NOWAIT on each bio before calling blk_steal_bios().
Move these fixups into blk_steal_bios() itself so that any caller
automatically gets correct flag state when bios cross queue boundaries.
Simplify nvme_failover_req() accordingly.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bring in the shared branch with the block layer.
* 'for-7.1/block-integrity' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: pass a maxlen argument to bio_iov_iter_bounce
block: add fs_bio_integrity helpers
block: make max_integrity_io_size public
block: prepare generation / verification helpers for fs usage
block: add a bdev_has_integrity_csum helper
block: factor out a bio_integrity_setup_default helper
block: factor out a bio_integrity_action helper
Signed-off-by: Christian Brauner <brauner@kernel.org>
Merge in integrity changes which are also landing in the VFS tree as
dependencies for fs related changes.
* for-7.1/block-integrity:
block: pass a maxlen argument to bio_iov_iter_bounce
block: add fs_bio_integrity helpers
block: make max_integrity_io_size public
block: prepare generation / verification helpers for fs usage
block: add a bdev_has_integrity_csum helper
block: factor out a bio_integrity_setup_default helper
block: factor out a bio_integrity_action helper
Correct the comments that the cloned bio must be freed before the memory
pointed to by @bio_src->bi_io_vecs (is freed).
Christoph Hellwig contributed most the of the update wording.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For blk-mq rotational zoned block devices (e.g. SMR HDDs), default to
having zone write plugging limit write operations to a maximum queue
depth of 1 for all zones. This significantly reduce write seek overhead
and improves SMR HDD write throughput.
For remotely connected disks with a very high network latency this
features might not be useful. However, remotely connected zoned devices
are rare at the moment, and we cannot know the round trip latency to
pick a good default for network attached devices. System administrators
can however disable this feature in that case.
For BIO based (non blk-mq) rotational zoned block devices, the device
driver (e.g. a DM target driver) can directly set an appropriate
default.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In order to maintain sequential write patterns per zone with zoned block
devices, zone write plugging issues only a single write BIO per zone at
any time. This works well but has the side effect that when large
sequential write streams are issued by the user and these streams cross
zone boundaries, the device ends up receiving a discontiguous set of
write commands for different zones. The same also happens when a user
writes simultaneously at high queue depth multiple zones: the device
does not see all sequential writes per zone and receives discontiguous
writes to different zones. While this does not affect the performance of
solid state zoned block devices, when using an SMR HDD, this pattern
change from sequential writes to discontiguous writes to different zones
significantly increases head seek which results in degraded write
throughput.
In order to reduce this seek overhead for rotational media devices,
introduce a per disk zone write plugs kernel thread to issue all write
BIOs to zones. This single zone write issuing context is enabled for
any zoned block device that has a request queue flagged with the new
QUEUE_ZONED_QD1_WRITES flag.
The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
zoned_qd1_writes for zoned devices. For regular block devices, this
attribute is not visible. For zoned block devices, a user can override
the default value set to force the global write maximum queue depth of
1 for a zoned block device, or clear this attribute to fallback to the
default behavior of zone write plugging which limits writes to QD=1 per
sequential zone.
Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
implemented using a list of zone write plugs that have a non-empty BIO
list. Listed zone write plugs are processed by the disk zone write plugs
worker kthread in FIFO order, and all BIOs of a zone write plug are all
processed before switching to the next listed zone write plug. A newly
submitted BIO for a non-FULL zone write plug that is not yet listed
causes the addition of the zone write plug at the end of the disk list
of zone write plugs.
Since the write BIOs queued in a zone write plug BIO list are
necessarilly sequential, for rotational media, using the single zone
write plugs kthread to issue all BIOs maintains a sequential write
pattern and thus reduces seek overhead and improves write throughput.
This processing essentially result in always writing to HDDs at QD=1,
which is not an issue for HDDs operating with write caching enabled.
Performance with write cache disabled is also not degraded thanks to
the efficient write handling of modern SMR HDDs.
A disk list of zone write plugs is defined using the new struct gendisk
zone_wplugs_list, and accesses to this list is protected using the
zone_wplugs_list_lock spinlock. The per disk kthread
(zone_wplugs_worker) code is implemented by the function
disk_zone_wplugs_worker(). A reference on listed zone write plugs is
always held until all BIOs of the zone write plug are processed by the
worker kthread. BIO issuing at QD=1 is driven using a completion
structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().
With this change, performance when sequentially writing the zones of a
30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
(1MiB direct I/Os, results in MB/s unit):
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Sequential write | Baseline | Patched |
| Queue Depth | 6.19-rc8 | |
+------------------+----------+---------+
| 1 | 244 | 245 |
| 2 | 244 | 245 |
| 4 | 245 | 245 |
| 8 | 242 | 245 |
| 16 | 222 | 246 |
| 32 | 211 | 245 |
| 64 | 193 | 244 |
| 128 | 112 | 246 |
+------------------+----------+---------+
With the current code (baseline), as the sequential write stream crosses
a zone boundary, higher queue depth creates a gap between the
last IO to the previous zone and the first IOs to the following zones,
causing head seeks and degrading performance. Using the disk zone
write plugs worker thread, this pattern disappears and the maximum
throughput of the drive is maintained, leading to over 100%
improvements in throughput for high queue depth write.
Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
MiB direct IOs, write throughput also increases significantly.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Random write | Baseline | Patched |
| Number of zones | 6.19-rc7 | |
+------------------+----------+---------+
| 1 | 191 | 192 |
| 2 | 101 | 128 |
| 4 | 115 | 123 |
| 8 | 90 | 120 |
| 16 | 64 | 115 |
| 32 | 58 | 105 |
| 64 | 56 | 101 |
| 128 | 55 | 99 |
+------------------+----------+---------+
Tests using XFS shows that buffered write speed with 8 jobs writing
files increases by 12% to 35% depending on the workload.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Workload | Baseline | Patched |
| | 6.19-rc7 | |
+------------------+----------+---------+
| 256MiB file size | 212 | 238 |
+------------------+----------+---------+
| 4MiB .. 128 MiB | 213 | 243 |
| random file size | | |
+------------------+----------+---------+
| 2MiB .. 8 MiB | 179 | 242 |
| random file size | | |
+------------------+----------+---------+
Performance gains are even more significant when using an HBA that
limits the maximum size of commands to a small value, e.g. HBAs
controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
In such case, the write throughput gains are over 40%.
+--------------------+
| Write BW (MB/s) |
+------------------+----------+---------+
| Workload | Baseline | Patched |
| | 6.19-rc7 | |
+------------------+----------+---------+
| 256MiB file size | 175 | 245 |
+------------------+----------+---------+
| 4MiB .. 128 MiB | 174 | 244 |
| random file size | | |
+------------------+----------+---------+
| 2MiB .. 8 MiB | 171 | 243 |
| random file size | | |
+------------------+----------+---------+
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename struct gendisk zone_wplugs_lock field to zone_wplugs_hash_lock to
clearly indicates that this is the spinlock used for manipulating the
hash table of zone write plugs.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The helper function disk_zone_is_full() is only used in
disk_zone_wplug_is_full(). So remove it and open code it directly in
this single caller.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk_get_and_lock_zone_wplug() always returns a zone write plug with the
plug lock held. This is unnecessary since this function does not look at
the fields of existing plugs, and new plugs need to be locked only after
their insertion in the disk hash table, when they are being used.
Remove the zone write plug locking from disk_get_and_lock_zone_wplug()
and rename this function disk_get_or_alloc_zone_wplug().
blk_zone_wplug_handle_write() is modified to add locking of the zone
write plug after calling disk_get_or_alloc_zone_wplug() and before
starting to use the plug. This change also simplifies
blk_revalidate_seq_zone() as unlocking the plug becomes unnecessary.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function disk_zone_wplug_schedule_bio_work() always takes a
reference on the zone write plug of the BIO work being scheduled. This
ensures that the zone write plug cannot be freed while the BIO work is
being scheduled but has not run yet. However, this unconditional
reference taking is fragile since the reference taken is released by the
BIO work blk_zone_wplug_bio_work() function, which implies that there
always must be a 1:1 relation between the work being scheduled and the
work running.
Make sure to drop the reference taken when scheduling the BIO work if
the work is already scheduled, that is, when queue_work() returns false.
Fixes: 9e78c38ab3 ("block: Hold a reference on zone write plugs to schedule submission")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 7b29518728 ("block: Do not remove zone write plugs still in
use") modified disk_should_remove_zone_wplug() to add a check on the
reference count of a zone write plug to prevent removing zone write
plugs from a disk hash table when the plugs are still being referenced
by BIOs or requests in-flight. However, this check does not take into
account that a BIO completion may happen right after its submission by
a zone write plug BIO work, and before the zone write plug BIO work
releases the zone write plug reference count. This situation leads to
disk_should_remove_zone_wplug() returning false as in this case the zone
write plug reference count is at least equal to 3. If the BIO that
completes in such manner transitioned the zone to the FULL condition,
the zone write plug for the FULL zone will remain in the disk hash
table.
Furthermore, relying on a particular value of a zone write plug
reference count to set the BLK_ZONE_WPLUG_UNHASHED flag is fragile as
reading the atomic reference count and doing a comparison with some
value is not overall atomic at all.
Address these issues by reworking the reference counting of zone write
plugs so that removing plugs from a disk hash table can be done
directly from disk_put_zone_wplug() when the last reference on a plug
is dropped.
To do so, replace the function disk_remove_zone_wplug() with
disk_mark_zone_wplug_dead(). This new function sets the zone write plug
flag BLK_ZONE_WPLUG_DEAD (which replaces BLK_ZONE_WPLUG_UNHASHED) and
drops the initial reference on the zone write plug taken when the plug
was added to the disk hash table. This function is called either for
zones that are empty or full, or directly in the case of a forced plug
removal (e.g. when the disk hash table is being destroyed on disk
removal). With this change, disk_should_remove_zone_wplug() is also
removed.
disk_put_zone_wplug() is modified to call the function
disk_free_zone_wplug() to remove a zone write plug from a disk hash
table and free the plug structure (with a call_rcu()), when the last
reference on a zone write plug is dropped. disk_free_zone_wplug()
always checks that the BLK_ZONE_WPLUG_DEAD flag is set.
In order to avoid having multiple zone write plugs for the same zone in
the disk hash table, disk_get_and_lock_zone_wplug() checked for the
BLK_ZONE_WPLUG_UNHASHED flag. This check is removed and a check for
the new BLK_ZONE_WPLUG_DEAD flag is added to
blk_zone_wplug_handle_write(). With this change, we continue preventing
adding multiple zone write plugs for the same zone and at the same time
re-inforce checks on the user behavior by failing new incoming write
BIOs targeting a zone that is marked as dead. This case can happen only
if the user erroneously issues write BIOs to zones that are full, or to
zones that are currently being reset or finished.
Fixes: 7b29518728 ("block: Do not remove zone write plugs still in use")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds a function for retrieving the set of Locking objects enabled
for Single User Mode (SUM) and the value of the
RangeStartRangeLengthPolicy parameter.
It retrieves data from the LockingInfo table, specifically the
columns SingleUserModeRanges and RangeStartLengthPolicy, which
were added according to the TCG Opal Feature Set: Single User Mode,
as described in chapters 4.4.3.1 and 4.4.3.2.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the column parameter in response_get_column() from u8 to u64
to support the full range of column identifiers.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This ioctl is used to set up RLE (read lock enabled) and WLE (write
lock enabled) parameters of the Locking object.
In Single User Mode (SUM), if the RangeStartRangeLengthPolicy parameter
is set in the 'Reactivate' method, only Admin authority maintains the
locking range length and start (offset) attributes of Locking objects
set up for SUM. All other attributes from struct opal_user_lr_setup
(RLE - read locking enabled, WLE - write locking enabled) shall
remain in possession of the User authority associated with the Locking
object set for SUM.
With the IOC_OPAL_ENABLE_DISABLE_LR ioctl, the opal_user_lr_setup
members 'range_start' and 'range_length' of the ioctl argument are
ignored.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This ioctl is used to set up locking range start (offset)
and locking range length attributes only.
In Single User Mode (SUM), if the RangeStartRangeLengthPolicy parameter
is set in the 'Reactivate' method, only Admin authority maintains the
locking range length and start (offset) attributes of Locking objects
set up for SUM. All other attributes from struct opal_user_lr_setup
(RLE - read locking enabled, WLE - write locking enabled) shall
remain in possession of the User authority associated with the Locking
object set for SUM.
Therefore, we need a separate function for setting up locking range
start and locking range length because it may require two different
authorities (and sessions) if the RangeStartRangeLengthPolicy attribute
is set.
With the IOC_OPAL_LR_SET_START_LEN ioctl, the opal_user_lr_setup
members 'RLE' and 'WLE' of the ioctl argument are ignored.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IOC_OPAL_LR_SETUP is used to set up a locking range entirely under a
single authority (usually Admin1), but for Single User Mode (SUM),
the permissions for attributes (RangeStart, RangeLength)
and (ReadLockEnable, WriteLockEnable, ReadLocked, WriteLocked)
may be split between two different authorities. Typically, it is Admin1
for the former and the User associated with the LockingRange in SUM
for the latter.
This commit only splits the internals in preparation for the introduction
of separate ioctls for setting RangeStart, RangeLength and the rest
using new ioctl calls.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds the 'Reactivate' method as described in the
"TCG Storage Opal SSC Feature Set: Single User Mode"
document (ch. 3.1.1.1).
The method enables switching an already active SED OPAL2 device,
with appropriate firmware support for Single User Mode (SUM),
to or from SUM.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As desribed in ch. 3.1.1.1.1.3 of TCG Storage Opal SSC Feature Set:
Single User Mode document.
To be used later in Reactivate method implementation.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As desribed in ch. 3.1.1.1.1.2 of TCG Storage Opal SSC Feature Set:
Single User Mode document.
To be used later in Reactivate method implementation and in function
for retrieving SUM device status.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As described in ch. 6.3, Table 240 in TCG Storage
Architecture Core Specification document.
It's also referenced in TCG Storage Opal SSC Feature Set:
Single User Mode document, ch. 3.1.1.1 Reactivate method.
It will be used later in Reactivate method implemetation
for sed-opal interface.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-and-tested-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow the file system to limit the size processed in a single
bounce operation. This is needed when generating integrity data
so that the size of a single integrity segment can't overflow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>