When retry_aligned_read() encounters an overlapped stripe, it releases
the stripe via raid5_release_stripe() which puts it on the lockless
released_stripes llist. In the next raid5d loop iteration,
release_stripe_list() drains the stripe onto handle_list (since
STRIPE_HANDLE is set by the original IO), but retry_aligned_read()
runs before handle_active_stripes() and removes the stripe from
handle_list via find_get_stripe() -> list_del_init(). This prevents
handle_stripe() from ever processing the stripe to resolve the
overlap, causing an infinite loop and soft lockup.
Fix this by using __release_stripe() with temp_inactive_list instead
of raid5_release_stripe() in the failure path, so the stripe does not
go through the released_stripes llist. This allows raid5d to break out
of its loop, and the overlap will be resolved when the stripe is
eventually processed by handle_stripe().
Fixes: 773ca82fa1 ("raid5: make release_stripe lockless")
Cc: stable@vger.kernel.org
Signed-off-by: FengWei Shih <dannyshih@synology.com>
Signed-off-by: Chia-Ming Chang <chiamingc@synology.com>
Link: https://lore.kernel.org/linux-raid/20260402061406.455755-1-chiamingc@synology.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
When skip_copy is enabled on a doubly-degraded RAID6, a device that is
being written to will be in R5_LOCKED state with R5_UPTODATE cleared.
If a new read triggers fetch_block() while the write is still in
flight, the 2-failure compute path may select this locked device as a
compute target because it is not R5_UPTODATE.
Because skip_copy makes the device page point directly to the bio page,
reconstructing data into it might be risky. Also, since the compute
marks the device R5_UPTODATE, it triggers WARN_ON in ops_run_io()
which checks that R5_SkipCopy and R5_UPTODATE are not both set.
This can be reproduced by running small-range concurrent read/write on
a doubly-degraded RAID6 with skip_copy enabled, for example:
mdadm -C /dev/md0 -l6 -n6 -R -f /dev/loop[0-3] missing missing
echo 1 > /sys/block/md0/md/skip_copy
fio --filename=/dev/md0 --rw=randrw --bs=4k --numjobs=8 \
--iodepth=32 --size=4M --runtime=30 --time_based --direct=1
Fix by checking R5_LOCKED before proceeding with the compute. The
compute will be retried once the lock is cleared on IO completion.
Signed-off-by: FengWei Shih <dannyshih@synology.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260319053351.3676794-1-dannyshih@synology.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Set chunk_sectors to the full stripe width (io_opt) so that the block
layer splits I/O at full stripe boundaries. This ensures that large
writes are aligned to full stripes, avoiding the read-modify-write
overhead that occurs with partial stripe writes in RAID-5/6.
When chunk_sectors is set, the block layer's bio splitting logic in
get_max_io_size() uses blk_boundary_sectors_left() to limit I/O size
to the boundary. This naturally aligns split bios to full stripe
boundaries, enabling more efficient full stripe writes.
Test results with 24-disk RAID5 (chunk_size=64k):
dd if=/dev/zero of=/dev/md0 bs=10M oflag=direct
Before: 461 MB/s
After: 520 MB/s (+12.8%)
Link: https://lore.kernel.org/linux-raid/20260223035834.3132498-1-yukuai@fnnas.com
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
bdev_nonrot() is simply the negative return value of bdev_rot().
So replace all call sites of bdev_nonrot() with calls to bdev_rot()
and remove bdev_nonrot().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
When llbitmap bit state is still unwritten, any new write should force
rcw, as bitmap_ops->blocks_synced() is checked in handle_stripe_dirtying().
However, later the same check is missing in need_this_block(), causing
stripe to deadloop during handling because handle_stripe() will decide
to go to handle_stripe_fill(), meanwhile need_this_block() always return
0 and nothing is handled.
Link: https://lore.kernel.org/linux-raid/20260123182623.3718551-2-yukuai@fnnas.com
Fixes: 5ab829f197 ("md/md-llbitmap: introduce new lockless bitmap")
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
'recovery_disabled' logic is complex and confusing, originally intended to
preserve raid in extreme scenarios. It was used in following cases:
- When sync fails and setting badblocks also fails, kick out non-In_sync
rdev and block spare rdev from joining to preserve raid [1]
- When last backup is unavailable, prevent repeated add-remove of spares
triggering recovery [2]
The original issues are now resolved:
- Error handlers in all raid types prevent last rdev from being kicked out
- Disks with failed recovery are marked Faulty and can't re-join
Therefore, remove 'recovery_disabled' as it's no longer needed.
[1] 5389042ffa ("md: change managed of recovery_disabled.")
[2] 4044ba58dd ("md: don't retry recovery of raid1 that fails due to error on source drive.")
Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-13-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Currently when sync read fails and badblocks set fails (exceeding
512 limit), rdev isn't immediately marked Faulty. Instead
'recovery_disabled' is set and non-In_sync rdevs are removed later.
This preserves array availability if bad regions aren't read, but bad
sectors might be read by users before rdev removal. This occurs due
to incorrect resync/recovery_offset updates that include these bad
sectors.
When badblocks exceed 512, keeping the disk provides little benefit
while adding complexity. Prompt disk replacement is more important.
Therefore when badblocks set fails, directly call md_error to mark rdev
Faulty immediately, preventing potential data access issues.
After this change, cleanup of offset update logic and 'recovery_disabled'
handling will follow.
Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-6-linan666@huaweicloud.com
Fixes: 5e5702898e ("md/raid10: Handle read errors during recovery better.")
Fixes: 3a9f28a511 ("md/raid1: improve handling of read failure during recovery.")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
On the one hand, stripe_request_ctx is 72 bytes, and it's a bit huge for
a stack variable.
On the other hand, the bitmap sectors_to_do is a fixed size, result in
max_hw_sector_kb of raid5 array is at most 256 * 4k = 1Mb, and this will
make full stripe IO impossible for the array that chunk_size * data_disks
is bigger. Allocate ctx during runtime will make it possible to get rid
of this limit.
Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-6-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
The variable mddev->private is first assigned to conf and then checked:
conf = mddev->private;
if (!conf) ...
If conf is NULL, then mddev->private is also NULL. In this case,
null-pointer dereferences can occur when calling raid5_quiesce():
raid5_quiesce(mddev, true);
raid5_quiesce(mddev, false);
since mddev->private is assigned to conf again in raid5_quiesce(), and conf
is dereferenced in several places, for example:
conf->quiesce = 0;
wake_up(&conf->wait_for_quiescent);
To fix this issue, the function should unlock mddev and return before
invoking raid5_quiesce() when conf is NULL, following the existing pattern
in raid5_change_consistency_policy().
Fixes: fa1944bbe6 ("md/raid5: Wait sync io to finish before changing group cnt")
Signed-off-by: Tuo Li <islituo@gmail.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20251225130326.67780-1-islituo@gmail.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Following test can cause IO hang:
mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none
sleep 5
echo 1 > /sys/block/sda/device/delete
echo 1 > /sys/block/sdb/device/delete
echo 1 > /sys/block/sdc/device/delete
echo 1 > /sys/block/sdd/device/delete
dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct
Root cause:
1) all disks removed, however all rdevs in the array is still in sync,
IO will be issued normally.
2) IO failure from sda, and set badblocks failed, sda will be faulty
and MD_SB_CHANGING_PENDING will be set.
3) error recovery try to recover this IO from other disks, IO will be
issued to sdb, sdc, and sdd.
4) IO failure from sdb, and set badblocks failed again, now array is
broken and will become read-only.
5) IO failure from sdc and sdd, however, stripe can't be handled anymore
because MD_SB_CHANGING_PENDING is set:
handle_stripe
handle_stripe
if (test_bit MD_SB_CHANGING_PENDING)
set_bit STRIPE_HANDLE
goto finish
// skip handling failed stripe
release_stripe
if (test_bit STRIPE_HANDLE)
list_add_tail conf->hand_list
6) later raid5d can't handle failed stripe as well:
raid5d
md_check_recovery
md_update_sb
if (!md_is_rdwr())
// can't clear pending bit
return
if (test_bit MD_SB_CHANGING_PENDING)
break;
// can't handle failed stripe
Since MD_SB_CHANGING_PENDING can never be cleared for read-only array,
fix this problem by skip this checking for read-only array.
Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com
Fixes: d87f064f58 ("md: never update metadata when array is read-only.")
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Previously, raid array used the maximum logical block size (LBS)
of all member disks. Adding a larger LBS disk at runtime could
unexpectedly increase RAID's LBS, risking corruption of existing
partitions. This can be reproduced by:
```
# LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean
# LBS is 512
cat /sys/block/md0/queue/logical_block_size
# create partition md0p1
parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
lsblk | grep md0p1
# LBS becomes 4096 after adding sdf
mdadm --add -q /dev/md0 /dev/sdf
cat /sys/block/md0/queue/logical_block_size
# partition lost
partprobe /dev/md0
lsblk | grep md0p1
```
Simply restricting larger-LBS disks is inflexible. In some scenarios,
only disks with 512 bytes LBS are available currently, but later, disks
with 4KB LBS may be added to the array.
Making LBS configurable is the best way to solve this scenario.
After this patch, the raid will:
- store LBS in disk metadata
- add a read-write sysfs 'mdX/logical_block_size'
Future mdadm should support setting LBS via metadata field during RAID
creation and the new sysfs. Though the kernel allows runtime LBS changes,
users should avoid modifying it after creating partitions or filesystems
to prevent compatibility issues.
Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
fields to default values at auto-detect. Supporting 0.90 would require
more extensive changes and no such use case has been observed.
Note that many RAID paths rely on PAGE_SIZE alignment, including for
metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
read/write failures. So this config should be prevented.
Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- FC target fixes (Daniel)
- Authentication fixes and updates (Martin, Chris)
- Admin controller handling (Kamaljit)
- Target lockdep assertions (Max)
- Keep-alive updates for discovery (Alastair)
- Suspend quirk (Georg)
- MD pull request via Yu:
- Add support for a lockless bitmap.
A key feature for the new bitmap are that the IO fastpath is
lockless. If a user issues lots of write IO to the same bitmap
bit in a short time, only the first write has additional overhead
to update bitmap bit, no additional overhead for the following
writes.
By supporting only resync or recover written data, means in the
case creating new array or replacing with a new disk, there is no
need to do a full disk resync/recovery.
- Switch ->getgeo() and ->bios_param() to using struct gendisk rather
than struct block_device.
- Rust block changes via Andreas. This series adds configuration via
configfs and remote completion to the rnull driver. The series also
includes a set of changes to the rust block device driver API: a few
cleanup patches, and a few features supporting the rnull changes.
The series removes the raw buffer formatting logic from
`kernel::block` and improves the logic available in `kernel::string`
to support the same use as the removed logic.
- floppy arch cleanups
- Reduce the number of dereferencing needed for ublk commands
- Restrict supported sockets for nbd. Mostly done to eliminate a class
of issues perpetually reported by syzbot, by using nonsensical socket
setups.
- A few s390 dasd block fixes
- Fix a few issues around atomic writes
- Improve DMA interation for integrity requests
- Improve how iovecs are treated with regards to O_DIRECT aligment
constraints.
We used to require each segment to adhere to the constraints, now
only the request as a whole needs to.
- Clean up and improve p2p support, enabling use of p2p for metadata
payloads
- Improve locking of request lookup, using SRCU where appropriate
- Use page references properly for brd, avoiding very long RCU sections
- Fix ordering of recursively submitted IOs
- Clean up and improve updating nr_requests for a live device
- Various fixes and cleanups
* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
s390/dasd: enforce dma_alignment to ensure proper buffer validation
s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
ublk: remove redundant zone op check in ublk_setup_iod()
nvme: Use non zero KATO for persistent discovery connections
nvmet: add safety check for subsys lock
nvme-core: use nvme_is_io_ctrl() for I/O controller check
nvme-core: do ioccsz/iorcsz validation only for I/O controllers
nvme-core: add method to check for an I/O controller
blk-cgroup: fix possible deadlock while configuring policy
blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
blk-mq: Fix more tag iteration function documentation
selftests: ublk: fix behavior when fio is not installed
ublk: don't access ublk_queue in ublk_unmap_io()
ublk: pass ublk_io to __ublk_complete_rq()
ublk: don't access ublk_queue in ublk_need_complete_req()
ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
ublk: don't pass ublk_queue to ublk_fetch()
ublk: don't access ublk_queue in ublk_config_io_buf()
ublk: don't access ublk_queue in ublk_check_fetch_buf()
ublk: pass q_id and tag to __ublk_check_and_get_req()
...
The parameter max_hw_wzeroes_unmap_sectors in queue_limits should be
equal to max_write_zeroes_sectors if it is set to a non-zero value.
However, the stacked md drivers call md_init_stacking_limits() to
initialize this parameter to UINT_MAX but only adjust
max_write_zeroes_sectors when setting limits. Therefore, this
discrepancy triggers a value check failure in blk_validate_limits().
$ modprobe scsi_debug num_parts=2 dev_size_mb=8 lbprz=1 lbpws=1
$ mdadm --create /dev/md0 --level=0 --raid-device=2 /dev/sda1 /dev/sda2
mdadm: Defaulting to version 1.2 metadata
mdadm: RUN_ARRAY failed: Invalid argument
Fix this failure by explicitly setting max_hw_wzeroes_unmap_sectors to
max_write_zeroes_sectors. Since the linear and raid0 drivers support
write zeroes, so they can support unmap write zeroes operation if all of
the backend devices support it. However, the raid1/10/5 drivers don't
support write zeroes, so we have to set it to zero.
Fixes: 0c40d7cb5e ("block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits")
Reported-by: John Garry <john.g.garry@oracle.com>
Closes: https://lore.kernel.org/linux-block/803a2183-a0bb-4b7a-92f1-afc5097630d2@oracle.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/linux-raid/20250910111107.3247530-2-yi.zhang@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
If bio is split by internal handling like chunksize or badblocks, the
corresponding trace_block_split() is missing, resulting in blktrace
inability to catch BIO split events and making it harder to analyze the
BIO sequence.
Cc: stable@vger.kernel.org
Fixes: 4b1faf9316 ("block: Kill bio_pair_split()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, raid456 must perform a whole array initial recovery to build
initail xor data, then IO to the array won't have to read all the blocks
in underlying disks.
This behavior will affect IO performance a lot, and nowadays there are
huge disks and the initial recovery can take a long time. Hence llbitmap
will support lazy initial recovery in following patches. This method is
used to check if data blocks is synced or not, if not then IO will still
have to read all blocks for raid456.
Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
When specified with WQ_CPU_INTENSIVE, the workqueue doesn't
participate in concurrency management. This behaviour is already
accounted for WQ_UNBOUND workqueues given that they are assigned
to their own worker threads.
Unset WQ_CPU_INTENSIVE as the use of flag has no effect when
used with WQ_UNBOUND.
Signed-off-by: Ryo Takakura <ryotkkr98@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/linux-raid/20250601013702.64640-1-ryotkkr98@gmail.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
During code review, it's found that other than raid5_bitmap_sector(),
reshape_progress is always checked before get_reshape_loc(), while
raid5_bitmap_sector() should check as well to prevent holding the
lock 'conf->device_lock'. Hence merge that checking inside
get_reshape_loc().
Link: https://lore.kernel.org/linux-raid/20250227120452.808503-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
There are two BUG reports that raid5 will hang at
bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
write is unbalanced, it's not quite clear where, and while reviewing raid5
code, it's found that bitmap operations can be optimized. For example,
for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to
the array:
┌────────────────────────────────────────────────────────────┐
│chunk 0 │
│ ┌────────────┬─────────────┬─────────────┬────────────┼
│ sh0 │A0: 0 + 4k │A1: 8k + 4k │A2: 16k + 4k │A3: P │
│ ┼────────────┼─────────────┼─────────────┼────────────┼
│ sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P │
┼──────┴────────────┴─────────────┴─────────────┴────────────┼
│chunk 1 │
│ ┌────────────┬─────────────┬─────────────┬────────────┤
│ sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P │C3: 40k + 4k│
│ ┼────────────┼─────────────┼─────────────┼────────────┼
│ sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P │D3: 44k + 4k│
└──────┴────────────┴─────────────┴─────────────┴────────────┘
Before this patch, 4 stripe head will be used, and each sh will attach
bio for 3 disks, and each attached bio will trigger
bitmap_startwrite() once, which means total 12 times.
- 3 times (0 + 4k), for (A0, A1 and A2)
- 3 times (4 + 4k), for (B0, B1 and B2)
- 3 times (8 + 4k), for (C0, C1 and C3)
- 3 times (12 + 4k), for (D0, D1 and D3)
After this patch, md upper layer will calculate that IO range (0 + 48k)
is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite()
just once.
Noted that this patch will align bitmap ranges to the chunks, for example,
if user issue a IO (0 + 4k) to array:
- Before this patch, 1 time (0 + 4k), for A0;
- After this patch, 1 time (0 + 8k) for chunk 0;
Usually, one bitmap bit will represent more than one disk chunk, and this
doesn't have any difference. And even if user really created a array
that one chunk contain multiple bits, the overhead is that more data
will be recovered after power failure.
Also remove STRIPE_BITMAP_PENDING since it's not used anymore.
[1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/
[2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Bitmap is used for the whole array for raid1/raid10, hence IO for the
array can be used directly for bitmap. However, bitmap is used for
underlying disks for raid5, hence IO for the array can't be used
directly for bitmap.
Implement pers->bitmap_sector() for raid5 to convert IO ranges from the
array to the underlying disks.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250109015145.158868-5-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
For the case that IO failed for one rdev, the bit will be mark as NEEDED
in following cases:
1) If badblocks is set and rdev is not faulty;
2) If rdev is faulty;
Case 1) is useless because synchronize data to badblocks make no sense.
Case 2) can be replaced with mddev->degraded.
Also remove R1BIO_Degraded, R10BIO_Degraded and STRIPE_DEGRADED since
case 2) no longer use them.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250109015145.158868-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
One customer reports a bug: raid5 is hung when changing thread cnt
while resync is running. The stripes are all in conf->handle_list
and new threads can't handle them.
Commit b39f35ebe8 ("md: don't quiesce in mddev_suspend()") removes
pers->quiesce from mddev_suspend/resume. Before this patch, mddev_suspend
needs to wait for all ios including sync io to finish. Now it's used
to only wait normal io.
Fix this by calling raid5_quiesce from raid5_store_group_thread_cnt
directly to wait all sync requests to finish before changing the group
cnt.
Fixes: b39f35ebe8 ("md: don't quiesce in mddev_suspend()")
Cc: stable@vger.kernel.org
Signed-off-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20241106095124.74577-1-xni@redhat.com
Signed-off-by: Song Liu <song@kernel.org>
Now that actual overlaps are not handled on the wait_for_overlap wq
anymore, the remaining cases when we wait on this wq are limited to
reshape. If reshape is not in progress, don't add to the wq in
raid5_make_request() because add_wait_queue() / remove_wait_queue()
operations take a spinlock and cause noticeable contention when multiple
threads are submitting requests to the mddev.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Link: https://lore.kernel.org/r/20240827153536.6743-3-artur.paszkiewicz@intel.com
Signed-off-by: Song Liu <song@kernel.org>