Commit Graph

1547 Commits

Author SHA1 Message Date
Yu Kuai
cf86bb53b9 md: wake raid456 reshape waiters before suspend
During raid456 reshape, direct IO across the reshape position can sleep
in raid5_make_request() waiting for reshape progress while still
holding an active_io reference. If userspace then freezes reshape and
writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io
and waits for all in-flight IO to drain.

This can deadlock: the IO needs reshape progress to continue, but the
reshape thread is already frozen, so the active_io reference is never
dropped and suspend never completes.

raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do
the same for normal md suspend when reshape is already interrupted, so
waiting raid456 IO can abort, drop its reference, and let suspend
finish.

The mdadm test tests/25raid456-reshape-deadlock reproduces the hang.

Fixes: 714d20150e ("md: add new helpers to suspend/resume array")
Link: https://lore.kernel.org/linux-raid/20260327140729.2030564-1-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07 13:09:22 +08:00
Xiao Ni
808cec7460 md/raid1: serialize overlap io for writemostly disk
Previously, using wait_event() would wake up all waiters simultaneously,
and they would compete for the tree lock. The bio which gets the lock
first will be handled, so the write sequence cannot be guaranteed.

For example:
bio1(100,200)
bio2(150,200)
bio3(150,300)

The write sequence of fast device is bio1,bio2,bio3. But the write sequence
of slow device could be bio1,bio3,bio2 due to lock competition. This causes
data corruption.

Replace waitqueue with a fifo list to guarantee the write sequence. And it
also needs to iterate the list when removing one entry. If not, it may miss
the opportunity to wake up the waiting io.

For example:
bio1(1,3), bio2(2,4)
bio3(5,7), bio4(6,8)
These four bios are in the same bucket. bio1 and bio3 are inserted into
the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the
first one. bio3 returns from slow disk and tries to wake up the waiting
bios. bio2 is removed from the list and will be handled. But bio1 hasn't
finished. So bio2 will be added into waiting list again. Then bio1 returns
from slow disk and wakes up waiting bios. bio4 is removed from the list
and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left
on the waiting list. So it needs to iterate the waiting list to wake up
the right bio.

Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260324072501.59865-1-xni@redhat.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07 13:09:22 +08:00
Yu Kuai
09af773650 md: add fallback to correct bitmap_ops on version mismatch
If default bitmap version and on-disk version doesn't match, and mdadm
is not the latest version to set bitmap_type, set bitmap_ops based on
the disk version.

Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-2-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07 13:09:22 +08:00
Abd-Alrhman Masalkhi
e4979f4fac md: remove unused static md_wq workqueue
The md_wq workqueue is defined as static and initialized in md_init(),
but it is not used anywhere within md.c.

All asynchronous and deferred work in this file is handled via
md_misc_wq or dedicated md threads.

Fixes: b75197e86e ("md: Remove flush handling")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Link: https://lore.kernel.org/linux-raid/20260328193522.3624-1-abd.masalkhi@gmail.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07 13:09:21 +08:00
Yu Kuai
2aa72276fa md: fix array_state=clear sysfs deadlock
When "clear" is written to array_state, md_attr_store() breaks sysfs
active protection so the array can delete itself from its own sysfs
store method.

However, md_attr_store() currently drops the mddev reference before
calling sysfs_unbreak_active_protection(). Once do_md_stop(..., 0)
has made the mddev eligible for delayed deletion, the temporary
kobject reference taken by sysfs_break_active_protection() can become
the last kobject reference protecting the md kobject.

That allows sysfs_unbreak_active_protection() to drop the last
kobject reference from the current sysfs writer context. kobject
teardown then recurses into kernfs removal while the current sysfs
node is still being unwound, and lockdep reports recursive locking on
kn->active with kernfs_drain() in the call chain.

Reproducer on an existing level:
1. Create an md0 linear array and activate it:
   mknod /dev/md0 b 9 0
   echo none > /sys/block/md0/md/metadata_version
   echo linear > /sys/block/md0/md/level
   echo 1 > /sys/block/md0/md/raid_disks
   echo "$(cat /sys/class/block/sdb/dev)" > /sys/block/md0/md/new_dev
   echo "$(($(cat /sys/class/block/sdb/size) / 2))" > \
	/sys/block/md0/md/dev-sdb/size
   echo 0 > /sys/block/md0/md/dev-sdb/slot
   echo active > /sys/block/md0/md/array_state
2. Wait briefly for the array to settle, then clear it:
   sleep 2
   echo clear > /sys/block/md0/md/array_state

The warning looks like:

  WARNING: possible recursive locking detected
  bash/588 is trying to acquire lock:
  (kn->active#65) at __kernfs_remove+0x157/0x1d0
  but task is already holding lock:
  (kn->active#65) at sysfs_unbreak_active_protection+0x1f/0x40
  ...
  Call Trace:
   kernfs_drain
   __kernfs_remove
   kernfs_remove_by_name_ns
   sysfs_remove_group
   sysfs_remove_groups
   __kobject_del
   kobject_put
   md_attr_store
   kernfs_fop_write_iter
   vfs_write
   ksys_write

Restore active protection before mddev_put() so the extra sysfs
kobject reference is dropped while the mddev is still held alive. The
actual md kobject deletion is then deferred until after the sysfs
write path has fully returned.

Fixes: 9e59d60976 ("md: call del_gendisk in control path")
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260330055213.3976052-1-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-06 11:40:14 +08:00
Chen Cheng
6f507eb2bb md: remove unused mddev argument from export_rdev
The mddev argument in export_rdev() is never used. Remove it to
simplify callers.

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20260304111417.20777-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-23 02:15:11 +08:00
Chen Cheng
eff0d74c6c md: suppress spurious superblock update error message for dm-raid
dm-raid has external metadata management (mddev->external = 1) and
no persistent superblock (mddev->persistent = 0). For these arrays,
there's no superblock to update, so the error message is spurious.

The error appears as:
md_update_sb: can't update sb for read-only array md0

Fixes: 8c9e376b9d ("md: warn about updating super block failure")
Reported-by: Tj <tj.iam.tj@proton.me>
Closes: https://lore.kernel.org/all/20260128082430.96788-1-tj.iam.tj@proton.me/
Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20260210133847.269986-1-chencheng@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-03-16 01:24:59 +08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
99dfe2d4da Merge tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more block updates from Jens Axboe:

 - Fix partial IOVA mapping cleanup in error handling

 - Minor prep series ignoring discard return value, as
   the inline value is always known

 - Ensure BLK_FEAT_STABLE_WRITES is set for drbd

 - Fix leak of folio in bio_iov_iter_bounce_read()

 - Allow IOC_PR_READ_* for read-only open

 - Another debugfs deadlock fix

 - A few doc updates

* tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  blk-mq: use NOIO context to prevent deadlock during debugfs creation
  blk-stat: convert struct blk_stat_callback to kernel-doc
  block: fix enum descriptions kernel-doc
  block: update docs for bio and bvec_iter
  block: change return type to void
  nvmet: ignore discard return value
  md: ignore discard return value
  block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova
  block: fix folio leak in bio_iov_iter_bounce_read()
  block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ
  drbd: always set BLK_FEAT_STABLE_WRITES
2026-02-17 08:48:45 -08:00
Chaitanya Kulkarni
699fcfb6cb md: ignore discard return value
__blkdev_issue_discard() always returns 0, making all error checking at
call sites dead code.

Simplify md to only check !discard_bio by ignoring the
__blkdev_issue_discard() value.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-12 04:23:53 -07:00
Linus Torvalds
61e629596f Merge tag 'for-7.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:

 - dm-verity:
     - various optimizations and fixes related to forward error correction
     - add a .dm-verity keyring

 - dm-integrity: fix bugs with growing a device in bitmap mode

 - dm-mpath:
     - fix leaking fake timeout requests
     - fix UAF bug caused by stale rq->bio
     - fix minor bugs in device creation

 - dm-core:
     - fix a bug related to blkg association
     - avoid unnecessary blk-crypto work on invalid keys

 - dm-bufio:
     - dm-bufio cleanup and optimization (reducing hash table lookups)

 - various other minor fixes and cleanups

* tag 'for-7.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
  dm mpath: make pg_init_delay_msecs settable
  Revert "dm: fix a race condition in retrieve_deps"
  dm mpath: Add missing dm_put_device when failing to get scsi dh name
  dm vdo encodings: clean up header and version functions
  dm: use bio_clone_blkg_association
  dm: fix excessive blk-crypto operations for invalid keys
  dm-verity: fix section mismatch error
  dm-unstripe: fix mapping bug when there are multiple targets in a table
  dm-integrity: fix recalculation in bitmap mode
  dm-bufio: avoid redundant buffer_tree lookups
  dm-bufio: merge cache_put() into cache_put_and_wake()
  selftests: add dm-verity keyring selftests
  dm-verity: add dm-verity keyring
  dm: clear cloned request bio pointer when last clone bio completes
  dm-verity: fix up various workqueue-related comments
  dm-verity: switch to bio_advance_iter_single()
  dm-verity: consolidate the BH and normal work structs
  dm: add WQ_PERCPU to alloc_workqueue users
  dm-integrity: fix a typo in the code for write/discard race
  dm: use READ_ONCE in dm_blk_report_zones
  ...
2026-02-11 17:04:21 -08:00
Heinz Mauelshagen
cefcb9297f md raid: fix hang when stopping arrays with metadata through dm-raid
When using device-mapper's dm-raid target, stopping a RAID array can cause
the system to hang under specific conditions.

This occurs when:

- A dm-raid managed device tree is suspended from top to bottom
   (the top-level RAID device is suspended first, followed by its
    underlying metadata and data devices)

- The top-level RAID device is then removed

Removing the top-level device triggers a hang in the following sequence:
the dm-raid destructor calls md_stop(), which tries to flush the
write-intent bitmap by writing to the metadata sub-devices. However, these
devices are already suspended, making them unable to complete the write-intent
operations and causing an indefinite block.

Fix:

- Prevent bitmap flushing when md_stop() is called from dm-raid
destructor context
  and avoid a quiescing/unquescing cycle which could also cause I/O

- Still allow write-intent bitmap flushing when called from dm-raid
suspend context

This ensures that RAID array teardown can complete successfully even when the
underlying devices are in a suspended state.

This second patch uses md_is_rdwr() to distinguish between suspend and
destructor paths as elaborated on above.

Link: https://lore.kernel.org/linux-raid/CAM23VxqYrwkhKEBeQrZeZwQudbiNey2_8B_SEOLqug=pXxaFrA@mail.gmail.com
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26 13:46:40 +08:00
Li Nan
5d1dd57929 md: remove recovery_disabled
'recovery_disabled' logic is complex and confusing, originally intended to
preserve raid in extreme scenarios. It was used in following cases:
- When sync fails and setting badblocks also fails, kick out non-In_sync
  rdev and block spare rdev from joining to preserve raid [1]
- When last backup is unavailable, prevent repeated add-remove of spares
  triggering recovery [2]

The original issues are now resolved:
- Error handlers in all raid types prevent last rdev from being kicked out
- Disks with failed recovery are marked Faulty and can't re-join

Therefore, remove 'recovery_disabled' as it's no longer needed.

[1] 5389042ffa ("md: change managed of recovery_disabled.")
[2] 4044ba58dd ("md: don't retry recovery of raid1 that fails due to error on source drive.")

Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-13-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26 13:17:38 +08:00
Li Nan
8ff59a7247 md: move finish_reshape to md_finish_sync()
finish_reshape implementations of raid10 and raid5 only update mddev
and rdev configurations. Move these operations to md_finish_sync() as
it is more appropriate.

No functional changes.

Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-10-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26 13:16:23 +08:00
Li Nan
6dd3aa08e8 md: factor out sync completion update into helper
Repeatedly reading 'mddev->recovery' flags in md_do_sync() may introduce
potential risk if this flag is modified during sync, leading to incorrect
offset updates. Therefore, replace direct 'mddev->recovery' checks with
'action'.

Move sync completion update logic into helper md_finish_sync(), which
improves readability and maintainability.

The reshape completion update remains safe as it only updated after
successful reshape when MD_RECOVERY_INTR is not set and 'curr_resync'
equals 'max_sectors'.

Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-9-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26 13:16:21 +08:00
Li Nan
af9c40ff5a md: remove MD_RECOVERY_ERROR handling and simplify resync_offset update
Following previous patch "md: update curr_resync_completed even when
MD_RECOVERY_INTR is set", 'curr_resync_completed' always equals
'curr_resync' for resync, so MD_RECOVERY_ERROR can be removed.

Also, simplify resync_offset update logic.

Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-8-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26 13:16:19 +08:00
Li Nan
cc0dab317a md: update curr_resync_completed even when MD_RECOVERY_INTR is set
An error sync IO may be done and sub 'recovery_active' while its
error handling work is pending. This work sets 'recovery_disabled'
and MD_RECOVERY_INTR, then later removes the bad disk without Faulty
flag. If 'curr_resync_completed' is updated before the disk is removed,
it could lead to reading from sync-failed regions.

With the previous patch, error IO will set badblocks or mark rdev as
Faulty, sync-failed regions are no longer readable. After waiting for
'recovery_active' to reach 0 (in the previous line), all sync IO has
*completed*, regardless of whether MD_RECOVERY_INTR is set. Thus, the
MD_RECOVERY_INTR check can be removed.

Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-7-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26 13:16:17 +08:00
Li Nan
fd4d44c14f md: mark rdev Faulty when badblocks setting fails
Currently when sync read fails and badblocks set fails (exceeding
512 limit), rdev isn't immediately marked Faulty. Instead
'recovery_disabled' is set and non-In_sync rdevs are removed later.
This preserves array availability if bad regions aren't read, but bad
sectors might be read by users before rdev removal. This occurs due
to incorrect resync/recovery_offset updates that include these bad
sectors.

When badblocks exceed 512, keeping the disk provides little benefit
while adding complexity. Prompt disk replacement is more important.
Therefore when badblocks set fails, directly call md_error to mark rdev
Faulty immediately, preventing potential data access issues.

After this change, cleanup of offset update logic and 'recovery_disabled'
handling will follow.

Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-6-linan666@huaweicloud.com
Fixes: 5e5702898e ("md/raid10: Handle read errors during recovery better.")
Fixes: 3a9f28a511 ("md/raid1: improve handling of read failure during recovery.")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26 13:16:15 +08:00
Li Nan
2a5d4549a2 md: factor error handling out of md_done_sync into helper
The 'ok' parameter in md_done_sync() is redundant for most callers that
always pass 'true'. Factor error handling logic into a separate helper
function md_sync_error() to eliminate unnecessary parameter passing and
improve code clarity.

No functional changes introduced.

Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-3-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26 13:16:07 +08:00
Yu Kuai
10787568cc md: merge mddev serialize_policy into mddev_flags
There is not need to use a separate field in struct mddev, there are no
functional changes.

Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-5-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26 13:10:51 +08:00
Yu Kuai
4f6d2e648c md: merge mddev faillast_dev into mddev_flags
There is not need to use a separate field in struct mddev, there are no
functional changes.

Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-4-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26 13:10:24 +08:00
Yu Kuai
fba4a98040 md: merge mddev has_superblock into mddev_flags
There is not need to use a separate field in struct mddev, there are no
functional changes.

Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-3-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26 13:09:55 +08:00
Marco Crivellari
d488086867 dm: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-01-14 13:16:00 +01:00
Li Nan
a4166f1c48 md: Fix forward incompatibility from configurable logical block size
Commit 62ed1b5822 ("md: allow configuring logical block size") used
reserved pad to add 'logical_block_size' to metadata. RAID rejects
non-zero reserved pad, so arrays fail when rolling back to old kernels
after booting new ones.

Set 'logical_block_size' only for newly created arrays to support rollback
to old kernels. Importantly new arrays still won't work on old kernels to
prevent data loss issue from LBS changes.

For arrays created on old kernels which confirmed not to rollback,
configure LBS by echo current LBS (queue/logical_block_size) to
md/logical_block_size.

Fixes: 62ed1b5822 ("md: allow configuring logical block size")
Reported-by: BugReports <bugreports61@gmail.com>
Closes: https://lore.kernel.org/linux-raid/825e532d-d1e1-44bb-5581-692b7c091796@huaweicloud.com/T/#t
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20251226024221.724201-2-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-27 10:14:07 +08:00
Li Nan
864466c38c md: Fix logical_block_size configuration being overwritten
In super_1_validate(), mddev->logical_block_size is directly overwritten
with the value from metadata. This causes the previously configured lbs
to be lost, making the configuration ineffective. Fix it.

Fixes: 62ed1b5822 ("md: allow configuring logical block size")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20251226024221.724201-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-27 10:13:40 +08:00
FengWei Shih
2cc583653b md: suspend array while updating raid_disks via sysfs
In raid1_reshape(), freeze_array() is called before modifying the r1bio
memory pool (conf->r1bio_pool) and conf->raid_disks, and
unfreeze_array() is called after the update is completed.

However, freeze_array() only waits until nr_sync_pending and
(nr_pending - nr_queued) of all buckets reaches zero. When an I/O error
occurs, nr_queued is increased and the corresponding r1bio is queued to
either retry_list or bio_end_io_list. As a result, freeze_array() may
unblock before these r1bios are released.

This can lead to a situation where conf->raid_disks and the mempool have
already been updated while queued r1bios, allocated with the old
raid_disks value, are later released. Consequently, free_r1bio() may
access memory out of bounds in put_all_bios() and release r1bios of the
wrong size to the new mempool, potentially causing issues with the
mempool as well.

Since only normal I/O might increase nr_queued while an I/O error occurs,
suspending the array avoids this issue.

Note: Updating raid_disks via ioctl SET_ARRAY_INFO already suspends
the array. Therefore, we suspend the array when updating raid_disks
via sysfs to avoid this issue too.

Signed-off-by: FengWei Shih <dannyshih@synology.com>
Link: https://lore.kernel.org/linux-raid/20251226101816.4506-1-dannyshih@synology.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-27 09:54:50 +08:00
Li Nan
00f6c1b4d1 md: Fix static checker warning in analyze_sbs
The following warn is reported:

 drivers/md/md.c:3912 analyze_sbs()
 warn: iterator 'i' not incremented

Fixes: d8730f0cf4 ("md: Remove deprecated CONFIG_MD_MULTIPATH")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/linux-raid/7e2e95ce-3740-09d8-a561-af6bfb767f18@huaweicloud.com/T/#t
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20251215124412.4015572-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-25 15:45:21 +08:00
Tarun Sahu
fdd0c6a649 md: remove legacy 1s delay in md_notify_reboot
During system shutdown, the md driver registered notifier function
(md_notify_reboot) currently imposes a hardcoded one-second delay.

This delay was introduced approximately 23 years ago and was likely
necessary for the hardware generation of that time. Proposing this
patch to make sure there are no known devices that need this delay.

Link: https://lore.kernel.org/linux-raid/20251121191422.2758555-1-tarunsahu@google.com
Signed-off-by: Tarun Sahu <tarunsahu@google.com>
Reviewed-by: Yu Kuai<yukuai@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-30 09:42:28 +08:00
Yu Kuai
8c9e376b9d md: warn about updating super block failure
Many personalities will handle IO error from daemon thread(like raid1d,
raid10d, raid5d), and sb will require to be clean before hanlding these
failed IO. However update sb can fail, for example array is broken by
IO failure, or user config sysfs api array_state.

This patch adds warning if updating sb failed first, in case this will
be related to IO hang.

Link: https://lore.kernel.org/linux-raid/20251117085557.770572-2-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-11-30 09:38:22 +08:00
Li Nan
62ed1b5822 md: allow configuring logical block size
Previously, raid array used the maximum logical block size (LBS)
of all member disks. Adding a larger LBS disk at runtime could
unexpectedly increase RAID's LBS, risking corruption of existing
partitions. This can be reproduced by:

```
  # LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
  mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean

  # LBS is 512
  cat /sys/block/md0/queue/logical_block_size

  # create partition md0p1
  parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
  lsblk | grep md0p1

  # LBS becomes 4096 after adding sdf
  mdadm --add -q /dev/md0 /dev/sdf
  cat /sys/block/md0/queue/logical_block_size

  # partition lost
  partprobe /dev/md0
  lsblk | grep md0p1
```

Simply restricting larger-LBS disks is inflexible. In some scenarios,
only disks with 512 bytes LBS are available currently, but later, disks
with 4KB LBS may be added to the array.

Making LBS configurable is the best way to solve this scenario.
After this patch, the raid will:
  - store LBS in disk metadata
  - add a read-write sysfs 'mdX/logical_block_size'

Future mdadm should support setting LBS via metadata field during RAID
creation and the new sysfs. Though the kernel allows runtime LBS changes,
users should avoid modifying it after creating partitions or filesystems
to prevent compatibility issues.

Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
fields to default values at auto-detect. Supporting 0.90 would require
more extensive changes and no such use case has been observed.

Note that many RAID paths rely on PAGE_SIZE alignment, including for
metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
read/write failures. So this config should be prevented.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:20:15 +08:00
Li Nan
9c47127a80 md: add check_new_feature module parameter
Raid checks if pad3 is zero when loading superblock from disk. Arrays
created with new features may fail to assemble on old kernels as pad3
is used.

Add module parameter check_new_feature to bypass this check.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-5-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:19:54 +08:00
Li Nan
381a3ce1c0 md: init bioset in mddev_init
IO operations may be needed before md_run(), such as updating metadata
after writing sysfs. Without bioset, this triggers a NULL pointer
dereference as below:

 BUG: kernel NULL pointer dereference, address: 0000000000000020
 Call Trace:
  md_update_sb+0x658/0xe00
  new_level_store+0xc5/0x120
  md_attr_store+0xc9/0x1e0
  sysfs_kf_write+0x6f/0xa0
  kernfs_fop_write_iter+0x141/0x2a0
  vfs_write+0x1fc/0x5a0
  ksys_write+0x79/0x180
  __x64_sys_write+0x1d/0x30
  x64_sys_call+0x2818/0x2880
  do_syscall_64+0xa9/0x580
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Reproducer
```
  mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd]
  echo inactive > /sys/block/md0/md/array_state
  echo 10 > /sys/block/md0/md/new_level
```

mddev_init() can only be called once per mddev, no need to test if bioset
has been initialized anymore.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-3-linan666@huaweicloud.com
Fixes: d981ed8419 ("md: Add new_level sysfs interface")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:19:10 +08:00
Li Nan
0ce112d917 md: delete md_redundancy_group when array is becoming inactive
'md_redundancy_group' are created in md_run() and deleted in del_gendisk(),
but these are not paired. Writing inactive/active to sysfs array_state can
trigger md_run() multiple times without del_gendisk(), leading to
duplicate creation as below:

 sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/sync_action'
 Call Trace:
  dump_stack_lvl+0x9f/0x120
  dump_stack+0x14/0x20
  sysfs_warn_dup+0x96/0xc0
  sysfs_add_file_mode_ns+0x19c/0x1b0
  internal_create_group+0x213/0x830
  sysfs_create_group+0x17/0x20
  md_run+0x856/0xe60
  ? __x64_sys_openat+0x23/0x30
  do_md_run+0x26/0x1d0
  array_state_store+0x559/0x760
  md_attr_store+0xc9/0x1e0
  sysfs_kf_write+0x6f/0xa0
  kernfs_fop_write_iter+0x141/0x2a0
  vfs_write+0x1fc/0x5a0
  ksys_write+0x79/0x180
  __x64_sys_write+0x1d/0x30
  x64_sys_call+0x2818/0x2880
  do_syscall_64+0xa9/0x580
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
 md: cannot register extra attributes for md0

Creation of it depends on 'pers', its lifecycle cannot be aligned with
gendisk. So fix this issue by triggering 'md_redundancy_group' deletion
when the array is becoming inactive.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-2-linan666@huaweicloud.com
Fixes: 790abe4d77 ("md: remove/add redundancy group only in level change")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:18:51 +08:00
Li Nan
6c6b66f65e md: prevent adding disks with larger logical_block_size to active arrays
When adding a disk to a md array, avoid updating the array's
logical_block_size to match the new disk. This prevents accidental
partition table loss that renders the array unusable.

The later patch will introduce a way to configure the array's
logical_block_size.

The issue was introduced before Linux 2.6.12-rc2.

Link: https://lore.kernel.org/linux-raid/20250918115759.334067-2-linan666@huaweicloud.com/
Fixes: d2e45eace8 ("[PATCH] Fix raid "bio too big" failures")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:17:33 +08:00
Xiao Ni
90e3bb44c0 md: avoid repeated calls to del_gendisk
There is a uaf problem which is found by case 23rdev-lifetime:

Oops: general protection fault, probably for non-canonical address 0xdead000000000122
RIP: 0010:bdi_unregister+0x4b/0x170
Call Trace:
 <TASK>
 __del_gendisk+0x356/0x3e0
 mddev_unlock+0x351/0x360
 rdev_attr_store+0x217/0x280
 kernfs_fop_write_iter+0x14a/0x210
 vfs_write+0x29e/0x550
 ksys_write+0x74/0xf0
 do_syscall_64+0xbb/0x380
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff5250a177e

The sequence is:
1. rdev remove path gets reconfig_mutex
2. rdev remove path release reconfig_mutex in mddev_unlock
3. md stop calls do_md_stop and sets MD_DELETED
4. rdev remove path calls del_gendisk because MD_DELETED is set
5. md stop path release reconfig_mutex and calls del_gendisk again

So there is a race condition we should resolve. This patch adds a
flag MD_DO_DELETE to avoid the race condition.

Link: https://lore.kernel.org/linux-raid/20251029063419.21700-1-xni@redhat.com
Fixes: 9e59d60976 ("md: call del_gendisk in control path")
Signed-off-by: Xiao Ni <xni@redhat.com>
Suggested-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08 17:49:22 +08:00
Wu Guanghao
082d680faf Factor out code into md_should_do_recovery()
In md_check_recovery(), use new helper to make code cleaner.

Link: https://lore.kernel.org/linux-raid/e62894c8-d916-94bc-ef48-3c60e6e1fc5d@huawei.com
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08 16:54:54 +08:00
Yun Zhou
0dc7620554 md: fix rcu protection in md_wakeup_thread
We attempted to use RCU to protect the pointer 'thread', but directly
passed the value when calling md_wakeup_thread(). This means that the
RCU pointer has been acquired before rcu_read_lock(), which renders
rcu_read_lock() ineffective and could lead to a use-after-free.

Link: https://lore.kernel.org/linux-raid/20251015083227.1079009-1-yun.zhou@windriver.com
Fixes: 4469315439 ("md: protect md_thread with rcu")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08 16:54:36 +08:00
Xiao Ni
cc394b94dc md: delete mddev kobj before deleting gendisk kobj
In sync del gendisk path, it deletes gendisk first and the directory
/sys/block/md is removed. Then it releases mddev kobj in a delayed work.
If we enable debug log in sysfs_remove_group, we can see the debug log
'sysfs group bitmap not found for kobject md'. It's the reason that the
parent kobj has been deleted, so it can't find parent directory.

In creating path, it allocs gendisk first, then adds mddev kobj. So it
should delete mddev kobj before deleting gendisk.

Before commit 9e59d60976 ("md: call del_gendisk in control path"), it
releases mddev kobj first. If the kobj hasn't been deleted, it does clean
job and deletes the kobj. Then it calls del_gendisk and releases gendisk
kobj. So it doesn't need to call kobject_del to delete mddev kobj. After
this patch, in sync del gendisk path, the sequence changes. So it needs
to call kobject_del to delete mddev kobj.

After this patch, the sequence is:
1. kobject del mddev kobj
2. del_gendisk deletes gendisk kobj
3. mddev_delayed_delete releases mddev kobj
4. md_kobj_release releases gendisk kobj

Link: https://lore.kernel.org/linux-raid/20250928012424.61370-1-xni@redhat.com
Fixes: 9e59d60976 ("md: call del_gendisk in control path")
Signed-off-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08 16:53:55 +08:00
Linus Torvalds
e1b1d03cee Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
     - FC target fixes (Daniel)
     - Authentication fixes and updates (Martin, Chris)
     - Admin controller handling (Kamaljit)
     - Target lockdep assertions (Max)
     - Keep-alive updates for discovery (Alastair)
     - Suspend quirk (Georg)

 - MD pull request via Yu:
     - Add support for a lockless bitmap.

       A key feature for the new bitmap are that the IO fastpath is
       lockless. If a user issues lots of write IO to the same bitmap
       bit in a short time, only the first write has additional overhead
       to update bitmap bit, no additional overhead for the following
       writes.

       By supporting only resync or recover written data, means in the
       case creating new array or replacing with a new disk, there is no
       need to do a full disk resync/recovery.

 - Switch ->getgeo() and ->bios_param() to using struct gendisk rather
   than struct block_device.

 - Rust block changes via Andreas. This series adds configuration via
   configfs and remote completion to the rnull driver. The series also
   includes a set of changes to the rust block device driver API: a few
   cleanup patches, and a few features supporting the rnull changes.

   The series removes the raw buffer formatting logic from
   `kernel::block` and improves the logic available in `kernel::string`
   to support the same use as the removed logic.

 - floppy arch cleanups

 - Reduce the number of dereferencing needed for ublk commands

 - Restrict supported sockets for nbd. Mostly done to eliminate a class
   of issues perpetually reported by syzbot, by using nonsensical socket
   setups.

 - A few s390 dasd block fixes

 - Fix a few issues around atomic writes

 - Improve DMA interation for integrity requests

 - Improve how iovecs are treated with regards to O_DIRECT aligment
   constraints.

   We used to require each segment to adhere to the constraints, now
   only the request as a whole needs to.

 - Clean up and improve p2p support, enabling use of p2p for metadata
   payloads

 - Improve locking of request lookup, using SRCU where appropriate

 - Use page references properly for brd, avoiding very long RCU sections

 - Fix ordering of recursively submitted IOs

 - Clean up and improve updating nr_requests for a live device

 - Various fixes and cleanups

* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
  s390/dasd: enforce dma_alignment to ensure proper buffer validation
  s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
  ublk: remove redundant zone op check in ublk_setup_iod()
  nvme: Use non zero KATO for persistent discovery connections
  nvmet: add safety check for subsys lock
  nvme-core: use nvme_is_io_ctrl() for I/O controller check
  nvme-core: do ioccsz/iorcsz validation only for I/O controllers
  nvme-core: add method to check for an I/O controller
  blk-cgroup: fix possible deadlock while configuring policy
  blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
  blk-mq: Fix more tag iteration function documentation
  selftests: ublk: fix behavior when fio is not installed
  ublk: don't access ublk_queue in ublk_unmap_io()
  ublk: pass ublk_io to __ublk_complete_rq()
  ublk: don't access ublk_queue in ublk_need_complete_req()
  ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
  ublk: don't pass ublk_queue to ublk_fetch()
  ublk: don't access ublk_queue in ublk_config_io_buf()
  ublk: don't access ublk_queue in ublk_check_fetch_buf()
  ublk: pass q_id and tag to __ublk_check_and_get_req()
  ...
2025-10-02 10:16:56 -07:00
Yu Kuai
5ab829f197 md/md-llbitmap: introduce new lockless bitmap
Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.

Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.

Due to known performance issues with md-bitmap and the unreasonable
implementations:

 - self-managed IO submitting like filemap_write_page();
 - global spin_lock

I have decided not to continue optimizing based on the current bitmap
implementation, this new bitmap is invented without locking from IO fast
path and can be used with fast disks.

For designs and details, see the comments in drivers/md-llbitmap.c.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:27:51 +08:00
Yu Kuai
66be318e66 md/md-bitmap: make method bitmap_ops->daemon_work optional
daemon_work() will be called by daemon thread, on the one hand, daemon
thread doesn't have strict wake-up time; on the other hand, too much
work are put to daemon thread, like handle sync IO, handle failed
or specail normal IO, handle recovery, and so on. Hence daemon thread
may be too busy to clear dirty bits in time.

Make bitmap_ops->daemon_work() optional and following patches will use
separate async work to clear dirty bits for the new bitmap.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-11-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:20:49 +08:00
Yu Kuai
c951ccf0bf md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
This flag is used by llbitmap in later patches to skip raid456 initial
recover and delay building initial xor data to first write.

https://lore.kernel.org/linux-raid/20250829080426.1441678-10-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-06 17:20:32 +08:00
Yu Kuai
f196d72888 md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
This method is used to check if blocks can be skipped before calling
into pers->sync_request(), llbitmap will use this method to skip
resync for unwritten/clean data blocks, and recovery/check/repair for
unwritten data blocks;

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:19:45 +08:00
Yu Kuai
fb8cc3b0d9 md/md-bitmap: delay registration of bitmap_ops until creating bitmap
Currently bitmap_ops is registered while allocating mddev, this is fine
when there is only one bitmap_ops.

Delay setting bitmap_ops until creating bitmap, so that user can choose
which bitmap to use before running the array.

Link: https://lore.kernel.org/linux-raid/20250721171557.34587-7-yukuai@kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:19:19 +08:00
Yu Kuai
180b82c1c7 md/md-bitmap: add a new sysfs api bitmap_type
The api will be used by mdadm to set bitmap_type while creating new array
or assembling array, prepare to add a new bitmap.

Currently available options are:

cat /sys/block/md0/md/bitmap_type
none [bitmap]

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:19:04 +08:00
Yu Kuai
300bffa870 md: add a new mddev field 'bitmap_id'
Prepare to store the bitmap id selected by user, also refactor
mddev_set_bitmap_ops a bit in case the value is invalid.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:18:48 +08:00
Yu Kuai
ac9dad8faa md/md-bitmap: support discard for bitmap ops
Use two new methods {start, end}_discard in bitmap_ops and a new field 'rw'
in struct md_io_clone to handle discard IO, prepare to support new md
bitmap.

Since all bitmap functions to hanlde write IO are the same, also add
typedef to make code cleaner.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-4-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:18:19 +08:00
Yu Kuai
7797da149d md: factor out a helper raid_is_456()
There are no functional changes, the helper will be used by llbitmap in
following patches.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:17:58 +08:00
Yu Kuai
d01acbce39 md: add a new parameter 'offset' to md_super_write()
The parameter is always set to 0 for now, following patches will use
this helper to write llbitmap to underlying disks, allow writing
dirty sectors instead of the whole page.

Also rename md_super_write to md_write_metadata since there is nothing
super-block specific.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-2-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:17:26 +08:00