linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Xiao Ni	808cec7460	md/raid1: serialize overlap io for writemostly disk Previously, using wait_event() would wake up all waiters simultaneously, and they would compete for the tree lock. The bio which gets the lock first will be handled, so the write sequence cannot be guaranteed. For example: bio1(100,200) bio2(150,200) bio3(150,300) The write sequence of fast device is bio1,bio2,bio3. But the write sequence of slow device could be bio1,bio3,bio2 due to lock competition. This causes data corruption. Replace waitqueue with a fifo list to guarantee the write sequence. And it also needs to iterate the list when removing one entry. If not, it may miss the opportunity to wake up the waiting io. For example: bio1(1,3), bio2(2,4) bio3(5,7), bio4(6,8) These four bios are in the same bucket. bio1 and bio3 are inserted into the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the first one. bio3 returns from slow disk and tries to wake up the waiting bios. bio2 is removed from the list and will be handled. But bio1 hasn't finished. So bio2 will be added into waiting list again. Then bio1 returns from slow disk and wakes up waiting bios. bio4 is removed from the list and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left on the waiting list. So it needs to iterate the waiting list to wake up the right bio. Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260324072501.59865-1-xni@redhat.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-04-07 13:09:22 +08:00
Xiao Ni	05c8de4f09	md: fix return value of mddev_trylock A return value of 0 is treaded as successful lock acquisition. In fact, a return value of 1 means getting the lock successfully. Link: https://lore.kernel.org/linux-raid/20260127073951.17248-1-xni@redhat.com Fixes: `9e59d60976` ("md: call del_gendisk in control path") Reported-by: Bart Van Assche <bvanassche@acm.org> Closes: https://lore.kernel.org/linux-raid/20250611073108.25463-1-xni@redhat.com/T/#mfa369ef5faa4aa58e13e6d9fdb88aecd862b8f2f Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-02-02 15:39:55 +08:00
Li Nan	5d1dd57929	md: remove recovery_disabled 'recovery_disabled' logic is complex and confusing, originally intended to preserve raid in extreme scenarios. It was used in following cases: - When sync fails and setting badblocks also fails, kick out non-In_sync rdev and block spare rdev from joining to preserve raid [1] - When last backup is unavailable, prevent repeated add-remove of spares triggering recovery [2] The original issues are now resolved: - Error handlers in all raid types prevent last rdev from being kicked out - Disks with failed recovery are marked Faulty and can't re-join Therefore, remove 'recovery_disabled' as it's no longer needed. [1] `5389042ffa` ("md: change managed of recovery_disabled.") [2] `4044ba58dd` ("md: don't retry recovery of raid1 that fails due to error on source drive.") Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-13-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-01-26 13:17:38 +08:00
Li Nan	af9c40ff5a	md: remove MD_RECOVERY_ERROR handling and simplify resync_offset update Following previous patch "md: update curr_resync_completed even when MD_RECOVERY_INTR is set", 'curr_resync_completed' always equals 'curr_resync' for resync, so MD_RECOVERY_ERROR can be removed. Also, simplify resync_offset update logic. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-8-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-01-26 13:16:19 +08:00
Li Nan	2a5d4549a2	md: factor error handling out of md_done_sync into helper The 'ok' parameter in md_done_sync() is redundant for most callers that always pass 'true'. Factor error handling logic into a separate helper function md_sync_error() to eliminate unnecessary parameter passing and improve code clarity. No functional changes introduced. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-3-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2026-01-26 13:16:07 +08:00
Yu Kuai	9340a95d48	md/raid5: use mempool to allocate stripe_request_ctx On the one hand, stripe_request_ctx is 72 bytes, and it's a bit huge for a stack variable. On the other hand, the bitmap sectors_to_do is a fixed size, result in max_hw_sector_kb of raid5 array is at most 256 * 4k = 1Mb, and this will make full stripe IO impossible for the array that chunk_size * data_disks is bigger. Allocate ctx during runtime will make it possible to get rid of this limit. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-6-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>	2026-01-26 13:11:29 +08:00
Yu Kuai	10787568cc	md: merge mddev serialize_policy into mddev_flags There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-5-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>	2026-01-26 13:10:51 +08:00
Yu Kuai	4f6d2e648c	md: merge mddev faillast_dev into mddev_flags There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-4-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>	2026-01-26 13:10:24 +08:00
Yu Kuai	fba4a98040	md: merge mddev has_superblock into mddev_flags There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>	2026-01-26 13:09:55 +08:00
Li Nan	62ed1b5822	md: allow configuring logical block size Previously, raid array used the maximum logical block size (LBS) of all member disks. Adding a larger LBS disk at runtime could unexpectedly increase RAID's LBS, risking corruption of existing partitions. This can be reproduced by: ``` # LBS of sd[de] is 512 bytes, sdf is 4096 bytes. mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean # LBS is 512 cat /sys/block/md0/queue/logical_block_size # create partition md0p1 parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100% lsblk \| grep md0p1 # LBS becomes 4096 after adding sdf mdadm --add -q /dev/md0 /dev/sdf cat /sys/block/md0/queue/logical_block_size # partition lost partprobe /dev/md0 lsblk \| grep md0p1 ``` Simply restricting larger-LBS disks is inflexible. In some scenarios, only disks with 512 bytes LBS are available currently, but later, disks with 4KB LBS may be added to the array. Making LBS configurable is the best way to solve this scenario. After this patch, the raid will: - store LBS in disk metadata - add a read-write sysfs 'mdX/logical_block_size' Future mdadm should support setting LBS via metadata field during RAID creation and the new sysfs. Though the kernel allows runtime LBS changes, users should avoid modifying it after creating partitions or filesystems to prevent compatibility issues. Only 1.x metadata supports configurable LBS. 0.90 metadata inits all fields to default values at auto-detect. Supporting 0.90 would require more extensive changes and no such use case has been observed. Note that many RAID paths rely on PAGE_SIZE alignment, including for metadata I/O. A larger LBS than PAGE_SIZE will result in metadata read/write failures. So this config should be prevented. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2025-11-11 11:20:15 +08:00
Xiao Ni	90e3bb44c0	md: avoid repeated calls to del_gendisk There is a uaf problem which is found by case 23rdev-lifetime: Oops: general protection fault, probably for non-canonical address 0xdead000000000122 RIP: 0010:bdi_unregister+0x4b/0x170 Call Trace: <TASK> __del_gendisk+0x356/0x3e0 mddev_unlock+0x351/0x360 rdev_attr_store+0x217/0x280 kernfs_fop_write_iter+0x14a/0x210 vfs_write+0x29e/0x550 ksys_write+0x74/0xf0 do_syscall_64+0xbb/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff5250a177e The sequence is: 1. rdev remove path gets reconfig_mutex 2. rdev remove path release reconfig_mutex in mddev_unlock 3. md stop calls do_md_stop and sets MD_DELETED 4. rdev remove path calls del_gendisk because MD_DELETED is set 5. md stop path release reconfig_mutex and calls del_gendisk again So there is a race condition we should resolve. This patch adds a flag MD_DO_DELETE to avoid the race condition. Link: https://lore.kernel.org/linux-raid/20251029063419.21700-1-xni@redhat.com Fixes: `9e59d60976` ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Suggested-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2025-11-08 17:49:22 +08:00
Yun Zhou	0dc7620554	md: fix rcu protection in md_wakeup_thread We attempted to use RCU to protect the pointer 'thread', but directly passed the value when calling md_wakeup_thread(). This means that the RCU pointer has been acquired before rcu_read_lock(), which renders rcu_read_lock() ineffective and could lead to a use-after-free. Link: https://lore.kernel.org/linux-raid/20251015083227.1079009-1-yun.zhou@windriver.com Fixes: `4469315439` ("md: protect md_thread with rcu") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>	2025-11-08 16:54:36 +08:00
Yu Kuai	5ab829f197	md/md-llbitmap: introduce new lockless bitmap Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. Due to known performance issues with md-bitmap and the unreasonable implementations: - self-managed IO submitting like filemap_write_page(); - global spin_lock I have decided not to continue optimizing based on the current bitmap implementation, this new bitmap is invented without locking from IO fast path and can be used with fast disks. For designs and details, see the comments in drivers/md-llbitmap.c. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Li Nan <linan122@huawei.com>	2025-09-06 17:27:51 +08:00
Yu Kuai	c951ccf0bf	md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER This flag is used by llbitmap in later patches to skip raid456 initial recover and delay building initial xor data to first write. https://lore.kernel.org/linux-raid/20250829080426.1441678-10-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-09-06 17:20:32 +08:00
Yu Kuai	300bffa870	md: add a new mddev field 'bitmap_id' Prepare to store the bitmap id selected by user, also refactor mddev_set_bitmap_ops a bit in case the value is invalid. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-09-06 17:18:48 +08:00
Yu Kuai	ac9dad8faa	md/md-bitmap: support discard for bitmap ops Use two new methods {start, end}_discard in bitmap_ops and a new field 'rw' in struct md_io_clone to handle discard IO, prepare to support new md bitmap. Since all bitmap functions to hanlde write IO are the same, also add typedef to make code cleaner. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>	2025-09-06 17:18:19 +08:00
Yu Kuai	7797da149d	md: factor out a helper raid_is_456() There are no functional changes, the helper will be used by llbitmap in following patches. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>	2025-09-06 17:17:58 +08:00
Yu Kuai	d01acbce39	md: add a new parameter 'offset' to md_super_write() The parameter is always set to 0 for now, following patches will use this helper to write llbitmap to underlying disks, allow writing dirty sectors instead of the whole page. Also rename md_super_write to md_write_metadata since there is nothing super-block specific. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>	2025-09-06 17:17:26 +08:00
Yu Kuai	c27474ac1d	md/md-bitmap: introduce CONFIG_MD_BITMAP Now that all implementations are internal, it's sensible to add a config option for md-bitmap, and it's a good way for isolation. Link: https://lore.kernel.org/linux-raid/20250707012711.376844-16-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-09-06 17:12:22 +08:00
Yu Kuai	9307dbac0e	md/md-bitmap: merge md_bitmap_group into bitmap_operations Now that all bitmap implementations are internal, it doesn't make sense to export md_bitmap_group anymore. Link: https://lore.kernel.org/linux-raid/20250707012711.376844-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-09-06 17:11:40 +08:00
Li Nan	907a99c314	md: rename recovery_cp to resync_offset 'recovery_cp' was used to represent the progress of sync, but its name contains recovery, which can cause confusion. Replaces 'recovery_cp' with 'resync_offset' for clarity. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20250722033340.1933388-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-31 01:26:04 +08:00
Xiao Ni	9e59d60976	md: call del_gendisk in control path Now del_gendisk and put_disk are called asynchronously in workqueue work. The asynchronous way has a problem that the device node can still exist after mdadm --stop command returns in a short window. So udev rule can open this device node and create the struct mddev in kernel again. So put del_gendisk in control path and still leave put_disk in md_kobj_release to avoid uaf of gendisk. Function del_gendisk can't be called with reconfig_mutex. If it's called with reconfig mutex, a deadlock can happen. del_gendisk waits all sysfs files access to finish and sysfs file access waits reconfig mutex. So put del_gendisk after releasing reconfig mutex. But there is still a window that sysfs can be accessed between mddev_unlock and del_gendisk. So some actions (add disk, change level, .e.g) can happen which lead unexpected results. MD_DELETED is used to resolve this problem. MD_DELETED is set before releasing reconfig mutex and it should be checked for these sysfs access which need reconfig mutex. For sysfs access which don't need reconfig mutex, del_gendisk will wait them to finish. But it doesn't need to do this in function mddev_lock_nointr. There are ten places that call it. * Five of them are in dm raid which we don't need to care. MD_DELETED is only used for md raid. * stop_sync_thread, md_do_sync and md_start_sync are related sync request, and it needs to wait sync thread to finish before stopping an array. * md_ioctl: md_open is called before md_ioctl, so ->openers is added. It will fail to stop the array. So it doesn't need to check MD_DELETED here * md_set_readonly: It needs to call mddev_set_closing_and_sync_blockdev when setting readonly or read_auto. So it will fail to stop the array too because MD_CLOSING is already set. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-2-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-12 17:51:54 +08:00
Yu Kuai	752d0464b7	md: clean up accounting for issued sync IO It's no longer used and can be removed, also remove the field 'gendisk->sync_io'. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-10-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-10 16:14:22 +08:00
Yu Kuai	e5797ae703	md: fix is_mddev_idle() If sync_speed is above speed_min, then is_mddev_idle() will be called for each sync IO to check if the array is idle, and inflight sync_io will be limited if the array is not idle. However, while mkfs.ext4 for a large raid5 array while recovery is in progress, it's found that sync_speed is already above speed_min while lots of stripes are used for sync IO, causing long delay for mkfs.ext4. Root cause is the following checking from is_mddev_idle(): t1: submit sync IO: events1 = completed IO - issued sync IO t2: submit next sync IO: events2 = completed IO - issued sync IO if (events2 - events1 > 64) For consequence, the more sync IO issued, the less likely checking will pass. And when completed normal IO is more than issued sync IO, the condition will finally pass and is_mddev_idle() will return false, however, last_events will be updated hence is_mddev_idle() can only return false once in a while. Fix this problem by changing the checking as following: 1) mddev doesn't have normal IO completed; 2) mddev doesn't have normal IO inflight; 3) if any member disks is partition, and all other partitions doesn't have IO completed. Also change rdev->last_events to unsigned long to cleanup type casting. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-10 16:13:31 +08:00
Yu Kuai	03720d82d7	md: add a new api sync_io_depth Currently if sync speed is above speed_min and below speed_max, md_do_sync() will wait for all sync IOs to be done before issuing new sync IO, means sync IO depth is limited to just 1. This limit is too low, in order to prevent sync speed drop conspicuously after fixing is_mddev_idle() in the next patch, add a new api for limiting sync IO depth, the default value is 32. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-10 16:12:52 +08:00
Yu Kuai	7168be3c8a	md: record dm-raid gendisk in mddev Following patch will use gendisk to check if there are normal IO completed or inflight, to fix a problem in mdraid that foreground IO can be starved by background sync IO in later patches. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-7-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-10 16:12:19 +08:00
Jens Axboe	017ff379b6	Merge tag 'md-6.15-20250312' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.15/block Merge MD changes from Yu: "- fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai)" * tag 'md-6.15-20250312' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid10: wait barrier before returning discard request with REQ_NOWAIT md/md-bitmap: fix wrong bitmap_limit for clustermd when write sb md/raid1,raid10: don't ignore IO flags md/raid5: merge reshape_progress checking inside get_reshape_loc() md: fix mddev uaf while iterating all_mddevs list md: switch md-cluster to use md_submodle_head md: don't export md_cluster_ops md/md-cluster: cleanup md_cluster_ops reference md: switch personalities to use md_submodule_head md: introduce struct md_submodule_head and APIs md: only include md-cluster.h if necessary md: merge common code into find_pers() md/raid1: fix memory leak in raid1_run() if no active rdev md: ensure resync is prioritized over recovery	2025-03-13 05:34:51 -06:00
Zheng Qixing	d301f164c3	badblocks: use sector_t instead of int to avoid truncation of badblocks length There is a truncation of badblocks length issue when set badblocks as follow: echo "2055 4294967299" > bad_blocks cat bad_blocks 2055 3 Change 'sectors' argument type from 'int' to 'sector_t'. This change avoids truncation of badblocks length for large sectors by replacing 'int' with 'sector_t' (u64), enabling proper handling of larger disk sizes and ensuring compatibility with 64-bit sector addressing. Fixes: `9e0e252a04` ("badblocks: Add core badblock management code") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:04:52 -07:00
Zheng Qixing	7e5102dd99	md: improve return types of badblocks handling functions rdev_set_badblocks() only indicates success/failure, so convert its return type from int to boolean for better semantic clarity. rdev_clear_badblocks() return value is never used by any caller, convert it to void. This removes unnecessary value returns. Also update narrow_write_error() in both raid1 and raid10 to use boolean return type to match rdev_set_badblocks(). Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-12-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:28 -07:00
Yu Kuai	87a86277c9	md: switch md-cluster to use md_submodle_head To make code cleaner, and prepare to add kconfig for bitmap. Also remove the unsed global variables pers_lock, md_cluster_ops and md_cluster_mod, and exported symbols register_md_cluster_operations(), unregister_md_cluster_operations() and md_cluster_ops. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>	2025-03-05 00:28:39 +08:00
Yu Kuai	c594de0455	md: don't export md_cluster_ops Add a new field 'cluster_ops' and initialize it md_setup_cluster(), so that the gloable variable 'md_cluter_ops' doesn't need to be exported. Also prepare to switch md-cluster to use md_submod_head. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-7-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>	2025-03-05 00:28:17 +08:00
Yu Kuai	3d44e1d157	md: switch personalities to use md_submodule_head Remove the global list 'pers_list', and switch to use md_submodule_head, which is managed by xarry. Prepare to unify registration and unregistration for all sub modules. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-03-05 00:27:20 +08:00
Yu Kuai	d3beb7c9c6	md: introduce struct md_submodule_head and APIs Prepare to unify registration and unregistration of md personalities and md-cluster, also prepare for add kconfig for md-bitmap. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-03-05 00:26:56 +08:00
Yu Kuai	bf0a73264f	md: only include md-cluster.h if necessary md-cluster is only supportted by raid1 and raid10, there is no need to include md-cluster.h for other personalities. Also move APIs that is only used in md-cluster.c from md.h to md-cluster.h. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>	2025-03-05 00:26:21 +08:00
Yu Kuai	cd5fc65338	md/md-bitmap: move bitmap_{start, end}write to md upper layer There are two BUG reports that raid5 will hang at bitmap_startwrite([1],[2]), root cause is that bitmap start write and end write is unbalanced, it's not quite clear where, and while reviewing raid5 code, it's found that bitmap operations can be optimized. For example, for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to the array: ┌────────────────────────────────────────────────────────────┐ │chunk 0 │ │ ┌────────────┬─────────────┬─────────────┬────────────┼ │ sh0 │A0: 0 + 4k │A1: 8k + 4k │A2: 16k + 4k │A3: P │ │ ┼────────────┼─────────────┼─────────────┼────────────┼ │ sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P │ ┼──────┴────────────┴─────────────┴─────────────┴────────────┼ │chunk 1 │ │ ┌────────────┬─────────────┬─────────────┬────────────┤ │ sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P │C3: 40k + 4k│ │ ┼────────────┼─────────────┼─────────────┼────────────┼ │ sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P │D3: 44k + 4k│ └──────┴────────────┴─────────────┴─────────────┴────────────┘ Before this patch, 4 stripe head will be used, and each sh will attach bio for 3 disks, and each attached bio will trigger bitmap_startwrite() once, which means total 12 times. - 3 times (0 + 4k), for (A0, A1 and A2) - 3 times (4 + 4k), for (B0, B1 and B2) - 3 times (8 + 4k), for (C0, C1 and C3) - 3 times (12 + 4k), for (D0, D1 and D3) After this patch, md upper layer will calculate that IO range (0 + 48k) is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite() just once. Noted that this patch will align bitmap ranges to the chunks, for example, if user issue a IO (0 + 4k) to array: - Before this patch, 1 time (0 + 4k), for A0; - After this patch, 1 time (0 + 8k) for chunk 0; Usually, one bitmap bit will represent more than one disk chunk, and this doesn't have any difference. And even if user really created a array that one chunk contain multiple bits, the overhead is that more data will be recovered after power failure. Also remove STRIPE_BITMAP_PENDING since it's not used anymore. [1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/ [2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2025-01-13 08:56:11 -08:00
Yu Kuai	0c984a283a	md: add a new callback pers->bitmap_sector() This callback will be used in raid5 to convert io ranges from array to bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20250109015145.158868-4-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2025-01-13 08:56:11 -08:00
Yu Kuai	4abfce19c7	md: add a new helper rdev_blocked() The helper will be used in later patches for raid1/raid10/raid5, the difference is that Faulty rdev with unacknowledged bad block will not be considered blocked. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Song Liu	7f67fdae33	Merge branch 'md-6.12-bitmap' into md-6.12 From Yu Kuai (with minor changes by Song Liu): The background is that currently bitmap is using a global spin_lock, causing lock contention and huge IO performance degradation for all raid levels. However, it's impossible to implement a new lock free bitmap with current situation that md-bitmap exposes the internal implementation with lots of exported apis. Hence bitmap_operations is invented, to describe bitmap core implementation, and a new bitmap can be introduced with a new bitmap_operations, we only need to switch to the new one during initialization. And with this we can build bitmap as kernel module, but that's not our concern for now. This version was tested with mdadm tests and lvm2 tests. This set does not introduce new errors in these tests. * md-6.12-bitmap: (42 commits) md/md-bitmap: make in memory structure internal md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations md/md-bitmap: merge md_bitmap_free() into bitmap_operations md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation. md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations md/md-bitmap: merge md_bitmap_resize() into bitmap_operations md/md-bitmap: pass in mddev directly for md_bitmap_resize() md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations md/md-bitmap: merge bitmap_unplug() into bitmap_operations md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug() md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync() md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations ... Signed-off-by: Song Liu <song@kernel.org>	2024-08-28 14:55:57 -07:00
Yu Kuai	b75197e86e	md: Remove flush handling For flush request, md has a special flush handling to merge concurrent flush request into single one, however, the whole mechanism is based on a disk level spin_lock 'mddev->lock'. And fsync can be called quite often in some user cases, for consequence, spin lock from IO fast path can cause performance degradation. Fortunately, the block layer already has flush handling to merge concurrent flush request, and it only acquires hctx level spin lock. (see details in blk-flush.c) This patch removes the flush handling in md, and converts to use general block layer flush handling in underlying disks. Flush test for 4 nvme raid10: start 128 threads to do fsync 100000 times, on arm64, see how long it takes. Test script: void* thread_func(void* arg) { int fd = (int)arg; for (int i = 0; i < FSYNC_COUNT; i++) { fsync(fd); } return NULL; } int main() { int fd = open("/dev/md0", O_RDWR); if (fd < 0) { perror("open"); exit(1); } pthread_t threads[THREADS]; struct timeval start, end; gettimeofday(&start, NULL); for (int i = 0; i < THREADS; i++) { pthread_create(&threads[i], NULL, thread_func, &fd); } for (int i = 0; i < THREADS; i++) { pthread_join(threads[i], NULL); } gettimeofday(&end, NULL); close(fd); long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec); printf("Elapsed time: %lld microseconds\n", elapsed); return 0; } Test result: about 10 times faster: Before this patch: 50943374 microseconds After this patch: `5096347` microseconds Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 17:19:55 -07:00
Yu Kuai	59fdd43304	md/md-bitmap: make in memory structure internal Now that struct bitmap_page and bitmap is not used externally anymore, move them from md-bitmap.h to md-bitmap.c (expect that dm-raid is still using define marco 'COUNTER_MAX'). Also fix some checkpatch warnings. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-43-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:16 -07:00
Yu Kuai	7add9db6ba	md/md-bitmap: introduce struct bitmap_operations The structure is empty for now, and will be used in later patches to merge in bitmap operations, so that bitmap implementation won't be exposed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-12-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Christophe JAILLET	1f4a72ff00	md-cluster: Constify struct md_cluster_operations 'struct md_cluster_operations' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 51941 1442 80 53463 d0d7 drivers/md/md-cluster.o After: ===== text data bss dec hex filename 52133 1246 80 53459 d0d3 drivers/md/md-cluster.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr	2024-07-04 06:20:27 +00:00
Christoph Hellwig	573d5abf3d	md: set md-specific flags for all queue limits The md driver wants to enforce a number of flags for all devices, even when not inheriting them from the underlying devices. To make sure these flags survive the queue_limits_set calls that md uses to update the queue limits without deriving them form the previous limits add a new md_init_stacking_limits helper that calls blk_set_stacking_limits and sets these flags. Fixes: `1122c0c1cc` ("block: move cache control settings out of queue->flags") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-26 09:37:35 -06:00
Jens Axboe	e3e72fe4cb	Merge branch 'for-6.11/block-limits' into for-6.11/block Pull in block limits branch, which exists as a shared branch for both the block and SCSI tree. * for-6.11/block-limits: (26 commits) block: move integrity information into queue_limits block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags block: bypass the STABLE_WRITES flag for protection information block: don't require stable pages for non-PI metadata block: use kstrtoul in flag_store block: factor out flag_{store,show} helper for integrity block: remove the blk_flush_integrity call in blk_integrity_unregister block: remove the blk_integrity_profile structure dm-integrity: use the nop integrity profile md/raid1: don't free conf on raid0_run failure md/raid0: don't free conf on raid0_run failure block: initialize integrity buffer to zero before writing it to media block: add special APIs for run-time disabling of discard and friends block: remove unused queue limits API sr: convert to the atomic queue limits API sd: convert to the atomic queue limits API sd: cleanup zoned queue limits initialization sd: factor out a sd_discard_mode helper sd: simplify the disable case in sd_config_discard sd: add a sd_disable_write_same helper ...	2024-06-14 10:22:08 -06:00
Christoph Hellwig	c6e56cf6b2	block: move integrity information into queue_limits Move the integrity information into the queue limits so that it can be set atomically with other queue limits, and that the sysfs changes to the read_verify and write_generate flags are properly synchronized. This also allows to provide a more useful helper to stack the integrity fields, although it still is separate from the main stacking function as not all stackable devices want to inherit the integrity settings. Even with that it greatly simplifies the code in md and dm. Note that the integrity field is moved as-is into the queue limits. While there are good arguments for removing the separate blk_integrity structure, this would cause a lot of churn and might better be done at a later time if desired. However the integrity field in the queue_limits structure is now unconditional so that various ifdefs can be avoided or replaced with IS_ENABLED(). Given that tiny size of it that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-14 10:20:07 -06:00
Yu Kuai	bc49694a9e	md: pass in max_sectors for pers->sync_request() For different sync_action, sync_thread will use different max_sectors, see details in md_sync_max_sectors(), currently both md_do_sync() and pers->sync_request() in eatch iteration have to get the same max_sectors. Hence pass in max_sectors for pers->sync_request() to prevent redundant code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-12-yukuai1@huaweicloud.com	2024-06-12 16:32:57 +00:00
Yu Kuai	d249e54188	md: replace last_sync_action with new enum type The only difference is that "none" is removed and initial last_sync_action will be idle. On the one hand, this value is introduced by commit `c4a3955145` ("MD: Remember the last sync operation that was performed"), and the usage described in commit message is not affected. On the other hand, last_sync_action is not used in mdadm or mdmon, and none of the tests that I can find. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-10-yukuai1@huaweicloud.com	2024-06-12 16:32:57 +00:00
Yu Kuai	7d9f107a4e	md: use new helpers in md_do_sync() Make code cleaner. and also use the action_name directly in kernel log: - "check" instead of "data-check" - "repair" instead of "requested-resync" Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-9-yukuai1@huaweicloud.com	2024-06-12 16:32:37 +00:00
Yu Kuai	5ce10a3859	md: don't fail action_store() if sync_thread is not registered MD_RECOVERY_RUNNING will always be set when trying to register a new sync_thread, however, if md_start_sync() turns out to do nothing, MD_RECOVERY_RUNNING will be cleared in this case. And during the race window, action_store() will return -EBUSY, which will cause some mdadm tests to fail. For example: The test 07reshape5intr will add a new disk to array, then start reshape: mdadm /dev/md0 --add /dev/xxx mdadm --grow /dev/md0 -n 3 And add_bound_rdev() from mdadm --add will set MD_RECOVERY_NEEDED, then during the race windown, mdadm --grow will fail. Fix the problem by waiting in action_store() during the race window, fail only if sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-8-yukuai1@huaweicloud.com	2024-06-12 16:27:50 +00:00
Yu Kuai	e792a4c215	md: add new helpers for sync_action The new helpers will get current sync_action of the array, will be used in later patches to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-4-yukuai1@huaweicloud.com	2024-06-12 16:27:49 +00:00

1 2 3 4 5 ...

340 Commits