block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits

Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.

For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state, which opportunistically avoids writing zeroes to media while
still guaranteeing that subsequent reads from the specified block range
will return zeroed data. This is a best-effort optimization, not a
mandatory requirement, some devices may partially fall back to writing
physical zeroes due to factors such as misalignment or being asked to
clear a block range smaller than the device's internal allocation unit.
Therefore, the speed of this operation is not guaranteed.

It is difficult to determine whether the storage device supports unmap
write zeroes operation. We cannot determine this by only querying
bdev_limits(bdev)->max_write_zeroes_sectors. Therefore, first, add a new
hardware queue limit parameters, max_hw_wzeroes_unmap_sectors, to
indicate whether a device supports this unmap write zeroes operation.
Then, add two new counterpart software queue limits,
max_wzeroes_unmap_sectors and max_user_wzeroes_unmap_sectors, which
allow users to disable this operation if the speed is very slow on some
sepcial devices.

Finally, for the stacked devices cases, initialize these two parameters
to UINT_MAX. This operation should be enabled by both the stacking
driver and all underlying devices.

Thanks to Martin K. Petersen for optimizing the documentation of the
write_zeroes_unmap sysfs interface.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/20250619111806.3546162-2-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This commit is contained in:
Zhang Yi
2025-06-19 19:17:58 +08:00
committed by Christian Brauner
parent e04c78d86a
commit 0c40d7cb5e
4 changed files with 87 additions and 2 deletions

View File

@@ -778,6 +778,39 @@ Description:
0, write zeroes is not supported by the device.
What: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes
Date: January 2025
Contact: Zhang Yi <yi.zhang@huawei.com>
Description:
[RO] This file indicates whether a device supports zeroing data
in a specified block range without incurring the cost of
physically writing zeroes to the media for each individual
block. If this parameter is set to write_zeroes_max_bytes, the
device implements a zeroing operation which opportunistically
avoids writing zeroes to media while still guaranteeing that
subsequent reads from the specified block range will return
zeroed data. This operation is a best-effort optimization, a
device may fall back to physically writing zeroes to the media
due to other factors such as misalignment or being asked to
clear a block range smaller than the device's internal
allocation unit. If this parameter is set to 0, the device may
have to write each logical block media during a zeroing
operation.
What: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes
Date: January 2025
Contact: Zhang Yi <yi.zhang@huawei.com>
Description:
[RW] While write_zeroes_unmap_max_hw_bytes is the hardware limit
for the device, this setting is the software limit. Since the
unmap write zeroes operation is a best-effort optimization, some
devices may still physically writing zeroes to media. So the
speed of this operation is not guaranteed. Writing a value of
'0' to this file disables this operation. Otherwise, this
parameter should be equal to write_zeroes_unmap_max_hw_bytes.
What: /sys/block/<disk>/queue/zone_append_max_bytes
Date: May 2020
Contact: linux-block@vger.kernel.org