linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Jeff Layton	0b2600f81c	treewide: change inode->i_ino from unsigned long to u64 On 32-bit architectures, unsigned long is only 32 bits wide, which causes 64-bit inode numbers to be silently truncated. Several filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that exceed 32 bits, and this truncation can lead to inode number collisions and other subtle bugs on 32-bit systems. Change the type of inode->i_ino from unsigned long to u64 to ensure that inode numbers are always represented as 64-bit values regardless of architecture. Update all format specifiers treewide from %lu/%lx to %llu/%llx to match the new type, along with corresponding local variable types. This is the bulk treewide conversion. Earlier patches in this series handled trace events separately to allow trace field reordering for better struct packing on 32-bit. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-06 14:31:28 +01:00
Linus Torvalds	3e48a11675	Merge tag 'f2fs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this development cycle, we focused on several key performance optimizations: - introducing large folio support to enhance read speeds for immutable files - reducing checkpoint=enable latency by flushing only committed dirty pages - implementing tracepoints to diagnose and resolve lock priority inversion. Additionally, we introduced the packed_ssa feature to optimize the SSA footprint when utilizing large block sizes. Detail summary: Enhancements: - support large folio for immutable non-compressed case - support non-4KB block size without packed_ssa feature - optimize f2fs_enable_checkpoint() to avoid long delay - optimize f2fs_overwrite_io() for f2fs_iomap_begin - optimize NAT block loading during checkpoint write - add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint - pin files do not require sbi->writepages lock for ordering - avoid f2fs_map_blocks() for consecutive holes in readpages - flush plug periodically during GC to maximize readahead effect - add tracepoints to catch lock overheads - add several sysfs entries to tune internal lock priorities Fixes: - fix lock priority inversion issue - fix incomplete block usage in compact SSA summaries - fix to show simulate_lock_timeout correctly - fix to avoid mapping wrong physical block for swapfile - fix IS_CHECKPOINTED flag inconsistency issue caused by concurrent atomic commit and checkpoint writes - fix to avoid UAF in f2fs_write_end_io()" * tag 'f2fs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (61 commits) f2fs: sysfs: introduce critical_task_priority f2fs: introduce trace_f2fs_priority_update f2fs: fix lock priority inversion issue f2fs: optimize f2fs_overwrite_io() for f2fs_iomap_begin f2fs: fix incomplete block usage in compact SSA summaries f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint() f2fs: optimize NAT block loading during checkpoint write f2fs: change size parameter of __has_cursum_space() to unsigned int f2fs: add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint f2fs: pin files do not require sbi->writepages lock for ordering f2fs: fix to show simulate_lock_timeout correctly f2fs: introduce FAULT_SKIP_WRITE f2fs: check skipped write in f2fs_enable_checkpoint() Revert "f2fs: add timeout in f2fs_enable_checkpoint()" f2fs: fix to unlock folio in f2fs_read_data_large_folio() f2fs: fix error path handling in f2fs_read_data_large_folio() f2fs: use folio_end_read f2fs: fix to avoid mapping wrong physical block for swapfile f2fs: avoid f2fs_map_blocks() for consecutive holes in readpages f2fs: advance index and offset after zeroing in large folio read ...	2026-02-14 09:48:10 -08:00
Linus Torvalds	997f9640c9	Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux Pull fsverity updates from Eric Biggers: "fsverity cleanups, speedup, and memory usage optimization from Christoph Hellwig: - Move some logic into common code - Fix btrfs to reject truncates of fsverity files - Improve the readahead implementation - Store each inode's fsverity_info in a hash table instead of using a pointer in the filesystem-specific part of the inode. This optimizes for memory usage in the usual case where most files don't have fsverity enabled. - Look up the fsverity_info fewer times during verification, to amortize the hash table overhead" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: remove inode from fsverity_verification_ctx fsverity: use a hashtable to find the fsverity_info btrfs: consolidate fsverity_info lookup f2fs: consolidate fsverity_info lookup ext4: consolidate fsverity_info lookup fs: consolidate fsverity_info lookup in buffer.c fsverity: push out fsverity_info lookup fsverity: deconstify the inode pointer in struct fsverity_info fsverity: kick off hash readahead at data I/O submission time ext4: move ->read_folio and ->readahead to readpage.c readahead: push invalidate_lock out of page_cache_ra_unbounded fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio fsverity: start consolidating pagecache code fsverity: pass struct file to ->write_merkle_tree_block f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY fs,fsverity: clear out fsverity_info from common code fs,fsverity: reject size changes on fsverity files in setattr_prepare	2026-02-12 10:41:34 -08:00
Chao Yu	52190933c3	f2fs: sysfs: introduce critical_task_priority This patch introduces /sys/fs/f2fs/<disk>/critical_task_priority, w/ this new sysfs interface, we can tune priority of f2fs_ckpt thread and f2fs_gc thread. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-02-10 20:53:21 +00:00
Christoph Hellwig	f77f281b61	fsverity: use a hashtable to find the fsverity_info Use the kernel's resizable hash table (rhashtable) to find the fsverity_info. This way file systems that want to support fsverity don't have to bloat every inode in the system with an extra pointer. The trade-off is that looking up the fsverity_info is a bit more expensive now, but the main operations are still dominated by I/O and hashing overhead. The rhashtable implementations requires no external synchronization, and the _fast versions of the APIs provide the RCU critical sections required by the implementation. Because struct fsverity_info is only removed on inode eviction and does not contain a reference count, there is no need for an extended critical section to grab a reference or validate the object state. The file open path uses rhashtable_lookup_get_insert_fast, which can either find an existing object for the hash key or insert a new one in a single atomic operation, so that concurrent opens never instantiate duplicate fsverity_info structure. FS_IOC_ENABLE_VERITY must already be synchronized by a combination of i_rwsem and file system flags and uses rhashtable_lookup_insert_fast, which errors out on an existing object for the hash key as an additional safety check. Because insertion into the hash table now happens before S_VERITY is set, fsverity just becomes a barrier and a flag check and doesn't have to look up the fsverity_info at all, so there is only a single lookup per ->read_folio or ->readahead invocation. For btrfs there is an additional one for each bio completion, while for ext4 and f2fs the fsverity_info is stored in the per-I/O context and reused for the completion workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de [EB: folded in fix for missing fsverity_free_info()] Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2026-02-04 11:31:54 -08:00
Christoph Hellwig	45dcb3ac98	f2fs: consolidate fsverity_info lookup Look up the fsverity_info once in f2fs_mpage_readpages, and then use it for the readahead, local verification of holes and pass it along to the I/O completion workqueue in struct bio_post_read_ctx. Do the same thing in f2fs_get_read_data_folio for reads that come from garbage collection and other background activities. This amortizes the lookup better once it becomes less efficient. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260202060754.270269-10-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2026-02-04 11:31:54 -08:00
Chao Yu	07de55cbf5	f2fs: fix lock priority inversion issue If userspace thread has held f2fs rw semaphore, due to its low priority, it could be runnable or preempted state for long time, during the time, it will block high priority thread which is trying to grab the same rw semaphore, e.g. cp_rwsem, io_rwsem... To fix such issue, let's detect thread's priority when it tries to grab f2fs_rwsem lock, if the priority is lower than a priority threshold, let's uplift the priority before it enters into critical region of lock, and restore the priority after it leaves from critical region. Meanwhile, introducing two new sysfs nodes: - /sys/fs/f2fs/<disk>/adjust_lock_priority, it is used to control whether the functionality is enable or not. ========== ================== Flag_Value Flag_Description ========== ================== 0x00000000 Disabled (default) 0x00000001 cp_rwsem 0x00000002 node_change 0x00000004 node_write 0x00000008 gc_lock 0x00000010 cp_global 0x00000020 io_rwsem ========== ================== - /sys/fs/f2fs/<disk>/lock_duration_priority, it is used to control priority threshold. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-31 03:24:39 +00:00
Chao Yu	6bb9010f78	f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint() It's rare case that sync_inodes_sb() always skips to flush some drity datas, so it's enough to give extra three more chances to flush data. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-27 02:45:59 +00:00
Yongpeng Yang	7c9ee0ed2b	f2fs: change size parameter of __has_cursum_space() to unsigned int All callers of __has_cursum_space() pass an unsigned int value as the size parameter. Change the parameter type to unsigned int accordingly. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-27 02:45:58 +00:00
Yongpeng Yang	401a3034d3	f2fs: add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint This patch adds separate write latency accounting for NAT and SIT blocks in f2fs_write_checkpoint(). Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-27 02:45:58 +00:00
Chao Yu	1120764691	f2fs: introduce FAULT_SKIP_WRITE In order to simulate skipped write during enable_checkpoint(). Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-27 02:45:58 +00:00
Chao Yu	ab59919c8a	f2fs: check skipped write in f2fs_enable_checkpoint() This patch introduces sbi->nr_pages[F2FS_SKIPPED_WRITE] to record any skipped write during data flush in f2fs_enable_checkpoint(). So in the loop of data flush, if there is any skipped write in previous flush, let's retry sync_inode_sb(), otherwise, all dirty data written before f2fs_enable_checkpoint() should have been persisted, then break the retry loop. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-27 02:45:44 +00:00
Jaegeuk Kim	993663874b	Revert "f2fs: add timeout in f2fs_enable_checkpoint()" This reverts commit `4bc3477796`. Let's apply a better approach to flush the only dirty pages committed by user to avoid the delay caused by unncessary incoming ones. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-20 20:54:14 +00:00
Chao Yu	50ac3ecd8e	f2fs: fix to do sanity check on node footer in {read,write}_end_io -----------[ cut here ]------------ kernel BUG at fs/f2fs/data.c:358! Call Trace: <IRQ> blk_update_request+0x5eb/0xe70 block/blk-mq.c:987 blk_mq_end_request+0x3e/0x70 block/blk-mq.c:1149 blk_complete_reqs block/blk-mq.c:1224 [inline] blk_done_softirq+0x107/0x160 block/blk-mq.c:1229 handle_softirqs+0x283/0x870 kernel/softirq.c:579 __do_softirq kernel/softirq.c:613 [inline] invoke_softirq kernel/softirq.c:453 [inline] __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050 </IRQ> In f2fs_write_end_io(), it detects there is inconsistency in between node page index (nid) and footer.nid of node page. If footer of node page is corrupted in fuzzed image, then we load corrupted node page w/ async method, e.g. f2fs_ra_node_pages() or f2fs_ra_node_page(), in where we won't do sanity check on node footer, once node page becomes dirty, we will encounter this bug after node page writeback. Cc: stable@kernel.org Reported-by: syzbot+803dd716c4310d16ff3a@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=803dd716c4310d16ff3a Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-17 00:00:34 +00:00
Yangyang Zang	f7b929eda1	f2fs: clean up the type parameter in f2fs_sync_meta_pages() Clean up code to improve readability, no logic changes. Signed-off-by: Yangyang Zang <zangyangyang1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-17 00:00:34 +00:00
Daeho Jeong	e48e16f3e3	f2fs: support non-4KB block size without packed_ssa feature Currently, F2FS requires the packed_ssa feature to be enabled when utilizing non-4KB block sizes (e.g., 16KB). This restriction limits the flexibility of filesystem formatting options. This patch allows F2FS to support non-4KB block sizes even when the packed_ssa feature is disabled. It adjusts the SSA calculation logic to correctly handle summary entries in larger blocks without the packed layout. Cc: stable@kernel.org Fixes: `7ee8bc3942` ("f2fs: revert summary entry count from 2048 to 512 in 16kb block support") Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-17 00:00:34 +00:00
Chao Yu	1dd3b437d4	f2fs: make FAULT_DISCARD obsolete __blkdev_issue_discard() in __submit_discard_cmd() will never fail, so let's make FAULT_DISCARD fault injection obsolete. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-17 00:00:34 +00:00
Chao Yu	3996b70209	Revert "f2fs: block cache/dio write during f2fs_enable_checkpoint()" This reverts commit `196c81fdd4`. Original patch may cause below deadlock, revert it. write remount - write_begin - lock_page --- lock A - prepare_write_begin - f2fs_map_lock - f2fs_enable_checkpoint - down_write(cp_enable_rwsem) --- lock B - sync_inode_sb - writepages - lock_page --- lock A - down_read(cp_enable_rwsem) --- lock A Cc: stable@kernel.org Fixes: `196c81fdd4` ("f2fs: block cache/dio write during f2fs_enable_checkpoint()") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-16 03:49:31 +00:00
Darrick J. Wong	6025447737	uapi: promote EFSCORRUPTED and EUCLEAN to errno.h Stop definining these privately and instead move them to the uapi errno.h so that they become canonical instead of copy pasta. Cc: linux-api@vger.kernel.org Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/176826402587.3490369.17659117524205214600.stgit@frogsfrogsfrogs Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-13 09:58:01 +01:00
Yongpeng Yang	071e50d61c	f2fs: change seq_file_ra_mul and max_io_bytes to unsigned int {struct file_ra_state}->ra_pages and {struct bio}->bi_iter.bi_size is defined as unsigned int, so values of seq_file_ra_mul and max_io_bytes exceeding UINT_MAX are meaningless. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:09 +00:00
Chao Yu	d36de29f4b	f2fs: sysfs: introduce inject_lock_timeout This patch adds a new sysfs node in /sys/fs/f2fs/<disk>/inject_lock_timeout, it relies on CONFIG_F2FS_FAULT_INJECTION kernel config. It can be used to simulate different type of timeout in lock duration. ========== =============================== Flag_Value Flag_Description ========== =============================== 0x00000000 No timeout (default) 0x00000001 Simulate running time 0x00000002 Simulate IO type sleep time 0x00000003 Simulate Non-IO type sleep time 0x00000004 Simulate runnable time ========== =============================== Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:08 +00:00
Chao Yu	c56254e2e0	f2fs: introduce FAULT_LOCK_TIMEOUT This patch introduce a new fault type FAULT_LOCK_TIMEOUT, it can be used to inject timeout into lock duration. Timeout type can be set via /sys/fs/f2fs/<disk>/inject_timeout_type Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:08 +00:00
Chao Yu	7a127c80b0	f2fs: rename FAULT_TIMEOUT to FAULT_ATOMIC_TIMEOUT No logic changes. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:08 +00:00
Chao Yu	6fa1160539	f2fs: fix timeout precision of f2fs_io_schedule_timeout_killable() Sometimes, f2fs_io_schedule_timeout_killable(HZ) may delay for about 2 seconds, this is because __f2fs_schedule_timeout(DEFAULT_SCHEDULE_TIMEOUT) may delay for about 2 * DEFAULT_SCHEDULE_TIMEOUT due to its precision, but we only account the delay as DEFAULT_SCHEDULE_TIMEOUT as below, fix it. f2fs_io_schedule_timeout_killable() .. timeout -= DEFAULT_SCHEDULE_TIMEOUT; Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:07 +00:00
Chao Yu	da90b67155	f2fs: fix to use jiffies based precision for DEFAULT_SCHEDULE_TIMEOUT Due to timeout parameter in {io,}_schedule_timeout() is based on jiffies unit precision. It will lose precision when using msecs_to_jiffies(x) for conversion. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:07 +00:00
Chao Yu	b5da276ae6	f2fs: clean up w/ __f2fs_schedule_timeout() No logic changes. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:07 +00:00
Chao Yu	67972c2b89	f2fs: trace elapsed time for io_rwsem lock Use f2fs_{down,up}_{read,write}_trace for io_rwsem to trace lock elapsed time. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:07 +00:00
Chao Yu	ce9fe67c9c	f2fs: trace elapsed time for cp_global_sem lock Use f2fs_{down,up}_write_trace for cp_global_sem to trace lock elapsed time. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:07 +00:00
Chao Yu	e605302c14	f2fs: trace elapsed time for gc_lock lock Use f2fs_{down,up}_write_trace for gc_lock to trace lock elapsed time. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:07 +00:00
Chao Yu	bb28b66875	f2fs: trace elapsed time for node_write lock Use f2fs_{down,up}_read_trace for node_write to trace lock elapsed time. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:06 +00:00
Chao Yu	f9f9360251	f2fs: trace elapsed time for node_change lock Use f2fs_{down,up}_read_trace for node_change to trace lock elapsed time. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:06 +00:00
Chao Yu	66e9e0d55d	f2fs: trace elapsed time for cp_rwsem lock Use f2fs_{down,up}_read_trace for cp_rwsem to trace lock elapsed time. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:06 +00:00
Chao Yu	e4b75621fc	f2fs: sysfs: introduce max_lock_elapsed_time This patch add a new sysfs node in /sys/fs/f2fs/<device>/max_lock_elapsed_time. This is a threshold, once a thread enters critical region that lock covers, total elapsed time exceeds this threshold, f2fs will print tracepoint to dump information of related context. This sysfs entry can be used to control the value of threshold, by default, the value is 500 ms. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:06 +00:00
Chao Yu	79b3cebc70	f2fs: add lock elapsed time trace facility for f2fs rwsemphore This patch adds lock elapsed time trace facility for f2fs rwsemphore. If total elapsed time of critical region covered by lock exceeds a threshold, it will print tracepoint to dump information of lock related context, including: - thread information - CPU/IO priority - lock information - elapsed time - total time - running time (depend on CONFIG_64BIT) - runnable time (depend on CONFIG_SCHED_INFO and CONFIG_SCHEDSTATS) - io sleep time (depend on CONFIG_TASK_DELAY_ACCT and /proc/sys/kernel/task_delayacct) - other time (by default other time will account nonio sleep time, but, if above kconfig is not defined, other time will include runnable time and/or io sleep time as wll) output: f2fs_lock_elapsed_time: dev = (254,52), comm: sh, pid: 13855, prio: 120, ioprio_class: 2, ioprio_data: 4, lock_name: cp_rwsem, lock_type: rlock, total: 1000, running: 993, runnable: 2, io_sleep: 0, other: 5 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-07 03:17:06 +00:00
Yongpeng Yang	db1a8a7813	f2fs: return immediately after submitting the specified folio in __submit_merged_write_cond f2fs_folio_wait_writeback ensures the folio write is submitted to the block layer via __submit_merged_write_cond, then waits for the folio to complete. Other I/O submissions are irrelevant to f2fs_folio_wait_writeback. Thus, if the folio write bio is already submitted, the function can return immediately. This patch adds a writeback parameter to __submit_merged_write_cond(), which signals an immediate return after submitting the target folio, and waitting writeback can use this parameter. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-01 03:29:40 +00:00
Chao Yu	3cb396a2c7	f2fs: fix to do sanity check on nat entry of quota inode As Zhiguo reported, nat entry of quota inode could be corrupted: "ino/block_addr=NULL_ADDR in nid=4 entry" We'd better to do sanity check on quota inode to detect and record nat.blk_addr inconsistency, so that we can have a chance to repair it w/ later fsck. Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2026-01-01 03:26:56 +00:00
Jaegeuk Kim	05e65c14ea	f2fs: support large folio for immutable non-compressed case This patch enables large folio for limited case where we can get the high-order memory allocation. It supports the encrypted and fsverity files, which are essential for Android environment. How to test: - dd if=/dev/zero of=/mnt/test/test bs=1G count=4 - f2fs_io setflags immutable /mnt/test/test - echo 3 > /proc/sys/vm/drop_caches : to reload inode with large folio - f2fs_io read 32 0 1024 mmap 0 0 /mnt/test/test Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-16 00:46:49 +00:00
YH Lin	8d1cb17aca	f2fs: optimize trace_f2fs_write_checkpoint with enums This patch optimizes the tracepoint by replacing these hardcoded strings with a new enumeration f2fs_cp_phase. 1.Defines enum f2fs_cp_phase with values for each checkpoint phase. 2.Updates trace_f2fs_write_checkpoint to accept a u16 phase argument instead of a string pointer. 3.Uses __print_symbolic in TP_printk to convert the enum values back to their corresponding strings for human-readable trace output. This change reduces the storage overhead for each trace event by replacing a variable-length string with a 2-byte integer, while maintaining the same readable output in ftrace. Signed-off-by: YH Lin <yhli@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:06 +00:00
Chao Yu	8f11fe52fc	f2fs: support to show curseg.next_blkoff in debugfs cat /sys/kernel/debug/f2fs/status Main area: 17 segs, 17 secs 17 zones TYPE blkoff segno secno zoneno dirty_seg full_seg valid_blk - COLD data: 0 4 4 4 0 0 0 - WARM data: 0 7 7 7 0 0 0 - HOT data: 1 5 5 5 2 0 512 - Dir dnode: 3 0 0 0 1 0 2 - File dnode: 0 1 1 1 0 0 0 - Indir nodes: 0 2 2 2 0 0 0 - Pinned file: 0 -1 -1 -1 - ATGC data: 0 -1 -1 -1 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:05 +00:00
Chao Yu	1627a303bc	f2fs: expand scalability of f2fs mount option opt field in structure f2fs_mount_info and opt_mask field in structure f2fs_fs_context is 32-bits variable, now we're running out of available bits in them, let's expand them to 64-bits for better scalability. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:05 +00:00
Chao Yu	d31e0de8b8	f2fs: change default schedule timeout value This patch changes default schedule timeout value from 20ms to 1ms, in order to give caller more chances to check whether IO or non-IO congestion condition has already been mitigable. In addition, default interval of periodical discard submission is kept to 20ms. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:05 +00:00
Chao Yu	76e780d88c	f2fs: introduce f2fs_schedule_timeout() In f2fs retry logic, we will call f2fs_io_schedule_timeout() to sleep as uninterruptible state (waiting for IO) for a while, however, in several paths below, we are not blocked by IO: - f2fs_write_single_data_page() return -EAGAIN due to racing on cp_rwsem. - f2fs_flush_device_cache() failed to submit preflush command. - __issue_discard_cmd_range() sleeps periodically in between two in batch discard submissions. So, in order to reveal state of task more accurate, let's introduce f2fs_schedule_timeout() and call it in above paths in where we are waiting for non-IO reasons. Then we can get real reason of uninterruptible sleep for a thread in tracepoint, perfetto, etc. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:05 +00:00
Yongpeng Yang	581251e030	f2fs: wrap all unusable_blocks_per_sec code in CONFIG_BLK_DEV_ZONED The usage of unusable_blocks_per_sec is already wrapped by CONFIG_BLK_DEV_ZONED, except for its declaration and the definitions of CAP_BLKS_PER_SEC and CAP_SEGS_PER_SEC. This patch ensures that all code related to unusable_blocks_per_sec is properly wrapped under the CONFIG_BLK_DEV_ZONED option. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:04 +00:00
Daeho Jeong	7ee8bc3942	f2fs: revert summary entry count from 2048 to 512 in 16kb block support The recent increase in the number of Segment Summary Area (SSA) entries from 512 to 2048 was an unintentional change in logic of 16kb block support. This commit corrects the issue. To better utilize the space available from the erroneous 2048-entry calculation, we are implementing a solution to share the currently unused SSA space with neighboring segments. This enhances overall SSA utilization without impacting the established 8MB segment size. Fixes: `d7e9a9037d` ("f2fs: Support Block Size == Page Size") Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:04 +00:00
Xiaole He	27bf6a637b	f2fs: fix age extent cache insertion skip on counter overflow The age extent cache uses last_blocks (derived from allocated_data_blocks) to determine data age. However, there's a conflict between the deletion marker (last_blocks=0) and legitimate last_blocks=0 cases when allocated_data_blocks overflows to 0 after reaching ULLONG_MAX. In this case, valid extents are incorrectly skipped due to the "if (!tei->last_blocks)" check in __update_extent_tree_range(). This patch fixes the issue by: 1. Reserving ULLONG_MAX as an invalid/deletion marker 2. Limiting allocated_data_blocks to range [0, ULLONG_MAX-1] 3. Using F2FS_EXTENT_AGE_INVALID for deletion scenarios 4. Adjusting overflow age calculation from ULLONG_MAX to (ULLONG_MAX-1) Reproducer (using a patched kernel with allocated_data_blocks initialized to ULLONG_MAX - 3 for quick testing): Step 1: Mount and check initial state # dd if=/dev/zero of=/tmp/test.img bs=1M count=100 # mkfs.f2fs -f /tmp/test.img # mkdir -p /mnt/f2fs_test # mount -t f2fs -o loop,age_extent_cache /tmp/test.img /mnt/f2fs_test # cat /sys/kernel/debug/f2fs/status \| grep -A 4 "Block Age" Allocated Data Blocks: 18446744073709551612 # ULLONG_MAX - 3 Inner Struct Count: tree: 1(0), node: 0 Step 2: Create files and write data to trigger overflow # touch /mnt/f2fs_test/{1,2,3,4}.txt; sync # cat /sys/kernel/debug/f2fs/status \| grep -A 4 "Block Age" Allocated Data Blocks: 18446744073709551613 # ULLONG_MAX - 2 Inner Struct Count: tree: 5(0), node: 1 # dd if=/dev/urandom of=/mnt/f2fs_test/1.txt bs=4K count=1; sync # cat /sys/kernel/debug/f2fs/status \| grep -A 4 "Block Age" Allocated Data Blocks: 18446744073709551614 # ULLONG_MAX - 1 Inner Struct Count: tree: 5(0), node: 2 # dd if=/dev/urandom of=/mnt/f2fs_test/2.txt bs=4K count=1; sync # cat /sys/kernel/debug/f2fs/status \| grep -A 4 "Block Age" Allocated Data Blocks: 18446744073709551615 # ULLONG_MAX Inner Struct Count: tree: 5(0), node: 3 # dd if=/dev/urandom of=/mnt/f2fs_test/3.txt bs=4K count=1; sync # cat /sys/kernel/debug/f2fs/status \| grep -A 4 "Block Age" Allocated Data Blocks: 0 # Counter overflowed! Inner Struct Count: tree: 5(0), node: 4 Step 3: Trigger the bug - next write should create node but gets skipped # dd if=/dev/urandom of=/mnt/f2fs_test/4.txt bs=4K count=1; sync # cat /sys/kernel/debug/f2fs/status \| grep -A 4 "Block Age" Allocated Data Blocks: 1 Inner Struct Count: tree: 5(0), node: 4 Expected: node: 5 (new extent node for 4.txt) Actual: node: 4 (extent insertion was incorrectly skipped due to last_blocks = allocated_data_blocks = 0 in __get_new_block_age) After this fix, the extent node is correctly inserted and node count becomes 5 as expected. Fixes: `71644dff48` ("f2fs: add block_age-based extent cache") Cc: stable@kernel.org Signed-off-by: Xiaole He <hexiaole1994@126.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:03 +00:00
Yongpeng Yang	d8bdf7856e	f2fs: ensure minimum trim granularity accounts for all devices When F2FS uses multiple block devices, each device may have a different discard granularity. The minimum trim granularity must be at least the maximum discard granularity of all devices, excluding zoned devices. Use max_t instead of the max() macro to compute the maximum value. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:03 +00:00
Chao Yu	196c81fdd4	f2fs: block cache/dio write during f2fs_enable_checkpoint() If there are too many background IOs during f2fs_enable_checkpoint(), sync_inodes_sb() may be blocked for long time due to it will loop to write dirty datas which are generated by in parallel write() continuously. Let's change as below to resolve this issue: - hold cp_enable_rwsem write lock to block any cache/dio write - decrease DEF_ENABLE_INTERVAL from 16 to 5 In addition, dump more logs during f2fs_enable_checkpoint(). Testcase: 1. fill data into filesystem until 90% usage. 2. mount -o remount,checkpoint=disable:10% /data 3. fio --rw=randwrite --bs=4kb --size=1GB --numjobs=10 \ --iodepth=64 --ioengine=psync --time_based --runtime=600 \ --directory=/data/fio_dir/ & 4. mount -o remount,checkpoint=enable /data Before: F2FS-fs (dm-51): f2fs_enable_checkpoint() finishes, writeback:7232, sync:39793, cp:457 After: F2FS-fs (dm-51): f2fs_enable_checkpoint end, writeback:5032, lock:0, sync_inode:5552, sync_fs:84 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:02 +00:00
Yongpeng Yang	89c16629e3	f2fs: change the unlock parameter of f2fs_put_page to bool Change the type of the unlock parameter of f2fs_put_page to bool. All callers should consistently pass true or false. No logical change. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:02 +00:00
Chao Yu	1f27ef42bb	f2fs: use global inline_xattr_slab instead of per-sb slab cache As Hong Yun reported in mailing list: loop7: detected capacity change from 0 to 131072 ------------[ cut here ]------------ kmem_cache of name 'f2fs_xattr_entry-7:7' already exists WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 kmem_cache_sanity_check mm/slab_common.c:109 [inline] WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 __kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307 CPU: 0 UID: 0 PID: 24426 Comm: syz.7.1370 Not tainted 6.17.0-rc4 #1 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:kmem_cache_sanity_check mm/slab_common.c:109 [inline] RIP: 0010:__kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307 Call Trace: __kmem_cache_create include/linux/slab.h:353 [inline] f2fs_kmem_cache_create fs/f2fs/f2fs.h:2943 [inline] f2fs_init_xattr_caches+0xa5/0xe0 fs/f2fs/xattr.c:843 f2fs_fill_super+0x1645/0x2620 fs/f2fs/super.c:4918 get_tree_bdev_flags+0x1fb/0x260 fs/super.c:1692 vfs_get_tree+0x43/0x140 fs/super.c:1815 do_new_mount+0x201/0x550 fs/namespace.c:3808 do_mount fs/namespace.c:4136 [inline] __do_sys_mount fs/namespace.c:4347 [inline] __se_sys_mount+0x298/0x2f0 fs/namespace.c:4324 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x8e/0x3a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e The bug can be reproduced w/ below scripts: - mount /dev/vdb /mnt1 - mount /dev/vdc /mnt2 - umount /mnt1 - mounnt /dev/vdb /mnt1 The reason is if we created two slab caches, named f2fs_xattr_entry-7:3 and f2fs_xattr_entry-7:7, and they have the same slab size. Actually, slab system will only create one slab cache core structure which has slab name of "f2fs_xattr_entry-7:3", and two slab caches share the same structure and cache address. So, if we destroy f2fs_xattr_entry-7:3 cache w/ cache address, it will decrease reference count of slab cache, rather than release slab cache entirely, since there is one more user has referenced the cache. Then, if we try to create slab cache w/ name "f2fs_xattr_entry-7:3" again, slab system will find that there is existed cache which has the same name and trigger the warning. Let's changes to use global inline_xattr_slab instead of per-sb slab cache for fixing. Fixes: `a999150f4f` ("f2fs: use kmem_cache pool during inline xattr lookups") Cc: stable@kernel.org Reported-by: Hong Yun <yhong@link.cuhk.edu.hk> Tested-by: Hong Yun <yhong@link.cuhk.edu.hk> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:02 +00:00
Chao Yu	10b591e7fb	f2fs: fix to avoid updating compression context during writeback Bai, Shuangpeng <sjb7183@psu.edu> reported a bug as below: Oops: divide error: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 11441 Comm: syz.0.46 Not tainted 6.17.0 #1 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:f2fs_all_cluster_page_ready+0x106/0x550 fs/f2fs/compress.c:857 Call Trace: <TASK> f2fs_write_cache_pages fs/f2fs/data.c:3078 [inline] __f2fs_write_data_pages fs/f2fs/data.c:3290 [inline] f2fs_write_data_pages+0x1c19/0x3600 fs/f2fs/data.c:3317 do_writepages+0x38e/0x640 mm/page-writeback.c:2634 filemap_fdatawrite_wbc mm/filemap.c:386 [inline] __filemap_fdatawrite_range mm/filemap.c:419 [inline] file_write_and_wait_range+0x2ba/0x3e0 mm/filemap.c:794 f2fs_do_sync_file+0x6e6/0x1b00 fs/f2fs/file.c:294 generic_write_sync include/linux/fs.h:3043 [inline] f2fs_file_write_iter+0x76e/0x2700 fs/f2fs/file.c:5259 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x7e9/0xe00 fs/read_write.c:686 ksys_write+0x19d/0x2d0 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf7/0x470 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The bug was triggered w/ below race condition: fsync setattr ioctl - f2fs_do_sync_file - file_write_and_wait_range - f2fs_write_cache_pages : inode is non-compressed : cc.cluster_size = F2FS_I(inode)->i_cluster_size = 0 - tag_pages_for_writeback - f2fs_setattr - truncate_setsize - f2fs_truncate - f2fs_fileattr_set - f2fs_setflags_common - set_compress_context : F2FS_I(inode)->i_cluster_size = 4 : set_inode_flag(inode, FI_COMPRESSED_FILE) - f2fs_compressed_file : return true - f2fs_all_cluster_page_ready : "pgidx % cc->cluster_size" trigger dividing 0 issue Let's change as below to fix this issue: - introduce a new atomic type variable .writeback in structure f2fs_inode_info to track the number of threads which calling f2fs_write_cache_pages(). - use .i_sem lock to protect .writeback update. - check .writeback before update compression context in f2fs_setflags_common() to avoid race w/ ->writepages. Fixes: `4c8ff7095b` ("f2fs: support data compression") Cc: stable@kernel.org Reported-by: Bai, Shuangpeng <sjb7183@psu.edu> Tested-by: Bai, Shuangpeng <sjb7183@psu.edu> Closes: https://lore.kernel.org/lkml/44D8F7B3-68AD-425F-9915-65D27591F93F@psu.edu Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2025-12-04 02:00:02 +00:00

1 2 3 4 5 ...

1447 Commits