Commit Graph

1447 Commits

Author SHA1 Message Date
Jeff Layton
0b2600f81c treewide: change inode->i_ino from unsigned long to u64
On 32-bit architectures, unsigned long is only 32 bits wide, which
causes 64-bit inode numbers to be silently truncated. Several
filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that
exceed 32 bits, and this truncation can lead to inode number collisions
and other subtle bugs on 32-bit systems.

Change the type of inode->i_ino from unsigned long to u64 to ensure that
inode numbers are always represented as 64-bit values regardless of
architecture. Update all format specifiers treewide from %lu/%lx to
%llu/%llx to match the new type, along with corresponding local variable
types.

This is the bulk treewide conversion. Earlier patches in this series
handled trace events separately to allow trace field reordering for
better struct packing on 32-bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-06 14:31:28 +01:00
Linus Torvalds
3e48a11675 Merge tag 'f2fs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
 "In this development cycle, we focused on several key performance
  optimizations:

   - introducing large folio support to enhance read speeds for
     immutable files

   - reducing checkpoint=enable latency by flushing only committed dirty
     pages

   - implementing tracepoints to diagnose and resolve lock priority
     inversion.

  Additionally, we introduced the packed_ssa feature to optimize the SSA
  footprint when utilizing large block sizes.

  Detail summary:

  Enhancements:
   - support large folio for immutable non-compressed case
   - support non-4KB block size without packed_ssa feature
   - optimize f2fs_enable_checkpoint() to avoid long delay
   - optimize f2fs_overwrite_io() for f2fs_iomap_begin
   - optimize NAT block loading during checkpoint write
   - add write latency stats for NAT and SIT blocks in
     f2fs_write_checkpoint
   - pin files do not require sbi->writepages lock for ordering
   - avoid f2fs_map_blocks() for consecutive holes in readpages
   - flush plug periodically during GC to maximize readahead effect
   - add tracepoints to catch lock overheads
   - add several sysfs entries to tune internal lock priorities

  Fixes:
   - fix lock priority inversion issue
   - fix incomplete block usage in compact SSA summaries
   - fix to show simulate_lock_timeout correctly
   - fix to avoid mapping wrong physical block for swapfile
   - fix IS_CHECKPOINTED flag inconsistency issue caused by
     concurrent atomic commit and checkpoint writes
   - fix to avoid UAF in f2fs_write_end_io()"

* tag 'f2fs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (61 commits)
  f2fs: sysfs: introduce critical_task_priority
  f2fs: introduce trace_f2fs_priority_update
  f2fs: fix lock priority inversion issue
  f2fs: optimize f2fs_overwrite_io() for f2fs_iomap_begin
  f2fs: fix incomplete block usage in compact SSA summaries
  f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint()
  f2fs: optimize NAT block loading during checkpoint write
  f2fs: change size parameter of __has_cursum_space() to unsigned int
  f2fs: add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint
  f2fs: pin files do not require sbi->writepages lock for ordering
  f2fs: fix to show simulate_lock_timeout correctly
  f2fs: introduce FAULT_SKIP_WRITE
  f2fs: check skipped write in f2fs_enable_checkpoint()
  Revert "f2fs: add timeout in f2fs_enable_checkpoint()"
  f2fs: fix to unlock folio in f2fs_read_data_large_folio()
  f2fs: fix error path handling in f2fs_read_data_large_folio()
  f2fs: use folio_end_read
  f2fs: fix to avoid mapping wrong physical block for swapfile
  f2fs: avoid f2fs_map_blocks() for consecutive holes in readpages
  f2fs: advance index and offset after zeroing in large folio read
  ...
2026-02-14 09:48:10 -08:00
Linus Torvalds
997f9640c9 Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux
Pull fsverity updates from Eric Biggers:
 "fsverity cleanups, speedup, and memory usage optimization from
  Christoph Hellwig:

   - Move some logic into common code

   - Fix btrfs to reject truncates of fsverity files

   - Improve the readahead implementation

   - Store each inode's fsverity_info in a hash table instead of using a
     pointer in the filesystem-specific part of the inode.

     This optimizes for memory usage in the usual case where most files
     don't have fsverity enabled.

   - Look up the fsverity_info fewer times during verification, to
     amortize the hash table overhead"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: remove inode from fsverity_verification_ctx
  fsverity: use a hashtable to find the fsverity_info
  btrfs: consolidate fsverity_info lookup
  f2fs: consolidate fsverity_info lookup
  ext4: consolidate fsverity_info lookup
  fs: consolidate fsverity_info lookup in buffer.c
  fsverity: push out fsverity_info lookup
  fsverity: deconstify the inode pointer in struct fsverity_info
  fsverity: kick off hash readahead at data I/O submission time
  ext4: move ->read_folio and ->readahead to readpage.c
  readahead: push invalidate_lock out of page_cache_ra_unbounded
  fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
  fsverity: start consolidating pagecache code
  fsverity: pass struct file to ->write_merkle_tree_block
  f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
  ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
  fs,fsverity: clear out fsverity_info from common code
  fs,fsverity: reject size changes on fsverity files in setattr_prepare
2026-02-12 10:41:34 -08:00
Chao Yu
52190933c3 f2fs: sysfs: introduce critical_task_priority
This patch introduces /sys/fs/f2fs/<disk>/critical_task_priority, w/
this new sysfs interface, we can tune priority of f2fs_ckpt thread and
f2fs_gc thread.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-02-10 20:53:21 +00:00
Christoph Hellwig
f77f281b61 fsverity: use a hashtable to find the fsverity_info
Use the kernel's resizable hash table (rhashtable) to find the
fsverity_info.  This way file systems that want to support fsverity don't
have to bloat every inode in the system with an extra pointer.  The
trade-off is that looking up the fsverity_info is a bit more expensive
now, but the main operations are still dominated by I/O and hashing
overhead.

The rhashtable implementations requires no external synchronization, and
the _fast versions of the APIs provide the RCU critical sections required
by the implementation.  Because struct fsverity_info is only removed on
inode eviction and does not contain a reference count, there is no need
for an extended critical section to grab a reference or validate the
object state.  The file open path uses rhashtable_lookup_get_insert_fast,
which can either find an existing object for the hash key or insert a
new one in a single atomic operation, so that concurrent opens never
instantiate duplicate fsverity_info structure.  FS_IOC_ENABLE_VERITY must
already be synchronized by a combination of i_rwsem and file system flags
and uses rhashtable_lookup_insert_fast, which errors out on an existing
object for the hash key as an additional safety check.

Because insertion into the hash table now happens before S_VERITY is set,
fsverity just becomes a barrier and a flag check and doesn't have to look
up the fsverity_info at all, so there is only a single lookup per
->read_folio or ->readahead invocation.  For btrfs there is an additional
one for each bio completion, while for ext4 and f2fs the fsverity_info
is stored in the per-I/O context and reused for the completion workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de
[EB: folded in fix for missing fsverity_free_info()]
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04 11:31:54 -08:00
Christoph Hellwig
45dcb3ac98 f2fs: consolidate fsverity_info lookup
Look up the fsverity_info once in f2fs_mpage_readpages, and then use it
for the readahead, local verification of holes and pass it along to the
I/O completion workqueue in struct bio_post_read_ctx.  Do the same
thing in f2fs_get_read_data_folio for reads that come from garbage
collection and other background activities.

This amortizes the lookup better once it becomes less efficient.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260202060754.270269-10-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04 11:31:54 -08:00
Chao Yu
07de55cbf5 f2fs: fix lock priority inversion issue
If userspace thread has held f2fs rw semaphore, due to its low priority,
it could be runnable or preempted state for long time, during the time,
it will block high priority thread which is trying to grab the same rw
semaphore, e.g. cp_rwsem, io_rwsem...

To fix such issue, let's detect thread's priority when it tries to grab
f2fs_rwsem lock, if the priority is lower than a priority threshold, let's
uplift the priority before it enters into critical region of lock, and
restore the priority after it leaves from critical region.

Meanwhile, introducing two new sysfs nodes:
- /sys/fs/f2fs/<disk>/adjust_lock_priority, it is used to control whether
the functionality is enable or not.
==========     ==================
Flag_Value     Flag_Description
==========     ==================
0x00000000     Disabled (default)
0x00000001     cp_rwsem
0x00000002     node_change
0x00000004     node_write
0x00000008     gc_lock
0x00000010     cp_global
0x00000020     io_rwsem
==========     ==================
- /sys/fs/f2fs/<disk>/lock_duration_priority, it is used to control
priority threshold.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-31 03:24:39 +00:00
Chao Yu
6bb9010f78 f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint()
It's rare case that sync_inodes_sb() always skips to flush some drity
datas, so it's enough to give extra three more chances to flush data.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:59 +00:00
Yongpeng Yang
7c9ee0ed2b f2fs: change size parameter of __has_cursum_space() to unsigned int
All callers of __has_cursum_space() pass an unsigned int value as the
size parameter. Change the parameter type to unsigned int accordingly.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:58 +00:00
Yongpeng Yang
401a3034d3 f2fs: add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint
This patch adds separate write latency accounting for NAT and SIT blocks
in f2fs_write_checkpoint().

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:58 +00:00
Chao Yu
1120764691 f2fs: introduce FAULT_SKIP_WRITE
In order to simulate skipped write during enable_checkpoint().

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:58 +00:00
Chao Yu
ab59919c8a f2fs: check skipped write in f2fs_enable_checkpoint()
This patch introduces sbi->nr_pages[F2FS_SKIPPED_WRITE] to record any
skipped write during data flush in f2fs_enable_checkpoint().

So in the loop of data flush, if there is any skipped write in previous
flush, let's retry sync_inode_sb(), otherwise, all dirty data written
before f2fs_enable_checkpoint() should have been persisted, then break
the retry loop.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:44 +00:00
Jaegeuk Kim
993663874b Revert "f2fs: add timeout in f2fs_enable_checkpoint()"
This reverts commit 4bc3477796.

Let's apply a better approach to flush the only dirty pages committed by user
to avoid the delay caused by unncessary incoming ones.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-20 20:54:14 +00:00
Chao Yu
50ac3ecd8e f2fs: fix to do sanity check on node footer in {read,write}_end_io
-----------[ cut here ]------------
kernel BUG at fs/f2fs/data.c:358!
Call Trace:
 <IRQ>
 blk_update_request+0x5eb/0xe70 block/blk-mq.c:987
 blk_mq_end_request+0x3e/0x70 block/blk-mq.c:1149
 blk_complete_reqs block/blk-mq.c:1224 [inline]
 blk_done_softirq+0x107/0x160 block/blk-mq.c:1229
 handle_softirqs+0x283/0x870 kernel/softirq.c:579
 __do_softirq kernel/softirq.c:613 [inline]
 invoke_softirq kernel/softirq.c:453 [inline]
 __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
 sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
 </IRQ>

In f2fs_write_end_io(), it detects there is inconsistency in between
node page index (nid) and footer.nid of node page.

If footer of node page is corrupted in fuzzed image, then we load corrupted
node page w/ async method, e.g. f2fs_ra_node_pages() or f2fs_ra_node_page(),
in where we won't do sanity check on node footer, once node page becomes
dirty, we will encounter this bug after node page writeback.

Cc: stable@kernel.org
Reported-by: syzbot+803dd716c4310d16ff3a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=803dd716c4310d16ff3a
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Yangyang Zang
f7b929eda1 f2fs: clean up the type parameter in f2fs_sync_meta_pages()
Clean up code to improve readability, no logic changes.

Signed-off-by: Yangyang Zang <zangyangyang1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Daeho Jeong
e48e16f3e3 f2fs: support non-4KB block size without packed_ssa feature
Currently, F2FS requires the packed_ssa feature to be enabled when
utilizing non-4KB block sizes (e.g., 16KB). This restriction limits
the flexibility of filesystem formatting options.

This patch allows F2FS to support non-4KB block sizes even when the
packed_ssa feature is disabled. It adjusts the SSA calculation logic to
correctly handle summary entries in larger blocks without the packed
layout.

Cc: stable@kernel.org
Fixes: 7ee8bc3942 ("f2fs: revert summary entry count from 2048 to 512 in 16kb block support")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Chao Yu
1dd3b437d4 f2fs: make FAULT_DISCARD obsolete
__blkdev_issue_discard() in __submit_discard_cmd() will never fail, so
let's make FAULT_DISCARD fault injection obsolete.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Chao Yu
3996b70209 Revert "f2fs: block cache/dio write during f2fs_enable_checkpoint()"
This reverts commit 196c81fdd4.

Original patch may cause below deadlock, revert it.

write				remount
- write_begin
 - lock_page  --- lock A
 - prepare_write_begin
  - f2fs_map_lock
				- f2fs_enable_checkpoint
				 - down_write(cp_enable_rwsem)  --- lock B
				 - sync_inode_sb
				  - writepages
				   - lock_page			--- lock A
   - down_read(cp_enable_rwsem)  --- lock A

Cc: stable@kernel.org
Fixes: 196c81fdd4 ("f2fs: block cache/dio write during f2fs_enable_checkpoint()")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-16 03:49:31 +00:00
Darrick J. Wong
6025447737 uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
Stop definining these privately and instead move them to the uapi
errno.h so that they become canonical instead of copy pasta.

Cc: linux-api@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/176826402587.3490369.17659117524205214600.stgit@frogsfrogsfrogs
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-13 09:58:01 +01:00
Yongpeng Yang
071e50d61c f2fs: change seq_file_ra_mul and max_io_bytes to unsigned int
{struct file_ra_state}->ra_pages and {struct bio}->bi_iter.bi_size is
defined as unsigned int, so values of seq_file_ra_mul and max_io_bytes
exceeding UINT_MAX are meaningless.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:09 +00:00
Chao Yu
d36de29f4b f2fs: sysfs: introduce inject_lock_timeout
This patch adds a new sysfs node in /sys/fs/f2fs/<disk>/inject_lock_timeout,
it relies on CONFIG_F2FS_FAULT_INJECTION kernel config.

It can be used to simulate different type of timeout in lock duration.

==========     ===============================
Flag_Value     Flag_Description
==========     ===============================
0x00000000     No timeout (default)
0x00000001     Simulate running time
0x00000002     Simulate IO type sleep time
0x00000003     Simulate Non-IO type sleep time
0x00000004     Simulate runnable time
==========     ===============================

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Chao Yu
c56254e2e0 f2fs: introduce FAULT_LOCK_TIMEOUT
This patch introduce a new fault type FAULT_LOCK_TIMEOUT, it can
be used to inject timeout into lock duration.

Timeout type can be set via /sys/fs/f2fs/<disk>/inject_timeout_type

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Chao Yu
7a127c80b0 f2fs: rename FAULT_TIMEOUT to FAULT_ATOMIC_TIMEOUT
No logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Chao Yu
6fa1160539 f2fs: fix timeout precision of f2fs_io_schedule_timeout_killable()
Sometimes, f2fs_io_schedule_timeout_killable(HZ) may delay for about 2
seconds, this is because __f2fs_schedule_timeout(DEFAULT_SCHEDULE_TIMEOUT)
may delay for about 2 * DEFAULT_SCHEDULE_TIMEOUT due to its precision, but
we only account the delay as DEFAULT_SCHEDULE_TIMEOUT as below, fix it.

f2fs_io_schedule_timeout_killable()
..
	timeout -= DEFAULT_SCHEDULE_TIMEOUT;

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
da90b67155 f2fs: fix to use jiffies based precision for DEFAULT_SCHEDULE_TIMEOUT
Due to timeout parameter in {io,}_schedule_timeout() is based on jiffies
unit precision. It will lose precision when using msecs_to_jiffies(x)
for conversion.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
b5da276ae6 f2fs: clean up w/ __f2fs_schedule_timeout()
No logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
67972c2b89 f2fs: trace elapsed time for io_rwsem lock
Use f2fs_{down,up}_{read,write}_trace for io_rwsem to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
ce9fe67c9c f2fs: trace elapsed time for cp_global_sem lock
Use f2fs_{down,up}_write_trace for cp_global_sem to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
e605302c14 f2fs: trace elapsed time for gc_lock lock
Use f2fs_{down,up}_write_trace for gc_lock to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
bb28b66875 f2fs: trace elapsed time for node_write lock
Use f2fs_{down,up}_read_trace for node_write to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
f9f9360251 f2fs: trace elapsed time for node_change lock
Use f2fs_{down,up}_read_trace for node_change to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
66e9e0d55d f2fs: trace elapsed time for cp_rwsem lock
Use f2fs_{down,up}_read_trace for cp_rwsem to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
e4b75621fc f2fs: sysfs: introduce max_lock_elapsed_time
This patch add a new sysfs node in /sys/fs/f2fs/<device>/max_lock_elapsed_time.

This is a threshold, once a thread enters critical region that lock covers,
total elapsed time exceeds this threshold, f2fs will print tracepoint to dump
information of related context. This sysfs entry can be used to control the
value of threshold, by default, the value is 500 ms.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
79b3cebc70 f2fs: add lock elapsed time trace facility for f2fs rwsemphore
This patch adds lock elapsed time trace facility for f2fs rwsemphore.

If total elapsed time of critical region covered by lock exceeds a
threshold, it will print tracepoint to dump information of lock related
context, including:
- thread information
- CPU/IO priority
- lock information
- elapsed time
 - total time
 - running time (depend on CONFIG_64BIT)
 - runnable time (depend on CONFIG_SCHED_INFO and CONFIG_SCHEDSTATS)
 - io sleep time (depend on CONFIG_TASK_DELAY_ACCT and
		  /proc/sys/kernel/task_delayacct)
 - other time    (by default other time will account nonio sleep time,
                  but, if above kconfig is not defined, other time will
                  include runnable time and/or io sleep time as wll)

output:
    f2fs_lock_elapsed_time: dev = (254,52), comm: sh, pid: 13855, prio: 120, ioprio_class: 2, ioprio_data: 4, lock_name: cp_rwsem, lock_type: rlock, total: 1000, running: 993, runnable: 2, io_sleep: 0, other: 5

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Yongpeng Yang
db1a8a7813 f2fs: return immediately after submitting the specified folio in __submit_merged_write_cond
f2fs_folio_wait_writeback ensures the folio write is submitted to the
block layer via __submit_merged_write_cond, then waits for the folio to
complete. Other I/O submissions are irrelevant to
f2fs_folio_wait_writeback. Thus, if the folio write bio is already
submitted, the function can return immediately. This patch adds a
writeback parameter to __submit_merged_write_cond(), which signals an
immediate return after submitting the target folio, and waitting
writeback can use this parameter.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:29:40 +00:00
Chao Yu
3cb396a2c7 f2fs: fix to do sanity check on nat entry of quota inode
As Zhiguo reported, nat entry of quota inode could be corrupted:

"ino/block_addr=NULL_ADDR in nid=4 entry"

We'd better to do sanity check on quota inode to detect and record
nat.blk_addr inconsistency, so that we can have a chance to repair
it w/ later fsck.

Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:26:56 +00:00
Jaegeuk Kim
05e65c14ea f2fs: support large folio for immutable non-compressed case
This patch enables large folio for limited case where we can get the high-order
memory allocation. It supports the encrypted and fsverity files, which are
essential for Android environment.

How to test:
- dd if=/dev/zero of=/mnt/test/test bs=1G count=4
- f2fs_io setflags immutable /mnt/test/test
- echo 3 > /proc/sys/vm/drop_caches
 : to reload inode with large folio
- f2fs_io read 32 0 1024 mmap 0 0 /mnt/test/test

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-16 00:46:49 +00:00
YH Lin
8d1cb17aca f2fs: optimize trace_f2fs_write_checkpoint with enums
This patch optimizes the tracepoint by replacing these hardcoded strings
with a new enumeration f2fs_cp_phase.

1.Defines enum f2fs_cp_phase with values for each checkpoint phase.
2.Updates trace_f2fs_write_checkpoint to accept a u16 phase argument
instead of a string pointer.
3.Uses __print_symbolic in TP_printk to convert the enum values
back to their corresponding strings for human-readable trace output.

This change reduces the storage overhead for each trace event
by replacing a variable-length string with a 2-byte integer,
while maintaining the same readable output in ftrace.

Signed-off-by: YH Lin <yhli@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:06 +00:00
Chao Yu
8f11fe52fc f2fs: support to show curseg.next_blkoff in debugfs
cat /sys/kernel/debug/f2fs/status

Main area: 17 segs, 17 secs 17 zones
    TYPE           blkoff    segno    secno   zoneno  dirty_seg   full_seg  valid_blk
  - COLD   data:        0        4        4        4          0          0          0
  - WARM   data:        0        7        7        7          0          0          0
  - HOT    data:        1        5        5        5          2          0        512
  - Dir   dnode:        3        0        0        0          1          0          2
  - File  dnode:        0        1        1        1          0          0          0
  - Indir nodes:        0        2        2        2          0          0          0
  - Pinned file:        0       -1       -1       -1
  - ATGC   data:        0       -1       -1       -1

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:05 +00:00
Chao Yu
1627a303bc f2fs: expand scalability of f2fs mount option
opt field in structure f2fs_mount_info and opt_mask field in structure
f2fs_fs_context is 32-bits variable, now we're running out of available
bits in them, let's expand them to 64-bits for better scalability.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:05 +00:00
Chao Yu
d31e0de8b8 f2fs: change default schedule timeout value
This patch changes default schedule timeout value from 20ms to 1ms,
in order to give caller more chances to check whether IO or non-IO
congestion condition has already been mitigable.

In addition, default interval of periodical discard submission is
kept to 20ms.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:05 +00:00
Chao Yu
76e780d88c f2fs: introduce f2fs_schedule_timeout()
In f2fs retry logic, we will call f2fs_io_schedule_timeout() to sleep as
uninterruptible state (waiting for IO) for a while, however, in several
paths below, we are not blocked by IO:
- f2fs_write_single_data_page() return -EAGAIN due to racing on cp_rwsem.
- f2fs_flush_device_cache() failed to submit preflush command.
- __issue_discard_cmd_range() sleeps periodically in between two in batch
discard submissions.

So, in order to reveal state of task more accurate, let's introduce
f2fs_schedule_timeout() and call it in above paths in where we are waiting
for non-IO reasons.

Then we can get real reason of uninterruptible sleep for a thread in
tracepoint, perfetto, etc.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:05 +00:00
Yongpeng Yang
581251e030 f2fs: wrap all unusable_blocks_per_sec code in CONFIG_BLK_DEV_ZONED
The usage of unusable_blocks_per_sec is already wrapped by
CONFIG_BLK_DEV_ZONED, except for its declaration and the definitions of
CAP_BLKS_PER_SEC and CAP_SEGS_PER_SEC. This patch ensures that all code
related to unusable_blocks_per_sec is properly wrapped under the
CONFIG_BLK_DEV_ZONED option.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:04 +00:00
Daeho Jeong
7ee8bc3942 f2fs: revert summary entry count from 2048 to 512 in 16kb block support
The recent increase in the number of Segment Summary Area (SSA) entries
from 512 to 2048 was an unintentional change in logic of 16kb block
support. This commit corrects the issue.

To better utilize the space available from the erroneous 2048-entry
calculation, we are implementing a solution to share the currently
unused SSA space with neighboring segments. This enhances overall
SSA utilization without impacting the established 8MB segment size.

Fixes: d7e9a9037d ("f2fs: Support Block Size == Page Size")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:04 +00:00
Xiaole He
27bf6a637b f2fs: fix age extent cache insertion skip on counter overflow
The age extent cache uses last_blocks (derived from
allocated_data_blocks) to determine data age. However, there's a
conflict between the deletion
marker (last_blocks=0) and legitimate last_blocks=0 cases when
allocated_data_blocks overflows to 0 after reaching ULLONG_MAX.

In this case, valid extents are incorrectly skipped due to the
"if (!tei->last_blocks)" check in __update_extent_tree_range().

This patch fixes the issue by:
1. Reserving ULLONG_MAX as an invalid/deletion marker
2. Limiting allocated_data_blocks to range [0, ULLONG_MAX-1]
3. Using F2FS_EXTENT_AGE_INVALID for deletion scenarios
4. Adjusting overflow age calculation from ULLONG_MAX to (ULLONG_MAX-1)

Reproducer (using a patched kernel with allocated_data_blocks
initialized to ULLONG_MAX - 3 for quick testing):

Step 1: Mount and check initial state
  # dd if=/dev/zero of=/tmp/test.img bs=1M count=100
  # mkfs.f2fs -f /tmp/test.img
  # mkdir -p /mnt/f2fs_test
  # mount -t f2fs -o loop,age_extent_cache /tmp/test.img /mnt/f2fs_test
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 18446744073709551612 # ULLONG_MAX - 3
  Inner Struct Count: tree: 1(0), node: 0

Step 2: Create files and write data to trigger overflow
  # touch /mnt/f2fs_test/{1,2,3,4}.txt; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 18446744073709551613 # ULLONG_MAX - 2
  Inner Struct Count: tree: 5(0), node: 1

  # dd if=/dev/urandom of=/mnt/f2fs_test/1.txt bs=4K count=1; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 18446744073709551614 # ULLONG_MAX - 1
  Inner Struct Count: tree: 5(0), node: 2

  # dd if=/dev/urandom of=/mnt/f2fs_test/2.txt bs=4K count=1; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 18446744073709551615 # ULLONG_MAX
  Inner Struct Count: tree: 5(0), node: 3

  # dd if=/dev/urandom of=/mnt/f2fs_test/3.txt bs=4K count=1; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 0 # Counter overflowed!
  Inner Struct Count: tree: 5(0), node: 4

Step 3: Trigger the bug - next write should create node but gets skipped
  # dd if=/dev/urandom of=/mnt/f2fs_test/4.txt bs=4K count=1; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 1
  Inner Struct Count: tree: 5(0), node: 4

  Expected: node: 5 (new extent node for 4.txt)
  Actual: node: 4 (extent insertion was incorrectly skipped due to
  last_blocks = allocated_data_blocks = 0 in __get_new_block_age)

After this fix, the extent node is correctly inserted and node count
becomes 5 as expected.

Fixes: 71644dff48 ("f2fs: add block_age-based extent cache")
Cc: stable@kernel.org
Signed-off-by: Xiaole He <hexiaole1994@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:03 +00:00
Yongpeng Yang
d8bdf7856e f2fs: ensure minimum trim granularity accounts for all devices
When F2FS uses multiple block devices, each device may have a
different discard granularity. The minimum trim granularity must be
at least the maximum discard granularity of all devices, excluding
zoned devices. Use max_t instead of the max() macro to compute the
maximum value.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:03 +00:00
Chao Yu
196c81fdd4 f2fs: block cache/dio write during f2fs_enable_checkpoint()
If there are too many background IOs during f2fs_enable_checkpoint(),
sync_inodes_sb() may be blocked for long time due to it will loop to
write dirty datas which are generated by in parallel write()
continuously.

Let's change as below to resolve this issue:
- hold cp_enable_rwsem write lock to block any cache/dio write
- decrease DEF_ENABLE_INTERVAL from 16 to 5

In addition, dump more logs during f2fs_enable_checkpoint().

Testcase:
1. fill data into filesystem until 90% usage.
2. mount -o remount,checkpoint=disable:10% /data
3. fio --rw=randwrite  --bs=4kb  --size=1GB  --numjobs=10  \
--iodepth=64  --ioengine=psync  --time_based  --runtime=600 \
--directory=/data/fio_dir/ &
4. mount -o remount,checkpoint=enable /data

Before:
F2FS-fs (dm-51): f2fs_enable_checkpoint() finishes, writeback:7232, sync:39793, cp:457

After:
F2FS-fs (dm-51): f2fs_enable_checkpoint end, writeback:5032, lock:0, sync_inode:5552, sync_fs:84

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:02 +00:00
Yongpeng Yang
89c16629e3 f2fs: change the unlock parameter of f2fs_put_page to bool
Change the type of the unlock parameter of f2fs_put_page to bool.
All callers should consistently pass true or false. No logical change.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:02 +00:00
Chao Yu
1f27ef42bb f2fs: use global inline_xattr_slab instead of per-sb slab cache
As Hong Yun reported in mailing list:

loop7: detected capacity change from 0 to 131072
------------[ cut here ]------------
kmem_cache of name 'f2fs_xattr_entry-7:7' already exists
WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 kmem_cache_sanity_check mm/slab_common.c:109 [inline]
WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 __kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307
CPU: 0 UID: 0 PID: 24426 Comm: syz.7.1370 Not tainted 6.17.0-rc4 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:kmem_cache_sanity_check mm/slab_common.c:109 [inline]
RIP: 0010:__kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307
Call Trace:
 __kmem_cache_create include/linux/slab.h:353 [inline]
 f2fs_kmem_cache_create fs/f2fs/f2fs.h:2943 [inline]
 f2fs_init_xattr_caches+0xa5/0xe0 fs/f2fs/xattr.c:843
 f2fs_fill_super+0x1645/0x2620 fs/f2fs/super.c:4918
 get_tree_bdev_flags+0x1fb/0x260 fs/super.c:1692
 vfs_get_tree+0x43/0x140 fs/super.c:1815
 do_new_mount+0x201/0x550 fs/namespace.c:3808
 do_mount fs/namespace.c:4136 [inline]
 __do_sys_mount fs/namespace.c:4347 [inline]
 __se_sys_mount+0x298/0x2f0 fs/namespace.c:4324
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x8e/0x3a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

The bug can be reproduced w/ below scripts:
- mount /dev/vdb /mnt1
- mount /dev/vdc /mnt2
- umount /mnt1
- mounnt /dev/vdb /mnt1

The reason is if we created two slab caches, named f2fs_xattr_entry-7:3
and f2fs_xattr_entry-7:7, and they have the same slab size. Actually,
slab system will only create one slab cache core structure which has
slab name of "f2fs_xattr_entry-7:3", and two slab caches share the same
structure and cache address.

So, if we destroy f2fs_xattr_entry-7:3 cache w/ cache address, it will
decrease reference count of slab cache, rather than release slab cache
entirely, since there is one more user has referenced the cache.

Then, if we try to create slab cache w/ name "f2fs_xattr_entry-7:3" again,
slab system will find that there is existed cache which has the same name
and trigger the warning.

Let's changes to use global inline_xattr_slab instead of per-sb slab cache
for fixing.

Fixes: a999150f4f ("f2fs: use kmem_cache pool during inline xattr lookups")
Cc: stable@kernel.org
Reported-by: Hong Yun <yhong@link.cuhk.edu.hk>
Tested-by: Hong Yun <yhong@link.cuhk.edu.hk>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:02 +00:00
Chao Yu
10b591e7fb f2fs: fix to avoid updating compression context during writeback
Bai, Shuangpeng <sjb7183@psu.edu> reported a bug as below:

Oops: divide error: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 11441 Comm: syz.0.46 Not tainted 6.17.0 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:f2fs_all_cluster_page_ready+0x106/0x550 fs/f2fs/compress.c:857
Call Trace:
 <TASK>
 f2fs_write_cache_pages fs/f2fs/data.c:3078 [inline]
 __f2fs_write_data_pages fs/f2fs/data.c:3290 [inline]
 f2fs_write_data_pages+0x1c19/0x3600 fs/f2fs/data.c:3317
 do_writepages+0x38e/0x640 mm/page-writeback.c:2634
 filemap_fdatawrite_wbc mm/filemap.c:386 [inline]
 __filemap_fdatawrite_range mm/filemap.c:419 [inline]
 file_write_and_wait_range+0x2ba/0x3e0 mm/filemap.c:794
 f2fs_do_sync_file+0x6e6/0x1b00 fs/f2fs/file.c:294
 generic_write_sync include/linux/fs.h:3043 [inline]
 f2fs_file_write_iter+0x76e/0x2700 fs/f2fs/file.c:5259
 new_sync_write fs/read_write.c:593 [inline]
 vfs_write+0x7e9/0xe00 fs/read_write.c:686
 ksys_write+0x19d/0x2d0 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xf7/0x470 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The bug was triggered w/ below race condition:

fsync				setattr			ioctl
- f2fs_do_sync_file
 - file_write_and_wait_range
  - f2fs_write_cache_pages
  : inode is non-compressed
  : cc.cluster_size =
    F2FS_I(inode)->i_cluster_size = 0
   - tag_pages_for_writeback
				- f2fs_setattr
				 - truncate_setsize
				 - f2fs_truncate
							- f2fs_fileattr_set
							 - f2fs_setflags_common
							  - set_compress_context
							  : F2FS_I(inode)->i_cluster_size = 4
							  : set_inode_flag(inode, FI_COMPRESSED_FILE)
   - f2fs_compressed_file
   : return true
   - f2fs_all_cluster_page_ready
   : "pgidx % cc->cluster_size" trigger dividing 0 issue

Let's change as below to fix this issue:
- introduce a new atomic type variable .writeback in structure f2fs_inode_info
to track the number of threads which calling f2fs_write_cache_pages().
- use .i_sem lock to protect .writeback update.
- check .writeback before update compression context in f2fs_setflags_common()
to avoid race w/ ->writepages.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Cc: stable@kernel.org
Reported-by: Bai, Shuangpeng <sjb7183@psu.edu>
Tested-by: Bai, Shuangpeng <sjb7183@psu.edu>
Closes: https://lore.kernel.org/lkml/44D8F7B3-68AD-425F-9915-65D27591F93F@psu.edu
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:02 +00:00