Commit Graph

188 Commits

Author SHA1 Message Date
Andreas Gruenbacher
bb504b4d64 lockref: remove count argument of lockref_init
All users of lockref_init() now initialize the count to 1, so hardcode
that and remove the count argument.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-4-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07 10:27:25 +01:00
Linus Torvalds
c2da8b3f91 Merge tag 'erofs-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
 "Still no new features for this cycle, as some ongoing improvements
  remain premature for now.

  This includes a micro-optimization for the superblock checksum, along
  with minor bugfixes and code cleanups, as usual:

   - Micro-optimize superblock checksum

   - Avoid overly large bvecs[] for file-backed mounts

   - Some leftover folio conversion in z_erofs_bind_cache()

   - Minor bugfixes and cleanups"

* tag 'erofs-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: refine z_erofs_get_extent_compressedlen()
  erofs: remove dead code in erofs_fc_parse_param
  erofs: return SHRINK_EMPTY if no objects to free
  erofs: convert z_erofs_bind_cache() to folios
  erofs: tidy up zdata.c
  erofs: get rid of `z_erofs_next_pcluster_t`
  erofs: simplify z_erofs_load_compact_lcluster()
  erofs: fix potential return value overflow of z_erofs_shrink_scan()
  erofs: shorten bvecs[] for file-backed mounts
  erofs: micro-optimize superblock checksum
  fs: erofs: xattr.c change kzalloc to kcalloc
2025-01-25 20:03:04 -08:00
Linus Torvalds
1d6d399223 Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
 "Kthreads affinity follow either of 4 existing different patterns:

   1) Per-CPU kthreads must stay affine to a single CPU and never
      execute relevant code on any other CPU. This is currently handled
      by smpboot code which takes care of CPU-hotplug operations.
      Affinity here is a correctness constraint.

   2) Some kthreads _have_ to be affine to a specific set of CPUs and
      can't run anywhere else. The affinity is set through
      kthread_bind_mask() and the subsystem takes care by itself to
      handle CPU-hotplug operations. Affinity here is assumed to be a
      correctness constraint.

   3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
      This is not a correctness constraint but merely a preference in
      terms of memory locality. kswapd and kcompactd both fall into this
      category. The affinity is set manually like for any other task and
      CPU-hotplug is supposed to be handled by the relevant subsystem so
      that the task is properly reaffined whenever a given CPU from the
      node comes up. Also care should be taken so that the node affinity
      doesn't cross isolated (nohz_full) cpumask boundaries.

   4) Similar to the previous point except kthreads have a _preferred_
      affinity different than a node. Both RCU boost kthreads and RCU
      exp kworkers fall into this category as they refer to "RCU nodes"
      from a distinctly distributed tree.

  Currently the preferred affinity patterns (3 and 4) have at least 4
  identified users, with more or less success when it comes to handle
  CPU-hotplug operations and CPU isolation. Each of which do it in its
  own ad-hoc way.

  This is an infrastructure proposal to handle this with the following
  API changes:

   - kthread_create_on_node() automatically affines the created kthread
     to its target node unless it has been set as per-cpu or bound with
     kthread_bind[_mask]() before the first wake-up.

   - kthread_affine_preferred() is a new function that can be called
     right after kthread_create_on_node() to specify a preferred
     affinity different than the specified node.

  When the preferred affinity can't be applied because the possible
  targets are offline or isolated (nohz_full), the kthread is affine to
  the housekeeping CPUs (which means to all online CPUs most of the time
  or only the non-nohz_full CPUs when nohz_full= is set).

  kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
  converted, along with a few old drivers.

  Summary of the changes:

   - Consolidate a bunch of ad-hoc implementations of
     kthread_run_on_cpu()

   - Introduce task_cpu_fallback_mask() that defines the default last
     resort affinity of a task to become nohz_full aware

   - Add some correctness check to ensure kthread_bind() is always
     called before the first kthread wake up.

   - Default affine kthread to its preferred node.

   - Convert kswapd / kcompactd and remove their halfway working ad-hoc
     affinity implementation

   - Implement kthreads preferred affinity

   - Unify kthread worker and kthread API's style

   - Convert RCU kthreads to the new API and remove the ad-hoc affinity
     implementation"

* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  kthread: modify kernel-doc function name to match code
  rcu: Use kthread preferred affinity for RCU exp kworkers
  treewide: Introduce kthread_run_worker[_on_cpu]()
  kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
  rcu: Use kthread preferred affinity for RCU boost
  kthread: Implement preferred affinity
  mm: Create/affine kswapd to its preferred node
  mm: Create/affine kcompactd to its preferred node
  kthread: Default affine kthread to its preferred NUMA node
  kthread: Make sure kthread hasn't started while binding it
  sched,arm64: Handle CPU isolation on last resort fallback rq selection
  arm64: Exclude nohz_full CPUs from 32bits el0 support
  lib: test_objpool: Use kthread_run_on_cpu()
  kallsyms: Use kthread_run_on_cpu()
  soc/qman: test: Use kthread_run_on_cpu()
  arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21 17:10:05 -08:00
Linus Torvalds
4b84a4c8d4 Merge tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Features:

   - Support caching symlink lengths in inodes

     The size is stored in a new union utilizing the same space as
     i_devices, thus avoiding growing the struct or taking up any more
     space

     When utilized it dodges strlen() in vfs_readlink(), giving about
     1.5% speed up when issuing readlink on /initrd.img on ext4

   - Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag

     If a file system supports uncached buffered IO, it may set
     FOP_DONTCACHE and enable support for RWF_DONTCACHE.

     If RWF_DONTCACHE is attempted without the file system supporting
     it, it'll get errored with -EOPNOTSUPP

   - Enable VBOXGUEST and VBOXSF_FS on ARM64

     Now that VirtualBox is able to run as a host on arm64 (e.g. the
     Apple M3 processors) we can enable VBOXSF_FS (and in turn
     VBOXGUEST) for this architecture.

     Tested with various runs of bonnie++ and dbench on an Apple MacBook
     Pro with the latest Virtualbox 7.1.4 r165100 installed

  Cleanups:

   - Delay sysctl_nr_open check in expand_files()

   - Use kernel-doc includes in fiemap docbook

   - Use page->private instead of page->index in watch_queue

   - Use a consume fence in mnt_idmap() as it's heavily used in
     link_path_walk()

   - Replace magic number 7 with ARRAY_SIZE() in fc_log

   - Sort out a stale comment about races between fd alloc and dup2()

   - Fix return type of do_mount() from long to int

   - Various cosmetic cleanups for the lockref code

  Fixes:

   - Annotate spinning as unlikely() in __read_seqcount_begin

     The annotation already used to be there, but got lost in commit
     52ac39e5db ("seqlock: seqcount_t: Implement all read APIs as
     statement expressions")

   - Fix proc_handler for sysctl_nr_open

   - Flush delayed work in delayed fput()

   - Fix grammar and spelling in propagate_umount()

   - Fix ESP not readable during coredump

     In /proc/PID/stat, there is the kstkesp field which is the stack
     pointer of a thread. While the thread is active, this field reads
     zero. But during a coredump, it should have a valid value

     However, at the moment, kstkesp is zero even during coredump

   - Don't wake up the writer if the pipe is still full

   - Fix unbalanced user_access_end() in select code"

* tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  gfs2: use lockref_init for qd_lockref
  erofs: use lockref_init for pcl->lockref
  dcache: use lockref_init for d_lockref
  lockref: add a lockref_init helper
  lockref: drop superfluous externs
  lockref: use bool for false/true returns
  lockref: improve the lockref_get_not_zero description
  lockref: remove lockref_put_not_zero
  fs: Fix return type of do_mount() from long to int
  select: Fix unbalanced user_access_end()
  vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64
  pipe_read: don't wake up the writer if the pipe is still full
  selftests: coredump: Add stackdump test
  fs/proc: do_task_stat: Fix ESP not readable during coredump
  fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
  fs: sort out a stale comment about races between fd alloc and dup2
  fs: Fix grammar and spelling in propagate_umount()
  fs: fc_log replace magic number 7 with ARRAY_SIZE()
  fs: use a consume fence in mnt_idmap()
  file: flush delayed work in delayed fput()
  ...
2025-01-20 09:40:49 -08:00
Gao Xiang
e180b8c4c2 erofs: convert z_erofs_bind_cache() to folios
The managed cache uses a pseudo inode to keep (necessary) compressed
data.

Currently, it still uses zero-order folios, so this is just a trivial
conversion, except that the use of the pagepool is temporarily dropped.

Drop some obsoleted comments too.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250114034429.431408-4-hsiangkao@linux.alibaba.com
2025-01-17 03:22:59 +08:00
Gao Xiang
6f435e94a1 erofs: tidy up zdata.c
All small code style adjustments, no logic changes:

 - z_erofs_decompress_frontend => z_erofs_frontend;
 - z_erofs_decompress_backend => z_erofs_backend;
 - Use Z_EROFS_DEFINE_FRONTEND() to replace DECOMPRESS_FRONTEND_INIT();
 - `nr_folios` should be `nrpages` in z_erofs_readahead();
 - Refine in-line comments.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250114034429.431408-3-hsiangkao@linux.alibaba.com
2025-01-17 03:22:43 +08:00
Gao Xiang
5514d8478b erofs: get rid of z_erofs_next_pcluster_t
It was originally intended for tagged pointer reservation.

Now all encoded data can be represented uniformally with
`struct z_erofs_pcluster` as described in commit bf1aa03980
("erofs: sunset `struct erofs_workgroup`"), let's drop it too.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250114034429.431408-2-hsiangkao@linux.alibaba.com
2025-01-17 03:21:21 +08:00
Gao Xiang
db902986de erofs: fix potential return value overflow of z_erofs_shrink_scan()
z_erofs_shrink_scan() could return small numbers due to the mistyped
`freed`.

Although I don't think it has any visible impact.

Fixes: 3883a79abd ("staging: erofs: introduce VLE decompression support")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250114040058.459981-1-hsiangkao@linux.alibaba.com
2025-01-17 03:19:39 +08:00
Christoph Hellwig
6f86f1465b erofs: use lockref_init for pcl->lockref
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250115094702.504610-8-hch@lst.de
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-16 11:48:12 +01:00
Frederic Weisbecker
b04e317b52 treewide: Introduce kthread_run_worker[_on_cpu]()
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.

On the other hand, kthread_create_worker() creates a kthread worker and
runs it.

This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.

Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08 18:15:03 +01:00
Frederic Weisbecker
41f70d8e16 kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
kthread_create_on_cpu() uses the CPU argument as an implicit and unique
printf argument to add to the format whereas
kthread_create_worker_on_cpu() still relies on explicitly passing the
printf arguments. This difference in behaviour is error prone and
doesn't help standardizing per-CPU kthread names.

Unify the behaviours and convert kthread_create_worker_on_cpu() to
use the printf behaviour of kthread_create_on_cpu().

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-01-08 18:15:03 +01:00
Gao Xiang
1a2180f685 erofs: fix PSI memstall accounting
Max Kellermann recently reported psi_group_cpu.tasks[NR_MEMSTALL] is
incorrect in the 6.11.9 kernel.

The root cause appears to be that, since the problematic commit, bio
can be NULL, causing psi_memstall_leave() to be skipped in
z_erofs_submit_queue().

Reported-by: Max Kellermann <max.kellermann@ionos.com>
Closes: https://lore.kernel.org/r/CAKPOu+8tvSowiJADW2RuKyofL_CSkm_SuyZA7ME5vMLWmL6pqw@mail.gmail.com
Fixes: 9e2f9d34dd ("erofs: handle overlapped pclusters out of crafted images properly")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241127085236.3538334-1-hsiangkao@linux.alibaba.com
2024-12-13 00:24:40 +08:00
Chunhai Guo
db80b98305 erofs: add sysfs node to drop internal caches
Add a sysfs node to drop compression-related caches, currently used to
drop in-memory pclusters and cached compressed folios.

Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241113041148.749129-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18 18:50:13 +08:00
Chunhai Guo
f5ad9f9a60 erofs: free pclusters if no cached folio is attached
Once a pcluster is fully decompressed and there are no attached cached
folios, its corresponding `struct z_erofs_pcluster` will be freed. This
will significantly reduce the frequency of calls to erofs_shrink_scan()
and the memory allocated for `struct z_erofs_pcluster`.

The tables below show approximately a 96% reduction in the calls to
erofs_shrink_scan() and in the memory allocated for `struct
z_erofs_pcluster` after applying this patch. The results were obtained
by performing a test to copy a 4.1GB partition on ARM64 Android devices
running the 6.6 kernel with an 8-core CPU and 12GB of memory.

1. The reduction in calls to erofs_shrink_scan():
+-----------------+-----------+----------+---------+
|                 | w/o patch | w/ patch |  diff   |
+-----------------+-----------+----------+---------+
| Average (times) |   11390   |   390    | -96.57% |
+-----------------+-----------+----------+---------+

2. The reduction in memory released by erofs_shrink_scan():
+-----------------+-----------+----------+---------+
|                 | w/o patch | w/ patch |  diff   |
+-----------------+-----------+----------+---------+
| Average (Byte)  | 133612656 | 4434552  | -96.68% |
+-----------------+-----------+----------+---------+

Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241112043235.546164-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18 18:50:13 +08:00
Gao Xiang
bf1aa03980 erofs: sunset struct erofs_workgroup
`struct erofs_workgroup` was introduced to provide a unique header
for all physically indexed objects.  However, after big pclusters and
shared pclusters are implemented upstream, it seems that all EROFS
encoded data (which requires transformation) can be represented with
`struct z_erofs_pcluster` directly.

Move all members into `struct z_erofs_pcluster` for simplicity.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241021035323.3280682-3-hsiangkao@linux.alibaba.com
2024-11-18 18:50:12 +08:00
Gao Xiang
9c91f95962 erofs: move erofs_workgroup operations into zdata.c
Move related helpers into zdata.c as an intermediate step of getting
rid of `struct erofs_workgroup`, and rename:

 erofs_workgroup_put => z_erofs_put_pcluster
 erofs_workgroup_get => z_erofs_get_pcluster
 erofs_try_to_release_workgroup => erofs_try_to_release_pcluster
 erofs_shrink_workstation => z_erofs_shrink_scan

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241021035323.3280682-2-hsiangkao@linux.alibaba.com
2024-11-18 18:50:12 +08:00
Gao Xiang
b091e8ed24 erofs: get rid of erofs_{find,insert}_workgroup
Just fold them into the only two callers since
they are simple enough.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241021035323.3280682-1-hsiangkao@linux.alibaba.com
2024-11-18 18:50:03 +08:00
Gao Xiang
2402082e53 erofs: get rid of z_erofs_try_to_claim_pcluster()
Just fold it into the caller for simplicity.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241010090420.405871-1-hsiangkao@linux.alibaba.com
2024-10-11 13:36:58 +08:00
Chunhai Guo
79f504a2cd erofs: allocate more short-lived pages from reserved pool first
This patch aims to allocate bvpages and short-lived compressed pages
from the reserved pool first.

After applying this patch, there are three benefits.

1. It reduces the page allocation time.
 The bvpages and short-lived compressed pages account for about 4% of
the pages allocated from the system in the multi-app launch benchmarks
[1]. It reduces the page allocation time accordingly and lowers the
likelihood of blockage by page allocation in low memory scenarios.

2. The pages in the reserved pool will be allocated on demand.
 Currently, bvpages and short-lived compressed pages are short-lived
pages allocated from the system, and the pages in the reserved pool all
originate from short-lived pages. Consequently, the number of reserved
pool pages will increase to z_erofs_rsv_nrpages over time.
 With this patch, all short-lived pages are allocated from the reserved
pool first, so the number of reserved pool pages will only increase when
there are not enough pages. Thus, even if z_erofs_rsv_nrpages is set to
a large number for specific reasons, the actual number of reserved pool
pages may remain low as per demand. In the multi-app launch benchmarks
[1], z_erofs_rsv_nrpages is set at 256, while the number of reserved
pool pages remains below 64.

3. When erofs cache decompression is disabled
   (EROFS_ZIP_CACHE_DISABLED), all pages will *only* be allocated from
the reserved pool for erofs. This will significantly reduce the memory
pressure from erofs.

[1] For additional details on the multi-app launch benchmarks, please
refer to commit 0f6273ab46 ("erofs: add a reserved buffer pool for lz4
decompression").

Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240906121110.3701889-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-09-12 22:59:49 +08:00
Gao Xiang
2349d2fa02 erofs: sunset unneeded NOFAILs
With iterative development, our codebase can now deal with compressed
buffer misses properly if both in-place I/O and compressed buffer
allocation fail.

Note that if readahead fails (with non-uptodate folios), the original
request will then fall back to synchronous read, and `.read_folio()`
should return appropriate errnos; otherwise -EIO will be passed to
user space, which is unexpected.

To simplify rarely encountered failure paths, a mimic decompression
will be just used.  Before that, failure reasons are recorded in
compressed_bvecs[] and they also act as placeholders to avoid in-place
pages.  They will be parsed just before decompression and then pass
back to `.read_folio()`.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240905084732.2684515-1-hsiangkao@linux.alibaba.com
2024-09-12 20:26:43 +08:00
Gao Xiang
283213718f erofs: support compressed inodes for fileio
Use pseudo bios just like the previous fscache approach since
merged bio_vecs can be filled properly with unique interfaces.

Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240830032840.3783206-3-hsiangkao@linux.alibaba.com
2024-09-10 15:27:09 +08:00
Gao Xiang
ce63cb62d7 erofs: support unencoded inodes for fileio
Since EROFS only needs to handle read requests in simple contexts,
Just directly use vfs_iocb_iter_read() for data I/Os.

Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240905093031.2745929-1-hsiangkao@linux.alibaba.com
2024-09-10 15:26:36 +08:00
Gao Xiang
9e2f9d34dd erofs: handle overlapped pclusters out of crafted images properly
syzbot reported a task hang issue due to a deadlock case where it is
waiting for the folio lock of a cached folio that will be used for
cache I/Os.

After looking into the crafted fuzzed image, I found it's formed with
several overlapped big pclusters as below:

 Ext:   logical offset   |  length :     physical offset    |  length
   0:        0..   16384 |   16384 :     151552..    167936 |   16384
   1:    16384..   32768 |   16384 :     155648..    172032 |   16384
   2:    32768..   49152 |   16384 :  537223168.. 537239552 |   16384
...

Here, extent 0/1 are physically overlapped although it's entirely
_impossible_ for normal filesystem images generated by mkfs.

First, managed folios containing compressed data will be marked as
up-to-date and then unlocked immediately (unlike in-place folios) when
compressed I/Os are complete.  If physical blocks are not submitted in
the incremental order, there should be separate BIOs to avoid dependency
issues.  However, the current code mis-arranges z_erofs_fill_bio_vec()
and BIO submission which causes unexpected BIO waits.

Second, managed folios will be connected to their own pclusters for
efficient inter-queries.  However, this is somewhat hard to implement
easily if overlapped big pclusters exist.  Again, these only appear in
fuzzed images so let's simply fall back to temporary short-lived pages
for correctness.

Additionally, it justifies that referenced managed folios cannot be
truncated for now and reverts part of commit 2080ca1ed3 ("erofs: tidy
up `struct z_erofs_bvec`") for simplicity although it shouldn't be any
difference.

Reported-by: syzbot+4fc98ed414ae63d1ada2@syzkaller.appspotmail.com
Reported-by: syzbot+de04e06b28cfecf2281c@syzkaller.appspotmail.com
Reported-by: syzbot+c8c8238b394be4a1087d@syzkaller.appspotmail.com
Tested-by: syzbot+4fc98ed414ae63d1ada2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000002fda01061e334873@google.com
Fixes: 8e6c8fa9f2 ("erofs: enable big pcluster feature")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240910070847.3356592-1-hsiangkao@linux.alibaba.com
2024-09-10 15:26:15 +08:00
Dan Carpenter
a3c10bed33 erofs: silence uninitialized variable warning in z_erofs_scan_folio()
Smatch complains that:

    fs/erofs/zdata.c:1047 z_erofs_scan_folio()
    error: uninitialized symbol 'err'.

The issue is if we hit this (!(map->m_flags & EROFS_MAP_MAPPED)) {
condition then "err" isn't set.  It's inside a loop so we would have to
hit that condition on every iteration.  Initialize "err" to zero to
solve this.

Fixes: 5b9654efb6 ("erofs: teach z_erofs_scan_folios() to handle multi-page folios")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/f78ab50e-ed6d-4275-8dd4-a4159fa565a2@stanley.mountain
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-07-13 12:47:34 +08:00
Gao Xiang
1001042e54 erofs: avoid refcounting short-lived pages
LZ4 always reuses the decompressed buffer as its LZ77 sliding window
(dynamic dictionary) for optimal performance.  However, in specific
cases, the output buffer may not fully contain valid page cache pages,
resulting in the use of short-lived pages for temporary purposes.

Due to the limited sliding window size, LZ4 shortlived bounce pages can
also be reused in a sliding manner, so each bounce page can be vmapped
multiple times in different relative positions by design.  In order to
avoiding double frees, currently, reuse counts are recorded via page
refcount, but it will no longer be used as-is in the future world of
Memdescs.

Just maintain a lookup table to check if a shortlived page is reused.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240711053659.1364989-1-hsiangkao@linux.alibaba.com
2024-07-11 15:14:26 +08:00
Gao Xiang
5a7cce827e erofs: refine z_erofs_{init,exit}_subsystem()
Introduce z_erofs_{init,exit}_decompressor() to unexport
z_erofs_{deflate,lzma,zstd}_{init,exit}().

Besides, call them in z_erofs_{init,exit}_subsystem()
for simplicity.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240709094106.3018109-2-hsiangkao@linux.alibaba.com
2024-07-09 19:04:40 +08:00
Gao Xiang
392d20ccef erofs: move each decompressor to its own source file
Thus *_config() function declarations can be avoided.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240709094106.3018109-1-hsiangkao@linux.alibaba.com
2024-07-09 19:04:40 +08:00
Gao Xiang
2080ca1ed3 erofs: tidy up struct z_erofs_bvec
After revisiting the design, I believe `struct z_erofs_bvec` should
be page-based instead of folio-based due to the reasons below:

 - The minimized memory mapping block is a page;

 - Under the certain circumstances, only temporary pages needs to be
   used instead of folios since refcount, mapcount for such pages are
   unnecessary;

 - Decompressors handle all types of pages including temporary pages,
   not only folios.

When handling `struct z_erofs_bvec`, all folio-related information
is now accessed using the page_folio() helper.

The final goal of this round adaptation is to eliminate direct
accesses to `struct page` in the EROFS codebase, except for some
exceptions like `z_erofs_is_shortlived_page()` and
`z_erofs_page_is_invalidated()`, which require a new helper to
determine the memdesc type of an arbitrary page.

Actually large folios of compressed files seem to work now, yet I tend
to conduct more tests before officially enabling this for all scenarios.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240703120051.3653452-4-hsiangkao@linux.alibaba.com
2024-07-08 22:09:42 +08:00
Gao Xiang
5b9654efb6 erofs: teach z_erofs_scan_folios() to handle multi-page folios
Previously, a folio just contains one page.  In order to enable large
folios, z_erofs_scan_folios() needs to handle multi-page folios.

First, this patch eliminates all gotos.  Instead, the new loop deal
with multiple parts in each folio.  It's simple to handle the parts
which belong to unmapped extents or fragment extents; but for encoded
extents, the page boundaries needs to be considered for `tight` and
`split` to keep inplace I/Os work correctly: when a part crosses the
page boundary, they needs to be reseted properly.

Besides, simplify `tight` derivation since Z_EROFS_PCLUSTER_HOOKED
has been removed for quite a while.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240703120051.3653452-3-hsiangkao@linux.alibaba.com
2024-07-08 22:09:42 +08:00
Gao Xiang
90cd33d793 erofs: convert z_erofs_read_fragment() to folios
Just a straight-forward conversion.  No logic changes.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240703120051.3653452-2-hsiangkao@linux.alibaba.com
2024-07-08 22:09:42 +08:00
Gao Xiang
1a4821a0a0 erofs: convert z_erofs_pcluster_readmore() to folios
Unlike `pagecache_get_page()`, `__filemap_get_folio()` returns error
pointers instead of NULL, thus switching to `IS_ERR_OR_NULL`.

Apart from that, it's just a straightforward conversion.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240703120051.3653452-1-hsiangkao@linux.alibaba.com
2024-07-08 22:09:41 +08:00
Al Viro
5587a8172e z_erofs_pcluster_begin(): don't bother with rounding position down
... and be more idiomatic when calculating ->pageofs_in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240425200017.GF1031757@ZenIV
[ Gao Xiang: don't use `offset_in_page(mptr)` due to EROFS_NO_KMAP. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-05-18 01:53:04 +08:00
Al Viro
e09815446d erofs: mechanically convert erofs_read_metabuf() to offsets
just lift the call of erofs_pos() into the callers; it will
collapse in most of them, but that's better done caller-by-caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240425195846.GC1031757@ZenIV
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-05-18 01:46:18 +08:00
Al Viro
958b9f85f8 erofs_buf: store address_space instead of inode
... seeing that ->i_mapping is the only thing we want from the inode.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-25 00:57:14 -04:00
Al Viro
469ad583c1 erofs: switch erofs_bread() to passing offset instead of block number
Callers are happier that way, especially since we no longer need to
play with splitting offset into block number and offset within block,
passing the former to erofs_bread(), then adding the latter...

erofs_bread() always reads entire pages, anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-07 03:04:50 -04:00
Jingbo Xu
a1bafc3109 erofs: support compressed inodes over fscache
Since fscache can utilize iov_iter to write dest buffers, bio_vec can
be used in this way too.

To simplify this, pseudo bios are prepared and bio_vec will be filled
with bio_add_page().  And a common .bi_end_io will be called directly
to handle I/O completions.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240308094159.40547-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10 18:41:32 +08:00
Gao Xiang
706fd68fce erofs: refine managed cache operations to folios
Convert erofs_try_to_free_all_cached_pages() and
z_erofs_cache_release_folio().

Besides, erofs_page_is_managed() is moved to zdata.c and renamed
as erofs_folio_is_managed().

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240305091448.1384242-6-hsiangkao@linux.alibaba.com
2024-03-10 18:41:25 +08:00
Gao Xiang
9266f2dc5e erofs: convert z_erofs_submissionqueue_endio() to folios
Use bio_for_each_folio() to iterate over each folio in the bio and
there is no large folios for now.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240305091448.1384242-5-hsiangkao@linux.alibaba.com
2024-03-10 18:41:16 +08:00
Gao Xiang
92cc38e02a erofs: convert z_erofs_fill_bio_vec() to folios
Introduce a folio member to `struct z_erofs_bvec` and convert most
of z_erofs_fill_bio_vec() to folios, which is still straight-forward.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240305091448.1384242-4-hsiangkao@linux.alibaba.com
2024-03-10 18:41:00 +08:00
Gao Xiang
19fb9070c2 erofs: get rid of justfound debugging tag
`justfound` is introduced to identify cached folios that are just added
to compressed bvecs so that more checks can be applied in the I/O
submission path.

EROFS is quite now stable compared to the codebase at that stage.
`justfound` becomes a burden for upcoming features.  Drop it.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240305091448.1384242-3-hsiangkao@linux.alibaba.com
2024-03-10 18:40:49 +08:00
Gao Xiang
0e25a788ea erofs: convert z_erofs_do_read_page() to folios
It is a straight-forward conversion. Besides, it's renamed as
z_erofs_scan_folio().

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240305091448.1384242-2-hsiangkao@linux.alibaba.com
2024-03-10 18:40:22 +08:00
Gao Xiang
d136d33586 erofs: convert z_erofs_onlinepage_.* to folios
Online folios are locked file-backed folios which will eventually
keep decoded (e.g. decompressed) data of each inode for end users to
utilize.  It may belong to a few pclusters and contain other data (e.g.
compressed data for inplace I/Os) temporarily in a time-sharing manner
to reduce memory footprints for low-ended storage devices with high
latencies under heary I/O pressure.

Apart from folio_end_read() usage, it's a straight-forward conversion.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240305091448.1384242-1-hsiangkao@linux.alibaba.com
2024-03-10 18:39:37 +08:00
Chunhai Guo
d9281660ff erofs: relaxed temporary buffers allocation on readahead
Even with inplace decompression, sometimes very few temporary buffers
may be still needed for a single decompression shot (e.g. 16 pages for
64k sliding window or 4 pages for 16k sliding window).  In low-memory
scenarios, it would be better to try to allocate with GFP_NOWAIT on
readahead first.  That can help reduce the time spent on page allocation
under durative memory pressure.

Here are detailed performance numbers under multi-app launch benchmark
workload [1] on ARM64 Android devices (8-core CPU and 8GB of memory)
running a 5.15 LTS kernel with EROFS of 4k pclusters:

+----------------------------------------------+
|      LZ4       | vanilla | patched |  diff   |
|----------------+---------+---------+---------|
|  Average (ms)  |  3364   |  2684   | -20.21% | [64k sliding window]
|----------------+---------+---------+---------|
|  Average (ms)  |  2079   |  1610   | -22.56% | [16k sliding window]
+----------------------------------------------+

The total size of system images for 4k pclusters is almost unchanged:
(64k sliding window)  9,117,044 KB
(16k sliding window)  9,113,096 KB

Therefore, in addition to switch the sliding window from 64k to 16k,
after applying this patch, it can eventually save 52.14% (3364 -> 1610)
on average with no memory reservation.  That is particularly useful for
embedded devices with limited resources.

[1] https://lore.kernel.org/r/20240109074143.4138783-1-guochunhai@vivo.com

Suggested-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20240126140142.201718-1-hsiangkao@linux.alibaba.com
2024-01-27 12:28:08 +08:00
Gao Xiang
cc4b2dd95f erofs: fix infinite loop due to a race of filling compressed_bvecs
I encountered a race issue after lengthy (~594647 secs) stress tests on
a 64k-page arm64 VM with several 4k-block EROFS images.  The timing
is like below:

z_erofs_try_inplace_io                  z_erofs_fill_bio_vec
  cmpxchg(&compressed_bvecs[].page,
          NULL, ..)
                                        [access bufvec]
  compressed_bvecs[] = *bvec;

Previously, z_erofs_submit_queue() just accessed bufvec->page only, so
other fields in bufvec didn't matter.  After the subpage block support
is landed, .offset and .end can be used too, but filling bufvec isn't
an atomic operation which can cause inconsistency.

Let's use a spinlock to keep the atomicity of each bufvec.  More
specifically, just reuse the existing spinlock `pcl->obj.lockref.lock`
since it's rarely used (also it takes a short time if even used) as long
as the pcluster has a reference.

Fixes: 192351616a ("erofs: support I/O submission for sub-page compressed blocks")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Link: https://lore.kernel.org/r/20240125120039.3228103-1-hsiangkao@linux.alibaba.com
2024-01-26 18:07:36 +08:00
Jingbo Xu
97cf5d53b4 erofs: get rid of unneeded GFP_NOFS
Clean up some leftovers since there is no way for EROFS to be called
again from a reclaim context.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240124031945.130782-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-01-25 11:24:19 +08:00
Yue Hu
652cdaa886 erofs: allow partially filled compressed bvecs
In order to reduce memory footprints even further, let's allow
partially filled compressed bvecs for readahead to bail out later.

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231221062341.23901-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-12-21 22:58:21 +08:00
Gao Xiang
0ee3a0d59e erofs: enable sub-page compressed block support
Let's just disable cached decompression and inplace I/Os for partial
pages as the first step in order to enable sub-page block initial
support.  In other words, currently it works primarily based on
temporary short-lived pages.  Don't expect too much in terms of
performance.

Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-6-hsiangkao@linux.alibaba.com
2023-12-18 15:49:39 +08:00
Gao Xiang
e5aba911de erofs: fix ztailpacking for subpage compressed blocks
`pageofs_in` should be the compressed data offset of the page rather
than of the block.

Acked-by: Chao Yu <chao@kernel.org>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231214161337.753049-1-hsiangkao@linux.alibaba.com
2023-12-18 15:49:07 +08:00
Gao Xiang
54ed3fdd66 erofs: record pclustersize in bytes instead of pages
Currently, compressed sizes are recorded in pages using `pclusterpages`,
However, for tailpacking pclusters, `tailpacking_size` is used instead.

This approach doesn't work when dealing with sub-page blocks. To address
this, let's switch them to the unified `pclustersize` in bytes.

Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-3-hsiangkao@linux.alibaba.com
2023-12-15 01:47:06 +08:00
Gao Xiang
192351616a erofs: support I/O submission for sub-page compressed blocks
Add a basic I/O submission path first to support sub-page blocks:

 - Temporary short-lived pages will be used entirely;

 - In-place I/O pages can be used partially, but compressed pages need
   to be able to be mapped in contiguous virtual memory.

As a start, currently cache decompression is explicitly disabled for
sub-page blocks, which will be supported in the future.

Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-2-hsiangkao@linux.alibaba.com
2023-12-15 01:46:53 +08:00