Pull persistent dentry infrastructure and conversion from Al Viro:
"Some filesystems use a kinda-sorta controlled dentry refcount leak to
pin dentries of created objects in dcache (and undo it when removing
those). A reference is grabbed and not released, but it's not actually
_stored_ anywhere.
That works, but it's hard to follow and verify; among other things, we
have no way to tell _which_ of the increments is intended to be an
unpaired one. Worse, on removal we need to decide whether the
reference had already been dropped, which can be non-trivial if that
removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag
(DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
claims responsibility for +1 in refcount.
The end result this series is aiming for:
- get these unbalanced dget() and dput() replaced with new primitives
that would, in addition to adjusting refcount, set and clear
persistency flag.
- instead of having kill_litter_super() mess with removing the
remaining "leaked" references (e.g. for all tmpfs files that hadn't
been removed prior to umount), have the regular
shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
dropping the corresponding reference if it had been set. After that
kill_litter_super() becomes an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many
places in too many filesystems. It has to be split into a series.
This work has really started early in 2024; quite a few preliminary
pieces have already gone into mainline. This chunk is finally getting
to the meat of that stuff - infrastructure and most of the conversions
to it.
Some pieces are still sitting in the local branches, but the bulk of
that stuff is here"
* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
d_make_discardable(): warn if given a non-persistent dentry
kill securityfs_recursive_remove()
convert securityfs
get rid of kill_litter_super()
convert rust_binderfs
convert nfsctl
convert rpc_pipefs
convert hypfs
hypfs: swich hypfs_create_u64() to returning int
hypfs: switch hypfs_create_str() to returning int
hypfs: don't pin dentries twice
convert gadgetfs
gadgetfs: switch to simple_remove_by_name()
convert functionfs
functionfs: switch to simple_remove_by_name()
functionfs: fix the open/removal races
functionfs: need to cancel ->reset_work in ->kill_sb()
functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
functionfs: don't abuse ffs_data_closed() on fs shutdown
convert selinuxfs
...
Pull MM updates from Andrew Morton:
"__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
Rework the vmalloc() code to support non-blocking allocations
(GFP_ATOIC, GFP_NOWAIT)
"ksm: fix exec/fork inheritance" (xu xin)
Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
inherited across fork/exec
"mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
Some light maintenance work on the zswap code
"mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
Enhance the /sys/kernel/debug/page_owner debug feature by adding
unique identifiers to differentiate the various stack traces so
that userspace monitoring tools can better match stack traces over
time
"mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
Minor alterations to the page allocator's per-cpu-pages feature
"Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
Address a scalability issue in userfaultfd's UFFDIO_MOVE operation
"kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)
"drivers/base/node: fold node register and unregister functions" (Donet Tom)
Clean up the NUMA node handling code a little
"mm: some optimizations for prot numa" (Kefeng Wang)
Cleanups and small optimizations to the NUMA allocation hinting
code
"mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
Address long lock hold times at boot on large machines. These were
causing (harmless) softlockup warnings
"optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
Remove some now-unnecessary work from page reclaim
"mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
Enhance the DAMOS auto-tuning feature
"mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
configuration
"expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
Enhance the new(ish) file_operations.mmap_prepare() method and port
additional callsites from the old ->mmap() over to ->mmap_prepare()
"Fix stale IOTLB entries for kernel address space" (Lu Baolu)
Fix a bug (and possible security issue on non-x86) in the IOMMU
code. In some situations the IOMMU could be left hanging onto a
stale kernel pagetable entry
"mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
Clean up and optimize the folio splitting code
"mm, swap: misc cleanup and bugfix" (Kairui Song)
Some cleanups and a minor fix in the swap discard code
"mm/damon: misc documentation fixups" (SeongJae Park)
"mm/damon: support pin-point targets removal" (SeongJae Park)
Permit userspace to remove a specific monitoring target in the
middle of the current targets list
"mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
A couple of cleanups related to mm header file inclusion
"mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
improve the selection of swap devices for NUMA machines
"mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
Change the memory block labels from macros to enums so they will
appear in kernel debug info
"ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
Address an inefficiency when KSM unmerges an address range
"mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
Fix leaks and unhandled malloc() failures in DAMON userspace unit
tests
"some cleanups for pageout()" (Baolin Wang)
Clean up a couple of minor things in the page scanner's
writeback-for-eviction code
"mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
Move hugetlb's sysfs/sysctl handling code into a new file
"introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
Make the VMA guard regions available in /proc/pid/smaps and
improves the mergeability of guarded VMAs
"mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
Reduce mmap lock contention for callers performing VMA guard region
operations
"vma_start_write_killable" (Matthew Wilcox)
Start work on permitting applications to be killed when they are
waiting on a read_lock on the VMA lock
"mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
Add additional userspace testing of DAMON's "commit" feature
"mm/damon: misc cleanups" (SeongJae Park)
"make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
VMA is merged with another
"mm: support device-private THP" (Balbir Singh)
Introduce support for Transparent Huge Page (THP) migration in zone
device-private memory
"Optimize folio split in memory failure" (Zi Yan)
"mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
Some more cleanups in the folio splitting code
"mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
Clean up our handling of pagetable leaf entries by introducing the
concept of 'software leaf entries', of type softleaf_t
"reparent the THP split queue" (Muchun Song)
Reparent the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory
resources
"unify PMD scan results and remove redundant cleanup" (Wei Yang)
A little cleanup in the hugepage collapse code
"zram: introduce writeback bio batching" (Sergey Senozhatsky)
Improve zram writeback efficiency by introducing batched bio
writeback support
"memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
Clean up our handling of the interrupt safety of some memcg stats
"make vmalloc gfp flags usage more apparent" (Vishal Moola)
Clean up vmalloc's handling of incoming GFP flags
"mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
Teach soft dirty and userfaultfd write protect tracking to use
RISC-V's Svrsw60t59b extension
"mm: swap: small fixes and comment cleanups" (Youngjun Park)
Fix a small bug and clean up some of the swap code
"initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
Start work on converting the vma struct's flags to a bitmap, so we
stop running out of them, especially on 32-bit
"mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
Address a possible bug in the swap discard code and clean things
up a little
[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
register device memory for poison handling") because it looks
broken to me, I've asked for clarification - Linus ]
* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm: fix vma_start_write_killable() signal handling
mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
mm/swapfile: fix list iteration when next node is removed during discard
fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
mm/kfence: add reboot notifier to disable KFENCE on shutdown
memcg: remove inc/dec_lruvec_kmem_state helpers
selftests/mm/uffd: initialize char variable to Null
mm: fix DEBUG_RODATA_TEST indentation in Kconfig
mm: introduce VMA flags bitmap type
tools/testing/vma: eliminate dependency on vma->__vm_flags
mm: simplify and rename mm flags function for clarity
mm: declare VMA flags by bit
zram: fix a spelling mistake
mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
pagemap: update BUDDY flag documentation
mm: swap: remove scan_swap_map_slots() references from comments
mm: swap: change swap_alloc_slow() to void
mm, swap: remove redundant comment for read_swap_cache_async
mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
...
Pull directory locking updates from Christian Brauner:
"This contains the work to add centralized APIs for directory locking
operations.
This series is part of a larger effort to change directory operation
locking to allow multiple concurrent operations in a directory. The
ultimate goal is to lock the target dentry(s) rather than the whole
parent directory.
To help with changing the locking protocol, this series centralizes
locking and lookup in new helper functions. The helpers establish a
pattern where it is the dentry that is being locked and unlocked
(currently the lock is held on dentry->d_parent->d_inode, but that can
change in the future).
This also changes vfs_mkdir() to unlock the parent on failure, as well
as dput()ing the dentry. This allows end_creating() to only require
the target dentry (which may be IS_ERR() after vfs_mkdir()), not the
parent"
* tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nfsd: fix end_creating() conversion
VFS: introduce end_creating_keep()
VFS: change vfs_mkdir() to unlock on failure.
ecryptfs: use new start_creating/start_removing APIs
Add start_renaming_two_dentries()
VFS/ovl/smb: introduce start_renaming_dentry()
VFS/nfsd/ovl: introduce start_renaming() and end_renaming()
VFS: add start_creating_killable() and start_removing_killable()
VFS: introduce start_removing_dentry()
smb/server: use end_removing_noperm for for target of smb2_create_link()
VFS: introduce start_creating_noperm() and start_removing_noperm()
VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()
VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()
VFS: tidy up do_unlinkat()
VFS: introduce start_dirop() and end_dirop()
debugfs: rename end_creating() to debugfs_end_creating()
vfs_mkdir() already drops the reference to the dentry on failure but it
leaves the parent locked.
This complicates end_creating() which needs to unlock the parent even
though the dentry is no longer available.
If we change vfs_mkdir() to unlock on failure as well as releasing the
dentry, we can remove the "parent" arg from end_creating() and simplify
the rules for calling it.
Note that cachefiles_get_directory() can choose to substitute an error
instead of actually calling vfs_mkdir(), for fault injection. In that
case it needs to call end_creating(), just as vfs_mkdir() now does on
error.
ovl_create_real() will now unlock on error. So the conditional
end_creating() after the call is removed, and end_creating() is called
internally on error.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-15-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull fs_context updates from Al Viro:
"Change vfs_parse_fs_string() calling conventions
Get rid of the length argument (almost all callers pass strlen() of
the string argument there), add vfs_parse_fs_qstr() for the cases that
do want separate length"
* tag 'pull-fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_nfs4_mount(): switch to vfs_parse_fs_string()
change the calling conventions for vfs_parse_fs_string()
Pull vfs async directory updates from Christian Brauner:
"This contains further preparatory changes for the asynchronous directory
locking scheme:
- Add lookup_one_positive_killable() which allows overlayfs to
perform lookup that won't block on a fatal signal
- Unify the mount idmap handling in struct renamedata as a rename can
only happen within a single mount
- Introduce kern_path_parent() for audit which sets the path to the
parent and returns a dentry for the target without holding any
locks on return
- Rename kern_path_locked() as it is only used to prepare for the
removal of an object from the filesystem:
kern_path_locked() => start_removing_path()
kern_path_create() => start_creating_path()
user_path_create() => start_creating_user_path()
user_path_locked_at() => start_removing_user_path_at()
done_path_create() => end_creating_path()
NA => end_removing_path()"
* tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
debugfs: rename start_creating() to debugfs_start_creating()
VFS: rename kern_path_locked() and related functions.
VFS/audit: introduce kern_path_parent() for audit
VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata
VFS: discard err2 in filename_create()
VFS/ovl: add lookup_one_positive_killable()
kern_path_locked() is now only used to prepare for removing an object
from the filesystem (and that is the only credible reason for wanting a
positive locked dentry). Thus it corresponds to kern_path_create() and
so should have a corresponding name.
Unfortunately the name "kern_path_create" is somewhat misleading as it
doesn't actually create anything. The recently added
simple_start_creating() provides a better pattern I believe. The
"start" can be matched with "end" to bracket the creating or removing.
So this patch changes names:
kern_path_locked -> start_removing_path
kern_path_create -> start_creating_path
user_path_create -> start_creating_user_path
user_path_locked_at -> start_removing_user_path_at
done_path_create -> end_creating_path
and also introduces end_removing_path() which is identical to
end_creating_path().
__start_removing_path (which was __kern_path_locked) is enhanced to
call mnt_want_write() for consistency with the start_creating_path().
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.
The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.
No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Absolute majority of callers are passing the 4th argument equal to
strlen() of the 3rd one.
Drop the v_size argument, add vfs_parse_fs_qstr() for the cases that
want independent length.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull mmap_prepare updates from Christian Brauner:
"Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
introduce new .mmap_prepare() file callback").
This is preferred to the existing f_op->mmap() hook as it does require
a VMA to be established yet, thus allowing the mmap logic to invoke
this hook far, far earlier, prior to inserting a VMA into the virtual
address space, or performing any other heavy handed operations.
This allows for much simpler unwinding on error, and for there to be a
single attempt at merging a VMA rather than having to possibly
reattempt a merge based on potentially altered VMA state.
Far more importantly, it prevents inappropriate manipulation of
incompletely initialised VMA state, which is something that has been
the cause of bugs and complexity in the past.
The intent is to gradually deprecate f_op->mmap, and in that vein this
series coverts the majority of file systems to using f_op->mmap_prepare.
Prerequisite steps are taken - firstly ensuring all checks for mmap
capabilities use the file_has_valid_mmap_hooks() helper rather than
directly checking for f_op->mmap (which is now not a valid check) and
secondly updating daxdev_mapping_supported() to not require a VMA
parameter to allow ext4 and xfs to be converted.
Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
nested file systems") handles the nasty edge-case of nested file
systems like overlayfs, which introduces a compatibility shim to allow
f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.
This allows for nested filesystems to continue to function correctly
with all file systems regardless of which callback is used. Once we
finally convert all file systems, this shim can be removed.
As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
can nest all other file systems.
We additionally do not update resctl - as this requires an update to
remap_pfn_range() (or an alternative to it) which we defer to a later
series, equally we do not update cramfs which needs a mixed mapping
insertion with the same issue, nor do we update procfs, hugetlbfs,
syfs or kernfs all of which require VMAs for internal state and hooks.
We shall return to all of these later"
* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: update porting, vfs documentation to describe mmap_prepare()
fs: replace mmap hook with .mmap_prepare for simple mappings
fs: convert most other generic_file_*mmap() users to .mmap_prepare()
fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
mm/filemap: introduce generic_file_*_mmap_prepare() helpers
fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
fs/dax: make it possible to check dev dax support without a VMA
fs: consistently use can_mmap_file() helper
mm/nommu: use file_has_valid_mmap_hooks() helper
mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
Pull async directory updates from Christian Brauner:
"This contains preparatory changes for the asynchronous directory
locking scheme.
While the locking scheme is still very much controversial and we're
still far away from landing any actual changes in that area the
preparatory work that we've been upstreaming for a while now has been
very useful. This is another set of minor changes and cleanups"
* tag 'vfs-6.17-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
exportfs: use lookup_one_unlocked()
coda: use iterate_dir() in coda_readdir()
VFS: Minor fixes for porting.rst
VFS: merge lookup_one_qstr_excl_raw() back into lookup_one_qstr_excl()
Pull dentry d_flags updates from Al Viro:
"The current exclusion rules for dentry->d_flags stores are rather
unpleasant. The basic rules are simple:
- stores to dentry->d_flags are OK under dentry->d_lock
- stores to dentry->d_flags are OK in the dentry constructor, before
becomes potentially visible to other threads
Unfortunately, there's a couple of exceptions to that, and that's
where the headache comes from.
The main PITA comes from d_set_d_op(); that primitive sets ->d_op of
dentry and adjusts the flags that correspond to presence of individual
methods. It's very easy to misuse; existing uses _are_ safe, but proof
of correctness is brittle.
Use in __d_alloc() is safe (we are within a constructor), but we might
as well precalculate the initial value of 'd_flags' when we set the
default ->d_op for given superblock and set 'd_flags' directly instead
of messing with that helper.
The reasons why other uses are safe are bloody convoluted; I'm not
going to reproduce it here. See [1] for gory details, if you care. The
critical part is using d_set_d_op() only just prior to
d_splice_alias(), which makes a combination of d_splice_alias() with
setting ->d_op, etc a natural replacement primitive.
Better yet, if we go that way, it's easy to take setting ->d_op and
modifying 'd_flags' under ->d_lock, which eliminates the headache as
far as 'd_flags' exclusion rules are concerned. Other exceptions are
minor and easy to deal with.
What this series does:
- d_set_d_op() is no longer available; instead a new primitive
(d_splice_alias_ops()) is provided, equivalent to combination of
d_set_d_op() and d_splice_alias().
- new field of struct super_block - 's_d_flags'. This sets the
default value of 'd_flags' to be used when allocating dentries on
this filesystem.
- new primitive for setting 's_d_op': set_default_d_op(). This
replaces stores to 's_d_op' at mount time.
All in-tree filesystems converted; out-of-tree ones will get caught
by the compiler ('s_d_op' is renamed, so stores to it will be
caught). 's_d_flags' is set by the same primitive to match the
's_d_op'.
- a lot of filesystems had sb->s_d_op->d_delete equal to
always_delete_dentry; that is equivalent to setting
DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well
set that bit in 's_d_flags' and drop 'd_delete()' from
dentry_operations.
In quite a few cases that results in empty dentry_operations, which
means that we can get rid of those.
- kill simple_dentry_operations - not needed anymore
- massage d_alloc_parallel() to get rid of the other exception wrt
'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we
allocate the new dentry; no need to delay that until we commit to
using the sucker.
As the result, 'd_flags' stores are all either under ->d_lock or done
before the dentry becomes visible in any shared data structures"
Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1]
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
configfs: use DCACHE_DONTCACHE
debugfs: use DCACHE_DONTCACHE
efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry()
9p: don't bother with always_delete_dentry
ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE
kill simple_dentry_operations
devpts, sunrpc, hostfs: don't bother with ->d_op
shmem: no dentry retention past the refcount reaching zero
d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier
make d_set_d_op() static
simple_lookup(): just set DCACHE_DONTCACHE
tracefs: Add d_delete to remove negative dentries
set_default_d_op(): calculate the matching value for ->d_flags
correct the set of flags forbidden at d_set_d_op() time
split d_flags calculation out of d_set_d_op()
new helper: set_default_d_op()
fuse: no need for special dentry_operations for root dentry
switch procfs from d_set_d_op() to d_splice_alias_ops()
new helper: d_splice_alias_ops()
procfs: kill ->proc_dops
...
Now that we have established .mmap_prepare() as the preferred means by
which filesystems establish state upon memory mapping of a file, update the
VFS and porting documentation to reflect this.
As part of this change, additionally update the VFS documentation to
contain the current state of the file_operations struct.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/20250723123036.35472-1-lorenzo.stoakes@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
collect_mounts() has several problems - one can't iterate over the results
directly, so it has to be done with callback passed to iterate_mounts();
it has an oopsable race with d_invalidate(); it creates temporary clones
of mounts invisibly for sync umount (IOW, you can have non-lazy umount
succeed leaving filesystem not mounted anywhere and yet still busy).
A saner approach is to give caller an array of struct path that would pin
every mount in a subtree, without cloning any mounts.
* collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone
* collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or
a pointer to array of struct path, one for each chunk of tree visible under
'where' (i.e. the first element is a copy of where, followed by (mount,root)
for everything mounted under it - the same set collect_mounts() would give).
Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning
references to the roots of subtrees in the caller's namespace.
Array is terminated by {NULL, NULL} struct path. If it fits into
preallocated array (on-stack, normally), that's where it goes; otherwise
it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated'
is ignored (and expected to be NULL).
* drop_collected_paths(paths, preallocated) is given the array returned
by an earlier call of collect_paths() and the preallocated array passed to that
call. All mount/dentry references are dropped and array is kfree'd if it's not
equal to 'preallocated'.
* instead of iterate_mounts(), users should just iterate over array
of struct path - nothing exotic is needed for that. Existing users (all in
audit_tree.c) are converted.
[folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>]
Fixes: 80b5dce8c5 ("vfs: Add a function to lazily unmount all mounts from any dentry")
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert the last user (d_alloc_pseudo()) and be done with that.
Any out-of-tree filesystem using it should switch to d_splice_alias_ops()
or, better yet, check whether it really needs to have ->d_op vary among
its dentries.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull automount updates from Al Viro:
"Automount wart removal
A bunch of odd boilerplate gone from instances - the reason for
those was the need to protect the yet-to-be-attched mount from
mark_mounts_for_expiry() deciding to take it out.
But that's easy to detect and take care of in mark_mounts_for_expiry()
itself; no need to have every instance simulate mount being busy by
grabbing an extra reference to it, with finish_automount() undoing
that once it attaches that mount.
Should've done it that way from the very beginning... This is a
flagday change, thankfully there are very few instances.
vfs_submount() is gone - its sole remaining user (trace_automount)
had been switched to saner primitives"
* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill vfs_submount()
saner calling conventions for ->d_automount()
Currently the calling conventions for ->d_automount() instances have
an odd wart - returned new mount to be attached is expected to have
refcount 2.
That kludge is intended to make sure that mark_mounts_for_expiry() called
before we get around to attaching that new mount to the tree won't decide
to take it out. finish_automount() drops the extra reference after it's
done with attaching mount to the tree - or drops the reference twice in
case of error. ->d_automount() instances have rather counterintuitive
boilerplate in them.
There's a much simpler approach: have mark_mounts_for_expiry() skip the
mounts that are yet to be mounted. And to hell with grabbing/dropping
those extra references. Makes for simpler correctness analysis, at that...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
try_lookup_noperm() and d_hash_and_lookup() are nearly identical. The
former does some validation of the name where the latter doesn't.
Outside of the VFS that validation is likely valuable, and having only
one exported function for this task is certainly a good idea.
So make d_hash_and_lookup() local to VFS files and change all other
callers to try_lookup_noperm(). Note that the arguments are swapped.
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250319031545.2999807-6-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
The lookup_one_len family of functions is (now) only used internally by
a filesystem on itself either
- in a context where permission checking is irrelevant such as by a
virtual filesystem populating itself, or xfs accessing its ORPHANAGE
or dquota accessing the quota file; or
- in a context where a permission check (MAY_EXEC on the parent) has just
been performed such as a network filesystem finding in "silly-rename"
file in the same directory. This is also the context after the
_parentat() functions where currently lookup_one_qstr_excl() is used.
So the permission check is pointless.
The name "one_len" is unhelpful in understanding the purpose of these
functions and should be changed. Most of the callers pass the len as
"strlen()" so using a qstr and QSTR() can simplify the code.
This patch renames these functions (include lookup_positive_unlocked()
which is part of the family despite the name) to have a name based on
"lookup_noperm". They are changed to receive a 'struct qstr' instead
of separate name and len. In a few cases the use of QSTR() results in a
new call to strlen().
try_lookup_noperm() takes a pointer to a qstr instead of the whole
qstr. This is consistent with d_hash_and_lookup() (which is nearly
identical) and useful for lookup_noperm_unlocked().
The new lookup_noperm_common() doesn't take a qstr yet. That will be
tidied up in a subsequent patch.
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-5-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
The family of functions:
lookup_one()
lookup_one_unlocked()
lookup_one_positive_unlocked()
appear designed to be used by external clients of the filesystem rather
than by filesystems acting on themselves as the lookup_one_len family
are used.
They are used by:
btrfs/ioctl - which is a user-space interface rather than an internal
activity
exportfs - i.e. from nfsd or the open_by_handle_at interface
overlayfs - at access the underlying filesystems
smb/server - for file service
They should be used by nfsd (more than just the exportfs path) and
cachefs but aren't.
It would help if the documentation didn't claim they should "not be
called by generic code".
Also the path component name is passed as "name" and "len" which are
(confusingly?) separate by the "base". In some cases the len in simply
"strlen" and so passing a qstr using QSTR() would make the calling
clearer.
Other callers do pass separate name and len which are stored in a
struct. Sometimes these are already stored in a qstr, other times it
easily could be.
So this patch changes these three functions to receive a 'struct qstr *',
and improves the documentation.
QSTR_LEN() is added to make it easy to pass a QSTR containing a known
len.
[brauner@kernel.org: take a struct qstr pointer]
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull vfs async dir updates from Christian Brauner:
"This contains cleanups that fell out of the work from async directory
handling:
- Change kern_path_locked() and user_path_locked_at() to never return
a negative dentry. This simplifies the usability of these helpers
in various places
- Drop d_exact_alias() from the remaining place in NFS where it is
still used. This also allows us to drop the d_exact_alias() helper
completely
- Drop an unnecessary call to fh_update() from nfsd_create_locked()
- Change i_op->mkdir() to return a struct dentry
Change vfs_mkdir() to return a dentry provided by the filesystems
which is hashed and positive. This allows us to reduce the number
of cases where the resulting dentry is not positive to very few
cases. The code in these places becomes simpler and easier to
understand.
- Repack DENTRY_* and LOOKUP_* flags"
* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: fix inline emphasis warning
VFS: Change vfs_mkdir() to return the dentry.
nfs: change mkdir inode_operation to return alternate dentry if needed.
fuse: return correct dentry for ->mkdir
ceph: return the correct dentry on mkdir
hostfs: store inode in dentry after mkdir if possible.
Change inode_operations.mkdir to return struct dentry *
nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
nfs/vfs: discard d_exact_alias()
VFS: add common error checks to lookup_one_qstr_excl()
VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
VFS: repack LOOKUP_ bit flags.
VFS: repack DENTRY_ flags.
The function can be replaced by evict_inodes. The only difference is
that evict_inodes() skips the inodes with positive refcount without
touching ->i_lock, but they are equivalent as evict_inodes() repeats the
refcount check after having grabbed ->i_lock.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20250307144318.28120-2-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Callers of lookup_one_qstr_excl() often check if the result is negative or
positive.
These changes can easily be moved into lookup_one_qstr_excl() by checking the
lookup flags:
LOOKUP_CREATE means it is NOT an error if the name doesn't exist.
LOOKUP_EXCL means it IS an error if the name DOES exist.
This patch adds these checks, then removes error checks from callers,
and ensures that appropriate flags are passed.
This subtly changes the meaning of LOOKUP_EXCL. Previously it could
only accompany LOOKUP_CREATE. Now it can accompany LOOKUP_RENAME_TARGET
as well. A couple of small changes are needed to accommodate this. The
NFS change is functionally a no-op but ensures nfs_is_exclusive_create() does
exactly what the name says.
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250217003020.3170652-3-neilb@suse.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
No callers of kern_path_locked() or user_path_locked_at() want a
negative dentry. So change them to return -ENOENT instead. This
simplifies callers.
This results in a subtle change to bcachefs in that an ioctl will now
return -ENOENT in preference to -EXDEV. I believe this restores the
behaviour to what it was prior to
Commit bbe6a7c899 ("bch2_ioctl_subvolume_destroy(): fix locking")
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250217003020.3170652-2-neilb@suse.de
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull vfs d_revalidate updates from Al Viro:
"Provide stable parent and name to ->d_revalidate() instances
Most of the filesystem methods where we care about dentry name and
parent have their stability guaranteed by the callers;
->d_revalidate() is the major exception.
It's easy enough for callers to supply stable values for expected name
and expected parent of the dentry being validated. That kills quite a
bit of boilerplate in ->d_revalidate() instances, along with a bunch
of races where they used to access ->d_name without sufficient
precautions"
* tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
9p: fix ->rename_sem exclusion
orangefs_d_revalidate(): use stable parent inode and name passed by caller
ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller
nfs: fix ->d_revalidate() UAF on ->d_name accesses
nfs{,4}_lookup_validate(): use stable parent inode passed by caller
gfs2_drevalidate(): use stable parent inode and name passed by caller
fuse_dentry_revalidate(): use stable parent inode and name passed by caller
vfat_revalidate{,_ci}(): use stable parent inode passed by caller
exfat_d_revalidate(): use stable parent inode passed by caller
fscrypt_d_revalidate(): use stable parent inode passed by caller
ceph_d_revalidate(): propagate stable name down into request encoding
ceph_d_revalidate(): use stable parent inode passed by caller
afs_d_revalidate(): use stable name and parent inode passed by caller
Pass parent directory inode and expected name to ->d_revalidate()
generic_ci_d_compare(): use shortname_storage
ext4 fast_commit: make use of name_snapshot primitives
dissolve external_name.u into separate members
make take_dentry_name_snapshot() lockless
dcache: back inline names with a struct-wrapped array of unsigned long
make sure that DNAME_INLINE_LEN is a multiple of word size
->d_revalidate() often needs to access dentry parent and name; that has
to be done carefully, since the locking environment varies from caller
to caller. We are not guaranteed that dentry in question will not be
moved right under us - not unless the filesystem is such that nothing
on it ever gets renamed.
It can be dealt with, but that results in boilerplate code that isn't
even needed - the callers normally have just found the dentry via dcache
lookup and want to verify that it's in the right place; they already
have the values of ->d_parent and ->d_name stable. There is a couple
of exceptions (overlayfs and, to less extent, ecryptfs), but for the
majority of calls that song and dance is not needed at all.
It's easier to make ecryptfs and overlayfs find and pass those values if
there's a ->d_revalidate() instance to be called, rather than doing that
in the instances.
This commit only changes the calling conventions; making use of supplied
values is left to followups.
NOTE: some instances need more than just the parent - things like CIFS
may need to build an entire path from filesystem root, so they need
more precautions than the usual boilerplate. This series doesn't
do anything to that need - these filesystems have to keep their locking
mechanisms (rename_lock loops, use of dentry_path_raw(), private rwsem
a-la v9fs).
One thing to keep in mind when using name is that name->name will normally
point into the pathname being resolved; the filename in question occupies
name->len bytes starting at name->name, and there is NUL somewhere after it,
but it the next byte might very well be '/' rather than '\0'. Do not
ignore name->len.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Deprecation period of reiserfs ends with the end of this year so it is
time to remove it from the kernel.
Acked-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull vfs blocksize updates from Al Viro:
"This gets rid of bogus set_blocksize() uses, switches it over
to be based on a 'struct file *' and verifies that the caller
has the device opened exclusively"
* tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make set_blocksize() fail unless block device is opened exclusive
set_blocksize(): switch to passing struct file *
btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens
swsusp: don't bother with setting block size
zram: don't bother with reopening - just use O_EXCL for open
swapon(2): open swap with O_EXCL
swapon(2)/swapoff(2): don't bother with block size
pktcdvd: sort set_blocksize() calls out
bcache_register(): don't bother with set_blocksize()
Pull dcache updates from Al Viro:
"Change of locking rules for __dentry_kill(), regularized refcounting
rules in that area, assorted cleanups and removal of weird corner
cases (e.g. now ->d_iput() on child is always called before the parent
might hit __dentry_kill(), etc)"
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
dcache: remove unnecessary NULL check in dget_dlock()
kill DCACHE_MAY_FREE
__d_unalias() doesn't use inode argument
d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant
get rid of DCACHE_GENOCIDE
d_genocide(): move the extern into fs/internal.h
simple_fill_super(): don't bother with d_genocide() on failure
nsfs: use d_make_root()
d_alloc_pseudo(): move setting ->d_op there from the (sole) caller
kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
retain_dentry(): introduce a trimmed-down lockless variant
__dentry_kill(): new locking scheme
d_prune_aliases(): use a shrink list
switch select_collect{,2}() to use of to_shrink_list()
to_shrink_list(): call only if refcount is 0
fold dentry_kill() into dput()
don't try to cut corners in shrink_lock_dentry()
fold the call of retain_dentry() into fast_dput()
Call retain_dentry() with refcount 0
dentry_kill(): don't bother with retain_dentry() on slow path
...
Pull rename updates from Al Viro:
"Fix directory locking scheme on rename
This was broken in 6.5; we really can't lock two unrelated directories
without holding ->s_vfs_rename_mutex first and in case of same-parent
rename of a subdirectory 6.5 ends up doing just that"
* tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
rename(): avoid a deadlock in the case of parents having no common ancestor
kill lock_two_inodes()
rename(): fix the locking of subdirectories
f2fs: Avoid reading renamed directory if parent does not change
ext4: don't access the source subdirectory content on same-directory rename
ext2: Avoid reading renamed directory if parent does not change
udf_rename(): only access the child content on cross-directory rename
ocfs2: Avoid touching renamed directory if parent does not change
reiserfs: Avoid touching renamed directory if parent does not change
... and fix the directory locking documentation and proof of correctness.
Holding ->s_vfs_rename_mutex *almost* prevents ->d_parent changes; the
case where we really don't want it is splicing the root of disconnected
tree to somewhere.
In other words, ->s_vfs_rename_mutex is sufficient to stabilize "X is an
ancestor of Y" only if X and Y are already in the same tree. Otherwise
it can go from false to true, and one can construct a deadlock on that.
Make lock_two_directories() report an error in such case and update the
callers of lock_rename()/lock_rename_child() to handle such errors.
And yes, such conditions are not impossible to create ;-/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We should never lock two subdirectories without having taken
->s_vfs_rename_mutex; inode pointer order or not, the "order" proposed
in 28eceeda13 "fs: Lock moved directories" is not transitive, with
the usual consequences.
The rationale for locking renamed subdirectory in all cases was
the possibility of race between rename modifying .. in a subdirectory to
reflect the new parent and another thread modifying the same subdirectory.
For a lot of filesystems that's not a problem, but for some it can lead
to trouble (e.g. the case when short directory contents is kept in the
inode, but creating a file in it might push it across the size limit
and copy its contents into separate data block(s)).
However, we need that only in case when the parent does change -
otherwise ->rename() doesn't need to do anything with .. entry in the
first place. Some instances are lazy and do a tautological update anyway,
but it's really not hard to avoid.
Amended locking rules for rename():
find the parent(s) of source and target
if source and target have the same parent
lock the common parent
else
lock ->s_vfs_rename_mutex
lock both parents, in ancestor-first order; if neither
is an ancestor of another, lock the parent of source
first.
find the source and target.
if source and target have the same parent
if operation is an overwriting rename of a subdirectory
lock the target subdirectory
else
if source is a subdirectory
lock the source
if target is a subdirectory
lock the target
lock non-directories involved, in inode pointer order if both
source and target are such.
That way we are guaranteed that parents are locked (for obvious reasons),
that any renamed non-directory is locked (nfsd relies upon that),
that any victim is locked (emptiness check needs that, among other things)
and subdirectory that changes parent is locked (needed to protect the update
of .. entries). We are also guaranteed that any operation locking more
than one directory either takes ->s_vfs_rename_mutex or locks a parent
followed by its child.
Cc: stable@vger.kernel.org
Fixes: 28eceeda13 "fs: Lock moved directories"
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently we enter __dentry_kill() with parent (along with the victim
dentry and victim's inode) held locked. Then we
mark dentry refcount as dead
call ->d_prune()
remove dentry from hash
remove it from the parent's list of children
unlock the parent, don't need it from that point on
detach dentry from inode, unlock dentry and drop the inode
(via ->d_iput())
call ->d_release()
regain the lock on dentry
check if it's on a shrink list (in which case freeing its empty husk
has to be left to shrink_dentry_list()) or not (in which case we can free it
ourselves). In the former case, mark it as an empty husk, so that
shrink_dentry_list() would know it can free the sucker.
drop the lock on dentry
... and usually the caller proceeds to drop a reference on the parent,
possibly retaking the lock on it.
That is painful for a bunch of reasons, starting with the need to take locks
out of order, but not limited to that - the parent of positive dentry can
change if we drop its ->d_lock, so getting these locks has to be done with
care. Moreover, as soon as dentry is out of the parent's list of children,
shrink_dcache_for_umount() won't see it anymore, making it appear as if
the parent is inexplicably busy. We do work around that by having
shrink_dentry_list() decrement the parent's refcount first and put it on
shrink list to be evicted once we are done with __dentry_kill() of child,
but that may in some cases lead to ->d_iput() on child called after the
parent got killed. That doesn't happen in cases where in-tree ->d_iput()
instances might want to look at the parent, but that's brittle as hell.
Solution: do removal from the parent's list of children in the very
end of __dentry_kill(). As the result, the callers do not need to
lock the parent and by the time we really need the parent locked,
dentry is negative and is guaranteed not to be moved around.
It does mean that ->d_prune() will be called with parent not locked.
It also means that we might see dentries in process of being torn
down while going through the parent's list of children; those dentries
will be unhashed, negative and with refcount marked dead. In practice,
that's enough for in-tree code that looks through the list of children
to do the right thing as-is. Out-of-tree code might need to be adjusted.
Calling conventions: __dentry_kill(dentry) is called with dentry->d_lock
held, along with ->i_lock of its inode (if any). It either returns
the parent (locked, with refcount decremented to 0) or NULL (if there'd
been no parent or if refcount decrement for parent hadn't reached 0).
lock_for_kill() is adjusted for new requirements - it doesn't touch
the parent's ->d_lock at all.
Callers adjusted. Note that for dput() we don't need to bother with
fast_dput() for the parent - we just need to check retain_dentry()
for it, since its ->d_lock is still held since the moment when
__dentry_kill() had taken it to remove the victim from the list of
children.
The kludge with early decrement of parent's refcount in
shrink_dentry_list() is no longer needed - shrink_dcache_for_umount()
sees the half-killed dentries in the list of children for as long
as they are pinning the parent. They are easily recognized and
accounted for by select_collect(), so we know we are not done yet.
As the result, we always have the expected ordering for ->d_iput()/->d_release()
vs. __dentry_kill() of the parent, no exceptions. Moreover, the current
rules for shrink lists (one must make sure that shrink_dcache_for_umount()
won't happen while any dentries from the superblock in question are on
any shrink lists) are gone - shrink_dcache_for_umount() will do the
right thing in all cases, taking such dentries out. Their empty
husks (memory occupied by struct dentry itself + its external name,
if any) will remain on the shrink lists, but they are no obstacles
to filesystem shutdown. And such husks will get freed as soon as
shrink_dentry_list() of the list they are on gets to them.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of bumping it from 0 to 1, calling retain_dentry(), then
decrementing it back to 0 (with ->d_lock held all the way through),
just leave refcount at 0 through all of that.
It will have a visible effect for ->d_delete() - now it can be
called with refcount 0 instead of 1 and it can no longer play
silly buggers with dropping/regaining ->d_lock. Not that any
in-tree instances tried to (it's pretty hard to get right).
Any out-of-tree ones will have to adjust (assuming they need any
changes).
Note that we do not need to extend rcu-critical area here - we have
verified that refcount is non-negative after having grabbed ->d_lock,
so nobody will be able to free dentry until they get into __dentry_kill(),
which won't happen until they manage to grab ->d_lock.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Saves a pointer per struct dentry and actually makes the things less
clumsy. Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.
A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs fanotify fsid updates from Christian Brauner:
"This work is part of the plan to enable fanotify to serve as a drop-in
replacement for inotify. While inotify is availabe on all filesystems,
fanotify currently isn't.
In order to support fanotify on all filesystems two things are needed:
(1) all filesystems need to support AT_HANDLE_FID
(2) all filesystems need to report a non-zero f_fsid
This contains (1) and allows filesystems to encode non-decodable file
handlers for fanotify without implementing any exportfs operations by
encoding a file id of type FILEID_INO64_GEN from i_ino and
i_generation.
Filesystems that want to opt out of encoding non-decodable file ids
for fanotify that don't support NFS export can do so by providing an
empty export_operations struct.
This also partially addresses (2) by generating f_fsid for simple
filesystems as well as freevxfs. Remaining filesystems will be dealt
with by separate patches.
Finally, this contains the patch from the current exportfs maintainers
which moves exportfs under vfs with Chuck, Jeff, and Amir as
maintainers and vfs.git as tree"
* tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
MAINTAINERS: create an entry for exportfs
fs: fix build error with CONFIG_EXPORTFS=m or not defined
freevxfs: derive f_fsid from bdev->bd_dev
fs: report f_fsid from s_dev for "simple" filesystems
exportfs: support encoding non-decodeable file handles by default
exportfs: define FILEID_INO64_GEN* file handle types
exportfs: make ->encode_fh() a mandatory method for NFS export
exportfs: add helpers to check if filesystem can encode/decode file handles
Rename the default helper for encoding FILEID_INO32_GEN* file handles to
generic_encode_ino32_fh() and convert the filesystems that used the
default implementation to use the generic helper explicitly.
After this change, exportfs_encode_inode_fh() no longer has a default
implementation to encode FILEID_INO32_GEN* file handles.
This is a step towards allowing filesystems to encode non-decodeable
file handles for fanotify without having to implement any
export_operations.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023180801.2953446-3-amir73il@gmail.com
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
We've changed the holder of the block device which has consequences.
Document this clearly and in detail so filesystem and vfs developers
have a proper digital paper trail.
Signed-off-by: Christian Brauner <brauner@kernel.org>