linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Christian Brauner	ad4999496e	mount: always duplicate mount In the OPEN_TREE_NAMESPACE path vfs_open_tree() resolves a path via filename_lookup() without holding namespace_lock. Between the lookup and create_new_namespace() acquiring namespace_lock via LOCK_MOUNT_EXACT_COPY() another thread can unmount the mount, setting mnt->mnt_ns to NULL. When create_new_namespace() then checks !mnt->mnt_ns it incorrectly takes the swap-and-mntget path that was designed for fsmount()'s detached mounts. This reuses a mount whose mnt_mp_list is in an inconsistent state from the concurrent unmount, causing a general protection fault in __umount_mnt() -> hlist_del_init(&mnt->mnt_mp_list) during namespace teardown. Remove the !mnt->mnt_ns special case entirely. Instead, always duplicate the mount: - For OPEN_TREE_NAMESPACE use __do_loopback() which will properly clone the mount or reject it via may_copy_tree() if it was unmounted in the race window. - For fsmount() use clone_mnt() directly (via the new MOUNT_COPY_NEW flag) since the mount is freshly created by vfs_create_mount() and not in any namespace so __do_loopback()'s IS_MNT_UNBINDABLE, may_copy_tree, and __has_locked_children checks don't apply. Reported-by: syzbot+e4470cc28308f2081ec8@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-04-14 09:30:15 +02:00
Christian Brauner	ccfac16e0b	move_mount: allow MOVE_MOUNT_BENEATH on the rootfs Allow MOVE_MOUNT_BENEATH to target the caller's rootfs. When the target of a mount-beneath operation is the caller's root mount, verify that: (1) The caller is located at the root of the mount, as enforced by path_mounted() in do_lock_mount(). (2) Propagation from the parent mount would not overmount the target, to avoid propagating beneath the rootfs of other mount namespaces. The root-switching is decomposed into individually atomic, locally-scoped steps: mount-beneath inserts the new root under the old one, chroot(".") switches the caller's root, and umount2(".", MNT_DETACH) removes the old root. Since each step only modifies the caller's own state, this avoids cross-namespace vulnerabilities and inherent fork/unshare/setns races that a chroot_fs_refs()-based approach would have. Userspace can use the following workflow to switch roots: fd_tree = open_tree(-EBADF, "/newroot", OPEN_TREE_CLONE \| OPEN_TREE_CLOEXEC); fchdir(fd_tree); move_mount(fd_tree, "", AT_FDCWD, "/", MOVE_MOUNT_BENEATH \| MOVE_MOUNT_F_EMPTY_PATH); chroot("."); umount2(".", MNT_DETACH); Link: https://patch.msgid.link/20260224-work-mount-beneath-rootfs-v1-2-8c58bf08488f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-12 13:34:59 +01:00
Christian Brauner	c62a476693	move_mount: transfer MNT_LOCKED When performing a mount-beneath operation the target mount can often be locked: unshare(CLONE_NEWUSER \| CLONE_NEWNS); mount --beneath -t tmpfs tmpfs /proc will fail because the procfs mount on /proc became locked when the mount namespace was created from the parent mount namespace. Same logic for: unshare(CLONE_NEWUSER \| CLONE_NEWNS); mount --beneath -t tmpfs tmpfs / MNT_LOCKED is raised to prevent an unprivileged mount namespace from revealing whatever is under a given mount. To replace the rootfs we need to handle that case though. We can simply transfer the locked mount property from the top mount to the mount beneath. The new mount we mounted beneath the top mount takes over the job of the top mount in protecting the parent mount from being revealed. This leaves us free to allow the top mount to be unmounted. This also works during mount propagation and also works for the non-MOVE_MOUNT_BENEATH case: (1) move_mount(MOVE_MOUNT_BENEATH): @source_mnt->overmount always NULL (2) move_mount(): @source_mnt->overmount maybe !NULL For (1) can_move_mount_beneath() rejects overmounted @source_mnt (We could allow this but whatever it's not really a use-case and it's fugly to move an overmounted mount stack around. What are you even doing? So let's keep that restriction. For (2) we can have @source_mnt overmounted (Someone overmounted us while we locked the target mount.). Both are fine. @source_mnt will be mounted on whatever @q was mounted on and @q will be mounted on the top of the @source_mnt mount stack. Even in such cases we can unlock @q and lock @source_mnt if @q was locked. This effectively makes mount propagation useful in cases where a mount namespace has a locked mount somewhere and we propagate a new mount beneath it but the mount namespace could never get at it because the old top mount remains locked. Again, we just let the newly propagated mount take over the protection and unlock the top mount. Link: https://patch.msgid.link/20260224-work-mount-beneath-rootfs-v1-1-8c58bf08488f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-12 13:34:58 +01:00
Christian Brauner	9d4e752a24	namespace: allow creating empty mount namespaces Add support for creating a mount namespace that contains only a copy of the root mount from the caller's mount namespace, with none of the child mounts. This is useful for containers and sandboxes that want to start with a minimal mount table and populate it from scratch rather than inheriting and then tearing down the full mount tree. Two new flags are introduced: - CLONE_EMPTY_MNTNS for clone3(), using the 64-bit flag space. - UNSHARE_EMPTY_MNTNS for unshare(), reusing the CLONE_PARENT_SETTID bit which has no meaning for unshare. Both flags imply CLONE_NEWNS. For the unshare path, UNSHARE_EMPTY_MNTNS is converted to CLONE_EMPTY_MNTNS in unshare_nsproxy_namespaces() before it reaches copy_mnt_ns(), so the mount namespace code only needs to handle a single flag. In copy_mnt_ns(), when CLONE_EMPTY_MNTNS is set, clone_mnt() is used instead of copy_tree() to clone only the root mount. The caller's root and working directory are both reset to the root dentry of the new mount. The cleanup variables are changed from vfsmount pointers with __free(mntput) to struct path with __free(path_put) because the empty mount namespace path needs to release both mount and dentry references when replacing the caller's root and pwd. In the normal (non-empty) path only the mount component is set, and dput(NULL) is a no-op so path_put remains correct there as well. Link: https://patch.msgid.link/20260306-work-empty-mntns-consolidated-v1-1-6eb30529bbb0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-12 13:33:55 +01:00
Christian Brauner	5e8969bd19	mount: add FSMOUNT_NAMESPACE Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount namespace with the newly created filesystem attached to a copy of the real rootfs. This returns a namespace file descriptor instead of an O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for open_tree(). This allows creating a new filesystem and immediately placing it in a new mount namespace in a single operation, which is useful for container runtimes and other namespace-based isolation mechanisms. The rootfs mount is created before copying the real rootfs for the new namespace meaning that the mount namespace id for the mount of the root of the namespace is bigger than the child mounted on top of it. We've never explicitly given the guarantee for such ordering and I doubt anyone relies on it. Accepting that lets us avoid copying the mount again and also avoids having to massage may_copy_tree() to grant an exception for fsmount->mnt->mnt_ns being NULL. Link: https://patch.msgid.link/20260122-work-fsmount-namespace-v1-3-5ef0a886e646@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-12 13:33:54 +01:00
Christian Brauner	ad4a3599e5	mount: simplify __do_loopback() Remove the OPEN_TREE_NAMESPACE flag checking from __do_loopback() and instead have callers pass CL_COPY_MNT_NS_FILE directly in copy_flags. Link: https://patch.msgid.link/20260122-work-fsmount-namespace-v1-2-5ef0a886e646@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-12 13:33:53 +01:00
Christian Brauner	0c0b046392	mount: start iterating from start of rbtree If the root of the namespace has an id that's greater than the child we'd not find it. Handle that case. Link: https://patch.msgid.link/20260122-work-fsmount-namespace-v1-1-5ef0a886e646@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-12 13:33:50 +01:00
Linus Torvalds	0e335a7745	Merge tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix an uninitialized variable in file_getattr(). The flags_valid field wasn't initialized before calling vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse - Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is not enabled. sysctl_hung_task_timeout_secs is 0 in that case causing spurious "waiting for writeback completion for more than 1 seconds" warnings - Fix a null-ptr-deref in do_statmount() when the mount is internal - Add missing kernel-doc description for the @private parameter in iomap_readahead() - Fix mount namespace creation to hold namespace_sem across the mount copy in create_new_namespace(). The previous drop-and-reacquire pattern was fragile and failed to clean up mount propagation links if the real rootfs was a shared or dependent mount - Fix /proc mount iteration where m->index wasn't updated when m->show() overflows, causing a restart to repeatedly show the same mount entry in a rapidly expanding mount table - Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the inode number is out of range - Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared. copy_mnt_ns() received the live fs_struct so if a subsequent namespace creation failed the rollback would leave pwd and root pointing to detached mounts. Always allocate a new fs_struct when CLONE_NEWNS is requested - fserror bug fixes: - Remove the unused fsnotify_sb_error() helper now that all callers have been converted to fserror_report_metadata - Fix a lockdep splat in fserror_report() where igrab() takes inode::i_lock which can be held in IRQ context. Replace igrab() with a direct i_count bump since filesystems should not report inodes that are about to be freed or not yet exposed - Handle error pointer in procfs for try_lookup_noperm() - Fix an integer overflow in ep_loop_check_proc() where recursive calls returning INT_MAX would overflow when +1 is added, breaking the recursion depth check - Fix a misleading break in pidfs * tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: avoid misleading break eventpoll: Fix integer overflow in ep_loop_check_proc() proc: Fix pointer error dereference fserror: fix lockdep complaint when igrabbing inode fsnotify: drop unused helper unshare: fix unshare_fs() handling minix: Correct errno in minix_new_inode namespace: fix proc mount iteration mount: hold namespace_sem across copy in create_new_namespace() iomap: Describe @private in iomap_readahead() statmount: Fix the null-ptr-deref in do_statmount() writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK fs: init flags_valid before calling vfs_fileattr_get	2026-02-25 10:34:23 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Christian Brauner	4a403d7aa9	namespace: fix proc mount iteration The m->index isn't updated when m->show() overflows and retains its value before the current mount causing a restart to start at the same value. If that happens in short order to due a quickly expanding mount table this would cause the same mount to be shown again and again. Ensure that pos always equals the mount id of the mount that was returned by start/next. On restart after overflow mnt_find_id_at(pos) finds the exact mount. This should avoid duplicates, avoid skips and should handle concurrent modification just fine. Cc: <stable@vger.kernel.org> Fixed: `2eea9ce431` ("mounts: keep list of mounts in an rbtree") Link: https://patch.msgid.link/20260129-geleckt-treuhand-4bb940acacd9@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-02-18 14:04:42 +01:00
Christian Brauner	a41dbf5e00	mount: hold namespace_sem across copy in create_new_namespace() Fix an oversight when creating a new mount namespace. If someone had the bright idea to make the real rootfs a shared or dependent mount and it is later copied the copy will become a peer of the old real rootfs mount or a dependent mount of it. The namespace semaphore is dropped and we use mount lock exact to lock the new real root mount. If that fails or the subsequent do_loopback() fails we rely on the copy of the real root mount to be cleaned up by path_put(). The problem is that this doesn't deal with mount propagation and will leave the mounts linked in the propagation lists. When creating a new mount namespace create_new_namespace() first acquires namespace_sem to clone the nullfs root, drops it, then reacquires it via LOCK_MOUNT_EXACT which takes inode_lock first to respect the inode_lock -> namespace_sem lock ordering. This drop-and-reacquire pattern is fragile and was the source of the propagation cleanup bug fixed in the preceding commit. Extend lock_mount_exact() with a copy_mount mode that clones the mount under the locks atomically. When copy_mount is true, path_overmounted() is skipped since we're copying the mount, not mounting on top of it - the nullfs root always has rootfs mounted on top so the check would always fail. If clone_mnt() fails after get_mountpoint() has pinned the mountpoint, __unlock_mount() is used to properly unpin the mountpoint and release both locks. This allows create_new_namespace() to use LOCK_MOUNT_EXACT_COPY which takes inode_lock and namespace_sem once and holds them throughout the clone and subsequent mount operations, eliminating the drop-and-reacquire pattern entirely. Reported-by: syzbot+a89f9434fb5a001ccd58@syzkaller.appspotmail.com Fixes: `9b8a0ba682` ("mount: add OPEN_TREE_NAMESPACE") # mainline only Link: https://lore.kernel.org/699047f6.050a0220.2757fb.0024.GAE@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-02-18 14:04:42 +01:00
Qing Wang	81f16c9778	statmount: Fix the null-ptr-deref in do_statmount() If the mount is internal, it's mnt_ns will be MNT_NS_INTERNAL, which is defined as ERR_PTR(-EINVAL). So, in the do_statmount(), need to check ns of mount by IS_ERR() and return. Fixes: `0e5032237e` ("statmount: accept fd as a parameter") Reported-by: syzbot+9e03a9535ea65f687a44@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/698e287a.a70a0220.2c38d7.009e.GAE@google.com/ Signed-off-by: Qing Wang <wangqing7171@gmail.com> Link: https://patch.msgid.link/20260213103006.2472569-1-wangqing7171@gmail.com Reviewed-by: Bhavik Sachdev <b.sachdev1904@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-02-14 13:17:50 +01:00
Linus Torvalds	26c9342bb7	Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs 'struct filename' updates from Al Viro: "[Mostly] sanitize struct filename handling" * tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits) sysfs(2): fs_index() argument is _not_ a pathname alpha: switch osf_mount() to strndup_user() ksmbd: use CLASS(filename_kernel) mqueue: switch to CLASS(filename) user_statfs(): switch to CLASS(filename) statx: switch to CLASS(filename_maybe_null) quotactl_block(): switch to CLASS(filename) chroot(2): switch to CLASS(filename) move_mount(2): switch to CLASS(filename_maybe_null) namei.c: switch user pathname imports to CLASS(filename{,_flags}) namei.c: convert getname_kernel() callers to CLASS(filename_kernel) do_f{chmod,chown,access}at(): use CLASS(filename_uflags) do_readlinkat(): switch to CLASS(filename_flags) do_sys_truncate(): switch to CLASS(filename) do_utimes_path(): switch to CLASS(filename_uflags) chdir(2): unspaghettify a bit... do_fchownat(): unspaghettify a bit... fspick(2): use CLASS(filename_flags) name_to_handle_at(): use CLASS(filename_uflags) vfs_open_tree(): use CLASS(filename_uflags) ...	2026-02-09 16:58:28 -08:00
Linus Torvalds	9e355113f0	Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains a mix of VFS cleanups, performance improvements, API fixes, documentation, and a deprecation notice. Scalability and performance: - Rework pid allocation to only take pidmap_lock once instead of twice during alloc_pid(), improving thread creation/teardown throughput by 10-16% depending on false-sharing luck. Pad the namespace refcount to reduce false-sharing - Track file lock presence via a flag in ->i_opflags instead of reading ->i_flctx, avoiding false-sharing with ->i_readcount on open/close hot paths. Measured 4-16% improvement on 24-core open-in-a-loop benchmarks - Use a consume fence in locks_inode_context() to match the store-release/load-consume idiom, eliminating a hardware fence on some architectures - Annotate cdev_lock with __cacheline_aligned_in_smp to prevent false-sharing - Remove a redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu() that never fires since the caller already verifies it, eliminating a 100% mispredicted branch - Fix a 100% mispredicted likely() in devcgroup_inode_permission() that became wrong after a prior code reorder Bug fixes and correctness: - Make insert_inode_locked() wait for inode destruction instead of skipping, fixing a corner case where two matching inodes could exist in the hash - Move f_mode initialization before file_ref_init() in alloc_file() to respect the SLAB_TYPESAFE_BY_RCU ordering contract - Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with no buffers attached, preventing a null pointer dereference when AS_RELEASE_ALWAYS is set but no release_folio op exists - Fix select restart_block to store end_time as timespec64, avoiding truncation of tv_sec on 32-bit architectures - Make dump_inode() use get_kernel_nofault() to safely access inode and superblock fields, matching the dump_mapping() pattern API modernization: - Make posix_acl_to_xattr() allocate the buffer internally since every single caller was doing it anyway. Reduces boilerplate and unnecessary error checking across ~15 filesystems - Replace deprecated simple_strtoul() with kstrtoul() for the ihash_entries, dhash_entries, mhash_entries, and mphash_entries boot parameters, adding proper error handling - Convert chardev code to use guard(mutex) and __free(kfree) cleanup patterns - Replace min_t() with min() or umin() in VFS code to avoid silently truncating unsigned long to unsigned int - Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers already check the flag Deprecation: - Begin deprecating legacy BSD process accounting (acct(2)). The interface has numerous footguns and better alternatives exist (eBPF) Documentation: - Fix and complete kernel-doc for struct export_operations, removing duplicated documentation between ReST and source - Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait() Testing: - Add a kunit test for initramfs cpio handling of entries with filesize > PATH_MAX Misc: - Add missing <linux/init_task.h> include in fs_struct.c" * tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) posix_acl: make posix_acl_to_xattr() alloc the buffer fs: make insert_inode_locked() wait for inode destruction initramfs_test: kunit test for cpio.filesize > PATH_MAX fs: improve dump_inode() to safely access inode fields fs: add <linux/init_task.h> for 'init_fs' docs: exportfs: Use source code struct documentation fs: move initializing f_mode before file_ref_init() exportfs: Complete kernel-doc for struct export_operations exportfs: Mark struct export_operations functions at kernel-doc exportfs: Fix kernel-doc output for get_name() acct(2): begin the deprecation of legacy BSD process accounting device_cgroup: remove branch hint after code refactor VFS: fix __start_dirop() kernel-doc warnings fs: Describe @isnew parameter in ilookup5_nowait() fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS select: store end_time as timespec64 in restart block chardev: Switch to guard(mutex) and __free(kfree) namespace: Replace simple_strtoul with kstrtoul to parse boot params dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries ...	2026-02-09 15:13:05 -08:00
Linus Torvalds	157d3d6efd	Merge tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - statmount: accept fd as a parameter Extend struct mnt_id_req with a file descriptor field and a new STATMOUNT_BY_FD flag. When set, statmount() returns mount information for the mount the fd resides on — including detached mounts (unmounted via umount2(MNT_DETACH)). For detached mounts the STATMOUNT_MNT_POINT and STATMOUNT_MNT_NS_ID mask bits are cleared since neither is meaningful. The capability check is skipped for STATMOUNT_BY_FD since holding an fd already implies prior access to the mount and equivalent information is available through fstatfs() and /proc/pid/mountinfo without privilege. Includes comprehensive selftests covering both attached and detached mount cases. - fs: Remove internal old mount API code (1 patch) Now that every in-tree filesystem has been converted to the new mount API, remove all the legacy shim code in fs_context.c that handled unconverted filesystems. This deletes ~280 lines including legacy_init_fs_context(), the legacy_fs_context struct, and associated wrappers. The mount(2) syscall path for userspace remains untouched. Documentation references to the legacy callbacks are cleaned up. - mount: add OPEN_TREE_NAMESPACE to open_tree() Container runtimes currently use CLONE_NEWNS to copy the caller's entire mount namespace — only to then pivot_root() and recursively unmount everything they just copied. With large mount tables and thousands of parallel container launches this creates significant contention on the namespace semaphore. OPEN_TREE_NAMESPACE copies only the specified mount tree (like OPEN_TREE_CLONE) but returns a mount namespace fd instead of a detached mount fd. The new namespace contains the copied tree mounted on top of a clone of the real rootfs. This functions as a combined unshare(CLONE_NEWNS) + pivot_root() in a single syscall. Works with user namespaces: an unshare(CLONE_NEWUSER) followed by OPEN_TREE_NAMESPACE creates a mount namespace owned by the new user namespace. Mount namespace file mounts are excluded from the copy to prevent cycles. Includes ~1000 lines of selftests" * tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/open_tree: add OPEN_TREE_NAMESPACE tests mount: add OPEN_TREE_NAMESPACE fs: Remove internal old mount API code selftests: statmount: tests for STATMOUNT_BY_FD statmount: accept fd as a parameter statmount: permission check should return EPERM	2026-02-09 14:43:47 -08:00
Christian Brauner	9b8a0ba682	mount: add OPEN_TREE_NAMESPACE When creating containers the setup usually involves using CLONE_NEWNS via clone3() or unshare(). This copies the caller's complete mount namespace. The runtime will also assemble a new rootfs and then use pivot_root() to switch the old mount tree with the new rootfs. Afterward it will recursively umount the old mount tree thereby getting rid of all mounts. On a basic system here where the mount table isn't particularly large this still copies about 30 mounts. Copying all of these mounts only to get rid of them later is pretty wasteful. This is exacerbated if intermediary mount namespaces are used that only exist for a very short amount of time and are immediately destroyed again causing a ton of mounts to be copied and destroyed needlessly. With a large mount table and a system where thousands or ten-thousands of containers are spawned in parallel this quickly becomes a bottleneck increasing contention on the semaphore. Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of returning a file descriptor referring to that mount tree OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor to a new mount namespace. In that new mount namespace the copied mount tree has been mounted on top of a copy of the real rootfs. The caller can setns() into that mount namespace and perform any additionally required setup such as move_mount() detached mounts in there. This allows OPEN_TREE_NAMESPACE to function as a combined unshare(CLONE_NEWNS) and pivot_root(). A caller may for example choose to create an extremely minimal rootfs: fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE); This will create a mount namespace where "wootwoot" has become the rootfs mounted on top of the real rootfs. The caller can now setns() into this new mount namespace and assemble additional mounts. This also works with user namespaces: unshare(CLONE_NEWUSER); fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE); which creates a new mount namespace owned by the earlier created user namespace with "wootwoot" as the rootfs mounted on top of the real rootfs. Link: https://patch.msgid.link/20251229-work-empty-namespace-v1-1-bfb24c7b061f@kernel.org Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Suggested-by: Christian Brauner <brauner@kernel.org> Suggested-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-16 19:21:40 +01:00
Al Viro	cbe9e300a2	move_mount(2): switch to CLASS(filename_maybe_null) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:52:04 -05:00
Al Viro	b79b3c1f66	vfs_open_tree(): use CLASS(filename_uflags) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:52:03 -05:00
Al Viro	57cd2072bd	mount_setattr(2): don't mess with LOOKUP_EMPTY just use CLASS(filename_uflags) + filename_lookup() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:52:03 -05:00
Christian Brauner	313c47f4fe	fs: use nullfs unconditionally as the real rootfs Remove the "nullfs_rootfs" boot parameter and simply always use nullfs. The mutable rootfs will be mounted on top of it. Systems that don't use pivot_root() to pivot away from the real rootfs will have an additional mount stick around but that shouldn't be a problem at all. If it is we'll rever this commit. This also simplifies the boot process and removes the need for the traditional switch_root workarounds. Suggested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-14 11:23:39 +01:00
Al Viro	1c38f1f9b0	move_mount(): filename_lookup() accepts ERR_PTR() as filename no need to check it in the caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Christian Brauner	576ee5dfd4	fs: add immutable rootfs Currently pivot_root() doesn't work on the real rootfs because it cannot be unmounted. Userspace has to do a recursive removal of the initramfs contents manually before continuing the boot. Really all we want from the real rootfs is to serve as the parent mount for anything that is actually useful such as the tmpfs or ramfs for initramfs unpacking or the rootfs itself. There's no need for the real rootfs to actually be anything meaningful or useful. Add a immutable rootfs called "nullfs" that can be selected via the "nullfs_rootfs" kernel command line option. The kernel will mount a tmpfs/ramfs on top of it, unpack the initramfs and fire up userspace which mounts the rootfs and can then just do: chdir(rootfs); pivot_root(".", "."); umount2(".", MNT_DETACH); and be done with it. (Ofc, userspace can also choose to retain the initramfs contents by using something like pivot_root(".", "/initramfs") without unmounting it.) Technically this also means that the rootfs mount in unprivileged namespaces doesn't need to become MNT_LOCKED anymore as it's guaranteed that the immutable rootfs remains permanently empty so there cannot be anything revealed by unmounting the covering mount. In the future this will also allow us to create completely empty mount namespaces without risking to leak anything. systemd already handles this all correctly as it tries to pivot_root() first and falls back to MS_MOVE only when that fails. This goes back to various discussion in previous years and a LPC 2024 presentation about this very topic. Link: https://patch.msgid.link/20260112-work-immutable-rootfs-v2-3-88dd1c34a204@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-12 16:52:09 +01:00
Christian Brauner	3c1b73fc6a	fs: add init_pivot_root() We will soon be able to pivot_root() with the introduction of the immutable rootfs. Add a wrapper for kernel internal usage. Link: https://patch.msgid.link/20260112-work-immutable-rootfs-v2-2-88dd1c34a204@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-12 16:52:09 +01:00
Christian Brauner	a2062463e8	fs: ensure that internal tmpfs mount gets mount id zero and the rootfs get mount id one as it always has. Before we actually mount the rootfs we create an internal tmpfs mount which has mount id zero but is never exposed anywhere. Continue that "tradition". Link: https://patch.msgid.link/20260112-work-immutable-rootfs-v2-1-88dd1c34a204@kernel.org Fixes: `7f9bfafc5f` ("fs: use xarray for old mount id") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-12 16:52:08 +01:00
Thorsten Blum	3f320e5c2e	namespace: Replace simple_strtoul with kstrtoul to parse boot params Replace simple_strtoul() with the recommended kstrtoul() for parsing the 'mhash_entries=' and 'mphash_entries=' boot parameters. Check the return value of kstrtoul() and reject invalid values. This adds error handling while preserving behavior for existing values, and removes use of the deprecated simple_strtoul() helper. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20251214153141.218953-2-thorsten.blum@linux.dev Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-12-24 13:51:12 +01:00
Bhavik Sachdev	0e5032237e	statmount: accept fd as a parameter Extend `struct mnt_id_req` to take in a fd and introduce STATMOUNT_BY_FD flag. When a valid fd is provided and STATMOUNT_BY_FD is set, statmount will return mountinfo about the mount the fd is on. This even works for "unmounted" mounts (mounts that have been umounted using umount2(mnt, MNT_DETACH)), if you have access to a file descriptor on that mount. These "umounted" mounts will have no mountpoint and no valid mount namespace. Hence, we unset the STATMOUNT_MNT_POINT and STATMOUNT_MNT_NS_ID in statmount.mask for "unmounted" mounts. In case of STATMOUNT_BY_FD, given that we already have access to an fd on the mount, accessing mount information without a capability check seems fine because of the following reasons: - All fs related information is available via fstatfs() without any capability check. - Mount information is also available via /proc/pid/mountinfo (without any capability check). - Given that we have access to a fd on the mount which tells us that we had access to the mount at some point (or someone that had access gave us the fd). So, we should be able to access mount info. Co-developed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com> Link: https://patch.msgid.link/20251129091455.757724-3-b.sachdev1904@gmail.com Acked-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-12-15 14:13:14 +01:00
Bhavik Sachdev	fccbe38a5d	statmount: permission check should return EPERM Currently, statmount() returns ENOENT when caller is not CAP_SYS_ADMIN in the user namespace owner of target mount namespace. This should be EPERM instead. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com> Link: https://patch.msgid.link/20251129091455.757724-2-b.sachdev1904@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-12-15 14:13:13 +01:00
Linus Torvalds	1b5dd29869	Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fd prepare updates from Christian Brauner: "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the common pattern of get_unused_fd_flags() + create file + fd_install() that is used extensively throughout the kernel and currently requires cumbersome cleanup paths. FD_ADD() - For simple cases where a file is installed immediately: fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device)); if (fd < 0) vfio_device_put_registration(device); return fd; FD_PREPARE() - For cases requiring access to the fd or file, or additional work before publishing: FD_PREPARE(fdf, O_CLOEXEC, sync_file->file); if (fdf.err) { fput(sync_file->file); return fdf.err; } data.fence = fd_prepare_fd(fdf); if (copy_to_user((void __user )arg, &data, sizeof(data))) return -EFAULT; return fd_publish(fdf); The primitives are centered around struct fd_prepare. FD_PREPARE() encapsulates all allocation and cleanup logic and must be followed by a call to fd_publish() which associates the fd with the file and installs it into the caller's fdtable. If fd_publish() isn't called, both are deallocated automatically. FD_ADD() is a shorthand that does fd_publish() immediately and never exposes the struct to the caller. I've implemented this in a way that it's compatible with the cleanup infrastructure while also being usable separately. IOW, it's centered around struct fd_prepare which is aliased to class_fd_prepare_t and so we can make use of all the basica guard infrastructure" tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) io_uring: convert io_create_mock_file() to FD_PREPARE() file: convert replace_fd() to FD_PREPARE() vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD() tty: convert ptm_open_peer() to FD_ADD() ntsync: convert ntsync_obj_get_fd() to FD_PREPARE() media: convert media_request_alloc() to FD_PREPARE() hv: convert mshv_ioctl_create_partition() to FD_ADD() gpio: convert linehandle_create() to FD_PREPARE() pseries: port papr_rtas_setup_file_interface() to FD_ADD() pseries: convert papr_platform_dump_create_handle() to FD_ADD() spufs: convert spufs_gang_open() to FD_PREPARE() papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE() spufs: convert spufs_context_open() to FD_PREPARE() net/socket: convert __sys_accept4_file() to FD_ADD() net/socket: convert sock_map_fd() to FD_ADD() net/kcm: convert kcm_ioctl() to FD_PREPARE() net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE() secretmem: convert memfd_secret() to FD_ADD() memfd: convert memfd_create() to FD_ADD() bpf: convert bpf_token_create() to FD_PREPARE() ...	2025-12-01 17:32:07 -08:00
Linus Torvalds	ffbf700df2	Merge tag 'vfs-6.19-rc1.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull autofs update from Christian Brauner: "Prevent futile mount triggers in private mount namespaces. Fix a problematic loop in autofs when a mount namespace contains autofs mounts that are propagation private and there is no namespace-specific automount daemon to handle possible automounting. Previously, attempted path resolution would loop until MAXSYMLINKS was reached before failing, causing significant noise in the log. The fix adds a check in autofs ->d_automount() so that the VFS can immediately return EPERM in this case. Since the mount is propagation private, EPERM is the most appropriate error code" * tag 'vfs-6.19-rc1.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: autofs: dont trigger mount if it cant succeed	2025-12-01 16:38:21 -08:00
Linus Torvalds	415d34b92c	Merge tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains substantial namespace infrastructure changes including a new system call, active reference counting, and extensive header cleanups. The branch depends on the shared kbuild branch for -fms-extensions support. Features: - listns() system call Add a new listns() system call that allows userspace to iterate through namespaces in the system. This provides a programmatic interface to discover and inspect namespaces, addressing longstanding limitations: Currently, there is no direct way for userspace to enumerate namespaces. Applications must resort to scanning /proc//ns/ across all processes, which is: - Inefficient - requires iterating over all processes - Incomplete - misses namespaces not attached to any running process but kept alive by file descriptors, bind mounts, or parent references - Permission-heavy - requires access to /proc for many processes - No ordering or ownership information - No filtering per namespace type The listns() system call solves these problems: ssize_t listns(const struct ns_id_req req, u64 ns_ids, size_t nr_ns_ids, unsigned int flags); struct ns_id_req { __u32 size; __u32 spare; __u64 ns_id; struct / listns / { __u32 ns_type; __u32 spare2; __u64 user_ns_id; }; }; Features include: - Pagination support for large namespace sets - Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.) - Filtering by owning user namespace - Permission checks respecting namespace isolation - Active Reference Counting Introduce an active reference count that tracks namespace visibility to userspace. A namespace is visible in the following cases: - The namespace is in use by a task - The namespace is persisted through a VFS object (namespace file descriptor or bind-mount) - The namespace is a hierarchical type and is the parent of child namespaces The active reference count does not regulate lifetime (that's still done by the normal reference count) - it only regulates visibility to namespace file handles and listns(). This prevents resurrection of namespaces that are pinned only for internal kernel reasons (e.g., user namespaces held by file->f_cred, lazy TLB references on idle CPUs, etc.) which should not be accessible via (1)-(3). - Unified Namespace Tree Introduce a unified tree structure for all namespaces with: - Fixed IDs assigned to initial namespaces - Lookup based solely on inode number - Maintained list of owned namespaces per user namespace - Simplified rbtree comparison helpers Cleanups - Header Reorganization: - Move namespace types into separate header (ns_common_types.h) - Decouple nstree from ns_common header - Move nstree types into separate header - Switch to new ns_tree_{node,root} structures with helper functions - Use guards for ns_tree_lock - Initial Namespace Reference Count Optimization - Make all reference counts on initial namespaces a nop to avoid pointless cacheline ping-pong for namespaces that can never go away - Drop custom reference count initialization for initial namespaces - Add NS_COMMON_INIT() macro and use it for all namespaces - pid: rely on common reference count behavior - Miscellaneous Cleanups - Rename exit_task_namespaces() to exit_nsproxy_namespaces() - Rename is_initial_namespace() and make argument const - Use boolean to indicate anonymous mount namespace - Simplify owner list iteration in nstree - nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly - nsfs: use inode_just_drop() - pidfs: raise DCACHE_DONTCACHE explicitly - pidfs: simplify PIDFD_GET__NAMESPACE ioctls - libfs: allow to specify s_d_flags - cgroup: add cgroup namespace to tree after owner is set - nsproxy: fix free_nsproxy() and simplify create_new_namespaces() Fixes: - setns(pidfd, ...) race condition Fix a subtle race when using pidfds with setns(). When the target task exits after prepare_nsset() but before commit_nsset(), the namespace's active reference count might have been dropped. If setns() then installs the namespaces, it would bump the active reference count from zero without taking the required reference on the owner namespace, leading to underflow when later decremented. The fix resurrects the ownership chain if necessary - if the caller succeeded in grabbing passive references, the setns() should succeed even if the target task exits or gets reaped. - Return EFAULT on put_user() error instead of success - Make sure references are dropped outside of RCU lock (some namespaces like mount namespace sleep when putting the last reference) - Don't skip active reference count initialization for network namespace - Add asserts for active refcount underflow - Add asserts for initial namespace reference counts (both passive and active) - ipc: enable is_ns_init_id() assertions - Fix kernel-doc comments for internal nstree functions - Selftests - 15 active reference count tests - 9 listns() functionality tests - 7 listns() permission tests - 12 inactive namespace resurrection tests - 3 threaded active reference count tests - commit_creds() active reference tests - Pagination and stress tests - EFAULT handling test - nsid tests fixes" tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits) pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls nstree: fix kernel-doc comments for internal functions nsproxy: fix free_nsproxy() and simplify create_new_namespaces() selftests/namespaces: fix nsid tests ns: drop custom reference count initialization for initial namespaces pid: rely on common reference count behavior ns: add asserts for initial namespace active reference counts ns: add asserts for initial namespace reference counts ns: make all reference counts on initial namespace a nop ipc: enable is_ns_init_id() assertions fs: use boolean to indicate anonymous mount namespace ns: rename is_initial_namespace() ns: make is_initial_namespace() argument const nstree: use guards for ns_tree_lock nstree: simplify owner list iteration nstree: switch to new structures nstree: add helper to operate on struct ns_tree_{node,root} nstree: move nstree types into separate header nstree: decouple from ns_common header ns: move namespace types into separate header ...	2025-12-01 09:47:41 -08:00
Linus Torvalds	b04b2e7a61	Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE permission checks during path lookup and adds the IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid expensive permission work. - Hide dentry_cache behind runtime const machinery. - Add German Maglione as virtiofs co-maintainer. Cleanups: - Tidy up and inline step_into() and walk_component() for improved code generation. - Re-enable IOCB_NOWAIT writes to files. This refactors file timestamp update logic, fixing a layering bypass in btrfs when updating timestamps on device files and improving FMODE_NOCMTIME handling in VFS now that nfsd started using it. - Path lookup optimizations extracting slowpaths into dedicated routines and adding branch prediction hints for mntput_no_expire(), fd_install(), lookup_slow(), and various other hot paths. - Enable clang's -fms-extensions flag, requiring a JFS rename to avoid conflicts. - Remove spurious exports in fs/file_attr.c. - Stop duplicating union pipe_index declaration. This depends on the shared kbuild branch that brings in -fms-extensions support which is merged into this branch. - Use MD5 library instead of crypto_shash in ecryptfs. - Use largest_zero_folio() in iomap_dio_zero(). - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and initrd code. - Various typo fixes. Fixes: - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs() call with wait == 1 to commit super blocks. The emergency sync path never passed this, leaving btrfs data uncommitted during emergency sync. - Use local kmap in watch_queue's post_one_notification(). - Add hint prints in sb_set_blocksize() for LBS dependency on THP" * tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) MAINTAINERS: add German Maglione as virtiofs co-maintainer fs: inline step_into() and walk_component() fs: tidy up step_into() & friends before inlining orangefs: use inode_update_timestamps directly btrfs: fix the comment on btrfs_update_time btrfs: use vfs_utimes to update file timestamps fs: export vfs_utimes fs: lift the FMODE_NOCMTIME check into file_update_time_flags fs: refactor file timestamp update logic include/linux/fs.h: trivial fix: regualr -> regular fs/splice.c: trivial fix: pipes -> pipe's fs: mark lookup_slow() as noinline fs: add predicts based on nd->depth fs: move mntput_no_expire() slowpath into a dedicated routine fs: remove spurious exports in fs/file_attr.c watch_queue: Use local kmap in post_one_notification() fs: touch up predicts in path lookup fs: move fd_install() slowpath into a dedicated routine and provide commentary fs: hide dentry_cache behind runtime const machinery fs: touch predicts in do_dentry_open() ...	2025-12-01 08:44:26 -08:00
Christian Brauner	05885f4165	namespace: convert fsmount() to FD_PREPARE() Christian Brauner <brauner@kernel.org> says: A variant of the fix sent in [1] was squashed into this commit. Link: https://lore.kernel.org/20251128035149.392402-1-kartikey406@gmail.com [1] Reported-by: Deepanshu Kartikey <kartikey406@gmail.com> Reported-by: syzbot+94048264da5715c251f9@syzkaller.appspotmail.com Tested-by: syzbot+94048264da5715c251f9@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=94048264da5715c251f9 Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-7-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-28 12:42:31 +01:00
Christian Brauner	416b0d1659	namespace: convert open_tree_attr() to FD_PREPARE() Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-6-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-28 12:42:31 +01:00
Christian Brauner	542a406543	namespace: convert open_tree() to FD_ADD() Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-5-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-28 12:42:31 +01:00
Andrei Vagin	7b6dcd9bfd	fs/namespace: fix reference leak in grab_requested_mnt_ns lookup_mnt_ns() already takes a reference on mnt_ns. grab_requested_mnt_ns() doesn't need to take an extra reference. Fixes: `78f0e33cd6` ("fs/namespace: correctly handle errors returned by grab_requested_mnt_ns") Signed-off-by: Andrei Vagin <avagin@google.com> Link: https://patch.msgid.link/20251122071953.3053755-1-avagin@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 09:34:56 +01:00
Mateusz Guzik	bfef6e1f34	fs: move mntput_no_expire() slowpath into a dedicated routine In the stock variant the compiler spills several registers on the stack and employs stack smashing protection, adding even more code + a branch on exit.. The actual fast path is small enough that the compiler inlines it for all callers -- the symbol is no longer emitted. Forcing noinline on it just for code-measurement purposes shows the fast path dropping from 111 to 39 bytes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251114201803.2183505-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-19 14:49:28 +01:00
Ian Kent	922a6f34c1	autofs: dont trigger mount if it cant succeed If a mount namespace contains autofs mounts, and they are propagation private, and there is no namespace specific automount daemon to handle possible automounting then attempted path resolution will loop until MAXSYMLINKS is reached before failing causing quite a bit of noise in the log. Add a check for this in autofs ->d_automount() so that the VFS can immediately return an error in this case. Since the mount is propagation private an EPERM return seems most appropriate. Suggested by: Christian Brauner <brauner@kernel.org> Signed-off-by: Ian Kent <raven@themaw.net> Link: https://patch.msgid.link/20251118024631.10854-2-raven@themaw.net Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-19 11:14:02 +01:00
Andrei Vagin	78f0e33cd6	fs/namespace: correctly handle errors returned by grab_requested_mnt_ns grab_requested_mnt_ns was changed to return error codes on failure, but its callers were not updated to check for error pointers, still checking only for a NULL return value. This commit updates the callers to use IS_ERR() or IS_ERR_OR_NULL() and PTR_ERR() to correctly check for and propagate errors. This also makes sure that the logic actually works and mount namespace file descriptors can be used to refere to mounts. Christian Brauner <brauner@kernel.org> says: Rework the patch to be more ergonomic and in line with our overall error handling patterns. Fixes: `7b9d14af87` ("fs: allow mount namespace fd") Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrei Vagin <avagin@google.com> Link: https://patch.msgid.link/20251111062815.2546189-1-avagin@google.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-12 10:42:49 +01:00
Christian Brauner	c2bbd2db52	ns: drop custom reference count initialization for initial namespaces Initial namespaces don't modify their reference count anymore. They remain fixed at one so drop the custom refcount initializations. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-16-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:32 +01:00
Christian Brauner	d9a44089ac	fs: use boolean to indicate anonymous mount namespace Stop playing games with the namespace id and use a boolean instead: * This will remove the special-casing we need to do everywhere for mount namespaces. * It will allow us to use asserts on the namespace id for initial namespaces everywhere. * It will allow us to put anonymous mount namespaces on the namespaces trees in the future and thus make them available to statmount() and listmount(). Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-10-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:31 +01:00
Christian Brauner	a657bc8a75	nstree: switch to new structures Switch the nstree management to the new combined structures. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-5-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-11 10:01:30 +01:00
Christian Brauner	3760342fd6	nstree: assign fixed ids to the initial namespaces The initial set of namespace comes with fixed inode numbers making it easy for userspace to identify them solely based on that information. This has long preceeded anything here. Similarly, let's assign fixed namespace ids for the initial namespaces. Kill the cookie and use a sequentially increasing number. This has the nice side-effect that the owning user namespace will always have a namespace id that is smaller than any of it's descendant namespaces. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-15-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-03 17:41:17 +01:00
Christian Brauner	0b1765830c	ns: use NS_COMMON_INIT() for all namespaces Now that we have a common initializer use it for all static namespaces. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-03 17:41:16 +01:00
Andy Shevchenko	9db8d46712	mnt: Remove dead code which might prevent from building Clang, in particular, is not happy about dead code: fs/namespace.c:135:37: error: unused function 'node_to_mnt_ns' [-Werror,-Wunused-function] 135 \| static inline struct mnt_namespace node_to_mnt_ns(const struct rb_node node) \| ^~~~~~~~~~~~~~ 1 error generated. Remove a leftover from the previous cleanup. Fixes: `7d7d164989` ("mnt: support ns lookup") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20251024132336.1666382-1-andriy.shevchenko@linux.intel.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-10-29 15:57:24 +01:00
Zhen Ni	0778ac7df5	fs: Fix uninitialized 'offp' in statmount_string() In statmount_string(), most flags assign an output offset pointer (offp) which is later updated with the string offset. However, the STATMOUNT_MNT_UIDMAP and STATMOUNT_MNT_GIDMAP cases directly set the struct fields instead of using offp. This leaves offp uninitialized, leading to a possible uninitialized dereference when *offp is updated. Fix it by assigning offp for UIDMAP and GIDMAP as well, keeping the code path consistent. Fixes: `37c4a9590e` ("statmount: allow to retrieve idmappings") Fixes: `e52e97f09f` ("statmount: let unset strings be empty") Cc: stable@vger.kernel.org Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Link: https://patch.msgid.link/20251013114151.664341-1-zhen.ni@easystack.cn Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-10-21 14:21:46 +02:00
Linus Torvalds	51e9889ab1	Merge tag 'pull-fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fs_context updates from Al Viro: "Change vfs_parse_fs_string() calling conventions Get rid of the length argument (almost all callers pass strlen() of the string argument there), add vfs_parse_fs_qstr() for the cases that do want separate length" * tag 'pull-fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_nfs4_mount(): switch to vfs_parse_fs_string() change the calling conventions for vfs_parse_fs_string()	2025-10-03 10:51:44 -07:00
Linus Torvalds	e64aeecbbb	Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "Several piles this cycle, this mount-related one being the largest and trickiest: - saner handling of guards in fs/namespace.c, getting rid of needlessly strong locking in some of the users - lock_mount() calling conventions change - have it set the environment for attaching to given location, storing the results in caller-supplied object, without altering the passed struct path. Make unlock_mount() called as __cleanup for those objects. It's not exactly guard(), but similar to it - MNT_WRITE_HOLD done right. mnt_hold_writers() does not mess with ->mnt_flags anymore, so insertion of a new mount into ->s_mounts of underlying superblock does not, in itself, expose ->mnt_flags of that mount to concurrent modifications - getting rid of pathological cases when umount() spends quadratic time removing the victims from propagation graph - part of that had been dealt with last cycle, this should finish it - a bunch of stuff constified - assorted cleanups * tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) constify {__,}mnt_is_readonly() WRITE_HOLD machinery: no need for to bump mount_lock seqcount struct mount: relocate MNT_WRITE_HOLD bit preparations to taking MNT_WRITE_HOLD out of ->mnt_flags setup_mnt(): primitive for connecting a mount to filesystem simplify the callers of mnt_unhold_writers() copy_mnt_ns(): use guards copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure open_detached_copy(): separate creation of namespace into helper open_detached_copy(): don't bother with mount_lock_hash() path_has_submounts(): use guard(mount_locked_reader) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() ecryptfs: get rid of pointless mount references in ecryptfs dentries umount_tree(): take all victims out of propagation graph at once do_mount(): use __free(path_put) do_move_mount_old(): use __free(path_put) constify can_move_mount_beneath() arguments path_umount(): constify struct path argument may_copy_tree(), __do_loopback(): constify struct path argument path_mount(): constify struct path argument ...	2025-10-03 10:19:44 -07:00
Linus Torvalds	18b19abc37	Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_() helpers net: port to ns_ref_() helpers uts: port to ns_ref_() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_() helpers time: port to ns_ref_() helpers pid: port to ns_ref_() helpers ipc: port to ns_ref_() helpers cgroup: port to ns_ref_() helpers ...	2025-09-29 11:20:29 -07:00
Linus Torvalds	722df25ddf	Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)	2025-09-29 10:36:50 -07:00

1 2 3 4 5 ...

970 Commits