mirror of
https://github.com/torvalds/linux.git
synced 2026-04-18 06:44:00 -04:00
2845989f2ebaf7848e4eccf9a779daf3156ea0a5
104865 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
7c8a4671dc |
Merge tag 'vfs-7.1-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
- Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount
namespace with the newly created filesystem attached to a copy of the
real rootfs. This returns a namespace file descriptor instead of an
O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for
open_tree().
This allows creating a new filesystem and immediately placing it in a
new mount namespace in a single operation, which is useful for
container runtimes and other namespace-based isolation mechanisms.
This accompanies OPEN_TREE_NAMESPACE and avoids a needless detour via
OPEN_TREE_NAMESPACE to get the same effect. Will be especially useful
when you mount an actual filesystem to be used as the container
rootfs.
- Currently, creating a new mount namespace always copies the entire
mount tree from the caller's namespace. For containers and sandboxes
that intend to build their mount table from scratch this is wasteful:
they inherit a potentially large mount tree only to immediately tear
it down.
This series adds support for creating a mount namespace that contains
only a clone of the root mount, with none of the child mounts. Two
new flags are introduced:
- CLONE_EMPTY_MNTNS (0x400000000) for clone3(), using the 64-bit flag space
- UNSHARE_EMPTY_MNTNS (0x00100000) for unshare()
Both flags imply CLONE_NEWNS. The resulting namespace contains a
single nullfs root mount with an immutable empty directory. The
intended workflow is to then mount a real filesystem (e.g., tmpfs)
over the root and build the mount table from there.
- Allow MOVE_MOUNT_BENEATH to target the caller's rootfs, allowing to
switch out the rootfs without pivot_root(2).
The traditional approach to switching the rootfs involves
pivot_root(2) or a chroot_fs_refs()-based mechanism that atomically
updates fs->root for all tasks sharing the same fs_struct. This has
consequences for fork(), unshare(CLONE_FS), and setns().
This series instead decomposes root-switching into individually
atomic, locally-scoped steps:
fd_tree = open_tree(-EBADF, "/newroot", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
fchdir(fd_tree);
move_mount(fd_tree, "", AT_FDCWD, "/", MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
chroot(".");
umount2(".", MNT_DETACH);
Since each step only modifies the caller's own state, the
fork/unshare/setns races are eliminated by design.
A key step to making this possible is to remove the locked mount
restriction. Originally MOVE_MOUNT_BENEATH doesn't support mounting
beneath a mount that is locked. The locked mount protects the
underlying mount from being revealed. This is a core mechanism of
unshare(CLONE_NEWUSER | CLONE_NEWNS). The mounts in the new mount
namespace become locked. That effectively makes the new mount table
useless as the caller cannot ever get rid of any of the mounts no
matter how useless they are.
We can lift this restriction though. We simply transfer the locked
property from the top mount to the mount beneath. This works because
what we care about is to protect the underlying mount aka the parent.
The mount mounted between the parent and the top mount takes over the
job of protecting the parent mount from the top mount mount. This
leaves us free to remove the locked property from the top mount which
can consequently be unmounted:
unshare(CLONE_NEWUSER | CLONE_NEWNS)
and we inherit a clone of procfs on /proc then currently we cannot
unmount it as:
umount -l /proc
will fail with EINVAL because the procfs mount is locked.
After this series we can now do:
mount --beneath -t tmpfs tmpfs /proc
umount -l /proc
after which a tmpfs mount has been placed beneath the procfs mount.
The tmpfs mount has become locked and the procfs mount has become
unlocked.
This means you can safely modify an inherited mount table after
unprivileged namespace creation.
Afterwards we simply make it possible to move a mount beneath the
rootfs allowing to upgrade the rootfs.
Removing the locked restriction makes this very useful for containers
created with unshare(CLONE_NEWUSER | CLONE_NEWNS) to reshuffle an
inherited mount table safely and MOVE_MOUNT_BENEATH makes it possible
to switch out the rootfs instead of using the costly pivot_root(2).
* tag 'vfs-7.1-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests/namespaces: remove unused utils.h include from listns_efault_test
selftests/fsmount_ns: add missing TARGETS and fix cap test
selftests/empty_mntns: fix wrong CLONE_EMPTY_MNTNS hex value in comment
selftests/empty_mntns: fix statmount_alloc() signature mismatch
selftests/statmount: remove duplicate wait_for_pid()
mount: always duplicate mount
selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests
move_mount: allow MOVE_MOUNT_BENEATH on the rootfs
move_mount: transfer MNT_LOCKED
selftests/filesystems: add clone3 tests for empty mount namespaces
selftests/filesystems: add tests for empty mount namespaces
namespace: allow creating empty mount namespaces
selftests: add FSMOUNT_NAMESPACE tests
selftests/statmount: add statmount_alloc() helper
tools: update mount.h header
mount: add FSMOUNT_NAMESPACE
mount: simplify __do_loopback()
mount: start iterating from start of rbtree
|
||
|
|
91a4855d6c |
Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support HW queue leasing, allowing containers to be granted access
to HW queues for zero-copy operations and AF_XDP
- Number of code moves to help the compiler with inlining. Avoid
output arguments for returning drop reason where possible
- Rework drop handling within qdiscs to include more metadata about
the reason and dropping qdisc in the tracepoints
- Remove the rtnl_lock use from IP Multicast Routing
- Pack size information into the Rx Flow Steering table pointer
itself. This allows making the table itself a flat array of u32s,
thus making the table allocation size a power of two
- Report TCP delayed ack timer information via socket diag
- Add ip_local_port_step_width sysctl to allow distributing the
randomly selected ports more evenly throughout the allowed space
- Add support for per-route tunsrc in IPv6 segment routing
- Start work of switching sockopt handling to iov_iter
- Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
buffer size drifting up
- Support MSG_EOR in MPTCP
- Add stp_mode attribute to the bridge driver for STP mode selection.
This addresses concerns about call_usermodehelper() usage
- Remove UDP-Lite support (as announced in 2023)
- Remove support for building IPv6 as a module. Remove the now
unnecessary function calling indirection
Cross-tree stuff:
- Move Michael MIC code from generic crypto into wireless, it's
considered insecure but some WiFi networks still need it
Netfilter:
- Switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
Florian W reports this gets us ~13% higher packet rate
- Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net. Convert some code that
walks the service lists to use RCU instead of the service_mutex
- Add more opinionated input validation to lower security exposure
- Make IPVS hash tables to be per-netns and resizable
Wireless:
- Finished assoc frame encryption/EPPKE/802.1X-over-auth
- Radar detection improvements
- Add 6 GHz incumbent signal detection APIs
- Multi-link support for FILS, probe response templates and client
probing
- New APIs and mac80211 support for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
Driver API:
- Add numerical ID for devlink instances (to avoid having to create
fake bus/device pairs just to have an ID). Support shared devlink
instances which span multiple PFs
- Add standard counters for reporting pause storm events (implement
in mlx5 and fbnic)
- Add configuration API for completion writeback buffering (implement
in mana)
- Support driver-initiated change of RSS context sizes
- Support DPLL monitoring input frequency (implement in zl3073x)
- Support per-port resources in devlink (implement in mlx5)
Misc:
- Expand the YAML spec for Netfilter
Drivers
- Software:
- macvlan: support multicast rx for bridge ports with shared
source MAC address
- team: decouple receive and transmit enablement for IEEE 802.3ad
LACP "independent control"
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- support high order pages in zero-copy mode (for payload
coalescing)
- support multiple packets in a page (for systems with 64kB
pages)
- Broadcom 25-400GE (bnxt):
- implement XDP RSS hash metadata extraction
- add software fallback for UDP GSO, lowering the IOMMU cost
- Broadcom 800GE (bnge):
- add link status and configuration handling
- add various HW and SW statistics
- Marvell/Cavium:
- NPC HW block support for cn20k
- Huawei (hinic3):
- add mailbox / control queue
- add rx VLAN offload
- add driver info and link management
- Ethernet NICs:
- Marvell/Aquantia:
- support reading SFP module info on some AQC100 cards
- Realtek PCI (r8169):
- add support for RTL8125cp
- Realtek USB (r8152):
- support for the RTL8157 5Gbit chip
- add 2500baseT EEE status/configuration support
- Ethernet NICs embedded and off-the-shelf IP:
- Synopsys (stmmac):
- cleanup and reorganize SerDes handling and PCS support
- cleanup descriptor handling and per-platform data
- cleanup and consolidate MDIO defines and handling
- shrink driver memory use for internal structures
- improve Tx IRQ coalescing
- improve TCP segmentation handling
- add support for Spacemit K3
- Cadence (macb):
- support PHYs that have inband autoneg disabled with GEM
- support IEEE 802.3az EEE
- rework usrio capabilities and handling
- AMD (xgbe):
- improve power management for S0i3
- improve TX resilience for link-down handling
- Virtual:
- Google cloud vNIC:
- support larger ring sizes in DQO-QPL mode
- improve HW-GRO handling
- support UDP GSO for DQO format
- PCIe NTB:
- support queue count configuration
- Ethernet PHYs:
- automatically disable PHY autonomous EEE if MAC is in charge
- Broadcom:
- add BCM84891/BCM84892 support
- Micrel:
- support for LAN9645X internal PHY
- Realtek:
- add RTL8224 pair order support
- support PHY LEDs on RTL8211F-VD
- support spread spectrum clocking (SSC)
- Maxlinear:
- add PHY-level statistics via ethtool
- Ethernet switches:
- Maxlinear (mxl862xx):
- support for bridge offloading
- support for VLANs
- support driver statistics
- Bluetooth:
- large number of fixes and new device IDs
- Mediatek:
- support MT6639 (MT7927)
- support MT7902 SDIO
- WiFi:
- Intel (iwlwifi):
- UNII-9 and continuing UHR work
- MediaTek (mt76):
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- Qualcomm (ath12k):
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
- support IPQ5424
- Realtek:
- add USB RX aggregation to improve performance
- add USB TX flow control by tracking in-flight URBs
- Cellular:
- IPA v5.2 support"
* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
wireguard: allowedips: remove redundant space
tools: ynl: add sample for wireguard
wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
MAINTAINERS: Add netkit selftest files
selftests/net: Add additional test coverage in nk_qlease
selftests/net: Split netdevsim tests from HW tests in nk_qlease
tools/ynl: Make YnlFamily closeable as a context manager
net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
net: airoha: Fix VIP configuration for AN7583 SoC
net: caif: clear client service pointer on teardown
net: strparser: fix skb_head leak in strp_abort_strp()
net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
selftests/bpf: add test for xdp_master_redirect with bond not up
net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
sctp: disable BH before calling udp_tunnel_xmit_skb()
sctp: fix missing encap_port propagation for GSO fragments
net: airoha: Rely on net_device pointer in ETS callbacks
...
|
||
|
|
fabd5a8d24 |
Merge tag 'x86_cache_for_v7.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov: - Add return value descriptions to several internal functions, addressing kernel-doc complaints - Add the x86 maintainer mailing list to the resctrl section so they are automatically included in patch submissions, and reference the applicable contribution rules document - Allow users to apply a single Capacity Bitmask to all cache domains at once using '*' as a shorthand, instead of having to specify each domain individually. This is particularly user-friendly on high core-count systems with many cache clusters - When a user provides a non-existent domain ID while configuring cache allocation, ensure the failure reason is properly reported to the user rather than silently returning an error with a misleading "ok" status * tag 'x86_cache_for_v7.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/resctrl: Add missing return value descriptions MAINTAINERS: Update resctrl entry fs/resctrl: Add "*" shorthand to set io_alloc CBM for all domains fs/resctrl: Report invalid domain ID when parsing io_alloc_cbm |
||
|
|
ad4999496e |
mount: always duplicate mount
In the OPEN_TREE_NAMESPACE path vfs_open_tree() resolves a path via filename_lookup() without holding namespace_lock. Between the lookup and create_new_namespace() acquiring namespace_lock via LOCK_MOUNT_EXACT_COPY() another thread can unmount the mount, setting mnt->mnt_ns to NULL. When create_new_namespace() then checks !mnt->mnt_ns it incorrectly takes the swap-and-mntget path that was designed for fsmount()'s detached mounts. This reuses a mount whose mnt_mp_list is in an inconsistent state from the concurrent unmount, causing a general protection fault in __umount_mnt() -> hlist_del_init(&mnt->mnt_mp_list) during namespace teardown. Remove the !mnt->mnt_ns special case entirely. Instead, always duplicate the mount: - For OPEN_TREE_NAMESPACE use __do_loopback() which will properly clone the mount or reject it via may_copy_tree() if it was unmounted in the race window. - For fsmount() use clone_mnt() directly (via the new MOUNT_COPY_NEW flag) since the mount is freshly created by vfs_create_mount() and not in any namespace so __do_loopback()'s IS_MNT_UNBINDABLE, may_copy_tree, and __has_locked_children checks don't apply. Reported-by: syzbot+e4470cc28308f2081ec8@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
4793dae01f |
Merge tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Danilo Krummrich:
"debugfs:
- Fix NULL pointer dereference in debugfs_create_str()
- Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
- Fix soundwire debugfs NULL pointer dereference from uninitialized
firmware_file
device property:
- Make fwnode flags modifications thread safe; widen the field to
unsigned long and use set_bit() / clear_bit() based accessors
- Document how to check for the property presence
devres:
- Separate struct devres_node from its "subclasses" (struct devres,
struct devres_group); give struct devres_node its own release and
free callbacks for per-type dispatch
- Introduce struct devres_action for devres actions, avoiding the
ARCH_DMA_MINALIGN alignment overhead of struct devres
- Export struct devres_node and its init/add/remove/dbginfo
primitives for use by Rust Devres<T>
- Fix missing node debug info in devm_krealloc()
- Use guard(spinlock_irqsave) where applicable; consolidate unlock
paths in devres_release_group()
driver_override:
- Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
generic driver_override infrastructure, replacing per-bus
driver_override strings, sysfs attributes, and match logic; fixes a
potential UAF from unsynchronized access to driver_override in bus
match() callbacks
- Simplify __device_set_driver_override() logic
kernfs:
- Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file
and directory removal
- Add corresponding selftests for memcg
platform:
- Allow attaching software nodes when creating platform devices via a
new 'swnode' field in struct platform_device_info
- Add kerneldoc for struct platform_device_info
software node:
- Move software node initialization from postcore_initcall() to
driver_init(), making it available early in the boot process
- Move kernel_kobj initialization (ksysfs_init) earlier to support
the above
- Remove software_node_exit(); dead code in a built-in unit
SoC:
- Introduce of_machine_read_compatible() and of_machine_read_model()
OF helpers and export soc_attr_read_machine() to replace direct
accesses to of_root from SoC drivers; also enables
CONFIG_COMPILE_TEST coverage for these drivers
sysfs:
- Constify attribute group array pointers to
'const struct attribute_group *const *' in sysfs functions,
device_add_groups() / device_remove_groups(), and struct class
Rust:
- Devres:
- Embed struct devres_node directly in Devres<T> instead of going
through devm_add_action(), avoiding the extra allocation and the
unnecessary ARCH_DMA_MINALIGN alignment
- I/O:
- Turn IoCapable from a marker trait into a functional trait
carrying the raw I/O accessor implementation (io_read /
io_write), providing working defaults for the per-type Io
methods
- Add RelaxedMmio wrapper type, making relaxed accessors usable in
code generic over the Io trait
- Remove overloaded per-type Io methods and per-backend macros
from Mmio and PCI ConfigSpace
- I/O (Register):
- Add IoLoc trait and generic read/write/update methods to the Io
trait, making I/O operations parameterizable by typed locations
- Add register! macro for defining hardware register types with
typed bitfield accessors backed by Bounded values; supports
direct, relative, and array register addressing
- Add write_reg() / try_write_reg() and LocatedRegister trait
- Update PCI sample driver to demonstrate the register! macro
Example:
```
register! {
/// UART control register.
CTRL(u32) @ 0x18 {
/// Receiver enable.
19:19 rx_enable => bool;
/// Parity configuration.
14:13 parity ?=> Parity;
}
/// FIFO watermark and counter register.
WATER(u32) @ 0x2c {
/// Number of datawords in the receive FIFO.
26:24 rx_count;
/// RX interrupt threshold.
17:16 rx_water;
}
}
impl WATER {
fn rx_above_watermark(&self) -> bool {
self.rx_count() > self.rx_water()
}
}
fn init(bar: &pci::Bar<BAR0_SIZE>) {
let water = WATER::zeroed()
.with_const_rx_water::<1>(); // > 3 would not compile
bar.write_reg(water);
let ctrl = CTRL::zeroed()
.with_parity(Parity::Even)
.with_rx_enable(true);
bar.write_reg(ctrl);
}
fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
if bar.read(WATER).rx_above_watermark() {
// drain the FIFO
}
}
fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
bar.update(CTRL, |r| r.with_parity(parity));
}
```
- IRQ:
- Move 'static bounds from where clauses to trait declarations for
IRQ handler traits
- Misc:
- Enable the generic_arg_infer Rust feature
- Extend Bounded with shift operations, single-bit bool
conversion, and const get()
Misc:
- Make deferred_probe_timeout default a Kconfig option
- Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
callbacks when no bus type PM ops are set
- Add conditional guard support for device_lock()
- Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
- Fix kernel-doc warnings in base.h
- Fix stale reference to memory_block_add_nid() in documentation"
* tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (67 commits)
bus: fsl-mc: use generic driver_override infrastructure
s390/ap: use generic driver_override infrastructure
s390/cio: use generic driver_override infrastructure
vdpa: use generic driver_override infrastructure
platform/wmi: use generic driver_override infrastructure
PCI: use generic driver_override infrastructure
driver core: make software nodes available earlier
software node: remove software_node_exit()
kernel: ksysfs: initialize kernel_kobj earlier
MAINTAINERS: add ksysfs.c to the DRIVER CORE entry
drivers/base/memory: fix stale reference to memory_block_add_nid()
device property: Document how to check for the property presence
soundwire: debugfs: initialize firmware_file to empty string
debugfs: fix placement of EXPORT_SYMBOL_GPL for debugfs_create_str()
debugfs: check for NULL pointer in debugfs_create_str()
driver core: Make deferred_probe_timeout default a Kconfig option
driver core: simplify __device_set_driver_override() clearing logic
driver core: auxiliary bus: Drop auxiliary_dev_pm_ops
device property: Make modifications of fwnode "flags" thread safe
rust: devres: embed struct devres_node directly
...
|
||
|
|
613b48bbd4 |
Merge tag 'execve-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook: - use strnlen() in __set_task_comm (Thorsten Blum) - update task_struct->comm comment (Thorsten Blum) * tag 'execve-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: sched: update task_struct->comm comment exec: use strnlen() in __set_task_comm |
||
|
|
cae0d23288 |
Merge tag 'pstore-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook: - fix ftrace dump when ECC is enabled (Andrey Skvortsov) - fix resource leak when ioremap() fails (Cole Leavitt) - Remove useless memblock header (Guilherme G. Piccoli) - Fix ECC parameter help text (Guilherme G. Piccoli) - Keep ftrace module parameter and debugfs switch in sync (Guilherme G. Piccoli) - Factor KASLR offset in the core kernel instruction addresses (Guilherme G. Piccoli) * tag 'pstore-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ftrace: Factor KASLR offset in the core kernel instruction addresses pstore/ftrace: Keep ftrace module parameter and debugfs switch in sync pstore/ram: fix resource leak when ioremap() fails pstore/ramoops: Fix ECC parameter help text pstore/ramoops: Remove useless memblock header pstore: fix ftrace dump, when ECC is enabled |
||
|
|
9932f00bf4 |
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers: - Various cleanups for the interface between fs/crypto/ and filesystems, from Christoph Hellwig - Simplify and optimize the implementation of v1 key derivation by using the AES library instead of the crypto_skcipher API * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: use AES library for v1 key derivation ext4: use a byte granularity cursor in ext4_mpage_readpages fscrypt: pass a real sector_t to fscrypt_zeroout_range fscrypt: pass a byte length to fscrypt_zeroout_range fscrypt: pass a byte offset to fscrypt_zeroout_range fscrypt: pass a byte length to fscrypt_zeroout_range_inline_crypt fscrypt: pass a byte offset to fscrypt_zeroout_range_inline_crypt fscrypt: pass a byte offset to fscrypt_set_bio_crypt_ctx fscrypt: pass a byte offset to fscrypt_mergeable_bio fscrypt: pass a byte offset to fscrypt_generate_dun fscrypt: move fscrypt_set_bio_crypt_ctx_bh to buffer.c ext4, fscrypt: merge fscrypt_mergeable_bio_bh into io_submit_need_new_bio ext4: factor out a io_submit_need_new_bio helper ext4: open code fscrypt_set_bio_crypt_ctx_bh ext4: initialize the write hint in io_submit_init_bio |
||
|
|
81dc1e4d32 |
Merge tag 'v7.1-rc1-part1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French: - Fix EAs bounds check - Fix OOB read in symlink response parsing - Add support for creating tmpfiles - Minor debug improvement for mount failure - Minor crypto cleanup - Add missing module description - mount fix for lease vs. nolease - Add Metze as maintainer for smbdirect - Minor error mapping header cleanup - Improve search speed of SMB1 maperror - Fix potential null ptr ref in smb2 map error tests * tag 'v7.1-rc1-part1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (26 commits) smb: client: allow both 'lease' and 'nolease' mount options smb: client: get rid of d_drop()+d_add() smb: client: set ATTR_TEMPORARY with O_TMPFILE | O_EXCL smb: client: add support for O_TMPFILE vfs: introduce d_mark_tmpfile_name() MAINTAINERS: create entry for smbdirect smb: client: add missing MODULE_DESCRIPTION() to smb1maperror_test smb: client: fix OOB reads parsing symlink error response smb: client: fix off-by-8 bounds check in check_wsl_eas() smb: client: Remove unnecessary selection of CRYPTO_ECB smb/client: move smb2maperror declarations to smb2proto.h smb/client: introduce KUnit tests to check DOS/SRV err mapping search smb/client: check if SMB1 DOS/SRV error mapping arrays are sorted smb/client: use binary search for SMB1 DOS/SRV error mapping smb/client: autogenerate SMB1 DOS/SRV to POSIX error mapping smb/client: annotate smberr.h with POSIX error codes smb/client: move ERRnetlogonNotStarted to DOS error class smb/client: introduce KUnit test to check ntstatus_to_dos_map search smb/client: check if ntstatus_to_dos_map is sorted smb/client: use binary search for NT status to DOS mapping ... |
||
|
|
0b0128e64a |
Merge tag 'xfs-merge-7.1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Carlos Maiolino: "There aren't any new features. The whole series is just a collection of bug fixes and code refactoring. There is some new information added a couple new tracepoints, new data added to mountstats, but no big changes" * tag 'xfs-merge-7.1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits) xfs: fix number of GC bvecs xfs: untangle the open zones reporting in mountinfo xfs: expose the number of open zones in sysfs xfs: reduce special casing for the open GC zone xfs: streamline GC zone selection xfs: refactor GC zone selection helpers xfs: rename xfs_zone_gc_iter_next to xfs_zone_gc_iter_irec xfs: put the open zone later xfs_open_zone_put xfs: add a separate tracepoint for stealing an open zone for GC xfs: delay initial open of the GC zone xfs: fix a resource leak in xfs_alloc_buftarg() xfs: handle too many open zones when mounting xfs: refactor xfs_mount_zones xfs: fix integer overflow in busy extent sort comparator xfs: fix integer overflow in deferred intent sort comparators xfs: fold xfs_setattr_size into xfs_vn_setattr_size xfs: remove a duplicate assert in xfs_setattr_size xfs: return default quota limits for IDs without a dquot xfs: start gc on zonegc_low_space attribute updates xfs: don't decrement the buffer LRU count for in-use buffers ... |
||
|
|
230fb3a33e |
Merge tag 'erofs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang: - Validate xattr h_shared_count to report -EFSCORRUPTED explicitly for crafted images - Verify metadata accesses for file-backed mounts via rw_verify_area() - Fix FS_IOC_GETFSLABEL to include the trailing NUL byte, consistent with ext4 and xfs - Properly handle 48-bit on-disk blocks/uniaddr for extra devices - Fix an index underflow in the LZ4 in-place decompression that can cause out-of-bounds accesses with crafted images - Minor fixes and cleanups * tag 'erofs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: error out obviously illegal extents in advance erofs: clean up encoded map flags erofs: fix unsigned underflow in z_erofs_lz4_handle_overlap() erofs: handle 48-bit blocks/uniaddr for extra devices erofs: include the trailing NUL in FS_IOC_GETFSLABEL erofs: ensure all folios are managed in erofs_try_to_free_all_cached_folios() erofs: verify metadata accesses for file-backed mounts erofs: harden h_shared_count in erofs_init_inode_xattrs() |
||
|
|
a62fe21079 |
Merge tag 'exfat-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon: - Implement FALLOC_FL_ALLOCATE_RANGE to add support for preallocating clusters without zeroing, helping to reduce file fragmentation - Add a unified block readahead helper for FAT chain conversion, bitmap allocation, and directory entry lookups - Optimize exfat_chain_cont_cluster() by caching buffer heads to minimize mark_buffer_dirty() and mirroring overhead during NO_FAT_CHAIN to FAT_CHAIN conversion - Switch to truncate_inode_pages_final() in evict_inode() to prevent BUG_ON caused by shadow entries during reclaim - Fix a 32-bit truncation bug in directory entry calculations by ensuring proper bitwise coercion - Fix sb->s_maxbytes calculation to correctly reflect the maximum possible volume size for a given cluster size, resolving xfstests generic/213 - Introduced exfat_cluster_walk() helper to traverse FAT chains by a specified step, handling both ALLOC_NO_FAT_CHAIN and ALLOC_FAT_CHAIN modes - Introduced exfat_chain_advance() helper to advance an exfat_chain structure, updating both the current cluster and remaining size - Remove dead assignments and fix Smatch warnings * tag 'exfat-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: use exfat_chain_advance helper exfat: introduce exfat_chain_advance helper exfat: remove NULL cache pointer case in exfat_ent_get exfat: use exfat_cluster_walk helper exfat: introduce exfat_cluster_walk helper exfat: fix incorrect directory checksum after rename to shorter name exfat: fix s_maxbytes exfat: fix passing zero to ERR_PTR() in exfat_mkdir() exfat: fix error handling for FAT table operations exfat: optimize exfat_chain_cont_cluster with cached buffer heads exfat: drop redundant sec parameter from exfat_mirror_bh exfat: use readahead helper in exfat_get_dentry exfat: use readahead helper in exfat_allocate_bitmap exfat: add block readahead in exfat_chain_cont_cluster exfat: add fallocate FALLOC_FL_ALLOCATE_RANGE support exfat: Fix bitwise operation having different size exfat: Drop dead assignment of num_clusters exfat: use truncate_inode_pages_final() at evict_inode() |
||
|
|
f2729827ae |
Merge tag 'nilfs2-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/nilfs2
Pull nilfs2 updates from Viacheslav Dubeyko:
"This contains fixes of syzbot reported issues in NILFS2 functionality:
- The DAT inode's btree node cache (i_assoc_inode) is initialized
lazily during btree operations.
However, nilfs_mdt_save_to_shadow_map() assumes i_assoc_inode is
already initialized when copying dirty pages to the shadow map
during GC. If NILFS_IOCTL_CLEAN_SEGMENTS is called immediately
after mount before any btree operation has occurred on the DAT
inode, i_assoc_inode is NULL leading to a general protection fault.
Fix this by calling nilfs_attach_btree_node_cache() on the DAT
inode in nilfs_dat_read() at mount time, ensuring i_assoc_inode is
always initialized before any GC operation can use it (Deepanshu
Kartikey)
- nilfs_ioctl_mark_blocks_dirty() uses bd_oblocknr to detect dead
blocks by comparing it with the current block number bd_blocknr. If
they differ, the block is considered dead and skipped.
A corrupted ioctl request with bd_oblocknr set to 0 causes the
comparison to incorrectly match when the lookup returns -ENOENT and
sets bd_blocknr to 0, bypassing the dead block check and calling
nilfs_bmap_mark() on a non- existent block. This causes
nilfs_btree_do_lookup() to return -ENOENT, triggering the
WARN_ON(ret == -ENOENT).
Fix this by rejecting ioctl requests with bd_oblocknr set to 0 at
the beginning of each iteration (Deepanshu Kartikey)"
* tag 'nilfs2-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/nilfs2:
nilfs2: reject zero bd_oblocknr in nilfs_ioctl_mark_blocks_dirty()
nilfs2: fix NULL i_assoc_inode dereference in nilfs_mdt_save_to_shadow_map
|
||
|
|
4d9981429a |
Merge tag 'hfs-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs
Pull hfsplus updates from Viacheslav Dubeyko:
"This contains several fixes of syzbot reported issues and HFS+ fixes
of xfstests failures.
- Fix a syzbot reported issue of a KMSAN uninit-value in
hfsplus_strcasecmp().
The root cause was that hfs_brec_read() doesn't validate that the
on-disk record size matches the expected size for the record type
being read. The fix introduced hfsplus_brec_read_cat() wrapper that
validates the record size based on the type field and returns -EIO
if size doesn't match (Deepanshu Kartikey)
- Fix a syzbot reported issue of processing corrupted HFS+ images
where the b-tree allocation bitmap indicates that the header node
(Node 0) is free. Node 0 must always be allocated. Violating this
invariant leads to allocator corruption, which cascades into kernel
panics or undefined behavior.
Prevent trusting a corrupted allocator state by adding a validation
check during hfs_btree_open(). If corruption is detected, print a
warning identifying the specific corrupted tree and force the
filesystem to mount read-only (SB_RDONLY).
This prevents kernel panics from corrupted images while enabling
data recovery (Shardul Bankar)
- Fix a potential deadlock in hfsplus_fill_super().
hfsplus_fill_super() calls hfs_find_init() to initialize a search
structure, which acquires tree->tree_lock. If the subsequent call
to hfsplus_cat_build_key() fails, the function jumps to the
out_put_root error label without releasing the lock.
Fix this by adding the missing hfs_find_exit(&fd) call before
jumping to the out_put_root error label. This ensures that
tree->tree_lock is properly released on the error path (Zilin Guan)
- Update a files ctime after rename in hfsplus_rename() (Yangtao Li)
The rest of the patches introduce the HFS+ fixes for the case of
generic/348, generic/728, generic/533, generic/523, and generic/642
test-cases of xfstests suite"
* tag 'hfs-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs:
hfsplus: fix generic/642 failure
hfsplus: rework logic of map nodes creation in xattr b-tree
hfsplus: fix logic of alloc/free b-tree node
hfsplus: fix error processing issue in hfs_bmap_free()
hfsplus: fix potential race conditions in b-tree functionality
hfsplus: extract hidden directory search into a helper function
hfsplus: fix held lock freed on hfsplus_fill_super()
hfsplus: fix generic/523 test-case failure
hfsplus: validate b-tree node 0 bitmap at mount time
hfsplus: refactor b-tree map page access and add node-type validation
hfsplus: fix to update ctime after rename
hfsplus: fix generic/533 test-case failure
hfsplus: set ctime after setxattr and removexattr
hfsplus: fix uninit-value by validating catalog record size
hfsplus: fix potential Allocation File corruption after fsync
|
||
|
|
f3756afb6f |
Merge tag 'affs-for-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull AFFS fix from David Sterba: "There's a potential out-of-bounds read in the directory hash table during readdir" * tag 'affs-for-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: bound hash_pos before table lookup in affs_readdir |
||
|
|
c92b4d3dd5 |
Merge tag 'for-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible changes:
- move shutdown ioctl support out of experimental features, a forced
stop of filesystem operation until the next unmount; additionally
there's a super block operation to forcibly remove a device from
under the filesystem that could lead to a shutdown or not if the
redundancy allows that
- report filesystem shutdown using fserror mechanism
- tree-checker updates:
- verify free space info, extent and bitmap items
- verify remap-tree items and related data in block group items
Performance improvements:
- speed up clearing first extent in the tracked range (+10%
throughput on sample workload)
- reduce COW rewrites of extent buffers during the same transaction
- avoid taking big device lock to update device stats during
transaction commit
- fix unnecessary flush on close when truncating empty files
(observed in practice on a backup application)
- prevent direct reclaim during compressed readahead to avoid stalls
under memory pressure
Notable fixes:
- fix chunk allocation strategy on RAID1-like block groups with
disproportionate device sizes, this could lead to ENOSPC due to
skewed reservation estimates
- adjust metadata reservation overcommit ratio to be less aggressive
and also try to flush if possible, this avoids ENOSPC and potential
transaction aborts in some edge cases (that are otherwise hard to
reproduce)
- fix silent IO error in encoded writes and ordered extent split in
zoned mode, the error was not correctly propagated to the address
space and could lead to zeroed ranges
- don't mark inline files NOCOMPRESS unexpectedly, the intent was to
do that for single block writes of regular files
- fix deadlock between reflink and transaction commit when using
flushoncommit
- fix overly strict item check of a running dev-replace operation
Core:
- zoned mode space reservation fixes:
- cap delayed refs metadata reservation to avoid overcommit
- update logic to reclaim partially unusable zones
- add another state to flush and reclaim partially used zone
- limit number of zones reclaimed in one go to avoid blocking
other operations
- don't let log trees consume global reserve on overcommit and fall
back to transaction commit
- revalidate extent buffer when checking its up-to-date status
- add self tests for zoned mode block group specifics
- reduce atomic allocations in some qgroup paths
- avoid unnecessary root node COW during snapshotting
- start new transaction in block group relocation conditionally
- faster check of NOCOW files on currently snapshotted root
- change how compressed bio size is tracked from bio and reduce the
structure size
- new tracepoint for search slot restart tracking
- checksum list manipulation improvements
- type, parameter cleanups, refactoring
- error handling improvements, transaction abort call adjustments"
* tag 'for-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (116 commits)
btrfs: btrfs_log_dev_io_error() on all bio errors
btrfs: fix silent IO error loss in encoded writes and zoned split
btrfs: skip clearing EXTENT_DEFRAG for NOCOW ordered extents
btrfs: use BTRFS_FS_UPDATE_UUID_TREE_GEN flag for UUID tree rescan check
btrfs: remove duplicate journal_info reset on failure to commit transaction
btrfs: tag as unlikely if statements that check for fs in error state
btrfs: fix double free in create_space_info() error path
btrfs: fix double free in create_space_info_sub_group() error path
btrfs: do not reject a valid running dev-replace
btrfs: only invalidate btree inode pages after all ebs are released
btrfs: prevent direct reclaim during compressed readahead
btrfs: replace BUG_ON() with error return in cache_save_setup()
btrfs: zstd: don't cache sectorsize in a local variable
btrfs: zlib: don't cache sectorsize in a local variable
btrfs: zlib: drop redundant folio address variable
btrfs: lzo: inline read/write length helpers
btrfs: use common eb range validation in read_extent_buffer_to_user_nofault()
btrfs: read eb folio index right before loops
btrfs: rename local variable for offset in folio
btrfs: unify types for binary search variables
...
|
||
|
|
7fe6ac157b |
Merge tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Add shared memory zero-copy I/O support for ublk, bypassing per-I/O
copies between kernel and userspace by matching registered buffer
PFNs at I/O time. Includes selftests.
- Refactor bio integrity to support filesystem initiated integrity
operations and arbitrary buffer alignment.
- Clean up bio allocation, splitting bio_alloc_bioset() into clear fast
and slow paths. Add bio_await() and bio_submit_or_kill() helpers,
unify synchronous bi_end_io callbacks.
- Fix zone write plug refcount handling and plug removal races. Add
support for serializing zone writes at QD=1 for rotational zoned
devices, yielding significant throughput improvements.
- Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET
command.
- Add io_uring passthrough (uring_cmd) support to the BSG layer.
- Replace pp_buf in partition scanning with struct seq_buf.
- zloop improvements and cleanups.
- drbd genl cleanup, switching to pre_doit/post_doit.
- NVMe pull request via Keith:
- Fabrics authentication updates
- Enhanced block queue limits support
- Workqueue usage updates
- A new write zeroes device quirk
- Tagset cleanup fix for loop device
- MD pull requests via Yu Kuai:
- Fix raid5 soft lockup in retry_aligned_read()
- Fix raid10 deadlock with check operation and nowait requests
- Fix raid1 overlapping writes on writemostly disks
- Fix sysfs deadlock on array_state=clear
- Proactive RAID-5 parity building with llbitmap, with
write_zeroes_unmap optimization for initial sync
- Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops
version mismatch fallback
- Fix bcache use-after-free and uninitialized closure
- Validate raid5 journal metadata payload size
- Various cleanups
- Various other fixes, improvements, and cleanups
* tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits)
ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd()
block: refactor blkdev_zone_mgmt_ioctl
MAINTAINERS: update ublk driver maintainer email
Documentation: ublk: address review comments for SHMEM_ZC docs
ublk: allow buffer registration before device is started
ublk: replace xarray with IDA for shmem buffer index allocation
ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
ublk: verify all pages in multi-page bvec fall within registered range
ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
xfs: use bio_await in xfs_zone_gc_reset_sync
block: add a bio_submit_or_kill helper
block: factor out a bio_await helper
block: unify the synchronous bi_end_io callbacks
xfs: fix number of GC bvecs
selftests/ublk: add read-only buffer registration test
selftests/ublk: add filesystem fio verify test for shmem_zc
selftests/ublk: add hugetlbfs shmem_zc test for loop target
selftests/ublk: add shared memory zero-copy test
selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
...
|
||
|
|
3ba310f2a3 |
Merge tag 'lsm-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull LSM updates from Paul Moore:
"We only have five patches in the LSM tree, but three of the five are
for an important bugfix relating to overlayfs and the mmap() and
mprotect() access controls for LSMs. Highlights below:
- Fix problems with the mmap() and mprotect() LSM hooks on overlayfs
As we are dealing with problems both in mmap() and mprotect() there
are essentially two components to this fix, spread across three
patches with all marked for stable.
The simplest portion of the fix is the creation of a new LSM hook,
security_mmap_backing_file(), that is used to enforce LSM mmap()
access controls on backing files in the stacked/overlayfs case. The
existing security_mmap_file() does not have visibility past the
user file. You can see from the associated SELinux hook callback
the code is fairly straightforward.
The mprotect() fix is a bit more complicated as there is no way in
the mprotect() code path to inspect both the user and backing
files, and bolting on a second file reference to vm_area_struct
wasn't really an option.
The solution taken here adds a LSM security blob and associated
hooks to the backing_file struct that LSMs can use to capture and
store relevant information from the user file. While the necessary
SELinux information is relatively small, a single u32, I expect
other LSMs to require more than that, and a dedicated backing_file
LSM blob provides a storage mechanism without negatively impacting
other filesystems.
I want to note that other LSMs beyond SELinux have been involved in
the discussion of the fixes presented here and they are working on
their own related changes using these new hooks, but due to other
issues those patches will be coming at a later date.
- Use kstrdup_const()/kfree_const() for securityfs symlink targets
- Resolve a handful of kernel-doc warnings in cred.h"
* tag 'lsm-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
selinux: fix overlayfs mmap() and mprotect() access checks
lsm: add backing_file LSM hooks
fs: prepare for adding LSM blob to backing_file
securityfs: use kstrdup_const() to manage symlink targets
cred: fix kernel-doc warnings in cred.h
|
||
|
|
ef3da345cc |
Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- coredump: add tracepoint for coredump events
- fs: hide file and bfile caches behind runtime const machinery
Fixes:
- fix architecture-specific compat_ftruncate64 implementations
- dcache: Limit the minimal number of bucket to two
- fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START
- fs/mbcache: cancel shrink work before destroying the cache
- dcache: permit dynamic_dname()s up to NAME_MAX
Cleanups:
- remove or unexport unused fs_context infrastructure
- trivial ->setattr cleanups
- selftests/filesystems: Assume that TIOCGPTPEER is defined
- writeback: fix kernel-doc function name mismatch for wb_put_many()
- autofs: replace manual symlink buffer allocation in autofs_dir_symlink
- init/initramfs.c: trivial fix: FSM -> Finite-state machine
- fs: remove stale and duplicate forward declarations
- readdir: Introduce dirent_size()
- fs: Replace user_access_{begin/end} by scoped user access
- kernel: acct: fix duplicate word in comment
- fs: write a better comment in step_into() concerning .mnt assignment
- fs: attr: fix comment formatting and spelling issues"
* tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
dcache: permit dynamic_dname()s up to NAME_MAX
fs: attr: fix comment formatting and spelling issues
fs: hide file and bfile caches behind runtime const machinery
fs: write a better comment in step_into() concerning .mnt assignment
proc: rename proc_notify_change to proc_setattr
proc: rename proc_setattr to proc_nochmod_setattr
affs: rename affs_notify_change to affs_setattr
adfs: rename adfs_notify_change to adfs_setattr
hfs: update comments on hfs_inode_setattr
kernel: acct: fix duplicate word in comment
fs: Replace user_access_{begin/end} by scoped user access
readdir: Introduce dirent_size()
coredump: add tracepoint for coredump events
fs: remove do_sys_truncate
fs: pass on FTRUNCATE_* flags to do_truncate
fs: fix archiecture-specific compat_ftruncate64
fs: remove stale and duplicate forward declarations
init/initramfs.c: trivial fix: FSM -> Finite-state machine
autofs: replace manual symlink buffer allocation in autofs_dir_symlink
fs/mbcache: cancel shrink work before destroying the cache
...
|
||
|
|
07c3ef5822 |
Merge tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull clone and pidfs updates from Christian Brauner:
"Add three new clone3() flags for pidfd-based process lifecycle
management.
CLONE_AUTOREAP:
CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to
the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for
SIGCHLD which applies to all children of a given parent.
Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped
property affecting all children which makes it unsuitable for
libraries or applications that need selective auto-reaping of
specific children while still being able to wait() on others.
CLONE_AUTOREAP stores an autoreap flag in the child's
signal_struct. When the child exits do_notify_parent() checks this
flag and causes exit_notify() to transition the task directly to
EXIT_DEAD. Since the flag lives on the child it survives
reparenting: if the original parent exits and the child is
reparented to a subreaper or init the child still auto-reaps when
it eventually exits. This is cleaner than forcing the subreaper to
get SIGCHLD and then reaping it. If the parent doesn't care the
subreaper won't care. If there's a subreaper that would care it
would be easy enough to add a prctl() that either just turns back
on SIGCHLD and turns off auto-reaping or a prctl() that just
notifies the subreaper whenever a child is reparented to it.
CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent
to monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.
CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because
autoreap is a process-level property, and CLONE_PARENT because an
autoreap child reparented via CLONE_PARENT could become an
invisible zombie under a parent that never calls wait().
The flag is not inherited by the autoreap process's own children.
Each child that should be autoreaped must be explicitly created
with CLONE_AUTOREAP.
CLONE_NNP:
CLONE_NNP sets no_new_privs on the child at clone time. Unlike
prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself,
CLONE_NNP allows the parent to impose no_new_privs on the child at
creation without affecting the parent's own privileges.
CLONE_THREAD is rejected because threads share credentials.
CLONE_NNP is useful on its own for any spawn-and-sandbox pattern
but was specifically introduced to enable unprivileged usage of
CLONE_PIDFD_AUTOKILL.
CLONE_PIDFD_AUTOKILL:
This flag ties a child's lifetime to the pidfd returned from
clone3(). When the last reference to the struct file created by
clone3() is closed the kernel sends SIGKILL to the child. A pidfd
obtained via pidfd_open() for the same process does not keep the
child alive and does not trigger autokill - only the specific
struct file from clone3() has this property. This is useful for
container runtimes, service managers, and sandboxed subprocess
execution - any scenario where the child must die if the parent
crashes or abandons the pidfd or just wants a throwaway helper
process.
CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP.
It requires CLONE_PIDFD because the whole point is tying the
child's lifetime to the pidfd. It requires CLONE_AUTOREAP because a
killed child with no one to reap it would become a zombie - the
primary use case is the parent crashing or abandoning the pidfd so
no one is around to call waitpid(). CLONE_THREAD is rejected
because autokill targets a process not a thread.
If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an
unprivileged user may spawn a process that is autokilled. The child
cannot escalate privileges via setuid/setgid exec after being
spawned. If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the
caller must have have CAP_SYS_ADMIN in its user namespace"
* tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: check pidfd_info->coredump_code correctness
pidfds: add coredump_code field to pidfd_info
kselftest/coredump: reintroduce null pointer dereference
selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests
selftests/pidfd: add CLONE_NNP tests
selftests/pidfd: add CLONE_AUTOREAP tests
pidfd: add CLONE_PIDFD_AUTOKILL
clone: add CLONE_NNP
clone: add CLONE_AUTOREAP
|
||
|
|
fc825e513c |
Merge tag 'vfs-7.1-rc1.bh.metadata' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs buffer_head updates from Christian Brauner: "This cleans up the mess that has accumulated over the years in metadata buffer_head tracking for inodes. It moves the tracking into dedicated structure in filesystem-private part of the inode (so that we don't use private_list, private_data, and private_lock in struct address_space), and also moves couple other users of private_data and private_list so these are removed from struct address_space saving 3 longs in struct inode for 99% of inodes" * tag 'vfs-7.1-rc1.bh.metadata' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) fs: Drop i_private_list from address_space fs: Drop mapping_metadata_bhs from address space ext4: Track metadata bhs in fs-private inode part minix: Track metadata bhs in fs-private inode part udf: Track metadata bhs in fs-private inode part fat: Track metadata bhs in fs-private inode part bfs: Track metadata bhs in fs-private inode part affs: Track metadata bhs in fs-private inode part ext2: Track metadata bhs in fs-private inode part fs: Provide functions for handling mapping_metadata_bhs directly fs: Switch inode_has_buffers() to take mapping_metadata_bhs fs: Make bhs point to mapping_metadata_bhs fs: Move metadata bhs tracking to a separate struct fs: Fold fsync_buffers_list() into sync_mapping_buffers() fs: Drop osync_buffers_list() kvm: Use private inode list instead of i_private_list fs: Remove i_private_data aio: Stop using i_private_data and i_private_lock hugetlbfs: Stop using i_private_data fs: Stop using i_private_data for metadata bh tracking ... |
||
|
|
2802f94072 |
Merge tag 'vfs-7.1-rc1.fat' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull FAT updates from Christian Brauner: "Minor fixes for the fat filesystem" * tag 'vfs-7.1-rc1.fat' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fat: fix stack frame size warnings in KUnit tests fat: add KUnit tests for timestamp conversion helpers |
||
|
|
b7d74ea0fd |
Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs i_ino updates from Christian Brauner: "For historical reasons, the inode->i_ino field is an unsigned long, which means that it's 32 bits on 32 bit architectures. This has caused a number of filesystems to implement hacks to hash a 64-bit identifier into a 32-bit field, and deprives us of a universal identifier field for an inode. This changes the inode->i_ino field from an unsigned long to a u64. This shouldn't make any material difference on 64-bit hosts, but 32-bit hosts will see struct inode grow by at least 4 bytes. This could have effects on slabcache sizes and field alignment. The bulk of the changes are to format strings and tracepoints, since the kernel itself doesn't care that much about the i_ino field. The first patch changes some vfs function arguments, so check that one out carefully. With this change, we may be able to shrink some inode structures. For instance, struct nfs_inode has a fileid field that holds the 64-bit inode number. With this set of changes, that field could be eliminated. I'd rather leave that sort of cleanups for later just to keep this simple" * tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group() EVM: add comment describing why ino field is still unsigned long vfs: remove externs from fs.h on functions modified by i_ino widening treewide: fix missed i_ino format specifier conversions ext4: fix signed format specifier in ext4_load_inode trace event treewide: change inode->i_ino from unsigned long to u64 nilfs2: widen trace event i_ino fields to u64 f2fs: widen trace event i_ino fields to u64 ext4: widen trace event i_ino fields to u64 zonefs: widen trace event i_ino fields to u64 hugetlbfs: widen trace event i_ino fields to u64 ext2: widen trace event i_ino fields to u64 cachefiles: widen trace event i_ino fields to u64 vfs: widen trace event i_ino fields to u64 net: change sock.sk_ino and sock_i_ino() to u64 audit: widen ino fields to u64 vfs: widen inode hash/lookup functions to u64 |
||
|
|
0f00132132 |
Merge tag 'vfs-7.1-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs integrity updates from Christian Brauner: "This adds support to generate and verify integrity information (aka T10 PI) in the file system, instead of the automatic below the covers support that is currently used. The implementation is based on refactoring the existing block layer PI code to be reusable for this use case, and then adding relatively small wrappers for the file system use case. These are then used in iomap to implement the semantics, and wired up in XFS with a small amount of glue code. Compared to the baseline this does not change performance for writes, but increases read performance up to 15% for 4k I/O, with the benefit decreasing with larger I/O sizes as even the baseline maxes out the device quickly on my older enterprise SSD" * tag 'vfs-7.1-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: support T10 protection information iomap: support T10 protection information iomap: support ioends for buffered reads iomap: add a bioset pointer to iomap_read_folio_ops ntfs3: remove copy and pasted iomap code iomap: allow file systems to hook into buffered read bio submission iomap: only call into ->submit_read when there is a read_ctx iomap: pass the iomap_iter to ->submit_read iomap: refactor iomap_bio_read_folio_range block: pass a maxlen argument to bio_iov_iter_bounce block: add fs_bio_integrity helpers block: make max_integrity_io_size public block: prepare generation / verification helpers for fs usage block: add a bdev_has_integrity_csum helper block: factor out a bio_integrity_setup_default helper block: factor out a bio_integrity_action helper |
||
|
|
3383589700 |
Merge tag 'vfs-7.1-rc1.directory' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs directory updates from Christian Brauner: "Recently 'start_creating', 'start_removing', 'start_renaming' and related interfaces were added which combine the locking and the lookup. At that time many callers were changed to use the new interfaces. However there are still an assortment of places out side of the core vfs where the directory is locked explictly, whether with inode_lock() or lock_rename() or similar. These were missed in the first pass for an assortment of uninteresting reasons. This addresses the remaining places where explicit locking is used, and changes them to use the new interfaces, or otherwise removes the explicit locking. The biggest changes are in overlayfs. The other changes are quite simple, though maybe the cachefiles changes is the least simple of those" * tag 'vfs-7.1-rc1.directory' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: unexport lock_rename(), lock_rename_child(), unlock_rename() ovl: remove ovl_lock_rename_workdir() ovl: use is_subdir() for testing if one thing is a subdir of another ovl: change ovl_create_real() to get a new lock when re-opening created file. ovl: pass name buffer to ovl_start_creating_temp() cachefiles: change cachefiles_bury_object to use start_renaming_dentry() ovl: Simplify ovl_lookup_real_one() VFS: make lookup_one_qstr_excl() static. nfsd: switch purge_old() to use start_removing_noperm() selinux: Use simple_start_creating() / simple_done_creating() Apparmor: Use simple_start_creating() / simple_done_creating() libfs: change simple_done_creating() to use end_creating() VFS: move the start_dirop() kerndoc comment to before start_dirop() fs/proc: Don't lock root inode when creating "self" and "thread-self" VFS: note error returns in documentation for various lookup functions |
||
|
|
c8db08110c |
Merge tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs xattr updates from Christian Brauner: "This reworks the simple_xattr infrastructure and adds support for user.* extended attributes on sockets. The simple_xattr subsystem currently uses an rbtree protected by a reader-writer spinlock. This series replaces the rbtree with an rhashtable giving O(1) average-case lookup with RCU-based lockless reads. This sped up concurrent access patterns on tmpfs quite a bit and it's an overall easy enough conversion to do and gets rid or rwlock_t. The conversion is done incrementally: a new rhashtable path is added alongside the existing rbtree, consumers are migrated one at a time (shmem, kernfs, pidfs), and then the rbtree code is removed. All three consumers switch from embedded structs to pointer-based lazy allocation so the rhashtable overhead is only paid for inodes that actually use xattrs. With this infrastructure in place the series adds support for user.* xattrs on sockets. Path-based AF_UNIX sockets inherit xattr support from the underlying filesystem (e.g. tmpfs) but sockets in sockfs - that is everything created via socket() including abstract namespace AF_UNIX sockets - had no xattr support at all. The xattr_permission() checks are reworked to allow user.* xattrs on S_IFSOCK inodes. Sockfs sockets get per-inode limits of 128 xattrs and 128KB total value size matching the limits already in use for kernfs. The practical motivation comes from several directions. systemd and GNOME are expanding their use of Varlink as an IPC mechanism. For D-Bus there are tools like dbus-monitor that can observe IPC traffic across the system but this only works because D-Bus has a central broker. For Varlink there is no broker and there is currently no way to identify which sockets speak Varlink. With user.* xattrs on sockets a service can label its socket with the IPC protocol it speaks (e.g., user.varlink=1) and an eBPF program can then selectively capture traffic on those sockets. Enumerating bound sockets via netlink combined with these xattr labels gives a way to discover all Varlink IPC entrypoints for debugging and introspection. Similarly, systemd-journald wants to use xattrs on the /dev/log socket for protocol negotiation to indicate whether RFC 5424 structured syslog is supported or whether only the legacy RFC 3164 format should be used. In containers these labels are particularly useful as high-privilege or more complicated solutions for socket identification aren't available. The series comes with comprehensive selftests covering path-based AF_UNIX sockets, sockfs socket operations, per-inode limit enforcement, and xattr operations across multiple address families (AF_INET, AF_INET6, AF_NETLINK, AF_PACKET)" * tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/xattr: test xattrs on various socket families selftests/xattr: sockfs socket xattr tests selftests/xattr: path-based AF_UNIX socket xattr tests xattr: support extended attributes on sockets xattr,net: support limited amount of extended attributes on sockfs sockets xattr: move user limits for xattrs to generic infra xattr: switch xattr_permission() to switch statement xattr: add xattr_permission_error() xattr: remove rbtree-based simple_xattr infrastructure pidfs: adapt to rhashtable-based simple_xattrs kernfs: adapt to rhashtable-based simple_xattrs with lazy allocation shmem: adapt to rhashtable-based simple_xattrs with lazy allocation xattr: add rhashtable-based simple_xattr infrastructure xattr: add rcu_head and rhash_head to struct simple_xattr |
||
|
|
0e58e3f1c5 |
Merge tag 'vfs-7.1-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs writeback updates from Christian Brauner: "This introduces writeback helper APIs and converts f2fs, gfs2 and nfs to stop accessing writeback internals directly" * tag 'vfs-7.1-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nfs: stop using writeback internals for WB_WRITEBACK accounting gfs2: stop using writeback internals for dirty_exceeded check f2fs: stop using writeback internals for dirty_exceeded checks writeback: prep helpers for dirty-limit and writeback accounting |
||
|
|
4248ed1013 |
smb: client: allow both 'lease' and 'nolease' mount options
Change the nolease mount option from fsparam_flag() to fsparam_flag_no() so that both 'lease' and 'nolease' are accepted as valid mount options. Previously, only 'nolease' was recognized. Passing 'lease' would fail with an unknown parameter error (or be silently ignored with 'sloppy'). With this change: - 'nolease' disables lease requests (same behavior as before) - 'lease' explicitly enables lease requests This also renames the enum value from Opt_nolease to Opt_lease and uses result.negated to set ctx->no_lease, which is the standard pattern used by other flag_no options in the cifs mount option parser. Signed-off-by: Rajasi Mandal <rajasimandal@microsoft.com> Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
24b8f8dcb9 |
pstore/ftrace: Factor KASLR offset in the core kernel instruction addresses
The pstore ftrace frontend works by purely collecting the instruction address, saving it on the persistent area through the backend and when the log is read, on next boot for example, the address is then resolved by using the regular printk symbol lookup (%pS for example). Problem: if we are running a relocatable kernel with KASLR enabled, this is a recipe for failure in the symbol resolution on next boots, since the addresses are offset'ed by the KASLR address. So, naturally the way to go is factor the KASLR address out of instruction address collection, and adding the fresh offset when resolving the symbol on future boots. Problem #2: modules also have varying addresses that float based on module base address and potentially the module ordering in memory, meaning factoring KASLR offset for them is useless. So, let's hereby only take KASLR offset into account for core kernel addresses, leaving module ones as is. And we have yet a 3rd complexity: not necessarily the check range for core kernel addresses holds true on future boots, since the module base address will vary. With that, the choice was to mark the addresses as being core vs module based on its MSB. And with that... ...we have the 4th challenge here: for some "simple" architectures, the CPU number is saved bit-encoded on the instruction pointer, to allow bigger timestamps - this is set through the PSTORE_CPU_IN_IP define for such architectures. Hence, the approach here is to skip such architectures (at least in a first moment). Finished? No. On top of all previous complexities, we have one extra pain point: kaslr_offset() is inlined and fully "resolved" at boot-time, after kernel decompression, through ELF relocation mechanism. Once the offset is known, it's patched to the kernel text area, wherever it is used. The mechanism, and its users, are only built-in - incompatible with module usage. Though there are possibly some hacks (as computing the offset using some kallsym lookup), the choice here is to restrict this optimization to the (hopefully common) case of CONFIG_PSTORE=y. TL;DR: let's factor KASLR offsets on pstore/ftrace for core kernel addresses, only when PSTORE is built-in and leaving module addresses out, as well as architectures that define PSTORE_CPU_IN_IP. Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Link: https://patch.msgid.link/20260410205848.2607169-1-gpiccoli@igalia.com Signed-off-by: Kees Cook <kees@kernel.org> |
||
|
|
dc0325b0aa |
smb: client: get rid of d_drop()+d_add()
Replace d_drop()+d_add() in cifs_tmpfile() and cifs_create() with d_instantiate(), and in cifs_atomic_open() with d_splice_alias() if in-lookup, otherwise d_instantiate(). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Closes: https://lore.kernel.org/r/20260408065719.GF3836593@ZenIV Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: David Howells <dhowells@redhat.com> Cc: NeilBrown <neilb@ownmail.net> Cc: linux-fsdevel@vger.kernel.org Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
62e02084ab |
smb: client: set ATTR_TEMPORARY with O_TMPFILE | O_EXCL
Set ATTR_TEMPORARY attribute on temporary delete-on-close files when O_EXCL is specified in conjunction with O_TMPFILE to let some servers cache as much data as possible and possibly never persist them into storage, thereby improving performance. Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: David Howells <dhowells@redhat.com> Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
3e7d63037a |
smb: client: add support for O_TMPFILE
Implement O_TMPFILE support for SMB2+ in the CIFS client. Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
30a59dddd6 |
vfs: introduce d_mark_tmpfile_name()
CIFS requires O_TMPFILE dentries to have names of newly created delete-on-close files in the server so it can build full pathnames from the root of the share when performing operations on them. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
bc1a64d236 |
smb: client: add missing MODULE_DESCRIPTION() to smb1maperror_test
On the latest linux-next following modpost warning is reported: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/smb/client/smb1maperror_test.o Add MODULE_DESCRIPTION() to the test module to fix the warning. Reviewed-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
7c6c4ed80b |
Merge tag 'vfs-7.0-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner: "The kernfs rbtree is keyed by (hash, ns, name) where the hash is seeded with the raw namespace pointer via init_name_hash(ns). The resulting hash values are exposed to userspace through readdir seek positions, and the pointer-based ordering in kernfs_name_compare() is observable through entry order. Switch from raw pointers to ns_common::ns_id for both hashing and comparison. A preparatory commit first replaces all const void * namespace parameters with const struct ns_common * throughout kernfs, sysfs, and kobject so the code can access ns->ns_id. Also compare the ns_id when hashes match in the rbtree to handle crafted collisions. Also fix eventpoll RCU grace period issue and a cachefiles refcount problem" * tag 'vfs-7.0-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: kernfs: make directory seek namespace-aware kernfs: use namespace id instead of pointer for hashing and comparison kernfs: pass struct ns_common instead of const void * for namespace tags eventpoll: defer struct eventpoll free to RCU grace period cachefiles: fix incorrect dentry refcount in cachefiles_cull() |
||
|
|
a5242d37c8 |
erofs: error out obviously illegal extents in advance
Detect some corrupted extent cases during metadata parsing rather than letting them result in harmless decompression failures later: - For full-reference compressed extents, the compressed size must not exceed the decompressed size, which is a strict on-disk layout constraint; - For plain (shifted/interlaced) extents, the decoded size must not exceed the encoded size, even accounting for partial decoding. Both ways work but it should be better to report illegal extents as metadata layout violations rather than deferring as decompression failure. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> |
||
|
|
5c40d2e9e3 |
erofs: clean up encoded map flags
- Remove EROFS_MAP_ENCODED since it was always set together with EROFS_MAP_MAPPED for compressed extents and checked redundantly; - Replace the EROFS_MAP_FULL_MAPPED flag with the opposite EROFS_MAP_PARTIAL_MAPPED flag so that extents are implicitly fully mapped initially to simplify the logic; - Make fragment extents independent of EROFS_MAP_MAPPED since they are not directly allocated on disk; thus fragment extents are no longer twisted with mapped extents. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> |
||
|
|
6fa253b38b |
affs: bound hash_pos before table lookup in affs_readdir
affs_readdir() decodes ctx->pos into hash_pos and chain_pos and then dereferences AFFS_HEAD(dir_bh)->table[hash_pos] before validating that hash_pos is within the runtime table bound. Treat out-of-range positions as end-of-directory before the first table lookup. Signed-off-by: Hyungjung Joo <jhj140711@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
b6e39e4846 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c |
||
|
|
21e161de2d |
erofs: fix unsigned underflow in z_erofs_lz4_handle_overlap()
Some crafted images can have illegal (!partial_decoding &&
m_llen < m_plen) extents, and the LZ4 inplace decompression path
can be wrongly hit, but it cannot handle (outpages < inpages)
properly: "outpages - inpages" wraps to a large value and
the subsequent rq->out[] access reads past the decompressed_pages
array.
However, such crafted cases can correctly result in a corruption
report in the normal LZ4 non-inplace path.
Let's add an additional check to fix this for backporting.
Reproducible image (base64-encoded gzipped blob):
H4sIAJGR12kCA+3SPUoDQRgG4MkmkkZk8QRbRFIIi9hbpEjrHQI5ghfwCN5BLCzTGtLbBI+g
dilSJo1CnIm7GEXFxhT6PDDwfrs73/ywIQD/1ePD4r7Ou6ETsrq4mu7XcWfj++Pb58nJU/9i
PNtbjhan04/9GtX4qVYc814WDqt6FaX5s+ZwXXeq52lndT6IuVvlblytLMvh4Gzwaf90nsvz
2DF/21+20T/ldgp5s1jXRaN4t/8izsy/OUB6e/Qa79r+JwAAAAAAAL52vQVuGQAAAP6+my1w
ywAAAAAAAADwu14ATsEYtgBQAAA=
$ mount -t erofs -o cache_strategy=disabled foo.erofs /mnt
$ dd if=/mnt/data of=/dev/null bs=4096 count=1
Fixes:
|
||
|
|
cb76a81c7c |
kernfs: make directory seek namespace-aware
The rbtree backing kernfs directories is ordered by (hash, ns_id, name) but kernfs_dir_pos() only searches by hash when seeking to a position during readdir. When two nodes from different namespaces share the same hash value, the binary search can land on a node in the wrong namespace. The subsequent skip-forward loop walks rb_next() and may overshoot the correct node, silently dropping an entry from the readdir results. With the recent switch from raw namespace pointers to public namespace ids as hash seeds, computing hash collisions became an offline operation. An unprivileged user could unshare into a new network namespace, create a single interface whose name-hash collides with a target entry in init_net, and cause a victim's seekdir/readdir on /sys/class/net to miss that entry. Fix this by extending the rbtree search in kernfs_dir_pos() to also compare namespace ids when hashes match. Since the rbtree is already ordered by (hash, ns_id, name), this makes the seek land directly in the correct namespace's range, eliminating the wrong-namespace overshoot. Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
1fe989e1c4 |
kernfs: use namespace id instead of pointer for hashing and comparison
kernfs uses the namespace tag as both a hash seed (via init_name_hash()) and a comparison key in the rbtree. The resulting hash values are exposed to userspace through directory seek positions (ctx->pos), and the raw pointer comparisons in kernfs_name_compare() encode kernel pointer ordering into the rbtree layout. This constitutes a KASLR information leak since the hash and ordering derived from kernel pointers can be observed from userspace. Fix this by using the 64-bit namespace id (ns_common::ns_id) instead of the raw pointer value for both hashing and comparison. The namespace id is a stable, non-secret identifier that is already exposed to userspace through other interfaces (e.g., /proc/pid/ns/, ioctl NS_GET_NSID). Introduce kernfs_ns_id() as a helper that extracts the namespace id from a potentially-NULL ns_common pointer, returning 0 for the no-namespace case. All namespace equality checks in the directory iteration and dentry revalidation paths are also switched from pointer comparison to ns_id comparison for consistency. Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
e3b2cf6e5d |
kernfs: pass struct ns_common instead of const void * for namespace tags
kernfs has historically used const void * to pass around namespace tags used for directory-level namespace filtering. The only current user of this is sysfs network namespace tagging where struct net pointers are cast to void *. Replace all const void * namespace parameters with const struct ns_common * throughout the kernfs, sysfs, and kobject namespace layers. This includes the kobj_ns_type_operations callbacks, kobject_namespace(), and all sysfs/kernfs APIs that accept or return namespace tags. Passing struct ns_common is needed because various codepaths require access to the underlying namespace. A struct ns_common can always be converted back to the concrete namespace type (e.g., struct net) via container_of() or to_ns_common() in the reverse direction. This is a preparatory change for switching to ns_id-based directory iteration to prevent a KASLR pointer leak through the current use of raw namespace pointers as hash seeds and comparison keys. Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
c1307d18ca |
hfsplus: fix generic/642 failure
The xfstests' test-case generic/642 finishes with corrupted HFS+ volume: sudo ./check generic/642 [sudo] password for slavad: FSTYP -- hfsplus PLATFORM -- Linux/x86_64 hfsplus-testing-0001 7.0.0-rc1+ #26 SMP PREEMPT_DYNAMIC Mon Mar 23 17:24:32 PDT 2026 MKFS_OPTIONS -- /dev/loop51 MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch generic/642 6s ... _check_generic_filesystem: filesystem on /dev/loop51 is inconsistent (see xfstests-dev/results//generic/642.full for details) Ran: generic/642 Failures: generic/642 Failed 1 of 1 tests sudo fsck.hfs -d /dev/loop51 ** /dev/loop51 Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K. Executing fsck_hfs (version 540.1-Linux). ** Checking non-journaled HFS Plus Volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. ** Checking multi-linked files. ** Checking catalog hierarchy. ** Checking extended attributes file. invalid free nodes - calculated 1637 header 1260 Invalid B-tree header Invalid map node (8, 0) ** Checking volume bitmap. ** Checking volume information. Verify Status: VIStat = 0x0000, ABTStat = 0xc000 EBTStat = 0x0000 CBTStat = 0x0000 CatStat = 0x00000000 ** Repairing volume. ** Rechecking volume. ** Checking non-journaled HFS Plus Volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. ** Checking multi-linked files. ** Checking catalog hierarchy. ** Checking extended attributes file. ** Checking volume bitmap. ** Checking volume information. ** The volume untitled was repaired successfully. The fsck tool detected that Extended Attributes b-tree is corrupted. Namely, the free nodes number is incorrect and map node bitmap has inconsistent state. Analysis has shown that during b-tree closing there are still some lost b-tree's nodes in the hash out of b-tree structure. But this orphaned b-tree nodes are still accounted as used in map node bitmap: tree_cnid 8, nidx 0, node_count 1408, free_nodes 1403 tree_cnid 8, nidx 1, node_count 1408, free_nodes 1403 tree_cnid 8, nidx 3, node_count 1408, free_nodes 1403 tree_cnid 8, nidx 54, node_count 1408, free_nodes 1403 tree_cnid 8, nidx 67, node_count 1408, free_nodes 1403 tree_cnid 8, nidx 0, prev 0, next 0, parent 0, num_recs 3, type 0x1, height 0 tree_cnid 8, nidx 1, prev 0, next 0, parent 3, num_recs 1, type 0xff, height 1 tree_cnid 8, nidx 3, prev 0, next 0, parent 0, num_recs 1, type 0x0, height 2 tree_cnid 8, nidx 54, prev 29, next 46, parent 3, num_recs 0, type 0xff, height 1 tree_cnid 8, nidx 67, prev 8, next 14, parent 3, num_recs 0, type 0xff, height 1 This issue happens in hfs_bnode_split() logic during detection the possibility of moving half ot the records out of the node. The hfs_bnode_split() contains a loop that implements a roughly 50/50 split of the B-tree node's records by scanning the offset table to find where the data crosses the node's midpoint. If this logic detects the incapability of spliting the node, then it simply calls hfs_bnode_put() for newly created node. However, node is not set as HFS_BNODE_DELETED and real deletion of node doesn't happen. As a result, the empty node becomes orphaned but it is still accounted as used. Finally, fsck tool detects this inconsistency of HFS+ volume. This patch adds call of hfs_bnode_unlink() before hfs_bnode_put() for the case if new node cannot be used for spliting the existing node. sudo ./check generic/642 FSTYP -- hfsplus PLATFORM -- Linux/x86_64 hfsplus-testing-0001 7.0.0-rc1+ #26 SMP PREEMPT_DYNAMIC Fri Apr 3 12:39:13 PDT 2026 MKFS_OPTIONS -- /dev/loop51 MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch generic/642 40s ... 39s Ran: generic/642 Passed all 1 tests Closes: https://github.com/hfs-linux-kernel/hfs-linux-kernel/issues/242 cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20260403230556.614171-6-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> |
||
|
|
732af3aa63 |
hfsplus: rework logic of map nodes creation in xattr b-tree
In hfsplus_init_header_node() when node_count > 63488 (header bitmap capacity), the code calculates map_nodes, subtracts them from free_nodes, and marks their positions used in the bitmap. However, it doesn't write the actual map node structure (type, record offsets, bitmap) for those physical positions, only node 0 is written. This patch reworks hfsplus_create_attributes_file() logic by introducing a specialized method of hfsplus_init_map_node() and writing the allocated map b-tree's nodes by means of hfsplus_write_attributes_file_node() method. cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20260403230556.614171-5-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> |
||
|
|
63584d7676 |
hfsplus: fix logic of alloc/free b-tree node
The hfs_bmap_alloc() and hfs_bmap_free() modify the b-tree's counters and nodes' bitmap of b-tree. However, hfs_btree_write() synchronizes the state of in-core b-tree's counters and node's bitmap with b-tree's descriptor in header node. Postponing this synchronization could result in inconsistent state of file system volume. This patch adds calling of hfs_btree_write() in hfs_bmap_alloc() and hfs_bmap_free() methods. cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20260403230556.614171-4-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> |
||
|
|
cd3901f4c0 |
hfsplus: fix error processing issue in hfs_bmap_free()
Currently, we check only -EINVAL error code in hfs_bmap_free() after calling the hfs_bmap_clear_bit(). It means that other error codes will be silently ignored. This patch adds the checking of all other error codes. cc: Shardul Bankar <shardul.b@mpiricsoftware.com> cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20260403230556.614171-3-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> |
||
|
|
6dca66d7ba |
hfsplus: fix potential race conditions in b-tree functionality
The HFS_BNODE_DELETED flag is checked in hfs_bnode_put() under locked tree->hash_lock. This patch adds locking for the case of setting the HFS_BNODE_DELETED flag in hfs_bnode_unlink() with the goal to avoid potential race conditions. The hfs_btree_write() method should be called under tree->tree_lock. This patch reworks logic by adding locking the tree->tree_lock for the calls of hfs_btree_write() in hfsplus_cat_write_inode() and hfsplus_system_write_inode(). This patch adds also the lockdep_assert_held() in hfs_bmap_reserve(), hfs_bmap_alloc(), and hfs_bmap_free(). cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20260403230556.614171-2-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> |
||
|
|
3df690bba2 |
smb: client: fix OOB reads parsing symlink error response
When a CREATE returns STATUS_STOPPED_ON_SYMLINK, smb2_check_message() returns success without any length validation, leaving the symlink parsers as the only defense against an untrusted server. symlink_data() walks SMB 3.1.1 error contexts with the loop test "p < end", but reads p->ErrorId at offset 4 and p->ErrorDataLength at offset 0. When the server-controlled ErrorDataLength advances p to within 1-7 bytes of end, the next iteration will read past it. When the matching context is found, sym->SymLinkErrorTag is read at offset 4 from p->ErrorContextData with no check that the symlink header itself fits. smb2_parse_symlink_response() then bounds-checks the substitute name using SMB2_SYMLINK_STRUCT_SIZE as the offset of PathBuffer from iov_base. That value is computed as sizeof(smb2_err_rsp) + sizeof(smb2_symlink_err_rsp), which is correct only when ErrorContextCount == 0. With at least one error context the symlink data sits 8 bytes deeper, and each skipped non-matching context shifts it further by 8 + ALIGN(ErrorDataLength, 8). The check is too short, allowing the substitute name read to run past iov_len. The out-of-bound heap bytes are UTF-16-decoded into the symlink target and returned to userspace via readlink(2). Fix this all up by making the loops test require the full context header to fit, rejecting sym if its header runs past end, and bound the substitute name against the actual position of sym->PathBuffer rather than a fixed offset. Because sub_offs and sub_len are 16bits, the pointer math will not overflow here with the new greater-than. Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: Bharath SM <bharathsm@microsoft.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: stable <stable@kernel.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
3d8b9d06bd |
smb: client: fix off-by-8 bounds check in check_wsl_eas()
The bounds check uses (u8 *)ea + nlen + 1 + vlen as the end of the EA name and value, but ea_data sits at offset sizeof(struct smb2_file_full_ea_info) = 8 from ea, not at offset 0. The strncmp() later reads ea->ea_data[0..nlen-1] and the value bytes follow at ea_data[nlen+1..nlen+vlen], so the actual end is ea->ea_data + nlen + 1 + vlen. Isn't pointer math fun? The earlier check (u8 *)ea > end - sizeof(*ea) only guarantees the 8-byte header is in bounds, but since the last EA is placed within 8 bytes of the end of the response, the name and value bytes are read past the end of iov. Fix this mess all up by using ea->ea_data as the base for the bounds check. An "untrusted" server can use this to leak up to 8 bytes of kernel heap into the EA name comparison and influence which WSL xattr the data is interpreted as. Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: Bharath SM <bharathsm@microsoft.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: stable <stable@kernel.org> Assisted-by: gregkh_clanker_t1000 Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steve French <stfrench@microsoft.com> |