linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Christian Brauner	e3b2cf6e5d	kernfs: pass struct ns_common instead of const void * for namespace tags kernfs has historically used const void * to pass around namespace tags used for directory-level namespace filtering. The only current user of this is sysfs network namespace tagging where struct net pointers are cast to void . Replace all const void namespace parameters with const struct ns_common * throughout the kernfs, sysfs, and kobject namespace layers. This includes the kobj_ns_type_operations callbacks, kobject_namespace(), and all sysfs/kernfs APIs that accept or return namespace tags. Passing struct ns_common is needed because various codepaths require access to the underlying namespace. A struct ns_common can always be converted back to the concrete namespace type (e.g., struct net) via container_of() or to_ns_common() in the reverse direction. This is a preparatory change for switching to ns_id-based directory iteration to prevent a KASLR pointer leak through the current use of raw namespace pointers as hash seeds and comparison keys. Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-04-09 14:36:52 +02:00
Nicholas Carlini	07712db808	eventpoll: defer struct eventpoll free to RCU grace period In certain situations, ep_free() in eventpoll.c will kfree the epi->ep eventpoll struct while it still being used by another concurrent thread. Defer the kfree() to an RCU callback to prevent UAF. Fixes: `f2e467a482` ("eventpoll: Fix semi-unbounded recursion") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-04-02 21:45:02 +02:00
NeilBrown	1635c2acdd	cachefiles: fix incorrect dentry refcount in cachefiles_cull() The patch mentioned below changed cachefiles_bury_object() to expect 2 references to the 'rep' dentry. Three of the callers were changed to use start_removing_dentry() which takes an extra reference so in those cases the call gets the expected references. However there is another call to cachefiles_bury_object() in cachefiles_cull() which did not need to be changed to use start_removing_dentry() and so was not properly considered. It still passed the dentry with just one reference so the net result is that a reference is lost. To meet the expectations of cachefiles_bury_object(), cachefiles_cull() must take an extra reference before the call. It will be dropped by cachefiles_bury_object(). Reported-by: Marc Dionne <marc.dionne@auristor.com> Fixes: `7bb1eb45e4` ("VFS: introduce start_removing_dentry()") Signed-off-by: NeilBrown <neil@brown.name> Link: https://patch.msgid.link/177456350181.1851489.16359967086642190170@noble.neil.brown.name Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-31 11:52:08 +02:00
Linus Torvalds	d0c3bcd5b8	Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull crypto library fix from Eric Biggers: "Fix missing zeroization of the ChaCha state" * tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: lib/crypto: chacha: Zeroize permuted_state before it leaves scope	2026-03-30 13:40:48 -07:00
Linus Torvalds	f1b24d8bdd	Merge tag 'trace-rtla-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull rtla build fix from Steven Rostedt: - Fix build failure when libbpf does not exist RTLA supports building without BPF libraries, but a recent change added a libbpf.h include outside of the BPF protection which caused build failures when libbpf was not installed. * tag 'trace-rtla-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rtla: Fix build without libbpf header	2026-03-30 13:12:00 -07:00
Tomas Glozar	2e8b1a1d12	rtla: Fix build without libbpf header rtla supports building without libbpf. However, BPF actions patchset [1] adds an include of bpf/libbpf.h into timerlat_bpf.h, which breaks build on systems that don't have libbpf headers installed. This is a leftover from a draft version of the patchset where timerlat_bpf_set_action() (which takes a struct bpf_program * argument) was defined in the header. timerlat_bpf.c already includes bpf/libbpf.h via timerlat.skel.h when libbpf is present. Remove the redundant include to fix build on systems without libbpf headers. [1] https://lore.kernel.org/linux-trace-kernel/20251126144205.331954-1-tglozar@redhat.com/T/ Cc: John Kacur <jkacur@redhat.com> Cc: Luis Goncalves <lgoncalv@redhat.com> Cc: Crystal Wood <crwood@redhat.com> Cc: Costa Shulyupin <costa.shul@redhat.com> Link: https://patch.msgid.link/20260330091207.16184-1-tglozar@redhat.com Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org> Closes: https://lore.kernel.org/linux-trace-kernel/20260329122202.65a8b575@robin/ Fixes: `8cd0f08ac7` ("rtla/timerlat: Support tail call from BPF program") Signed-off-by: Tomas Glozar <tglozar@redhat.com> Reviewed-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-30 12:44:48 -04:00
Linus Torvalds	7aaa8047ea	Linux 7.0-rc6 v7.0-rc6	2026-03-29 15:40:00 -07:00
Linus Torvalds	d1384f70b2	Merge tag 'vfs-7.0-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix netfs_limit_iter() hitting BUG() when an ITER_KVEC iterator reaches it via core dump writes to 9P filesystems. Add ITER_KVEC handling following the same pattern as the existing ITER_BVEC code. - Fix a NULL pointer dereference in the netfs unbuffered write retry path when the filesystem (e.g., 9P) doesn't set the prepare_write operation. - Clear I_DIRTY_TIME in sync_lazytime for filesystems implementing ->sync_lazytime. Without this the flag stays set and may cause additional unnecessary calls during inode deactivation. - Increase tmpfs size in mount_setattr selftests. A recent commit bumped the ext4 image size to 2 GB but didn't adjust the tmpfs backing store, so mkfs.ext4 fails with ENOSPC writing metadata. - Fix an invalid folio access in iomap when i_blkbits matches the folio size but differs from the I/O granularity. The cur_folio pointer would not get invalidated and iomap_read_end() would still be called on it despite the IO helper owning it. - Fix hash_name() docstring. - Fix read abandonment during netfs retry where the subreq variable used for abandonment could be uninitialized on the first pass or point to a deleted subrequest on later passes. - Don't block sync for filesystems with no data integrity guarantees. Add a SB_I_NO_DATA_INTEGRITY superblock flag replacing the per-inode AS_NO_DATA_INTEGRITY mapping flag so sync kicks off writeback but doesn't wait for flusher threads. This fixes a suspend-to-RAM hang on fuse-overlayfs where the flusher thread blocks when the fuse daemon is frozen. - Fix a lockdep splat in iomap when reads fail. iomap_read_end_io() invokes fserror_report() which calls igrab() taking i_lock in hardirq context while i_lock is normally held with interrupts enabled. Kick failed read handling to a workqueue. - Remove the redundant netfs_io_stream::front member and use stream->subrequests.next instead, fixing a potential issue in the direct write code path. * tag 'vfs-7.0-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix the handling of stream->front by removing it iomap: fix lockdep complaint when reads fail writeback: don't block sync for filesystems with no data integrity guarantees netfs: Fix read abandonment during retry vfs: fix docstring of hash_name() iomap: fix invalid folio access when i_blkbits differs from I/O granularity selftests/mount_setattr: increase tmpfs size for idmapped mount tests fs: clear I_DIRTY_TIME in sync_lazytime netfs: Fix NULL pointer dereference in netfs_unbuffered_write() on retry netfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iterators	2026-03-29 15:24:28 -07:00
Linus Torvalds	fc9eae25ec	Merge tag 'phy-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy Pull phy fixes from Vinod Koul: - Qualcomm PCS table fix for ufs phy - TI device node reference fix - Common prop kconfig fix - lynx CDR lock workaround for lanes disabled - usb disconnect function fix of k1 driver * tag 'phy-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: phy: qcom: qmp-ufs: Fix SM8650 PCS table for Gear 4 phy: ti: j721e-wiz: Fix device node reference leak in wiz_get_lane_phy_types() phy: k1-usb: add disconnect function support phy: lynx-28g: skip CDR lock workaround for lanes disabled in the device tree phy: make PHY_COMMON_PROPS Kconfig symbol conditionally user-selectable	2026-03-29 12:48:52 -07:00
Linus Torvalds	a516c618a6	Merge tag 'dmaengine-fix-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine fixes from Vinod Koul: "A bunch of driver fixes with idxd ones being the biggest: - Xilinx regmap init error handling, dma_device directions, residue calculation, and reset related timeout fixes - Renesas CHCTRL updates and driver list fixes - DW HDMA cycle bits and MSI data programming fix - IDXD pile of fixes for memeory leak and FLR fixes" * tag 'dmaengine-fix-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (21 commits) dmaengine: xilinx_dma: Fix reset related timeout with two-channel AXIDMA dmaengine: xilinx: xilinx_dma: Fix unmasked residue subtraction dmaengine: xilinx: xilinx_dma: Fix residue calculation for cyclic DMA dmaengine: xilinx: xilinx_dma: Fix dma_device directions dmaengine: sh: rz-dmac: Move CHCTRL updates under spinlock dmaengine: sh: rz-dmac: Protect the driver specific lists dmaengine: idxd: fix possible wrong descriptor completion in llist_abort_desc() dmaengine: xilinx: xdma: Fix regmap init error handling dmaengine: dw-edma: Fix multiple times setting of the CYCLE_STATE and CYCLE_BIT bits for HDMA. dmaengine: idxd: Fix leaking event log memory dmaengine: idxd: Fix freeing the allocated ida too late dmaengine: idxd: Fix memory leak when a wq is reset dmaengine: idxd: Fix not releasing workqueue on .release() dmaengine: idxd: Wait for submitted operations on .device_synchronize() dmaengine: idxd: Flush all pending descriptors dmaengine: idxd: Flush kernel workqueues on Function Level Reset dmaengine: idxd: Fix possible invalid memory access after FLR dmaengine: idxd: Fix crash when the event log is disabled dmaengine: idxd: Fix lockdep warnings when calling idxd_device_config() dmaengine: dw-edma: fix MSI data programming for multi-IRQ case ...	2026-03-29 12:42:31 -07:00
Linus Torvalds	32ee88daf7	Merge tag 'i2c-for-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: - designware: fix resume-probe race causing NULL-deref in amdisp - imx: fix timeout on repeated reads and extra clock at end - MAINTAINERS: drop outdated I2C website * tag 'i2c-for-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: MAINTAINERS: drop outdated I2C website i2c: designware: amdisp: Fix resume-probe race condition issue i2c: imx: ensure no clock is generated after last read i2c: imx: fix i2c issue when reading multiple messages	2026-03-29 12:27:13 -07:00
Linus Torvalds	ac354b5cb0	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "s390: - Lots of small and not-so-small fixes for the newly rewritten gmap, mostly affecting the handling of nested guests. x86: - Fix an issue with shadow paging, which causes KVM to install an MMIO PTE in the shadow page tables without first zapping a non-MMIO SPTE if KVM didn't see the write that modified the shadowed guest PTE. While commit `a54aa15c6b` ("KVM: x86/mmu: Handle MMIO SPTEs directly in mmu_set_spte()") was right about it being impossible to miss such a write if it was coming from the guest, it failed to account for writes to guest memory that are outside the scope of KVM: if userspace modifies the guest PTE, and then the guest hits a relevant page fault, KVM will get confused" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/mmu: Only WARN in direct MMUs when overwriting shadow-present SPTE KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE KVM: s390: Fix KVM_S390_VCPU_FAULT ioctl KVM: s390: vsie: Fix guest page tables protection KVM: s390: vsie: Fix unshadowing while shadowing KVM: s390: vsie: Fix refcount overflow for shadow gmaps KVM: s390: vsie: Fix nested guest memory shadowing KVM: s390: Correctly handle guest mappings without struct page KVM: s390: Fix gmap_link() KVM: s390: vsie: Fix check for pre-existing shadow mapping KVM: s390: Remove non-atomic dat_crstep_xchg() KVM: s390: vsie: Fix dat_split_ste()	2026-03-29 11:58:47 -07:00
Linus Torvalds	b8a3bc8567	Merge tag 'for-linus-7.0a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fix from Juergen Gross: "A single fix for a very rare bug introduced in rc5" * tag 'for-linus-7.0a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/privcmd: unregister xenstore notifier on module exit	2026-03-29 11:51:37 -07:00
Linus Torvalds	f242ac4a09	Merge tag 'x86-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix an early boot crash in AMD SEV-SNP guests, caused by incorrect FSGSBASE init ordering (Nikunj A Dadhania) - Remove X86_CR4_FRED from the CR4 pinned bits mask, to fix a race window during the bootup of SEV-{ES,SNP} or TDX guests, which can crash them if they trigger exceptions in that window (Borislav Petkov) - Fix early boot failures on SEV-ES/SNP guests, due to incorrect early GHCB access (Nikunj A Dadhania) - Add clarifying comment to the CRn pinning logic, to avoid future confusion & bugs (Peter Zijlstra) * tag 'x86-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add comment clarifying CRn pinning x86/fred: Fix early boot failures on SEV-ES/SNP guests x86/cpu: Remove X86_CR4_FRED from the CR4 pinned bits mask x86/cpu: Enable FSGSBASE early in cpu_init_exception_handling()	2026-03-29 10:04:37 -07:00
Linus Torvalds	47e3f23f0e	Merge tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix an argument order bug in the alarm timer forwarding logic, which may cause missed expirations or incorrect overrun accounting" * tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Fix argument order in alarm_timer_forward()	2026-03-29 10:02:38 -07:00
Linus Torvalds	f087b0bad4	Merge tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fixes from Ingo Molnar: - Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar futex flags and potential UaF access (Peter Zijlstra) - Fix UaF between futex_key_to_node_opt() and vma_replace_policy() (Hao-Yu Yang) - Clear stale exiting pointer in futex_lock_pi() retry path, which triggered a warning (and potential misbehavior) in stress-testing (Davidlohr Bueso) * tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Clear stale exiting pointer in futex_lock_pi() retry path futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy() futex: Require sys_futex_requeue() to have identical flags	2026-03-29 09:59:46 -07:00
Linus Torvalds	21047b17b3	Merge tag 'irq-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: - Fix TX completion signaling bug in the Qualcomm MPM irqchip driver - Fix probe error handling in the Renesas RZ/V2H(P) irqchip driver * tag 'irq-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/renesas-rzv2h: Fix error path in rzv2h_icu_probe_common() irqchip/qcom-mpm: Add missing mailbox TX done acknowledgment	2026-03-29 09:53:01 -07:00
Linus Torvalds	a3d97d1d3f	Merge tag 'ovl-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs fixes from Amir Goldstein: - Fix regression in 'xino' feature detection I clumsily introduced this regression myself when working on another subsystem (fsnotify). Both the regression and the fix have almost no visible impact on users except for some kmsg prints. - Fix to performance regression in v6.12. This regression was reported by Google COS developers. It is not uncommon these days for the year-old mature LTS to get adopted by distros and get exposed to many new workloads. We made a sub-smart move of making a behavior change in v6.12 which could impact performance, without making it opt-in. Fixing this mistake retroactively, to be picked by LTS. * tag 'ovl-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: make fsync after metadata copy-up opt-in mount option ovl: fix wrong detection of 32bit inode numbers	2026-03-29 09:34:50 -07:00
Linus Torvalds	241d4ca15d	Merge tag 'ext4_for_linus-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: - Update the MAINTAINERS file to add reviewers for the ext4 file system - Add a test issue an ext4 warning (not a WARN_ON) if there are still dirty pages attached to an evicted inode. - Fix a number of Syzkaller issues - Fix memory leaks on error paths - Replace some BUG and WARN with EFSCORRUPTED reporting - Fix a potential crash when disabling discard via remount followed by an immediate unmount. (Found by Sashiko) - Fix a corner case which could lead to allocating blocks for an indirect-mapped inode block numbers > 2*32 - Fix a race when reallocating a freed inode that could result in a deadlock - Fix a user-after-free in update_super_work when racing with umount - Fix build issues when trying to build ext4's kunit tests as a module - Fix a bug where ext4_split_extent_zeroout() could fail to pass back an error from ext4_ext_dirty() - Avoid allocating blocks from a corrupted block group in ext4_mb_find_by_goal() - Fix a percpu_counters list corruption BUG triggered by an ext4 extents kunit - Fix a potetial crash caused by the fast commit flush path potentially accessing the jinode structure before it is fully initialized - Fix fsync(2) in no-journal mode to make sure the dirtied inode is write to storage - Fix a bug when in no-journal mode, when ext4 tries to avoid using recently deleted inodes, if lazy itable initialization is enabled, can lead to an unitialized inode getting skipped and triggering an e2fsck complaint - Fix journal credit calculation when setting an xattr when both the encryption and ea_inode feeatures are enabled - Fix corner cases which could result in stale xarray tags after writeback - Fix generic/475 failures caused by ENOSPC errors while creating a symlink when the system crashes resulting to a file system inconsistency when replaying the fast commit journal tag 'ext4_for_linus-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (27 commits) ext4: always drain queued discard work in ext4_mb_release() ext4: handle wraparound when searching for blocks for indirect mapped blocks ext4: skip split extent recovery on corruption ext4: fix iloc.bh leak in ext4_fc_replay_inode() error paths ext4: fix deadlock on inode reallocation ext4: fix use-after-free in update_super_work when racing with umount ext4: fix the might_sleep() warnings in kvfree() ext4: reject mount if bigalloc with s_first_data_block != 0 ext4: fix extents-test.c is not compiled when EXT4_KUNIT_TESTS=M ext4: fix mballoc-test.c is not compiled when EXT4_KUNIT_TESTS=M ext4: introduce EXPORT_SYMBOL_FOR_EXT4_TEST() helper jbd2: gracefully abort on checkpointing state corruptions ext4: avoid infinite loops caused by residual data ext4: validate p_idx bounds in ext4_ext_correct_indexes ext4: test if inode's all dirty pages are submitted to disk ext4: minor fix for ext4_split_extent_zeroout() ext4: avoid allocate block from corrupted group in ext4_mb_find_by_goal() ext4: kunit: extents-test: lix percpu_counters list corruption ext4: publish jinode after initialization ext4: replace BUG_ON with proper error handling in ext4_read_inline_folio ...	2026-03-29 09:30:06 -07:00
Linus Torvalds	b51ad67773	Merge tag 'for-7.0-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes. There's one that stands out in size as it fixes an edge case in fsync. - fix issue on fsync where file with zero size appears as a non-zero after log replay - in zlib compression, handle a crash when data alignment causes folio reference issues - fix possible crash with enabled tracepoints on a overlayfs mount - handle device stats update error - on zoned filesystems, fix kobject leak on sub-block groups - fix super block offset in an error message in validation" * tag 'for-7.0-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix lost error when running device stats on multiple devices fs btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file() btrfs: zlib: handle page aligned compressed size correctly btrfs: fix leak of kobject name for sub-group space_info btrfs: fix zero size inode with non-zero size after log replay btrfs: fix super block offset in error message in btrfs_validate_super()	2026-03-28 15:23:03 -07:00
Linus Torvalds	0bcb517f0a	Merge tag 'mm-hotfixes-stable-2026-03-28-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "10 hotfixes. 8 are cc:stable. 9 are for MM. There's a 3-patch series of DAMON fixes from Josh Law and SeongJae Park. The rest are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-03-28-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/mseal: update VMA end correctly on merge bug: avoid format attribute warning for clang as well mm/pagewalk: fix race between concurrent split and refault mm/memory: fix PMD/PUD checks in follow_pfnmap_start() mm/damon/sysfs: check contexts->nr in repeat_call_fn mm/damon/sysfs: check contexts->nr before accessing contexts_arr[0] mm/damon/sysfs: fix param_ctx leak on damon_sysfs_new_test_ctx() failure mm/swap: fix swap cache memcg accounting MAINTAINERS, mailmap: update email address for Harry Yoo mm/huge_memory: fix folio isn't locked in softleaf_to_folio()	2026-03-28 14:19:55 -07:00
Wolfram Sang	b0faf733fc	MAINTAINERS: drop outdated I2C website As stated on the website: "This wiki has been archived and the content is no longer updated." No need to reference it. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>	2026-03-28 20:31:33 +01:00
Linus Torvalds	cbfffcca2b	Merge tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix potential deadlock in osnoise and hotplug The interface_lock can be called by a osnoise thread and the CPU shutdown logic of osnoise can wait for this thread to finish. But cpus_read_lock() can also be taken while holding the interface_lock. This produces a circular lock dependency and can cause a deadlock. Swap the ordering of cpus_read_lock() and the interface_lock to have interface_lock taken within the cpus_read_lock() context to prevent this circular dependency. - Fix freeing of event triggers in early boot up If the same trigger is added on the kernel command line, the second one will fail to be applied and the trigger created will be freed. This calls into the deferred logic and creates a kernel thread to do the freeing. But the command line logic is called before kernel threads can be created and this leads to a NULL pointer dereference. Delay freeing event triggers until late init. * tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Drain deferred trigger frees if kthread creation fails tracing: Fix potential deadlock in cpu hotplug with osnoise	2026-03-28 09:59:09 -07:00
Linus Torvalds	e522b75c44	Merge tag 's390-7.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Add array_index_nospec() to syscall dispatch table lookup to prevent limited speculative out-of-bounds access with user-controlled syscall number - Mark array_index_mask_nospec() __always_inline since GCC may emit an out-of-line call instead of the inline data dependency sequence the mitigation relies on - Clear r12 on kernel entry to prevent potential speculative use of user value in system_call, ext/io/mcck interrupt handlers * tag 's390-7.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/entry: Scrub r12 register on kernel entry s390/syscalls: Add spectre boundary for syscall dispatch table s390/barrier: Make array_index_mask_nospec() __always_inline	2026-03-28 09:50:11 -07:00
Davidlohr Bueso	210d36d892	futex: Clear stale exiting pointer in futex_lock_pi() retry path Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: `3ef240eaff` ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net	2026-03-28 13:54:02 +01:00
Wesley Atwell	250ab25391	tracing: Drain deferred trigger frees if kthread creation fails Boot-time trigger registration can fail before the trigger-data cleanup kthread exists. Deferring those frees until late init is fine, but the post-boot fallback must still drain the deferred list if kthread creation never succeeds. Otherwise, boot-deferred nodes can accumulate on trigger_data_free_list, later frees fall back to synchronously freeing only the current object, and the older queued entries are leaked forever. To trigger this, add the following to the kernel command line: trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon The second traceon trigger will fail and be freed. This triggers a NULL pointer dereference and crashes the kernel. Keep the deferred boot-time behavior, but when kthread creation fails, drain the whole queued list synchronously. Do the same in the late-init drain path so queued entries are not stranded there either. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com Fixes: `61d445af0a` ("tracing: Add bulk garbage collection of freeing event_trigger_data") Signed-off-by: Wesley Atwell <atwellwea@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-28 08:32:44 -04:00
Lorenzo Stoakes (Oracle)	2697dd8ae7	mm/mseal: update VMA end correctly on merge Previously we stored the end of the current VMA in curr_end, and then upon iterating to the next VMA updated curr_start to curr_end to advance to the next VMA. However, this doesn't take into account the fact that a VMA might be updated due to a merge by vma_modify_flags(), which can result in curr_end being stale and thus, upon setting curr_start to curr_end, ending up with an incorrect curr_start on the next iteration. Resolve the issue by setting curr_end to vma->vm_end unconditionally to ensure this value remains updated should this occur. While we're here, eliminate this entire class of bug by simply setting const curr_[start/end] to be clamped to the input range and VMAs, which also happens to simplify the logic. Link: https://lkml.kernel.org/r/20260327173104.322405-1-ljs@kernel.org Fixes: `6c2da14ae1` ("mm/mseal: rework mseal apply logic") Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reported-by: Antonius <antonius@bluedragonsec.com> Closes: https://lore.kernel.org/linux-mm/CAK8a0jwWGj9-SgFk0yKFh7i8jMkwKm5b0ao9=kmXWjO54veX2g@mail.gmail.com/ Suggested-by: David Hildenbrand (ARM) <david@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:38 -07:00
Arnd Bergmann	2598ab9d63	bug: avoid format attribute warning for clang as well Like gcc, clang-22 now also warns about a function that it incorrectly identifies as a printf-style format: lib/bug.c:190:22: error: diagnostic behavior may be improved by adding the 'format(printf, 1, 0)' attribute to the declaration of '__warn_printf' [-Werror,-Wmissing-format-attribute] 179 \| static void __warn_printf(const char fmt, struct pt_regs regs) \| __attribute__((format(printf, 1, 0))) 180 \| { 181 \| if (!fmt) 182 \| return; 183 \| 184 \| #ifdef HAVE_ARCH_BUG_FORMAT_ARGS 185 \| if (regs) { 186 \| struct arch_va_list _args; 187 \| va_list args = __warn_args(&_args, regs); 188 \| 189 \| if (args) { 190 \| vprintk(fmt, args); \| ^ Revert the change that added a gcc-specific workaround, and instead add the generic annotation that avoid the warning. Link: https://lkml.kernel.org/r/20260323205534.1284284-1-arnd@kernel.org Fixes: `d36067d6ea` ("bug: Hush suggest-attribute=format for __warn_printf()") Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Suggested-by: Brendan Jackman <jackmanb@google.com> Link: https://lore.kernel.org/all/20251208141618.2805983-1-andriy.shevchenko@linux.intel.com/T/#u Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bill Wendling <morbo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Justin Stitt <justinstitt@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:38 -07:00
Max Boone	3b89863c3f	mm/pagewalk: fix race between concurrent split and refault The splitting of a PUD entry in walk_pud_range() can race with a concurrent thread refaulting the PUD leaf entry causing it to try walking a PMD range that has disappeared. An example and reproduction of this is to try reading numa_maps of a process while VFIO-PCI is setting up DMA (specifically the vfio_pin_pages_remote call) on a large BAR for that process. This will trigger a kernel BUG: vfio-pci 0000:03:00.0: enabling device (0000 -> 0002) BUG: unable to handle page fault for address: ffffa23980000000 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI ... RIP: 0010:walk_pgd_range+0x3b5/0x7a0 Code: 8d 43 ff 48 89 44 24 28 4d 89 ce 4d 8d a7 00 00 20 00 48 8b 4c 24 28 49 81 e4 00 00 e0 ff 49 8d 44 24 ff 48 39 c8 4c 0f 43 e3 <49> f7 06 9f ff ff ff 75 3b 48 8b 44 24 20 48 8b 40 28 48 85 c0 74 RSP: 0018:ffffac23e1ecf808 EFLAGS: 00010287 RAX: 00007f44c01fffff RBX: 00007f4500000000 RCX: 00007f44ffffffff RDX: 0000000000000000 RSI: 000ffffffffff000 RDI: ffffffff93378fe0 RBP: ffffac23e1ecf918 R08: 0000000000000004 R09: ffffa23980000000 R10: 0000000000000020 R11: 0000000000000004 R12: 00007f44c0200000 R13: 00007f44c0000000 R14: ffffa23980000000 R15: 00007f44c0000000 FS: 00007fe884739580(0000) GS:ffff9b7d7a9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa23980000000 CR3: 000000c0650e2005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> __walk_page_range+0x195/0x1b0 walk_page_vma+0x62/0xc0 show_numa_map+0x12b/0x3b0 seq_read_iter+0x297/0x440 seq_read+0x11d/0x140 vfs_read+0xc2/0x340 ksys_read+0x5f/0xe0 do_syscall_64+0x68/0x130 ? get_page_from_freelist+0x5c2/0x17e0 ? mas_store_prealloc+0x17e/0x360 ? vma_set_page_prot+0x4c/0xa0 ? __alloc_pages_noprof+0x14e/0x2d0 ? __mod_memcg_lruvec_state+0x8d/0x140 ? __lruvec_stat_mod_folio+0x76/0xb0 ? __folio_mod_stat+0x26/0x80 ? do_anonymous_page+0x705/0x900 ? __handle_mm_fault+0xa8d/0x1000 ? __count_memcg_events+0x53/0xf0 ? handle_mm_fault+0xa5/0x360 ? do_user_addr_fault+0x342/0x640 ? arch_exit_to_user_mode_prepare.constprop.0+0x16/0xa0 ? irqentry_exit_to_user_mode+0x24/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fe88464f47e Code: c0 e9 b6 fe ff ff 50 48 8d 3d be 07 0b 00 e8 69 01 02 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 RSP: 002b:00007ffe6cd9a9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fe88464f47e RDX: 0000000000020000 RSI: 00007fe884543000 RDI: 0000000000000003 RBP: 00007fe884543000 R08: 00007fe884542010 R09: 0000000000000000 R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 </TASK> Fix this by validating the PUD entry in walk_pmd_range() using a stable snapshot (pudp_get()). If the PUD is not present or is a leaf, retry the walk via ACTION_AGAIN instead of descending further. This mirrors the retry logic in walk_pte_range(), which lets walk_pmd_range() retry if the PTE is not being got by pte_offset_map_lock(). Link: https://lkml.kernel.org/r/20260325-pagewalk-check-pmd-refault-v2-1-707bff33bc60@akamai.com Fixes: `f9e54c3a2f` ("vfio/pci: implement huge_fault support") Co-developed-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Max Boone <mboone@akamai.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:38 -07:00
David Hildenbrand (Arm)	ffef67b93a	mm/memory: fix PMD/PUD checks in follow_pfnmap_start() follow_pfnmap_start() suffers from two problems: (1) We are not re-fetching the pmd/pud after taking the PTL Therefore, we are not properly stabilizing what the lock actually protects. If there is concurrent zapping, we would indicate to the caller that we found an entry, however, that entry might already have been invalidated, or contain a different PFN after taking the lock. Properly use pmdp_get() / pudp_get() after taking the lock. (2) pmd_leaf() / pud_leaf() are not well defined on non-present entries pmd_leaf()/pud_leaf() could wrongly trigger on non-present entries. There is no real guarantee that pmd_leaf()/pud_leaf() returns something reasonable on non-present entries. Most architectures indeed either perform a present check or make it work by smart use of flags. However, for example loongarch checks the _PAGE_HUGE flag in pmd_leaf(), and always sets the _PAGE_HUGE flag in __swp_entry_to_pmd(). Whereby pmd_trans_huge() explicitly checks pmd_present(), pmd_leaf() does not do that. Let's check pmd_present()/pud_present() before assuming "the is a present PMD leaf" when spotting pmd_leaf()/pud_leaf(), like other page table handling code that traverses user page tables does. Given that non-present PMD entries are likely rare in VM_IO\|VM_PFNMAP, (1) is likely more relevant than (2). It is questionable how often (1) would actually trigger, but let's CC stable to be sure. This was found by code inspection. Link: https://lkml.kernel.org/r/20260323-follow_pfnmap_fix-v1-1-5b0ec10872b3@kernel.org Fixes: `6da8e9634b` ("mm: new follow_pfnmap API") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:38 -07:00
Josh Law	6557004a8b	mm/damon/sysfs: check contexts->nr in repeat_call_fn damon_sysfs_repeat_call_fn() calls damon_sysfs_upd_tuned_intervals(), damon_sysfs_upd_schemes_stats(), and damon_sysfs_upd_schemes_effective_quotas() without checking contexts->nr. If nr_contexts is set to 0 via sysfs while DAMON is running, these functions dereference contexts_arr[0] and cause a NULL pointer dereference. Add the missing check. For example, the issue can be reproduced using DAMON sysfs interface and DAMON user-space tool (damo) [1] like below. $ sudo damo start --refresh_interval 1s $ echo 0 \| sudo tee \ /sys/kernel/mm/damon/admin/kdamonds/0/contexts/nr_contexts Link: https://patch.msgid.link/20260320163559.178101-3-objecting@objecting.org Link: https://lkml.kernel.org/r/20260321175427.86000-4-sj@kernel.org Link: https://github.com/damonitor/damo [1] Fixes: `d809a7c64b` ("mm/damon/sysfs: implement refresh_ms file internal work") Signed-off-by: Josh Law <objecting@objecting.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:38 -07:00
Josh Law	1bfe9fb5ed	mm/damon/sysfs: check contexts->nr before accessing contexts_arr[0] Multiple sysfs command paths dereference contexts_arr[0] without first verifying that kdamond->contexts->nr == 1. A user can set nr_contexts to 0 via sysfs while DAMON is running, causing NULL pointer dereferences. In more detail, the issue can be triggered by privileged users like below. First, start DAMON and make contexts directory empty (kdamond->contexts->nr == 0). # damo start # cd /sys/kernel/mm/damon/admin/kdamonds/0 # echo 0 > contexts/nr_contexts Then, each of below commands will cause the NULL pointer dereference. # echo update_schemes_stats > state # echo update_schemes_tried_regions > state # echo update_schemes_tried_bytes > state # echo update_schemes_effective_quotas > state # echo update_tuned_intervals > state Guard all commands (except OFF) at the entry point of damon_sysfs_handle_cmd(). Link: https://lkml.kernel.org/r/20260321175427.86000-3-sj@kernel.org Fixes: `0ac32b8aff` ("mm/damon/sysfs: support DAMOS stats") Signed-off-by: Josh Law <objecting@objecting.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:38 -07:00
Josh Law	7fe000eb32	mm/damon/sysfs: fix param_ctx leak on damon_sysfs_new_test_ctx() failure Patch series "mm/damon/sysfs: fix memory leak and NULL dereference issues", v4. DAMON_SYSFS can leak memory under allocation failure, and do NULL pointer dereference when a privileged user make wrong sequences of control. Fix those. This patch (of 3): When damon_sysfs_new_test_ctx() fails in damon_sysfs_commit_input(), param_ctx is leaked because the early return skips the cleanup at the out label. Destroy param_ctx before returning. Link: https://lkml.kernel.org/r/20260321175427.86000-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260321175427.86000-2-sj@kernel.org Fixes: `f0c5118ebb` ("mm/damon/sysfs: catch commit test ctx alloc failure") Signed-off-by: Josh Law <objecting@objecting.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:37 -07:00
Alexandre Ghiti	9e0d0ddfbc	mm/swap: fix swap cache memcg accounting The swap readahead path was recently refactored and while doing this, the order between the charging of the folio in the memcg and the addition of the folio in the swap cache was inverted. Since the accounting of the folio is done while adding the folio to the swap cache and the folio is not charged in the memcg yet, the accounting is then done at the node level, which is wrong. Fix this by charging the folio in the memcg before adding it to the swap cache. Link: https://lkml.kernel.org/r/20260320050601.1833108-1-alex@ghiti.fr Fixes: `2732acda82` ("mm, swap: use swap cache as the swap in synchronize layer") Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Kairui Song <kasong@tencent.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:37 -07:00
Harry Yoo (Oracle)	26d3dca201	MAINTAINERS, mailmap: update email address for Harry Yoo Update my email address to harry@kernel.org. Link: https://lkml.kernel.org/r/20260320125925.2259998-1-harry@kernel.org Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:37 -07:00
Jinjiang Tu	4c5e7f0fcd	mm/huge_memory: fix folio isn't locked in softleaf_to_folio() On arm64 server, we found folio that get from migration entry isn't locked in softleaf_to_folio(). This issue triggers when mTHP splitting and zap_nonpresent_ptes() races, and the root cause is lack of memory barrier in softleaf_to_folio(). The race is as follows: CPU0 CPU1 deferred_split_scan() zap_nonpresent_ptes() lock folio split_folio() unmap_folio() change ptes to migration entries __split_folio_to_order() softleaf_to_folio() set flags(including PG_locked) for tail pages folio = pfn_folio(softleaf_to_pfn(entry)) smp_wmb() VM_WARN_ON_ONCE(!folio_test_locked(folio)) prep_compound_page() for tail pages In __split_folio_to_order(), smp_wmb() guarantees page flags of tail pages are visible before the tail page becomes non-compound. smp_wmb() should be paired with smp_rmb() in softleaf_to_folio(), which is missed. As a result, if zap_nonpresent_ptes() accesses migration entry that stores tail pfn, softleaf_to_folio() may see the updated compound_head of tail page before page->flags. This issue will trigger VM_WARN_ON_ONCE() in pfn_swap_entry_folio() because of the race between folio split and zap_nonpresent_ptes() leading to a folio incorrectly undergoing modification without a folio lock being held. This is a BUG_ON() before commit `93976a2034` ("mm: eliminate further swapops predicates"), which in merged in v6.19-rc1. To fix it, add missing smp_rmb() if the softleaf entry is migration entry in softleaf_to_folio() and softleaf_to_page(). [tujinjiang@huawei.com: update function name and comments] Link: https://lkml.kernel.org/r/20260321075214.3305564-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20260319012541.4158561-1-tujinjiang@huawei.com Fixes: `e9b61f1985` ("thp: reintroduce split_huge_page()") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-27 20:48:37 -07:00
Theodore Ts'o	9ee29d20aa	ext4: always drain queued discard work in ext4_mb_release() While reviewing recent ext4 patch[1], Sashiko raised the following concern[2]: > If the filesystem is initially mounted with the discard option, > deleting files will populate sbi->s_discard_list and queue > s_discard_work. If it is then remounted with nodiscard, the > EXT4_MOUNT_DISCARD flag is cleared, but the pending s_discard_work is > neither cancelled nor flushed. [1] https://lore.kernel.org/r/20260319094545.19291-1-qiang.zhang@linux.dev/ [2] https://sashiko.dev/#/patchset/20260319094545.19291-1-qiang.zhang%40linux.dev The concern was valid, but it had nothing to do with the patch[1]. One of the problems with Sashiko in its current (early) form is that it will detect pre-existing issues and report it as a problem with the patch that it is reviewing. In practice, it would be hard to hit deliberately (unless you are a malicious syzkaller fuzzer), since it would involve mounting the file system with -o discard, and then deleting a large number of files, remounting the file system with -o nodiscard, and then immediately unmounting the file system before the queued discard work has a change to drain on its own. Fix it because it's a real bug, and to avoid Sashiko from raising this concern when analyzing future patches to mballoc.c. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: `55cdd0af2b` ("ext4: get discard out of jbd2 commit kthread contex") Cc: stable@kernel.org	2026-03-27 23:39:10 -04:00
Theodore Ts'o	bb81702370	ext4: handle wraparound when searching for blocks for indirect mapped blocks Commit `4865c768b5` ("ext4: always allocate blocks only from groups inode can use") restricts what blocks will be allocated for indirect block based files to block numbers that fit within 32-bit block numbers. However, when using a review bot running on the latest Gemini LLM to check this commit when backporting into an LTS based kernel, it raised this concern: If ac->ac_g_ex.fe_group is >= ngroups (for instance, if the goal group was populated via stream allocation from s_mb_last_groups), then start will be >= ngroups. Does this allow allocating blocks beyond the 32-bit limit for indirect block mapped files? The commit message mentions that ext4_mb_scan_groups_linear() takes care to not select unsupported groups. However, its loop uses group = *start, and the very first iteration will call ext4_mb_scan_group() with this unsupported group because next_linear_group() is only called at the end of the iteration. After reviewing the code paths involved and considering the LLM review, I determined that this can happen when there is a file system where some files/directories are extent-mapped and others are indirect-block mapped. To address this, add a safety clamp in ext4_mb_scan_groups(). Fixes: `4865c768b5` ("ext4: always allocate blocks only from groups inode can use") Cc: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20260326045834.1175822-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:39:02 -04:00
hongao	3ceda17325	ext4: skip split extent recovery on corruption ext4_split_extent_at() retries after ext4_ext_insert_extent() fails by refinding the original extent and restoring its length. That recovery is only safe for transient resource failures such as -ENOSPC, -EDQUOT, and -ENOMEM. When ext4_ext_insert_extent() fails because the extent tree is already corrupted, ext4_find_extent() can return a leaf path without p_ext. ext4_split_extent_at() then dereferences path[depth].p_ext while trying to fix up the original extent length, causing a NULL pointer dereference while handling a pre-existing filesystem corruption. Do not enter the recovery path for corruption errors, and validate p_ext after refinding the extent before touching it. This keeps the recovery path limited to cases it can actually repair and turns the syzbot-triggered crash into a proper corruption report. Fixes: `716b9c23b8` ("ext4: refactor split and convert extents") Reported-by: syzbot+1ffa5d865557e51cb604@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1ffa5d865557e51cb604 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: hongao <hongao@uniontech.com> Link: https://patch.msgid.link/EF77870F23FF9C90+20260324015815.35248-1-hongao@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:38:52 -04:00
Baokun Li	ec0a7500d8	ext4: fix iloc.bh leak in ext4_fc_replay_inode() error paths During code review, Joseph found that ext4_fc_replay_inode() calls ext4_get_fc_inode_loc() to get the inode location, which holds a reference to iloc.bh that must be released via brelse(). However, several error paths jump to the 'out' label without releasing iloc.bh: - ext4_handle_dirty_metadata() failure - sync_dirty_buffer() failure - ext4_mark_inode_used() failure - ext4_iget() failure Fix this by introducing an 'out_brelse' label placed just before the existing 'out' label to ensure iloc.bh is always released. Additionally, make ext4_fc_replay_inode() propagate errors properly instead of always returning 0. Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:38:01 -04:00
Jan Kara	0c90eed1b9	ext4: fix deadlock on inode reallocation Currently there is a race in ext4 when reallocating freed inode resulting in a deadlock: Task1 Task2 ext4_evict_inode() handle = ext4_journal_start(); ... if (IS_SYNC(inode)) handle->h_sync = 1; ext4_free_inode() ext4_new_inode() handle = ext4_journal_start() finds the bit in inode bitmap already clear insert_inode_locked() waits for inode to be removed from the hash. ext4_journal_stop(handle) jbd2_journal_stop(handle) jbd2_log_wait_commit(journal, tid); - deadlocks waiting for transaction handle Task2 holds Fix the problem by removing inode from the hash already in ext4_clear_inode() by which time all IO for the inode is done so reuse is already fine but we are still before possibly blocking on transaction commit. Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Link: https://lore.kernel.org/all/abNvb2PcrKj1FBeC@ly-workstation Fixes: `88ec797c46` ("fs: make insert_inode_locked() wait for inode destruction") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260320090428.24899-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:37:48 -04:00
Jiayuan Chen	d15e4b0a41	ext4: fix use-after-free in update_super_work when racing with umount Commit `b98535d091` ("ext4: fix bug_on in start_this_handle during umount filesystem") moved ext4_unregister_sysfs() before flushing s_sb_upd_work to prevent new error work from being queued via /proc/fs/ext4/xx/mb_groups reads during unmount. However, this introduced a use-after-free because update_super_work calls ext4_notify_error_sysfs() -> sysfs_notify() which accesses the kobject's kernfs_node after it has been freed by kobject_del() in ext4_unregister_sysfs(): update_super_work ext4_put_super ----------------- -------------- ext4_unregister_sysfs(sb) kobject_del(&sbi->s_kobj) __kobject_del() sysfs_remove_dir() kobj->sd = NULL sysfs_put(sd) kernfs_put() // RCU free ext4_notify_error_sysfs(sbi) sysfs_notify(&sbi->s_kobj) kn = kobj->sd // stale pointer kernfs_get(kn) // UAF on freed kernfs_node ext4_journal_destroy() flush_work(&sbi->s_sb_upd_work) Instead of reordering the teardown sequence, fix this by making ext4_notify_error_sysfs() detect that sysfs has already been torn down by checking s_kobj.state_in_sysfs, and skipping the sysfs_notify() call in that case. A dedicated mutex (s_error_notify_mutex) serializes ext4_notify_error_sysfs() against kobject_del() in ext4_unregister_sysfs() to prevent TOCTOU races where the kobject could be deleted between the state_in_sysfs check and the sysfs_notify() call. Fixes: `b98535d091` ("ext4: fix bug_on in start_this_handle during umount filesystem") Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260319120336.157873-1-jiayuan.chen@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:37:39 -04:00
Zqiang	496bb99b7e	ext4: fix the might_sleep() warnings in kvfree() Use the kvfree() in the RCU read critical section can trigger the following warnings: EXT4-fs (vdb): unmounting filesystem cd983e5b-3c83-4f5a-a136-17b00eb9d018. WARNING: suspicious RCU usage ./include/linux/rcupdate.h:409 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 Call Trace: <TASK> dump_stack_lvl+0xbb/0xd0 dump_stack+0x14/0x20 lockdep_rcu_suspicious+0x15a/0x1b0 __might_resched+0x375/0x4d0 ? put_object.part.0+0x2c/0x50 __might_sleep+0x108/0x160 vfree+0x58/0x910 ? ext4_group_desc_free+0x27/0x270 kvfree+0x23/0x40 ext4_group_desc_free+0x111/0x270 ext4_put_super+0x3c8/0xd40 generic_shutdown_super+0x14c/0x4a0 ? __pfx_shrinker_free+0x10/0x10 kill_block_super+0x40/0x90 ext4_kill_sb+0x6d/0xb0 deactivate_locked_super+0xb4/0x180 deactivate_super+0x7e/0xa0 cleanup_mnt+0x296/0x3e0 __cleanup_mnt+0x16/0x20 task_work_run+0x157/0x250 ? __pfx_task_work_run+0x10/0x10 ? exit_to_user_mode_loop+0x6a/0x550 exit_to_user_mode_loop+0x102/0x550 do_syscall_64+0x44a/0x500 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> BUG: sleeping function called from invalid context at mm/vmalloc.c:3441 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 556, name: umount preempt_count: 1, expected: 0 CPU: 3 UID: 0 PID: 556 Comm: umount Call Trace: <TASK> dump_stack_lvl+0xbb/0xd0 dump_stack+0x14/0x20 __might_resched+0x275/0x4d0 ? put_object.part.0+0x2c/0x50 __might_sleep+0x108/0x160 vfree+0x58/0x910 ? ext4_group_desc_free+0x27/0x270 kvfree+0x23/0x40 ext4_group_desc_free+0x111/0x270 ext4_put_super+0x3c8/0xd40 generic_shutdown_super+0x14c/0x4a0 ? __pfx_shrinker_free+0x10/0x10 kill_block_super+0x40/0x90 ext4_kill_sb+0x6d/0xb0 deactivate_locked_super+0xb4/0x180 deactivate_super+0x7e/0xa0 cleanup_mnt+0x296/0x3e0 __cleanup_mnt+0x16/0x20 task_work_run+0x157/0x250 ? __pfx_task_work_run+0x10/0x10 ? exit_to_user_mode_loop+0x6a/0x550 exit_to_user_mode_loop+0x102/0x550 do_syscall_64+0x44a/0x500 entry_SYSCALL_64_after_hwframe+0x77/0x7f The above scenarios occur in initialization failures and teardown paths, there are no parallel operations on the resources released by kvfree(), this commit therefore remove rcu_read_lock/unlock() and use rcu_access_pointer() instead of rcu_dereference() operations. Fixes: `7c990728b9` ("ext4: fix potential race between s_flex_groups online resizing and access") Fixes: `df3da4ea5a` ("ext4: fix potential race between s_group_info online resizing and access") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Link: https://patch.msgid.link/20260319094545.19291-1-qiang.zhang@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:36:20 -04:00
Helen Koike	3822743dc2	ext4: reject mount if bigalloc with s_first_data_block != 0 bigalloc with s_first_data_block != 0 is not supported, reject mounting it. Signed-off-by: Helen Koike <koike@igalia.com> Suggested-by: Theodore Ts'o <tytso@mit.edu> Reported-by: syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b73703b873a33d8eb8f6 Link: https://patch.msgid.link/20260317142325.135074-1-koike@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:36:11 -04:00
Ye Bin	9e1b14320b	ext4: fix extents-test.c is not compiled when EXT4_KUNIT_TESTS=M Now, only EXT4_KUNIT_TESTS=Y testcase will be compiled in 'extents.c'. To solve this issue, the ext4 test code needs to be decoupled. The 'extents-test' module is compiled into 'ext4-test' module. Fixes: `cb1e0c1d1f` ("ext4: kunit tests for extent splitting and conversion") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260314075258.1317579-4-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-03-27 23:36:06 -04:00
Ye Bin	519b76ac0b	ext4: fix mballoc-test.c is not compiled when EXT4_KUNIT_TESTS=M Now, only EXT4_KUNIT_TESTS=Y testcase will be compiled in 'mballoc.c'. To solve this issue, the ext4 test code needs to be decoupled. The ext4 test module is compiled into a separate module. Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Closes: https://patchwork.kernel.org/project/cifs-client/patch/20260118091313.1988168-2-chenxiaosong.chenxiaosong@linux.dev/ Fixes: `7c9fa399a3` ("ext4: add first unit test for ext4_mb_new_blocks_simple in mballoc") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260314075258.1317579-3-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-03-27 23:36:02 -04:00
Ye Bin	49504a5125	ext4: introduce EXPORT_SYMBOL_FOR_EXT4_TEST() helper Introduce EXPORT_SYMBOL_FOR_EXT4_TEST() helper for kuint test. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260314075258.1317579-2-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-03-27 23:34:19 -04:00
Milos Nikic	bac3190a8e	jbd2: gracefully abort on checkpointing state corruptions This patch targets two internal state machine invariants in checkpoint.c residing inside functions that natively return integer error codes. - In jbd2_cleanup_journal_tail(): A blocknr of 0 indicates a severely corrupted journal superblock. Replaced the J_ASSERT with a WARN_ON_ONCE and a graceful journal abort, returning -EFSCORRUPTED. - In jbd2_log_do_checkpoint(): Replaced the J_ASSERT_BH checking for an unexpected buffer_jwrite state. If the warning triggers, we explicitly drop the just-taken get_bh() reference and call __flush_batch() to safely clean up any previously queued buffers in the j_chkpt_bhs array, preventing a memory leak before returning -EFSCORRUPTED. Signed-off-by: Milos Nikic <nikic.milos@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260311041548.159424-1-nikic.milos@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:34:09 -04:00
Edward Adam Davis	5422fe71d2	ext4: avoid infinite loops caused by residual data On the mkdir/mknod path, when mapping logical blocks to physical blocks, if inserting a new extent into the extent tree fails (in this example, because the file system disabled the huge file feature when marking the inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to reclaim the physical block without deleting the corresponding data in the extent tree. This causes subsequent mkdir operations to reference the previously reclaimed physical block number again, even though this physical block is already being used by the xattr block. Therefore, a situation arises where both the directory and xattr are using the same buffer head block in memory simultaneously. The above causes ext4_xattr_block_set() to enter an infinite loop about "inserted" and cannot release the inode lock, ultimately leading to the 143s blocking problem mentioned in [1]. If the metadata is corrupted, then trying to remove some extent space can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE was passed, remove space wrongly update quota information. Jan Kara suggests distinguishing between two cases: 1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully consistent and we must maintain its consistency including all the accounting. However these errors can happen only early before we've inserted the extent into the extent tree. So current code works correctly for this case. 2) Some other error - this means metadata is corrupted. We should strive to do as few modifications as possible to limit damage. So I'd just skip freeing of allocated blocks. [1] INFO: task syz.0.17:5995 blocked for more than 143 seconds. Call Trace: inode_lock_nested include/linux/fs.h:1073 [inline] __start_dirop fs/namei.c:2923 [inline] start_dirop fs/namei.c:2934 [inline] Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7 Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:33:56 -04:00
Tejas Bharambe	2acb5c12eb	ext4: validate p_idx bounds in ext4_ext_correct_indexes ext4_ext_correct_indexes() walks up the extent tree correcting index entries when the first extent in a leaf is modified. Before accessing path[k].p_idx->ei_block, there is no validation that p_idx falls within the valid range of index entries for that level. If the on-disk extent header contains a corrupted or crafted eh_entries value, p_idx can point past the end of the allocated buffer, causing a slab-out-of-bounds read. Fix this by validating path[k].p_idx against EXT_LAST_INDEX() at both access sites: before the while loop and inside it. Return -EFSCORRUPTED if the index pointer is out of range, consistent with how other bounds violations are handled in the ext4 extent tree code. Reported-by: syzbot+04c4e65cab786a2e5b7e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=04c4e65cab786a2e5b7e Signed-off-by: Tejas Bharambe <tejas.bharambe@outlook.com> Link: https://patch.msgid.link/JH0PR06MB66326016F9B6AD24097D232B897CA@JH0PR06MB6632.apcprd06.prod.outlook.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:33:46 -04:00

1 2 3 4 5 ...

1429327 Commits