linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Kexin Sun	3cb0dc0d0e	mm: vmalloc: update outdated comment for renamed vread() The function vread() was renamed to vread_iter() in commit `4c91c07c93` ("mm: vmalloc: convert vread() to vread_iter()"), converting from a buffer-based to an iterator-based interface. Update the kdoc of vread_iter() to reflect the new interface: replace references to @buf with @iter, drop the stale "kernel's buffer" requirement, and update the self-reference from vread() to vread_iter(). Also update the stale vread() reference in pstore's ram_core.c. Assisted-by: unnamed:deepseek-v3.2 coccinelle Link: https://lkml.kernel.org/r/20260321105820.7134-1-kexinsun@smail.nju.edu.cn Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kees Cook <kees@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:34 -07:00
Kaitao Cheng	c4a9439a5a	mm: mark early-init static variables with __meminitdata Static variables defined inside __meminit functions should also be marked with __meminitdata, so that their storage is placed in the .init.data section and reclaimed with free_initmem(), thereby reducing permanent .bss memory usage when CONFIG_MEMORY_HOTPLUG is disabled. Link: https://lkml.kernel.org/r/20260321120847.8159-1-pilgrimtao@gmail.com Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:34 -07:00
Shigeru Yoshida	4fb61d95ad	mm/zsmalloc: copy KMSAN metadata in zs_page_migrate() zs_page_migrate() uses copy_page() to copy the contents of a zspage page during migration. However, copy_page() is not instrumented by KMSAN, so the shadow and origin metadata of the destination page are not updated. As a result, subsequent accesses to the migrated page are reported as use-after-free by KMSAN, despite the data being correctly copied. Add a kmsan_copy_page_meta() call after copy_page() to propagate the KMSAN metadata to the new page, matching what copy_highpage() does internally. Link: https://lkml.kernel.org/r/20260321132912.93434-1-syoshida@redhat.com Fixes: `afb2d666d0` ("zsmalloc: use copy_page for full page copy") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:34 -07:00
Hubert Mazur	1871d548fc	mm/execmem: make the populate and alloc atomic When a block of memory is requested from the execmem manager it tries to find a suitable fragment by traversing the free_areas. In case there is no such block, a new memory area is added to the free_areas and then allocated to the caller by traversing the free_area tree again. The above operations of allocation and tree traversal are not atomic hence another request may consume this newly allocated memory block which results in the allocation failure for the original request. Such occurrence can be spotted on devices running the 6.18 kernel during the parallel modules loading. To mitigate such resource races execute the cache population and allocation operations under one mutex lock. Link: https://lkml.kernel.org/r/20260320075723.779985-1-hmazur@google.com Signed-off-by: Hubert Mazur <hmazur@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Stanislaw Kardach <skardach@google.com> Cc: Michal Krawczyk <mikrawczyk@google.com> Cc: Slawomir Rosek <srosek@google.com> Cc: Hubert Mazur <hmazur@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:34 -07:00
Liew Rui Yan	6f1e182387	Docs/mm/damon: document min_nr_regions constraint and rationale The current DAMON implementation requires 'min_nr_regions' to be at least 3. However, this constraint is not explicitly documented in the admin-guide documents, nor is its design rationale explained in the design document. Add a section in design.rst to explain the rationale: the virtual address space monitoring design needs to handle at least three regions to accommodate two large unmapped areas. While this is specific to 'vaddr', DAMON currently enforces it across all operation sets for consistency. Also update reclaim.rst and lru_sort.rst by adding cross-references to this constraint within their respective 'min_nr_regions' parameter description sections, ensuring users are aware of the lower bound. This change is motivated from a recent discussion [1]. Link: https://lkml.kernel.org/r/20260320052428.213230-1-aethernet65535@gmail.com Link: https://lore.kernel.org/damon/20260319151528.86490-1-sj@kernel.org/T/#t [1] Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:34 -07:00
Josh Law	cc4555fc6d	mm/damon/core: document damos_commit_dests() failure semantics Add a kernel-doc-like comment to damos_commit_dests() documenting its allocation failure contract: on -ENOMEM, the destination structure is left in a partially torn-down state that is safe to deallocate via damon_destroy_scheme(), but must not be reused for further commits. This was unclear from the code alone and led to a separate patch [1] attempting to reset nr_dests on failure. Make the intended usage explicit so future readers do not repeat the confusion. Link: https://lkml.kernel.org/r/20260320143648.91673-1-sj@kernel.org Link: https://lore.kernel.org/20260318214939.36100-1-objecting@objecting.org [1] Signed-off-by: Josh Law <objecting@objecting.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:34 -07:00
Leno Hou	a6a8c087dc	mm/mglru: fix cgroup OOM during MGLRU state switching When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race condition exists between the state switching and the memory reclaim path. This can lead to unexpected cgroup OOM kills, even when plenty of reclaimable memory is available. Problem Description ================== The issue arises from a "reclaim vacuum" during the transition. 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to false before the pages are drained from MGLRU lists back to traditional LRU lists. 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false and skip the MGLRU path. 3. However, these pages might not have reached the traditional LRU lists yet, or the changes are not yet visible to all CPUs due to a lack of synchronization. 4. get_scan_count() subsequently finds traditional LRU lists empty, concludes there is no reclaimable memory, and triggers an OOM kill. A similar race can occur during enablement, where the reclaimer sees the new state but the MGLRU lists haven't been populated via fill_evictable() yet. Solution ======== Introduce a 'switching' state (`lru_switch`) to bridge the transition. When transitioning, the system enters this intermediate state where the reclaimer is forced to attempt both MGLRU and traditional reclaim paths sequentially. This ensures that folios remain visible to at least one reclaim mechanism until the transition is fully materialized across all CPUs. Race & Mitigation ================ A race window exists between checking the 'draining' state and performing the actual list operations. For instance, a reclaimer might observe the draining state as false just before it changes, leading to a suboptimal reclaim path decision. However, this impact is effectively mitigated by the kernel's reclaim retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails to find eligible folios due to a state transition race, subsequent retries in the loop will observe the updated state and correctly direct the scan to the appropriate LRU lists. This ensures the transient inconsistency does not escalate into a terminal OOM kill. This effectively reduce the race window that previously triggered OOMs under high memory pressure. This fix has been verified on v7.0.0-rc1; dynamic toggling of MGLRU functions correctly without triggering unexpected OOM kills. Link: https://lkml.kernel.org/r/20260319-b4-switch-mglru-v2-v5-1-8898491e5f17@gmail.com Signed-off-by: Leno Hou <lenohou@gmail.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Jialing Wang <wjl.linux@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Bingfang Guo <bfguo@icloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:33 -07:00
teawater	dc711106a0	zsmalloc: return -EBUSY for zspage migration lock contention movable_operations::migrate_page() should return an appropriate error code for temporary migration failures so the migration core can handle them correctly. zs_page_migrate() currently returns -EINVAL when zspage_write_trylock() fails. That path reflects transient lock contention, not invalid input, so -EINVAL is clearly wrong. However, -EAGAIN is also inappropriate here: the zspage's reader-lock owner may hold the lock for an unbounded duration due to slow decompression or reader-lock owner preemption. Since migration retries are bounded by NR_MAX_MIGRATE_PAGES_RETRY and performed with virtually no delay between attempts, there is no guarantee the lock will be released in time for a retry to succeed. -EAGAIN implies "try again soon", which does not hold in this case. Return -EBUSY instead, which more accurately conveys that the resource is occupied and migration cannot proceed at this time. Link: https://lkml.kernel.org/r/20260319065924.69337-1-hui.zhu@linux.dev Signed-off-by: teawater <zhuhui@kylinos.cn> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:33 -07:00
David Hildenbrand (Arm)	6ebf98d71f	mm: introduce CONFIG_NUMA_MIGRATION and simplify CONFIG_MIGRATION CONFIG_MEMORY_HOTREMOVE, CONFIG_COMPACTION and CONFIG_CMA all select CONFIG_MIGRATION, because they require it to work (users). Only CONFIG_NUMA_BALANCING and CONFIG_BALLOON_MIGRATION depend on CONFIG_MIGRATION. CONFIG_BALLOON_MIGRATION is not an actual user, but an implementation of migration support, so the dependency is correct (CONFIG_BALLOON_MIGRATION does not make any sense without CONFIG_MIGRATION). However, kconfig-language.rst clearly states "In general use select only for non-visible symbols". So far CONFIG_MIGRATION is user-visible ... and the dependencies rather confusing. The whole reason why CONFIG_MIGRATION is user-visible is because of CONFIG_NUMA: some users might want CONFIG_NUMA but not page migration support. Let's clean all that up by introducing a dedicated CONFIG_NUMA_MIGRATION config option for that purpose only. Make CONFIG_NUMA_BALANCING that so far depended on CONFIG_NUMA && CONFIG_MIGRATION to depend on CONFIG_MIGRATION instead. CONFIG_NUMA_MIGRATION will depend on CONFIG_NUMA && CONFIG_MMU. CONFIG_NUMA_MIGRATION is user-visible and will default to "y". We use that default so new configs will automatically enable it, just like it was the case with CONFIG_MIGRATION. The downside is that some configs that used to have CONFIG_MIGRATION=n might get it re-enabled by CONFIG_NUMA_MIGRATION=y, which shouldn't be a problem. CONFIG_MIGRATION is now a non-visible config option. Any code that select CONFIG_MIGRATION (as before) must depend directly or indirectly on CONFIG_MMU. CONFIG_NUMA_MIGRATION is responsible for any NUMA migration code, which is mempolicy migration code, memory-tiering code, and move_pages() code in migrate.c. CONFIG_NUMA_BALANCING uses its functionality. Note that this implies that with CONFIG_NUMA_MIGRATION=n, move_pages() will not be available even though CONFIG_MIGRATION=y, which is an expected change. In migrate.c, we can remove the CONFIG_NUMA check as both CONFIG_NUMA_MIGRATION and CONFIG_NUMA_BALANCING depend on it. With this change, CONFIG_MIGRATION is an internal config, all users of migration selects CONFIG_MIGRATION, and only CONFIG_BALLOON_MIGRATION depends on it. Link: https://lkml.kernel.org/r/20260319-config_migration-v1-2-42270124966f@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:33 -07:00
David Hildenbrand (Arm)	078f80f909	mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE Patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION". While working on memory hotplug code cleanups, I realized that CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE is not really required anymore. Changing that revealed some rather nasty looking CONFIG_MIGRATION handling. Let's clean that up by introducing a dedicated CONFIG_NUMA_MIGRATION option and reducing the dependencies that CONFIG_MIGRATION has. This patch (of 2): All architectures that select CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE also select CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG. So we can just remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE. For CONFIG_MIGRATION, make it depend on CONFIG_MEMORY_HOTREMOVE instead, and make CONFIG_MEMORY_HOTREMOVE select CONFIG_MIGRATION (just like CONFIG_CMA and CONFIG_COMPACTION already do). We'll clean up CONFIG_MIGRATION next. Link: https://lkml.kernel.org/r/20260319-config_migration-v1-0-42270124966f@kernel.org Link: https://lkml.kernel.org/r/20260319-config_migration-v1-1-42270124966f@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:33 -07:00
David Hildenbrand (Arm)	738de20c4f	mm/sparse: move memory hotplug bits to sparse-vmemmap.c Let's move all memory hoptplug related code to sparse-vmemmap.c. We only have to expose sparse_index_init(). While at it, drop the definition of sparse_index_init() for !CONFIG_SPARSEMEM, which is unused, and place the declaration in internal.h. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-15-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:33 -07:00
David Hildenbrand (Arm)	08e5f77c37	mm/sparse: move __section_mark_present() to internal.h Let's prepare for moving memory hotplug handling from sparse.c to sparse-vmemmap.c by moving __section_mark_present() to internal.h. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-14-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:33 -07:00
David Hildenbrand (Arm)	f62a3bf227	mm/sparse: move sparse_init_one_section() to internal.h While at it, convert the BUG_ON to a VM_WARN_ON_ONCE, avoid long lines, and merge sparse_encode_mem_map() into its only caller sparse_init_one_section(). Clarify the comment a bit, pointing at page_to_pfn(). [david@kernel.org: s/VM_WARN_ON/VM_WARN_ON_ONCE/] Link: https://lkml.kernel.org/r/6b04c1a1-74e7-42e8-8523-a40802e5dacc@kernel.org Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-13-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:33 -07:00
David Hildenbrand (Arm)	b551ed94d9	mm/sparse: drop set_section_nid() from sparse_add_section() CONFIG_MEMORY_HOTPLUG is CONFIG_SPARSEMEM_VMEMMAP-only. And CONFIG_SPARSEMEM_VMEMMAP implies that NODE_NOT_IN_PAGE_FLAGS cannot be set: see include/linux/page-flags-layout.h ... #elif defined(CONFIG_SPARSEMEM_VMEMMAP) #error "Vmemmap: No space for nodes field in page flags" ... Which implies that the node is always stored in page flags and NODE_NOT_IN_PAGE_FLAGS cannot be set. Therefore, set_section_nid() is a NOP on CONFIG_SPARSEMEM_VMEMMAP. So let's remove the set_section_nid() call to prepare for moving CONFIG_MEMORY_HOTPLUG to mm/sparse-vmemmap.c Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-12-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:32 -07:00
David Hildenbrand (Arm)	fead6dcff8	mm: prepare to move subsection_map_init() to mm/sparse-vmemmap.c We want to move subsection_map_init() to mm/sparse-vmemmap.c. To prepare for getting rid of subsection_map_init() in mm/sparse.c completely, use a static inline function for !CONFIG_SPARSEMEM_VMEMMAP. While at it, move the declaration to internal.h and rename it to "sparse_init_subsection_map()". Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-11-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:32 -07:00
David Hildenbrand (Arm)	dac89b150b	mm/sparse: remove CONFIG_MEMORY_HOTPLUG-specific usemap allocation handling In 2008, we added through commit `48c906823f` ("memory hotplug: allocate usemap on the section with pgdat") quite some complexity to try allocating memory for the "usemap" (storing pageblock information per memory section) for a memory section close to the memory of the "pgdat" of the node. The goal was to make memory hotunplug of boot memory more likely to succeed. That commit also added some checks for circular dependencies between two memory sections, whereby two memory sections would contain each others usemap, turning both boot memory sections un-removable. However, in 2010, commit `a4322e1bad` ("sparsemem: Put usemap for one node together") started allocating the usemap for multiple memory sections on the same node in one chunk, effectively grouping all usemap allocations of the same node in a single memblock allocation. We don't really give guarantees about memory hotunplug of boot memory, and with the change in 2010, it is impossible in practice to get any circular dependencies. So let's simply remove this complexity. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-10-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:32 -07:00
David Hildenbrand (Arm)	22688ade3b	mm/sparse: remove sparse_decode_mem_map() section_deactivate() applies to CONFIG_SPARSEMEM_VMEMMAP only. So we can just use pfn_to_page() (after making sure we have the start PFN of the section), and remove sparse_decode_mem_map(). Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-9-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:32 -07:00
David Hildenbrand (Arm)	7f8e592bb3	mm/bootmem_info: avoid using sparse_decode_mem_map() With SPARSEMEM_VMEMMAP, we can just do a pfn_to_page(). It is not super clear whether the start_pfn is properly aligned ... so let's just make sure it is properly aligned to the start of the section. We will soon might try to remove the bootmem info completely, for now, just keep it working as is. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-8-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:32 -07:00
David Hildenbrand (Arm)	4129341443	mm/bootmem_info: remove handling for !CONFIG_SPARSEMEM_VMEMMAP It is not immediately obvious that CONFIG_HAVE_BOOTMEM_INFO_NODE is only selected from CONFIG_MEMORY_HOTREMOVE, which itself depends on CONFIG_MEMORY_HOTPLUG that ... depends on CONFIG_SPARSEMEM_VMEMMAP. Let's remove the !CONFIG_SPARSEMEM_VMEMMAP leftovers that are dead code. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-7-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:32 -07:00
David Hildenbrand (Arm)	119c31caa5	mm/sparse: remove !CONFIG_SPARSEMEM_VMEMMAP leftovers for CONFIG_MEMORY_HOTPLUG CONFIG_MEMORY_HOTPLUG now depends on CONFIG_SPARSEMEM_VMEMMAP. So let's remove the !CONFIG_SPARSEMEM_VMEMMAP leftovers that are dead code. Adjust the comment above fill_subsection_map() accordingly. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-6-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:32 -07:00
David Hildenbrand (Arm)	62257a5fb9	mm/memory_hotplug: simplify check_pfn_span() We now always have CONFIG_SPARSEMEM_VMEMMAP, so remove the dead code. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-5-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:31 -07:00
David Hildenbrand (Arm)	fb3c3f5d27	mm/Kconfig: make CONFIG_MEMORY_HOTPLUG depend on CONFIG_SPARSEMEM_VMEMMAP Ever since commit `f8f03eb5f0` ("mm: stop making SPARSEMEM_VMEMMAP user-selectable"), an architecture that supports CONFIG_SPARSEMEM_VMEMMAP (by selecting SPARSEMEM_VMEMMAP_ENABLE) can no longer enable CONFIG_SPARSEMEM without CONFIG_SPARSEMEM_VMEMMAP. Right now, CONFIG_MEMORY_HOTPLUG is guarded by CONFIG_SPARSEMEM. However, CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG is only enabled by * arm64: which selects SPARSEMEM_VMEMMAP_ENABLE * loongarch: which selects SPARSEMEM_VMEMMAP_ENABLE * powerpc (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE * riscv (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE * s390 with SPARSEMEM: which selects SPARSEMEM_VMEMMAP_ENABLE * x86 (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE So, we can make CONFIG_MEMORY_HOTPLUG depend on CONFIG_SPARSEMEM_VMEMMAP without affecting any setups. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-4-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:31 -07:00
David Hildenbrand (Arm)	e66383b674	mm/sparse: remove WARN_ONs from (online\|offline)_mem_sections() We do not allow offlining of memory with memory holes, and always hotplug memory without holes. Consequently, we cannot end up onlining or offlining memory sections that have holes (including invalid sections). That's also why these WARN_ONs never fired. Let's remove the WARN_ONs along with the TODO regarding double-checking. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-3-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:31 -07:00
David Hildenbrand (Arm)	9d80de66a0	mm/memory_hotplug: remove for_each_valid_pfn() usage When offlining memory, we know that the memory range has no holes. Checking for valid pfns is not required. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-2-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:31 -07:00
David Hildenbrand (Arm)	89e69c7d18	mm/memory_hotplug: fix possible race in scan_movable_pages() Patch series "mm: memory hot(un)plug and SPARSEMEM cleanups", v2. Some cleanups around memory hot(un)plug and SPARSEMEM. In essence, we can limit CONFIG_MEMORY_HOTPLUG to CONFIG_SPARSEMEM_VMEMMAP, remove some dead code, and move all the hotplug bits over to mm/sparse-vmemmap.c. Some further/related cleanups around other unnecessary code (memory hole handling and complicated usemap allocation). I have some further sparse.c cleanups lying around, and I'm planning on getting rid of bootmem_info.c entirely. This patch (of 15): If a hugetlb folio gets freed while we are in scan_movable_pages(), folio_nr_pages() could return 0, resulting in or'ing "0 - 1 = -1" to the PFN, resulting in PFN = -1. We're not holding any locks or references that would prevent that. for_each_valid_pfn() would then search for the next valid PFN, and could return a PFN that is outside of the range of the original requested range. do_migrate_page() would then try to migrate quite a big range, which is certainly undesirable. To fix it, simply test for valid folio_nr_pages() values. While at it, as PageHuge() really just does a page_folio() internally, we can just use folio_test_hugetlb() on the folio directly. scan_movable_pages() is expected to be fast, and we try to avoid taking locks or grabbing references. We cannot use folio_try_get() as that does not work for free hugetlb folios. We could grab the hugetlb_lock, but that just adds complexity. The race is unlikely to trigger in practice, so we won't be CCing stable. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-0-096addc8800d@kernel.org Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-1-096addc8800d@kernel.org Fixes: `16540dae95` ("mm/hugetlb: mm/memory_hotplug: use a folio in scan_movable_pages()") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:31 -07:00
Chen Ni	42561b341b	mm/swapfile: remove duplicate include of swap_table.h Remove duplicate inclusion of swap_table.h in swapfile.c to clean up redundant code. Link: https://lkml.kernel.org/r/20260318043849.399266-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:31 -07:00
Asier Gutierrez	01494f713e	Docs/mm/damon/design: document DAMON actions when TRANSPARENT_HUGEPAGE is off MADV_HUGEPAGE and MADV_NOHUGEPAGE are guarded and they are not available when compiling the kernel without TRANSPARENT_HUGEPAGE option. The DAMON behaviour is to silently fail [1] in when DAMOS_HUGEPAGE or DAMOS_NOHUGEPAGE are used, but TRANSPARENT_HUGEPAGE is disabled. Update the DAMON documentation to reflect this behaviour. Link: https://lkml.kernel.org/r/20260318035349.88715-1-sj@kernel.org Link: https://lore.kernel.org/66131775-180b-4b9f-b7ce-61a3e077b6e6@huawei-partners.com/ [1] Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:31 -07:00
Sergey Senozhatsky	cba8299330	zram: change scan_slots to return void scan_slots_for_writeback() and scan_slots_for_recompress() work in a "best effort" fashion, if they cannot allocate memory for a new pp-slot candidate they just return and post-processing selects slots that were successfully scanned thus far. scan_slots functions never return errors and their callers never check the return status, so convert them to return void. Link: https://lkml.kernel.org/r/20260317032349.753645-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:30 -07:00
Liew Rui Yan	4bdbddb4e4	Docs/mm/damon: document exclusivity of special-purpose modules Add a section in design.rst to explain that DAMON special-purpose kernel modules (LRU_SORT, RECLAIM, STAT) run in an exclusive manner and return -EBUSY if another is already running. Update lru_sort.rst, reclaim.rst and stat.rst by adding cross-references to this exclusivity rule at the end of their respective Example sections. This change is motivated from another discussion [1]. Link: https://lkml.kernel.org/r/20260315162945.80994-1-sj@kernel.org Link: https://lore.kernel.org/damon/20260314002119.79742-1-sj@kernel.org/T/#t [1] Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:30 -07:00
Sergey Senozhatsky	bf989ade27	zram: propagate read_from_bdev_async() errors When read_from_bdev_async() fails to chain bio, for instance fails to allocate request or bio, we need to propagate the error condition so that upper layer is aware of it. zram already does that by setting BLK_STS_IOERR ->bi_status, but only for sync reads. Change async read path to return its error status so that async errors are also handled. Link: https://lkml.kernel.org/r/20260316015354.114465-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Brian Geffon <bgeffon@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:30 -07:00
gao xu	f0f6f78714	zram: optimize LZ4 dictionary compression performance Calling `LZ4_loadDict()` repeatedly in Zram causes significant overhead due to its internal dictionary pre-processing. This commit introduces a template stream mechanism to pre-process the dictionary only once when the dictionary is initially set or modified. It then efficiently copies this state for subsequent compressions. Verification Test Items: Test Platform: android16-6.12 1. Collect Anonymous Page Dataset 1) Apply the following patch: static bool zram_meta_alloc(struct zram zram, u64 disksize) if (!huge_class_size) - huge_class_size = zs_huge_class_size(zram->mem_pool); + huge_class_size = 0; 2）Install multiple apps and monkey testing until SwapFree is close to 0. 3）Execute the following command to export data: dd if=/dev/block/zram0 of=/data/samples/zram_dump.img bs=4K 2. Train Dictionary Since LZ4 does not have a dedicated dictionary training tool, the zstd tool can be used for training[1]. The command is as follows: zstd --train /data/samples/ --split=4096 --maxdict=64KB -o /vendor/etc/dict_data 3. Test Code adb shell "dd if=/data/samples/zram_dump.img of=/dev/test_pattern bs=4096 count=131072 conv=fsync" adb shell "swapoff /dev/block/zram0" adb shell "echo 1 > /sys/block/zram0/reset" adb shell "echo lz4 > /sys/block/zram0/comp_algorithm" adb shell "echo dict=/vendor/etc/dict_data > /sys/block/zram0/algorithm_params" adb shell "echo 6G > /sys/block/zram0/disksize" echo "Start Compression" adb shell "taskset 80 dd if=/dev/test_pattern of=/dev/block/zram0 bs=4096 count=131072 conv=fsync" echo. echo "Start Decompression" adb shell "taskset 80 dd if=/dev/block/zram0 of=/dev/output_result bs=4096 count=131072 conv=fsync" echo "mm_stat:" adb shell "cat /sys/block/zram0/mm_stat" echo. Note: To ensure stable test results, it is best to lock the CPU frequency before executing the test. LZ4 supports dictionaries up to 64KB. Below are the test results for compression rates at various dictionary sizes: dict_size base patch 4 KB 156M/s 219M/s 8 KB 136M/s 217M/s 16KB 98M/s 214M/s 32KB 66M/s 225M/s 64KB 38M/s 224M/s When an LZ4 compression dictionary is enabled, compression speed is negatively impacted by the dictionary's size; larger dictionaries result in slower compression. This patch eliminates the influence of dictionary size on compression speed, ensuring consistent performance regardless of dictionary scale. Link: https://lkml.kernel.org/r/698181478c9c4b10aa21b4a847bdc706@honor.com Link: https://github.com/lz4/lz4?tab=readme-ov-file [1] Signed-off-by: gao xu <gaoxu2@honor.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:30 -07:00
Nico Pache	a155d945b7	mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() The khugepaged daemon and madvise_collapse have two different implementations that do almost the same thing. Create collapse_single_pmd to increase code reuse and create an entry point to these two users. Refactor madvise_collapse and collapse_scan_mm_slot to use the new collapse_single_pmd function. To help reduce confusion around the mmap_locked variable, we rename mmap_locked to lock_dropped in the collapse_scan_mm_slot() function, and remove the redundant mmap_locked in madvise_collapse(); this further unifies the code readiblity. the SCAN_PTE_MAPPED_HUGEPAGE enum is no longer reachable in the madvise_collapse() function, so we drop it from the list of "continuing" enums. This introduces a minor behavioral change that is most likely an undiscovered bug. The current implementation of khugepaged tests collapse_test_exit_or_disable() before calling collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse case. By unifying these two callers madvise_collapse now also performs this check. We also modify the return value to be SCAN_ANY_PROCESS which properly indicates that this process is no longer valid to operate on. By moving the madvise_collapse writeback-retry logic into the helper function we can also avoid having to revalidate the VMA. We guard the khugepaged_pages_collapsed variable to ensure its only incremented for khugepaged. As requested we also convert a VM_BUG_ON to a VM_WARN_ON. Link: https://lkml.kernel.org/r/20260325114022.444081-6-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:30 -07:00
Nico Pache	ff7e03a871	mm/khugepaged: rename hpage_collapse_* to collapse_* The hpage_collapse functions describe functions used by madvise_collapse and khugepaged. remove the unnecessary hpage prefix to shorten the function name. Link: https://lkml.kernel.org/r/20260325114022.444081-5-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:30 -07:00
Nico Pache	36da8a88fd	mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1 The value (HPAGE_PMD_NR - 1) is used often in the khugepaged code to signify the limit of the max_ptes_* values. Add a define for this to increase code readability and reuse. Link: https://lkml.kernel.org/r/20260325114022.444081-4-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:30 -07:00
Nico Pache	b90c453d26	mm: introduce is_pmd_order helper In order to add mTHP support to khugepaged, we will often be checking if a given order is (or is not) a PMD order. Some places in the kernel already use this check, so lets create a simple helper function to keep the code clean and readable. Link: https://lkml.kernel.org/r/20260325114022.444081-3-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:29 -07:00
Nico Pache	a91fd9f710	mm: consolidate anonymous folio PTE mapping into helpers Patch series "mm: khugepaged cleanups and mTHP prerequisites", v4. The following series contains cleanups and prerequisites for my work on khugepaged mTHP support [1]. These have been separated out to ease review. The first patch in the series refactors the page fault folio to pte mapping and follows a similar convention as defined by map_anon_folio_pmd_(no)pf(). This not only cleans up the current implementation of do_anonymous_page(), but will allow for reuse later in the khugepaged mTHP implementation. The second patch adds a small is_pmd_order() helper to check if an order is the PMD order. This check is open-coded in a number of places. This patch aims to clean this up and will be used more in the khugepaged mTHP work. The third patch also adds a small DEFINE for (HPAGE_PMD_NR - 1) which is used often across the khugepaged code. The fourth and fifth patch come from the khugepaged mTHP patchset [1]. These two patches include the rename of function prefixes, and the unification of khugepaged and madvise_collapse via a new collapse_single_pmd function. Patch 1: refactor do_anonymous_page into map_anon_folio_pte_(no)pf Patch 2: add is_pmd_order helper Patch 3: Add define for (HPAGE_PMD_NR - 1) Patch 4: Refactor/rename hpage_collapse Patch 5: Refactoring to combine madvise_collapse and khugepaged A big thanks to everyone that has reviewed, tested, and participated in the development process. This patch (of 5): The anonymous page fault handler in do_anonymous_page() open-codes the sequence to map a newly allocated anonymous folio at the PTE level: - construct the PTE entry - add rmap - add to LRU - set the PTEs - update the MMU cache. Introduce two helpers to consolidate this duplicated logic, mirroring the existing map_anon_folio_pmd_nopf() pattern for PMD-level mappings: map_anon_folio_pte_nopf(): constructs the PTE entry, takes folio references, adds anon rmap and LRU. This function also handles the uffd_wp that can occur in the pf variant. The future khugepaged mTHP code calls this to handle mapping the new collapsed mTHP to its folio. map_anon_folio_pte_pf(): extends the nopf variant to handle MM_ANONPAGES counter updates, and mTHP fault allocation statistics for the page fault path. The zero-page read path in do_anonymous_page() is also untangled from the shared setpte label, since it does not allocate a folio and should not share the same mapping sequence as the write path. We can now leave nr_pages undeclared at the function intialization, and use the single page update_mmu_cache function to handle the zero page update. This refactoring will also help reduce code duplication between mm/memory.c and mm/khugepaged.c, and provides a clean API for PTE-level anonymous folio mapping that can be reused by future callers (like khugpeaged mTHP support) Link: https://lkml.kernel.org/r/20260325114022.444081-1-npache@redhat.com Link: https://lkml.kernel.org/r/20260325114022.444081-2-npache@redhat.com Link: https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:29 -07:00
Jianhui Zhou	0217c7fb4d	mm/userfaultfd: fix hugetlb fault mutex hash calculation In mfill_atomic_hugetlb(), linear_page_index() is used to calculate the page index for hugetlb_fault_mutex_hash(). However, linear_page_index() returns the index in PAGE_SIZE units, while hugetlb_fault_mutex_hash() expects the index in huge page units. This mismatch means that different addresses within the same huge page can produce different hash values, leading to the use of different mutexes for the same huge page. This can cause races between faulting threads, which can corrupt the reservation map and trigger the BUG_ON in resv_map_release(). Fix this by introducing hugetlb_linear_page_index(), which returns the page index in huge page granularity, and using it in place of linear_page_index(). Link: https://lkml.kernel.org/r/20260310110526.335749-1-jianhuizzzzz@gmail.com Fixes: `a08c7193e4` ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Jianhui Zhou <jianhuizzzzz@gmail.com> Reported-by: syzbot+f525fd79634858f478e7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f525fd79634858f478e7 Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: JonasZhou <JonasZhou@zhaoxin.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:29 -07:00
SeongJae Park	8c6a765f4a	mm/damon/lru_sort: respect addr_unit on default monitoring region setup In the past, damon_set_region_biggest_system_ram_default(), which is the core function for setting the default monitoring target region of DAMON_LRU_SORT, didn't support addr_unit. Hence DAMON_LRU_SORT was silently ignoring the user input for addr_unit when the user doesn't explicitly set the monitoring target regions, and therefore the default target region is being used. No real problem from that ignorance was reported so far. But, the implicit rule is only making things confusing. Also, the default target region setup function is updated to support addr_unit. Hence there is no reason to keep ignoring it. Respect the user-passed addr_unit for the default target monitoring region use case. Link: https://lkml.kernel.org/r/20260311052927.93921-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:29 -07:00
SeongJae Park	fdfcda8d08	mm/damon/reclaim: respect addr_unit on default monitoring region setup In the past, damon_set_region_biggest_system_ram_default(), which is the core function for setting the default monitoring target region of DAMON_RECLAIM, didn't support addr_unit. Hence DAMON_RECLAIM was silently ignoring the user input for addr_unit when the user doesn't explicitly set the monitoring target regions, and therefore the default target region is being used. No real problem from that ignorance was reported so far. But, the implicit rule is only making things confusing. Also, the default target region setup function is updated to support addr_unit. Hence there is no reason to keep ignoring it. Respect the user-passed addr_unit for the default target monitoring region use case. Link: https://lkml.kernel.org/r/20260311052927.93921-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:29 -07:00
SeongJae Park	5f9a5926b7	mm/damon/core: fix wrong damon_set_regions() argument The third argument is the length of the second parameter. But addr_unit is wrongly being passed. Fix it. Link: https://lkml.kernel.org/r/20260314001854.79623-1-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:29 -07:00
SeongJae Park	eabc2eddb2	mm/damon/core: receive addr_unit on damon_set_region_biggest_system_ram_default() damon_find_biggest_system_ram() was not supporting addr_unit in the past. Hence, its caller, damon_set_region_biggest_system_ram_default(), was also not supporting addr_unit. The previous commit has updated the inner function to support addr_unit. There is no more reason to not support addr_unit on damon_set_region_biggest_system_ram_default(). Rather, it makes unnecessary inconsistency on support of addr_unit. Update it to receive addr_unit and handle it inside. Link: https://lkml.kernel.org/r/20260311052927.93921-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:28 -07:00
SeongJae Park	b47dcc1a28	mm/damon/core: support addr_unit on damon_find_biggest_system_ram() damon_find_biggest_system_ram() sets an 'unsigned long' variable with 'resource_size_t' value. This is fundamentally wrong. On environments such as ARM 32 bit machines having LPAE (Large Physical Address Extensions), which DAMON supports, the size of 'unsigned long' may be smaller than that of 'resource_size_t'. It is safe, though, since we restrict the walk to be done only up to ULONG_MAX. DAMON supports the address size gap using 'addr_unit'. We didn't add the support to the function, just to make the initial support change small. Now the support is reasonably settled. This kind of gap is only making the code inconsistent and easy to be confused. Add the support of 'addr_unit' to the function, by letting callers pass the 'addr_unit' and handling it in the function. All callers are passing 'addr_unit' 1, though, to keep the old behavior. [sj@kernel.org: verify found biggest system ram] Link: https://lkml.kernel.org/r/20260317144725.88524-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260311052927.93921-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:28 -07:00
SeongJae Park	c63067e8b0	mm/damon/core: fix wrong end address assignment on walk_system_ram() Patch series "mm/damon: support addr_unit on default monitoring targets for modules". DAMON_RECLAIM and DAMON_LRU_SORT support 'addr_unit' parameters only when the monitoring target address range is explicitly set. This was intentional for making the initial 'addr_unit' support change small. Now 'addr_unit' support is being quite stabilized. Having the corner case of the support is only making the code inconsistent with implicit rules. The inconsistency makes it easy to confuse [1] readers. After all, there is no real reason to keep 'addr_unit' support incomplete. Add the support for the case to improve the readability and more completely support 'addr_unit'. This series is constructed with five patches. The first one (patch 1) fixes a small bug that mistakenly assigns inclusive end address to open end address, which was found from this work. The second and third ones (patches 2 and 3) extend the default monitoring target setting functions in the core layer one by one, to support the 'addr_unit' while making no visible changes. The final two patches (patches 4 and 5) update DAMON_RECLAIM and DAMON_LRU_SORT to support 'addr_unit' for the default monitoring target address ranges, by passing the user input to the core functions. This patch (of 5): 'struct damon_addr_range' and 'struct resource' represent different types of address ranges. 'damon_addr_range' is for end-open ranges ([start, end)). 'resource' is for fully-closed ranges ([start, end]). But walk_system_ram() is assigning resource->end to damon_addr_range->end without the inclusiveness adjustment. As a result, the function returns an address range that is missing the last one byte. The function is being used to find and set the biggest system ram as the default monitoring target for DAMON_RECLAIM and DAMON_LRU_SORT. Missing the last byte of the big range shouldn't be a real problem for the real use cases. That said, the loss is definitely an unintended behavior. Do the correct adjustment. Link: https://lkml.kernel.org/r/20260311052927.93921-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260311052927.93921-2-sj@kernel.org Link: https://lore.kernel.org/20260131015643.79158-1-sj@kernel.org [1] Fixes: `43b0536cb4` ("mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:28 -07:00
Lorenzo Stoakes (Oracle)	0289955fc5	mm/mremap: check map count under mmap write lock and abstract We are checking the mmap count in check_mremap_params(), prior to obtaining an mmap write lock, which means that accesses to current->mm->map_count might race with this field being updated. Resolve this by only checking this field after the mmap write lock is held. Additionally, abstract this check into a helper function with extensive ASCII documentation of what's going on. Link: https://lkml.kernel.org/r/18be0b48eaa8e8804eb745974ee729c3ade0c687.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reported-by: Jianzhou Zhao <luckd0g@163.com> Closes: https://lore.kernel.org/all/1a7d4c26.6b46.19cdbe7eaf0.Coremail.luckd0g@163.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:28 -07:00
Lorenzo Stoakes (Oracle)	2d1e54aab6	mm: abstract reading sysctl_max_map_count, and READ_ONCE() Concurrent reads and writes of sysctl_max_map_count are possible, so we should READ_ONCE() and WRITE_ONCE(). The sysctl procfs logic already enforces WRITE_ONCE(), so abstract the read side with get_sysctl_max_map_count(). While we're here, also move the field to mm/internal.h and add the getter there since only mm interacts with it, there's no need for anybody else to have access. Finally, update the VMA userland tests to reflect the change. Link: https://lkml.kernel.org/r/0715259eb37cbdfde4f9e5db92a20ec7110a1ce5.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Jianzhou Zhao <luckd0g@163.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:28 -07:00
Lorenzo Stoakes (Oracle)	9b9b8d4aeb	mm/mremap: correct invalid map count check Patch series "mm: improve map count checks". Firstly, in mremap(), it appears that our map count checks have been overly conservative - there is simply no reason to require that we have headroom of 4 mappings prior to moving the VMA, we only need headroom of 2 VMAs since commit `659ace584e` ("mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()"). Likely the original headroom of 4 mappings was a mistake, and 3 was actually intended. Next, we access sysctl_max_map_count in a number of places without being all that careful about how we do so. We introduce a simple helper that READ_ONCE()'s the field (get_sysctl_max_map_count()) to ensure that the field is accessed correctly. The WRITE_ONCE() side is already handled by the sysctl procfs code in proc_int_conv(). We also move this field to internal.h as there's no reason for anybody else to access it outside of mm. Unfortunately we have to maintain the extern variable, as mmap.c implements the procfs code. Finally, we are accessing current->mm->map_count without holding the mmap write lock, which is also not correct, so this series ensures the lock is head before we access it. We also abstract the check to a helper function, and add ASCII diagrams to explain why we're doing what we're doing. This patch (of 3): We currently check to see, if on moving a VMA when doing mremap(), if it might violate the sys.vm.max_map_count limit. This was introduced in the mists of time prior to 2.6.12. At this point in time, as now, the move_vma() operation would copy the VMA (+1 mapping if not merged), then potentially split the source VMA upon unmap. Prior to commit `659ace584e` ("mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()"), a VMA split would check whether mm->map_count >= sysctl_max_map_count prior to a split before it ran. On unmap of the source VMA, if we are moving a partial VMA, we might split the VMA twice. This would mean, on invocation of split_vma() (as was), we'd check whether mm->map_count >= sysctl_max_map_count with a map count elevated by one, then again with a map count elevated by two, ending up with a map count elevated by three. At this point we'd reduce the map count on unmap. At the start of move_vma(), there was a check that has remained throughout mremap()'s history of mm->map_count >= sysctl_max_map_count - 3 (which implies mm->mmap_count + 4 > sysctl_max_map_count - that is, we must have headroom for 4 additional mappings). After mm->map_count is elevated by 3, it is decremented by one once the unmap completes. The mmap write lock is held, so nothing else will observe mm->map_count > sysctl_max_map_count. It appears this check was always incorrect - it should have either be one of 'mm->map_count > sysctl_max_map_count - 3' or 'mm->map_count >= sysctl_max_map_count - 2'. After commit `659ace584e` ("mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()"), the map count check on split is eliminated in the newly introduced __split_vma(), which the unmap path uses, and has that path check whether mm->map_count >= sysctl_max_map_count. This is valid since, net, an unmap can only cause an increase in map count of 1 (split both sides, unmap middle). Since we only copy a VMA and (if MREMAP_DONTUNMAP is not set) unmap afterwards, the maximum number of additional mappings that will actually be subject to any check will be 2. Therefore, update the check to assert this corrected value. Additionally, update the check introduced by commit `ea2c3f6f55` ("mm,mremap: bail out earlier in mremap_to under map pressure") to account for this. While we're here, clean up the comment prior to that. Link: https://lkml.kernel.org/r/cover.1773249037.git.ljs@kernel.org Link: https://lkml.kernel.org/r/73e218c67dcd197c5331840fb011e2c17155bfb0.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Jianzhou Zhao <luckd0g@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:28 -07:00
Kexin Sun	d4e981b280	kasan: update outdated comment kmalloc_large() was renamed kmalloc_large_noprof() by commit `7bd230a266` ("mm/slab: enable slab allocation tagging for kmalloc and friends"), and subsequently renamed __kmalloc_large_noprof() by commit `a0a44d9175` ("mm, slab: don't wrap internal functions with alloc_hooks()"), making it an internal implementation detail. Large kmalloc allocations are now performed through the public kmalloc() interface directly, making the reference to KMALLOC_MAX_SIZE also stale (KMALLOC_MAX_CACHE_SIZE would be more accurate). Remove the references to kmalloc_large() and KMALLOC_MAX_SIZE, and rephrase the description for large kmalloc allocations. Link: https://lkml.kernel.org/r/20260312053812.1365-1-kexinsun@smail.nju.edu.cn Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Assisted-by: unnamed:deepseek-v3.2 coccinelle Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:28 -07:00
Usama Arif	a2e0c0668a	mm: migrate: requeue destination folio on deferred split queue During folio migration, __folio_migrate_mapping() removes the source folio from the deferred split queue, but the destination folio is never re-queued. This causes underutilized THPs to escape the shrinker after NUMA migration, since they silently drop off the deferred split list. Fix this by recording whether the source folio was on the deferred split queue and its partially mapped state before move_to_new_folio() unqueues it, and re-queuing the destination folio after a successful migration if it was. By the time migrate_folio_move() runs, partially mapped folios without a pin have already been split by migrate_pages_batch(). So only two cases remain on the deferred list at this point: 1. Partially mapped folios with a pin (split failed). 2. Fully mapped but potentially underused folios. The recorded partially_mapped state is forwarded to deferred_split_folio() so that the destination folio is correctly re-queued in both cases. Because THPs are removed from the deferred_list, THP shinker cannot split the underutilized THPs in time. As a result, users will show less free memory than before. Link: https://lkml.kernel.org/r/20260312104723.1351321-1-usama.arif@linux.dev Fixes: `dafff3f4c8` ("mm: split underused THPs") Signed-off-by: Usama Arif <usama.arif@linux.dev> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:27 -07:00
Waiman Long	2d028f3e4b	selftest: memcg: skip memcg_sock test if address family not supported The test_memcg_sock test in memcontrol.c sets up an IPv6 socket and send data over it to consume memory and verify that memory.stat.sock and memory.current values are close. On systems where IPv6 isn't enabled or not configured to support SOCK_STREAM, the test_memcg_sock test always fails. When the socket() call fails, there is no way we can test the memory consumption and verify the above claim. I believe it is better to just skip the test in this case instead of reporting a test failure hinting that there may be something wrong with the memcg code. Link: https://lkml.kernel.org/r/20260311200526.885899-1-longman@redhat.com Fixes: `5f8f019380` ("selftests: cgroup/memcontrol: add basic test for socket accounting") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:27 -07:00
Mike Rapoport (Microsoft)	f08f610ea0	selftests/mm: pagemap_ioctl: remove hungarian notation Replace lpBaseAddress with addr and dwRegionSize with size. Link: https://lkml.kernel.org/r/20260311180737.3767545-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:27 -07:00

1 2 3 4 5 ...

1429135 Commits