linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Gregory Price	9e80e66dda	mm, hugetlb: implement movable_gigantic_pages sysctl This reintroduces a concept removed by: commit `d6cb41cc44` ("mm, hugetlb: remove hugepages_treat_as_movable sysctl") This sysctl provides flexibility between ZONE_MOVABLE use cases: 1) onlining memory in ZONE_MOVABLE to maintain hotplug compatibility 2) onlining memory in ZONE_MOVABLE to make hugepage allocate reliable When ZONE_MOVABLE is used to make huge page allocation more reliable, disallowing gigantic pages memory in this region is pointless. If hotplug is not a requirement, we can loosen the restrictions to allow 1GB gigantic pages in ZONE_MOVABLE. Since 1GB can be difficult to migrate / has impacts on compaction / defragmentation, we don't enable this by default. Notably, 1GB pages can only be migrated if another 1GB page is available - so hot-unplug will fail if such a page cannot be found. However, since there are scenarios where gigantic pages are migratable, we should allow use of these on movable regions. When not valid 1GB is available for migration, hot-unplug will retry indefinitely (or until interrupted). For example: echo 0 > node0/hugepages/..-1GB/nr_hugepages # clear node0 1GB pages echo 1 > node1/hugepages/..-1GB/nr_hugepages # reserve node1 1GB page ./alloc_huge_node1 & # Allocate a 1GB page on node1 ./node1_offline & # attempt to offline all node1 memory echo 1 > node0/hugepages/..-1GB/nr_hugepages # reserve node0 1GB page In this example, node1_offline will block indefinitely until the final step, when a node0 1GB page is made available. Note: Boot-time CMA is not possible for driver-managed hotplug memory, as CMA requires the memory to be registered as SystemRAM at boot time. Additionally, 1GB huge pages are not supported by THP. Link: https://lkml.kernel.org/r/20251221125603.2364174-1-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: David Rientjes <rientjes@google.com> Link: https://lore.kernel.org/all/20180201193132.Hk7vI_xaU%25akpm@linux-foundation.org/ Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:50 -08:00
Wentao Guan	7db0787000	mm: cleanup vma_iter_bulk_alloc commit `d240629148` ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"), removed the only user and mas_expected_entries has been removed, since commit `e3852a1213` ("maple_tree: Drop bulk insert support"). Also cleanup the mas_expected_entries in maple_tree.h. No functional change. Link: https://lkml.kernel.org/r/20251106110929.3522073-1-guanwentao@uniontech.com Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cheng Nie <niecheng1@uniontech.com> Cc: Guan Wentao <guanwentao@uniontech.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:50 -08:00
Brendan Jackman	241b3a0963	mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-comment The current description of contexts where it's invalid to make GFP_ATOMIC and GFP_NOWAIT calls is rather vague. Replace this with a direct description of the actual contexts of concern and refer to the RT docs where this is explained more discursively. While rejigging this prose, also move the documentation of GFP_NOWAIT to the GFP_NOWAIT section. Link: https://lore.kernel.org/all/d912480a-5229-4efe-9336-b31acded30f5@suse.cz/ Link: https://lkml.kernel.org/r/20251219-b4-gfp_atomic-comment-v2-1-4c4ce274c2b6@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:49 -08:00
Kairui Song	7969f30594	mm/gup: remove no longer used gup_fast_undo_dev_pagemap This helper is no longer used after commit `fd2825b076` ("mm/gup: remove pXX_devmap usage from get_user_pages()"). Link: https://lkml.kernel.org/r/20251219-gup-cleanup-v1-1-348a70d9eecb@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:49 -08:00
Vlastimil Babka	9c9828d3ea	mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Since commit `cc638f329e` ("mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations"), THP page fault allocations have settled on the following scheme (from the commit log): 1. local node only THP allocation with no reclaim, just compaction. 2. for madvised VMA's or when synchronous compaction is enabled always - THP allocation from any node with effort determined by global defrag setting and VMA madvise 3. fallback to base pages on any node Recent customer reports however revealed we have a gap in step 1 above. What we have seen is excessive reclaim due to THP page faults on a NUMA node that's close to its high watermark, while other nodes have plenty of free memory. The problem with step 1 is that it promises no reclaim after the compaction attempt, however reclaim is only avoided for certain compaction outcomes (deferred, or skipped due to insufficient free base pages), and not e.g. when compaction is actually performed but fails (we did see compact_fail vmstat counter increasing). THP page faults can therefore exhibit a zone_reclaim_mode-like behavior, which is not the intention. Thus add a check for __GFP_THISNODE that corresponds to this exact situation and prevents continuing with reclaim/compaction once the initial compaction attempt isn't successful in allocating the page. Note that commit `cc638f329e` has not introduced this over-reclaim possibility; it appears to exist in some form since commit `2f0799a0ff` ("mm, thp: restore node-local hugepage allocations"). Followup commits `b39d0ee263` ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") and `cc638f329e` have moved in the right direction, but left the abovementioned gap. Link: https://lkml.kernel.org/r/20251219-costly-noretry-thisnode-fix-v1-1-e1085a4a0c34@suse.cz Fixes: `2f0799a0ff` ("mm, thp: restore node-local hugepage allocations") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:49 -08:00
Xiu Jianfeng	ed60c8e280	mm/hugetlb_cgroup: fix -Wformat-truncation warning A false-positive compile warnings with -Wformat-trucation was introduced by commit `47179fe035` ("mm/hugetlb_cgroup: prepare cftypes based on template") on arch s390. Suppress it by replacing snprintf() with scnprintf(). mm/hugetlb_cgroup.c: In function 'hugetlb_cgroup_file_init': mm/hugetlb_cgroup.c:829:44: warning: '%s' directive output may be truncated writing up to 1623 bytes into a region of size between 32 and 63 [-Wformat-truncation=] 829 \| snprintf(cft->name, MAX_CFTYPE_NAME, "%s.%s", buf, tmpl->name); \| ^~ Link: https://lkml.kernel.org/r/20251222072359.3626182-1-xiujianfeng@huaweicloud.com Fixes: `47179fe035` ("mm/hugetlb_cgroup: prepare cftypes based on template") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512212332.9lFRbgdS-lkp@intel.com/ Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:48 -08:00
Kevin Lourenco	62451ae347	mm: fix minor spelling mistakes in comments Correct several typos in comments across files in mm/ [akpm@linux-foundation.org: also fix comment grammar, per SeongJae] Link: https://lkml.kernel.org/r/20251218150906.25042-1-klourencodev@gmail.com Signed-off-by: Kevin Lourenco <klourencodev@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:48 -08:00
Kevin Lourenco	5ec9bb6de4	mm/damon: fix typos in comments Correct minor spelling mistakes in several files under mm/damon. No functional changes. Link: https://lkml.kernel.org/r/20251217181216.47576-1-klourencodev@gmail.com Signed-off-by: Kevin Lourenco <klourencodev@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:48 -08:00
Heiko Carstens	a9853ac1c3	zram: remove KMSG_COMPONENT macro The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel message catalog" from 2008 [1] which never made it upstream. The macro was added to s390 code to allow for an out-of-tree patch which used this to generate unique message ids. Also this out-of-tree doesn't exist anymore. The pattern of how the KMSG_COMPONENT is used was partially also used for non s390 specific code, for whatever reasons. Remove the macro in order to get rid of a pointless indirection. Link: https://lkml.kernel.org/r/20251126143602.2207435-1-hca@linux.ibm.com Link: https://lwn.net/Articles/292650/ [1] Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:48 -08:00
Thorsten Blum	84355caa27	mm/mm_init: replace simple_strtoul with kstrtobool in set_hashdist Use bool for 'hashdist' and replace simple_strtoul() with kstrtobool() for parsing the 'hashdist=' boot parameter. Unlike simple_strtoul(), which returns an unsigned long, kstrtobool() converts the string directly to bool and avoids implicit casting. Check the return value of kstrtobool() and reject invalid values. This adds error handling while preserving behavior for existing values, and removes use of the deprecated simple_strtoul() helper. The current code silently sets 'hashdist = 0' if parsing fails, instead of leaving the default value (HASHDIST_DEFAULT) unchanged. Additionally, kstrtobool() accepts common boolean strings such as "on" and "off". Link: https://lkml.kernel.org/r/20251217110214.50807-1-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:47 -08:00
Audra Mitchell	a98ec863fd	lib/test_vmalloc.c: minor fixes to test_vmalloc.c If PAGE_SIZE is larger than 4k and if you have a system with a large number of CPUs, this test can require a very large amount of memory leading to oom-killer firing. Given the type of allocation, the kernel won't have anything to kill, causing the system to stall. Add a parameter to the test_vmalloc driver to represent the number of times a percpu object will be allocated. Calculate this in test_vmalloc.sh to be 90% of available memory or the current default of 35000, whichever is smaller. Link: https://lkml.kernel.org/r/20251201181848.1216197-1-audra@redhat.com Signed-off-by: Audra Mitchell <audra@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:47 -08:00
Sidhartha Kumar	bd4526e64b	maple_tree: remove struct maple_alloc struct maple_alloc is deprecated after the maple tree conversion to sheaves, remove the references from the header file. Link: https://lkml.kernel.org/r/20251203224511.469978-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:47 -08:00
Johannes Weiner	64dd89ae01	mm/block/fs: remove laptop_mode Laptop mode was introduced to save battery, by delaying and consolidating writes and thereby maximize the time rotating hard drives wouldn't have to spin. Luckily, rotating hard drives, with their high spin-up times and power draw, are a thing of the past for battery-powered devices. Reclaim has also since changed to not write single filesystem pages anymore, and regular filesystem writeback is lumpy by design. The juice doesn't appear worth the squeeze anymore. The footprint of the feature is small, but nevertheless it's a complicating factor in mm, block, filesystems. Developers don't think about it, and it likely hasn't been tested with new reclaim and writeback changes in years. Let's sunset it. Keep the sysctl with a deprecation warning around for a few more cycles, but remove all functionality behind it. [akpm@linux-foundation.org: fix Documentation/admin-guide/laptops/index.rst] Link: https://lkml.kernel.org/r/20251216185201.GH905277@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:47 -08:00
Sergey Senozhatsky	657a81fe3b	zram: drop pp_in_progress pp_in_progress makes sure that only one post-processing (writeback or recomrpession) is active at any given time. Functionality wise it, basically, shadows zram init_lock, when init_lock is acquired in writer mode. Switch recompress_store() and writeback_store() to take zram init_lock in writer mode, like all store() sysfs handlers should do, so that we can drop pp_in_progress. Recompression and writeback can be somewhat slow, so holding init_lock in writer mode can block zram attrs reads, but in reality the only zram attrs reads that take place are mm_stat reads, and usually it's the same process that reads mm_stat and does recompression or writeback. Link: https://lkml.kernel.org/r/20251216071342.687993-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:46 -08:00
JaeJoon Jung	9082f24bd3	mm/damon/stat: deduplicate intervals_goal setup in damon_stat_build_ctx() The damon_stat_build_ctx() function sets the values of intervals_goal structure members. These values are applied to damon_ctx in damon_set_attrs(). However, It is resetting the values that were already applied previously to the same values. I suggest removing this code as it constitutes duplicate execution. Link: https://patch.msgid.link/20251206011716.7185-1-rgbi3307@gmail.com Link: https://lkml.kernel.org/r/20251216073440.40891-1-sj@kernel.org Signed-off-by: JaeJoon Jung <rgbi3307@gmail.com> Reviewed-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:46 -08:00
SeongJae Park	804c26b961	mm/damon/core: add trace point for damos stat per apply interval DAMON users can read DAMOS stats via DAMON sysfs interface. It enables efficient, simple and flexible usages of the stats. Especially for systems not having advanced tools like perf or bpftrace, that can be useful. But if the advanced tools are available, exposing the stats via tracepoint can reduce unnecessary reimplementation of the wheels. Add a new tracepoint for DAMOS stats, namely damos_stat_after_apply_interval. The tracepoint is triggered for each scheme's apply interval and exposes the whole stat values. If the user needs sub-apply interval information for any chance, damos_before_apply tracepoint could be used. Link: https://lkml.kernel.org/r/20251216080128.42991-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:46 -08:00
SeongJae Park	dcecf9e58b	Docs/ABI/damon: update for max_nr_snapshots Update DAMON ABI document for the newly added DAMON sysfs interface file, max_nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:46 -08:00
SeongJae Park	2584dd7496	Docs/admin-guide/mm/damon/usage: update for max_nr_snapshots Update DAMON usage document for the newly added DAMON sysfs interface file, max_nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:45 -08:00
SeongJae Park	64aa87f03d	Docs/mm/damon/design: update for max_nr_snapshots Update DAMON design document for the newly added snapshot level DAMOS deactivation feature, max_nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:45 -08:00
SeongJae Park	204ab9ab93	mm/damon/sysfs-schemes: implement max_nr_snapshots file Add a new DAMON sysfs file for setting and getting the newly introduced per-DAMON-snapshot level DAMOS deactivation control parameter, max_nr_snapshots. The file has a name same to the parameter and placed under the damos stat directory. Link: https://lkml.kernel.org/r/20251216080128.42991-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:45 -08:00
SeongJae Park	84e425c68e	mm/damon/core: implement max_nr_snapshots There are DAMOS use cases that require user-space centric control of its activation and deactivation. Having the control plane on the user-space, or using DAMOS as a way for monitoring results collection are such examples. DAMON parameters online commit, DAMOS quotas and watermarks can be useful for this purpose. However, those features work only at the sub-DAMON-snapshot level. In some use cases, the DAMON-snapshot level control is required. For example, in DAMOS-based monitoring results collection use case, the user online-installs a DAMOS scheme with DAMOS_STAT action, wait it be applied to whole regions of a single DAMON-snapshot, retrieves the stats and tried regions information, and online-uninstall the scheme. It is efficient to ensure the lifetime of the scheme as no more no less one snapshot consumption. To support such use cases, introduce a new DAMOS core API per-scheme parameter, namely max_nr_snapshots. As the name implies, it is the upper limit of nr_snapshots, which is a DAMOS stat that represents the number of DAMON-snapshots that the scheme has fully applied. If the limit is set with a non-zero value and nr_snapshots reaches or exceeds the limit, the scheme is deactivated. Link: https://lkml.kernel.org/r/20251216080128.42991-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:45 -08:00
SeongJae Park	ccaa2d062a	mm/damon: update damos kerneldoc for stat field Commit `0e92c2ee9f` ("mm/damon/schemes: account scheme actions that successfully applied") has replaced ->stat_count and ->stat_sz of 'struct damos' with ->stat. The commit mistakenly did not update the related kernel doc comment, though. Update the comment. Link: https://lkml.kernel.org/r/20251216080128.42991-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:44 -08:00
SeongJae Park	55221e53f7	Docs/ABI/damon: update for nr_snapshots damos stat Update DAMON ABI document for the newly added damos stat, nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:44 -08:00
SeongJae Park	0b43f89e2d	Docs/admin-guide/mm/damon/usage: update for nr_snapshots damos stat Update DAMON usage document for the newly added damos stat, nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:44 -08:00
SeongJae Park	ee7f5d1933	Docs/mm/damon/design: update for nr_snapshots damos stat Update DAMON design document for the newly added damos stat, nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:43 -08:00
SeongJae Park	83a741b974	mm/damon/sysfs-schemes: introduce nr_snapshots damos stat file Introduce a new DAMON sysfs interface file for exposing the newly added DAMOS stat, nr_snapshots. The file has the name same to the stat name (nr_snapshots) and placed under the damos stat sysfs directory. Link: https://lkml.kernel.org/r/20251216080128.42991-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:43 -08:00
SeongJae Park	4a6ceb7c97	mm/damon/core: introduce nr_snapshots damos stat Patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats". Introduce three changes for improving DAMOS stat's provided information, deterministic control, and reading usability. DAMOS provides stats that are important for understanding its behavior. It lacks information about how many DAMON-generated monitoring output snapshots it has worked on. Add a new stat, nr_snapshots, to show the information. Users can control DAMOS schemes in multiple ways. Using the online parameters commit feature, they can install and uninstall DAMOS schemes whenever they want while keeping DAMON runs. DAMOS quotas and watermarks can be used for manually or automatically turning on/off or adjusting the aggressiveness of the scheme. DAMOS filters can be used for applying the scheme to specific memory entities based on their types and locations. Some users want their DAMOS scheme to be applied to only specific number of DAMON snapshots, for more deterministic control. One example use case is tracepoint based snapshot reading. Add a new knob, max_nr_snapshots, to support this. If the nr_snapshots parameter becomes same to or greater than the value of this parameter, the scheme is deactivated. Users can read DAMOS stats via DAMON's sysfs interface. For deep level investigations on environments having advanced tools like perf and bpftrace, exposing the stats via a tracepoint can be useful. Implement a new tracepoint, namely damon:damos_stat_after_apply_interval. First five patches (patches 1-5) of this series implement the new stat, nr_snapshots, on the core layer (patch 1), expose on DAMON sysfs user interface (patch 2), and update documents (patches 3-5). Following six patches (patches 6-11) are for the new stat based DAMOS deactivation (max_nr_snapshots). The first one (patch 6) of this group updates a kernel-doc comment before making further changes. Then an implementation of it on the core layer (patch 7), an introduction of a new DAMON sysfs interface file for users of the feature (patch 8), and three updates of the documents (patches 9-11) follow. The final one (patch 12) introduces the new tracepoint that exposes the DAMOS stat values for each scheme apply interval. This patch (of 12): DAMON generates monitoring results snapshots for every sampling interval. DAMOS applies given schemes on the regions of the snapshots, for every apply interval of the scheme. DAMOS stat informs a given scheme has tried to how many memory entities and applied, in the region and byte level. In some use cases including user-space oriented tuning and investigations, it is useful to know that in the DAMON-snapshot level. Introduce a new stat, namely nr_snapshots for DAMON core API callers. [sj@kernel.org: fix wrong list_is_last() call in damons_is_last_region()] Link: https://lkml.kernel.org/r/20260114152049.99727-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251216080128.42991-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251216080128.42991-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:43 -08:00
Kaushlendra Kumar	8b8017d7c4	tools/mm/slabinfo: fix --partial long option mapping The long option "--partial" was incorrectly mapped to lowercase 'p' in the opts[] array, but the getopt string and switch case handle uppercase 'P'. This mismatch caused --partial to be rejected. Fix the long_options mapping to use 'P' so --partial works correctly alongside the existing -P short option. Link: https://lkml.kernel.org/r/20251208105240.2719773-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com> Reviewed-by: SeongJae Park <sj@kernel.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:43 -08:00
Kaushlendra Kumar	9f5edd785d	tools/mm/thp_swap_allocator_test: fix small folio alignment Use ALIGNMENT_SMALLFOLIO instead of ALIGNMENT_MTHP when allocating small folios to ensure correct memory alignment for the test case. Before: test allocates small folios with 64KB alignment (ALIGNMENT_MTHP) when only 4KB alignment (ALIGNMENT_SMALLFOLIO) is needed. This wastes address space and may cause allocation failures on systems with fragmented memory. Worst-case impact: this only affects thp_swap_allocator_test tool behavior. Link: https://lkml.kernel.org/r/20251209031745.2723120-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:42 -08:00
Enze Li	6e4930e333	mm/damon/core: fix wasteful CPU calls by skipping non-existent targets Currently, DAMON does not proactively clean up invalid monitoring targets during its runtime. When some monitored processes exit, DAMON continues to make the following unnecessary function calls, --damon_for_each_target-- --damon_for_each_region-- damon_do_apply_schemes damos_apply_scheme damon_va_apply_scheme damos_madvise damon_get_mm it is only in the damon_get_mm() function that it may finally discover the target no longer exists, which wastes CPU resources. A simple idea is to check for the existence of monitoring targets within the kdamond_need_stop() function and promptly clean up non-existent targets. However, SJ pointed out that this approach is problematic because the online commit logic incorrectly uses list indices to update the monitoring state. This can lead to data loss if the target list is changed concurrently. Meanwhile, SJ suggests checking for target existence at the damon_for_each_target level, and if a target does not exist, simply skip it and proceed to the next one. Link: https://lkml.kernel.org/r/20251210052508.264433-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:42 -08:00
Johannes Weiner	16cc8b9396	mm: memcontrol: rename mem_cgroup_from_slab_obj() In addition to slab objects, this function is used for resolving non-slab kernel pointers. This has caused confusion in recent refactoring work. Rename it to mem_cgroup_from_virt(), sticking with terminology established by the virt_to_<foo>() converters. Link: https://lore.kernel.org/linux-mm/20251113161424.GB3465062@cmpxchg.org/ Link: https://lkml.kernel.org/r/20251210154301.720133-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:42 -08:00
Chen Ridong	055059ed72	memcg: remove mem_cgroup_size() The mem_cgroup_size helper is used only in apply_proportional_protection to read the current memory usage. Its semantics are unclear and inconsistent with other sites, which directly call page_counter_read for the same purpose. Remove this helper and get its usage via mem_cgroup_protection for clarity. Additionally, rename the local variable 'cgroup_size' to 'usage' to better reflect its meaning. No functional changes intended. Link: https://lkml.kernel.org/r/20251211013019.2080004-3-chenridong@huaweicloud.com Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michal Koutný <mkoutny@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lu Jialin <lujialin4@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:42 -08:00
Chen Ridong	558605a530	memcg: move mem_cgroup_usage memcontrol-v1.c Patch series "memcg cleanups", v3. Two code moves/removals with no behavior change. This patch (of 2): Currently, mem_cgroup_usage is only used for v1, just move it to memcontrol-v1.c Link: https://lkml.kernel.org/r/20251211013019.2080004-1-chenridong@huaweicloud.com Link: https://lkml.kernel.org/r/20251211013019.2080004-2-chenridong@huaweicloud.com Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Koutný <mkoutny@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lu Jialin <lujialin4@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:41 -08:00
Johannes Weiner	85aa391974	mm: zswap: delete unused acomp->is_sleepable This hasn't been used since `7d4c9629b7` ("mm: zswap: use object read/write APIs instead of object mapping APIs"). Drop it. Link: https://lkml.kernel.org/r/20251211025645.820517-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:41 -08:00
Swaraj Gaikwad	cc05d5d94b	mm/damon/sysfs-schemes: remove outdated TODO in target_nid_store() The TODO comment in target_nid_store() suggested adding range validation for target_nid. As discussed in [1], the current behavior of accepting any integer value is intentional. DAMON sysfs aims to remain flexible, including supporting users who prepare node IDs before future NUMA hotplug events. Because this behavior matches the broader design philosophy of the DAMON sysfs interface, the TODO comment is now misleading. This patch removes the comment without introducing any behavioral change. No functional changes. Link: https://lkml.kernel.org/r/20251211032722.4928-2-swarajgaikwad1925@gmail.com Link: https://lore.kernel.org/lkml/20251210150930.57679-1-sj@kernel.org/ [1] Signed-off-by: Swaraj Gaikwad <swarajgaikwad1925@gmail.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:41 -08:00
Ankur Arora	93552c9a33	mm: folio_zero_user: cache neighbouring pages folio_zero_user() does straight zeroing without caring about temporal locality for caches. This replaced commit `c6ddfb6c58` ("mm, clear_huge_page: move order algorithm into a separate function") where we cleared a page at a time converging to the faulting page from the left and the right. To retain limited temporal locality, split the clearing in three parts: the faulting page and its immediate neighbourhood, and the regions on its left and right. We clear the local neighbourhood last to maximize chances of it sticking around in the cache. Performance === AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads, memory=2.2 TB, L1d=16K/thread, L2=512K/thread, L3=2MB/thread) vm-scalability/anon-w-seq-hugetlb: this workload runs with 384 processes (one for each CPU) each zeroing anonymously mapped hugetlb memory which is then accessed sequentially. stime utime discontiguous-page 1739.93 ( +- 6.15% ) 1016.61 ( +- 4.75% ) contiguous-page 1853.70 ( +- 2.51% ) 1187.13 ( +- 3.50% ) batched-pages 1756.75 ( +- 2.98% ) 1133.32 ( +- 4.89% ) neighbourhood-last 1725.18 ( +- 4.59% ) 1123.78 ( +- 7.38% ) Both stime and utime largely respond somewhat expectedly. There is a fair amount of run to run variation but the general trend is that the stime drops and utime increases. There are a few oddities, like contiguous-page performing very differently from batched-pages. As such this is likely an uncommon pattern where we saturate the memory bandwidth (since all CPUs are running the test) and at the same time are cache constrained because we access the entire region. Kernel make (make -j 12 bzImage): stime utime discontiguous-page 199.29 ( +- 0.63% ) 1431.67 ( +- .04% ) contiguous-page 193.76 ( +- 0.58% ) 1433.60 ( +- .05% ) batched-pages 193.92 ( +- 0.76% ) 1431.04 ( +- .08% ) neighbourhood-last 194.46 ( +- 0.68% ) 1431.51 ( +- .06% ) For make the utime stays relatively flat with a fairly small (-2.4%) improvement in the stime. Link: https://lkml.kernel.org/r/20260107072009.1615991-9-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:41 -08:00
Ankur Arora	94962b2628	mm: folio_zero_user: clear page ranges Use batch clearing in clear_contig_highpages() instead of clearing a single page at a time. Exposing larger ranges enables the processor to optimize based on extent. To do this we just switch to using clear_user_highpages() which would in turn use clear_user_pages() or clear_pages(). Batched clearing, when running under non-preemptible models, however, has latency considerations. In particular, we need periodic invocations of cond_resched() to keep to reasonable preemption latencies. This is a problem because the clearing primitives do not, or might not be able to, call cond_resched() to check if preemption is needed. So, limit the worst case preemption latency by doing the clearing in units of no more than PROCESS_PAGES_NON_PREEMPT_BATCH pages. (Preemptible models already define away most of cond_resched(), so the batch size is ignored when running under those.) PROCESS_PAGES_NON_PREEMPT_BATCH: for architectures with "fast" clear-pages (ones that define clear_pages()), we define it as 32MB worth of pages. This is meant to be large enough to allow the processor to optimize the operation and yet small enough that we see reasonable preemption latency for when this optimization is not possible (ex. slow microarchitectures, memory bandwidth saturation.) This specific value also allows for a cacheline allocation elision optimization (which might help unrelated applications by not evicting potentially useful cache lines) that kicks in recent generations of AMD Zen processors at around LLC-size (32MB is a typical size). At the same time 32MB is small enough that even with poor clearing bandwidth (say ~10GBps), time to clear 32MB should be well below the scheduler's default warning threshold (sysctl_resched_latency_warn_ms=100). "Slow" architectures (don't have clear_pages()) will continue to use the base value (single page). Performance == Testing a demand fault workload shows a decent improvement in bandwidth with pg-sz=1GB. Bandwidth with pg-sz=2MB stays flat. $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 contiguous-pages batched-pages (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 23.58 +- 1.95% 25.34 +- 1.18% + 7.50% preempt=* pg-sz=1GB 25.09 +- 0.79% 39.22 +- 2.32% + 56.31% preempt=none\|voluntary pg-sz=1GB 25.71 +- 0.03% 52.73 +- 0.20% [#] +110.16% preempt=full\|lazy [#] We perform much better with preempt=full\|lazy because, not needing explicit invocations of cond_resched() we can clear the full extent (pg-sz=1GB) as a single unit which the processor can optimize for. (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13); region-size=64GB, local node; 2.56 GHz, boost=0.) Analysis == pg-sz=1GB: the improvement we see falls in two buckets depending on the batch size in use. For batch-size=32MB the number of cachelines allocated (L1-dcache-loads) -- which stay relatively flat for smaller batches, start to drop off because cacheline allocation elision kicks in. And as can be seen below, at batch-size=1GB, we stop allocating cachelines almost entirely. (Not visible here but from testing with intermediate sizes, the allocation change kicks in only at batch-size=32MB and ramps up from there.) contigous-pages 6,949,417,798 L1-dcache-loads # 883.599 M/sec ( +- 0.01% ) (35.75%) 3,226,709,573 L1-dcache-load-misses # 46.43% of all L1-dcache accesses ( +- 0.05% ) (35.75%) batched,32MB 2,290,365,772 L1-dcache-loads # 471.171 M/sec ( +- 0.36% ) (35.72%) 1,144,426,272 L1-dcache-load-misses # 49.97% of all L1-dcache accesses ( +- 0.58% ) (35.70%) batched,1GB 63,914,157 L1-dcache-loads # 17.464 M/sec ( +- 8.08% ) (35.73%) 22,074,367 L1-dcache-load-misses # 34.54% of all L1-dcache accesses ( +- 16.70% ) (35.70%) The dropoff is also visible in L2 prefetch hits (miss numbers are on similar lines): contiguous-pages 3,464,861,312 l2_pf_hit_l2.all # 437.722 M/sec ( +- 0.74% ) (15.69%) batched,32MB 883,750,087 l2_pf_hit_l2.all # 181.223 M/sec ( +- 1.18% ) (15.71%) batched,1GB 8,967,943 l2_pf_hit_l2.all # 2.450 M/sec ( +- 17.92% ) (15.77%) This largely decouples the frontend from the backend since the clearing operation does not need to wait on loads from memory (we still need cacheline ownership but that's a shorter path). This is most visible if we rerun the test above with (boost=1, 3.66 GHz). $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 contiguous-pages batched-pages (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 26.08 +- 1.72% 26.13 +- 0.92% - preempt=* pg-sz=1GB 26.99 +- 0.62% 48.85 +- 2.19% + 80.99% preempt=none\|voluntary pg-sz=1GB 27.69 +- 0.18% 75.18 +- 0.25% +171.50% preempt=full\|lazy Comparing the batched-pages numbers from the boost=0 ones and these: for a clock-speed gain of 42% we gain 24.5% for batch-size=32MB and 42.5% for batch-size=1GB. In comparison the baseline contiguous-pages case and both the pg-sz=2MB ones are largely backend bound so gain no more than ~10%. Other platforms tested, Intel Icelakex (Oracle X9) and ARM64 Neoverse-N1 (Ampere Altra) both show an improvement of ~35% for pg-sz=2MB\|1GB. The first goes from around 8GBps to 11GBps and the second from 32GBps to 44 GBPs. [ankur.a.arora@oracle.com: move the unit computation and make it a const Link: https://lkml.kernel.org/r/20260108060406.1693853-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260107072009.1615991-8-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:40 -08:00
Ankur Arora	9890ecab6a	mm: folio_zero_user: clear pages sequentially process_huge_pages(), used to clear hugepages, is optimized for cache locality. In particular it processes a hugepage in 4KB page units and in a difficult to predict order: clearing pages in the periphery in a backwards or forwards direction, then converging inwards to the faulting page (or page specified via base_addr.) This helps maximize temporal locality at time of access. However, while it keeps stores inside a 4KB page sequential, pages are ordered semi-randomly in a way that is not easy for the processor to predict. This limits the clearing bandwidth to what's available in a 4KB page. Consider the baseline bandwidth: $ perf bench mem mmap -p 2MB -f populate -s 64GB -l 3 # Running 'mem/mmap' benchmark: # function 'populate' (Eagerly populated mmap()) # Copying 64GB bytes ... 11.791097 GB/sec (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13); region-size=64GB, local node; 2.56 GHz, boost=0.) 11.79 GBps amounts to around 323ns/4KB. With memory access latency of ~100ns, that doesn't leave much time to help from, say, hardware prefetchers. (Note that since this is a purely write workload, it's reasonable to assume that the processor does not need to prefetch any cachelines. However, for a processor to skip the prefetch, it would need to look at the access pattern, and see that full cachelines were being written. This might be easily visible if clear_page() was using, say x86 string instructions; less so if it were using a store loop. In any case, the existence of these kind predictors or appropriately helpful threshold values is implementation specific. Additionally, even when the processor can skip the prefetch, coherence protocols will still need to establish exclusive ownership necessitating communication with remote caches.) With that, the change is quite straight-forward. Instead of clearing pages discontiguously, clear contiguously: switch to a loop around clear_user_highpage(). Performance == Testing a demand fault workload shows a decent improvement in bandwidth with pg-sz=2MB. Performance of pg-sz=1GB does not change because it has always used straight clearing. $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 discontiguous-pages contiguous-pages (baseline) (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 11.76 +- 1.10% 23.58 +- 1.95% +100.51% pg-sz=1GB 24.85 +- 2.41% 25.40 +- 1.33% - Analysis (pg-sz=2MB) == At L1 data cache level, nothing changes. The processor continues to access the same number of cachelines, allocating and missing them as it writes to them. discontiguous-pages 7,394,341,051 L1-dcache-loads # 445.172 M/sec ( +- 0.04% ) (35.73%) 3,292,247,227 L1-dcache-load-misses # 44.52% of all L1-dcache accesses ( +- 0.01% ) (35.73%) contiguous-pages 7,205,105,282 L1-dcache-loads # 861.895 M/sec ( +- 0.02% ) (35.75%) 3,241,584,535 L1-dcache-load-misses # 44.99% of all L1-dcache accesses ( +- 0.00% ) (35.74%) The L2 prefetcher, however, is now able to prefetch ~22% more cachelines (L2 prefetch miss rate also goes up significantly showing that we are backend limited): discontiguous-pages 2,835,860,245 l2_pf_hit_l2.all # 170.242 M/sec ( +- 0.12% ) (15.65%) contiguous-pages 3,472,055,269 l2_pf_hit_l2.all # 411.319 M/sec ( +- 0.62% ) (15.67%) That sill leaves a large gap between the ~22% improvement in prefetch and the ~100% improvement in bandwidth but better prefetching seems to streamline the traffic well enough that most of the data starts comes from the L2 leading to substantially fewer cache-misses at the LLC: discontiguous-pages 8,493,499,137 cache-references # 511.416 M/sec ( +- 0.15% ) (50.01%) 930,501,344 cache-misses # 10.96% of all cache refs ( +- 0.52% ) (50.01%) contiguous-pages 9,421,926,416 cache-references # 1.120 G/sec ( +- 0.09% ) (50.02%) 68,787,247 cache-misses # 0.73% of all cache refs ( +- 0.15% ) (50.03%) In addition, there are a few minor frontend optimizations: clear_pages() on x86 is now fully inlined, so we don't have a CALL/RET pair (which isn't free when using RETHUNK speculative execution mitigation as we do on my test system.) The loop in clear_contig_highpages() is also easier to predict (especially when handling faults) as compared to that in process_huge_pages(). discontiguous-pages 980,014,411 branches # 59.005 M/sec (31.26%) discontiguous-pages 180,897,177 branch-misses # 18.46% of all branches (31.26%) contiguous-pages 515,630,550 branches # 62.654 M/sec (31.27%) contiguous-pages 78,039,496 branch-misses # 15.13% of all branches (31.28%) Note that although clearing contiguously is easier to optimize for the processor, it does not, sadly, mean that the processor will necessarily take advantage of it. For instance this change does not result in any improvement in my tests on Intel Icelakex (Oracle X9), or on ARM64 Neoverse-N1 (Ampere Altra). Link: https://lkml.kernel.org/r/20260107072009.1615991-7-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:40 -08:00
Ankur Arora	cb431accb3	x86/clear_page: introduce clear_pages() Performance when clearing with string instructions (x86-64-stosq and similar) can vary significantly based on the chunk-size used. $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 13.748208 GB/sec $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in # arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 15.067900 GB/sec $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 38.104311 GB/sec (Both on AMD Milan.) With a change in chunk-size from 4KB to 1GB, we see the performance go from 13.7 GB/sec to 38.1 GB/sec. For the chunk-size of 2MB the change isn't quite as drastic but it is worth adding a clear_page() variant that can handle contiguous page-extents. Link: https://lkml.kernel.org/r/20260107072009.1615991-6-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:40 -08:00
Ankur Arora	54a6b89a3d	x86/mm: simplify clear_page_* clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS" variations. Inlining gets rid of an unnecessary CALL/RET (which isn't free when using RETHUNK speculative execution mitigations.) Fixup and rename clear_page_orig() to adapt to the changed calling convention. Also add a comment from Dave Hansen detailing various clearing mechanisms used in clear_page(). Link: https://lkml.kernel.org/r/20260107072009.1615991-5-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:40 -08:00
Ankur Arora	8d846b723e	highmem: introduce clear_user_highpages() Define clear_user_highpages() which uses the range clearing primitive, clear_user_pages(). We can safely use this when CONFIG_HIGHMEM is disabled and if the architecture does not have clear_user_highpage. The first is needed to ensure that contiguous page ranges stay contiguous which precludes intermediate maps via HIGMEM. The second, because if the architecture has clear_user_highpage(), it likely needs flushing magic when clearing the page, magic that we aren't privy to. For both of those cases, just fallback to a loop around clear_user_highpage(). Link: https://lkml.kernel.org/r/20260107072009.1615991-4-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:39 -08:00
Ankur Arora	62a9f5a85b	mm: introduce clear_pages() and clear_user_pages() Introduce clear_pages(), to be overridden by architectures that support more efficient clearing of consecutive pages. Also introduce clear_user_pages(), however, we will not expect this function to be overridden anytime soon. As we do for clear_user_page(), define clear_user_pages() only if the architecture does not define clear_user_highpage(). That is because if the architecture does define clear_user_highpage(), then it likely needs some flushing magic when clearing user pages or highpages. This means we can get away without defining clear_user_pages(), since, much like its single page sibling, its only potential user is the generic clear_user_highpages() which should instead be using clear_user_highpage(). Link: https://lkml.kernel.org/r/20260107072009.1615991-3-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:39 -08:00
David Hildenbrand	8e38607aa4	treewide: provide a generic clear_user_page() variant Patch series "mm: folio_zero_user: clear page ranges", v11. This series adds clearing of contiguous page ranges for hugepages. The series improves on the current discontiguous clearing approach in two ways: - clear pages in a contiguous fashion. - use batched clearing via clear_pages() wherever exposed. The first is useful because it allows us to make much better use of hardware prefetchers. The second, enables advertising the real extent to the processor. Where specific instructions support it (ex. string instructions on x86; "mops" on arm64 etc), a processor can optimize based on this because, instead of seeing a sequence of 8-byte stores, or a sequence of 4KB pages, it sees a larger unit being operated on. For instance, AMD Zen uarchs (for extents larger than LLC-size) switch to a mode where they start eliding cacheline allocation. This is helpful not just because it results in higher bandwidth, but also because now the cache is not evicting useful cachelines and replacing them with zeroes. Demand faulting a 64GB region shows performance improvement: $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 baseline +series (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 11.76 +- 1.10% 25.34 +- 1.18% [] +115.47% preempt= pg-sz=1GB 24.85 +- 2.41% 39.22 +- 2.32% + 57.82% preempt=none\|voluntary pg-sz=1GB (similar) 52.73 +- 0.20% [#] +112.19% preempt=full\|lazy [*] This improvement is because switching to sequential clearing allows the hardware prefetchers to do a much better job. [#] For pg-sz=1GB a large part of the improvement is because of the cacheline elision mentioned above. preempt=full\|lazy improves upon that because, not needing explicit invocations of cond_resched() to ensure reasonable preemption latency, it can clear the full extent as a single unit. In comparison the maximum extent used for preempt=none\|voluntary is PROCESS_PAGES_NON_PREEMPT_BATCH (32MB). When provided the full extent the processor forgoes allocating cachelines on this path almost entirely. (The hope is that eventually, in the fullness of time, the lazy preemption model will be able to do the same job that none or voluntary models are used for, allowing us to do away with cond_resched().) Raghavendra also tested previous version of the series on AMD Genoa and sees similar improvement [1] with preempt=lazy. $ perf bench mem map -p $page-size -f populate -s 64GB -l 10 base patched change pg-sz=2MB 12.731939 GB/sec 26.304263 GB/sec 106.6% pg-sz=1GB 26.232423 GB/sec 61.174836 GB/sec 133.2% This patch (of 8): Let's drop all variants that effectively map to clear_page() and provide it in a generic variant instead. We'll use the macro clear_user_page to indicate whether an architecture provides it's own variant. Also, clear_user_page() is only called from the generic variant of clear_user_highpage(), so define it only if the architecture does not provide a clear_user_highpage(). And, for simplicity define it in linux/highmem.h. Note that for parisc, clear_page() and clear_user_page() map to clear_page_asm(), so we can just get rid of the custom clear_user_page() implementation. There is a clear_user_page_asm() function on parisc, that seems to be unused. Not sure what's up with that. Link: https://lkml.kernel.org/r/20260107072009.1615991-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260107072009.1615991-2-ankur.a.arora@oracle.com Signed-off-by: David Hildenbrand <david@redhat.com> Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:39 -08:00
Sergey Senozhatsky	8b05d2d8af	zram: fixup read_block_state() ac_time is now in seconds, do not use ktime_to_timespec64() [akpm@linux-foundation.org: remove now-unused local `ts'] [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20260115033031.3818977-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Chris Mason <clm@meta.com> Closes: https://lkml.kernel.org/r/20260114124522.1326519-1-clm@meta.com Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:38 -08:00
Sergey Senozhatsky	4932844eb8	zram: trivial fix of recompress_slot() coding styles A minor fixup of 80-cols breakage in recompress_slot() comment and zs_malloc() call. Link: https://lkml.kernel.org/r/ff3254847dbdc6fbd2e3fed53c572a261d60b7b6.1765775954.git.senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Cc: Chris Mason <clm@meta.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:38 -08:00
Sergey Senozhatsky	bde60fe747	zram: rename internal slot API We have a somewhat confusing internal API naming. E.g. the following code: zram_slot_lock() if (zram_allocated()) zram_set_flag() zram_slot_unlock() may look like it does something on zram device level, but in fact it tests and sets slot entry flags, not the device ones. Rename API to explicitly distinguish functions that operate on the slot level from functions that operate on the zram device level. While at it, fixup some coding styles. [senozhatsky@chromium.org: fix up mark_slot_accessed()] Link: https://lkml.kernel.org/r/20260115031922.3813659-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/775a0b1a0ace5caf1f05965d8bc637c1192820fa.1765775954.git.senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:38 -08:00
Sergey Senozhatsky	2e8ff2f51d	zram: use u32 for entry ac_time tracking We can reduce sizeof(zram_table_entry) on 64-bit systems by converting flags and ac_time to u32. Entry flags fit into u32, and for ac_time u32 gives us over a century of entry lifespan (approx 136 years) which is plenty (zram uses system boot time (seconds)). In struct zram_table_entry we use bytes aliasing, because bit-wait API (for slot lock) requires a whole unsigned long word. Link: https://lkml.kernel.org/r/d7c0b48450c70eeb5fd8acd6ecd23593f30dbf1f.1765775954.git.senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: David Stevens <stevensd@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:38 -08:00
Sergey Senozhatsky	0327a86213	zram: consolidate device-attr declarations Do not spread device attributes declarations across the file, move io_stat, mm_stat, debug_stat to a common device-attr section. Link: https://lkml.kernel.org/r/20251201094754.4149975-8-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:37 -08:00
Sergey Senozhatsky	0d38260c2a	zram: switch to guard() for init_lock Use init_lock guard() in sysfs store/show handlers, in order to simplify and, more importantly, to modernize the code. While at it, fix up more coding styles. Link: https://lkml.kernel.org/r/20251201094754.4149975-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:37 -08:00
Sergey Senozhatsky	7ad688c0cd	zram: rename zram_free_page() We don't free page in zram_free_page(), not all slots even have any memory associated with them (e.g. ZRAM_SAME). We free the slot (or reset it), rename the function accordingly. Link: https://lkml.kernel.org/r/20251201094754.4149975-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: David Stevens <stevensd@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:37 -08:00

1 2 3 4 5 ...

1413636 Commits