mirror of
https://github.com/torvalds/linux.git
synced 2026-04-28 19:42:31 -04:00
The node/zone quirk section of the cxl documentation is incorrect. The actual reason for fallback allocation misbehavior in the described configuration is due to a kswapd/reclaim thrashing scenario fixed by the linked patch. Remove this section. Link: https://lore.kernel.org/linux-mm/20250919162134.1098208-1-hannes@cmpxchg.org/ Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
55 lines
2.2 KiB
ReStructuredText
55 lines
2.2 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
==================
|
|
The Page Allocator
|
|
==================
|
|
|
|
The kernel page allocator services all general page allocation requests, such
|
|
as :code:`kmalloc`. CXL configuration steps affect the behavior of the page
|
|
allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
|
|
placed in.
|
|
|
|
This section mostly focuses on how these configurations affect the page
|
|
allocator (as of Linux v6.15) rather than the overall page allocator behavior.
|
|
|
|
NUMA nodes and mempolicy
|
|
========================
|
|
Unless a task explicitly registers a mempolicy, the default memory policy
|
|
of the linux kernel is to allocate memory from the `local NUMA node` first,
|
|
and fall back to other nodes only if the local node is pressured.
|
|
|
|
Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
|
|
with the CXL memory being non-local. Technically, however, it is possible
|
|
for a compute node to have no local DRAM, and for CXL memory to be the
|
|
`local` capacity for that compute node.
|
|
|
|
|
|
Memory Zones
|
|
============
|
|
CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.
|
|
|
|
As of v6.15, the page allocator attempts to allocate from the highest
|
|
available and compatible ZONE for an allocation from the local node first.
|
|
|
|
An example of a `zone incompatibility` is attempting to service an allocation
|
|
marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are
|
|
typically not migratable, and as a result can only be serviced from
|
|
:code:`ZONE_NORMAL` or lower.
|
|
|
|
To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
|
|
:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
|
|
will fallback to allocate from :code:`ZONE_NORMAL`.
|
|
|
|
|
|
CGroups and CPUSets
|
|
===================
|
|
Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
|
|
in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
|
|
containers to limit the accessibility of certain NUMA nodes for tasks in that
|
|
container. Users may wish to utilize this in multi-tenant systems where some
|
|
tasks prefer not to use slower memory.
|
|
|
|
In the reclaim section we'll discuss some limitations of this interface to
|
|
prevent demotions of shared data to CXL memory (if demotions are enabled).
|
|
|