Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
Pull Intel software guard extension (SGX) updates from Dave Hansen:
"A couple of x86/sgx changes.
The first one is a no-brainer to use the (simple) SHA-256 library.
For the second one, some folks doing testing noticed that SGX systems
under memory pressure were inducing fatal machine checks at pretty
unnerving rates, despite the SGX code having _some_ awareness of
memory poison.
It turns out that the SGX reclaim path was not checking for poison
_and_ it always accesses memory to copy it around. Make sure that
poisoned pages are not reclaimed"
* tag 'x86_sgx_for_6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Prevent attempts to reclaim poisoned pages
x86/sgx: Use SHA-256 library API instead of crypto_shash API
Pull ACPI updates from Rafael Wysocki:
"The most significant part of these changes is an ACPICA update
covering two upstream ACPICA releases, 20241212 and 20250404, that
have not been included into the kernel code base yet.
Among other things, it adds definitions needed to address GCC 15's
-Wunterminated-string-initialization warnings, adds support for three
new tables (MRRM, ERDT, RIMT), extends support for two tables (RAS2,
DMAR), and fixes some issues.
On top of the above, there is a new parser for the MRRM table, more
changes related to GCC 15's -Wunterminated-string-initialization
warnings, a CPPC library update including functions related to
autonomous CPU performance state selection, a couple of new quirks,
some assorted fixes and some code cleanups.
Specifics:
- Fix two ACPICA SLAB cache leaks (Seunghun Han)
- Add EINJv2 get error type action and define Error Injection Actions
in hex values to avoid inconsistencies between the specification
and the code (Zaid Alali)
- Fix typo in comments for SRAT structures (Adam Lackorzynski)
- Prevent possible loss of data in ACPICA because of u32 to u8
conversions (Saket Dumbre)
- Fix reading FFixedHW operation regions in ACPICA (Daniil Tatianin)
- Add support for printing AML arguments when the ACPICA debug level
is ACPI_LV_TRACE_POINT (Mario Limonciello)
- Drop a stale comment about the file content from actbl2.h (Sudeep
Holla)
- Apply pack(1) to union aml_resource (Tamir Duberstein)
- Fix overflow check in the ACPICA version of vsnprintf() (gldrk)
- Interpret SIDP structures in DMAR added revision 3.4 of the VT-d
specification (Alexey Neyman)
- Add typedef and other definitions related to MRRM to ACPICA (Tony
Luck)
- Add definitions for RIMT to ACPICA (Sunil V L)
- Fix spelling mistake "Incremement" -> "Increment" in the ACPICA
utilities code (Colin Ian King)
- Add typedef and other definitions for ERDT to ACPICA (Tony Luck)
- Introduce ACPI_NONSTRING and use it (Kees Cook, Ahmed Salem)
- Rename structure and field names of the RAS2 table in actbl2.h
(Shiju Jose)
- Fix up whitespace in acpica/utcache.c (Zhe Qiao)
- Avoid sequence overread in a call to strncmp() in
ap_get_table_length() and replace strncpy() with memcpy() in ACPICA
in some places (Ahmed Salem)
- Update copyright year in all ACPICA files (Saket Dumbre)
- Add __nonstring annotations for unterminated strings in the static
ACPI tables parsing code (Kees Cook)
- Add support for parsing the MRRM ACPI table and sysfs files to
describe memory regions listed in it (Tony Luck, Anil
Keshavamurthy)
- Remove an (explicitly) unused header file include from the VIOT
ACPI table parser file (Andy Shevchenko)
- Improve logging around acpi_initialize_tables() (Bartosz
Szczepanek)
- Clean up the initialization of CPU data structures in the ACPI
processor driver (Zhang Rui)
- Remove an obsolete comment regarding the C-states handling in the
ACPI processor driver (Giovanni Gherdovich)
- Simplify PCC shared memory region handling (Sudeep Holla)
- Rework and extend functions for reading CPPC register values and
for updating CPPC registers (Lifeng Zheng)
- Add three functions related to autonomous CPU performance state
selection to the CPPC library (Lifeng Zheng)
- Turn the acpi_pci_root_remap_iospace() fwnode_handle parameter into
a const pointer (Pei Xiao)
- Round battery capacity percengate in the ACPI battery driver to the
closest integer to avoid user confusion (shitao)
- Make the ACPI battery driver report the current as a negative
number to the power supply framework when the battery is
discharging as documented (Peter Marheine)
- Add TUXEDO InfinityBook Pro AMD Gen9 to the acpi_ec_no_wakeup[]
list to prevent spurious wakeups from suspend-to-idle (Werner
Sembach)
- Convert the APEI EINJ driver to a faux device one (Sudeep Holla,
Jon Hunter)
- Remove redundant calls to einj_get_available_error_type() from the
APEI EINJ driver (Zaid Alali)
- Fix a typo for MECHREVO in irq1_edge_low_force_override[] (Mingcong
Bai)
- Add an LPS0 check() callback to the AMD pinctrl driver and fix up
config symbol dependencies in it (Mario Limonciello, Rafael
Wysocki)
- Avoid initializing the ACPI platform profile driver on non-ACPI
platforms (Alexandre Ghiti)
- Document that references to ACPI data (non-device) nodes should use
string-only references in hierarchical data node packages (Sakari
Ailus)
- Fail the ACPI bus registration if acpi_kobj registration fails
(Armin Wolf)"
* tag 'acpi-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
ACPI: MRRM: Fix default max memory region
ACPI: bus: Bail out if acpi_kobj registration fails
ACPI: platform_profile: Avoid initializing on non-ACPI platforms
pinctrl: amd: Fix hibernation support with CONFIG_SUSPEND unset
ACPI: tables: Improve logging around acpi_initialize_tables()
ACPI: VIOT: Remove (explicitly) unused header
ACPI: Add documentation for exposing MRRM data
ACPI: MRRM: Add /sys files to describe memory ranges
ACPI: MRRM: Minimal parse of ACPI MRRM table
ACPICA: Update copyright year
ACPICA: Logfile: Changes for version 20250404
ACPICA: Replace strncpy() with memcpy()
ACPICA: Apply ACPI_NONSTRING in more places
ACPICA: Avoid sequence overread in call to strncmp()
ACPICA: Adjust the position of code lines
ACPICA: actbl2.h: ACPI 6.5: RAS2: Rename structure and field names of the RAS2 table
ACPICA: Apply ACPI_NONSTRING
ACPICA: Introduce ACPI_NONSTRING
ACPICA: actbl2.h: ERDT: Add typedef and other definitions
ACPICA: infrastructure: Add new DMT_BUF types and shorten a long name
...
Pull x86 resource control updates from Borislav Petkov:
"Carve out the resctrl filesystem-related code into fs/resctrl/ so that
multiple architectures can share the fs API for manipulating their
respective hw resource control implementation.
This is the second step in the work towards sharing the resctrl
filesystem interface, the next one being plugging ARM's MPAM into the
aforementioned fs API"
* tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
MAINTAINERS: Add reviewers for fs/resctrl
x86,fs/resctrl: Move the resctrl filesystem code to live in /fs/resctrl
x86/resctrl: Always initialise rid field in rdt_resources_all[]
x86/resctrl: Relax some asm #includes
x86/resctrl: Prefer alloc(sizeof(*foo)) idiom in rdt_init_fs_context()
x86/resctrl: Squelch whitespace anomalies in resctrl core code
x86/resctrl: Move pseudo lock prototypes to include/linux/resctrl.h
x86/resctrl: Fix types in resctrl_arch_mon_ctx_{alloc,free}() stubs
x86/resctrl: Move enum resctrl_event_id to resctrl.h
x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl
fs/resctrl: Add boiler plate for external resctrl code
x86/resctrl: Add 'resctrl' to the title of the resctrl documentation
x86/resctrl: Split trace.h
x86/resctrl: Expand the width of domid by replacing mon_data_bits
x86/resctrl: Add end-marker to the resctrl_event_id enum
x86/resctrl: Move is_mba_sc() out of core.c
x86/resctrl: Drop __init/__exit on assorted symbols
x86/resctrl: Resctrl_exit() teardown resctrl but leave the mount point
x86/resctrl: Check all domains are offline in resctrl_exit()
x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_"
...
Merge updates related to the handling of static (data-only) ACPI tables
for 6.16-rc1:
- Add __nonstring annotations for unterminated strings in the static
ACPI tables parsing code (Kees Cook).
- Add support for parsing the MRRM ACPI table and sysfs files to
describe memory regions listed in it (Tony Luck, Anil Keshavamurthy).
- Remove an (explicitly) unused header file include from the VIOT ACPI
table parser file (Andy Shevchenko).
- Improve logging around acpi_initialize_tables() (Bartosz Szczepanek).
* acpi-tables:
ACPI: MRRM: Fix default max memory region
ACPI: tables: Improve logging around acpi_initialize_tables()
ACPI: VIOT: Remove (explicitly) unused header
ACPI: Add documentation for exposing MRRM data
ACPI: MRRM: Add /sys files to describe memory ranges
ACPI: MRRM: Minimal parse of ACPI MRRM table
ACPI: tables: Add __nonstring annotations for unterminated strings
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 ITS mitigation from Dave Hansen:
"Mitigate Indirect Target Selection (ITS) issue.
I'd describe this one as a good old CPU bug where the behavior is
_obviously_ wrong, but since it just results in bad predictions it
wasn't wrong enough to notice. Well, the researchers noticed and also
realized that thus bug undermined a bunch of existing indirect branch
mitigations.
Thus the unusually wide impact on this one. Details:
ITS is a bug in some Intel CPUs that affects indirect branches
including RETs in the first half of a cacheline. Due to ITS such
branches may get wrongly predicted to a target of (direct or indirect)
branch that is located in the second half of a cacheline. Researchers
at VUSec found this behavior and reported to Intel.
Affected processors:
- Cascade Lake, Cooper Lake, Whiskey Lake V, Coffee Lake R, Comet
Lake, Ice Lake, Tiger Lake and Rocket Lake.
Scope of impact:
- Guest/host isolation:
When eIBRS is used for guest/host isolation, the indirect branches
in the VMM may still be predicted with targets corresponding to
direct branches in the guest.
- Intra-mode using cBPF:
cBPF can be used to poison the branch history to exploit ITS.
Realigning the indirect branches and RETs mitigates this attack
vector.
- User/kernel:
With eIBRS enabled user/kernel isolation is *not* impacted by ITS.
- Indirect Branch Prediction Barrier (IBPB):
Due to this bug indirect branches may be predicted with targets
corresponding to direct branches which were executed prior to IBPB.
This will be fixed in the microcode.
Mitigation:
As indirect branches in the first half of cacheline are affected, the
mitigation is to replace those indirect branches with a call to thunk that
is aligned to the second half of the cacheline.
RETs that take prediction from RSB are not affected, but they may be
affected by RSB-underflow condition. So, RETs in the first half of
cacheline are also patched to a return thunk that executes the RET aligned
to second half of cacheline"
* tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftest/x86/bugs: Add selftests for ITS
x86/its: FineIBT-paranoid vs ITS
x86/its: Use dynamic thunks for indirect branches
x86/ibt: Keep IBT disabled during alternative patching
mm/execmem: Unify early execmem_cache behaviour
x86/its: Align RETs in BHB clear sequence to avoid thunking
x86/its: Add support for RSB stuffing mitigation
x86/its: Add "vmexit" option to skip mitigation on some CPUs
x86/its: Enable Indirect Target Selection mitigation
x86/its: Add support for ITS-safe return thunk
x86/its: Add support for ITS-safe indirect thunk
x86/its: Enumerate Indirect Target Selection (ITS) bug
Documentation: x86/bugs/its: Add ITS documentation
ITS mitigation moves the unsafe indirect branches to a safe thunk. This
could degrade the prediction accuracy as the source address of indirect
branches becomes same for different execution paths.
To improve the predictions, and hence the performance, assign a separate
thunk for each indirect callsite. This is also a defense-in-depth measure
to avoid indirect branches aliasing with each other.
As an example, 5000 dynamic thunks would utilize around 16 bits of the
address space, thereby gaining entropy. For a BTB that uses
32 bits for indexing, dynamic thunks could provide better prediction
accuracy over fixed thunks.
Have ITS thunks be variable sized and use EXECMEM_MODULE_TEXT such that
they are both more flexible (got to extend them later) and live in 2M TLBs,
just like kernel code, avoiding undue TLB pressure.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Due to ITS, indirect branches in the lower half of a cacheline may be
vulnerable to branch target injection attack.
Introduce ITS-safe thunks to patch indirect branches in the lower half of
cacheline with the thunk. Also thunk any eBPF generated indirect branches
in emit_indirect_jump().
Below category of indirect branches are not mitigated:
- Indirect branches in the .init section are not mitigated because they are
discarded after boot.
- Indirect branches that are explicitly marked retpoline-safe.
Note that retpoline also mitigates the indirect branches against ITS. This
is because the retpoline sequence fills an RSB entry before RET, and it
does not suffer from RSB-underflow part of the ITS.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Calling core::fmt::write() from rust code while FineIBT is enabled
results in a kernel panic:
[ 4614.199779] kernel BUG at arch/x86/kernel/cet.c:132!
[ 4614.205343] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 4614.211781] CPU: 2 UID: 0 PID: 6057 Comm: dmabuf_dump Tainted: G U O 6.12.17-android16-0-g6ab38c534a43 #1 9da040f27673ec3945e23b998a0f8bd64c846599
[ 4614.227832] Tainted: [U]=USER, [O]=OOT_MODULE
[ 4614.241247] RIP: 0010:do_kernel_cp_fault+0xea/0xf0
...
[ 4614.398144] RIP: 0010:_RNvXs5_NtNtNtCs3o2tGsuHyou_4core3fmt3num3impyNtB9_7Display3fmt+0x0/0x20
[ 4614.407792] Code: 48 f7 df 48 0f 48 f9 48 89 f2 89 c6 5d e9 18 fd ff ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 41 81 ea 14 61 af 2c 74 03 0f 0b 90 <66> 0f 1f 00 55 48 89 e5 48 89 f2 48 8b 3f be 01 00 00 00 5d e9 e7
[ 4614.428775] RSP: 0018:ffffb95acfa4ba68 EFLAGS: 00010246
[ 4614.434609] RAX: 0000000000000000 RBX: 0000000000000010 RCX: 0000000000000000
[ 4614.442587] RDX: 0000000000000007 RSI: ffffb95acfa4ba70 RDI: ffffb95acfa4bc88
[ 4614.450557] RBP: ffffb95acfa4bae0 R08: ffff0a00ffffff05 R09: 0000000000000070
[ 4614.458527] R10: 0000000000000000 R11: ffffffffab67eaf0 R12: ffffb95acfa4bcc8
[ 4614.466493] R13: ffffffffac5d50f0 R14: 0000000000000000 R15: 0000000000000000
[ 4614.474473] ? __cfi__RNvXs5_NtNtNtCs3o2tGsuHyou_4core3fmt3num3impyNtB9_7Display3fmt+0x10/0x10
[ 4614.484118] ? _RNvNtCs3o2tGsuHyou_4core3fmt5write+0x1d2/0x250
This happens because core::fmt::write() calls
core::fmt::rt::Argument::fmt(), which currently has CFI disabled:
library/core/src/fmt/rt.rs:
171 // FIXME: Transmuting formatter in new and indirectly branching to/calling
172 // it here is an explicit CFI violation.
173 #[allow(inline_no_sanitize)]
174 #[no_sanitize(cfi, kcfi)]
175 #[inline]
176 pub(super) unsafe fn fmt(&self, f: &mut Formatter<'_>) -> Result {
This causes a Control Protection exception, because FineIBT has sealed
off the original function's endbr64.
This makes rust currently incompatible with FineIBT. Add a Kconfig
dependency that prevents FineIBT from getting turned on by default
if rust is enabled.
[ Rust 1.88.0 (scheduled for 2025-06-26) should have this fixed [1],
and thus we relaxed the condition with Rust >= 1.88.
When `objtool` lands checking for this with e.g. [2], the plan is
to ideally run that in upstream Rust's CI to prevent regressions
early [3], since we do not control `core`'s source code.
Alice tested the Rust PR backported to an older compiler.
Peter would like that Rust provides a stable `core` which can be
pulled into the kernel: "Relying on that much out of tree code is
'unfortunate'".
- Miguel ]
Signed-off-by: Paweł Anikiel <panikiel@google.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://github.com/rust-lang/rust/pull/139632 [1]
Link: https://lore.kernel.org/rust-for-linux/20250410154556.GB9003@noisy.programming.kicks-ass.net/ [2]
Link: https://github.com/rust-lang/rust/pull/139632#issuecomment-2801950873 [3]
Link: https://lore.kernel.org/r/20250410115420.366349-1-panikiel@google.com
Link: https://lore.kernel.org/r/att0-CANiq72kjDM0cKALVy4POEzhfdT4nO7tqz0Pm7xM+3=_0+L1t=A@mail.gmail.com
[ Reduced splat. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Pull Kbuild updates from Masahiro Yamada:
- Improve performance in gendwarfksyms
- Remove deprecated EXTRA_*FLAGS and KBUILD_ENABLE_EXTRA_GCC_CHECKS
- Support CONFIG_HEADERS_INSTALL for ARCH=um
- Use more relative paths to sources files for better reproducibility
- Support the loong64 Debian architecture
- Add Kbuild bash completion
- Introduce intermediate vmlinux.unstripped for architectures that need
static relocations to be stripped from the final vmlinux
- Fix versioning in Debian packages for -rc releases
- Treat missing MODULE_DESCRIPTION() as an error
- Convert Nios2 Makefiles to use the generic rule for built-in DTB
- Add debuginfo support to the RPM package
* tag 'kbuild-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (40 commits)
kbuild: rpm-pkg: build a debuginfo RPM
kconfig: merge_config: use an empty file as initfile
nios2: migrate to the generic rule for built-in DTB
rust: kbuild: skip `--remap-path-prefix` for `rustdoc`
kbuild: pacman-pkg: hardcode module installation path
kbuild: deb-pkg: don't set KBUILD_BUILD_VERSION unconditionally
modpost: require a MODULE_DESCRIPTION()
kbuild: make all file references relative to source root
x86: drop unnecessary prefix map configuration
kbuild: deb-pkg: add comment about future removal of KDEB_COMPRESS
kbuild: Add a help message for "headers"
kbuild: deb-pkg: remove "version" variable in mkdebian
kbuild: deb-pkg: fix versioning for -rc releases
Documentation/kbuild: Fix indentation in modules.rst example
x86: Get rid of Makefile.postlink
kbuild: Create intermediate vmlinux build with relocations preserved
kbuild: Introduce Kconfig symbol for linking vmlinux with relocations
kbuild: link-vmlinux.sh: Make output file name configurable
kbuild: do not generate .tmp_vmlinux*.map when CONFIG_VMLINUX_MAP=y
Revert "kheaders: Ignore silly-rename files"
...
Pull more MM updates from Andrew Morton:
- The series "mm: fixes for fallouts from mem_init() cleanup" from Mike
Rapoport fixes a couple of issues with the just-merged "arch, mm:
reduce code duplication in mem_init()" series
- The series "MAINTAINERS: add my isub-entries to MM part." from Mike
Rapoport does some maintenance on MAINTAINERS
- The series "remove tlb_remove_page_ptdesc()" from Qi Zheng does some
cleanup work to the page mapping code
- The series "mseal system mappings" from Jeff Xu permits sealing of
"system mappings", such as vdso, vvar, vvar_vclock, vectors (arm
compat-mode), sigpage (arm compat-mode)
- Plus the usual shower of singleton patches
* tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
mseal sysmap: add arch-support txt
mseal sysmap: enable s390
selftest: test system mappings are sealed
mseal sysmap: update mseal.rst
mseal sysmap: uprobe mapping
mseal sysmap: enable arm64
mseal sysmap: enable x86-64
mseal sysmap: generic vdso vvar mapping
selftests: x86: test_mremap_vdso: skip if vdso is msealed
mseal sysmap: kernel config and header change
mm: pgtable: remove tlb_remove_page_ptdesc()
x86: pgtable: convert to use tlb_remove_ptdesc()
riscv: pgtable: unconditionally use tlb_remove_ptdesc()
mm: pgtable: convert some architectures to use tlb_remove_ptdesc()
mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc*
mm: pgtable: make generic tlb_remove_table() use struct ptdesc
microblaze/mm: put mm_cmdline_setup() in .init.text section
mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range
MAINTAINERS: mm: add entry for secretmem
MAINTAINERS: mm: add entry for numa memblocks and numa emulation
...
Pull x86 TDX updates from Dave Hansen:
"Avoid direct HLT instruction execution in TDX guests.
TDX guests aren't expected to use the HLT instruction directly. It
causes a virtualization exception (#VE). While the #VE _can_ be
handled, the current handling is slow and buggy and the easiest thing
is just to avoid HLT in the first place. Plus, the kernel already has
paravirt infrastructure that makes it relatively painless.
Make TDX guests require paravirt and add some TDX-specific paravirt
handlers which avoid HLT in the normal halt routines. Also add a
warning in case another HLT sneaks in.
There was a report that this leads to a "major performance
improvement" on specjbb2015, probably because of the extra #VE
overhead or missed wakeups from the buggy HLT handling"
* tag 'x86_tdx_for_6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling
x86/tdx: Fix arch_safe_halt() execution for TDX VMs
x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
Pull misc x86 fixes and updates from Ingo Molnar:
- Fix a large number of x86 Kconfig dependency and help text accuracy
bugs/problems, by Mateusz Jończyk and David Heideberg
- Fix a VM_PAT interaction with fork() crash. This also touches core
kernel code
- Fix an ORC unwinder bug for interrupt entries
- Fixes and cleanups
- Fix an AMD microcode loader bug that can promote verification
failures into success
- Add early-printk support for MMIO based UARTs on an x86 board that
had no other serial debugging facility and also experienced early
boot crashes
* tag 'x86-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Fix __apply_microcode_amd()'s return value
x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
x86/fpu: Update the outdated comment above fpstate_init_user()
x86/early_printk: Add support for MMIO-based UARTs
x86/dumpstack: Fix inaccurate unwinding from exception stacks due to misplaced assignment
x86/entry: Fix ORC unwinder for PUSH_REGS with save_ret=1
x86/Kconfig: Fix lists in X86_EXTENDED_PLATFORM help text
x86/Kconfig: Correct X86_X2APIC help text
x86/speculation: Remove the extra #ifdef around CALL_NOSPEC
x86/Kconfig: Document release year of glibc 2.3.3
x86/Kconfig: Make CONFIG_PCI_CNB20LE_QUIRK depend on X86_32
x86/Kconfig: Document CONFIG_PCI_MMCONFIG
x86/Kconfig: Update lists in X86_EXTENDED_PLATFORM
x86/Kconfig: Move all X86_EXTENDED_PLATFORM options together
x86/Kconfig: Always enable ARCH_SPARSEMEM_ENABLE
x86/Kconfig: Enable X86_X2APIC by default and improve help text
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE
handler will enable interrupts before TDCALL is routed to hypervisor
leading to missed wakeup events, as current TDX spec doesn't expose
interruptibility state information to allow #VE handler to selectively
enable interrupts.
Commit bfe6ed0c67 ("x86/tdx: Add HLT support for TDX guests")
prevented the idle routines from executing HLT instruction in STI-shadow.
But it missed the paravirt routine which can be reached via this path
as an example:
kvm_wait() =>
safe_halt() =>
raw_safe_halt() =>
arch_safe_halt() =>
irq.safe_halt() =>
pv_native_safe_halt()
To reliably handle arch_safe_halt() for TDX VMs, introduce explicit
dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt()
routines with TDX-safe versions that execute direct TDCALL and needed
interrupt flag updates. Executing direct TDCALL brings in additional
benefit of avoiding HLT related #VEs altogether.
As tested by Ryan Afranji:
"Tested with the specjbb2015 benchmark. It has heavy lock contention which leads
to many halt calls. TDX VMs suffered a poor score before this patchset.
Verified the major performance improvement with this patchset applied."
Fixes: bfe6ed0c67 ("x86/tdx: Add HLT support for TDX guests")
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Ryan Afranji <afranji@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250228014416.3925664-3-vannapurve@google.com
Pull CRC updates from Eric Biggers:
"Another set of improvements to the kernel's CRC (cyclic redundancy
check) code:
- Rework the CRC64 library functions to be directly optimized, like
what I did last cycle for the CRC32 and CRC-T10DIF library
functions
- Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
support and acceleration for crc64_be and crc64_nvme
- Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
crc_t10dif, crc64_be, and crc64_nvme
- Remove crc_t10dif and crc64_rocksoft from the crypto API, since
they are no longer needed there
- Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect
- Add kunit test cases for crc64_nvme and crc7
- Eliminate redundant functions for calculating the Castagnoli CRC32,
settling on just crc32c()
- Remove unnecessary prompts from some of the CRC kconfig options
- Further optimize the x86 crc32c code"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits)
x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512
lib/crc: remove unnecessary prompt for CONFIG_CRC64
lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C
lib/crc: remove unnecessary prompt for CONFIG_CRC8
lib/crc: remove unnecessary prompt for CONFIG_CRC7
lib/crc: remove unnecessary prompt for CONFIG_CRC4
lib/crc7: unexport crc7_be_syndrome_table
lib/crc_kunit.c: update comment in crc_benchmark()
lib/crc_kunit.c: add test and benchmark for crc7_be()
x86/crc32: optimize tail handling for crc32c short inputs
riscv/crc64: add Zbc optimized CRC64 functions
riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
riscv/crc32: reimplement the CRC32 functions using new template
riscv/crc: add "template" for Zbc optimized CRC functions
x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings
x86/crc32: improve crc32c_arch() code generation with clang
x86/crc64: implement crc64_be and crc64_nvme using new template
x86/crc-t10dif: implement crc_t10dif using new template
x86/crc32: implement crc32_le using new template
x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
...
Pull x86 resource control updates from Borislav Petkov:
- First part of the MPAM work: split the architectural part of resctrl
from the filesystem part so that ARM's MPAM varian of resource
control can be added later while sharing the user interface with x86
(James Morse)
* tag 'x86_cache_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86/resctrl: Move get_{mon,ctrl}_domain_from_cpu() to live with their callers
x86/resctrl: Move get_config_index() to a header
x86/resctrl: Handle throttle_mode for SMBA resources
x86/resctrl: Move RFTYPE flags to be managed by resctrl
x86/resctrl: Make resctrl_arch_pseudo_lock_fn() take a plr
x86/resctrl: Make prefetch_disable_bits belong to the arch code
x86/resctrl: Allow an architecture to disable pseudo lock
x86/resctrl: Add resctrl_arch_ prefix to pseudo lock functions
x86/resctrl: Move mbm_cfg_mask to struct rdt_resource
x86/resctrl: Move mba_mbps_default_event init to filesystem code
x86/resctrl: Change mon_event_config_{read,write}() to be arch helpers
x86/resctrl: Add resctrl_arch_is_evt_configurable() to abstract BMEC
x86/resctrl: Move the is_mbm_*_enabled() helpers to asm/resctrl.h
x86/resctrl: Rewrite and move the for_each_*_rdt_resource() walkers
x86/resctrl: Move monitor init work to a resctrl init call
x86/resctrl: Move monitor exit work to a resctrl exit call
x86/resctrl: Add an arch helper to reset one resource
x86/resctrl: Move resctrl types to a separate header
x86/resctrl: Move rdt_find_domain() to be visible to arch and fs code
x86/resctrl: Expose resctrl fs's init function to the rest of the kernel
...
Pull VDSO infrastructure updates from Thomas Gleixner:
- Consolidate the VDSO storage
The VDSO data storage and data layout has been largely architecture
specific for historical reasons. That increases the maintenance
effort and causes inconsistencies over and over.
There is no real technical reason for architecture specific layouts
and implementations. The architecture specific details can easily be
integrated into a generic layout, which also reduces the amount of
duplicated code for managing the mappings.
Convert all architectures over to a unified layout and common mapping
infrastructure. This splits the VDSO data layout into subsystem
specific blocks, timekeeping, random and architecture parts, which
provides a better structure and allows to improve and update the
functionalities without conflict and interaction.
- Rework the timekeeping data storage
The current implementation is designed for exposing system
timekeeping accessors, which was good enough at the time when it was
designed.
PTP and Time Sensitive Networking (TSN) change that as there are
requirements to expose independent PTP clocks, which are not related
to system timekeeping.
Replace the monolithic data storage by a structured layout, which
allows to add support for independent PTP clocks on top while reusing
both the data structures and the time accessor implementations.
* tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
sparc/vdso: Always reject undefined references during linking
x86/vdso: Always reject undefined references during linking
vdso: Rework struct vdso_time_data and introduce struct vdso_clock
vdso: Move architecture related data before basetime data
powerpc/vdso: Prepare introduction of struct vdso_clock
arm64/vdso: Prepare introduction of struct vdso_clock
x86/vdso: Prepare introduction of struct vdso_clock
time/namespace: Prepare introduction of struct vdso_clock
vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock struct
vdso/vsyscall: Prepare introduction of struct vdso_clock
vdso/gettimeofday: Prepare helper functions for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_coarse_timens() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_coarse() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_hres_timens() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_hres() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare introduction of struct vdso_clock
vdso/helpers: Prepare introduction of struct vdso_clock
vdso/datapage: Define vdso_clock to prepare for multiple PTP clocks
vdso: Make vdso_time_data cacheline aligned
arm64: Make asm/cache.h compatible with vDSO
...
Support for STA2X11-based systems was removed in February in:
dcbb01fbb7 ("x86/pci: Remove old STA2x11 support")
Intel MID for 32-bit platforms was removed from this list also in
February in:
ca5955dd5f ("x86/cpu: Document CONFIG_X86_INTEL_MID as 64-bit-only")
Intel MID for 64-bit platforms is a duplicate for "Merrifield/Moorefield
MID devices".
Fixes: 4047e8773f ("x86/Kconfig: Update lists in X86_EXTENDED_PLATFORM")
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Link: https://lore.kernel.org/r/20250322175052.43611-1-mat.jonczyk@o2.pl
Currently, it is not true that the kernel will panic with CONFIG_X86_X2APIC=n
on systems that require it; it will try to disable the APIC and run without
it to at least give the user a clear warning message. See the second
variant of check_x2apic() in arch/x86/kernel/apic/apic.c .
Also massage some other parts of the help text.
Fixes: 9232c49ff3 ("x86/Kconfig: Enable X86_X2APIC by default and improve help text")
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250322154541.40325-1-mat.jonczyk@o2.pl
I was unable to find a good description of the ServerWorks CNB20LE
chipset. However, it was probably exclusively used with the Pentium III
processor (this CPU model was used in all references to it that I
found where the CPU model was provided: dmesgs in [1] and [2];
[3] page 2; [4]-[7]).
As is widely known, the Pentium III processor did not support the 64-bit
mode, support for which was introduced by Intel a couple of years later.
So it is safe to assume that no systems with the CNB20LE chipset have
amd64 and the CONFIG_PCI_CNB20LE_QUIRK may now depend on X86_32.
Additionally, I have determined that most computers with the CNB20LE
chipset did have ACPI support and this driver was inactive on them.
I have submitted a patch to remove this driver, but it was met with
resistance [8].
[1] Jim Studt, Re: Problem with ServerWorks CNB20LE and lost interrupts
Linux Kernel Mailing List, https://lkml.org/lkml/2002/1/11/111
[2] RedHat Bug 665109 - e100 problems on old Compaq Proliant DL320
https://bugzilla.redhat.com/show_bug.cgi?id=665109
[3] R. Hughes-Jones, S. Dallison, G. Fairey, Performance Measurements on
Gigabit Ethernet NICs and Server Quality Motherboards,
http://datatag.web.cern.ch/papers/pfldnet2003-rhj.doc
[4] "Hardware for Linux",
Probe #d6b5151873 of Intel STL2-bd A28808-302 Desktop Computer (STL2)
https://linux-hardware.org/?probe=d6b5151873
[5] "Hardware for Linux", Probe #0b5d843f10 of Compaq ProLiant DL380
https://linux-hardware.org/?probe=0b5d843f10
[6] Ubuntu Forums, Dell Poweredge 2400 - Adaptec SCSI Bus AIC-7880
https://ubuntuforums.org/showthread.php?t=1689552
[7] Ira W. Snyder, "BISECTED: 2.6.35 (and -git) fail to boot: APIC problems"
https://lkml.org/lkml/2010/8/13/220
[8] Bjorn Helgaas, "Re: [PATCH] x86/pci: drop ServerWorks / Broadcom
CNB20LE PCI host bridge driver"
https://lore.kernel.org/lkml/20220318165535.GA840063@bhelgaas/T/
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heideberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-6-b0cbaa6fa338@ixit.cz
The order of the entries matches the order they appear in Kconfig.
In 2011, AMD Elan was moved to Kconfig.cpu and the dependency on
X86_EXTENDED_PLATFORM was dropped in:
ce9c99af8d ("x86, cpu: Move AMD Elan Kconfig under "Processor family"")
Support for Moorestown MID devices was removed in 2012 in:
1a8359e411 ("x86/mid: Remove Intel Moorestown")
SGI 320/540 (Visual Workstation) was removed in 2014 in:
c5f9ee3d66 ("x86, platforms: Remove SGI Visual Workstation")
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heideberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-4-b0cbaa6fa338@ixit.cz
It appears that (X86_64 || X86_32) is always true on x86.
This logical OR directive was introduced in:
6ea3038648 ("arch/x86: remove depends on CONFIG_EXPERIMENTAL")
By (EXPERIMENTAL && X86_32) turning into (X86_32). Since
this change was an identity transformation, nobody noticed
that the condition turned into 'true'.
[ mingo: Updated changelog ]
Fixes: 6ea3038648 ("arch/x86: remove depends on CONFIG_EXPERIMENTAL")
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heideberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-2-b0cbaa6fa338@ixit.cz
As many current platforms (most modern Intel CPUs and QEMU) have x2APIC
present, enable CONFIG_X86_X2APIC by default as it gives performance
and functionality benefits. Additionally, if the BIOS has already
switched APIC to x2APIC mode, but CONFIG_X86_X2APIC is disabled, the
kernel will panic in arch/x86/kernel/apic/apic.c .
Also improve the help text, which was confusing and really did not
describe what the feature is about.
Help text references and discussion:
Both Intel [1] and AMD [3] spell the name as "x2APIC", not "x2apic".
"It allows faster access to the local APIC"
[2], chapter 2.1, page 15:
"More efficient MSR interface to access APIC registers."
"x2APIC was introduced in Intel CPUs around 2008":
I was unable to find specific information which Intel CPUs
support x2APIC. Wikipedia claims it was "introduced with the
Nehalem microarchitecture in November 2008", but I was not able
to confirm this independently. At least some Nehalem CPUs do not
support x2APIC [1].
The documentation [2] is dated June 2008. Linux kernel also
introduced x2APIC support in 2008, so the year seems to be
right.
"and in AMD EPYC CPUs in 2019":
[3], page 15:
"AMD introduced an x2APIC in our EPYC 7002 Series processors for
the first time."
"It is also frequently emulated in virtual machines, even when the host
CPU does not support it."
[1]
"If this configuration option is disabled, the kernel will not boot on
some platforms that have x2APIC enabled."
According to some BIOS documentation [4], the x2APIC may be
"disabled", "enabled", or "force enabled" on this system.
I think that "enabled" means "made available to the operating
system, but not already turned on" and "force enabled" means
"already switched to x2APIC mode when the OS boots". Only in the
latter mode a kernel without CONFIG_X86_X2APIC will panic in
validate_x2apic() in arch/x86/kernel/apic/apic.c .
QEMU 4.2.1 and my Intel HP laptop (bought in 2019) use the
"enabled" mode and the kernel does not panic.
[1] "Re: [Qemu-devel] [Question] why x2apic's set by default without host sup"
https://lists.gnu.org/archive/html/qemu-devel/2013-07/msg03527.html
[2] Intel® 64 Architecture x2APIC Specification,
( https://www.naic.edu/~phil/software/intel/318148.pdf )
[3] Workload Tuning Guide for AMD EPYC ™ 7002 Series Processor Based
Servers Application Note,
https://developer.amd.com/wp-content/resources/56745_0.80.pdf
[4] UEFI System Utilities and Shell Command Mobile Help for HPE ProLiant
Gen10, ProLiant Gen10 Plus Servers and HPE Synergy:
Enabling or disabling Processor x2APIC Support
https://techlibrary.hpe.com/docs/iss/proliant-gen10-uefi/s_enable_disable_x2APIC_support.html
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-1-b0cbaa6fa338@ixit.cz
Required and disabled feature masks completely rely on build configs,
i.e., once a build config is fixed, so are the feature masks.
To prepare for auto-generating the <asm/cpufeaturemasks.h> header
with required and disabled feature masks based on a build config,
add feature Kconfig items:
- X86_REQUIRED_FEATURE_x
- X86_DISABLED_FEATURE_x
each of which may be set to "y" if and only if its preconditions from
current build config are met.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250228082338.73859-3-xin@zytor.com
Some architectures build vmlinux with static relocations preserved, but
strip them again from the final vmlinux image. Arch specific tools
consume these static relocations in order to construct relocation tables
for KASLR.
The fact that vmlinux is created, consumed and subsequently updated goes
against the typical, declarative paradigm used by Make, which is based
on rules and dependencies. So as a first step towards cleaning this up,
introduce a Kconfig symbol to declare that the arch wants to consume the
static relocations emitted into vmlinux. This will be wired up further
in subsequent patches.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pseudo-lock relies on knowledge of the micro-architecture to disable
prefetchers etc.
On arm64 these controls are typically secure only, meaning Linux can't access
them. Arm's cache-lockdown feature works in a very different way. Resctrl's
pseudo-lock isn't going to be used on arm64 platforms.
Add a Kconfig symbol that can be selected by the architecture. This enables or
disables building of the pseudo_lock.c file, and replaces the functions with
stubs. An additional IS_ENABLED() check is needed in rdtgroup_mode_write() so
that attempting to enable pseudo-lock reports an "Unknown or unsupported mode"
to user-space via the last_cmd_status file.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-25-james.morse@arm.com
The CONFIG_EISA menu was cleaned up in 2018, but this inadvertently
brought the option back on 64-bit machines: ISA remains guarded by
a CONFIG_X86_32 check, but EISA no longer depends on ISA.
The last Intel machines ith EISA support used a 82375EB PCI/EISA bridge
from 1993 that could be paired with the 440FX chipset on early Pentium-II
CPUs, long before the first x86-64 products.
Fixes: 6630a8e501 ("eisa: consolidate EISA Kconfig entry in drivers/eisa")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250226213714.4040853-11-arnd@kernel.org