From 5c247d08bc81bbad4c662dcf5654137a2f8483ec Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Feb 2026 20:10:10 +0000 Subject: [PATCH 001/373] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on nested #VMEXIT. This value is incorrect in some cases, causing L1 to run L2 with a corrupted CR2. This could lead to segfaults or data corruption if L2 is in the middle of handling a #PF and reads a corrupted CR2. Use the correct value in vcpu->arch.cr2 instead. The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are only out-of-sync in two cases: after save+restore, and after a #PF is injected into L2. In either case, if a #VMEXIT to L1 is synthesized before L2 runs, using the value in vmcb02 would be incorrect. After save+restore, the value of CR2 is restored by KVM_SET_SREGS into vcpu->arch.cr2. It is not reflect in vmcb02 until a VMRUN of L2. Before that, it holds whatever was in vmcb02 before restore, which would be zero on a new vCPU that never ran nested. If a #VMEXIT to L1 is synthesized before L2 ever runs, using vcpu->arch.cr2 to update vmcb12 is the right thing to do. The #PF injection case is more nuanced. Although the APM is a bit unclear about when CR2 is written during a #PF, the SDM is more clear: Processors update CR2 whenever a page fault is detected. If a second page fault occurs while an earlier page fault is being delivered, the faulting linear address of the second fault will overwrite the contents of CR2 (replacing the previous address). These updates to CR2 occur even if the page fault results in a double fault or occurs during the delivery of a double fault. KVM injecting the exception surely counts as the #PF being "detected". More importantly, when an exception is injected into L2 at the time of a synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly, such that an L1 hypervisor can re-inject the exception. If CR2 is not written at that point, the L1 hypervisor have no way of correctly re-injecting the #PF. Hence, if a #VMEXIT to L1 is synthesized after the #PF is injected into L2 but before it actually runs, using vcpu->arch.cr2 to update vmcb12 is also the right thing to do. Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for L2, only when it is injected. The distinction is important, because only injected (but not intercepted) exceptions are propagated to L1 through exit_int_info. It would be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would perceive an updated CR2 value with no #PF. Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260203201010.1871056-1-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 53ab6ce3cc26..99f8b8de8159 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) vmcb12->save.efer = svm->vcpu.arch.efer; vmcb12->save.cr0 = kvm_read_cr0(vcpu); vmcb12->save.cr3 = kvm_read_cr3(vcpu); - vmcb12->save.cr2 = vmcb02->save.cr2; + vmcb12->save.cr2 = vcpu->arch.cr2; vmcb12->save.cr4 = svm->vcpu.arch.cr4; vmcb12->save.rflags = kvm_get_rflags(vcpu); vmcb12->save.rip = kvm_rip_read(vcpu); From d0ad1b05bbe6f8da159a4dfb6692b3b7ce30ccc8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 17 Feb 2026 16:54:38 -0800 Subject: [PATCH 002/373] KVM: x86: Defer non-architectural deliver of exception payload to userspace read When attempting to play nice with userspace that hasn't enabled KVM_CAP_EXCEPTION_PAYLOAD, defer KVM's non-architectural delivery of the payload until userspace actually reads relevant vCPU state, and more importantly, force delivery of the payload in *all* paths where userspace saves relevant vCPU state, not just KVM_GET_VCPU_EVENTS. Ignoring userspace save/restore for the moment, delivering the payload before the exception is injected is wrong regardless of whether L1 or L2 is running. To make matters even more confusing, the flaw *currently* being papered over by the !is_guest_mode() check isn't even the same bug that commit da998b46d244 ("kvm: x86: Defer setting of CR2 until #PF delivery") was trying to avoid. At the time of commit da998b46d244, KVM didn't correctly handle exception intercepts, as KVM would wait until VM-Entry into L2 was imminent to check if the queued exception should morph to a nested VM-Exit. I.e. KVM would deliver the payload to L2 and then synthesize a VM-Exit into L1. But the payload was only the most blatant issue, e.g. waiting to check exception intercepts would also lead to KVM incorrectly escalating a should-be-intercepted #PF into a #DF. That underlying bug was eventually fixed by commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending VM-Exits at queue time"), but in the interim, commit a06230b62b89 ("KVM: x86: Deliver exception payload on KVM_GET_VCPU_EVENTS") came along and subtly added another dependency on the !is_guest_mode() check. While not recorded in the changelog, the motivation for deferring the !exception_payload_enabled delivery was to fix a flaw where a synthesized MTF (Monitor Trap Flag) VM-Exit would drop a pending #DB and clobber DR6. On a VM-Exit, VMX CPUs save pending #DB information into the VMCS, which is emulated by KVM in nested_vmx_update_pending_dbg() by grabbing the payload from the queue/pending exception. I.e. prematurely delivering the payload would cause the pending #DB to not be recorded in the VMCS, and of course, clobber L2's DR6 as seen by L1. Jumping back to save+restore, the quirked behavior of forcing delivery of the payload only works if userspace does KVM_GET_VCPU_EVENTS *before* CR2 or DR6 is saved, i.e. before KVM_GET_SREGS{,2} and KVM_GET_DEBUGREGS. E.g. if userspace does KVM_GET_SREGS before KVM_GET_VCPU_EVENTS, then the CR2 saved by userspace won't contain the payload for the exception save by KVM_GET_VCPU_EVENTS. Deliberately deliver the payload in the store_regs() path, as it's the least awful option even though userspace may not be doing save+restore. Because if userspace _is_ doing save restore, it could elide KVM_GET_SREGS knowing that SREGS were already saved when the vCPU exited. Link: https://lore.kernel.org/all/20200207103608.110305-1-oupton@google.com Cc: Yosry Ahmed Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed Tested-by: Yosry Ahmed Link: https://patch.msgid.link/20260218005438.2619063-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++----------------- 1 file changed, 39 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a03530795707..6e87ec52fa06 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, vcpu->arch.exception.error_code = error_code; vcpu->arch.exception.has_payload = has_payload; vcpu->arch.exception.payload = payload; - if (!is_guest_mode(vcpu)) - kvm_deliver_exception_payload(vcpu, - &vcpu->arch.exception); return; } @@ -5531,18 +5528,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu, return 0; } -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, - struct kvm_vcpu_events *events) +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu) { - struct kvm_queued_exception *ex; - - process_nmi(vcpu); - -#ifdef CONFIG_KVM_SMM - if (kvm_check_request(KVM_REQ_SMI, vcpu)) - process_smi(vcpu); -#endif - /* * KVM's ABI only allows for one exception to be migrated. Luckily, * the only time there can be two queued exceptions is if there's a @@ -5553,21 +5540,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, if (vcpu->arch.exception_vmexit.pending && !vcpu->arch.exception.pending && !vcpu->arch.exception.injected) - ex = &vcpu->arch.exception_vmexit; - else - ex = &vcpu->arch.exception; + return &vcpu->arch.exception_vmexit; + + return &vcpu->arch.exception; +} + +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu) +{ + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); /* - * In guest mode, payload delivery should be deferred if the exception - * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 - * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, - * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not - * propagate the payload and so it cannot be safely deferred. Deliver - * the payload if the capability hasn't been requested. + * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver + * the pending exception payload when userspace saves *any* vCPU state + * that interacts with exception payloads to avoid breaking userspace. + * + * Architecturally, KVM must not deliver an exception payload until the + * exception is actually injected, e.g. to avoid losing pending #DB + * information (which VMX tracks in the VMCS), and to avoid clobbering + * state if the exception is never injected for whatever reason. But + * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or + * may not propagate the payload across save+restore, and so KVM can't + * safely defer delivery of the payload. */ if (!vcpu->kvm->arch.exception_payload_enabled && ex->pending && ex->has_payload) kvm_deliver_exception_payload(vcpu, ex); +} + +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, + struct kvm_vcpu_events *events) +{ + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); + + process_nmi(vcpu); + +#ifdef CONFIG_KVM_SMM + if (kvm_check_request(KVM_REQ_SMI, vcpu)) + process_smi(vcpu); +#endif + + kvm_handle_exception_payload_quirk(vcpu); memset(events, 0, sizeof(*events)); @@ -5746,6 +5758,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, vcpu->arch.guest_state_protected) return -EINVAL; + kvm_handle_exception_payload_quirk(vcpu); + memset(dbgregs, 0, sizeof(*dbgregs)); BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); @@ -12136,6 +12150,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) if (vcpu->arch.guest_state_protected) goto skip_protected_regs; + kvm_handle_exception_payload_quirk(vcpu); + kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS); kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS); kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES); From 0c96c47d4345084f543f2fe60ab031507b9a1b2f Mon Sep 17 00:00:00 2001 From: Zhiquan Li Date: Thu, 12 Feb 2026 18:38:38 +0800 Subject: [PATCH 003/373] KVM: selftests: Add CPU vendor detection for Hygon Currently some KVM selftests are failed on Hygon CPUs due to missing vendor detection and edge-case handling specific to Hygon's architecture. Add CPU vendor detection for Hygon and add a global variable "host_cpu_is_hygon" as the basic facility for the following fixes. Signed-off-by: Zhiquan Li Link: https://patch.msgid.link/20260212103841.171459-2-zhiquan_li@163.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/include/x86/processor.h | 6 ++++++ tools/testing/selftests/kvm/lib/x86/processor.c | 3 +++ 2 files changed, 9 insertions(+) diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h index 4ebae4269e68..1338de7111e7 100644 --- a/tools/testing/selftests/kvm/include/x86/processor.h +++ b/tools/testing/selftests/kvm/include/x86/processor.h @@ -21,6 +21,7 @@ extern bool host_cpu_is_intel; extern bool host_cpu_is_amd; +extern bool host_cpu_is_hygon; extern uint64_t guest_tsc_khz; #ifndef MAX_NR_CPUID_ENTRIES @@ -694,6 +695,11 @@ static inline bool this_cpu_is_amd(void) return this_cpu_vendor_string_is("AuthenticAMD"); } +static inline bool this_cpu_is_hygon(void) +{ + return this_cpu_vendor_string_is("HygonGenuine"); +} + static inline uint32_t __this_cpu_has(uint32_t function, uint32_t index, uint8_t reg, uint8_t lo, uint8_t hi) { diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c index fab18e9be66c..f6b1c5324931 100644 --- a/tools/testing/selftests/kvm/lib/x86/processor.c +++ b/tools/testing/selftests/kvm/lib/x86/processor.c @@ -23,6 +23,7 @@ vm_vaddr_t exception_handlers; bool host_cpu_is_amd; bool host_cpu_is_intel; +bool host_cpu_is_hygon; bool is_forced_emulation_enabled; uint64_t guest_tsc_khz; @@ -792,6 +793,7 @@ void kvm_arch_vm_post_create(struct kvm_vm *vm, unsigned int nr_vcpus) sync_global_to_guest(vm, host_cpu_is_intel); sync_global_to_guest(vm, host_cpu_is_amd); + sync_global_to_guest(vm, host_cpu_is_hygon); sync_global_to_guest(vm, is_forced_emulation_enabled); sync_global_to_guest(vm, pmu_errata_mask); @@ -1424,6 +1426,7 @@ void kvm_selftest_arch_init(void) { host_cpu_is_intel = this_cpu_is_intel(); host_cpu_is_amd = this_cpu_is_amd(); + host_cpu_is_hygon = this_cpu_is_hygon(); is_forced_emulation_enabled = kvm_is_forced_emulation_enabled(); kvm_init_pmu_errata(); From 53b2869231d3211ed638295e8873215d0ad10507 Mon Sep 17 00:00:00 2001 From: Zhiquan Li Date: Thu, 12 Feb 2026 18:38:39 +0800 Subject: [PATCH 004/373] KVM: selftests: Add a flag to identify AMD compatible test cases Most of KVM x86 selftests for AMD are compatible with Hygon architecture (but not all), add a flag "host_cpu_is_amd_compatible" to figure out these cases. Following test failures on Hygon platform can be fixed: * Fix hypercall test: Hygon architecture also uses VMMCALL as guest hypercall instruction. * Following test failures due to access reserved memory address regions: - access_tracking_perf_test - demand_paging_test - dirty_log_perf_test - dirty_log_test - kvm_page_table_test - memslot_modification_stress_test - pre_fault_memory_test - x86/dirty_log_page_splitting_test Hygon CSV also makes the "physical address space width reduction", the reduced physical address bits are reported by bits 11:6 of CPUID[0x8000001f].EBX as well, so the existed logic is totally applicable for Hygon processors. Mapping memory into these regions and accessing to them results in a #PF. Signed-off-by: Zhiquan Li Link: https://patch.msgid.link/20260212103841.171459-3-zhiquan_li@163.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/include/x86/processor.h | 1 + tools/testing/selftests/kvm/lib/x86/processor.c | 12 ++++++++---- tools/testing/selftests/kvm/x86/fix_hypercall_test.c | 2 +- tools/testing/selftests/kvm/x86/msrs_test.c | 2 +- tools/testing/selftests/kvm/x86/xapic_state_test.c | 2 +- 5 files changed, 12 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h index 1338de7111e7..40e3deb64812 100644 --- a/tools/testing/selftests/kvm/include/x86/processor.h +++ b/tools/testing/selftests/kvm/include/x86/processor.h @@ -22,6 +22,7 @@ extern bool host_cpu_is_intel; extern bool host_cpu_is_amd; extern bool host_cpu_is_hygon; +extern bool host_cpu_is_amd_compatible; extern uint64_t guest_tsc_khz; #ifndef MAX_NR_CPUID_ENTRIES diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c index f6b1c5324931..f4e8649071b6 100644 --- a/tools/testing/selftests/kvm/lib/x86/processor.c +++ b/tools/testing/selftests/kvm/lib/x86/processor.c @@ -24,6 +24,7 @@ vm_vaddr_t exception_handlers; bool host_cpu_is_amd; bool host_cpu_is_intel; bool host_cpu_is_hygon; +bool host_cpu_is_amd_compatible; bool is_forced_emulation_enabled; uint64_t guest_tsc_khz; @@ -794,6 +795,7 @@ void kvm_arch_vm_post_create(struct kvm_vm *vm, unsigned int nr_vcpus) sync_global_to_guest(vm, host_cpu_is_intel); sync_global_to_guest(vm, host_cpu_is_amd); sync_global_to_guest(vm, host_cpu_is_hygon); + sync_global_to_guest(vm, host_cpu_is_amd_compatible); sync_global_to_guest(vm, is_forced_emulation_enabled); sync_global_to_guest(vm, pmu_errata_mask); @@ -1350,7 +1352,8 @@ const struct kvm_cpuid_entry2 *get_cpuid_entry(const struct kvm_cpuid2 *cpuid, "1: vmmcall\n\t" \ "2:" \ : "=a"(r) \ - : [use_vmmcall] "r" (host_cpu_is_amd), inputs); \ + : [use_vmmcall] "r" (host_cpu_is_amd_compatible), \ + inputs); \ \ r; \ }) @@ -1390,8 +1393,8 @@ unsigned long vm_compute_max_gfn(struct kvm_vm *vm) max_gfn = (1ULL << (guest_maxphyaddr - vm->page_shift)) - 1; - /* Avoid reserved HyperTransport region on AMD processors. */ - if (!host_cpu_is_amd) + /* Avoid reserved HyperTransport region on AMD or Hygon processors. */ + if (!host_cpu_is_amd_compatible) return max_gfn; /* On parts with <40 physical address bits, the area is fully hidden */ @@ -1405,7 +1408,7 @@ unsigned long vm_compute_max_gfn(struct kvm_vm *vm) /* * Otherwise it's at the top of the physical address space, possibly - * reduced due to SME by bits 11:6 of CPUID[0x8000001f].EBX. Use + * reduced due to SME or CSV by bits 11:6 of CPUID[0x8000001f].EBX. Use * the old conservative value if MAXPHYADDR is not enumerated. */ if (!this_cpu_has_p(X86_PROPERTY_MAX_PHY_ADDR)) @@ -1427,6 +1430,7 @@ void kvm_selftest_arch_init(void) host_cpu_is_intel = this_cpu_is_intel(); host_cpu_is_amd = this_cpu_is_amd(); host_cpu_is_hygon = this_cpu_is_hygon(); + host_cpu_is_amd_compatible = host_cpu_is_amd || host_cpu_is_hygon; is_forced_emulation_enabled = kvm_is_forced_emulation_enabled(); kvm_init_pmu_errata(); diff --git a/tools/testing/selftests/kvm/x86/fix_hypercall_test.c b/tools/testing/selftests/kvm/x86/fix_hypercall_test.c index 762628f7d4ba..00b6e85735dd 100644 --- a/tools/testing/selftests/kvm/x86/fix_hypercall_test.c +++ b/tools/testing/selftests/kvm/x86/fix_hypercall_test.c @@ -52,7 +52,7 @@ static void guest_main(void) if (host_cpu_is_intel) { native_hypercall_insn = vmx_vmcall; other_hypercall_insn = svm_vmmcall; - } else if (host_cpu_is_amd) { + } else if (host_cpu_is_amd_compatible) { native_hypercall_insn = svm_vmmcall; other_hypercall_insn = vmx_vmcall; } else { diff --git a/tools/testing/selftests/kvm/x86/msrs_test.c b/tools/testing/selftests/kvm/x86/msrs_test.c index 40d918aedce6..4c97444fdefe 100644 --- a/tools/testing/selftests/kvm/x86/msrs_test.c +++ b/tools/testing/selftests/kvm/x86/msrs_test.c @@ -81,7 +81,7 @@ static u64 fixup_rdmsr_val(u32 msr, u64 want) * is supposed to emulate that behavior based on guest vendor model * (which is the same as the host vendor model for this test). */ - if (!host_cpu_is_amd) + if (!host_cpu_is_amd_compatible) return want; switch (msr) { diff --git a/tools/testing/selftests/kvm/x86/xapic_state_test.c b/tools/testing/selftests/kvm/x86/xapic_state_test.c index 3b4814c55722..0c5e12f5f14e 100644 --- a/tools/testing/selftests/kvm/x86/xapic_state_test.c +++ b/tools/testing/selftests/kvm/x86/xapic_state_test.c @@ -248,7 +248,7 @@ int main(int argc, char *argv[]) * drops writes, AMD does not). Account for the errata when checking * that KVM reads back what was written. */ - x.has_xavic_errata = host_cpu_is_amd && + x.has_xavic_errata = host_cpu_is_amd_compatible && get_kvm_amd_param_bool("avic"); vcpu_clear_cpuid_feature(x.vcpu, X86_FEATURE_X2APIC); From 6b8b11ba47159f33d766b274001529da9d5b0913 Mon Sep 17 00:00:00 2001 From: Zhiquan Li Date: Thu, 12 Feb 2026 18:38:40 +0800 Subject: [PATCH 005/373] KVM: selftests: Allow the PMU event filter test for Hygon At present, the PMU event filter test for AMD architecture is applicable for Hygon architecture as well. Since all known Hygon processors can re-use the test cases, so it isn't necessary to create a new wrapper. Signed-off-by: Zhiquan Li Link: https://patch.msgid.link/20260212103841.171459-4-zhiquan_li@163.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/x86/pmu_event_filter_test.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/x86/pmu_event_filter_test.c b/tools/testing/selftests/kvm/x86/pmu_event_filter_test.c index 1c5b7611db24..93b61c077991 100644 --- a/tools/testing/selftests/kvm/x86/pmu_event_filter_test.c +++ b/tools/testing/selftests/kvm/x86/pmu_event_filter_test.c @@ -361,7 +361,8 @@ static bool use_intel_pmu(void) */ static bool use_amd_pmu(void) { - return host_cpu_is_amd && kvm_cpu_family() >= 0x17; + return (host_cpu_is_amd && kvm_cpu_family() >= 0x17) || + host_cpu_is_hygon; } /* From 9396cc1e282a280bcba2e932e03994e0aada4cd8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 12 Feb 2026 18:38:41 +0800 Subject: [PATCH 006/373] KVM: selftests: Fix reserved value WRMSR testcase for multi-feature MSRs When determining whether or not a WRMSR with reserved bits will #GP or succeed due to the WRMSR not existing per the guest virtual CPU model, expect failure if and only if _all_ features associated with the MSR are unsupported. Checking only the primary feature results in false failures when running on AMD and Hygon CPUs with only one of RDPID or RDTSCP, as AMD/Hygon CPUs ignore MSR_TSC_AUX[63:32], i.e. don't treat the bits as reserved, and so #GP only if the MSR is unsupported. Fixes: 9c38ddb3df94 ("KVM: selftests: Add an MSR test to exercise guest/host and read/write") Reported-by: Zhiquan Li Closes: https://lore.kernel.org/all/20260209041305.64906-6-zhiquan_li@163.com Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260212103841.171459-5-zhiquan_li@163.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/x86/msrs_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/x86/msrs_test.c b/tools/testing/selftests/kvm/x86/msrs_test.c index 4c97444fdefe..f7e39bf887ad 100644 --- a/tools/testing/selftests/kvm/x86/msrs_test.c +++ b/tools/testing/selftests/kvm/x86/msrs_test.c @@ -175,7 +175,7 @@ void guest_test_reserved_val(const struct kvm_msr *msr) * If the CPU will truncate the written value (e.g. SYSENTER on AMD), * expect success and a truncated value, not #GP. */ - if (!this_cpu_has(msr->feature) || + if ((!this_cpu_has(msr->feature) && !this_cpu_has(msr->feature2)) || msr->rsvd_val == fixup_rdmsr_val(msr->index, msr->rsvd_val)) { u8 vec = wrmsr_safe(msr->index, msr->rsvd_val); From e1df128dc00beaa53b0be4e751b7f2f0192dc146 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Thu, 12 Feb 2026 22:24:04 +0100 Subject: [PATCH 007/373] KVM: x86: Zero-initialize temporary fxregs_state buffers in FXSAVE emulation Explicitly zero-initialize stack-allocated struct fxregs_state variables in em_fxsave() and fxregs_fixup() to ensure all padding and unused fields are cleared before use. Both functions declare temporary fxregs_state buffers that may be partially written by fxsave. Although the emulator copies only the architecturally defined portion of the state to userspace, any padding or otherwise untouched bytes in the structure can remain uninitialized. This can lead to the use of uninitialized stack data and may trigger KMSAN reports. In the worst case, it could result in leaking stack contents if such bytes are ever exposed. No functional change intended. Suggested-by: Sean Christopherson Signed-off-by: Uros Bizjak Cc: Sean Christopherson Cc: Paolo Bonzini Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Link: https://patch.msgid.link/20260212212457.24483-1-ubizjak@gmail.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/emulate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c8e292e9a24d..20ed588015f1 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3708,7 +3708,7 @@ static inline size_t fxstate_size(struct x86_emulate_ctxt *ctxt) */ static int em_fxsave(struct x86_emulate_ctxt *ctxt) { - struct fxregs_state fx_state; + struct fxregs_state fx_state = {}; int rc; rc = check_fxsr(ctxt); @@ -3738,7 +3738,7 @@ static int em_fxsave(struct x86_emulate_ctxt *ctxt) static noinline int fxregs_fixup(struct fxregs_state *fx_state, const size_t used_size) { - struct fxregs_state fx_tmp; + struct fxregs_state fx_tmp = {}; int rc; rc = asm_safe("fxsave %[fx]", , [fx] "+m"(fx_tmp)); From c522ac04ba9d7ec6003633aa1501c7392cdf8b2d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlos=20L=C3=B3pez?= Date: Thu, 12 Feb 2026 15:05:56 +0100 Subject: [PATCH 008/373] KVM: x86/pmu: annotate struct kvm_x86_pmu_event_filter with __counted_by() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit struct kvm_x86_pmu_event_filter has a flexible array member, so annotate it with the field that describes the amount of entries in such array. Opportunistically replace the open-coded array size calculation with flex_array_size() when copying the array portion of the struct from userspace. Signed-off-by: Carlos López Link: https://patch.msgid.link/20260212140556.3883030-2-clopez@suse.de Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/pmu.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ff07c45e3c73..d9159b969bd9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1261,7 +1261,7 @@ struct kvm_x86_pmu_event_filter { __u32 nr_excludes; __u64 *includes; __u64 *excludes; - __u64 events[]; + __u64 events[] __counted_by(nevents); }; enum kvm_apicv_inhibit { diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index bd6b785cf261..e218352e3423 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -1256,7 +1256,7 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) r = -EFAULT; if (copy_from_user(filter->events, user_filter->events, - sizeof(filter->events[0]) * filter->nevents)) + flex_array_size(filter, events, filter->nevents))) goto cleanup; r = prepare_filter_lists(filter); From 46ee9d718b9b67a8be067a39e21da6634107ed0e Mon Sep 17 00:00:00 2001 From: Li RongQing Date: Tue, 10 Feb 2026 01:21:43 -0500 Subject: [PATCH 009/373] KVM: Mark halt poll and other module parameters with appropriate memory attributes Add '__read_mostly' to the halt polling parameters (halt_poll_ns, halt_poll_ns_grow, halt_poll_ns_grow_start, halt_poll_ns_shrink) since they are frequently read in hot paths (e.g., vCPU halt handling) but only occasionally updated via sysfs. This improves cache locality on SMP systems. Conversely, mark 'allow_unsafe_mappings' and 'enable_virt_at_load' with '__ro_after_init', as they are set only during module initialization via kernel command line or early sysfs writes and remain constant thereafter. This enhances security by preventing runtime modification and enables compiler optimizations. Signed-off-by: Li RongQing Link: https://patch.msgid.link/20260210062143.1739-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson --- virt/kvm/kvm_main.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1bc1da66b4b0..66371d8139d8 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -76,22 +76,22 @@ MODULE_DESCRIPTION("Kernel-based Virtual Machine (KVM) Hypervisor"); MODULE_LICENSE("GPL"); /* Architectures should define their poll value according to the halt latency */ -unsigned int halt_poll_ns = KVM_HALT_POLL_NS_DEFAULT; +unsigned int __read_mostly halt_poll_ns = KVM_HALT_POLL_NS_DEFAULT; module_param(halt_poll_ns, uint, 0644); EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns); /* Default doubles per-vcpu halt_poll_ns. */ -unsigned int halt_poll_ns_grow = 2; +unsigned int __read_mostly halt_poll_ns_grow = 2; module_param(halt_poll_ns_grow, uint, 0644); EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_grow); /* The start value to grow halt_poll_ns from */ -unsigned int halt_poll_ns_grow_start = 10000; /* 10us */ +unsigned int __read_mostly halt_poll_ns_grow_start = 10000; /* 10us */ module_param(halt_poll_ns_grow_start, uint, 0644); EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_grow_start); /* Default halves per-vcpu halt_poll_ns. */ -unsigned int halt_poll_ns_shrink = 2; +unsigned int __read_mostly halt_poll_ns_shrink = 2; module_param(halt_poll_ns_shrink, uint, 0644); EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink); @@ -99,7 +99,7 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink); * Allow direct access (from KVM or the CPU) without MMU notifier protection * to unpinned pages. */ -static bool allow_unsafe_mappings; +static bool __ro_after_init allow_unsafe_mappings; module_param(allow_unsafe_mappings, bool, 0444); /* @@ -5574,7 +5574,7 @@ static struct miscdevice kvm_dev = { }; #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING -bool enable_virt_at_load = true; +bool __ro_after_init enable_virt_at_load = true; module_param(enable_virt_at_load, bool, 0444); EXPORT_SYMBOL_FOR_KVM_INTERNAL(enable_virt_at_load); From 6dad59124e1536a38e0f177d45418ebe1e0eea1f Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Wed, 11 Feb 2026 11:28:49 +0100 Subject: [PATCH 010/373] KVM: VMX: Drop obsolete branch hint prefixes from inline asm Remove explicit branch hint prefixes (.byte 0x2e / 0x3e) from VMX inline assembly sequences. These prefixes (CS/DS segment overrides used as branch hints on very old x86 CPUs) have been ignored by modern processors for a long time. Keeping them provides no measurable benefit and only enlarges the generated code. No functional change intended. Signed-off-by: Uros Bizjak Cc: Sean Christopherson Cc: Paolo Bonzini Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Link: https://patch.msgid.link/20260211102928.100944-1-ubizjak@gmail.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx_ops.h | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx_ops.h b/arch/x86/kvm/vmx/vmx_ops.h index 96677576c836..1000d37f5b0c 100644 --- a/arch/x86/kvm/vmx/vmx_ops.h +++ b/arch/x86/kvm/vmx/vmx_ops.h @@ -119,7 +119,6 @@ do_exception: #else /* !CONFIG_CC_HAS_ASM_GOTO_OUTPUT */ asm volatile("1: vmread %[field], %[output]\n\t" - ".byte 0x3e\n\t" /* branch taken hint */ "ja 3f\n\t" /* @@ -191,7 +190,6 @@ static __always_inline unsigned long vmcs_readl(unsigned long field) #define vmx_asm1(insn, op1, error_args...) \ do { \ asm goto("1: " __stringify(insn) " %0\n\t" \ - ".byte 0x2e\n\t" /* branch not taken hint */ \ "jna %l[error]\n\t" \ _ASM_EXTABLE(1b, %l[fault]) \ : : op1 : "cc" : error, fault); \ @@ -208,7 +206,6 @@ fault: \ #define vmx_asm2(insn, op1, op2, error_args...) \ do { \ asm goto("1: " __stringify(insn) " %1, %0\n\t" \ - ".byte 0x2e\n\t" /* branch not taken hint */ \ "jna %l[error]\n\t" \ _ASM_EXTABLE(1b, %l[fault]) \ : : op1, op2 : "cc" : error, fault); \ From 192f777b3af084d2073037b13ed0c2457e563d39 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Wed, 11 Feb 2026 11:28:50 +0100 Subject: [PATCH 011/373] KVM: VMX: Use ASM_INPUT_RM in __vmcs_writel Use the ASM_INPUT_RM macro for VMCS write operation in vmx_ops.h to work around clang problems with "rm" asm constraint. clang seems to always chose the memory input, while it is almost always the worst choice. Signed-off-by: Uros Bizjak Cc: Sean Christopherson Cc: Paolo Bonzini Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Acked-by: Nathan Chancellor Link: https://patch.msgid.link/20260211102928.100944-2-ubizjak@gmail.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx_ops.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/vmx_ops.h b/arch/x86/kvm/vmx/vmx_ops.h index 1000d37f5b0c..81784befaaf4 100644 --- a/arch/x86/kvm/vmx/vmx_ops.h +++ b/arch/x86/kvm/vmx/vmx_ops.h @@ -221,7 +221,7 @@ fault: \ static __always_inline void __vmcs_writel(unsigned long field, unsigned long value) { - vmx_asm2(vmwrite, "r"(field), "rm"(value), field, value); + vmx_asm2(vmwrite, "r" (field), ASM_INPUT_RM (value), field, value); } static __always_inline void vmcs_write16(unsigned long field, u16 value) From e63fb1379f4b9300a44739964e69549bebbcdca4 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 10 Feb 2026 01:08:06 +0000 Subject: [PATCH 012/373] KVM: nSVM: Mark all of vmcb02 dirty when restoring nested state When restoring a vCPU in guest mode, any state restored before KVM_SET_NESTED_STATE (e.g. KVM_SET_SREGS) will mark the corresponding dirty bits in vmcb01, as it is the active VMCB before switching to vmcb02 in svm_set_nested_state(). Hence, mark all fields in vmcb02 dirty in svm_set_nested_state() to capture any previously restored fields. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260210010806.3204289-1-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 99f8b8de8159..d5a8f5608f2d 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1909,6 +1909,12 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, svm_switch_vmcb(svm, &svm->nested.vmcb02); nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base); + /* + * Any previously restored state (e.g. KVM_SET_SREGS) would mark fields + * dirty in vmcb01 instead of vmcb02, so mark all of vmcb02 dirty here. + */ + vmcb_mark_all_dirty(svm->vmcb); + /* * While the nested guest CR3 is already checked and set by * KVM_SET_SREGS, it was set when nested state was yet loaded, From 5a6b189317501169b0510f2f1256cfc0c6ca81c7 Mon Sep 17 00:00:00 2001 From: Li RongQing Date: Mon, 2 Feb 2026 04:50:04 -0500 Subject: [PATCH 013/373] KVM: SVM: Mark module parameters as __ro_after_init for security and performance SVM module parameters such as avic, sev_enabled, npt_enabled, and pause_filter_thresh are configured exclusively during initialization (via kernel command line) and remain constant throughout runtime. Additionally, sev_supported_vmsa_features and svm_gp_erratum_intercept, while not exposed as module parameters, share the same initialization pattern and runtime constancy. Mark these variables with '__ro_after_init' to: - Harden against accidental or malicious runtime modification - Enable compiler and CPU optimizations (improved caching, branch prediction) - Align with kernel security best practices for init-only configuration The exception is 'iopm_base', which retains '__read_mostly' as it requires updates during module unloading. Suggested-by: Sean Christopherson Signed-off-by: Li RongQing Link: https://patch.msgid.link/20260202095004.1765-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/avic.c | 4 ++-- arch/x86/kvm/svm/sev.c | 8 ++++---- arch/x86/kvm/svm/svm.c | 32 ++++++++++++++++---------------- 3 files changed, 22 insertions(+), 22 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index f92214b1a938..8c2bc98fed2b 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -86,13 +86,13 @@ static const struct kernel_param_ops avic_ops = { * Enable / disable AVIC. In "auto" mode (default behavior), AVIC is enabled * for Zen4+ CPUs with x2AVIC (and all other criteria for enablement are met). */ -static int avic = AVIC_AUTO_MODE; +static int __ro_after_init avic = AVIC_AUTO_MODE; module_param_cb(avic, &avic_ops, &avic, 0444); __MODULE_PARM_TYPE(avic, "bool"); module_param(enable_ipiv, bool, 0444); -static bool force_avic; +static bool __ro_after_init force_avic; module_param_unsafe(force_avic, bool, 0444); /* Note: diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 3f9c1aa39a0a..77ebc166abfd 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -52,18 +52,18 @@ #define SNP_GUEST_VMM_ERR_GENERIC (~0U) /* enable/disable SEV support */ -static bool sev_enabled = true; +static bool __ro_after_init sev_enabled = true; module_param_named(sev, sev_enabled, bool, 0444); /* enable/disable SEV-ES support */ -static bool sev_es_enabled = true; +static bool __ro_after_init sev_es_enabled = true; module_param_named(sev_es, sev_es_enabled, bool, 0444); /* enable/disable SEV-SNP support */ -static bool sev_snp_enabled = true; +static bool __ro_after_init sev_snp_enabled = true; module_param_named(sev_snp, sev_snp_enabled, bool, 0444); -static unsigned int nr_ciphertext_hiding_asids; +static unsigned int __ro_after_init nr_ciphertext_hiding_asids; module_param_named(ciphertext_hiding_asids, nr_ciphertext_hiding_asids, uint, 0444); #define AP_RESET_HOLD_NONE 0 diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 8f8bc863e214..936f7652d1e4 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -110,52 +110,52 @@ static DEFINE_PER_CPU(u64, current_tsc_ratio); * count only mode. */ -static unsigned short pause_filter_thresh = KVM_DEFAULT_PLE_GAP; +static unsigned short __ro_after_init pause_filter_thresh = KVM_DEFAULT_PLE_GAP; module_param(pause_filter_thresh, ushort, 0444); -static unsigned short pause_filter_count = KVM_SVM_DEFAULT_PLE_WINDOW; +static unsigned short __ro_after_init pause_filter_count = KVM_SVM_DEFAULT_PLE_WINDOW; module_param(pause_filter_count, ushort, 0444); /* Default doubles per-vcpu window every exit. */ -static unsigned short pause_filter_count_grow = KVM_DEFAULT_PLE_WINDOW_GROW; +static unsigned short __ro_after_init pause_filter_count_grow = KVM_DEFAULT_PLE_WINDOW_GROW; module_param(pause_filter_count_grow, ushort, 0444); /* Default resets per-vcpu window every exit to pause_filter_count. */ -static unsigned short pause_filter_count_shrink = KVM_DEFAULT_PLE_WINDOW_SHRINK; +static unsigned short __ro_after_init pause_filter_count_shrink = KVM_DEFAULT_PLE_WINDOW_SHRINK; module_param(pause_filter_count_shrink, ushort, 0444); /* Default is to compute the maximum so we can never overflow. */ -static unsigned short pause_filter_count_max = KVM_SVM_DEFAULT_PLE_WINDOW_MAX; +static unsigned short __ro_after_init pause_filter_count_max = KVM_SVM_DEFAULT_PLE_WINDOW_MAX; module_param(pause_filter_count_max, ushort, 0444); /* * Use nested page tables by default. Note, NPT may get forced off by * svm_hardware_setup() if it's unsupported by hardware or the host kernel. */ -bool npt_enabled = true; +bool __ro_after_init npt_enabled = true; module_param_named(npt, npt_enabled, bool, 0444); /* allow nested virtualization in KVM/SVM */ -static int nested = true; +static int __ro_after_init nested = true; module_param(nested, int, 0444); /* enable/disable Next RIP Save */ -int nrips = true; +int __ro_after_init nrips = true; module_param(nrips, int, 0444); /* enable/disable Virtual VMLOAD VMSAVE */ -static int vls = true; +static int __ro_after_init vls = true; module_param(vls, int, 0444); /* enable/disable Virtual GIF */ -int vgif = true; +int __ro_after_init vgif = true; module_param(vgif, int, 0444); /* enable/disable LBR virtualization */ -int lbrv = true; +int __ro_after_init lbrv = true; module_param(lbrv, int, 0444); -static int tsc_scaling = true; +static int __ro_after_init tsc_scaling = true; module_param(tsc_scaling, int, 0444); module_param(enable_device_posted_irqs, bool, 0444); @@ -164,19 +164,19 @@ bool __read_mostly dump_invalid_vmcb; module_param(dump_invalid_vmcb, bool, 0644); -bool intercept_smi = true; +bool __ro_after_init intercept_smi = true; module_param(intercept_smi, bool, 0444); -bool vnmi = true; +bool __ro_after_init vnmi = true; module_param(vnmi, bool, 0444); module_param(enable_mediated_pmu, bool, 0444); -static bool svm_gp_erratum_intercept = true; +static bool __ro_after_init svm_gp_erratum_intercept = true; static u8 rsm_ins_bytes[] = "\x0f\xaa"; -static unsigned long iopm_base; +static unsigned long __read_mostly iopm_base; DEFINE_PER_CPU(struct svm_cpu_data, svm_data); From 52de184badc48d2bb5f2087b19da9d2cddfb0464 Mon Sep 17 00:00:00 2001 From: Li RongQing Date: Mon, 2 Feb 2026 04:50:04 -0500 Subject: [PATCH 014/373] KVM: SVM: Mark module parameters as __ro_after_init for security and performance SVM module parameters such as avic, sev_enabled, npt_enabled, and pause_filter_thresh are configured exclusively during initialization (via kernel command line) and remain constant throughout runtime. Additionally, sev_supported_vmsa_features and svm_gp_erratum_intercept, while not exposed as module parameters, share the same initialization pattern and runtime constancy. Mark these variables with '__ro_after_init' to: - Harden against accidental or malicious runtime modification - Enable compiler and CPU optimizations (improved caching, branch prediction) - Align with kernel security best practices for init-only configuration The exception is 'iopm_base', which retains '__read_mostly' as it requires updates during module unloading. Suggested-by: Sean Christopherson Signed-off-by: Li RongQing Link: https://patch.msgid.link/20260202095004.1765-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/avic.c | 4 ++-- arch/x86/kvm/svm/sev.c | 8 ++++---- arch/x86/kvm/svm/svm.c | 32 ++++++++++++++++---------------- 3 files changed, 22 insertions(+), 22 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index f92214b1a938..8c2bc98fed2b 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -86,13 +86,13 @@ static const struct kernel_param_ops avic_ops = { * Enable / disable AVIC. In "auto" mode (default behavior), AVIC is enabled * for Zen4+ CPUs with x2AVIC (and all other criteria for enablement are met). */ -static int avic = AVIC_AUTO_MODE; +static int __ro_after_init avic = AVIC_AUTO_MODE; module_param_cb(avic, &avic_ops, &avic, 0444); __MODULE_PARM_TYPE(avic, "bool"); module_param(enable_ipiv, bool, 0444); -static bool force_avic; +static bool __ro_after_init force_avic; module_param_unsafe(force_avic, bool, 0444); /* Note: diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 3f9c1aa39a0a..77ebc166abfd 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -52,18 +52,18 @@ #define SNP_GUEST_VMM_ERR_GENERIC (~0U) /* enable/disable SEV support */ -static bool sev_enabled = true; +static bool __ro_after_init sev_enabled = true; module_param_named(sev, sev_enabled, bool, 0444); /* enable/disable SEV-ES support */ -static bool sev_es_enabled = true; +static bool __ro_after_init sev_es_enabled = true; module_param_named(sev_es, sev_es_enabled, bool, 0444); /* enable/disable SEV-SNP support */ -static bool sev_snp_enabled = true; +static bool __ro_after_init sev_snp_enabled = true; module_param_named(sev_snp, sev_snp_enabled, bool, 0444); -static unsigned int nr_ciphertext_hiding_asids; +static unsigned int __ro_after_init nr_ciphertext_hiding_asids; module_param_named(ciphertext_hiding_asids, nr_ciphertext_hiding_asids, uint, 0444); #define AP_RESET_HOLD_NONE 0 diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 8f8bc863e214..936f7652d1e4 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -110,52 +110,52 @@ static DEFINE_PER_CPU(u64, current_tsc_ratio); * count only mode. */ -static unsigned short pause_filter_thresh = KVM_DEFAULT_PLE_GAP; +static unsigned short __ro_after_init pause_filter_thresh = KVM_DEFAULT_PLE_GAP; module_param(pause_filter_thresh, ushort, 0444); -static unsigned short pause_filter_count = KVM_SVM_DEFAULT_PLE_WINDOW; +static unsigned short __ro_after_init pause_filter_count = KVM_SVM_DEFAULT_PLE_WINDOW; module_param(pause_filter_count, ushort, 0444); /* Default doubles per-vcpu window every exit. */ -static unsigned short pause_filter_count_grow = KVM_DEFAULT_PLE_WINDOW_GROW; +static unsigned short __ro_after_init pause_filter_count_grow = KVM_DEFAULT_PLE_WINDOW_GROW; module_param(pause_filter_count_grow, ushort, 0444); /* Default resets per-vcpu window every exit to pause_filter_count. */ -static unsigned short pause_filter_count_shrink = KVM_DEFAULT_PLE_WINDOW_SHRINK; +static unsigned short __ro_after_init pause_filter_count_shrink = KVM_DEFAULT_PLE_WINDOW_SHRINK; module_param(pause_filter_count_shrink, ushort, 0444); /* Default is to compute the maximum so we can never overflow. */ -static unsigned short pause_filter_count_max = KVM_SVM_DEFAULT_PLE_WINDOW_MAX; +static unsigned short __ro_after_init pause_filter_count_max = KVM_SVM_DEFAULT_PLE_WINDOW_MAX; module_param(pause_filter_count_max, ushort, 0444); /* * Use nested page tables by default. Note, NPT may get forced off by * svm_hardware_setup() if it's unsupported by hardware or the host kernel. */ -bool npt_enabled = true; +bool __ro_after_init npt_enabled = true; module_param_named(npt, npt_enabled, bool, 0444); /* allow nested virtualization in KVM/SVM */ -static int nested = true; +static int __ro_after_init nested = true; module_param(nested, int, 0444); /* enable/disable Next RIP Save */ -int nrips = true; +int __ro_after_init nrips = true; module_param(nrips, int, 0444); /* enable/disable Virtual VMLOAD VMSAVE */ -static int vls = true; +static int __ro_after_init vls = true; module_param(vls, int, 0444); /* enable/disable Virtual GIF */ -int vgif = true; +int __ro_after_init vgif = true; module_param(vgif, int, 0444); /* enable/disable LBR virtualization */ -int lbrv = true; +int __ro_after_init lbrv = true; module_param(lbrv, int, 0444); -static int tsc_scaling = true; +static int __ro_after_init tsc_scaling = true; module_param(tsc_scaling, int, 0444); module_param(enable_device_posted_irqs, bool, 0444); @@ -164,19 +164,19 @@ bool __read_mostly dump_invalid_vmcb; module_param(dump_invalid_vmcb, bool, 0644); -bool intercept_smi = true; +bool __ro_after_init intercept_smi = true; module_param(intercept_smi, bool, 0444); -bool vnmi = true; +bool __ro_after_init vnmi = true; module_param(vnmi, bool, 0444); module_param(enable_mediated_pmu, bool, 0444); -static bool svm_gp_erratum_intercept = true; +static bool __ro_after_init svm_gp_erratum_intercept = true; static u8 rsm_ins_bytes[] = "\x0f\xaa"; -static unsigned long iopm_base; +static unsigned long __read_mostly iopm_base; DEFINE_PER_CPU(struct svm_cpu_data, svm_data); From 6dad5447c7bfca26b5d72604b5378dca6dc58bbc Mon Sep 17 00:00:00 2001 From: Ackerley Tng Date: Thu, 29 Jan 2026 09:26:46 -0800 Subject: [PATCH 015/373] KVM: guest_memfd: Don't set FGP_ACCESSED when getting folios guest_memfd folios don't care about accessed flags since the memory is unevictable and there is no storage to write back to, hence, cleanup the allocation path by not setting FGP_ACCESSED. Signed-off-by: Ackerley Tng [sean: split to separate patch] Acked-by: Vlastimil Babka Acked-by: David Hildenbrand (arm) Link: https://patch.msgid.link/20260129172646.2361462-1-ackerleytng@google.com Signed-off-by: Sean Christopherson --- virt/kvm/guest_memfd.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 017d84a7adf3..462c5c5cb602 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -126,14 +126,13 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) * Fast-path: See if folio is already present in mapping to avoid * policy_lookup. */ - folio = __filemap_get_folio(inode->i_mapping, index, - FGP_LOCK | FGP_ACCESSED, 0); + folio = filemap_lock_folio(inode->i_mapping, index); if (!IS_ERR(folio)) return folio; policy = mpol_shared_policy_lookup(&GMEM_I(inode)->policy, index); folio = __filemap_get_folio_mpol(inode->i_mapping, index, - FGP_LOCK | FGP_ACCESSED | FGP_CREAT, + FGP_LOCK | FGP_CREAT, mapping_gfp_mask(inode->i_mapping), policy); mpol_cond_put(policy); From 7b402ec851cb66e73ee35913c7d802bba820086b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 23 Jan 2026 14:45:11 -0800 Subject: [PATCH 016/373] KVM: SVM: Fix clearing IRQ window inhibit with nested guests Clearing IRQ window inhibit today relies on interrupt window interception, but that is not always reachable when nested guests are involved. If L1 is intercepting IRQs, then interrupt_window_interception() will never be reached while L2 is active, because the only reason KVM would set the V_IRQ intercept in vmcb02 would be on behalf of L1, i.e. because of vmcb12. svm_clear_vintr() always operates on (at least) vmcb01, and VMRUN unconditionally sets GIF=1, which means that enter_svm_guest_mode() will always do svm_clear_vintr() via svm_set_gif(svm, true). I.e. KVM will keep the VM-wide inhibit set until control transfers back to L1 *and* an interrupt window is triggered. If L1 is not intercepting IRQs, KVM may immediately inject L1's ExtINT into L2 if IRQs are enabled in L2 without taking an interrupt window interception. Address this by clearing the IRQ window inhibit when KVM actually injects an interrupt and there are no further injectable interrupts. That way, if L1 isn't intercepting IRQs, KVM will drop the inhibit as soon as an interrupt is injected into L2. And if L1 is intercepting IRQs, KVM will keep the inhibit until the IRQ is injected into L2. So, AVIC won't be left inhibited. Note, somewhat blindly invoking kvm_clear_apicv_inhibit() is both wrong and suboptimal. If the IRQWIN inhibit isn't set, then the vCPU will unnecessarily take apicv_update_lock for write. And if a _different_ vCPU has an injectable IRQ, clearing IRQWIN may block that vCPU's ability to inject its IRQ. Defer fixing both issues to a future commit, as fixing one problem without also fixing the other would also leave KVM in a temporarily bad state, as would fixing both issues without fixing _this_ bug. I.e. it's not feasible to fix each bug independently without there being some remaining flaw in KVM. Co-developed-by: Naveen N Rao (AMD) Signed-off-by: Naveen N Rao (AMD) Tested-by: Naveen N Rao (AMD) Link: https://patch.msgid.link/20260123224514.2509129-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 936f7652d1e4..8766fd5f6d2b 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3130,20 +3130,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu) kvm_make_request(KVM_REQ_EVENT, vcpu); svm_clear_vintr(to_svm(vcpu)); - /* - * If not running nested, for AVIC, the only reason to end up here is ExtINTs. - * In this case AVIC was temporarily disabled for - * requesting the IRQ window and we have to re-enable it. - * - * If running nested, still remove the VM wide AVIC inhibit to - * support case in which the interrupt window was requested when the - * vCPU was not running nested. - - * All vCPUs which run still run nested, will remain to have their - * AVIC still inhibited due to per-cpu AVIC inhibition. - */ - kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); - ++vcpu->stat.irq_window_exits; return 1; } @@ -3732,6 +3718,20 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected) type = SVM_EVTINJ_TYPE_INTR; } + /* + * If AVIC was inhibited in order to detect an IRQ window, and there's + * no other injectable interrupts pending or L2 is active (see below), + * then drop the inhibit as the window has served its purpose. + * + * If L2 is active, this path is reachable if L1 is not intercepting + * IRQs, i.e. if KVM is injecting L1 IRQs into L2. AVIC is locally + * inhibited while L2 is active; drop the VM-wide inhibit to optimize + * the case in which the interrupt window was requested while L1 was + * active (the vCPU was not running nested). + */ + if (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu)) + kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); + trace_kvm_inj_virq(intr->nr, intr->soft, reinjected); ++vcpu->stat.irq_injections; From 6563ddadd169cc6f509a75b3ff8354309dcb9080 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 23 Jan 2026 14:45:12 -0800 Subject: [PATCH 017/373] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs IRQ window inhibits can be requested by multiple vCPUs at the same time for injecting interrupts meant for different vCPUs. However, AVIC inhibition is VM-wide and hence it is possible for the inhibition to be cleared prematurely by the first vCPU that obtains the IRQ window even though a second vCPU is still waiting for its IRQ window. This is likely not a functional issue since the other vCPU will again see that interrupts are pending to be injected (due to KVM_REQ_EVENT), and will again request for an IRQ window inhibition. However, this can result in AVIC being rapidly toggled resulting in high contention on apicv_update_lock and degrading performance of the guest. Address this by maintaining a VM-wide count of the number of vCPUs that have requested for an IRQ window. Set/clear the inhibit reason when the count transitions between 0 and 1. This ensures that the inhibit reason is not cleared as long as there are some vCPUs still waiting for an IRQ window. Co-developed-by: Paolo Bonzini Signed-off-by: Paolo Bonzini Co-developed-by: Naveen N Rao (AMD) Signed-off-by: Naveen N Rao (AMD) Tested-by: Naveen N Rao (AMD) Link: https://patch.msgid.link/20260123224514.2509129-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 19 ++++++++++++++++- arch/x86/kvm/svm/svm.c | 36 +++++++++++++++++++++++---------- arch/x86/kvm/svm/svm.h | 1 + arch/x86/kvm/x86.c | 19 +++++++++++++++++ 4 files changed, 63 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ff07c45e3c73..68db00dc09a0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1433,6 +1433,7 @@ struct kvm_arch { struct kvm_pit *vpit; #endif atomic_t vapics_in_nmi_mode; + struct mutex apic_map_lock; struct kvm_apic_map __rcu *apic_map; atomic_t apic_map_dirty; @@ -1440,9 +1441,13 @@ struct kvm_arch { bool apic_access_memslot_enabled; bool apic_access_memslot_inhibited; - /* Protects apicv_inhibit_reasons */ + /* + * Protects apicv_inhibit_reasons and apicv_nr_irq_window_req (with an + * asterisk, see kvm_inc_or_dec_irq_window_inhibit() for details). + */ struct rw_semaphore apicv_update_lock; unsigned long apicv_inhibit_reasons; + atomic_t apicv_nr_irq_window_req; gpa_t wall_clock; @@ -2316,6 +2321,18 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm, kvm_set_or_clear_apicv_inhibit(kvm, reason, false); } +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc); + +static inline void kvm_inc_apicv_irq_window_req(struct kvm *kvm) +{ + kvm_inc_or_dec_irq_window_inhibit(kvm, true); +} + +static inline void kvm_dec_apicv_irq_window_req(struct kvm *kvm) +{ + kvm_inc_or_dec_irq_window_inhibit(kvm, false); +} + int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, void *insn, int insn_len); void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 8766fd5f6d2b..e0da247ee594 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3729,8 +3729,11 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected) * the case in which the interrupt window was requested while L1 was * active (the vCPU was not running nested). */ - if (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu)) - kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); + if (svm->avic_irq_window && + (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu))) { + svm->avic_irq_window = false; + kvm_dec_apicv_irq_window_req(svm->vcpu.kvm); + } trace_kvm_inj_virq(intr->nr, intr->soft, reinjected); ++vcpu->stat.irq_injections; @@ -3932,17 +3935,28 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu) */ if (vgif || gif_set(svm)) { /* - * IRQ window is not needed when AVIC is enabled, - * unless we have pending ExtINT since it cannot be injected - * via AVIC. In such case, KVM needs to temporarily disable AVIC, - * and fallback to injecting IRQ via V_IRQ. + * KVM only enables IRQ windows when AVIC is enabled if there's + * pending ExtINT since it cannot be injected via AVIC (ExtINT + * bypasses the local APIC). V_IRQ is ignored by hardware when + * AVIC is enabled, and so KVM needs to temporarily disable + * AVIC in order to detect when it's ok to inject the ExtINT. * - * If running nested, AVIC is already locally inhibited - * on this vCPU, therefore there is no need to request - * the VM wide AVIC inhibition. + * If running nested, AVIC is already locally inhibited on this + * vCPU (L2 vCPUs use a different MMU that never maps the AVIC + * backing page), therefore there is no need to increment the + * VM-wide AVIC inhibit. KVM will re-evaluate events when the + * vCPU exits to L1 and enable an IRQ window if the ExtINT is + * still pending. + * + * Note, the IRQ window inhibit needs to be updated even if + * AVIC is inhibited for a different reason, as KVM needs to + * keep AVIC inhibited if the other reason is cleared and there + * is still an injectable interrupt pending. */ - if (!is_guest_mode(vcpu)) - kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); + if (enable_apicv && !svm->avic_irq_window && !is_guest_mode(vcpu)) { + svm->avic_irq_window = true; + kvm_inc_apicv_irq_window_req(vcpu->kvm); + } svm_set_vintr(svm); } diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index ebd7b36b1ceb..68675b25ef8e 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -333,6 +333,7 @@ struct vcpu_svm { bool guest_state_loaded; + bool avic_irq_window; bool x2avic_msrs_intercepted; bool lbr_msrs_intercepted; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a03530795707..db25938b6b50 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11014,6 +11014,25 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm, } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_set_or_clear_apicv_inhibit); +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc) +{ + int add = inc ? 1 : -1; + + if (!enable_apicv) + return; + + /* + * Strictly speaking, the lock is only needed if going 0->1 or 1->0, + * a la atomic_dec_and_mutex_lock. However, ExtINTs are rare and + * only target a single CPU, so that is the common case; do not + * bother eliding the down_write()/up_write() pair. + */ + guard(rwsem_write)(&kvm->arch.apicv_update_lock); + if (atomic_add_return(add, &kvm->arch.apicv_nr_irq_window_req) == inc) + __kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc); +} +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_inc_or_dec_irq_window_inhibit); + static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) { if (!kvm_apic_present(vcpu)) From 5617dddcfa30129562d7028ec766797d8c345f36 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 23 Jan 2026 14:45:13 -0800 Subject: [PATCH 018/373] KVM: SVM: Optimize IRQ window inhibit handling IRQ windows represent times during which an IRQ can be injected into a vCPU, and thus represent times when a vCPU is running with RFLAGS.IF=1 and GIF enabled (TPR/PPR don't matter since KVM controls interrupt injection and it only injects one interrupt at a time). On SVM, when emulating the local APIC (i.e., AVIC disabled), KVM detects IRQ windows by injecting a dummy virtual interrupt through VMCB.V_IRQ and intercepting virtual interrupts (INTERCEPT_VINTR). This intercept triggers as soon as the guest enables interrupts and is about to take the dummy interrupt, at which point the actual interrupt can be injected through VMCB.EVENTINJ. When AVIC is enabled, VMCB.V_IRQ is ignored by the hardware and so detecting IRQ windows requires AVIC to be inhibited. However, this is only necessary for ExtINTs since all other interrupts can be injected either by directly setting IRR in the APIC backing page and letting the AVIC hardware inject the interrupt into the guest, or via VMCB.V_NMI for NMIs. If AVIC is enabled but inhibited for some other reason, KVM has to request for IRQ window inhibits every time it has to inject an interrupt into the guest. This is because APICv inhibits are dynamic in nature, so KVM has to be sure that AVIC is inhibited for purposes of discovering an IRQ window even if the other inhibit is cleared in the meantime. This is particularly problematic with APICV_INHIBIT_REASON_PIT_REINJ which stays set throughout the life of the guest and results in KVM rapidly toggling IRQ window inhibit resulting in contention on apicv_update_lock. Address this by setting and clearing APICV_INHIBIT_REASON_PIT_REINJ lazily: if some other inhibit reason is already set, just increment the IRQ window request count and do not update apicv_inhibit_reasons immediately. If any other inhibit reason is set/cleared in the meantime, re-evaluate APICV_INHIBIT_REASON_PIT_REINJ by checking the IRQ window request count and update apicv_inhibit_reasons appropriately. Otherwise, just the IRQ window request count is incremented/decremented each time an IRQ window is requested. This reduces much of the contention on the apicv_update_lock semaphore and does away with much of the performance degradation. Co-developed-by: Paolo Bonzini Signed-off-by: Paolo Bonzini Co-developed-by: Naveen N Rao (AMD) Signed-off-by: Naveen N Rao (AMD) Tested-by: Naveen N Rao (AMD) Link: https://patch.msgid.link/20260123224514.2509129-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index db25938b6b50..193e724e7f48 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10973,7 +10973,11 @@ void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm, old = new = kvm->arch.apicv_inhibit_reasons; - set_or_clear_apicv_inhibit(&new, reason, set); + if (reason != APICV_INHIBIT_REASON_IRQWIN) + set_or_clear_apicv_inhibit(&new, reason, set); + + set_or_clear_apicv_inhibit(&new, APICV_INHIBIT_REASON_IRQWIN, + atomic_read(&kvm->arch.apicv_nr_irq_window_req)); if (!!old != !!new) { /* @@ -11021,6 +11025,26 @@ void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc) if (!enable_apicv) return; + /* + * IRQ windows are requested either because of ExtINT injections, or + * because APICv is already disabled/inhibited for another reason. + * While ExtINT injections are rare and should not happen while the + * vCPU is running its actual workload, it's worth avoiding thrashing + * if the IRQ window is being requested because APICv is already + * inhibited. So, toggle the actual inhibit (which requires taking + * the lock for write) if and only if there's no other inhibit. + * kvm_set_or_clear_apicv_inhibit() always evaluates the IRQ window + * count; thus the IRQ window inhibit call _will_ be lazily updated on + * the next call, if it ever happens. + */ + if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) { + guard(rwsem_read)(&kvm->arch.apicv_update_lock); + if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) { + atomic_add(add, &kvm->arch.apicv_nr_irq_window_req); + return; + } + } + /* * Strictly speaking, the lock is only needed if going 0->1 or 1->0, * a la atomic_dec_and_mutex_lock. However, ExtINTs are rare and From fa78a514d632ed2428b7c573108d9658c00d536e Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 23 Jan 2026 14:45:14 -0800 Subject: [PATCH 019/373] KVM: Isolate apicv_update_lock and apicv_nr_irq_window_req in a cacheline Force apicv_update_lock and apicv_nr_irq_window_req to reside in their own cacheline to avoid generating significant contention due to false sharing when KVM is contantly creating IRQ windows. E.g. apicv_inhibit_reasons is read on every VM-Enter; disabled_exits is read on page faults, on PAUSE exits, if a vCPU is scheduled out, etc.; kvmclock_offset is read every time a vCPU needs to refresh kvmclock, and so on and so forth. Isolating the write-mostly fields from all other (read-mostly) fields improves performance by 7-8% when running netperf TCP_RR between two guests on the same physical host when using an in-kernel PIT in re-inject mode. Reported-by: Naveen N Rao (AMD) Closes: https://lore.kernel.org/all/yrxhngndj37edud6tj5y3vunaf7nirwor4n63yf4275wdocnd3@c77ujgialc6r Tested-by: Naveen N Rao (AMD) Link: https://patch.msgid.link/20260123224514.2509129-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 68db00dc09a0..883b85b3e1de 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1441,13 +1441,23 @@ struct kvm_arch { bool apic_access_memslot_enabled; bool apic_access_memslot_inhibited; + /* + * Force apicv_update_lock and apicv_nr_irq_window_req to reside in a + * dedicated cacheline. They are write-mostly, whereas most everything + * else in kvm_arch is read-mostly. Note that apicv_inhibit_reasons is + * read-mostly: toggling VM-wide inhibits is rare; _checking_ for + * inhibits is common. + */ + ____cacheline_aligned /* * Protects apicv_inhibit_reasons and apicv_nr_irq_window_req (with an * asterisk, see kvm_inc_or_dec_irq_window_inhibit() for details). */ struct rw_semaphore apicv_update_lock; - unsigned long apicv_inhibit_reasons; atomic_t apicv_nr_irq_window_req; + ____cacheline_aligned + + unsigned long apicv_inhibit_reasons; gpa_t wall_clock; From e907b4e72488f1df878e7e8acf88d23e49cb3ca7 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Fri, 27 Feb 2026 01:13:06 +0000 Subject: [PATCH 020/373] KVM: x86: Check for injected exceptions before queuing a debug exception On KVM_SET_GUEST_DEBUG, if a #DB or #BP is injected with KVM_GUESTDBG_INJECT_DB or KVM_GUESTDBG_INJECT_BP, KVM fails with -EBUSY if there is an existing pending exception. This was introduced in commit 4f926bf29186 ("KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG") to avoid a warning in kvm_queue_exception(), presumably to avoid overriding a pending exception. This added another (arguably nice) property, if there's a pending exception, KVM_SET_GUEST_DEBUG cannot cause a #DF or triple fault. However, if an exception is injected, KVM_SET_GUEST_DEBUG will cause a #DF or triple fault in the guest, as kvm_multiple_exception() combines them. Check for both pending and injected exceptions for KVM_GUESTDBG_INJECT_DB and KVM_GUESTDBG_INJECT_BP, to avoid accidentally injecting a #DB or triple fault. Signed-off-by: Yosry Ahmed base-commit: a68a4bbc5b9ce5b722473399f05cb05217abaee8 Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a03530795707..658476815b6a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12529,7 +12529,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) { r = -EBUSY; - if (kvm_is_exception_pending(vcpu)) + if (kvm_is_exception_pending(vcpu) || vcpu->arch.exception.injected) goto out; if (dbg->control & KVM_GUESTDBG_INJECT_DB) kvm_queue_exception(vcpu, DB_VECTOR); From 24f7d36b824b65cf1a2db3db478059187b2a37b0 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 24 Feb 2026 22:50:17 +0000 Subject: [PATCH 021/373] KVM: nSVM: Ensure AVIC is inhibited when restoring a vCPU to guest mode On nested VMRUN, KVM ensures AVIC is inhibited by requesting KVM_REQ_APICV_UPDATE, triggering a check of inhibit reasons, finding APICV_INHIBIT_REASON_NESTED, and disabling AVIC. However, when KVM_SET_NESTED_STATE is performed on a vCPU not in guest mode with AVIC enabled, KVM_REQ_APICV_UPDATE is not requested, and AVIC is not inhibited. Request KVM_REQ_APICV_UPDATE in the KVM_SET_NESTED_STATE path if AVIC is active, similar to the nested VMRUN path. Fixes: f44509f849fe ("KVM: x86: SVM: allow AVIC to co-exist with a nested guest running") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260224225017.3303870-1-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d5a8f5608f2d..3667f8ba5268 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1928,6 +1928,9 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, svm->nested.force_msr_bitmap_recalc = true; + if (kvm_vcpu_apicv_active(vcpu)) + kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu); + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); ret = 0; out_free: From 778d8c1b2a6ffe622ddcd3bb35b620e6e41f4da0 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:43 +0000 Subject: [PATCH 022/373] KVM: nSVM: Sync NextRIP to cached vmcb12 after VMRUN of L2 After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs fields written by the CPU from vmcb02 to the cached vmcb12. This is because the cached vmcb12 is used as the authoritative copy of some of the controls, and is the payload when saving/restoring nested state. NextRIP is also written by the CPU (in some cases) after VMRUN, but is not sync'd to the cached vmcb12. As a result, it is corrupted after save/restore (replaced by the original value written by L1 on nested VMRUN). This could cause problems for both KVM (e.g. when injecting a soft IRQ) or L1 (e.g. when using NextRIP to advance RIP after emulating an instruction). Fix this by sync'ing NextRIP to the cache after VMRUN of L2, but only after completing interrupts (not in nested_sync_control_from_vmcb02()), as KVM may update NextRIP (e.g. when re-injecting a soft IRQ). Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-2-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 8f8bc863e214..07f096758f34 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4435,6 +4435,16 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) svm_complete_interrupts(vcpu); + /* + * Update the cache after completing interrupts to get an accurate + * NextRIP, e.g. when re-injecting a soft interrupt. + * + * FIXME: Rework svm_get_nested_state() to not pull data from the + * cache (except for maybe int_ctl). + */ + if (is_guest_mode(vcpu)) + svm->nested.ctl.next_rip = svm->vmcb->control.next_rip; + return svm_exit_handlers_fastpath(vcpu); } From 03bee264f8ebfd39e0254c98e112d033a7aa9055 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:44 +0000 Subject: [PATCH 023/373] KVM: nSVM: Sync interrupt shadow to cached vmcb12 after VMRUN of L2 After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs fields written by the CPU from vmcb02 to the cached vmcb12. This is because the cached vmcb12 is used as the authoritative copy of some of the controls, and is the payload when saving/restoring nested state. int_state is also written by the CPU, specifically bit 0 (i.e. SVM_INTERRUPT_SHADOW_MASK) for nested VMs, but it is not sync'd to cached vmcb12. This does not cause a problem if KVM_SET_NESTED_STATE preceeds KVM_SET_VCPU_EVENTS in the restore path, as an interrupt shadow would be correctly restored to vmcb02 (KVM_SET_VCPU_EVENTS overwrites what KVM_SET_NESTED_STATE restored in int_state). However, if KVM_SET_VCPU_EVENTS preceeds KVM_SET_NESTED_STATE, an interrupt shadow would be restored into vmcb01 instead of vmcb02. This would mostly be benign for L1 (delays an interrupt), but not for L2. For L2, the vCPU could hang (e.g. if a wakeup interrupt is delivered before a HLT that should have been in an interrupt shadow). Sync int_state to the cached vmcb12 in nested_sync_control_from_vmcb02() to avoid this problem. With that, KVM_SET_NESTED_STATE restores the correct interrupt shadow state, and if KVM_SET_VCPU_EVENTS follows it would overwrite it with the same value. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-3-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 3667f8ba5268..2308e40691c4 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -521,6 +521,7 @@ void nested_sync_control_from_vmcb02(struct vcpu_svm *svm) u32 mask; svm->nested.ctl.event_inj = svm->vmcb->control.event_inj; svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err; + svm->nested.ctl.int_state = svm->vmcb->control.int_state; /* Only a few fields of int_ctl are written by the processor. */ mask = V_IRQ_MASK | V_TPR_MASK; From 2303ca26fbb005a45aaf5a547465f978df906cb7 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:45 +0000 Subject: [PATCH 024/373] KVM: selftests: Extend state_test to check vGIF V_GIF_MASK is one of the fields written by the CPU after VMRUN, and sync'd by KVM from vmcb02 to cached vmcb12 after running L2. Part of the reason is to make sure V_GIF_MASK is saved/restored correctly, as the cached vmcb12 is the payload of nested state. Verify that V_GIF_MASK is saved/restored correctly in state_test by enabling vGIF in vmcb12, toggling GIF in L2 at different GUEST_SYNC() points, and verifying that V_GIF_MASK is correctly propagated to the nested state. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-4-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/x86/state_test.c | 24 ++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/tools/testing/selftests/kvm/x86/state_test.c b/tools/testing/selftests/kvm/x86/state_test.c index f2c7a1c297e3..57c7546f3d7c 100644 --- a/tools/testing/selftests/kvm/x86/state_test.c +++ b/tools/testing/selftests/kvm/x86/state_test.c @@ -26,7 +26,9 @@ void svm_l2_guest_code(void) GUEST_SYNC(4); /* Exit to L1 */ vmcall(); + clgi(); GUEST_SYNC(6); + stgi(); /* Done, exit to L1 and never come back. */ vmcall(); } @@ -41,6 +43,8 @@ static void svm_l1_guest_code(struct svm_test_data *svm) generic_svm_setup(svm, svm_l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + vmcb->control.int_ctl |= (V_GIF_ENABLE_MASK | V_GIF_MASK); + GUEST_SYNC(3); run_guest(vmcb, svm->vmcb_gpa); GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL); @@ -222,6 +226,24 @@ static void __attribute__((__flatten__)) guest_code(void *arg) GUEST_DONE(); } +void svm_check_nested_state(int stage, struct kvm_x86_state *state) +{ + struct vmcb *vmcb = (struct vmcb *)state->nested.data.svm; + + if (kvm_cpu_has(X86_FEATURE_VGIF)) { + if (stage == 4) + TEST_ASSERT_EQ(!!(vmcb->control.int_ctl & V_GIF_MASK), 1); + if (stage == 6) + TEST_ASSERT_EQ(!!(vmcb->control.int_ctl & V_GIF_MASK), 0); + } +} + +void check_nested_state(int stage, struct kvm_x86_state *state) +{ + if (kvm_has_cap(KVM_CAP_NESTED_STATE) && kvm_cpu_has(X86_FEATURE_SVM)) + svm_check_nested_state(stage, state); +} + int main(int argc, char *argv[]) { uint64_t *xstate_bv, saved_xstate_bv; @@ -278,6 +300,8 @@ int main(int argc, char *argv[]) kvm_vm_release(vm); + check_nested_state(stage, state); + /* Restore state in a new VM. */ vcpu = vm_recreate_with_one_vcpu(vm); vcpu_load_state(vcpu, state); From e5cdd34b5f74c4a0c72fe43092192f347d999e77 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:46 +0000 Subject: [PATCH 025/373] KVM: selftests: Extend state_test to check next_rip Similar to vGIF, extend state_test to make sure that next_rip is saved correctly in nested state. GUEST_SYNC() in L2 causes IO emulation by KVM, which advances the RIP to the value of next_rip. Hence, if next_rip is saved correctly, its value should match the saved RIP value. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-5-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/x86/state_test.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tools/testing/selftests/kvm/x86/state_test.c b/tools/testing/selftests/kvm/x86/state_test.c index 57c7546f3d7c..992a52504a4a 100644 --- a/tools/testing/selftests/kvm/x86/state_test.c +++ b/tools/testing/selftests/kvm/x86/state_test.c @@ -236,6 +236,17 @@ void svm_check_nested_state(int stage, struct kvm_x86_state *state) if (stage == 6) TEST_ASSERT_EQ(!!(vmcb->control.int_ctl & V_GIF_MASK), 0); } + + if (kvm_cpu_has(X86_FEATURE_NRIPS)) { + /* + * GUEST_SYNC() causes IO emulation in KVM, in which case the + * RIP is advanced before exiting to userspace. Hence, the RIP + * in the saved state should be the same as nRIP saved by the + * CPU in the VMCB. + */ + if (stage == 6) + TEST_ASSERT_EQ(vmcb->control.next_rip, state->regs.rip); + } } void check_nested_state(int stage, struct kvm_x86_state *state) From 690dc03859e7907bc995f389618c748619559477 Mon Sep 17 00:00:00 2001 From: Jim Mattson Date: Tue, 10 Feb 2026 15:45:42 -0800 Subject: [PATCH 026/373] KVM: x86: Ignore cpuid faulting in SMM The Intel Virtualization Technology FlexMigration Application Note says, "When CPUID faulting is enabled, all executions of the CPUID instruction outside system-management mode (SMM) cause a general-protection exception (#GP(0)) if the current privilege level (CPL) is greater than 0." Always allow the execution of CPUID in SMM. Fixes: db2336a80489 ("KVM: x86: virtualize cpuid faulting") Signed-off-by: Jim Mattson Link: https://patch.msgid.link/20260210234613.1383279-1-jmattson@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/cpuid.c | 3 ++- arch/x86/kvm/emulate.c | 6 +++--- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d2486506a808..baf9a2860d98 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -2157,7 +2157,8 @@ int kvm_emulate_cpuid(struct kvm_vcpu *vcpu) { u32 eax, ebx, ecx, edx; - if (cpuid_fault_enabled(vcpu) && !kvm_require_cpl(vcpu, 0)) + if (!is_smm(vcpu) && cpuid_fault_enabled(vcpu) && + !kvm_require_cpl(vcpu, 0)) return 1; eax = kvm_rax_read(vcpu); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 20ed588015f1..500711c6f069 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3583,10 +3583,10 @@ static int em_cpuid(struct x86_emulate_ctxt *ctxt) u64 msr = 0; ctxt->ops->get_msr(ctxt, MSR_MISC_FEATURES_ENABLES, &msr); - if (msr & MSR_MISC_FEATURES_ENABLES_CPUID_FAULT && - ctxt->ops->cpl(ctxt)) { + if (!ctxt->ops->is_smm(ctxt) && + (msr & MSR_MISC_FEATURES_ENABLES_CPUID_FAULT) && + ctxt->ops->cpl(ctxt)) return emulate_gp(ctxt, 0); - } eax = reg_read(ctxt, VCPU_REGS_RAX); ecx = reg_read(ctxt, VCPU_REGS_RCX); From 0b16e69d17d8c35c5c9d5918bf596c75a44655d3 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:36 -0800 Subject: [PATCH 027/373] KVM: x86: Use scratch field in MMIO fragment to hold small write values When exiting to userspace to service an emulated MMIO write, copy the to-be-written value to a scratch field in the MMIO fragment if the size of the data payload is 8 bytes or less, i.e. can fit in a single chunk, instead of pointing the fragment directly at the source value. This fixes a class of use-after-free bugs that occur when the emulator initiates a write using an on-stack, local variable as the source, the write splits a page boundary, *and* both pages are MMIO pages. Because KVM's ABI only allows for physically contiguous MMIO requests, accesses that split MMIO pages are separated into two fragments, and are sent to userspace one at a time. When KVM attempts to complete userspace MMIO in response to KVM_RUN after the first fragment, KVM will detect the second fragment and generate a second userspace exit, and reference the on-stack variable. The issue is most visible if the second KVM_RUN is performed by a separate task, in which case the stack of the initiating task can show up as truly freed data. ================================================================== BUG: KASAN: use-after-free in complete_emulated_mmio+0x305/0x420 Read of size 1 at addr ffff888009c378d1 by task syz-executor417/984 CPU: 1 PID: 984 Comm: syz-executor417 Not tainted 5.10.0-182.0.0.95.h2627.eulerosv2r13.x86_64 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0xbe/0xfd print_address_description.constprop.0+0x19/0x170 __kasan_report.cold+0x6c/0x84 kasan_report+0x3a/0x50 check_memory_region+0xfd/0x1f0 memcpy+0x20/0x60 complete_emulated_mmio+0x305/0x420 kvm_arch_vcpu_ioctl_run+0x63f/0x6d0 kvm_vcpu_ioctl+0x413/0xb20 __se_sys_ioctl+0x111/0x160 do_syscall_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x67/0xd1 RIP: 0033:0x42477d Code: <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007faa8e6890e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000004d7338 RCX: 000000000042477d RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005 RBP: 00000000004d7330 R08: 00007fff28d546df R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d733c R13: 0000000000000000 R14: 000000000040a200 R15: 00007fff28d54720 The buggy address belongs to the page: page:0000000029f6a428 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9c37 flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000000 0000000000000000 ffffea0000270dc8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888009c37780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888009c37800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff888009c37880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff888009c37900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888009c37980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== The bug can also be reproduced with a targeted KVM-Unit-Test by hacking KVM to fill a large on-stack variable in complete_emulated_mmio(), i.e. by overwrite the data value with garbage. Limit the use of the scratch fields to 8-byte or smaller accesses, and to just writes, as larger accesses and reads are not affected thanks to implementation details in the emulator, but add a sanity check to ensure those details don't change in the future. Specifically, KVM never uses on-stack variables for accesses larger that 8 bytes, e.g. uses an operand in the emulator context, and *all* reads are buffered through the mem_read cache. Note! Using the scratch field for reads is not only unnecessary, it's also extremely difficult to handle correctly. As above, KVM buffers all reads through the mem_read cache, and heavily relies on that behavior when re-emulating the instruction after a userspace MMIO read exit. If a read splits a page, the first page is NOT an MMIO page, and the second page IS an MMIO page, then the MMIO fragment needs to point at _just_ the second chunk of the destination, i.e. its position in the mem_read cache. Taking the "obvious" approach of copying the fragment value into the destination when re-emulating the instruction would clobber the first chunk of the destination, i.e. would clobber the data that was read from guest memory. Fixes: f78146b0f923 ("KVM: Fix page-crossing MMIO") Suggested-by: Yashu Zhang Reported-by: Yashu Zhang Closes: https://lore.kernel.org/all/369eaaa2b3c1425c85e8477066391bc7@huawei.com Cc: stable@vger.kernel.org Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 14 +++++++++++++- include/linux/kvm_host.h | 3 ++- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a03530795707..9bb9d7f078fc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8225,7 +8225,13 @@ static int emulator_read_write_onepage(unsigned long addr, void *val, WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS); frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++]; frag->gpa = gpa; - frag->data = val; + if (write && bytes <= 8u) { + frag->val = 0; + frag->data = &frag->val; + memcpy(&frag->val, val, bytes); + } else { + frag->data = val; + } frag->len = bytes; return X86EMUL_CONTINUE; } @@ -8240,6 +8246,9 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, gpa_t gpa; int rc; + if (WARN_ON_ONCE((bytes > 8u || !ops->write) && object_is_on_stack(val))) + return X86EMUL_UNHANDLEABLE; + if (ops->read_write_prepare && ops->read_write_prepare(vcpu, val, bytes)) return X86EMUL_CONTINUE; @@ -11846,6 +11855,9 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu) frag++; vcpu->mmio_cur_fragment++; } else { + if (WARN_ON_ONCE(frag->data == &frag->val)) + return -EIO; + /* Go forward to the next mmio piece. */ frag->data += len; frag->gpa += len; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 34759a262b28..abb309372035 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -318,7 +318,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop) struct kvm_mmio_fragment { gpa_t gpa; void *data; - unsigned len; + u64 val; + unsigned int len; }; struct kvm_vcpu { From 4046823e78b0c9abd25d23bffd7f1c773532dbfd Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:37 -0800 Subject: [PATCH 028/373] KVM: x86: Open code handling of completed MMIO reads in emulator_read_write() Open code the handling of completed MMIO reads instead of using an ops hook, as burying the logic behind a (likely RETPOLINE'd) indirect call, and with an unintuitive name, makes relatively straightforward code hard to comprehend. Opportunistically add comments to explain the dependencies between the emulator's mem_read cache and the MMIO read completion logic, as it's very easy to overlook the cache's role in getting the read data into the correct destination. No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/emulate.c | 13 +++++++++++++ arch/x86/kvm/x86.c | 33 ++++++++++++++++----------------- 2 files changed, 29 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c8e292e9a24d..70850e591350 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1297,12 +1297,25 @@ static int read_emulated(struct x86_emulate_ctxt *ctxt, int rc; struct read_cache *mc = &ctxt->mem_read; + /* + * If the read gets a cache hit, simply copy the value from the cache. + * A "hit" here means that there is unused data in the cache, i.e. when + * re-emulating an instruction to complete a userspace exit, KVM relies + * on "no decode" to ensure the instruction is re-emulated in the same + * sequence, so that multiple reads are fulfilled in the correct order. + */ if (mc->pos < mc->end) goto read_cached; if (KVM_EMULATOR_BUG_ON((mc->end + size) >= sizeof(mc->data), ctxt)) return X86EMUL_UNHANDLEABLE; + /* + * Route all reads to the cache. This allows @dest to be an on-stack + * variable without triggering use-after-free if KVM needs to exit to + * userspace to handle an MMIO read (the MMIO fragment will point at + * the current location in the cache). + */ rc = ctxt->ops->read_emulated(ctxt, addr, mc->data + mc->end, size, &ctxt->exception); if (rc != X86EMUL_CONTINUE) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9bb9d7f078fc..b0bb9cd9df8a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8108,8 +8108,6 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, } struct read_write_emulator_ops { - int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val, - int bytes); int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa, void *val, int bytes); int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, @@ -8119,18 +8117,6 @@ struct read_write_emulator_ops { bool write; }; -static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes) -{ - if (vcpu->mmio_read_completed) { - trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes, - vcpu->mmio_fragments[0].gpa, val); - vcpu->mmio_read_completed = 0; - return 1; - } - - return 0; -} - static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, void *val, int bytes) { @@ -8166,7 +8152,6 @@ static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, } static const struct read_write_emulator_ops read_emultor = { - .read_write_prepare = read_prepare, .read_write_emulate = read_emulate, .read_write_mmio = vcpu_mmio_read, .read_write_exit_mmio = read_exit_mmio, @@ -8249,9 +8234,23 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, if (WARN_ON_ONCE((bytes > 8u || !ops->write) && object_is_on_stack(val))) return X86EMUL_UNHANDLEABLE; - if (ops->read_write_prepare && - ops->read_write_prepare(vcpu, val, bytes)) + /* + * If the read was already completed via a userspace MMIO exit, there's + * nothing left to do except trace the MMIO read. When completing MMIO + * reads, KVM re-emulates the instruction to propagate the value into + * the correct destination, e.g. into the correct register, but the + * value itself has already been copied to the read cache. + * + * Note! This is *tightly* coupled to read_emulated() satisfying reads + * from the emulator's mem_read cache, so that the MMIO fragment data + * is copied to the correct chunk of the correct operand. + */ + if (!ops->write && vcpu->mmio_read_completed) { + trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes, + vcpu->mmio_fragments[0].gpa, val); + vcpu->mmio_read_completed = 0; return X86EMUL_CONTINUE; + } vcpu->mmio_nr_fragments = 0; From 4f11fded5381eb32ca2e0f1e280c6eb97eff92c8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:38 -0800 Subject: [PATCH 029/373] KVM: x86: Trace unsatisfied MMIO reads on a per-page basis Invoke the "unsatisfied MMIO reads" when KVM first detects that a particular access "chunk" requires an exit to userspace instead of tracing the entire access at the time KVM initiates the exit to userspace. I.e. precisely trace the first and/or second fragments of a page split instead of tracing the entire access, as the GPA could be wrong on a page split case. Leave the completion tracepoint alone, at least for now, as fixing the completion path would incur significantly complexity to track exactly which fragment(s) of the overall access actually triggered MMIO, but add a comment that the tracing for completed reads in is technically wrong. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b0bb9cd9df8a..e2cf5349d593 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7807,6 +7807,9 @@ static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v) v += n; } while (len); + if (len) + trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, len, addr, NULL); + return handled; } @@ -8138,7 +8141,6 @@ static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val) static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, void *val, int bytes) { - trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, NULL); return X86EMUL_IO_NEEDED; } @@ -8246,6 +8248,11 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, * is copied to the correct chunk of the correct operand. */ if (!ops->write && vcpu->mmio_read_completed) { + /* + * For simplicity, trace the entire MMIO read in one shot, even + * though the GPA might be incorrect if there are two fragments + * that aren't contiguous in the GPA space. + */ trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes, vcpu->mmio_fragments[0].gpa, val); vcpu->mmio_read_completed = 0; From 523b6269f700373eba65ad8a0bfaac284a12c167 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:39 -0800 Subject: [PATCH 030/373] KVM: x86: Use local MMIO fragment variable to clean up emulator_read_write() Grab the MMIO fragment used by emulator_read_write() to initiate an exit to userspace in a local variable to make the code easier to read. No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e2cf5349d593..d04f0a0383f3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8230,7 +8230,7 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, const struct read_write_emulator_ops *ops) { struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); - gpa_t gpa; + struct kvm_mmio_fragment *frag; int rc; if (WARN_ON_ONCE((bytes > 8u || !ops->write) && object_is_on_stack(val))) @@ -8286,17 +8286,16 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, if (!vcpu->mmio_nr_fragments) return X86EMUL_CONTINUE; - gpa = vcpu->mmio_fragments[0].gpa; - vcpu->mmio_needed = 1; vcpu->mmio_cur_fragment = 0; - vcpu->run->mmio.len = min(8u, vcpu->mmio_fragments[0].len); + frag = &vcpu->mmio_fragments[0]; + vcpu->run->mmio.len = min(8u, frag->len); vcpu->run->mmio.is_write = vcpu->mmio_is_write = ops->write; vcpu->run->exit_reason = KVM_EXIT_MMIO; - vcpu->run->mmio.phys_addr = gpa; + vcpu->run->mmio.phys_addr = frag->gpa; - return ops->read_write_exit_mmio(vcpu, gpa, val, bytes); + return ops->read_write_exit_mmio(vcpu, frag->gpa, val, bytes); } static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, From cbbf8228c0716d8f96f354f378efd1cbadb428e0 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:40 -0800 Subject: [PATCH 031/373] KVM: x86: Open code read vs. write userspace MMIO exits in emulator_read_write() Open code the differences in read vs. write userspace MMIO exits instead of burying three lines of code behind indirect callbacks, as splitting the logic makes it extremely hard to track that KVM's handling of reads vs. write is _significantly_ different. Add a comment to explain why the semantics are different, and how on earth an MMIO write ends up triggering an exit to userspace. No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d04f0a0383f3..fa2eb4fdc4b4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8115,8 +8115,6 @@ struct read_write_emulator_ops { void *val, int bytes); int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val); - int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, - void *val, int bytes); bool write; }; @@ -8138,31 +8136,14 @@ static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val) return vcpu_mmio_write(vcpu, gpa, bytes, val); } -static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, - void *val, int bytes) -{ - return X86EMUL_IO_NEEDED; -} - -static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, - void *val, int bytes) -{ - struct kvm_mmio_fragment *frag = &vcpu->mmio_fragments[0]; - - memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len)); - return X86EMUL_CONTINUE; -} - static const struct read_write_emulator_ops read_emultor = { .read_write_emulate = read_emulate, .read_write_mmio = vcpu_mmio_read, - .read_write_exit_mmio = read_exit_mmio, }; static const struct read_write_emulator_ops write_emultor = { .read_write_emulate = write_emulate, .read_write_mmio = write_mmio, - .read_write_exit_mmio = write_exit_mmio, .write = true, }; @@ -8295,7 +8276,19 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, vcpu->run->exit_reason = KVM_EXIT_MMIO; vcpu->run->mmio.phys_addr = frag->gpa; - return ops->read_write_exit_mmio(vcpu, frag->gpa, val, bytes); + /* + * For MMIO reads, stop emulating and immediately exit to userspace, as + * KVM needs the value to correctly emulate the instruction. For MMIO + * writes, continue emulating as the write to MMIO is a side effect for + * all intents and purposes. KVM will still exit to userspace, but + * after completing emulation (see the check on vcpu->mmio_needed in + * x86_emulate_instruction()). + */ + if (!ops->write) + return X86EMUL_IO_NEEDED; + + memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len)); + return X86EMUL_CONTINUE; } static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, From 72f36f99072c3b79451af38274d59ac30cc064c6 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:41 -0800 Subject: [PATCH 032/373] KVM: x86: Move MMIO write tracing into vcpu_mmio_write() Move the invocation of MMIO write tracepoint into vcpu_mmio_write() and drop its largely-useless wrapper to cull pointless code and to make the code symmetrical with respect to vcpu_mmio_read(). No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa2eb4fdc4b4..0f4cfc3374a6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7768,11 +7768,14 @@ static void kvm_init_msr_lists(void) } static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len, - const void *v) + void *__v) { + const void *v = __v; int handled = 0; int n; + trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, len, addr, __v); + do { n = min(len, 8); if (!(lapic_in_kernel(vcpu) && @@ -8130,12 +8133,6 @@ static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, return emulator_write_phys(vcpu, gpa, val, bytes); } -static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val) -{ - trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, val); - return vcpu_mmio_write(vcpu, gpa, bytes, val); -} - static const struct read_write_emulator_ops read_emultor = { .read_write_emulate = read_emulate, .read_write_mmio = vcpu_mmio_read, @@ -8143,7 +8140,7 @@ static const struct read_write_emulator_ops read_emultor = { static const struct read_write_emulator_ops write_emultor = { .read_write_emulate = write_emulate, - .read_write_mmio = write_mmio, + .read_write_mmio = vcpu_mmio_write, .write = true, }; From 144089f5c3944cf6383d53ab5d941b74924a0989 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:42 -0800 Subject: [PATCH 033/373] KVM: x86: Harden SEV-ES MMIO against on-stack use-after-free Add a sanity check to ensure KVM doesn't use an on-stack variable when handling an MMIO request for an SEV-ES guest. The source/destination for SEV-ES MMIO should _always_ be the #VMGEXIT scratch area. Opportunistically update the comment in the completion side of things to clarify that frag->data doesn't need to be copied anywhere, and the VMEGEXIT is trap-like (the current comment doesn't clarify *how* RIP is advanced). Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-8-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0f4cfc3374a6..5752ec3fc8f2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -14272,8 +14272,10 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu) if (vcpu->mmio_cur_fragment >= vcpu->mmio_nr_fragments) { vcpu->mmio_needed = 0; - // VMG change, at this point, we're always done - // RIP has already been advanced + /* + * All done, as frag->data always points at the GHCB scratch + * area and VMGEXIT is trap-like (RIP is advanced by hardware). + */ return 1; } @@ -14296,7 +14298,7 @@ int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, int handled; struct kvm_mmio_fragment *frag; - if (!data) + if (!data || WARN_ON_ONCE(object_is_on_stack(data))) return -EINVAL; handled = write_emultor.read_write_mmio(vcpu, gpa, bytes, data); @@ -14335,7 +14337,7 @@ int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, int handled; struct kvm_mmio_fragment *frag; - if (!data) + if (!data || WARN_ON_ONCE(object_is_on_stack(data))) return -EINVAL; handled = read_emultor.read_write_mmio(vcpu, gpa, bytes, data); From 33e09e2f9735fef7255aa96d1fe00782777bc44b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:43 -0800 Subject: [PATCH 034/373] KVM: x86: Dedup kvm_sev_es_mmio_{read,write}() Dedup the SEV-ES emulated MMIO code by using the read vs. write emulator ops to handle the few differences between reads and writes. Opportunistically tweak the comment about fragments to call out that KVM should verify that userspace can actually handle MMIO requests that cross page boundaries. Unlike emulated MMIO, the request is made in the GPA space, not the GVA space, i.e. emulation across page boundaries can work generically, at least in theory. No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 58 +++++++++++++++------------------------------- 1 file changed, 19 insertions(+), 39 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5752ec3fc8f2..81e683914072 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -14292,16 +14292,17 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu) return 0; } -int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, - void *data) +static int kvm_sev_es_do_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, + unsigned int bytes, void *data, + const struct read_write_emulator_ops *ops) { - int handled; struct kvm_mmio_fragment *frag; + int handled; if (!data || WARN_ON_ONCE(object_is_on_stack(data))) return -EINVAL; - handled = write_emultor.read_write_mmio(vcpu, gpa, bytes, data); + handled = ops->read_write_mmio(vcpu, gpa, bytes, data); if (handled == bytes) return 1; @@ -14309,7 +14310,10 @@ int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, gpa += handled; data += handled; - /*TODO: Check if need to increment number of frags */ + /* + * TODO: Determine whether or not userspace plays nice with MMIO + * requests that split a page boundary. + */ frag = vcpu->mmio_fragments; vcpu->mmio_nr_fragments = 1; frag->len = bytes; @@ -14321,51 +14325,27 @@ int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, vcpu->run->mmio.phys_addr = gpa; vcpu->run->mmio.len = min(8u, frag->len); - vcpu->run->mmio.is_write = 1; - memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len)); + vcpu->run->mmio.is_write = ops->write; + if (ops->write) + memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len)); vcpu->run->exit_reason = KVM_EXIT_MMIO; vcpu->arch.complete_userspace_io = complete_sev_es_emulated_mmio; return 0; } + +int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, + void *data) +{ + return kvm_sev_es_do_mmio(vcpu, gpa, bytes, data, &write_emultor); +} EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_sev_es_mmio_write); int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, void *data) { - int handled; - struct kvm_mmio_fragment *frag; - - if (!data || WARN_ON_ONCE(object_is_on_stack(data))) - return -EINVAL; - - handled = read_emultor.read_write_mmio(vcpu, gpa, bytes, data); - if (handled == bytes) - return 1; - - bytes -= handled; - gpa += handled; - data += handled; - - /*TODO: Check if need to increment number of frags */ - frag = vcpu->mmio_fragments; - vcpu->mmio_nr_fragments = 1; - frag->len = bytes; - frag->gpa = gpa; - frag->data = data; - - vcpu->mmio_needed = 1; - vcpu->mmio_cur_fragment = 0; - - vcpu->run->mmio.phys_addr = gpa; - vcpu->run->mmio.len = min(8u, frag->len); - vcpu->run->mmio.is_write = 0; - vcpu->run->exit_reason = KVM_EXIT_MMIO; - - vcpu->arch.complete_userspace_io = complete_sev_es_emulated_mmio; - - return 0; + return kvm_sev_es_do_mmio(vcpu, gpa, bytes, data, &read_emultor); } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_sev_es_mmio_read); From 326e810eaaa53ae38c21855da064bfed26a44045 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:44 -0800 Subject: [PATCH 035/373] KVM: x86: Consolidate SEV-ES MMIO emulation into a single public API Dedup kvm_sev_es_mmio_{read,write}() into a single API, as the "cost" of plumbing in a boolean is largely negligible since KVM pulls out a boolean for ops->write anyways, and consolidating the APIs will allow for additional cleanups. No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-10-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 22 +++++++--------------- arch/x86/kvm/x86.c | 29 +++++++++-------------------- arch/x86/kvm/x86.h | 6 ++---- 3 files changed, 18 insertions(+), 39 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 3f9c1aa39a0a..f3478fb9461d 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -4434,25 +4434,17 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) switch (control->exit_code) { case SVM_VMGEXIT_MMIO_READ: - ret = setup_vmgexit_scratch(svm, true, control->exit_info_2); + case SVM_VMGEXIT_MMIO_WRITE: { + bool is_write = control->exit_code == SVM_VMGEXIT_MMIO_WRITE; + + ret = setup_vmgexit_scratch(svm, !is_write, control->exit_info_2); if (ret) break; - ret = kvm_sev_es_mmio_read(vcpu, - control->exit_info_1, - control->exit_info_2, - svm->sev_es.ghcb_sa); - break; - case SVM_VMGEXIT_MMIO_WRITE: - ret = setup_vmgexit_scratch(svm, false, control->exit_info_2); - if (ret) - break; - - ret = kvm_sev_es_mmio_write(vcpu, - control->exit_info_1, - control->exit_info_2, - svm->sev_es.ghcb_sa); + ret = kvm_sev_es_mmio(vcpu, is_write, control->exit_info_1, + control->exit_info_2, svm->sev_es.ghcb_sa); break; + } case SVM_VMGEXIT_NMI_COMPLETE: ++vcpu->stat.nmi_window_exits; svm->nmi_masked = false; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 81e683914072..6ecc3cf972ae 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -14292,9 +14292,8 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu) return 0; } -static int kvm_sev_es_do_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, - unsigned int bytes, void *data, - const struct read_write_emulator_ops *ops) +int kvm_sev_es_mmio(struct kvm_vcpu *vcpu, bool is_write, gpa_t gpa, + unsigned int bytes, void *data) { struct kvm_mmio_fragment *frag; int handled; @@ -14302,7 +14301,10 @@ static int kvm_sev_es_do_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, if (!data || WARN_ON_ONCE(object_is_on_stack(data))) return -EINVAL; - handled = ops->read_write_mmio(vcpu, gpa, bytes, data); + if (is_write) + handled = vcpu_mmio_write(vcpu, gpa, bytes, data); + else + handled = vcpu_mmio_read(vcpu, gpa, bytes, data); if (handled == bytes) return 1; @@ -14325,8 +14327,8 @@ static int kvm_sev_es_do_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, vcpu->run->mmio.phys_addr = gpa; vcpu->run->mmio.len = min(8u, frag->len); - vcpu->run->mmio.is_write = ops->write; - if (ops->write) + vcpu->run->mmio.is_write = is_write; + if (is_write) memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len)); vcpu->run->exit_reason = KVM_EXIT_MMIO; @@ -14334,20 +14336,7 @@ static int kvm_sev_es_do_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, return 0; } - -int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, - void *data) -{ - return kvm_sev_es_do_mmio(vcpu, gpa, bytes, data, &write_emultor); -} -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_sev_es_mmio_write); - -int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, - void *data) -{ - return kvm_sev_es_do_mmio(vcpu, gpa, bytes, data, &read_emultor); -} -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_sev_es_mmio_read); +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_sev_es_mmio); static void advance_sev_es_emulated_pio(struct kvm_vcpu *vcpu, unsigned count, int size) { diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 94d4f07aaaa0..1d0f0edd31b3 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -712,10 +712,8 @@ static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) __reserved_bits; \ }) -int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t src, unsigned int bytes, - void *dst); -int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t src, unsigned int bytes, - void *dst); +int kvm_sev_es_mmio(struct kvm_vcpu *vcpu, bool is_write, gpa_t gpa, + unsigned int bytes, void *data); int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, unsigned int port, void *data, unsigned int count, int in); From 3517193ef9c260e4a2677fd4e7dc09efd0f628bb Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:45 -0800 Subject: [PATCH 036/373] KVM: x86: Bury emulator read/write ops in emulator_{read,write}_emulated() Now that SEV-ES invokes vcpu_mmio_{read,write}() directly, bury the read vs. write emulator ops in the dedicated emulator callbacks so that they are colocated with their usage, and to make it harder for non-emulator code to use hooks that are intended for the emulator. Opportunistically rename the structures to get rid of the long-standing "emultor" typo. No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-11-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6ecc3cf972ae..abc4ec06c548 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8133,17 +8133,6 @@ static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, return emulator_write_phys(vcpu, gpa, val, bytes); } -static const struct read_write_emulator_ops read_emultor = { - .read_write_emulate = read_emulate, - .read_write_mmio = vcpu_mmio_read, -}; - -static const struct read_write_emulator_ops write_emultor = { - .read_write_emulate = write_emulate, - .read_write_mmio = vcpu_mmio_write, - .write = true, -}; - static int emulator_read_write_onepage(unsigned long addr, void *val, unsigned int bytes, struct x86_exception *exception, @@ -8294,8 +8283,13 @@ static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, unsigned int bytes, struct x86_exception *exception) { - return emulator_read_write(ctxt, addr, val, bytes, - exception, &read_emultor); + static const struct read_write_emulator_ops ops = { + .read_write_emulate = read_emulate, + .read_write_mmio = vcpu_mmio_read, + .write = false, + }; + + return emulator_read_write(ctxt, addr, val, bytes, exception, &ops); } static int emulator_write_emulated(struct x86_emulate_ctxt *ctxt, @@ -8304,8 +8298,13 @@ static int emulator_write_emulated(struct x86_emulate_ctxt *ctxt, unsigned int bytes, struct x86_exception *exception) { - return emulator_read_write(ctxt, addr, (void *)val, bytes, - exception, &write_emultor); + static const struct read_write_emulator_ops ops = { + .read_write_emulate = write_emulate, + .read_write_mmio = vcpu_mmio_write, + .write = true, + }; + + return emulator_read_write(ctxt, addr, (void *)val, bytes, exception, &ops); } #define emulator_try_cmpxchg_user(t, ptr, old, new) \ From 929613b3cd1a97bc6e17100c0cab5668cd4eff90 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:46 -0800 Subject: [PATCH 037/373] KVM: x86: Fold emulator_write_phys() into write_emulate() Fold emulator_write_phys() into write_emulate() to drop a superfluous wrapper, and to provide more symmetry between the read and write paths. No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-12-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 3 --- arch/x86/kvm/x86.c | 20 +++++++------------- 2 files changed, 7 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ff07c45e3c73..aa030fbd669d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2097,9 +2097,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3); -int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, - const void *val, int bytes); - extern bool tdp_enabled; u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index abc4ec06c548..0195b77710e6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8101,18 +8101,6 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, return vcpu_is_mmio_gpa(vcpu, gva, *gpa, write); } -int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, - const void *val, int bytes) -{ - int ret; - - ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes); - if (ret < 0) - return 0; - kvm_page_track_write(vcpu, gpa, val, bytes); - return 1; -} - struct read_write_emulator_ops { int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa, void *val, int bytes); @@ -8130,7 +8118,13 @@ static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, void *val, int bytes) { - return emulator_write_phys(vcpu, gpa, val, bytes); + int ret; + + ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes); + if (ret < 0) + return 0; + kvm_page_track_write(vcpu, gpa, val, bytes); + return 1; } static int emulator_read_write_onepage(unsigned long addr, void *val, From 216729846603d5be668a84d75c88ae097c27ae61 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:47 -0800 Subject: [PATCH 038/373] KVM: x86: Rename .read_write_emulate() to .read_write_guest() Rename the ops and helpers to read/write guest memory to clarify that they do exactly that, i.e. aren't generic emulation flows and don't do anything related to emulated MMIO. Opportunistically add comments to explain the flow, e.g. it's not exactly obvious that KVM deliberately treats "failed" accesses to guest memory as emulated MMIO. No functional change intended. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-13-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 38 ++++++++++++++++++++++++++++---------- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0195b77710e6..cbd377bf71ba 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8102,21 +8102,21 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, } struct read_write_emulator_ops { - int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa, - void *val, int bytes); + int (*read_write_guest)(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes); int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val); bool write; }; -static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, - void *val, int bytes) +static int emulator_read_guest(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes) { return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes); } -static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, - void *val, int bytes) +static int emulator_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes) { int ret; @@ -8156,11 +8156,22 @@ static int emulator_read_write_onepage(unsigned long addr, void *val, return X86EMUL_PROPAGATE_FAULT; } - if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes)) + /* + * If the memory is not _known_ to be emulated MMIO, attempt to access + * guest memory. If accessing guest memory fails, e.g. because there's + * no memslot, then handle the access as MMIO. Note, treating the + * access as emulated MMIO is technically wrong if there is a memslot, + * i.e. if accessing host user memory failed, but this has been KVM's + * historical ABI for decades. + */ + if (!ret && ops->read_write_guest(vcpu, gpa, val, bytes)) return X86EMUL_CONTINUE; /* - * Is this MMIO handled locally? + * Attempt to handle emulated MMIO within the kernel, e.g. for accesses + * to an in-kernel local or I/O APIC, or to an ioeventfd range attached + * to MMIO bus. If the access isn't fully resolved, insert an MMIO + * fragment with the relevant details. */ handled = ops->read_write_mmio(vcpu, gpa, bytes, val); if (handled == bytes) @@ -8181,6 +8192,13 @@ static int emulator_read_write_onepage(unsigned long addr, void *val, frag->data = val; } frag->len = bytes; + + /* + * Continue emulating, even though KVM needs to (eventually) do an MMIO + * exit to userspace. If the access splits multiple pages, then KVM + * needs to exit to userspace only after emulating both parts of the + * access. + */ return X86EMUL_CONTINUE; } @@ -8278,7 +8296,7 @@ static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, struct x86_exception *exception) { static const struct read_write_emulator_ops ops = { - .read_write_emulate = read_emulate, + .read_write_guest = emulator_read_guest, .read_write_mmio = vcpu_mmio_read, .write = false, }; @@ -8293,7 +8311,7 @@ static int emulator_write_emulated(struct x86_emulate_ctxt *ctxt, struct x86_exception *exception) { static const struct read_write_emulator_ops ops = { - .read_write_emulate = write_emulate, + .read_write_guest = emulator_write_guest, .read_write_mmio = vcpu_mmio_write, .write = true, }; From 4f09e62afcd6c7a2c3428a3453ced7e56475dc70 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:48 -0800 Subject: [PATCH 039/373] KVM: x86: Don't panic the kernel if completing userspace I/O / MMIO goes sideways Kill the VM instead of the host kernel if KVM botches I/O and/or MMIO handling. There is zero danger to the host or guest, i.e. panicking the host isn't remotely justified. Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-14-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cbd377bf71ba..1467652ceabc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9709,7 +9709,8 @@ static int complete_fast_pio_in(struct kvm_vcpu *vcpu) unsigned long val; /* We should only ever be called with arch.pio.count equal to 1 */ - BUG_ON(vcpu->arch.pio.count != 1); + if (KVM_BUG_ON(vcpu->arch.pio.count != 1, vcpu->kvm)) + return -EIO; if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.cui_linear_rip))) { vcpu->arch.pio.count = 0; @@ -11819,7 +11820,8 @@ static inline int complete_emulated_io(struct kvm_vcpu *vcpu) static int complete_emulated_pio(struct kvm_vcpu *vcpu) { - BUG_ON(!vcpu->arch.pio.count); + if (KVM_BUG_ON(!vcpu->arch.pio.count, vcpu->kvm)) + return -EIO; return complete_emulated_io(vcpu); } @@ -11848,7 +11850,8 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu) struct kvm_mmio_fragment *frag; unsigned len; - BUG_ON(!vcpu->mmio_needed); + if (KVM_BUG_ON(!vcpu->mmio_needed, vcpu->kvm)) + return -EIO; /* Complete previous fragment */ frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment]; @@ -14261,7 +14264,8 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu) struct kvm_mmio_fragment *frag; unsigned int len; - BUG_ON(!vcpu->mmio_needed); + if (KVM_BUG_ON(!vcpu->mmio_needed, vcpu->kvm)) + return -EIO; /* Complete previous fragment */ frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment]; From e2138c4a5be1e50d75281136bdc3e709cb07ec5e Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 24 Feb 2026 17:20:49 -0800 Subject: [PATCH 040/373] KVM: x86: Add helpers to prepare kvm_run for userspace MMIO exit Add helpers to fill kvm_run for userspace MMIO exits to deduplicate a variety of code, and to allow for a cleaner return path in emulator_read_write(). Opportunistically add a KVM_BUG_ON() to ensure the caller is limiting the length of a single MMIO access to 8 bytes (the largest size userspace is prepared to handled, as the ABI was baked before things like MOVDQ came along). No functional change intended. Cc: Rick Edgecombe Cc: Binbin Wu Cc: Xiaoyao Li Cc: Tom Lendacky Cc: Michael Roth Tested-by: Tom Lendacky Tested-by: Rick Edgecombe Link: https://patch.msgid.link/20260225012049.920665-15-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/tdx.c | 14 ++++---------- arch/x86/kvm/x86.c | 42 ++++++++---------------------------------- arch/x86/kvm/x86.h | 26 ++++++++++++++++++++++++++ 3 files changed, 38 insertions(+), 44 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index c5065f84b78b..5e9b0c4d9af6 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1467,17 +1467,11 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu) /* Request the device emulation to userspace device model. */ vcpu->mmio_is_write = write; - if (!write) + + __kvm_prepare_emulated_mmio_exit(vcpu, gpa, size, &val, write); + + if (!write) { vcpu->arch.complete_userspace_io = tdx_complete_mmio_read; - - vcpu->run->mmio.phys_addr = gpa; - vcpu->run->mmio.len = size; - vcpu->run->mmio.is_write = write; - vcpu->run->exit_reason = KVM_EXIT_MMIO; - - if (write) { - memcpy(vcpu->run->mmio.data, &val, size); - } else { vcpu->mmio_fragments[0].gpa = gpa; vcpu->mmio_fragments[0].len = size; trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1467652ceabc..8cb6b1f1916e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8209,7 +8209,6 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, const struct read_write_emulator_ops *ops) { struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); - struct kvm_mmio_fragment *frag; int rc; if (WARN_ON_ONCE((bytes > 8u || !ops->write) && object_is_on_stack(val))) @@ -8267,12 +8266,9 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, vcpu->mmio_needed = 1; vcpu->mmio_cur_fragment = 0; + vcpu->mmio_is_write = ops->write; - frag = &vcpu->mmio_fragments[0]; - vcpu->run->mmio.len = min(8u, frag->len); - vcpu->run->mmio.is_write = vcpu->mmio_is_write = ops->write; - vcpu->run->exit_reason = KVM_EXIT_MMIO; - vcpu->run->mmio.phys_addr = frag->gpa; + kvm_prepare_emulated_mmio_exit(vcpu, &vcpu->mmio_fragments[0]); /* * For MMIO reads, stop emulating and immediately exit to userspace, as @@ -8282,11 +8278,7 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt, * after completing emulation (see the check on vcpu->mmio_needed in * x86_emulate_instruction()). */ - if (!ops->write) - return X86EMUL_IO_NEEDED; - - memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len)); - return X86EMUL_CONTINUE; + return ops->write ? X86EMUL_CONTINUE : X86EMUL_IO_NEEDED; } static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, @@ -11883,12 +11875,7 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu) return complete_emulated_io(vcpu); } - run->exit_reason = KVM_EXIT_MMIO; - run->mmio.phys_addr = frag->gpa; - if (vcpu->mmio_is_write) - memcpy(run->mmio.data, frag->data, min(8u, frag->len)); - run->mmio.len = min(8u, frag->len); - run->mmio.is_write = vcpu->mmio_is_write; + kvm_prepare_emulated_mmio_exit(vcpu, frag); vcpu->arch.complete_userspace_io = complete_emulated_mmio; return 0; } @@ -14295,15 +14282,8 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu) } // More MMIO is needed - run->mmio.phys_addr = frag->gpa; - run->mmio.len = min(8u, frag->len); - run->mmio.is_write = vcpu->mmio_is_write; - if (run->mmio.is_write) - memcpy(run->mmio.data, frag->data, min(8u, frag->len)); - run->exit_reason = KVM_EXIT_MMIO; - + kvm_prepare_emulated_mmio_exit(vcpu, frag); vcpu->arch.complete_userspace_io = complete_sev_es_emulated_mmio; - return 0; } @@ -14332,23 +14312,17 @@ int kvm_sev_es_mmio(struct kvm_vcpu *vcpu, bool is_write, gpa_t gpa, * requests that split a page boundary. */ frag = vcpu->mmio_fragments; - vcpu->mmio_nr_fragments = 1; frag->len = bytes; frag->gpa = gpa; frag->data = data; vcpu->mmio_needed = 1; vcpu->mmio_cur_fragment = 0; + vcpu->mmio_nr_fragments = 1; + vcpu->mmio_is_write = is_write; - vcpu->run->mmio.phys_addr = gpa; - vcpu->run->mmio.len = min(8u, frag->len); - vcpu->run->mmio.is_write = is_write; - if (is_write) - memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len)); - vcpu->run->exit_reason = KVM_EXIT_MMIO; - + kvm_prepare_emulated_mmio_exit(vcpu, frag); vcpu->arch.complete_userspace_io = complete_sev_es_emulated_mmio; - return 0; } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_sev_es_mmio); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 1d0f0edd31b3..44a28d343d40 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -718,6 +718,32 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, unsigned int port, void *data, unsigned int count, int in); +static inline void __kvm_prepare_emulated_mmio_exit(struct kvm_vcpu *vcpu, + gpa_t gpa, unsigned int len, + const void *data, + bool is_write) +{ + struct kvm_run *run = vcpu->run; + + KVM_BUG_ON(len > 8, vcpu->kvm); + + run->mmio.len = len; + run->mmio.is_write = is_write; + run->exit_reason = KVM_EXIT_MMIO; + run->mmio.phys_addr = gpa; + if (is_write) + memcpy(run->mmio.data, data, len); +} + +static inline void kvm_prepare_emulated_mmio_exit(struct kvm_vcpu *vcpu, + struct kvm_mmio_fragment *frag) +{ + WARN_ON_ONCE(!vcpu->mmio_needed || !vcpu->mmio_nr_fragments); + + __kvm_prepare_emulated_mmio_exit(vcpu, frag->gpa, min(8u, frag->len), + frag->data, vcpu->mmio_is_write); +} + static inline bool user_exit_on_hypercall(struct kvm *kvm, unsigned long hc_nr) { return kvm->arch.hypercall_exit_enabled & BIT(hc_nr); From 2b1a59f7ef96c3f29f0ada1a63f4699c35687e33 Mon Sep 17 00:00:00 2001 From: Gal Pressman Date: Wed, 25 Feb 2026 16:50:49 +0200 Subject: [PATCH 041/373] KVM: SVM: Fix UBSAN warning when reading avic parameter The avic parameter is stored as an int to support the special value -1 (AVIC_AUTO_MODE), but the cited commit changed it from bool to int while keeping param_get_bool() as the getter function. This causes UBSAN to report "load of value 255 is not a valid value for type '_Bool'" when the parameter is read via sysfs. The issue happens in two scenarios: 1. During module load: There's a time window between when module parameters are registered, and when avic_hardware_setup() runs to resolve the value, where the value is -1. 2. On non-AMD systems: On non-AMD hardware, the kvm_is_svm_supported() check returns early. The avic_hardware_setup() function never runs, so avic remains -1. Fix that by implementing a getter function that properly reads and converts the -1 value into a string. Triggered by sos report: UBSAN: invalid-load in kernel/params.c:323:33 load of value 255 is not a valid value for type '_Bool' CPU: 0 UID: 0 PID: 4667 Comm: sos Not tainted 6.19.0-rc5net_mlx5_1e86836 #1 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack_lvl+0x69/0xa0 ubsan_epilogue+0x5/0x2b __ubsan_handle_load_invalid_value.cold+0x47/0x4c ? lock_acquire+0x219/0x2c0 param_get_bool.cold+0xf/0x14 param_attr_show+0x51/0x80 module_attr_show+0x19/0x30 sysfs_kf_seq_show+0xac/0xf0 seq_read_iter+0x100/0x410 copy_splice_read+0x1b4/0x360 splice_direct_to_actor+0xbd/0x270 ? wait_for_space+0xb0/0xb0 do_splice_direct+0x72/0xb0 ? propagate_umount+0x870/0x870 do_sendfile+0x3a3/0x470 __x64_sys_sendfile64+0x5e/0xe0 do_syscall_64+0x70/0x8c0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: ca2967de5a5b ("KVM: SVM: Enable AVIC by default for Zen4+ if x2AVIC is support") Reviewed-by: Dragos Tatulea Signed-off-by: Gal Pressman Reviewed-by: Naveen N Rao (AMD) Link: https://patch.msgid.link/20260225145050.2350278-2-gal@nvidia.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/avic.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 8c2bc98fed2b..7056c4891f93 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -76,10 +77,20 @@ static int avic_param_set(const char *val, const struct kernel_param *kp) return param_set_bint(val, kp); } +static int avic_param_get(char *buffer, const struct kernel_param *kp) +{ + int val = *(int *)kp->arg; + + if (val == AVIC_AUTO_MODE) + return sysfs_emit(buffer, "N\n"); + + return param_get_bool(buffer, kp); +} + static const struct kernel_param_ops avic_ops = { .flags = KERNEL_PARAM_OPS_FL_NOARG, .set = avic_param_set, - .get = param_get_bool, + .get = avic_param_get, }; /* From 1450ab08108ccd825c8f9362475fadfc187942fc Mon Sep 17 00:00:00 2001 From: Gal Pressman Date: Wed, 25 Feb 2026 16:50:50 +0200 Subject: [PATCH 042/373] KVM: x86/mmu: Fix UBSAN warning when reading nx_huge_pages parameter The nx_huge_pages parameter is stored as an int (initialized to -1 to indicate auto mode), but get_nx_huge_pages() calls param_get_bool() which expects a bool pointer. This causes UBSAN to report "load of value 255 is not a valid value for type '_Bool'" when the parameter is read via sysfs during a narrow time window. The issue occurs during module load: the module parameter is registered and its sysfs file becomes readable before the kvm_mmu_x86_module_init() function runs: 1. Module load begins, static variable initialized to -1 2. mod_sysfs_setup() creates /sys/module/kvm/parameters/nx_huge_pages 3. (Parameter readable, value = -1) 4. do_init_module() runs kvm_x86_init() 5. kvm_mmu_x86_module_init() resolves -1 to bool If userspace (e.g., sos report) reads the parameter during step 3, param_get_bool() dereferences the int as a bool, triggering the UBSAN warning. Fix that by properly reading and converting the -1 value into an 'auto' string. Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Reviewed-by: Dragos Tatulea Signed-off-by: Gal Pressman Link: https://patch.msgid.link/20260225145050.2350278-3-gal@nvidia.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b922a8b00057..733c1d5671cd 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7487,9 +7487,14 @@ static void kvm_wake_nx_recovery_thread(struct kvm *kvm) static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp) { + int val = *(int *)kp->arg; + if (nx_hugepage_mitigation_hard_disabled) return sysfs_emit(buffer, "never\n"); + if (val == -1) + return sysfs_emit(buffer, "auto\n"); + return param_get_bool(buffer, kp); } From ecb80629321306547f7ad13b0ca5ef9cf8cdbb77 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 13:08:20 -0800 Subject: [PATCH 043/373] KVM: x86/mmu: Don't zero-allocate page table used for splitting a hugepage When splitting hugepages in the TDP MMU, don't zero the new page table on allocation since tdp_mmu_split_huge_page() is guaranteed to write every entry and thus every byte. Unless someone peeks at the memory between allocating the page table and writing the child SPTEs, no functional change intended. Cc: Rick Edgecombe Cc: Kai Huang Reviewed-by: Kai Huang Reviewed-by: Rick Edgecombe Link: https://patch.msgid.link/20260218210820.2828896-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 9c26038f6b77..7b1102d26f9c 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1507,7 +1507,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void) if (!sp) return NULL; - sp->spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); + sp->spt = (void *)__get_free_page(GFP_KERNEL_ACCOUNT); if (!sp->spt) { kmem_cache_free(mmu_page_header_cache, sp); return NULL; From f35043d0f973504e5f199be6287159dc5b373deb Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:14:16 -0800 Subject: [PATCH 044/373] KVM: SVM: Serialize updates to global OS-Visible Workarounds variables Guard writes to the global osvw_status and osvw_len variables with a spinlock to ensure enabling virtualization on multiple CPUs in parallel doesn't effectively drop any writes due to writing back stale data. Don't bother taking the lock when the boot CPU doesn't support the feature, as that check is constant for all CPUs, i.e. racing writes will always write the same value (zero). Note, the bug was inadvertently "fixed" by commit 9a798b1337af ("KVM: Register cpuhp and syscore callbacks when enabling hardware"), which effectively serialized calls to enable virtualization due to how the cpuhp framework "brings up" CPU. But KVM shouldn't rely on the mechanics of cphup to provide serialization. Link: https://patch.msgid.link/20251113231420.1695919-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e0da247ee594..9d611bfbcc8c 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -77,6 +77,7 @@ static bool erratum_383_found __read_mostly; * are published and we know what the new status bits are */ static uint64_t osvw_len = 4, osvw_status; +static DEFINE_SPINLOCK(osvw_lock); static DEFINE_PER_CPU(u64, current_tsc_ratio); @@ -558,16 +559,19 @@ static int svm_enable_virtualization_cpu(void) if (!err) err = native_read_msr_safe(MSR_AMD64_OSVW_STATUS, &status); - if (err) + guard(spinlock)(&osvw_lock); + + if (err) { osvw_status = osvw_len = 0; - else { + } else { if (len < osvw_len) osvw_len = len; osvw_status |= status; osvw_status &= (1ULL << osvw_len) - 1; } - } else + } else { osvw_status = osvw_len = 0; + } svm_init_erratum_383(); From 089af84641b574990da97d4674706a0303abca34 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:14:17 -0800 Subject: [PATCH 045/373] KVM: SVM: Skip OSVW MSR reads if KVM is treating all errata as present Don't bother reading the OSVW MSRs if osvw_len is already zero, i.e. if KVM is already treating all errata as present, in which case the positive path of the if-statement is one giant nop. Opportunistically update the comment to more thoroughly explain how the MSRs work and why the code does what it does. Link: https://patch.msgid.link/20251113231420.1695919-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9d611bfbcc8c..695cebf8d724 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -543,15 +543,25 @@ static int svm_enable_virtualization_cpu(void) /* - * Get OSVW bits. + * Get OS-Visible Workarounds (OSVW) bits. * * Note that it is possible to have a system with mixed processor * revisions and therefore different OSVW bits. If bits are not the same * on different processors then choose the worst case (i.e. if erratum * is present on one processor and not on another then assume that the * erratum is present everywhere). + * + * Note #2! The OSVW MSRs are used to communciate that an erratum is + * NOT present! Software must assume erratum as present if its bit is + * set in OSVW_STATUS *or* the bit number exceeds OSVW_ID_LENGTH. If + * either RDMSR fails, simply zero out the length to treat all errata + * as being present. Similarly, use the *minimum* length across all + * CPUs, not the maximum length. + * + * If the length is zero, then is KVM already treating all errata as + * being present and there's nothing left to do. */ - if (cpu_has(&boot_cpu_data, X86_FEATURE_OSVW)) { + if (osvw_len && cpu_has(&boot_cpu_data, X86_FEATURE_OSVW)) { u64 len, status = 0; int err; From c65106af8393fe45524b256d7836317a8b3f2c09 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:14:18 -0800 Subject: [PATCH 046/373] KVM: SVM: Extract OS-visible workarounds setup to helper function Move the initialization of the global OSVW variables to a helper function so that svm_enable_virtualization_cpu() isn't polluted with a pile of what is effectively legacy code. No functional change intended. Link: https://patch.msgid.link/20251113231420.1695919-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 90 +++++++++++++++++++++++------------------- 1 file changed, 49 insertions(+), 41 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 695cebf8d724..7fccb3d72e18 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -421,6 +421,54 @@ static void svm_init_osvw(struct kvm_vcpu *vcpu) vcpu->arch.osvw.status |= 1; } +static void svm_init_os_visible_workarounds(void) +{ + u64 len, status; + int err; + + /* + * Get OS-Visible Workarounds (OSVW) bits. + * + * Note that it is possible to have a system with mixed processor + * revisions and therefore different OSVW bits. If bits are not the same + * on different processors then choose the worst case (i.e. if erratum + * is present on one processor and not on another then assume that the + * erratum is present everywhere). + * + * Note #2! The OSVW MSRs are used to communciate that an erratum is + * NOT present! Software must assume erratum as present if its bit is + * set in OSVW_STATUS *or* the bit number exceeds OSVW_ID_LENGTH. If + * either RDMSR fails, simply zero out the length to treat all errata + * as being present. Similarly, use the *minimum* length across all + * CPUs, not the maximum length. + * + * If the length is zero, then is KVM already treating all errata as + * being present and there's nothing left to do. + */ + if (!osvw_len) + return; + + if (!boot_cpu_has(X86_FEATURE_OSVW)) { + osvw_status = osvw_len = 0; + return; + } + + err = native_read_msr_safe(MSR_AMD64_OSVW_ID_LENGTH, &len); + if (!err) + err = native_read_msr_safe(MSR_AMD64_OSVW_STATUS, &status); + + guard(spinlock)(&osvw_lock); + + if (err) { + osvw_status = osvw_len = 0; + } else { + if (len < osvw_len) + osvw_len = len; + osvw_status |= status; + osvw_status &= (1ULL << osvw_len) - 1; + } +} + static bool __kvm_is_svm_supported(void) { int cpu = smp_processor_id(); @@ -541,47 +589,7 @@ static int svm_enable_virtualization_cpu(void) __svm_write_tsc_multiplier(SVM_TSC_RATIO_DEFAULT); } - - /* - * Get OS-Visible Workarounds (OSVW) bits. - * - * Note that it is possible to have a system with mixed processor - * revisions and therefore different OSVW bits. If bits are not the same - * on different processors then choose the worst case (i.e. if erratum - * is present on one processor and not on another then assume that the - * erratum is present everywhere). - * - * Note #2! The OSVW MSRs are used to communciate that an erratum is - * NOT present! Software must assume erratum as present if its bit is - * set in OSVW_STATUS *or* the bit number exceeds OSVW_ID_LENGTH. If - * either RDMSR fails, simply zero out the length to treat all errata - * as being present. Similarly, use the *minimum* length across all - * CPUs, not the maximum length. - * - * If the length is zero, then is KVM already treating all errata as - * being present and there's nothing left to do. - */ - if (osvw_len && cpu_has(&boot_cpu_data, X86_FEATURE_OSVW)) { - u64 len, status = 0; - int err; - - err = native_read_msr_safe(MSR_AMD64_OSVW_ID_LENGTH, &len); - if (!err) - err = native_read_msr_safe(MSR_AMD64_OSVW_STATUS, &status); - - guard(spinlock)(&osvw_lock); - - if (err) { - osvw_status = osvw_len = 0; - } else { - if (len < osvw_len) - osvw_len = len; - osvw_status |= status; - osvw_status &= (1ULL << osvw_len) - 1; - } - } else { - osvw_status = osvw_len = 0; - } + svm_init_os_visible_workarounds(); svm_init_erratum_383(); From 3b7a320e491c87c6d25928f6798c2efeef2be0e8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:14:19 -0800 Subject: [PATCH 047/373] KVM: SVM: Skip OSVW variable updates if current CPU's errata are a subset Elide the OSVW variable updates if the current CPU's set of errata are a subset of the errata tracked in the global values, i.e. if no update is needed. There's no danger of under-reporting errata due to bailing early as KVM is purely reducing the set of "known fixed" errata. I.e. a racing update on a different CPU with _more_ errata doesn't change anything if the current CPU has the same or fewer errata relative to the status quo. If another CPU is writing osvw_len, then "len" is guaranteed to be larger than the new osvw_len and so the osvw_len update would be skipped anyways. If another CPU is setting new bits in osvw_status, then "status" is guaranteed to be a subset of the new osvw_status and the bitwise-OR would be an effective nop anyways. Link: https://patch.msgid.link/20251113231420.1695919-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7fccb3d72e18..5c4328f42604 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -424,7 +424,6 @@ static void svm_init_osvw(struct kvm_vcpu *vcpu) static void svm_init_os_visible_workarounds(void) { u64 len, status; - int err; /* * Get OS-Visible Workarounds (OSVW) bits. @@ -453,20 +452,19 @@ static void svm_init_os_visible_workarounds(void) return; } - err = native_read_msr_safe(MSR_AMD64_OSVW_ID_LENGTH, &len); - if (!err) - err = native_read_msr_safe(MSR_AMD64_OSVW_STATUS, &status); + if (native_read_msr_safe(MSR_AMD64_OSVW_ID_LENGTH, &len) || + native_read_msr_safe(MSR_AMD64_OSVW_STATUS, &status)) + len = status = 0; + + if (status == READ_ONCE(osvw_status) && len >= READ_ONCE(osvw_len)) + return; guard(spinlock)(&osvw_lock); - if (err) { - osvw_status = osvw_len = 0; - } else { - if (len < osvw_len) - osvw_len = len; - osvw_status |= status; - osvw_status &= (1ULL << osvw_len) - 1; - } + if (len < osvw_len) + osvw_len = len; + osvw_status |= status; + osvw_status &= (1ULL << osvw_len) - 1; } static bool __kvm_is_svm_supported(void) From a56444d5e7387effbc61d6b98fe5d68897017fc9 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:14:20 -0800 Subject: [PATCH 048/373] KVM: SVM: Skip OSVW MSR reads if current CPU doesn't support the feature Skip the OSVW RDMSRs if the current CPU doesn't enumerate support for the MSRs. In practice, checking only the boot CPU's capabilities is sufficient, as the RDMSRs should fault when unsupported, but there's no downside to being more precise, and checking only the boot CPU _looks_ wrong given the rather odd semantics of the MSRs. E.g. if a CPU doesn't support OVSW, then KVM must assume all errata are present. Link: https://patch.msgid.link/20251113231420.1695919-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 5c4328f42604..3f3290d5a0a6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -447,12 +447,8 @@ static void svm_init_os_visible_workarounds(void) if (!osvw_len) return; - if (!boot_cpu_has(X86_FEATURE_OSVW)) { - osvw_status = osvw_len = 0; - return; - } - - if (native_read_msr_safe(MSR_AMD64_OSVW_ID_LENGTH, &len) || + if (!this_cpu_has(X86_FEATURE_OSVW) || + native_read_msr_safe(MSR_AMD64_OSVW_ID_LENGTH, &len) || native_read_msr_safe(MSR_AMD64_OSVW_STATUS, &status)) len = status = 0; From 43e41846ac7ebee529c3684b5726d71224f4fbdd Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 2 Mar 2026 15:42:49 +0000 Subject: [PATCH 049/373] KVM: x86: Drop redundant call to kvm_deliver_exception_payload() In kvm_check_and_inject_events(), kvm_deliver_exception_payload() is called for pending #DB exceptions. However, shortly after, the per-vendor inject_exception callbacks are made. Both vmx_inject_exception() and svm_inject_exception() unconditionally call kvm_deliver_exception_payload(), so the call in kvm_check_and_inject_events() is redundant. Note that the extra call for pending #DB exceptions is harmless, as kvm_deliver_exception_payload() clears exception.has_payload after the first call. The call in kvm_check_and_inject_events() was added in commit f10c729ff965 ("kvm: vmx: Defer setting of DR6 until #DB delivery"). At that point, the call was likely needed because svm_queue_exception() checked whether an exception for L2 is intercepted by L1 before calling kvm_deliver_exception_payload(), as SVM did not have a check_nested_events callback. Since DR6 is updated before the #DB intercept in SVM (unlike VMX), it was necessary to deliver the DR6 payload before calling svm_queue_exception(). After that, commit 7c86663b68ba ("KVM: nSVM: inject exceptions via svm_check_nested_events") added a check_nested_events callback for SVM, which checked for L1 intercepts for L2's exceptions, and delivered the the payload appropriately before the intercept. At that point, svm_queue_exception() started calling kvm_deliver_exception_payload() unconditionally, and the call to kvm_deliver_exception_payload() from its caller became redundant. No functional change intended. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260302154249.784529-1-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 658476815b6a..d5731499f4c2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10736,12 +10736,10 @@ static int kvm_check_and_inject_events(struct kvm_vcpu *vcpu, __kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) | X86_EFLAGS_RF); - if (vcpu->arch.exception.vector == DB_VECTOR) { - kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception); - if (vcpu->arch.dr7 & DR7_GD) { - vcpu->arch.dr7 &= ~DR7_GD; - kvm_update_dr7(vcpu); - } + if (vcpu->arch.exception.vector == DB_VECTOR && + vcpu->arch.dr7 & DR7_GD) { + vcpu->arch.dr7 &= ~DR7_GD; + kvm_update_dr7(vcpu); } kvm_inject_exception(vcpu); From 4059172b2a78a71d15d8fcd8d3fd8ea1ba65d25b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:47 -0800 Subject: [PATCH 050/373] KVM: x86: Move kvm_rebooting to x86 Move kvm_rebooting, which is only read by x86, to KVM x86 so that it can be moved again to core x86 code. Add a "shutdown" arch hook to facilate setting the flag in KVM x86, along with a pile of comments to provide more context around what KVM x86 is doing and why. Reviewed-by: Chao Gao Acked-by: Dave Hansen Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++ arch/x86/kvm/x86.h | 1 + include/linux/kvm_host.h | 8 +++++++- virt/kvm/kvm_main.c | 14 +++++++------- 4 files changed, 37 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a03530795707..7ac3578e6ec0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -700,6 +700,9 @@ static void drop_user_return_notifiers(void) kvm_on_user_return(&msrs->urn); } +__visible bool kvm_rebooting; +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_rebooting); + /* * Handle a fault on a hardware virtualization (VMX or SVM) instruction. * @@ -13177,6 +13180,25 @@ int kvm_arch_enable_virtualization_cpu(void) return 0; } +void kvm_arch_shutdown(void) +{ + /* + * Set kvm_rebooting to indicate that KVM has asynchronously disabled + * hardware virtualization, i.e. that errors and/or exceptions on SVM + * and VMX instructions are expected and should be ignored. + */ + kvm_rebooting = true; + + /* + * Ensure kvm_rebooting is visible before IPIs are sent to other CPUs + * to disable virtualization. Effectively pairs with the reception of + * the IPI (kvm_rebooting is read in task/exception context, but only + * _needs_ to be read as %true after the IPI function callback disables + * virtualization). + */ + smp_wmb(); +} + void kvm_arch_disable_virtualization_cpu(void) { kvm_x86_call(disable_virtualization_cpu)(); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 94d4f07aaaa0..b314649e5c02 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -54,6 +54,7 @@ struct kvm_host_values { u64 arch_capabilities; }; +extern bool kvm_rebooting; void kvm_spurious_fault(void); #define SIZE_OF_MEMSLOTS_HASHTABLE \ diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 34759a262b28..7c4ebd5210ec 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1627,6 +1627,13 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {} #endif #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING +/* + * kvm_arch_shutdown() is invoked immediately prior to forcefully disabling + * hardware virtualization on all CPUs via IPI function calls (in preparation + * for shutdown or reboot), e.g. to allow arch code to prepare for disabling + * virtualization while KVM may be actively running vCPUs. + */ +void kvm_arch_shutdown(void); /* * kvm_arch_{enable,disable}_virtualization() are called on one CPU, under * kvm_usage_lock, immediately after/before 0=>1 and 1=>0 transitions of @@ -2313,7 +2320,6 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu) #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING extern bool enable_virt_at_load; -extern bool kvm_rebooting; #endif extern unsigned int halt_poll_ns; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1bc1da66b4b0..d27bf2488b12 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5578,13 +5578,15 @@ bool enable_virt_at_load = true; module_param(enable_virt_at_load, bool, 0444); EXPORT_SYMBOL_FOR_KVM_INTERNAL(enable_virt_at_load); -__visible bool kvm_rebooting; -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_rebooting); - static DEFINE_PER_CPU(bool, virtualization_enabled); static DEFINE_MUTEX(kvm_usage_lock); static int kvm_usage_count; +__weak void kvm_arch_shutdown(void) +{ + +} + __weak void kvm_arch_enable_virtualization(void) { @@ -5638,10 +5640,9 @@ static int kvm_offline_cpu(unsigned int cpu) static void kvm_shutdown(void *data) { + kvm_arch_shutdown(); + /* - * Disable hardware virtualization and set kvm_rebooting to indicate - * that KVM has asynchronously disabled hardware virtualization, i.e. - * that relevant errors and exceptions aren't entirely unexpected. * Some flavors of hardware virtualization need to be disabled before * transferring control to firmware (to perform shutdown/reboot), e.g. * on x86, virtualization can block INIT interrupts, which are used by @@ -5650,7 +5651,6 @@ static void kvm_shutdown(void *data) * 100% comprehensive. */ pr_info("kvm: exiting hardware virtualization\n"); - kvm_rebooting = true; on_each_cpu(kvm_disable_virtualization_cpu, NULL, 1); } From 3c75e6a5da3c0dfcd34e5f9df5390804179f2aeb Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:48 -0800 Subject: [PATCH 051/373] KVM: VMX: Move architectural "vmcs" and "vmcs_hdr" structures to public vmx.h Move "struct vmcs" and "struct vmcs_hdr" to asm/vmx.h in anticipation of moving VMXON/VMXOFF to the core kernel (VMXON requires a "root" VMCS with the appropriate revision ID in its header). No functional change intended. Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/vmx.h | 11 +++++++++++ arch/x86/kvm/vmx/vmcs.h | 11 ----------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index b92ff87e3560..37080382df54 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -20,6 +20,17 @@ #include #include +struct vmcs_hdr { + u32 revision_id:31; + u32 shadow_vmcs:1; +}; + +struct vmcs { + struct vmcs_hdr hdr; + u32 abort; + char data[]; +}; + #define VMCS_CONTROL_BIT(x) BIT(VMX_FEATURE_##x & 0x1f) /* diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h index 66d747e265b1..1f16ddeae9cb 100644 --- a/arch/x86/kvm/vmx/vmcs.h +++ b/arch/x86/kvm/vmx/vmcs.h @@ -22,17 +22,6 @@ #define VMCS12_IDX_TO_ENC(idx) ROL16(idx, 10) #define ENC_TO_VMCS12_IDX(enc) ROL16(enc, 6) -struct vmcs_hdr { - u32 revision_id:31; - u32 shadow_vmcs:1; -}; - -struct vmcs { - struct vmcs_hdr hdr; - u32 abort; - char data[]; -}; - DECLARE_PER_CPU(struct vmcs *, current_vmcs); /* From a1450a8156c65d9fe6111627094c26359d2e2274 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:49 -0800 Subject: [PATCH 052/373] KVM: x86: Move "kvm_rebooting" to kernel as "virt_rebooting" Move "kvm_rebooting" to the kernel, exported for KVM, as one of many steps towards extracting the innermost VMXON and EFER.SVME management logic out of KVM and into to core x86. For lack of a better name, call the new file "hw.c", to yield "virt hardware" when combined with its parent directory. No functional change intended. Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/virt.h | 11 +++++++++++ arch/x86/kvm/svm/svm.c | 3 ++- arch/x86/kvm/svm/vmenter.S | 10 +++++----- arch/x86/kvm/vmx/tdx.c | 3 ++- arch/x86/kvm/vmx/vmenter.S | 2 +- arch/x86/kvm/vmx/vmx.c | 5 +++-- arch/x86/kvm/x86.c | 17 ++++++++--------- arch/x86/kvm/x86.h | 1 - arch/x86/virt/Makefile | 2 ++ arch/x86/virt/hw.c | 7 +++++++ 10 files changed, 41 insertions(+), 20 deletions(-) create mode 100644 arch/x86/include/asm/virt.h create mode 100644 arch/x86/virt/hw.c diff --git a/arch/x86/include/asm/virt.h b/arch/x86/include/asm/virt.h new file mode 100644 index 000000000000..131b9bf9ef3c --- /dev/null +++ b/arch/x86/include/asm/virt.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _ASM_X86_VIRT_H +#define _ASM_X86_VIRT_H + +#include + +#if IS_ENABLED(CONFIG_KVM_X86) +extern bool virt_rebooting; +#endif + +#endif /* _ASM_X86_VIRT_H */ diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 8f8bc863e214..0ae66c770ebc 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -44,6 +44,7 @@ #include #include #include +#include #include @@ -495,7 +496,7 @@ static inline void kvm_cpu_svm_disable(void) static void svm_emergency_disable_virtualization_cpu(void) { - kvm_rebooting = true; + virt_rebooting = true; kvm_cpu_svm_disable(); } diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S index 3392bcadfb89..d47c5c93c991 100644 --- a/arch/x86/kvm/svm/vmenter.S +++ b/arch/x86/kvm/svm/vmenter.S @@ -298,16 +298,16 @@ SYM_FUNC_START(__svm_vcpu_run) RESTORE_GUEST_SPEC_CTRL_BODY RESTORE_HOST_SPEC_CTRL_BODY (%_ASM_SP) -10: cmpb $0, _ASM_RIP(kvm_rebooting) +10: cmpb $0, _ASM_RIP(virt_rebooting) jne 2b ud2 -30: cmpb $0, _ASM_RIP(kvm_rebooting) +30: cmpb $0, _ASM_RIP(virt_rebooting) jne 4b ud2 -50: cmpb $0, _ASM_RIP(kvm_rebooting) +50: cmpb $0, _ASM_RIP(virt_rebooting) jne 6b ud2 -70: cmpb $0, _ASM_RIP(kvm_rebooting) +70: cmpb $0, _ASM_RIP(virt_rebooting) jne 8b ud2 @@ -394,7 +394,7 @@ SYM_FUNC_START(__svm_sev_es_vcpu_run) RESTORE_GUEST_SPEC_CTRL_BODY RESTORE_HOST_SPEC_CTRL_BODY %sil -3: cmpb $0, kvm_rebooting(%rip) +3: cmpb $0, virt_rebooting(%rip) jne 2b ud2 diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index c5065f84b78b..f81b562733ef 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "capabilities.h" #include "mmu.h" #include "x86_ops.h" @@ -1994,7 +1995,7 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath) * TDX_SEAMCALL_VMFAILINVALID. */ if (unlikely((vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR)) { - KVM_BUG_ON(!kvm_rebooting, vcpu->kvm); + KVM_BUG_ON(!virt_rebooting, vcpu->kvm); goto unhandled_exit; } diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 4426d34811fc..8a481dae9cae 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -310,7 +310,7 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL) RET .Lfixup: - cmpb $0, _ASM_RIP(kvm_rebooting) + cmpb $0, _ASM_RIP(virt_rebooting) jne .Lvmfail ud2 .Lvmfail: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 967b58a8ab9d..fc6e3b620866 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -48,6 +48,7 @@ #include #include #include +#include #include #include @@ -814,13 +815,13 @@ void vmx_emergency_disable_virtualization_cpu(void) int cpu = raw_smp_processor_id(); struct loaded_vmcs *v; - kvm_rebooting = true; + virt_rebooting = true; /* * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be * set in task context. If this races with VMX is disabled by an NMI, * VMCLEAR and VMXOFF may #UD, but KVM will eat those faults due to - * kvm_rebooting set. + * virt_rebooting set. */ if (!(__read_cr4() & X86_CR4_VMXE)) return; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7ac3578e6ec0..91a20fffedc3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -83,6 +83,8 @@ #include #include #include +#include + #include #define CREATE_TRACE_POINTS @@ -700,9 +702,6 @@ static void drop_user_return_notifiers(void) kvm_on_user_return(&msrs->urn); } -__visible bool kvm_rebooting; -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_rebooting); - /* * Handle a fault on a hardware virtualization (VMX or SVM) instruction. * @@ -713,7 +712,7 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_rebooting); noinstr void kvm_spurious_fault(void) { /* Fault while not rebooting. We want the trace. */ - BUG_ON(!kvm_rebooting); + BUG_ON(!virt_rebooting); } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_spurious_fault); @@ -13183,16 +13182,16 @@ int kvm_arch_enable_virtualization_cpu(void) void kvm_arch_shutdown(void) { /* - * Set kvm_rebooting to indicate that KVM has asynchronously disabled + * Set virt_rebooting to indicate that KVM has asynchronously disabled * hardware virtualization, i.e. that errors and/or exceptions on SVM * and VMX instructions are expected and should be ignored. */ - kvm_rebooting = true; + virt_rebooting = true; /* - * Ensure kvm_rebooting is visible before IPIs are sent to other CPUs + * Ensure virt_rebooting is visible before IPIs are sent to other CPUs * to disable virtualization. Effectively pairs with the reception of - * the IPI (kvm_rebooting is read in task/exception context, but only + * the IPI (virt_rebooting is read in task/exception context, but only * _needs_ to be read as %true after the IPI function callback disables * virtualization). */ @@ -13213,7 +13212,7 @@ void kvm_arch_disable_virtualization_cpu(void) * disable virtualization arrives. Handle the extreme edge case here * instead of trying to account for it in the normal flows. */ - if (in_task() || WARN_ON_ONCE(!kvm_rebooting)) + if (in_task() || WARN_ON_ONCE(!virt_rebooting)) drop_user_return_notifiers(); else __module_get(THIS_MODULE); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index b314649e5c02..94d4f07aaaa0 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -54,7 +54,6 @@ struct kvm_host_values { u64 arch_capabilities; }; -extern bool kvm_rebooting; void kvm_spurious_fault(void); #define SIZE_OF_MEMSLOTS_HASHTABLE \ diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile index ea343fc392dc..6e485751650c 100644 --- a/arch/x86/virt/Makefile +++ b/arch/x86/virt/Makefile @@ -1,2 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only obj-y += svm/ vmx/ + +obj-$(subst m,y,$(CONFIG_KVM_X86)) += hw.o \ No newline at end of file diff --git a/arch/x86/virt/hw.c b/arch/x86/virt/hw.c new file mode 100644 index 000000000000..df3dc18d19b4 --- /dev/null +++ b/arch/x86/virt/hw.c @@ -0,0 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include + +#include + +__visible bool virt_rebooting; +EXPORT_SYMBOL_FOR_KVM(virt_rebooting); From 405b7c27934eaabbcc52ccfbaeb22ef966b6b5f0 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:50 -0800 Subject: [PATCH 053/373] KVM: VMX: Unconditionally allocate root VMCSes during boot CPU bringup Allocate the root VMCS (misleading called "vmxarea" and "kvm_area" in KVM) for each possible CPU during early boot CPU bringup, before early TDX initialization, so that TDX can eventually do VMXON on-demand (to make SEAMCALLs) without needing to load kvm-intel.ko. Allocate the pages early on, e.g. instead of trying to do so on-demand, to avoid having to juggle allocation failures at runtime. Opportunistically rename the per-CPU pointers to better reflect the role of the VMCS. Use Intel's "root VMCS" terminology, e.g. from various VMCS patents[1][2] and older SDMs, not the more opaque "VMXON region" used in recent versions of the SDM. While it's possible the VMCS passed to VMXON no longer serves as _the_ root VMCS on modern CPUs, it is still in effect a "root mode VMCS", as described in the patents. Link: https://patentimages.storage.googleapis.com/c7/e4/32/d7a7def5580667/WO2013101191A1.pdf [1] Link: https://patentimages.storage.googleapis.com/13/f6/8d/1361fab8c33373/US20080163205A1.pdf [2] Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/virt.h | 13 ++++++- arch/x86/kernel/cpu/common.c | 2 + arch/x86/kvm/vmx/vmx.c | 58 ++--------------------------- arch/x86/virt/hw.c | 71 ++++++++++++++++++++++++++++++++++++ 4 files changed, 89 insertions(+), 55 deletions(-) diff --git a/arch/x86/include/asm/virt.h b/arch/x86/include/asm/virt.h index 131b9bf9ef3c..0da6db4f5b0c 100644 --- a/arch/x86/include/asm/virt.h +++ b/arch/x86/include/asm/virt.h @@ -2,10 +2,21 @@ #ifndef _ASM_X86_VIRT_H #define _ASM_X86_VIRT_H -#include +#include + +#include #if IS_ENABLED(CONFIG_KVM_X86) extern bool virt_rebooting; + +void __init x86_virt_init(void); + +#if IS_ENABLED(CONFIG_KVM_INTEL) +DECLARE_PER_CPU(struct vmcs *, root_vmcs); +#endif + +#else +static __always_inline void x86_virt_init(void) {} #endif #endif /* _ASM_X86_VIRT_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 1c3261cae40c..314ceb100527 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -71,6 +71,7 @@ #include #include #include +#include #include #include @@ -2151,6 +2152,7 @@ static __init void identify_boot_cpu(void) cpu_detect_tlb(&boot_cpu_data); setup_cr_pinning(); + x86_virt_init(); tsx_init(); tdx_init(); lkgs_init(); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index fc6e3b620866..abd4830f71d8 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -580,7 +580,6 @@ noinline void invept_error(unsigned long ext, u64 eptp) vmx_insn_failed("invept failed: ext=0x%lx eptp=%llx\n", ext, eptp); } -static DEFINE_PER_CPU(struct vmcs *, vmxarea); DEFINE_PER_CPU(struct vmcs *, current_vmcs); /* * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is needed @@ -2934,6 +2933,9 @@ static bool __kvm_is_vmx_supported(void) return false; } + if (!per_cpu(root_vmcs, cpu)) + return false; + return true; } @@ -3008,7 +3010,7 @@ fault: int vmx_enable_virtualization_cpu(void) { int cpu = raw_smp_processor_id(); - u64 phys_addr = __pa(per_cpu(vmxarea, cpu)); + u64 phys_addr = __pa(per_cpu(root_vmcs, cpu)); int r; if (cr4_read_shadow() & X86_CR4_VMXE) @@ -3129,47 +3131,6 @@ out_vmcs: return -ENOMEM; } -static void free_kvm_area(void) -{ - int cpu; - - for_each_possible_cpu(cpu) { - free_vmcs(per_cpu(vmxarea, cpu)); - per_cpu(vmxarea, cpu) = NULL; - } -} - -static __init int alloc_kvm_area(void) -{ - int cpu; - - for_each_possible_cpu(cpu) { - struct vmcs *vmcs; - - vmcs = alloc_vmcs_cpu(false, cpu, GFP_KERNEL); - if (!vmcs) { - free_kvm_area(); - return -ENOMEM; - } - - /* - * When eVMCS is enabled, alloc_vmcs_cpu() sets - * vmcs->revision_id to KVM_EVMCS_VERSION instead of - * revision_id reported by MSR_IA32_VMX_BASIC. - * - * However, even though not explicitly documented by - * TLFS, VMXArea passed as VMXON argument should - * still be marked with revision_id reported by - * physical CPU. - */ - if (kvm_is_using_evmcs()) - vmcs->hdr.revision_id = vmx_basic_vmcs_revision_id(vmcs_config.basic); - - per_cpu(vmxarea, cpu) = vmcs; - } - return 0; -} - static void fix_pmode_seg(struct kvm_vcpu *vcpu, int seg, struct kvm_segment *save) { @@ -8566,8 +8527,6 @@ void vmx_hardware_unsetup(void) if (nested) nested_vmx_hardware_unsetup(); - - free_kvm_area(); } void vmx_vm_destroy(struct kvm *kvm) @@ -8870,10 +8829,6 @@ __init int vmx_hardware_setup(void) return r; } - r = alloc_kvm_area(); - if (r) - goto err_kvm_area; - kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler); /* @@ -8900,11 +8855,6 @@ __init int vmx_hardware_setup(void) kvm_caps.inapplicable_quirks &= ~KVM_X86_QUIRK_IGNORE_GUEST_PAT; return 0; - -err_kvm_area: - if (nested) - nested_vmx_hardware_unsetup(); - return r; } void vmx_exit(void) diff --git a/arch/x86/virt/hw.c b/arch/x86/virt/hw.c index df3dc18d19b4..56972f594d90 100644 --- a/arch/x86/virt/hw.c +++ b/arch/x86/virt/hw.c @@ -1,7 +1,78 @@ // SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include #include +#include +#include +#include +#include #include +#include __visible bool virt_rebooting; EXPORT_SYMBOL_FOR_KVM(virt_rebooting); + +#if IS_ENABLED(CONFIG_KVM_INTEL) +DEFINE_PER_CPU(struct vmcs *, root_vmcs); +EXPORT_PER_CPU_SYMBOL(root_vmcs); + +static __init void x86_vmx_exit(void) +{ + int cpu; + + for_each_possible_cpu(cpu) { + free_page((unsigned long)per_cpu(root_vmcs, cpu)); + per_cpu(root_vmcs, cpu) = NULL; + } +} + +static __init int x86_vmx_init(void) +{ + u64 basic_msr; + u32 rev_id; + int cpu; + + if (!cpu_feature_enabled(X86_FEATURE_VMX)) + return -EOPNOTSUPP; + + rdmsrq(MSR_IA32_VMX_BASIC, basic_msr); + + /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */ + if (WARN_ON_ONCE(vmx_basic_vmcs_size(basic_msr) > PAGE_SIZE)) + return -EIO; + + /* + * Even if eVMCS is enabled (or will be enabled?), and even though not + * explicitly documented by TLFS, the root VMCS passed to VMXON should + * still be marked with the revision_id reported by the physical CPU. + */ + rev_id = vmx_basic_vmcs_revision_id(basic_msr); + + for_each_possible_cpu(cpu) { + int node = cpu_to_node(cpu); + struct page *page; + struct vmcs *vmcs; + + page = __alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0); + if (!page) { + x86_vmx_exit(); + return -ENOMEM; + } + + vmcs = page_address(page); + vmcs->hdr.revision_id = rev_id; + per_cpu(root_vmcs, cpu) = vmcs; + } + + return 0; +} +#else +static __init int x86_vmx_init(void) { return -EOPNOTSUPP; } +#endif + +void __init x86_virt_init(void) +{ + x86_vmx_init(); +} From 95e4adb24ff6e876f6d2b960cf922d3c969d4dc4 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:51 -0800 Subject: [PATCH 054/373] x86/virt: Force-clear X86_FEATURE_VMX if configuring root VMCS fails If allocating and configuring a root VMCS fails, clear X86_FEATURE_VMX in all CPUs so that KVM doesn't need to manually check root_vmcs. As added bonuses, clearing VMX will reflect that VMX is unusable in /proc/cpuinfo, and will avoid a futile auto-probe of kvm-intel.ko. WARN if allocating a root VMCS page fails, e.g. to help users figure out why VMX is broken in the unlikely scenario something goes sideways during boot (and because the allocation should succeed unless there's a kernel bug). Tweak KVM's error message to suggest checking kernel logs if VMX is unsupported (in addition to checking BIOS). Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 7 ++++--- arch/x86/virt/hw.c | 14 ++++++++++++-- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index abd4830f71d8..e767835a4f3a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2927,14 +2927,15 @@ static bool __kvm_is_vmx_supported(void) return false; } - if (!this_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) || - !this_cpu_has(X86_FEATURE_VMX)) { + if (!this_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL)) { pr_err("VMX not enabled (by BIOS) in MSR_IA32_FEAT_CTL on CPU %d\n", cpu); return false; } - if (!per_cpu(root_vmcs, cpu)) + if (!this_cpu_has(X86_FEATURE_VMX)) { + pr_err("VMX not fully enabled on CPU %d. Check kernel logs and/or BIOS\n", cpu); return false; + } return true; } diff --git a/arch/x86/virt/hw.c b/arch/x86/virt/hw.c index 56972f594d90..40495872fdfb 100644 --- a/arch/x86/virt/hw.c +++ b/arch/x86/virt/hw.c @@ -28,7 +28,7 @@ static __init void x86_vmx_exit(void) } } -static __init int x86_vmx_init(void) +static __init int __x86_vmx_init(void) { u64 basic_msr; u32 rev_id; @@ -56,7 +56,7 @@ static __init int x86_vmx_init(void) struct vmcs *vmcs; page = __alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0); - if (!page) { + if (WARN_ON_ONCE(!page)) { x86_vmx_exit(); return -ENOMEM; } @@ -68,6 +68,16 @@ static __init int x86_vmx_init(void) return 0; } + +static __init int x86_vmx_init(void) +{ + int r; + + r = __x86_vmx_init(); + if (r) + setup_clear_cpu_cap(X86_FEATURE_VMX); + return r; +} #else static __init int x86_vmx_init(void) { return -EOPNOTSUPP; } #endif From 920da4f75519a3fa3fe2fc25458445b561653610 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:52 -0800 Subject: [PATCH 055/373] KVM: VMX: Move core VMXON enablement to kernel Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so that TDX can (eventually) force VMXON without having to rely on KVM being loaded, e.g. to do SEAMCALLs during initialization. Opportunistically update the comment regarding emergency disabling via NMI to clarify that virt_rebooting will be set by _another_ emergency callback, i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only before _this_ invocation does VMXOFF. Acked-by: Dave Hansen Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/events/intel/pt.c | 1 - arch/x86/include/asm/virt.h | 6 +-- arch/x86/kvm/vmx/vmx.c | 73 +++---------------------------- arch/x86/virt/hw.c | 85 ++++++++++++++++++++++++++++++++++++- 4 files changed, 92 insertions(+), 73 deletions(-) diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c index 44524a387c58..b5726b50e77d 100644 --- a/arch/x86/events/intel/pt.c +++ b/arch/x86/events/intel/pt.c @@ -1591,7 +1591,6 @@ void intel_pt_handle_vmx(int on) local_irq_restore(flags); } -EXPORT_SYMBOL_FOR_KVM(intel_pt_handle_vmx); /* * PMU callbacks diff --git a/arch/x86/include/asm/virt.h b/arch/x86/include/asm/virt.h index 0da6db4f5b0c..cca0210a5c16 100644 --- a/arch/x86/include/asm/virt.h +++ b/arch/x86/include/asm/virt.h @@ -2,8 +2,6 @@ #ifndef _ASM_X86_VIRT_H #define _ASM_X86_VIRT_H -#include - #include #if IS_ENABLED(CONFIG_KVM_X86) @@ -12,7 +10,9 @@ extern bool virt_rebooting; void __init x86_virt_init(void); #if IS_ENABLED(CONFIG_KVM_INTEL) -DECLARE_PER_CPU(struct vmcs *, root_vmcs); +int x86_vmx_enable_virtualization_cpu(void); +int x86_vmx_disable_virtualization_cpu(void); +void x86_vmx_emergency_disable_virtualization_cpu(void); #endif #else diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e767835a4f3a..36238cc694fd 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -786,41 +786,16 @@ static int vmx_set_guest_uret_msr(struct vcpu_vmx *vmx, return ret; } -/* - * Disable VMX and clear CR4.VMXE (even if VMXOFF faults) - * - * Note, VMXOFF causes a #UD if the CPU is !post-VMXON, but it's impossible to - * atomically track post-VMXON state, e.g. this may be called in NMI context. - * Eat all faults as all other faults on VMXOFF faults are mode related, i.e. - * faults are guaranteed to be due to the !post-VMXON check unless the CPU is - * magically in RM, VM86, compat mode, or at CPL>0. - */ -static int kvm_cpu_vmxoff(void) -{ - asm goto("1: vmxoff\n\t" - _ASM_EXTABLE(1b, %l[fault]) - ::: "cc", "memory" : fault); - - cr4_clear_bits(X86_CR4_VMXE); - return 0; - -fault: - cr4_clear_bits(X86_CR4_VMXE); - return -EIO; -} - void vmx_emergency_disable_virtualization_cpu(void) { int cpu = raw_smp_processor_id(); struct loaded_vmcs *v; - virt_rebooting = true; - /* * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be - * set in task context. If this races with VMX is disabled by an NMI, - * VMCLEAR and VMXOFF may #UD, but KVM will eat those faults due to - * virt_rebooting set. + * set in task context. If this races with _another_ emergency call + * from NMI context, VMCLEAR may #UD, but KVM will eat those faults due + * to virt_rebooting being set by the interrupting NMI callback. */ if (!(__read_cr4() & X86_CR4_VMXE)) return; @@ -832,7 +807,7 @@ void vmx_emergency_disable_virtualization_cpu(void) vmcs_clear(v->shadow_vmcs); } - kvm_cpu_vmxoff(); + x86_vmx_emergency_disable_virtualization_cpu(); } static void __loaded_vmcs_clear(void *arg) @@ -2988,34 +2963,9 @@ int vmx_check_processor_compat(void) return 0; } -static int kvm_cpu_vmxon(u64 vmxon_pointer) -{ - u64 msr; - - cr4_set_bits(X86_CR4_VMXE); - - asm goto("1: vmxon %[vmxon_pointer]\n\t" - _ASM_EXTABLE(1b, %l[fault]) - : : [vmxon_pointer] "m"(vmxon_pointer) - : : fault); - return 0; - -fault: - WARN_ONCE(1, "VMXON faulted, MSR_IA32_FEAT_CTL (0x3a) = 0x%llx\n", - rdmsrq_safe(MSR_IA32_FEAT_CTL, &msr) ? 0xdeadbeef : msr); - cr4_clear_bits(X86_CR4_VMXE); - - return -EFAULT; -} - int vmx_enable_virtualization_cpu(void) { int cpu = raw_smp_processor_id(); - u64 phys_addr = __pa(per_cpu(root_vmcs, cpu)); - int r; - - if (cr4_read_shadow() & X86_CR4_VMXE) - return -EBUSY; /* * This can happen if we hot-added a CPU but failed to allocate @@ -3024,15 +2974,7 @@ int vmx_enable_virtualization_cpu(void) if (kvm_is_using_evmcs() && !hv_get_vp_assist_page(cpu)) return -EFAULT; - intel_pt_handle_vmx(1); - - r = kvm_cpu_vmxon(phys_addr); - if (r) { - intel_pt_handle_vmx(0); - return r; - } - - return 0; + return x86_vmx_enable_virtualization_cpu(); } static void vmclear_local_loaded_vmcss(void) @@ -3049,12 +2991,9 @@ void vmx_disable_virtualization_cpu(void) { vmclear_local_loaded_vmcss(); - if (kvm_cpu_vmxoff()) - kvm_spurious_fault(); + x86_vmx_disable_virtualization_cpu(); hv_reset_evmcs(); - - intel_pt_handle_vmx(0); } struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu, gfp_t flags) diff --git a/arch/x86/virt/hw.c b/arch/x86/virt/hw.c index 40495872fdfb..dc426c2bc24a 100644 --- a/arch/x86/virt/hw.c +++ b/arch/x86/virt/hw.c @@ -15,8 +15,89 @@ __visible bool virt_rebooting; EXPORT_SYMBOL_FOR_KVM(virt_rebooting); #if IS_ENABLED(CONFIG_KVM_INTEL) -DEFINE_PER_CPU(struct vmcs *, root_vmcs); -EXPORT_PER_CPU_SYMBOL(root_vmcs); +static DEFINE_PER_CPU(struct vmcs *, root_vmcs); + +static int x86_virt_cpu_vmxon(void) +{ + u64 vmxon_pointer = __pa(per_cpu(root_vmcs, raw_smp_processor_id())); + u64 msr; + + cr4_set_bits(X86_CR4_VMXE); + + asm goto("1: vmxon %[vmxon_pointer]\n\t" + _ASM_EXTABLE(1b, %l[fault]) + : : [vmxon_pointer] "m"(vmxon_pointer) + : : fault); + return 0; + +fault: + WARN_ONCE(1, "VMXON faulted, MSR_IA32_FEAT_CTL (0x3a) = 0x%llx\n", + rdmsrq_safe(MSR_IA32_FEAT_CTL, &msr) ? 0xdeadbeef : msr); + cr4_clear_bits(X86_CR4_VMXE); + + return -EFAULT; +} + +int x86_vmx_enable_virtualization_cpu(void) +{ + int r; + + if (cr4_read_shadow() & X86_CR4_VMXE) + return -EBUSY; + + intel_pt_handle_vmx(1); + + r = x86_virt_cpu_vmxon(); + if (r) { + intel_pt_handle_vmx(0); + return r; + } + + return 0; +} +EXPORT_SYMBOL_FOR_KVM(x86_vmx_enable_virtualization_cpu); + +/* + * Disable VMX and clear CR4.VMXE (even if VMXOFF faults) + * + * Note, VMXOFF causes a #UD if the CPU is !post-VMXON, but it's impossible to + * atomically track post-VMXON state, e.g. this may be called in NMI context. + * Eat all faults as all other faults on VMXOFF faults are mode related, i.e. + * faults are guaranteed to be due to the !post-VMXON check unless the CPU is + * magically in RM, VM86, compat mode, or at CPL>0. + */ +int x86_vmx_disable_virtualization_cpu(void) +{ + int r = -EIO; + + asm goto("1: vmxoff\n\t" + _ASM_EXTABLE(1b, %l[fault]) + ::: "cc", "memory" : fault); + r = 0; + +fault: + cr4_clear_bits(X86_CR4_VMXE); + intel_pt_handle_vmx(0); + return r; +} +EXPORT_SYMBOL_FOR_KVM(x86_vmx_disable_virtualization_cpu); + +void x86_vmx_emergency_disable_virtualization_cpu(void) +{ + virt_rebooting = true; + + /* + * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be + * set in task context. If this races with _another_ emergency call + * from NMI context, VMXOFF may #UD, but kernel will eat those faults + * due to virt_rebooting being set by the interrupting NMI callback. + */ + if (!(__read_cr4() & X86_CR4_VMXE)) + return; + + x86_vmx_disable_virtualization_cpu(); +} +EXPORT_SYMBOL_FOR_KVM(x86_vmx_emergency_disable_virtualization_cpu); static __init void x86_vmx_exit(void) { From 32d76cdfa1222c88262da5b12e0b2bba444c96fa Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:53 -0800 Subject: [PATCH 056/373] KVM: SVM: Move core EFER.SVME enablement to kernel Move the innermost EFER.SVME logic out of KVM and into to core x86 to land the SVM support alongside VMX support. This will allow providing a more unified API from the kernel to KVM, and will allow moving the bulk of the emergency disabling insanity out of KVM without having a weird split between kernel and KVM for SVM vs. VMX. No functional change intended. Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-8-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/virt.h | 6 +++++ arch/x86/kvm/svm/svm.c | 34 +++++------------------- arch/x86/virt/hw.c | 53 +++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+), 27 deletions(-) diff --git a/arch/x86/include/asm/virt.h b/arch/x86/include/asm/virt.h index cca0210a5c16..9a0753eaa20c 100644 --- a/arch/x86/include/asm/virt.h +++ b/arch/x86/include/asm/virt.h @@ -15,6 +15,12 @@ int x86_vmx_disable_virtualization_cpu(void); void x86_vmx_emergency_disable_virtualization_cpu(void); #endif +#if IS_ENABLED(CONFIG_KVM_AMD) +int x86_svm_enable_virtualization_cpu(void); +int x86_svm_disable_virtualization_cpu(void); +void x86_svm_emergency_disable_virtualization_cpu(void); +#endif + #else static __always_inline void x86_virt_init(void) {} #endif diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 0ae66c770ebc..fc08450cb4b7 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -478,27 +478,9 @@ static __always_inline struct sev_es_save_area *sev_es_host_save_area(struct svm return &sd->save_area->host_sev_es_save; } -static inline void kvm_cpu_svm_disable(void) -{ - uint64_t efer; - - wrmsrq(MSR_VM_HSAVE_PA, 0); - rdmsrq(MSR_EFER, efer); - if (efer & EFER_SVME) { - /* - * Force GIF=1 prior to disabling SVM, e.g. to ensure INIT and - * NMI aren't blocked. - */ - stgi(); - wrmsrq(MSR_EFER, efer & ~EFER_SVME); - } -} - static void svm_emergency_disable_virtualization_cpu(void) { - virt_rebooting = true; - - kvm_cpu_svm_disable(); + wrmsrq(MSR_VM_HSAVE_PA, 0); } static void svm_disable_virtualization_cpu(void) @@ -507,7 +489,8 @@ static void svm_disable_virtualization_cpu(void) if (tsc_scaling) __svm_write_tsc_multiplier(SVM_TSC_RATIO_DEFAULT); - kvm_cpu_svm_disable(); + x86_svm_disable_virtualization_cpu(); + wrmsrq(MSR_VM_HSAVE_PA, 0); amd_pmu_disable_virt(); } @@ -516,12 +499,12 @@ static int svm_enable_virtualization_cpu(void) { struct svm_cpu_data *sd; - uint64_t efer; int me = raw_smp_processor_id(); + int r; - rdmsrq(MSR_EFER, efer); - if (efer & EFER_SVME) - return -EBUSY; + r = x86_svm_enable_virtualization_cpu(); + if (r) + return r; sd = per_cpu_ptr(&svm_data, me); sd->asid_generation = 1; @@ -529,8 +512,6 @@ static int svm_enable_virtualization_cpu(void) sd->next_asid = sd->max_asid + 1; sd->min_asid = max_sev_asid + 1; - wrmsrq(MSR_EFER, efer | EFER_SVME); - wrmsrq(MSR_VM_HSAVE_PA, sd->save_area_pa); if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) { @@ -541,7 +522,6 @@ static int svm_enable_virtualization_cpu(void) __svm_write_tsc_multiplier(SVM_TSC_RATIO_DEFAULT); } - /* * Get OSVW bits. * diff --git a/arch/x86/virt/hw.c b/arch/x86/virt/hw.c index dc426c2bc24a..014e9dfab805 100644 --- a/arch/x86/virt/hw.c +++ b/arch/x86/virt/hw.c @@ -163,6 +163,59 @@ static __init int x86_vmx_init(void) static __init int x86_vmx_init(void) { return -EOPNOTSUPP; } #endif +#if IS_ENABLED(CONFIG_KVM_AMD) +int x86_svm_enable_virtualization_cpu(void) +{ + u64 efer; + + if (!cpu_feature_enabled(X86_FEATURE_SVM)) + return -EOPNOTSUPP; + + rdmsrq(MSR_EFER, efer); + if (efer & EFER_SVME) + return -EBUSY; + + wrmsrq(MSR_EFER, efer | EFER_SVME); + return 0; +} +EXPORT_SYMBOL_FOR_KVM(x86_svm_enable_virtualization_cpu); + +int x86_svm_disable_virtualization_cpu(void) +{ + int r = -EIO; + u64 efer; + + /* + * Force GIF=1 prior to disabling SVM, e.g. to ensure INIT and + * NMI aren't blocked. + */ + asm goto("1: stgi\n\t" + _ASM_EXTABLE(1b, %l[fault]) + ::: "memory" : fault); + r = 0; + +fault: + rdmsrq(MSR_EFER, efer); + wrmsrq(MSR_EFER, efer & ~EFER_SVME); + return r; +} +EXPORT_SYMBOL_FOR_KVM(x86_svm_disable_virtualization_cpu); + +void x86_svm_emergency_disable_virtualization_cpu(void) +{ + u64 efer; + + virt_rebooting = true; + + rdmsrq(MSR_EFER, efer); + if (!(efer & EFER_SVME)) + return; + + x86_svm_disable_virtualization_cpu(); +} +EXPORT_SYMBOL_FOR_KVM(x86_svm_emergency_disable_virtualization_cpu); +#endif + void __init x86_virt_init(void) { x86_vmx_init(); From 428afac5a8ea9c55bb8408e02dc92b8f85bf5f30 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:54 -0800 Subject: [PATCH 057/373] KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem Move the majority of the code related to disabling hardware virtualization in emergency from KVM into the virt subsystem so that virt can take full ownership of the state of SVM/VMX. This will allow refcounting usage of SVM/VMX so that KVM and the TDX subsystem can enable VMX without stomping on each other. To route the emergency callback to the "right" vendor code, add to avoid mixing vendor and generic code, implement a x86_virt_ops structure to track the emergency callback, along with the SVM vs. VMX (vs. "none") feature that is active. To avoid having to choose between SVM and VMX, simply refuse to enable either if both are somehow supported. No known CPU supports both SVM and VMX, and it's comically unlikely such a CPU will ever exist. Leave KVM's clearing of loaded VMCSes and MSR_VM_HSAVE_PA in KVM, via a callback explicitly scoped to KVM. Loading VMCSes and saving/restoring host state are firmly tied to running VMs, and thus are (a) KVM's responsibility and (b) operations that are still exclusively reserved for KVM (as far as in-tree code is concerned). I.e. the contract being established is that non-KVM subsystems can utilize virtualization, but for all intents and purposes cannot act as full-blown hypervisors. Reviewed-by: Chao Gao Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/include/asm/reboot.h | 11 --- arch/x86/include/asm/virt.h | 9 ++- arch/x86/kernel/crash.c | 3 +- arch/x86/kernel/reboot.c | 63 ++-------------- arch/x86/kernel/smp.c | 5 +- arch/x86/kvm/vmx/vmx.c | 11 --- arch/x86/kvm/x86.c | 4 +- arch/x86/virt/hw.c | 123 +++++++++++++++++++++++++++++--- 9 files changed, 138 insertions(+), 94 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ff07c45e3c73..0bda52fbcae5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -40,7 +40,8 @@ #include #include #include -#include +#include + #include #define __KVM_HAVE_ARCH_VCPU_DEBUGFS diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h index ecd58ea9a837..a671a1145906 100644 --- a/arch/x86/include/asm/reboot.h +++ b/arch/x86/include/asm/reboot.h @@ -25,17 +25,6 @@ void __noreturn machine_real_restart(unsigned int type); #define MRR_BIOS 0 #define MRR_APM 1 -typedef void (cpu_emergency_virt_cb)(void); -#if IS_ENABLED(CONFIG_KVM_X86) -void cpu_emergency_register_virt_callback(cpu_emergency_virt_cb *callback); -void cpu_emergency_unregister_virt_callback(cpu_emergency_virt_cb *callback); -void cpu_emergency_disable_virtualization(void); -#else -static inline void cpu_emergency_register_virt_callback(cpu_emergency_virt_cb *callback) {} -static inline void cpu_emergency_unregister_virt_callback(cpu_emergency_virt_cb *callback) {} -static inline void cpu_emergency_disable_virtualization(void) {} -#endif /* CONFIG_KVM_X86 */ - typedef void (*nmi_shootdown_cb)(int, struct pt_regs*); void nmi_shootdown_cpus(nmi_shootdown_cb callback); void run_crash_ipi_callback(struct pt_regs *regs); diff --git a/arch/x86/include/asm/virt.h b/arch/x86/include/asm/virt.h index 9a0753eaa20c..2c35534437e0 100644 --- a/arch/x86/include/asm/virt.h +++ b/arch/x86/include/asm/virt.h @@ -4,6 +4,8 @@ #include +typedef void (cpu_emergency_virt_cb)(void); + #if IS_ENABLED(CONFIG_KVM_X86) extern bool virt_rebooting; @@ -12,17 +14,20 @@ void __init x86_virt_init(void); #if IS_ENABLED(CONFIG_KVM_INTEL) int x86_vmx_enable_virtualization_cpu(void); int x86_vmx_disable_virtualization_cpu(void); -void x86_vmx_emergency_disable_virtualization_cpu(void); #endif #if IS_ENABLED(CONFIG_KVM_AMD) int x86_svm_enable_virtualization_cpu(void); int x86_svm_disable_virtualization_cpu(void); -void x86_svm_emergency_disable_virtualization_cpu(void); #endif +int x86_virt_emergency_disable_virtualization_cpu(void); + +void x86_virt_register_emergency_callback(cpu_emergency_virt_cb *callback); +void x86_virt_unregister_emergency_callback(cpu_emergency_virt_cb *callback); #else static __always_inline void x86_virt_init(void) {} +static inline int x86_virt_emergency_disable_virtualization_cpu(void) { return -ENOENT; } #endif #endif /* _ASM_X86_VIRT_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 335fd2ee9766..cd796818d94d 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -42,6 +42,7 @@ #include #include #include +#include /* Used while preparing memory map entries for second kernel */ struct crash_memmap_data { @@ -111,7 +112,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs) crash_smp_send_stop(); - cpu_emergency_disable_virtualization(); + x86_virt_emergency_disable_virtualization_cpu(); /* * Disable Intel PT to stop its logging diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 6032fa9ec753..0bab8863375a 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -532,51 +533,6 @@ static inline void kb_wait(void) static inline void nmi_shootdown_cpus_on_restart(void); #if IS_ENABLED(CONFIG_KVM_X86) -/* RCU-protected callback to disable virtualization prior to reboot. */ -static cpu_emergency_virt_cb __rcu *cpu_emergency_virt_callback; - -void cpu_emergency_register_virt_callback(cpu_emergency_virt_cb *callback) -{ - if (WARN_ON_ONCE(rcu_access_pointer(cpu_emergency_virt_callback))) - return; - - rcu_assign_pointer(cpu_emergency_virt_callback, callback); -} -EXPORT_SYMBOL_FOR_KVM(cpu_emergency_register_virt_callback); - -void cpu_emergency_unregister_virt_callback(cpu_emergency_virt_cb *callback) -{ - if (WARN_ON_ONCE(rcu_access_pointer(cpu_emergency_virt_callback) != callback)) - return; - - rcu_assign_pointer(cpu_emergency_virt_callback, NULL); - synchronize_rcu(); -} -EXPORT_SYMBOL_FOR_KVM(cpu_emergency_unregister_virt_callback); - -/* - * Disable virtualization, i.e. VMX or SVM, to ensure INIT is recognized during - * reboot. VMX blocks INIT if the CPU is post-VMXON, and SVM blocks INIT if - * GIF=0, i.e. if the crash occurred between CLGI and STGI. - */ -void cpu_emergency_disable_virtualization(void) -{ - cpu_emergency_virt_cb *callback; - - /* - * IRQs must be disabled as KVM enables virtualization in hardware via - * function call IPIs, i.e. IRQs need to be disabled to guarantee - * virtualization stays disabled. - */ - lockdep_assert_irqs_disabled(); - - rcu_read_lock(); - callback = rcu_dereference(cpu_emergency_virt_callback); - if (callback) - callback(); - rcu_read_unlock(); -} - static void emergency_reboot_disable_virtualization(void) { local_irq_disable(); @@ -588,16 +544,11 @@ static void emergency_reboot_disable_virtualization(void) * We can't take any locks and we may be on an inconsistent state, so * use NMIs as IPIs to tell the other CPUs to disable VMX/SVM and halt. * - * Do the NMI shootdown even if virtualization is off on _this_ CPU, as - * other CPUs may have virtualization enabled. + * Safely force _this_ CPU out of VMX/SVM operation, and if necessary, + * blast NMIs to force other CPUs out of VMX/SVM as well.k */ - if (rcu_access_pointer(cpu_emergency_virt_callback)) { - /* Safely force _this_ CPU out of VMX/SVM operation. */ - cpu_emergency_disable_virtualization(); - - /* Disable VMX/SVM and halt on other CPUs. */ + if (!x86_virt_emergency_disable_virtualization_cpu()) nmi_shootdown_cpus_on_restart(); - } } #else static void emergency_reboot_disable_virtualization(void) { } @@ -875,10 +826,10 @@ static int crash_nmi_callback(unsigned int val, struct pt_regs *regs) shootdown_callback(cpu, regs); /* - * Prepare the CPU for reboot _after_ invoking the callback so that the - * callback can safely use virtualization instructions, e.g. VMCLEAR. + * Disable virtualization, as both VMX and SVM can block INIT and thus + * prevent AP bringup, e.g. in a kdump kernel or in firmware. */ - cpu_emergency_disable_virtualization(); + x86_virt_emergency_disable_virtualization_cpu(); atomic_dec(&waiting_for_crash_ipi); diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index b014e6d229f9..cbf95fe2b207 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -35,6 +35,7 @@ #include #include #include +#include /* * Some notes on x86 processor bugs affecting SMP operation: @@ -124,7 +125,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs) if (raw_smp_processor_id() == atomic_read(&stopping_cpu)) return NMI_HANDLED; - cpu_emergency_disable_virtualization(); + x86_virt_emergency_disable_virtualization_cpu(); stop_this_cpu(NULL); return NMI_HANDLED; @@ -136,7 +137,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs) DEFINE_IDTENTRY_SYSVEC(sysvec_reboot) { apic_eoi(); - cpu_emergency_disable_virtualization(); + x86_virt_emergency_disable_virtualization_cpu(); stop_this_cpu(NULL); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 36238cc694fd..c02fd7e91809 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -791,23 +791,12 @@ void vmx_emergency_disable_virtualization_cpu(void) int cpu = raw_smp_processor_id(); struct loaded_vmcs *v; - /* - * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be - * set in task context. If this races with _another_ emergency call - * from NMI context, VMCLEAR may #UD, but KVM will eat those faults due - * to virt_rebooting being set by the interrupting NMI callback. - */ - if (!(__read_cr4() & X86_CR4_VMXE)) - return; - list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu), loaded_vmcss_on_cpu_link) { vmcs_clear(v->vmcs); if (v->shadow_vmcs) vmcs_clear(v->shadow_vmcs); } - - x86_vmx_emergency_disable_virtualization_cpu(); } static void __loaded_vmcs_clear(void *arg) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 91a20fffedc3..93896099417d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -13075,12 +13075,12 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_vcpu_deliver_sipi_vector); void kvm_arch_enable_virtualization(void) { - cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_disable_virtualization_cpu); + x86_virt_register_emergency_callback(kvm_x86_ops.emergency_disable_virtualization_cpu); } void kvm_arch_disable_virtualization(void) { - cpu_emergency_unregister_virt_callback(kvm_x86_ops.emergency_disable_virtualization_cpu); + x86_virt_unregister_emergency_callback(kvm_x86_ops.emergency_disable_virtualization_cpu); } int kvm_arch_enable_virtualization_cpu(void) diff --git a/arch/x86/virt/hw.c b/arch/x86/virt/hw.c index 014e9dfab805..73c8309ba3fb 100644 --- a/arch/x86/virt/hw.c +++ b/arch/x86/virt/hw.c @@ -11,9 +11,45 @@ #include #include +struct x86_virt_ops { + int feature; + void (*emergency_disable_virtualization_cpu)(void); +}; +static struct x86_virt_ops virt_ops __ro_after_init; + __visible bool virt_rebooting; EXPORT_SYMBOL_FOR_KVM(virt_rebooting); +static cpu_emergency_virt_cb __rcu *kvm_emergency_callback; + +void x86_virt_register_emergency_callback(cpu_emergency_virt_cb *callback) +{ + if (WARN_ON_ONCE(rcu_access_pointer(kvm_emergency_callback))) + return; + + rcu_assign_pointer(kvm_emergency_callback, callback); +} +EXPORT_SYMBOL_FOR_KVM(x86_virt_register_emergency_callback); + +void x86_virt_unregister_emergency_callback(cpu_emergency_virt_cb *callback) +{ + if (WARN_ON_ONCE(rcu_access_pointer(kvm_emergency_callback) != callback)) + return; + + rcu_assign_pointer(kvm_emergency_callback, NULL); + synchronize_rcu(); +} +EXPORT_SYMBOL_FOR_KVM(x86_virt_unregister_emergency_callback); + +static void x86_virt_invoke_kvm_emergency_callback(void) +{ + cpu_emergency_virt_cb *kvm_callback; + + kvm_callback = rcu_dereference(kvm_emergency_callback); + if (kvm_callback) + kvm_callback(); +} + #if IS_ENABLED(CONFIG_KVM_INTEL) static DEFINE_PER_CPU(struct vmcs *, root_vmcs); @@ -42,6 +78,9 @@ int x86_vmx_enable_virtualization_cpu(void) { int r; + if (virt_ops.feature != X86_FEATURE_VMX) + return -EOPNOTSUPP; + if (cr4_read_shadow() & X86_CR4_VMXE) return -EBUSY; @@ -82,22 +121,24 @@ fault: } EXPORT_SYMBOL_FOR_KVM(x86_vmx_disable_virtualization_cpu); -void x86_vmx_emergency_disable_virtualization_cpu(void) +static void x86_vmx_emergency_disable_virtualization_cpu(void) { virt_rebooting = true; /* * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be * set in task context. If this races with _another_ emergency call - * from NMI context, VMXOFF may #UD, but kernel will eat those faults - * due to virt_rebooting being set by the interrupting NMI callback. + * from NMI context, VMCLEAR (in KVM) and VMXOFF may #UD, but KVM and + * the kernel will eat those faults due to virt_rebooting being set by + * the interrupting NMI callback. */ if (!(__read_cr4() & X86_CR4_VMXE)) return; + x86_virt_invoke_kvm_emergency_callback(); + x86_vmx_disable_virtualization_cpu(); } -EXPORT_SYMBOL_FOR_KVM(x86_vmx_emergency_disable_virtualization_cpu); static __init void x86_vmx_exit(void) { @@ -111,6 +152,11 @@ static __init void x86_vmx_exit(void) static __init int __x86_vmx_init(void) { + const struct x86_virt_ops vmx_ops = { + .feature = X86_FEATURE_VMX, + .emergency_disable_virtualization_cpu = x86_vmx_emergency_disable_virtualization_cpu, + }; + u64 basic_msr; u32 rev_id; int cpu; @@ -147,6 +193,7 @@ static __init int __x86_vmx_init(void) per_cpu(root_vmcs, cpu) = vmcs; } + memcpy(&virt_ops, &vmx_ops, sizeof(virt_ops)); return 0; } @@ -161,6 +208,7 @@ static __init int x86_vmx_init(void) } #else static __init int x86_vmx_init(void) { return -EOPNOTSUPP; } +static __init void x86_vmx_exit(void) { } #endif #if IS_ENABLED(CONFIG_KVM_AMD) @@ -168,7 +216,7 @@ int x86_svm_enable_virtualization_cpu(void) { u64 efer; - if (!cpu_feature_enabled(X86_FEATURE_SVM)) + if (virt_ops.feature != X86_FEATURE_SVM) return -EOPNOTSUPP; rdmsrq(MSR_EFER, efer); @@ -201,7 +249,7 @@ fault: } EXPORT_SYMBOL_FOR_KVM(x86_svm_disable_virtualization_cpu); -void x86_svm_emergency_disable_virtualization_cpu(void) +static void x86_svm_emergency_disable_virtualization_cpu(void) { u64 efer; @@ -211,12 +259,71 @@ void x86_svm_emergency_disable_virtualization_cpu(void) if (!(efer & EFER_SVME)) return; + x86_virt_invoke_kvm_emergency_callback(); + x86_svm_disable_virtualization_cpu(); } -EXPORT_SYMBOL_FOR_KVM(x86_svm_emergency_disable_virtualization_cpu); + +static __init int x86_svm_init(void) +{ + const struct x86_virt_ops svm_ops = { + .feature = X86_FEATURE_SVM, + .emergency_disable_virtualization_cpu = x86_svm_emergency_disable_virtualization_cpu, + }; + + if (!cpu_feature_enabled(X86_FEATURE_SVM)) + return -EOPNOTSUPP; + + memcpy(&virt_ops, &svm_ops, sizeof(virt_ops)); + return 0; +} +#else +static __init int x86_svm_init(void) { return -EOPNOTSUPP; } #endif +/* + * Disable virtualization, i.e. VMX or SVM, to ensure INIT is recognized during + * reboot. VMX blocks INIT if the CPU is post-VMXON, and SVM blocks INIT if + * GIF=0, i.e. if the crash occurred between CLGI and STGI. + */ +int x86_virt_emergency_disable_virtualization_cpu(void) +{ + /* Ensure the !feature check can't get false positives. */ + BUILD_BUG_ON(!X86_FEATURE_SVM || !X86_FEATURE_VMX); + + if (!virt_ops.feature) + return -EOPNOTSUPP; + + /* + * IRQs must be disabled as virtualization is enabled in hardware via + * function call IPIs, i.e. IRQs need to be disabled to guarantee + * virtualization stays disabled. + */ + lockdep_assert_irqs_disabled(); + + /* + * Do the NMI shootdown even if virtualization is off on _this_ CPU, as + * other CPUs may have virtualization enabled. + * + * TODO: Track whether or not virtualization might be enabled on other + * CPUs? May not be worth avoiding the NMI shootdown... + */ + virt_ops.emergency_disable_virtualization_cpu(); + return 0; +} + void __init x86_virt_init(void) { - x86_vmx_init(); + /* + * Attempt to initialize both SVM and VMX, and simply use whichever one + * is present. Rsefuse to enable/use SVM or VMX if both are somehow + * supported. No known CPU supports both SVM and VMX. + */ + bool has_vmx = !x86_vmx_init(); + bool has_svm = !x86_svm_init(); + + if (WARN_ON_ONCE(has_vmx && has_svm)) { + x86_vmx_exit(); + memset(&virt_ops, 0, sizeof(virt_ops)); + } } From 8528a7f9c91d917ad2b3b6a71f1cb7e00b1fb1bf Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:55 -0800 Subject: [PATCH 058/373] x86/virt: Add refcounting of VMX/SVM usage to support multiple in-kernel users Implement a per-CPU refcounting scheme so that "users" of hardware virtualization, e.g. KVM and the future TDX code, can co-exist without pulling the rug out from under each other. E.g. if KVM were to disable VMX on module unload or when the last KVM VM was destroyed, SEAMCALLs from the TDX subsystem would #UD and panic the kernel. Disable preemption in the get/put APIs to ensure virtualization is fully enabled/disabled before returning to the caller. E.g. if the task were preempted after a 0=>1 transition, the new task would see a 1=>2 and thus return without enabling virtualization. Explicitly disable preemption instead of requiring the caller to do so, because the need to disable preemption is an artifact of the implementation. E.g. from KVM's perspective there is no _need_ to disable preemption as KVM guarantees the pCPU on which it is running is stable (but preemption is enabled). Opportunistically abstract away SVM vs. VMX in the public APIs by using X86_FEATURE_{SVM,VMX} to communicate what technology the caller wants to enable and use. Cc: Xu Yilun Reviewed-by: Chao Gao Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-10-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/virt.h | 11 ++----- arch/x86/kvm/svm/svm.c | 4 +-- arch/x86/kvm/vmx/vmx.c | 4 +-- arch/x86/virt/hw.c | 64 +++++++++++++++++++++++++++---------- 4 files changed, 53 insertions(+), 30 deletions(-) diff --git a/arch/x86/include/asm/virt.h b/arch/x86/include/asm/virt.h index 2c35534437e0..1558a0673d06 100644 --- a/arch/x86/include/asm/virt.h +++ b/arch/x86/include/asm/virt.h @@ -11,15 +11,8 @@ extern bool virt_rebooting; void __init x86_virt_init(void); -#if IS_ENABLED(CONFIG_KVM_INTEL) -int x86_vmx_enable_virtualization_cpu(void); -int x86_vmx_disable_virtualization_cpu(void); -#endif - -#if IS_ENABLED(CONFIG_KVM_AMD) -int x86_svm_enable_virtualization_cpu(void); -int x86_svm_disable_virtualization_cpu(void); -#endif +int x86_virt_get_ref(int feat); +void x86_virt_put_ref(int feat); int x86_virt_emergency_disable_virtualization_cpu(void); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index fc08450cb4b7..e4be0caa09b3 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -489,7 +489,7 @@ static void svm_disable_virtualization_cpu(void) if (tsc_scaling) __svm_write_tsc_multiplier(SVM_TSC_RATIO_DEFAULT); - x86_svm_disable_virtualization_cpu(); + x86_virt_put_ref(X86_FEATURE_SVM); wrmsrq(MSR_VM_HSAVE_PA, 0); amd_pmu_disable_virt(); @@ -502,7 +502,7 @@ static int svm_enable_virtualization_cpu(void) int me = raw_smp_processor_id(); int r; - r = x86_svm_enable_virtualization_cpu(); + r = x86_virt_get_ref(X86_FEATURE_SVM); if (r) return r; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c02fd7e91809..6200cf4dbd26 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2963,7 +2963,7 @@ int vmx_enable_virtualization_cpu(void) if (kvm_is_using_evmcs() && !hv_get_vp_assist_page(cpu)) return -EFAULT; - return x86_vmx_enable_virtualization_cpu(); + return x86_virt_get_ref(X86_FEATURE_VMX); } static void vmclear_local_loaded_vmcss(void) @@ -2980,7 +2980,7 @@ void vmx_disable_virtualization_cpu(void) { vmclear_local_loaded_vmcss(); - x86_vmx_disable_virtualization_cpu(); + x86_virt_put_ref(X86_FEATURE_VMX); hv_reset_evmcs(); } diff --git a/arch/x86/virt/hw.c b/arch/x86/virt/hw.c index 73c8309ba3fb..c898f16fe612 100644 --- a/arch/x86/virt/hw.c +++ b/arch/x86/virt/hw.c @@ -13,6 +13,8 @@ struct x86_virt_ops { int feature; + int (*enable_virtualization_cpu)(void); + int (*disable_virtualization_cpu)(void); void (*emergency_disable_virtualization_cpu)(void); }; static struct x86_virt_ops virt_ops __ro_after_init; @@ -20,6 +22,8 @@ static struct x86_virt_ops virt_ops __ro_after_init; __visible bool virt_rebooting; EXPORT_SYMBOL_FOR_KVM(virt_rebooting); +static DEFINE_PER_CPU(int, virtualization_nr_users); + static cpu_emergency_virt_cb __rcu *kvm_emergency_callback; void x86_virt_register_emergency_callback(cpu_emergency_virt_cb *callback) @@ -74,13 +78,10 @@ fault: return -EFAULT; } -int x86_vmx_enable_virtualization_cpu(void) +static int x86_vmx_enable_virtualization_cpu(void) { int r; - if (virt_ops.feature != X86_FEATURE_VMX) - return -EOPNOTSUPP; - if (cr4_read_shadow() & X86_CR4_VMXE) return -EBUSY; @@ -94,7 +95,6 @@ int x86_vmx_enable_virtualization_cpu(void) return 0; } -EXPORT_SYMBOL_FOR_KVM(x86_vmx_enable_virtualization_cpu); /* * Disable VMX and clear CR4.VMXE (even if VMXOFF faults) @@ -105,7 +105,7 @@ EXPORT_SYMBOL_FOR_KVM(x86_vmx_enable_virtualization_cpu); * faults are guaranteed to be due to the !post-VMXON check unless the CPU is * magically in RM, VM86, compat mode, or at CPL>0. */ -int x86_vmx_disable_virtualization_cpu(void) +static int x86_vmx_disable_virtualization_cpu(void) { int r = -EIO; @@ -119,7 +119,6 @@ fault: intel_pt_handle_vmx(0); return r; } -EXPORT_SYMBOL_FOR_KVM(x86_vmx_disable_virtualization_cpu); static void x86_vmx_emergency_disable_virtualization_cpu(void) { @@ -154,6 +153,8 @@ static __init int __x86_vmx_init(void) { const struct x86_virt_ops vmx_ops = { .feature = X86_FEATURE_VMX, + .enable_virtualization_cpu = x86_vmx_enable_virtualization_cpu, + .disable_virtualization_cpu = x86_vmx_disable_virtualization_cpu, .emergency_disable_virtualization_cpu = x86_vmx_emergency_disable_virtualization_cpu, }; @@ -212,13 +213,10 @@ static __init void x86_vmx_exit(void) { } #endif #if IS_ENABLED(CONFIG_KVM_AMD) -int x86_svm_enable_virtualization_cpu(void) +static int x86_svm_enable_virtualization_cpu(void) { u64 efer; - if (virt_ops.feature != X86_FEATURE_SVM) - return -EOPNOTSUPP; - rdmsrq(MSR_EFER, efer); if (efer & EFER_SVME) return -EBUSY; @@ -226,9 +224,8 @@ int x86_svm_enable_virtualization_cpu(void) wrmsrq(MSR_EFER, efer | EFER_SVME); return 0; } -EXPORT_SYMBOL_FOR_KVM(x86_svm_enable_virtualization_cpu); -int x86_svm_disable_virtualization_cpu(void) +static int x86_svm_disable_virtualization_cpu(void) { int r = -EIO; u64 efer; @@ -247,7 +244,6 @@ fault: wrmsrq(MSR_EFER, efer & ~EFER_SVME); return r; } -EXPORT_SYMBOL_FOR_KVM(x86_svm_disable_virtualization_cpu); static void x86_svm_emergency_disable_virtualization_cpu(void) { @@ -268,6 +264,8 @@ static __init int x86_svm_init(void) { const struct x86_virt_ops svm_ops = { .feature = X86_FEATURE_SVM, + .enable_virtualization_cpu = x86_svm_enable_virtualization_cpu, + .disable_virtualization_cpu = x86_svm_disable_virtualization_cpu, .emergency_disable_virtualization_cpu = x86_svm_emergency_disable_virtualization_cpu, }; @@ -281,6 +279,41 @@ static __init int x86_svm_init(void) static __init int x86_svm_init(void) { return -EOPNOTSUPP; } #endif +int x86_virt_get_ref(int feat) +{ + int r; + + /* Ensure the !feature check can't get false positives. */ + BUILD_BUG_ON(!X86_FEATURE_SVM || !X86_FEATURE_VMX); + + if (!virt_ops.feature || virt_ops.feature != feat) + return -EOPNOTSUPP; + + guard(preempt)(); + + if (this_cpu_inc_return(virtualization_nr_users) > 1) + return 0; + + r = virt_ops.enable_virtualization_cpu(); + if (r) + WARN_ON_ONCE(this_cpu_dec_return(virtualization_nr_users)); + + return r; +} +EXPORT_SYMBOL_FOR_KVM(x86_virt_get_ref); + +void x86_virt_put_ref(int feat) +{ + guard(preempt)(); + + if (WARN_ON_ONCE(!this_cpu_read(virtualization_nr_users)) || + this_cpu_dec_return(virtualization_nr_users)) + return; + + BUG_ON(virt_ops.disable_virtualization_cpu() && !virt_rebooting); +} +EXPORT_SYMBOL_FOR_KVM(x86_virt_put_ref); + /* * Disable virtualization, i.e. VMX or SVM, to ensure INIT is recognized during * reboot. VMX blocks INIT if the CPU is post-VMXON, and SVM blocks INIT if @@ -288,9 +321,6 @@ static __init int x86_svm_init(void) { return -EOPNOTSUPP; } */ int x86_virt_emergency_disable_virtualization_cpu(void) { - /* Ensure the !feature check can't get false positives. */ - BUILD_BUG_ON(!X86_FEATURE_SVM || !X86_FEATURE_VMX); - if (!virt_ops.feature) return -EOPNOTSUPP; From 0efe5dc16169b0c7d47cbb495225065c67712fbc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:56 -0800 Subject: [PATCH 059/373] x86/virt/tdx: Drop the outdated requirement that TDX be enabled in IRQ context Remove TDX's outdated requirement that per-CPU enabling be done via IPI function call, which was a stale artifact leftover from early versions of the TDX enablement series. The requirement that IRQs be disabled should have been dropped as part of the revamped series that relied on a the KVM rework to enable VMX at module load. In other words, the kernel's "requirement" was never a requirement at all, but instead a reflection of how KVM enabled VMX (via IPI callback) when the TDX subsystem code was merged. Note, accessing per-CPU information is safe even without disabling IRQs, as tdx_online_cpu() is invoked via a cpuhp callback, i.e. from a per-CPU thread. Link: https://lore.kernel.org/all/ZyJOiPQnBz31qLZ7@google.com Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-11-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/tdx.c | 9 +-------- arch/x86/virt/vmx/tdx/tdx.c | 9 +-------- 2 files changed, 2 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index f81b562733ef..60e7ba883675 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -3293,17 +3293,10 @@ int tdx_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private) static int tdx_online_cpu(unsigned int cpu) { - unsigned long flags; - int r; - /* Sanity check CPU is already in post-VMXON */ WARN_ON_ONCE(!(cr4_read_shadow() & X86_CR4_VMXE)); - local_irq_save(flags); - r = tdx_cpu_enable(); - local_irq_restore(flags); - - return r; + return tdx_cpu_enable(); } static int tdx_offline_cpu(unsigned int cpu) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 8b8e165a2001..61cece496bdb 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -106,8 +106,7 @@ static __always_inline int sc_retry_prerr(sc_func_t func, /* * Do the module global initialization once and return its result. - * It can be done on any cpu. It's always called with interrupts - * disabled. + * It can be done on any cpu, and from task or IRQ context. */ static int try_init_module_global(void) { @@ -116,8 +115,6 @@ static int try_init_module_global(void) static bool sysinit_done; static int sysinit_ret; - lockdep_assert_irqs_disabled(); - raw_spin_lock(&sysinit_lock); if (sysinit_done) @@ -148,8 +145,6 @@ out: * global initialization SEAMCALL if not done) on local cpu to make this * cpu be ready to run any other SEAMCALLs. * - * Always call this function via IPI function calls. - * * Return 0 on success, otherwise errors. */ int tdx_cpu_enable(void) @@ -160,8 +155,6 @@ int tdx_cpu_enable(void) if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM)) return -ENODEV; - lockdep_assert_irqs_disabled(); - if (__this_cpu_read(tdx_lp_initialized)) return 0; From 165e77353831a85caa0444d16f29bd6b111dd2c5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:57 -0800 Subject: [PATCH 060/373] KVM: x86/tdx: Do VMXON and TDX-Module initialization during subsys init Now that VMXON can be done without bouncing through KVM, do TDX-Module initialization during subsys init (specifically before module_init() so that it runs before KVM when both are built-in). Aside from the obvious benefits of separating core TDX code from KVM, this will allow tagging a pile of TDX functions and globals as being __init and __ro_after_init. Reviewed-by: Dan Williams Reviewed-by: Chao Gao Acked-by: Dave Hansen Tested-by: Chao Gao Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-12-seanjc@google.com Signed-off-by: Sean Christopherson --- Documentation/arch/x86/tdx.rst | 36 +------ arch/x86/include/asm/tdx.h | 4 - arch/x86/kvm/vmx/tdx.c | 148 ++++++----------------------- arch/x86/virt/vmx/tdx/tdx.c | 168 +++++++++++++++++++-------------- arch/x86/virt/vmx/tdx/tdx.h | 8 -- 5 files changed, 130 insertions(+), 234 deletions(-) diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst index 61670e7df2f7..ff6b110291bc 100644 --- a/Documentation/arch/x86/tdx.rst +++ b/Documentation/arch/x86/tdx.rst @@ -60,44 +60,18 @@ Besides initializing the TDX module, a per-cpu initialization SEAMCALL must be done on one cpu before any other SEAMCALLs can be made on that cpu. -The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to -allow the user of TDX to enable the TDX module and enable TDX on local -cpu respectively. - -Making SEAMCALL requires VMXON has been done on that CPU. Currently only -KVM implements VMXON. For now both tdx_enable() and tdx_cpu_enable() -don't do VMXON internally (not trivial), but depends on the caller to -guarantee that. - -To enable TDX, the caller of TDX should: 1) temporarily disable CPU -hotplug; 2) do VMXON and tdx_enable_cpu() on all online cpus; 3) call -tdx_enable(). For example:: - - cpus_read_lock(); - on_each_cpu(vmxon_and_tdx_cpu_enable()); - ret = tdx_enable(); - cpus_read_unlock(); - if (ret) - goto no_tdx; - // TDX is ready to use - -And the caller of TDX must guarantee the tdx_cpu_enable() has been -successfully done on any cpu before it wants to run any other SEAMCALL. -A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug -online callback, and refuse to online if tdx_cpu_enable() fails. - User can consult dmesg to see whether the TDX module has been initialized. If the TDX module is initialized successfully, dmesg shows something like below:: [..] virt/tdx: 262668 KBs allocated for PAMT - [..] virt/tdx: module initialized + [..] virt/tdx: TDX-Module initialized If the TDX module failed to initialize, dmesg also shows it failed to initialize:: - [..] virt/tdx: module initialization failed ... + [..] virt/tdx: TDX-Module initialization failed ... TDX Interaction to Other Kernel Components ------------------------------------------ @@ -129,9 +103,9 @@ CPU Hotplug ~~~~~~~~~~~ TDX module requires the per-cpu initialization SEAMCALL must be done on -one cpu before any other SEAMCALLs can be made on that cpu. The kernel -provides tdx_cpu_enable() to let the user of TDX to do it when the user -wants to use a new cpu for TDX task. +one cpu before any other SEAMCALLs can be made on that cpu. The kernel, +via the CPU hotplug framework, performs the necessary initialization when +a CPU is first brought online. TDX doesn't support physical (ACPI) CPU hotplug. During machine boot, TDX verifies all boot-time present logical CPUs are TDX compatible before diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 6b338d7f01b7..a149740b24e8 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -145,8 +145,6 @@ static __always_inline u64 sc_retry(sc_func_t func, u64 fn, #define seamcall(_fn, _args) sc_retry(__seamcall, (_fn), (_args)) #define seamcall_ret(_fn, _args) sc_retry(__seamcall_ret, (_fn), (_args)) #define seamcall_saved_ret(_fn, _args) sc_retry(__seamcall_saved_ret, (_fn), (_args)) -int tdx_cpu_enable(void); -int tdx_enable(void); const char *tdx_dump_mce_info(struct mce *m); const struct tdx_sys_info *tdx_get_sysinfo(void); @@ -223,8 +221,6 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td); u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page); #else static inline void tdx_init(void) { } -static inline int tdx_cpu_enable(void) { return -ENODEV; } -static inline int tdx_enable(void) { return -ENODEV; } static inline u32 tdx_get_nr_guest_keyids(void) { return 0; } static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; } static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; } diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 60e7ba883675..c0cd0d73015f 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -59,7 +59,7 @@ module_param_named(tdx, enable_tdx, bool, 0444); #define TDX_SHARED_BIT_PWL_5 gpa_to_gfn(BIT_ULL(51)) #define TDX_SHARED_BIT_PWL_4 gpa_to_gfn(BIT_ULL(47)) -static enum cpuhp_state tdx_cpuhp_state; +static enum cpuhp_state tdx_cpuhp_state __ro_after_init; static const struct tdx_sys_info *tdx_sysinfo; @@ -3293,10 +3293,7 @@ int tdx_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private) static int tdx_online_cpu(unsigned int cpu) { - /* Sanity check CPU is already in post-VMXON */ - WARN_ON_ONCE(!(cr4_read_shadow() & X86_CR4_VMXE)); - - return tdx_cpu_enable(); + return 0; } static int tdx_offline_cpu(unsigned int cpu) @@ -3335,51 +3332,6 @@ static int tdx_offline_cpu(unsigned int cpu) return -EBUSY; } -static void __do_tdx_cleanup(void) -{ - /* - * Once TDX module is initialized, it cannot be disabled and - * re-initialized again w/o runtime update (which isn't - * supported by kernel). Only need to remove the cpuhp here. - * The TDX host core code tracks TDX status and can handle - * 'multiple enabling' scenario. - */ - WARN_ON_ONCE(!tdx_cpuhp_state); - cpuhp_remove_state_nocalls_cpuslocked(tdx_cpuhp_state); - tdx_cpuhp_state = 0; -} - -static void __tdx_cleanup(void) -{ - cpus_read_lock(); - __do_tdx_cleanup(); - cpus_read_unlock(); -} - -static int __init __do_tdx_bringup(void) -{ - int r; - - /* - * TDX-specific cpuhp callback to call tdx_cpu_enable() on all - * online CPUs before calling tdx_enable(), and on any new - * going-online CPU to make sure it is ready for TDX guest. - */ - r = cpuhp_setup_state_cpuslocked(CPUHP_AP_ONLINE_DYN, - "kvm/cpu/tdx:online", - tdx_online_cpu, tdx_offline_cpu); - if (r < 0) - return r; - - tdx_cpuhp_state = r; - - r = tdx_enable(); - if (r) - __do_tdx_cleanup(); - - return r; -} - static int __init __tdx_bringup(void) { const struct tdx_sys_info_td_conf *td_conf; @@ -3399,34 +3351,18 @@ static int __init __tdx_bringup(void) } } - /* - * Enabling TDX requires enabling hardware virtualization first, - * as making SEAMCALLs requires CPU being in post-VMXON state. - */ - r = kvm_enable_virtualization(); - if (r) - return r; - - cpus_read_lock(); - r = __do_tdx_bringup(); - cpus_read_unlock(); - - if (r) - goto tdx_bringup_err; - - r = -EINVAL; /* Get TDX global information for later use */ tdx_sysinfo = tdx_get_sysinfo(); - if (WARN_ON_ONCE(!tdx_sysinfo)) - goto get_sysinfo_err; + if (!tdx_sysinfo) + return -ENODEV; /* Check TDX module and KVM capabilities */ if (!tdx_get_supported_attrs(&tdx_sysinfo->td_conf) || !tdx_get_supported_xfam(&tdx_sysinfo->td_conf)) - goto get_sysinfo_err; + return -EINVAL; if (!(tdx_sysinfo->features.tdx_features0 & MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM)) - goto get_sysinfo_err; + return -EINVAL; /* * TDX has its own limit of maximum vCPUs it can support for all @@ -3461,34 +3397,31 @@ static int __init __tdx_bringup(void) if (td_conf->max_vcpus_per_td < num_present_cpus()) { pr_err("Disable TDX: MAX_VCPU_PER_TD (%u) smaller than number of logical CPUs (%u).\n", td_conf->max_vcpus_per_td, num_present_cpus()); - goto get_sysinfo_err; + return -EINVAL; } if (misc_cg_set_capacity(MISC_CG_RES_TDX, tdx_get_nr_guest_keyids())) - goto get_sysinfo_err; + return -EINVAL; /* - * Leave hardware virtualization enabled after TDX is enabled - * successfully. TDX CPU hotplug depends on this. + * TDX-specific cpuhp callback to disallow offlining the last CPU in a + * packing while KVM is running one or more TDs. Reclaiming HKIDs + * requires doing PAGE.WBINVD on every package, i.e. offlining all CPUs + * of a package would prevent reclaiming the HKID. */ + r = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "kvm/cpu/tdx:online", + tdx_online_cpu, tdx_offline_cpu); + if (r < 0) + goto err_cpuhup; + + tdx_cpuhp_state = r; return 0; -get_sysinfo_err: - __tdx_cleanup(); -tdx_bringup_err: - kvm_disable_virtualization(); +err_cpuhup: + misc_cg_set_capacity(MISC_CG_RES_TDX, 0); return r; } -void tdx_cleanup(void) -{ - if (enable_tdx) { - misc_cg_set_capacity(MISC_CG_RES_TDX, 0); - __tdx_cleanup(); - kvm_disable_virtualization(); - } -} - int __init tdx_bringup(void) { int r, i; @@ -3520,39 +3453,11 @@ int __init tdx_bringup(void) goto success_disable_tdx; } - if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) { - pr_err("tdx: MOVDIR64B is required for TDX\n"); - goto success_disable_tdx; - } - - if (!cpu_feature_enabled(X86_FEATURE_SELFSNOOP)) { - pr_err("Self-snoop is required for TDX\n"); - goto success_disable_tdx; - } - if (!cpu_feature_enabled(X86_FEATURE_TDX_HOST_PLATFORM)) { - pr_err("tdx: no TDX private KeyIDs available\n"); + pr_err("TDX not supported by the host platform\n"); goto success_disable_tdx; } - if (!enable_virt_at_load) { - pr_err("tdx: tdx requires kvm.enable_virt_at_load=1\n"); - goto success_disable_tdx; - } - - /* - * Ideally KVM should probe whether TDX module has been loaded - * first and then try to bring it up. But TDX needs to use SEAMCALL - * to probe whether the module is loaded (there is no CPUID or MSR - * for that), and making SEAMCALL requires enabling virtualization - * first, just like the rest steps of bringing up TDX module. - * - * So, for simplicity do everything in __tdx_bringup(); the first - * SEAMCALL will return -ENODEV when the module is not loaded. The - * only complication is having to make sure that initialization - * SEAMCALLs don't return TDX_SEAMCALL_VMFAILINVALID in other - * cases. - */ r = __tdx_bringup(); if (r) { /* @@ -3567,8 +3472,6 @@ int __init tdx_bringup(void) */ if (r == -ENODEV) goto success_disable_tdx; - - enable_tdx = 0; } return r; @@ -3578,6 +3481,15 @@ success_disable_tdx: return 0; } +void tdx_cleanup(void) +{ + if (!enable_tdx) + return; + + misc_cg_set_capacity(MISC_CG_RES_TDX, 0); + cpuhp_remove_state(tdx_cpuhp_state); +} + void __init tdx_hardware_setup(void) { KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_tdx); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 61cece496bdb..5baed9f8e5b8 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -39,6 +40,7 @@ #include #include #include +#include #include "tdx.h" static u32 tdx_global_keyid __ro_after_init; @@ -51,13 +53,11 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized); static struct tdmr_info_list tdx_tdmr_list; -static enum tdx_module_status_t tdx_module_status; -static DEFINE_MUTEX(tdx_module_lock); - /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ static LIST_HEAD(tdx_memlist); static struct tdx_sys_info tdx_sysinfo; +static bool tdx_module_initialized; typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); @@ -139,22 +139,15 @@ out: } /** - * tdx_cpu_enable - Enable TDX on local cpu - * - * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module - * global initialization SEAMCALL if not done) on local cpu to make this - * cpu be ready to run any other SEAMCALLs. - * - * Return 0 on success, otherwise errors. + * Enable VMXON and then do one-time TDX module per-cpu initialization SEAMCALL + * (and TDX module global initialization SEAMCALL if not done) on local cpu to + * make this cpu be ready to run any other SEAMCALLs. */ -int tdx_cpu_enable(void) +static int tdx_cpu_enable(void) { struct tdx_module_args args = {}; int ret; - if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM)) - return -ENODEV; - if (__this_cpu_read(tdx_lp_initialized)) return 0; @@ -175,7 +168,58 @@ int tdx_cpu_enable(void) return 0; } -EXPORT_SYMBOL_FOR_KVM(tdx_cpu_enable); + +static int tdx_online_cpu(unsigned int cpu) +{ + int ret; + + ret = x86_virt_get_ref(X86_FEATURE_VMX); + if (ret) + return ret; + + ret = tdx_cpu_enable(); + if (ret) + x86_virt_put_ref(X86_FEATURE_VMX); + + return ret; +} + +static int tdx_offline_cpu(unsigned int cpu) +{ + x86_virt_put_ref(X86_FEATURE_VMX); + return 0; +} + +static void tdx_shutdown_cpu(void *ign) +{ + x86_virt_put_ref(X86_FEATURE_VMX); +} + +static void tdx_shutdown(void *ign) +{ + on_each_cpu(tdx_shutdown_cpu, NULL, 1); +} + +static int tdx_suspend(void *ign) +{ + x86_virt_put_ref(X86_FEATURE_VMX); + return 0; +} + +static void tdx_resume(void *ign) +{ + WARN_ON_ONCE(x86_virt_get_ref(X86_FEATURE_VMX)); +} + +static const struct syscore_ops tdx_syscore_ops = { + .suspend = tdx_suspend, + .resume = tdx_resume, + .shutdown = tdx_shutdown, +}; + +static struct syscore tdx_syscore = { + .ops = &tdx_syscore_ops, +}; /* * Add a memory region as a TDX memory block. The caller must make sure @@ -1150,67 +1194,50 @@ err_free_tdxmem: goto out_put_tdxmem; } -static int __tdx_enable(void) +static int tdx_enable(void) { + enum cpuhp_state state; int ret; + if (!cpu_feature_enabled(X86_FEATURE_TDX_HOST_PLATFORM)) { + pr_err("TDX not supported by the host platform\n"); + return -ENODEV; + } + + if (!cpu_feature_enabled(X86_FEATURE_XSAVE)) { + pr_err("XSAVE is required for TDX\n"); + return -EINVAL; + } + + if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) { + pr_err("MOVDIR64B is required for TDX\n"); + return -EINVAL; + } + + if (!cpu_feature_enabled(X86_FEATURE_SELFSNOOP)) { + pr_err("Self-snoop is required for TDX\n"); + return -ENODEV; + } + + state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "virt/tdx:online", + tdx_online_cpu, tdx_offline_cpu); + if (state < 0) + return state; + ret = init_tdx_module(); if (ret) { - pr_err("module initialization failed (%d)\n", ret); - tdx_module_status = TDX_MODULE_ERROR; + pr_err("TDX-Module initialization failed (%d)\n", ret); + cpuhp_remove_state(state); return ret; } - pr_info("module initialized\n"); - tdx_module_status = TDX_MODULE_INITIALIZED; + register_syscore(&tdx_syscore); + tdx_module_initialized = true; + pr_info("TDX-Module initialized\n"); return 0; } - -/** - * tdx_enable - Enable TDX module to make it ready to run TDX guests - * - * This function assumes the caller has: 1) held read lock of CPU hotplug - * lock to prevent any new cpu from becoming online; 2) done both VMXON - * and tdx_cpu_enable() on all online cpus. - * - * This function requires there's at least one online cpu for each CPU - * package to succeed. - * - * This function can be called in parallel by multiple callers. - * - * Return 0 if TDX is enabled successfully, otherwise error. - */ -int tdx_enable(void) -{ - int ret; - - if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM)) - return -ENODEV; - - lockdep_assert_cpus_held(); - - mutex_lock(&tdx_module_lock); - - switch (tdx_module_status) { - case TDX_MODULE_UNINITIALIZED: - ret = __tdx_enable(); - break; - case TDX_MODULE_INITIALIZED: - /* Already initialized, great, tell the caller. */ - ret = 0; - break; - default: - /* Failed to initialize in the previous attempts */ - ret = -EINVAL; - break; - } - - mutex_unlock(&tdx_module_lock); - - return ret; -} -EXPORT_SYMBOL_FOR_KVM(tdx_enable); +subsys_initcall(tdx_enable); static bool is_pamt_page(unsigned long phys) { @@ -1461,15 +1488,10 @@ void __init tdx_init(void) const struct tdx_sys_info *tdx_get_sysinfo(void) { - const struct tdx_sys_info *p = NULL; + if (!tdx_module_initialized) + return NULL; - /* Make sure all fields in @tdx_sysinfo have been populated */ - mutex_lock(&tdx_module_lock); - if (tdx_module_status == TDX_MODULE_INITIALIZED) - p = (const struct tdx_sys_info *)&tdx_sysinfo; - mutex_unlock(&tdx_module_lock); - - return p; + return (const struct tdx_sys_info *)&tdx_sysinfo; } EXPORT_SYMBOL_FOR_KVM(tdx_get_sysinfo); diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 82bb82be8567..dde219c823b4 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -91,14 +91,6 @@ struct tdmr_info { * Do not put any hardware-defined TDX structure representations below * this comment! */ - -/* Kernel defined TDX module status during module initialization. */ -enum tdx_module_status_t { - TDX_MODULE_UNINITIALIZED, - TDX_MODULE_INITIALIZED, - TDX_MODULE_ERROR -}; - struct tdx_memblock { struct list_head list; unsigned long start_pfn; From 9900400e20c0289bf0c82231169c33b43e38c6e8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:26:58 -0800 Subject: [PATCH 061/373] x86/virt/tdx: Tag a pile of functions as __init, and globals as __ro_after_init Now that TDX-Module initialization is done during subsys init, tag all related functions as __init, and relevant data as __ro_after_init. Reviewed-by: Dan Williams Reviewed-by: Chao Gao Tested-by: Chao Gao Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-13-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/virt/vmx/tdx/tdx.c | 119 ++++++++++---------- arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 10 +- 2 files changed, 66 insertions(+), 63 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 5baed9f8e5b8..a5937b0b76b2 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -56,8 +56,8 @@ static struct tdmr_info_list tdx_tdmr_list; /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ static LIST_HEAD(tdx_memlist); -static struct tdx_sys_info tdx_sysinfo; -static bool tdx_module_initialized; +static struct tdx_sys_info tdx_sysinfo __ro_after_init; +static bool tdx_module_initialized __ro_after_init; typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); @@ -226,8 +226,9 @@ static struct syscore tdx_syscore = { * all memory regions are added in address ascending order and don't * overlap. */ -static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, - unsigned long end_pfn, int nid) +static __init int add_tdx_memblock(struct list_head *tmb_list, + unsigned long start_pfn, + unsigned long end_pfn, int nid) { struct tdx_memblock *tmb; @@ -245,7 +246,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, return 0; } -static void free_tdx_memlist(struct list_head *tmb_list) +static __init void free_tdx_memlist(struct list_head *tmb_list) { /* @tmb_list is protected by mem_hotplug_lock */ while (!list_empty(tmb_list)) { @@ -263,7 +264,7 @@ static void free_tdx_memlist(struct list_head *tmb_list) * ranges off in a secondary structure because memblock is modified * in memory hotplug while TDX memory regions are fixed. */ -static int build_tdx_memlist(struct list_head *tmb_list) +static __init int build_tdx_memlist(struct list_head *tmb_list) { unsigned long start_pfn, end_pfn; int i, nid, ret; @@ -295,7 +296,7 @@ err: return ret; } -static int read_sys_metadata_field(u64 field_id, u64 *data) +static __init int read_sys_metadata_field(u64 field_id, u64 *data) { struct tdx_module_args args = {}; int ret; @@ -317,7 +318,7 @@ static int read_sys_metadata_field(u64 field_id, u64 *data) #include "tdx_global_metadata.c" -static int check_features(struct tdx_sys_info *sysinfo) +static __init int check_features(struct tdx_sys_info *sysinfo) { u64 tdx_features0 = sysinfo->features.tdx_features0; @@ -330,7 +331,7 @@ static int check_features(struct tdx_sys_info *sysinfo) } /* Calculate the actual TDMR size */ -static int tdmr_size_single(u16 max_reserved_per_tdmr) +static __init int tdmr_size_single(u16 max_reserved_per_tdmr) { int tdmr_sz; @@ -344,8 +345,8 @@ static int tdmr_size_single(u16 max_reserved_per_tdmr) return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT); } -static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list, - struct tdx_sys_info_tdmr *sysinfo_tdmr) +static __init int alloc_tdmr_list(struct tdmr_info_list *tdmr_list, + struct tdx_sys_info_tdmr *sysinfo_tdmr) { size_t tdmr_sz, tdmr_array_sz; void *tdmr_array; @@ -376,7 +377,7 @@ static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list, return 0; } -static void free_tdmr_list(struct tdmr_info_list *tdmr_list) +static __init void free_tdmr_list(struct tdmr_info_list *tdmr_list) { free_pages_exact(tdmr_list->tdmrs, tdmr_list->max_tdmrs * tdmr_list->tdmr_sz); @@ -405,8 +406,8 @@ static inline u64 tdmr_end(struct tdmr_info *tdmr) * preallocated @tdmr_list, following all the special alignment * and size rules for TDMR. */ -static int fill_out_tdmrs(struct list_head *tmb_list, - struct tdmr_info_list *tdmr_list) +static __init int fill_out_tdmrs(struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list) { struct tdx_memblock *tmb; int tdmr_idx = 0; @@ -482,8 +483,8 @@ static int fill_out_tdmrs(struct list_head *tmb_list, * Calculate PAMT size given a TDMR and a page size. The returned * PAMT size is always aligned up to 4K page boundary. */ -static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz, - u16 pamt_entry_size) +static __init unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz, + u16 pamt_entry_size) { unsigned long pamt_sz, nr_pamt_entries; @@ -514,7 +515,7 @@ static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz, * PAMT. This node will have some memory covered by the TDMR. The * relative amount of memory covered is not considered. */ -static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list) +static __init int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list) { struct tdx_memblock *tmb; @@ -543,9 +544,9 @@ static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list) * Allocate PAMTs from the local NUMA node of some memory in @tmb_list * within @tdmr, and set up PAMTs for @tdmr. */ -static int tdmr_set_up_pamt(struct tdmr_info *tdmr, - struct list_head *tmb_list, - u16 pamt_entry_size[]) +static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr, + struct list_head *tmb_list, + u16 pamt_entry_size[]) { unsigned long pamt_base[TDX_PS_NR]; unsigned long pamt_size[TDX_PS_NR]; @@ -615,7 +616,7 @@ static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base, *pamt_size = pamt_sz; } -static void tdmr_do_pamt_func(struct tdmr_info *tdmr, +static __init void tdmr_do_pamt_func(struct tdmr_info *tdmr, void (*pamt_func)(unsigned long base, unsigned long size)) { unsigned long pamt_base, pamt_size; @@ -632,17 +633,17 @@ static void tdmr_do_pamt_func(struct tdmr_info *tdmr, pamt_func(pamt_base, pamt_size); } -static void free_pamt(unsigned long pamt_base, unsigned long pamt_size) +static __init void free_pamt(unsigned long pamt_base, unsigned long pamt_size) { free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT); } -static void tdmr_free_pamt(struct tdmr_info *tdmr) +static __init void tdmr_free_pamt(struct tdmr_info *tdmr) { tdmr_do_pamt_func(tdmr, free_pamt); } -static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list) +static __init void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list) { int i; @@ -651,9 +652,9 @@ static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list) } /* Allocate and set up PAMTs for all TDMRs */ -static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list, - struct list_head *tmb_list, - u16 pamt_entry_size[]) +static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list, + struct list_head *tmb_list, + u16 pamt_entry_size[]) { int i, ret = 0; @@ -702,12 +703,13 @@ void tdx_quirk_reset_page(struct page *page) } EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_page); -static void tdmr_quirk_reset_pamt(struct tdmr_info *tdmr) +static __init void tdmr_quirk_reset_pamt(struct tdmr_info *tdmr) + { tdmr_do_pamt_func(tdmr, tdx_quirk_reset_paddr); } -static void tdmrs_quirk_reset_pamt_all(struct tdmr_info_list *tdmr_list) +static __init void tdmrs_quirk_reset_pamt_all(struct tdmr_info_list *tdmr_list) { int i; @@ -715,7 +717,7 @@ static void tdmrs_quirk_reset_pamt_all(struct tdmr_info_list *tdmr_list) tdmr_quirk_reset_pamt(tdmr_entry(tdmr_list, i)); } -static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) +static __init unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) { unsigned long pamt_size = 0; int i; @@ -730,8 +732,8 @@ static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) return pamt_size / 1024; } -static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr, - u64 size, u16 max_reserved_per_tdmr) +static __init int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, + u64 addr, u64 size, u16 max_reserved_per_tdmr) { struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas; int idx = *p_idx; @@ -764,10 +766,10 @@ static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr, * those holes fall within @tdmr, set up a TDMR reserved area to cover * the hole. */ -static int tdmr_populate_rsvd_holes(struct list_head *tmb_list, - struct tdmr_info *tdmr, - int *rsvd_idx, - u16 max_reserved_per_tdmr) +static __init int tdmr_populate_rsvd_holes(struct list_head *tmb_list, + struct tdmr_info *tdmr, + int *rsvd_idx, + u16 max_reserved_per_tdmr) { struct tdx_memblock *tmb; u64 prev_end; @@ -828,10 +830,10 @@ static int tdmr_populate_rsvd_holes(struct list_head *tmb_list, * overlaps with @tdmr, set up a TDMR reserved area to cover the * overlapping part. */ -static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list, - struct tdmr_info *tdmr, - int *rsvd_idx, - u16 max_reserved_per_tdmr) +static __init int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list, + struct tdmr_info *tdmr, + int *rsvd_idx, + u16 max_reserved_per_tdmr) { int i, ret; @@ -866,7 +868,7 @@ static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list, } /* Compare function called by sort() for TDMR reserved areas */ -static int rsvd_area_cmp_func(const void *a, const void *b) +static __init int rsvd_area_cmp_func(const void *a, const void *b) { struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a; struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b; @@ -885,10 +887,10 @@ static int rsvd_area_cmp_func(const void *a, const void *b) * Populate reserved areas for the given @tdmr, including memory holes * (via @tmb_list) and PAMTs (via @tdmr_list). */ -static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr, - struct list_head *tmb_list, - struct tdmr_info_list *tdmr_list, - u16 max_reserved_per_tdmr) +static __init int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr, + struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list, + u16 max_reserved_per_tdmr) { int ret, rsvd_idx = 0; @@ -913,9 +915,9 @@ static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr, * Populate reserved areas for all TDMRs in @tdmr_list, including memory * holes (via @tmb_list) and PAMTs. */ -static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list, - struct list_head *tmb_list, - u16 max_reserved_per_tdmr) +static __init int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list, + struct list_head *tmb_list, + u16 max_reserved_per_tdmr) { int i; @@ -936,9 +938,9 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list, * to cover all TDX memory regions in @tmb_list based on the TDX module * TDMR global information in @sysinfo_tdmr. */ -static int construct_tdmrs(struct list_head *tmb_list, - struct tdmr_info_list *tdmr_list, - struct tdx_sys_info_tdmr *sysinfo_tdmr) +static __init int construct_tdmrs(struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list, + struct tdx_sys_info_tdmr *sysinfo_tdmr) { u16 pamt_entry_size[TDX_PS_NR] = { sysinfo_tdmr->pamt_4k_entry_size, @@ -970,7 +972,8 @@ static int construct_tdmrs(struct list_head *tmb_list, return ret; } -static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid) +static __init int config_tdx_module(struct tdmr_info_list *tdmr_list, + u64 global_keyid) { struct tdx_module_args args = {}; u64 *tdmr_pa_array; @@ -1005,7 +1008,7 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid) return ret; } -static int do_global_key_config(void *unused) +static __init int do_global_key_config(void *unused) { struct tdx_module_args args = {}; @@ -1023,7 +1026,7 @@ static int do_global_key_config(void *unused) * KVM) can ensure success by ensuring sufficient CPUs are online and * can run SEAMCALLs. */ -static int config_global_keyid(void) +static __init int config_global_keyid(void) { cpumask_var_t packages; int cpu, ret = -EINVAL; @@ -1063,7 +1066,7 @@ static int config_global_keyid(void) return ret; } -static int init_tdmr(struct tdmr_info *tdmr) +static __init int init_tdmr(struct tdmr_info *tdmr) { u64 next; @@ -1094,7 +1097,7 @@ static int init_tdmr(struct tdmr_info *tdmr) return 0; } -static int init_tdmrs(struct tdmr_info_list *tdmr_list) +static __init int init_tdmrs(struct tdmr_info_list *tdmr_list) { int i; @@ -1113,7 +1116,7 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list) return 0; } -static int init_tdx_module(void) +static __init int init_tdx_module(void) { int ret; @@ -1194,7 +1197,7 @@ err_free_tdxmem: goto out_put_tdxmem; } -static int tdx_enable(void) +static __init int tdx_enable(void) { enum cpuhp_state state; int ret; diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c index 13ad2663488b..360963bc9328 100644 --- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c +++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c @@ -7,7 +7,7 @@ * Include this file to other C file instead. */ -static int get_tdx_sys_info_features(struct tdx_sys_info_features *sysinfo_features) +static __init int get_tdx_sys_info_features(struct tdx_sys_info_features *sysinfo_features) { int ret = 0; u64 val; @@ -18,7 +18,7 @@ static int get_tdx_sys_info_features(struct tdx_sys_info_features *sysinfo_featu return ret; } -static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr) +static __init int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr) { int ret = 0; u64 val; @@ -37,7 +37,7 @@ static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr) return ret; } -static int get_tdx_sys_info_td_ctrl(struct tdx_sys_info_td_ctrl *sysinfo_td_ctrl) +static __init int get_tdx_sys_info_td_ctrl(struct tdx_sys_info_td_ctrl *sysinfo_td_ctrl) { int ret = 0; u64 val; @@ -52,7 +52,7 @@ static int get_tdx_sys_info_td_ctrl(struct tdx_sys_info_td_ctrl *sysinfo_td_ctrl return ret; } -static int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf) +static __init int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf) { int ret = 0; u64 val; @@ -85,7 +85,7 @@ static int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf return ret; } -static int get_tdx_sys_info(struct tdx_sys_info *sysinfo) +static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo) { int ret = 0; From eac90a5ba0aa40f6d81def241bd78d2a5cc5e08b Mon Sep 17 00:00:00 2001 From: Chao Gao Date: Fri, 13 Feb 2026 17:26:59 -0800 Subject: [PATCH 062/373] x86/virt/tdx: KVM: Consolidate TDX CPU hotplug handling The core kernel registers a CPU hotplug callback to do VMX and TDX init and deinit while KVM registers a separate CPU offline callback to block offlining the last online CPU in a socket. Splitting TDX-related CPU hotplug handling across two components is odd and adds unnecessary complexity. Consolidate TDX-related CPU hotplug handling by integrating KVM's tdx_offline_cpu() to the one in the core kernel. Also move nr_configured_hkid to the core kernel because tdx_offline_cpu() references it. Since HKID allocation and free are handled in the core kernel, it's more natural to track used HKIDs there. Reviewed-by: Dan Williams Signed-off-by: Chao Gao Tested-by: Chao Gao Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-14-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/tdx.c | 67 +------------------------------------ arch/x86/virt/vmx/tdx/tdx.c | 49 +++++++++++++++++++++++++-- 2 files changed, 47 insertions(+), 69 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index c0cd0d73015f..520d85a2974a 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -59,8 +59,6 @@ module_param_named(tdx, enable_tdx, bool, 0444); #define TDX_SHARED_BIT_PWL_5 gpa_to_gfn(BIT_ULL(51)) #define TDX_SHARED_BIT_PWL_4 gpa_to_gfn(BIT_ULL(47)) -static enum cpuhp_state tdx_cpuhp_state __ro_after_init; - static const struct tdx_sys_info *tdx_sysinfo; void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err) @@ -219,8 +217,6 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf, */ static DEFINE_MUTEX(tdx_lock); -static atomic_t nr_configured_hkid; - static bool tdx_operand_busy(u64 err) { return (err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY; @@ -268,7 +264,6 @@ static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx) { tdx_guest_keyid_free(kvm_tdx->hkid); kvm_tdx->hkid = -1; - atomic_dec(&nr_configured_hkid); misc_cg_uncharge(MISC_CG_RES_TDX, kvm_tdx->misc_cg, 1); put_misc_cg(kvm_tdx->misc_cg); kvm_tdx->misc_cg = NULL; @@ -2398,8 +2393,6 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params, ret = -ENOMEM; - atomic_inc(&nr_configured_hkid); - tdr_page = alloc_page(GFP_KERNEL); if (!tdr_page) goto free_hkid; @@ -3291,51 +3284,10 @@ int tdx_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private) return PG_LEVEL_4K; } -static int tdx_online_cpu(unsigned int cpu) -{ - return 0; -} - -static int tdx_offline_cpu(unsigned int cpu) -{ - int i; - - /* No TD is running. Allow any cpu to be offline. */ - if (!atomic_read(&nr_configured_hkid)) - return 0; - - /* - * In order to reclaim TDX HKID, (i.e. when deleting guest TD), need to - * call TDH.PHYMEM.PAGE.WBINVD on all packages to program all memory - * controller with pconfig. If we have active TDX HKID, refuse to - * offline the last online cpu. - */ - for_each_online_cpu(i) { - /* - * Found another online cpu on the same package. - * Allow to offline. - */ - if (i != cpu && topology_physical_package_id(i) == - topology_physical_package_id(cpu)) - return 0; - } - - /* - * This is the last cpu of this package. Don't offline it. - * - * Because it's hard for human operator to understand the - * reason, warn it. - */ -#define MSG_ALLPKG_ONLINE \ - "TDX requires all packages to have an online CPU. Delete all TDs in order to offline all CPUs of a package.\n" - pr_warn_ratelimited(MSG_ALLPKG_ONLINE); - return -EBUSY; -} - static int __init __tdx_bringup(void) { const struct tdx_sys_info_td_conf *td_conf; - int r, i; + int i; for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) { /* @@ -3403,23 +3355,7 @@ static int __init __tdx_bringup(void) if (misc_cg_set_capacity(MISC_CG_RES_TDX, tdx_get_nr_guest_keyids())) return -EINVAL; - /* - * TDX-specific cpuhp callback to disallow offlining the last CPU in a - * packing while KVM is running one or more TDs. Reclaiming HKIDs - * requires doing PAGE.WBINVD on every package, i.e. offlining all CPUs - * of a package would prevent reclaiming the HKID. - */ - r = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "kvm/cpu/tdx:online", - tdx_online_cpu, tdx_offline_cpu); - if (r < 0) - goto err_cpuhup; - - tdx_cpuhp_state = r; return 0; - -err_cpuhup: - misc_cg_set_capacity(MISC_CG_RES_TDX, 0); - return r; } int __init tdx_bringup(void) @@ -3487,7 +3423,6 @@ void tdx_cleanup(void) return; misc_cg_set_capacity(MISC_CG_RES_TDX, 0); - cpuhp_remove_state(tdx_cpuhp_state); } void __init tdx_hardware_setup(void) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index a5937b0b76b2..3a1dddec6843 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -59,6 +59,8 @@ static LIST_HEAD(tdx_memlist); static struct tdx_sys_info tdx_sysinfo __ro_after_init; static bool tdx_module_initialized __ro_after_init; +static atomic_t nr_configured_hkid; + typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) @@ -186,6 +188,40 @@ static int tdx_online_cpu(unsigned int cpu) static int tdx_offline_cpu(unsigned int cpu) { + int i; + + /* No TD is running. Allow any cpu to be offline. */ + if (!atomic_read(&nr_configured_hkid)) + goto done; + + /* + * In order to reclaim TDX HKID, (i.e. when deleting guest TD), need to + * call TDH.PHYMEM.PAGE.WBINVD on all packages to program all memory + * controller with pconfig. If we have active TDX HKID, refuse to + * offline the last online cpu. + */ + for_each_online_cpu(i) { + /* + * Found another online cpu on the same package. + * Allow to offline. + */ + if (i != cpu && topology_physical_package_id(i) == + topology_physical_package_id(cpu)) + goto done; + } + + /* + * This is the last cpu of this package. Don't offline it. + * + * Because it's hard for human operator to understand the + * reason, warn it. + */ +#define MSG_ALLPKG_ONLINE \ + "TDX requires all packages to have an online CPU. Delete all TDs in order to offline all CPUs of a package.\n" + pr_warn_ratelimited(MSG_ALLPKG_ONLINE); + return -EBUSY; + +done: x86_virt_put_ref(X86_FEATURE_VMX); return 0; } @@ -1506,15 +1542,22 @@ EXPORT_SYMBOL_FOR_KVM(tdx_get_nr_guest_keyids); int tdx_guest_keyid_alloc(void) { - return ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start, - tdx_guest_keyid_start + tdx_nr_guest_keyids - 1, - GFP_KERNEL); + int ret; + + ret = ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start, + tdx_guest_keyid_start + tdx_nr_guest_keyids - 1, + GFP_KERNEL); + if (ret >= 0) + atomic_inc(&nr_configured_hkid); + + return ret; } EXPORT_SYMBOL_FOR_KVM(tdx_guest_keyid_alloc); void tdx_guest_keyid_free(unsigned int keyid) { ida_free(&tdx_guest_keyid_pool, keyid); + atomic_dec(&nr_configured_hkid); } EXPORT_SYMBOL_FOR_KVM(tdx_guest_keyid_free); From afe31de159bf218d6e92db6a4495f715f0a4e38c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:27:00 -0800 Subject: [PATCH 063/373] x86/virt/tdx: Use ida_is_empty() to detect if any TDs may be running Drop nr_configured_hkid and instead use ida_is_empty() to detect if any HKIDs have been allocated/configured. Suggested-by: Dan Williams Reviewed-by: Dan Williams Reviewed-by: Chao Gao Tested-by: Chao Gao Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-15-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/virt/vmx/tdx/tdx.c | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 3a1dddec6843..cb9b3210ab71 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -59,8 +59,6 @@ static LIST_HEAD(tdx_memlist); static struct tdx_sys_info tdx_sysinfo __ro_after_init; static bool tdx_module_initialized __ro_after_init; -static atomic_t nr_configured_hkid; - typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) @@ -191,7 +189,7 @@ static int tdx_offline_cpu(unsigned int cpu) int i; /* No TD is running. Allow any cpu to be offline. */ - if (!atomic_read(&nr_configured_hkid)) + if (ida_is_empty(&tdx_guest_keyid_pool)) goto done; /* @@ -1542,22 +1540,15 @@ EXPORT_SYMBOL_FOR_KVM(tdx_get_nr_guest_keyids); int tdx_guest_keyid_alloc(void) { - int ret; - - ret = ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start, - tdx_guest_keyid_start + tdx_nr_guest_keyids - 1, - GFP_KERNEL); - if (ret >= 0) - atomic_inc(&nr_configured_hkid); - - return ret; + return ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start, + tdx_guest_keyid_start + tdx_nr_guest_keyids - 1, + GFP_KERNEL); } EXPORT_SYMBOL_FOR_KVM(tdx_guest_keyid_alloc); void tdx_guest_keyid_free(unsigned int keyid) { ida_free(&tdx_guest_keyid_pool, keyid); - atomic_dec(&nr_configured_hkid); } EXPORT_SYMBOL_FOR_KVM(tdx_guest_keyid_free); From d30372d0b7e637475c79a785d055f4eb8c863656 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:27:01 -0800 Subject: [PATCH 064/373] KVM: Bury kvm_{en,dis}able_virtualization() in kvm_main.c once more Now that TDX handles doing VMXON without KVM's involvement, bury the top-level APIs to enable and disable virtualization back in kvm_main.c. No functional change intended. Reviewed-by: Dan Williams Reviewed-by: Chao Gao Tested-by: Chao Gao Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-16-seanjc@google.com Signed-off-by: Sean Christopherson --- include/linux/kvm_host.h | 8 -------- virt/kvm/kvm_main.c | 17 +++++++++++++---- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7c4ebd5210ec..fbd549bdf052 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2613,12 +2613,4 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, struct kvm_pre_fault_memory *range); #endif -#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING -int kvm_enable_virtualization(void); -void kvm_disable_virtualization(void); -#else -static inline int kvm_enable_virtualization(void) { return 0; } -static inline void kvm_disable_virtualization(void) { } -#endif - #endif diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d27bf2488b12..a9ccf9a1c41e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1102,6 +1102,9 @@ static inline struct kvm_io_bus *kvm_get_bus_for_destruction(struct kvm *kvm, !refcount_read(&kvm->users_count)); } +static int kvm_enable_virtualization(void); +static void kvm_disable_virtualization(void); + static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) { struct kvm *kvm = kvm_arch_alloc_vm(); @@ -5689,7 +5692,7 @@ static struct syscore kvm_syscore = { .ops = &kvm_syscore_ops, }; -int kvm_enable_virtualization(void) +static int kvm_enable_virtualization(void) { int r; @@ -5734,9 +5737,8 @@ err_cpuhp: --kvm_usage_count; return r; } -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_enable_virtualization); -void kvm_disable_virtualization(void) +static void kvm_disable_virtualization(void) { guard(mutex)(&kvm_usage_lock); @@ -5747,7 +5749,6 @@ void kvm_disable_virtualization(void) cpuhp_remove_state(CPUHP_AP_KVM_ONLINE); kvm_arch_disable_virtualization(); } -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_disable_virtualization); static int kvm_init_virtualization(void) { @@ -5763,6 +5764,14 @@ static void kvm_uninit_virtualization(void) kvm_disable_virtualization(); } #else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */ +static int kvm_enable_virtualization(void) +{ + return 0; +} +static void kvm_disable_virtualization(void) +{ + +} static int kvm_init_virtualization(void) { return 0; From f630de1f8d70d7e29e12bc25dc63f9c5f771dc59 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 13 Feb 2026 17:27:02 -0800 Subject: [PATCH 065/373] KVM: TDX: Fold tdx_bringup() into tdx_hardware_setup() Now that TDX doesn't need to manually enable virtualization through _KVM_ APIs during setup, fold tdx_bringup() into tdx_hardware_setup() where the code belongs, e.g. so that KVM doesn't leave the S-EPT kvm_x86_ops wired up when TDX is disabled. The weird ordering (and naming) was necessary to allow KVM TDX to use kvm_enable_virtualization(), which in turn had a hard dependency on kvm_x86_ops.enable_virtualization_cpu and thus kvm_x86_vendor_init(). Tested-by: Chao Gao Reviewed-by: Dan Williams Tested-by: Sagi Shahar Link: https://patch.msgid.link/20260214012702.2368778-17-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/main.c | 21 +++++++++------------ arch/x86/kvm/vmx/tdx.c | 39 +++++++++++++++------------------------ arch/x86/kvm/vmx/tdx.h | 8 ++------ 3 files changed, 26 insertions(+), 42 deletions(-) diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c index a46ccd670785..dbebddf648be 100644 --- a/arch/x86/kvm/vmx/main.c +++ b/arch/x86/kvm/vmx/main.c @@ -29,10 +29,15 @@ static __init int vt_hardware_setup(void) if (ret) return ret; - if (enable_tdx) - tdx_hardware_setup(); + return enable_tdx ? tdx_hardware_setup() : 0; +} - return 0; +static void vt_hardware_unsetup(void) +{ + if (enable_tdx) + tdx_hardware_unsetup(); + + vmx_hardware_unsetup(); } static int vt_vm_init(struct kvm *kvm) @@ -869,7 +874,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = { .check_processor_compatibility = vmx_check_processor_compat, - .hardware_unsetup = vmx_hardware_unsetup, + .hardware_unsetup = vt_op(hardware_unsetup), .enable_virtualization_cpu = vmx_enable_virtualization_cpu, .disable_virtualization_cpu = vt_op(disable_virtualization_cpu), @@ -1029,7 +1034,6 @@ struct kvm_x86_init_ops vt_init_ops __initdata = { static void __exit vt_exit(void) { kvm_exit(); - tdx_cleanup(); vmx_exit(); } module_exit(vt_exit); @@ -1043,11 +1047,6 @@ static int __init vt_init(void) if (r) return r; - /* tdx_init() has been taken */ - r = tdx_bringup(); - if (r) - goto err_tdx_bringup; - /* * TDX and VMX have different vCPU structures. Calculate the * maximum size/align so that kvm_init() can use the larger @@ -1074,8 +1073,6 @@ static int __init vt_init(void) return 0; err_kvm_init: - tdx_cleanup(); -err_tdx_bringup: vmx_exit(); return r; } diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 520d85a2974a..b7264b533feb 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -3284,7 +3284,12 @@ int tdx_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private) return PG_LEVEL_4K; } -static int __init __tdx_bringup(void) +void tdx_hardware_unsetup(void) +{ + misc_cg_set_capacity(MISC_CG_RES_TDX, 0); +} + +static int __init __tdx_hardware_setup(void) { const struct tdx_sys_info_td_conf *td_conf; int i; @@ -3358,7 +3363,7 @@ static int __init __tdx_bringup(void) return 0; } -int __init tdx_bringup(void) +int __init tdx_hardware_setup(void) { int r, i; @@ -3394,7 +3399,7 @@ int __init tdx_bringup(void) goto success_disable_tdx; } - r = __tdx_bringup(); + r = __tdx_hardware_setup(); if (r) { /* * Disable TDX only but don't fail to load module if the TDX @@ -3408,31 +3413,12 @@ int __init tdx_bringup(void) */ if (r == -ENODEV) goto success_disable_tdx; + + return r; } - return r; - -success_disable_tdx: - enable_tdx = 0; - return 0; -} - -void tdx_cleanup(void) -{ - if (!enable_tdx) - return; - - misc_cg_set_capacity(MISC_CG_RES_TDX, 0); -} - -void __init tdx_hardware_setup(void) -{ KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_tdx); - /* - * Note, if the TDX module can't be loaded, KVM TDX support will be - * disabled but KVM will continue loading (see tdx_bringup()). - */ vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx)); vt_x86_ops.link_external_spt = tdx_sept_link_private_spt; @@ -3440,4 +3426,9 @@ void __init tdx_hardware_setup(void) vt_x86_ops.free_external_spt = tdx_sept_free_private_spt; vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte; vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt; + return 0; + +success_disable_tdx: + enable_tdx = 0; + return 0; } diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h index 45b5183ccb36..b5cd2ffb303e 100644 --- a/arch/x86/kvm/vmx/tdx.h +++ b/arch/x86/kvm/vmx/tdx.h @@ -8,9 +8,8 @@ #ifdef CONFIG_KVM_INTEL_TDX #include "common.h" -void tdx_hardware_setup(void); -int tdx_bringup(void); -void tdx_cleanup(void); +int tdx_hardware_setup(void); +void tdx_hardware_unsetup(void); extern bool enable_tdx; @@ -187,9 +186,6 @@ TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management); TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch); #else -static inline int tdx_bringup(void) { return 0; } -static inline void tdx_cleanup(void) {} - #define enable_tdx 0 struct kvm_tdx { From 8d397582f6b5e9fbcf09781c7c934b4910e94a50 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:47 +0000 Subject: [PATCH 066/373] KVM: nSVM: Always use NextRIP as vmcb02's NextRIP after first L2 VMRUN For guests with NRIPS disabled, L1 does not provide NextRIP when running an L2 with an injected soft interrupt, instead it advances the current RIP before running it. KVM uses the current RIP as the NextRIP in vmcb02 to emulate a CPU without NRIPS. However, after L2 runs the first time, NextRIP will be updated by the CPU and/or KVM, and the current RIP is no longer the correct value to use in vmcb02. Hence, after save/restore, use the current RIP if and only if a nested run is pending, otherwise use NextRIP. Give soft_int_next_rip the same treatment, as it's the same logic, just for a narrower use case. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-6-yosry@kernel.org [sean: give soft_int_next_rip the same treatment] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 2308e40691c4..1cc083f95e6a 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -845,24 +845,32 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err; /* - * next_rip is consumed on VMRUN as the return address pushed on the + * NextRIP is consumed on VMRUN as the return address pushed on the * stack for injected soft exceptions/interrupts. If nrips is exposed - * to L1, take it verbatim from vmcb12. If nrips is supported in - * hardware but not exposed to L1, stuff the actual L2 RIP to emulate - * what a nrips=0 CPU would do (L1 is responsible for advancing RIP - * prior to injecting the event). + * to L1, take it verbatim from vmcb12. + * + * If nrips is supported in hardware but not exposed to L1, stuff the + * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is + * responsible for advancing RIP prior to injecting the event). This is + * only the case for the first L2 run after VMRUN. After that (e.g. + * during save/restore), NextRIP is updated by the CPU and/or KVM, and + * the value of the L2 RIP from vmcb12 should not be used. */ - if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) - vmcb02->control.next_rip = svm->nested.ctl.next_rip; - else if (boot_cpu_has(X86_FEATURE_NRIPS)) - vmcb02->control.next_rip = vmcb12_rip; + if (boot_cpu_has(X86_FEATURE_NRIPS)) { + if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || + !svm->nested.nested_run_pending) + vmcb02->control.next_rip = svm->nested.ctl.next_rip; + else + vmcb02->control.next_rip = vmcb12_rip; + } svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); if (is_evtinj_soft(vmcb02->control.event_inj)) { svm->soft_int_injected = true; svm->soft_int_csbase = vmcb12_csbase; svm->soft_int_old_rip = vmcb12_rip; - if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) + if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || + !svm->nested.nested_run_pending) svm->soft_int_next_rip = svm->nested.ctl.next_rip; else svm->soft_int_next_rip = vmcb12_rip; From 58f5d8eebd5c6b0c9377391d6b7bf9d321e014cc Mon Sep 17 00:00:00 2001 From: Ackerley Tng Date: Fri, 20 Feb 2026 23:54:35 +0000 Subject: [PATCH 067/373] KVM: selftests: Wrap madvise() to assert success Extend kvm_syscalls.h to wrap madvise() to assert success. This will be used in the next patch. Signed-off-by: Ackerley Tng Reviewed-by: David Hildenbrand (Arm) Link: https://patch.msgid.link/455483ca29a3a3042efee0cf3bbd0e2548cbeb1c.1771630983.git.ackerleytng@google.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/include/kvm_syscalls.h | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/kvm/include/kvm_syscalls.h b/tools/testing/selftests/kvm/include/kvm_syscalls.h index d4e613162bba..843c9904c46f 100644 --- a/tools/testing/selftests/kvm/include/kvm_syscalls.h +++ b/tools/testing/selftests/kvm/include/kvm_syscalls.h @@ -77,5 +77,6 @@ __KVM_SYSCALL_DEFINE(munmap, 2, void *, mem, size_t, size); __KVM_SYSCALL_DEFINE(close, 1, int, fd); __KVM_SYSCALL_DEFINE(fallocate, 4, int, fd, int, mode, loff_t, offset, loff_t, len); __KVM_SYSCALL_DEFINE(ftruncate, 2, unsigned int, fd, off_t, length); +__KVM_SYSCALL_DEFINE(madvise, 3, void *, addr, size_t, length, int, advice); #endif /* SELFTEST_KVM_SYSCALLS_H */ From 9830209b4ae8c8eecae7e6af271cebf1e1285142 Mon Sep 17 00:00:00 2001 From: Ackerley Tng Date: Fri, 20 Feb 2026 23:54:36 +0000 Subject: [PATCH 068/373] KVM: selftests: Test MADV_COLLAPSE on guest_memfd guest_memfd only supports PAGE_SIZE pages, and khugepaged or MADV_COLLAPSE collapsing pages may result in private memory regions being mapped into host page tables. Add test to verify that MADV_COLLAPSE fails on guest_memfd folios, and any subsequent usage of guest_memfd memory faults in PAGE_SIZE folios. Running this test should not result in any memory failure logs or kernel WARNings. This selftest was added as a result of a syzbot-reported issue where khugepaged operating on guest_memfd memory with MADV_HUGEPAGE caused the collapse of folios, which then subsequently resulted in a WARNing. Link: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 Suggested-by: David Hildenbrand Signed-off-by: Ackerley Tng Link: https://patch.msgid.link/8048d04f150326d1e2231318aa9f1b3fce3e2e2c.1771630983.git.ackerleytng@google.com Signed-off-by: Sean Christopherson --- .../testing/selftests/kvm/guest_memfd_test.c | 70 ++++++++++++++++++- 1 file changed, 67 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index 618c937f3c90..24d20372ab2d 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -171,6 +171,64 @@ static void test_numa_allocation(int fd, size_t total_size) kvm_munmap(mem, total_size); } +static void test_collapse(int fd, uint64_t flags) +{ + const size_t pmd_size = get_trans_hugepagesz(); + void *reserved_addr; + void *aligned_addr; + char *mem; + off_t i; + + /* + * To even reach the point where the guest_memfd folios will + * get collapsed, both the userspace address and the offset + * within the guest_memfd have to be aligned to pmd_size. + * + * To achieve that alignment, reserve virtual address space + * with regular mmap, then use MAP_FIXED to allocate memory + * from a pmd_size-aligned offset (0) at a known, available + * virtual address. + */ + reserved_addr = kvm_mmap(pmd_size * 2, PROT_NONE, + MAP_PRIVATE | MAP_ANONYMOUS, -1); + aligned_addr = align_ptr_up(reserved_addr, pmd_size); + + mem = mmap(aligned_addr, pmd_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_SHARED, fd, 0); + TEST_ASSERT(IS_ALIGNED((u64)mem, pmd_size), + "Userspace address must be aligned to PMD size."); + + /* + * Use reads to populate page table to avoid setting dirty + * flag on page. + */ + for (i = 0; i < pmd_size; i += getpagesize()) + READ_ONCE(mem[i]); + + /* + * Advising the use of huge pages in guest_memfd should be + * fine... + */ + kvm_madvise(mem, pmd_size, MADV_HUGEPAGE); + + /* + * ... but collapsing folios must not be supported to avoid + * mapping beyond shared ranges into host userspace page + * tables. + */ + TEST_ASSERT_EQ(madvise(mem, pmd_size, MADV_COLLAPSE), -1); + TEST_ASSERT_EQ(errno, EINVAL); + + /* + * Removing from host page tables and re-faulting should be + * fine; should not end up faulting in a collapsed/huge folio. + */ + kvm_madvise(mem, pmd_size, MADV_DONTNEED); + READ_ONCE(mem[0]); + + kvm_munmap(reserved_addr, pmd_size * 2); +} + static void test_fault_sigbus(int fd, size_t accessible_size, size_t map_size) { const char val = 0xaa; @@ -350,14 +408,17 @@ static void test_guest_memfd_flags(struct kvm_vm *vm) } } -#define gmem_test(__test, __vm, __flags) \ +#define __gmem_test(__test, __vm, __flags, __gmem_size) \ do { \ - int fd = vm_create_guest_memfd(__vm, page_size * 4, __flags); \ + int fd = vm_create_guest_memfd(__vm, __gmem_size, __flags); \ \ - test_##__test(fd, page_size * 4); \ + test_##__test(fd, __gmem_size); \ close(fd); \ } while (0) +#define gmem_test(__test, __vm, __flags) \ + __gmem_test(__test, __vm, __flags, page_size * 4) + static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags) { test_create_guest_memfd_multiple(vm); @@ -367,9 +428,12 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags) if (flags & GUEST_MEMFD_FLAG_MMAP) { if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) { + size_t pmd_size = get_trans_hugepagesz(); + gmem_test(mmap_supported, vm, flags); gmem_test(fault_overflow, vm, flags); gmem_test(numa_allocation, vm, flags); + __gmem_test(collapse, vm, flags, pmd_size); } else { gmem_test(fault_private, vm, flags); } From a0592461f39c00b28f552fe842a063a00043eaa8 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:48 +0000 Subject: [PATCH 069/373] KVM: nSVM: Delay stuffing L2's current RIP into NextRIP until vCPU run For guests with NRIPS disabled, L1 does not provide NextRIP when running an L2 with an injected soft interrupt, instead it advances L2's RIP before running it. KVM uses L2's current RIP as the NextRIP in vmcb02 to emulate a CPU without NRIPS. However, in svm_set_nested_state(), the value used for L2's current RIP comes from vmcb02, which is just whatever the vCPU had in vmcb02 before restoring nested state (zero on a freshly created vCPU). Passing the cached RIP value instead (i.e. kvm_rip_read()) would only fix the issue if registers are restored before nested state. Instead, split the logic of setting NextRIP in vmcb02. Handle the 'normal' case of initializing vmcb02's NextRIP using NextRIP from vmcb12 (or KVM_GET_NESTED_STATE's payload) in nested_vmcb02_prepare_control(). Delay the special case of stuffing L2's current RIP into vmcb02's NextRIP until shortly before the vCPU is run, to make sure the most up-to-date value of RIP is used regardless of KVM_SET_REGS and KVM_SET_NESTED_STATE's relative ordering. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-7-yosry@kernel.org [sean: use new helper, svm_fixup_nested_rips()] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 25 ++++++++----------------- arch/x86/kvm/svm/svm.c | 25 +++++++++++++++++++++++++ 2 files changed, 33 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 1cc083f95e6a..76d959d15e14 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -845,24 +845,15 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err; /* - * NextRIP is consumed on VMRUN as the return address pushed on the - * stack for injected soft exceptions/interrupts. If nrips is exposed - * to L1, take it verbatim from vmcb12. - * - * If nrips is supported in hardware but not exposed to L1, stuff the - * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is - * responsible for advancing RIP prior to injecting the event). This is - * only the case for the first L2 run after VMRUN. After that (e.g. - * during save/restore), NextRIP is updated by the CPU and/or KVM, and - * the value of the L2 RIP from vmcb12 should not be used. + * If nrips is exposed to L1, take NextRIP as-is. Otherwise, L1 + * advances L2's RIP before VMRUN instead of using NextRIP. KVM will + * stuff the current RIP as vmcb02's NextRIP before L2 is run. After + * the first run of L2 (e.g. after save+restore), NextRIP is updated by + * the CPU and/or KVM and should be used regardless of L1's support. */ - if (boot_cpu_has(X86_FEATURE_NRIPS)) { - if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || - !svm->nested.nested_run_pending) - vmcb02->control.next_rip = svm->nested.ctl.next_rip; - else - vmcb02->control.next_rip = vmcb12_rip; - } + if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || + !svm->nested.nested_run_pending) + vmcb02->control.next_rip = svm->nested.ctl.next_rip; svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); if (is_evtinj_soft(vmcb02->control.event_inj)) { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 07f096758f34..f862bafc381a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3738,6 +3738,29 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected) svm->vmcb->control.event_inj = intr->nr | SVM_EVTINJ_VALID | type; } +static void svm_fixup_nested_rips(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (!is_guest_mode(vcpu) || !svm->nested.nested_run_pending) + return; + + /* + * If nrips is supported in hardware but not exposed to L1, stuff the + * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is + * responsible for advancing RIP prior to injecting the event). Once L2 + * runs after L1 executes VMRUN, NextRIP is updated by the CPU and/or + * KVM, and this is no longer needed. + * + * This is done here (as opposed to when preparing vmcb02) to use the + * most up-to-date value of RIP regardless of the order of restoring + * registers and nested state in the vCPU save+restore path. + */ + if (boot_cpu_has(X86_FEATURE_NRIPS) && + !guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) + svm->vmcb->control.next_rip = kvm_rip_read(vcpu); +} + void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode, int trig_mode, int vector) { @@ -4334,6 +4357,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) kvm_register_is_dirty(vcpu, VCPU_EXREG_ERAPS)) svm->vmcb->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP; + svm_fixup_nested_rips(vcpu); + svm_hv_update_vp_id(svm->vmcb, vcpu); /* From c64bc6ed1764c1b7e3c0017019f743196074092f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 4 Mar 2026 16:06:56 -0800 Subject: [PATCH 070/373] KVM: nSVM: Delay setting soft IRQ RIP tracking fields until vCPU run In the save+restore path, when restoring nested state, the values of RIP and CS base passed into nested_vmcb02_prepare_control() are mostly incorrect. They are both pulled from the vmcb02. For CS base, the value is only correct if system regs are restored before nested state. The value of RIP is whatever the vCPU had in vmcb02 before restoring nested state (zero on a freshly created vCPU). Instead, take a similar approach to NextRIP, and delay initializing the RIP tracking fields until shortly before the vCPU is run, to make sure the most up-to-date values of RIP and CS base are used regardless of KVM_SET_SREGS, KVM_SET_REGS, and KVM_SET_NESTED_STATE's relative ordering. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-8-yosry@kernel.org [sean: deal with the svm_cancel_injection() madness] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 17 ++++++++--------- arch/x86/kvm/svm/svm.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 76d959d15e14..3e2841598a36 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -742,9 +742,7 @@ static bool is_evtinj_nmi(u32 evtinj) return type == SVM_EVTINJ_TYPE_NMI; } -static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, - unsigned long vmcb12_rip, - unsigned long vmcb12_csbase) +static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) { u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK; u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK; @@ -856,15 +854,16 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.next_rip = svm->nested.ctl.next_rip; svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); + + /* + * soft_int_csbase, soft_int_old_rip, and soft_int_next_rip (if L1 + * doesn't have NRIPS) are initialized later, before the vCPU is run. + */ if (is_evtinj_soft(vmcb02->control.event_inj)) { svm->soft_int_injected = true; - svm->soft_int_csbase = vmcb12_csbase; - svm->soft_int_old_rip = vmcb12_rip; if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || !svm->nested.nested_run_pending) svm->soft_int_next_rip = svm->nested.ctl.next_rip; - else - svm->soft_int_next_rip = vmcb12_rip; } /* LBR_CTL_ENABLE_MASK is controlled by svm_update_lbrv() */ @@ -962,7 +961,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr); svm_switch_vmcb(svm, &svm->nested.vmcb02); - nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base); + nested_vmcb02_prepare_control(svm); nested_vmcb02_prepare_save(svm, vmcb12); ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3, @@ -1907,7 +1906,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, nested_copy_vmcb_control_to_cache(svm, ctl); svm_switch_vmcb(svm, &svm->nested.vmcb02); - nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base); + nested_vmcb02_prepare_control(svm); /* * Any previously restored state (e.g. KVM_SET_SREGS) would mark fields diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index f862bafc381a..d82e30c40eaa 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3637,6 +3637,16 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) return svm_invoke_exit_handler(vcpu, svm->vmcb->control.exit_code); } +static void svm_set_nested_run_soft_int_state(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->soft_int_csbase = svm->vmcb->save.cs.base; + svm->soft_int_old_rip = kvm_rip_read(vcpu); + if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) + svm->soft_int_next_rip = kvm_rip_read(vcpu); +} + static int pre_svm_run(struct kvm_vcpu *vcpu) { struct svm_cpu_data *sd = per_cpu_ptr(&svm_data, vcpu->cpu); @@ -3759,6 +3769,13 @@ static void svm_fixup_nested_rips(struct kvm_vcpu *vcpu) if (boot_cpu_has(X86_FEATURE_NRIPS) && !guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) svm->vmcb->control.next_rip = kvm_rip_read(vcpu); + + /* + * Simiarly, initialize the soft int metadata here to use the most + * up-to-date values of RIP and CS base, regardless of restore order. + */ + if (svm->soft_int_injected) + svm_set_nested_run_soft_int_state(vcpu); } void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode, @@ -4128,6 +4145,18 @@ static void svm_complete_soft_interrupt(struct kvm_vcpu *vcpu, u8 vector, bool is_soft = (type == SVM_EXITINTINFO_TYPE_SOFT); struct vcpu_svm *svm = to_svm(vcpu); + /* + * Initialize the soft int fields *before* reading them below if KVM + * aborted entry to the guest with a nested VMRUN pending. To ensure + * KVM uses up-to-date values for RIP and CS base across save/restore, + * regardless of restore order, KVM waits to set the soft int fields + * until VMRUN is imminent. But when canceling injection, KVM requeues + * the soft int and will reinject it via the standard injection flow, + * and so KVM needs to grab the state from the pending nested VMRUN. + */ + if (is_guest_mode(vcpu) && svm->nested.nested_run_pending) + svm_set_nested_run_soft_int_state(vcpu); + /* * If NRIPS is enabled, KVM must snapshot the pre-VMRUN next_rip that's * associated with the original soft exception/interrupt. next_rip is From d99df02ff427f461102230f9c5b90a6c64ee8e23 Mon Sep 17 00:00:00 2001 From: Kevin Cheng Date: Sat, 28 Feb 2026 03:33:26 +0000 Subject: [PATCH 071/373] KVM: SVM: Inject #UD for INVLPGA if EFER.SVME=0 INVLPGA should cause a #UD when EFER.SVME is not set. Add a check to properly inject #UD when EFER.SVME=0. Fixes: ff092385e828 ("KVM: SVM: Implement INVLPGA") Cc: stable@vger.kernel.org Signed-off-by: Kevin Cheng Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260228033328.2285047-3-chengkev@google.com [sean: tag for stable@] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index d82e30c40eaa..543f9f3f966e 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2367,6 +2367,9 @@ static int invlpga_interception(struct kvm_vcpu *vcpu) gva_t gva = kvm_rax_read(vcpu); u32 asid = kvm_rcx_read(vcpu); + if (nested_svm_check_permissions(vcpu)) + return 1; + /* FIXME: Handle an address size prefix. */ if (!is_long_mode(vcpu)) gva = (u32)gva; From b53ab5167a81537777ac780bbd93d32613aa3bda Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:55 +0000 Subject: [PATCH 072/373] KVM: nSVM: Avoid clearing VMCB_LBR in vmcb12 svm_copy_lbrs() always marks VMCB_LBR dirty in the destination VMCB. However, nested_svm_vmexit() uses it to copy LBRs to vmcb12, and clearing clean bits in vmcb12 is not architecturally defined. Move vmcb_mark_dirty() to callers and drop it for vmcb12. This also facilitates incoming refactoring that does not pass the entire VMCB to svm_copy_lbrs(). Fixes: d20c796ca370 ("KVM: x86: nSVM: implement nested LBR virtualization") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-2-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 7 +++++-- arch/x86/kvm/svm/svm.c | 2 -- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 3e2841598a36..0a35c815f4d2 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -715,6 +715,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 } else { svm_copy_lbrs(vmcb02, vmcb01); } + vmcb_mark_dirty(vmcb02, VMCB_LBR); svm_update_lbrv(&svm->vcpu); } @@ -1231,10 +1232,12 @@ int nested_svm_vmexit(struct vcpu_svm *svm) kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) + (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { svm_copy_lbrs(vmcb12, vmcb02); - else + } else { svm_copy_lbrs(vmcb01, vmcb02); + vmcb_mark_dirty(vmcb01, VMCB_LBR); + } svm_update_lbrv(vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 543f9f3f966e..9b4f5a46d550 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -848,8 +848,6 @@ void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb) to_vmcb->save.br_to = from_vmcb->save.br_to; to_vmcb->save.last_excp_from = from_vmcb->save.last_excp_from; to_vmcb->save.last_excp_to = from_vmcb->save.last_excp_to; - - vmcb_mark_dirty(to_vmcb, VMCB_LBR); } static void __svm_enable_lbrv(struct kvm_vcpu *vcpu) From 361dbe8173c460a2bf8aee23920f6c2dbdcabb94 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:56 +0000 Subject: [PATCH 073/373] KVM: SVM: Switch svm_copy_lbrs() to a macro In preparation for using svm_copy_lbrs() with 'struct vmcb_save_area' without a containing 'struct vmcb', and later even 'struct vmcb_save_area_cached', make it a macro. Macros are generally not preferred compared to functions, mainly due to type-safety. However, in this case it seems like having a simple macro copying a few fields is better than copy-pasting the same 5 lines of code in different places. Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-3-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 8 ++++---- arch/x86/kvm/svm/svm.c | 9 --------- arch/x86/kvm/svm/svm.h | 10 +++++++++- 3 files changed, 13 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 0a35c815f4d2..9c64d036e30b 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -710,10 +710,10 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 * Reserved bits of DEBUGCTL are ignored. Be consistent with * svm_set_msr's definition of reserved bits. */ - svm_copy_lbrs(vmcb02, vmcb12); + svm_copy_lbrs(&vmcb02->save, &vmcb12->save); vmcb02->save.dbgctl &= ~DEBUGCTL_RESERVED_BITS; } else { - svm_copy_lbrs(vmcb02, vmcb01); + svm_copy_lbrs(&vmcb02->save, &vmcb01->save); } vmcb_mark_dirty(vmcb02, VMCB_LBR); svm_update_lbrv(&svm->vcpu); @@ -1233,9 +1233,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm) if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { - svm_copy_lbrs(vmcb12, vmcb02); + svm_copy_lbrs(&vmcb12->save, &vmcb02->save); } else { - svm_copy_lbrs(vmcb01, vmcb02); + svm_copy_lbrs(&vmcb01->save, &vmcb02->save); vmcb_mark_dirty(vmcb01, VMCB_LBR); } diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9b4f5a46d550..7170f2f623af 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -841,15 +841,6 @@ static void svm_recalc_msr_intercepts(struct kvm_vcpu *vcpu) */ } -void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb) -{ - to_vmcb->save.dbgctl = from_vmcb->save.dbgctl; - to_vmcb->save.br_from = from_vmcb->save.br_from; - to_vmcb->save.br_to = from_vmcb->save.br_to; - to_vmcb->save.last_excp_from = from_vmcb->save.last_excp_from; - to_vmcb->save.last_excp_to = from_vmcb->save.last_excp_to; -} - static void __svm_enable_lbrv(struct kvm_vcpu *vcpu) { to_svm(vcpu)->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK; diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index ebd7b36b1ceb..44d767cd1d25 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -713,8 +713,16 @@ static inline void *svm_vcpu_alloc_msrpm(void) return svm_alloc_permissions_map(MSRPM_SIZE, GFP_KERNEL_ACCOUNT); } +#define svm_copy_lbrs(to, from) \ +do { \ + (to)->dbgctl = (from)->dbgctl; \ + (to)->br_from = (from)->br_from; \ + (to)->br_to = (from)->br_to; \ + (to)->last_excp_from = (from)->last_excp_from; \ + (to)->last_excp_to = (from)->last_excp_to; \ +} while (0) + void svm_vcpu_free_msrpm(void *msrpm); -void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb); void svm_enable_lbrv(struct kvm_vcpu *vcpu); void svm_update_lbrv(struct kvm_vcpu *vcpu); From 3700f0788da6acf73b2df56690f4b201aa4aefd2 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:57 +0000 Subject: [PATCH 074/373] KVM: SVM: Add missing save/restore handling of LBR MSRs MSR_IA32_DEBUGCTLMSR and LBR MSRs are currently not enumerated by KVM_GET_MSR_INDEX_LIST, and LBR MSRs cannot be set with KVM_SET_MSRS. So save/restore is completely broken. Fix it by adding the MSRs to msrs_to_save_base, and allowing writes to LBR MSRs from userspace only (as they are read-only MSRs) if LBR virtualization is enabled. Additionally, to correctly restore L1's LBRs while L2 is running, make sure the LBRs are copied from the captured VMCB01 save area in svm_copy_vmrun_state(). Note, for VMX, this also fixes a flaw where MSR_IA32_DEBUGCTLMSR isn't reported as an MSR to save/restore. Note #2, over-reporting MSR_IA32_LASTxxx on Intel is ok, as KVM already handles unsupported reads and writes thanks to commit b5e2fec0ebc3 ("KVM: Ignore DEBUGCTL MSRs with no effect") (kvm_do_msr_access() will morph the unsupported userspace write into a nop). Fixes: 24e09cbf480a ("KVM: SVM: enable LBR virtualization") Cc: stable@vger.kernel.org Reported-by: Jim Mattson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-4-yosry@kernel.org [sean: guard with lbrv checks, massage changelog] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 5 +++++ arch/x86/kvm/svm/svm.c | 42 ++++++++++++++++++++++++++++++++++----- arch/x86/kvm/x86.c | 3 +++ 3 files changed, 45 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 9c64d036e30b..2b1066ce23f5 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1099,6 +1099,11 @@ void svm_copy_vmrun_state(struct vmcb_save_area *to_save, to_save->isst_addr = from_save->isst_addr; to_save->ssp = from_save->ssp; } + + if (kvm_cpu_cap_has(X86_FEATURE_LBRV)) { + svm_copy_lbrs(to_save, from_save); + to_save->dbgctl &= ~DEBUGCTL_RESERVED_BITS; + } } void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7170f2f623af..e97c56df41f6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2787,19 +2787,19 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = svm->tsc_aux; break; case MSR_IA32_DEBUGCTLMSR: - msr_info->data = svm->vmcb->save.dbgctl; + msr_info->data = lbrv ? svm->vmcb->save.dbgctl : 0; break; case MSR_IA32_LASTBRANCHFROMIP: - msr_info->data = svm->vmcb->save.br_from; + msr_info->data = lbrv ? svm->vmcb->save.br_from : 0; break; case MSR_IA32_LASTBRANCHTOIP: - msr_info->data = svm->vmcb->save.br_to; + msr_info->data = lbrv ? svm->vmcb->save.br_to : 0; break; case MSR_IA32_LASTINTFROMIP: - msr_info->data = svm->vmcb->save.last_excp_from; + msr_info->data = lbrv ? svm->vmcb->save.last_excp_from : 0; break; case MSR_IA32_LASTINTTOIP: - msr_info->data = svm->vmcb->save.last_excp_to; + msr_info->data = lbrv ? svm->vmcb->save.last_excp_to : 0; break; case MSR_VM_HSAVE_PA: msr_info->data = svm->nested.hsave_msr; @@ -3074,6 +3074,38 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) vmcb_mark_dirty(svm->vmcb, VMCB_LBR); svm_update_lbrv(vcpu); break; + case MSR_IA32_LASTBRANCHFROMIP: + if (!lbrv) + return KVM_MSR_RET_UNSUPPORTED; + if (!msr->host_initiated) + return 1; + svm->vmcb->save.br_from = data; + vmcb_mark_dirty(svm->vmcb, VMCB_LBR); + break; + case MSR_IA32_LASTBRANCHTOIP: + if (!lbrv) + return KVM_MSR_RET_UNSUPPORTED; + if (!msr->host_initiated) + return 1; + svm->vmcb->save.br_to = data; + vmcb_mark_dirty(svm->vmcb, VMCB_LBR); + break; + case MSR_IA32_LASTINTFROMIP: + if (!lbrv) + return KVM_MSR_RET_UNSUPPORTED; + if (!msr->host_initiated) + return 1; + svm->vmcb->save.last_excp_from = data; + vmcb_mark_dirty(svm->vmcb, VMCB_LBR); + break; + case MSR_IA32_LASTINTTOIP: + if (!lbrv) + return KVM_MSR_RET_UNSUPPORTED; + if (!msr->host_initiated) + return 1; + svm->vmcb->save.last_excp_to = data; + vmcb_mark_dirty(svm->vmcb, VMCB_LBR); + break; case MSR_VM_HSAVE_PA: /* * Old kernels did not validate the value written to diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6e87ec52fa06..64da02d1ee00 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -351,6 +351,9 @@ static const u32 msrs_to_save_base[] = { MSR_IA32_U_CET, MSR_IA32_S_CET, MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP, MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB, + MSR_IA32_DEBUGCTLMSR, + MSR_IA32_LASTBRANCHFROMIP, MSR_IA32_LASTBRANCHTOIP, + MSR_IA32_LASTINTFROMIP, MSR_IA32_LASTINTTOIP, }; static const u32 msrs_to_save_pmu[] = { From ac17892e51525ccea892b7e3171e2d1e9bb6fa61 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:58 +0000 Subject: [PATCH 075/373] KVM: selftests: Add a test for LBR save/restore (ft. nested) Add a selftest exercising save/restore with usage of LBRs in both L1 and L2, and making sure all LBRs remain intact. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-5-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../selftests/kvm/include/x86/processor.h | 5 + .../selftests/kvm/x86/svm_lbr_nested_state.c | 145 ++++++++++++++++++ 3 files changed, 151 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index fdec90e85467..36b48e766e49 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -112,6 +112,7 @@ TEST_GEN_PROGS_x86 += x86/svm_vmcall_test TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test +TEST_GEN_PROGS_x86 += x86/svm_lbr_nested_state TEST_GEN_PROGS_x86 += x86/tsc_scaling_sync TEST_GEN_PROGS_x86 += x86/sync_regs_test TEST_GEN_PROGS_x86 += x86/ucna_injection_test diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h index 4ebae4269e68..db0171935197 100644 --- a/tools/testing/selftests/kvm/include/x86/processor.h +++ b/tools/testing/selftests/kvm/include/x86/processor.h @@ -1360,6 +1360,11 @@ static inline bool kvm_is_ignore_msrs(void) return get_kvm_param_bool("ignore_msrs"); } +static inline bool kvm_is_lbrv_enabled(void) +{ + return !!get_kvm_amd_param_integer("lbrv"); +} + uint64_t *vm_get_pte(struct kvm_vm *vm, uint64_t vaddr); uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2, diff --git a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c new file mode 100644 index 000000000000..bf16abb1152e --- /dev/null +++ b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2026, Google, Inc. + */ + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "svm_util.h" + + +#define L2_GUEST_STACK_SIZE 64 + +#define DO_BRANCH() do { asm volatile("jmp 1f\n 1: nop"); } while (0) + +struct lbr_branch { + u64 from, to; +}; + +volatile struct lbr_branch l2_branch; + +#define RECORD_AND_CHECK_BRANCH(b) \ +do { \ + wrmsr(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR); \ + DO_BRANCH(); \ + (b)->from = rdmsr(MSR_IA32_LASTBRANCHFROMIP); \ + (b)->to = rdmsr(MSR_IA32_LASTBRANCHTOIP); \ + /* Disable LBR right after to avoid overriding the IPs */ \ + wrmsr(MSR_IA32_DEBUGCTLMSR, 0); \ + \ + GUEST_ASSERT_NE((b)->from, 0); \ + GUEST_ASSERT_NE((b)->to, 0); \ +} while (0) + +#define CHECK_BRANCH_MSRS(b) \ +do { \ + GUEST_ASSERT_EQ((b)->from, rdmsr(MSR_IA32_LASTBRANCHFROMIP)); \ + GUEST_ASSERT_EQ((b)->to, rdmsr(MSR_IA32_LASTBRANCHTOIP)); \ +} while (0) + +#define CHECK_BRANCH_VMCB(b, vmcb) \ +do { \ + GUEST_ASSERT_EQ((b)->from, vmcb->save.br_from); \ + GUEST_ASSERT_EQ((b)->to, vmcb->save.br_to); \ +} while (0) + +static void l2_guest_code(struct svm_test_data *svm) +{ + /* Record a branch, trigger save/restore, and make sure LBRs are intact */ + RECORD_AND_CHECK_BRANCH(&l2_branch); + GUEST_SYNC(true); + CHECK_BRANCH_MSRS(&l2_branch); + vmmcall(); +} + +static void l1_guest_code(struct svm_test_data *svm, bool nested_lbrv) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + struct vmcb *vmcb = svm->vmcb; + struct lbr_branch l1_branch; + + /* Record a branch, trigger save/restore, and make sure LBRs are intact */ + RECORD_AND_CHECK_BRANCH(&l1_branch); + GUEST_SYNC(true); + CHECK_BRANCH_MSRS(&l1_branch); + + /* Run L2, which will also do the same */ + generic_svm_setup(svm, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + if (nested_lbrv) + vmcb->control.virt_ext = LBR_CTL_ENABLE_MASK; + else + vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; + + run_guest(vmcb, svm->vmcb_gpa); + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); + + /* Trigger save/restore one more time before checking, just for kicks */ + GUEST_SYNC(true); + + /* + * If LBR_CTL_ENABLE is set, L1 and L2 should have separate LBR MSRs, so + * expect L1's LBRs to remain intact and L2 LBRs to be in the VMCB. + * Otherwise, the MSRs are shared between L1 & L2 so expect L2's LBRs. + */ + if (nested_lbrv) { + CHECK_BRANCH_MSRS(&l1_branch); + CHECK_BRANCH_VMCB(&l2_branch, vmcb); + } else { + CHECK_BRANCH_MSRS(&l2_branch); + } + GUEST_DONE(); +} + +void test_lbrv_nested_state(bool nested_lbrv) +{ + struct kvm_x86_state *state = NULL; + struct kvm_vcpu *vcpu; + vm_vaddr_t svm_gva; + struct kvm_vm *vm; + struct ucall uc; + + pr_info("Testing with nested LBRV %s\n", nested_lbrv ? "enabled" : "disabled"); + + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); + vcpu_alloc_svm(vm, &svm_gva); + vcpu_args_set(vcpu, 2, svm_gva, nested_lbrv); + + for (;;) { + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + switch (get_ucall(vcpu, &uc)) { + case UCALL_SYNC: + /* Save the vCPU state and restore it in a new VM on sync */ + pr_info("Guest triggered save/restore.\n"); + state = vcpu_save_state(vcpu); + kvm_vm_release(vm); + vcpu = vm_recreate_with_one_vcpu(vm); + vcpu_load_state(vcpu, state); + kvm_x86_state_cleanup(state); + break; + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + /* NOT REACHED */ + case UCALL_DONE: + goto done; + default: + TEST_FAIL("Unknown ucall %lu", uc.cmd); + } + } +done: + kvm_vm_free(vm); +} + +int main(int argc, char *argv[]) +{ + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + TEST_REQUIRE(kvm_is_lbrv_enabled()); + + test_lbrv_nested_state(/*nested_lbrv=*/false); + test_lbrv_nested_state(/*nested_lbrv=*/true); + + return 0; +} From 01ddcdc55e097ca38c28ae656711b8e6d1df71f8 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:59 +0000 Subject: [PATCH 076/373] KVM: nSVM: Always inject a #GP if mapping VMCB12 fails on nested VMRUN nested_svm_vmrun() currently only injects a #GP if kvm_vcpu_map() fails with -EINVAL. But it could also fail with -EFAULT if creating a host mapping failed. Inject a #GP in all cases, no reason to treat failure modes differently. Fixes: 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-6-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 2b1066ce23f5..7a472d7c6e98 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1010,12 +1010,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) } vmcb12_gpa = svm->vmcb->save.rax; - ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map); - if (ret == -EINVAL) { + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) { kvm_inject_gp(vcpu, 0); return 1; - } else if (ret) { - return kvm_skip_emulated_instruction(vcpu); } ret = kvm_skip_emulated_instruction(vcpu); From 290c8d82023ab0e1d2782d37136541e017174d7c Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:00 +0000 Subject: [PATCH 077/373] KVM: nSVM: Refactor checking LBRV enablement in vmcb12 into a helper Refactor the vCPU cap and vmcb12 flag checks into a helper. The unlikely() annotation is dropped, it's unlikely (huh) to make a difference and the CPU will probably predict it better on its own. CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-7-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 7a472d7c6e98..d419fd516fa9 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -640,6 +640,12 @@ void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm) svm->nested.vmcb02.ptr->save.g_pat = svm->vmcb01.ptr->save.g_pat; } +static bool nested_vmcb12_has_lbrv(struct kvm_vcpu *vcpu) +{ + return guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && + (to_svm(vcpu)->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK); +} + static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12) { bool new_vmcb12 = false; @@ -704,8 +710,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 vmcb_mark_dirty(vmcb02, VMCB_DR); } - if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { + if (nested_vmcb12_has_lbrv(vcpu)) { /* * Reserved bits of DEBUGCTL are ignored. Be consistent with * svm_set_msr's definition of reserved bits. @@ -1233,8 +1238,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) if (!nested_exit_on_intr(svm)) kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { + if (nested_vmcb12_has_lbrv(vcpu)) { svm_copy_lbrs(&vmcb12->save, &vmcb02->save); } else { svm_copy_lbrs(&vmcb01->save, &vmcb02->save); From dcf3648ab71437b504abbfdc4e74622a0f1a56e3 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:01 +0000 Subject: [PATCH 078/373] KVM: nSVM: Refactor writing vmcb12 on nested #VMEXIT as a helper Move mapping vmcb12 and updating it out of nested_svm_vmexit() into a helper, no functional change intended. CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-8-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 77 ++++++++++++++++++++++----------------- 1 file changed, 44 insertions(+), 33 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d419fd516fa9..8c01916cb154 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1124,36 +1124,20 @@ void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb) to_vmcb->save.sysenter_eip = from_vmcb->save.sysenter_eip; } -int nested_svm_vmexit(struct vcpu_svm *svm) +static int nested_svm_vmexit_update_vmcb12(struct kvm_vcpu *vcpu) { - struct kvm_vcpu *vcpu = &svm->vcpu; - struct vmcb *vmcb01 = svm->vmcb01.ptr; + struct vcpu_svm *svm = to_svm(vcpu); struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; - struct vmcb *vmcb12; struct kvm_host_map map; + struct vmcb *vmcb12; int rc; rc = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.vmcb12_gpa), &map); - if (rc) { - if (rc == -EINVAL) - kvm_inject_gp(vcpu, 0); - return 1; - } + if (rc) + return rc; vmcb12 = map.hva; - /* Exit Guest-Mode */ - leave_guest_mode(vcpu); - svm->nested.vmcb12_gpa = 0; - WARN_ON_ONCE(svm->nested.nested_run_pending); - - kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); - - /* in case we halted in L2 */ - kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE); - - /* Give the current vmcb to the guest */ - vmcb12->save.es = vmcb02->save.es; vmcb12->save.cs = vmcb02->save.cs; vmcb12->save.ss = vmcb02->save.ss; @@ -1190,10 +1174,48 @@ int nested_svm_vmexit(struct vcpu_svm *svm) if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) vmcb12->control.next_rip = vmcb02->control.next_rip; + if (nested_vmcb12_has_lbrv(vcpu)) + svm_copy_lbrs(&vmcb12->save, &vmcb02->save); + vmcb12->control.int_ctl = svm->nested.ctl.int_ctl; vmcb12->control.event_inj = svm->nested.ctl.event_inj; vmcb12->control.event_inj_err = svm->nested.ctl.event_inj_err; + trace_kvm_nested_vmexit_inject(vmcb12->control.exit_code, + vmcb12->control.exit_info_1, + vmcb12->control.exit_info_2, + vmcb12->control.exit_int_info, + vmcb12->control.exit_int_info_err, + KVM_ISA_SVM); + + kvm_vcpu_unmap(vcpu, &map); + return 0; +} + +int nested_svm_vmexit(struct vcpu_svm *svm) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + struct vmcb *vmcb01 = svm->vmcb01.ptr; + struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; + int rc; + + rc = nested_svm_vmexit_update_vmcb12(vcpu); + if (rc) { + if (rc == -EINVAL) + kvm_inject_gp(vcpu, 0); + return 1; + } + + /* Exit Guest-Mode */ + leave_guest_mode(vcpu); + svm->nested.vmcb12_gpa = 0; + WARN_ON_ONCE(svm->nested.nested_run_pending); + + kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); + + /* in case we halted in L2 */ + kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE); + if (!kvm_pause_in_guest(vcpu->kvm)) { vmcb01->control.pause_filter_count = vmcb02->control.pause_filter_count; vmcb_mark_dirty(vmcb01, VMCB_INTERCEPTS); @@ -1238,9 +1260,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) if (!nested_exit_on_intr(svm)) kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - if (nested_vmcb12_has_lbrv(vcpu)) { - svm_copy_lbrs(&vmcb12->save, &vmcb02->save); - } else { + if (!nested_vmcb12_has_lbrv(vcpu)) { svm_copy_lbrs(&vmcb01->save, &vmcb02->save); vmcb_mark_dirty(vmcb01, VMCB_LBR); } @@ -1296,15 +1316,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm) svm->vcpu.arch.dr7 = DR7_FIXED_1; kvm_update_dr7(&svm->vcpu); - trace_kvm_nested_vmexit_inject(vmcb12->control.exit_code, - vmcb12->control.exit_info_1, - vmcb12->control.exit_info_2, - vmcb12->control.exit_int_info, - vmcb12->control.exit_int_info_err, - KVM_ISA_SVM); - - kvm_vcpu_unmap(vcpu, &map); - nested_svm_transition_tlb_flush(vcpu); nested_svm_uninit_mmu_context(vcpu); From 1b30e7551767cb95b3e49bb169c72bbd76b56e05 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:02 +0000 Subject: [PATCH 079/373] KVM: nSVM: Triple fault if mapping VMCB12 fails on nested #VMEXIT KVM currently injects a #GP and hopes for the best if mapping VMCB12 fails on nested #VMEXIT, and only if the failure mode is -EINVAL. Mapping the VMCB12 could also fail if creating host mappings fails. After the #GP is injected, nested_svm_vmexit() bails early, without cleaning up (e.g. KVM_REQ_GET_NESTED_STATE_PAGES is set, is_guest_mode() is true, etc). Instead of optionally injecting a #GP, triple fault the guest if mapping VMCB12 fails since KVM cannot make a sane recovery. The APM states that a #VMEXIT will triple fault if host state is illegal or an exception occurs while loading host state, so the behavior is not entirely made up. Do not return early from nested_svm_vmexit(), continue cleaning up the vCPU state (e.g. switch back to vmcb01), to handle the failure as gracefully as possible. Fixes: cf74a78b229d ("KVM: SVM: Add VMEXIT handler and intercepts") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-9-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 8c01916cb154..30c99bbe9927 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1199,12 +1199,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm) struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; int rc; - rc = nested_svm_vmexit_update_vmcb12(vcpu); - if (rc) { - if (rc == -EINVAL) - kvm_inject_gp(vcpu, 0); - return 1; - } + if (nested_svm_vmexit_update_vmcb12(vcpu)) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); /* Exit Guest-Mode */ leave_guest_mode(vcpu); From 5d291ef0585ed880ed4dd71ea1a5965e0a65fb53 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:03 +0000 Subject: [PATCH 080/373] KVM: nSVM: Triple fault if restore host CR3 fails on nested #VMEXIT If loading L1's CR3 fails on a nested #VMEXIT, nested_svm_vmexit() returns an error code that is ignored by most callers, and continues to run L1 with corrupted state. A sane recovery is not possible in this case, and HW behavior is to cause a shutdown. Inject a triple fault instead, and do not return early from nested_svm_vmexit(). Continue cleaning up the vCPU state (e.g. clear pending exceptions), to handle the failure as gracefully as possible. From the APM: Upon #VMEXIT, the processor performs the following actions in order to return to the host execution context: ... if (illegal host state loaded, or exception while loading host state) shutdown else execute first host instruction following the VMRUN Remove the return value of nested_svm_vmexit(), which is mostly unchecked anyway. Fixes: d82aaef9c88a ("KVM: nSVM: use nested_svm_load_cr3() on guest->host switch") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-10-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 10 +++------- arch/x86/kvm/svm/svm.c | 11 ++--------- arch/x86/kvm/svm/svm.h | 6 +++--- 3 files changed, 8 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 30c99bbe9927..5e0feeb50ba3 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1192,12 +1192,11 @@ static int nested_svm_vmexit_update_vmcb12(struct kvm_vcpu *vcpu) return 0; } -int nested_svm_vmexit(struct vcpu_svm *svm) +void nested_svm_vmexit(struct vcpu_svm *svm) { struct kvm_vcpu *vcpu = &svm->vcpu; struct vmcb *vmcb01 = svm->vmcb01.ptr; struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; - int rc; if (nested_svm_vmexit_update_vmcb12(vcpu)) kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); @@ -1316,9 +1315,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm) nested_svm_uninit_mmu_context(vcpu); - rc = nested_svm_load_cr3(vcpu, vmcb01->save.cr3, false, true); - if (rc) - return 1; + if (nested_svm_load_cr3(vcpu, vmcb01->save.cr3, false, true)) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); /* * Drop what we picked up for L2 via svm_complete_interrupts() so it @@ -1343,8 +1341,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm) */ if (kvm_apicv_activated(vcpu->kvm)) __kvm_vcpu_update_apicv(vcpu); - - return 0; } static void nested_svm_triple_fault(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e97c56df41f6..7efa71709292 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2234,13 +2234,9 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode) [SVM_INSTR_VMSAVE] = vmsave_interception, }; struct vcpu_svm *svm = to_svm(vcpu); - int ret; if (is_guest_mode(vcpu)) { - /* Returns '1' or -errno on failure, '0' on success. */ - ret = nested_svm_simple_vmexit(svm, guest_mode_exit_codes[opcode]); - if (ret) - return ret; + nested_svm_simple_vmexit(svm, guest_mode_exit_codes[opcode]); return 1; } return svm_instr_handlers[opcode](vcpu); @@ -4871,7 +4867,6 @@ static int svm_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) { struct vcpu_svm *svm = to_svm(vcpu); struct kvm_host_map map_save; - int ret; if (!is_guest_mode(vcpu)) return 0; @@ -4891,9 +4886,7 @@ static int svm_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP]; - ret = nested_svm_simple_vmexit(svm, SVM_EXIT_SW); - if (ret) - return ret; + nested_svm_simple_vmexit(svm, SVM_EXIT_SW); /* * KVM uses VMCB01 to store L1 host state while L2 runs but diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 44d767cd1d25..7629cb37c930 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -793,14 +793,14 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu); void svm_copy_vmrun_state(struct vmcb_save_area *to_save, struct vmcb_save_area *from_save); void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb); -int nested_svm_vmexit(struct vcpu_svm *svm); +void nested_svm_vmexit(struct vcpu_svm *svm); -static inline int nested_svm_simple_vmexit(struct vcpu_svm *svm, u32 exit_code) +static inline void nested_svm_simple_vmexit(struct vcpu_svm *svm, u32 exit_code) { svm->vmcb->control.exit_code = exit_code; svm->vmcb->control.exit_info_1 = 0; svm->vmcb->control.exit_info_2 = 0; - return nested_svm_vmexit(svm); + nested_svm_vmexit(svm); } int nested_svm_exit_handled(struct vcpu_svm *svm); From f85a6ce06e4a0d49652f57967a649ab09e06287c Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:04 +0000 Subject: [PATCH 081/373] KVM: nSVM: Clear GIF on nested #VMEXIT(INVALID) According to the APM, GIF is set to 0 on any #VMEXIT, including an #VMEXIT(INVALID) due to failed consistency checks. Clear GIF on consistency check failures. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-11-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 5e0feeb50ba3..ac7d7f82c82b 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1035,6 +1035,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) vmcb12->control.exit_code = SVM_EXIT_ERR; vmcb12->control.exit_info_1 = 0; vmcb12->control.exit_info_2 = 0; + svm_set_gif(svm, false); goto out; } From 69b721a86d0dcb026f6db7d111dcde7550442d2e Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:05 +0000 Subject: [PATCH 082/373] KVM: nSVM: Clear EVENTINJ fields in vmcb12 on nested #VMEXIT According to the APM, from the reference of the VMRUN instruction: Upon #VMEXIT, the processor performs the following actions in order to return to the host execution context: ... clear EVENTINJ field in VMCB KVM already syncs EVENTINJ fields from vmcb02 to cached vmcb12 on every L2->L0 #VMEXIT. Since these fields are zeroed by the CPU on #VMEXIT, they will mostly be zeroed in vmcb12 on nested #VMEXIT by nested_svm_vmexit(). However, this is not the case when: 1. Consistency checks fail, as nested_svm_vmexit() is not called. 2. Entering guest mode fails before L2 runs (e.g. due to failed load of CR3). (2) was broken by commit 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexit"), as prior to that nested_svm_vmexit() always zeroed EVENTINJ fields. Explicitly clear the fields in all nested #VMEXIT code paths. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Fixes: 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexit") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-12-yosry@kernel.org [sean: massage changelog formatting] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index ac7d7f82c82b..90c8bc641bf3 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1035,6 +1035,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) vmcb12->control.exit_code = SVM_EXIT_ERR; vmcb12->control.exit_info_1 = 0; vmcb12->control.exit_info_2 = 0; + vmcb12->control.event_inj = 0; + vmcb12->control.event_inj_err = 0; svm_set_gif(svm, false); goto out; } @@ -1178,9 +1180,9 @@ static int nested_svm_vmexit_update_vmcb12(struct kvm_vcpu *vcpu) if (nested_vmcb12_has_lbrv(vcpu)) svm_copy_lbrs(&vmcb12->save, &vmcb02->save); + vmcb12->control.event_inj = 0; + vmcb12->control.event_inj_err = 0; vmcb12->control.int_ctl = svm->nested.ctl.int_ctl; - vmcb12->control.event_inj = svm->nested.ctl.event_inj; - vmcb12->control.event_inj_err = svm->nested.ctl.event_inj_err; trace_kvm_nested_vmexit_inject(vmcb12->control.exit_code, vmcb12->control.exit_info_1, From 8998e1d012f3f45d0456f16706682cef04c3c436 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:06 +0000 Subject: [PATCH 083/373] KVM: nSVM: Clear tracking of L1->L2 NMI and soft IRQ on nested #VMEXIT KVM clears tracking of L1->L2 injected NMIs (i.e. nmi_l1_to_l2) and soft IRQs (i.e. soft_int_injected) on a synthesized #VMEXIT(INVALID) due to failed VMRUN. However, they are not explicitly cleared in other synthesized #VMEXITs. soft_int_injected is always cleared after the first VMRUN of L2 when completing interrupts, as any re-injection is then tracked by KVM (instead of purely in vmcb02). nmi_l1_to_l2 is not cleared after the first VMRUN if NMI injection failed, as KVM still needs to keep track that the NMI originated from L1 to avoid blocking NMIs for L1. It is only cleared when the NMI injection succeeds. KVM could synthesize a #VMEXIT to L1 before successfully injecting the NMI into L2 (e.g. due to a #NPF on L2's NMI handler in L1's NPTs). In this case, nmi_l1_to_l2 will remain true, and KVM may not correctly mask NMIs and intercept IRET when injecting an NMI into L1. Clear both nmi_l1_to_l2 and soft_int_injected in nested_svm_vmexit(), i.e. for all #VMEXITs except those that occur due to failed consistency checks, as those happen before nmi_l1_to_l2 or soft_int_injected are set. Fixes: 159fc6fa3b7d ("KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-13-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 90c8bc641bf3..d0037f01fb98 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1064,8 +1064,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) out_exit_err: svm->nested.nested_run_pending = 0; - svm->nmi_l1_to_l2 = false; - svm->soft_int_injected = false; svm->vmcb->control.exit_code = SVM_EXIT_ERR; svm->vmcb->control.exit_info_1 = 0; @@ -1321,6 +1319,10 @@ void nested_svm_vmexit(struct vcpu_svm *svm) if (nested_svm_load_cr3(vcpu, vmcb01->save.cr3, false, true)) kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + /* Drop tracking for L1->L2 injected NMIs and soft IRQs */ + svm->nmi_l1_to_l2 = false; + svm->soft_int_injected = false; + /* * Drop what we picked up for L2 via svm_complete_interrupts() so it * doesn't end up in L1. From b786e34cde42922dace620e6f56f0858edae2311 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:07 +0000 Subject: [PATCH 084/373] KVM: nSVM: Drop nested_vmcb_check_{save/control}() wrappers The wrappers provide little value and make it harder to see what KVM is checking in the normal flow. Drop them. Opportunistically fixup comments referring to the functions, adding '()' to make it clear it's a reference to a function. No functional change intended. Co-developed-by: Sean Christopherson Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-14-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 36 ++++++++++-------------------------- 1 file changed, 10 insertions(+), 26 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d0037f01fb98..0d447d044101 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -339,8 +339,8 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size) kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1); } -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu, - struct vmcb_ctrl_area_cached *control) +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, + struct vmcb_ctrl_area_cached *control) { if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN))) return false; @@ -367,8 +367,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu, } /* Common checks that apply to both L1 and L2 state. */ -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu, - struct vmcb_save_area_cached *save) +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu, + struct vmcb_save_area_cached *save) { if (CC(!(save->efer & EFER_SVME))) return false; @@ -402,22 +402,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu, return true; } -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_save_area_cached *save = &svm->nested.save; - - return __nested_vmcb_check_save(vcpu, save); -} - -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl; - - return __nested_vmcb_check_controls(vcpu, ctl); -} - /* * If a feature is not advertised to L1, clear the corresponding vmcb12 * intercept. @@ -469,7 +453,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->pause_filter_count = from->pause_filter_count; to->pause_filter_thresh = from->pause_filter_thresh; - /* Copy asid here because nested_vmcb_check_controls will check it. */ + /* Copy asid here because nested_vmcb_check_controls() will check it */ to->asid = from->asid; to->msrpm_base_pa &= ~0x0fffULL; to->iopm_base_pa &= ~0x0fffULL; @@ -1030,8 +1014,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) nested_copy_vmcb_control_to_cache(svm, &vmcb12->control); nested_copy_vmcb_save_to_cache(svm, &vmcb12->save); - if (!nested_vmcb_check_save(vcpu) || - !nested_vmcb_check_controls(vcpu)) { + if (!nested_vmcb_check_save(vcpu, &svm->nested.save) || + !nested_vmcb_check_controls(vcpu, &svm->nested.ctl)) { vmcb12->control.exit_code = SVM_EXIT_ERR; vmcb12->control.exit_info_1 = 0; vmcb12->control.exit_info_2 = 0; @@ -1877,12 +1861,12 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, ret = -EINVAL; __nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl); - if (!__nested_vmcb_check_controls(vcpu, &ctl_cached)) + if (!nested_vmcb_check_controls(vcpu, &ctl_cached)) goto out_free; /* * Processor state contains L2 state. Check that it is - * valid for guest mode (see nested_vmcb_check_save). + * valid for guest mode (see nested_vmcb_check_save()). */ cr0 = kvm_read_cr0(vcpu); if (((cr0 & X86_CR0_CD) == 0) && (cr0 & X86_CR0_NW)) @@ -1896,7 +1880,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, if (!(save->cr0 & X86_CR0_PG) || !(save->cr0 & X86_CR0_PE) || (save->rflags & X86_EFLAGS_VM) || - !__nested_vmcb_check_save(vcpu, &save_cached)) + !nested_vmcb_check_save(vcpu, &save_cached)) goto out_free; From e0b6f031d64c086edd563e7af9c0c0a2261dd2a4 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:08 +0000 Subject: [PATCH 085/373] KVM: nSVM: Drop the non-architectural consistency check for NP_ENABLE KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs. On first glance, it seems like the check should actually be for guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the host to support NPTs but the guest CPUID to not advertise it. However, the consistency check is not architectural to begin with. The APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored if X86_FEATURE_NPT is not available for L1, so sanitize it when copying from the VMCB12 to KVM's cache. Apart from the consistency check, NP_ENABLE in VMCB12 is currently ignored because the bit is actually copied from VMCB01 to VMCB02, not from VMCB12. Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-15-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 0d447d044101..2ed6530e7bd1 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -348,9 +348,6 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, if (CC(control->asid == 0)) return false; - if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && !npt_enabled)) - return false; - if (CC(!nested_svm_check_bitmap_pa(vcpu, control->msrpm_base_pa, MSRPM_SIZE))) return false; @@ -431,6 +428,11 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, nested_svm_sanitize_intercept(vcpu, to, SKINIT); nested_svm_sanitize_intercept(vcpu, to, RDPRU); + /* Always clear SVM_NESTED_CTL_NP_ENABLE if the guest cannot use NPTs */ + to->nested_ctl = from->nested_ctl; + if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NPT)) + to->nested_ctl &= ~SVM_NESTED_CTL_NP_ENABLE; + to->iopm_base_pa = from->iopm_base_pa; to->msrpm_base_pa = from->msrpm_base_pa; to->tsc_offset = from->tsc_offset; @@ -444,7 +446,6 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->exit_info_2 = from->exit_info_2; to->exit_int_info = from->exit_int_info; to->exit_int_info_err = from->exit_int_info_err; - to->nested_ctl = from->nested_ctl; to->event_inj = from->event_inj; to->event_inj_err = from->event_inj_err; to->next_rip = from->next_rip; From b71138fcc362c67ebe66747bb22cb4e6b4d6a651 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:09 +0000 Subject: [PATCH 086/373] KVM: nSVM: Add missing consistency check for nCR3 validity MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From the APM Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024): When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the following conditions are considered illegal state combinations, in addition to those mentioned in “Canonicalization and Consistency Checks”: • Any MBZ bit of nCR3 is set. • Any G_PAT.PA field has an unsupported type encoding or any reserved field in G_PAT has a nonzero value. Add the consistency check for nCR3 being a legal GPA with no MBZ bits set. Note, the G_PAT.PA check is being handled separately[*]. Link: https://lore.kernel.org/kvm/20260205214326.1029278-3-jmattson@google.com [*] Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-16-yosry@kernel.org [sean: capture everything in CC(), massage changelog formatting] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 2ed6530e7bd1..a59b976c16db 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -348,6 +348,10 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, if (CC(control->asid == 0)) return false; + if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && + !kvm_vcpu_is_legal_gpa(vcpu, control->nested_cr3))) + return false; + if (CC(!nested_svm_check_bitmap_pa(vcpu, control->msrpm_base_pa, MSRPM_SIZE))) return false; From 96bd3e76a171a8e21a6387e54e4c420a81968492 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:10 +0000 Subject: [PATCH 087/373] KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CS MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit According to the APM Volume #2, 15.5, Canonicalization and Consistency Checks (24593—Rev. 3.42—March 2024), the following condition (among others) results in a #VMEXIT with VMEXIT_INVALID (aka SVM_EXIT_ERR): EFER.LME, CR0.PG, CR4.PAE, CS.L, and CS.D are all non-zero. In the list of consistency checks done when EFER.LME and CR0.PG are set, add a check that CS.L and CS.D are not both set, after the existing check that CR4.PAE is set. This is functionally a nop because the nested VMRUN results in SVM_EXIT_ERR in HW, which is forwarded to L1, but KVM makes all consistency checks before a VMRUN is actually attempted. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-17-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++++++ arch/x86/kvm/svm/svm.h | 1 + 2 files changed, 7 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index a59b976c16db..50180565bcfc 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -391,6 +391,10 @@ static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu, CC(!(save->cr0 & X86_CR0_PE)) || CC(!kvm_vcpu_is_legal_cr3(vcpu, save->cr3))) return false; + + if (CC((save->cs.attrib & SVM_SELECTOR_L_MASK) && + (save->cs.attrib & SVM_SELECTOR_DB_MASK))) + return false; } /* Note, SVM doesn't have any additional restrictions on CR4. */ @@ -486,6 +490,8 @@ static void __nested_copy_vmcb_save_to_cache(struct vmcb_save_area_cached *to, * Copy only fields that are validated, as we need them * to avoid TOC/TOU races. */ + to->cs = from->cs; + to->efer = from->efer; to->cr0 = from->cr0; to->cr3 = from->cr3; diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 7629cb37c930..0a5d5a4453b7 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -140,6 +140,7 @@ struct kvm_vmcb_info { }; struct vmcb_save_area_cached { + struct vmcb_seg cs; u64 efer; u64 cr4; u64 cr3; From 7e79f71bca5cf536f92effc7227bd044c2722c11 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:11 +0000 Subject: [PATCH 088/373] KVM: nSVM: Add missing consistency check for EVENTINJ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit According to the APM Volume #2, 15.20 (24593—Rev. 3.42—March 2024): VMRUN exits with VMEXIT_INVALID error code if either: • Reserved values of TYPE have been specified, or • TYPE = 3 (exception) has been specified with a vector that does not correspond to an exception (this includes vector 2, which is an NMI, not an exception). Add the missing consistency checks to KVM. For the second point, inject VMEXIT_INVALID if the vector is anything but the vectors defined by the APM for exceptions. Reserved vectors are also considered invalid, which matches the HW behavior. Vector 9 (i.e. #CSO) is considered invalid because it is reserved on modern CPUs, and according to LLMs no CPUs exist supporting SVM and producing #CSOs. Defined exceptions could be different between virtual CPUs as new CPUs define new vectors. In a best effort to dynamically define the valid vectors, make all currently defined vectors as valid except those obviously tied to a CPU feature: SHSTK -> #CP and SEV-ES -> #VC. As new vectors are defined, they can similarly be tied to corresponding CPU features. Invalid vectors on specific (e.g. old) CPUs that are missed by KVM should be rejected by HW anyway. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-18-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 51 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 50180565bcfc..1c5f0f08bb8c 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -339,6 +339,54 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size) kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1); } +static bool nested_svm_event_inj_valid_exept(struct kvm_vcpu *vcpu, u8 vector) +{ + /* + * Vectors that do not correspond to a defined exception are invalid + * (including #NMI and reserved vectors). In a best effort to define + * valid exceptions based on the virtual CPU, make all exceptions always + * valid except those obviously tied to a CPU feature. + */ + switch (vector) { + case DE_VECTOR: case DB_VECTOR: case BP_VECTOR: case OF_VECTOR: + case BR_VECTOR: case UD_VECTOR: case NM_VECTOR: case DF_VECTOR: + case TS_VECTOR: case NP_VECTOR: case SS_VECTOR: case GP_VECTOR: + case PF_VECTOR: case MF_VECTOR: case AC_VECTOR: case MC_VECTOR: + case XM_VECTOR: case HV_VECTOR: case SX_VECTOR: + return true; + case CP_VECTOR: + return guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK); + case VC_VECTOR: + return guest_cpu_cap_has(vcpu, X86_FEATURE_SEV_ES); + } + return false; +} + +/* + * According to the APM, VMRUN exits with SVM_EXIT_ERR if SVM_EVTINJ_VALID is + * set and: + * - The type of event_inj is not one of the defined values. + * - The type is SVM_EVTINJ_TYPE_EXEPT, but the vector is not a valid exception. + */ +static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj) +{ + u32 type = event_inj & SVM_EVTINJ_TYPE_MASK; + u8 vector = event_inj & SVM_EVTINJ_VEC_MASK; + + if (!(event_inj & SVM_EVTINJ_VALID)) + return true; + + if (type != SVM_EVTINJ_TYPE_INTR && type != SVM_EVTINJ_TYPE_NMI && + type != SVM_EVTINJ_TYPE_EXEPT && type != SVM_EVTINJ_TYPE_SOFT) + return false; + + if (type == SVM_EVTINJ_TYPE_EXEPT && + !nested_svm_event_inj_valid_exept(vcpu, vector)) + return false; + + return true; +} + static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, struct vmcb_ctrl_area_cached *control) { @@ -364,6 +412,9 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, return false; } + if (CC(!nested_svm_check_event_inj(vcpu, control->event_inj))) + return false; + return true; } From d5bde6113aed8315a2bfe708730b721be9c2f48b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:51 -0800 Subject: [PATCH 089/373] KVM: SVM: Explicitly mark vmcb01 dirty after modifying VMCB intercepts When reacting to an intercept update, explicitly mark vmcb01's intercepts dirty, as KVM always initially operates on vmcb01, and nested_svm_vmexit() isn't guaranteed to mark VMCB_INTERCEPTS as dirty. I.e. if L2 is active, KVM will modify the intercepts for L1, but might not mark them as dirty before the next VMRUN of L1. Fixes: 116a0a23676e ("KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count") Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 1c5f0f08bb8c..5b639d98bf09 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -128,11 +128,13 @@ void recalc_intercepts(struct vcpu_svm *svm) struct vmcb_ctrl_area_cached *g; unsigned int i; - vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + vmcb_mark_dirty(svm->vmcb01.ptr, VMCB_INTERCEPTS); if (!is_guest_mode(&svm->vcpu)) return; + vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + c = &svm->vmcb->control; h = &svm->vmcb01.ptr->control; g = &svm->nested.ctl; From c36991c6f8d2ab56ee67aff04e3c357f45cfc76c Mon Sep 17 00:00:00 2001 From: Kevin Cheng Date: Tue, 3 Mar 2026 16:22:22 -0800 Subject: [PATCH 090/373] KVM: nSVM: Raise #UD if unhandled VMMCALL isn't intercepted by L1 Explicitly synthesize a #UD for VMMCALL if L2 is active, L1 does NOT want to intercept VMMCALL, nested_svm_l2_tlb_flush_enabled() is true, and the hypercall is something other than one of the supported Hyper-V hypercalls. When all of the above conditions are met, KVM will intercept VMMCALL but never forward it to L1, i.e. will let L2 make hypercalls as if it were L1. The TLFS says a whole lot of nothing about this scenario, so go with the architectural behavior, which says that VMMCALL #UDs if it's not intercepted. Opportunistically do a 2-for-1 stub trade by stub-ifying the new API instead of the helpers it uses. The last remaining "single" stub will soon be dropped as well. Suggested-by: Sean Christopherson Fixes: 3f4a812edf5c ("KVM: nSVM: hyper-v: Enable L2 TLB flush") Cc: Vitaly Kuznetsov Cc: stable@vger.kernel.org Signed-off-by: Kevin Cheng Link: https://patch.msgid.link/20260228033328.2285047-5-chengkev@google.com [sean: rewrite changelog and comment, tag for stable, remove defunct stubs] Reviewed-by: Yosry Ahmed Reviewed-by: Vitaly Kuznetsov Link: https://patch.msgid.link/20260304002223.1105129-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/hyperv.h | 8 -------- arch/x86/kvm/svm/hyperv.h | 11 +++++++++++ arch/x86/kvm/svm/nested.c | 4 +--- arch/x86/kvm/svm/svm.c | 19 ++++++++++++++++++- 4 files changed, 30 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 6ce160ffa678..6301f79fcbae 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -305,14 +305,6 @@ static inline bool kvm_hv_has_stimer_pending(struct kvm_vcpu *vcpu) { return false; } -static inline bool kvm_hv_is_tlb_flush_hcall(struct kvm_vcpu *vcpu) -{ - return false; -} -static inline bool guest_hv_cpuid_has_l2_tlb_flush(struct kvm_vcpu *vcpu) -{ - return false; -} static inline int kvm_hv_verify_vp_assist(struct kvm_vcpu *vcpu) { return 0; diff --git a/arch/x86/kvm/svm/hyperv.h b/arch/x86/kvm/svm/hyperv.h index d3f8bfc05832..9af03970d40c 100644 --- a/arch/x86/kvm/svm/hyperv.h +++ b/arch/x86/kvm/svm/hyperv.h @@ -41,6 +41,13 @@ static inline bool nested_svm_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu) return hv_vcpu->vp_assist_page.nested_control.features.directhypercall; } +static inline bool nested_svm_is_l2_tlb_flush_hcall(struct kvm_vcpu *vcpu) +{ + return guest_hv_cpuid_has_l2_tlb_flush(vcpu) && + nested_svm_l2_tlb_flush_enabled(vcpu) && + kvm_hv_is_tlb_flush_hcall(vcpu); +} + void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu); #else /* CONFIG_KVM_HYPERV */ static inline void nested_svm_hv_update_vm_vp_ids(struct kvm_vcpu *vcpu) {} @@ -48,6 +55,10 @@ static inline bool nested_svm_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu) { return false; } +static inline bool nested_svm_is_l2_tlb_flush_hcall(struct kvm_vcpu *vcpu) +{ + return false; +} static inline void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu) {} #endif /* CONFIG_KVM_HYPERV */ diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 5b639d98bf09..0f7893a7cb04 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1738,9 +1738,7 @@ int nested_svm_exit_special(struct vcpu_svm *svm) } case SVM_EXIT_VMMCALL: /* Hyper-V L2 TLB flush hypercall is handled by L0 */ - if (guest_hv_cpuid_has_l2_tlb_flush(vcpu) && - nested_svm_l2_tlb_flush_enabled(vcpu) && - kvm_hv_is_tlb_flush_hcall(vcpu)) + if (nested_svm_is_l2_tlb_flush_hcall(vcpu)) return NESTED_EXIT_HOST; break; default: diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7efa71709292..9e6864cf58d3 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -52,6 +52,7 @@ #include "svm.h" #include "svm_ops.h" +#include "hyperv.h" #include "kvm_onhyperv.h" #include "svm_onhyperv.h" @@ -3248,6 +3249,22 @@ static int bus_lock_exit(struct kvm_vcpu *vcpu) return 0; } +static int vmmcall_interception(struct kvm_vcpu *vcpu) +{ + /* + * Inject a #UD if L2 is active and the VMMCALL isn't a Hyper-V TLB + * hypercall, as VMMCALL #UDs if it's not intercepted, and this path is + * reachable if and only if L1 doesn't want to intercept VMMCALL or has + * enabled L0 (KVM) handling of Hyper-V L2 TLB flush hypercalls. + */ + if (is_guest_mode(vcpu) && !nested_svm_is_l2_tlb_flush_hcall(vcpu)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + return kvm_emulate_hypercall(vcpu); +} + static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3298,7 +3315,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_TASK_SWITCH] = task_switch_interception, [SVM_EXIT_SHUTDOWN] = shutdown_interception, [SVM_EXIT_VMRUN] = vmrun_interception, - [SVM_EXIT_VMMCALL] = kvm_emulate_hypercall, + [SVM_EXIT_VMMCALL] = vmmcall_interception, [SVM_EXIT_VMLOAD] = vmload_interception, [SVM_EXIT_VMSAVE] = vmsave_interception, [SVM_EXIT_STGI] = stgi_interception, From 33d3617a52f9930d22b2af59f813c2fbdefa6dd5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 3 Mar 2026 16:22:23 -0800 Subject: [PATCH 091/373] KVM: nSVM: Always intercept VMMCALL when L2 is active Always intercept VMMCALL now that KVM properly synthesizes a #UD as appropriate, i.e. when L1 doesn't want to intercept VMMCALL, to avoid putting L2 into an infinite #UD loop if KVM_X86_QUIRK_FIX_HYPERCALL_INSN is enabled. By letting L2 execute VMMCALL natively and thus #UD, for all intents and purposes KVM morphs the VMMCALL intercept into a #UD intercept (KVM always intercepts #UD). When the hypercall quirk is enabled, KVM "emulates" VMMCALL in response to the #UD by trying to fixup the opcode to the "right" vendor, then restarts the guest, without skipping the VMMCALL. As a result, the guest sees an endless stream of #UDs since it's already executing the correct vendor hypercall instruction, i.e. the emulator doesn't anticipate that the #UD could be due to lack of interception, as opposed to a truly undefined opcode. Fixes: 0d945bd93511 ("KVM: SVM: Don't allow nested guest to VMMCALL into host") Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed Reviewed-by: Vitaly Kuznetsov Link: https://patch.msgid.link/20260304002223.1105129-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/hyperv.h | 4 ---- arch/x86/kvm/svm/nested.c | 7 ------- 2 files changed, 11 deletions(-) diff --git a/arch/x86/kvm/svm/hyperv.h b/arch/x86/kvm/svm/hyperv.h index 9af03970d40c..f70d076911a6 100644 --- a/arch/x86/kvm/svm/hyperv.h +++ b/arch/x86/kvm/svm/hyperv.h @@ -51,10 +51,6 @@ static inline bool nested_svm_is_l2_tlb_flush_hcall(struct kvm_vcpu *vcpu) void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu); #else /* CONFIG_KVM_HYPERV */ static inline void nested_svm_hv_update_vm_vp_ids(struct kvm_vcpu *vcpu) {} -static inline bool nested_svm_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu) -{ - return false; -} static inline bool nested_svm_is_l2_tlb_flush_hcall(struct kvm_vcpu *vcpu) { return false; diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 0f7893a7cb04..fb86f09985e7 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -158,13 +158,6 @@ void recalc_intercepts(struct vcpu_svm *svm) vmcb_clr_intercept(c, INTERCEPT_VINTR); } - /* - * We want to see VMMCALLs from a nested guest only when Hyper-V L2 TLB - * flush feature is enabled. - */ - if (!nested_svm_l2_tlb_flush_enabled(&svm->vcpu)) - vmcb_clr_intercept(c, INTERCEPT_VMMCALL); - for (i = 0; i < MAX_INTERCEPT; i++) c->intercepts[i] |= g->intercepts[i]; From 69f779f79e0d1ff321a89ab56cdcab34613104c0 Mon Sep 17 00:00:00 2001 From: Kevin Cheng Date: Tue, 3 Mar 2026 16:30:09 -0800 Subject: [PATCH 092/373] KVM: SVM: Move STGI and CLGI intercept handling Move STGI/CLGI intercept handling to svm_recalc_instruction_intercepts() in preparation for making the function EFER.SVME-aware. This will allow configuring STGI/CLGI intercepts along with other intercepts for other SVM instructions when EFER.SVME is toggled (KVM needs to intercept SVM instructions when EFER.SVME=0 to inject #UD). When clearing the STGI intercept in particular, request KVM_REQ_EVENT if there is at least one a pending GIF-controlled event. This avoids breaking NMI/SMI window tracking, as enable_{nmi,smi}_window() sets INTERCEPT_STGI to detect when NMIs become unblocked. KVM_REQ_EVENT forces kvm_check_and_inject_events() to re-evaluate pending events and re-enable the intercept if needed. Extract the pending GIF event check into a helper function svm_has_pending_gif_event() to deduplicate the logic between svm_recalc_instruction_intercepts() and svm_set_gif(). Signed-off-by: Kevin Cheng [sean: keep vgif handling out of the "Intel CPU model" path] Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260304003010.1108257-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9e6864cf58d3..30d3291e4738 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -999,6 +999,14 @@ void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu) preempt_enable(); } +static bool svm_has_pending_gif_event(struct vcpu_svm *svm) +{ + return svm->vcpu.arch.smi_pending || + svm->vcpu.arch.nmi_pending || + kvm_cpu_has_injectable_intr(&svm->vcpu) || + kvm_apic_has_pending_init_or_sipi(&svm->vcpu); +} + /* Evaluate instruction intercepts that depend on guest CPUID features. */ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) { @@ -1042,6 +1050,20 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) } } + if (vgif) { + svm_clr_intercept(svm, INTERCEPT_STGI); + svm_clr_intercept(svm, INTERCEPT_CLGI); + + /* + * Process pending events when clearing STGI/CLGI intercepts if + * there's at least one pending event that is masked by GIF, so + * that KVM re-evaluates if the intercept needs to be set again + * to track when GIF is re-enabled (e.g. for NMI injection). + */ + if (svm_has_pending_gif_event(svm)) + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + } + if (kvm_need_rdpmc_intercept(vcpu)) svm_set_intercept(svm, INTERCEPT_RDPMC); else @@ -1185,11 +1207,8 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) if (vnmi) svm->vmcb->control.int_ctl |= V_NMI_ENABLE_MASK; - if (vgif) { - svm_clr_intercept(svm, INTERCEPT_STGI); - svm_clr_intercept(svm, INTERCEPT_CLGI); + if (vgif) svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; - } if (vls) svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; @@ -2306,10 +2325,7 @@ void svm_set_gif(struct vcpu_svm *svm, bool value) svm_clear_vintr(svm); enable_gif(svm); - if (svm->vcpu.arch.smi_pending || - svm->vcpu.arch.nmi_pending || - kvm_cpu_has_injectable_intr(&svm->vcpu) || - kvm_apic_has_pending_init_or_sipi(&svm->vcpu)) + if (svm_has_pending_gif_event(svm)) kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); } else { disable_gif(svm); From 460c7eb2e7594319abcb2066c737cb8b5eb78213 Mon Sep 17 00:00:00 2001 From: Kevin Cheng Date: Tue, 3 Mar 2026 16:30:10 -0800 Subject: [PATCH 093/373] KVM: SVM: Recalc instructions intercepts when EFER.SVME is toggled The AMD APM states that VMRUN, VMLOAD, VMSAVE, CLGI, VMMCALL, and INVLPGA instructions should generate a #UD when EFER.SVME is cleared. Currently, when VMLOAD, VMSAVE, or CLGI are executed in L1 with EFER.SVME cleared, no #UD is generated in certain cases. This is because the intercepts for these instructions are cleared based on whether or not vls or vgif is enabled. The #UD fails to be generated when the intercepts are absent. Fix the missing #UD generation by ensuring that all relevant instructions have intercepts set when SVME.EFER is disabled. VMMCALL is special because KVM's ABI is that VMCALL/VMMCALL are always supported for L1 and never fault. Signed-off-by: Kevin Cheng [sean: isolate Intel CPU "compatibility" in EFER.SVME=1 path] Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260304003010.1108257-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 30d3291e4738..5fbd87450f4f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -245,6 +245,8 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) if (svm_gp_erratum_intercept && !sev_guest(vcpu->kvm)) set_exception_intercept(svm, GP_VECTOR); } + + kvm_make_request(KVM_REQ_RECALC_INTERCEPTS, vcpu); } svm->vmcb->save.efer = efer | EFER_SVME; @@ -1032,27 +1034,31 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) } /* - * No need to toggle VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK here, it is - * always set if vls is enabled. If the intercepts are set, the bit is - * meaningless anyway. + * Intercept instructions that #UD if EFER.SVME=0, as SVME must be set + * even when running the guest, i.e. hardware will only ever see + * EFER.SVME=1. + * + * No need to toggle any of the vgif/vls/etc. enable bits here, as they + * are set when the VMCB is initialized and never cleared (if the + * relevant intercepts are set, the enablements are meaningless anyway). */ - if (guest_cpuid_is_intel_compatible(vcpu)) { + if (!(vcpu->arch.efer & EFER_SVME)) { svm_set_intercept(svm, INTERCEPT_VMLOAD); svm_set_intercept(svm, INTERCEPT_VMSAVE); + svm_set_intercept(svm, INTERCEPT_CLGI); + svm_set_intercept(svm, INTERCEPT_STGI); } else { /* * If hardware supports Virtual VMLOAD VMSAVE then enable it * in VMCB and clear intercepts to avoid #VMEXIT. */ - if (vls) { + if (guest_cpuid_is_intel_compatible(vcpu)) { + svm_set_intercept(svm, INTERCEPT_VMLOAD); + svm_set_intercept(svm, INTERCEPT_VMSAVE); + } else if (vls) { svm_clr_intercept(svm, INTERCEPT_VMLOAD); svm_clr_intercept(svm, INTERCEPT_VMSAVE); } - } - - if (vgif) { - svm_clr_intercept(svm, INTERCEPT_STGI); - svm_clr_intercept(svm, INTERCEPT_CLGI); /* * Process pending events when clearing STGI/CLGI intercepts if @@ -1060,8 +1066,13 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) * that KVM re-evaluates if the intercept needs to be set again * to track when GIF is re-enabled (e.g. for NMI injection). */ - if (svm_has_pending_gif_event(svm)) - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + if (vgif) { + svm_clr_intercept(svm, INTERCEPT_CLGI); + svm_clr_intercept(svm, INTERCEPT_STGI); + + if (svm_has_pending_gif_event(svm)) + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + } } if (kvm_need_rdpmc_intercept(vcpu)) From 0b97f929831a70e7ad6d9dbd30ae1f65dd43526d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:52 -0800 Subject: [PATCH 094/373] KVM: SVM: Separate recalc_intercepts() into nested vs. non-nested parts Extract the non-nested aspects of recalc_intercepts() into a separate helper, svm_mark_intercepts_dirty(), to make it clear that the call isn't *just* recalculating (vmcb02's) intercepts, and to not bury non-nested code in nested.c. As suggested by Yosry, opportunistically prepend "nested_vmbc02_" to recalc_intercepts() so that it's obvious the function specifically deals with recomputing intercepts for L2. No functional change intended. Cc: Yosry Ahmed Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 9 ++------- arch/x86/kvm/svm/sev.c | 2 +- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/svm/svm.h | 26 ++++++++++++++++++++------ 4 files changed, 25 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index fb86f09985e7..21ee75d6cdff 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -122,17 +122,12 @@ static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm) return false; } -void recalc_intercepts(struct vcpu_svm *svm) +void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) { struct vmcb_control_area *c, *h; struct vmcb_ctrl_area_cached *g; unsigned int i; - vmcb_mark_dirty(svm->vmcb01.ptr, VMCB_INTERCEPTS); - - if (!is_guest_mode(&svm->vcpu)) - return; - vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); c = &svm->vmcb->control; @@ -962,7 +957,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * Merge guest and host intercepts - must be called with vcpu in * guest-mode to take effect. */ - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to_vmcb) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 3f9c1aa39a0a..fea4a65758ad 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -4639,7 +4639,7 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm, bool init_event) if (!sev_vcpu_has_debug_swap(svm)) { vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ); vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } else { /* * Disable #DB intercept iff DebugSwap is enabled. KVM doesn't diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 5fbd87450f4f..1901e9feff51 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -638,7 +638,7 @@ static void set_dr_intercepts(struct vcpu_svm *svm) vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ); vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static void clr_dr_intercepts(struct vcpu_svm *svm) @@ -647,7 +647,7 @@ static void clr_dr_intercepts(struct vcpu_svm *svm) vmcb->control.intercepts[INTERCEPT_DR] = 0; - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static bool msr_write_intercepted(struct kvm_vcpu *vcpu, u32 msr) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 0a5d5a4453b7..267ef8a3359b 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -358,8 +358,6 @@ struct svm_cpu_data { DECLARE_PER_CPU(struct svm_cpu_data, svm_data); -void recalc_intercepts(struct vcpu_svm *svm); - static __always_inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) { return container_of(kvm, struct kvm_svm, kvm); @@ -487,6 +485,22 @@ static inline bool vmcb12_is_intercept(struct vmcb_ctrl_area_cached *control, u3 return __vmcb_is_intercept((unsigned long *)&control->intercepts, bit); } +void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm); + +static inline void svm_mark_intercepts_dirty(struct vcpu_svm *svm) +{ + vmcb_mark_dirty(svm->vmcb01.ptr, VMCB_INTERCEPTS); + + /* + * If L2 is active, recalculate the intercepts for vmcb02 to account + * for the changes made to vmcb01. All intercept configuration is done + * for vmcb01 and then propagated to vmcb02 to combine KVM's intercepts + * with L1's intercepts (from the vmcb12 snapshot). + */ + if (is_guest_mode(&svm->vcpu)) + nested_vmcb02_recalc_intercepts(svm); +} + static inline void set_exception_intercept(struct vcpu_svm *svm, u32 bit) { struct vmcb *vmcb = svm->vmcb01.ptr; @@ -494,7 +508,7 @@ static inline void set_exception_intercept(struct vcpu_svm *svm, u32 bit) WARN_ON_ONCE(bit >= 32); vmcb_set_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit) @@ -504,7 +518,7 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit) WARN_ON_ONCE(bit >= 32); vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static inline void svm_set_intercept(struct vcpu_svm *svm, int bit) @@ -513,7 +527,7 @@ static inline void svm_set_intercept(struct vcpu_svm *svm, int bit) vmcb_set_intercept(&vmcb->control, bit); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static inline void svm_clr_intercept(struct vcpu_svm *svm, int bit) @@ -522,7 +536,7 @@ static inline void svm_clr_intercept(struct vcpu_svm *svm, int bit) vmcb_clr_intercept(&vmcb->control, bit); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static inline bool svm_is_intercept(struct vcpu_svm *svm, int bit) From a367b6e10372b46fa10debd889e89aa65ca65aee Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 18 Feb 2026 15:09:53 -0800 Subject: [PATCH 095/373] KVM: nSVM: WARN and abort vmcb02 intercepts recalc if vmcb02 isn't active WARN and bail early from nested_vmcb02_recalc_intercepts() if vmcb02 isn't the active/current VMCB, as recalculating intercepts for vmcb01 using logic intended for merging vmcb12 and vmcb01 intercepts can yield unexpected and unwanted results. In addition to hardening against general bugs, this will provide additional safeguards "if" nested_vmcb02_recalc_intercepts() is invoked directly from nested_vmcb02_prepare_control(). Signed-off-by: Yosry Ahmed [sean: split to separate patch, bail early on "failure"] Link: https://patch.msgid.link/20260218230958.2877682-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 21ee75d6cdff..75e7deef51a5 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -128,6 +128,9 @@ void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) struct vmcb_ctrl_area_cached *g; unsigned int i; + if (WARN_ON_ONCE(svm->vmcb != svm->nested.vmcb02.ptr)) + return; + vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); c = &svm->vmcb->control; From 4a80c4bc1f10645fe3fc51d4c116f69096340683 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:54 -0800 Subject: [PATCH 096/373] KVM: nSVM: Directly (re)calc vmcb02 intercepts from nested_vmcb02_prepare_control() Now that nested_vmcb02_recalc_intercepts() provides guardrails against it being incorrectly called without vmcb02 active, invoke it directly from nested_vmcb02_recalc_intercepts() instead of bouncing through svm_mark_intercepts_dirty(), which unnecessarily marks vmcb01 as dirty. Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 75e7deef51a5..5ee77a5130d3 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -960,7 +960,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * Merge guest and host intercepts - must be called with vcpu in * guest-mode to take effect. */ - svm_mark_intercepts_dirty(svm); + nested_vmcb02_recalc_intercepts(svm); } static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to_vmcb) From 586160b750914d5bd636f395a2ba9248c6f346e5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:55 -0800 Subject: [PATCH 097/373] KVM: nSVM: Use intuitive local variables in nested_vmcb02_recalc_intercepts() Now that nested_vmcb02_recalc_intercepts() is explicitly scoped to deal with *only* recalculating vmcb02 intercepts, rename its local variables to use more intuivite names. The current "c", "h", and "g" local variables, for the current VMCB, vmcb01, and (cached) vmcb12 respectively, are short and sweet, but don't do much to help unfamiliar readers understand what the code is doing. Use vmcb12_ctrl/vmcb01/vmcb02/vmcb12_ctrl in lieu of c/h/g to make it clear the function is updating intercepts in vmcb02 based on the intercepts in vmcb01 and (cached) vmcb12. Opportunistically change the existing WARN_ON to a WARN_ON_ONCE so that a KVM bug doesn't unintentionally DoS the host. No functional change intended. Signed-off-by: Yosry Ahmed [sean: use WARN_ON_ONCE, keep local vmcb12 cache as vmcb12_ctrl] Link: https://patch.msgid.link/20260218230958.2877682-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 5ee77a5130d3..46804b54200d 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -124,23 +124,20 @@ static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm) void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) { - struct vmcb_control_area *c, *h; - struct vmcb_ctrl_area_cached *g; + struct vmcb_ctrl_area_cached *vmcb12_ctrl = &svm->nested.ctl; + struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; + struct vmcb *vmcb01 = svm->vmcb01.ptr; unsigned int i; - if (WARN_ON_ONCE(svm->vmcb != svm->nested.vmcb02.ptr)) + if (WARN_ON_ONCE(svm->vmcb != vmcb02)) return; - vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); - - c = &svm->vmcb->control; - h = &svm->vmcb01.ptr->control; - g = &svm->nested.ctl; + vmcb_mark_dirty(vmcb02, VMCB_INTERCEPTS); for (i = 0; i < MAX_INTERCEPT; i++) - c->intercepts[i] = h->intercepts[i]; + vmcb02->control.intercepts[i] = vmcb01->control.intercepts[i]; - if (g->int_ctl & V_INTR_MASKING_MASK) { + if (vmcb12_ctrl->int_ctl & V_INTR_MASKING_MASK) { /* * If L2 is active and V_INTR_MASKING is enabled in vmcb12, * disable intercept of CR8 writes as L2's CR8 does not affect @@ -151,17 +148,17 @@ void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) * the effective RFLAGS.IF for L1 interrupts will never be set * while L2 is running (L2's RFLAGS.IF doesn't affect L1 IRQs). */ - vmcb_clr_intercept(c, INTERCEPT_CR8_WRITE); - if (!(svm->vmcb01.ptr->save.rflags & X86_EFLAGS_IF)) - vmcb_clr_intercept(c, INTERCEPT_VINTR); + vmcb_clr_intercept(&vmcb02->control, INTERCEPT_CR8_WRITE); + if (!(vmcb01->save.rflags & X86_EFLAGS_IF)) + vmcb_clr_intercept(&vmcb02->control, INTERCEPT_VINTR); } for (i = 0; i < MAX_INTERCEPT; i++) - c->intercepts[i] |= g->intercepts[i]; + vmcb02->control.intercepts[i] |= vmcb12_ctrl->intercepts[i]; /* If SMI is not intercepted, ignore guest SMI intercept as well */ if (!intercept_smi) - vmcb_clr_intercept(c, INTERCEPT_SMI); + vmcb_clr_intercept(&vmcb02->control, INTERCEPT_SMI); if (nested_vmcb_needs_vls_intercept(svm)) { /* @@ -169,10 +166,10 @@ void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) * we must intercept these instructions to correctly * emulate them in case L1 doesn't intercept them. */ - vmcb_set_intercept(c, INTERCEPT_VMLOAD); - vmcb_set_intercept(c, INTERCEPT_VMSAVE); + vmcb_set_intercept(&vmcb02->control, INTERCEPT_VMLOAD); + vmcb_set_intercept(&vmcb02->control, INTERCEPT_VMSAVE); } else { - WARN_ON(!(c->virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); + WARN_ON_ONCE(!(vmcb02->control.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); } } From ef09eebc5736add3415b6efb009fdb7c47a504c7 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 18 Feb 2026 15:09:56 -0800 Subject: [PATCH 098/373] KVM: nSVM: Use vmcb12_is_intercept() in nested_sync_control_from_vmcb02() Use vmcb12_is_intercept() instead of open-coding the intercept check. No functional change intended. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 46804b54200d..c965d10f3187 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -570,7 +570,7 @@ void nested_sync_control_from_vmcb02(struct vcpu_svm *svm) * int_ctl (because it was never recognized while L2 was running). */ if (svm_is_intercept(svm, INTERCEPT_VINTR) && - !test_bit(INTERCEPT_VINTR, (unsigned long *)svm->nested.ctl.intercepts)) + !vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_VINTR)) mask &= ~V_IRQ_MASK; if (nested_vgif_enabled(svm)) From af75470944f4c978956001cd6034f67469957c1b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:57 -0800 Subject: [PATCH 099/373] KVM: nSVM: Move vmcb_ctrl_area_cached.bus_lock_rip to svm_nested_state Move "bus_lock_rip" from "vmcb_ctrl_area_cached" to "svm_nested_state" as "last_bus_lock_rip" to more accurately reflect what it tracks, and because it is NOT a cached vmcb12 control field. The misplaced field isn't all that apparent in the current code base, as KVM uses "svm->nested.ctl" broadly, but the bad placement becomes glaringly obvious if "svm->nested.ctl" is captured as a local "vmcb12_ctrl" variable. No functional change intended. Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-8-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 8 ++++---- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/svm/svm.h | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index c965d10f3187..dc4cca7df47e 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -850,7 +850,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * L1 re-enters L2, the same instruction will trigger a VM-Exit and the * entire cycle start over. */ - if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == vmcb02->save.rip)) + if (vmcb02->save.rip && (svm->nested.last_bus_lock_rip == vmcb02->save.rip)) vmcb02->control.bus_lock_counter = 1; else vmcb02->control.bus_lock_counter = 0; @@ -1255,11 +1255,11 @@ void nested_svm_vmexit(struct vcpu_svm *svm) } /* - * Invalidate bus_lock_rip unless KVM is still waiting for the guest - * to make forward progress before re-enabling bus lock detection. + * Invalidate last_bus_lock_rip unless KVM is still waiting for the + * guest to make forward progress before re-enabling bus lock detection. */ if (!vmcb02->control.bus_lock_counter) - svm->nested.ctl.bus_lock_rip = INVALID_GPA; + svm->nested.last_bus_lock_rip = INVALID_GPA; nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 1901e9feff51..62501c120112 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3271,7 +3271,7 @@ static int bus_lock_exit(struct kvm_vcpu *vcpu) vcpu->arch.complete_userspace_io = complete_userspace_buslock; if (is_guest_mode(vcpu)) - svm->nested.ctl.bus_lock_rip = vcpu->arch.cui_linear_rip; + svm->nested.last_bus_lock_rip = vcpu->arch.cui_linear_rip; return 0; } diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 267ef8a3359b..6c3b3fae91ec 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -174,7 +174,6 @@ struct vmcb_ctrl_area_cached { u64 nested_cr3; u64 virt_ext; u32 clean; - u64 bus_lock_rip; union { #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_KVM_HYPERV) struct hv_vmcb_enlightenments hv_enlightenments; @@ -189,6 +188,7 @@ struct svm_nested_state { u64 vm_cr_msr; u64 vmcb12_gpa; u64 last_vmcb12_gpa; + u64 last_bus_lock_rip; /* * The MSR permissions map used for vmcb02, which is the merge result From 56bfbe68f78ece2ea9b15f31ec8f7543d8942e3b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:58 -0800 Subject: [PATCH 100/373] KVM: nSVM: Capture svm->nested.ctl as vmcb12_ctrl when preparing vmcb02 Grab svm->nested.ctl as vmcb12_ctrl when preparing the vmcb02 controls to make it more obvious that much of the data is coming from vmcb12 (or rather, a snapshot of vmcb12 at the time of L1's VMRUN). Opportunistically reorder the variable definitions to create a pretty reverse fir tree. No functional change intended. Cc: Yosry Ahmed Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 39 +++++++++++++++++++-------------------- 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index dc4cca7df47e..146faa7584a1 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -789,11 +789,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK; u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK; - struct kvm_vcpu *vcpu = &svm->vcpu; - struct vmcb *vmcb01 = svm->vmcb01.ptr; + struct vmcb_ctrl_area_cached *vmcb12_ctrl = &svm->nested.ctl; struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; - u32 pause_count12; - u32 pause_thresh12; + struct vmcb *vmcb01 = svm->vmcb01.ptr; + struct kvm_vcpu *vcpu = &svm->vcpu; + u32 pause_count12, pause_thresh12; nested_svm_transition_tlb_flush(vcpu); @@ -806,7 +806,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_VGIF) && - (svm->nested.ctl.int_ctl & V_GIF_ENABLE_MASK)) + (vmcb12_ctrl->int_ctl & V_GIF_ENABLE_MASK)) int_ctl_vmcb12_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK); else int_ctl_vmcb01_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK); @@ -864,10 +864,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) if (nested_npt_enabled(svm)) nested_svm_init_mmu_context(vcpu); - vcpu->arch.tsc_offset = kvm_calc_nested_tsc_offset( - vcpu->arch.l1_tsc_offset, - svm->nested.ctl.tsc_offset, - svm->tsc_ratio_msr); + vcpu->arch.tsc_offset = kvm_calc_nested_tsc_offset(vcpu->arch.l1_tsc_offset, + vmcb12_ctrl->tsc_offset, + svm->tsc_ratio_msr); vmcb02->control.tsc_offset = vcpu->arch.tsc_offset; @@ -876,13 +875,13 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) nested_svm_update_tsc_ratio_msr(vcpu); vmcb02->control.int_ctl = - (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) | + (vmcb12_ctrl->int_ctl & int_ctl_vmcb12_bits) | (vmcb01->control.int_ctl & int_ctl_vmcb01_bits); - vmcb02->control.int_vector = svm->nested.ctl.int_vector; - vmcb02->control.int_state = svm->nested.ctl.int_state; - vmcb02->control.event_inj = svm->nested.ctl.event_inj; - vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err; + vmcb02->control.int_vector = vmcb12_ctrl->int_vector; + vmcb02->control.int_state = vmcb12_ctrl->int_state; + vmcb02->control.event_inj = vmcb12_ctrl->event_inj; + vmcb02->control.event_inj_err = vmcb12_ctrl->event_inj_err; /* * If nrips is exposed to L1, take NextRIP as-is. Otherwise, L1 @@ -893,7 +892,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || !svm->nested.nested_run_pending) - vmcb02->control.next_rip = svm->nested.ctl.next_rip; + vmcb02->control.next_rip = vmcb12_ctrl->next_rip; svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); @@ -905,7 +904,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) svm->soft_int_injected = true; if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || !svm->nested.nested_run_pending) - svm->soft_int_next_rip = svm->nested.ctl.next_rip; + svm->soft_int_next_rip = vmcb12_ctrl->next_rip; } /* LBR_CTL_ENABLE_MASK is controlled by svm_update_lbrv() */ @@ -914,11 +913,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) vmcb02->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; if (guest_cpu_cap_has(vcpu, X86_FEATURE_PAUSEFILTER)) - pause_count12 = svm->nested.ctl.pause_filter_count; + pause_count12 = vmcb12_ctrl->pause_filter_count; else pause_count12 = 0; if (guest_cpu_cap_has(vcpu, X86_FEATURE_PFTHRESHOLD)) - pause_thresh12 = svm->nested.ctl.pause_filter_thresh; + pause_thresh12 = vmcb12_ctrl->pause_filter_thresh; else pause_thresh12 = 0; if (kvm_pause_in_guest(svm->vcpu.kvm)) { @@ -932,7 +931,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) vmcb02->control.pause_filter_thresh = vmcb01->control.pause_filter_thresh; /* ... but ensure filtering is disabled if so requested. */ - if (vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_PAUSE)) { + if (vmcb12_is_intercept(vmcb12_ctrl, INTERCEPT_PAUSE)) { if (!pause_count12) vmcb02->control.pause_filter_count = 0; if (!pause_thresh12) @@ -949,7 +948,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * L2 is the "guest"). */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS)) - vmcb02->control.erap_ctl = (svm->nested.ctl.erap_ctl & + vmcb02->control.erap_ctl = (vmcb12_ctrl->erap_ctl & ERAP_CONTROL_ALLOW_LARGER_RAP) | ERAP_CONTROL_CLEAR_RAP; From 1aea80dd42cf46d11af5ff7874a4f4dae77efd6a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 3 Mar 2026 08:58:06 -0800 Subject: [PATCH 101/373] KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctl The 'nested_ctl' field is misnamed. Although the first bit is for nested paging, the other defined bits are for SEV/SEV-ES. Other bits in the same field according to the APM (but not defined by KVM) include "Guest Mode Execution Trap", "Enable INVLPGB/TLBSYNC", and other control bits unrelated to 'nested'. There is nothing common among these bits, so just name the field misc_ctl. Also rename the flags accordingly. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-19-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 8 ++++---- arch/x86/kvm/svm/nested.c | 14 +++++++------- arch/x86/kvm/svm/sev.c | 4 ++-- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/svm/svm.h | 4 ++-- tools/testing/selftests/kvm/include/x86/svm.h | 6 +++--- tools/testing/selftests/kvm/lib/x86/svm.c | 2 +- 7 files changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index edde36097ddc..983db6575141 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -142,7 +142,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u64 exit_info_2; u32 exit_int_info; u32 exit_int_info_err; - u64 nested_ctl; + u64 misc_ctl; u64 avic_vapic_bar; u64 ghcb_gpa; u32 event_inj; @@ -239,9 +239,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define SVM_IOIO_SIZE_MASK (7 << SVM_IOIO_SIZE_SHIFT) #define SVM_IOIO_ASIZE_MASK (7 << SVM_IOIO_ASIZE_SHIFT) -#define SVM_NESTED_CTL_NP_ENABLE BIT(0) -#define SVM_NESTED_CTL_SEV_ENABLE BIT(1) -#define SVM_NESTED_CTL_SEV_ES_ENABLE BIT(2) +#define SVM_MISC_ENABLE_NP BIT(0) +#define SVM_MISC_ENABLE_SEV BIT(1) +#define SVM_MISC_ENABLE_SEV_ES BIT(2) #define SVM_TSC_RATIO_RSVD 0xffffff0000000000ULL diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 146faa7584a1..789f38c55541 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -386,7 +386,7 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, if (CC(control->asid == 0)) return false; - if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && + if (CC((control->misc_ctl & SVM_MISC_ENABLE_NP) && !kvm_vcpu_is_legal_gpa(vcpu, control->nested_cr3))) return false; @@ -477,10 +477,10 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, nested_svm_sanitize_intercept(vcpu, to, SKINIT); nested_svm_sanitize_intercept(vcpu, to, RDPRU); - /* Always clear SVM_NESTED_CTL_NP_ENABLE if the guest cannot use NPTs */ - to->nested_ctl = from->nested_ctl; + /* Always clear SVM_MISC_ENABLE_NP if the guest cannot use NPTs */ + to->misc_ctl = from->misc_ctl; if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NPT)) - to->nested_ctl &= ~SVM_NESTED_CTL_NP_ENABLE; + to->misc_ctl &= ~SVM_MISC_ENABLE_NP; to->iopm_base_pa = from->iopm_base_pa; to->msrpm_base_pa = from->msrpm_base_pa; @@ -823,7 +823,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) } /* Copied from vmcb01. msrpm_base can be overwritten later. */ - vmcb02->control.nested_ctl = vmcb01->control.nested_ctl; + vmcb02->control.misc_ctl = vmcb01->control.misc_ctl; vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; vmcb_mark_dirty(vmcb02, VMCB_PERM_MAP); @@ -982,7 +982,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, vmcb12->save.rip, vmcb12->control.int_ctl, vmcb12->control.event_inj, - vmcb12->control.nested_ctl, + vmcb12->control.misc_ctl, vmcb12->control.nested_cr3, vmcb12->save.cr3, KVM_ISA_SVM); @@ -1770,7 +1770,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst, dst->exit_info_2 = from->exit_info_2; dst->exit_int_info = from->exit_int_info; dst->exit_int_info_err = from->exit_int_info_err; - dst->nested_ctl = from->nested_ctl; + dst->misc_ctl = from->misc_ctl; dst->event_inj = from->event_inj; dst->event_inj_err = from->event_inj_err; dst->next_rip = from->next_rip; diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index fea4a65758ad..a0f1ce2b9a7b 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -4599,7 +4599,7 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm, bool init_event) struct kvm_sev_info *sev = to_kvm_sev_info(svm->vcpu.kvm); struct vmcb *vmcb = svm->vmcb01.ptr; - svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ES_ENABLE; + svm->vmcb->control.misc_ctl |= SVM_MISC_ENABLE_SEV_ES; /* * An SEV-ES guest requires a VMSA area that is a separate from the @@ -4670,7 +4670,7 @@ void sev_init_vmcb(struct vcpu_svm *svm, bool init_event) { struct kvm_vcpu *vcpu = &svm->vcpu; - svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ENABLE; + svm->vmcb->control.misc_ctl |= SVM_MISC_ENABLE_SEV; clr_exception_intercept(svm, UD_VECTOR); /* diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 62501c120112..c626cbacaf4a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1186,7 +1186,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) if (npt_enabled) { /* Setup VMCB for Nested Paging */ - control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE; + control->misc_ctl |= SVM_MISC_ENABLE_NP; svm_clr_intercept(svm, INTERCEPT_INVLPG); clr_exception_intercept(svm, PF_VECTOR); svm_clr_intercept(svm, INTERCEPT_CR3_READ); @@ -3417,7 +3417,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) pr_err("%-20s%016llx\n", "exit_info2:", control->exit_info_2); pr_err("%-20s%08x\n", "exit_int_info:", control->exit_int_info); pr_err("%-20s%08x\n", "exit_int_info_err:", control->exit_int_info_err); - pr_err("%-20s%lld\n", "nested_ctl:", control->nested_ctl); + pr_err("%-20s%lld\n", "misc_ctl:", control->misc_ctl); pr_err("%-20s%016llx\n", "nested_cr3:", control->nested_cr3); pr_err("%-20s%016llx\n", "avic_vapic_bar:", control->avic_vapic_bar); pr_err("%-20s%016llx\n", "ghcb:", control->ghcb_gpa); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 6c3b3fae91ec..ab7eebd3fcff 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -167,7 +167,7 @@ struct vmcb_ctrl_area_cached { u64 exit_info_2; u32 exit_int_info; u32 exit_int_info_err; - u64 nested_ctl; + u64 misc_ctl; u32 event_inj; u32 event_inj_err; u64 next_rip; @@ -593,7 +593,7 @@ static inline bool gif_set(struct vcpu_svm *svm) static inline bool nested_npt_enabled(struct vcpu_svm *svm) { - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE; + return svm->nested.ctl.misc_ctl & SVM_MISC_ENABLE_NP; } static inline bool nested_vnmi_enabled(struct vcpu_svm *svm) diff --git a/tools/testing/selftests/kvm/include/x86/svm.h b/tools/testing/selftests/kvm/include/x86/svm.h index 10b30b38bb3f..d81d8a9f5bfb 100644 --- a/tools/testing/selftests/kvm/include/x86/svm.h +++ b/tools/testing/selftests/kvm/include/x86/svm.h @@ -97,7 +97,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u64 exit_info_2; u32 exit_int_info; u32 exit_int_info_err; - u64 nested_ctl; + u64 misc_ctl; u64 avic_vapic_bar; u8 reserved_4[8]; u32 event_inj; @@ -175,8 +175,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define SVM_VM_CR_SVM_LOCK_MASK 0x0008ULL #define SVM_VM_CR_SVM_DIS_MASK 0x0010ULL -#define SVM_NESTED_CTL_NP_ENABLE BIT(0) -#define SVM_NESTED_CTL_SEV_ENABLE BIT(1) +#define SVM_MISC_ENABLE_NP BIT(0) +#define SVM_MISC_ENABLE_SEV BIT(1) struct __attribute__ ((__packed__)) vmcb_seg { u16 selector; diff --git a/tools/testing/selftests/kvm/lib/x86/svm.c b/tools/testing/selftests/kvm/lib/x86/svm.c index 2e5c480c9afd..eb20b00112c7 100644 --- a/tools/testing/selftests/kvm/lib/x86/svm.c +++ b/tools/testing/selftests/kvm/lib/x86/svm.c @@ -126,7 +126,7 @@ void generic_svm_setup(struct svm_test_data *svm, void *guest_rip, void *guest_r guest_regs.rdi = (u64)svm; if (svm->ncr3_gpa) { - ctrl->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE; + ctrl->misc_ctl |= SVM_MISC_ENABLE_NP; ctrl->nested_cr3 = svm->ncr3_gpa; } } From 7e6eab9be2200f83ab03ab2b921ea7ca47a6c3b4 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:13 +0000 Subject: [PATCH 102/373] KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2 'virt' is confusing in the VMCB because it is relative and ambiguous. The 'virt_ext' field includes bits for LBR virtualization and VMSAVE/VMLOAD virtualization, so it's just another miscellaneous control field. Name it as such. While at it, move the definitions of the bits below those for 'misc_ctl' and rename them for consistency. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-20-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 7 +++---- arch/x86/kvm/svm/nested.c | 16 ++++++++-------- arch/x86/kvm/svm/svm.c | 18 +++++++++--------- arch/x86/kvm/svm/svm.h | 2 +- tools/testing/selftests/kvm/include/x86/svm.h | 8 ++++---- .../kvm/x86/nested_vmsave_vmload_test.c | 16 ++++++++-------- .../selftests/kvm/x86/svm_lbr_nested_state.c | 4 ++-- 7 files changed, 35 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 983db6575141..c169256c415f 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -148,7 +148,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u32 event_inj; u32 event_inj_err; u64 nested_cr3; - u64 virt_ext; + u64 misc_ctl2; u32 clean; u32 reserved_5; u64 next_rip; @@ -222,9 +222,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define X2APIC_MODE_SHIFT 30 #define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT) -#define LBR_CTL_ENABLE_MASK BIT_ULL(0) -#define VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK BIT_ULL(1) - #define SVM_INTERRUPT_SHADOW_MASK BIT_ULL(0) #define SVM_GUEST_INTERRUPT_MASK BIT_ULL(1) @@ -243,6 +240,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define SVM_MISC_ENABLE_SEV BIT(1) #define SVM_MISC_ENABLE_SEV_ES BIT(2) +#define SVM_MISC2_ENABLE_V_LBR BIT_ULL(0) +#define SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE BIT_ULL(1) #define SVM_TSC_RATIO_RSVD 0xffffff0000000000ULL #define SVM_TSC_RATIO_MIN 0x0000000000000001ULL diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 789f38c55541..d3e3721fa223 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -116,7 +116,7 @@ static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm) if (!nested_npt_enabled(svm)) return true; - if (!(svm->nested.ctl.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)) + if (!(svm->nested.ctl.misc_ctl2 & SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE)) return true; return false; @@ -169,7 +169,7 @@ void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) vmcb_set_intercept(&vmcb02->control, INTERCEPT_VMLOAD); vmcb_set_intercept(&vmcb02->control, INTERCEPT_VMSAVE); } else { - WARN_ON_ONCE(!(vmcb02->control.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); + WARN_ON_ONCE(!(vmcb02->control.misc_ctl2 & SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE)); } } @@ -499,7 +499,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->event_inj_err = from->event_inj_err; to->next_rip = from->next_rip; to->nested_cr3 = from->nested_cr3; - to->virt_ext = from->virt_ext; + to->misc_ctl2 = from->misc_ctl2; to->pause_filter_count = from->pause_filter_count; to->pause_filter_thresh = from->pause_filter_thresh; @@ -679,7 +679,7 @@ void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm) static bool nested_vmcb12_has_lbrv(struct kvm_vcpu *vcpu) { return guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (to_svm(vcpu)->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK); + (to_svm(vcpu)->nested.ctl.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR); } static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12) @@ -907,10 +907,10 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) svm->soft_int_next_rip = vmcb12_ctrl->next_rip; } - /* LBR_CTL_ENABLE_MASK is controlled by svm_update_lbrv() */ + /* SVM_MISC2_ENABLE_V_LBR is controlled by svm_update_lbrv() */ if (!nested_vmcb_needs_vls_intercept(svm)) - vmcb02->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + vmcb02->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; if (guest_cpu_cap_has(vcpu, X86_FEATURE_PAUSEFILTER)) pause_count12 = vmcb12_ctrl->pause_filter_count; @@ -1774,8 +1774,8 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst, dst->event_inj = from->event_inj; dst->event_inj_err = from->event_inj_err; dst->next_rip = from->next_rip; - dst->nested_cr3 = from->nested_cr3; - dst->virt_ext = from->virt_ext; + dst->nested_cr3 = from->nested_cr3; + dst->misc_ctl2 = from->misc_ctl2; dst->pause_filter_count = from->pause_filter_count; dst->pause_filter_thresh = from->pause_filter_thresh; /* 'clean' and 'hv_enlightenments' are not changed by KVM */ diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index c626cbacaf4a..7decb68f38f6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -713,7 +713,7 @@ void *svm_alloc_permissions_map(unsigned long size, gfp_t gfp_mask) static void svm_recalc_lbr_msr_intercepts(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - bool intercept = !(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK); + bool intercept = !(svm->vmcb->control.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR); if (intercept == svm->lbr_msrs_intercepted) return; @@ -846,7 +846,7 @@ static void svm_recalc_msr_intercepts(struct kvm_vcpu *vcpu) static void __svm_enable_lbrv(struct kvm_vcpu *vcpu) { - to_svm(vcpu)->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK; + to_svm(vcpu)->vmcb->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_LBR; } void svm_enable_lbrv(struct kvm_vcpu *vcpu) @@ -858,16 +858,16 @@ void svm_enable_lbrv(struct kvm_vcpu *vcpu) static void __svm_disable_lbrv(struct kvm_vcpu *vcpu) { KVM_BUG_ON(sev_es_guest(vcpu->kvm), vcpu->kvm); - to_svm(vcpu)->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; + to_svm(vcpu)->vmcb->control.misc_ctl2 &= ~SVM_MISC2_ENABLE_V_LBR; } void svm_update_lbrv(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - bool current_enable_lbrv = svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK; + bool current_enable_lbrv = svm->vmcb->control.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR; bool enable_lbrv = (svm->vmcb->save.dbgctl & DEBUGCTLMSR_LBR) || (is_guest_mode(vcpu) && guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK)); + (svm->nested.ctl.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR)); if (enable_lbrv && !current_enable_lbrv) __svm_enable_lbrv(vcpu); @@ -1222,7 +1222,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; if (vls) - svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + svm->vmcb->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; if (vcpu->kvm->arch.bus_lock_detection_enabled) svm_set_intercept(svm, INTERCEPT_BUSLOCK); @@ -3423,7 +3423,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) pr_err("%-20s%016llx\n", "ghcb:", control->ghcb_gpa); pr_err("%-20s%08x\n", "event_inj:", control->event_inj); pr_err("%-20s%08x\n", "event_inj_err:", control->event_inj_err); - pr_err("%-20s%lld\n", "virt_ext:", control->virt_ext); + pr_err("%-20s%lld\n", "misc_ctl2:", control->misc_ctl2); pr_err("%-20s%016llx\n", "next_rip:", control->next_rip); pr_err("%-20s%016llx\n", "avic_backing_page:", control->avic_backing_page); pr_err("%-20s%016llx\n", "avic_logical_id:", control->avic_logical_id); @@ -4472,7 +4472,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) * VM-Exit), as running with the host's DEBUGCTL can negatively affect * guest state and can even be fatal, e.g. due to Bus Lock Detect. */ - if (!(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) && + if (!(svm->vmcb->control.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR) && vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl) update_debugctlmsr(svm->vmcb->save.dbgctl); @@ -4503,7 +4503,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI)) kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); - if (!(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) && + if (!(svm->vmcb->control.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR) && vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl) update_debugctlmsr(vcpu->arch.host_debugctl); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index ab7eebd3fcff..760a8a6d45cd 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -172,7 +172,7 @@ struct vmcb_ctrl_area_cached { u32 event_inj_err; u64 next_rip; u64 nested_cr3; - u64 virt_ext; + u64 misc_ctl2; u32 clean; union { #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_KVM_HYPERV) diff --git a/tools/testing/selftests/kvm/include/x86/svm.h b/tools/testing/selftests/kvm/include/x86/svm.h index d81d8a9f5bfb..c8539166270e 100644 --- a/tools/testing/selftests/kvm/include/x86/svm.h +++ b/tools/testing/selftests/kvm/include/x86/svm.h @@ -103,7 +103,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u32 event_inj; u32 event_inj_err; u64 nested_cr3; - u64 virt_ext; + u64 misc_ctl2; u32 clean; u32 reserved_5; u64 next_rip; @@ -155,9 +155,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define AVIC_ENABLE_SHIFT 31 #define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT) -#define LBR_CTL_ENABLE_MASK BIT_ULL(0) -#define VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK BIT_ULL(1) - #define SVM_INTERRUPT_SHADOW_MASK 1 #define SVM_IOIO_STR_SHIFT 2 @@ -178,6 +175,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define SVM_MISC_ENABLE_NP BIT(0) #define SVM_MISC_ENABLE_SEV BIT(1) +#define SVM_MISC2_ENABLE_V_LBR BIT_ULL(0) +#define SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE BIT_ULL(1) + struct __attribute__ ((__packed__)) vmcb_seg { u16 selector; u16 attrib; diff --git a/tools/testing/selftests/kvm/x86/nested_vmsave_vmload_test.c b/tools/testing/selftests/kvm/x86/nested_vmsave_vmload_test.c index 6764a48f9d4d..71717118d692 100644 --- a/tools/testing/selftests/kvm/x86/nested_vmsave_vmload_test.c +++ b/tools/testing/selftests/kvm/x86/nested_vmsave_vmload_test.c @@ -79,8 +79,8 @@ static void l1_guest_code(struct svm_test_data *svm) svm->vmcb->control.intercept |= (BIT_ULL(INTERCEPT_VMSAVE) | BIT_ULL(INTERCEPT_VMLOAD)); - /* ..VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK cleared.. */ - svm->vmcb->control.virt_ext &= ~VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + /* ..SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE cleared.. */ + svm->vmcb->control.misc_ctl2 &= ~SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; svm->vmcb->save.rip = (u64)l2_guest_code_vmsave; run_guest(svm->vmcb, svm->vmcb_gpa); @@ -90,8 +90,8 @@ static void l1_guest_code(struct svm_test_data *svm) run_guest(svm->vmcb, svm->vmcb_gpa); GUEST_ASSERT_EQ(svm->vmcb->control.exit_code, SVM_EXIT_VMLOAD); - /* ..and VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK set */ - svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + /* ..and SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE set */ + svm->vmcb->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; svm->vmcb->save.rip = (u64)l2_guest_code_vmsave; run_guest(svm->vmcb, svm->vmcb_gpa); @@ -106,20 +106,20 @@ static void l1_guest_code(struct svm_test_data *svm) BIT_ULL(INTERCEPT_VMLOAD)); /* - * Without VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK, the GPA will be + * Without SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE, the GPA will be * interpreted as an L1 GPA, so VMCB0 should be used. */ svm->vmcb->save.rip = (u64)l2_guest_code_vmcb0; - svm->vmcb->control.virt_ext &= ~VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + svm->vmcb->control.misc_ctl2 &= ~SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; run_guest(svm->vmcb, svm->vmcb_gpa); GUEST_ASSERT_EQ(svm->vmcb->control.exit_code, SVM_EXIT_VMMCALL); /* - * With VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK, the GPA will be interpeted as + * With SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE, the GPA will be interpeted as * an L2 GPA, and translated through the NPT to VMCB1. */ svm->vmcb->save.rip = (u64)l2_guest_code_vmcb1; - svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + svm->vmcb->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; run_guest(svm->vmcb, svm->vmcb_gpa); GUEST_ASSERT_EQ(svm->vmcb->control.exit_code, SVM_EXIT_VMMCALL); diff --git a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c index bf16abb1152e..ff99438824d3 100644 --- a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c +++ b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c @@ -69,9 +69,9 @@ static void l1_guest_code(struct svm_test_data *svm, bool nested_lbrv) &l2_guest_stack[L2_GUEST_STACK_SIZE]); if (nested_lbrv) - vmcb->control.virt_ext = LBR_CTL_ENABLE_MASK; + vmcb->control.misc_ctl2 = SVM_MISC2_ENABLE_V_LBR; else - vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; + vmcb->control.misc_ctl2 &= ~SVM_MISC2_ENABLE_V_LBR; run_guest(vmcb, svm->vmcb_gpa); GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); From 84dc9fd0354d3d0e02faf2f7b3f4d1228c2571ea Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:14 +0000 Subject: [PATCH 103/373] KVM: nSVM: Cache all used fields from VMCB12 Currently, most fields used from VMCB12 are cached in svm->nested.{ctl/save}. This is mainly to avoid TOC-TOU bugs. However, for the save area, only the fields used in the consistency checks (i.e. nested_vmcb_check_save()) were being cached. Other fields are read directly from guest memory in nested_vmcb02_prepare_save(). While probably benign, this still makes it possible for TOC-TOU bugs to happen. For example, RAX, RSP, and RIP are read twice, once to store in VMCB02, and once to store in vcpu->arch.regs. It is possible for the guest to modify the value between both reads, potentially causing nasty bugs. Harden against such bugs by caching everything in svm->nested.save. Cache all the needed fields, and keep all accesses to the VMCB12 strictly in nested_svm_vmrun() for caching and early error injection. Following changes will further limit the access to the VMCB12 in the nested VMRUN path. Introduce vmcb12_is_dirty() to use with the cached control fields instead of vmcb_is_dirty(), similar to vmcb12_is_intercept(). Opportunistically order the copies in __nested_copy_vmcb_save_to_cache() by the order in which the fields are defined in struct vmcb_save_area. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-21-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 116 ++++++++++++++++++++++---------------- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/svm/svm.h | 27 ++++++++- 3 files changed, 93 insertions(+), 52 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d3e3721fa223..0c3f2db6ac0b 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -507,11 +507,11 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->asid = from->asid; to->msrpm_base_pa &= ~0x0fffULL; to->iopm_base_pa &= ~0x0fffULL; + to->clean = from->clean; #ifdef CONFIG_KVM_HYPERV /* Hyper-V extensions (Enlightened VMCB) */ if (kvm_hv_hypercall_enabled(vcpu)) { - to->clean = from->clean; memcpy(&to->hv_enlightenments, &from->hv_enlightenments, sizeof(to->hv_enlightenments)); } @@ -527,19 +527,34 @@ void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm, static void __nested_copy_vmcb_save_to_cache(struct vmcb_save_area_cached *to, struct vmcb_save_area *from) { - /* - * Copy only fields that are validated, as we need them - * to avoid TOC/TOU races. - */ + to->es = from->es; to->cs = from->cs; + to->ss = from->ss; + to->ds = from->ds; + to->gdtr = from->gdtr; + to->idtr = from->idtr; + + to->cpl = from->cpl; to->efer = from->efer; - to->cr0 = from->cr0; - to->cr3 = from->cr3; to->cr4 = from->cr4; - - to->dr6 = from->dr6; + to->cr3 = from->cr3; + to->cr0 = from->cr0; to->dr7 = from->dr7; + to->dr6 = from->dr6; + + to->rflags = from->rflags; + to->rip = from->rip; + to->rsp = from->rsp; + + to->s_cet = from->s_cet; + to->ssp = from->ssp; + to->isst_addr = from->isst_addr; + + to->rax = from->rax; + to->cr2 = from->cr2; + + svm_copy_lbrs(to, from); } void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm, @@ -682,8 +697,10 @@ static bool nested_vmcb12_has_lbrv(struct kvm_vcpu *vcpu) (to_svm(vcpu)->nested.ctl.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR); } -static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12) +static void nested_vmcb02_prepare_save(struct vcpu_svm *svm) { + struct vmcb_ctrl_area_cached *control = &svm->nested.ctl; + struct vmcb_save_area_cached *save = &svm->nested.save; bool new_vmcb12 = false; struct vmcb *vmcb01 = svm->vmcb01.ptr; struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; @@ -699,48 +716,48 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 svm->nested.force_msr_bitmap_recalc = true; } - if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_SEG))) { - vmcb02->save.es = vmcb12->save.es; - vmcb02->save.cs = vmcb12->save.cs; - vmcb02->save.ss = vmcb12->save.ss; - vmcb02->save.ds = vmcb12->save.ds; - vmcb02->save.cpl = vmcb12->save.cpl; + if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_SEG))) { + vmcb02->save.es = save->es; + vmcb02->save.cs = save->cs; + vmcb02->save.ss = save->ss; + vmcb02->save.ds = save->ds; + vmcb02->save.cpl = save->cpl; vmcb_mark_dirty(vmcb02, VMCB_SEG); } - if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_DT))) { - vmcb02->save.gdtr = vmcb12->save.gdtr; - vmcb02->save.idtr = vmcb12->save.idtr; + if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_DT))) { + vmcb02->save.gdtr = save->gdtr; + vmcb02->save.idtr = save->idtr; vmcb_mark_dirty(vmcb02, VMCB_DT); } if (guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK) && - (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_CET)))) { - vmcb02->save.s_cet = vmcb12->save.s_cet; - vmcb02->save.isst_addr = vmcb12->save.isst_addr; - vmcb02->save.ssp = vmcb12->save.ssp; + (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_CET)))) { + vmcb02->save.s_cet = save->s_cet; + vmcb02->save.isst_addr = save->isst_addr; + vmcb02->save.ssp = save->ssp; vmcb_mark_dirty(vmcb02, VMCB_CET); } - kvm_set_rflags(vcpu, vmcb12->save.rflags | X86_EFLAGS_FIXED); + kvm_set_rflags(vcpu, save->rflags | X86_EFLAGS_FIXED); svm_set_efer(vcpu, svm->nested.save.efer); svm_set_cr0(vcpu, svm->nested.save.cr0); svm_set_cr4(vcpu, svm->nested.save.cr4); - svm->vcpu.arch.cr2 = vmcb12->save.cr2; + svm->vcpu.arch.cr2 = save->cr2; - kvm_rax_write(vcpu, vmcb12->save.rax); - kvm_rsp_write(vcpu, vmcb12->save.rsp); - kvm_rip_write(vcpu, vmcb12->save.rip); + kvm_rax_write(vcpu, save->rax); + kvm_rsp_write(vcpu, save->rsp); + kvm_rip_write(vcpu, save->rip); /* In case we don't even reach vcpu_run, the fields are not updated */ - vmcb02->save.rax = vmcb12->save.rax; - vmcb02->save.rsp = vmcb12->save.rsp; - vmcb02->save.rip = vmcb12->save.rip; + vmcb02->save.rax = save->rax; + vmcb02->save.rsp = save->rsp; + vmcb02->save.rip = save->rip; - if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_DR))) { + if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_DR))) { vmcb02->save.dr7 = svm->nested.save.dr7 | DR7_FIXED_1; svm->vcpu.arch.dr6 = svm->nested.save.dr6 | DR6_ACTIVE_LOW; vmcb_mark_dirty(vmcb02, VMCB_DR); @@ -751,7 +768,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 * Reserved bits of DEBUGCTL are ignored. Be consistent with * svm_set_msr's definition of reserved bits. */ - svm_copy_lbrs(&vmcb02->save, &vmcb12->save); + svm_copy_lbrs(&vmcb02->save, save); vmcb02->save.dbgctl &= ~DEBUGCTL_RESERVED_BITS; } else { svm_copy_lbrs(&vmcb02->save, &vmcb01->save); @@ -971,28 +988,29 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl; } -int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, - struct vmcb *vmcb12, bool from_vmrun) +int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun) { struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_ctrl_area_cached *control = &svm->nested.ctl; + struct vmcb_save_area_cached *save = &svm->nested.save; int ret; trace_kvm_nested_vmenter(svm->vmcb->save.rip, vmcb12_gpa, - vmcb12->save.rip, - vmcb12->control.int_ctl, - vmcb12->control.event_inj, - vmcb12->control.misc_ctl, - vmcb12->control.nested_cr3, - vmcb12->save.cr3, + save->rip, + control->int_ctl, + control->event_inj, + control->misc_ctl, + control->nested_cr3, + save->cr3, KVM_ISA_SVM); - trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 0xffff, - vmcb12->control.intercepts[INTERCEPT_CR] >> 16, - vmcb12->control.intercepts[INTERCEPT_EXCEPTION], - vmcb12->control.intercepts[INTERCEPT_WORD3], - vmcb12->control.intercepts[INTERCEPT_WORD4], - vmcb12->control.intercepts[INTERCEPT_WORD5]); + trace_kvm_nested_intercepts(control->intercepts[INTERCEPT_CR] & 0xffff, + control->intercepts[INTERCEPT_CR] >> 16, + control->intercepts[INTERCEPT_EXCEPTION], + control->intercepts[INTERCEPT_WORD3], + control->intercepts[INTERCEPT_WORD4], + control->intercepts[INTERCEPT_WORD5]); svm->nested.vmcb12_gpa = vmcb12_gpa; @@ -1003,7 +1021,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, svm_switch_vmcb(svm, &svm->nested.vmcb02); nested_vmcb02_prepare_control(svm); - nested_vmcb02_prepare_save(svm, vmcb12); + nested_vmcb02_prepare_save(svm); ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3, nested_npt_enabled(svm), from_vmrun); @@ -1091,7 +1109,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) svm->nested.nested_run_pending = 1; - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, true)) + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) goto out_exit_err; if (nested_svm_merge_msrpm(vcpu)) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7decb68f38f6..2c511f86b79d 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -5004,7 +5004,7 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) vmcb12 = map.hva; nested_copy_vmcb_control_to_cache(svm, &vmcb12->control); nested_copy_vmcb_save_to_cache(svm, &vmcb12->save); - ret = enter_svm_guest_mode(vcpu, smram64->svm_guest_vmcb_gpa, vmcb12, false); + ret = enter_svm_guest_mode(vcpu, smram64->svm_guest_vmcb_gpa, false); if (ret) goto unmap_save; diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 760a8a6d45cd..995c8de3f660 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -140,13 +140,32 @@ struct kvm_vmcb_info { }; struct vmcb_save_area_cached { + struct vmcb_seg es; struct vmcb_seg cs; + struct vmcb_seg ss; + struct vmcb_seg ds; + struct vmcb_seg gdtr; + struct vmcb_seg idtr; + u8 cpl; u64 efer; u64 cr4; u64 cr3; u64 cr0; u64 dr7; u64 dr6; + u64 rflags; + u64 rip; + u64 rsp; + u64 s_cet; + u64 ssp; + u64 isst_addr; + u64 rax; + u64 cr2; + u64 dbgctl; + u64 br_from; + u64 br_to; + u64 last_excp_from; + u64 last_excp_to; }; struct vmcb_ctrl_area_cached { @@ -419,6 +438,11 @@ static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit) return !test_bit(bit, (unsigned long *)&vmcb->control.clean); } +static inline bool vmcb12_is_dirty(struct vmcb_ctrl_area_cached *control, int bit) +{ + return !test_bit(bit, (unsigned long *)&control->clean); +} + static __always_inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) { return container_of(vcpu, struct vcpu_svm, vcpu); @@ -799,8 +823,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm) int __init nested_svm_init_msrpm_merge_offsets(void); -int enter_svm_guest_mode(struct kvm_vcpu *vcpu, - u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun); +int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb_gpa, bool from_vmrun); void svm_leave_nested(struct kvm_vcpu *vcpu); void svm_free_nested(struct vcpu_svm *svm); int svm_allocate_nested(struct vcpu_svm *svm); From b709087e9e544259d1d075ced91cc4ab769a8ae2 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:15 +0000 Subject: [PATCH 104/373] KVM: nSVM: Restrict mapping vmcb12 on nested VMRUN All accesses to the vmcb12 in the guest memory on nested VMRUN are limited to nested_svm_vmrun() copying vmcb12 fields and writing them on failed consistency checks. However, vmcb12 remains mapped throughout nested_svm_vmrun(). Mapping and unmapping around usages is possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is used after being unmapped. Move reading the vmcb12, copying to cache, and consistency checks from nested_svm_vmrun() into a new helper, nested_svm_copy_vmcb12_to_cache() to limit the scope of the mapping. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-22-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 89 ++++++++++++++++++++++----------------- 1 file changed, 51 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 0c3f2db6ac0b..c61b4923963e 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1041,12 +1041,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun) return 0; } +static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct kvm_host_map map; + struct vmcb *vmcb12; + int r = 0; + + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) + return -EFAULT; + + vmcb12 = map.hva; + nested_copy_vmcb_control_to_cache(svm, &vmcb12->control); + nested_copy_vmcb_save_to_cache(svm, &vmcb12->save); + + if (!nested_vmcb_check_save(vcpu, &svm->nested.save) || + !nested_vmcb_check_controls(vcpu, &svm->nested.ctl)) { + vmcb12->control.exit_code = SVM_EXIT_ERR; + vmcb12->control.exit_info_1 = 0; + vmcb12->control.exit_info_2 = 0; + vmcb12->control.event_inj = 0; + vmcb12->control.event_inj_err = 0; + svm_set_gif(svm, false); + r = -EINVAL; + } + + kvm_vcpu_unmap(vcpu, &map); + return r; +} + int nested_svm_vmrun(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - int ret; - struct vmcb *vmcb12; - struct kvm_host_map map; + int ret, err; u64 vmcb12_gpa; struct vmcb *vmcb01 = svm->vmcb01.ptr; @@ -1067,32 +1094,23 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) return ret; } + if (WARN_ON_ONCE(!svm->nested.initialized)) + return -EINVAL; + vmcb12_gpa = svm->vmcb->save.rax; - if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) { + err = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); + if (err == -EFAULT) { kvm_inject_gp(vcpu, 0); return 1; } + /* + * Advance RIP if #GP or #UD are not injected, but otherwise stop if + * copying and checking vmcb12 failed. + */ ret = kvm_skip_emulated_instruction(vcpu); - - vmcb12 = map.hva; - - if (WARN_ON_ONCE(!svm->nested.initialized)) - return -EINVAL; - - nested_copy_vmcb_control_to_cache(svm, &vmcb12->control); - nested_copy_vmcb_save_to_cache(svm, &vmcb12->save); - - if (!nested_vmcb_check_save(vcpu, &svm->nested.save) || - !nested_vmcb_check_controls(vcpu, &svm->nested.ctl)) { - vmcb12->control.exit_code = SVM_EXIT_ERR; - vmcb12->control.exit_info_1 = 0; - vmcb12->control.exit_info_2 = 0; - vmcb12->control.event_inj = 0; - vmcb12->control.event_inj_err = 0; - svm_set_gif(svm, false); - goto out; - } + if (err) + return ret; /* * Since vmcb01 is not in use, we can use it to store some of the L1 @@ -1109,23 +1127,18 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) svm->nested.nested_run_pending = 1; - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) - goto out_exit_err; + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) || + !nested_svm_merge_msrpm(vcpu)) { + svm->nested.nested_run_pending = 0; + svm->nmi_l1_to_l2 = false; + svm->soft_int_injected = false; - if (nested_svm_merge_msrpm(vcpu)) - goto out; + svm->vmcb->control.exit_code = SVM_EXIT_ERR; + svm->vmcb->control.exit_info_1 = 0; + svm->vmcb->control.exit_info_2 = 0; -out_exit_err: - svm->nested.nested_run_pending = 0; - - svm->vmcb->control.exit_code = SVM_EXIT_ERR; - svm->vmcb->control.exit_info_1 = 0; - svm->vmcb->control.exit_info_2 = 0; - - nested_svm_vmexit(svm); - -out: - kvm_vcpu_unmap(vcpu, &map); + nested_svm_vmexit(svm); + } return ret; } From a2b858051cf03d4f0abca014cddd424675be5316 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:16 +0000 Subject: [PATCH 105/373] KVM: nSVM: Use PAGE_MASK to drop lower bits of bitmap GPAs from vmcb12 Use PAGE_MASK to drop the lower bits from IOPM_BASE_PA and MSRPM_BASE_PA while copying them instead of dropping the bits afterward with a hardcoded mask. No functional change intended. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-23-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index c61b4923963e..fd7045904948 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -482,8 +482,8 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NPT)) to->misc_ctl &= ~SVM_MISC_ENABLE_NP; - to->iopm_base_pa = from->iopm_base_pa; - to->msrpm_base_pa = from->msrpm_base_pa; + to->iopm_base_pa = from->iopm_base_pa & PAGE_MASK; + to->msrpm_base_pa = from->msrpm_base_pa & PAGE_MASK; to->tsc_offset = from->tsc_offset; to->tlb_ctl = from->tlb_ctl; to->erap_ctl = from->erap_ctl; @@ -505,8 +505,6 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, /* Copy asid here because nested_vmcb_check_controls() will check it */ to->asid = from->asid; - to->msrpm_base_pa &= ~0x0fffULL; - to->iopm_base_pa &= ~0x0fffULL; to->clean = from->clean; #ifdef CONFIG_KVM_HYPERV From 30a1d2fa819039e06bc6242669f6fd45df039a41 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:17 +0000 Subject: [PATCH 106/373] KVM: nSVM: Sanitize TLB_CONTROL field when copying from vmcb12 The APM defines possible values for TLB_CONTROL as 0, 1, 3, and 7 -- all of which are always allowed for KVM guests as KVM always supports X86_FEATURE_FLUSHBYASID. Only copy bits 0 to 2 from vmcb12's TLB_CONTROL, such that no unhandled or reserved bits end up in vmcb02. Note that TLB_CONTROL in vmcb12 is currently ignored by KVM, as it nukes the TLB on nested transitions anyway (see nested_svm_transition_tlb_flush()). However, such sanitization will be needed once the TODOs there are addressed, and it's minimal churn to add it now. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-24-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 2 ++ arch/x86/kvm/svm/nested.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index c169256c415f..16cf4f435aeb 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -182,6 +182,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define TLB_CONTROL_FLUSH_ASID 3 #define TLB_CONTROL_FLUSH_ASID_LOCAL 7 +#define TLB_CONTROL_MASK GENMASK(2, 0) + #define ERAP_CONTROL_ALLOW_LARGER_RAP BIT(0) #define ERAP_CONTROL_CLEAR_RAP BIT(1) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index fd7045904948..c4680270e54f 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -485,7 +485,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->iopm_base_pa = from->iopm_base_pa & PAGE_MASK; to->msrpm_base_pa = from->msrpm_base_pa & PAGE_MASK; to->tsc_offset = from->tsc_offset; - to->tlb_ctl = from->tlb_ctl; + to->tlb_ctl = from->tlb_ctl & TLB_CONTROL_MASK; to->erap_ctl = from->erap_ctl; to->int_ctl = from->int_ctl; to->int_vector = from->int_vector; From c8123e82725648b1b13103ce3d8066ce13ab81b7 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:18 +0000 Subject: [PATCH 107/373] KVM: nSVM: Sanitize INT/EVENTINJ fields when copying from vmcb12 Make sure all fields used from vmcb12 in creating the vmcb02 are sanitized, such that no unhandled or reserved bits end up in the vmcb02. The following control fields are read from vmcb12 and have bits that are either reserved or not handled/advertised by KVM: tlb_ctl, int_ctl, int_state, int_vector, event_inj, misc_ctl, and misc_ctl2. The following fields do not require any extra sanitizing: - tlb_ctl: already being sanitized. - int_ctl: bits from vmcb12 are copied bit-by-bit as needed. - misc_ctl: only used in consistency checks (particularly NP_ENABLE). - misc_ctl2: bits from vmcb12 are copied bit-by-bit as needed. For the remaining fields (int_vector, int_state, and event_inj), make sure only defined bits are copied from L1's vmcb12 into KVM'cache by defining appropriate masks where needed. Suggested-by: Jim Mattson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-25-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 5 +++++ arch/x86/kvm/svm/nested.c | 8 ++++---- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 16cf4f435aeb..bcfeb5e7c0ed 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -224,6 +224,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define X2APIC_MODE_SHIFT 30 #define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT) +#define SVM_INT_VECTOR_MASK GENMASK(7, 0) + #define SVM_INTERRUPT_SHADOW_MASK BIT_ULL(0) #define SVM_GUEST_INTERRUPT_MASK BIT_ULL(1) @@ -637,6 +639,9 @@ static inline void __unused_size_checks(void) #define SVM_EVTINJ_VALID (1 << 31) #define SVM_EVTINJ_VALID_ERR (1 << 11) +#define SVM_EVTINJ_RESERVED_BITS ~(SVM_EVTINJ_VEC_MASK | SVM_EVTINJ_TYPE_MASK | \ + SVM_EVTINJ_VALID_ERR | SVM_EVTINJ_VALID) + #define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK #define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index c4680270e54f..bf1e1ca22d9c 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -488,18 +488,18 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->tlb_ctl = from->tlb_ctl & TLB_CONTROL_MASK; to->erap_ctl = from->erap_ctl; to->int_ctl = from->int_ctl; - to->int_vector = from->int_vector; - to->int_state = from->int_state; + to->int_vector = from->int_vector & SVM_INT_VECTOR_MASK; + to->int_state = from->int_state & SVM_INTERRUPT_SHADOW_MASK; to->exit_code = from->exit_code; to->exit_info_1 = from->exit_info_1; to->exit_info_2 = from->exit_info_2; to->exit_int_info = from->exit_int_info; to->exit_int_info_err = from->exit_int_info_err; - to->event_inj = from->event_inj; + to->event_inj = from->event_inj & ~SVM_EVTINJ_RESERVED_BITS; to->event_inj_err = from->event_inj_err; to->next_rip = from->next_rip; to->nested_cr3 = from->nested_cr3; - to->misc_ctl2 = from->misc_ctl2; + to->misc_ctl2 = from->misc_ctl2; to->pause_filter_count = from->pause_filter_count; to->pause_filter_thresh = from->pause_filter_thresh; From b6dc21d896a02b5fd305f505a4ec4dad50ecd8fb Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:19 +0000 Subject: [PATCH 108/373] KVM: nSVM: Only copy SVM_MISC_ENABLE_NP from VMCB01's misc_ctl The 'misc_ctl' field in VMCB02 is taken as-is from VMCB01. However, the only bit that needs to copied is SVM_MISC_ENABLE_NP, as all other known bits in misc_ctl are related to SEV guests, and KVM doesn't support nested virtualization for SEV guests. Only copy SVM_MISC_ENABLE_NP to harden against future bugs if/when other bits are set for L1 but should not be set for L2. Opportunistically add a comment explaining why SVM_MISC_ENABLE_NP is taken from VMCB01 and not VMCB02. Suggested-by: Jim Mattson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-26-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index bf1e1ca22d9c..b191c6cab57d 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -837,8 +837,16 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) V_NMI_BLOCKING_MASK); } - /* Copied from vmcb01. msrpm_base can be overwritten later. */ - vmcb02->control.misc_ctl = vmcb01->control.misc_ctl; + /* + * Copied from vmcb01. msrpm_base can be overwritten later. + * + * SVM_MISC_ENABLE_NP in vmcb12 is only used for consistency checks. If + * L1 enables NPTs, KVM shadows L1's NPTs and uses those to run L2. If + * L1 disables NPT, KVM runs L2 with the same NPTs used to run L1. For + * the latter, L1 runs L2 with shadow page tables that translate L2 GVAs + * to L1 GPAs, so the same NPTs can be used for L1 and L2. + */ + vmcb02->control.misc_ctl = vmcb01->control.misc_ctl & SVM_MISC_ENABLE_NP; vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; vmcb_mark_dirty(vmcb02, VMCB_PERM_MAP); From 5e4c6da0bb925bc91a6020511e85bd9574f8474a Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:20 +0000 Subject: [PATCH 109/373] KVM: selftest: Add a selftest for VMRUN/#VMEXIT with unmappable vmcb12 Add a test that verifies that KVM correctly injects a #GP for nested VMRUN and a shutdown for nested #VMEXIT, if the GPA of vmcb12 cannot be mapped. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-27-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../kvm/x86/svm_nested_invalid_vmcb12_gpa.c | 98 +++++++++++++++++++ 2 files changed, 99 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 36b48e766e49..f12e7c17d379 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -110,6 +110,7 @@ TEST_GEN_PROGS_x86 += x86/state_test TEST_GEN_PROGS_x86 += x86/vmx_preemption_timer_test TEST_GEN_PROGS_x86 += x86/svm_vmcall_test TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test +TEST_GEN_PROGS_x86 += x86/svm_nested_invalid_vmcb12_gpa TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test TEST_GEN_PROGS_x86 += x86/svm_lbr_nested_state diff --git a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c new file mode 100644 index 000000000000..c6d5f712120d --- /dev/null +++ b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2026, Google LLC. + */ +#include "kvm_util.h" +#include "vmx.h" +#include "svm_util.h" +#include "kselftest.h" + + +#define L2_GUEST_STACK_SIZE 64 + +#define SYNC_GP 101 +#define SYNC_L2_STARTED 102 + +u64 valid_vmcb12_gpa; +int gp_triggered; + +static void guest_gp_handler(struct ex_regs *regs) +{ + GUEST_ASSERT(!gp_triggered); + GUEST_SYNC(SYNC_GP); + gp_triggered = 1; + regs->rax = valid_vmcb12_gpa; +} + +static void l2_guest_code(void) +{ + GUEST_SYNC(SYNC_L2_STARTED); + vmcall(); +} + +static void l1_guest_code(struct svm_test_data *svm, u64 invalid_vmcb12_gpa) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + + generic_svm_setup(svm, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + valid_vmcb12_gpa = svm->vmcb_gpa; + + run_guest(svm->vmcb, invalid_vmcb12_gpa); /* #GP */ + + /* GP handler should jump here */ + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); + GUEST_DONE(); +} + +int main(int argc, char *argv[]) +{ + struct kvm_x86_state *state; + vm_vaddr_t nested_gva = 0; + struct kvm_vcpu *vcpu; + uint32_t maxphyaddr; + u64 max_legal_gpa; + struct kvm_vm *vm; + struct ucall uc; + + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); + vm_install_exception_handler(vcpu->vm, GP_VECTOR, guest_gp_handler); + + /* + * Find the max legal GPA that is not backed by a memslot (i.e. cannot + * be mapped by KVM). + */ + maxphyaddr = kvm_cpuid_property(vcpu->cpuid, X86_PROPERTY_MAX_PHY_ADDR); + max_legal_gpa = BIT_ULL(maxphyaddr) - PAGE_SIZE; + vcpu_alloc_svm(vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, max_legal_gpa); + + /* VMRUN with max_legal_gpa, KVM injects a #GP */ + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); + TEST_ASSERT_EQ(uc.args[1], SYNC_GP); + + /* + * Enter L2 (with a legit vmcb12 GPA), then overwrite vmcb12 GPA with + * max_legal_gpa. KVM will fail to map vmcb12 on nested VM-Exit and + * cause a shutdown. + */ + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); + TEST_ASSERT_EQ(uc.args[1], SYNC_L2_STARTED); + + state = vcpu_save_state(vcpu); + state->nested.hdr.svm.vmcb_pa = max_legal_gpa; + vcpu_load_state(vcpu, state); + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SHUTDOWN); + + kvm_x86_state_cleanup(state); + kvm_vm_free(vm); + return 0; +} From 66b207f175f1cd52b083c4d90d03cc1c15b8ae6a Mon Sep 17 00:00:00 2001 From: Jim Mattson Date: Mon, 23 Feb 2026 16:54:39 -0800 Subject: [PATCH 110/373] KVM: x86: SVM: Remove vmcb_is_dirty() After commit dd26d1b5d6ed ("KVM: nSVM: Cache all used fields from VMCB12"), vmcb_is_dirty() has no callers. Remove the function. Signed-off-by: Jim Mattson Link: https://patch.msgid.link/20260224005500.1471972-2-jmattson@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.h | 5 ----- 1 file changed, 5 deletions(-) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 995c8de3f660..c53068848628 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -433,11 +433,6 @@ static inline void vmcb_mark_dirty(struct vmcb *vmcb, int bit) vmcb->control.clean &= ~(1 << bit); } -static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit) -{ - return !test_bit(bit, (unsigned long *)&vmcb->control.clean); -} - static inline bool vmcb12_is_dirty(struct vmcb_ctrl_area_cached *control, int bit) { return !test_bit(bit, (unsigned long *)&control->clean); From cdc69269b18a19cb76eaf7bf4fa47fe270dcaf11 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 9 Feb 2026 19:51:41 +0000 Subject: [PATCH 111/373] KVM: SVM: Triple fault L1 on unintercepted EFER.SVME clear by L2 KVM tracks when EFER.SVME is set and cleared to initialize and tear down nested state. However, it doesn't differentiate if EFER.SVME is getting toggled in L1 or L2+. If L2 clears EFER.SVME, and L1 does not intercept the EFER write, KVM exits guest mode and tears down nested state while L2 is running, executing L1 without injecting a proper #VMEXIT. According to the APM: The effect of turning off EFER.SVME while a guest is running is undefined; therefore, the VMM should always prevent guests from writing EFER. Since the behavior is architecturally undefined, KVM gets to choose what to do. Inject a triple fault into L1 as a more graceful option that running L1 with corrupted state. Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed base-commit: 95deaec3557dced322e2540bfa426e60e5373d46 Link: https://patch.msgid.link/20260209195142.2554532-2-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 2c511f86b79d..4bf0f5d7167f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -217,6 +217,19 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) { if (!(efer & EFER_SVME)) { + /* + * Architecturally, clearing EFER.SVME while a guest is + * running yields undefined behavior, i.e. KVM can do + * literally anything. Force the vCPU back into L1 as + * that is the safest option for KVM, but synthesize a + * triple fault (for L1!) so that KVM at least doesn't + * run random L2 code in the context of L1. Do so if + * and only if the vCPU is actively running, e.g. to + * avoid positives if userspace is stuffing state. + */ + if (is_guest_mode(vcpu) && vcpu->wants_to_run) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + svm_leave_nested(vcpu); /* #GP intercept is still needed for vmware backdoor */ if (!enable_vmware_backdoor) From 3900e56eb184abcc8a16ab52af24ea255589acc2 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 9 Feb 2026 19:51:42 +0000 Subject: [PATCH 112/373] KVM: selftests: Add a test for L2 clearing EFER.SVME without intercept Add a test that verifies KVM's newly introduced behavior of synthesizing a triple fault in L1 if L2 clears EFER.SVME without an L1 interception (which is architecturally undefined). Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260209195142.2554532-3-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../kvm/x86/svm_nested_clear_efer_svme.c | 55 +++++++++++++++++++ 2 files changed, 56 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/svm_nested_clear_efer_svme.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index f12e7c17d379..ba87cd31872b 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -110,6 +110,7 @@ TEST_GEN_PROGS_x86 += x86/state_test TEST_GEN_PROGS_x86 += x86/vmx_preemption_timer_test TEST_GEN_PROGS_x86 += x86/svm_vmcall_test TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test +TEST_GEN_PROGS_x86 += x86/svm_nested_clear_efer_svme TEST_GEN_PROGS_x86 += x86/svm_nested_invalid_vmcb12_gpa TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test diff --git a/tools/testing/selftests/kvm/x86/svm_nested_clear_efer_svme.c b/tools/testing/selftests/kvm/x86/svm_nested_clear_efer_svme.c new file mode 100644 index 000000000000..a521a9eed061 --- /dev/null +++ b/tools/testing/selftests/kvm/x86/svm_nested_clear_efer_svme.c @@ -0,0 +1,55 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2026, Google LLC. + */ +#include "kvm_util.h" +#include "vmx.h" +#include "svm_util.h" +#include "kselftest.h" + + +#define L2_GUEST_STACK_SIZE 64 + +static void l2_guest_code(void) +{ + unsigned long efer = rdmsr(MSR_EFER); + + /* generic_svm_setup() initializes EFER_SVME set for L2 */ + GUEST_ASSERT(efer & EFER_SVME); + wrmsr(MSR_EFER, efer & ~EFER_SVME); + + /* Unreachable, L1 should be shutdown */ + GUEST_ASSERT(0); +} + +static void l1_guest_code(struct svm_test_data *svm) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + + generic_svm_setup(svm, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + run_guest(svm->vmcb, svm->vmcb_gpa); + + /* Unreachable, L1 should be shutdown */ + GUEST_ASSERT(0); +} + +int main(int argc, char *argv[]) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + vm_vaddr_t nested_gva = 0; + + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); + + vcpu_alloc_svm(vm, &nested_gva); + vcpu_args_set(vcpu, 1, nested_gva); + + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SHUTDOWN); + + kvm_vm_free(vm); + return 0; +} From 9019e82c7e46c03c37e8b108473d02b543222d9f Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:05 +0000 Subject: [PATCH 113/373] KVM: arm64: Add PKVM_DISABLE_STAGE2_ON_PANIC On NVHE_EL2_DEBUG, when using pKVM, the host stage-2 is relaxed to grant the kernel access to the stacktrace, hypervisor bug table and text to symbolize addresses. This is unsafe for production. In preparation for adding more debug options to NVHE_EL2_DEBUG, decouple the stage-2 relaxation into a separate option. While at it, rename PROTECTED_NVHE_STACKTRACE into PKVM_STACKTRACE, following the same naming scheme as PKVM_DISABLE_STAGE2_ON_PANIC. Reviewed-by: Kalesh Singh Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-20-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/Kconfig | 62 +++++++++++++++++----------- arch/arm64/kvm/handle_exit.c | 2 +- arch/arm64/kvm/hyp/nvhe/host.S | 2 +- arch/arm64/kvm/hyp/nvhe/stacktrace.c | 6 +-- arch/arm64/kvm/stacktrace.c | 8 ++-- 5 files changed, 48 insertions(+), 32 deletions(-) diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 7d1f22fd490b..14b2d0b0b831 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -42,32 +42,10 @@ menuconfig KVM If unsure, say N. -config NVHE_EL2_DEBUG - bool "Debug mode for non-VHE EL2 object" - depends on KVM - help - Say Y here to enable the debug mode for the non-VHE KVM EL2 object. - Failure reports will BUG() in the hypervisor. This is intended for - local EL2 hypervisor development. - - If unsure, say N. - -config PROTECTED_NVHE_STACKTRACE - bool "Protected KVM hypervisor stacktraces" - depends on NVHE_EL2_DEBUG - default n - help - Say Y here to enable pKVM hypervisor stacktraces on hyp_panic() - - If using protected nVHE mode, but cannot afford the associated - memory cost (less than 0.75 page per CPU) of pKVM stacktraces, - say N. - - If unsure, or not using protected nVHE (pKVM), say N. +if KVM config PTDUMP_STAGE2_DEBUGFS bool "Present the stage-2 pagetables to debugfs" - depends on KVM depends on DEBUG_KERNEL depends on DEBUG_FS depends on ARCH_HAS_PTDUMP @@ -82,4 +60,42 @@ config PTDUMP_STAGE2_DEBUGFS If in doubt, say N. +config NVHE_EL2_DEBUG + bool "Debug mode for non-VHE EL2 object" + default n + help + Say Y here to enable the debug mode for the non-VHE KVM EL2 object. + Failure reports will BUG() in the hypervisor. This is intended for + local EL2 hypervisor development. + + If unsure, say N. + +if NVHE_EL2_DEBUG + +config PKVM_DISABLE_STAGE2_ON_PANIC + bool "Disable the host stage-2 on panic" + default n + help + Relax the host stage-2 on hypervisor panic to allow the kernel to + unwind and symbolize the hypervisor stacktrace. This however tampers + the system security. This is intended for local EL2 hypervisor + development. + + If unsure, say N. + +config PKVM_STACKTRACE + bool "Protected KVM hypervisor stacktraces" + depends on PKVM_DISABLE_STAGE2_ON_PANIC + default y + help + Say Y here to enable pKVM hypervisor stacktraces on hyp_panic() + + If using protected nVHE mode, but cannot afford the associated + memory cost (less than 0.75 page per CPU) of pKVM stacktraces, + say N. + + If unsure, or not using protected nVHE (pKVM), say N. + +endif # NVHE_EL2_DEBUG +endif # KVM endif # VIRTUALIZATION diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index cc7d5d1709cb..54aedf93c78b 100644 --- a/arch/arm64/kvm/handle_exit.c +++ b/arch/arm64/kvm/handle_exit.c @@ -539,7 +539,7 @@ void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr, /* All hyp bugs, including warnings, are treated as fatal. */ if (!is_protected_kvm_enabled() || - IS_ENABLED(CONFIG_NVHE_EL2_DEBUG)) { + IS_ENABLED(CONFIG_PKVM_DISABLE_STAGE2_ON_PANIC)) { struct bug_entry *bug = find_bug(elr_in_kimg); if (bug) diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S index eef15b374abb..3092befcd97c 100644 --- a/arch/arm64/kvm/hyp/nvhe/host.S +++ b/arch/arm64/kvm/hyp/nvhe/host.S @@ -120,7 +120,7 @@ SYM_FUNC_START(__hyp_do_panic) mov x29, x0 -#ifdef CONFIG_NVHE_EL2_DEBUG +#ifdef PKVM_DISABLE_STAGE2_ON_PANIC /* Ensure host stage-2 is disabled */ mrs x0, hcr_el2 bic x0, x0, #HCR_VM diff --git a/arch/arm64/kvm/hyp/nvhe/stacktrace.c b/arch/arm64/kvm/hyp/nvhe/stacktrace.c index 5b6eeab1a774..7c832d60d22b 100644 --- a/arch/arm64/kvm/hyp/nvhe/stacktrace.c +++ b/arch/arm64/kvm/hyp/nvhe/stacktrace.c @@ -34,7 +34,7 @@ static void hyp_prepare_backtrace(unsigned long fp, unsigned long pc) stacktrace_info->pc = pc; } -#ifdef CONFIG_PROTECTED_NVHE_STACKTRACE +#ifdef CONFIG_PKVM_STACKTRACE #include DEFINE_PER_CPU(unsigned long [NVHE_STACKTRACE_SIZE/sizeof(long)], pkvm_stacktrace); @@ -134,11 +134,11 @@ static void pkvm_save_backtrace(unsigned long fp, unsigned long pc) unwind(&state, pkvm_save_backtrace_entry, &idx); } -#else /* !CONFIG_PROTECTED_NVHE_STACKTRACE */ +#else /* !CONFIG_PKVM_STACKTRACE */ static void pkvm_save_backtrace(unsigned long fp, unsigned long pc) { } -#endif /* CONFIG_PROTECTED_NVHE_STACKTRACE */ +#endif /* CONFIG_PKVM_STACKTRACE */ /* * kvm_nvhe_prepare_backtrace - prepare to dump the nVHE backtrace diff --git a/arch/arm64/kvm/stacktrace.c b/arch/arm64/kvm/stacktrace.c index af5eec681127..9724c320126b 100644 --- a/arch/arm64/kvm/stacktrace.c +++ b/arch/arm64/kvm/stacktrace.c @@ -197,7 +197,7 @@ static void hyp_dump_backtrace(unsigned long hyp_offset) kvm_nvhe_dump_backtrace_end(); } -#ifdef CONFIG_PROTECTED_NVHE_STACKTRACE +#ifdef CONFIG_PKVM_STACKTRACE DECLARE_KVM_NVHE_PER_CPU(unsigned long [NVHE_STACKTRACE_SIZE/sizeof(long)], pkvm_stacktrace); @@ -225,12 +225,12 @@ static void pkvm_dump_backtrace(unsigned long hyp_offset) kvm_nvhe_dump_backtrace_entry((void *)hyp_offset, stacktrace[i]); kvm_nvhe_dump_backtrace_end(); } -#else /* !CONFIG_PROTECTED_NVHE_STACKTRACE */ +#else /* !CONFIG_PKVM_STACKTRACE */ static void pkvm_dump_backtrace(unsigned long hyp_offset) { - kvm_err("Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE\n"); + kvm_err("Cannot dump pKVM nVHE stacktrace: !CONFIG_PKVM_STACKTRACE\n"); } -#endif /* CONFIG_PROTECTED_NVHE_STACKTRACE */ +#endif /* CONFIG_PKVM_STACKTRACE */ /* * kvm_nvhe_dump_backtrace - Dump KVM nVHE hypervisor backtrace. From 405df5b55748d93d44666fd6005c60981094a077 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:06 +0000 Subject: [PATCH 114/373] KVM: arm64: Add clock support to nVHE/pKVM hyp In preparation for supporting tracing from the nVHE hyp, add support to generate timestamps with a clock fed by the CNTCVT counter. The clock can be kept in sync with the kernel's by updating the slope values. This will be done later. As current we do only create a trace clock, make the whole support dependent on the upcoming CONFIG_NVHE_EL2_TRACING. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-21-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/nvhe/clock.h | 16 ++++++ arch/arm64/kvm/hyp/nvhe/Makefile | 3 +- arch/arm64/kvm/hyp/nvhe/clock.c | 65 +++++++++++++++++++++++++ 3 files changed, 83 insertions(+), 1 deletion(-) create mode 100644 arch/arm64/kvm/hyp/include/nvhe/clock.h create mode 100644 arch/arm64/kvm/hyp/nvhe/clock.c diff --git a/arch/arm64/kvm/hyp/include/nvhe/clock.h b/arch/arm64/kvm/hyp/include/nvhe/clock.h new file mode 100644 index 000000000000..9f429f5c0664 --- /dev/null +++ b/arch/arm64/kvm/hyp/include/nvhe/clock.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ARM64_KVM_HYP_NVHE_CLOCK_H +#define __ARM64_KVM_HYP_NVHE_CLOCK_H +#include + +#include + +#ifdef CONFIG_NVHE_EL2_TRACING +void trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc); +u64 trace_clock(void); +#else +static inline void +trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) { } +static inline u64 trace_clock(void) { return 0; } +#endif +#endif diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile index a244ec25f8c5..8dc95257c291 100644 --- a/arch/arm64/kvm/hyp/nvhe/Makefile +++ b/arch/arm64/kvm/hyp/nvhe/Makefile @@ -17,7 +17,7 @@ ccflags-y += -fno-stack-protector \ hostprogs := gen-hyprel HOST_EXTRACFLAGS += -I$(objtree)/include -lib-objs := clear_page.o copy_page.o memcpy.o memset.o +lib-objs := clear_page.o copy_page.o memcpy.o memset.o tishift.o lib-objs := $(addprefix ../../../lib/, $(lib-objs)) CFLAGS_switch.nvhe.o += -Wno-override-init @@ -29,6 +29,7 @@ hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o hyp-obj-y += ../../../kernel/smccc-call.o hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o +hyp-obj-$(CONFIG_NVHE_EL2_TRACING) += clock.o hyp-obj-y += $(lib-objs) ## diff --git a/arch/arm64/kvm/hyp/nvhe/clock.c b/arch/arm64/kvm/hyp/nvhe/clock.c new file mode 100644 index 000000000000..32fc4313fe43 --- /dev/null +++ b/arch/arm64/kvm/hyp/nvhe/clock.c @@ -0,0 +1,65 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2025 Google LLC + * Author: Vincent Donnefort + */ + +#include + +#include +#include + +static struct clock_data { + struct { + u32 mult; + u32 shift; + u64 epoch_ns; + u64 epoch_cyc; + u64 cyc_overflow64; + } data[2]; + u64 cur; +} trace_clock_data; + +static u64 __clock_mult_uint128(u64 cyc, u32 mult, u32 shift) +{ + __uint128_t ns = (__uint128_t)cyc * mult; + + ns >>= shift; + + return (u64)ns; +} + +/* Does not guarantee no reader on the modified bank. */ +void trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) +{ + struct clock_data *clock = &trace_clock_data; + u64 bank = clock->cur ^ 1; + + clock->data[bank].mult = mult; + clock->data[bank].shift = shift; + clock->data[bank].epoch_ns = epoch_ns; + clock->data[bank].epoch_cyc = epoch_cyc; + clock->data[bank].cyc_overflow64 = ULONG_MAX / mult; + + smp_store_release(&clock->cur, bank); +} + +/* Use untrusted host data */ +u64 trace_clock(void) +{ + struct clock_data *clock = &trace_clock_data; + u64 bank = smp_load_acquire(&clock->cur); + u64 cyc, ns; + + cyc = __arch_counter_get_cntvct() - clock->data[bank].epoch_cyc; + + if (likely(cyc < clock->data[bank].cyc_overflow64)) { + ns = cyc * clock->data[bank].mult; + ns >>= clock->data[bank].shift; + } else { + ns = __clock_mult_uint128(cyc, clock->data[bank].mult, + clock->data[bank].shift); + } + + return (u64)ns + clock->data[bank].epoch_ns; +} From 8bbeb4d1698fb0945107c69019c15207bbe605c9 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:07 +0000 Subject: [PATCH 115/373] KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp Knowing the number of CPUs is necessary for determining the boundaries of per-cpu variables, which will be used for upcoming hypervisor tracing. hyp_nr_cpus which stores this value, is only initialised for the pKVM hypervisor. Make it accessible for the nVHE hypervisor as well. With the kernel now responsible for initialising hyp_nr_cpus, the nr_cpus parameter is no longer needed in __pkvm_init. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-22-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_hyp.h | 4 ++-- arch/arm64/kvm/arm.c | 5 ++++- arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 -- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 +++----- arch/arm64/kvm/hyp/nvhe/setup.c | 4 +--- 5 files changed, 10 insertions(+), 13 deletions(-) diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h index 76ce2b94bd97..4bf63025061e 100644 --- a/arch/arm64/include/asm/kvm_hyp.h +++ b/arch/arm64/include/asm/kvm_hyp.h @@ -129,8 +129,7 @@ void __noreturn __hyp_do_panic(struct kvm_cpu_context *host_ctxt, u64 spsr, #ifdef __KVM_NVHE_HYPERVISOR__ void __pkvm_init_switch_pgd(phys_addr_t pgd, unsigned long sp, void (*fn)(void)); -int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus, - unsigned long *per_cpu_base, u32 hyp_va_bits); +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long *per_cpu_base, u32 hyp_va_bits); void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt); #endif @@ -147,5 +146,6 @@ extern u64 kvm_nvhe_sym(id_aa64smfr0_el1_sys_val); extern unsigned long kvm_nvhe_sym(__icache_flags); extern unsigned int kvm_nvhe_sym(kvm_arm_vmid_bits); extern unsigned int kvm_nvhe_sym(kvm_host_sve_max_vl); +extern unsigned long kvm_nvhe_sym(hyp_nr_cpus); #endif /* __ARM64_KVM_HYP_H__ */ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 410ffd41fd73..1371f9b3ecea 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -2465,7 +2466,7 @@ static int __init do_pkvm_init(u32 hyp_va_bits) preempt_disable(); cpu_hyp_init_context(); ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size, - num_possible_cpus(), kern_hyp_va(per_cpu_base), + kern_hyp_va(per_cpu_base), hyp_va_bits); cpu_hyp_init_features(); @@ -2674,6 +2675,8 @@ static int __init init_hyp_mode(void) kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[cpu] = (unsigned long)page_addr; } + kvm_nvhe_sym(hyp_nr_cpus) = num_possible_cpus(); + /* * Map the Hyp-code called directly from the host */ diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index 5f9d56754e39..f8a7b8c04c49 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -30,8 +30,6 @@ enum pkvm_component_id { PKVM_ID_FFA, }; -extern unsigned long hyp_nr_cpus; - int __pkvm_prot_finalize(void); int __pkvm_host_share_hyp(u64 pfn); int __pkvm_host_unshare_hyp(u64 pfn); diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index e7790097db93..8ae3c348ed81 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -486,17 +486,15 @@ static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(phys_addr_t, phys, host_ctxt, 1); DECLARE_REG(unsigned long, size, host_ctxt, 2); - DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3); - DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4); - DECLARE_REG(u32, hyp_va_bits, host_ctxt, 5); + DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 3); + DECLARE_REG(u32, hyp_va_bits, host_ctxt, 4); /* * __pkvm_init() will return only if an error occurred, otherwise it * will tail-call in __pkvm_init_finalise() which will have to deal * with the host context directly. */ - cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base, - hyp_va_bits); + cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, per_cpu_base, hyp_va_bits); } static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt) diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c index 90bd014e952f..d8e5b563fd3d 100644 --- a/arch/arm64/kvm/hyp/nvhe/setup.c +++ b/arch/arm64/kvm/hyp/nvhe/setup.c @@ -341,8 +341,7 @@ out: __host_enter(host_ctxt); } -int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus, - unsigned long *per_cpu_base, u32 hyp_va_bits) +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long *per_cpu_base, u32 hyp_va_bits) { struct kvm_nvhe_init_params *params; void *virt = hyp_phys_to_virt(phys); @@ -355,7 +354,6 @@ int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus, return -EINVAL; hyp_spin_lock_init(&pkvm_pgd_lock); - hyp_nr_cpus = nr_cpus; ret = divide_memory_pool(virt, size); if (ret) From 4cdf8dec8c115d9531f80519db69846c32c580a5 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:08 +0000 Subject: [PATCH 116/373] KVM: arm64: Support unaligned fixmap in the pKVM hyp Return the fixmap VA with the page offset, instead of the page base address. This allows to use hyp_fixmap_map() seamlessly regardless of the address alignment. While at it, do the same for hyp_fixblock_map(). Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-23-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/mm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c index 218976287d3f..c98cec0c150f 100644 --- a/arch/arm64/kvm/hyp/nvhe/mm.c +++ b/arch/arm64/kvm/hyp/nvhe/mm.c @@ -244,7 +244,7 @@ static void *fixmap_map_slot(struct hyp_fixmap_slot *slot, phys_addr_t phys) void *hyp_fixmap_map(phys_addr_t phys) { - return fixmap_map_slot(this_cpu_ptr(&fixmap_slots), phys); + return fixmap_map_slot(this_cpu_ptr(&fixmap_slots), phys) + offset_in_page(phys); } static void fixmap_clear_slot(struct hyp_fixmap_slot *slot) @@ -366,7 +366,7 @@ void *hyp_fixblock_map(phys_addr_t phys, size_t *size) #ifdef HAS_FIXBLOCK *size = PMD_SIZE; hyp_spin_lock(&hyp_fixblock_lock); - return fixmap_map_slot(&hyp_fixblock_slot, phys); + return fixmap_map_slot(&hyp_fixblock_slot, phys) + offset_in_page(phys); #else *size = PAGE_SIZE; return hyp_fixmap_map(phys); From 680a04c333fa29bf92007efe11431be005e8c4bb Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:09 +0000 Subject: [PATCH 117/373] KVM: arm64: Add tracing capability for the nVHE/pKVM hyp There is currently no way to inspect or log what's happening at EL2 when the nVHE or pKVM hypervisor is used. With the growing set of features for pKVM, the need for tooling is more pressing. And tracefs, by its reliability, versatility and support for user-space is fit for purpose. Add support to write into a tracefs compatible ring-buffer. There's no way the hypervisor could log events directly into the host tracefs ring-buffers. So instead let's use our own, where the hypervisor is the writer and the host the reader. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-24-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 4 + arch/arm64/include/asm/kvm_hyptrace.h | 13 ++ arch/arm64/kvm/Kconfig | 5 + arch/arm64/kvm/hyp/include/nvhe/trace.h | 23 ++ arch/arm64/kvm/hyp/nvhe/Makefile | 5 +- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 32 +++ arch/arm64/kvm/hyp/nvhe/trace.c | 273 ++++++++++++++++++++++++ 7 files changed, 354 insertions(+), 1 deletion(-) create mode 100644 arch/arm64/include/asm/kvm_hyptrace.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/trace.h create mode 100644 arch/arm64/kvm/hyp/nvhe/trace.c diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index a1ad12c72ebf..d46d9196d661 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -89,6 +89,10 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_load, __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_put, __KVM_HOST_SMCCC_FUNC___pkvm_tlb_flush_vmid, + __KVM_HOST_SMCCC_FUNC___tracing_load, + __KVM_HOST_SMCCC_FUNC___tracing_unload, + __KVM_HOST_SMCCC_FUNC___tracing_enable, + __KVM_HOST_SMCCC_FUNC___tracing_swap_reader, }; #define DECLARE_KVM_VHE_SYM(sym) extern char sym[] diff --git a/arch/arm64/include/asm/kvm_hyptrace.h b/arch/arm64/include/asm/kvm_hyptrace.h new file mode 100644 index 000000000000..9c30a479bc36 --- /dev/null +++ b/arch/arm64/include/asm/kvm_hyptrace.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef __ARM64_KVM_HYPTRACE_H_ +#define __ARM64_KVM_HYPTRACE_H_ + +#include + +struct hyp_trace_desc { + unsigned long bpages_backing_start; + size_t bpages_backing_size; + struct trace_buffer_desc trace_buffer_desc; + +}; +#endif diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 14b2d0b0b831..d215348370fa 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -72,6 +72,11 @@ config NVHE_EL2_DEBUG if NVHE_EL2_DEBUG +config NVHE_EL2_TRACING + bool + depends on TRACING + default y + config PKVM_DISABLE_STAGE2_ON_PANIC bool "Disable the host stage-2 on panic" default n diff --git a/arch/arm64/kvm/hyp/include/nvhe/trace.h b/arch/arm64/kvm/hyp/include/nvhe/trace.h new file mode 100644 index 000000000000..7da8788ce527 --- /dev/null +++ b/arch/arm64/kvm/hyp/include/nvhe/trace.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef __ARM64_KVM_HYP_NVHE_TRACE_H +#define __ARM64_KVM_HYP_NVHE_TRACE_H +#include + +#ifdef CONFIG_NVHE_EL2_TRACING +void *tracing_reserve_entry(unsigned long length); +void tracing_commit_entry(void); + +int __tracing_load(unsigned long desc_va, size_t desc_size); +void __tracing_unload(void); +int __tracing_enable(bool enable); +int __tracing_swap_reader(unsigned int cpu); +#else +static inline void *tracing_reserve_entry(unsigned long length) { return NULL; } +static inline void tracing_commit_entry(void) { } + +static inline int __tracing_load(unsigned long desc_va, size_t desc_size) { return -ENODEV; } +static inline void __tracing_unload(void) { } +static inline int __tracing_enable(bool enable) { return -ENODEV; } +static inline int __tracing_swap_reader(unsigned int cpu) { return -ENODEV; } +#endif +#endif diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile index 8dc95257c291..f1840628d2d6 100644 --- a/arch/arm64/kvm/hyp/nvhe/Makefile +++ b/arch/arm64/kvm/hyp/nvhe/Makefile @@ -29,9 +29,12 @@ hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o hyp-obj-y += ../../../kernel/smccc-call.o hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o -hyp-obj-$(CONFIG_NVHE_EL2_TRACING) += clock.o +hyp-obj-$(CONFIG_NVHE_EL2_TRACING) += clock.o trace.o hyp-obj-y += $(lib-objs) +# Path to simple_ring_buffer.c +CFLAGS_trace.nvhe.o += -I$(objtree)/kernel/trace/ + ## ## Build rules for compiling nVHE hyp code ## Output of this folder is `kvm_nvhe.o`, a partially linked object diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 8ae3c348ed81..ae04204ea81f 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -18,6 +18,7 @@ #include #include #include +#include #include DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params); @@ -587,6 +588,33 @@ static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle); } +static void handle___tracing_load(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(unsigned long, desc_hva, host_ctxt, 1); + DECLARE_REG(size_t, desc_size, host_ctxt, 2); + + cpu_reg(host_ctxt, 1) = __tracing_load(desc_hva, desc_size); +} + +static void handle___tracing_unload(struct kvm_cpu_context *host_ctxt) +{ + __tracing_unload(); +} + +static void handle___tracing_enable(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(bool, enable, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __tracing_enable(enable); +} + +static void handle___tracing_swap_reader(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(unsigned int, cpu, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __tracing_swap_reader(cpu); +} + typedef void (*hcall_t)(struct kvm_cpu_context *); #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x @@ -628,6 +656,10 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__pkvm_vcpu_load), HANDLE_FUNC(__pkvm_vcpu_put), HANDLE_FUNC(__pkvm_tlb_flush_vmid), + HANDLE_FUNC(__tracing_load), + HANDLE_FUNC(__tracing_unload), + HANDLE_FUNC(__tracing_enable), + HANDLE_FUNC(__tracing_swap_reader), }; static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) diff --git a/arch/arm64/kvm/hyp/nvhe/trace.c b/arch/arm64/kvm/hyp/nvhe/trace.c new file mode 100644 index 000000000000..282cba70ce9b --- /dev/null +++ b/arch/arm64/kvm/hyp/nvhe/trace.c @@ -0,0 +1,273 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Google LLC + * Author: Vincent Donnefort + */ + +#include +#include +#include +#include + +#include +#include +#include + +#include "simple_ring_buffer.c" + +static DEFINE_PER_CPU(struct simple_rb_per_cpu, __simple_rbs); + +static struct hyp_trace_buffer { + struct simple_rb_per_cpu __percpu *simple_rbs; + void *bpages_backing_start; + size_t bpages_backing_size; + hyp_spinlock_t lock; +} trace_buffer = { + .simple_rbs = &__simple_rbs, + .lock = __HYP_SPIN_LOCK_UNLOCKED, +}; + +static bool hyp_trace_buffer_loaded(struct hyp_trace_buffer *trace_buffer) +{ + return trace_buffer->bpages_backing_size > 0; +} + +void *tracing_reserve_entry(unsigned long length) +{ + return simple_ring_buffer_reserve(this_cpu_ptr(trace_buffer.simple_rbs), length, + trace_clock()); +} + +void tracing_commit_entry(void) +{ + simple_ring_buffer_commit(this_cpu_ptr(trace_buffer.simple_rbs)); +} + +static int __admit_host_mem(void *start, u64 size) +{ + if (!PAGE_ALIGNED(start) || !PAGE_ALIGNED(size) || !size) + return -EINVAL; + + if (!is_protected_kvm_enabled()) + return 0; + + return __pkvm_host_donate_hyp(hyp_virt_to_pfn(start), size >> PAGE_SHIFT); +} + +static void __release_host_mem(void *start, u64 size) +{ + if (!is_protected_kvm_enabled()) + return; + + WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(start), size >> PAGE_SHIFT)); +} + +static int hyp_trace_buffer_load_bpage_backing(struct hyp_trace_buffer *trace_buffer, + struct hyp_trace_desc *desc) +{ + void *start = (void *)kern_hyp_va(desc->bpages_backing_start); + size_t size = desc->bpages_backing_size; + int ret; + + ret = __admit_host_mem(start, size); + if (ret) + return ret; + + memset(start, 0, size); + + trace_buffer->bpages_backing_start = start; + trace_buffer->bpages_backing_size = size; + + return 0; +} + +static void hyp_trace_buffer_unload_bpage_backing(struct hyp_trace_buffer *trace_buffer) +{ + void *start = trace_buffer->bpages_backing_start; + size_t size = trace_buffer->bpages_backing_size; + + if (!size) + return; + + memset(start, 0, size); + + __release_host_mem(start, size); + + trace_buffer->bpages_backing_start = 0; + trace_buffer->bpages_backing_size = 0; +} + +static void *__pin_shared_page(unsigned long kern_va) +{ + void *va = kern_hyp_va((void *)kern_va); + + if (!is_protected_kvm_enabled()) + return va; + + return hyp_pin_shared_mem(va, va + PAGE_SIZE) ? NULL : va; +} + +static void __unpin_shared_page(void *va) +{ + if (!is_protected_kvm_enabled()) + return; + + hyp_unpin_shared_mem(va, va + PAGE_SIZE); +} + +static void hyp_trace_buffer_unload(struct hyp_trace_buffer *trace_buffer) +{ + int cpu; + + hyp_assert_lock_held(&trace_buffer->lock); + + if (!hyp_trace_buffer_loaded(trace_buffer)) + return; + + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) + simple_ring_buffer_unload_mm(per_cpu_ptr(trace_buffer->simple_rbs, cpu), + __unpin_shared_page); + + hyp_trace_buffer_unload_bpage_backing(trace_buffer); +} + +static int hyp_trace_buffer_load(struct hyp_trace_buffer *trace_buffer, + struct hyp_trace_desc *desc) +{ + struct simple_buffer_page *bpages; + struct ring_buffer_desc *rb_desc; + int ret, cpu; + + hyp_assert_lock_held(&trace_buffer->lock); + + if (hyp_trace_buffer_loaded(trace_buffer)) + return -EINVAL; + + ret = hyp_trace_buffer_load_bpage_backing(trace_buffer, desc); + if (ret) + return ret; + + bpages = trace_buffer->bpages_backing_start; + for_each_ring_buffer_desc(rb_desc, cpu, &desc->trace_buffer_desc) { + ret = simple_ring_buffer_init_mm(per_cpu_ptr(trace_buffer->simple_rbs, cpu), + bpages, rb_desc, __pin_shared_page, + __unpin_shared_page); + if (ret) + break; + + bpages += rb_desc->nr_page_va; + } + + if (ret) + hyp_trace_buffer_unload(trace_buffer); + + return ret; +} + +static bool hyp_trace_desc_validate(struct hyp_trace_desc *desc, size_t desc_size) +{ + struct ring_buffer_desc *rb_desc; + unsigned int cpu; + size_t nr_bpages; + void *desc_end; + + /* + * Both desc_size and bpages_backing_size are untrusted host-provided + * values. We rely on __pkvm_host_donate_hyp() to enforce their validity. + */ + desc_end = (void *)desc + desc_size; + nr_bpages = desc->bpages_backing_size / sizeof(struct simple_buffer_page); + + for_each_ring_buffer_desc(rb_desc, cpu, &desc->trace_buffer_desc) { + /* Can we read nr_page_va? */ + if ((void *)rb_desc + struct_size(rb_desc, page_va, 0) > desc_end) + return false; + + /* Overflow desc? */ + if ((void *)rb_desc + struct_size(rb_desc, page_va, rb_desc->nr_page_va) > desc_end) + return false; + + /* Overflow bpages backing memory? */ + if (nr_bpages < rb_desc->nr_page_va) + return false; + + if (cpu >= hyp_nr_cpus) + return false; + + if (cpu != rb_desc->cpu) + return false; + + nr_bpages -= rb_desc->nr_page_va; + } + + return true; +} + +int __tracing_load(unsigned long desc_hva, size_t desc_size) +{ + struct hyp_trace_desc *desc = (struct hyp_trace_desc *)kern_hyp_va(desc_hva); + int ret; + + ret = __admit_host_mem(desc, desc_size); + if (ret) + return ret; + + if (!hyp_trace_desc_validate(desc, desc_size)) + goto err_release_desc; + + hyp_spin_lock(&trace_buffer.lock); + + ret = hyp_trace_buffer_load(&trace_buffer, desc); + + hyp_spin_unlock(&trace_buffer.lock); + +err_release_desc: + __release_host_mem(desc, desc_size); + return ret; +} + +void __tracing_unload(void) +{ + hyp_spin_lock(&trace_buffer.lock); + hyp_trace_buffer_unload(&trace_buffer); + hyp_spin_unlock(&trace_buffer.lock); +} + +int __tracing_enable(bool enable) +{ + int cpu, ret = enable ? -EINVAL : 0; + + hyp_spin_lock(&trace_buffer.lock); + + if (!hyp_trace_buffer_loaded(&trace_buffer)) + goto unlock; + + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) + simple_ring_buffer_enable_tracing(per_cpu_ptr(trace_buffer.simple_rbs, cpu), + enable); + + ret = 0; + +unlock: + hyp_spin_unlock(&trace_buffer.lock); + + return ret; +} + +int __tracing_swap_reader(unsigned int cpu) +{ + int ret = -ENODEV; + + if (cpu >= hyp_nr_cpus) + return -EINVAL; + + hyp_spin_lock(&trace_buffer.lock); + + if (hyp_trace_buffer_loaded(&trace_buffer)) + ret = simple_ring_buffer_swap_reader_page( + per_cpu_ptr(trace_buffer.simple_rbs, cpu)); + + hyp_spin_unlock(&trace_buffer.lock); + + return ret; +} From 3aed038aac8d897016a2b6e1935f16c7640918d4 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:10 +0000 Subject: [PATCH 118/373] KVM: arm64: Add trace remote for the nVHE/pKVM hyp In both protected and nVHE mode, the hypervisor is capable of writing events into tracefs compatible ring-buffers. Create a trace remote so the kernel can read those buffers. This currently doesn't provide any event support which will come later. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-25-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/Makefile | 2 + arch/arm64/kvm/arm.c | 5 + arch/arm64/kvm/hyp_trace.c | 219 +++++++++++++++++++++++++++++++++++++ arch/arm64/kvm/hyp_trace.h | 11 ++ 5 files changed, 238 insertions(+) create mode 100644 arch/arm64/kvm/hyp_trace.c create mode 100644 arch/arm64/kvm/hyp_trace.h diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index d215348370fa..17edfe3ae615 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -75,6 +75,7 @@ if NVHE_EL2_DEBUG config NVHE_EL2_TRACING bool depends on TRACING + select TRACE_REMOTE default y config PKVM_DISABLE_STAGE2_ON_PANIC diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 3ebc0570345c..59612d2f277c 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -30,6 +30,8 @@ kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o +kvm-$(CONFIG_NVHE_EL2_TRACING) += hyp_trace.o + always-y := hyp_constants.h hyp-constants.s define rule_gen_hyp_constants diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 1371f9b3ecea..87a3d28d5f0f 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -24,6 +24,7 @@ #define CREATE_TRACE_POINTS #include "trace_arm.h" +#include "hyp_trace.h" #include #include @@ -2415,6 +2416,10 @@ static int __init init_subsystems(void) kvm_register_perf_callbacks(); + err = kvm_hyp_trace_init(); + if (err) + kvm_err("Failed to initialize Hyp tracing\n"); + out: if (err) hyp_cpu_pm_exit(); diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c new file mode 100644 index 000000000000..2866effe28ec --- /dev/null +++ b/arch/arm64/kvm/hyp_trace.c @@ -0,0 +1,219 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Google LLC + * Author: Vincent Donnefort + */ + +#include +#include + +#include +#include +#include + +#include "hyp_trace.h" + +/* Access to this struct within the trace_remote_callbacks are protected by the trace_remote lock */ +static struct hyp_trace_buffer { + struct hyp_trace_desc *desc; + size_t desc_size; +} trace_buffer; + +static int __map_hyp(void *start, size_t size) +{ + if (is_protected_kvm_enabled()) + return 0; + + return create_hyp_mappings(start, start + size, PAGE_HYP); +} + +static int __share_page(unsigned long va) +{ + return kvm_share_hyp((void *)va, (void *)va + 1); +} + +static void __unshare_page(unsigned long va) +{ + kvm_unshare_hyp((void *)va, (void *)va + 1); +} + +static int hyp_trace_buffer_alloc_bpages_backing(struct hyp_trace_buffer *trace_buffer, size_t size) +{ + int nr_bpages = (PAGE_ALIGN(size) / PAGE_SIZE) + 1; + size_t backing_size; + void *start; + + backing_size = PAGE_ALIGN(sizeof(struct simple_buffer_page) * nr_bpages * + num_possible_cpus()); + + start = alloc_pages_exact(backing_size, GFP_KERNEL_ACCOUNT); + if (!start) + return -ENOMEM; + + trace_buffer->desc->bpages_backing_start = (unsigned long)start; + trace_buffer->desc->bpages_backing_size = backing_size; + + return __map_hyp(start, backing_size); +} + +static void hyp_trace_buffer_free_bpages_backing(struct hyp_trace_buffer *trace_buffer) +{ + free_pages_exact((void *)trace_buffer->desc->bpages_backing_start, + trace_buffer->desc->bpages_backing_size); +} + +static void hyp_trace_buffer_unshare_hyp(struct hyp_trace_buffer *trace_buffer, int last_cpu) +{ + struct ring_buffer_desc *rb_desc; + int cpu, p; + + for_each_ring_buffer_desc(rb_desc, cpu, &trace_buffer->desc->trace_buffer_desc) { + if (cpu > last_cpu) + break; + + __share_page(rb_desc->meta_va); + for (p = 0; p < rb_desc->nr_page_va; p++) + __unshare_page(rb_desc->page_va[p]); + } +} + +static int hyp_trace_buffer_share_hyp(struct hyp_trace_buffer *trace_buffer) +{ + struct ring_buffer_desc *rb_desc; + int cpu, p, ret = 0; + + for_each_ring_buffer_desc(rb_desc, cpu, &trace_buffer->desc->trace_buffer_desc) { + ret = __share_page(rb_desc->meta_va); + if (ret) + break; + + for (p = 0; p < rb_desc->nr_page_va; p++) { + ret = __share_page(rb_desc->page_va[p]); + if (ret) + break; + } + + if (ret) { + for (p--; p >= 0; p--) + __unshare_page(rb_desc->page_va[p]); + break; + } + } + + if (ret) + hyp_trace_buffer_unshare_hyp(trace_buffer, cpu--); + + return ret; +} + +static struct trace_buffer_desc *hyp_trace_load(unsigned long size, void *priv) +{ + struct hyp_trace_buffer *trace_buffer = priv; + struct hyp_trace_desc *desc; + size_t desc_size; + int ret; + + if (WARN_ON(trace_buffer->desc)) + return ERR_PTR(-EINVAL); + + desc_size = trace_buffer_desc_size(size, num_possible_cpus()); + if (desc_size == SIZE_MAX) + return ERR_PTR(-E2BIG); + + desc_size = PAGE_ALIGN(desc_size); + desc = (struct hyp_trace_desc *)alloc_pages_exact(desc_size, GFP_KERNEL); + if (!desc) + return ERR_PTR(-ENOMEM); + + ret = __map_hyp(desc, desc_size); + if (ret) + goto err_free_desc; + + trace_buffer->desc = desc; + + ret = hyp_trace_buffer_alloc_bpages_backing(trace_buffer, size); + if (ret) + goto err_free_desc; + + ret = trace_remote_alloc_buffer(&desc->trace_buffer_desc, desc_size, size, + cpu_possible_mask); + if (ret) + goto err_free_backing; + + ret = hyp_trace_buffer_share_hyp(trace_buffer); + if (ret) + goto err_free_buffer; + + ret = kvm_call_hyp_nvhe(__tracing_load, (unsigned long)desc, desc_size); + if (ret) + goto err_unload_pages; + + return &desc->trace_buffer_desc; + +err_unload_pages: + hyp_trace_buffer_unshare_hyp(trace_buffer, INT_MAX); + +err_free_buffer: + trace_remote_free_buffer(&desc->trace_buffer_desc); + +err_free_backing: + hyp_trace_buffer_free_bpages_backing(trace_buffer); + +err_free_desc: + free_pages_exact(desc, desc_size); + trace_buffer->desc = NULL; + + return ERR_PTR(ret); +} + +static void hyp_trace_unload(struct trace_buffer_desc *desc, void *priv) +{ + struct hyp_trace_buffer *trace_buffer = priv; + + if (WARN_ON(desc != &trace_buffer->desc->trace_buffer_desc)) + return; + + kvm_call_hyp_nvhe(__tracing_unload); + hyp_trace_buffer_unshare_hyp(trace_buffer, INT_MAX); + trace_remote_free_buffer(desc); + hyp_trace_buffer_free_bpages_backing(trace_buffer); + free_pages_exact(trace_buffer->desc, trace_buffer->desc_size); + trace_buffer->desc = NULL; +} + +static int hyp_trace_enable_tracing(bool enable, void *priv) +{ + return kvm_call_hyp_nvhe(__tracing_enable, enable); +} + +static int hyp_trace_swap_reader_page(unsigned int cpu, void *priv) +{ + return kvm_call_hyp_nvhe(__tracing_swap_reader, cpu); +} + +static int hyp_trace_reset(unsigned int cpu, void *priv) +{ + return 0; +} + +static int hyp_trace_enable_event(unsigned short id, bool enable, void *priv) +{ + return 0; +} + +static struct trace_remote_callbacks trace_remote_callbacks = { + .load_trace_buffer = hyp_trace_load, + .unload_trace_buffer = hyp_trace_unload, + .enable_tracing = hyp_trace_enable_tracing, + .swap_reader_page = hyp_trace_swap_reader_page, + .reset = hyp_trace_reset, + .enable_event = hyp_trace_enable_event, +}; + +int __init kvm_hyp_trace_init(void) +{ + if (is_kernel_in_hyp_mode()) + return 0; + + return trace_remote_register("hypervisor", &trace_remote_callbacks, &trace_buffer, NULL, 0); +} diff --git a/arch/arm64/kvm/hyp_trace.h b/arch/arm64/kvm/hyp_trace.h new file mode 100644 index 000000000000..c991b1ec65f1 --- /dev/null +++ b/arch/arm64/kvm/hyp_trace.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __ARM64_KVM_HYP_TRACE_H__ +#define __ARM64_KVM_HYP_TRACE_H__ + +#ifdef CONFIG_NVHE_EL2_TRACING +int kvm_hyp_trace_init(void); +#else +static inline int kvm_hyp_trace_init(void) { return 0; } +#endif +#endif From b22888917fa411d100339ef9d418b4eb86aba962 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:11 +0000 Subject: [PATCH 119/373] KVM: arm64: Sync boot clock with the nVHE/pKVM hyp Configure the hypervisor tracing clock with the kernel boot clock. For tracing purposes, the boot clock is interesting: it doesn't stop on suspend. However, it is corrected on a regular basis, which implies the need to re-evaluate it every once in a while. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Christopher S. Hall Cc: Richard Cochran Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-26-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/hyp/include/nvhe/trace.h | 2 + arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 ++ arch/arm64/kvm/hyp/nvhe/trace.c | 16 +++ arch/arm64/kvm/hyp_trace.c | 149 ++++++++++++++++++++++++ 5 files changed, 179 insertions(+) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index d46d9196d661..9c4fe3bfbfff 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -93,6 +93,7 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___tracing_unload, __KVM_HOST_SMCCC_FUNC___tracing_enable, __KVM_HOST_SMCCC_FUNC___tracing_swap_reader, + __KVM_HOST_SMCCC_FUNC___tracing_update_clock, }; #define DECLARE_KVM_VHE_SYM(sym) extern char sym[] diff --git a/arch/arm64/kvm/hyp/include/nvhe/trace.h b/arch/arm64/kvm/hyp/include/nvhe/trace.h index 7da8788ce527..fd641e1b1c23 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/trace.h +++ b/arch/arm64/kvm/hyp/include/nvhe/trace.h @@ -11,6 +11,7 @@ int __tracing_load(unsigned long desc_va, size_t desc_size); void __tracing_unload(void); int __tracing_enable(bool enable); int __tracing_swap_reader(unsigned int cpu); +void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc); #else static inline void *tracing_reserve_entry(unsigned long length) { return NULL; } static inline void tracing_commit_entry(void) { } @@ -19,5 +20,6 @@ static inline int __tracing_load(unsigned long desc_va, size_t desc_size) { retu static inline void __tracing_unload(void) { } static inline int __tracing_enable(bool enable) { return -ENODEV; } static inline int __tracing_swap_reader(unsigned int cpu) { return -ENODEV; } +static inline void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) { } #endif #endif diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index ae04204ea81f..02d5271199d5 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -615,6 +615,16 @@ static void handle___tracing_swap_reader(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __tracing_swap_reader(cpu); } +static void handle___tracing_update_clock(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(u32, mult, host_ctxt, 1); + DECLARE_REG(u32, shift, host_ctxt, 2); + DECLARE_REG(u64, epoch_ns, host_ctxt, 3); + DECLARE_REG(u64, epoch_cyc, host_ctxt, 4); + + __tracing_update_clock(mult, shift, epoch_ns, epoch_cyc); +} + typedef void (*hcall_t)(struct kvm_cpu_context *); #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x @@ -660,6 +670,7 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__tracing_unload), HANDLE_FUNC(__tracing_enable), HANDLE_FUNC(__tracing_swap_reader), + HANDLE_FUNC(__tracing_update_clock), }; static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) diff --git a/arch/arm64/kvm/hyp/nvhe/trace.c b/arch/arm64/kvm/hyp/nvhe/trace.c index 282cba70ce9b..2c8e6f49d7de 100644 --- a/arch/arm64/kvm/hyp/nvhe/trace.c +++ b/arch/arm64/kvm/hyp/nvhe/trace.c @@ -271,3 +271,19 @@ int __tracing_swap_reader(unsigned int cpu) return ret; } + +void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) +{ + int cpu; + + /* After this loop, all CPUs are observing the new bank... */ + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) { + struct simple_rb_per_cpu *simple_rb = per_cpu_ptr(trace_buffer.simple_rbs, cpu); + + while (READ_ONCE(simple_rb->status) == SIMPLE_RB_WRITING) + ; + } + + /* ...we can now override the old one and swap. */ + trace_clock_update(mult, shift, epoch_ns, epoch_cyc); +} diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c index 2866effe28ec..1e5fc27f0e9d 100644 --- a/arch/arm64/kvm/hyp_trace.c +++ b/arch/arm64/kvm/hyp_trace.c @@ -4,15 +4,133 @@ * Author: Vincent Donnefort */ +#include #include +#include #include +#include #include #include #include #include "hyp_trace.h" +/* Same 10min used by clocksource when width is more than 32-bits */ +#define CLOCK_MAX_CONVERSION_S 600 +/* + * Time to give for the clock init. Long enough to get a good mult/shift + * estimation. Short enough to not delay the tracing start too much. + */ +#define CLOCK_INIT_MS 100 +/* + * Time between clock checks. Must be small enough to catch clock deviation when + * it is still tiny. + */ +#define CLOCK_UPDATE_MS 500 + +static struct hyp_trace_clock { + u64 cycles; + u64 cyc_overflow64; + u64 boot; + u32 mult; + u32 shift; + struct delayed_work work; + struct completion ready; + struct mutex lock; + bool running; +} hyp_clock; + +static void __hyp_clock_work(struct work_struct *work) +{ + struct delayed_work *dwork = to_delayed_work(work); + struct hyp_trace_clock *hyp_clock; + struct system_time_snapshot snap; + u64 rate, delta_cycles; + u64 boot, delta_boot; + + hyp_clock = container_of(dwork, struct hyp_trace_clock, work); + + ktime_get_snapshot(&snap); + boot = ktime_to_ns(snap.boot); + + delta_boot = boot - hyp_clock->boot; + delta_cycles = snap.cycles - hyp_clock->cycles; + + /* Compare hyp clock with the kernel boot clock */ + if (hyp_clock->mult) { + u64 err, cur = delta_cycles; + + if (WARN_ON_ONCE(cur >= hyp_clock->cyc_overflow64)) { + __uint128_t tmp = (__uint128_t)cur * hyp_clock->mult; + + cur = tmp >> hyp_clock->shift; + } else { + cur *= hyp_clock->mult; + cur >>= hyp_clock->shift; + } + cur += hyp_clock->boot; + + err = abs_diff(cur, boot); + /* No deviation, only update epoch if necessary */ + if (!err) { + if (delta_cycles >= (hyp_clock->cyc_overflow64 >> 1)) + goto fast_forward; + + goto resched; + } + + /* Warn if the error is above tracing precision (1us) */ + if (err > NSEC_PER_USEC) + pr_warn_ratelimited("hyp trace clock off by %lluus\n", + err / NSEC_PER_USEC); + } + + rate = div64_u64(delta_cycles * NSEC_PER_SEC, delta_boot); + + clocks_calc_mult_shift(&hyp_clock->mult, &hyp_clock->shift, + rate, NSEC_PER_SEC, CLOCK_MAX_CONVERSION_S); + + /* Add a comfortable 50% margin */ + hyp_clock->cyc_overflow64 = (U64_MAX / hyp_clock->mult) >> 1; + +fast_forward: + hyp_clock->cycles = snap.cycles; + hyp_clock->boot = boot; + kvm_call_hyp_nvhe(__tracing_update_clock, hyp_clock->mult, + hyp_clock->shift, hyp_clock->boot, hyp_clock->cycles); + complete(&hyp_clock->ready); + +resched: + schedule_delayed_work(&hyp_clock->work, + msecs_to_jiffies(CLOCK_UPDATE_MS)); +} + +static void hyp_trace_clock_enable(struct hyp_trace_clock *hyp_clock, bool enable) +{ + struct system_time_snapshot snap; + + if (hyp_clock->running == enable) + return; + + if (!enable) { + cancel_delayed_work_sync(&hyp_clock->work); + hyp_clock->running = false; + } + + ktime_get_snapshot(&snap); + + hyp_clock->boot = ktime_to_ns(snap.boot); + hyp_clock->cycles = snap.cycles; + hyp_clock->mult = 0; + + init_completion(&hyp_clock->ready); + INIT_DELAYED_WORK(&hyp_clock->work, __hyp_clock_work); + schedule_delayed_work(&hyp_clock->work, msecs_to_jiffies(CLOCK_INIT_MS)); + wait_for_completion(&hyp_clock->ready); + hyp_clock->running = true; +} + /* Access to this struct within the trace_remote_callbacks are protected by the trace_remote lock */ static struct hyp_trace_buffer { struct hyp_trace_desc *desc; @@ -183,6 +301,8 @@ static void hyp_trace_unload(struct trace_buffer_desc *desc, void *priv) static int hyp_trace_enable_tracing(bool enable, void *priv) { + hyp_trace_clock_enable(&hyp_clock, enable); + return kvm_call_hyp_nvhe(__tracing_enable, enable); } @@ -201,7 +321,22 @@ static int hyp_trace_enable_event(unsigned short id, bool enable, void *priv) return 0; } +static int hyp_trace_clock_show(struct seq_file *m, void *v) +{ + seq_puts(m, "[boot]\n"); + + return 0; +} +DEFINE_SHOW_ATTRIBUTE(hyp_trace_clock); + +static int hyp_trace_init_tracefs(struct dentry *d, void *priv) +{ + return tracefs_create_file("trace_clock", 0440, d, NULL, &hyp_trace_clock_fops) ? + 0 : -ENOMEM; +} + static struct trace_remote_callbacks trace_remote_callbacks = { + .init = hyp_trace_init_tracefs, .load_trace_buffer = hyp_trace_load, .unload_trace_buffer = hyp_trace_unload, .enable_tracing = hyp_trace_enable_tracing, @@ -212,8 +347,22 @@ static struct trace_remote_callbacks trace_remote_callbacks = { int __init kvm_hyp_trace_init(void) { + int cpu; + if (is_kernel_in_hyp_mode()) return 0; +#ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND + for_each_possible_cpu(cpu) { + const struct arch_timer_erratum_workaround *wa = + per_cpu(timer_unstable_counter_workaround, cpu); + + if (wa && wa->read_cntvct_el0) { + pr_warn("hyp trace can't handle CNTVCT workaround '%s'\n", wa->desc); + return -EOPNOTSUPP; + } + } +#endif + return trace_remote_register("hypervisor", &trace_remote_callbacks, &trace_buffer, NULL, 0); } From 2194d317e07d169efc113344c531621ce31afe64 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:12 +0000 Subject: [PATCH 120/373] KVM: arm64: Add trace reset to the nVHE/pKVM hyp Make the hypervisor reset either the whole tracing buffer or a specific ring-buffer, on remotes/hypervisor/trace or per_cpu//trace write access. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-27-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/hyp/include/nvhe/trace.h | 2 ++ arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++++++++ arch/arm64/kvm/hyp/nvhe/trace.c | 17 +++++++++++++++++ arch/arm64/kvm/hyp_trace.c | 2 +- 5 files changed, 29 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index 9c4fe3bfbfff..66abf77cf371 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -94,6 +94,7 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___tracing_enable, __KVM_HOST_SMCCC_FUNC___tracing_swap_reader, __KVM_HOST_SMCCC_FUNC___tracing_update_clock, + __KVM_HOST_SMCCC_FUNC___tracing_reset, }; #define DECLARE_KVM_VHE_SYM(sym) extern char sym[] diff --git a/arch/arm64/kvm/hyp/include/nvhe/trace.h b/arch/arm64/kvm/hyp/include/nvhe/trace.h index fd641e1b1c23..44912869d184 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/trace.h +++ b/arch/arm64/kvm/hyp/include/nvhe/trace.h @@ -12,6 +12,7 @@ void __tracing_unload(void); int __tracing_enable(bool enable); int __tracing_swap_reader(unsigned int cpu); void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc); +int __tracing_reset(unsigned int cpu); #else static inline void *tracing_reserve_entry(unsigned long length) { return NULL; } static inline void tracing_commit_entry(void) { } @@ -21,5 +22,6 @@ static inline void __tracing_unload(void) { } static inline int __tracing_enable(bool enable) { return -ENODEV; } static inline int __tracing_swap_reader(unsigned int cpu) { return -ENODEV; } static inline void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) { } +static inline int __tracing_reset(unsigned int cpu) { return -ENODEV; } #endif #endif diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 02d5271199d5..9b05f0c87586 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -625,6 +625,13 @@ static void handle___tracing_update_clock(struct kvm_cpu_context *host_ctxt) __tracing_update_clock(mult, shift, epoch_ns, epoch_cyc); } +static void handle___tracing_reset(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(unsigned int, cpu, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __tracing_reset(cpu); +} + typedef void (*hcall_t)(struct kvm_cpu_context *); #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x @@ -671,6 +678,7 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__tracing_enable), HANDLE_FUNC(__tracing_swap_reader), HANDLE_FUNC(__tracing_update_clock), + HANDLE_FUNC(__tracing_reset), }; static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) diff --git a/arch/arm64/kvm/hyp/nvhe/trace.c b/arch/arm64/kvm/hyp/nvhe/trace.c index 2c8e6f49d7de..a6ca27b18e15 100644 --- a/arch/arm64/kvm/hyp/nvhe/trace.c +++ b/arch/arm64/kvm/hyp/nvhe/trace.c @@ -287,3 +287,20 @@ void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) /* ...we can now override the old one and swap. */ trace_clock_update(mult, shift, epoch_ns, epoch_cyc); } + +int __tracing_reset(unsigned int cpu) +{ + int ret = -ENODEV; + + if (cpu >= hyp_nr_cpus) + return -EINVAL; + + hyp_spin_lock(&trace_buffer.lock); + + if (hyp_trace_buffer_loaded(&trace_buffer)) + ret = simple_ring_buffer_reset(per_cpu_ptr(trace_buffer.simple_rbs, cpu)); + + hyp_spin_unlock(&trace_buffer.lock); + + return ret; +} diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c index 1e5fc27f0e9d..09bc192e3514 100644 --- a/arch/arm64/kvm/hyp_trace.c +++ b/arch/arm64/kvm/hyp_trace.c @@ -313,7 +313,7 @@ static int hyp_trace_swap_reader_page(unsigned int cpu, void *priv) static int hyp_trace_reset(unsigned int cpu, void *priv) { - return 0; + return kvm_call_hyp_nvhe(__tracing_reset, cpu); } static int hyp_trace_enable_event(unsigned short id, bool enable, void *priv) From 0a90fbc8a1709f682e0196c2632027cdedae5e94 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:13 +0000 Subject: [PATCH 121/373] KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote Allow the creation of hypervisor and trace remote events with a single macro HYP_EVENT(). That macro expands in the kernel side to add all the required declarations (based on REMOTE_EVENT()) as well as in the hypervisor side to create the trace_() function. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-28-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/include/asm/kvm_define_hypevents.h | 16 ++++++++ arch/arm64/include/asm/kvm_hypevents.h | 10 +++++ arch/arm64/include/asm/kvm_hyptrace.h | 13 +++++++ arch/arm64/kernel/image-vars.h | 4 ++ arch/arm64/kernel/vmlinux.lds.S | 18 +++++++++ .../kvm/hyp/include/nvhe/define_events.h | 14 +++++++ arch/arm64/kvm/hyp/include/nvhe/trace.h | 31 ++++++++++++++++ arch/arm64/kvm/hyp/nvhe/Makefile | 2 +- arch/arm64/kvm/hyp/nvhe/events.c | 25 +++++++++++++ arch/arm64/kvm/hyp/nvhe/hyp-main.c | 9 +++++ arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 6 +++ arch/arm64/kvm/hyp_trace.c | 37 ++++++++++++++++++- 13 files changed, 184 insertions(+), 2 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_define_hypevents.h create mode 100644 arch/arm64/include/asm/kvm_hypevents.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/define_events.h create mode 100644 arch/arm64/kvm/hyp/nvhe/events.c diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index 66abf77cf371..47d250436f8c 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -95,6 +95,7 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___tracing_swap_reader, __KVM_HOST_SMCCC_FUNC___tracing_update_clock, __KVM_HOST_SMCCC_FUNC___tracing_reset, + __KVM_HOST_SMCCC_FUNC___tracing_enable_event, }; #define DECLARE_KVM_VHE_SYM(sym) extern char sym[] diff --git a/arch/arm64/include/asm/kvm_define_hypevents.h b/arch/arm64/include/asm/kvm_define_hypevents.h new file mode 100644 index 000000000000..77d6790252a6 --- /dev/null +++ b/arch/arm64/include/asm/kvm_define_hypevents.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#define REMOTE_EVENT_INCLUDE_FILE arch/arm64/include/asm/kvm_hypevents.h + +#define REMOTE_EVENT_SECTION "_hyp_events" + +#define HE_STRUCT(__args) __args +#define HE_PRINTK(__args...) __args +#define he_field re_field + +#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ + REMOTE_EVENT(__name, 0, RE_STRUCT(__struct), RE_PRINTK(__printk)) + +#define HYP_EVENT_MULTI_READ +#include +#undef HYP_EVENT_MULTI_READ diff --git a/arch/arm64/include/asm/kvm_hypevents.h b/arch/arm64/include/asm/kvm_hypevents.h new file mode 100644 index 000000000000..d6e033c96c52 --- /dev/null +++ b/arch/arm64/include/asm/kvm_hypevents.h @@ -0,0 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#if !defined(__ARM64_KVM_HYPEVENTS_H_) || defined(HYP_EVENT_MULTI_READ) +#define __ARM64_KVM_HYPEVENTS_H_ + +#ifdef __KVM_NVHE_HYPERVISOR__ +#include +#endif + +#endif diff --git a/arch/arm64/include/asm/kvm_hyptrace.h b/arch/arm64/include/asm/kvm_hyptrace.h index 9c30a479bc36..de133b735f72 100644 --- a/arch/arm64/include/asm/kvm_hyptrace.h +++ b/arch/arm64/include/asm/kvm_hyptrace.h @@ -10,4 +10,17 @@ struct hyp_trace_desc { struct trace_buffer_desc trace_buffer_desc; }; + +struct hyp_event_id { + unsigned short id; + atomic_t enabled; +}; + +extern struct remote_event __hyp_events_start[]; +extern struct remote_event __hyp_events_end[]; + +/* hyp_event section used by the hypervisor */ +extern struct hyp_event_id __hyp_event_ids_start[]; +extern struct hyp_event_id __hyp_event_ids_end[]; + #endif diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h index d7b0d12b1015..d4c7d45ae6bc 100644 --- a/arch/arm64/kernel/image-vars.h +++ b/arch/arm64/kernel/image-vars.h @@ -138,6 +138,10 @@ KVM_NVHE_ALIAS(__hyp_data_start); KVM_NVHE_ALIAS(__hyp_data_end); KVM_NVHE_ALIAS(__hyp_rodata_start); KVM_NVHE_ALIAS(__hyp_rodata_end); +#ifdef CONFIG_NVHE_EL2_TRACING +KVM_NVHE_ALIAS(__hyp_event_ids_start); +KVM_NVHE_ALIAS(__hyp_event_ids_end); +#endif /* pKVM static key */ KVM_NVHE_ALIAS(kvm_protected_mode_initialized); diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S index 2964aad0362e..4bf2a83c3448 100644 --- a/arch/arm64/kernel/vmlinux.lds.S +++ b/arch/arm64/kernel/vmlinux.lds.S @@ -13,12 +13,23 @@ *(__kvm_ex_table) \ __stop___kvm_ex_table = .; +#ifdef CONFIG_NVHE_EL2_TRACING +#define HYPERVISOR_EVENT_IDS \ + . = ALIGN(PAGE_SIZE); \ + __hyp_event_ids_start = .; \ + *(HYP_SECTION_NAME(.event_ids)) \ + __hyp_event_ids_end = .; +#else +#define HYPERVISOR_EVENT_IDS +#endif + #define HYPERVISOR_RODATA_SECTIONS \ HYP_SECTION_NAME(.rodata) : { \ . = ALIGN(PAGE_SIZE); \ __hyp_rodata_start = .; \ *(HYP_SECTION_NAME(.data..ro_after_init)) \ *(HYP_SECTION_NAME(.rodata)) \ + HYPERVISOR_EVENT_IDS \ . = ALIGN(PAGE_SIZE); \ __hyp_rodata_end = .; \ } @@ -307,6 +318,13 @@ SECTIONS HYPERVISOR_DATA_SECTION +#ifdef CONFIG_NVHE_EL2_TRACING + .data.hyp_events : { + __hyp_events_start = .; + *(SORT(_hyp_events.*)) + __hyp_events_end = .; + } +#endif /* * Data written with the MMU off but read with the MMU on requires * cache lines to be invalidated, discarding up to a Cache Writeback diff --git a/arch/arm64/kvm/hyp/include/nvhe/define_events.h b/arch/arm64/kvm/hyp/include/nvhe/define_events.h new file mode 100644 index 000000000000..776d4c6cb702 --- /dev/null +++ b/arch/arm64/kvm/hyp/include/nvhe/define_events.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#undef HYP_EVENT +#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ + struct hyp_event_id hyp_event_id_##__name \ + __section(".hyp.event_ids."#__name) = { \ + .enabled = ATOMIC_INIT(0), \ + } + +#define HYP_EVENT_MULTI_READ +#include +#undef HYP_EVENT_MULTI_READ + +#undef HYP_EVENT diff --git a/arch/arm64/kvm/hyp/include/nvhe/trace.h b/arch/arm64/kvm/hyp/include/nvhe/trace.h index 44912869d184..802a18b77c56 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/trace.h +++ b/arch/arm64/kvm/hyp/include/nvhe/trace.h @@ -1,9 +1,36 @@ /* SPDX-License-Identifier: GPL-2.0-only */ #ifndef __ARM64_KVM_HYP_NVHE_TRACE_H #define __ARM64_KVM_HYP_NVHE_TRACE_H + +#include + #include +#define HE_PROTO(__args...) __args +#define HE_ASSIGN(__args...) __args +#define HE_STRUCT RE_STRUCT +#define he_field re_field + #ifdef CONFIG_NVHE_EL2_TRACING + +#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ + REMOTE_EVENT_FORMAT(__name, __struct); \ + extern struct hyp_event_id hyp_event_id_##__name; \ + static __always_inline void trace_##__name(__proto) \ + { \ + struct remote_event_format_##__name *__entry; \ + size_t length = sizeof(*__entry); \ + \ + if (!atomic_read(&hyp_event_id_##__name.enabled)) \ + return; \ + __entry = tracing_reserve_entry(length); \ + if (!__entry) \ + return; \ + __entry->hdr.id = hyp_event_id_##__name.id; \ + __assign \ + tracing_commit_entry(); \ + } + void *tracing_reserve_entry(unsigned long length); void tracing_commit_entry(void); @@ -13,9 +40,12 @@ int __tracing_enable(bool enable); int __tracing_swap_reader(unsigned int cpu); void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc); int __tracing_reset(unsigned int cpu); +int __tracing_enable_event(unsigned short id, bool enable); #else static inline void *tracing_reserve_entry(unsigned long length) { return NULL; } static inline void tracing_commit_entry(void) { } +#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ + static inline void trace_##__name(__proto) {} static inline int __tracing_load(unsigned long desc_va, size_t desc_size) { return -ENODEV; } static inline void __tracing_unload(void) { } @@ -23,5 +53,6 @@ static inline int __tracing_enable(bool enable) { return -ENODEV; } static inline int __tracing_swap_reader(unsigned int cpu) { return -ENODEV; } static inline void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) { } static inline int __tracing_reset(unsigned int cpu) { return -ENODEV; } +static inline int __tracing_enable_event(unsigned short id, bool enable) { return -ENODEV; } #endif #endif diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile index f1840628d2d6..143d55ec7298 100644 --- a/arch/arm64/kvm/hyp/nvhe/Makefile +++ b/arch/arm64/kvm/hyp/nvhe/Makefile @@ -29,7 +29,7 @@ hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o hyp-obj-y += ../../../kernel/smccc-call.o hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o -hyp-obj-$(CONFIG_NVHE_EL2_TRACING) += clock.o trace.o +hyp-obj-$(CONFIG_NVHE_EL2_TRACING) += clock.o trace.o events.o hyp-obj-y += $(lib-objs) # Path to simple_ring_buffer.c diff --git a/arch/arm64/kvm/hyp/nvhe/events.c b/arch/arm64/kvm/hyp/nvhe/events.c new file mode 100644 index 000000000000..add9383aadb5 --- /dev/null +++ b/arch/arm64/kvm/hyp/nvhe/events.c @@ -0,0 +1,25 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Google LLC + * Author: Vincent Donnefort + */ + +#include +#include + +#include + +int __tracing_enable_event(unsigned short id, bool enable) +{ + struct hyp_event_id *event_id = &__hyp_event_ids_start[id]; + atomic_t *enabled; + + if (event_id >= __hyp_event_ids_end) + return -EINVAL; + + enabled = hyp_fixmap_map(__hyp_pa(&event_id->enabled)); + atomic_set(enabled, enable); + hyp_fixmap_unmap(); + + return 0; +} diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 9b05f0c87586..fc5953f31b4b 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -632,6 +632,14 @@ static void handle___tracing_reset(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __tracing_reset(cpu); } +static void handle___tracing_enable_event(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(unsigned short, id, host_ctxt, 1); + DECLARE_REG(bool, enable, host_ctxt, 2); + + cpu_reg(host_ctxt, 1) = __tracing_enable_event(id, enable); +} + typedef void (*hcall_t)(struct kvm_cpu_context *); #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x @@ -679,6 +687,7 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__tracing_swap_reader), HANDLE_FUNC(__tracing_update_clock), HANDLE_FUNC(__tracing_reset), + HANDLE_FUNC(__tracing_enable_event), }; static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) diff --git a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S index d724f6d69302..7a02837203d1 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S +++ b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S @@ -16,6 +16,12 @@ SECTIONS { HYP_SECTION(.text) HYP_SECTION(.data..ro_after_init) HYP_SECTION(.rodata) +#ifdef CONFIG_NVHE_EL2_TRACING + . = ALIGN(PAGE_SIZE); + BEGIN_HYP_SECTION(.event_ids) + *(SORT(.hyp.event_ids.*)) + END_HYP_SECTION +#endif /* * .hyp..data..percpu needs to be page aligned to maintain the same diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c index 09bc192e3514..0144cd26703e 100644 --- a/arch/arm64/kvm/hyp_trace.c +++ b/arch/arm64/kvm/hyp_trace.c @@ -318,6 +318,25 @@ static int hyp_trace_reset(unsigned int cpu, void *priv) static int hyp_trace_enable_event(unsigned short id, bool enable, void *priv) { + struct hyp_event_id *event_id = lm_alias(&__hyp_event_ids_start[id]); + struct page *page; + atomic_t *enabled; + void *map; + + if (is_protected_kvm_enabled()) + return kvm_call_hyp_nvhe(__tracing_enable_event, id, enable); + + enabled = &event_id->enabled; + page = virt_to_page(enabled); + map = vmap(&page, 1, VM_MAP, PAGE_KERNEL); + if (!map) + return -ENOMEM; + + enabled = map + offset_in_page(enabled); + atomic_set(enabled, enable); + + vunmap(map); + return 0; } @@ -345,6 +364,19 @@ static struct trace_remote_callbacks trace_remote_callbacks = { .enable_event = hyp_trace_enable_event, }; +#include + +static void __init hyp_trace_init_events(void) +{ + struct hyp_event_id *hyp_event_id = __hyp_event_ids_start; + struct remote_event *event = __hyp_events_start; + int id = 0; + + /* Events on both sides hypervisor are sorted */ + for (; event < __hyp_events_end; event++, hyp_event_id++, id++) + event->id = hyp_event_id->id = id; +} + int __init kvm_hyp_trace_init(void) { int cpu; @@ -364,5 +396,8 @@ int __init kvm_hyp_trace_init(void) } #endif - return trace_remote_register("hypervisor", &trace_remote_callbacks, &trace_buffer, NULL, 0); + hyp_trace_init_events(); + + return trace_remote_register("hypervisor", &trace_remote_callbacks, &trace_buffer, + __hyp_events_start, __hyp_events_end - __hyp_events_start); } From 696dfec22b8e4ac57cc90558af2b6be2b37f4b95 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:14 +0000 Subject: [PATCH 122/373] KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp The hyp_enter and hyp_exit events are logged by the hypervisor any time it is entered and exited. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-29-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_host.h | 3 ++ arch/arm64/include/asm/kvm_hypevents.h | 39 +++++++++++++++++++++ arch/arm64/kvm/arm.c | 2 ++ arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h | 23 ++++++++++++ arch/arm64/kvm/hyp/include/nvhe/trace.h | 12 +++++++ arch/arm64/kvm/hyp/nvhe/ffa.c | 28 +++++++-------- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 ++++++ arch/arm64/kvm/hyp/nvhe/psci-relay.c | 7 ++-- arch/arm64/kvm/hyp/nvhe/switch.c | 5 ++- arch/arm64/kvm/hyp_trace.c | 18 ++++++++++ 10 files changed, 131 insertions(+), 17 deletions(-) create mode 100644 arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 2ca264b3db5f..b50ac3bb4bc9 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -920,6 +920,9 @@ struct kvm_vcpu_arch { /* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */ struct vncr_tlb *vncr_tlb; + + /* Hyp-readable copy of kvm_vcpu::pid */ + pid_t pid; }; /* diff --git a/arch/arm64/include/asm/kvm_hypevents.h b/arch/arm64/include/asm/kvm_hypevents.h index d6e033c96c52..221a1dacb2f0 100644 --- a/arch/arm64/include/asm/kvm_hypevents.h +++ b/arch/arm64/include/asm/kvm_hypevents.h @@ -7,4 +7,43 @@ #include #endif +#ifndef __HYP_ENTER_EXIT_REASON +#define __HYP_ENTER_EXIT_REASON +enum hyp_enter_exit_reason { + HYP_REASON_SMC, + HYP_REASON_HVC, + HYP_REASON_PSCI, + HYP_REASON_HOST_ABORT, + HYP_REASON_GUEST_EXIT, + HYP_REASON_ERET_HOST, + HYP_REASON_ERET_GUEST, + HYP_REASON_UNKNOWN /* Must be last */ +}; +#endif + +HYP_EVENT(hyp_enter, + HE_PROTO(struct kvm_cpu_context *host_ctxt, u8 reason), + HE_STRUCT( + he_field(u8, reason) + he_field(pid_t, vcpu) + ), + HE_ASSIGN( + __entry->reason = reason; + __entry->vcpu = __tracing_get_vcpu_pid(host_ctxt); + ), + HE_PRINTK("reason=%s vcpu=%d", __hyp_enter_exit_reason_str(__entry->reason), __entry->vcpu) +); + +HYP_EVENT(hyp_exit, + HE_PROTO(struct kvm_cpu_context *host_ctxt, u8 reason), + HE_STRUCT( + he_field(u8, reason) + he_field(pid_t, vcpu) + ), + HE_ASSIGN( + __entry->reason = reason; + __entry->vcpu = __tracing_get_vcpu_pid(host_ctxt); + ), + HE_PRINTK("reason=%s vcpu=%d", __hyp_enter_exit_reason_str(__entry->reason), __entry->vcpu) +); #endif diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 87a3d28d5f0f..04c43c9eb764 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -707,6 +707,8 @@ nommu: if (!cpumask_test_cpu(cpu, vcpu->kvm->arch.supported_cpus)) vcpu_set_on_unsupported_cpu(vcpu); + + vcpu->arch.pid = pid_nr(vcpu->pid); } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h b/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h new file mode 100644 index 000000000000..1258bc84477f --- /dev/null +++ b/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ +#define __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ + +#include + +#include + +#define hyp_smccc_1_1_smc(...) \ + do { \ + trace_hyp_exit(NULL, HYP_REASON_SMC); \ + arm_smccc_1_1_smc(__VA_ARGS__); \ + trace_hyp_enter(NULL, HYP_REASON_SMC); \ + } while (0) + +#define hyp_smccc_1_2_smc(...) \ + do { \ + trace_hyp_exit(NULL, HYP_REASON_SMC); \ + arm_smccc_1_2_smc(__VA_ARGS__); \ + trace_hyp_enter(NULL, HYP_REASON_SMC); \ + } while (0) + +#endif /* __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ */ diff --git a/arch/arm64/kvm/hyp/include/nvhe/trace.h b/arch/arm64/kvm/hyp/include/nvhe/trace.h index 802a18b77c56..8813ff250f8e 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/trace.h +++ b/arch/arm64/kvm/hyp/include/nvhe/trace.h @@ -6,6 +6,18 @@ #include +static inline pid_t __tracing_get_vcpu_pid(struct kvm_cpu_context *host_ctxt) +{ + struct kvm_vcpu *vcpu; + + if (!host_ctxt) + host_ctxt = host_data_ptr(host_ctxt); + + vcpu = host_ctxt->__hyp_running_vcpu; + + return vcpu ? vcpu->arch.pid : 0; +} + #define HE_PROTO(__args...) __args #define HE_ASSIGN(__args...) __args #define HE_STRUCT RE_STRUCT diff --git a/arch/arm64/kvm/hyp/nvhe/ffa.c b/arch/arm64/kvm/hyp/nvhe/ffa.c index 94161ea1cd60..1af722771178 100644 --- a/arch/arm64/kvm/hyp/nvhe/ffa.c +++ b/arch/arm64/kvm/hyp/nvhe/ffa.c @@ -26,10 +26,10 @@ * the duration and are therefore serialised. */ -#include #include #include +#include #include #include #include @@ -147,7 +147,7 @@ static int ffa_map_hyp_buffers(u64 ffa_page_count) { struct arm_smccc_1_2_regs res; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_FN64_RXTX_MAP, .a1 = hyp_virt_to_phys(hyp_buffers.tx), .a2 = hyp_virt_to_phys(hyp_buffers.rx), @@ -161,7 +161,7 @@ static int ffa_unmap_hyp_buffers(void) { struct arm_smccc_1_2_regs res; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_RXTX_UNMAP, .a1 = HOST_FFA_ID, }, &res); @@ -172,7 +172,7 @@ static int ffa_unmap_hyp_buffers(void) static void ffa_mem_frag_tx(struct arm_smccc_1_2_regs *res, u32 handle_lo, u32 handle_hi, u32 fraglen, u32 endpoint_id) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_MEM_FRAG_TX, .a1 = handle_lo, .a2 = handle_hi, @@ -184,7 +184,7 @@ static void ffa_mem_frag_tx(struct arm_smccc_1_2_regs *res, u32 handle_lo, static void ffa_mem_frag_rx(struct arm_smccc_1_2_regs *res, u32 handle_lo, u32 handle_hi, u32 fragoff) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_MEM_FRAG_RX, .a1 = handle_lo, .a2 = handle_hi, @@ -196,7 +196,7 @@ static void ffa_mem_frag_rx(struct arm_smccc_1_2_regs *res, u32 handle_lo, static void ffa_mem_xfer(struct arm_smccc_1_2_regs *res, u64 func_id, u32 len, u32 fraglen) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = func_id, .a1 = len, .a2 = fraglen, @@ -206,7 +206,7 @@ static void ffa_mem_xfer(struct arm_smccc_1_2_regs *res, u64 func_id, u32 len, static void ffa_mem_reclaim(struct arm_smccc_1_2_regs *res, u32 handle_lo, u32 handle_hi, u32 flags) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_MEM_RECLAIM, .a1 = handle_lo, .a2 = handle_hi, @@ -216,7 +216,7 @@ static void ffa_mem_reclaim(struct arm_smccc_1_2_regs *res, u32 handle_lo, static void ffa_retrieve_req(struct arm_smccc_1_2_regs *res, u32 len) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_FN64_MEM_RETRIEVE_REQ, .a1 = len, .a2 = len, @@ -225,7 +225,7 @@ static void ffa_retrieve_req(struct arm_smccc_1_2_regs *res, u32 len) static void ffa_rx_release(struct arm_smccc_1_2_regs *res) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_RX_RELEASE, }, res); } @@ -728,7 +728,7 @@ static int hyp_ffa_post_init(void) size_t min_rxtx_sz; struct arm_smccc_1_2_regs res; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ .a0 = FFA_ID_GET, }, &res); if (res.a0 != FFA_SUCCESS) @@ -737,7 +737,7 @@ static int hyp_ffa_post_init(void) if (res.a2 != HOST_FFA_ID) return -EINVAL; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ .a0 = FFA_FEATURES, .a1 = FFA_FN64_RXTX_MAP, }, &res); @@ -788,7 +788,7 @@ static void do_ffa_version(struct arm_smccc_1_2_regs *res, * first if TEE supports it. */ if (FFA_MINOR_VERSION(ffa_req_version) < FFA_MINOR_VERSION(hyp_ffa_version)) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_VERSION, .a1 = ffa_req_version, }, res); @@ -824,7 +824,7 @@ static void do_ffa_part_get(struct arm_smccc_1_2_regs *res, goto out_unlock; } - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_PARTITION_INFO_GET, .a1 = uuid0, .a2 = uuid1, @@ -939,7 +939,7 @@ int hyp_ffa_init(void *pages) if (kvm_host_psci_config.smccc_version < ARM_SMCCC_VERSION_1_2) return 0; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_VERSION, .a1 = FFA_VERSION_1_2, }, &res); diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index fc5953f31b4b..547d63679022 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -137,6 +138,8 @@ static void flush_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu) hyp_vcpu->vcpu.arch.vsesr_el2 = host_vcpu->arch.vsesr_el2; hyp_vcpu->vcpu.arch.vgic_cpu.vgic_v3 = host_vcpu->arch.vgic_cpu.vgic_v3; + + hyp_vcpu->vcpu.arch.pid = host_vcpu->arch.pid; } static void sync_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu) @@ -728,7 +731,9 @@ inval: static void default_host_smc_handler(struct kvm_cpu_context *host_ctxt) { + trace_hyp_exit(host_ctxt, HYP_REASON_SMC); __kvm_hyp_host_forward_smc(host_ctxt); + trace_hyp_enter(host_ctxt, HYP_REASON_SMC); } static void handle_host_smc(struct kvm_cpu_context *host_ctxt) @@ -815,15 +820,19 @@ void handle_trap(struct kvm_cpu_context *host_ctxt) { u64 esr = read_sysreg_el2(SYS_ESR); + switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_HVC64: + trace_hyp_enter(host_ctxt, HYP_REASON_HVC); handle_host_hcall(host_ctxt); break; case ESR_ELx_EC_SMC64: + trace_hyp_enter(host_ctxt, HYP_REASON_SMC); handle_host_smc(host_ctxt); break; case ESR_ELx_EC_IABT_LOW: case ESR_ELx_EC_DABT_LOW: + trace_hyp_enter(host_ctxt, HYP_REASON_HOST_ABORT); handle_host_mem_abort(host_ctxt); break; case ESR_ELx_EC_SYS64: @@ -833,4 +842,6 @@ void handle_trap(struct kvm_cpu_context *host_ctxt) default: BUG(); } + + trace_hyp_exit(host_ctxt, HYP_REASON_ERET_HOST); } diff --git a/arch/arm64/kvm/hyp/nvhe/psci-relay.c b/arch/arm64/kvm/hyp/nvhe/psci-relay.c index c3e196fb8b18..ab4c7bddb163 100644 --- a/arch/arm64/kvm/hyp/nvhe/psci-relay.c +++ b/arch/arm64/kvm/hyp/nvhe/psci-relay.c @@ -6,11 +6,12 @@ #include #include +#include #include -#include #include #include +#include #include #include @@ -65,7 +66,7 @@ static unsigned long psci_call(unsigned long fn, unsigned long arg0, { struct arm_smccc_res res; - arm_smccc_1_1_smc(fn, arg0, arg1, arg2, &res); + hyp_smccc_1_1_smc(fn, arg0, arg1, arg2, &res); return res.a0; } @@ -206,6 +207,7 @@ asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on) struct kvm_cpu_context *host_ctxt; host_ctxt = host_data_ptr(host_ctxt); + trace_hyp_enter(host_ctxt, HYP_REASON_PSCI); if (is_cpu_on) boot_args = this_cpu_ptr(&cpu_on_args); @@ -221,6 +223,7 @@ asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on) write_sysreg_el1(INIT_SCTLR_EL1_MMU_OFF, SYS_SCTLR); write_sysreg(INIT_PSTATE_EL1, SPSR_EL2); + trace_hyp_exit(host_ctxt, HYP_REASON_PSCI); __host_enter(host_ctxt); } diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index 779089e42681..ca60721501d1 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -7,7 +7,6 @@ #include #include -#include #include #include #include @@ -21,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -308,10 +308,13 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) __debug_switch_to_guest(vcpu); do { + trace_hyp_exit(host_ctxt, HYP_REASON_ERET_GUEST); + /* Jump in the fire! */ exit_code = __guest_enter(vcpu); /* And we're baaack! */ + trace_hyp_enter(host_ctxt, HYP_REASON_GUEST_EXIT); } while (fixup_guest_exit(vcpu, &exit_code)); __sysreg_save_state_nvhe(guest_ctxt); diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c index 0144cd26703e..1ad6a55ba95c 100644 --- a/arch/arm64/kvm/hyp_trace.c +++ b/arch/arm64/kvm/hyp_trace.c @@ -364,8 +364,26 @@ static struct trace_remote_callbacks trace_remote_callbacks = { .enable_event = hyp_trace_enable_event, }; +static const char *__hyp_enter_exit_reason_str(u8 reason); + #include +static const char *__hyp_enter_exit_reason_str(u8 reason) +{ + static const char strs[][12] = { + "smc", + "hvc", + "psci", + "host_abort", + "guest_exit", + "eret_host", + "eret_guest", + "unknown", + }; + + return strs[min(reason, HYP_REASON_UNKNOWN)]; +} + static void __init hyp_trace_init_events(void) { struct hyp_event_id *hyp_event_id = __hyp_event_ids_start; From 5bbbed42f71f771a70e974e3726d283690ecd589 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:15 +0000 Subject: [PATCH 123/373] KVM: arm64: Add selftest event support to nVHE/pKVM hyp Add a selftest event that can be triggered from a `write_event` tracefs file. This intends to be used by trace remote selftests. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-30-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/include/asm/kvm_hypevents.h | 11 +++++++++++ arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++++++++ arch/arm64/kvm/hyp_trace.c | 22 ++++++++++++++++++++++ 4 files changed, 42 insertions(+) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index 47d250436f8c..c8eb992d3ac8 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -96,6 +96,7 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___tracing_update_clock, __KVM_HOST_SMCCC_FUNC___tracing_reset, __KVM_HOST_SMCCC_FUNC___tracing_enable_event, + __KVM_HOST_SMCCC_FUNC___tracing_write_event, }; #define DECLARE_KVM_VHE_SYM(sym) extern char sym[] diff --git a/arch/arm64/include/asm/kvm_hypevents.h b/arch/arm64/include/asm/kvm_hypevents.h index 221a1dacb2f0..743c49bd878f 100644 --- a/arch/arm64/include/asm/kvm_hypevents.h +++ b/arch/arm64/include/asm/kvm_hypevents.h @@ -46,4 +46,15 @@ HYP_EVENT(hyp_exit, ), HE_PRINTK("reason=%s vcpu=%d", __hyp_enter_exit_reason_str(__entry->reason), __entry->vcpu) ); + +HYP_EVENT(selftest, + HE_PROTO(u64 id), + HE_STRUCT( + he_field(u64, id) + ), + HE_ASSIGN( + __entry->id = id; + ), + RE_PRINTK("id=%llu", __entry->id) +); #endif diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 547d63679022..eff9cb208627 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -643,6 +643,13 @@ static void handle___tracing_enable_event(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __tracing_enable_event(id, enable); } +static void handle___tracing_write_event(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(u64, id, host_ctxt, 1); + + trace_selftest(id); +} + typedef void (*hcall_t)(struct kvm_cpu_context *); #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x @@ -691,6 +698,7 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__tracing_update_clock), HANDLE_FUNC(__tracing_reset), HANDLE_FUNC(__tracing_enable_event), + HANDLE_FUNC(__tracing_write_event), }; static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c index 1ad6a55ba95c..c1e28f6581ab 100644 --- a/arch/arm64/kvm/hyp_trace.c +++ b/arch/arm64/kvm/hyp_trace.c @@ -348,8 +348,30 @@ static int hyp_trace_clock_show(struct seq_file *m, void *v) } DEFINE_SHOW_ATTRIBUTE(hyp_trace_clock); +static ssize_t hyp_trace_write_event_write(struct file *f, const char __user *ubuf, + size_t cnt, loff_t *pos) +{ + unsigned long val; + int ret; + + ret = kstrtoul_from_user(ubuf, cnt, 10, &val); + if (ret) + return ret; + + kvm_call_hyp_nvhe(__tracing_write_event, val); + + return cnt; +} + +static const struct file_operations hyp_trace_write_event_fops = { + .write = hyp_trace_write_event_write, +}; + static int hyp_trace_init_tracefs(struct dentry *d, void *priv) { + if (!tracefs_create_file("write_event", 0200, d, NULL, &hyp_trace_write_event_fops)) + return -ENOMEM; + return tracefs_create_file("trace_clock", 0440, d, NULL, &hyp_trace_clock_fops) ? 0 : -ENOMEM; } From 39d5ca62a3dab7d162d49eb60b14cdd46138590f Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 9 Mar 2026 16:25:16 +0000 Subject: [PATCH 124/373] tracing: selftests: Add hypervisor trace remote tests Run the trace remote selftests with the trace remote 'hypervisor', This trace remote is most likely created when the arm64 KVM nVHE/pKVM hypervisor is in use. Cc: Shuah Khan Cc: linux-kselftest@vger.kernel.org Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260309162516.2623589-31-vdonnefort@google.com Signed-off-by: Marc Zyngier --- .../ftrace/test.d/remotes/hypervisor/buffer_size.tc | 11 +++++++++++ .../ftrace/test.d/remotes/hypervisor/reset.tc | 11 +++++++++++ .../ftrace/test.d/remotes/hypervisor/trace.tc | 11 +++++++++++ .../ftrace/test.d/remotes/hypervisor/trace_pipe.tc | 11 +++++++++++ .../ftrace/test.d/remotes/hypervisor/unloading.tc | 11 +++++++++++ 5 files changed, 55 insertions(+) create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/buffer_size.tc create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/reset.tc create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace.tc create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace_pipe.tc create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/unloading.tc diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/buffer_size.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/buffer_size.tc new file mode 100644 index 000000000000..64bf859d6406 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/buffer_size.tc @@ -0,0 +1,11 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Test hypervisor trace buffer size +# requires: remotes/hypervisor/write_event + +SOURCE_REMOTE_TEST=1 +. $TEST_DIR/remotes/buffer_size.tc + +set -e +setup_remote "hypervisor" +test_buffer_size diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/reset.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/reset.tc new file mode 100644 index 000000000000..7fe3b09b34e3 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/reset.tc @@ -0,0 +1,11 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Test hypervisor trace buffer reset +# requires: remotes/hypervisor/write_event + +SOURCE_REMOTE_TEST=1 +. $TEST_DIR/remotes/reset.tc + +set -e +setup_remote "hypervisor" +test_reset diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace.tc new file mode 100644 index 000000000000..b937c19ca7f9 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace.tc @@ -0,0 +1,11 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Test hypervisor non-consuming trace read +# requires: remotes/hypervisor/write_event + +SOURCE_REMOTE_TEST=1 +. $TEST_DIR/remotes/trace.tc + +set -e +setup_remote "hypervisor" +test_trace diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace_pipe.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace_pipe.tc new file mode 100644 index 000000000000..66aa1b76c147 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace_pipe.tc @@ -0,0 +1,11 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Test hypervisor consuming trace read +# requires: remotes/hypervisor/write_event + +SOURCE_REMOTE_TEST=1 +. $TEST_DIR/remotes/trace_pipe.tc + +set -e +setup_remote "hypervisor" +test_trace_pipe diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/unloading.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/unloading.tc new file mode 100644 index 000000000000..1dafde3414ab --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/unloading.tc @@ -0,0 +1,11 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Test hypervisor trace buffer unloading +# requires: remotes/hypervisor/write_event + +SOURCE_REMOTE_TEST=1 +. $TEST_DIR/remotes/unloading.tc + +set -e +setup_remote "hypervisor" +test_unloading From ce6a2badf58170bcf73489cd73981bb5775c1e22 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Wed, 11 Mar 2026 16:49:56 +0000 Subject: [PATCH 125/373] KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing simple_ring_buffer.c is located in the source tree and isn't duplicated to objtree. Fix its include path. Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260311164956.1424119-1-vdonnefort@google.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile index 143d55ec7298..3d33fbefdfc1 100644 --- a/arch/arm64/kvm/hyp/nvhe/Makefile +++ b/arch/arm64/kvm/hyp/nvhe/Makefile @@ -33,7 +33,7 @@ hyp-obj-$(CONFIG_NVHE_EL2_TRACING) += clock.o trace.o events.o hyp-obj-y += $(lib-objs) # Path to simple_ring_buffer.c -CFLAGS_trace.nvhe.o += -I$(objtree)/kernel/trace/ +CFLAGS_trace.nvhe.o += -I$(srctree)/kernel/trace/ ## ## Build rules for compiling nVHE hyp code From 5f2f83047126f1cb2986d142d2e76e1fa3cef3f0 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Thu, 12 Mar 2026 11:35:35 +0000 Subject: [PATCH 126/373] tracing: Update undefined symbols allow list for simple_ring_buffer Undefined symbols are not allowed for simple_ring_buffer.c. But some compiler emitted symbols are missing in the allowlist. Update it. Reported-by: Nathan Chancellor Signed-off-by: Vincent Donnefort Fixes: a717943d8ecc ("tracing: Check for undefined symbols in simple_ring_buffer") Closes: https://lore.kernel.org/all/20260311221816.GA316631@ax162/ Acked-by: Steven Rostedt (Google) Link: https://patch.msgid.link/20260312113535.2213350-1-vdonnefort@google.com Signed-off-by: Marc Zyngier --- kernel/trace/Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 3182e1bc1cf7..3cc490aadc99 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -137,7 +137,8 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o # to all kernel symbols. Fail the build if forbidden symbols are found. # UNDEFINED_ALLOWLIST := memset alt_cb_patch_nops __x86 __ubsan __asan __kasan __gcov __aeabi_unwind -UNDEFINED_ALLOWLIST += __stack_chk_fail stackleak_track_stack __ref_stack __sanitizer +UNDEFINED_ALLOWLIST += __stack_chk_fail stackleak_track_stack __ref_stack __sanitizer llvm_gcda llvm_gcov +UNDEFINED_ALLOWLIST += .TOC\. __clear_pages_unrolled __memmove copy_page warn_slowpath_fmt UNDEFINED_ALLOWLIST := $(addprefix -e , $(UNDEFINED_ALLOWLIST)) quiet_cmd_check_undefined = NM $< From 7e4b6c94300e355a72670c5b896ccc26ac312c63 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Thu, 12 Mar 2026 13:35:43 +0100 Subject: [PATCH 127/373] tracing: add more symbols to whitelist Randconfig builds show a number of cryptic build errors from hitting undefined symbols in simple_ring_buffer.o: make[7]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:147: kernel/trace/simple_ring_buffer.o.checked] Error 1 These happen with CONFIG_TRACE_BRANCH_PROFILING, CONFIG_KASAN_HW_TAGS, CONFIG_STACKPROTECTOR, CONFIG_DEBUG_IRQFLAGS and indirectly from WARN_ON(). Add exceptions for each one that I have hit so far on arm64, x86_64 and arm randconfig builds. Other architectures likely hit additional ones, so it would be nice to produce a little more verbose output that include the name of the missing symbols directly. Fixes: a717943d8ecc ("tracing: Check for undefined symbols in simple_ring_buffer") Signed-off-by: Arnd Bergmann Link: https://patch.msgid.link/20260312123601.625063-2-arnd@kernel.org Signed-off-by: Marc Zyngier --- kernel/trace/Makefile | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 3cc490aadc99..beb15936829d 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -1,4 +1,4 @@ -# SPDX-License-Identifier: GPL-2.0 + # Do not instrument the tracer itself: @@ -139,6 +139,8 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o UNDEFINED_ALLOWLIST := memset alt_cb_patch_nops __x86 __ubsan __asan __kasan __gcov __aeabi_unwind UNDEFINED_ALLOWLIST += __stack_chk_fail stackleak_track_stack __ref_stack __sanitizer llvm_gcda llvm_gcov UNDEFINED_ALLOWLIST += .TOC\. __clear_pages_unrolled __memmove copy_page warn_slowpath_fmt +UNDEFINED_ALLOWLIST += ftrace_likely_update __hwasan_load __hwasan_store __hwasan_tag_memory +UNDEFINED_ALLOWLIST += warn_bogus_irq_restore __stack_chk_guard UNDEFINED_ALLOWLIST := $(addprefix -e , $(UNDEFINED_ALLOWLIST)) quiet_cmd_check_undefined = NM $< From 9e5dd49de5d82401e92098c03e0a0e978ddd515a Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Thu, 12 Mar 2026 13:35:44 +0100 Subject: [PATCH 128/373] KVM: arm64: tracing: add ftrace dependency Selecting CONFIG_TRACE_REMOTE causes a build time warning when FTRACE is disabled: WARNING: unmet direct dependencies detected for TRACE_REMOTE Depends on [n]: FTRACE [=n] Selected by [y]: - NVHE_EL2_TRACING [=y] && VIRTUALIZATION [=y] && KVM [=y] && NVHE_EL2_DEBUG [=y] && TRACING [=y] Add this as another dependency to ensure a clean build. Fixes: 3aed038aac8d ("KVM: arm64: Add trace remote for the nVHE/pKVM hyp") Signed-off-by: Arnd Bergmann Reviewed-by: Vincent Donnefort Link: https://patch.msgid.link/20260312123601.625063-3-arnd@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 17edfe3ae615..449154f9a485 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -74,7 +74,7 @@ if NVHE_EL2_DEBUG config NVHE_EL2_TRACING bool - depends on TRACING + depends on TRACING && FTRACE select TRACE_REMOTE default y From 3b27c82ba2f3dcf8075e3df74dbf7294d2955d1a Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Sat, 7 Mar 2026 01:16:17 +0000 Subject: [PATCH 129/373] KVM: x86: Move some EFER bits enablement to common code Move EFER bits enablement that only depend on CPU support to common code, as there is no reason to do it in vendor code. Leave EFER.SVME and EFER.LMSLE enablement in SVM code as they depend on vendor module parameters. Having the enablement in common code ensures that if a vendor starts supporting an existing feature, KVM doesn't end up advertising to userspace but not allowing the EFER bit to be set. No functional change intended. Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260307011619.2324234-2-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 7 ------- arch/x86/kvm/vmx/vmx.c | 4 ---- arch/x86/kvm/x86.c | 14 ++++++++++++++ 3 files changed, 14 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 936f7652d1e4..424ed50e6bfa 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -5405,14 +5405,10 @@ static __init int svm_hardware_setup(void) pr_err_ratelimited("NX (Execute Disable) not supported\n"); return -EOPNOTSUPP; } - kvm_enable_efer_bits(EFER_NX); kvm_caps.supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); - if (boot_cpu_has(X86_FEATURE_FXSR_OPT)) - kvm_enable_efer_bits(EFER_FFXSR); - if (tsc_scaling) { if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) { tsc_scaling = false; @@ -5426,9 +5422,6 @@ static __init int svm_hardware_setup(void) tsc_aux_uret_slot = kvm_add_user_return_msr(MSR_TSC_AUX); - if (boot_cpu_has(X86_FEATURE_AUTOIBRS)) - kvm_enable_efer_bits(EFER_AUTOIBRS); - /* Check for pause filtering support */ if (!boot_cpu_has(X86_FEATURE_PAUSEFILTER)) { pause_filter_count = 0; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 967b58a8ab9d..bc28da49f283 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8694,10 +8694,6 @@ __init int vmx_hardware_setup(void) vmx_setup_user_return_msrs(); - - if (boot_cpu_has(X86_FEATURE_NX)) - kvm_enable_efer_bits(EFER_NX); - if (boot_cpu_has(X86_FEATURE_MPX)) { rdmsrq(MSR_IA32_BNDCFGS, host_bndcfgs); WARN_ONCE(host_bndcfgs, "BNDCFGS in host will be lost"); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d5731499f4c2..7e8c1816cffd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9998,6 +9998,18 @@ void kvm_setup_xss_caps(void) } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_setup_xss_caps); +static void kvm_setup_efer_caps(void) +{ + if (boot_cpu_has(X86_FEATURE_NX)) + kvm_enable_efer_bits(EFER_NX); + + if (boot_cpu_has(X86_FEATURE_FXSR_OPT)) + kvm_enable_efer_bits(EFER_FFXSR); + + if (boot_cpu_has(X86_FEATURE_AUTOIBRS)) + kvm_enable_efer_bits(EFER_AUTOIBRS); +} + static inline void kvm_ops_update(struct kvm_x86_init_ops *ops) { memcpy(&kvm_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); @@ -10134,6 +10146,8 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) if (r != 0) goto out_mmu_exit; + kvm_setup_efer_caps(); + enable_device_posted_irqs &= enable_apicv && irq_remapping_cap(IRQ_POSTING_CAP); From d216449f253c7039c3e6a0276279c117a5198ce0 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Sat, 7 Mar 2026 01:16:18 +0000 Subject: [PATCH 130/373] KVM: x86: Use kvm_cpu_cap_has() for EFER bits enablement checks Instead of checking that the hardware supports underlying features for EFER bits, check if KVM supports them. It is practically the same, but this removes a subtle dependency on kvm_set_cpu_caps() enabling the relevant CPUID features. No functional change intended. Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260307011619.2324234-3-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7e8c1816cffd..3753d0b62ded 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10000,13 +10000,13 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_setup_xss_caps); static void kvm_setup_efer_caps(void) { - if (boot_cpu_has(X86_FEATURE_NX)) + if (kvm_cpu_cap_has(X86_FEATURE_NX)) kvm_enable_efer_bits(EFER_NX); - if (boot_cpu_has(X86_FEATURE_FXSR_OPT)) + if (kvm_cpu_cap_has(X86_FEATURE_FXSR_OPT)) kvm_enable_efer_bits(EFER_FFXSR); - if (boot_cpu_has(X86_FEATURE_AUTOIBRS)) + if (kvm_cpu_cap_has(X86_FEATURE_AUTOIBRS)) kvm_enable_efer_bits(EFER_AUTOIBRS); } From 577da677aa7cbc13040e4951170d39ec7663ad8a Mon Sep 17 00:00:00 2001 From: Xin Li Date: Fri, 6 Mar 2026 15:12:53 -0800 Subject: [PATCH 131/373] KVM: VMX: Remove unnecessary parentheses Drop redundant parentheses; the & operator has higher precedence than the return statement's implicit evaluation, making the grouping redundant. Signed-off-by: Xin Li Link: https://patch.msgid.link/20260306231253.2177246-1-xin@zytor.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/capabilities.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index 4e371c93ae16..56cacc06225e 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -107,7 +107,7 @@ static inline bool cpu_has_load_perf_global_ctrl(void) static inline bool cpu_has_load_cet_ctrl(void) { - return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_CET_STATE); + return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_CET_STATE; } static inline bool cpu_has_save_perf_global_ctrl(void) From 26c9bfc0fac240540581cfbe58031b412f98aaf8 Mon Sep 17 00:00:00 2001 From: xuanqingshi <1356292400@qq.com> Date: Fri, 6 Mar 2026 17:12:32 +0800 Subject: [PATCH 132/373] KVM: x86: Add LAPIC guard in kvm_apic_write_nodecode() kvm_apic_write_nodecode() dereferences vcpu->arch.apic without first checking whether the in-kernel LAPIC has been initialized. If it has not (e.g. the vCPU was created without an in-kernel LAPIC), the dereference results in a NULL pointer access. While APIC-write VM-Exits are not expected to occur on a vCPU without an in-kernel LAPIC, kvm_apic_write_nodecode() should be robust against such a scenario as a defense-in-depth measure, e.g. to guard against KVM bugs or CPU errata that could generate a spurious APIC-write VM-Exit. Use KVM_BUG_ON() with lapic_in_kernel() instead of a simple WARN_ON_ONCE(), as suggested by Sean Christopherson, so that KVM kills the VM outright rather than letting it continue in a broken state. Found by a VMCS-targeted fuzzer based on syzkaller. Signed-off-by: xuanqingshi <1356292400@qq.com> Link: https://patch.msgid.link/tencent_7A9F1B4D75468C0CF5DE1B6902038C948B07@qq.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/lapic.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 9381c58d4c85..02f2039d5f99 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2657,6 +2657,9 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) { struct kvm_lapic *apic = vcpu->arch.apic; + if (KVM_BUG_ON(!lapic_in_kernel(vcpu), vcpu->kvm)) + return; + /* * ICR is a single 64-bit register when x2APIC is enabled, all others * registers hold 32-bit values. For legacy xAPIC, ICR writes need to From 00d572d4cd7d23f9a7a498d2d824b68ba3ea5b88 Mon Sep 17 00:00:00 2001 From: Anel Orazgaliyeva Date: Fri, 6 Mar 2026 08:59:52 +0100 Subject: [PATCH 133/373] KVM: X86: Fix array_index_nospec protection in __pv_send_ipi The __pv_send_ipi() function iterates over up to BITS_PER_LONG vCPUs starting from the APIC ID specified in its 'min' argument, which is provided by the guest. Commit c87bd4dd43a6 used array_index_nospec() to clamp the value of 'min' but then the for_each_set_bit() loop dereferences higher indices without further protection. Theoretically, a guest can trigger speculative access to up to BITS_PER_LONG elements off the end of the phys_map[] array. (In practice it would probably need aggressive loop unrolling by the compiler to go more than one element off the end, and even that seems unlikely, but the theoretical possibility exists.) Move the array_index_nospec() inside the loop to protect the [map + i] index which is actually being used each time. Fixes: c87bd4dd43a6 ("KVM: x86: use array_index_nospec with indices that come from guest") Fixes: bdf7ffc89922 ("KVM: LAPIC: Fix pv ipis out-of-bounds access") Fixes: 4180bf1b655a ("KVM: X86: Implement "send IPI" hypercall") Signed-off-by: Anel Orazgaliyeva Signed-off-by: David Woodhouse Reviewed-by: Jim Mattson Link: https://patch.msgid.link/9d50fc3ca9e8e58f551d015f95d51a3c29ce6ccc.camel@infradead.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/lapic.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 02f2039d5f99..e3ec4d8607c1 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -840,16 +840,16 @@ static int __pv_send_ipi(unsigned long *ipi_bitmap, struct kvm_apic_map *map, { int i, count = 0; struct kvm_vcpu *vcpu; + size_t map_index; if (min > map->max_apic_id) return 0; - min = array_index_nospec(min, map->max_apic_id + 1); - for_each_set_bit(i, ipi_bitmap, - min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) { - if (map->phys_map[min + i]) { - vcpu = map->phys_map[min + i]->vcpu; + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) { + map_index = array_index_nospec(min + i, map->max_apic_id + 1); + if (map->phys_map[map_index]) { + vcpu = map->phys_map[map_index]->vcpu; count += kvm_apic_set_irq(vcpu, irq, NULL); } } From b3ae3ceb556945724d0c046ddb4ea0cf492a0ce6 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Fri, 23 Jan 2026 17:03:03 +0800 Subject: [PATCH 134/373] KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed Use the large-page metadata to avoid pointless attempts to search SP. If the target GFN falls within a range where a large page is allowed, then there cannot be a shadow page for that GFN; a shadow page in the range would itself disallow using a large page. In that case, there is nothing to unsync and mmu_try_to_unsync_pages() can return immediately. This is always true for TDP MMU without nested TDP, and holds for a significant fraction of cases with shadow paging even all SPs are 4K. For shadow paging, this optimization theoretically avoids work for about 1/e ~= 37% of GFNs, assuming one guest page table per 2M of memory and that each GPT falls randomly into the 2M memory buckets. In a simple test setup, it skipped unsync in a much higher percentage of cases, mainly because the guest buddy allocator clusters GPTs into fewer buckets. Signed-off-by: Lai Jiangshan Link: https://patch.msgid.link/20260123090304.32286-2-jiangshanlai@gmail.com [sean: check for hugepage after write-tracking, update comment] Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 733c1d5671cd..3dce38ffee76 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2940,6 +2940,15 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, if (kvm_gfn_is_write_tracked(kvm, slot, gfn)) return -EPERM; + /* + * Only 4KiB mappings can become unsync, and KVM disallows hugepages + * when accounting 4KiB shadow pages. Upper-level gPTEs are always + * write-protected (see above), thus if the gfn can be mapped with a + * hugepage and isn't write-tracked, it can't have a shadow page. + */ + if (!lpage_info_slot(gfn, slot, PG_LEVEL_2M)->disallow_lpage) + return 0; + /* * The page is not write-tracked, mark existing shadow pages unsync * unless KVM is synchronizing an unsync SP. In that case, KVM must From 55be358e17af4aa218f173cd6eb17a0dc423cd70 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2026 13:26:19 -0800 Subject: [PATCH 135/373] KVM: x86: Immediately fail the build when possible if required #define is missing Guard usage of the must-be-defined macros in KVM's multi-include headers with the existing #ifdefs that attempt to alert the developer to a missing macro, and spit out an explicit #error message if a macro is missing, as referencing the missing macro completely defeats the purpose of the #ifdef (the compiler spews a ton of error messages and buries the targeted error message). Suggested-by: Alexey Dobriyan Reviewed-by: Yuan Yao Link: https://patch.msgid.link/20260302212619.710873-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm-x86-ops.h | 10 ++++++---- arch/x86/include/asm/kvm-x86-pmu-ops.h | 8 +++++--- arch/x86/kvm/vmx/vmcs_shadow_fields.h | 5 +++-- 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index de709fb5bd76..3776cf5382a2 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -1,8 +1,9 @@ /* SPDX-License-Identifier: GPL-2.0 */ -#if !defined(KVM_X86_OP) || !defined(KVM_X86_OP_OPTIONAL) -BUILD_BUG_ON(1) -#endif - +#if !defined(KVM_X86_OP) || \ + !defined(KVM_X86_OP_OPTIONAL) || \ + !defined(KVM_X86_OP_OPTIONAL_RET0) +#error Missing one or more KVM_X86_OP #defines +#else /* * KVM_X86_OP() and KVM_X86_OP_OPTIONAL() are used to help generate * both DECLARE/DEFINE_STATIC_CALL() invocations and @@ -148,6 +149,7 @@ KVM_X86_OP_OPTIONAL(alloc_apic_backing_page) KVM_X86_OP_OPTIONAL_RET0(gmem_prepare) KVM_X86_OP_OPTIONAL_RET0(gmem_max_mapping_level) KVM_X86_OP_OPTIONAL(gmem_invalidate) +#endif #undef KVM_X86_OP #undef KVM_X86_OP_OPTIONAL diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h index f0aa6996811f..d5452b3433b7 100644 --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h @@ -1,7 +1,8 @@ /* SPDX-License-Identifier: GPL-2.0 */ -#if !defined(KVM_X86_PMU_OP) || !defined(KVM_X86_PMU_OP_OPTIONAL) -BUILD_BUG_ON(1) -#endif +#if !defined(KVM_X86_PMU_OP) || \ + !defined(KVM_X86_PMU_OP_OPTIONAL) +#error Missing one or more KVM_X86_PMU_OP #defines +#else /* * KVM_X86_PMU_OP() and KVM_X86_PMU_OP_OPTIONAL() are used to help generate @@ -26,6 +27,7 @@ KVM_X86_PMU_OP_OPTIONAL(cleanup) KVM_X86_PMU_OP_OPTIONAL(write_global_ctrl) KVM_X86_PMU_OP(mediated_load) KVM_X86_PMU_OP(mediated_put) +#endif #undef KVM_X86_PMU_OP #undef KVM_X86_PMU_OP_OPTIONAL diff --git a/arch/x86/kvm/vmx/vmcs_shadow_fields.h b/arch/x86/kvm/vmx/vmcs_shadow_fields.h index cad128d1657b..67e821c2be6d 100644 --- a/arch/x86/kvm/vmx/vmcs_shadow_fields.h +++ b/arch/x86/kvm/vmx/vmcs_shadow_fields.h @@ -1,6 +1,6 @@ #if !defined(SHADOW_FIELD_RO) && !defined(SHADOW_FIELD_RW) -BUILD_BUG_ON(1) -#endif +#error Must #define at least one of SHADOW_FIELD_RO or SHADOW_FIELD_RW +#else #ifndef SHADOW_FIELD_RO #define SHADOW_FIELD_RO(x, y) @@ -74,6 +74,7 @@ SHADOW_FIELD_RW(HOST_GS_BASE, host_gs_base) /* 64-bit */ SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address) SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address) +#endif #undef SHADOW_FIELD_RO #undef SHADOW_FIELD_RW From de0bfdc7137d5132b71dd1fe7aa3ca3df4d68241 Mon Sep 17 00:00:00 2001 From: Nikunj A Dadhania Date: Tue, 10 Feb 2026 05:35:11 +0000 Subject: [PATCH 136/373] KVM: x86: Advertise AVX512 Bit Matrix Multiply (BMM) to userspace Advertise AVX512 Bit Matrix Multiply (BMM) and Bit Reversal instructions to userspace via CPUID leaf 0x80000021_EAX[23]. This feature enables bit matrix multiply operations and bit reversal. Like most AVX instructions, there are no intercept controls for individual instructions, and no extra work is needed in KVM to enable correct execution of the instructions in the guest. The instructions and CPUID feature are first described in: AMD64 Bit Matrix Multiply and Bit Reversal Instructions Publication #69192 Revision: 1.00 Issue Date: January 2026 While at it, reorder PREFETCHI in KVM's initialization sequence to match the CPUID bit position order for better organization. Signed-off-by: Nikunj A Dadhania Link: https://patch.msgid.link/20260210053511.1612505-1-nikunj@amd.com [sean: massage changelog] Signed-off-by: Sean Christopherson --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/kvm/cpuid.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index dbe104df339b..de7bd88e539d 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -473,6 +473,7 @@ #define X86_FEATURE_GP_ON_USER_CPUID (20*32+17) /* User CPUID faulting */ #define X86_FEATURE_PREFETCHI (20*32+20) /* Prefetch Data/Instruction to Cache Level */ +#define X86_FEATURE_AVX512_BMM (20*32+23) /* AVX512 Bit Matrix Multiply instructions */ #define X86_FEATURE_ERAPS (20*32+24) /* Enhanced Return Address Predictor Security */ #define X86_FEATURE_SBPB (20*32+27) /* Selective Branch Prediction Barrier */ #define X86_FEATURE_IBPB_BRTYPE (20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */ diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index baf9a2860d98..d740c45039c9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -1243,11 +1243,12 @@ void kvm_initialize_cpu_caps(void) F(NULL_SEL_CLR_BASE), /* UpperAddressIgnore */ F(AUTOIBRS), - F(PREFETCHI), EMULATED_F(NO_SMM_CTL_MSR), /* PrefetchCtlMsr */ /* GpOnUserCpuid */ /* EPSF */ + F(PREFETCHI), + F(AVX512_BMM), F(ERAPS), SYNTHESIZED_F(SBPB), SYNTHESIZED_F(IBPB_BRTYPE), From 0b4a043a54144aef3e5a2597c29c6adb5e6c47dc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 15:04:14 -0700 Subject: [PATCH 137/373] KVM: SVM: Add a helper to get LBR field pointer to dedup MSR accesses Add a helper to get a pointer to the corresponding VMCB field given an LBR MSR index, and use it to dedup the handling in svm_{g,s}et_msr(). No functional change intended. Suggested-by: Yosry Ahmed Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260310220414.2569208-1-seanjc@google.com [sean: use KVM_BUG_ON() instead of BUILD_BUG(), clang ain't smart enough] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 49 +++++++++++++++++------------------------- 1 file changed, 20 insertions(+), 29 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 4bf0f5d7167f..5e6bd7fca298 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2751,6 +2751,24 @@ static int svm_get_feature_msr(u32 msr, u64 *data) return 0; } +static u64 *svm_vmcb_lbr(struct vcpu_svm *svm, u32 msr) +{ + switch (msr) { + case MSR_IA32_LASTBRANCHFROMIP: + return &svm->vmcb->save.br_from; + case MSR_IA32_LASTBRANCHTOIP: + return &svm->vmcb->save.br_to; + case MSR_IA32_LASTINTFROMIP: + return &svm->vmcb->save.last_excp_from; + case MSR_IA32_LASTINTTOIP: + return &svm->vmcb->save.last_excp_to; + default: + break; + } + KVM_BUG_ON(1, svm->vcpu.kvm); + return &svm->vmcb->save.br_from; +} + static bool sev_es_prevent_msr_access(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { @@ -2827,16 +2845,10 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = lbrv ? svm->vmcb->save.dbgctl : 0; break; case MSR_IA32_LASTBRANCHFROMIP: - msr_info->data = lbrv ? svm->vmcb->save.br_from : 0; - break; case MSR_IA32_LASTBRANCHTOIP: - msr_info->data = lbrv ? svm->vmcb->save.br_to : 0; - break; case MSR_IA32_LASTINTFROMIP: - msr_info->data = lbrv ? svm->vmcb->save.last_excp_from : 0; - break; case MSR_IA32_LASTINTTOIP: - msr_info->data = lbrv ? svm->vmcb->save.last_excp_to : 0; + msr_info->data = lbrv ? *svm_vmcb_lbr(svm, msr_info->index) : 0; break; case MSR_VM_HSAVE_PA: msr_info->data = svm->nested.hsave_msr; @@ -3112,35 +3124,14 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) svm_update_lbrv(vcpu); break; case MSR_IA32_LASTBRANCHFROMIP: - if (!lbrv) - return KVM_MSR_RET_UNSUPPORTED; - if (!msr->host_initiated) - return 1; - svm->vmcb->save.br_from = data; - vmcb_mark_dirty(svm->vmcb, VMCB_LBR); - break; case MSR_IA32_LASTBRANCHTOIP: - if (!lbrv) - return KVM_MSR_RET_UNSUPPORTED; - if (!msr->host_initiated) - return 1; - svm->vmcb->save.br_to = data; - vmcb_mark_dirty(svm->vmcb, VMCB_LBR); - break; case MSR_IA32_LASTINTFROMIP: - if (!lbrv) - return KVM_MSR_RET_UNSUPPORTED; - if (!msr->host_initiated) - return 1; - svm->vmcb->save.last_excp_from = data; - vmcb_mark_dirty(svm->vmcb, VMCB_LBR); - break; case MSR_IA32_LASTINTTOIP: if (!lbrv) return KVM_MSR_RET_UNSUPPORTED; if (!msr->host_initiated) return 1; - svm->vmcb->save.last_excp_to = data; + *svm_vmcb_lbr(svm, ecx) = data; vmcb_mark_dirty(svm->vmcb, VMCB_LBR); break; case MSR_VM_HSAVE_PA: From 520a1347faf46c2c00c3499de05fdecc6d254c2e Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Fri, 6 Mar 2026 21:08:56 +0000 Subject: [PATCH 138/373] KVM: nSVM: Simplify error handling of nested_svm_copy_vmcb12_to_cache() nested_svm_vmrun() currently stores the return value of nested_svm_copy_vmcb12_to_cache() in a local variable 'err', separate from the generally used 'ret' variable. This is done to have a single call to kvm_skip_emulated_instruction(), such that we can store the return value of kvm_skip_emulated_instruction() in 'ret', and then re-check the return value of nested_svm_copy_vmcb12_to_cache() in 'err'. The code is unnecessarily confusing. Instead, call kvm_skip_emulated_instruction() in the failure path of nested_svm_copy_vmcb12_to_cache() if the return value is not -EFAULT, and drop 'err'. Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260306210900.1933788-3-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index b191c6cab57d..3ffde1ff719b 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1079,7 +1079,7 @@ static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa int nested_svm_vmrun(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - int ret, err; + int ret; u64 vmcb12_gpa; struct vmcb *vmcb01 = svm->vmcb01.ptr; @@ -1104,19 +1104,20 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) return -EINVAL; vmcb12_gpa = svm->vmcb->save.rax; - err = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); - if (err == -EFAULT) { - kvm_inject_gp(vcpu, 0); - return 1; + + ret = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); + if (ret) { + if (ret == -EFAULT) { + kvm_inject_gp(vcpu, 0); + return 1; + } + + /* Advance RIP past VMRUN as part of the nested #VMEXIT. */ + return kvm_skip_emulated_instruction(vcpu); } - /* - * Advance RIP if #GP or #UD are not injected, but otherwise stop if - * copying and checking vmcb12 failed. - */ + /* At this point, VMRUN is guaranteed to not fault; advance RIP. */ ret = kvm_skip_emulated_instruction(vcpu); - if (err) - return ret; /* * Since vmcb01 is not in use, we can use it to store some of the L1 From 1211907ac0b5f35e5720620c50b7ca3c72d81f7e Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Mon, 16 Mar 2026 09:28:45 +0000 Subject: [PATCH 139/373] tracing: Generate undef symbols allowlist for simple_ring_buffer Compiler and tooling-generated symbols are difficult to maintain across all supported architectures. Make the allowlist more robust by replacing the harcoded list with a mechanism that automatically detects these symbols. This mechanism generates a C function designed to trigger common compiler-inserted symbols. Signed-off-by: Vincent Donnefort Reviewed-by: Nathan Chancellor Tested-by: Nathan Chancellor Acked-by: Steven Rostedt (Google) Tested-by: Arnd Bergmann Link: https://patch.msgid.link/20260316092845.3367411-1-vdonnefort@google.com [maz: added __msan prefix to allowlist as pointed out by Arnd] Signed-off-by: Marc Zyngier --- kernel/trace/Makefile | 46 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 38 insertions(+), 8 deletions(-) diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index beb15936829d..26f3d61e900d 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -136,17 +136,47 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o # simple_ring_buffer is used by the pKVM hypervisor which does not have access # to all kernel symbols. Fail the build if forbidden symbols are found. # -UNDEFINED_ALLOWLIST := memset alt_cb_patch_nops __x86 __ubsan __asan __kasan __gcov __aeabi_unwind -UNDEFINED_ALLOWLIST += __stack_chk_fail stackleak_track_stack __ref_stack __sanitizer llvm_gcda llvm_gcov -UNDEFINED_ALLOWLIST += .TOC\. __clear_pages_unrolled __memmove copy_page warn_slowpath_fmt -UNDEFINED_ALLOWLIST += ftrace_likely_update __hwasan_load __hwasan_store __hwasan_tag_memory -UNDEFINED_ALLOWLIST += warn_bogus_irq_restore __stack_chk_guard -UNDEFINED_ALLOWLIST := $(addprefix -e , $(UNDEFINED_ALLOWLIST)) +# undefsyms_base generates a set of compiler and tooling-generated symbols that can +# safely be ignored for simple_ring_buffer. +# +filechk_undefsyms_base = \ + echo '$(pound)include '; \ + echo '$(pound)include '; \ + echo '$(pound)include '; \ + echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);'; \ + echo 'void undefsyms_base(void *p, int n);'; \ + echo 'void undefsyms_base(void *p, int n) {'; \ + echo ' char buffer[256] = { 0 };'; \ + echo ' u32 u = 0;'; \ + echo ' memset((char * volatile)page, 8, PAGE_SIZE);'; \ + echo ' memset((char * volatile)buffer, 8, sizeof(buffer));'; \ + echo ' memcpy((void * volatile)p, buffer, sizeof(buffer));'; \ + echo ' cmpxchg((u32 * volatile)&u, 0, 8);'; \ + echo ' WARN_ON(n == 0xdeadbeef);'; \ + echo '}' + +$(obj)/undefsyms_base.c: FORCE + $(call filechk,undefsyms_base) + +clean-files += undefsyms_base.c + +$(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c + +targets += undefsyms_base.o + +# Ensure KASAN is enabled to avoid logic that may disable FORTIFY_SOURCE when +# KASAN is not enabled. undefsyms_base.o does not automatically get KASAN flags +# because it is not linked into vmlinux. +KASAN_SANITIZE_undefsyms_base.o := y + +UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \ + __msan simple_ring_buffer \ + $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}') quiet_cmd_check_undefined = NM $< - cmd_check_undefined = test -z "`$(NM) -u $< | grep -v $(UNDEFINED_ALLOWLIST)`" + cmd_check_undefined = test -z "`$(NM) -u $< | grep -v $(addprefix -e , $(UNDEFINED_ALLOWLIST))`" -$(obj)/%.o.checked: $(obj)/%.o FORCE +$(obj)/%.o.checked: $(obj)/%.o $(obj)/undefsyms_base.o FORCE $(call if_changed,check_undefined) always-$(CONFIG_SIMPLE_RING_BUFFER) += simple_ring_buffer.o.checked From 8510d054b7e086a71a9191ac3399231290dfd272 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Fri, 13 Mar 2026 10:49:16 +0100 Subject: [PATCH 140/373] KVM: arm64: avoid unused-variable warning The 'cpu' variable is only used inside of an #ifdef block and causes a warning if there is no user: arch/arm64/kvm/hyp_trace.c: In function 'kvm_hyp_trace_init': arch/arm64/kvm/hyp_trace.c:422:13: error: unused variable 'cpu' [-Werror=unused-variable] 422 | int cpu; | ^~~ Change the #ifdef to an equivalent IS_ENABLED() check to avoid the warning. Fixes: b22888917fa4 ("KVM: arm64: Sync boot clock with the nVHE/pKVM hyp") Signed-off-by: Arnd Bergmann Reviewed-by: Vincent Donnefort Link: https://patch.msgid.link/20260313094925.3749287-1-arnd@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp_trace.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c index c1e28f6581ab..8b7f2bf2fba8 100644 --- a/arch/arm64/kvm/hyp_trace.c +++ b/arch/arm64/kvm/hyp_trace.c @@ -424,17 +424,16 @@ int __init kvm_hyp_trace_init(void) if (is_kernel_in_hyp_mode()) return 0; -#ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND for_each_possible_cpu(cpu) { const struct arch_timer_erratum_workaround *wa = per_cpu(timer_unstable_counter_workaround, cpu); - if (wa && wa->read_cntvct_el0) { + if (IS_ENABLED(CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND) && + wa && wa->read_cntvct_el0) { pr_warn("hyp trace can't handle CNTVCT workaround '%s'\n", wa->desc); return -EOPNOTSUPP; } } -#endif hyp_trace_init_events(); From d7729643942325933508274f0392b749ca74f7cc Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 17 Mar 2026 19:42:52 +0000 Subject: [PATCH 141/373] tracing: Restore accidentally removed SPDX tag Restore the SPDX tag that was accidentally dropped. Fixes: 7e4b6c94300e3 ("tracing: add more symbols to whitelist") Reported-by: Nathan Chancellor Cc: Arnd Bergmann Cc: Vincent Donnefort Cc: Steven Rostedt Reviewed-by: Steven Rostedt (Google) Link: https://patch.msgid.link/20260317194252.1890568-1-maz@kernel.org Signed-off-by: Marc Zyngier --- kernel/trace/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 26f3d61e900d..c5e14ffd36ee 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -1,4 +1,4 @@ - +# SPDX-License-Identifier: GPL-2.0 # Do not instrument the tracer itself: From 204f7c018d76c2488a90fc6681519b8eb6eebb1d Mon Sep 17 00:00:00 2001 From: Wei-Lin Chang Date: Tue, 17 Mar 2026 18:26:37 +0000 Subject: [PATCH 142/373] KVM: arm64: ptdump: Make KVM ptdump code s2 mmu aware To reuse the ptdump code for shadow page table dumping later, let's pass s2 mmu as the private data, so we have the freedom to select which page table to print. Signed-off-by: Wei-Lin Chang Reviewed-by: Joey Gouly Link: https://patch.msgid.link/20260317182638.1592507-2-weilin.chang@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/ptdump.c | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c index 6a8836207a79..8713001e992d 100644 --- a/arch/arm64/kvm/ptdump.c +++ b/arch/arm64/kvm/ptdump.c @@ -18,7 +18,7 @@ #define KVM_PGTABLE_MAX_LEVELS (KVM_PGTABLE_LAST_LEVEL + 1) struct kvm_ptdump_guest_state { - struct kvm *kvm; + struct kvm_s2_mmu *mmu; struct ptdump_pg_state parser_state; struct addr_marker ipa_marker[MARKERS_LEN]; struct ptdump_pg_level level[KVM_PGTABLE_MAX_LEVELS]; @@ -112,10 +112,9 @@ static int kvm_ptdump_build_levels(struct ptdump_pg_level *level, u32 start_lvl) return 0; } -static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm *kvm) +static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm_s2_mmu *mmu) { struct kvm_ptdump_guest_state *st; - struct kvm_s2_mmu *mmu = &kvm->arch.mmu; struct kvm_pgtable *pgtable = mmu->pgt; int ret; @@ -133,7 +132,7 @@ static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm *kvm) st->ipa_marker[1].start_address = BIT(pgtable->ia_bits); st->range[0].end = BIT(pgtable->ia_bits); - st->kvm = kvm; + st->mmu = mmu; st->parser_state = (struct ptdump_pg_state) { .marker = &st->ipa_marker[0], .level = -1, @@ -149,8 +148,8 @@ static int kvm_ptdump_guest_show(struct seq_file *m, void *unused) { int ret; struct kvm_ptdump_guest_state *st = m->private; - struct kvm *kvm = st->kvm; - struct kvm_s2_mmu *mmu = &kvm->arch.mmu; + struct kvm_s2_mmu *mmu = st->mmu; + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); struct ptdump_pg_state *parser_state = &st->parser_state; struct kvm_pgtable_walker walker = (struct kvm_pgtable_walker) { .cb = kvm_ptdump_visitor, @@ -169,14 +168,15 @@ static int kvm_ptdump_guest_show(struct seq_file *m, void *unused) static int kvm_ptdump_guest_open(struct inode *m, struct file *file) { - struct kvm *kvm = m->i_private; + struct kvm_s2_mmu *mmu = m->i_private; + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); struct kvm_ptdump_guest_state *st; int ret; if (!kvm_get_kvm_safe(kvm)) return -ENOENT; - st = kvm_ptdump_parser_create(kvm); + st = kvm_ptdump_parser_create(mmu); if (IS_ERR(st)) { ret = PTR_ERR(st); goto err_with_kvm_ref; @@ -194,7 +194,7 @@ err_with_kvm_ref: static int kvm_ptdump_guest_close(struct inode *m, struct file *file) { - struct kvm *kvm = m->i_private; + struct kvm *kvm = kvm_s2_mmu_to_kvm(m->i_private); void *st = ((struct seq_file *)file->private_data)->private; kfree(st); @@ -229,14 +229,15 @@ static int kvm_pgtable_levels_show(struct seq_file *m, void *unused) static int kvm_pgtable_debugfs_open(struct inode *m, struct file *file, int (*show)(struct seq_file *, void *)) { - struct kvm *kvm = m->i_private; + struct kvm_s2_mmu *mmu = m->i_private; + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); struct kvm_pgtable *pgtable; int ret; if (!kvm_get_kvm_safe(kvm)) return -ENOENT; - pgtable = kvm->arch.mmu.pgt; + pgtable = mmu->pgt; ret = single_open(file, show, pgtable); if (ret < 0) @@ -256,7 +257,7 @@ static int kvm_pgtable_levels_open(struct inode *m, struct file *file) static int kvm_pgtable_debugfs_close(struct inode *m, struct file *file) { - struct kvm *kvm = m->i_private; + struct kvm *kvm = kvm_s2_mmu_to_kvm(m->i_private); kvm_put_kvm(kvm); return single_release(m, file); @@ -279,9 +280,9 @@ static const struct file_operations kvm_pgtable_levels_fops = { void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) { debugfs_create_file("stage2_page_tables", 0400, kvm->debugfs_dentry, - kvm, &kvm_ptdump_guest_fops); - debugfs_create_file("ipa_range", 0400, kvm->debugfs_dentry, kvm, - &kvm_pgtable_range_fops); + &kvm->arch.mmu, &kvm_ptdump_guest_fops); + debugfs_create_file("ipa_range", 0400, kvm->debugfs_dentry, + &kvm->arch.mmu, &kvm_pgtable_range_fops); debugfs_create_file("stage2_levels", 0400, kvm->debugfs_dentry, - kvm, &kvm_pgtable_levels_fops); + &kvm->arch.mmu, &kvm_pgtable_levels_fops); } From 90f0155f8754e75fa29fce02e40d690fb733852d Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:49:57 +0000 Subject: [PATCH 143/373] KVM: arm64: vgic-v3: Drop userspace write sanitization for ID_AA64PFR0.GIC on GICv5 Drop a check that blocked userspace writes to ID_AA64PFR0_EL1 for writes that set the GIC field to 0 (NI) on GICv5 hosts. There is no such check for GICv3 native systems, and having inconsistent behaviour both complicates the logic and risks breaking existing userspace software that expects to be able to write the register. This means that userspace is now able to create a GICv3 guest on GICv5 hosts, and disable the guest from seeing that it has a GICv3. This matches the already existing behaviour for GICv3-native VMs, allowing for fewer issues when migrating from GICv3 hosts to compatible GICv5 hosts. Additionally, this allows the trap and FGU infrastucture to kick in as these rely on the state of the feature bits that have been set. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-2-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/sys_regs.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 1b4cacb6e918..4b9f4e5d946b 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -2177,14 +2177,6 @@ static int set_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, (vcpu_has_nv(vcpu) && !FIELD_GET(ID_AA64PFR0_EL1_EL2, user_val))) return -EINVAL; - /* - * If we are running on a GICv5 host and support FEAT_GCIE_LEGACY, then - * we support GICv3. Fail attempts to do anything but set that to IMP. - */ - if (vgic_is_v3_compat(vcpu->kvm) && - FIELD_GET(ID_AA64PFR0_EL1_GIC_MASK, user_val) != ID_AA64PFR0_EL1_GIC_IMP) - return -EINVAL; - return set_id_reg(vcpu, rd, user_val); } From 3a2857da94d4783c076b15035c578892f1817dce Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:50:13 +0000 Subject: [PATCH 144/373] KVM: arm64: vgic: Rework vgic_is_v3() and add vgic_host_has_gicvX() The GIC version checks used to determine host capabilities and guest configuration have become somewhat conflated (in part due to the addition of GICv5 support). vgic_is_v3() is a prime example, which prior to this change has been a combination of guest configuration and host cabability. Split out the host capability check from vgic_is_v3(), which now only checks if the vgic model itself is GICv3. Add two new functions: vgic_host_has_gicv3() and vgic_host_has_gicv5(). These explicitly check the host capabilities, i.e., can the host system run a GICvX guest or not. The vgic_is_v3() check in vcpu_set_ich_hcr() has been replaced with vgic_host_has_gicv3() as this only applies on GICv3-capable hardware, and isn't strictly only applicable for a GICv3 guest (it is actually vital for vGICv2 on GICv3 hosts). Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-3-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/sys_regs.c | 2 +- arch/arm64/kvm/vgic/vgic-v3.c | 2 +- arch/arm64/kvm/vgic/vgic.h | 17 +++++++++++++---- 3 files changed, 15 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 4b9f4e5d946b..0acd10e50aab 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1985,7 +1985,7 @@ static u64 sanitise_id_aa64pfr0_el1(const struct kvm_vcpu *vcpu, u64 val) val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, CSV3, IMP); } - if (vgic_is_v3(vcpu->kvm)) { + if (vgic_host_has_gicv3()) { val &= ~ID_AA64PFR0_EL1_GIC_MASK; val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); } diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c index 6a355eca1934..9e841e7afd4a 100644 --- a/arch/arm64/kvm/vgic/vgic-v3.c +++ b/arch/arm64/kvm/vgic/vgic-v3.c @@ -499,7 +499,7 @@ void vcpu_set_ich_hcr(struct kvm_vcpu *vcpu) { struct vgic_v3_cpu_if *vgic_v3 = &vcpu->arch.vgic_cpu.vgic_v3; - if (!vgic_is_v3(vcpu->kvm)) + if (!vgic_host_has_gicv3()) return; /* Hide GICv3 sysreg if necessary */ diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index c9b3bb07e483..0bb8fa10bb4e 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -454,15 +454,24 @@ void vgic_v3_put_nested(struct kvm_vcpu *vcpu); void vgic_v3_handle_nested_maint_irq(struct kvm_vcpu *vcpu); void vgic_v3_nested_update_mi(struct kvm_vcpu *vcpu); -static inline bool vgic_is_v3_compat(struct kvm *kvm) +static inline bool vgic_is_v3(struct kvm *kvm) { - return cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF) && + return kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3; +} + +static inline bool vgic_host_has_gicv3(void) +{ + /* + * Either the host is a native GICv3, or it is GICv5 with + * FEAT_GCIE_LEGACY. + */ + return kvm_vgic_global_state.type == VGIC_V3 || kvm_vgic_global_state.has_gcie_v3_compat; } -static inline bool vgic_is_v3(struct kvm *kvm) +static inline bool vgic_host_has_gicv5(void) { - return kvm_vgic_global_state.type == VGIC_V3 || vgic_is_v3_compat(kvm); + return kvm_vgic_global_state.type == VGIC_V5; } int vgic_its_debug_init(struct kvm_device *dev); From cbd8c958be54abdf2c0f9b9c3eac971428b9d4b1 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:50:28 +0000 Subject: [PATCH 145/373] KVM: arm64: Return early from kvm_finalize_sys_regs() if guest has run If the guest has already run, we have no business finalizing the system register state - it is too late. Therefore, check early and bail if the VM has already run. This change also stops kvm_init_nv_sysregs() from being called once the RM has run once. Although this looks like a behavioural change, the function returns early once it has been called the first time. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-4-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/sys_regs.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 0acd10e50aab..42c84b7900ff 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -5659,11 +5659,14 @@ int kvm_finalize_sys_regs(struct kvm_vcpu *vcpu) guard(mutex)(&kvm->arch.config_lock); + if (kvm_vm_has_ran_once(kvm)) + return 0; + /* * This hacks into the ID registers, so only perform it when the * first vcpu runs, or the kvm_set_vm_id_reg() helper will scream. */ - if (!irqchip_in_kernel(kvm) && !kvm_vm_has_ran_once(kvm)) { + if (!irqchip_in_kernel(kvm)) { u64 val; val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; From 663594aafb438f8c4e51d4bf2dbf48b9f68aedb7 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:51:01 +0000 Subject: [PATCH 146/373] KVM: arm64: vgic: Split out mapping IRQs and setting irq_ops Prior to this change, the act of mapping a virtual IRQ to a physical one also set the irq_ops. Unmapping then reset the irq_ops to NULL. So far, this has been fine and hasn't caused any major issues. Now, however, as GICv5 support is being added to KVM, it has become apparent that conflating mapping/unmapping IRQs and setting/clearing irq_ops can cause issues. The reason is that the upcoming GICv5 support introduces a set of default irq_ops for PPIs, and removing this when unmapping will cause things to break rather horribly. Split out the mapping/unmapping of IRQs from the setting/clearing of irq_ops. The arch timer code is updated to set the irq_ops following a successful map. The irq_ops are intentionally not removed again on an unmap as the only irq_op introduced by the arch timer only takes effect if the hw bit in struct vgic_irq is set. Therefore, it is safe to leave this in place, and it avoids additional complexity when GICv5 support is introduced. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-6-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arch_timer.c | 22 +++++++++++----------- arch/arm64/kvm/vgic/vgic.c | 27 +++++++++++++++++++++------ include/kvm/arm_vgic.h | 5 ++++- 3 files changed, 36 insertions(+), 18 deletions(-) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 600f250753b4..d31bc1e7a13c 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -740,13 +740,11 @@ static void kvm_timer_vcpu_load_nested_switch(struct kvm_vcpu *vcpu, ret = kvm_vgic_map_phys_irq(vcpu, map->direct_vtimer->host_timer_irq, - timer_irq(map->direct_vtimer), - &arch_timer_irq_ops); + timer_irq(map->direct_vtimer)); WARN_ON_ONCE(ret); ret = kvm_vgic_map_phys_irq(vcpu, map->direct_ptimer->host_timer_irq, - timer_irq(map->direct_ptimer), - &arch_timer_irq_ops); + timer_irq(map->direct_ptimer)); WARN_ON_ONCE(ret); } } @@ -1543,6 +1541,7 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu) { struct arch_timer_cpu *timer = vcpu_timer(vcpu); struct timer_map map; + struct irq_ops *ops; int ret; if (timer->enabled) @@ -1563,20 +1562,21 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu) get_timer_map(vcpu, &map); + ops = &arch_timer_irq_ops; + + for (int i = 0; i < nr_timers(vcpu); i++) + kvm_vgic_set_irq_ops(vcpu, timer_irq(vcpu_get_timer(vcpu, i)), ops); + ret = kvm_vgic_map_phys_irq(vcpu, map.direct_vtimer->host_timer_irq, - timer_irq(map.direct_vtimer), - &arch_timer_irq_ops); + timer_irq(map.direct_vtimer)); if (ret) return ret; - if (map.direct_ptimer) { + if (map.direct_ptimer) ret = kvm_vgic_map_phys_irq(vcpu, map.direct_ptimer->host_timer_irq, - timer_irq(map.direct_ptimer), - &arch_timer_irq_ops); - } - + timer_irq(map.direct_ptimer)); if (ret) return ret; diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index e22b79cfff96..9e0d26348a2a 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -553,10 +553,27 @@ int kvm_vgic_inject_irq(struct kvm *kvm, struct kvm_vcpu *vcpu, return 0; } +void kvm_vgic_set_irq_ops(struct kvm_vcpu *vcpu, u32 vintid, + struct irq_ops *ops) +{ + struct vgic_irq *irq = vgic_get_vcpu_irq(vcpu, vintid); + + BUG_ON(!irq); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + irq->ops = ops; + + vgic_put_irq(vcpu->kvm, irq); +} + +void kvm_vgic_clear_irq_ops(struct kvm_vcpu *vcpu, u32 vintid) +{ + kvm_vgic_set_irq_ops(vcpu, vintid, NULL); +} + /* @irq->irq_lock must be held */ static int kvm_vgic_map_irq(struct kvm_vcpu *vcpu, struct vgic_irq *irq, - unsigned int host_irq, - struct irq_ops *ops) + unsigned int host_irq) { struct irq_desc *desc; struct irq_data *data; @@ -576,7 +593,6 @@ static int kvm_vgic_map_irq(struct kvm_vcpu *vcpu, struct vgic_irq *irq, irq->hw = true; irq->host_irq = host_irq; irq->hwintid = data->hwirq; - irq->ops = ops; return 0; } @@ -585,11 +601,10 @@ static inline void kvm_vgic_unmap_irq(struct vgic_irq *irq) { irq->hw = false; irq->hwintid = 0; - irq->ops = NULL; } int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, - u32 vintid, struct irq_ops *ops) + u32 vintid) { struct vgic_irq *irq = vgic_get_vcpu_irq(vcpu, vintid); unsigned long flags; @@ -598,7 +613,7 @@ int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, BUG_ON(!irq); raw_spin_lock_irqsave(&irq->irq_lock, flags); - ret = kvm_vgic_map_irq(vcpu, irq, host_irq, ops); + ret = kvm_vgic_map_irq(vcpu, irq, host_irq); raw_spin_unlock_irqrestore(&irq->irq_lock, flags); vgic_put_irq(vcpu->kvm, irq); diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index f2eafc65bbf4..46262d1433bc 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -397,8 +397,11 @@ void kvm_vgic_init_cpu_hardware(void); int kvm_vgic_inject_irq(struct kvm *kvm, struct kvm_vcpu *vcpu, unsigned int intid, bool level, void *owner); +void kvm_vgic_set_irq_ops(struct kvm_vcpu *vcpu, u32 vintid, + struct irq_ops *ops); +void kvm_vgic_clear_irq_ops(struct kvm_vcpu *vcpu, u32 vintid); int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, - u32 vintid, struct irq_ops *ops); + u32 vintid); int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, unsigned int vintid); int kvm_vgic_get_map(struct kvm_vcpu *vcpu, unsigned int vintid); bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int vintid); From 2808a8337078f2a65f1f1176880e1491a3e88fa8 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:51:16 +0000 Subject: [PATCH 147/373] arm64/sysreg: Add remaining GICv5 ICC_ & ICH_ sysregs for KVM support Add the GICv5 system registers required to support native GICv5 guests with KVM. Many of the GICv5 sysregs have already been added as part of the host GICv5 driver, keeping this set relatively small. The registers added in this change complete the set by adding those required by KVM either directly (ICH_) or indirectly (FGTs for the ICC_ sysregs). The following system registers and their fields are added: ICC_APR_EL1 ICC_HPPIR_EL1 ICC_IAFFIDR_EL1 ICH_APR_EL2 ICH_CONTEXTR_EL2 ICH_PPI_ACTIVER_EL2 ICH_PPI_DVI_EL2 ICH_PPI_ENABLER_EL2 ICH_PPI_PENDR_EL2 ICH_PPI_PRIORITYR_EL2 Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-7-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/tools/sysreg | 480 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 480 insertions(+) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 9d1c21108057..51dcca5b2fa6 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -3243,6 +3243,14 @@ UnsignedEnum 3:0 ID_BITS EndEnum EndSysreg +Sysreg ICC_HPPIR_EL1 3 0 12 10 3 +Res0 63:33 +Field 32 HPPIV +Field 31:29 TYPE +Res0 28:24 +Field 23:0 ID +EndSysreg + Sysreg ICC_ICSR_EL1 3 0 12 10 4 Res0 63:48 Field 47:32 IAFFID @@ -3257,6 +3265,11 @@ Field 1 Enabled Field 0 F EndSysreg +Sysreg ICC_IAFFIDR_EL1 3 0 12 10 5 +Res0 63:16 +Field 15:0 IAFFID +EndSysreg + SysregFields ICC_PPI_ENABLERx_EL1 Field 63 EN63 Field 62 EN62 @@ -3663,6 +3676,42 @@ Res0 14:12 Field 11:0 AFFINITY EndSysreg +Sysreg ICC_APR_EL1 3 1 12 0 0 +Res0 63:32 +Field 31 P31 +Field 30 P30 +Field 29 P29 +Field 28 P28 +Field 27 P27 +Field 26 P26 +Field 25 P25 +Field 24 P24 +Field 23 P23 +Field 22 P22 +Field 21 P21 +Field 20 P20 +Field 19 P19 +Field 18 P18 +Field 17 P17 +Field 16 P16 +Field 15 P15 +Field 14 P14 +Field 13 P13 +Field 12 P12 +Field 11 P11 +Field 10 P10 +Field 9 P9 +Field 8 P8 +Field 7 P7 +Field 6 P6 +Field 5 P5 +Field 4 P4 +Field 3 P3 +Field 2 P2 +Field 1 P1 +Field 0 P0 +EndSysreg + Sysreg ICC_CR0_EL1 3 1 12 0 1 Res0 63:39 Field 38 PID @@ -4687,6 +4736,42 @@ Field 31:16 PhyPARTID29 Field 15:0 PhyPARTID28 EndSysreg +Sysreg ICH_APR_EL2 3 4 12 8 4 +Res0 63:32 +Field 31 P31 +Field 30 P30 +Field 29 P29 +Field 28 P28 +Field 27 P27 +Field 26 P26 +Field 25 P25 +Field 24 P24 +Field 23 P23 +Field 22 P22 +Field 21 P21 +Field 20 P20 +Field 19 P19 +Field 18 P18 +Field 17 P17 +Field 16 P16 +Field 15 P15 +Field 14 P14 +Field 13 P13 +Field 12 P12 +Field 11 P11 +Field 10 P10 +Field 9 P9 +Field 8 P8 +Field 7 P7 +Field 6 P6 +Field 5 P5 +Field 4 P4 +Field 3 P3 +Field 2 P2 +Field 1 P1 +Field 0 P0 +EndSysreg + Sysreg ICH_HFGRTR_EL2 3 4 12 9 4 Res0 63:21 Field 20 ICC_PPI_ACTIVERn_EL1 @@ -4735,6 +4820,306 @@ Field 1 GICCDDIS Field 0 GICCDEN EndSysreg +SysregFields ICH_PPI_DVIRx_EL2 +Field 63 DVI63 +Field 62 DVI62 +Field 61 DVI61 +Field 60 DVI60 +Field 59 DVI59 +Field 58 DVI58 +Field 57 DVI57 +Field 56 DVI56 +Field 55 DVI55 +Field 54 DVI54 +Field 53 DVI53 +Field 52 DVI52 +Field 51 DVI51 +Field 50 DVI50 +Field 49 DVI49 +Field 48 DVI48 +Field 47 DVI47 +Field 46 DVI46 +Field 45 DVI45 +Field 44 DVI44 +Field 43 DVI43 +Field 42 DVI42 +Field 41 DVI41 +Field 40 DVI40 +Field 39 DVI39 +Field 38 DVI38 +Field 37 DVI37 +Field 36 DVI36 +Field 35 DVI35 +Field 34 DVI34 +Field 33 DVI33 +Field 32 DVI32 +Field 31 DVI31 +Field 30 DVI30 +Field 29 DVI29 +Field 28 DVI28 +Field 27 DVI27 +Field 26 DVI26 +Field 25 DVI25 +Field 24 DVI24 +Field 23 DVI23 +Field 22 DVI22 +Field 21 DVI21 +Field 20 DVI20 +Field 19 DVI19 +Field 18 DVI18 +Field 17 DVI17 +Field 16 DVI16 +Field 15 DVI15 +Field 14 DVI14 +Field 13 DVI13 +Field 12 DVI12 +Field 11 DVI11 +Field 10 DVI10 +Field 9 DVI9 +Field 8 DVI8 +Field 7 DVI7 +Field 6 DVI6 +Field 5 DVI5 +Field 4 DVI4 +Field 3 DVI3 +Field 2 DVI2 +Field 1 DVI1 +Field 0 DVI0 +EndSysregFields + +Sysreg ICH_PPI_DVIR0_EL2 3 4 12 10 0 +Fields ICH_PPI_DVIx_EL2 +EndSysreg + +Sysreg ICH_PPI_DVIR1_EL2 3 4 12 10 1 +Fields ICH_PPI_DVIx_EL2 +EndSysreg + +SysregFields ICH_PPI_ENABLERx_EL2 +Field 63 EN63 +Field 62 EN62 +Field 61 EN61 +Field 60 EN60 +Field 59 EN59 +Field 58 EN58 +Field 57 EN57 +Field 56 EN56 +Field 55 EN55 +Field 54 EN54 +Field 53 EN53 +Field 52 EN52 +Field 51 EN51 +Field 50 EN50 +Field 49 EN49 +Field 48 EN48 +Field 47 EN47 +Field 46 EN46 +Field 45 EN45 +Field 44 EN44 +Field 43 EN43 +Field 42 EN42 +Field 41 EN41 +Field 40 EN40 +Field 39 EN39 +Field 38 EN38 +Field 37 EN37 +Field 36 EN36 +Field 35 EN35 +Field 34 EN34 +Field 33 EN33 +Field 32 EN32 +Field 31 EN31 +Field 30 EN30 +Field 29 EN29 +Field 28 EN28 +Field 27 EN27 +Field 26 EN26 +Field 25 EN25 +Field 24 EN24 +Field 23 EN23 +Field 22 EN22 +Field 21 EN21 +Field 20 EN20 +Field 19 EN19 +Field 18 EN18 +Field 17 EN17 +Field 16 EN16 +Field 15 EN15 +Field 14 EN14 +Field 13 EN13 +Field 12 EN12 +Field 11 EN11 +Field 10 EN10 +Field 9 EN9 +Field 8 EN8 +Field 7 EN7 +Field 6 EN6 +Field 5 EN5 +Field 4 EN4 +Field 3 EN3 +Field 2 EN2 +Field 1 EN1 +Field 0 EN0 +EndSysregFields + +Sysreg ICH_PPI_ENABLER0_EL2 3 4 12 10 2 +Fields ICH_PPI_ENABLERx_EL2 +EndSysreg + +Sysreg ICH_PPI_ENABLER1_EL2 3 4 12 10 3 +Fields ICH_PPI_ENABLERx_EL2 +EndSysreg + +SysregFields ICH_PPI_PENDRx_EL2 +Field 63 PEND63 +Field 62 PEND62 +Field 61 PEND61 +Field 60 PEND60 +Field 59 PEND59 +Field 58 PEND58 +Field 57 PEND57 +Field 56 PEND56 +Field 55 PEND55 +Field 54 PEND54 +Field 53 PEND53 +Field 52 PEND52 +Field 51 PEND51 +Field 50 PEND50 +Field 49 PEND49 +Field 48 PEND48 +Field 47 PEND47 +Field 46 PEND46 +Field 45 PEND45 +Field 44 PEND44 +Field 43 PEND43 +Field 42 PEND42 +Field 41 PEND41 +Field 40 PEND40 +Field 39 PEND39 +Field 38 PEND38 +Field 37 PEND37 +Field 36 PEND36 +Field 35 PEND35 +Field 34 PEND34 +Field 33 PEND33 +Field 32 PEND32 +Field 31 PEND31 +Field 30 PEND30 +Field 29 PEND29 +Field 28 PEND28 +Field 27 PEND27 +Field 26 PEND26 +Field 25 PEND25 +Field 24 PEND24 +Field 23 PEND23 +Field 22 PEND22 +Field 21 PEND21 +Field 20 PEND20 +Field 19 PEND19 +Field 18 PEND18 +Field 17 PEND17 +Field 16 PEND16 +Field 15 PEND15 +Field 14 PEND14 +Field 13 PEND13 +Field 12 PEND12 +Field 11 PEND11 +Field 10 PEND10 +Field 9 PEND9 +Field 8 PEND8 +Field 7 PEND7 +Field 6 PEND6 +Field 5 PEND5 +Field 4 PEND4 +Field 3 PEND3 +Field 2 PEND2 +Field 1 PEND1 +Field 0 PEND0 +EndSysregFields + +Sysreg ICH_PPI_PENDR0_EL2 3 4 12 10 4 +Fields ICH_PPI_PENDRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PENDR1_EL2 3 4 12 10 5 +Fields ICH_PPI_PENDRx_EL2 +EndSysreg + +SysregFields ICH_PPI_ACTIVERx_EL2 +Field 63 ACTIVE63 +Field 62 ACTIVE62 +Field 61 ACTIVE61 +Field 60 ACTIVE60 +Field 59 ACTIVE59 +Field 58 ACTIVE58 +Field 57 ACTIVE57 +Field 56 ACTIVE56 +Field 55 ACTIVE55 +Field 54 ACTIVE54 +Field 53 ACTIVE53 +Field 52 ACTIVE52 +Field 51 ACTIVE51 +Field 50 ACTIVE50 +Field 49 ACTIVE49 +Field 48 ACTIVE48 +Field 47 ACTIVE47 +Field 46 ACTIVE46 +Field 45 ACTIVE45 +Field 44 ACTIVE44 +Field 43 ACTIVE43 +Field 42 ACTIVE42 +Field 41 ACTIVE41 +Field 40 ACTIVE40 +Field 39 ACTIVE39 +Field 38 ACTIVE38 +Field 37 ACTIVE37 +Field 36 ACTIVE36 +Field 35 ACTIVE35 +Field 34 ACTIVE34 +Field 33 ACTIVE33 +Field 32 ACTIVE32 +Field 31 ACTIVE31 +Field 30 ACTIVE30 +Field 29 ACTIVE29 +Field 28 ACTIVE28 +Field 27 ACTIVE27 +Field 26 ACTIVE26 +Field 25 ACTIVE25 +Field 24 ACTIVE24 +Field 23 ACTIVE23 +Field 22 ACTIVE22 +Field 21 ACTIVE21 +Field 20 ACTIVE20 +Field 19 ACTIVE19 +Field 18 ACTIVE18 +Field 17 ACTIVE17 +Field 16 ACTIVE16 +Field 15 ACTIVE15 +Field 14 ACTIVE14 +Field 13 ACTIVE13 +Field 12 ACTIVE12 +Field 11 ACTIVE11 +Field 10 ACTIVE10 +Field 9 ACTIVE9 +Field 8 ACTIVE8 +Field 7 ACTIVE7 +Field 6 ACTIVE6 +Field 5 ACTIVE5 +Field 4 ACTIVE4 +Field 3 ACTIVE3 +Field 2 ACTIVE2 +Field 1 ACTIVE1 +Field 0 ACTIVE0 +EndSysregFields + +Sysreg ICH_PPI_ACTIVER0_EL2 3 4 12 10 6 +Fields ICH_PPI_ACTIVERx_EL2 +EndSysreg + +Sysreg ICH_PPI_ACTIVER1_EL2 3 4 12 10 7 +Fields ICH_PPI_ACTIVERx_EL2 +EndSysreg + Sysreg ICH_HCR_EL2 3 4 12 11 0 Res0 63:32 Field 31:27 EOIcount @@ -4789,6 +5174,18 @@ Field 1 V3 Field 0 En EndSysreg +Sysreg ICH_CONTEXTR_EL2 3 4 12 11 6 +Field 63 V +Field 62 F +Field 61 IRICHPPIDIS +Field 60 DB +Field 59:55 DBPM +Res0 54:48 +Field 47:32 VPE +Res0 31:16 +Field 15:0 VM +EndSysreg + Sysreg ICH_VMCR_EL2 3 4 12 11 7 Prefix FEAT_GCIE Res0 63:32 @@ -4810,6 +5207,89 @@ Field 1 VENG1 Field 0 VENG0 EndSysreg +SysregFields ICH_PPI_PRIORITYRx_EL2 +Res0 63:61 +Field 60:56 Priority7 +Res0 55:53 +Field 52:48 Priority6 +Res0 47:45 +Field 44:40 Priority5 +Res0 39:37 +Field 36:32 Priority4 +Res0 31:29 +Field 28:24 Priority3 +Res0 23:21 +Field 20:16 Priority2 +Res0 15:13 +Field 12:8 Priority1 +Res0 7:5 +Field 4:0 Priority0 +EndSysregFields + +Sysreg ICH_PPI_PRIORITYR0_EL2 3 4 12 14 0 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR1_EL2 3 4 12 14 1 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR2_EL2 3 4 12 14 2 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR3_EL2 3 4 12 14 3 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR4_EL2 3 4 12 14 4 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR5_EL2 3 4 12 14 5 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR6_EL2 3 4 12 14 6 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR7_EL2 3 4 12 14 7 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR8_EL2 3 4 12 15 0 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR9_EL2 3 4 12 15 1 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR10_EL2 3 4 12 15 2 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR11_EL2 3 4 12 15 3 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR12_EL2 3 4 12 15 4 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR13_EL2 3 4 12 15 5 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR14_EL2 3 4 12 15 6 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + +Sysreg ICH_PPI_PRIORITYR15_EL2 3 4 12 15 7 +Fields ICH_PPI_PRIORITYRx_EL2 +EndSysreg + Sysreg CONTEXTIDR_EL2 3 4 13 0 1 Fields CONTEXTIDR_ELx EndSysreg From 59991153f026766447bea14d85439555b6bf9164 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:51:32 +0000 Subject: [PATCH 148/373] arm64/sysreg: Add GICR CDNMIA encoding The encoding for the GICR CDNMIA system instruction is thus far unused (and shall remain unused for the time being). However, in order to plumb the FGTs into KVM correctly, KVM needs to be made aware of the encoding of this system instruction. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-8-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/sysreg.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index f4436ecc630c..938cdb248f83 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -1052,6 +1052,7 @@ #define GICV5_OP_GIC_CDPRI sys_insn(1, 0, 12, 1, 2) #define GICV5_OP_GIC_CDRCFG sys_insn(1, 0, 12, 1, 5) #define GICV5_OP_GICR_CDIA sys_insn(1, 0, 12, 3, 0) +#define GICV5_OP_GICR_CDNMIA sys_insn(1, 0, 12, 3, 1) /* Definitions for GIC CDAFF */ #define GICV5_GIC_CDAFF_IAFFID_MASK GENMASK_ULL(47, 32) @@ -1098,6 +1099,12 @@ #define GICV5_GIC_CDIA_TYPE_MASK GENMASK_ULL(31, 29) #define GICV5_GIC_CDIA_ID_MASK GENMASK_ULL(23, 0) +/* Definitions for GICR CDNMIA */ +#define GICV5_GICR_CDNMIA_VALID_MASK BIT_ULL(32) +#define GICV5_GICR_CDNMIA_VALID(r) FIELD_GET(GICV5_GICR_CDNMIA_VALID_MASK, r) +#define GICV5_GICR_CDNMIA_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GICR_CDNMIA_ID_MASK GENMASK_ULL(23, 0) + #define gicr_insn(insn) read_sysreg_s(GICV5_OP_GICR_##insn) #define gic_insn(v, insn) write_sysreg_s(v, GICV5_OP_GIC_##insn) From c547c51ff4d44c787330506737c5ce7808e536cc Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:51:47 +0000 Subject: [PATCH 149/373] KVM: arm64: gic-v5: Add ARM_VGIC_V5 device to KVM headers This is the base GICv5 device which is to be used with the KVM_CREATE_DEVICE ioctl to create a GICv5-based vgic. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-9-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- include/uapi/linux/kvm.h | 2 ++ tools/include/uapi/linux/kvm.h | 2 ++ 2 files changed, 4 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 80364d4dbebb..d0c0c8605976 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1224,6 +1224,8 @@ enum kvm_device_type { #define KVM_DEV_TYPE_LOONGARCH_EIOINTC KVM_DEV_TYPE_LOONGARCH_EIOINTC KVM_DEV_TYPE_LOONGARCH_PCHPIC, #define KVM_DEV_TYPE_LOONGARCH_PCHPIC KVM_DEV_TYPE_LOONGARCH_PCHPIC + KVM_DEV_TYPE_ARM_VGIC_V5, +#define KVM_DEV_TYPE_ARM_VGIC_V5 KVM_DEV_TYPE_ARM_VGIC_V5 KVM_DEV_TYPE_MAX, diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h index 65500f5db379..713e4360eca0 100644 --- a/tools/include/uapi/linux/kvm.h +++ b/tools/include/uapi/linux/kvm.h @@ -1220,6 +1220,8 @@ enum kvm_device_type { #define KVM_DEV_TYPE_LOONGARCH_EIOINTC KVM_DEV_TYPE_LOONGARCH_EIOINTC KVM_DEV_TYPE_LOONGARCH_PCHPIC, #define KVM_DEV_TYPE_LOONGARCH_PCHPIC KVM_DEV_TYPE_LOONGARCH_PCHPIC + KVM_DEV_TYPE_ARM_VGIC_V5, +#define KVM_DEV_TYPE_ARM_VGIC_V5 KVM_DEV_TYPE_ARM_VGIC_V5 KVM_DEV_TYPE_MAX, From eb8bce08ecb12fa0e76af23432f1adb162248ca6 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:52:03 +0000 Subject: [PATCH 150/373] KVM: arm64: gic: Introduce interrupt type helpers GICv5 has moved from using interrupt ranges for different interrupt types to using some of the upper bits of the interrupt ID to denote the interrupt type. This is not compatible with older GICs (which rely on ranges of interrupts to determine the type), and hence a set of helpers is introduced. These helpers take a struct kvm*, and use the vgic model to determine how to interpret the interrupt ID. Helpers are introduced for PPIs, SPIs, and LPIs. Additionally, a helper is introduced to determine if an interrupt is private - SGIs and PPIs for older GICs, and PPIs only for GICv5. Additionally, vgic_is_v5() is introduced (which unsurpisingly returns true when running a GICv5 guest), and the existing vgic_is_v3() check is moved from vgic.h to arm_vgic.h (to live alongside the vgic_is_v5() one), and has been converted into a macro. The helpers are plumbed into the core vgic code, as well as the Arch Timer and PMU code. There should be no functional changes as part of this change. Signed-off-by: Sascha Bischoff Reviewed-by: Joey Gouly Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-10-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arch_timer.c | 2 +- arch/arm64/kvm/pmu-emul.c | 7 +- arch/arm64/kvm/vgic/vgic-kvm-device.c | 2 +- arch/arm64/kvm/vgic/vgic.c | 14 ++-- arch/arm64/kvm/vgic/vgic.h | 5 -- include/kvm/arm_vgic.h | 102 ++++++++++++++++++++++++-- 6 files changed, 110 insertions(+), 22 deletions(-) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index d31bc1e7a13c..92870ee6dacd 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -1603,7 +1603,7 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) if (get_user(irq, uaddr)) return -EFAULT; - if (!(irq_is_ppi(irq))) + if (!(irq_is_ppi(vcpu->kvm, irq))) return -EINVAL; mutex_lock(&vcpu->kvm->arch.config_lock); diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index 93cc9bbb5cec..41a3c5dc2bca 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -939,7 +939,8 @@ int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu) * number against the dimensions of the vgic and make sure * it's valid. */ - if (!irq_is_ppi(irq) && !vgic_valid_spi(vcpu->kvm, irq)) + if (!irq_is_ppi(vcpu->kvm, irq) && + !vgic_valid_spi(vcpu->kvm, irq)) return -EINVAL; } else if (kvm_arm_pmu_irq_initialized(vcpu)) { return -EINVAL; @@ -991,7 +992,7 @@ static bool pmu_irq_is_valid(struct kvm *kvm, int irq) if (!kvm_arm_pmu_irq_initialized(vcpu)) continue; - if (irq_is_ppi(irq)) { + if (irq_is_ppi(vcpu->kvm, irq)) { if (vcpu->arch.pmu.irq_num != irq) return false; } else { @@ -1142,7 +1143,7 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) return -EFAULT; /* The PMU overflow interrupt can be a PPI or a valid SPI. */ - if (!(irq_is_ppi(irq) || irq_is_spi(irq))) + if (!(irq_is_ppi(vcpu->kvm, irq) || irq_is_spi(vcpu->kvm, irq))) return -EINVAL; if (!pmu_irq_is_valid(kvm, irq)) diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c b/arch/arm64/kvm/vgic/vgic-kvm-device.c index 3d1a776b716d..b12ba99a423e 100644 --- a/arch/arm64/kvm/vgic/vgic-kvm-device.c +++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c @@ -639,7 +639,7 @@ static int vgic_v3_set_attr(struct kvm_device *dev, if (vgic_initialized(dev->kvm)) return -EBUSY; - if (!irq_is_ppi(val)) + if (!irq_is_ppi(dev->kvm, val)) return -EINVAL; dev->kvm->arch.vgic.mi_intid = val; diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index 9e0d26348a2a..2f3f892cbddc 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -94,7 +94,7 @@ struct vgic_irq *vgic_get_irq(struct kvm *kvm, u32 intid) } /* LPIs */ - if (intid >= VGIC_MIN_LPI) + if (irq_is_lpi(kvm, intid)) return vgic_get_lpi(kvm, intid); return NULL; @@ -123,7 +123,7 @@ static void vgic_release_lpi_locked(struct vgic_dist *dist, struct vgic_irq *irq static __must_check bool __vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq) { - if (irq->intid < VGIC_MIN_LPI) + if (!irq_is_lpi(kvm, irq->intid)) return false; return refcount_dec_and_test(&irq->refcount); @@ -148,7 +148,7 @@ void vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq) * Acquire/release it early on lockdep kernels to make locking issues * in rare release paths a bit more obvious. */ - if (IS_ENABLED(CONFIG_LOCKDEP) && irq->intid >= VGIC_MIN_LPI) { + if (IS_ENABLED(CONFIG_LOCKDEP) && irq_is_lpi(kvm, irq->intid)) { guard(spinlock_irqsave)(&dist->lpi_xa.xa_lock); } @@ -186,7 +186,7 @@ void vgic_flush_pending_lpis(struct kvm_vcpu *vcpu) raw_spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags); list_for_each_entry_safe(irq, tmp, &vgic_cpu->ap_list_head, ap_list) { - if (irq->intid >= VGIC_MIN_LPI) { + if (irq_is_lpi(vcpu->kvm, irq->intid)) { raw_spin_lock(&irq->irq_lock); list_del(&irq->ap_list); irq->vcpu = NULL; @@ -521,12 +521,12 @@ int kvm_vgic_inject_irq(struct kvm *kvm, struct kvm_vcpu *vcpu, if (ret) return ret; - if (!vcpu && intid < VGIC_NR_PRIVATE_IRQS) + if (!vcpu && irq_is_private(kvm, intid)) return -EINVAL; trace_vgic_update_irq_pending(vcpu ? vcpu->vcpu_idx : 0, intid, level); - if (intid < VGIC_NR_PRIVATE_IRQS) + if (irq_is_private(kvm, intid)) irq = vgic_get_vcpu_irq(vcpu, intid); else irq = vgic_get_irq(kvm, intid); @@ -700,7 +700,7 @@ int kvm_vgic_set_owner(struct kvm_vcpu *vcpu, unsigned int intid, void *owner) return -EAGAIN; /* SGIs and LPIs cannot be wired up to any device */ - if (!irq_is_ppi(intid) && !vgic_valid_spi(vcpu->kvm, intid)) + if (!irq_is_ppi(vcpu->kvm, intid) && !vgic_valid_spi(vcpu->kvm, intid)) return -EINVAL; irq = vgic_get_vcpu_irq(vcpu, intid); diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index 0bb8fa10bb4e..f2924f821197 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -454,11 +454,6 @@ void vgic_v3_put_nested(struct kvm_vcpu *vcpu); void vgic_v3_handle_nested_maint_irq(struct kvm_vcpu *vcpu); void vgic_v3_nested_update_mi(struct kvm_vcpu *vcpu); -static inline bool vgic_is_v3(struct kvm *kvm) -{ - return kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3; -} - static inline bool vgic_host_has_gicv3(void) { /* diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index 46262d1433bc..b8011b395796 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -19,6 +19,7 @@ #include #include +#include #define VGIC_V3_MAX_CPUS 512 #define VGIC_V2_MAX_CPUS 8 @@ -31,9 +32,88 @@ #define VGIC_MIN_LPI 8192 #define KVM_IRQCHIP_NUM_PINS (1020 - 32) -#define irq_is_ppi(irq) ((irq) >= VGIC_NR_SGIS && (irq) < VGIC_NR_PRIVATE_IRQS) -#define irq_is_spi(irq) ((irq) >= VGIC_NR_PRIVATE_IRQS && \ - (irq) <= VGIC_MAX_SPI) +#define is_v5_type(t, i) (FIELD_GET(GICV5_HWIRQ_TYPE, (i)) == (t)) + +#define __irq_is_sgi(t, i) \ + ({ \ + bool __ret; \ + \ + switch (t) { \ + case KVM_DEV_TYPE_ARM_VGIC_V5: \ + __ret = false; \ + break; \ + default: \ + __ret = (i) < VGIC_NR_SGIS; \ + } \ + \ + __ret; \ + }) + +#define __irq_is_ppi(t, i) \ + ({ \ + bool __ret; \ + \ + switch (t) { \ + case KVM_DEV_TYPE_ARM_VGIC_V5: \ + __ret = is_v5_type(GICV5_HWIRQ_TYPE_PPI, (i)); \ + break; \ + default: \ + __ret = (i) >= VGIC_NR_SGIS; \ + __ret &= (i) < VGIC_NR_PRIVATE_IRQS; \ + } \ + \ + __ret; \ + }) + +#define __irq_is_spi(t, i) \ + ({ \ + bool __ret; \ + \ + switch (t) { \ + case KVM_DEV_TYPE_ARM_VGIC_V5: \ + __ret = is_v5_type(GICV5_HWIRQ_TYPE_SPI, (i)); \ + break; \ + default: \ + __ret = (i) <= VGIC_MAX_SPI; \ + __ret &= (i) >= VGIC_NR_PRIVATE_IRQS; \ + } \ + \ + __ret; \ + }) + +#define __irq_is_lpi(t, i) \ + ({ \ + bool __ret; \ + \ + switch (t) { \ + case KVM_DEV_TYPE_ARM_VGIC_V5: \ + __ret = is_v5_type(GICV5_HWIRQ_TYPE_LPI, (i)); \ + break; \ + default: \ + __ret = (i) >= 8192; \ + } \ + \ + __ret; \ + }) + +#define irq_is_sgi(k, i) __irq_is_sgi((k)->arch.vgic.vgic_model, i) +#define irq_is_ppi(k, i) __irq_is_ppi((k)->arch.vgic.vgic_model, i) +#define irq_is_spi(k, i) __irq_is_spi((k)->arch.vgic.vgic_model, i) +#define irq_is_lpi(k, i) __irq_is_lpi((k)->arch.vgic.vgic_model, i) + +#define irq_is_private(k, i) (irq_is_ppi(k, i) || irq_is_sgi(k, i)) + +#define vgic_v5_get_hwirq_id(x) FIELD_GET(GICV5_HWIRQ_ID, (x)) +#define vgic_v5_set_hwirq_id(x) FIELD_PREP(GICV5_HWIRQ_ID, (x)) + +#define __vgic_v5_set_type(t) (FIELD_PREP(GICV5_HWIRQ_TYPE, GICV5_HWIRQ_TYPE_##t)) +#define vgic_v5_make_ppi(x) (__vgic_v5_set_type(PPI) | vgic_v5_set_hwirq_id(x)) +#define vgic_v5_make_spi(x) (__vgic_v5_set_type(SPI) | vgic_v5_set_hwirq_id(x)) +#define vgic_v5_make_lpi(x) (__vgic_v5_set_type(LPI) | vgic_v5_set_hwirq_id(x)) + +#define __vgic_is_v(k, v) ((k)->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V##v) +#define vgic_is_v3(k) (__vgic_is_v(k, 3)) +#define vgic_is_v5(k) (__vgic_is_v(k, 5)) enum vgic_type { VGIC_V2, /* Good ol' GICv2 */ @@ -417,8 +497,20 @@ u64 vgic_v3_get_misr(struct kvm_vcpu *vcpu); #define irqchip_in_kernel(k) (!!((k)->arch.vgic.in_kernel)) #define vgic_initialized(k) ((k)->arch.vgic.initialized) -#define vgic_valid_spi(k, i) (((i) >= VGIC_NR_PRIVATE_IRQS) && \ - ((i) < (k)->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS)) +#define vgic_valid_spi(k, i) \ + ({ \ + bool __ret = irq_is_spi(k, i); \ + \ + switch ((k)->arch.vgic.vgic_model) { \ + case KVM_DEV_TYPE_ARM_VGIC_V5: \ + __ret &= FIELD_GET(GICV5_HWIRQ_ID, i) < (k)->arch.vgic.nr_spis; \ + break; \ + default: \ + __ret &= (i) < ((k)->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS); \ + } \ + \ + __ret; \ + }) bool kvm_vcpu_has_pending_irqs(struct kvm_vcpu *vcpu); void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu); From da92ff15ca4c5b0f75ec1cb3d2e275db2ff2c810 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:52:19 +0000 Subject: [PATCH 151/373] KVM: arm64: gic-v5: Add Arm copyright header This header was mistakenly omitted during the creation of this file. Add it now. Better late than never. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-11-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 331651087e2c..9d9aa5774e63 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -1,4 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025, 2026 Arm Ltd. + */ #include #include From f656807150e3e1c6f76cab918e5adfad6d881d58 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:52:34 +0000 Subject: [PATCH 152/373] KVM: arm64: gic-v5: Detect implemented PPIs on boot As part of booting the system and initialising KVM, create and populate a mask of the implemented PPIs. This mask allows future PPI operations (such as save/restore or state, or syncing back into the shadow state) to only consider PPIs that are actually implemented on the host. The set of implemented virtual PPIs matches the set of implemented physical PPIs for a GICv5 host. Therefore, this mask represents all PPIs that could ever by used by a GICv5-based guest on a specific host, albeit pre-filtered by what we support in KVM (see next paragraph). Only architected PPIs are currently supported in KVM with GICv5. Moreover, as KVM only supports a subset of all possible PPIS (Timers, PMU, GICv5 SW_PPI) the PPI mask only includes these PPIs, if present. The timers are always assumed to be present; if we have KVM we have EL2, which means that we have the EL1 & EL2 Timer PPIs. If we have a PMU (v3), then the PMUIRQ is present. The GICv5 SW_PPI is always assumed to be present. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-12-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 31 ++++++++++++++++++++++++++++++ include/kvm/arm_vgic.h | 13 +++++++++++++ include/linux/irqchip/arm-gic-v5.h | 22 +++++++++++++++++++++ 3 files changed, 66 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 9d9aa5774e63..cf8382a954bb 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -4,10 +4,39 @@ */ #include + +#include #include #include "vgic.h" +static struct vgic_v5_ppi_caps ppi_caps; + +/* + * Not all PPIs are guaranteed to be implemented for GICv5. Deterermine which + * ones are, and generate a mask. + */ +static void vgic_v5_get_implemented_ppis(void) +{ + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) + return; + + /* + * If we have KVM, we have EL2, which means that we have support for the + * EL1 and EL2 Physical & Virtual timers. + */ + __assign_bit(GICV5_ARCH_PPI_CNTHP, ppi_caps.impl_ppi_mask, 1); + __assign_bit(GICV5_ARCH_PPI_CNTV, ppi_caps.impl_ppi_mask, 1); + __assign_bit(GICV5_ARCH_PPI_CNTHV, ppi_caps.impl_ppi_mask, 1); + __assign_bit(GICV5_ARCH_PPI_CNTP, ppi_caps.impl_ppi_mask, 1); + + /* The SW_PPI should be available */ + __assign_bit(GICV5_ARCH_PPI_SW_PPI, ppi_caps.impl_ppi_mask, 1); + + /* The PMUIRQ is available if we have the PMU */ + __assign_bit(GICV5_ARCH_PPI_PMUIRQ, ppi_caps.impl_ppi_mask, system_supports_pmuv3()); +} + /* * Probe for a vGICv5 compatible interrupt controller, returning 0 on success. * Currently only supports GICv3-based VMs on a GICv5 host, and hence only @@ -18,6 +47,8 @@ int vgic_v5_probe(const struct gic_kvm_info *info) u64 ich_vtr_el2; int ret; + vgic_v5_get_implemented_ppis(); + if (!cpus_have_final_cap(ARM64_HAS_GICV5_LEGACY)) return -ENODEV; diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index b8011b395796..0fabeabedd6d 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -32,6 +32,14 @@ #define VGIC_MIN_LPI 8192 #define KVM_IRQCHIP_NUM_PINS (1020 - 32) +/* + * GICv5 supports 128 PPIs, but only the first 64 are architected. We only + * support the timers and PMU in KVM, both of which are architected. Rather than + * handling twice the state, we instead opt to only support the architected set + * in KVM for now. At a future stage, this can be bumped up to 128, if required. + */ +#define VGIC_V5_NR_PRIVATE_IRQS 64 + #define is_v5_type(t, i) (FIELD_GET(GICV5_HWIRQ_TYPE, (i)) == (t)) #define __irq_is_sgi(t, i) \ @@ -420,6 +428,11 @@ struct vgic_v3_cpu_if { unsigned int used_lrs; }; +/* What PPI capabilities does a GICv5 host have */ +struct vgic_v5_ppi_caps { + DECLARE_BITMAP(impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); +}; + struct vgic_cpu { /* CPU vif control registers for world switch */ union { diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h index b78488df6c98..b1566a7c93ec 100644 --- a/include/linux/irqchip/arm-gic-v5.h +++ b/include/linux/irqchip/arm-gic-v5.h @@ -24,6 +24,28 @@ #define GICV5_HWIRQ_TYPE_LPI UL(0x2) #define GICV5_HWIRQ_TYPE_SPI UL(0x3) +/* + * Architected PPIs + */ +#define GICV5_ARCH_PPI_S_DB_PPI 0x0 +#define GICV5_ARCH_PPI_RL_DB_PPI 0x1 +#define GICV5_ARCH_PPI_NS_DB_PPI 0x2 +#define GICV5_ARCH_PPI_SW_PPI 0x3 +#define GICV5_ARCH_PPI_HACDBSIRQ 0xf +#define GICV5_ARCH_PPI_CNTHVS 0x13 +#define GICV5_ARCH_PPI_CNTHPS 0x14 +#define GICV5_ARCH_PPI_PMBIRQ 0x15 +#define GICV5_ARCH_PPI_COMMIRQ 0x16 +#define GICV5_ARCH_PPI_PMUIRQ 0x17 +#define GICV5_ARCH_PPI_CTIIRQ 0x18 +#define GICV5_ARCH_PPI_GICMNT 0x19 +#define GICV5_ARCH_PPI_CNTHP 0x1a +#define GICV5_ARCH_PPI_CNTV 0x1b +#define GICV5_ARCH_PPI_CNTHV 0x1c +#define GICV5_ARCH_PPI_CNTPS 0x1d +#define GICV5_ARCH_PPI_CNTP 0x1e +#define GICV5_ARCH_PPI_TRBIRQ 0x1f + /* * Tables attributes */ From a258a383b91774ac646517ec1003a442964d8946 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:52:50 +0000 Subject: [PATCH 153/373] KVM: arm64: gic-v5: Sanitize ID_AA64PFR2_EL1.GCIE Add in a sanitization function for ID_AA64PFR2_EL1, preserving the already-present behaviour for the FPMR, MTEFAR, and MTESTOREONLY fields. Add sanitisation for the GCIE field, which is set to IMP if the host supports a GICv5 guest and NI, otherwise. Extend the sanitisation that takes place in kvm_vgic_create() to zero the ID_AA64PFR2.GCIE field when a non-GICv5 GIC is created. More importantly, move this sanitisation to a separate function, kvm_vgic_finalize_sysregs(), and call it from kvm_finalize_sys_regs(). We are required to finalize the GIC and GCIE fields a second time in kvm_finalize_sys_regs() due to how QEMU blindly reads out then verbatim restores the system register state. This avoids the issue where both the GCIE and GIC features are marked as present (an architecturally invalid combination), and hence guests fall over. See the comment in kvm_finalize_sys_regs() for more details. Overall, the following happens: * Before an irqchip is created, FEAT_GCIE is presented if the host supports GICv5-based guests. * Once an irqchip is created, all other supported irqchips are hidden from the guest; system register state reflects the guest's irqchip. * Userspace is allowed to set invalid irqchip feature combinations in the system registers, but... * ...invalid combinations are removed a second time prior to the first run of the guest, and things hopefully just work. All of this extra work is required to make sure that "legacy" GICv3 guests based on QEMU transparently work on compatible GICv5 hosts without modification. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-13-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/sys_regs.c | 70 +++++++++++++++++++++++++++++---- arch/arm64/kvm/vgic/vgic-init.c | 49 ++++++++++++++++------- include/kvm/arm_vgic.h | 1 + 3 files changed, 98 insertions(+), 22 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 42c84b7900ff..140cf35f4eeb 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1758,6 +1758,7 @@ static u8 pmuver_to_perfmon(u8 pmuver) static u64 sanitise_id_aa64pfr0_el1(const struct kvm_vcpu *vcpu, u64 val); static u64 sanitise_id_aa64pfr1_el1(const struct kvm_vcpu *vcpu, u64 val); +static u64 sanitise_id_aa64pfr2_el1(const struct kvm_vcpu *vcpu, u64 val); static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val); /* Read a sanitised cpufeature ID register by sys_reg_desc */ @@ -1783,10 +1784,7 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu, val = sanitise_id_aa64pfr1_el1(vcpu, val); break; case SYS_ID_AA64PFR2_EL1: - val &= ID_AA64PFR2_EL1_FPMR | - (kvm_has_mte(vcpu->kvm) ? - ID_AA64PFR2_EL1_MTEFAR | ID_AA64PFR2_EL1_MTESTOREONLY : - 0); + val = sanitise_id_aa64pfr2_el1(vcpu, val); break; case SYS_ID_AA64ISAR1_EL1: if (!vcpu_has_ptrauth(vcpu)) @@ -2027,6 +2025,23 @@ static u64 sanitise_id_aa64pfr1_el1(const struct kvm_vcpu *vcpu, u64 val) return val; } +static u64 sanitise_id_aa64pfr2_el1(const struct kvm_vcpu *vcpu, u64 val) +{ + val &= ID_AA64PFR2_EL1_FPMR | + ID_AA64PFR2_EL1_MTEFAR | + ID_AA64PFR2_EL1_MTESTOREONLY; + + if (!kvm_has_mte(vcpu->kvm)) { + val &= ~ID_AA64PFR2_EL1_MTEFAR; + val &= ~ID_AA64PFR2_EL1_MTESTOREONLY; + } + + if (vgic_host_has_gicv5()) + val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR2_EL1, GCIE, IMP); + + return val; +} + static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val) { val = ID_REG_LIMIT_FIELD_ENUM(val, ID_AA64DFR0_EL1, DebugVer, V8P8); @@ -2216,6 +2231,12 @@ static int set_id_aa64pfr1_el1(struct kvm_vcpu *vcpu, return set_id_reg(vcpu, rd, user_val); } +static int set_id_aa64pfr2_el1(struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd, u64 user_val) +{ + return set_id_reg(vcpu, rd, user_val); +} + /* * Allow userspace to de-feature a stage-2 translation granule but prevent it * from claiming the impossible. @@ -3197,10 +3218,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { ID_AA64PFR1_EL1_RES0 | ID_AA64PFR1_EL1_MPAM_frac | ID_AA64PFR1_EL1_MTE)), - ID_WRITABLE(ID_AA64PFR2_EL1, - ID_AA64PFR2_EL1_FPMR | - ID_AA64PFR2_EL1_MTEFAR | - ID_AA64PFR2_EL1_MTESTOREONLY), + ID_FILTERED(ID_AA64PFR2_EL1, id_aa64pfr2_el1, + ~(ID_AA64PFR2_EL1_FPMR | + ID_AA64PFR2_EL1_MTEFAR | + ID_AA64PFR2_EL1_MTESTOREONLY | + ID_AA64PFR2_EL1_GCIE)), ID_UNALLOCATED(4,3), ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0), ID_HIDDEN(ID_AA64SMFR0_EL1), @@ -5671,8 +5693,40 @@ int kvm_finalize_sys_regs(struct kvm_vcpu *vcpu) val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, val); + val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1) & ~ID_AA64PFR2_EL1_GCIE; + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1, val); val = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, val); + } else { + /* + * Certain userspace software - QEMU - samples the system + * register state without creating an irqchip, then blindly + * restores the state prior to running the final guest. This + * means that it restores the virtualization & emulation + * capabilities of the host system, rather than something that + * reflects the final guest state. Moreover, it checks that the + * state was "correctly" restored (i.e., verbatim), bailing if + * it isn't, so masking off invalid state isn't an option. + * + * On GICv5 hardware that supports FEAT_GCIE_LEGACY we can run + * both GICv3- and GICv5-based guests. Therefore, we initially + * present both ID_AA64PFR0.GIC and ID_AA64PFR2.GCIE as IMP to + * reflect that userspace can create EITHER a vGICv3 or a + * vGICv5. This is an architecturally invalid combination, of + * course. Once an in-kernel GIC is created, the sysreg state is + * updated to reflect the actual, valid configuration. + * + * Setting both the GIC and GCIE features to IMP unsurprisingly + * results in guests falling over, and hence we need to fix up + * this mess in KVM. Before running for the first time we yet + * again ensure that the GIC and GCIE fields accurately reflect + * the actual hardware the guest should see. + * + * This hack allows legacy QEMU-based GICv3 guests to run + * unmodified on compatible GICv5 hosts, and avoids the inverse + * problem for GICv5-based guests in the future. + */ + kvm_vgic_finalize_idregs(kvm); } if (vcpu_has_nv(vcpu)) { diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index e9b8b5fc480c..e1be9c5ada7b 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -71,7 +71,6 @@ static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type); int kvm_vgic_create(struct kvm *kvm, u32 type) { struct kvm_vcpu *vcpu; - u64 aa64pfr0, pfr1; unsigned long i; int ret; @@ -145,19 +144,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type) kvm->arch.vgic.implementation_rev = KVM_VGIC_IMP_REV_LATEST; kvm->arch.vgic.vgic_dist_base = VGIC_ADDR_UNDEF; - aa64pfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; - pfr1 = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; - - if (type == KVM_DEV_TYPE_ARM_VGIC_V2) { - kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF; - } else { - INIT_LIST_HEAD(&kvm->arch.vgic.rd_regions); - aa64pfr0 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); - pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); - } - - kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, aa64pfr0); - kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, pfr1); + /* + * We've now created the GIC. Update the system register state + * to accurately reflect what we've created. + */ + kvm_vgic_finalize_idregs(kvm); kvm_for_each_vcpu(i, vcpu, kvm) { ret = vgic_allocate_private_irqs_locked(vcpu, type); @@ -617,6 +608,36 @@ out_slots: return ret; } +void kvm_vgic_finalize_idregs(struct kvm *kvm) +{ + u32 type = kvm->arch.vgic.vgic_model; + u64 aa64pfr0, aa64pfr2, pfr1; + + aa64pfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; + aa64pfr2 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1) & ~ID_AA64PFR2_EL1_GCIE; + pfr1 = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; + + switch (type) { + case KVM_DEV_TYPE_ARM_VGIC_V2: + kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF; + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: + INIT_LIST_HEAD(&kvm->arch.vgic.rd_regions); + aa64pfr0 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); + pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); + break; + case KVM_DEV_TYPE_ARM_VGIC_V5: + aa64pfr2 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR2_EL1, GCIE, IMP); + break; + default: + WARN_ONCE(1, "Unknown VGIC type!!!\n"); + } + + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, aa64pfr0); + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1, aa64pfr2); + kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, pfr1); +} + /* GENERIC PROBE */ void kvm_vgic_cpu_up(void) diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index 0fabeabedd6d..24969fa8d02d 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -485,6 +485,7 @@ int kvm_vgic_create(struct kvm *kvm, u32 type); void kvm_vgic_destroy(struct kvm *kvm); void kvm_vgic_vcpu_destroy(struct kvm_vcpu *vcpu); int kvm_vgic_map_resources(struct kvm *kvm); +void kvm_vgic_finalize_idregs(struct kvm *kvm); int kvm_vgic_hyp_init(void); void kvm_vgic_init_cpu_hardware(void); From 9d6d9514c08f462d162040b48526bda60def9de1 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:53:05 +0000 Subject: [PATCH 154/373] KVM: arm64: gic-v5: Support GICv5 FGTs & FGUs Extend the existing FGT/FGU infrastructure to include the GICv5 trap registers (ICH_HFGRTR_EL2, ICH_HFGWTR_EL2, ICH_HFGITR_EL2). This involves mapping the trap registers and their bits to the corresponding feature that introduces them (FEAT_GCIE for all, in this case), and mapping each trap bit to the system register/instruction controlled by it. As of this change, none of the GICv5 instructions or register accesses are being trapped. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-14-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_host.h | 19 +++++ arch/arm64/include/asm/vncr_mapping.h | 3 + arch/arm64/kvm/arm.c | 3 + arch/arm64/kvm/config.c | 97 +++++++++++++++++++++++-- arch/arm64/kvm/emulate-nested.c | 68 +++++++++++++++++ arch/arm64/kvm/hyp/include/hyp/switch.h | 27 +++++++ arch/arm64/kvm/hyp/nvhe/switch.c | 3 + arch/arm64/kvm/sys_regs.c | 2 + 8 files changed, 215 insertions(+), 7 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 70cb9cfd760a..64a1ee6c442f 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -287,6 +287,9 @@ enum fgt_group_id { HDFGRTR2_GROUP, HDFGWTR2_GROUP = HDFGRTR2_GROUP, HFGITR2_GROUP, + ICH_HFGRTR_GROUP, + ICH_HFGWTR_GROUP = ICH_HFGRTR_GROUP, + ICH_HFGITR_GROUP, /* Must be last */ __NR_FGT_GROUP_IDS__ @@ -620,6 +623,10 @@ enum vcpu_sysreg { VNCR(ICH_HCR_EL2), VNCR(ICH_VMCR_EL2), + VNCR(ICH_HFGRTR_EL2), + VNCR(ICH_HFGWTR_EL2), + VNCR(ICH_HFGITR_EL2), + NR_SYS_REGS /* Nothing after this line! */ }; @@ -675,6 +682,9 @@ extern struct fgt_masks hfgwtr2_masks; extern struct fgt_masks hfgitr2_masks; extern struct fgt_masks hdfgrtr2_masks; extern struct fgt_masks hdfgwtr2_masks; +extern struct fgt_masks ich_hfgrtr_masks; +extern struct fgt_masks ich_hfgwtr_masks; +extern struct fgt_masks ich_hfgitr_masks; extern struct fgt_masks kvm_nvhe_sym(hfgrtr_masks); extern struct fgt_masks kvm_nvhe_sym(hfgwtr_masks); @@ -687,6 +697,9 @@ extern struct fgt_masks kvm_nvhe_sym(hfgwtr2_masks); extern struct fgt_masks kvm_nvhe_sym(hfgitr2_masks); extern struct fgt_masks kvm_nvhe_sym(hdfgrtr2_masks); extern struct fgt_masks kvm_nvhe_sym(hdfgwtr2_masks); +extern struct fgt_masks kvm_nvhe_sym(ich_hfgrtr_masks); +extern struct fgt_masks kvm_nvhe_sym(ich_hfgwtr_masks); +extern struct fgt_masks kvm_nvhe_sym(ich_hfgitr_masks); struct kvm_cpu_context { struct user_pt_regs regs; /* sp = sp_el0 */ @@ -1659,6 +1672,11 @@ static __always_inline enum fgt_group_id __fgt_reg_to_group_id(enum vcpu_sysreg case HDFGRTR2_EL2: case HDFGWTR2_EL2: return HDFGRTR2_GROUP; + case ICH_HFGRTR_EL2: + case ICH_HFGWTR_EL2: + return ICH_HFGRTR_GROUP; + case ICH_HFGITR_EL2: + return ICH_HFGITR_GROUP; default: BUILD_BUG_ON(1); } @@ -1673,6 +1691,7 @@ static __always_inline enum fgt_group_id __fgt_reg_to_group_id(enum vcpu_sysreg case HDFGWTR_EL2: \ case HFGWTR2_EL2: \ case HDFGWTR2_EL2: \ + case ICH_HFGWTR_EL2: \ p = &(vcpu)->arch.fgt[id].w; \ break; \ default: \ diff --git a/arch/arm64/include/asm/vncr_mapping.h b/arch/arm64/include/asm/vncr_mapping.h index c2485a862e69..14366d35ce82 100644 --- a/arch/arm64/include/asm/vncr_mapping.h +++ b/arch/arm64/include/asm/vncr_mapping.h @@ -108,5 +108,8 @@ #define VNCR_MPAMVPM5_EL2 0x968 #define VNCR_MPAMVPM6_EL2 0x970 #define VNCR_MPAMVPM7_EL2 0x978 +#define VNCR_ICH_HFGITR_EL2 0xB10 +#define VNCR_ICH_HFGRTR_EL2 0xB18 +#define VNCR_ICH_HFGWTR_EL2 0xB20 #endif /* __ARM64_VNCR_MAPPING_H__ */ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 410ffd41fd73..aa69fd5b372f 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -2529,6 +2529,9 @@ static void kvm_hyp_init_symbols(void) kvm_nvhe_sym(hfgitr2_masks) = hfgitr2_masks; kvm_nvhe_sym(hdfgrtr2_masks)= hdfgrtr2_masks; kvm_nvhe_sym(hdfgwtr2_masks)= hdfgwtr2_masks; + kvm_nvhe_sym(ich_hfgrtr_masks) = ich_hfgrtr_masks; + kvm_nvhe_sym(ich_hfgwtr_masks) = ich_hfgwtr_masks; + kvm_nvhe_sym(ich_hfgitr_masks) = ich_hfgitr_masks; /* * Flush entire BSS since part of its data containing init symbols is read diff --git a/arch/arm64/kvm/config.c b/arch/arm64/kvm/config.c index d9f553cbf9df..e4ec1bda8dfc 100644 --- a/arch/arm64/kvm/config.c +++ b/arch/arm64/kvm/config.c @@ -225,6 +225,7 @@ struct reg_feat_map_desc { #define FEAT_MTPMU ID_AA64DFR0_EL1, MTPMU, IMP #define FEAT_HCX ID_AA64MMFR1_EL1, HCX, IMP #define FEAT_S2PIE ID_AA64MMFR3_EL1, S2PIE, IMP +#define FEAT_GCIE ID_AA64PFR2_EL1, GCIE, IMP static bool not_feat_aa64el3(struct kvm *kvm) { @@ -1277,6 +1278,58 @@ static const struct reg_bits_to_feat_map vtcr_el2_feat_map[] = { static const DECLARE_FEAT_MAP(vtcr_el2_desc, VTCR_EL2, vtcr_el2_feat_map, FEAT_AA64EL2); +static const struct reg_bits_to_feat_map ich_hfgrtr_feat_map[] = { + NEEDS_FEAT(ICH_HFGRTR_EL2_ICC_APR_EL1 | + ICH_HFGRTR_EL2_ICC_IDRn_EL1 | + ICH_HFGRTR_EL2_ICC_CR0_EL1 | + ICH_HFGRTR_EL2_ICC_HPPIR_EL1 | + ICH_HFGRTR_EL2_ICC_PCR_EL1 | + ICH_HFGRTR_EL2_ICC_ICSR_EL1 | + ICH_HFGRTR_EL2_ICC_IAFFIDR_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_HMRn_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_ENABLERn_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_PENDRn_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_PRIORITYRn_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_ACTIVERn_EL1, + FEAT_GCIE), +}; + +static const DECLARE_FEAT_MAP_FGT(ich_hfgrtr_desc, ich_hfgrtr_masks, + ich_hfgrtr_feat_map, FEAT_GCIE); + +static const struct reg_bits_to_feat_map ich_hfgwtr_feat_map[] = { + NEEDS_FEAT(ICH_HFGWTR_EL2_ICC_APR_EL1 | + ICH_HFGWTR_EL2_ICC_CR0_EL1 | + ICH_HFGWTR_EL2_ICC_PCR_EL1 | + ICH_HFGWTR_EL2_ICC_ICSR_EL1 | + ICH_HFGWTR_EL2_ICC_PPI_ENABLERn_EL1 | + ICH_HFGWTR_EL2_ICC_PPI_PENDRn_EL1 | + ICH_HFGWTR_EL2_ICC_PPI_PRIORITYRn_EL1 | + ICH_HFGWTR_EL2_ICC_PPI_ACTIVERn_EL1, + FEAT_GCIE), +}; + +static const DECLARE_FEAT_MAP_FGT(ich_hfgwtr_desc, ich_hfgwtr_masks, + ich_hfgwtr_feat_map, FEAT_GCIE); + +static const struct reg_bits_to_feat_map ich_hfgitr_feat_map[] = { + NEEDS_FEAT(ICH_HFGITR_EL2_GICCDEN | + ICH_HFGITR_EL2_GICCDDIS | + ICH_HFGITR_EL2_GICCDPRI | + ICH_HFGITR_EL2_GICCDAFF | + ICH_HFGITR_EL2_GICCDPEND | + ICH_HFGITR_EL2_GICCDRCFG | + ICH_HFGITR_EL2_GICCDHM | + ICH_HFGITR_EL2_GICCDEOI | + ICH_HFGITR_EL2_GICCDDI | + ICH_HFGITR_EL2_GICRCDIA | + ICH_HFGITR_EL2_GICRCDNMIA, + FEAT_GCIE), +}; + +static const DECLARE_FEAT_MAP_FGT(ich_hfgitr_desc, ich_hfgitr_masks, + ich_hfgitr_feat_map, FEAT_GCIE); + static void __init check_feat_map(const struct reg_bits_to_feat_map *map, int map_size, u64 resx, const char *str) { @@ -1328,6 +1381,9 @@ void __init check_feature_map(void) check_reg_desc(&sctlr_el2_desc); check_reg_desc(&mdcr_el2_desc); check_reg_desc(&vtcr_el2_desc); + check_reg_desc(&ich_hfgrtr_desc); + check_reg_desc(&ich_hfgwtr_desc); + check_reg_desc(&ich_hfgitr_desc); } static bool idreg_feat_match(struct kvm *kvm, const struct reg_bits_to_feat_map *map) @@ -1460,6 +1516,13 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt) val |= compute_fgu_bits(kvm, &hdfgrtr2_desc); val |= compute_fgu_bits(kvm, &hdfgwtr2_desc); break; + case ICH_HFGRTR_GROUP: + val |= compute_fgu_bits(kvm, &ich_hfgrtr_desc); + val |= compute_fgu_bits(kvm, &ich_hfgwtr_desc); + break; + case ICH_HFGITR_GROUP: + val |= compute_fgu_bits(kvm, &ich_hfgitr_desc); + break; default: BUG(); } @@ -1531,6 +1594,15 @@ struct resx get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg) case VTCR_EL2: resx = compute_reg_resx_bits(kvm, &vtcr_el2_desc, 0, 0); break; + case ICH_HFGRTR_EL2: + resx = compute_reg_resx_bits(kvm, &ich_hfgrtr_desc, 0, 0); + break; + case ICH_HFGWTR_EL2: + resx = compute_reg_resx_bits(kvm, &ich_hfgwtr_desc, 0, 0); + break; + case ICH_HFGITR_EL2: + resx = compute_reg_resx_bits(kvm, &ich_hfgitr_desc, 0, 0); + break; default: WARN_ON_ONCE(1); resx = (typeof(resx)){}; @@ -1565,6 +1637,12 @@ static __always_inline struct fgt_masks *__fgt_reg_to_masks(enum vcpu_sysreg reg return &hdfgrtr2_masks; case HDFGWTR2_EL2: return &hdfgwtr2_masks; + case ICH_HFGRTR_EL2: + return &ich_hfgrtr_masks; + case ICH_HFGWTR_EL2: + return &ich_hfgwtr_masks; + case ICH_HFGITR_EL2: + return &ich_hfgitr_masks; default: BUILD_BUG_ON(1); } @@ -1618,12 +1696,17 @@ void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) __compute_hdfgwtr(vcpu); __compute_fgt(vcpu, HAFGRTR_EL2); - if (!cpus_have_final_cap(ARM64_HAS_FGT2)) - return; + if (cpus_have_final_cap(ARM64_HAS_FGT2)) { + __compute_fgt(vcpu, HFGRTR2_EL2); + __compute_fgt(vcpu, HFGWTR2_EL2); + __compute_fgt(vcpu, HFGITR2_EL2); + __compute_fgt(vcpu, HDFGRTR2_EL2); + __compute_fgt(vcpu, HDFGWTR2_EL2); + } - __compute_fgt(vcpu, HFGRTR2_EL2); - __compute_fgt(vcpu, HFGWTR2_EL2); - __compute_fgt(vcpu, HFGITR2_EL2); - __compute_fgt(vcpu, HDFGRTR2_EL2); - __compute_fgt(vcpu, HDFGWTR2_EL2); + if (cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) { + __compute_fgt(vcpu, ICH_HFGRTR_EL2); + __compute_fgt(vcpu, ICH_HFGWTR_EL2); + __compute_fgt(vcpu, ICH_HFGITR_EL2); + } } diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c index 22d497554c94..dba7ced74ca5 100644 --- a/arch/arm64/kvm/emulate-nested.c +++ b/arch/arm64/kvm/emulate-nested.c @@ -2053,6 +2053,60 @@ static const struct encoding_to_trap_config encoding_to_fgt[] __initconst = { SR_FGT(SYS_AMEVCNTR0_EL0(2), HAFGRTR, AMEVCNTR02_EL0, 1), SR_FGT(SYS_AMEVCNTR0_EL0(1), HAFGRTR, AMEVCNTR01_EL0, 1), SR_FGT(SYS_AMEVCNTR0_EL0(0), HAFGRTR, AMEVCNTR00_EL0, 1), + + /* + * ICH_HFGRTR_EL2 & ICH_HFGWTR_EL2 + */ + SR_FGT(SYS_ICC_APR_EL1, ICH_HFGRTR, ICC_APR_EL1, 0), + SR_FGT(SYS_ICC_IDR0_EL1, ICH_HFGRTR, ICC_IDRn_EL1, 0), + SR_FGT(SYS_ICC_CR0_EL1, ICH_HFGRTR, ICC_CR0_EL1, 0), + SR_FGT(SYS_ICC_HPPIR_EL1, ICH_HFGRTR, ICC_HPPIR_EL1, 0), + SR_FGT(SYS_ICC_PCR_EL1, ICH_HFGRTR, ICC_PCR_EL1, 0), + SR_FGT(SYS_ICC_ICSR_EL1, ICH_HFGRTR, ICC_ICSR_EL1, 0), + SR_FGT(SYS_ICC_IAFFIDR_EL1, ICH_HFGRTR, ICC_IAFFIDR_EL1, 0), + SR_FGT(SYS_ICC_PPI_HMR0_EL1, ICH_HFGRTR, ICC_PPI_HMRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_HMR1_EL1, ICH_HFGRTR, ICC_PPI_HMRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_ENABLER0_EL1, ICH_HFGRTR, ICC_PPI_ENABLERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_ENABLER1_EL1, ICH_HFGRTR, ICC_PPI_ENABLERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_CPENDR0_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_CPENDR1_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_SPENDR0_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_SPENDR1_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR0_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR1_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR2_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR3_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR4_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR5_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR6_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR7_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR8_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR9_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR10_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR11_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR12_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR13_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR14_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR15_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_CACTIVER0_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_CACTIVER1_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_SACTIVER0_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_SACTIVER1_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), + + /* + * ICH_HFGITR_EL2 + */ + SR_FGT(GICV5_OP_GIC_CDEN, ICH_HFGITR, GICCDEN, 0), + SR_FGT(GICV5_OP_GIC_CDDIS, ICH_HFGITR, GICCDDIS, 0), + SR_FGT(GICV5_OP_GIC_CDPRI, ICH_HFGITR, GICCDPRI, 0), + SR_FGT(GICV5_OP_GIC_CDAFF, ICH_HFGITR, GICCDAFF, 0), + SR_FGT(GICV5_OP_GIC_CDPEND, ICH_HFGITR, GICCDPEND, 0), + SR_FGT(GICV5_OP_GIC_CDRCFG, ICH_HFGITR, GICCDRCFG, 0), + SR_FGT(GICV5_OP_GIC_CDHM, ICH_HFGITR, GICCDHM, 0), + SR_FGT(GICV5_OP_GIC_CDEOI, ICH_HFGITR, GICCDEOI, 0), + SR_FGT(GICV5_OP_GIC_CDDI, ICH_HFGITR, GICCDDI, 0), + SR_FGT(GICV5_OP_GICR_CDIA, ICH_HFGITR, GICRCDIA, 0), + SR_FGT(GICV5_OP_GICR_CDNMIA, ICH_HFGITR, GICRCDNMIA, 0), }; /* @@ -2127,6 +2181,9 @@ FGT_MASKS(hfgwtr2_masks, HFGWTR2_EL2); FGT_MASKS(hfgitr2_masks, HFGITR2_EL2); FGT_MASKS(hdfgrtr2_masks, HDFGRTR2_EL2); FGT_MASKS(hdfgwtr2_masks, HDFGWTR2_EL2); +FGT_MASKS(ich_hfgrtr_masks, ICH_HFGRTR_EL2); +FGT_MASKS(ich_hfgwtr_masks, ICH_HFGWTR_EL2); +FGT_MASKS(ich_hfgitr_masks, ICH_HFGITR_EL2); static __init bool aggregate_fgt(union trap_config tc) { @@ -2162,6 +2219,14 @@ static __init bool aggregate_fgt(union trap_config tc) rmasks = &hfgitr2_masks; wmasks = NULL; break; + case ICH_HFGRTR_GROUP: + rmasks = &ich_hfgrtr_masks; + wmasks = &ich_hfgwtr_masks; + break; + case ICH_HFGITR_GROUP: + rmasks = &ich_hfgitr_masks; + wmasks = NULL; + break; } rresx = rmasks->res0 | rmasks->res1; @@ -2232,6 +2297,9 @@ static __init int check_all_fgt_masks(int ret) &hfgitr2_masks, &hdfgrtr2_masks, &hdfgwtr2_masks, + &ich_hfgrtr_masks, + &ich_hfgwtr_masks, + &ich_hfgitr_masks, }; int err = 0; diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index 2597e8bda867..ae04fd680d1e 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -233,6 +233,18 @@ static inline void __activate_traps_hfgxtr(struct kvm_vcpu *vcpu) __activate_fgt(hctxt, vcpu, HDFGWTR2_EL2); } +static inline void __activate_traps_ich_hfgxtr(struct kvm_vcpu *vcpu) +{ + struct kvm_cpu_context *hctxt = host_data_ptr(host_ctxt); + + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) + return; + + __activate_fgt(hctxt, vcpu, ICH_HFGRTR_EL2); + __activate_fgt(hctxt, vcpu, ICH_HFGWTR_EL2); + __activate_fgt(hctxt, vcpu, ICH_HFGITR_EL2); +} + #define __deactivate_fgt(htcxt, vcpu, reg) \ do { \ write_sysreg_s(ctxt_sys_reg(hctxt, reg), \ @@ -265,6 +277,19 @@ static inline void __deactivate_traps_hfgxtr(struct kvm_vcpu *vcpu) __deactivate_fgt(hctxt, vcpu, HDFGWTR2_EL2); } +static inline void __deactivate_traps_ich_hfgxtr(struct kvm_vcpu *vcpu) +{ + struct kvm_cpu_context *hctxt = host_data_ptr(host_ctxt); + + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) + return; + + __deactivate_fgt(hctxt, vcpu, ICH_HFGRTR_EL2); + __deactivate_fgt(hctxt, vcpu, ICH_HFGWTR_EL2); + __deactivate_fgt(hctxt, vcpu, ICH_HFGITR_EL2); + +} + static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) { u64 r = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1; @@ -328,6 +353,7 @@ static inline void __activate_traps_common(struct kvm_vcpu *vcpu) } __activate_traps_hfgxtr(vcpu); + __activate_traps_ich_hfgxtr(vcpu); __activate_traps_mpam(vcpu); } @@ -345,6 +371,7 @@ static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) write_sysreg_s(ctxt_sys_reg(hctxt, HCRX_EL2), SYS_HCRX_EL2); __deactivate_traps_hfgxtr(vcpu); + __deactivate_traps_ich_hfgxtr(vcpu); __deactivate_traps_mpam(); } diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index 779089e42681..b41485ce295a 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -44,6 +44,9 @@ struct fgt_masks hfgwtr2_masks; struct fgt_masks hfgitr2_masks; struct fgt_masks hdfgrtr2_masks; struct fgt_masks hdfgwtr2_masks; +struct fgt_masks ich_hfgrtr_masks; +struct fgt_masks ich_hfgwtr_masks; +struct fgt_masks ich_hfgitr_masks; extern void kvm_nvhe_prepare_backtrace(unsigned long fp, unsigned long pc); diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 140cf35f4eeb..cd6deaf47315 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -5661,6 +5661,8 @@ void kvm_calculate_traps(struct kvm_vcpu *vcpu) compute_fgu(kvm, HFGRTR2_GROUP); compute_fgu(kvm, HFGITR2_GROUP); compute_fgu(kvm, HDFGRTR2_GROUP); + compute_fgu(kvm, ICH_HFGRTR_GROUP); + compute_fgu(kvm, ICH_HFGITR_GROUP); set_bit(KVM_ARCH_FLAG_FGU_INITIALIZED, &kvm->arch.flags); out: From 607871ce633d3e0ca0eb375a04371f1130fc2c5a Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:53:21 +0000 Subject: [PATCH 155/373] KVM: arm64: gic-v5: Add emulation for ICC_IAFFIDR_EL1 accesses GICv5 doesn't provide an ICV_IAFFIDR_EL1 or ICH_IAFFIDR_EL2 for providing the IAFFID to the guest. A guest access to the ICC_IAFFIDR_EL1 must therefore be trapped and emulated to avoid the guest accessing the host's ICC_IAFFIDR_EL1. The virtual IAFFID is provided to the guest when it reads ICC_IAFFIDR_EL1 (which always traps back to the hypervisor). Writes are rightly ignored. KVM treats the GICv5 VPEID, the virtual IAFFID, and the vcpu_id as the same, and so the vcpu_id is returned. The trapping for the ICC_IAFFIDR_EL1 is always enabled when in a guest context. Co-authored-by: Timothy Hayes Signed-off-by: Timothy Hayes Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-15-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/config.c | 10 +++++++++- arch/arm64/kvm/sys_regs.c | 16 ++++++++++++++++ arch/arm64/kvm/vgic/vgic.h | 5 +++++ 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/config.c b/arch/arm64/kvm/config.c index e4ec1bda8dfc..bac5f49fdbde 100644 --- a/arch/arm64/kvm/config.c +++ b/arch/arm64/kvm/config.c @@ -1684,6 +1684,14 @@ static void __compute_hdfgwtr(struct kvm_vcpu *vcpu) *vcpu_fgt(vcpu, HDFGWTR_EL2) |= HDFGWTR_EL2_MDSCR_EL1; } +static void __compute_ich_hfgrtr(struct kvm_vcpu *vcpu) +{ + __compute_fgt(vcpu, ICH_HFGRTR_EL2); + + /* ICC_IAFFIDR_EL1 *always* needs to be trapped when running a guest */ + *vcpu_fgt(vcpu, ICH_HFGRTR_EL2) &= ~ICH_HFGRTR_EL2_ICC_IAFFIDR_EL1; +} + void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) { if (!cpus_have_final_cap(ARM64_HAS_FGT)) @@ -1705,7 +1713,7 @@ void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) } if (cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) { - __compute_fgt(vcpu, ICH_HFGRTR_EL2); + __compute_ich_hfgrtr(vcpu); __compute_fgt(vcpu, ICH_HFGWTR_EL2); __compute_fgt(vcpu, ICH_HFGITR_EL2); } diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index cd6deaf47315..d4531457ea02 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -681,6 +681,21 @@ static bool access_gic_dir(struct kvm_vcpu *vcpu, return true; } +static bool access_gicv5_iaffid(struct kvm_vcpu *vcpu, struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + if (p->is_write) + return undef_access(vcpu, p, r); + + /* + * For GICv5 VMs, the IAFFID value is the same as the VPE ID. The VPE ID + * is the same as the VCPU's ID. + */ + p->regval = FIELD_PREP(ICC_IAFFIDR_EL1_IAFFID, vcpu->vcpu_id); + + return true; +} + static bool trap_raz_wi(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) @@ -3405,6 +3420,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ICC_AP1R1_EL1), undef_access }, { SYS_DESC(SYS_ICC_AP1R2_EL1), undef_access }, { SYS_DESC(SYS_ICC_AP1R3_EL1), undef_access }, + { SYS_DESC(SYS_ICC_IAFFIDR_EL1), access_gicv5_iaffid }, { SYS_DESC(SYS_ICC_DIR_EL1), access_gic_dir }, { SYS_DESC(SYS_ICC_RPR_EL1), undef_access }, { SYS_DESC(SYS_ICC_SGI1R_EL1), access_gic_sgi }, diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index f2924f821197..7b7eed69d797 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -447,6 +447,11 @@ static inline bool kvm_has_gicv3(struct kvm *kvm) return kvm_has_feat(kvm, ID_AA64PFR0_EL1, GIC, IMP); } +static inline bool kvm_has_gicv5(struct kvm *kvm) +{ + return kvm_has_feat(kvm, ID_AA64PFR2_EL1, GCIE, IMP); +} + void vgic_v3_flush_nested(struct kvm_vcpu *vcpu); void vgic_v3_sync_nested(struct kvm_vcpu *vcpu); void vgic_v3_load_nested(struct kvm_vcpu *vcpu); From 070543a85adce329672012a1fe35fa48c76e02d5 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:53:37 +0000 Subject: [PATCH 156/373] KVM: arm64: gic-v5: Trap and emulate ICC_IDR0_EL1 accesses Unless accesses to the ICC_IDR0_EL1 are trapped by KVM, the guest reads the same state as the host. This isn't desirable as it limits the migratability of VMs and means that KVM can't hide hardware features such as FEAT_GCIE_LEGACY. Trap and emulate accesses to the register, and present KVM's chosen ID bits and Priority bits (which is 5, as GICv5 only supports 5 bits of priority in the CPU interface). FEAT_GCIE_LEGACY is never presented to the guest as it is only relevant for nested guests doing mixed GICv5 and GICv3 support. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-16-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/config.c | 11 +++++++++-- arch/arm64/kvm/sys_regs.c | 23 +++++++++++++++++++++++ 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/config.c b/arch/arm64/kvm/config.c index bac5f49fdbde..5663f25905e8 100644 --- a/arch/arm64/kvm/config.c +++ b/arch/arm64/kvm/config.c @@ -1688,8 +1688,15 @@ static void __compute_ich_hfgrtr(struct kvm_vcpu *vcpu) { __compute_fgt(vcpu, ICH_HFGRTR_EL2); - /* ICC_IAFFIDR_EL1 *always* needs to be trapped when running a guest */ - *vcpu_fgt(vcpu, ICH_HFGRTR_EL2) &= ~ICH_HFGRTR_EL2_ICC_IAFFIDR_EL1; + /* + * ICC_IAFFIDR_EL1 *always* needs to be trapped when running a guest. + * + * We also trap accesses to ICC_IDR0_EL1 to allow us to completely hide + * FEAT_GCIE_LEGACY from the guest, and to (potentially) present fewer + * ID bits than the host supports. + */ + *vcpu_fgt(vcpu, ICH_HFGRTR_EL2) &= ~(ICH_HFGRTR_EL2_ICC_IAFFIDR_EL1 | + ICH_HFGRTR_EL2_ICC_IDRn_EL1); } void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index d4531457ea02..85300e76bbe4 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -681,6 +681,28 @@ static bool access_gic_dir(struct kvm_vcpu *vcpu, return true; } +static bool access_gicv5_idr0(struct kvm_vcpu *vcpu, struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + if (p->is_write) + return undef_access(vcpu, p, r); + + /* + * Expose KVM's priority- and ID-bits to the guest, but not GCIE_LEGACY. + * + * Note: for GICv5 the mimic the way that the num_pri_bits and + * num_id_bits fields are used with GICv3: + * - num_pri_bits stores the actual number of priority bits, whereas the + * register field stores num_pri_bits - 1. + * - num_id_bits stores the raw field value, which is 0b0000 for 16 bits + * and 0b0001 for 24 bits. + */ + p->regval = FIELD_PREP(ICC_IDR0_EL1_PRI_BITS, vcpu->arch.vgic_cpu.num_pri_bits - 1) | + FIELD_PREP(ICC_IDR0_EL1_ID_BITS, vcpu->arch.vgic_cpu.num_id_bits); + + return true; +} + static bool access_gicv5_iaffid(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) { @@ -3420,6 +3442,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ICC_AP1R1_EL1), undef_access }, { SYS_DESC(SYS_ICC_AP1R2_EL1), undef_access }, { SYS_DESC(SYS_ICC_AP1R3_EL1), undef_access }, + { SYS_DESC(SYS_ICC_IDR0_EL1), access_gicv5_idr0 }, { SYS_DESC(SYS_ICC_IAFFIDR_EL1), access_gicv5_iaffid }, { SYS_DESC(SYS_ICC_DIR_EL1), access_gic_dir }, { SYS_DESC(SYS_ICC_RPR_EL1), undef_access }, From af325e87af5da2f686d1ad547edc96f731418f2a Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:53:52 +0000 Subject: [PATCH 157/373] KVM: arm64: gic-v5: Add vgic-v5 save/restore hyp interface Introduce the following hyp functions to save/restore GICv5 state: * __vgic_v5_save_apr() * __vgic_v5_restore_vmcr_apr() * __vgic_v5_save_ppi_state() - no hypercall required * __vgic_v5_restore_ppi_state() - no hypercall required * __vgic_v5_save_state() - no hypercall required * __vgic_v5_restore_state() - no hypercall required Note that the functions tagged as not requiring hypercalls are always called directly from the same context. They are either called via the vgic_save_state()/vgic_restore_state() path when running with VHE, or via __hyp_vgic_save_state()/__hyp_vgic_restore_state() otherwise. This mimics how vgic_v3_save_state()/vgic_v3_restore_state() are implemented. Overall, the state of the following registers is saved/restored: * ICC_ICSR_EL1 * ICH_APR_EL2 * ICH_PPI_ACTIVERx_EL2 * ICH_PPI_DVIRx_EL2 * ICH_PPI_ENABLERx_EL2 * ICH_PPI_PENDRx_EL2 * ICH_PPI_PRIORITYRx_EL2 * ICH_VMCR_EL2 All of these are saved/restored to/from the KVM vgic_v5 CPUIF shadow state, with the exception of the PPI active, pending, and enable state. The pending state is saved and restored from kvm_host_data as any changes here need to be tracked and propagated back to the vgic_irq shadow structures (coming in a future commit). Therefore, an entry and an exit copy is required. The active and enable state is restored from the vgic_v5 CPUIF, but is saved to kvm_host_data. Again, this needs to by synced back into the shadow data structures. The ICSR must be save/restored as this register is shared between host and guest. Therefore, to avoid leaking host state to the guest, this must be saved and restored. Moreover, as this can by used by the host at any time, it must be save/restored eagerly. Note: the host state is not preserved as the host should only use this register when preemption is disabled. As with GICv3, the VMCR is eagerly saved as this is required when checking if interrupts can be injected or not, and therefore impacts things such as WFI. As part of restoring the ICH_VMCR_EL2 and ICH_APR_EL2, GICv3-compat mode is also disabled by setting the ICH_VCTLR_EL2.V3 bit to 0. The correspoinding GICv3-compat mode enable is part of the VMCR & APR restore for a GICv3 guest as it only takes effect when actually running a guest. Co-authored-by: Timothy Hayes Signed-off-by: Timothy Hayes Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-17-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 2 + arch/arm64/include/asm/kvm_host.h | 16 +++ arch/arm64/include/asm/kvm_hyp.h | 9 ++ arch/arm64/kvm/hyp/nvhe/Makefile | 2 +- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 16 +++ arch/arm64/kvm/hyp/vgic-v5-sr.c | 170 +++++++++++++++++++++++++++++ arch/arm64/kvm/hyp/vhe/Makefile | 2 +- include/kvm/arm_vgic.h | 22 ++++ 8 files changed, 237 insertions(+), 2 deletions(-) create mode 100644 arch/arm64/kvm/hyp/vgic-v5-sr.c diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index a1ad12c72ebf..44e4696ca113 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -89,6 +89,8 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_load, __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_put, __KVM_HOST_SMCCC_FUNC___pkvm_tlb_flush_vmid, + __KVM_HOST_SMCCC_FUNC___vgic_v5_save_apr, + __KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr, }; #define DECLARE_KVM_VHE_SYM(sym) extern char sym[] diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 64a1ee6c442f..c4a172b70206 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -800,6 +800,22 @@ struct kvm_host_data { /* Last vgic_irq part of the AP list recorded in an LR */ struct vgic_irq *last_lr_irq; + + /* PPI state tracking for GICv5-based guests */ + struct { + /* + * For tracking the PPI pending state, we need both the entry + * state and exit state to correctly detect edges as it is + * possible that an interrupt has been injected in software in + * the interim. + */ + DECLARE_BITMAP(pendr_entry, VGIC_V5_NR_PRIVATE_IRQS); + DECLARE_BITMAP(pendr_exit, VGIC_V5_NR_PRIVATE_IRQS); + + /* The saved state of the regs when leaving the guest */ + DECLARE_BITMAP(activer_exit, VGIC_V5_NR_PRIVATE_IRQS); + DECLARE_BITMAP(enabler_exit, VGIC_V5_NR_PRIVATE_IRQS); + } vgic_v5_ppi_state; }; struct kvm_host_psci_config { diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h index 76ce2b94bd97..2d8dfd534bd9 100644 --- a/arch/arm64/include/asm/kvm_hyp.h +++ b/arch/arm64/include/asm/kvm_hyp.h @@ -87,6 +87,15 @@ void __vgic_v3_save_aprs(struct vgic_v3_cpu_if *cpu_if); void __vgic_v3_restore_vmcr_aprs(struct vgic_v3_cpu_if *cpu_if); int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu); +/* GICv5 */ +void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if); +void __vgic_v5_restore_vmcr_apr(struct vgic_v5_cpu_if *cpu_if); +/* No hypercalls for the following */ +void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if); +void __vgic_v5_restore_ppi_state(struct vgic_v5_cpu_if *cpu_if); +void __vgic_v5_save_state(struct vgic_v5_cpu_if *cpu_if); +void __vgic_v5_restore_state(struct vgic_v5_cpu_if *cpu_if); + #ifdef __KVM_NVHE_HYPERVISOR__ void __timer_enable_traps(struct kvm_vcpu *vcpu); void __timer_disable_traps(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile index a244ec25f8c5..84a3bf96def6 100644 --- a/arch/arm64/kvm/hyp/nvhe/Makefile +++ b/arch/arm64/kvm/hyp/nvhe/Makefile @@ -26,7 +26,7 @@ hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o hyp-main.o hyp-smp.o psci-relay.o early_alloc.o page_alloc.o \ cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ - ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o + ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o ../vgic-v5-sr.o hyp-obj-y += ../../../kernel/smccc-call.o hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o hyp-obj-y += $(lib-objs) diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index e7790097db93..007fc993f231 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -589,6 +589,20 @@ static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle); } +static void handle___vgic_v5_save_apr(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1); + + __vgic_v5_save_apr(kern_hyp_va(cpu_if)); +} + +static void handle___vgic_v5_restore_vmcr_apr(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1); + + __vgic_v5_restore_vmcr_apr(kern_hyp_va(cpu_if)); +} + typedef void (*hcall_t)(struct kvm_cpu_context *); #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x @@ -630,6 +644,8 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__pkvm_vcpu_load), HANDLE_FUNC(__pkvm_vcpu_put), HANDLE_FUNC(__pkvm_tlb_flush_vmid), + HANDLE_FUNC(__vgic_v5_save_apr), + HANDLE_FUNC(__vgic_v5_restore_vmcr_apr), }; static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c b/arch/arm64/kvm/hyp/vgic-v5-sr.c new file mode 100644 index 000000000000..f34ea219cc4e --- /dev/null +++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c @@ -0,0 +1,170 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025, 2026 - Arm Ltd + */ + +#include + +#include + +void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if) +{ + cpu_if->vgic_apr = read_sysreg_s(SYS_ICH_APR_EL2); +} + +static void __vgic_v5_compat_mode_disable(void) +{ + sysreg_clear_set_s(SYS_ICH_VCTLR_EL2, ICH_VCTLR_EL2_V3, 0); + isb(); +} + +void __vgic_v5_restore_vmcr_apr(struct vgic_v5_cpu_if *cpu_if) +{ + __vgic_v5_compat_mode_disable(); + + write_sysreg_s(cpu_if->vgic_vmcr, SYS_ICH_VMCR_EL2); + write_sysreg_s(cpu_if->vgic_apr, SYS_ICH_APR_EL2); +} + +void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if) +{ + /* + * The following code assumes that the bitmap storage that we have for + * PPIs is either 64 (architected PPIs, only) or 128 bits (architected & + * impdef PPIs). + */ + BUILD_BUG_ON(VGIC_V5_NR_PRIVATE_IRQS % 64); + + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, + read_sysreg_s(SYS_ICH_PPI_ACTIVER0_EL2), 0, 64); + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->enabler_exit, + read_sysreg_s(SYS_ICH_PPI_ENABLER0_EL2), 0, 64); + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr_exit, + read_sysreg_s(SYS_ICH_PPI_PENDR0_EL2), 0, 64); + + cpu_if->vgic_ppi_priorityr[0] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR0_EL2); + cpu_if->vgic_ppi_priorityr[1] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR1_EL2); + cpu_if->vgic_ppi_priorityr[2] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR2_EL2); + cpu_if->vgic_ppi_priorityr[3] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR3_EL2); + cpu_if->vgic_ppi_priorityr[4] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR4_EL2); + cpu_if->vgic_ppi_priorityr[5] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR5_EL2); + cpu_if->vgic_ppi_priorityr[6] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR6_EL2); + cpu_if->vgic_ppi_priorityr[7] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR7_EL2); + + if (VGIC_V5_NR_PRIVATE_IRQS == 128) { + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, + read_sysreg_s(SYS_ICH_PPI_ACTIVER1_EL2), 64, 64); + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->enabler_exit, + read_sysreg_s(SYS_ICH_PPI_ENABLER1_EL2), 64, 64); + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr_exit, + read_sysreg_s(SYS_ICH_PPI_PENDR1_EL2), 64, 64); + + cpu_if->vgic_ppi_priorityr[8] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR8_EL2); + cpu_if->vgic_ppi_priorityr[9] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR9_EL2); + cpu_if->vgic_ppi_priorityr[10] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR10_EL2); + cpu_if->vgic_ppi_priorityr[11] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR11_EL2); + cpu_if->vgic_ppi_priorityr[12] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR12_EL2); + cpu_if->vgic_ppi_priorityr[13] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR13_EL2); + cpu_if->vgic_ppi_priorityr[14] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR14_EL2); + cpu_if->vgic_ppi_priorityr[15] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR15_EL2); + } + + /* Now that we are done, disable DVI */ + write_sysreg_s(0, SYS_ICH_PPI_DVIR0_EL2); + write_sysreg_s(0, SYS_ICH_PPI_DVIR1_EL2); +} + +void __vgic_v5_restore_ppi_state(struct vgic_v5_cpu_if *cpu_if) +{ + DECLARE_BITMAP(pendr, VGIC_V5_NR_PRIVATE_IRQS); + + /* We assume 64 or 128 PPIs - see above comment */ + BUILD_BUG_ON(VGIC_V5_NR_PRIVATE_IRQS % 64); + + /* Enable DVI so that the guest's interrupt config takes over */ + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_dvir, 0, 64), + SYS_ICH_PPI_DVIR0_EL2); + + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_activer, 0, 64), + SYS_ICH_PPI_ACTIVER0_EL2); + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_enabler, 0, 64), + SYS_ICH_PPI_ENABLER0_EL2); + + /* Update the pending state of the NON-DVI'd PPIs, only */ + bitmap_andnot(pendr, host_data_ptr(vgic_v5_ppi_state)->pendr_entry, + cpu_if->vgic_ppi_dvir, VGIC_V5_NR_PRIVATE_IRQS); + write_sysreg_s(bitmap_read(pendr, 0, 64), SYS_ICH_PPI_PENDR0_EL2); + + write_sysreg_s(cpu_if->vgic_ppi_priorityr[0], + SYS_ICH_PPI_PRIORITYR0_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[1], + SYS_ICH_PPI_PRIORITYR1_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[2], + SYS_ICH_PPI_PRIORITYR2_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[3], + SYS_ICH_PPI_PRIORITYR3_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[4], + SYS_ICH_PPI_PRIORITYR4_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[5], + SYS_ICH_PPI_PRIORITYR5_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[6], + SYS_ICH_PPI_PRIORITYR6_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[7], + SYS_ICH_PPI_PRIORITYR7_EL2); + + if (VGIC_V5_NR_PRIVATE_IRQS == 128) { + /* Enable DVI so that the guest's interrupt config takes over */ + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_dvir, 64, 64), + SYS_ICH_PPI_DVIR1_EL2); + + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_activer, 64, 64), + SYS_ICH_PPI_ACTIVER1_EL2); + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_enabler, 64, 64), + SYS_ICH_PPI_ENABLER1_EL2); + write_sysreg_s(bitmap_read(pendr, 64, 64), + SYS_ICH_PPI_PENDR1_EL2); + + write_sysreg_s(cpu_if->vgic_ppi_priorityr[8], + SYS_ICH_PPI_PRIORITYR8_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[9], + SYS_ICH_PPI_PRIORITYR9_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[10], + SYS_ICH_PPI_PRIORITYR10_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[11], + SYS_ICH_PPI_PRIORITYR11_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[12], + SYS_ICH_PPI_PRIORITYR12_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[13], + SYS_ICH_PPI_PRIORITYR13_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[14], + SYS_ICH_PPI_PRIORITYR14_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[15], + SYS_ICH_PPI_PRIORITYR15_EL2); + } else { + write_sysreg_s(0, SYS_ICH_PPI_DVIR1_EL2); + + write_sysreg_s(0, SYS_ICH_PPI_ACTIVER1_EL2); + write_sysreg_s(0, SYS_ICH_PPI_ENABLER1_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PENDR1_EL2); + + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR8_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR9_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR10_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR11_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR12_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR13_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR14_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR15_EL2); + } +} + +void __vgic_v5_save_state(struct vgic_v5_cpu_if *cpu_if) +{ + cpu_if->vgic_vmcr = read_sysreg_s(SYS_ICH_VMCR_EL2); + cpu_if->vgic_icsr = read_sysreg_s(SYS_ICC_ICSR_EL1); +} + +void __vgic_v5_restore_state(struct vgic_v5_cpu_if *cpu_if) +{ + write_sysreg_s(cpu_if->vgic_icsr, SYS_ICC_ICSR_EL1); +} diff --git a/arch/arm64/kvm/hyp/vhe/Makefile b/arch/arm64/kvm/hyp/vhe/Makefile index afc4aed9231a..9695328bbd96 100644 --- a/arch/arm64/kvm/hyp/vhe/Makefile +++ b/arch/arm64/kvm/hyp/vhe/Makefile @@ -10,4 +10,4 @@ CFLAGS_switch.o += -Wno-override-init obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ - ../fpsimd.o ../hyp-entry.o ../exception.o + ../fpsimd.o ../hyp-entry.o ../exception.o ../vgic-v5-sr.o diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index 24969fa8d02d..07e394690dcc 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -428,6 +428,27 @@ struct vgic_v3_cpu_if { unsigned int used_lrs; }; +struct vgic_v5_cpu_if { + u64 vgic_apr; + u64 vgic_vmcr; + + /* PPI register state */ + DECLARE_BITMAP(vgic_ppi_dvir, VGIC_V5_NR_PRIVATE_IRQS); + DECLARE_BITMAP(vgic_ppi_activer, VGIC_V5_NR_PRIVATE_IRQS); + DECLARE_BITMAP(vgic_ppi_enabler, VGIC_V5_NR_PRIVATE_IRQS); + /* We have one byte (of which 5 bits are used) per PPI for priority */ + u64 vgic_ppi_priorityr[VGIC_V5_NR_PRIVATE_IRQS / 8]; + + /* + * The ICSR is re-used across host and guest, and hence it needs to be + * saved/restored. Only one copy is required as the host should block + * preemption between executing GIC CDRCFG and acccessing the + * ICC_ICSR_EL1. A guest, of course, can never guarantee this, and hence + * it is the hyp's responsibility to keep the state constistent. + */ + u64 vgic_icsr; +}; + /* What PPI capabilities does a GICv5 host have */ struct vgic_v5_ppi_caps { DECLARE_BITMAP(impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); @@ -438,6 +459,7 @@ struct vgic_cpu { union { struct vgic_v2_cpu_if vgic_v2; struct vgic_v3_cpu_if vgic_v3; + struct vgic_v5_cpu_if vgic_v5; }; struct vgic_irq *private_irqs; From 9b8e3d4ca0e734dd13dc261c5f888b359f8f5983 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:54:08 +0000 Subject: [PATCH 158/373] KVM: arm64: gic-v5: Implement GICv5 load/put and save/restore This change introduces GICv5 load/put. Additionally, it plumbs in save/restore for: * PPIs (ICH_PPI_x_EL2 regs) * ICH_VMCR_EL2 * ICH_APR_EL2 * ICC_ICSR_EL1 A GICv5-specific enable bit is added to struct vgic_vmcr as this differs from previous GICs. On GICv5-native systems, the VMCR only contains the enable bit (driven by the guest via ICC_CR0_EL1.EN) and the priority mask (PCR). A struct gicv5_vpe is also introduced. This currently only contains a single field - bool resident - which is used to track if a VPE is currently running or not, and is used to avoid a case of double load or double put on the WFI path for a vCPU. This struct will be extended as additional GICv5 support is merged, specifically for VPE doorbells. Co-authored-by: Timothy Hayes Signed-off-by: Timothy Hayes Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-18-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/switch.c | 12 +++++ arch/arm64/kvm/vgic/vgic-mmio.c | 40 +++++++++++++--- arch/arm64/kvm/vgic/vgic-v5.c | 74 ++++++++++++++++++++++++++++++ arch/arm64/kvm/vgic/vgic.c | 74 ++++++++++++++++++++++++------ arch/arm64/kvm/vgic/vgic.h | 7 +++ include/kvm/arm_vgic.h | 2 + include/linux/irqchip/arm-gic-v5.h | 5 ++ 7 files changed, 193 insertions(+), 21 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index b41485ce295a..a88da302b6d0 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -113,6 +113,12 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu) /* Save VGICv3 state on non-VHE systems */ static void __hyp_vgic_save_state(struct kvm_vcpu *vcpu) { + if (vgic_is_v5(kern_hyp_va(vcpu->kvm))) { + __vgic_v5_save_state(&vcpu->arch.vgic_cpu.vgic_v5); + __vgic_v5_save_ppi_state(&vcpu->arch.vgic_cpu.vgic_v5); + return; + } + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) { __vgic_v3_save_state(&vcpu->arch.vgic_cpu.vgic_v3); __vgic_v3_deactivate_traps(&vcpu->arch.vgic_cpu.vgic_v3); @@ -122,6 +128,12 @@ static void __hyp_vgic_save_state(struct kvm_vcpu *vcpu) /* Restore VGICv3 state on non-VHE systems */ static void __hyp_vgic_restore_state(struct kvm_vcpu *vcpu) { + if (vgic_is_v5(kern_hyp_va(vcpu->kvm))) { + __vgic_v5_restore_state(&vcpu->arch.vgic_cpu.vgic_v5); + __vgic_v5_restore_ppi_state(&vcpu->arch.vgic_cpu.vgic_v5); + return; + } + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) { __vgic_v3_activate_traps(&vcpu->arch.vgic_cpu.vgic_v3); __vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3); diff --git a/arch/arm64/kvm/vgic/vgic-mmio.c b/arch/arm64/kvm/vgic/vgic-mmio.c index a573b1f0c6cb..74d76dec9730 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio.c +++ b/arch/arm64/kvm/vgic/vgic-mmio.c @@ -842,18 +842,46 @@ vgic_find_mmio_region(const struct vgic_register_region *regions, void vgic_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) { - if (kvm_vgic_global_state.type == VGIC_V2) - vgic_v2_set_vmcr(vcpu, vmcr); - else + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + + switch (dist->vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V5: + vgic_v5_set_vmcr(vcpu, vmcr); + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: vgic_v3_set_vmcr(vcpu, vmcr); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + vgic_v3_set_vmcr(vcpu, vmcr); + else + vgic_v2_set_vmcr(vcpu, vmcr); + break; + default: + BUG(); + } } void vgic_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) { - if (kvm_vgic_global_state.type == VGIC_V2) - vgic_v2_get_vmcr(vcpu, vmcr); - else + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + + switch (dist->vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V5: + vgic_v5_get_vmcr(vcpu, vmcr); + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: vgic_v3_get_vmcr(vcpu, vmcr); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + vgic_v3_get_vmcr(vcpu, vmcr); + else + vgic_v2_get_vmcr(vcpu, vmcr); + break; + default: + BUG(); + } } /* diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index cf8382a954bb..41317e1d94a2 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -86,3 +86,77 @@ int vgic_v5_probe(const struct gic_kvm_info *info) return 0; } + +void vgic_v5_load(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + + /* + * On the WFI path, vgic_load is called a second time. The first is when + * scheduling in the vcpu thread again, and the second is when leaving + * WFI. Skip the second instance as it serves no purpose and just + * restores the same state again. + */ + if (cpu_if->gicv5_vpe.resident) + return; + + kvm_call_hyp(__vgic_v5_restore_vmcr_apr, cpu_if); + + cpu_if->gicv5_vpe.resident = true; +} + +void vgic_v5_put(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + + /* + * Do nothing if we're not resident. This can happen in the WFI path + * where we do a vgic_put in the WFI path and again later when + * descheduling the thread. We risk losing VMCR state if we sync it + * twice, so instead return early in this case. + */ + if (!cpu_if->gicv5_vpe.resident) + return; + + kvm_call_hyp(__vgic_v5_save_apr, cpu_if); + + cpu_if->gicv5_vpe.resident = false; +} + +void vgic_v5_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcrp) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u64 vmcr = cpu_if->vgic_vmcr; + + vmcrp->en = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_EN, vmcr); + vmcrp->pmr = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_VPMR, vmcr); +} + +void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcrp) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u64 vmcr; + + vmcr = FIELD_PREP(FEAT_GCIE_ICH_VMCR_EL2_VPMR, vmcrp->pmr) | + FIELD_PREP(FEAT_GCIE_ICH_VMCR_EL2_EN, vmcrp->en); + + cpu_if->vgic_vmcr = vmcr; +} + +void vgic_v5_restore_state(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + + __vgic_v5_restore_state(cpu_if); + __vgic_v5_restore_ppi_state(cpu_if); + dsb(sy); +} + +void vgic_v5_save_state(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + + __vgic_v5_save_state(cpu_if); + __vgic_v5_save_ppi_state(cpu_if); + dsb(sy); +} diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index 2f3f892cbddc..84199d2df80a 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -1017,7 +1017,10 @@ static inline bool can_access_vgic_from_kernel(void) static inline void vgic_save_state(struct kvm_vcpu *vcpu) { - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + /* No switch statement here. See comment in vgic_restore_state() */ + if (vgic_is_v5(vcpu->kvm)) + vgic_v5_save_state(vcpu); + else if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) vgic_v2_save_state(vcpu); else __vgic_v3_save_state(&vcpu->arch.vgic_cpu.vgic_v3); @@ -1026,14 +1029,16 @@ static inline void vgic_save_state(struct kvm_vcpu *vcpu) /* Sync back the hardware VGIC state into our emulation after a guest's run. */ void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu) { - /* If nesting, emulate the HW effect from L0 to L1 */ - if (vgic_state_is_nested(vcpu)) { - vgic_v3_sync_nested(vcpu); - return; - } + if (vgic_is_v3(vcpu->kvm)) { + /* If nesting, emulate the HW effect from L0 to L1 */ + if (vgic_state_is_nested(vcpu)) { + vgic_v3_sync_nested(vcpu); + return; + } - if (vcpu_has_nv(vcpu)) - vgic_v3_nested_update_mi(vcpu); + if (vcpu_has_nv(vcpu)) + vgic_v3_nested_update_mi(vcpu); + } if (can_access_vgic_from_kernel()) vgic_save_state(vcpu); @@ -1055,7 +1060,18 @@ void kvm_vgic_process_async_update(struct kvm_vcpu *vcpu) static inline void vgic_restore_state(struct kvm_vcpu *vcpu) { - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + /* + * As nice as it would be to restructure this code into a switch + * statement as can be found elsewhere, the logic quickly gets ugly. + * + * __vgic_v3_restore_state() is doing a lot of heavy lifting here. It is + * required for GICv3-on-GICv3, GICv2-on-GICv3, GICv3-on-GICv5, and the + * no-in-kernel-irqchip case on GICv3 hardware. Hence, adding a switch + * here results in much more complex code. + */ + if (vgic_is_v5(vcpu->kvm)) + vgic_v5_restore_state(vcpu); + else if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) vgic_v2_restore_state(vcpu); else __vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3); @@ -1109,30 +1125,58 @@ void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu) void kvm_vgic_load(struct kvm_vcpu *vcpu) { + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + if (unlikely(!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))) { if (has_vhe() && static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) __vgic_v3_activate_traps(&vcpu->arch.vgic_cpu.vgic_v3); return; } - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) - vgic_v2_load(vcpu); - else + switch (dist->vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V5: + vgic_v5_load(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: vgic_v3_load(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + vgic_v3_load(vcpu); + else + vgic_v2_load(vcpu); + break; + default: + BUG(); + } } void kvm_vgic_put(struct kvm_vcpu *vcpu) { + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + if (unlikely(!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))) { if (has_vhe() && static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) __vgic_v3_deactivate_traps(&vcpu->arch.vgic_cpu.vgic_v3); return; } - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) - vgic_v2_put(vcpu); - else + switch (dist->vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V5: + vgic_v5_put(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: vgic_v3_put(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + vgic_v3_put(vcpu); + else + vgic_v2_put(vcpu); + break; + default: + BUG(); + } } int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index 7b7eed69d797..cc487a69d038 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -187,6 +187,7 @@ static inline u64 vgic_ich_hcr_trap_bits(void) * registers regardless of the hardware backed GIC used. */ struct vgic_vmcr { + u32 en; /* GICv5-specific */ u32 grpen0; u32 grpen1; @@ -363,6 +364,12 @@ void vgic_debug_init(struct kvm *kvm); void vgic_debug_destroy(struct kvm *kvm); int vgic_v5_probe(const struct gic_kvm_info *info); +void vgic_v5_load(struct kvm_vcpu *vcpu); +void vgic_v5_put(struct kvm_vcpu *vcpu); +void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); +void vgic_v5_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); +void vgic_v5_restore_state(struct kvm_vcpu *vcpu); +void vgic_v5_save_state(struct kvm_vcpu *vcpu); static inline int vgic_v3_max_apr_idx(struct kvm_vcpu *vcpu) { diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index 07e394690dcc..b27bfc463a31 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -447,6 +447,8 @@ struct vgic_v5_cpu_if { * it is the hyp's responsibility to keep the state constistent. */ u64 vgic_icsr; + + struct gicv5_vpe gicv5_vpe; }; /* What PPI capabilities does a GICv5 host have */ diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h index b1566a7c93ec..40d2fce68294 100644 --- a/include/linux/irqchip/arm-gic-v5.h +++ b/include/linux/irqchip/arm-gic-v5.h @@ -387,6 +387,11 @@ int gicv5_spi_irq_set_type(struct irq_data *d, unsigned int type); int gicv5_irs_iste_alloc(u32 lpi); void gicv5_irs_syncr(void); +/* Embedded in kvm.arch */ +struct gicv5_vpe { + bool resident; +}; + struct gicv5_its_devtab_cfg { union { struct { From 8f1fbe2fd279240d6999e3a975d0a51d816e080a Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:54:23 +0000 Subject: [PATCH 159/373] KVM: arm64: gic-v5: Finalize GICv5 PPIs and generate mask We only want to expose a subset of the PPIs to a guest. If a PPI does not have an owner, it is not being actively driven by a device. The SW_PPI is a special case, as it is likely for userspace to wish to inject that. Therefore, just prior to running the guest for the first time, we need to finalize the PPIs. A mask is generated which, when combined with trapping a guest's PPI accesses, allows for the guest's view of the PPI to be filtered. This mask is global to the VM as all VCPUs PPI configurations must match. In addition, the PPI HMR is calculated. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-19-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arm.c | 4 ++++ arch/arm64/kvm/vgic/vgic-v5.c | 35 +++++++++++++++++++++++++++++++++++ include/kvm/arm_vgic.h | 24 ++++++++++++++++++++++++ 3 files changed, 63 insertions(+) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index aa69fd5b372f..5bbc1adb705e 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -934,6 +934,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) return ret; } + ret = vgic_v5_finalize_ppi_state(kvm); + if (ret) + return ret; + if (is_protected_kvm_enabled()) { ret = pkvm_create_hyp_vm(kvm); if (ret) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 41317e1d94a2..07f416fbc4bc 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -87,6 +87,41 @@ int vgic_v5_probe(const struct gic_kvm_info *info) return 0; } +int vgic_v5_finalize_ppi_state(struct kvm *kvm) +{ + struct kvm_vcpu *vcpu0; + int i; + + if (!vgic_is_v5(kvm)) + return 0; + + /* The PPI state for all VCPUs should be the same. Pick the first. */ + vcpu0 = kvm_get_vcpu(kvm, 0); + + bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); + bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr, VGIC_V5_NR_PRIVATE_IRQS); + + for_each_set_bit(i, ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { + const u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu0, intid); + + /* Expose PPIs with an owner or the SW_PPI, only */ + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { + if (irq->owner || i == GICV5_ARCH_PPI_SW_PPI) { + __assign_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, 1); + __assign_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr, + irq->config == VGIC_CONFIG_LEVEL); + } + } + + vgic_put_irq(vcpu0->kvm, irq); + } + + return 0; +} + void vgic_v5_load(struct kvm_vcpu *vcpu) { struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index b27bfc463a31..fdad0263499b 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -326,6 +326,23 @@ struct vgic_redist_region { struct list_head list; }; +struct vgic_v5_vm { + /* + * We only expose a subset of PPIs to the guest. This subset is a + * combination of the PPIs that are actually implemented and what we + * actually choose to expose. + */ + DECLARE_BITMAP(vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); + + /* + * The HMR itself is handled by the hardware, but we still need to have + * a mask that we can use when merging in pending state (only the state + * of Edge PPIs is merged back in from the guest an the HMR provides a + * convenient way to do that). + */ + DECLARE_BITMAP(vgic_ppi_hmr, VGIC_V5_NR_PRIVATE_IRQS); +}; + struct vgic_dist { bool in_kernel; bool ready; @@ -398,6 +415,11 @@ struct vgic_dist { * else. */ struct its_vm its_vm; + + /* + * GICv5 per-VM data. + */ + struct vgic_v5_vm gicv5_vm; }; struct vgic_v2_cpu_if { @@ -588,6 +610,8 @@ int vgic_v4_load(struct kvm_vcpu *vcpu); void vgic_v4_commit(struct kvm_vcpu *vcpu); int vgic_v4_put(struct kvm_vcpu *vcpu); +int vgic_v5_finalize_ppi_state(struct kvm *kvm); + bool vgic_state_is_nested(struct kvm_vcpu *vcpu); /* CPU HP callbacks */ From 4a9a32d3538a9d800067be113b0196271a478c6a Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:54:39 +0000 Subject: [PATCH 160/373] KVM: arm64: gic: Introduce queue_irq_unlock to irq_ops There are times when the default behaviour of vgic_queue_irq_unlock() is undesirable. This is because some GICs, such a GICv5 which is the main driver for this change, handle the majority of the interrupt lifecycle in hardware. In this case, there is no need for a per-VCPU AP list as the interrupt can be made pending directly. This is done either via the ICH_PPI_x_EL2 registers for PPIs, or with the VDPEND system instruction for SPIs and LPIs. The vgic_queue_irq_unlock() function is made overridable using a new function pointer in struct irq_ops. vgic_queue_irq_unlock() is overridden if the function pointer is non-null. This new irq_op is unused in this change - it is purely providing the infrastructure itself. The subsequent PPI injection changes provide a demonstration of the usage of the queue_irq_unlock irq_op. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-20-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic.c | 3 +++ include/kvm/arm_vgic.h | 8 ++++++++ 2 files changed, 11 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index 84199d2df80a..c46c0e1db436 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -404,6 +404,9 @@ bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, lockdep_assert_held(&irq->irq_lock); + if (irq->ops && irq->ops->queue_irq_unlock) + return irq->ops->queue_irq_unlock(kvm, irq, flags); + retry: vcpu = vgic_target_oracle(irq); if (irq->vcpu || !vcpu) { diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index fdad0263499b..e9797c5dbbf0 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -189,6 +189,8 @@ enum vgic_irq_config { VGIC_CONFIG_LEVEL }; +struct vgic_irq; + /* * Per-irq ops overriding some common behavious. * @@ -207,6 +209,12 @@ struct irq_ops { * peaking into the physical GIC. */ bool (*get_input_level)(int vintid); + + /* + * Function pointer to override the queuing of an IRQ. + */ + bool (*queue_irq_unlock)(struct kvm *kvm, struct vgic_irq *irq, + unsigned long flags) __releases(&irq->irq_lock); }; struct vgic_irq { From 4d591252bacb2d004b7c7f5db439bfa23b552ee7 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:54:55 +0000 Subject: [PATCH 161/373] KVM: arm64: gic-v5: Implement PPI interrupt injection This change introduces interrupt injection for PPIs for GICv5-based guests. The lifecycle of PPIs is largely managed by the hardware for a GICv5 system. The hypervisor injects pending state into the guest by using the ICH_PPI_PENDRx_EL2 registers. These are used by the hardware to pick a Highest Priority Pending Interrupt (HPPI) for the guest based on the enable state of each individual interrupt. The enable state and priority for each interrupt are provided by the guest itself (through writes to the PPI registers). When Direct Virtual Interrupt (DVI) is set for a particular PPI, the hypervisor is even able to skip the injection of the pending state altogether - it all happens in hardware. The result of the above is that no AP lists are required for GICv5, unlike for older GICs. Instead, for PPIs the ICH_PPI_* registers fulfil the same purpose for all 128 PPIs. Hence, as long as the ICH_PPI_* registers are populated prior to guest entry, and merged back into the KVM shadow state on exit, the PPI state is preserved, and interrupts can be injected. When injecting the state of a PPI the state is merged into the PPI-specific vgic_irq structure. The PPIs are made pending via the ICH_PPI_PENDRx_EL2 registers, the value of which is generated from the vgic_irq structures for each PPI exposed on guest entry. The queue_irq_unlock() irq_op is required to kick the vCPU to ensure that it seems the new state. The result is that no AP lists are used for private interrupts on GICv5. Prior to entering the guest, vgic_v5_flush_ppi_state() is called from kvm_vgic_flush_hwstate(). This generates the pending state to inject into the guest, and snapshots it (twice - an entry and an exit copy) in order to track any changes. These changes can come from a guest consuming an interrupt or from a guest making an Edge-triggered interrupt pending. When returning from running a guest, the guest's PPI state is merged back into KVM's vgic_irq state in vgic_v5_merge_ppi_state() from kvm_vgic_sync_hwstate(). The Enable and Active state is synced back for all PPIs, and the pending state is synced back for Edge PPIs (Level is driven directly by the devices generating said levels). The incoming pending state from the guest is merged with KVM's shadow state to avoid losing any incoming interrupts. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-21-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 137 ++++++++++++++++++++++++++++++++++ arch/arm64/kvm/vgic/vgic.c | 41 ++++++++-- arch/arm64/kvm/vgic/vgic.h | 25 ++++--- 3 files changed, 188 insertions(+), 15 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 07f416fbc4bc..014d6fc1d44a 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -122,6 +122,143 @@ int vgic_v5_finalize_ppi_state(struct kvm *kvm) return 0; } +/* + * For GICv5, the PPIs are mostly directly managed by the hardware. We (the + * hypervisor) handle the pending, active, enable state save/restore, but don't + * need the PPIs to be queued on a per-VCPU AP list. Therefore, sanity check the + * state, unlock, and return. + */ +static bool vgic_v5_ppi_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, + unsigned long flags) + __releases(&irq->irq_lock) +{ + struct kvm_vcpu *vcpu; + + lockdep_assert_held(&irq->irq_lock); + + if (WARN_ON_ONCE(!__irq_is_ppi(KVM_DEV_TYPE_ARM_VGIC_V5, irq->intid))) + goto out_unlock_fail; + + vcpu = irq->target_vcpu; + if (WARN_ON_ONCE(!vcpu)) + goto out_unlock_fail; + + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); + + /* Directly kick the target VCPU to make sure it sees the IRQ */ + kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu); + kvm_vcpu_kick(vcpu); + + return true; + +out_unlock_fail: + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); + + return false; +} + +static struct irq_ops vgic_v5_ppi_irq_ops = { + .queue_irq_unlock = vgic_v5_ppi_queue_irq_unlock, +}; + +void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid) +{ + kvm_vgic_set_irq_ops(vcpu, vintid, &vgic_v5_ppi_irq_ops); +} + +/* + * Detect any PPIs state changes, and propagate the state with KVM's + * shadow structures. + */ +void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + DECLARE_BITMAP(changed_active, VGIC_V5_NR_PRIVATE_IRQS); + DECLARE_BITMAP(changed_pending, VGIC_V5_NR_PRIVATE_IRQS); + DECLARE_BITMAP(changed_bits, VGIC_V5_NR_PRIVATE_IRQS); + unsigned long *activer, *pendr_entry, *pendr; + int i; + + activer = host_data_ptr(vgic_v5_ppi_state)->activer_exit; + pendr_entry = host_data_ptr(vgic_v5_ppi_state)->pendr_entry; + pendr = host_data_ptr(vgic_v5_ppi_state)->pendr_exit; + + bitmap_xor(changed_active, cpu_if->vgic_ppi_activer, activer, + VGIC_V5_NR_PRIVATE_IRQS); + bitmap_xor(changed_pending, pendr_entry, pendr, + VGIC_V5_NR_PRIVATE_IRQS); + bitmap_or(changed_bits, changed_active, changed_pending, + VGIC_V5_NR_PRIVATE_IRQS); + + for_each_set_bit(i, changed_bits, VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { + irq->active = test_bit(i, activer); + + /* This is an OR to avoid losing incoming edges! */ + if (irq->config == VGIC_CONFIG_EDGE) + irq->pending_latch |= test_bit(i, pendr); + } + + vgic_put_irq(vcpu->kvm, irq); + } + + /* + * Re-inject the exit state as entry state next time! + * + * Note that the write of the Enable state is trapped, and hence there + * is nothing to explcitly sync back here as we already have the latest + * copy by definition. + */ + bitmap_copy(cpu_if->vgic_ppi_activer, activer, VGIC_V5_NR_PRIVATE_IRQS); +} + +void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu) +{ + DECLARE_BITMAP(pendr, VGIC_V5_NR_PRIVATE_IRQS); + int i; + + /* + * Time to enter the guest - we first need to build the guest's + * ICC_PPI_PENDRx_EL1, however. + */ + bitmap_zero(pendr, VGIC_V5_NR_PRIVATE_IRQS); + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, + VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + __assign_bit(i, pendr, irq_is_pending(irq)); + + vgic_put_irq(vcpu->kvm, irq); + } + + /* + * Copy the shadow state to the pending reg that will be written to the + * ICH_PPI_PENDRx_EL2 regs. While the guest is running we track any + * incoming changes to the pending state in the vgic_irq structures. The + * incoming changes are merged with the outgoing changes on the return + * path. + */ + bitmap_copy(host_data_ptr(vgic_v5_ppi_state)->pendr_entry, pendr, + VGIC_V5_NR_PRIVATE_IRQS); + + /* + * Make sure that we can correctly detect "edges" in the PPI + * state. There's a path where we never actually enter the guest, and + * failure to do this risks losing pending state + */ + bitmap_copy(host_data_ptr(vgic_v5_ppi_state)->pendr_exit, pendr, + VGIC_V5_NR_PRIVATE_IRQS); +} + void vgic_v5_load(struct kvm_vcpu *vcpu) { struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index c46c0e1db436..485a9a3fab8d 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -105,6 +105,18 @@ struct vgic_irq *vgic_get_vcpu_irq(struct kvm_vcpu *vcpu, u32 intid) if (WARN_ON(!vcpu)) return NULL; + if (vgic_is_v5(vcpu->kvm)) { + u32 int_num, hwirq_id; + + if (!__irq_is_ppi(KVM_DEV_TYPE_ARM_VGIC_V5, intid)) + return NULL; + + hwirq_id = FIELD_GET(GICV5_HWIRQ_ID, intid); + int_num = array_index_nospec(hwirq_id, VGIC_V5_NR_PRIVATE_IRQS); + + return &vcpu->arch.vgic_cpu.private_irqs[int_num]; + } + /* SGIs and PPIs */ if (intid < VGIC_NR_PRIVATE_IRQS) { intid = array_index_nospec(intid, VGIC_NR_PRIVATE_IRQS); @@ -830,8 +842,13 @@ retry: vgic_release_deleted_lpis(vcpu->kvm); } -static inline void vgic_fold_lr_state(struct kvm_vcpu *vcpu) +static void vgic_fold_state(struct kvm_vcpu *vcpu) { + if (vgic_is_v5(vcpu->kvm)) { + vgic_v5_fold_ppi_state(vcpu); + return; + } + if (!*host_data_ptr(last_lr_irq)) return; @@ -1046,8 +1063,10 @@ void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu) if (can_access_vgic_from_kernel()) vgic_save_state(vcpu); - vgic_fold_lr_state(vcpu); - vgic_prune_ap_list(vcpu); + vgic_fold_state(vcpu); + + if (!vgic_is_v5(vcpu->kvm)) + vgic_prune_ap_list(vcpu); } /* Sync interrupts that were deactivated through a DIR trap */ @@ -1080,6 +1099,17 @@ static inline void vgic_restore_state(struct kvm_vcpu *vcpu) __vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3); } +static void vgic_flush_state(struct kvm_vcpu *vcpu) +{ + if (vgic_is_v5(vcpu->kvm)) { + vgic_v5_flush_ppi_state(vcpu); + return; + } + + scoped_guard(raw_spinlock, &vcpu->arch.vgic_cpu.ap_list_lock) + vgic_flush_lr_state(vcpu); +} + /* Flush our emulation state into the GIC hardware before entering the guest. */ void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu) { @@ -1116,13 +1146,12 @@ void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu) DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); - scoped_guard(raw_spinlock, &vcpu->arch.vgic_cpu.ap_list_lock) - vgic_flush_lr_state(vcpu); + vgic_flush_state(vcpu); if (can_access_vgic_from_kernel()) vgic_restore_state(vcpu); - if (vgic_supports_direct_irqs(vcpu->kvm)) + if (vgic_supports_direct_irqs(vcpu->kvm) && kvm_vgic_global_state.has_gicv4) vgic_v4_commit(vcpu); } diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index cc487a69d038..d90af676d5d0 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -364,6 +364,9 @@ void vgic_debug_init(struct kvm *kvm); void vgic_debug_destroy(struct kvm *kvm); int vgic_v5_probe(const struct gic_kvm_info *info); +void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid); +void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu); +void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu); void vgic_v5_load(struct kvm_vcpu *vcpu); void vgic_v5_put(struct kvm_vcpu *vcpu); void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); @@ -432,15 +435,6 @@ void vgic_its_invalidate_all_caches(struct kvm *kvm); int vgic_its_inv_lpi(struct kvm *kvm, struct vgic_irq *irq); int vgic_its_invall(struct kvm_vcpu *vcpu); -bool system_supports_direct_sgis(void); -bool vgic_supports_direct_msis(struct kvm *kvm); -bool vgic_supports_direct_sgis(struct kvm *kvm); - -static inline bool vgic_supports_direct_irqs(struct kvm *kvm) -{ - return vgic_supports_direct_msis(kvm) || vgic_supports_direct_sgis(kvm); -} - int vgic_v4_init(struct kvm *kvm); void vgic_v4_teardown(struct kvm *kvm); void vgic_v4_configure_vsgis(struct kvm *kvm); @@ -481,6 +475,19 @@ static inline bool vgic_host_has_gicv5(void) return kvm_vgic_global_state.type == VGIC_V5; } +bool system_supports_direct_sgis(void); +bool vgic_supports_direct_msis(struct kvm *kvm); +bool vgic_supports_direct_sgis(struct kvm *kvm); + +static inline bool vgic_supports_direct_irqs(struct kvm *kvm) +{ + /* GICv5 always supports direct IRQs */ + if (vgic_is_v5(kvm)) + return true; + + return vgic_supports_direct_msis(kvm) || vgic_supports_direct_sgis(kvm); +} + int vgic_its_debug_init(struct kvm_device *dev); void vgic_its_debug_destroy(struct kvm_device *dev); From da8d9636be7e0761f69c3dadf747c725732312ff Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:55:10 +0000 Subject: [PATCH 162/373] KVM: arm64: gic-v5: Init Private IRQs (PPIs) for GICv5 Initialise the private interrupts (PPIs, only) for GICv5. This means that a GICv5-style intid is generated (which encodes the PPI type in the top bits) instead of the 0-based index that is used for older GICs. Additionally, set all of the GICv5 PPIs to use Level for the handling mode, with the exception of the SW_PPI which uses Edge. This matches the architecturally-defined set in the GICv5 specification (the CTIIRQ handling mode is IMPDEF, so Level has been picked for that). Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-22-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-init.c | 96 +++++++++++++++++++++++---------- 1 file changed, 67 insertions(+), 29 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index e1be9c5ada7b..e0366e8c144d 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -250,9 +250,65 @@ int kvm_vgic_vcpu_nv_init(struct kvm_vcpu *vcpu) return ret; } +static void vgic_allocate_private_irq(struct kvm_vcpu *vcpu, int i, u32 type) +{ + struct vgic_irq *irq = &vcpu->arch.vgic_cpu.private_irqs[i]; + + INIT_LIST_HEAD(&irq->ap_list); + raw_spin_lock_init(&irq->irq_lock); + irq->vcpu = NULL; + irq->target_vcpu = vcpu; + refcount_set(&irq->refcount, 0); + + irq->intid = i; + if (vgic_irq_is_sgi(i)) { + /* SGIs */ + irq->enabled = 1; + irq->config = VGIC_CONFIG_EDGE; + } else { + /* PPIs */ + irq->config = VGIC_CONFIG_LEVEL; + } + + switch (type) { + case KVM_DEV_TYPE_ARM_VGIC_V3: + irq->group = 1; + irq->mpidr = kvm_vcpu_get_mpidr_aff(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + irq->group = 0; + irq->targets = BIT(vcpu->vcpu_id); + break; + } +} + +static void vgic_v5_allocate_private_irq(struct kvm_vcpu *vcpu, int i, u32 type) +{ + struct vgic_irq *irq = &vcpu->arch.vgic_cpu.private_irqs[i]; + u32 intid = vgic_v5_make_ppi(i); + + INIT_LIST_HEAD(&irq->ap_list); + raw_spin_lock_init(&irq->irq_lock); + irq->vcpu = NULL; + irq->target_vcpu = vcpu; + refcount_set(&irq->refcount, 0); + + irq->intid = intid; + + /* The only Edge architected PPI is the SW_PPI */ + if (i == GICV5_ARCH_PPI_SW_PPI) + irq->config = VGIC_CONFIG_EDGE; + else + irq->config = VGIC_CONFIG_LEVEL; + + /* Register the GICv5-specific PPI ops */ + vgic_v5_set_ppi_ops(vcpu, intid); +} + static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type) { struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; + u32 num_private_irqs; int i; lockdep_assert_held(&vcpu->kvm->arch.config_lock); @@ -260,8 +316,13 @@ static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type) if (vgic_cpu->private_irqs) return 0; + if (vgic_is_v5(vcpu->kvm)) + num_private_irqs = VGIC_V5_NR_PRIVATE_IRQS; + else + num_private_irqs = VGIC_NR_PRIVATE_IRQS; + vgic_cpu->private_irqs = kzalloc_objs(struct vgic_irq, - VGIC_NR_PRIVATE_IRQS, + num_private_irqs, GFP_KERNEL_ACCOUNT); if (!vgic_cpu->private_irqs) @@ -271,34 +332,11 @@ static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type) * Enable and configure all SGIs to be edge-triggered and * configure all PPIs as level-triggered. */ - for (i = 0; i < VGIC_NR_PRIVATE_IRQS; i++) { - struct vgic_irq *irq = &vgic_cpu->private_irqs[i]; - - INIT_LIST_HEAD(&irq->ap_list); - raw_spin_lock_init(&irq->irq_lock); - irq->intid = i; - irq->vcpu = NULL; - irq->target_vcpu = vcpu; - refcount_set(&irq->refcount, 0); - if (vgic_irq_is_sgi(i)) { - /* SGIs */ - irq->enabled = 1; - irq->config = VGIC_CONFIG_EDGE; - } else { - /* PPIs */ - irq->config = VGIC_CONFIG_LEVEL; - } - - switch (type) { - case KVM_DEV_TYPE_ARM_VGIC_V3: - irq->group = 1; - irq->mpidr = kvm_vcpu_get_mpidr_aff(vcpu); - break; - case KVM_DEV_TYPE_ARM_VGIC_V2: - irq->group = 0; - irq->targets = BIT(vcpu->vcpu_id); - break; - } + for (i = 0; i < num_private_irqs; i++) { + if (vgic_is_v5(vcpu->kvm)) + vgic_v5_allocate_private_irq(vcpu, i, type); + else + vgic_allocate_private_irq(vcpu, i, type); } return 0; From f20554ad3ccd42397f863f6c41b43b831cf9b328 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:55:26 +0000 Subject: [PATCH 163/373] KVM: arm64: gic-v5: Clear TWI if single task running Handle GICv5 in kvm_vcpu_should_clear_twi(). Clear TWI if there is a single task running, and enable it otherwise. This is a sane default for GICv5 given the current level of support. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-23-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 5bbc1adb705e..f68c4036afeb 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -613,6 +613,9 @@ static bool kvm_vcpu_should_clear_twi(struct kvm_vcpu *vcpu) if (unlikely(kvm_wfi_trap_policy != KVM_WFX_NOTRAP_SINGLE_TASK)) return kvm_wfi_trap_policy == KVM_WFX_NOTRAP; + if (vgic_is_v5(vcpu->kvm)) + return single_task_running(); + return single_task_running() && vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 && (atomic_read(&vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count) || From 933e5288fa9714085e384a3d6ad6dcce8089a6b9 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:55:41 +0000 Subject: [PATCH 164/373] KVM: arm64: gic-v5: Check for pending PPIs This change allows KVM to check for pending PPI interrupts. This has two main components: First of all, the effective priority mask is calculated. This is a combination of the priority mask in the VPEs ICC_PCR_EL1.PRIORITY and the currently running priority as determined from the VPE's ICH_APR_EL1. If an interrupt's priority is greater than or equal to the effective priority mask, it can be signalled. Otherwise, it cannot. Secondly, any Enabled and Pending PPIs must be checked against this compound priority mask. The reqires the PPI priorities to by synced back to the KVM shadow state on WFI entry - this is skipped in general operation as it isn't required and is rather expensive. If any Enabled and Pending PPIs are of sufficient priority to be signalled, then there are pending PPIs. Else, there are not. This ensures that a VPE is not woken when it cannot actually process the pending interrupts. As the PPI priorities are not synced back to the KVM shadow state on every guest exit, they must by synced prior to checking if there are pending interrupts for the guest. The sync itself happens in vgic_v5_put() if, and only if, the vcpu is entering WFI as this is the only case where it is not planned to run the vcpu thread again. If the vcpu enters WFI, the vcpu thread will be descheduled and won't be rescheduled again until it has a pending interrupt, which is checked from kvm_arch_vcpu_runnable(). Signed-off-by: Sascha Bischoff Reviewed-by: Joey Gouly Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-24-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 113 ++++++++++++++++++++++++++++++++++ arch/arm64/kvm/vgic/vgic.c | 3 + arch/arm64/kvm/vgic/vgic.h | 1 + 3 files changed, 117 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 014d6fc1d44a..f0600be619b1 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -122,6 +122,37 @@ int vgic_v5_finalize_ppi_state(struct kvm *kvm) return 0; } +static u32 vgic_v5_get_effective_priority_mask(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u32 highest_ap, priority_mask; + + /* + * If the guest's CPU has not opted to receive interrupts, then the + * effective running priority is the highest priority. Just return 0 + * (the highest priority). + */ + if (!FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_EN, cpu_if->vgic_vmcr)) + return 0; + + /* + * Counting the number of trailing zeros gives the current active + * priority. Explicitly use the 32-bit version here as we have 32 + * priorities. 32 then means that there are no active priorities. + */ + highest_ap = cpu_if->vgic_apr ? __builtin_ctz(cpu_if->vgic_apr) : 32; + + /* + * An interrupt is of sufficient priority if it is equal to or + * greater than the priority mask. Add 1 to the priority mask + * (i.e., lower priority) to match the APR logic before taking + * the min. This gives us the lowest priority that is masked. + */ + priority_mask = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_VPMR, cpu_if->vgic_vmcr); + + return min(highest_ap, priority_mask + 1); +} + /* * For GICv5, the PPIs are mostly directly managed by the hardware. We (the * hypervisor) handle the pending, active, enable state save/restore, but don't @@ -166,6 +197,84 @@ void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid) kvm_vgic_set_irq_ops(vcpu, vintid, &vgic_v5_ppi_irq_ops); } +/* + * Sync back the PPI priorities to the vgic_irq shadow state for any interrupts + * exposed to the guest (skipping all others). + */ +static void vgic_v5_sync_ppi_priorities(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u64 priorityr; + int i; + + /* + * We have up to 16 PPI Priority regs, but only have a few interrupts + * that the guest is allowed to use. Limit our sync of PPI priorities to + * those actually exposed to the guest by first iterating over the mask + * of exposed PPIs. + */ + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + int pri_idx, pri_reg, pri_bit; + u8 priority; + + /* + * Determine which priority register and the field within it to + * extract. + */ + pri_reg = i / 8; + pri_idx = i % 8; + pri_bit = pri_idx * 8; + + priorityr = cpu_if->vgic_ppi_priorityr[pri_reg]; + priority = field_get(GENMASK(pri_bit + 4, pri_bit), priorityr); + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + irq->priority = priority; + + vgic_put_irq(vcpu->kvm, irq); + } +} + +bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu) +{ + unsigned int priority_mask; + int i; + + priority_mask = vgic_v5_get_effective_priority_mask(vcpu); + + /* + * If the combined priority mask is 0, nothing can be signalled! In the + * case where the guest has disabled interrupt delivery for the vcpu + * (via ICV_CR0_EL1.EN->ICH_VMCR_EL2.EN), we calculate the priority mask + * as 0 too (the highest possible priority). + */ + if (!priority_mask) + return false; + + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + bool has_pending = false; + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + has_pending = (irq->enabled && irq_is_pending(irq) && + irq->priority <= priority_mask); + + vgic_put_irq(vcpu->kvm, irq); + + if (has_pending) + return true; + } + + return false; +} + /* * Detect any PPIs state changes, and propagate the state with KVM's * shadow structures. @@ -293,6 +402,10 @@ void vgic_v5_put(struct kvm_vcpu *vcpu) kvm_call_hyp(__vgic_v5_save_apr, cpu_if); cpu_if->gicv5_vpe.resident = false; + + /* The shadow priority is only updated on entering WFI */ + if (vcpu_get_flag(vcpu, IN_WFI)) + vgic_v5_sync_ppi_priorities(vcpu); } void vgic_v5_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcrp) diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index 485a9a3fab8d..d9ca5509147a 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -1219,6 +1219,9 @@ int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu) unsigned long flags; struct vgic_vmcr vmcr; + if (vgic_is_v5(vcpu->kvm)) + return vgic_v5_has_pending_ppi(vcpu); + if (!vcpu->kvm->arch.vgic.enabled) return false; diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index d90af676d5d0..8f15f7472458 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -365,6 +365,7 @@ void vgic_debug_destroy(struct kvm *kvm); int vgic_v5_probe(const struct gic_kvm_info *info); void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid); +bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu); void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu); void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu); void vgic_v5_load(struct kvm_vcpu *vcpu); From d1328c61511f6a2aeda48b8b9096e67d2443ec71 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:55:57 +0000 Subject: [PATCH 165/373] KVM: arm64: gic-v5: Trap and mask guest ICC_PPI_ENABLERx_EL1 writes A guest should not be able to detect if a PPI that is not exposed to the guest is implemented or not. Avoid the guest enabling any PPIs that are not implemented as far as the guest is concerned by trapping and masking writes to the two ICC_PPI_ENABLERx_EL1 registers. When a guest writes these registers, the write is masked with the set of PPIs actually exposed to the guest, and the state is written back to KVM's shadow state. As there is now no way for the guest to change the PPI enable state without it being trapped, saving of the PPI Enable state is dropped from guest exit. Reads for the above registers are not masked. When the guest is running and reads from the above registers, it is presented with what KVM provides in the ICH_PPI_ENABLERx_EL2 registers, which is the masked version of what the guest last wrote. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-25-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_host.h | 1 - arch/arm64/kvm/config.c | 13 +++++++- arch/arm64/kvm/hyp/vgic-v5-sr.c | 4 --- arch/arm64/kvm/sys_regs.c | 50 +++++++++++++++++++++++++++++++ 4 files changed, 62 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index c4a172b70206..a7dc0aac3b93 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -814,7 +814,6 @@ struct kvm_host_data { /* The saved state of the regs when leaving the guest */ DECLARE_BITMAP(activer_exit, VGIC_V5_NR_PRIVATE_IRQS); - DECLARE_BITMAP(enabler_exit, VGIC_V5_NR_PRIVATE_IRQS); } vgic_v5_ppi_state; }; diff --git a/arch/arm64/kvm/config.c b/arch/arm64/kvm/config.c index 5663f25905e8..e14685343191 100644 --- a/arch/arm64/kvm/config.c +++ b/arch/arm64/kvm/config.c @@ -1699,6 +1699,17 @@ static void __compute_ich_hfgrtr(struct kvm_vcpu *vcpu) ICH_HFGRTR_EL2_ICC_IDRn_EL1); } +static void __compute_ich_hfgwtr(struct kvm_vcpu *vcpu) +{ + __compute_fgt(vcpu, ICH_HFGWTR_EL2); + + /* + * We present a different subset of PPIs the guest from what + * exist in real hardware. We only trap writes, not reads. + */ + *vcpu_fgt(vcpu, ICH_HFGWTR_EL2) &= ~(ICH_HFGWTR_EL2_ICC_PPI_ENABLERn_EL1); +} + void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) { if (!cpus_have_final_cap(ARM64_HAS_FGT)) @@ -1721,7 +1732,7 @@ void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) if (cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) { __compute_ich_hfgrtr(vcpu); - __compute_fgt(vcpu, ICH_HFGWTR_EL2); + __compute_ich_hfgwtr(vcpu); __compute_fgt(vcpu, ICH_HFGITR_EL2); } } diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c b/arch/arm64/kvm/hyp/vgic-v5-sr.c index f34ea219cc4e..2c4304ffa9f3 100644 --- a/arch/arm64/kvm/hyp/vgic-v5-sr.c +++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c @@ -37,8 +37,6 @@ void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if) bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, read_sysreg_s(SYS_ICH_PPI_ACTIVER0_EL2), 0, 64); - bitmap_write(host_data_ptr(vgic_v5_ppi_state)->enabler_exit, - read_sysreg_s(SYS_ICH_PPI_ENABLER0_EL2), 0, 64); bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr_exit, read_sysreg_s(SYS_ICH_PPI_PENDR0_EL2), 0, 64); @@ -54,8 +52,6 @@ void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if) if (VGIC_V5_NR_PRIVATE_IRQS == 128) { bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, read_sysreg_s(SYS_ICH_PPI_ACTIVER1_EL2), 64, 64); - bitmap_write(host_data_ptr(vgic_v5_ppi_state)->enabler_exit, - read_sysreg_s(SYS_ICH_PPI_ENABLER1_EL2), 64, 64); bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr_exit, read_sysreg_s(SYS_ICH_PPI_PENDR1_EL2), 64, 64); diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 85300e76bbe4..e1001544d4f4 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -718,6 +718,54 @@ static bool access_gicv5_iaffid(struct kvm_vcpu *vcpu, struct sys_reg_params *p, return true; } +static bool access_gicv5_ppi_enabler(struct kvm_vcpu *vcpu, + struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + unsigned long *mask = vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask; + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + int i; + + /* We never expect to get here with a read! */ + if (WARN_ON_ONCE(!p->is_write)) + return undef_access(vcpu, p, r); + + /* + * If we're only handling architected PPIs and the guest writes to the + * enable for the non-architected PPIs, we just return as there's + * nothing to do at all. We don't even allocate the storage for them in + * this case. + */ + if (VGIC_V5_NR_PRIVATE_IRQS == 64 && p->Op2 % 2) + return true; + + /* + * Merge the raw guest write into out bitmap at an offset of either 0 or + * 64, then and it with our PPI mask. + */ + bitmap_write(cpu_if->vgic_ppi_enabler, p->regval, 64 * (p->Op2 % 2), 64); + bitmap_and(cpu_if->vgic_ppi_enabler, cpu_if->vgic_ppi_enabler, mask, + VGIC_V5_NR_PRIVATE_IRQS); + + /* + * Sync the change in enable states to the vgic_irqs. We consider all + * PPIs as we don't expose many to the guest. + */ + for_each_set_bit(i, mask, VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + irq->enabled = test_bit(i, cpu_if->vgic_ppi_enabler); + + vgic_put_irq(vcpu->kvm, irq); + } + + return true; +} + static bool trap_raz_wi(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) @@ -3444,6 +3492,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ICC_AP1R3_EL1), undef_access }, { SYS_DESC(SYS_ICC_IDR0_EL1), access_gicv5_idr0 }, { SYS_DESC(SYS_ICC_IAFFIDR_EL1), access_gicv5_iaffid }, + { SYS_DESC(SYS_ICC_PPI_ENABLER0_EL1), access_gicv5_ppi_enabler }, + { SYS_DESC(SYS_ICC_PPI_ENABLER1_EL1), access_gicv5_ppi_enabler }, { SYS_DESC(SYS_ICC_DIR_EL1), access_gic_dir }, { SYS_DESC(SYS_ICC_RPR_EL1), undef_access }, { SYS_DESC(SYS_ICC_SGI1R_EL1), access_gic_sgi }, From 4a5444d23979b69e466f8080477112c264f194f2 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:56:12 +0000 Subject: [PATCH 166/373] KVM: arm64: Introduce set_direct_injection irq_op GICv5 adds support for directly injected PPIs. The mechanism for setting this up is GICv5 specific, so rather than adding GICv5-specific code to the common vgic code, we introduce a new irq_op. This new irq_op is intended to be used to enable or disable direct injection for interrupts that support it. As it is an irq_op, it has no effect unless explicitly populated in the irq_ops structure for a particular interrupt. The usage is demonstracted in the subsequent change. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-26-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic.c | 7 +++++++ include/kvm/arm_vgic.h | 7 +++++++ 2 files changed, 14 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index d9ca5509147a..9ac0ff60aa8a 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -608,12 +608,19 @@ static int kvm_vgic_map_irq(struct kvm_vcpu *vcpu, struct vgic_irq *irq, irq->hw = true; irq->host_irq = host_irq; irq->hwintid = data->hwirq; + + if (irq->ops && irq->ops->set_direct_injection) + irq->ops->set_direct_injection(vcpu, irq, true); + return 0; } /* @irq->irq_lock must be held */ static inline void kvm_vgic_unmap_irq(struct vgic_irq *irq) { + if (irq->ops && irq->ops->set_direct_injection) + irq->ops->set_direct_injection(irq->target_vcpu, irq, false); + irq->hw = false; irq->hwintid = 0; } diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index e9797c5dbbf0..a28cf765f3eb 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -215,6 +215,13 @@ struct irq_ops { */ bool (*queue_irq_unlock)(struct kvm *kvm, struct vgic_irq *irq, unsigned long flags) __releases(&irq->irq_lock); + + /* + * Callback function pointer to either enable or disable direct + * injection for a mapped interrupt. + */ + void (*set_direct_injection)(struct kvm_vcpu *vcpu, + struct vgic_irq *irq, bool direct); }; struct vgic_irq { From 5a98d0e17e59210b400734f2359c4453aab3af21 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:56:28 +0000 Subject: [PATCH 167/373] KVM: arm64: gic-v5: Implement direct injection of PPIs GICv5 is able to directly inject PPI pending state into a guest using a mechanism called DVI whereby the pending bit for a paticular PPI is driven directly by the physically-connected hardware. This mechanism itself doesn't allow for any ID translation, so the host interrupt is directly mapped into a guest with the same interrupt ID. When mapping a virtual interrupt to a physical interrupt via kvm_vgic_map_irq for a GICv5 guest, check if the interrupt itself is a PPI or not. If it is, and the host's interrupt ID matches that used for the guest DVI is enabled, and the interrupt itself is marked as directly_injected. When the interrupt is unmapped again, this process is reversed, and DVI is disabled for the interrupt again. Note: the expectation is that a directly injected PPI is disabled on the host while the guest state is loaded. The reason is that although DVI is enabled to drive the guest's pending state directly, the host pending state also remains driven. In order to avoid the same PPI firing on both the host and the guest, the host's interrupt must be disabled (masked). This is left up to the code that owns the device generating the PPI as this needs to be handled on a per-VM basis. One VM might use DVI, while another might not, in which case the physical PPI should be enabled for the latter. Co-authored-by: Timothy Hayes Signed-off-by: Timothy Hayes Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-27-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index f0600be619b1..b84324f0a311 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -188,8 +188,24 @@ out_unlock_fail: return false; } +/* + * Sets/clears the corresponding bit in the ICH_PPI_DVIR register. + */ +static void vgic_v5_set_ppi_dvi(struct kvm_vcpu *vcpu, struct vgic_irq *irq, + bool dvi) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u32 ppi; + + lockdep_assert_held(&irq->irq_lock); + + ppi = vgic_v5_get_hwirq_id(irq->intid); + __assign_bit(ppi, cpu_if->vgic_ppi_dvir, dvi); +} + static struct irq_ops vgic_v5_ppi_irq_ops = { .queue_irq_unlock = vgic_v5_ppi_queue_irq_unlock, + .set_direct_injection = vgic_v5_set_ppi_dvi, }; void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid) From b88d05a893cb7c8a48d03ff93d4aca95a6165377 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:56:43 +0000 Subject: [PATCH 168/373] KVM: arm64: gic-v5: Support GICv5 interrupts with KVM_IRQ_LINE Interrupts under GICv5 look quite different to those from older Arm GICs. Specifically, the type is encoded in the top bits of the interrupt ID. Extend KVM_IRQ_LINE to cope with GICv5 PPIs and SPIs. The requires subtly changing the KVM_IRQ_LINE API for GICv5 guests. For older Arm GICs, PPIs had to be in the range of 16-31, and SPIs had to be 32-1019, but this no longer holds true for GICv5. Instead, for a GICv5 guest support PPIs in the range of 0-127, and SPIs in the range 0-65535. The documentation is updated accordingly. The SPI range doesn't cover the full SPI range that a GICv5 system can potentially cope with (GICv5 provides up to 24-bits of SPI ID space, and we only have 16 bits to work with in KVM_IRQ_LINE). However, 65k SPIs is more than would be reasonably expected on systems for years to come. In order to use vgic_is_v5(), the kvm/arm_vgic.h header is added to kvm/arm.c. Note: As the GICv5 KVM implementation currently doesn't support injecting SPIs attempts to do so will fail. This restriction will by lifted as the GICv5 KVM support evolves. Co-authored-by: Timothy Hayes Signed-off-by: Timothy Hayes Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-28-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- Documentation/virt/kvm/api.rst | 6 ++++-- arch/arm64/kvm/arm.c | 22 +++++++++++++++++++--- arch/arm64/kvm/vgic/vgic.c | 4 ++++ 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 032516783e96..03d87d9b97d9 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -907,10 +907,12 @@ The irq_type field has the following values: - KVM_ARM_IRQ_TYPE_CPU: out-of-kernel GIC: irq_id 0 is IRQ, irq_id 1 is FIQ - KVM_ARM_IRQ_TYPE_SPI: - in-kernel GIC: SPI, irq_id between 32 and 1019 (incl.) + in-kernel GICv2/GICv3: SPI, irq_id between 32 and 1019 (incl.) (the vcpu_index field is ignored) + in-kernel GICv5: SPI, irq_id between 0 and 65535 (incl.) - KVM_ARM_IRQ_TYPE_PPI: - in-kernel GIC: PPI, irq_id between 16 and 31 (incl.) + in-kernel GICv2/GICv3: PPI, irq_id between 16 and 31 (incl.) + in-kernel GICv5: PPI, irq_id between 0 and 127 (incl.) (The irq_id field thus corresponds nicely to the IRQ ID in the ARM GIC specs) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index f68c4036afeb..8577d7dd4d1e 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -45,6 +45,9 @@ #include #include #include +#include + +#include #include "sys_regs.h" @@ -1479,16 +1482,29 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, if (!vcpu) return -EINVAL; - if (irq_num < VGIC_NR_SGIS || irq_num >= VGIC_NR_PRIVATE_IRQS) + if (vgic_is_v5(kvm)) { + if (irq_num >= VGIC_V5_NR_PRIVATE_IRQS) + return -EINVAL; + + /* Build a GICv5-style IntID here */ + irq_num = vgic_v5_make_ppi(irq_num); + } else if (irq_num < VGIC_NR_SGIS || + irq_num >= VGIC_NR_PRIVATE_IRQS) { return -EINVAL; + } return kvm_vgic_inject_irq(kvm, vcpu, irq_num, level, NULL); case KVM_ARM_IRQ_TYPE_SPI: if (!irqchip_in_kernel(kvm)) return -ENXIO; - if (irq_num < VGIC_NR_PRIVATE_IRQS) - return -EINVAL; + if (vgic_is_v5(kvm)) { + /* Build a GICv5-style IntID here */ + irq_num = vgic_v5_make_spi(irq_num); + } else { + if (irq_num < VGIC_NR_PRIVATE_IRQS) + return -EINVAL; + } return kvm_vgic_inject_irq(kvm, NULL, irq_num, level, NULL); } diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index 9ac0ff60aa8a..1e9fe8764584 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -86,6 +86,10 @@ static struct vgic_irq *vgic_get_lpi(struct kvm *kvm, u32 intid) */ struct vgic_irq *vgic_get_irq(struct kvm *kvm, u32 intid) { + /* Non-private IRQs are not yet implemented for GICv5 */ + if (vgic_is_v5(kvm)) + return NULL; + /* SPIs */ if (intid >= VGIC_NR_PRIVATE_IRQS && intid < (kvm->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS)) { From f4d37c7c35769579c51aa5fe00161c690b89811d Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:56:59 +0000 Subject: [PATCH 169/373] KVM: arm64: gic-v5: Create and initialise vgic_v5 Update kvm_vgic_create to create a vgic_v5 device. When creating a vgic, FEAT_GCIE in the ID_AA64PFR2 is only exposed to vgic_v5-based guests, and is hidden otherwise. GIC in ~ID_AA64PFR0_EL1 is never exposed for a vgic_v5 guest. When initialising a vgic_v5, skip kvm_vgic_dist_init as GICv5 doesn't support one. The current vgic_v5 implementation only supports PPIs, so no SPIs are initialised either. The current vgic_v5 support doesn't extend to nested guests. Therefore, the init of vgic_v5 for a nested guest is failed in vgic_v5_init. As the current vgic_v5 doesn't require any resources to be mapped, vgic_v5_map_resources is simply used to check that the vgic has indeed been initialised. Again, this will change as more GICv5 support is merged in. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-29-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-init.c | 56 ++++++++++++++++++++------------- arch/arm64/kvm/vgic/vgic-v5.c | 26 +++++++++++++++ arch/arm64/kvm/vgic/vgic.h | 2 ++ include/kvm/arm_vgic.h | 1 + 4 files changed, 64 insertions(+), 21 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index e0366e8c144d..75185651ff64 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -66,7 +66,7 @@ static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type); * or through the generic KVM_CREATE_DEVICE API ioctl. * irqchip_in_kernel() tells you if this function succeeded or not. * @kvm: kvm struct pointer - * @type: KVM_DEV_TYPE_ARM_VGIC_V[23] + * @type: KVM_DEV_TYPE_ARM_VGIC_V[235] */ int kvm_vgic_create(struct kvm *kvm, u32 type) { @@ -131,8 +131,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type) if (type == KVM_DEV_TYPE_ARM_VGIC_V2) kvm->max_vcpus = VGIC_V2_MAX_CPUS; - else + else if (type == KVM_DEV_TYPE_ARM_VGIC_V3) kvm->max_vcpus = VGIC_V3_MAX_CPUS; + else if (type == KVM_DEV_TYPE_ARM_VGIC_V5) + kvm->max_vcpus = min(VGIC_V5_MAX_CPUS, + kvm_vgic_global_state.max_gic_vcpus); if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) { ret = -E2BIG; @@ -426,22 +429,28 @@ int vgic_init(struct kvm *kvm) if (kvm->created_vcpus != atomic_read(&kvm->online_vcpus)) return -EBUSY; - /* freeze the number of spis */ - if (!dist->nr_spis) - dist->nr_spis = VGIC_NR_IRQS_LEGACY - VGIC_NR_PRIVATE_IRQS; + if (!vgic_is_v5(kvm)) { + /* freeze the number of spis */ + if (!dist->nr_spis) + dist->nr_spis = VGIC_NR_IRQS_LEGACY - VGIC_NR_PRIVATE_IRQS; - ret = kvm_vgic_dist_init(kvm, dist->nr_spis); - if (ret) - goto out; - - /* - * Ensure vPEs are allocated if direct IRQ injection (e.g. vSGIs, - * vLPIs) is supported. - */ - if (vgic_supports_direct_irqs(kvm)) { - ret = vgic_v4_init(kvm); + ret = kvm_vgic_dist_init(kvm, dist->nr_spis); if (ret) - goto out; + return ret; + + /* + * Ensure vPEs are allocated if direct IRQ injection (e.g. vSGIs, + * vLPIs) is supported. + */ + if (vgic_supports_direct_irqs(kvm)) { + ret = vgic_v4_init(kvm); + if (ret) + return ret; + } + } else { + ret = vgic_v5_init(kvm); + if (ret) + return ret; } kvm_for_each_vcpu(idx, vcpu, kvm) @@ -449,12 +458,12 @@ int vgic_init(struct kvm *kvm) ret = kvm_vgic_setup_default_irq_routing(kvm); if (ret) - goto out; + return ret; vgic_debug_init(kvm); dist->initialized = true; -out: - return ret; + + return 0; } static void kvm_vgic_dist_destroy(struct kvm *kvm) @@ -598,6 +607,7 @@ int vgic_lazy_init(struct kvm *kvm) int kvm_vgic_map_resources(struct kvm *kvm) { struct vgic_dist *dist = &kvm->arch.vgic; + bool needs_dist = true; enum vgic_type type; gpa_t dist_base; int ret = 0; @@ -616,12 +626,16 @@ int kvm_vgic_map_resources(struct kvm *kvm) if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V2) { ret = vgic_v2_map_resources(kvm); type = VGIC_V2; - } else { + } else if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) { ret = vgic_v3_map_resources(kvm); type = VGIC_V3; + } else { + ret = vgic_v5_map_resources(kvm); + type = VGIC_V5; + needs_dist = false; } - if (ret) + if (ret || !needs_dist) goto out; dist_base = dist->vgic_dist_base; diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index b84324f0a311..14e1fad913f0 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -87,6 +87,32 @@ int vgic_v5_probe(const struct gic_kvm_info *info) return 0; } +int vgic_v5_init(struct kvm *kvm) +{ + struct kvm_vcpu *vcpu; + unsigned long idx; + + if (vgic_initialized(kvm)) + return 0; + + kvm_for_each_vcpu(idx, vcpu, kvm) { + if (vcpu_has_nv(vcpu)) { + kvm_err("Nested GICv5 VMs are currently unsupported\n"); + return -EINVAL; + } + } + + return 0; +} + +int vgic_v5_map_resources(struct kvm *kvm) +{ + if (!vgic_initialized(kvm)) + return -EBUSY; + + return 0; +} + int vgic_v5_finalize_ppi_state(struct kvm *kvm) { struct kvm_vcpu *vcpu0; diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index 8f15f7472458..0f1986fcd7d0 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -364,6 +364,8 @@ void vgic_debug_init(struct kvm *kvm); void vgic_debug_destroy(struct kvm *kvm); int vgic_v5_probe(const struct gic_kvm_info *info); +int vgic_v5_init(struct kvm *kvm); +int vgic_v5_map_resources(struct kvm *kvm); void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid); bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu); void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu); diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index a28cf765f3eb..a5ddccf7ef3b 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -21,6 +21,7 @@ #include #include +#define VGIC_V5_MAX_CPUS 512 #define VGIC_V3_MAX_CPUS 512 #define VGIC_V2_MAX_CPUS 8 #define VGIC_NR_IRQS_LEGACY 256 From a3ca7cf9b31715a63c4dd32f3b6209c3bd744988 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:57:14 +0000 Subject: [PATCH 170/373] KVM: arm64: gic-v5: Initialise ID and priority bits when resetting vcpu Determine the number of priority bits and ID bits exposed to the guest as part of resetting the vcpu state. These values are presented to the guest by trapping and emulating reads from ICC_IDR0_EL1. GICv5 supports either 16- or 24-bits of ID space (for SPIs and LPIs). It is expected that 2^16 IDs is more than enough, and therefore this value is chosen irrespective of the hardware supporting more or not. The GICv5 architecture only supports 5 bits of priority in the CPU interface (but potentially fewer in the IRS). Therefore, this is the default value chosen for the number of priority bits in the CPU IF. Note: We replicate the way that GICv3 uses the num_id_bits and num_pri_bits variables. That is, num_id_bits stores the value of the hardware field verbatim (0 means 16-bits, 1 would mean 24-bits for GICv5), and num_pri_bits stores the actual number of priority bits; the field value + 1. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-30-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-init.c | 6 +++++- arch/arm64/kvm/vgic/vgic-v5.c | 15 +++++++++++++++ arch/arm64/kvm/vgic/vgic.h | 1 + 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index 75185651ff64..fe854cac5272 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -398,7 +398,11 @@ int kvm_vgic_vcpu_init(struct kvm_vcpu *vcpu) static void kvm_vgic_vcpu_reset(struct kvm_vcpu *vcpu) { - if (kvm_vgic_global_state.type == VGIC_V2) + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + + if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V5) + vgic_v5_reset(vcpu); + else if (kvm_vgic_global_state.type == VGIC_V2) vgic_v2_reset(vcpu); else vgic_v3_reset(vcpu); diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 14e1fad913f0..c263e097786f 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -87,6 +87,21 @@ int vgic_v5_probe(const struct gic_kvm_info *info) return 0; } +void vgic_v5_reset(struct kvm_vcpu *vcpu) +{ + /* + * We always present 16-bits of ID space to the guest, irrespective of + * the host allowing more. + */ + vcpu->arch.vgic_cpu.num_id_bits = ICC_IDR0_EL1_ID_BITS_16BITS; + + /* + * The GICv5 architeture only supports 5-bits of priority in the + * CPUIF (but potentially fewer in the IRS). + */ + vcpu->arch.vgic_cpu.num_pri_bits = 5; +} + int vgic_v5_init(struct kvm *kvm) { struct kvm_vcpu *vcpu; diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index 0f1986fcd7d0..9d941241c8a2 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -364,6 +364,7 @@ void vgic_debug_init(struct kvm *kvm); void vgic_debug_destroy(struct kvm *kvm); int vgic_v5_probe(const struct gic_kvm_info *info); +void vgic_v5_reset(struct kvm_vcpu *vcpu); int vgic_v5_init(struct kvm *kvm); int vgic_v5_map_resources(struct kvm *kvm); void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid); From 91d940cd678d3c394c845cd64081113167d700d2 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:57:30 +0000 Subject: [PATCH 171/373] irqchip/gic-v5: Introduce minimal irq_set_type() for PPIs GICv5 does not support configuring the handling mode or trigger mode of PPIs at runtime - these choices are made at implementation time, and most of the architected PPIs have an architected handling mode (as reported in the ICH_PPI_HMRn_EL1 registers). As chip->set_irq_type() is optional, this has not been implemented for GICv5 PPIs as it served no real purpose. However, although the set_irq_type() function is marked as optional, the lack of it breaks attempts to create a domain hierarchy on top of GICv5's PPI domain. This is due to __irq_set_trigger() calling chip->set_irq_type(), which returns -ENOSYS if the parent domain doesn't implement the set_irq_type() call. In order to make things work, this change introduces a set_irq_type() call for GICv5 PPIs. This performs a basic sanity check (that the hardware's handling mode (Level/Edge) matches what is being set as the type, and does nothing else. This is sufficient to get hierarchical domains working for GICv5 PPIs (such as the one KVM introduces for the arch timer). It has the side benefit (or drawback) that it will catch cases where the firmware description doesn't match what the hardware reports. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-31-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- drivers/irqchip/irq-gic-v5.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/drivers/irqchip/irq-gic-v5.c b/drivers/irqchip/irq-gic-v5.c index 405a5eee847b..6b0903be8ebf 100644 --- a/drivers/irqchip/irq-gic-v5.c +++ b/drivers/irqchip/irq-gic-v5.c @@ -511,6 +511,23 @@ static bool gicv5_ppi_irq_is_level(irq_hw_number_t hwirq) return !!(read_ppi_sysreg_s(hwirq, PPI_HM) & bit); } +static int gicv5_ppi_irq_set_type(struct irq_data *d, unsigned int type) +{ + /* + * GICv5's PPIs do not have a configurable trigger or handling + * mode. Check that the attempt to set a type matches what the + * hardware reports in the HMR, and error on a mismatch. + */ + + if (type & IRQ_TYPE_EDGE_BOTH && gicv5_ppi_irq_is_level(d->hwirq)) + return -EINVAL; + + if (type & IRQ_TYPE_LEVEL_MASK && !gicv5_ppi_irq_is_level(d->hwirq)) + return -EINVAL; + + return 0; +} + static int gicv5_ppi_irq_set_vcpu_affinity(struct irq_data *d, void *vcpu) { if (vcpu) @@ -526,6 +543,7 @@ static const struct irq_chip gicv5_ppi_irq_chip = { .irq_mask = gicv5_ppi_irq_mask, .irq_unmask = gicv5_ppi_irq_unmask, .irq_eoi = gicv5_ppi_irq_eoi, + .irq_set_type = gicv5_ppi_irq_set_type, .irq_get_irqchip_state = gicv5_ppi_irq_get_irqchip_state, .irq_set_irqchip_state = gicv5_ppi_irq_set_irqchip_state, .irq_set_vcpu_affinity = gicv5_ppi_irq_set_vcpu_affinity, From 9491c63b6cd7bdae97cd29c7c6bf400adbd3578f Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:57:45 +0000 Subject: [PATCH 172/373] KVM: arm64: gic-v5: Enlighten arch timer for GICv5 Now that GICv5 has arrived, the arch timer requires some TLC to address some of the key differences introduced with GICv5. For PPIs on GICv5, the queue_irq_unlock irq_op is used as AP lists are not required at all for GICv5. The arch timer also introduces an irq_op - get_input_level. Extend the arch-timer-provided irq_ops to include the PPI op for vgic_v5 guests. When possible, DVI (Direct Virtual Interrupt) is set for PPIs when using a vgic_v5, which directly inject the pending state into the guest. This means that the host never sees the interrupt for the guest for these interrupts. This has three impacts. * First of all, the kvm_cpu_has_pending_timer check is updated to explicitly check if the timers are expected to fire. * Secondly, for mapped timers (which use DVI) they must be masked on the host prior to entering a GICv5 guest, and unmasked on the return path. This is handled in set_timer_irq_phys_masked. * Thirdly, it makes zero sense to attempt to inject state for a DVI'd interrupt. Track which timers are direct, and skip the call to kvm_vgic_inject_irq() for these. The final, but rather important, change is that the architected PPIs for the timers are made mandatory for a GICv5 guest. Attempts to set them to anything else are actively rejected. Once a vgic_v5 is initialised, the arch timer PPIs are also explicitly reinitialised to ensure the correct GICv5-compatible PPIs are used - this also adds in the GICv5 PPI type to the intid. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-32-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arch_timer.c | 86 ++++++++++++++++++++++++++------- arch/arm64/kvm/vgic/vgic-init.c | 9 ++++ arch/arm64/kvm/vgic/vgic-v5.c | 7 ++- include/kvm/arm_arch_timer.h | 11 ++++- include/kvm/arm_vgic.h | 3 ++ 5 files changed, 94 insertions(+), 22 deletions(-) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 92870ee6dacd..67b989671b41 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -56,6 +56,12 @@ static struct irq_ops arch_timer_irq_ops = { .get_input_level = kvm_arch_timer_get_input_level, }; +static struct irq_ops arch_timer_irq_ops_vgic_v5 = { + .get_input_level = kvm_arch_timer_get_input_level, + .queue_irq_unlock = vgic_v5_ppi_queue_irq_unlock, + .set_direct_injection = vgic_v5_set_ppi_dvi, +}; + static int nr_timers(struct kvm_vcpu *vcpu) { if (!vcpu_has_nv(vcpu)) @@ -177,6 +183,10 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map) map->emul_ptimer = vcpu_ptimer(vcpu); } + map->direct_vtimer->direct = true; + if (map->direct_ptimer) + map->direct_ptimer->direct = true; + trace_kvm_get_timer_map(vcpu->vcpu_id, map); } @@ -396,7 +406,11 @@ static bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx) int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu) { - return vcpu_has_wfit_active(vcpu) && wfit_delay_ns(vcpu) == 0; + struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); + struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); + + return kvm_timer_should_fire(vtimer) || kvm_timer_should_fire(ptimer) || + (vcpu_has_wfit_active(vcpu) && wfit_delay_ns(vcpu) == 0); } /* @@ -447,6 +461,10 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level, if (userspace_irqchip(vcpu->kvm)) return; + /* Skip injecting on GICv5 for directly injected (DVI'd) timers */ + if (vgic_is_v5(vcpu->kvm) && timer_ctx->direct) + return; + kvm_vgic_inject_irq(vcpu->kvm, vcpu, timer_irq(timer_ctx), timer_ctx->irq.level, @@ -674,6 +692,7 @@ static void kvm_timer_vcpu_load_gic(struct arch_timer_context *ctx) phys_active = kvm_vgic_map_is_active(vcpu, timer_irq(ctx)); phys_active |= ctx->irq.level; + phys_active |= vgic_is_v5(vcpu->kvm); set_timer_irq_phys_active(ctx, phys_active); } @@ -862,7 +881,8 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu) get_timer_map(vcpu, &map); if (static_branch_likely(&has_gic_active_state)) { - if (vcpu_has_nv(vcpu)) + /* We don't do NV on GICv5, yet */ + if (vcpu_has_nv(vcpu) && !vgic_is_v5(vcpu->kvm)) kvm_timer_vcpu_load_nested_switch(vcpu, &map); kvm_timer_vcpu_load_gic(map.direct_vtimer); @@ -932,6 +952,12 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu) if (kvm_vcpu_is_blocking(vcpu)) kvm_timer_blocking(vcpu); + + if (vgic_is_v5(vcpu->kvm)) { + set_timer_irq_phys_active(map.direct_vtimer, false); + if (map.direct_ptimer) + set_timer_irq_phys_active(map.direct_ptimer, false); + } } void kvm_timer_sync_nested(struct kvm_vcpu *vcpu) @@ -1095,10 +1121,19 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu) HRTIMER_MODE_ABS_HARD); } +/* + * This is always called during kvm_arch_init_vm, but will also be + * called from kvm_vgic_create if we have a vGICv5. + */ void kvm_timer_init_vm(struct kvm *kvm) { + /* + * Set up the default PPIs - note that we adjust them based on + * the model of the GIC as GICv5 uses a different way to + * describing interrupts. + */ for (int i = 0; i < NR_KVM_TIMERS; i++) - kvm->arch.timer_data.ppi[i] = default_ppi[i]; + kvm->arch.timer_data.ppi[i] = get_vgic_ppi(kvm, default_ppi[i]); } void kvm_timer_cpu_up(void) @@ -1267,7 +1302,15 @@ static int timer_irq_set_irqchip_state(struct irq_data *d, static void timer_irq_eoi(struct irq_data *d) { - if (!irqd_is_forwarded_to_vcpu(d)) + /* + * On a GICv5 host, we still need to call EOI on the parent for + * PPIs. The host driver already handles irqs which are forwarded to + * vcpus, and skips the GIC CDDI while still doing the GIC CDEOI. This + * is required to emulate the EOIMode=1 on GICv5 hardware. Failure to + * call EOI unsurprisingly results in *BAD* lock-ups. + */ + if (!irqd_is_forwarded_to_vcpu(d) || + kvm_vgic_global_state.type == VGIC_V5) irq_chip_eoi_parent(d); } @@ -1331,7 +1374,8 @@ static int kvm_irq_init(struct arch_timer_kvm_info *info) host_vtimer_irq = info->virtual_irq; kvm_irq_fixup_flags(host_vtimer_irq, &host_vtimer_irq_flags); - if (kvm_vgic_global_state.no_hw_deactivation) { + if (kvm_vgic_global_state.no_hw_deactivation || + kvm_vgic_global_state.type == VGIC_V5) { struct fwnode_handle *fwnode; struct irq_data *data; @@ -1349,7 +1393,8 @@ static int kvm_irq_init(struct arch_timer_kvm_info *info) return -ENOMEM; } - arch_timer_irq_ops.flags |= VGIC_IRQ_SW_RESAMPLE; + if (kvm_vgic_global_state.no_hw_deactivation) + arch_timer_irq_ops.flags |= VGIC_IRQ_SW_RESAMPLE; WARN_ON(irq_domain_push_irq(domain, host_vtimer_irq, (void *)TIMER_VTIMER)); } @@ -1500,10 +1545,13 @@ static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu) break; /* - * We know by construction that we only have PPIs, so - * all values are less than 32. + * We know by construction that we only have PPIs, so all values + * are less than 32 for non-GICv5 VGICs. On GICv5, they are + * architecturally defined to be under 32 too. However, we mask + * off most of the bits as we might be presented with a GICv5 + * style PPI where the type is encoded in the top-bits. */ - ppis |= BIT(irq); + ppis |= BIT(irq & 0x1f); } valid = hweight32(ppis) == nr_timers(vcpu); @@ -1562,7 +1610,8 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu) get_timer_map(vcpu, &map); - ops = &arch_timer_irq_ops; + ops = vgic_is_v5(vcpu->kvm) ? &arch_timer_irq_ops_vgic_v5 : + &arch_timer_irq_ops; for (int i = 0; i < nr_timers(vcpu); i++) kvm_vgic_set_irq_ops(vcpu, timer_irq(vcpu_get_timer(vcpu, i)), ops); @@ -1606,12 +1655,11 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) if (!(irq_is_ppi(vcpu->kvm, irq))) return -EINVAL; - mutex_lock(&vcpu->kvm->arch.config_lock); + guard(mutex)(&vcpu->kvm->arch.config_lock); if (test_bit(KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE, &vcpu->kvm->arch.flags)) { - ret = -EBUSY; - goto out; + return -EBUSY; } switch (attr->attr) { @@ -1628,10 +1676,16 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) idx = TIMER_HPTIMER; break; default: - ret = -ENXIO; - goto out; + return -ENXIO; } + /* + * The PPIs for the Arch Timers are architecturally defined for + * GICv5. Reject anything that changes them from the specified value. + */ + if (vgic_is_v5(vcpu->kvm) && vcpu->kvm->arch.timer_data.ppi[idx] != irq) + return -EINVAL; + /* * We cannot validate the IRQ unicity before we run, so take it at * face value. The verdict will be given on first vcpu run, for each @@ -1639,8 +1693,6 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) */ vcpu->kvm->arch.timer_data.ppi[idx] = irq; -out: - mutex_unlock(&vcpu->kvm->arch.config_lock); return ret; } diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index fe854cac5272..47169604100f 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -173,6 +173,15 @@ int kvm_vgic_create(struct kvm *kvm, u32 type) if (type == KVM_DEV_TYPE_ARM_VGIC_V3) kvm->arch.vgic.nassgicap = system_supports_direct_sgis(); + /* + * We now know that we have a GICv5. The Arch Timer PPI interrupts may + * have been initialised at this stage, but will have done so assuming + * that we have an older GIC, meaning that the IntIDs won't be + * correct. We init them again, and this time they will be correct. + */ + if (type == KVM_DEV_TYPE_ARM_VGIC_V5) + kvm_timer_init_vm(kvm); + out_unlock: mutex_unlock(&kvm->arch.config_lock); kvm_unlock_all_vcpus(kvm); diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index c263e097786f..9384c7fcb1aa 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -200,8 +200,8 @@ static u32 vgic_v5_get_effective_priority_mask(struct kvm_vcpu *vcpu) * need the PPIs to be queued on a per-VCPU AP list. Therefore, sanity check the * state, unlock, and return. */ -static bool vgic_v5_ppi_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, - unsigned long flags) +bool vgic_v5_ppi_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, + unsigned long flags) __releases(&irq->irq_lock) { struct kvm_vcpu *vcpu; @@ -232,8 +232,7 @@ out_unlock_fail: /* * Sets/clears the corresponding bit in the ICH_PPI_DVIR register. */ -static void vgic_v5_set_ppi_dvi(struct kvm_vcpu *vcpu, struct vgic_irq *irq, - bool dvi) +void vgic_v5_set_ppi_dvi(struct kvm_vcpu *vcpu, struct vgic_irq *irq, bool dvi) { struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; u32 ppi; diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h index 7310841f4512..a7754e0a2ef4 100644 --- a/include/kvm/arm_arch_timer.h +++ b/include/kvm/arm_arch_timer.h @@ -10,6 +10,8 @@ #include #include +#include + enum kvm_arch_timers { TIMER_PTIMER, TIMER_VTIMER, @@ -47,7 +49,7 @@ struct arch_timer_vm_data { u64 poffset; /* The PPI for each timer, global to the VM */ - u8 ppi[NR_KVM_TIMERS]; + u32 ppi[NR_KVM_TIMERS]; }; struct arch_timer_context { @@ -74,6 +76,9 @@ struct arch_timer_context { /* Duplicated state from arch_timer.c for convenience */ u32 host_timer_irq; + + /* Is this a direct timer? */ + bool direct; }; struct timer_map { @@ -130,6 +135,10 @@ void kvm_timer_init_vhe(void); #define timer_vm_data(ctx) (&(timer_context_to_vcpu(ctx)->kvm->arch.timer_data)) #define timer_irq(ctx) (timer_vm_data(ctx)->ppi[arch_timer_ctx_index(ctx)]) +#define get_vgic_ppi(k, i) (((k)->arch.vgic.vgic_model != KVM_DEV_TYPE_ARM_VGIC_V5) ? \ + (i) : (FIELD_PREP(GICV5_HWIRQ_ID, i) | \ + FIELD_PREP(GICV5_HWIRQ_TYPE, GICV5_HWIRQ_TYPE_PPI))) + u64 kvm_arm_timer_read_sysreg(struct kvm_vcpu *vcpu, enum kvm_arch_timers tmr, enum kvm_arch_timer_regs treg); diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index a5ddccf7ef3b..8cc3a7b4d815 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -627,6 +627,9 @@ void vgic_v4_commit(struct kvm_vcpu *vcpu); int vgic_v4_put(struct kvm_vcpu *vcpu); int vgic_v5_finalize_ppi_state(struct kvm *kvm); +bool vgic_v5_ppi_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, + unsigned long flags); +void vgic_v5_set_ppi_dvi(struct kvm_vcpu *vcpu, struct vgic_irq *irq, bool dvi); bool vgic_state_is_nested(struct kvm_vcpu *vcpu); From 7c31c06e2d2d75859d773ba940e56d1db2bd1fcd Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:58:01 +0000 Subject: [PATCH 173/373] KVM: arm64: gic-v5: Mandate architected PPI for PMU emulation on GICv5 Make it mandatory to use the architected PPI when running a GICv5 guest. Attempts to set anything other than the architected PPI (23) are rejected. Additionally, KVM_ARM_VCPU_PMU_V3_INIT is relaxed to no longer require KVM_ARM_VCPU_PMU_V3_IRQ to be called for GICv5-based guests. In this case, the architectued PPI is automatically used. Documentation is bumped accordingly. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Reviewed-by: Joey Gouly Link: https://patch.msgid.link/20260319154937.3619520-33-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- Documentation/virt/kvm/devices/vcpu.rst | 5 +++-- arch/arm64/kvm/pmu-emul.c | 13 +++++++++++-- include/kvm/arm_pmu.h | 5 ++++- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst index 60bf205cb373..5e3805820010 100644 --- a/Documentation/virt/kvm/devices/vcpu.rst +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -37,7 +37,8 @@ Returns: A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt type must be same for each vcpu. As a PPI, the interrupt number is the same for -all vcpus, while as an SPI it must be a separate number per vcpu. +all vcpus, while as an SPI it must be a separate number per vcpu. For +GICv5-based guests, the architected PPI (23) must be used. 1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT --------------------------------------- @@ -50,7 +51,7 @@ Returns: -EEXIST Interrupt number already used -ENODEV PMUv3 not supported or GIC not initialized -ENXIO PMUv3 not supported, missing VCPU feature or interrupt - number not set + number not set (non-GICv5 guests, only) -EBUSY PMUv3 already initialized ======= ====================================================== diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index 41a3c5dc2bca..e1860acae641 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -962,8 +962,13 @@ static int kvm_arm_pmu_v3_init(struct kvm_vcpu *vcpu) if (!vgic_initialized(vcpu->kvm)) return -ENODEV; - if (!kvm_arm_pmu_irq_initialized(vcpu)) - return -ENXIO; + if (!kvm_arm_pmu_irq_initialized(vcpu)) { + if (!vgic_is_v5(vcpu->kvm)) + return -ENXIO; + + /* Use the architected irq number for GICv5. */ + vcpu->arch.pmu.irq_num = KVM_ARMV8_PMU_GICV5_IRQ; + } ret = kvm_vgic_set_owner(vcpu, vcpu->arch.pmu.irq_num, &vcpu->arch.pmu); @@ -988,6 +993,10 @@ static bool pmu_irq_is_valid(struct kvm *kvm, int irq) unsigned long i; struct kvm_vcpu *vcpu; + /* On GICv5, the PMUIRQ is architecturally mandated to be PPI 23 */ + if (vgic_is_v5(kvm) && irq != KVM_ARMV8_PMU_GICV5_IRQ) + return false; + kvm_for_each_vcpu(i, vcpu, kvm) { if (!kvm_arm_pmu_irq_initialized(vcpu)) continue; diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h index 96754b51b411..0a36a3d5c894 100644 --- a/include/kvm/arm_pmu.h +++ b/include/kvm/arm_pmu.h @@ -12,6 +12,9 @@ #define KVM_ARMV8_PMU_MAX_COUNTERS 32 +/* PPI #23 - architecturally specified for GICv5 */ +#define KVM_ARMV8_PMU_GICV5_IRQ 0x20000017 + #if IS_ENABLED(CONFIG_HW_PERF_EVENTS) && IS_ENABLED(CONFIG_KVM) struct kvm_pmc { u8 idx; /* index into the pmu->pmc array */ @@ -38,7 +41,7 @@ struct arm_pmu_entry { }; bool kvm_supports_guest_pmuv3(void); -#define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS) +#define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num != 0) u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx); void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val); void kvm_pmu_set_counter_value_user(struct kvm_vcpu *vcpu, u64 select_idx, u64 val); From 5aefaf11f9af5d58257ad3d0c71c447a41963069 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:58:17 +0000 Subject: [PATCH 174/373] KVM: arm64: gic: Hide GICv5 for protected guests We don't support running protected guest with GICv5 at the moment. Therefore, be sure that we don't expose it to the guest at all by actively hiding it when running a protected guest. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-34-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_hyp.h | 1 + arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/nvhe/sys_regs.c | 8 ++++++++ 3 files changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h index 2d8dfd534bd9..5648e8d9ff62 100644 --- a/arch/arm64/include/asm/kvm_hyp.h +++ b/arch/arm64/include/asm/kvm_hyp.h @@ -145,6 +145,7 @@ void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt); extern u64 kvm_nvhe_sym(id_aa64pfr0_el1_sys_val); extern u64 kvm_nvhe_sym(id_aa64pfr1_el1_sys_val); +extern u64 kvm_nvhe_sym(id_aa64pfr2_el1_sys_val); extern u64 kvm_nvhe_sym(id_aa64isar0_el1_sys_val); extern u64 kvm_nvhe_sym(id_aa64isar1_el1_sys_val); extern u64 kvm_nvhe_sym(id_aa64isar2_el1_sys_val); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 8577d7dd4d1e..cb22bed9c85d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -2530,6 +2530,7 @@ static void kvm_hyp_init_symbols(void) { kvm_nvhe_sym(id_aa64pfr0_el1_sys_val) = get_hyp_id_aa64pfr0_el1(); kvm_nvhe_sym(id_aa64pfr1_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); + kvm_nvhe_sym(id_aa64pfr2_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64PFR2_EL1); kvm_nvhe_sym(id_aa64isar0_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR0_EL1); kvm_nvhe_sym(id_aa64isar1_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR1_EL1); kvm_nvhe_sym(id_aa64isar2_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR2_EL1); diff --git a/arch/arm64/kvm/hyp/nvhe/sys_regs.c b/arch/arm64/kvm/hyp/nvhe/sys_regs.c index 06d28621722e..b40fd01ebf32 100644 --- a/arch/arm64/kvm/hyp/nvhe/sys_regs.c +++ b/arch/arm64/kvm/hyp/nvhe/sys_regs.c @@ -20,6 +20,7 @@ */ u64 id_aa64pfr0_el1_sys_val; u64 id_aa64pfr1_el1_sys_val; +u64 id_aa64pfr2_el1_sys_val; u64 id_aa64isar0_el1_sys_val; u64 id_aa64isar1_el1_sys_val; u64 id_aa64isar2_el1_sys_val; @@ -108,6 +109,11 @@ static const struct pvm_ftr_bits pvmid_aa64pfr1[] = { FEAT_END }; +static const struct pvm_ftr_bits pvmid_aa64pfr2[] = { + MAX_FEAT(ID_AA64PFR2_EL1, GCIE, NI), + FEAT_END +}; + static const struct pvm_ftr_bits pvmid_aa64mmfr0[] = { MAX_FEAT_ENUM(ID_AA64MMFR0_EL1, PARANGE, 40), MAX_FEAT_ENUM(ID_AA64MMFR0_EL1, ASIDBITS, 16), @@ -221,6 +227,8 @@ static u64 pvm_calc_id_reg(const struct kvm_vcpu *vcpu, u32 id) return get_restricted_features(vcpu, id_aa64pfr0_el1_sys_val, pvmid_aa64pfr0); case SYS_ID_AA64PFR1_EL1: return get_restricted_features(vcpu, id_aa64pfr1_el1_sys_val, pvmid_aa64pfr1); + case SYS_ID_AA64PFR2_EL1: + return get_restricted_features(vcpu, id_aa64pfr2_el1_sys_val, pvmid_aa64pfr2); case SYS_ID_AA64ISAR0_EL1: return id_aa64isar0_el1_sys_val; case SYS_ID_AA64ISAR1_EL1: From 61d4ad518312ecddef2331ea3d22902b4eac0e0a Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:58:33 +0000 Subject: [PATCH 175/373] KVM: arm64: gic-v5: Hide FEAT_GCIE from NV GICv5 guests Currently, NV guests are not supported with GICv5. Therefore, make sure that FEAT_GCIE is always hidden from such guests. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-35-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/nested.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index 2c43097248b2..efd5d21c7ac7 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -1558,6 +1558,11 @@ u64 limit_nv_id_reg(struct kvm *kvm, u32 reg, u64 val) ID_AA64PFR1_EL1_MTE); break; + case SYS_ID_AA64PFR2_EL1: + /* GICv5 is not yet supported for NV */ + val &= ~ID_AA64PFR2_EL1_GCIE; + break; + case SYS_ID_AA64MMFR0_EL1: /* Hide ExS, Secure Memory */ val &= ~(ID_AA64MMFR0_EL1_EXS | From 37a25294682d28ef3bd131566602450a72c4d839 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:58:48 +0000 Subject: [PATCH 176/373] KVM: arm64: gic-v5: Introduce kvm_arm_vgic_v5_ops and register them Only the KVM_DEV_ARM_VGIC_GRP_CTRL->KVM_DEV_ARM_VGIC_CTRL_INIT op is currently supported. All other ops are stubbed out. Co-authored-by: Timothy Hayes Signed-off-by: Timothy Hayes Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-36-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-kvm-device.c | 74 +++++++++++++++++++++++++++ include/linux/kvm_host.h | 1 + 2 files changed, 75 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c b/arch/arm64/kvm/vgic/vgic-kvm-device.c index b12ba99a423e..772da54c1518 100644 --- a/arch/arm64/kvm/vgic/vgic-kvm-device.c +++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c @@ -336,6 +336,10 @@ int kvm_register_vgic_device(unsigned long type) break; ret = kvm_vgic_register_its_device(); break; + case KVM_DEV_TYPE_ARM_VGIC_V5: + ret = kvm_register_device_ops(&kvm_arm_vgic_v5_ops, + KVM_DEV_TYPE_ARM_VGIC_V5); + break; } return ret; @@ -715,3 +719,73 @@ struct kvm_device_ops kvm_arm_vgic_v3_ops = { .get_attr = vgic_v3_get_attr, .has_attr = vgic_v3_has_attr, }; + +static int vgic_v5_set_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_ARM_VGIC_GRP_ADDR: + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: + return -ENXIO; + case KVM_DEV_ARM_VGIC_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_ARM_VGIC_CTRL_INIT: + return vgic_set_common_attr(dev, attr); + default: + return -ENXIO; + } + default: + return -ENXIO; + } + +} + +static int vgic_v5_get_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_ARM_VGIC_GRP_ADDR: + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: + return -ENXIO; + case KVM_DEV_ARM_VGIC_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_ARM_VGIC_CTRL_INIT: + return vgic_get_common_attr(dev, attr); + default: + return -ENXIO; + } + default: + return -ENXIO; + } +} + +static int vgic_v5_has_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_ARM_VGIC_GRP_ADDR: + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: + return -ENXIO; + case KVM_DEV_ARM_VGIC_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_ARM_VGIC_CTRL_INIT: + return 0; + default: + return -ENXIO; + } + default: + return -ENXIO; + } +} + +struct kvm_device_ops kvm_arm_vgic_v5_ops = { + .name = "kvm-arm-vgic-v5", + .create = vgic_create, + .destroy = vgic_destroy, + .set_attr = vgic_v5_set_attr, + .get_attr = vgic_v5_get_attr, + .has_attr = vgic_v5_has_attr, +}; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 6b76e7a6f4c2..779d9ed85cbf 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2366,6 +2366,7 @@ void kvm_unregister_device_ops(u32 type); extern struct kvm_device_ops kvm_mpic_ops; extern struct kvm_device_ops kvm_arm_vgic_v2_ops; extern struct kvm_device_ops kvm_arm_vgic_v3_ops; +extern struct kvm_device_ops kvm_arm_vgic_v5_ops; #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT From a8946fde86f860c3a94dca4ee71fe04a7a519da1 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:59:04 +0000 Subject: [PATCH 177/373] KVM: arm64: gic-v5: Set ICH_VCTLR_EL2.En on boot This control enables virtual HPPI selection, i.e., selection and delivery of interrupts for a guest (assuming that the guest itself has opted to receive interrupts). This is set to enabled on boot as there is no reason for disabling it in normal operation as virtual interrupt signalling itself is still controlled via the HCR_EL2. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-37-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/el2_setup.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h index 85f4c1615472..998b2a3f615a 100644 --- a/arch/arm64/include/asm/el2_setup.h +++ b/arch/arm64/include/asm/el2_setup.h @@ -248,6 +248,8 @@ ICH_HFGWTR_EL2_ICC_CR0_EL1 | \ ICH_HFGWTR_EL2_ICC_APR_EL1) msr_s SYS_ICH_HFGWTR_EL2, x0 // Disable reg write traps + mov x0, #(ICH_VCTLR_EL2_En) + msr_s SYS_ICH_VCTLR_EL2, x0 // Enable vHPPI selection .Lskip_gicv5_\@: .endm From 9b7aa05533f1bd170211fb6ee5812d9e736492ef Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:59:19 +0000 Subject: [PATCH 178/373] KVM: arm64: gic-v5: Probe for GICv5 device The basic GICv5 PPI support is now complete. Allow probing for a native GICv5 rather than just the legacy support. The implementation doesn't support protected VMs with GICv5 at this time. Therefore, if KVM has protected mode enabled the native GICv5 init is skipped, but legacy VMs are allowed if the hardware supports it. At this stage the GICv5 KVM implementation only supports PPIs, and doesn't interact with the host IRS at all. This means that there is no need to check how many concurrent VMs or vCPUs per VM are supported by the IRS - the PPI support only requires the CPUIF. The support is artificially limited to VGIC_V5_MAX_CPUS, i.e. 512, vCPUs per VM. With this change it becomes possible to run basic GICv5-based VMs, provided that they only use PPIs. Co-authored-by: Timothy Hayes Signed-off-by: Timothy Hayes Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Reviewed-by: Joey Gouly Link: https://patch.msgid.link/20260319154937.3619520-38-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 48 +++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 11 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 9384c7fcb1aa..f7a24ea6ad78 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -39,24 +39,14 @@ static void vgic_v5_get_implemented_ppis(void) /* * Probe for a vGICv5 compatible interrupt controller, returning 0 on success. - * Currently only supports GICv3-based VMs on a GICv5 host, and hence only - * registers a VGIC_V3 device. */ int vgic_v5_probe(const struct gic_kvm_info *info) { + bool v5_registered = false; u64 ich_vtr_el2; int ret; - vgic_v5_get_implemented_ppis(); - - if (!cpus_have_final_cap(ARM64_HAS_GICV5_LEGACY)) - return -ENODEV; - kvm_vgic_global_state.type = VGIC_V5; - kvm_vgic_global_state.has_gcie_v3_compat = true; - - /* We only support v3 compat mode - use vGICv3 limits */ - kvm_vgic_global_state.max_gic_vcpus = VGIC_V3_MAX_CPUS; kvm_vgic_global_state.vcpu_base = 0; kvm_vgic_global_state.vctrl_base = NULL; @@ -64,6 +54,38 @@ int vgic_v5_probe(const struct gic_kvm_info *info) kvm_vgic_global_state.has_gicv4 = false; kvm_vgic_global_state.has_gicv4_1 = false; + /* + * GICv5 is currently not supported in Protected mode. Skip the + * registration of GICv5 completely to make sure no guests can create a + * GICv5-based guest. + */ + if (is_protected_kvm_enabled()) { + kvm_info("GICv5-based guests are not supported with pKVM\n"); + goto skip_v5; + } + + kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS; + + vgic_v5_get_implemented_ppis(); + + ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V5); + if (ret) { + kvm_err("Cannot register GICv5 KVM device.\n"); + goto skip_v5; + } + + v5_registered = true; + kvm_info("GCIE system register CPU interface\n"); + +skip_v5: + /* If we don't support the GICv3 compat mode we're done. */ + if (!cpus_have_final_cap(ARM64_HAS_GICV5_LEGACY)) { + if (!v5_registered) + return -ENODEV; + return 0; + } + + kvm_vgic_global_state.has_gcie_v3_compat = true; ich_vtr_el2 = kvm_call_hyp_ret(__vgic_v3_get_gic_config); kvm_vgic_global_state.ich_vtr_el2 = (u32)ich_vtr_el2; @@ -79,6 +101,10 @@ int vgic_v5_probe(const struct gic_kvm_info *info) return ret; } + /* We potentially limit the max VCPUs further than we need to here */ + kvm_vgic_global_state.max_gic_vcpus = min(VGIC_V3_MAX_CPUS, + VGIC_V5_MAX_CPUS); + static_branch_enable(&kvm_vgic_global_state.gicv3_cpuif); kvm_info("GCIE legacy system register CPU interface\n"); From eb3c4d2c9a4d76b775a9dbd5ac056d1abf0083a1 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:59:35 +0000 Subject: [PATCH 179/373] Documentation: KVM: Introduce documentation for VGICv5 Now that it is possible to create a VGICv5 device, provide initial documentation for it. At this stage, there is little to document. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-39-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- .../virt/kvm/devices/arm-vgic-v5.rst | 37 +++++++++++++++++++ Documentation/virt/kvm/devices/index.rst | 1 + 2 files changed, 38 insertions(+) create mode 100644 Documentation/virt/kvm/devices/arm-vgic-v5.rst diff --git a/Documentation/virt/kvm/devices/arm-vgic-v5.rst b/Documentation/virt/kvm/devices/arm-vgic-v5.rst new file mode 100644 index 000000000000..9904cb888277 --- /dev/null +++ b/Documentation/virt/kvm/devices/arm-vgic-v5.rst @@ -0,0 +1,37 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================================== +ARM Virtual Generic Interrupt Controller v5 (VGICv5) +==================================================== + + +Device types supported: + - KVM_DEV_TYPE_ARM_VGIC_V5 ARM Generic Interrupt Controller v5.0 + +Only one VGIC instance may be instantiated through this API. The created VGIC +will act as the VM interrupt controller, requiring emulated user-space devices +to inject interrupts to the VGIC instead of directly to CPUs. + +Creating a guest GICv5 device requires a host GICv5 host. The current VGICv5 +device only supports PPI interrupts. These can either be injected from emulated +in-kernel devices (such as the Arch Timer, or PMU), or via the KVM_IRQ_LINE +ioctl. + +Groups: + KVM_DEV_ARM_VGIC_GRP_CTRL + Attributes: + + KVM_DEV_ARM_VGIC_CTRL_INIT + request the initialization of the VGIC, no additional parameter in + kvm_device_attr.addr. Must be called after all VCPUs have been created. + + Errors: + + ======= ======================================================== + -ENXIO VGIC not properly configured as required prior to calling + this attribute + -ENODEV no online VCPU + -ENOMEM memory shortage when allocating vgic internal data + -EFAULT Invalid guest ram access + -EBUSY One or more VCPUS are running + ======= ======================================================== diff --git a/Documentation/virt/kvm/devices/index.rst b/Documentation/virt/kvm/devices/index.rst index 192cda7405c8..70845aba38f4 100644 --- a/Documentation/virt/kvm/devices/index.rst +++ b/Documentation/virt/kvm/devices/index.rst @@ -10,6 +10,7 @@ Devices arm-vgic-its arm-vgic arm-vgic-v3 + arm-vgic-v5 mpic s390_flic vcpu From d51c978b7d3e143381f871d28d8a0437d446b51b Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 15:59:50 +0000 Subject: [PATCH 180/373] KVM: arm64: gic-v5: Communicate userspace-driveable PPIs via a UAPI GICv5 systems will likely not support the full set of PPIs. The presence of any virtual PPI is tied to the presence of the physical PPI. Therefore, the available PPIs will be limited by the physical host. Userspace cannot drive any PPIs that are not implemented. Moreover, it is not desirable to expose all PPIs to the guest in the first place, even if they are supported in hardware. Some devices, such as the arch timer, are implemented in KVM, and hence those PPIs shouldn't be driven by userspace, either. Provided a new UAPI: KVM_DEV_ARM_VGIC_GRP_CTRL => KVM_DEV_ARM_VGIC_USERPSPACE_PPIs This allows userspace to query which PPIs it is able to drive via KVM_IRQ_LINE. Additionally, introduce a check in kvm_vm_ioctl_irq_line() to reject any PPIs not in the userspace mask. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-40-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- .../virt/kvm/devices/arm-vgic-v5.rst | 13 ++++++++ arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/arm.c | 11 ++++++- arch/arm64/kvm/vgic/vgic-kvm-device.c | 31 +++++++++++++++++++ arch/arm64/kvm/vgic/vgic-v5.c | 10 ++++++ include/kvm/arm_vgic.h | 3 ++ tools/arch/arm64/include/uapi/asm/kvm.h | 1 + 7 files changed, 69 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/devices/arm-vgic-v5.rst b/Documentation/virt/kvm/devices/arm-vgic-v5.rst index 9904cb888277..29335ea823fc 100644 --- a/Documentation/virt/kvm/devices/arm-vgic-v5.rst +++ b/Documentation/virt/kvm/devices/arm-vgic-v5.rst @@ -25,6 +25,19 @@ Groups: request the initialization of the VGIC, no additional parameter in kvm_device_attr.addr. Must be called after all VCPUs have been created. + KVM_DEV_ARM_VGIC_USERPSPACE_PPIs + request the mask of userspace-drivable PPIs. Only a subset of the PPIs can + be directly driven from userspace with GICv5, and the returned mask + informs userspace of which it is allowed to drive via KVM_IRQ_LINE. + + Userspace must allocate and point to __u64[2] of data in + kvm_device_attr.addr. When this call returns, the provided memory will be + populated with the userspace PPI mask. The lower __u64 contains the mask + for the lower 64 PPIS, with the remaining 64 being in the second __u64. + + This is a read-only attribute, and cannot be set. Attempts to set it are + rejected. + Errors: ======= ======================================================== diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index a792a599b9d6..1c13bfa2d38a 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -428,6 +428,7 @@ enum { #define KVM_DEV_ARM_ITS_RESTORE_TABLES 2 #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3 #define KVM_DEV_ARM_ITS_CTRL_RESET 4 +#define KVM_DEV_ARM_VGIC_USERSPACE_PPIS 5 /* Device Control API on vcpu fd */ #define KVM_ARM_VCPU_PMU_V3_CTRL 0 diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index cb22bed9c85d..36410f7cd2ad 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -1449,10 +1449,11 @@ static int vcpu_interrupt_line(struct kvm_vcpu *vcpu, int number, bool level) int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, bool line_status) { - u32 irq = irq_level->irq; unsigned int irq_type, vcpu_id, irq_num; struct kvm_vcpu *vcpu = NULL; bool level = irq_level->level; + u32 irq = irq_level->irq; + unsigned long *mask; irq_type = (irq >> KVM_ARM_IRQ_TYPE_SHIFT) & KVM_ARM_IRQ_TYPE_MASK; vcpu_id = (irq >> KVM_ARM_IRQ_VCPU_SHIFT) & KVM_ARM_IRQ_VCPU_MASK; @@ -1486,6 +1487,14 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, if (irq_num >= VGIC_V5_NR_PRIVATE_IRQS) return -EINVAL; + /* + * Only allow PPIs that are explicitly exposed to + * usespace to be driven via KVM_IRQ_LINE + */ + mask = kvm->arch.vgic.gicv5_vm.userspace_ppis; + if (!test_bit(irq_num, mask)) + return -EINVAL; + /* Build a GICv5-style IntID here */ irq_num = vgic_v5_make_ppi(irq_num); } else if (irq_num < VGIC_NR_SGIS || diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c b/arch/arm64/kvm/vgic/vgic-kvm-device.c index 772da54c1518..a96c77dccf35 100644 --- a/arch/arm64/kvm/vgic/vgic-kvm-device.c +++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c @@ -720,6 +720,32 @@ struct kvm_device_ops kvm_arm_vgic_v3_ops = { .has_attr = vgic_v3_has_attr, }; +static int vgic_v5_get_userspace_ppis(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + struct vgic_v5_vm *gicv5_vm = &dev->kvm->arch.vgic.gicv5_vm; + u64 __user *uaddr = (u64 __user *)(long)attr->addr; + int ret; + + guard(mutex)(&dev->kvm->arch.config_lock); + + /* + * We either support 64 or 128 PPIs. In the former case, we need to + * return 0s for the second 64 bits as we have no storage backing those. + */ + ret = put_user(bitmap_read(gicv5_vm->userspace_ppis, 0, 64), uaddr); + if (ret) + return ret; + uaddr++; + + if (VGIC_V5_NR_PRIVATE_IRQS == 128) + ret = put_user(bitmap_read(gicv5_vm->userspace_ppis, 64, 128), uaddr); + else + ret = put_user(0, uaddr); + + return ret; +} + static int vgic_v5_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -732,6 +758,7 @@ static int vgic_v5_set_attr(struct kvm_device *dev, switch (attr->attr) { case KVM_DEV_ARM_VGIC_CTRL_INIT: return vgic_set_common_attr(dev, attr); + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: default: return -ENXIO; } @@ -753,6 +780,8 @@ static int vgic_v5_get_attr(struct kvm_device *dev, switch (attr->attr) { case KVM_DEV_ARM_VGIC_CTRL_INIT: return vgic_get_common_attr(dev, attr); + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: + return vgic_v5_get_userspace_ppis(dev, attr); default: return -ENXIO; } @@ -773,6 +802,8 @@ static int vgic_v5_has_attr(struct kvm_device *dev, switch (attr->attr) { case KVM_DEV_ARM_VGIC_CTRL_INIT: return 0; + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: + return 0; default: return -ENXIO; } diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index f7a24ea6ad78..2b6cd5c3f9c2 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -143,6 +143,16 @@ int vgic_v5_init(struct kvm *kvm) } } + /* We only allow userspace to drive the SW_PPI, if it is implemented. */ + bitmap_zero(kvm->arch.vgic.gicv5_vm.userspace_ppis, + VGIC_V5_NR_PRIVATE_IRQS); + __assign_bit(GICV5_ARCH_PPI_SW_PPI, + kvm->arch.vgic.gicv5_vm.userspace_ppis, + VGIC_V5_NR_PRIVATE_IRQS); + bitmap_and(kvm->arch.vgic.gicv5_vm.userspace_ppis, + kvm->arch.vgic.gicv5_vm.userspace_ppis, + ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); + return 0; } diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index 8cc3a7b4d815..1388dc6028a9 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -350,6 +350,9 @@ struct vgic_v5_vm { */ DECLARE_BITMAP(vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); + /* A mask of the PPIs that are exposed for userspace to drive. */ + DECLARE_BITMAP(userspace_ppis, VGIC_V5_NR_PRIVATE_IRQS); + /* * The HMR itself is handled by the hardware, but we still need to have * a mask that we can use when merging in pending state (only the state diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h index a792a599b9d6..1c13bfa2d38a 100644 --- a/tools/arch/arm64/include/uapi/asm/kvm.h +++ b/tools/arch/arm64/include/uapi/asm/kvm.h @@ -428,6 +428,7 @@ enum { #define KVM_DEV_ARM_ITS_RESTORE_TABLES 2 #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3 #define KVM_DEV_ARM_ITS_CTRL_RESET 4 +#define KVM_DEV_ARM_VGIC_USERSPACE_PPIS 5 /* Device Control API on vcpu fd */ #define KVM_ARM_VCPU_PMU_V3_CTRL 0 From 0a9f38bf612b195e04236d366ed9f769ce14cc27 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 16:00:06 +0000 Subject: [PATCH 181/373] KVM: arm64: selftests: Introduce a minimal GICv5 PPI selftest This basic selftest creates a vgic_v5 device (if supported), and tests that one of the PPI interrupts works as expected with a basic single-vCPU guest. Upon starting, the guest enables interrupts. That means that it is initialising all PPIs to have reasonable priorities, but marking them as disabled. Then the priority mask in the ICC_PCR_EL1 is set, and interrupts are enable in ICC_CR0_EL1. At this stage the guest is able to receive interrupts. The architected SW_PPI (64) is enabled and KVM_IRQ_LINE ioctl is used to inject the state into the guest. The guest's interrupt handler has an explicit WFI in order to ensure that the guest skips WFI when there are pending and enabled PPI interrupts. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260319154937.3619520-41-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- tools/testing/selftests/kvm/Makefile.kvm | 1 + tools/testing/selftests/kvm/arm64/vgic_v5.c | 228 ++++++++++++++++++ .../selftests/kvm/include/arm64/gic_v5.h | 150 ++++++++++++ 3 files changed, 379 insertions(+) create mode 100644 tools/testing/selftests/kvm/arm64/vgic_v5.c create mode 100644 tools/testing/selftests/kvm/include/arm64/gic_v5.h diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index dc68371f76a3..98282acd9040 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -177,6 +177,7 @@ TEST_GEN_PROGS_arm64 += arm64/vcpu_width_config TEST_GEN_PROGS_arm64 += arm64/vgic_init TEST_GEN_PROGS_arm64 += arm64/vgic_irq TEST_GEN_PROGS_arm64 += arm64/vgic_lpi_stress +TEST_GEN_PROGS_arm64 += arm64/vgic_v5 TEST_GEN_PROGS_arm64 += arm64/vpmu_counter_access TEST_GEN_PROGS_arm64 += arm64/no-vgic-v3 TEST_GEN_PROGS_arm64 += arm64/idreg-idst diff --git a/tools/testing/selftests/kvm/arm64/vgic_v5.c b/tools/testing/selftests/kvm/arm64/vgic_v5.c new file mode 100644 index 000000000000..3ce6cf37a629 --- /dev/null +++ b/tools/testing/selftests/kvm/arm64/vgic_v5.c @@ -0,0 +1,228 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include + +#include + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "vgic.h" + +#define NR_VCPUS 1 + +struct vm_gic { + struct kvm_vm *vm; + int gic_fd; + uint32_t gic_dev_type; +}; + +static uint64_t max_phys_size; + +#define GUEST_CMD_IRQ_CDIA 10 +#define GUEST_CMD_IRQ_DIEOI 11 +#define GUEST_CMD_IS_AWAKE 12 +#define GUEST_CMD_IS_READY 13 + +static void guest_irq_handler(struct ex_regs *regs) +{ + bool valid; + u32 hwirq; + u64 ia; + static int count; + + /* + * We have pending interrupts. Should never actually enter WFI + * here! + */ + wfi(); + GUEST_SYNC(GUEST_CMD_IS_AWAKE); + + ia = gicr_insn(CDIA); + valid = GICV5_GICR_CDIA_VALID(ia); + + GUEST_SYNC(GUEST_CMD_IRQ_CDIA); + + if (!valid) + return; + + gsb_ack(); + isb(); + + hwirq = FIELD_GET(GICV5_GICR_CDIA_INTID, ia); + + gic_insn(hwirq, CDDI); + gic_insn(0, CDEOI); + + GUEST_SYNC(GUEST_CMD_IRQ_DIEOI); + + if (++count >= 2) + GUEST_DONE(); + + /* Ask for the next interrupt to be injected */ + GUEST_SYNC(GUEST_CMD_IS_READY); +} + +static void guest_code(void) +{ + local_irq_disable(); + + gicv5_cpu_enable_interrupts(); + local_irq_enable(); + + /* Enable the SW_PPI (3) */ + write_sysreg_s(BIT_ULL(3), SYS_ICC_PPI_ENABLER0_EL1); + + /* Ask for the first interrupt to be injected */ + GUEST_SYNC(GUEST_CMD_IS_READY); + + /* Loop forever waiting for interrupts */ + while (1); +} + + +/* we don't want to assert on run execution, hence that helper */ +static int run_vcpu(struct kvm_vcpu *vcpu) +{ + return __vcpu_run(vcpu) ? -errno : 0; +} + +static void vm_gic_destroy(struct vm_gic *v) +{ + close(v->gic_fd); + kvm_vm_free(v->vm); +} + +static void test_vgic_v5_ppis(uint32_t gic_dev_type) +{ + struct kvm_vcpu *vcpus[NR_VCPUS]; + struct ucall uc; + u64 user_ppis[2]; + struct vm_gic v; + int ret, i; + + v.gic_dev_type = gic_dev_type; + v.vm = __vm_create(VM_SHAPE_DEFAULT, NR_VCPUS, 0); + + v.gic_fd = kvm_create_device(v.vm, gic_dev_type); + + for (i = 0; i < NR_VCPUS; i++) + vcpus[i] = vm_vcpu_add(v.vm, i, guest_code); + + vm_init_descriptor_tables(v.vm); + vm_install_exception_handler(v.vm, VECTOR_IRQ_CURRENT, guest_irq_handler); + + for (i = 0; i < NR_VCPUS; i++) + vcpu_init_descriptor_tables(vcpus[i]); + + kvm_device_attr_set(v.gic_fd, KVM_DEV_ARM_VGIC_GRP_CTRL, + KVM_DEV_ARM_VGIC_CTRL_INIT, NULL); + + /* Read out the PPIs that user space is allowed to drive. */ + kvm_device_attr_get(v.gic_fd, KVM_DEV_ARM_VGIC_GRP_CTRL, + KVM_DEV_ARM_VGIC_USERSPACE_PPIS, &user_ppis); + + /* We should always be able to drive the SW_PPI. */ + TEST_ASSERT(user_ppis[0] & BIT(GICV5_ARCH_PPI_SW_PPI), + "SW_PPI is not drivable by userspace"); + + while (1) { + ret = run_vcpu(vcpus[0]); + + switch (get_ucall(vcpus[0], &uc)) { + case UCALL_SYNC: + /* + * The guest is ready for the next level change. Set + * high if ready, and lower if it has been consumed. + */ + if (uc.args[1] == GUEST_CMD_IS_READY || + uc.args[1] == GUEST_CMD_IRQ_DIEOI) { + u64 irq; + bool level = uc.args[1] == GUEST_CMD_IRQ_DIEOI ? 0 : 1; + + irq = FIELD_PREP(KVM_ARM_IRQ_NUM_MASK, 3); + irq |= KVM_ARM_IRQ_TYPE_PPI << KVM_ARM_IRQ_TYPE_SHIFT; + + _kvm_irq_line(v.vm, irq, level); + } else if (uc.args[1] == GUEST_CMD_IS_AWAKE) { + pr_info("Guest skipping WFI due to pending IRQ\n"); + } else if (uc.args[1] == GUEST_CMD_IRQ_CDIA) { + pr_info("Guest acknowledged IRQ\n"); + } + + continue; + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + break; + case UCALL_DONE: + goto done; + default: + TEST_FAIL("Unknown ucall %lu", uc.cmd); + } + } + +done: + TEST_ASSERT(ret == 0, "Failed to test GICv5 PPIs"); + + vm_gic_destroy(&v); +} + +/* + * Returns 0 if it's possible to create GIC device of a given type (V5). + */ +int test_kvm_device(uint32_t gic_dev_type) +{ + struct kvm_vcpu *vcpus[NR_VCPUS]; + struct vm_gic v; + int ret; + + v.vm = vm_create_with_vcpus(NR_VCPUS, guest_code, vcpus); + + /* try to create a non existing KVM device */ + ret = __kvm_test_create_device(v.vm, 0); + TEST_ASSERT(ret && errno == ENODEV, "unsupported device"); + + /* trial mode */ + ret = __kvm_test_create_device(v.vm, gic_dev_type); + if (ret) + return ret; + v.gic_fd = kvm_create_device(v.vm, gic_dev_type); + + ret = __kvm_create_device(v.vm, gic_dev_type); + TEST_ASSERT(ret < 0 && errno == EEXIST, "create GIC device twice"); + + vm_gic_destroy(&v); + + return 0; +} + +void run_tests(uint32_t gic_dev_type) +{ + pr_info("Test VGICv5 PPIs\n"); + test_vgic_v5_ppis(gic_dev_type); +} + +int main(int ac, char **av) +{ + int ret; + int pa_bits; + + test_disable_default_vgic(); + + pa_bits = vm_guest_mode_params[VM_MODE_DEFAULT].pa_bits; + max_phys_size = 1ULL << pa_bits; + + ret = test_kvm_device(KVM_DEV_TYPE_ARM_VGIC_V5); + if (ret) { + pr_info("No GICv5 support; Not running GIC_v5 tests.\n"); + exit(KSFT_SKIP); + } + + pr_info("Running VGIC_V5 tests.\n"); + run_tests(KVM_DEV_TYPE_ARM_VGIC_V5); + + return 0; +} diff --git a/tools/testing/selftests/kvm/include/arm64/gic_v5.h b/tools/testing/selftests/kvm/include/arm64/gic_v5.h new file mode 100644 index 000000000000..eb523d9277cf --- /dev/null +++ b/tools/testing/selftests/kvm/include/arm64/gic_v5.h @@ -0,0 +1,150 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#ifndef __SELFTESTS_GIC_V5_H +#define __SELFTESTS_GIC_V5_H + +#include +#include + +#include + +#include "processor.h" + +/* + * Definitions for GICv5 instructions for the Current Domain + */ +#define GICV5_OP_GIC_CDAFF sys_insn(1, 0, 12, 1, 3) +#define GICV5_OP_GIC_CDDI sys_insn(1, 0, 12, 2, 0) +#define GICV5_OP_GIC_CDDIS sys_insn(1, 0, 12, 1, 0) +#define GICV5_OP_GIC_CDHM sys_insn(1, 0, 12, 2, 1) +#define GICV5_OP_GIC_CDEN sys_insn(1, 0, 12, 1, 1) +#define GICV5_OP_GIC_CDEOI sys_insn(1, 0, 12, 1, 7) +#define GICV5_OP_GIC_CDPEND sys_insn(1, 0, 12, 1, 4) +#define GICV5_OP_GIC_CDPRI sys_insn(1, 0, 12, 1, 2) +#define GICV5_OP_GIC_CDRCFG sys_insn(1, 0, 12, 1, 5) +#define GICV5_OP_GICR_CDIA sys_insn(1, 0, 12, 3, 0) +#define GICV5_OP_GICR_CDNMIA sys_insn(1, 0, 12, 3, 1) + +/* Definitions for GIC CDAFF */ +#define GICV5_GIC_CDAFF_IAFFID_MASK GENMASK_ULL(47, 32) +#define GICV5_GIC_CDAFF_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GIC_CDAFF_IRM_MASK BIT_ULL(28) +#define GICV5_GIC_CDAFF_ID_MASK GENMASK_ULL(23, 0) + +/* Definitions for GIC CDDI */ +#define GICV5_GIC_CDDI_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GIC_CDDI_ID_MASK GENMASK_ULL(23, 0) + +/* Definitions for GIC CDDIS */ +#define GICV5_GIC_CDDIS_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GIC_CDDIS_TYPE(r) FIELD_GET(GICV5_GIC_CDDIS_TYPE_MASK, r) +#define GICV5_GIC_CDDIS_ID_MASK GENMASK_ULL(23, 0) +#define GICV5_GIC_CDDIS_ID(r) FIELD_GET(GICV5_GIC_CDDIS_ID_MASK, r) + +/* Definitions for GIC CDEN */ +#define GICV5_GIC_CDEN_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GIC_CDEN_ID_MASK GENMASK_ULL(23, 0) + +/* Definitions for GIC CDHM */ +#define GICV5_GIC_CDHM_HM_MASK BIT_ULL(32) +#define GICV5_GIC_CDHM_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GIC_CDHM_ID_MASK GENMASK_ULL(23, 0) + +/* Definitions for GIC CDPEND */ +#define GICV5_GIC_CDPEND_PENDING_MASK BIT_ULL(32) +#define GICV5_GIC_CDPEND_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GIC_CDPEND_ID_MASK GENMASK_ULL(23, 0) + +/* Definitions for GIC CDPRI */ +#define GICV5_GIC_CDPRI_PRIORITY_MASK GENMASK_ULL(39, 35) +#define GICV5_GIC_CDPRI_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GIC_CDPRI_ID_MASK GENMASK_ULL(23, 0) + +/* Definitions for GIC CDRCFG */ +#define GICV5_GIC_CDRCFG_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GIC_CDRCFG_ID_MASK GENMASK_ULL(23, 0) + +/* Definitions for GICR CDIA */ +#define GICV5_GICR_CDIA_VALID_MASK BIT_ULL(32) +#define GICV5_GICR_CDIA_VALID(r) FIELD_GET(GICV5_GICR_CDIA_VALID_MASK, r) +#define GICV5_GICR_CDIA_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GICR_CDIA_ID_MASK GENMASK_ULL(23, 0) +#define GICV5_GICR_CDIA_INTID GENMASK_ULL(31, 0) + +/* Definitions for GICR CDNMIA */ +#define GICV5_GICR_CDNMIA_VALID_MASK BIT_ULL(32) +#define GICV5_GICR_CDNMIA_VALID(r) FIELD_GET(GICV5_GICR_CDNMIA_VALID_MASK, r) +#define GICV5_GICR_CDNMIA_TYPE_MASK GENMASK_ULL(31, 29) +#define GICV5_GICR_CDNMIA_ID_MASK GENMASK_ULL(23, 0) + +#define gicr_insn(insn) read_sysreg_s(GICV5_OP_GICR_##insn) +#define gic_insn(v, insn) write_sysreg_s(v, GICV5_OP_GIC_##insn) + +#define __GIC_BARRIER_INSN(op0, op1, CRn, CRm, op2, Rt) \ + __emit_inst(0xd5000000 | \ + sys_insn((op0), (op1), (CRn), (CRm), (op2)) | \ + ((Rt) & 0x1f)) + +#define GSB_SYS_BARRIER_INSN __GIC_BARRIER_INSN(1, 0, 12, 0, 0, 31) +#define GSB_ACK_BARRIER_INSN __GIC_BARRIER_INSN(1, 0, 12, 0, 1, 31) + +#define gsb_ack() asm volatile(GSB_ACK_BARRIER_INSN : : : "memory") +#define gsb_sys() asm volatile(GSB_SYS_BARRIER_INSN : : : "memory") + +#define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) + +#define GICV5_IRQ_DEFAULT_PRI 0b10000 + +#define GICV5_ARCH_PPI_SW_PPI 0x3 + +void gicv5_ppi_priority_init(void) +{ + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR0_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR1_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR2_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR3_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR4_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR5_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR6_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR7_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR8_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR9_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR10_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR11_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR12_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR13_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR14_EL1); + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR15_EL1); + + /* + * Context syncronization required to make sure system register writes + * effects are synchronised. + */ + isb(); +} + +void gicv5_cpu_disable_interrupts(void) +{ + u64 cr0; + + cr0 = FIELD_PREP(ICC_CR0_EL1_EN, 0); + write_sysreg_s(cr0, SYS_ICC_CR0_EL1); +} + +void gicv5_cpu_enable_interrupts(void) +{ + u64 cr0, pcr; + + write_sysreg_s(0, SYS_ICC_PPI_ENABLER0_EL1); + write_sysreg_s(0, SYS_ICC_PPI_ENABLER1_EL1); + + gicv5_ppi_priority_init(); + + pcr = FIELD_PREP(ICC_PCR_EL1_PRIORITY, GICV5_IRQ_DEFAULT_PRI); + write_sysreg_s(pcr, SYS_ICC_PCR_EL1); + + cr0 = FIELD_PREP(ICC_CR0_EL1_EN, 1); + write_sysreg_s(cr0, SYS_ICC_CR0_EL1); +} + +#endif From ce29261ec6482de54320c03398eb30e9615aee40 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Thu, 19 Mar 2026 16:00:21 +0000 Subject: [PATCH 182/373] KVM: arm64: selftests: Add no-vgic-v5 selftest Now that GICv5 is supported, it is important to check that all of the GICv5 register state is hidden from a guest that doesn't create a vGICv5. Rename the no-vgic-v3 selftest to no-vgic, and extend it to check GICv5 system registers too. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260319154937.3619520-42-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- tools/testing/selftests/kvm/Makefile.kvm | 2 +- .../testing/selftests/kvm/arm64/no-vgic-v3.c | 177 ----------- tools/testing/selftests/kvm/arm64/no-vgic.c | 297 ++++++++++++++++++ 3 files changed, 298 insertions(+), 178 deletions(-) delete mode 100644 tools/testing/selftests/kvm/arm64/no-vgic-v3.c create mode 100644 tools/testing/selftests/kvm/arm64/no-vgic.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 98282acd9040..98da9fa4b8b7 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -179,7 +179,7 @@ TEST_GEN_PROGS_arm64 += arm64/vgic_irq TEST_GEN_PROGS_arm64 += arm64/vgic_lpi_stress TEST_GEN_PROGS_arm64 += arm64/vgic_v5 TEST_GEN_PROGS_arm64 += arm64/vpmu_counter_access -TEST_GEN_PROGS_arm64 += arm64/no-vgic-v3 +TEST_GEN_PROGS_arm64 += arm64/no-vgic TEST_GEN_PROGS_arm64 += arm64/idreg-idst TEST_GEN_PROGS_arm64 += arm64/kvm-uuid TEST_GEN_PROGS_arm64 += access_tracking_perf_test diff --git a/tools/testing/selftests/kvm/arm64/no-vgic-v3.c b/tools/testing/selftests/kvm/arm64/no-vgic-v3.c deleted file mode 100644 index 152c34776981..000000000000 --- a/tools/testing/selftests/kvm/arm64/no-vgic-v3.c +++ /dev/null @@ -1,177 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 - -// Check that, on a GICv3 system, not configuring GICv3 correctly -// results in all of the sysregs generating an UNDEF exception. - -#include -#include -#include - -static volatile bool handled; - -#define __check_sr_read(r) \ - ({ \ - uint64_t val; \ - \ - handled = false; \ - dsb(sy); \ - val = read_sysreg_s(SYS_ ## r); \ - val; \ - }) - -#define __check_sr_write(r) \ - do { \ - handled = false; \ - dsb(sy); \ - write_sysreg_s(0, SYS_ ## r); \ - isb(); \ - } while(0) - -/* Fatal checks */ -#define check_sr_read(r) \ - do { \ - __check_sr_read(r); \ - __GUEST_ASSERT(handled, #r " no read trap"); \ - } while(0) - -#define check_sr_write(r) \ - do { \ - __check_sr_write(r); \ - __GUEST_ASSERT(handled, #r " no write trap"); \ - } while(0) - -#define check_sr_rw(r) \ - do { \ - check_sr_read(r); \ - check_sr_write(r); \ - } while(0) - -static void guest_code(void) -{ - uint64_t val; - - /* - * Check that we advertise that ID_AA64PFR0_EL1.GIC == 0, having - * hidden the feature at runtime without any other userspace action. - */ - __GUEST_ASSERT(FIELD_GET(ID_AA64PFR0_EL1_GIC, - read_sysreg(id_aa64pfr0_el1)) == 0, - "GICv3 wrongly advertised"); - - /* - * Access all GICv3 registers, and fail if we don't get an UNDEF. - * Note that we happily access all the APxRn registers without - * checking their existance, as all we want to see is a failure. - */ - check_sr_rw(ICC_PMR_EL1); - check_sr_read(ICC_IAR0_EL1); - check_sr_write(ICC_EOIR0_EL1); - check_sr_rw(ICC_HPPIR0_EL1); - check_sr_rw(ICC_BPR0_EL1); - check_sr_rw(ICC_AP0R0_EL1); - check_sr_rw(ICC_AP0R1_EL1); - check_sr_rw(ICC_AP0R2_EL1); - check_sr_rw(ICC_AP0R3_EL1); - check_sr_rw(ICC_AP1R0_EL1); - check_sr_rw(ICC_AP1R1_EL1); - check_sr_rw(ICC_AP1R2_EL1); - check_sr_rw(ICC_AP1R3_EL1); - check_sr_write(ICC_DIR_EL1); - check_sr_read(ICC_RPR_EL1); - check_sr_write(ICC_SGI1R_EL1); - check_sr_write(ICC_ASGI1R_EL1); - check_sr_write(ICC_SGI0R_EL1); - check_sr_read(ICC_IAR1_EL1); - check_sr_write(ICC_EOIR1_EL1); - check_sr_rw(ICC_HPPIR1_EL1); - check_sr_rw(ICC_BPR1_EL1); - check_sr_rw(ICC_CTLR_EL1); - check_sr_rw(ICC_IGRPEN0_EL1); - check_sr_rw(ICC_IGRPEN1_EL1); - - /* - * ICC_SRE_EL1 may not be trappable, as ICC_SRE_EL2.Enable can - * be RAO/WI. Engage in non-fatal accesses, starting with a - * write of 0 to try and disable SRE, and let's see if it - * sticks. - */ - __check_sr_write(ICC_SRE_EL1); - if (!handled) - GUEST_PRINTF("ICC_SRE_EL1 write not trapping (OK)\n"); - - val = __check_sr_read(ICC_SRE_EL1); - if (!handled) { - __GUEST_ASSERT((val & BIT(0)), - "ICC_SRE_EL1 not trapped but ICC_SRE_EL1.SRE not set\n"); - GUEST_PRINTF("ICC_SRE_EL1 read not trapping (OK)\n"); - } - - GUEST_DONE(); -} - -static void guest_undef_handler(struct ex_regs *regs) -{ - /* Success, we've gracefully exploded! */ - handled = true; - regs->pc += 4; -} - -static void test_run_vcpu(struct kvm_vcpu *vcpu) -{ - struct ucall uc; - - do { - vcpu_run(vcpu); - - switch (get_ucall(vcpu, &uc)) { - case UCALL_ABORT: - REPORT_GUEST_ASSERT(uc); - break; - case UCALL_PRINTF: - printf("%s", uc.buffer); - break; - case UCALL_DONE: - break; - default: - TEST_FAIL("Unknown ucall %lu", uc.cmd); - } - } while (uc.cmd != UCALL_DONE); -} - -static void test_guest_no_gicv3(void) -{ - struct kvm_vcpu *vcpu; - struct kvm_vm *vm; - - /* Create a VM without a GICv3 */ - vm = vm_create_with_one_vcpu(&vcpu, guest_code); - - vm_init_descriptor_tables(vm); - vcpu_init_descriptor_tables(vcpu); - - vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, - ESR_ELx_EC_UNKNOWN, guest_undef_handler); - - test_run_vcpu(vcpu); - - kvm_vm_free(vm); -} - -int main(int argc, char *argv[]) -{ - struct kvm_vcpu *vcpu; - struct kvm_vm *vm; - uint64_t pfr0; - - test_disable_default_vgic(); - - vm = vm_create_with_one_vcpu(&vcpu, NULL); - pfr0 = vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1)); - __TEST_REQUIRE(FIELD_GET(ID_AA64PFR0_EL1_GIC, pfr0), - "GICv3 not supported."); - kvm_vm_free(vm); - - test_guest_no_gicv3(); - - return 0; -} diff --git a/tools/testing/selftests/kvm/arm64/no-vgic.c b/tools/testing/selftests/kvm/arm64/no-vgic.c new file mode 100644 index 000000000000..b14686ef17d1 --- /dev/null +++ b/tools/testing/selftests/kvm/arm64/no-vgic.c @@ -0,0 +1,297 @@ +// SPDX-License-Identifier: GPL-2.0 + +// Check that, on a GICv3-capable system (GICv3 native, or GICv5 with +// FEAT_GCIE_LEGACY), not configuring GICv3 correctly results in all +// of the sysregs generating an UNDEF exception. Do the same for GICv5 +// on a GICv5 host. + +#include +#include +#include + +#include + +static volatile bool handled; + +#define __check_sr_read(r) \ + ({ \ + uint64_t val; \ + \ + handled = false; \ + dsb(sy); \ + val = read_sysreg_s(SYS_ ## r); \ + val; \ + }) + +#define __check_sr_write(r) \ + do { \ + handled = false; \ + dsb(sy); \ + write_sysreg_s(0, SYS_ ## r); \ + isb(); \ + } while (0) + +#define __check_gicv5_gicr_op(r) \ + ({ \ + uint64_t val; \ + \ + handled = false; \ + dsb(sy); \ + val = read_sysreg_s(GICV5_OP_GICR_ ## r); \ + val; \ + }) + +#define __check_gicv5_gic_op(r) \ + do { \ + handled = false; \ + dsb(sy); \ + write_sysreg_s(0, GICV5_OP_GIC_ ## r); \ + isb(); \ + } while (0) + +/* Fatal checks */ +#define check_sr_read(r) \ + do { \ + __check_sr_read(r); \ + __GUEST_ASSERT(handled, #r " no read trap"); \ + } while (0) + +#define check_sr_write(r) \ + do { \ + __check_sr_write(r); \ + __GUEST_ASSERT(handled, #r " no write trap"); \ + } while (0) + +#define check_sr_rw(r) \ + do { \ + check_sr_read(r); \ + check_sr_write(r); \ + } while (0) + +#define check_gicv5_gicr_op(r) \ + do { \ + __check_gicv5_gicr_op(r); \ + __GUEST_ASSERT(handled, #r " no read trap"); \ + } while (0) + +#define check_gicv5_gic_op(r) \ + do { \ + __check_gicv5_gic_op(r); \ + __GUEST_ASSERT(handled, #r " no write trap"); \ + } while (0) + +static void guest_code_gicv3(void) +{ + uint64_t val; + + /* + * Check that we advertise that ID_AA64PFR0_EL1.GIC == 0, having + * hidden the feature at runtime without any other userspace action. + */ + __GUEST_ASSERT(FIELD_GET(ID_AA64PFR0_EL1_GIC, + read_sysreg(id_aa64pfr0_el1)) == 0, + "GICv3 wrongly advertised"); + + /* + * Access all GICv3 registers, and fail if we don't get an UNDEF. + * Note that we happily access all the APxRn registers without + * checking their existence, as all we want to see is a failure. + */ + check_sr_rw(ICC_PMR_EL1); + check_sr_read(ICC_IAR0_EL1); + check_sr_write(ICC_EOIR0_EL1); + check_sr_rw(ICC_HPPIR0_EL1); + check_sr_rw(ICC_BPR0_EL1); + check_sr_rw(ICC_AP0R0_EL1); + check_sr_rw(ICC_AP0R1_EL1); + check_sr_rw(ICC_AP0R2_EL1); + check_sr_rw(ICC_AP0R3_EL1); + check_sr_rw(ICC_AP1R0_EL1); + check_sr_rw(ICC_AP1R1_EL1); + check_sr_rw(ICC_AP1R2_EL1); + check_sr_rw(ICC_AP1R3_EL1); + check_sr_write(ICC_DIR_EL1); + check_sr_read(ICC_RPR_EL1); + check_sr_write(ICC_SGI1R_EL1); + check_sr_write(ICC_ASGI1R_EL1); + check_sr_write(ICC_SGI0R_EL1); + check_sr_read(ICC_IAR1_EL1); + check_sr_write(ICC_EOIR1_EL1); + check_sr_rw(ICC_HPPIR1_EL1); + check_sr_rw(ICC_BPR1_EL1); + check_sr_rw(ICC_CTLR_EL1); + check_sr_rw(ICC_IGRPEN0_EL1); + check_sr_rw(ICC_IGRPEN1_EL1); + + /* + * ICC_SRE_EL1 may not be trappable, as ICC_SRE_EL2.Enable can + * be RAO/WI. Engage in non-fatal accesses, starting with a + * write of 0 to try and disable SRE, and let's see if it + * sticks. + */ + __check_sr_write(ICC_SRE_EL1); + if (!handled) + GUEST_PRINTF("ICC_SRE_EL1 write not trapping (OK)\n"); + + val = __check_sr_read(ICC_SRE_EL1); + if (!handled) { + __GUEST_ASSERT((val & BIT(0)), + "ICC_SRE_EL1 not trapped but ICC_SRE_EL1.SRE not set\n"); + GUEST_PRINTF("ICC_SRE_EL1 read not trapping (OK)\n"); + } + + GUEST_DONE(); +} + +static void guest_code_gicv5(void) +{ + /* + * Check that we advertise that ID_AA64PFR2_EL1.GCIE == 0, having + * hidden the feature at runtime without any other userspace action. + */ + __GUEST_ASSERT(FIELD_GET(ID_AA64PFR2_EL1_GCIE, + read_sysreg_s(SYS_ID_AA64PFR2_EL1)) == 0, + "GICv5 wrongly advertised"); + + /* + * Try all GICv5 instructions, and fail if we don't get an UNDEF. + */ + check_gicv5_gic_op(CDAFF); + check_gicv5_gic_op(CDDI); + check_gicv5_gic_op(CDDIS); + check_gicv5_gic_op(CDEOI); + check_gicv5_gic_op(CDHM); + check_gicv5_gic_op(CDPEND); + check_gicv5_gic_op(CDPRI); + check_gicv5_gic_op(CDRCFG); + check_gicv5_gicr_op(CDIA); + check_gicv5_gicr_op(CDNMIA); + + /* Check General System Register acccesses */ + check_sr_rw(ICC_APR_EL1); + check_sr_rw(ICC_CR0_EL1); + check_sr_read(ICC_HPPIR_EL1); + check_sr_read(ICC_IAFFIDR_EL1); + check_sr_rw(ICC_ICSR_EL1); + check_sr_read(ICC_IDR0_EL1); + check_sr_rw(ICC_PCR_EL1); + + /* Check PPI System Register accessess */ + check_sr_rw(ICC_PPI_CACTIVER0_EL1); + check_sr_rw(ICC_PPI_CACTIVER1_EL1); + check_sr_rw(ICC_PPI_SACTIVER0_EL1); + check_sr_rw(ICC_PPI_SACTIVER1_EL1); + check_sr_rw(ICC_PPI_CPENDR0_EL1); + check_sr_rw(ICC_PPI_CPENDR1_EL1); + check_sr_rw(ICC_PPI_SPENDR0_EL1); + check_sr_rw(ICC_PPI_SPENDR1_EL1); + check_sr_rw(ICC_PPI_ENABLER0_EL1); + check_sr_rw(ICC_PPI_ENABLER1_EL1); + check_sr_read(ICC_PPI_HMR0_EL1); + check_sr_read(ICC_PPI_HMR1_EL1); + check_sr_rw(ICC_PPI_PRIORITYR0_EL1); + check_sr_rw(ICC_PPI_PRIORITYR1_EL1); + check_sr_rw(ICC_PPI_PRIORITYR2_EL1); + check_sr_rw(ICC_PPI_PRIORITYR3_EL1); + check_sr_rw(ICC_PPI_PRIORITYR4_EL1); + check_sr_rw(ICC_PPI_PRIORITYR5_EL1); + check_sr_rw(ICC_PPI_PRIORITYR6_EL1); + check_sr_rw(ICC_PPI_PRIORITYR7_EL1); + check_sr_rw(ICC_PPI_PRIORITYR8_EL1); + check_sr_rw(ICC_PPI_PRIORITYR9_EL1); + check_sr_rw(ICC_PPI_PRIORITYR10_EL1); + check_sr_rw(ICC_PPI_PRIORITYR11_EL1); + check_sr_rw(ICC_PPI_PRIORITYR12_EL1); + check_sr_rw(ICC_PPI_PRIORITYR13_EL1); + check_sr_rw(ICC_PPI_PRIORITYR14_EL1); + check_sr_rw(ICC_PPI_PRIORITYR15_EL1); + + GUEST_DONE(); +} + +static void guest_undef_handler(struct ex_regs *regs) +{ + /* Success, we've gracefully exploded! */ + handled = true; + regs->pc += 4; +} + +static void test_run_vcpu(struct kvm_vcpu *vcpu) +{ + struct ucall uc; + + do { + vcpu_run(vcpu); + + switch (get_ucall(vcpu, &uc)) { + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + break; + case UCALL_PRINTF: + printf("%s", uc.buffer); + break; + case UCALL_DONE: + break; + default: + TEST_FAIL("Unknown ucall %lu", uc.cmd); + } + } while (uc.cmd != UCALL_DONE); +} + +static void test_guest_no_vgic(void *guest_code) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + + /* Create a VM without a GIC */ + vm = vm_create_with_one_vcpu(&vcpu, guest_code); + + vm_init_descriptor_tables(vm); + vcpu_init_descriptor_tables(vcpu); + + vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, + ESR_ELx_EC_UNKNOWN, guest_undef_handler); + + test_run_vcpu(vcpu); + + kvm_vm_free(vm); +} + +int main(int argc, char *argv[]) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + bool has_v3, has_v5; + uint64_t pfr; + + test_disable_default_vgic(); + + vm = vm_create_with_one_vcpu(&vcpu, NULL); + + pfr = vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1)); + has_v3 = !!FIELD_GET(ID_AA64PFR0_EL1_GIC, pfr); + + pfr = vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR2_EL1)); + has_v5 = !!FIELD_GET(ID_AA64PFR2_EL1_GCIE, pfr); + + kvm_vm_free(vm); + + __TEST_REQUIRE(has_v3 || has_v5, + "Neither GICv3 nor GICv5 supported."); + + if (has_v3) { + pr_info("Testing no-vgic-v3\n"); + test_guest_no_vgic(guest_code_gicv3); + } else { + pr_info("No GICv3 support: skipping no-vgic-v3 test\n"); + } + + if (has_v5) { + pr_info("Testing no-vgic-v5\n"); + test_guest_no_vgic(guest_code_gicv5); + } else { + pr_info("No GICv5 support: skipping no-vgic-v5 test\n"); + } + + return 0; +} From 58b4bd18390ec3118d8577e19bdee0d01d40c31e Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Fri, 20 Mar 2026 14:29:33 -0700 Subject: [PATCH 183/373] tracing: Adjust cmd_check_undefined to show unexpected undefined symbols When the check_undefined command in kernel/trace/Makefile fails, there is no output, making it hard to understand why the build failed. Capture the output of the $(NM) + grep command and print it when failing to make it clearer what the problem is. Fixes: a717943d8ecc ("tracing: Check for undefined symbols in simple_ring_buffer") Signed-off-by: Nathan Chancellor Reviewed-by: Vincent Donnefort Acked-by: Arnd Bergmann Link: https://patch.msgid.link/20260320-cmd_check_undefined-verbose-v1-1-54fc5b061f94@kernel.org Signed-off-by: Marc Zyngier --- kernel/trace/Makefile | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index c5e14ffd36ee..d662c1a64cd5 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -174,7 +174,13 @@ UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitize $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}') quiet_cmd_check_undefined = NM $< - cmd_check_undefined = test -z "`$(NM) -u $< | grep -v $(addprefix -e , $(UNDEFINED_ALLOWLIST))`" + cmd_check_undefined = \ + undefsyms=$$($(NM) -u $< | grep -v $(addprefix -e , $(UNDEFINED_ALLOWLIST)) || true); \ + if [ -n "$$undefsyms" ]; then \ + echo "Unexpected symbols in $<:" >&2; \ + echo "$$undefsyms" >&2; \ + false; \ + fi $(obj)/%.o.checked: $(obj)/%.o $(obj)/undefsyms_base.o FORCE $(call if_changed,check_undefined) From 19e15dc73f0fc74eaf63ad9b3a50648450269b4d Mon Sep 17 00:00:00 2001 From: Wei-Lin Chang Date: Tue, 17 Mar 2026 18:26:38 +0000 Subject: [PATCH 184/373] KVM: arm64: nv: Expose shadow page tables in debugfs Exposing shadow page tables in debugfs improves the debugability and testability of NV. With this patch a new directory "nested" is created for each VM created if the host is NV capable. Within the directory each valid s2 mmu will have its shadow page table exposed as a readable file with the file name formatted as 0x-0x-s2-{en,dis}abled. The creation and removal of the files happen at the points when an s2 mmu becomes valid, or the context it represents change. In the future the "nested" directory can also hold other NV related information. This is gated behind CONFIG_PTDUMP_STAGE2_DEBUGFS. Suggested-by: Marc Zyngier Reviewed-by: Sebastian Ene Signed-off-by: Wei-Lin Chang Reviewed-by: Joey Gouly Link: https://patch.msgid.link/20260317182638.1592507-3-weilin.chang@arm.com [maz: minor refactor, full 16 chars addresses] Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_host.h | 9 +++++++++ arch/arm64/include/asm/kvm_mmu.h | 4 ++++ arch/arm64/kvm/nested.c | 6 +++++- arch/arm64/kvm/ptdump.c | 26 ++++++++++++++++++++++++++ 4 files changed, 44 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 2ca264b3db5f..70d9f6855cce 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -217,6 +217,10 @@ struct kvm_s2_mmu { */ bool nested_stage2_enabled; +#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS + struct dentry *shadow_pt_debugfs_dentry; +#endif + /* * true when this MMU needs to be unmapped before being used for a new * purpose. @@ -405,6 +409,11 @@ struct kvm_arch { * the associated pKVM instance in the hypervisor. */ struct kvm_protected_vm pkvm; + +#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS + /* Nested virtualization info */ + struct dentry *debugfs_nv_dentry; +#endif }; struct kvm_vcpu_fault_info { diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index d968aca0461a..01e9c72d6aa7 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -393,8 +393,12 @@ static inline bool kvm_supports_cacheable_pfnmap(void) #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS void kvm_s2_ptdump_create_debugfs(struct kvm *kvm); +void kvm_nested_s2_ptdump_create_debugfs(struct kvm_s2_mmu *mmu); +void kvm_nested_s2_ptdump_remove_debugfs(struct kvm_s2_mmu *mmu); #else static inline void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) {} +static inline void kvm_nested_s2_ptdump_create_debugfs(struct kvm_s2_mmu *mmu) {} +static inline void kvm_nested_s2_ptdump_remove_debugfs(struct kvm_s2_mmu *mmu) {} #endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */ #endif /* __ASSEMBLER__ */ diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index 12c9f6e8dfda..34bab84b18fe 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -730,8 +730,10 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size; /* Make sure we don't forget to do the laundry */ - if (kvm_s2_mmu_valid(s2_mmu)) + if (kvm_s2_mmu_valid(s2_mmu)) { + kvm_nested_s2_ptdump_remove_debugfs(s2_mmu); s2_mmu->pending_unmap = true; + } /* * The virtual VMID (modulo CnP) will be used as a key when matching @@ -745,6 +747,8 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) s2_mmu->tlb_vtcr = vcpu_read_sys_reg(vcpu, VTCR_EL2); s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, HCR_EL2) & HCR_VM; + kvm_nested_s2_ptdump_create_debugfs(s2_mmu); + out: atomic_inc(&s2_mmu->refcnt); diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c index 8713001e992d..f5753ac37a05 100644 --- a/arch/arm64/kvm/ptdump.c +++ b/arch/arm64/kvm/ptdump.c @@ -10,12 +10,14 @@ #include #include +#include #include #include #include #define MARKERS_LEN 2 #define KVM_PGTABLE_MAX_LEVELS (KVM_PGTABLE_LAST_LEVEL + 1) +#define S2FNAMESZ sizeof("0x0123456789abcdef-0x0123456789abcdef-s2-disabled") struct kvm_ptdump_guest_state { struct kvm_s2_mmu *mmu; @@ -277,6 +279,28 @@ static const struct file_operations kvm_pgtable_levels_fops = { .release = kvm_pgtable_debugfs_close, }; +void kvm_nested_s2_ptdump_create_debugfs(struct kvm_s2_mmu *mmu) +{ + struct dentry *dent; + char file_name[S2FNAMESZ]; + + snprintf(file_name, sizeof(file_name), "0x%016llx-0x%016llx-s2-%sabled", + mmu->tlb_vttbr, + mmu->tlb_vtcr, + mmu->nested_stage2_enabled ? "en" : "dis"); + + dent = debugfs_create_file(file_name, 0400, + mmu->arch->debugfs_nv_dentry, mmu, + &kvm_ptdump_guest_fops); + + mmu->shadow_pt_debugfs_dentry = dent; +} + +void kvm_nested_s2_ptdump_remove_debugfs(struct kvm_s2_mmu *mmu) +{ + debugfs_remove(mmu->shadow_pt_debugfs_dentry); +} + void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) { debugfs_create_file("stage2_page_tables", 0400, kvm->debugfs_dentry, @@ -285,4 +309,6 @@ void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) &kvm->arch.mmu, &kvm_pgtable_range_fops); debugfs_create_file("stage2_levels", 0400, kvm->debugfs_dentry, &kvm->arch.mmu, &kvm_pgtable_levels_fops); + if (cpus_have_final_cap(ARM64_HAS_NESTED_VIRT)) + kvm->arch.debugfs_nv_dentry = debugfs_create_dir("nested", kvm->debugfs_dentry); } From 4ebfa3230b40728638a6acceb709f900f920f921 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 21 Mar 2026 21:24:15 +0000 Subject: [PATCH 185/373] KVM: arm64: pkvm: Move error handling to the end of kvm_hyp_cpu_entry We currently handle CPUs having booted at EL1 in the middle of the kvm_hyp_cpu_entry function. Not only this adversely affects readability, but this is also at a bizarre spot should more error handling be added (which we're about to do). Move the WFE/WFI loop to the end of the function and fix a comment. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Link: https://patch.msgid.link/20260321212419.2803972-2-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/hyp-init.S | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S index 0d42eedc7167..5d00bde09201 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S @@ -201,14 +201,9 @@ SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu) /* Check that the core was booted in EL2. */ mrs x0, CurrentEL cmp x0, #CurrentEL_EL2 - b.eq 2f + b.ne 1f - /* The core booted in EL1. KVM cannot be initialized on it. */ -1: wfe - wfi - b 1b - -2: msr SPsel, #1 // We want to use SP_EL{1,2} + msr SPsel, #1 // We want to use SP_EL2 init_el2_hcr 0 @@ -222,6 +217,11 @@ SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu) mov x0, x29 ldr x1, =kvm_host_psci_cpu_entry br x1 + + // The core booted in EL1. KVM cannot be initialized on it. +1: wfe + wfi + b 1b SYM_CODE_END(__kvm_hyp_init_cpu) SYM_CODE_START(__kvm_handle_stub_hvc) From 1536a0b1386850b67a9ea840e57b7b475e895fed Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 21 Mar 2026 21:24:16 +0000 Subject: [PATCH 186/373] KVM: arm64: pkvm: Simplify BTI handling on CPU boot In order to perform an indirect branch to kvm_host_psci_cpu_entry() on a BTI-aware system, we first branch to a 'BTI j' landing pad, and from there branch again to the target. While this works, this is really not required: - BLR works with 'BTI c' and 'PACIASP' as the landing pad - Even if LR gets clobbered by BLR, we are going to restore the host's registers, so it is pointless to try and avoid touching LR Given the above, drop the veneer and directly call into C code. If we were to come back from it, we'd directly enter the error handler. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Link: https://patch.msgid.link/20260321212419.2803972-3-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/host.S | 10 ---------- arch/arm64/kvm/hyp/nvhe/hyp-init.S | 9 +++++---- 2 files changed, 5 insertions(+), 14 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S index eef15b374abb..465f6f1dd641 100644 --- a/arch/arm64/kvm/hyp/nvhe/host.S +++ b/arch/arm64/kvm/hyp/nvhe/host.S @@ -291,13 +291,3 @@ SYM_CODE_START(__kvm_hyp_host_forward_smc) ret SYM_CODE_END(__kvm_hyp_host_forward_smc) - -/* - * kvm_host_psci_cpu_entry is called through br instruction, which requires - * bti j instruction as compilers (gcc and llvm) doesn't insert bti j for external - * functions, but bti c instead. - */ -SYM_CODE_START(kvm_host_psci_cpu_entry) - bti j - b __kvm_host_psci_cpu_entry -SYM_CODE_END(kvm_host_psci_cpu_entry) diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S index 5d00bde09201..55e0dce65dc5 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S @@ -213,12 +213,13 @@ SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu) mov x0, x28 bl ___kvm_hyp_init // Clobbers x0..x2 - /* Leave idmap. */ + /* Leave idmap -- using BLR is OK, LR is restored from host context */ mov x0, x29 - ldr x1, =kvm_host_psci_cpu_entry - br x1 + ldr x1, =__kvm_host_psci_cpu_entry + blr x1 - // The core booted in EL1. KVM cannot be initialized on it. + // The core booted in EL1, or the C code unexpectedly returned. + // Either way, KVM cannot be initialized on it. 1: wfe wfi b 1b From ba64e273eac3d7ec4a2b621b3620c4d3b0399858 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 21 Mar 2026 21:24:17 +0000 Subject: [PATCH 187/373] KVM: arm64: pkvm: Turn __kvm_hyp_init_cpu into an inner label __kvm_hyp_init_cpu really is an internal label for kvm_hyp_cpu_entry and kvm_hyp_cpu_resume. Make it clear that this is what it is, and drop a pointless branch in kvm_hyp_cpu_resume. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Link: https://patch.msgid.link/20260321212419.2803972-4-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/hyp-init.S | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S index 55e0dce65dc5..2e80fcbff2df 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S @@ -175,7 +175,6 @@ SYM_CODE_END(___kvm_hyp_init) SYM_CODE_START(kvm_hyp_cpu_entry) mov x1, #1 // is_cpu_on = true b __kvm_hyp_init_cpu -SYM_CODE_END(kvm_hyp_cpu_entry) /* * PSCI CPU_SUSPEND / SYSTEM_SUSPEND entry point @@ -184,17 +183,8 @@ SYM_CODE_END(kvm_hyp_cpu_entry) */ SYM_CODE_START(kvm_hyp_cpu_resume) mov x1, #0 // is_cpu_on = false - b __kvm_hyp_init_cpu -SYM_CODE_END(kvm_hyp_cpu_resume) -/* - * Common code for CPU entry points. Initializes EL2 state and - * installs the hypervisor before handing over to a C handler. - * - * x0: struct kvm_nvhe_init_params PA - * x1: bool is_cpu_on - */ -SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu) +SYM_INNER_LABEL(__kvm_hyp_init_cpu, SYM_L_LOCAL) mov x28, x0 // Stash arguments mov x29, x1 @@ -223,7 +213,8 @@ SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu) 1: wfe wfi b 1b -SYM_CODE_END(__kvm_hyp_init_cpu) +SYM_CODE_END(kvm_hyp_cpu_resume) +SYM_CODE_END(kvm_hyp_cpu_entry) SYM_CODE_START(__kvm_handle_stub_hvc) /* From 59c6e12d40a5b05038b68bcdb4690456fee68e8a Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 21 Mar 2026 21:24:18 +0000 Subject: [PATCH 188/373] KVM: arm64: pkvm: Use direct function pointers for cpu_{on,resume} Instead of using a boolean to decide whether a CPU is booting or resuming, just pass an actual function pointer around. This makes the code a bit more straightforward to understand. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Link: https://patch.msgid.link/20260321212419.2803972-5-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 3 ++- arch/arm64/kvm/hyp/nvhe/hyp-init.S | 9 +++---- arch/arm64/kvm/hyp/nvhe/psci-relay.c | 39 +++++++++++++++++----------- 3 files changed, 29 insertions(+), 22 deletions(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index a1ad12c72ebf..f4c769857fdf 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -291,7 +291,8 @@ asmlinkage void __noreturn hyp_panic_bad_stack(void); asmlinkage void kvm_unexpected_el2_exception(void); struct kvm_cpu_context; void handle_trap(struct kvm_cpu_context *host_ctxt); -asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on); +asmlinkage void __noreturn __kvm_host_psci_cpu_on_entry(void); +asmlinkage void __noreturn __kvm_host_psci_cpu_resume_entry(void); void __noreturn __pkvm_init_finalise(void); void kvm_nvhe_prepare_backtrace(unsigned long fp, unsigned long pc); void kvm_patch_vector_branch(struct alt_instr *alt, diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S index 2e80fcbff2df..64296b31da73 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S @@ -173,7 +173,7 @@ SYM_CODE_END(___kvm_hyp_init) * x0: struct kvm_nvhe_init_params PA */ SYM_CODE_START(kvm_hyp_cpu_entry) - mov x1, #1 // is_cpu_on = true + ldr x29, =__kvm_host_psci_cpu_on_entry b __kvm_hyp_init_cpu /* @@ -182,11 +182,10 @@ SYM_CODE_START(kvm_hyp_cpu_entry) * x0: struct kvm_nvhe_init_params PA */ SYM_CODE_START(kvm_hyp_cpu_resume) - mov x1, #0 // is_cpu_on = false + ldr x29, =__kvm_host_psci_cpu_resume_entry SYM_INNER_LABEL(__kvm_hyp_init_cpu, SYM_L_LOCAL) mov x28, x0 // Stash arguments - mov x29, x1 /* Check that the core was booted in EL2. */ mrs x0, CurrentEL @@ -204,9 +203,7 @@ SYM_INNER_LABEL(__kvm_hyp_init_cpu, SYM_L_LOCAL) bl ___kvm_hyp_init // Clobbers x0..x2 /* Leave idmap -- using BLR is OK, LR is restored from host context */ - mov x0, x29 - ldr x1, =__kvm_host_psci_cpu_entry - blr x1 + blr x29 // The core booted in EL1, or the C code unexpectedly returned. // Either way, KVM cannot be initialized on it. diff --git a/arch/arm64/kvm/hyp/nvhe/psci-relay.c b/arch/arm64/kvm/hyp/nvhe/psci-relay.c index c3e196fb8b18..5aeb5b453a59 100644 --- a/arch/arm64/kvm/hyp/nvhe/psci-relay.c +++ b/arch/arm64/kvm/hyp/nvhe/psci-relay.c @@ -200,23 +200,12 @@ static int psci_system_suspend(u64 func_id, struct kvm_cpu_context *host_ctxt) __hyp_pa(init_params), 0); } -asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on) +static void __noreturn __kvm_host_psci_cpu_entry(unsigned long pc, unsigned long r0) { - struct psci_boot_args *boot_args; - struct kvm_cpu_context *host_ctxt; + struct kvm_cpu_context *host_ctxt = host_data_ptr(host_ctxt); - host_ctxt = host_data_ptr(host_ctxt); - - if (is_cpu_on) - boot_args = this_cpu_ptr(&cpu_on_args); - else - boot_args = this_cpu_ptr(&suspend_args); - - cpu_reg(host_ctxt, 0) = boot_args->r0; - write_sysreg_el2(boot_args->pc, SYS_ELR); - - if (is_cpu_on) - release_boot_args(boot_args); + cpu_reg(host_ctxt, 0) = r0; + write_sysreg_el2(pc, SYS_ELR); write_sysreg_el1(INIT_SCTLR_EL1_MMU_OFF, SYS_SCTLR); write_sysreg(INIT_PSTATE_EL1, SPSR_EL2); @@ -224,6 +213,26 @@ asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on) __host_enter(host_ctxt); } +asmlinkage void __noreturn __kvm_host_psci_cpu_on_entry(void) +{ + struct psci_boot_args *boot_args = this_cpu_ptr(&cpu_on_args); + unsigned long pc, r0; + + pc = READ_ONCE(boot_args->pc); + r0 = READ_ONCE(boot_args->r0); + + release_boot_args(boot_args); + + __kvm_host_psci_cpu_entry(pc, r0); +} + +asmlinkage void __noreturn __kvm_host_psci_cpu_resume_entry(void) +{ + struct psci_boot_args *boot_args = this_cpu_ptr(&suspend_args); + + __kvm_host_psci_cpu_entry(boot_args->pc, boot_args->r0); +} + static unsigned long psci_0_1_handler(u64 func_id, struct kvm_cpu_context *host_ctxt) { if (is_psci_0_1(cpu_off, func_id) || is_psci_0_1(migrate, func_id)) From 54a3cc145456272b10c1452fe89e1dcf933d5c39 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 21 Mar 2026 21:24:19 +0000 Subject: [PATCH 189/373] KVM: arm64: Remove extra ISBs when using msr_hcr_el2 The msr_hcr_el2 macro is slightly awkward, as it provides an ISB when CONFIG_AMPERE_ERRATUM_AC04_CPU_23 is present, and none otherwise. Note that this this option is 'default y', meaning that it is likely to be selected. Most instances of msr_hcr_el2 are also immediately followed by an ISB, meaning that in most cases, you end-up with two back-to-back ISBs. This isn't a big deal, but once you have seen that, you can't unsee it. Rework the msr_hcr_el2 macro to always provide the ISB, and drop the superfluous ISBs everywhere else. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Link: https://patch.msgid.link/20260321212419.2803972-6-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/el2_setup.h | 2 -- arch/arm64/include/asm/sysreg.h | 6 ++---- arch/arm64/kernel/hyp-stub.S | 1 - arch/arm64/kvm/hyp/nvhe/host.S | 1 - 4 files changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h index 85f4c1615472..3e58d6264581 100644 --- a/arch/arm64/include/asm/el2_setup.h +++ b/arch/arm64/include/asm/el2_setup.h @@ -50,7 +50,6 @@ * effectively VHE-only or not. */ msr_hcr_el2 x0 // Setup HCR_EL2 as nVHE - isb mov x1, #1 // Write something to FAR_EL1 msr far_el1, x1 isb @@ -64,7 +63,6 @@ .LnE2H0_\@: orr x0, x0, #HCR_E2H msr_hcr_el2 x0 - isb .LnVHE_\@: .endm diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index f4436ecc630c..ca66b8017fa8 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -1114,11 +1114,9 @@ .macro msr_hcr_el2, reg #if IS_ENABLED(CONFIG_AMPERE_ERRATUM_AC04_CPU_23) dsb nsh - msr hcr_el2, \reg - isb -#else - msr hcr_el2, \reg #endif + msr hcr_el2, \reg + isb // Required by AMPERE_ERRATUM_AC04_CPU_23 .endm #else diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S index 085bc9972f6b..634ddc904244 100644 --- a/arch/arm64/kernel/hyp-stub.S +++ b/arch/arm64/kernel/hyp-stub.S @@ -103,7 +103,6 @@ SYM_CODE_START_LOCAL(__finalise_el2) // Engage the VHE magic! mov_q x0, HCR_HOST_VHE_FLAGS msr_hcr_el2 x0 - isb // Use the EL1 allocated stack, per-cpu offset mrs x0, sp_el1 diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S index 465f6f1dd641..ff10cafa0ca8 100644 --- a/arch/arm64/kvm/hyp/nvhe/host.S +++ b/arch/arm64/kvm/hyp/nvhe/host.S @@ -125,7 +125,6 @@ SYM_FUNC_START(__hyp_do_panic) mrs x0, hcr_el2 bic x0, x0, #HCR_VM msr_hcr_el2 x0 - isb tlbi vmalls12e1 dsb nsh #endif From fa9681ed5c6ae980af18545690e31d8f9c088c33 Mon Sep 17 00:00:00 2001 From: Jiakai Xu Date: Tue, 3 Mar 2026 01:08:57 +0000 Subject: [PATCH 190/373] RISC-V: KVM: Validate SBI STA shmem alignment in kvm_sbi_ext_sta_set_reg() The RISC-V SBI Steal-Time Accounting (STA) extension requires the shared memory physical address to be 64-byte aligned, or set to all-ones to explicitly disable steal-time accounting. KVM exposes the SBI STA shared memory configuration to userspace via KVM_SET_ONE_REG. However, the current implementation of kvm_sbi_ext_sta_set_reg() does not validate the alignment of the configured shared memory address. As a result, userspace can install a misaligned shared memory address that violates the SBI specification. Such an invalid configuration may later reach runtime code paths that assume a valid and properly aligned shared memory region. In particular, KVM_RUN can trigger the following WARN_ON in kvm_riscv_vcpu_record_steal_time(): WARNING: arch/riscv/kvm/vcpu_sbi_sta.c:49 at kvm_riscv_vcpu_record_steal_time WARN_ON paths are not expected to be reachable during normal runtime execution, and may result in a kernel panic when panic_on_warn is enabled. Fix this by validating the computed shared memory GPA at the KVM_SET_ONE_REG boundary. A temporary GPA is constructed and checked before committing it to vcpu->arch.sta.shmem. The validation allows either a 64-byte aligned GPA or INVALID_GPA (all-ones), which disables STA as defined by the SBI specification. This prevents invalid userspace state from reaching runtime code paths that assume SBI STA invariants and avoids unexpected WARN_ON behavior. Fixes: f61ce890b1f074 ("RISC-V: KVM: Add support for SBI STA registers") Signed-off-by: Jiakai Xu Signed-off-by: Jiakai Xu Reviewed-by: Andrew Jones Link: https://lore.kernel.org/r/20260303010859.1763177-2-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel --- arch/riscv/kvm/vcpu_sbi_sta.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/arch/riscv/kvm/vcpu_sbi_sta.c b/arch/riscv/kvm/vcpu_sbi_sta.c index afa0545c3bcf..3b834709b429 100644 --- a/arch/riscv/kvm/vcpu_sbi_sta.c +++ b/arch/riscv/kvm/vcpu_sbi_sta.c @@ -181,6 +181,7 @@ static int kvm_sbi_ext_sta_set_reg(struct kvm_vcpu *vcpu, unsigned long reg_num, unsigned long reg_size, const void *reg_val) { unsigned long value; + gpa_t new_shmem = INVALID_GPA; if (reg_size != sizeof(unsigned long)) return -EINVAL; @@ -191,18 +192,18 @@ static int kvm_sbi_ext_sta_set_reg(struct kvm_vcpu *vcpu, unsigned long reg_num, if (IS_ENABLED(CONFIG_32BIT)) { gpa_t hi = upper_32_bits(vcpu->arch.sta.shmem); - vcpu->arch.sta.shmem = value; - vcpu->arch.sta.shmem |= hi << 32; + new_shmem = value; + new_shmem |= hi << 32; } else { - vcpu->arch.sta.shmem = value; + new_shmem = value; } break; case KVM_REG_RISCV_SBI_STA_REG(shmem_hi): if (IS_ENABLED(CONFIG_32BIT)) { gpa_t lo = lower_32_bits(vcpu->arch.sta.shmem); - vcpu->arch.sta.shmem = ((gpa_t)value << 32); - vcpu->arch.sta.shmem |= lo; + new_shmem = ((gpa_t)value << 32); + new_shmem |= lo; } else if (value != 0) { return -EINVAL; } @@ -211,6 +212,11 @@ static int kvm_sbi_ext_sta_set_reg(struct kvm_vcpu *vcpu, unsigned long reg_num, return -ENOENT; } + if (new_shmem != INVALID_GPA && !IS_ALIGNED(new_shmem, 64)) + return -EINVAL; + + vcpu->arch.sta.shmem = new_shmem; + return 0; } From 40351ed924dd30ded1b43c7333ce695a4a835f7b Mon Sep 17 00:00:00 2001 From: Jiakai Xu Date: Tue, 3 Mar 2026 01:08:58 +0000 Subject: [PATCH 191/373] KVM: selftests: Refactor UAPI tests into dedicated function Move steal time UAPI tests from steal_time_init() into a separate check_steal_time_uapi() function for better code organization and maintainability. Previously, x86 and ARM64 architectures performed UAPI validation tests within steal_time_init(), mixing initialization logic with uapi tests. Changes by architecture: x86_64: - Extract MSR reserved bits test from steal_time_init() - Move to check_steal_time_uapi() which tests that setting MSR_KVM_STEAL_TIME with KVM_STEAL_RESERVED_MASK fails ARM64: - Extract three UAPI tests from steal_time_init(): Device attribute support check Misaligned IPA rejection (EINVAL) Duplicate IPA setting rejection (EEXIST) - Move all tests to check_steal_time_uapi() RISC-V: - Add empty check_steal_time_uapi() stub for future use - No changes to steal_time_init() (had no tests to extract) The new check_steal_time_uapi() function: - Is called once before the per-VCPU test loop No functional change intended. Suggested-by: Andrew Jones Signed-off-by: Jiakai Xu Signed-off-by: Jiakai Xu Reviewed-by: Andrew Jones Link: https://lore.kernel.org/r/20260303010859.1763177-3-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel --- tools/testing/selftests/kvm/steal_time.c | 67 ++++++++++++++++++------ 1 file changed, 51 insertions(+), 16 deletions(-) diff --git a/tools/testing/selftests/kvm/steal_time.c b/tools/testing/selftests/kvm/steal_time.c index 7be8adfe5dd3..8e4d7c13b598 100644 --- a/tools/testing/selftests/kvm/steal_time.c +++ b/tools/testing/selftests/kvm/steal_time.c @@ -69,16 +69,10 @@ static bool is_steal_time_supported(struct kvm_vcpu *vcpu) static void steal_time_init(struct kvm_vcpu *vcpu, uint32_t i) { - int ret; - /* ST_GPA_BASE is identity mapped */ st_gva[i] = (void *)(ST_GPA_BASE + i * STEAL_TIME_SIZE); sync_global_to_guest(vcpu->vm, st_gva[i]); - ret = _vcpu_set_msr(vcpu, MSR_KVM_STEAL_TIME, - (ulong)st_gva[i] | KVM_STEAL_RESERVED_MASK); - TEST_ASSERT(ret == 0, "Bad GPA didn't fail"); - vcpu_set_msr(vcpu, MSR_KVM_STEAL_TIME, (ulong)st_gva[i] | KVM_MSR_ENABLED); } @@ -99,6 +93,21 @@ static void steal_time_dump(struct kvm_vm *vm, uint32_t vcpu_idx) st->pad[8], st->pad[9], st->pad[10]); } +static void check_steal_time_uapi(void) +{ + struct kvm_vm *vm; + struct kvm_vcpu *vcpu; + int ret; + + vm = vm_create_with_one_vcpu(&vcpu, NULL); + + ret = _vcpu_set_msr(vcpu, MSR_KVM_STEAL_TIME, + (ulong)ST_GPA_BASE | KVM_STEAL_RESERVED_MASK); + TEST_ASSERT(ret == 0, "Bad GPA didn't fail"); + + kvm_vm_free(vm); +} + #elif defined(__aarch64__) /* PV_TIME_ST must have 64-byte alignment */ @@ -170,7 +179,6 @@ static void steal_time_init(struct kvm_vcpu *vcpu, uint32_t i) { struct kvm_vm *vm = vcpu->vm; uint64_t st_ipa; - int ret; struct kvm_device_attr dev = { .group = KVM_ARM_VCPU_PVTIME_CTRL, @@ -178,21 +186,12 @@ static void steal_time_init(struct kvm_vcpu *vcpu, uint32_t i) .addr = (uint64_t)&st_ipa, }; - vcpu_ioctl(vcpu, KVM_HAS_DEVICE_ATTR, &dev); - /* ST_GPA_BASE is identity mapped */ st_gva[i] = (void *)(ST_GPA_BASE + i * STEAL_TIME_SIZE); sync_global_to_guest(vm, st_gva[i]); - st_ipa = (ulong)st_gva[i] | 1; - ret = __vcpu_ioctl(vcpu, KVM_SET_DEVICE_ATTR, &dev); - TEST_ASSERT(ret == -1 && errno == EINVAL, "Bad IPA didn't report EINVAL"); - st_ipa = (ulong)st_gva[i]; vcpu_ioctl(vcpu, KVM_SET_DEVICE_ATTR, &dev); - - ret = __vcpu_ioctl(vcpu, KVM_SET_DEVICE_ATTR, &dev); - TEST_ASSERT(ret == -1 && errno == EEXIST, "Set IPA twice without EEXIST"); } static void steal_time_dump(struct kvm_vm *vm, uint32_t vcpu_idx) @@ -205,6 +204,36 @@ static void steal_time_dump(struct kvm_vm *vm, uint32_t vcpu_idx) ksft_print_msg(" st_time: %ld\n", st->st_time); } +static void check_steal_time_uapi(void) +{ + struct kvm_vm *vm; + struct kvm_vcpu *vcpu; + uint64_t st_ipa; + int ret; + + vm = vm_create_with_one_vcpu(&vcpu, NULL); + + struct kvm_device_attr dev = { + .group = KVM_ARM_VCPU_PVTIME_CTRL, + .attr = KVM_ARM_VCPU_PVTIME_IPA, + .addr = (uint64_t)&st_ipa, + }; + + vcpu_ioctl(vcpu, KVM_HAS_DEVICE_ATTR, &dev); + + st_ipa = (ulong)ST_GPA_BASE | 1; + ret = __vcpu_ioctl(vcpu, KVM_SET_DEVICE_ATTR, &dev); + TEST_ASSERT(ret == -1 && errno == EINVAL, "Bad IPA didn't report EINVAL"); + + st_ipa = (ulong)ST_GPA_BASE; + vcpu_ioctl(vcpu, KVM_SET_DEVICE_ATTR, &dev); + + ret = __vcpu_ioctl(vcpu, KVM_SET_DEVICE_ATTR, &dev); + TEST_ASSERT(ret == -1 && errno == EEXIST, "Set IPA twice without EEXIST"); + + kvm_vm_free(vm); +} + #elif defined(__riscv) /* SBI STA shmem must have 64-byte alignment */ @@ -301,6 +330,10 @@ static void steal_time_dump(struct kvm_vm *vm, uint32_t vcpu_idx) pr_info("\n"); } +static void check_steal_time_uapi(void) +{ +} + #elif defined(__loongarch__) /* steal_time must have 64-byte alignment */ @@ -465,6 +498,8 @@ int main(int ac, char **av) TEST_REQUIRE(is_steal_time_supported(vcpus[0])); ksft_set_plan(NR_VCPUS); + check_steal_time_uapi(); + /* Run test on each VCPU */ for (i = 0; i < NR_VCPUS; ++i) { steal_time_init(vcpus[i], i); From 7c61e7433b49ca948dc8cc2b70a20b3dbc36363d Mon Sep 17 00:00:00 2001 From: Jiakai Xu Date: Tue, 3 Mar 2026 01:08:59 +0000 Subject: [PATCH 192/373] RISC-V: KVM: selftests: Add RISC-V SBI STA shmem alignment tests Add RISC-V KVM selftests to verify the SBI Steal-Time Accounting (STA) shared memory alignment requirements. The SBI specification requires the STA shared memory GPA to be 64-byte aligned, or set to all-ones to explicitly disable steal-time accounting. This test verifies that KVM enforces the expected behavior when configuring the SBI STA shared memory via KVM_SET_ONE_REG. Specifically, the test checks that: - misaligned GPAs are rejected with -EINVAL - 64-byte aligned GPAs are accepted - all-ones GPA is accepted Signed-off-by: Jiakai Xu Signed-off-by: Jiakai Xu Reviewed-by: Andrew Jones Link: https://lore.kernel.org/r/20260303010859.1763177-4-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel --- .../selftests/kvm/include/kvm_util_types.h | 2 ++ tools/testing/selftests/kvm/steal_time.c | 31 +++++++++++++++++++ 2 files changed, 33 insertions(+) diff --git a/tools/testing/selftests/kvm/include/kvm_util_types.h b/tools/testing/selftests/kvm/include/kvm_util_types.h index ec787b97cf18..0366e9bce7f9 100644 --- a/tools/testing/selftests/kvm/include/kvm_util_types.h +++ b/tools/testing/selftests/kvm/include/kvm_util_types.h @@ -17,4 +17,6 @@ typedef uint64_t vm_paddr_t; /* Virtual Machine (Guest) physical address */ typedef uint64_t vm_vaddr_t; /* Virtual Machine (Guest) virtual address */ +#define INVALID_GPA (~(uint64_t)0) + #endif /* SELFTEST_KVM_UTIL_TYPES_H */ diff --git a/tools/testing/selftests/kvm/steal_time.c b/tools/testing/selftests/kvm/steal_time.c index 8e4d7c13b598..efe56a10d13e 100644 --- a/tools/testing/selftests/kvm/steal_time.c +++ b/tools/testing/selftests/kvm/steal_time.c @@ -332,6 +332,37 @@ static void steal_time_dump(struct kvm_vm *vm, uint32_t vcpu_idx) static void check_steal_time_uapi(void) { + struct kvm_vm *vm; + struct kvm_vcpu *vcpu; + struct kvm_one_reg reg; + uint64_t shmem; + int ret; + + vm = vm_create_with_one_vcpu(&vcpu, NULL); + + reg.id = KVM_REG_RISCV | + KVM_REG_SIZE_ULONG | + KVM_REG_RISCV_SBI_STATE | + KVM_REG_RISCV_SBI_STA | + KVM_REG_RISCV_SBI_STA_REG(shmem_lo); + reg.addr = (uint64_t)&shmem; + + shmem = ST_GPA_BASE + 1; + ret = __vcpu_ioctl(vcpu, KVM_SET_ONE_REG, ®); + TEST_ASSERT(ret == -1 && errno == EINVAL, + "misaligned STA shmem returns -EINVAL"); + + shmem = ST_GPA_BASE; + ret = __vcpu_ioctl(vcpu, KVM_SET_ONE_REG, ®); + TEST_ASSERT(ret == 0, + "aligned STA shmem succeeds"); + + shmem = INVALID_GPA; + ret = __vcpu_ioctl(vcpu, KVM_SET_ONE_REG, ®); + TEST_ASSERT(ret == 0, + "all-ones for STA shmem succeeds"); + + kvm_vm_free(vm); } #elif defined(__loongarch__) From 66fcf492008db407e8d600ceaefd2c8a8070c5ae Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:20 +0000 Subject: [PATCH 193/373] KVM: arm64: Extract VMA size resolution in user_mem_abort() As part of an effort to refactor user_mem_abort() into smaller, more focused helper functions, extract the logic responsible for determining the VMA shift and page size into a new static helper, kvm_s2_resolve_vma_size(). Reviewed-by: Joey Gouly Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 130 ++++++++++++++++++++++++------------------- 1 file changed, 73 insertions(+), 57 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 17d64a1e11e5..f8064b2d3204 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1639,6 +1639,77 @@ out_unlock: return ret != -EAGAIN ? ret : 0; } +static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, + unsigned long hva, + struct kvm_memory_slot *memslot, + struct kvm_s2_trans *nested, + bool *force_pte, phys_addr_t *ipa) +{ + short vma_shift; + long vma_pagesize; + + if (*force_pte) + vma_shift = PAGE_SHIFT; + else + vma_shift = get_vma_page_shift(vma, hva); + + switch (vma_shift) { +#ifndef __PAGETABLE_PMD_FOLDED + case PUD_SHIFT: + if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) + break; + fallthrough; +#endif + case CONT_PMD_SHIFT: + vma_shift = PMD_SHIFT; + fallthrough; + case PMD_SHIFT: + if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) + break; + fallthrough; + case CONT_PTE_SHIFT: + vma_shift = PAGE_SHIFT; + *force_pte = true; + fallthrough; + case PAGE_SHIFT: + break; + default: + WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); + } + + vma_pagesize = 1UL << vma_shift; + + if (nested) { + unsigned long max_map_size; + + max_map_size = *force_pte ? PAGE_SIZE : PUD_SIZE; + + *ipa = kvm_s2_trans_output(nested); + + /* + * If we're about to create a shadow stage 2 entry, then we + * can only create a block mapping if the guest stage 2 page + * table uses at least as big a mapping. + */ + max_map_size = min(kvm_s2_trans_size(nested), max_map_size); + + /* + * Be careful that if the mapping size falls between + * two host sizes, take the smallest of the two. + */ + if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE) + max_map_size = PMD_SIZE; + else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE) + max_map_size = PAGE_SIZE; + + *force_pte = (max_map_size == PAGE_SIZE); + vma_pagesize = min_t(long, vma_pagesize, max_map_size); + vma_shift = __ffs(vma_pagesize); + } + + return vma_shift; +} + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, unsigned long hva, @@ -1695,65 +1766,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return -EFAULT; } - if (force_pte) - vma_shift = PAGE_SHIFT; - else - vma_shift = get_vma_page_shift(vma, hva); - - switch (vma_shift) { -#ifndef __PAGETABLE_PMD_FOLDED - case PUD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) - break; - fallthrough; -#endif - case CONT_PMD_SHIFT: - vma_shift = PMD_SHIFT; - fallthrough; - case PMD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) - break; - fallthrough; - case CONT_PTE_SHIFT: - vma_shift = PAGE_SHIFT; - force_pte = true; - fallthrough; - case PAGE_SHIFT: - break; - default: - WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); - } - + vma_shift = kvm_s2_resolve_vma_size(vma, hva, memslot, nested, + &force_pte, &ipa); vma_pagesize = 1UL << vma_shift; - if (nested) { - unsigned long max_map_size; - - max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE; - - ipa = kvm_s2_trans_output(nested); - - /* - * If we're about to create a shadow stage 2 entry, then we - * can only create a block mapping if the guest stage 2 page - * table uses at least as big a mapping. - */ - max_map_size = min(kvm_s2_trans_size(nested), max_map_size); - - /* - * Be careful that if the mapping size falls between - * two host sizes, take the smallest of the two. - */ - if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE) - max_map_size = PMD_SIZE; - else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE) - max_map_size = PAGE_SIZE; - - force_pte = (max_map_size == PAGE_SIZE); - vma_pagesize = min_t(long, vma_pagesize, max_map_size); - vma_shift = __ffs(vma_pagesize); - } - /* * Both the canonical IPA and fault IPA must be aligned to the * mapping size to ensure we find the right PFN and lay down the From aa35bcf2e76234fef7bbca9bf364039692a27661 Mon Sep 17 00:00:00 2001 From: Osama Abdelkader Date: Thu, 12 Mar 2026 00:18:32 +0100 Subject: [PATCH 194/373] RISC-V: KVM: fix PMU snapshot_set_shmem on 32-bit hosts When saddr_high != 0 on RV32, the goto out was unconditional, causing valid 64-bit addresses to be rejected. Only goto out when the address is invalid (64-bit host with saddr_high != 0). Fixes: c2f41ddbcdd7 ("RISC-V: KVM: Implement SBI PMU Snapshot feature") Signed-off-by: Osama Abdelkader Reviewed-by: Andrew Jones Link: https://lore.kernel.org/r/20260311231833.13189-1-osama.abdelkader@gmail.com Signed-off-by: Anup Patel --- arch/riscv/kvm/vcpu_pmu.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c index e873430e596b..f3bf985dcf43 100644 --- a/arch/riscv/kvm/vcpu_pmu.c +++ b/arch/riscv/kvm/vcpu_pmu.c @@ -427,11 +427,12 @@ int kvm_riscv_vcpu_pmu_snapshot_set_shmem(struct kvm_vcpu *vcpu, unsigned long s saddr = saddr_low; if (saddr_high != 0) { - if (IS_ENABLED(CONFIG_32BIT)) + if (IS_ENABLED(CONFIG_32BIT)) { saddr |= ((gpa_t)saddr_high << 32); - else + } else { sbiret = SBI_ERR_INVALID_ADDRESS; - goto out; + goto out; + } } kvpmu->sdata = kzalloc(snapshot_area_size, GFP_ATOMIC); From b7c958d7c1eb1cb9b2be7b5ee4129fcd66cec978 Mon Sep 17 00:00:00 2001 From: Osama Abdelkader Date: Mon, 16 Mar 2026 16:16:11 +0100 Subject: [PATCH 195/373] riscv: kvm: fix vector context allocation leak When the second kzalloc (host_context.vector.datap) fails in kvm_riscv_vcpu_alloc_vector_context, the first allocation (guest_context.vector.datap) is leaked. Free it before returning. Fixes: 0f4b82579716 ("riscv: KVM: Add vector lazy save/restore support") Cc: stable@vger.kernel.org Signed-off-by: Osama Abdelkader Reviewed-by: Andy Chiu Link: https://lore.kernel.org/r/20260316151612.13305-1-osama.abdelkader@gmail.com Signed-off-by: Anup Patel --- arch/riscv/kvm/vcpu_vector.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/riscv/kvm/vcpu_vector.c b/arch/riscv/kvm/vcpu_vector.c index 05f3cc2d8e31..5b6ad82d47be 100644 --- a/arch/riscv/kvm/vcpu_vector.c +++ b/arch/riscv/kvm/vcpu_vector.c @@ -80,8 +80,11 @@ int kvm_riscv_vcpu_alloc_vector_context(struct kvm_vcpu *vcpu) return -ENOMEM; vcpu->arch.host_context.vector.datap = kzalloc(riscv_v_vsize, GFP_KERNEL); - if (!vcpu->arch.host_context.vector.datap) + if (!vcpu->arch.host_context.vector.datap) { + kfree(vcpu->arch.guest_context.vector.datap); + vcpu->arch.guest_context.vector.datap = NULL; return -ENOMEM; + } return 0; } From 99594f75b49edc7046057bca06d892c16967a9b3 Mon Sep 17 00:00:00 2001 From: Jiakai Xu Date: Mon, 16 Mar 2026 01:45:32 +0000 Subject: [PATCH 196/373] RISC-V: KVM: Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() When a guest invokes SBI_EXT_PMU_COUNTER_FW_READ or SBI_EXT_PMU_COUNTER_FW_READ_HI on a firmware counter that has not been configured via SBI_EXT_PMU_COUNTER_CFG_MATCH, the pmc->event_idx remains SBI_PMU_EVENT_IDX_INVALID (0xFFFFFFFF). get_event_code() extracts the lower 16 bits, yielding 0xFFFF (65535), which is then used to index into kvpmu->fw_event[]. Since fw_event is only RISCV_KVM_MAX_FW_CTRS (32) entries, this triggers an array-index-out-of-bounds: UBSAN: array-index-out-of-bounds in arch/riscv/kvm/vcpu_pmu.c:255:37 index 65535 is out of range for type 'kvm_fw_event [32]' Add a check for the known unconfigured case (SBI_PMU_EVENT_IDX_INVALID) and a WARN_ONCE guard for any unexpected out-of-bounds event codes, returning -EINVAL in both cases. Fixes: badc386869e2c ("RISC-V: KVM: Support firmware events") Fixes: 08fb07d6dcf71 ("RISC-V: KVM: Support 64 bit firmware counters on RV32") Signed-off-by: Jiakai Xu Signed-off-by: Jiakai Xu Reviewed-by: Andrew Jones Link: https://lore.kernel.org/r/20260316014533.2312254-2-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel --- arch/riscv/kvm/vcpu_pmu.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c index f3bf985dcf43..9e9f3302cef8 100644 --- a/arch/riscv/kvm/vcpu_pmu.c +++ b/arch/riscv/kvm/vcpu_pmu.c @@ -226,7 +226,14 @@ static int pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx, if (pmc->cinfo.type != SBI_PMU_CTR_TYPE_FW) return -EINVAL; + if (pmc->event_idx == SBI_PMU_EVENT_IDX_INVALID) + return -EINVAL; + fevent_code = get_event_code(pmc->event_idx); + if (WARN_ONCE(fevent_code >= SBI_PMU_FW_MAX, + "Invalid firmware event code: %d\n", fevent_code)) + return -EINVAL; + pmc->counter_val = kvpmu->fw_event[fevent_code].value; *out_val = pmc->counter_val >> 32; @@ -251,7 +258,14 @@ static int pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx, pmc = &kvpmu->pmc[cidx]; if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) { + if (pmc->event_idx == SBI_PMU_EVENT_IDX_INVALID) + return -EINVAL; + fevent_code = get_event_code(pmc->event_idx); + if (WARN_ONCE(fevent_code >= SBI_PMU_FW_MAX, + "Invalid firmware event code: %d\n", fevent_code)) + return -EINVAL; + pmc->counter_val = kvpmu->fw_event[fevent_code].value; } else if (pmc->perf_event) { pmc->counter_val += perf_event_read_value(pmc->perf_event, &enabled, &running); From 198c7ce9801abd63c44176c3f034577b887c0070 Mon Sep 17 00:00:00 2001 From: Jiakai Xu Date: Mon, 16 Mar 2026 01:45:33 +0000 Subject: [PATCH 197/373] RISC-V: KVM: selftests: Fix firmware counter read in sbi_pmu_test The current sbi_pmu_test attempts to read firmware counters without configuring them first with SBI_EXT_PMU_COUNTER_CFG_MATCH. Previously this did not fail because KVM incorrectly allowed the read and accessed fw_event[] with an out-of-bounds index when the counter was unconfigured. After fixing that bug, the read now correctly returns SBI_ERR_INVALID_PARAM, causing the selftest to fail. Update the test to configure a firmware event before reading the counter. Also add a negative test to ensure that attempting to read an unconfigured firmware counter fails gracefully. Signed-off-by: Jiakai Xu Signed-off-by: Jiakai Xu Reviewed-by: Andrew Jones Reviewed-by: Nutty Liu Link: https://lore.kernel.org/r/20260316014533.2312254-3-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel --- .../testing/selftests/kvm/include/riscv/sbi.h | 37 +++++++++++++++++++ .../selftests/kvm/riscv/sbi_pmu_test.c | 20 +++++++++- 2 files changed, 56 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/include/riscv/sbi.h b/tools/testing/selftests/kvm/include/riscv/sbi.h index 046b432ae896..16f1815ac48f 100644 --- a/tools/testing/selftests/kvm/include/riscv/sbi.h +++ b/tools/testing/selftests/kvm/include/riscv/sbi.h @@ -97,6 +97,43 @@ enum sbi_pmu_hw_generic_events_t { SBI_PMU_HW_GENERAL_MAX, }; +enum sbi_pmu_fw_generic_events_t { + SBI_PMU_FW_MISALIGNED_LOAD = 0, + SBI_PMU_FW_MISALIGNED_STORE = 1, + SBI_PMU_FW_ACCESS_LOAD = 2, + SBI_PMU_FW_ACCESS_STORE = 3, + SBI_PMU_FW_ILLEGAL_INSN = 4, + SBI_PMU_FW_SET_TIMER = 5, + SBI_PMU_FW_IPI_SENT = 6, + SBI_PMU_FW_IPI_RCVD = 7, + SBI_PMU_FW_FENCE_I_SENT = 8, + SBI_PMU_FW_FENCE_I_RCVD = 9, + SBI_PMU_FW_SFENCE_VMA_SENT = 10, + SBI_PMU_FW_SFENCE_VMA_RCVD = 11, + SBI_PMU_FW_SFENCE_VMA_ASID_SENT = 12, + SBI_PMU_FW_SFENCE_VMA_ASID_RCVD = 13, + + SBI_PMU_FW_HFENCE_GVMA_SENT = 14, + SBI_PMU_FW_HFENCE_GVMA_RCVD = 15, + SBI_PMU_FW_HFENCE_GVMA_VMID_SENT = 16, + SBI_PMU_FW_HFENCE_GVMA_VMID_RCVD = 17, + + SBI_PMU_FW_HFENCE_VVMA_SENT = 18, + SBI_PMU_FW_HFENCE_VVMA_RCVD = 19, + SBI_PMU_FW_HFENCE_VVMA_ASID_SENT = 20, + SBI_PMU_FW_HFENCE_VVMA_ASID_RCVD = 21, + SBI_PMU_FW_MAX, +}; + +/* SBI PMU event types */ +enum sbi_pmu_event_type { + SBI_PMU_EVENT_TYPE_HW = 0x0, + SBI_PMU_EVENT_TYPE_CACHE = 0x1, + SBI_PMU_EVENT_TYPE_RAW = 0x2, + SBI_PMU_EVENT_TYPE_RAW_V2 = 0x3, + SBI_PMU_EVENT_TYPE_FW = 0xf, +}; + /* SBI PMU counter types */ enum sbi_pmu_ctr_type { SBI_PMU_CTR_TYPE_HW = 0x0, diff --git a/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c b/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c index 924a335d2262..cec1621ace23 100644 --- a/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c +++ b/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c @@ -436,6 +436,7 @@ static void test_pmu_basic_sanity(void) struct sbiret ret; int num_counters = 0, i; union sbi_pmu_ctr_info ctrinfo; + unsigned long fw_eidx; probe = guest_sbi_probe_extension(SBI_EXT_PMU, &out_val); GUEST_ASSERT(probe && out_val == 1); @@ -461,7 +462,24 @@ static void test_pmu_basic_sanity(void) pmu_csr_read_num(ctrinfo.csr); GUEST_ASSERT(illegal_handler_invoked); } else if (ctrinfo.type == SBI_PMU_CTR_TYPE_FW) { - read_fw_counter(i, ctrinfo); + /* Read without configure should fail */ + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ, + i, 0, 0, 0, 0, 0); + GUEST_ASSERT(ret.error == SBI_ERR_INVALID_PARAM); + + /* + * Try to configure with a common firmware event. + * If configuration succeeds, verify we can read it. + */ + fw_eidx = ((unsigned long)SBI_PMU_EVENT_TYPE_FW << 16) | + SBI_PMU_FW_ACCESS_LOAD; + + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, + i, 1, 0, fw_eidx, 0, 0); + if (ret.error == 0) { + GUEST_ASSERT(ret.value == i); + read_fw_counter(i, ctrinfo); + } } } From 570428601ba506e76c265a65626524ef3c5cbc04 Mon Sep 17 00:00:00 2001 From: "Zenghui Yu (Huawei)" Date: Sat, 28 Mar 2026 13:31:55 +0800 Subject: [PATCH 198/373] KVM: arm64: ptdump: Initialize parser_state before pgtable walk If we go through the "need a bigger buffer" path in seq_read_iter(), which is likely to happen as we're dumping page tables, we will pass the populated-by-last-run st::parser_state to kvm_pgtable_walk()/kvm_ptdump_visitor(). As a result, the output of stage2_page_tables on my box looks like 0x0000000240000000-0x0000000000000000 17179869175G 1 0x0000000000000000-0x0000000000200000 2M 2 R px ux AF BLK 0x0000000000200000-0x0000000040000000 1022M 2 0x0000000040000000-0x0000000040200000 2M 2 R W PXNUXN AF BLK [...] Fix it by always initializing st::parser_state before starting a new pgtable walk. Besides, remove st::range as it's not used by note_page(); remove the explicit initialization of parser_state::start_address as it will be initialized in note_page() anyway. Signed-off-by: Zenghui Yu (Huawei) Link: https://patch.msgid.link/20260328053155.12219-1-zenghui.yu@linux.dev [maz: rebased on top of NV support] Signed-off-by: Marc Zyngier --- arch/arm64/kvm/ptdump.c | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c index f5753ac37a05..c9140e22abcf 100644 --- a/arch/arm64/kvm/ptdump.c +++ b/arch/arm64/kvm/ptdump.c @@ -24,7 +24,6 @@ struct kvm_ptdump_guest_state { struct ptdump_pg_state parser_state; struct addr_marker ipa_marker[MARKERS_LEN]; struct ptdump_pg_level level[KVM_PGTABLE_MAX_LEVELS]; - struct ptdump_range range[MARKERS_LEN]; }; static const struct ptdump_prot_bits stage2_pte_bits[] = { @@ -132,17 +131,8 @@ static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm_s2_mmu st->ipa_marker[0].name = "Guest IPA"; st->ipa_marker[1].start_address = BIT(pgtable->ia_bits); - st->range[0].end = BIT(pgtable->ia_bits); st->mmu = mmu; - st->parser_state = (struct ptdump_pg_state) { - .marker = &st->ipa_marker[0], - .level = -1, - .pg_level = &st->level[0], - .ptdump.range = &st->range[0], - .start_address = 0, - }; - return st; } @@ -152,14 +142,18 @@ static int kvm_ptdump_guest_show(struct seq_file *m, void *unused) struct kvm_ptdump_guest_state *st = m->private; struct kvm_s2_mmu *mmu = st->mmu; struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); - struct ptdump_pg_state *parser_state = &st->parser_state; struct kvm_pgtable_walker walker = (struct kvm_pgtable_walker) { .cb = kvm_ptdump_visitor, - .arg = parser_state, + .arg = &st->parser_state, .flags = KVM_PGTABLE_WALK_LEAF, }; - parser_state->seq = m; + st->parser_state = (struct ptdump_pg_state) { + .marker = &st->ipa_marker[0], + .level = -1, + .pg_level = &st->level[0], + .seq = m, + }; write_lock(&kvm->mmu_lock); ret = kvm_pgtable_walk(mmu->pgt, 0, BIT(mmu->pgt->ia_bits), &walker); From 0b236c72c02df36992b7d6ff29cbe5abef973250 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:21 +0000 Subject: [PATCH 199/373] KVM: arm64: Introduce struct kvm_s2_fault to user_mem_abort() The user_mem_abort() function takes many arguments and defines a large number of local variables. Passing all these variables around to helper functions would result in functions with too many arguments. Introduce struct kvm_s2_fault to encapsulate the stage-2 fault state. This structure holds both the input parameters and the intermediate state required during the fault handling process. Update user_mem_abort() to initialize this structure and replace the usage of local variables with fields from the new structure. This prepares the ground for further extracting parts of user_mem_abort() into smaller helper functions that can simply take a pointer to the fault state structure. Reviewed-by: Joey Gouly Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 202 +++++++++++++++++++++++++------------------ 1 file changed, 118 insertions(+), 84 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index f8064b2d3204..ad97da3037fb 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1710,38 +1710,68 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, return vma_shift; } +struct kvm_s2_fault { + struct kvm_vcpu *vcpu; + phys_addr_t fault_ipa; + struct kvm_s2_trans *nested; + struct kvm_memory_slot *memslot; + unsigned long hva; + bool fault_is_perm; + + bool write_fault; + bool exec_fault; + bool writable; + bool topup_memcache; + bool mte_allowed; + bool is_vma_cacheable; + bool s2_force_noncacheable; + bool vfio_allow_any_uc; + unsigned long mmu_seq; + phys_addr_t ipa; + short vma_shift; + gfn_t gfn; + kvm_pfn_t pfn; + bool logging_active; + bool force_pte; + long vma_pagesize; + long fault_granule; + enum kvm_pgtable_prot prot; + struct page *page; + vm_flags_t vm_flags; +}; + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, unsigned long hva, bool fault_is_perm) { int ret = 0; - bool topup_memcache; - bool write_fault, writable; - bool exec_fault, mte_allowed, is_vma_cacheable; - bool s2_force_noncacheable = false, vfio_allow_any_uc = false; - unsigned long mmu_seq; - phys_addr_t ipa = fault_ipa; + struct kvm_s2_fault fault_data = { + .vcpu = vcpu, + .fault_ipa = fault_ipa, + .nested = nested, + .memslot = memslot, + .hva = hva, + .fault_is_perm = fault_is_perm, + .ipa = fault_ipa, + .logging_active = memslot_is_logging(memslot), + .force_pte = memslot_is_logging(memslot), + .s2_force_noncacheable = false, + .vfio_allow_any_uc = false, + .prot = KVM_PGTABLE_PROT_R, + }; + struct kvm_s2_fault *fault = &fault_data; struct kvm *kvm = vcpu->kvm; struct vm_area_struct *vma; - short vma_shift; void *memcache; - gfn_t gfn; - kvm_pfn_t pfn; - bool logging_active = memslot_is_logging(memslot); - bool force_pte = logging_active; - long vma_pagesize, fault_granule; - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; struct kvm_pgtable *pgt; - struct page *page; - vm_flags_t vm_flags; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; - if (fault_is_perm) - fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu); - write_fault = kvm_is_write_fault(vcpu); - exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); - VM_WARN_ON_ONCE(write_fault && exec_fault); + if (fault->fault_is_perm) + fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu); + fault->write_fault = kvm_is_write_fault(fault->vcpu); + fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu); + VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault); /* * Permission faults just need to update the existing leaf entry, @@ -1749,8 +1779,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * only exception to this is when dirty logging is enabled at runtime * and a write fault needs to collapse a block entry into a table. */ - topup_memcache = !fault_is_perm || (logging_active && write_fault); - ret = prepare_mmu_memcache(vcpu, topup_memcache, &memcache); + fault->topup_memcache = !fault->fault_is_perm || + (fault->logging_active && fault->write_fault); + ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache); if (ret) return ret; @@ -1759,33 +1790,33 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * get block mapping for device MMIO region. */ mmap_read_lock(current->mm); - vma = vma_lookup(current->mm, hva); + vma = vma_lookup(current->mm, fault->hva); if (unlikely(!vma)) { - kvm_err("Failed to find VMA for hva 0x%lx\n", hva); + kvm_err("Failed to find VMA for fault->hva 0x%lx\n", fault->hva); mmap_read_unlock(current->mm); return -EFAULT; } - vma_shift = kvm_s2_resolve_vma_size(vma, hva, memslot, nested, - &force_pte, &ipa); - vma_pagesize = 1UL << vma_shift; + fault->vma_shift = kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot, fault->nested, + &fault->force_pte, &fault->ipa); + fault->vma_pagesize = 1UL << fault->vma_shift; /* * Both the canonical IPA and fault IPA must be aligned to the * mapping size to ensure we find the right PFN and lay down the * mapping in the right place. */ - fault_ipa = ALIGN_DOWN(fault_ipa, vma_pagesize); - ipa = ALIGN_DOWN(ipa, vma_pagesize); + fault->fault_ipa = ALIGN_DOWN(fault->fault_ipa, fault->vma_pagesize); + fault->ipa = ALIGN_DOWN(fault->ipa, fault->vma_pagesize); - gfn = ipa >> PAGE_SHIFT; - mte_allowed = kvm_vma_mte_allowed(vma); + fault->gfn = fault->ipa >> PAGE_SHIFT; + fault->mte_allowed = kvm_vma_mte_allowed(vma); - vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; + fault->vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; - vm_flags = vma->vm_flags; + fault->vm_flags = vma->vm_flags; - is_vma_cacheable = kvm_vma_is_cacheable(vma); + fault->is_vma_cacheable = kvm_vma_is_cacheable(vma); /* Don't use the VMA after the unlock -- it may have vanished */ vma = NULL; @@ -1798,24 +1829,25 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs * with the smp_wmb() in kvm_mmu_invalidate_end(). */ - mmu_seq = kvm->mmu_invalidate_seq; + fault->mmu_seq = kvm->mmu_invalidate_seq; mmap_read_unlock(current->mm); - pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0, - &writable, &page); - if (pfn == KVM_PFN_ERR_HWPOISON) { - kvm_send_hwpoison_signal(hva, vma_shift); + fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn, + fault->write_fault ? FOLL_WRITE : 0, + &fault->writable, &fault->page); + if (fault->pfn == KVM_PFN_ERR_HWPOISON) { + kvm_send_hwpoison_signal(fault->hva, fault->vma_shift); return 0; } - if (is_error_noslot_pfn(pfn)) + if (is_error_noslot_pfn(fault->pfn)) return -EFAULT; /* * Check if this is non-struct page memory PFN, and cannot support * CMOs. It could potentially be unsafe to access as cacheable. */ - if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) { - if (is_vma_cacheable) { + if (fault->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) { + if (fault->is_vma_cacheable) { /* * Whilst the VMA owner expects cacheable mapping to this * PFN, hardware also has to support the FWB and CACHE DIC @@ -1841,17 +1873,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * In both cases, we don't let transparent_hugepage_adjust() * change things at the last minute. */ - s2_force_noncacheable = true; + fault->s2_force_noncacheable = true; } - } else if (logging_active && !write_fault) { + } else if (fault->logging_active && !fault->write_fault) { /* * Only actually map the page as writable if this was a write * fault. */ - writable = false; + fault->writable = false; } - if (exec_fault && s2_force_noncacheable) + if (fault->exec_fault && fault->s2_force_noncacheable) ret = -ENOEXEC; if (ret) @@ -1863,18 +1895,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * and trigger the exception here. Since the memslot is valid, inject * the fault back to the guest. */ - if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(vcpu))) { - kvm_inject_dabt_excl_atomic(vcpu, kvm_vcpu_get_hfar(vcpu)); + if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(fault->vcpu))) { + kvm_inject_dabt_excl_atomic(fault->vcpu, kvm_vcpu_get_hfar(fault->vcpu)); ret = 1; goto out_put_page; } - if (nested) - adjust_nested_fault_perms(nested, &prot, &writable); + if (fault->nested) + adjust_nested_fault_perms(fault->nested, &fault->prot, &fault->writable); kvm_fault_lock(kvm); - pgt = vcpu->arch.hw_mmu->pgt; - if (mmu_invalidate_retry(kvm, mmu_seq)) { + pgt = fault->vcpu->arch.hw_mmu->pgt; + if (mmu_invalidate_retry(kvm, fault->mmu_seq)) { ret = -EAGAIN; goto out_unlock; } @@ -1883,78 +1915,80 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * If we are not forced to use page mapping, check if we are * backed by a THP and thus use block mapping if possible. */ - if (vma_pagesize == PAGE_SIZE && !(force_pte || s2_force_noncacheable)) { - if (fault_is_perm && fault_granule > PAGE_SIZE) - vma_pagesize = fault_granule; + if (fault->vma_pagesize == PAGE_SIZE && + !(fault->force_pte || fault->s2_force_noncacheable)) { + if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE) + fault->vma_pagesize = fault->fault_granule; else - vma_pagesize = transparent_hugepage_adjust(kvm, memslot, - hva, &pfn, - &fault_ipa); + fault->vma_pagesize = transparent_hugepage_adjust(kvm, fault->memslot, + fault->hva, &fault->pfn, + &fault->fault_ipa); - if (vma_pagesize < 0) { - ret = vma_pagesize; + if (fault->vma_pagesize < 0) { + ret = fault->vma_pagesize; goto out_unlock; } } - if (!fault_is_perm && !s2_force_noncacheable && kvm_has_mte(kvm)) { + if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) { /* Check the VMM hasn't introduced a new disallowed VMA */ - if (mte_allowed) { - sanitise_mte_tags(kvm, pfn, vma_pagesize); + if (fault->mte_allowed) { + sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize); } else { ret = -EFAULT; goto out_unlock; } } - if (writable) - prot |= KVM_PGTABLE_PROT_W; + if (fault->writable) + fault->prot |= KVM_PGTABLE_PROT_W; - if (exec_fault) - prot |= KVM_PGTABLE_PROT_X; + if (fault->exec_fault) + fault->prot |= KVM_PGTABLE_PROT_X; - if (s2_force_noncacheable) { - if (vfio_allow_any_uc) - prot |= KVM_PGTABLE_PROT_NORMAL_NC; + if (fault->s2_force_noncacheable) { + if (fault->vfio_allow_any_uc) + fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC; else - prot |= KVM_PGTABLE_PROT_DEVICE; + fault->prot |= KVM_PGTABLE_PROT_DEVICE; } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) { - prot |= KVM_PGTABLE_PROT_X; + fault->prot |= KVM_PGTABLE_PROT_X; } - if (nested) - adjust_nested_exec_perms(kvm, nested, &prot); + if (fault->nested) + adjust_nested_exec_perms(kvm, fault->nested, &fault->prot); /* * Under the premise of getting a FSC_PERM fault, we just need to relax * permissions only if vma_pagesize equals fault_granule. Otherwise, * kvm_pgtable_stage2_map() should be called to change block size. */ - if (fault_is_perm && vma_pagesize == fault_granule) { + if (fault->fault_is_perm && fault->vma_pagesize == fault->fault_granule) { /* * Drop the SW bits in favour of those stored in the * PTE, which will be preserved. */ - prot &= ~KVM_NV_GUEST_MAP_SZ; - ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot, flags); + fault->prot &= ~KVM_NV_GUEST_MAP_SZ; + ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault->fault_ipa, fault->prot, + flags); } else { - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize, - __pfn_to_phys(pfn), prot, - memcache, flags); + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault->fault_ipa, fault->vma_pagesize, + __pfn_to_phys(fault->pfn), fault->prot, + memcache, flags); } out_unlock: - kvm_release_faultin_page(kvm, page, !!ret, writable); + kvm_release_faultin_page(kvm, fault->page, !!ret, fault->writable); kvm_fault_unlock(kvm); /* Mark the page dirty only if the fault is handled successfully */ - if (writable && !ret) - mark_page_dirty_in_slot(kvm, memslot, gfn); + if (fault->writable && !ret) + mark_page_dirty_in_slot(kvm, fault->memslot, fault->gfn); return ret != -EAGAIN ? ret : 0; out_put_page: - kvm_release_page_unused(page); + kvm_release_page_unused(fault->page); return ret; } From bae99813c6a9ce474cbb7b6553dc6e379b2f4375 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:22 +0000 Subject: [PATCH 200/373] KVM: arm64: Extract PFN resolution in user_mem_abort() Extract the section of code responsible for pinning the physical page frame number (PFN) backing the faulting IPA into a new helper, kvm_s2_fault_pin_pfn(). This helper encapsulates the critical section where the mmap_read_lock is held, the VMA is looked up, the mmu invalidate sequence is sampled, and the PFN is ultimately resolved via __kvm_faultin_pfn(). It also handles the early exits for hardware poisoned pages and noslot PFNs. By isolating this region, we can begin to organize the state variables required for PFN resolution into the kvm_s2_fault struct, clearing out a significant amount of local variable clutter from user_mem_abort(). Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 105 ++++++++++++++++++++++++------------------- 1 file changed, 59 insertions(+), 46 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index ad97da3037fb..7ca704a62d62 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1740,55 +1740,11 @@ struct kvm_s2_fault { vm_flags_t vm_flags; }; -static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, unsigned long hva, - bool fault_is_perm) +static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) { - int ret = 0; - struct kvm_s2_fault fault_data = { - .vcpu = vcpu, - .fault_ipa = fault_ipa, - .nested = nested, - .memslot = memslot, - .hva = hva, - .fault_is_perm = fault_is_perm, - .ipa = fault_ipa, - .logging_active = memslot_is_logging(memslot), - .force_pte = memslot_is_logging(memslot), - .s2_force_noncacheable = false, - .vfio_allow_any_uc = false, - .prot = KVM_PGTABLE_PROT_R, - }; - struct kvm_s2_fault *fault = &fault_data; - struct kvm *kvm = vcpu->kvm; struct vm_area_struct *vma; - void *memcache; - struct kvm_pgtable *pgt; - enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; + struct kvm *kvm = fault->vcpu->kvm; - if (fault->fault_is_perm) - fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu); - fault->write_fault = kvm_is_write_fault(fault->vcpu); - fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu); - VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault); - - /* - * Permission faults just need to update the existing leaf entry, - * and so normally don't require allocations from the memcache. The - * only exception to this is when dirty logging is enabled at runtime - * and a write fault needs to collapse a block entry into a table. - */ - fault->topup_memcache = !fault->fault_is_perm || - (fault->logging_active && fault->write_fault); - ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache); - if (ret) - return ret; - - /* - * Let's check if we will get back a huge page backed by hugetlbfs, or - * get block mapping for device MMIO region. - */ mmap_read_lock(current->mm); vma = vma_lookup(current->mm, fault->hva); if (unlikely(!vma)) { @@ -1842,6 +1798,63 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (is_error_noslot_pfn(fault->pfn)) return -EFAULT; + return 1; +} + +static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, + struct kvm_s2_trans *nested, + struct kvm_memory_slot *memslot, unsigned long hva, + bool fault_is_perm) +{ + int ret = 0; + struct kvm_s2_fault fault_data = { + .vcpu = vcpu, + .fault_ipa = fault_ipa, + .nested = nested, + .memslot = memslot, + .hva = hva, + .fault_is_perm = fault_is_perm, + .ipa = fault_ipa, + .logging_active = memslot_is_logging(memslot), + .force_pte = memslot_is_logging(memslot), + .s2_force_noncacheable = false, + .vfio_allow_any_uc = false, + .prot = KVM_PGTABLE_PROT_R, + }; + struct kvm_s2_fault *fault = &fault_data; + struct kvm *kvm = vcpu->kvm; + void *memcache; + struct kvm_pgtable *pgt; + enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; + + if (fault->fault_is_perm) + fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu); + fault->write_fault = kvm_is_write_fault(fault->vcpu); + fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu); + VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault); + + /* + * Permission faults just need to update the existing leaf entry, + * and so normally don't require allocations from the memcache. The + * only exception to this is when dirty logging is enabled at runtime + * and a write fault needs to collapse a block entry into a table. + */ + fault->topup_memcache = !fault->fault_is_perm || + (fault->logging_active && fault->write_fault); + ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache); + if (ret) + return ret; + + /* + * Let's check if we will get back a huge page backed by hugetlbfs, or + * get block mapping for device MMIO region. + */ + ret = kvm_s2_fault_pin_pfn(fault); + if (ret != 1) + return ret; + + ret = 0; + /* * Check if this is non-struct page memory PFN, and cannot support * CMOs. It could potentially be unsafe to access as cacheable. From f5a5bb8de11863bd92f4188b7e823e3fca4d68e6 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:23 +0000 Subject: [PATCH 201/373] KVM: arm64: Isolate mmap_read_lock inside new kvm_s2_fault_get_vma_info() helper Extract the VMA lookup and metadata snapshotting logic from kvm_s2_fault_pin_pfn() into a tightly-scoped sub-helper. This refactoring structurally fixes a TOCTOU (Time-Of-Check to Time-Of-Use) vulnerability and Use-After-Free risk involving the vma pointer. In the previous layout, the mmap_read_lock is taken, the vma is looked up, and then the lock is dropped before the function continues to map the PFN. While an explicit vma = NULL safeguard was present, the vma variable was still lexically in scope for the remainder of the function. By isolating the locked region into kvm_s2_fault_get_vma_info(), the vma pointer becomes a local variable strictly confined to that sub-helper. Because the pointer's scope literally ends when the sub-helper returns, it is not possible for the subsequent page fault logic in kvm_s2_fault_pin_pfn() to accidentally access the vanished VMA, eliminating this bug class by design. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 7ca704a62d62..9fe2e31a8601 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1740,7 +1740,7 @@ struct kvm_s2_fault { vm_flags_t vm_flags; }; -static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) +static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) { struct vm_area_struct *vma; struct kvm *kvm = fault->vcpu->kvm; @@ -1774,9 +1774,6 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) fault->is_vma_cacheable = kvm_vma_is_cacheable(vma); - /* Don't use the VMA after the unlock -- it may have vanished */ - vma = NULL; - /* * Read mmu_invalidate_seq so that KVM can detect if the results of * vma_lookup() or __kvm_faultin_pfn() become stale prior to @@ -1788,6 +1785,17 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) fault->mmu_seq = kvm->mmu_invalidate_seq; mmap_read_unlock(current->mm); + return 0; +} + +static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) +{ + int ret; + + ret = kvm_s2_fault_get_vma_info(fault); + if (ret) + return ret; + fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn, fault->write_fault ? FOLL_WRITE : 0, &fault->writable, &fault->page); From a6e11bd6e1bd9ea9a42738c5a6ac12881b5fcb36 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:24 +0000 Subject: [PATCH 202/373] KVM: arm64: Extract stage-2 permission logic in user_mem_abort() Extract the logic that computes the stage-2 protections and checks for various permission faults (e.g., execution faults on non-cacheable memory) into a new helper function, kvm_s2_fault_compute_prot(). This helper also handles injecting atomic/exclusive faults back into the guest when necessary. This refactoring step separates the permission computation from the mapping logic, making the main fault handler flow clearer. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 163 +++++++++++++++++++++++-------------------- 1 file changed, 87 insertions(+), 76 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 9fe2e31a8601..8a606103a44b 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1809,6 +1809,89 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) return 1; } +static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault) +{ + struct kvm *kvm = fault->vcpu->kvm; + + /* + * Check if this is non-struct page memory PFN, and cannot support + * CMOs. It could potentially be unsafe to access as cacheable. + */ + if (fault->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) { + if (fault->is_vma_cacheable) { + /* + * Whilst the VMA owner expects cacheable mapping to this + * PFN, hardware also has to support the FWB and CACHE DIC + * features. + * + * ARM64 KVM relies on kernel VA mapping to the PFN to + * perform cache maintenance as the CMO instructions work on + * virtual addresses. VM_PFNMAP region are not necessarily + * mapped to a KVA and hence the presence of hardware features + * S2FWB and CACHE DIC are mandatory to avoid the need for + * cache maintenance. + */ + if (!kvm_supports_cacheable_pfnmap()) + return -EFAULT; + } else { + /* + * If the page was identified as device early by looking at + * the VMA flags, vma_pagesize is already representing the + * largest quantity we can map. If instead it was mapped + * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE + * and must not be upgraded. + * + * In both cases, we don't let transparent_hugepage_adjust() + * change things at the last minute. + */ + fault->s2_force_noncacheable = true; + } + } else if (fault->logging_active && !fault->write_fault) { + /* + * Only actually map the page as writable if this was a write + * fault. + */ + fault->writable = false; + } + + if (fault->exec_fault && fault->s2_force_noncacheable) + return -ENOEXEC; + + /* + * Guest performs atomic/exclusive operations on memory with unsupported + * attributes (e.g. ld64b/st64b on normal memory when no FEAT_LS64WB) + * and trigger the exception here. Since the memslot is valid, inject + * the fault back to the guest. + */ + if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(fault->vcpu))) { + kvm_inject_dabt_excl_atomic(fault->vcpu, kvm_vcpu_get_hfar(fault->vcpu)); + return 1; + } + + if (fault->nested) + adjust_nested_fault_perms(fault->nested, &fault->prot, &fault->writable); + + if (fault->writable) + fault->prot |= KVM_PGTABLE_PROT_W; + + if (fault->exec_fault) + fault->prot |= KVM_PGTABLE_PROT_X; + + if (fault->s2_force_noncacheable) { + if (fault->vfio_allow_any_uc) + fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC; + else + fault->prot |= KVM_PGTABLE_PROT_DEVICE; + } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) { + fault->prot |= KVM_PGTABLE_PROT_X; + } + + if (fault->nested) + adjust_nested_exec_perms(kvm, fault->nested, &fault->prot); + + return 0; +} + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, unsigned long hva, @@ -1863,68 +1946,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, ret = 0; - /* - * Check if this is non-struct page memory PFN, and cannot support - * CMOs. It could potentially be unsafe to access as cacheable. - */ - if (fault->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) { - if (fault->is_vma_cacheable) { - /* - * Whilst the VMA owner expects cacheable mapping to this - * PFN, hardware also has to support the FWB and CACHE DIC - * features. - * - * ARM64 KVM relies on kernel VA mapping to the PFN to - * perform cache maintenance as the CMO instructions work on - * virtual addresses. VM_PFNMAP region are not necessarily - * mapped to a KVA and hence the presence of hardware features - * S2FWB and CACHE DIC are mandatory to avoid the need for - * cache maintenance. - */ - if (!kvm_supports_cacheable_pfnmap()) - ret = -EFAULT; - } else { - /* - * If the page was identified as device early by looking at - * the VMA flags, vma_pagesize is already representing the - * largest quantity we can map. If instead it was mapped - * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE - * and must not be upgraded. - * - * In both cases, we don't let transparent_hugepage_adjust() - * change things at the last minute. - */ - fault->s2_force_noncacheable = true; - } - } else if (fault->logging_active && !fault->write_fault) { - /* - * Only actually map the page as writable if this was a write - * fault. - */ - fault->writable = false; + ret = kvm_s2_fault_compute_prot(fault); + if (ret == 1) { + ret = 1; /* fault injected */ + goto out_put_page; } - - if (fault->exec_fault && fault->s2_force_noncacheable) - ret = -ENOEXEC; - if (ret) goto out_put_page; - /* - * Guest performs atomic/exclusive operations on memory with unsupported - * attributes (e.g. ld64b/st64b on normal memory when no FEAT_LS64WB) - * and trigger the exception here. Since the memslot is valid, inject - * the fault back to the guest. - */ - if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(fault->vcpu))) { - kvm_inject_dabt_excl_atomic(fault->vcpu, kvm_vcpu_get_hfar(fault->vcpu)); - ret = 1; - goto out_put_page; - } - - if (fault->nested) - adjust_nested_fault_perms(fault->nested, &fault->prot, &fault->writable); - kvm_fault_lock(kvm); pgt = fault->vcpu->arch.hw_mmu->pgt; if (mmu_invalidate_retry(kvm, fault->mmu_seq)) { @@ -1961,24 +1990,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, } } - if (fault->writable) - fault->prot |= KVM_PGTABLE_PROT_W; - - if (fault->exec_fault) - fault->prot |= KVM_PGTABLE_PROT_X; - - if (fault->s2_force_noncacheable) { - if (fault->vfio_allow_any_uc) - fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC; - else - fault->prot |= KVM_PGTABLE_PROT_DEVICE; - } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) { - fault->prot |= KVM_PGTABLE_PROT_X; - } - - if (fault->nested) - adjust_nested_exec_perms(kvm, fault->nested, &fault->prot); - /* * Under the premise of getting a FSC_PERM fault, we just need to relax * permissions only if vma_pagesize equals fault_granule. Otherwise, From 5557a3f843bcef3de9a1237020348b2859812170 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:25 +0000 Subject: [PATCH 203/373] KVM: arm64: Extract page table mapping in user_mem_abort() Extract the code responsible for locking the KVM MMU and mapping the PFN into the stage-2 page tables into a new helper, kvm_s2_fault_map(). This helper manages the kvm_fault_lock, checks for MMU invalidation retries, attempts to adjust for transparent huge pages (THP), handles MTE sanitization if needed, and finally maps or relaxes permissions on the stage-2 entries. With this change, the main user_mem_abort() function is now a sequential dispatcher that delegates to specialized helper functions. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 128 +++++++++++++++++++++++-------------------- 1 file changed, 68 insertions(+), 60 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 8a606103a44b..478fceeeb44b 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1892,68 +1892,13 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault) return 0; } -static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, unsigned long hva, - bool fault_is_perm) +static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) { - int ret = 0; - struct kvm_s2_fault fault_data = { - .vcpu = vcpu, - .fault_ipa = fault_ipa, - .nested = nested, - .memslot = memslot, - .hva = hva, - .fault_is_perm = fault_is_perm, - .ipa = fault_ipa, - .logging_active = memslot_is_logging(memslot), - .force_pte = memslot_is_logging(memslot), - .s2_force_noncacheable = false, - .vfio_allow_any_uc = false, - .prot = KVM_PGTABLE_PROT_R, - }; - struct kvm_s2_fault *fault = &fault_data; - struct kvm *kvm = vcpu->kvm; - void *memcache; + struct kvm *kvm = fault->vcpu->kvm; struct kvm_pgtable *pgt; + int ret; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; - if (fault->fault_is_perm) - fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu); - fault->write_fault = kvm_is_write_fault(fault->vcpu); - fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu); - VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault); - - /* - * Permission faults just need to update the existing leaf entry, - * and so normally don't require allocations from the memcache. The - * only exception to this is when dirty logging is enabled at runtime - * and a write fault needs to collapse a block entry into a table. - */ - fault->topup_memcache = !fault->fault_is_perm || - (fault->logging_active && fault->write_fault); - ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache); - if (ret) - return ret; - - /* - * Let's check if we will get back a huge page backed by hugetlbfs, or - * get block mapping for device MMIO region. - */ - ret = kvm_s2_fault_pin_pfn(fault); - if (ret != 1) - return ret; - - ret = 0; - - ret = kvm_s2_fault_compute_prot(fault); - if (ret == 1) { - ret = 1; /* fault injected */ - goto out_put_page; - } - if (ret) - goto out_put_page; - kvm_fault_lock(kvm); pgt = fault->vcpu->arch.hw_mmu->pgt; if (mmu_invalidate_retry(kvm, fault->mmu_seq)) { @@ -2001,8 +1946,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * PTE, which will be preserved. */ fault->prot &= ~KVM_NV_GUEST_MAP_SZ; - ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault->fault_ipa, fault->prot, - flags); + ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault->fault_ipa, + fault->prot, flags); } else { ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault->fault_ipa, fault->vma_pagesize, __pfn_to_phys(fault->pfn), fault->prot, @@ -2018,6 +1963,69 @@ out_unlock: mark_page_dirty_in_slot(kvm, fault->memslot, fault->gfn); return ret != -EAGAIN ? ret : 0; +} + +static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, + struct kvm_s2_trans *nested, + struct kvm_memory_slot *memslot, unsigned long hva, + bool fault_is_perm) +{ + int ret = 0; + struct kvm_s2_fault fault_data = { + .vcpu = vcpu, + .fault_ipa = fault_ipa, + .nested = nested, + .memslot = memslot, + .hva = hva, + .fault_is_perm = fault_is_perm, + .ipa = fault_ipa, + .logging_active = memslot_is_logging(memslot), + .force_pte = memslot_is_logging(memslot), + .s2_force_noncacheable = false, + .vfio_allow_any_uc = false, + .prot = KVM_PGTABLE_PROT_R, + }; + struct kvm_s2_fault *fault = &fault_data; + void *memcache; + + if (fault->fault_is_perm) + fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu); + fault->write_fault = kvm_is_write_fault(fault->vcpu); + fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu); + VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault); + + /* + * Permission faults just need to update the existing leaf entry, + * and so normally don't require allocations from the memcache. The + * only exception to this is when dirty logging is enabled at runtime + * and a write fault needs to collapse a block entry into a table. + */ + fault->topup_memcache = !fault->fault_is_perm || + (fault->logging_active && fault->write_fault); + ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache); + if (ret) + return ret; + + /* + * Let's check if we will get back a huge page backed by hugetlbfs, or + * get block mapping for device MMIO region. + */ + ret = kvm_s2_fault_pin_pfn(fault); + if (ret != 1) + return ret; + + ret = 0; + + ret = kvm_s2_fault_compute_prot(fault); + if (ret == 1) { + ret = 1; /* fault injected */ + goto out_put_page; + } + if (ret) + goto out_put_page; + + ret = kvm_s2_fault_map(fault, memcache); + return ret; out_put_page: kvm_release_page_unused(fault->page); From 2175ca5384ba9f3d1f45745522cdeb5865488400 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:26 +0000 Subject: [PATCH 204/373] KVM: arm64: Simplify nested VMA shift calculation In the kvm_s2_resolve_vma_size() helper, the local variable vma_pagesize is calculated from vma_shift, only to be used to bound the vma_pagesize by max_map_size and subsequently convert it back to a shift via __ffs(). Because vma_pagesize and max_map_size are both powers of two, we can simplify the logic by omitting vma_pagesize entirely and bounding the vma_shift directly using the shift of max_map_size. This achieves the same result while keeping the size-to-shift conversion out of the helper logic. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 478fceeeb44b..1d423bc29e6c 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1646,7 +1646,6 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, bool *force_pte, phys_addr_t *ipa) { short vma_shift; - long vma_pagesize; if (*force_pte) vma_shift = PAGE_SHIFT; @@ -1677,8 +1676,6 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); } - vma_pagesize = 1UL << vma_shift; - if (nested) { unsigned long max_map_size; @@ -1703,8 +1700,7 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, max_map_size = PAGE_SIZE; *force_pte = (max_map_size == PAGE_SIZE); - vma_pagesize = min_t(long, vma_pagesize, max_map_size); - vma_shift = __ffs(vma_pagesize); + vma_shift = min_t(short, vma_shift, __ffs(max_map_size)); } return vma_shift; From 9a57bc1b3c6b1f583f43acf7719d66e6b30ef2a9 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:27 +0000 Subject: [PATCH 205/373] KVM: arm64: Remove redundant state variables from struct kvm_s2_fault Remove redundant variables vma_shift and vfio_allow_any_uc from struct kvm_s2_fault as they are easily derived or checked when needed. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 1d423bc29e6c..ee653dd4eb27 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1721,10 +1721,8 @@ struct kvm_s2_fault { bool mte_allowed; bool is_vma_cacheable; bool s2_force_noncacheable; - bool vfio_allow_any_uc; unsigned long mmu_seq; phys_addr_t ipa; - short vma_shift; gfn_t gfn; kvm_pfn_t pfn; bool logging_active; @@ -1749,9 +1747,9 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) return -EFAULT; } - fault->vma_shift = kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot, fault->nested, - &fault->force_pte, &fault->ipa); - fault->vma_pagesize = 1UL << fault->vma_shift; + fault->vma_pagesize = 1UL << kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot, + fault->nested, &fault->force_pte, + &fault->ipa); /* * Both the canonical IPA and fault IPA must be aligned to the @@ -1764,8 +1762,6 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) fault->gfn = fault->ipa >> PAGE_SHIFT; fault->mte_allowed = kvm_vma_mte_allowed(vma); - fault->vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; - fault->vm_flags = vma->vm_flags; fault->is_vma_cacheable = kvm_vma_is_cacheable(vma); @@ -1796,7 +1792,7 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) fault->write_fault ? FOLL_WRITE : 0, &fault->writable, &fault->page); if (fault->pfn == KVM_PFN_ERR_HWPOISON) { - kvm_send_hwpoison_signal(fault->hva, fault->vma_shift); + kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize)); return 0; } if (is_error_noslot_pfn(fault->pfn)) @@ -1874,7 +1870,7 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault) fault->prot |= KVM_PGTABLE_PROT_X; if (fault->s2_force_noncacheable) { - if (fault->vfio_allow_any_uc) + if (fault->vm_flags & VM_ALLOW_ANY_UNCACHED) fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC; else fault->prot |= KVM_PGTABLE_PROT_DEVICE; @@ -1978,7 +1974,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, .logging_active = memslot_is_logging(memslot), .force_pte = memslot_is_logging(memslot), .s2_force_noncacheable = false, - .vfio_allow_any_uc = false, .prot = KVM_PGTABLE_PROT_R, }; struct kvm_s2_fault *fault = &fault_data; From 3825373b91b0fbedc65205a59f95379aaf596aad Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:28 +0000 Subject: [PATCH 206/373] KVM: arm64: Simplify return logic in user_mem_abort() With the refactoring done, the final return block of user_mem_abort() can be tidied up a bit more. Clean up the trailing edge by dropping the unnecessary assignment, collapsing the return evaluation for kvm_s2_fault_compute_prot(), and tail calling kvm_s2_fault_map() directly. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index ee653dd4eb27..c5ac9bd87b99 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -2005,22 +2005,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (ret != 1) return ret; - ret = 0; - ret = kvm_s2_fault_compute_prot(fault); - if (ret == 1) { - ret = 1; /* fault injected */ - goto out_put_page; + if (ret) { + kvm_release_page_unused(fault->page); + return ret; } - if (ret) - goto out_put_page; - ret = kvm_s2_fault_map(fault, memcache); - return ret; - -out_put_page: - kvm_release_page_unused(fault->page); - return ret; + return kvm_s2_fault_map(fault, memcache); } /* Resolve the access fault by making the page young again. */ From 975bad4bb21e4eea7ecc0f32f4cbbb91a9c8d48f Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:29 +0000 Subject: [PATCH 207/373] KVM: arm64: Initialize struct kvm_s2_fault completely at declaration Simplify the initialization of struct kvm_s2_fault in user_mem_abort(). Instead of partially initializing the struct via designated initializers and then sequentially assigning the remaining fields (like write_fault and topup_memcache) further down the function, evaluate those dependencies upfront. This allows the entire struct to be fully initialized at declaration. It also eliminates the need for the intermediate fault_data variable and its associated fault pointer, reducing boilerplate. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 34 ++++++++++++++++------------------ 1 file changed, 16 insertions(+), 18 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index c5ac9bd87b99..fe942be061cb 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1962,8 +1962,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_memory_slot *memslot, unsigned long hva, bool fault_is_perm) { - int ret = 0; - struct kvm_s2_fault fault_data = { + bool write_fault = kvm_is_write_fault(vcpu); + bool logging_active = memslot_is_logging(memslot); + struct kvm_s2_fault fault = { .vcpu = vcpu, .fault_ipa = fault_ipa, .nested = nested, @@ -1971,19 +1972,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, .hva = hva, .fault_is_perm = fault_is_perm, .ipa = fault_ipa, - .logging_active = memslot_is_logging(memslot), - .force_pte = memslot_is_logging(memslot), - .s2_force_noncacheable = false, + .logging_active = logging_active, + .force_pte = logging_active, .prot = KVM_PGTABLE_PROT_R, + .fault_granule = fault_is_perm ? kvm_vcpu_trap_get_perm_fault_granule(vcpu) : 0, + .write_fault = write_fault, + .exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu), + .topup_memcache = !fault_is_perm || (logging_active && write_fault), }; - struct kvm_s2_fault *fault = &fault_data; void *memcache; + int ret; - if (fault->fault_is_perm) - fault->fault_granule = kvm_vcpu_trap_get_perm_fault_granule(fault->vcpu); - fault->write_fault = kvm_is_write_fault(fault->vcpu); - fault->exec_fault = kvm_vcpu_trap_is_exec_fault(fault->vcpu); - VM_WARN_ON_ONCE(fault->write_fault && fault->exec_fault); + VM_WARN_ON_ONCE(fault.write_fault && fault.exec_fault); /* * Permission faults just need to update the existing leaf entry, @@ -1991,9 +1991,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * only exception to this is when dirty logging is enabled at runtime * and a write fault needs to collapse a block entry into a table. */ - fault->topup_memcache = !fault->fault_is_perm || - (fault->logging_active && fault->write_fault); - ret = prepare_mmu_memcache(fault->vcpu, fault->topup_memcache, &memcache); + ret = prepare_mmu_memcache(vcpu, fault.topup_memcache, &memcache); if (ret) return ret; @@ -2001,17 +1999,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * Let's check if we will get back a huge page backed by hugetlbfs, or * get block mapping for device MMIO region. */ - ret = kvm_s2_fault_pin_pfn(fault); + ret = kvm_s2_fault_pin_pfn(&fault); if (ret != 1) return ret; - ret = kvm_s2_fault_compute_prot(fault); + ret = kvm_s2_fault_compute_prot(&fault); if (ret) { - kvm_release_page_unused(fault->page); + kvm_release_page_unused(fault.page); return ret; } - return kvm_s2_fault_map(fault, memcache); + return kvm_s2_fault_map(&fault, memcache); } /* Resolve the access fault by making the page young again. */ From 3456943e8786d56aa2bd3f99e9ad2d735a9879c8 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:30 +0000 Subject: [PATCH 208/373] KVM: arm64: Optimize early exit checks in kvm_s2_fault_pin_pfn() Optimize the early exit checks in kvm_s2_fault_pin_pfn by grouping all error responses under the generic is_error_noslot_pfn check first, avoiding unnecessary branches in the hot path. Reviewed-by: Joey Gouly Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index fe942be061cb..74c512839e81 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1791,12 +1791,13 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn, fault->write_fault ? FOLL_WRITE : 0, &fault->writable, &fault->page); - if (fault->pfn == KVM_PFN_ERR_HWPOISON) { - kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize)); - return 0; - } - if (is_error_noslot_pfn(fault->pfn)) + if (unlikely(is_error_noslot_pfn(fault->pfn))) { + if (fault->pfn == KVM_PFN_ERR_HWPOISON) { + kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize)); + return 0; + } return -EFAULT; + } return 1; } From 84699747aa554197f6d5b4f4c9d1bcb6cb28e21f Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:31 +0000 Subject: [PATCH 209/373] KVM: arm64: Hoist MTE validation check out of MMU lock path Simplify the non-cacheable attributes assignment by using a ternary operator. Additionally, hoist the MTE validation check (mte_allowed) out of kvm_s2_fault_map() and into kvm_s2_fault_compute_prot(). This allows us to fail faster and avoid acquiring the KVM MMU lock unnecessarily when the VMM introduces a disallowed VMA for an MTE-enabled guest. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 74c512839e81..871143a2b2f7 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1870,18 +1870,21 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault) if (fault->exec_fault) fault->prot |= KVM_PGTABLE_PROT_X; - if (fault->s2_force_noncacheable) { - if (fault->vm_flags & VM_ALLOW_ANY_UNCACHED) - fault->prot |= KVM_PGTABLE_PROT_NORMAL_NC; - else - fault->prot |= KVM_PGTABLE_PROT_DEVICE; - } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) { + if (fault->s2_force_noncacheable) + fault->prot |= (fault->vm_flags & VM_ALLOW_ANY_UNCACHED) ? + KVM_PGTABLE_PROT_NORMAL_NC : KVM_PGTABLE_PROT_DEVICE; + else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) fault->prot |= KVM_PGTABLE_PROT_X; - } if (fault->nested) adjust_nested_exec_perms(kvm, fault->nested, &fault->prot); + if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) { + /* Check the VMM hasn't introduced a new disallowed VMA */ + if (!fault->mte_allowed) + return -EFAULT; + } + return 0; } @@ -1918,15 +1921,8 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) } } - if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) { - /* Check the VMM hasn't introduced a new disallowed VMA */ - if (fault->mte_allowed) { - sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize); - } else { - ret = -EFAULT; - goto out_unlock; - } - } + if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) + sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize); /* * Under the premise of getting a FSC_PERM fault, we just need to relax From 11f8f1b8a97b63b180f7aa021ec9abdead283025 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Fri, 6 Mar 2026 14:02:32 +0000 Subject: [PATCH 210/373] KVM: arm64: Clean up control flow in kvm_s2_fault_map() Clean up the KVM MMU lock retry loop by pre-assigning the error code. Add clear braces to the THP adjustment integration for readability, and safely unnest the transparent hugepage logic branches. Signed-off-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 871143a2b2f7..719521904ef5 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1897,10 +1897,9 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) kvm_fault_lock(kvm); pgt = fault->vcpu->arch.hw_mmu->pgt; - if (mmu_invalidate_retry(kvm, fault->mmu_seq)) { - ret = -EAGAIN; + ret = -EAGAIN; + if (mmu_invalidate_retry(kvm, fault->mmu_seq)) goto out_unlock; - } /* * If we are not forced to use page mapping, check if we are @@ -1908,16 +1907,17 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) */ if (fault->vma_pagesize == PAGE_SIZE && !(fault->force_pte || fault->s2_force_noncacheable)) { - if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE) + if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE) { fault->vma_pagesize = fault->fault_granule; - else + } else { fault->vma_pagesize = transparent_hugepage_adjust(kvm, fault->memslot, fault->hva, &fault->pfn, &fault->fault_ipa); - if (fault->vma_pagesize < 0) { - ret = fault->vma_pagesize; - goto out_unlock; + if (fault->vma_pagesize < 0) { + ret = fault->vma_pagesize; + goto out_unlock; + } } } @@ -1951,7 +1951,9 @@ out_unlock: if (fault->writable && !ret) mark_page_dirty_in_slot(kvm, fault->memslot, fault->gfn); - return ret != -EAGAIN ? ret : 0; + if (ret != -EAGAIN) + return ret; + return 0; } static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, From 5729560e2c3cb67a22a1d72688e0bb8e96798313 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 7 Mar 2026 11:39:49 +0000 Subject: [PATCH 211/373] KVM: arm64: Kill fault->ipa fault->ipa, in a nested contest, represents the output of the guest's S2 translation for the fault->fault_ipa input, and is equal to fault->fault_ipa otherwise, Given that this is readily available from kvm_s2_trans_output(), drop fault->ipa and directly compute fault->gfn instead, which is really what we want. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 719521904ef5..371ee0a836cf 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1643,7 +1643,7 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, unsigned long hva, struct kvm_memory_slot *memslot, struct kvm_s2_trans *nested, - bool *force_pte, phys_addr_t *ipa) + bool *force_pte) { short vma_shift; @@ -1681,8 +1681,6 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, max_map_size = *force_pte ? PAGE_SIZE : PUD_SIZE; - *ipa = kvm_s2_trans_output(nested); - /* * If we're about to create a shadow stage 2 entry, then we * can only create a block mapping if the guest stage 2 page @@ -1722,7 +1720,6 @@ struct kvm_s2_fault { bool is_vma_cacheable; bool s2_force_noncacheable; unsigned long mmu_seq; - phys_addr_t ipa; gfn_t gfn; kvm_pfn_t pfn; bool logging_active; @@ -1738,6 +1735,7 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) { struct vm_area_struct *vma; struct kvm *kvm = fault->vcpu->kvm; + phys_addr_t ipa; mmap_read_lock(current->mm); vma = vma_lookup(current->mm, fault->hva); @@ -1748,8 +1746,7 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) } fault->vma_pagesize = 1UL << kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot, - fault->nested, &fault->force_pte, - &fault->ipa); + fault->nested, &fault->force_pte); /* * Both the canonical IPA and fault IPA must be aligned to the @@ -1757,9 +1754,9 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) * mapping in the right place. */ fault->fault_ipa = ALIGN_DOWN(fault->fault_ipa, fault->vma_pagesize); - fault->ipa = ALIGN_DOWN(fault->ipa, fault->vma_pagesize); + ipa = fault->nested ? kvm_s2_trans_output(fault->nested) : fault->fault_ipa; + fault->gfn = ALIGN_DOWN(ipa, fault->vma_pagesize) >> PAGE_SHIFT; - fault->gfn = fault->ipa >> PAGE_SHIFT; fault->mte_allowed = kvm_vma_mte_allowed(vma); fault->vm_flags = vma->vm_flags; @@ -1970,7 +1967,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, .memslot = memslot, .hva = hva, .fault_is_perm = fault_is_perm, - .ipa = fault_ipa, .logging_active = logging_active, .force_pte = logging_active, .prot = KVM_PGTABLE_PROT_R, From f583a53c2b8a4bd77f090bb76512eb87bc80f2c4 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sun, 8 Mar 2026 12:18:29 +0000 Subject: [PATCH 212/373] KVM: arm64: Make fault_ipa immutable Updating fault_ipa is conceptually annoying, as it changes something that is a property of the fault itself. Stop doing so and instead use fault->gfn as the sole piece of state that can be used to represent the faulting IPA. At the same time, introduce get_canonical_gfn() for the couple of case we're we are concerned with the memslot-related IPA and not the faulting one. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 38 ++++++++++++++++++++++++++------------ 1 file changed, 26 insertions(+), 12 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 371ee0a836cf..9ceecba992d8 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1400,10 +1400,10 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot, */ static long transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, - unsigned long hva, kvm_pfn_t *pfnp, - phys_addr_t *ipap) + unsigned long hva, kvm_pfn_t *pfnp, gfn_t *gfnp) { kvm_pfn_t pfn = *pfnp; + gfn_t gfn = *gfnp; /* * Make sure the adjustment is done only for THP pages. Also make @@ -1419,7 +1419,8 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, if (sz < PMD_SIZE) return PAGE_SIZE; - *ipap &= PMD_MASK; + gfn &= ~(PTRS_PER_PMD - 1); + *gfnp = gfn; pfn &= ~(PTRS_PER_PMD - 1); *pfnp = pfn; @@ -1735,7 +1736,6 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) { struct vm_area_struct *vma; struct kvm *kvm = fault->vcpu->kvm; - phys_addr_t ipa; mmap_read_lock(current->mm); vma = vma_lookup(current->mm, fault->hva); @@ -1753,9 +1753,7 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) * mapping size to ensure we find the right PFN and lay down the * mapping in the right place. */ - fault->fault_ipa = ALIGN_DOWN(fault->fault_ipa, fault->vma_pagesize); - ipa = fault->nested ? kvm_s2_trans_output(fault->nested) : fault->fault_ipa; - fault->gfn = ALIGN_DOWN(ipa, fault->vma_pagesize) >> PAGE_SHIFT; + fault->gfn = ALIGN_DOWN(fault->fault_ipa, fault->vma_pagesize) >> PAGE_SHIFT; fault->mte_allowed = kvm_vma_mte_allowed(vma); @@ -1777,6 +1775,17 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) return 0; } +static gfn_t get_canonical_gfn(struct kvm_s2_fault *fault) +{ + phys_addr_t ipa; + + if (!fault->nested) + return fault->gfn; + + ipa = kvm_s2_trans_output(fault->nested); + return ALIGN_DOWN(ipa, fault->vma_pagesize) >> PAGE_SHIFT; +} + static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) { int ret; @@ -1785,7 +1794,7 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) if (ret) return ret; - fault->pfn = __kvm_faultin_pfn(fault->memslot, fault->gfn, + fault->pfn = __kvm_faultin_pfn(fault->memslot, get_canonical_gfn(fault), fault->write_fault ? FOLL_WRITE : 0, &fault->writable, &fault->page); if (unlikely(is_error_noslot_pfn(fault->pfn))) { @@ -1885,6 +1894,11 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault) return 0; } +static phys_addr_t get_ipa(const struct kvm_s2_fault *fault) +{ + return gfn_to_gpa(fault->gfn); +} + static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) { struct kvm *kvm = fault->vcpu->kvm; @@ -1909,7 +1923,7 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) } else { fault->vma_pagesize = transparent_hugepage_adjust(kvm, fault->memslot, fault->hva, &fault->pfn, - &fault->fault_ipa); + &fault->gfn); if (fault->vma_pagesize < 0) { ret = fault->vma_pagesize; @@ -1932,10 +1946,10 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) * PTE, which will be preserved. */ fault->prot &= ~KVM_NV_GUEST_MAP_SZ; - ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault->fault_ipa, + ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, get_ipa(fault), fault->prot, flags); } else { - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault->fault_ipa, fault->vma_pagesize, + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, get_ipa(fault), fault->vma_pagesize, __pfn_to_phys(fault->pfn), fault->prot, memcache, flags); } @@ -1946,7 +1960,7 @@ out_unlock: /* Mark the page dirty only if the fault is handled successfully */ if (fault->writable && !ret) - mark_page_dirty_in_slot(kvm, fault->memslot, fault->gfn); + mark_page_dirty_in_slot(kvm, fault->memslot, get_canonical_gfn(fault)); if (ret != -EAGAIN) return ret; From c6f4d84643498bc855e2f6719a3233ded0e2dc63 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sun, 8 Mar 2026 15:03:49 +0000 Subject: [PATCH 213/373] KVM: arm64: Move fault context to const structure In order to make it clearer what gets updated or not during fault handling, move a set of information that losely represents the fault context. This gets populated early, from handle_mem_abort(), and gets passed along as a const pointer. user_mem_abort()'s signature is majorly improved in doing so, and kvm_s2_fault loses a bunch of fields. gmem_abort() will get a similar treatment down the line. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 133 ++++++++++++++++++++++--------------------- 1 file changed, 69 insertions(+), 64 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 9ceecba992d8..3c319cfc0357 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1565,6 +1565,14 @@ static void adjust_nested_exec_perms(struct kvm *kvm, *prot &= ~KVM_PGTABLE_PROT_PX; } +struct kvm_s2_fault_desc { + struct kvm_vcpu *vcpu; + phys_addr_t fault_ipa; + struct kvm_s2_trans *nested; + struct kvm_memory_slot *memslot; + unsigned long hva; +}; + static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, bool is_perm) @@ -1640,23 +1648,20 @@ out_unlock: return ret != -EAGAIN ? ret : 0; } -static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, - unsigned long hva, - struct kvm_memory_slot *memslot, - struct kvm_s2_trans *nested, - bool *force_pte) +static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, + struct vm_area_struct *vma, bool *force_pte) { short vma_shift; if (*force_pte) vma_shift = PAGE_SHIFT; else - vma_shift = get_vma_page_shift(vma, hva); + vma_shift = get_vma_page_shift(vma, s2fd->hva); switch (vma_shift) { #ifndef __PAGETABLE_PMD_FOLDED case PUD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) + if (fault_supports_stage2_huge_mapping(s2fd->memslot, s2fd->hva, PUD_SIZE)) break; fallthrough; #endif @@ -1664,7 +1669,7 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, vma_shift = PMD_SHIFT; fallthrough; case PMD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) + if (fault_supports_stage2_huge_mapping(s2fd->memslot, s2fd->hva, PMD_SIZE)) break; fallthrough; case CONT_PTE_SHIFT: @@ -1677,7 +1682,7 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); } - if (nested) { + if (s2fd->nested) { unsigned long max_map_size; max_map_size = *force_pte ? PAGE_SIZE : PUD_SIZE; @@ -1687,7 +1692,7 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, * can only create a block mapping if the guest stage 2 page * table uses at least as big a mapping. */ - max_map_size = min(kvm_s2_trans_size(nested), max_map_size); + max_map_size = min(kvm_s2_trans_size(s2fd->nested), max_map_size); /* * Be careful that if the mapping size falls between @@ -1706,11 +1711,6 @@ static short kvm_s2_resolve_vma_size(struct vm_area_struct *vma, } struct kvm_s2_fault { - struct kvm_vcpu *vcpu; - phys_addr_t fault_ipa; - struct kvm_s2_trans *nested; - struct kvm_memory_slot *memslot; - unsigned long hva; bool fault_is_perm; bool write_fault; @@ -1732,28 +1732,28 @@ struct kvm_s2_fault { vm_flags_t vm_flags; }; -static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) +static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, + struct kvm_s2_fault *fault) { struct vm_area_struct *vma; - struct kvm *kvm = fault->vcpu->kvm; + struct kvm *kvm = s2fd->vcpu->kvm; mmap_read_lock(current->mm); - vma = vma_lookup(current->mm, fault->hva); + vma = vma_lookup(current->mm, s2fd->hva); if (unlikely(!vma)) { - kvm_err("Failed to find VMA for fault->hva 0x%lx\n", fault->hva); + kvm_err("Failed to find VMA for hva 0x%lx\n", s2fd->hva); mmap_read_unlock(current->mm); return -EFAULT; } - fault->vma_pagesize = 1UL << kvm_s2_resolve_vma_size(vma, fault->hva, fault->memslot, - fault->nested, &fault->force_pte); + fault->vma_pagesize = BIT(kvm_s2_resolve_vma_size(s2fd, vma, &fault->force_pte)); /* * Both the canonical IPA and fault IPA must be aligned to the * mapping size to ensure we find the right PFN and lay down the * mapping in the right place. */ - fault->gfn = ALIGN_DOWN(fault->fault_ipa, fault->vma_pagesize) >> PAGE_SHIFT; + fault->gfn = ALIGN_DOWN(s2fd->fault_ipa, fault->vma_pagesize) >> PAGE_SHIFT; fault->mte_allowed = kvm_vma_mte_allowed(vma); @@ -1775,31 +1775,33 @@ static int kvm_s2_fault_get_vma_info(struct kvm_s2_fault *fault) return 0; } -static gfn_t get_canonical_gfn(struct kvm_s2_fault *fault) +static gfn_t get_canonical_gfn(const struct kvm_s2_fault_desc *s2fd, + const struct kvm_s2_fault *fault) { phys_addr_t ipa; - if (!fault->nested) + if (!s2fd->nested) return fault->gfn; - ipa = kvm_s2_trans_output(fault->nested); + ipa = kvm_s2_trans_output(s2fd->nested); return ALIGN_DOWN(ipa, fault->vma_pagesize) >> PAGE_SHIFT; } -static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) +static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, + struct kvm_s2_fault *fault) { int ret; - ret = kvm_s2_fault_get_vma_info(fault); + ret = kvm_s2_fault_get_vma_info(s2fd, fault); if (ret) return ret; - fault->pfn = __kvm_faultin_pfn(fault->memslot, get_canonical_gfn(fault), + fault->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, fault), fault->write_fault ? FOLL_WRITE : 0, &fault->writable, &fault->page); if (unlikely(is_error_noslot_pfn(fault->pfn))) { if (fault->pfn == KVM_PFN_ERR_HWPOISON) { - kvm_send_hwpoison_signal(fault->hva, __ffs(fault->vma_pagesize)); + kvm_send_hwpoison_signal(s2fd->hva, __ffs(fault->vma_pagesize)); return 0; } return -EFAULT; @@ -1808,9 +1810,10 @@ static int kvm_s2_fault_pin_pfn(struct kvm_s2_fault *fault) return 1; } -static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault) +static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, + struct kvm_s2_fault *fault) { - struct kvm *kvm = fault->vcpu->kvm; + struct kvm *kvm = s2fd->vcpu->kvm; /* * Check if this is non-struct page memory PFN, and cannot support @@ -1862,13 +1865,13 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault) * and trigger the exception here. Since the memslot is valid, inject * the fault back to the guest. */ - if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(fault->vcpu))) { - kvm_inject_dabt_excl_atomic(fault->vcpu, kvm_vcpu_get_hfar(fault->vcpu)); + if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(s2fd->vcpu))) { + kvm_inject_dabt_excl_atomic(s2fd->vcpu, kvm_vcpu_get_hfar(s2fd->vcpu)); return 1; } - if (fault->nested) - adjust_nested_fault_perms(fault->nested, &fault->prot, &fault->writable); + if (s2fd->nested) + adjust_nested_fault_perms(s2fd->nested, &fault->prot, &fault->writable); if (fault->writable) fault->prot |= KVM_PGTABLE_PROT_W; @@ -1882,8 +1885,8 @@ static int kvm_s2_fault_compute_prot(struct kvm_s2_fault *fault) else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) fault->prot |= KVM_PGTABLE_PROT_X; - if (fault->nested) - adjust_nested_exec_perms(kvm, fault->nested, &fault->prot); + if (s2fd->nested) + adjust_nested_exec_perms(kvm, s2fd->nested, &fault->prot); if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) { /* Check the VMM hasn't introduced a new disallowed VMA */ @@ -1899,15 +1902,16 @@ static phys_addr_t get_ipa(const struct kvm_s2_fault *fault) return gfn_to_gpa(fault->gfn); } -static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) +static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, + struct kvm_s2_fault *fault, void *memcache) { - struct kvm *kvm = fault->vcpu->kvm; + struct kvm *kvm = s2fd->vcpu->kvm; struct kvm_pgtable *pgt; int ret; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; kvm_fault_lock(kvm); - pgt = fault->vcpu->arch.hw_mmu->pgt; + pgt = s2fd->vcpu->arch.hw_mmu->pgt; ret = -EAGAIN; if (mmu_invalidate_retry(kvm, fault->mmu_seq)) goto out_unlock; @@ -1921,8 +1925,8 @@ static int kvm_s2_fault_map(struct kvm_s2_fault *fault, void *memcache) if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE) { fault->vma_pagesize = fault->fault_granule; } else { - fault->vma_pagesize = transparent_hugepage_adjust(kvm, fault->memslot, - fault->hva, &fault->pfn, + fault->vma_pagesize = transparent_hugepage_adjust(kvm, s2fd->memslot, + s2fd->hva, &fault->pfn, &fault->gfn); if (fault->vma_pagesize < 0) { @@ -1960,34 +1964,27 @@ out_unlock: /* Mark the page dirty only if the fault is handled successfully */ if (fault->writable && !ret) - mark_page_dirty_in_slot(kvm, fault->memslot, get_canonical_gfn(fault)); + mark_page_dirty_in_slot(kvm, s2fd->memslot, get_canonical_gfn(s2fd, fault)); if (ret != -EAGAIN) return ret; return 0; } -static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, unsigned long hva, - bool fault_is_perm) +static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) { - bool write_fault = kvm_is_write_fault(vcpu); - bool logging_active = memslot_is_logging(memslot); + bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); + bool write_fault = kvm_is_write_fault(s2fd->vcpu); + bool logging_active = memslot_is_logging(s2fd->memslot); struct kvm_s2_fault fault = { - .vcpu = vcpu, - .fault_ipa = fault_ipa, - .nested = nested, - .memslot = memslot, - .hva = hva, - .fault_is_perm = fault_is_perm, + .fault_is_perm = perm_fault, .logging_active = logging_active, .force_pte = logging_active, .prot = KVM_PGTABLE_PROT_R, - .fault_granule = fault_is_perm ? kvm_vcpu_trap_get_perm_fault_granule(vcpu) : 0, + .fault_granule = perm_fault ? kvm_vcpu_trap_get_perm_fault_granule(s2fd->vcpu) : 0, .write_fault = write_fault, - .exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu), - .topup_memcache = !fault_is_perm || (logging_active && write_fault), + .exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu), + .topup_memcache = !perm_fault || (logging_active && write_fault), }; void *memcache; int ret; @@ -2000,7 +1997,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * only exception to this is when dirty logging is enabled at runtime * and a write fault needs to collapse a block entry into a table. */ - ret = prepare_mmu_memcache(vcpu, fault.topup_memcache, &memcache); + ret = prepare_mmu_memcache(s2fd->vcpu, fault.topup_memcache, &memcache); if (ret) return ret; @@ -2008,17 +2005,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * Let's check if we will get back a huge page backed by hugetlbfs, or * get block mapping for device MMIO region. */ - ret = kvm_s2_fault_pin_pfn(&fault); + ret = kvm_s2_fault_pin_pfn(s2fd, &fault); if (ret != 1) return ret; - ret = kvm_s2_fault_compute_prot(&fault); + ret = kvm_s2_fault_compute_prot(s2fd, &fault); if (ret) { kvm_release_page_unused(fault.page); return ret; } - return kvm_s2_fault_map(&fault, memcache); + return kvm_s2_fault_map(s2fd, &fault, memcache); } /* Resolve the access fault by making the page young again. */ @@ -2284,12 +2281,20 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && !write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu)); + const struct kvm_s2_fault_desc s2fd = { + .vcpu = vcpu, + .fault_ipa = fault_ipa, + .nested = nested, + .memslot = memslot, + .hva = hva, + }; + if (kvm_slot_has_gmem(memslot)) ret = gmem_abort(vcpu, fault_ipa, nested, memslot, esr_fsc_is_permission_fault(esr)); else - ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva, - esr_fsc_is_permission_fault(esr)); + ret = user_mem_abort(&s2fd); + if (ret == 0) ret = 1; out: From fe4b6f824f267739ef2c5f2f047c317eceb587c3 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 09:59:38 +0000 Subject: [PATCH 214/373] KVM: arm64: Replace fault_is_perm with a helper Carrying a boolean to indicate that a given fault is a permission fault is slightly odd, as this is a property of the fault itself, and we'd better avoid duplicating state. For this purpose, introduce a kvm_s2_fault_is_perm() predicate that can take a fault descriptor as a parameter. fault_is_perm is therefore dropped from kvm_s2_fault. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Reviewed-by: Joey Gouly Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 3c319cfc0357..2bb4e974886a 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1711,8 +1711,6 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, } struct kvm_s2_fault { - bool fault_is_perm; - bool write_fault; bool exec_fault; bool writable; @@ -1732,6 +1730,11 @@ struct kvm_s2_fault { vm_flags_t vm_flags; }; +static bool kvm_s2_fault_is_perm(const struct kvm_s2_fault_desc *s2fd) +{ + return kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); +} + static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, struct kvm_s2_fault *fault) { @@ -1888,7 +1891,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, if (s2fd->nested) adjust_nested_exec_perms(kvm, s2fd->nested, &fault->prot); - if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) { + if (!kvm_s2_fault_is_perm(s2fd) && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) { /* Check the VMM hasn't introduced a new disallowed VMA */ if (!fault->mte_allowed) return -EFAULT; @@ -1905,6 +1908,7 @@ static phys_addr_t get_ipa(const struct kvm_s2_fault *fault) static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, struct kvm_s2_fault *fault, void *memcache) { + bool fault_is_perm = kvm_s2_fault_is_perm(s2fd); struct kvm *kvm = s2fd->vcpu->kvm; struct kvm_pgtable *pgt; int ret; @@ -1922,7 +1926,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, */ if (fault->vma_pagesize == PAGE_SIZE && !(fault->force_pte || fault->s2_force_noncacheable)) { - if (fault->fault_is_perm && fault->fault_granule > PAGE_SIZE) { + if (fault_is_perm && fault->fault_granule > PAGE_SIZE) { fault->vma_pagesize = fault->fault_granule; } else { fault->vma_pagesize = transparent_hugepage_adjust(kvm, s2fd->memslot, @@ -1936,7 +1940,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, } } - if (!fault->fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) + if (!fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize); /* @@ -1944,7 +1948,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, * permissions only if vma_pagesize equals fault_granule. Otherwise, * kvm_pgtable_stage2_map() should be called to change block size. */ - if (fault->fault_is_perm && fault->vma_pagesize == fault->fault_granule) { + if (fault_is_perm && fault->vma_pagesize == fault->fault_granule) { /* * Drop the SW bits in favour of those stored in the * PTE, which will be preserved. @@ -1977,7 +1981,6 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) bool write_fault = kvm_is_write_fault(s2fd->vcpu); bool logging_active = memslot_is_logging(s2fd->memslot); struct kvm_s2_fault fault = { - .fault_is_perm = perm_fault, .logging_active = logging_active, .force_pte = logging_active, .prot = KVM_PGTABLE_PROT_R, From 31571929e8a894d06291a9501b2be22cffe14443 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 10:57:04 +0000 Subject: [PATCH 215/373] KVM: arm64: Constrain fault_granule to kvm_s2_fault_map() The notion of fault_granule is specific to kvm_s2_fault_map(), and is unused anywhere else. Move this variable locally, removing it from kvm_s2_fault. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 2bb4e974886a..981c04a74ab7 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1724,7 +1724,6 @@ struct kvm_s2_fault { bool logging_active; bool force_pte; long vma_pagesize; - long fault_granule; enum kvm_pgtable_prot prot; struct page *page; vm_flags_t vm_flags; @@ -1908,9 +1907,9 @@ static phys_addr_t get_ipa(const struct kvm_s2_fault *fault) static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, struct kvm_s2_fault *fault, void *memcache) { - bool fault_is_perm = kvm_s2_fault_is_perm(s2fd); struct kvm *kvm = s2fd->vcpu->kvm; struct kvm_pgtable *pgt; + long perm_fault_granule; int ret; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; @@ -1920,14 +1919,17 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, if (mmu_invalidate_retry(kvm, fault->mmu_seq)) goto out_unlock; + perm_fault_granule = (kvm_s2_fault_is_perm(s2fd) ? + kvm_vcpu_trap_get_perm_fault_granule(s2fd->vcpu) : 0); + /* * If we are not forced to use page mapping, check if we are * backed by a THP and thus use block mapping if possible. */ if (fault->vma_pagesize == PAGE_SIZE && !(fault->force_pte || fault->s2_force_noncacheable)) { - if (fault_is_perm && fault->fault_granule > PAGE_SIZE) { - fault->vma_pagesize = fault->fault_granule; + if (perm_fault_granule > PAGE_SIZE) { + fault->vma_pagesize = perm_fault_granule; } else { fault->vma_pagesize = transparent_hugepage_adjust(kvm, s2fd->memslot, s2fd->hva, &fault->pfn, @@ -1940,15 +1942,15 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, } } - if (!fault_is_perm && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) + if (!perm_fault_granule && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize); /* * Under the premise of getting a FSC_PERM fault, we just need to relax - * permissions only if vma_pagesize equals fault_granule. Otherwise, + * permissions only if vma_pagesize equals perm_fault_granule. Otherwise, * kvm_pgtable_stage2_map() should be called to change block size. */ - if (fault_is_perm && fault->vma_pagesize == fault->fault_granule) { + if (fault->vma_pagesize == perm_fault_granule) { /* * Drop the SW bits in favour of those stored in the * PTE, which will be preserved. @@ -1984,7 +1986,6 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) .logging_active = logging_active, .force_pte = logging_active, .prot = KVM_PGTABLE_PROT_R, - .fault_granule = perm_fault ? kvm_vcpu_trap_get_perm_fault_granule(s2fd->vcpu) : 0, .write_fault = write_fault, .exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu), .topup_memcache = !perm_fault || (logging_active && write_fault), From 49902d7e010614888253a90252f4704edb2fe3e3 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 11:22:59 +0000 Subject: [PATCH 216/373] KVM: arm64: Kill write_fault from kvm_s2_fault We already have kvm_is_write_fault() as a predicate indicating a S2 fault on a write, and we're better off just using that instead of duplicating the state. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 981c04a74ab7..7dab0c3faa5b 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1711,7 +1711,6 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, } struct kvm_s2_fault { - bool write_fault; bool exec_fault; bool writable; bool topup_memcache; @@ -1799,7 +1798,7 @@ static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, return ret; fault->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, fault), - fault->write_fault ? FOLL_WRITE : 0, + kvm_is_write_fault(s2fd->vcpu) ? FOLL_WRITE : 0, &fault->writable, &fault->page); if (unlikely(is_error_noslot_pfn(fault->pfn))) { if (fault->pfn == KVM_PFN_ERR_HWPOISON) { @@ -1850,7 +1849,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, */ fault->s2_force_noncacheable = true; } - } else if (fault->logging_active && !fault->write_fault) { + } else if (fault->logging_active && !kvm_is_write_fault(s2fd->vcpu)) { /* * Only actually map the page as writable if this was a write * fault. @@ -1980,21 +1979,17 @@ out_unlock: static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) { bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); - bool write_fault = kvm_is_write_fault(s2fd->vcpu); bool logging_active = memslot_is_logging(s2fd->memslot); struct kvm_s2_fault fault = { .logging_active = logging_active, .force_pte = logging_active, .prot = KVM_PGTABLE_PROT_R, - .write_fault = write_fault, .exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu), - .topup_memcache = !perm_fault || (logging_active && write_fault), + .topup_memcache = !perm_fault || (logging_active && kvm_is_write_fault(s2fd->vcpu)), }; void *memcache; int ret; - VM_WARN_ON_ONCE(fault.write_fault && fault.exec_fault); - /* * Permission faults just need to update the existing leaf entry, * and so normally don't require allocations from the memcache. The From c0d699915a835364b1c8d1eca11e11e7b82f0705 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 11:28:11 +0000 Subject: [PATCH 217/373] KVM: arm64: Kill exec_fault from kvm_s2_fault Similarly to write_fault, exec_fault can be advantageously replaced by the kvm_vcpu_trap_is_exec_fault() predicate where needed. Another one bites the dust... Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 7dab0c3faa5b..e8bda71e862b 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1711,7 +1711,6 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, } struct kvm_s2_fault { - bool exec_fault; bool writable; bool topup_memcache; bool mte_allowed; @@ -1857,7 +1856,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, fault->writable = false; } - if (fault->exec_fault && fault->s2_force_noncacheable) + if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu) && fault->s2_force_noncacheable) return -ENOEXEC; /* @@ -1877,7 +1876,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, if (fault->writable) fault->prot |= KVM_PGTABLE_PROT_W; - if (fault->exec_fault) + if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu)) fault->prot |= KVM_PGTABLE_PROT_X; if (fault->s2_force_noncacheable) @@ -1984,7 +1983,6 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) .logging_active = logging_active, .force_pte = logging_active, .prot = KVM_PGTABLE_PROT_R, - .exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu), .topup_memcache = !perm_fault || (logging_active && kvm_is_write_fault(s2fd->vcpu)), }; void *memcache; From 1a3cd7cb5547979f96ec2a0920b8e39939c58db3 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 11:52:40 +0000 Subject: [PATCH 218/373] KVM: arm64: Kill topup_memcache from kvm_s2_fault The topup_memcache field can be easily replaced by the equivalent conditions, and the resulting code is not much worse. While at it, split prepare_mmu_memcache() into get/topup helpers, which makes the code more readable. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 35 +++++++++++++++++------------------ 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index e8bda71e862b..1aedc066ba65 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1513,25 +1513,22 @@ static bool kvm_vma_is_cacheable(struct vm_area_struct *vma) } } -static int prepare_mmu_memcache(struct kvm_vcpu *vcpu, bool topup_memcache, - void **memcache) +static void *get_mmu_memcache(struct kvm_vcpu *vcpu) { - int min_pages; - if (!is_protected_kvm_enabled()) - *memcache = &vcpu->arch.mmu_page_cache; + return &vcpu->arch.mmu_page_cache; else - *memcache = &vcpu->arch.pkvm_memcache; + return &vcpu->arch.pkvm_memcache; +} - if (!topup_memcache) - return 0; - - min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu); +static int topup_mmu_memcache(struct kvm_vcpu *vcpu, void *memcache) +{ + int min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu); if (!is_protected_kvm_enabled()) - return kvm_mmu_topup_memory_cache(*memcache, min_pages); + return kvm_mmu_topup_memory_cache(memcache, min_pages); - return topup_hyp_memcache(*memcache, min_pages); + return topup_hyp_memcache(memcache, min_pages); } /* @@ -1589,7 +1586,8 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, gfn_t gfn; int ret; - ret = prepare_mmu_memcache(vcpu, true, &memcache); + memcache = get_mmu_memcache(vcpu); + ret = topup_mmu_memcache(vcpu, memcache); if (ret) return ret; @@ -1712,7 +1710,6 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, struct kvm_s2_fault { bool writable; - bool topup_memcache; bool mte_allowed; bool is_vma_cacheable; bool s2_force_noncacheable; @@ -1983,7 +1980,6 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) .logging_active = logging_active, .force_pte = logging_active, .prot = KVM_PGTABLE_PROT_R, - .topup_memcache = !perm_fault || (logging_active && kvm_is_write_fault(s2fd->vcpu)), }; void *memcache; int ret; @@ -1994,9 +1990,12 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) * only exception to this is when dirty logging is enabled at runtime * and a write fault needs to collapse a block entry into a table. */ - ret = prepare_mmu_memcache(s2fd->vcpu, fault.topup_memcache, &memcache); - if (ret) - return ret; + memcache = get_mmu_memcache(s2fd->vcpu); + if (!perm_fault || (logging_active && kvm_is_write_fault(s2fd->vcpu))) { + ret = topup_mmu_memcache(s2fd->vcpu, memcache); + if (ret) + return ret; + } /* * Let's check if we will get back a huge page backed by hugetlbfs, or From 4ff2ce512063ae79976c23b6fe2130611bcbae62 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 14:31:06 +0000 Subject: [PATCH 219/373] KVM: arm64: Move VMA-related information to kvm_s2_fault_vma_info Mecanically extract a bunch of VMA-related fields from kvm_s2_fault and move them to a new kvm_s2_fault_vma_info structure. This is not much, but it already allows us to define which functions can update this structure, and which ones are pure consumers of the data. Those in the latter camp are updated to take a const pointer to that structure. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 117 ++++++++++++++++++++++++------------------- 1 file changed, 65 insertions(+), 52 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 1aedc066ba65..9f92892b27a4 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1646,6 +1646,15 @@ out_unlock: return ret != -EAGAIN ? ret : 0; } +struct kvm_s2_fault_vma_info { + unsigned long mmu_seq; + long vma_pagesize; + vm_flags_t vm_flags; + gfn_t gfn; + bool mte_allowed; + bool is_vma_cacheable; +}; + static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, struct vm_area_struct *vma, bool *force_pte) { @@ -1710,18 +1719,12 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, struct kvm_s2_fault { bool writable; - bool mte_allowed; - bool is_vma_cacheable; bool s2_force_noncacheable; - unsigned long mmu_seq; - gfn_t gfn; kvm_pfn_t pfn; bool logging_active; bool force_pte; - long vma_pagesize; enum kvm_pgtable_prot prot; struct page *page; - vm_flags_t vm_flags; }; static bool kvm_s2_fault_is_perm(const struct kvm_s2_fault_desc *s2fd) @@ -1730,7 +1733,8 @@ static bool kvm_s2_fault_is_perm(const struct kvm_s2_fault_desc *s2fd) } static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault) + struct kvm_s2_fault *fault, + struct kvm_s2_fault_vma_info *s2vi) { struct vm_area_struct *vma; struct kvm *kvm = s2fd->vcpu->kvm; @@ -1743,20 +1747,20 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, return -EFAULT; } - fault->vma_pagesize = BIT(kvm_s2_resolve_vma_size(s2fd, vma, &fault->force_pte)); + s2vi->vma_pagesize = BIT(kvm_s2_resolve_vma_size(s2fd, vma, &fault->force_pte)); /* * Both the canonical IPA and fault IPA must be aligned to the * mapping size to ensure we find the right PFN and lay down the * mapping in the right place. */ - fault->gfn = ALIGN_DOWN(s2fd->fault_ipa, fault->vma_pagesize) >> PAGE_SHIFT; + s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT; - fault->mte_allowed = kvm_vma_mte_allowed(vma); + s2vi->mte_allowed = kvm_vma_mte_allowed(vma); - fault->vm_flags = vma->vm_flags; + s2vi->vm_flags = vma->vm_flags; - fault->is_vma_cacheable = kvm_vma_is_cacheable(vma); + s2vi->is_vma_cacheable = kvm_vma_is_cacheable(vma); /* * Read mmu_invalidate_seq so that KVM can detect if the results of @@ -1766,39 +1770,40 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs * with the smp_wmb() in kvm_mmu_invalidate_end(). */ - fault->mmu_seq = kvm->mmu_invalidate_seq; + s2vi->mmu_seq = kvm->mmu_invalidate_seq; mmap_read_unlock(current->mm); return 0; } static gfn_t get_canonical_gfn(const struct kvm_s2_fault_desc *s2fd, - const struct kvm_s2_fault *fault) + const struct kvm_s2_fault_vma_info *s2vi) { phys_addr_t ipa; if (!s2fd->nested) - return fault->gfn; + return s2vi->gfn; ipa = kvm_s2_trans_output(s2fd->nested); - return ALIGN_DOWN(ipa, fault->vma_pagesize) >> PAGE_SHIFT; + return ALIGN_DOWN(ipa, s2vi->vma_pagesize) >> PAGE_SHIFT; } static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault) + struct kvm_s2_fault *fault, + struct kvm_s2_fault_vma_info *s2vi) { int ret; - ret = kvm_s2_fault_get_vma_info(s2fd, fault); + ret = kvm_s2_fault_get_vma_info(s2fd, fault, s2vi); if (ret) return ret; - fault->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, fault), + fault->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, s2vi), kvm_is_write_fault(s2fd->vcpu) ? FOLL_WRITE : 0, &fault->writable, &fault->page); if (unlikely(is_error_noslot_pfn(fault->pfn))) { if (fault->pfn == KVM_PFN_ERR_HWPOISON) { - kvm_send_hwpoison_signal(s2fd->hva, __ffs(fault->vma_pagesize)); + kvm_send_hwpoison_signal(s2fd->hva, __ffs(s2vi->vma_pagesize)); return 0; } return -EFAULT; @@ -1808,7 +1813,8 @@ static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, } static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault) + struct kvm_s2_fault *fault, + const struct kvm_s2_fault_vma_info *s2vi) { struct kvm *kvm = s2fd->vcpu->kvm; @@ -1816,8 +1822,8 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, * Check if this is non-struct page memory PFN, and cannot support * CMOs. It could potentially be unsafe to access as cacheable. */ - if (fault->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) { - if (fault->is_vma_cacheable) { + if (s2vi->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) { + if (s2vi->is_vma_cacheable) { /* * Whilst the VMA owner expects cacheable mapping to this * PFN, hardware also has to support the FWB and CACHE DIC @@ -1877,7 +1883,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, fault->prot |= KVM_PGTABLE_PROT_X; if (fault->s2_force_noncacheable) - fault->prot |= (fault->vm_flags & VM_ALLOW_ANY_UNCACHED) ? + fault->prot |= (s2vi->vm_flags & VM_ALLOW_ANY_UNCACHED) ? KVM_PGTABLE_PROT_NORMAL_NC : KVM_PGTABLE_PROT_DEVICE; else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) fault->prot |= KVM_PGTABLE_PROT_X; @@ -1887,74 +1893,73 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, if (!kvm_s2_fault_is_perm(s2fd) && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) { /* Check the VMM hasn't introduced a new disallowed VMA */ - if (!fault->mte_allowed) + if (!s2vi->mte_allowed) return -EFAULT; } return 0; } -static phys_addr_t get_ipa(const struct kvm_s2_fault *fault) -{ - return gfn_to_gpa(fault->gfn); -} - static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault, void *memcache) + struct kvm_s2_fault *fault, + const struct kvm_s2_fault_vma_info *s2vi, void *memcache) { + enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; struct kvm *kvm = s2fd->vcpu->kvm; struct kvm_pgtable *pgt; long perm_fault_granule; + long mapping_size; + gfn_t gfn; int ret; - enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; kvm_fault_lock(kvm); pgt = s2fd->vcpu->arch.hw_mmu->pgt; ret = -EAGAIN; - if (mmu_invalidate_retry(kvm, fault->mmu_seq)) + if (mmu_invalidate_retry(kvm, s2vi->mmu_seq)) goto out_unlock; perm_fault_granule = (kvm_s2_fault_is_perm(s2fd) ? kvm_vcpu_trap_get_perm_fault_granule(s2fd->vcpu) : 0); + mapping_size = s2vi->vma_pagesize; + gfn = s2vi->gfn; /* * If we are not forced to use page mapping, check if we are * backed by a THP and thus use block mapping if possible. */ - if (fault->vma_pagesize == PAGE_SIZE && + if (mapping_size == PAGE_SIZE && !(fault->force_pte || fault->s2_force_noncacheable)) { if (perm_fault_granule > PAGE_SIZE) { - fault->vma_pagesize = perm_fault_granule; + mapping_size = perm_fault_granule; } else { - fault->vma_pagesize = transparent_hugepage_adjust(kvm, s2fd->memslot, - s2fd->hva, &fault->pfn, - &fault->gfn); - - if (fault->vma_pagesize < 0) { - ret = fault->vma_pagesize; + mapping_size = transparent_hugepage_adjust(kvm, s2fd->memslot, + s2fd->hva, &fault->pfn, + &gfn); + if (mapping_size < 0) { + ret = mapping_size; goto out_unlock; } } } if (!perm_fault_granule && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) - sanitise_mte_tags(kvm, fault->pfn, fault->vma_pagesize); + sanitise_mte_tags(kvm, fault->pfn, mapping_size); /* * Under the premise of getting a FSC_PERM fault, we just need to relax - * permissions only if vma_pagesize equals perm_fault_granule. Otherwise, + * permissions only if mapping_size equals perm_fault_granule. Otherwise, * kvm_pgtable_stage2_map() should be called to change block size. */ - if (fault->vma_pagesize == perm_fault_granule) { + if (mapping_size == perm_fault_granule) { /* * Drop the SW bits in favour of those stored in the * PTE, which will be preserved. */ fault->prot &= ~KVM_NV_GUEST_MAP_SZ; - ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, get_ipa(fault), + ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn), fault->prot, flags); } else { - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, get_ipa(fault), fault->vma_pagesize, + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size, __pfn_to_phys(fault->pfn), fault->prot, memcache, flags); } @@ -1963,9 +1968,16 @@ out_unlock: kvm_release_faultin_page(kvm, fault->page, !!ret, fault->writable); kvm_fault_unlock(kvm); - /* Mark the page dirty only if the fault is handled successfully */ - if (fault->writable && !ret) - mark_page_dirty_in_slot(kvm, s2fd->memslot, get_canonical_gfn(s2fd, fault)); + /* + * Mark the page dirty only if the fault is handled successfully, + * making sure we adjust the canonical IPA if the mapping size has + * been updated (via a THP upgrade, for example). + */ + if (fault->writable && !ret) { + phys_addr_t ipa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi)); + ipa &= ~(mapping_size - 1); + mark_page_dirty_in_slot(kvm, s2fd->memslot, gpa_to_gfn(ipa)); + } if (ret != -EAGAIN) return ret; @@ -1976,6 +1988,7 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) { bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); bool logging_active = memslot_is_logging(s2fd->memslot); + struct kvm_s2_fault_vma_info s2vi = {}; struct kvm_s2_fault fault = { .logging_active = logging_active, .force_pte = logging_active, @@ -2001,17 +2014,17 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) * Let's check if we will get back a huge page backed by hugetlbfs, or * get block mapping for device MMIO region. */ - ret = kvm_s2_fault_pin_pfn(s2fd, &fault); + ret = kvm_s2_fault_pin_pfn(s2fd, &fault, &s2vi); if (ret != 1) return ret; - ret = kvm_s2_fault_compute_prot(s2fd, &fault); + ret = kvm_s2_fault_compute_prot(s2fd, &fault, &s2vi); if (ret) { kvm_release_page_unused(fault.page); return ret; } - return kvm_s2_fault_map(s2fd, &fault, memcache); + return kvm_s2_fault_map(s2fd, &fault, &s2vi, memcache); } /* Resolve the access fault by making the page young again. */ From 18e1312879db0af5bfd1b1a8b6771d314ccd3ca9 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 14:32:25 +0000 Subject: [PATCH 220/373] KVM: arm64: Kill logging_active from kvm_s2_fault There are only two spots where we evaluate whether logging is active. Replace the boolean with calls to the relevant helper. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 9f92892b27a4..227bc34bc535 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1721,7 +1721,6 @@ struct kvm_s2_fault { bool writable; bool s2_force_noncacheable; kvm_pfn_t pfn; - bool logging_active; bool force_pte; enum kvm_pgtable_prot prot; struct page *page; @@ -1851,7 +1850,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, */ fault->s2_force_noncacheable = true; } - } else if (fault->logging_active && !kvm_is_write_fault(s2fd->vcpu)) { + } else if (memslot_is_logging(s2fd->memslot) && !kvm_is_write_fault(s2fd->vcpu)) { /* * Only actually map the page as writable if this was a write * fault. @@ -1987,11 +1986,9 @@ out_unlock: static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) { bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); - bool logging_active = memslot_is_logging(s2fd->memslot); struct kvm_s2_fault_vma_info s2vi = {}; struct kvm_s2_fault fault = { - .logging_active = logging_active, - .force_pte = logging_active, + .force_pte = memslot_is_logging(s2fd->memslot), .prot = KVM_PGTABLE_PROT_R, }; void *memcache; @@ -2004,7 +2001,8 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) * and a write fault needs to collapse a block entry into a table. */ memcache = get_mmu_memcache(s2fd->vcpu); - if (!perm_fault || (logging_active && kvm_is_write_fault(s2fd->vcpu))) { + if (!perm_fault || (memslot_is_logging(s2fd->memslot) && + kvm_is_write_fault(s2fd->vcpu))) { ret = topup_mmu_memcache(s2fd->vcpu, memcache); if (ret) return ret; From 08abf09f4d763565ee9ed084401b7ac92f8b4952 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 15:35:27 +0000 Subject: [PATCH 221/373] KVM: arm64: Restrict the scope of the 'writable' attribute The 'writable' field is ambiguous, and indicates multiple things: - whether the underlying memslot is writable - whether we are resolving the fault with writable attributes Add a new field to kvm_s2_fault_vma_info (map_writable) to indicate the former condition, and have local writable variables to track the latter. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 227bc34bc535..f8e88539988c 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1653,6 +1653,7 @@ struct kvm_s2_fault_vma_info { gfn_t gfn; bool mte_allowed; bool is_vma_cacheable; + bool map_writable; }; static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, @@ -1718,7 +1719,6 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, } struct kvm_s2_fault { - bool writable; bool s2_force_noncacheable; kvm_pfn_t pfn; bool force_pte; @@ -1799,7 +1799,7 @@ static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, fault->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, s2vi), kvm_is_write_fault(s2fd->vcpu) ? FOLL_WRITE : 0, - &fault->writable, &fault->page); + &s2vi->map_writable, &fault->page); if (unlikely(is_error_noslot_pfn(fault->pfn))) { if (fault->pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(s2fd->hva, __ffs(s2vi->vma_pagesize)); @@ -1816,6 +1816,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, const struct kvm_s2_fault_vma_info *s2vi) { struct kvm *kvm = s2fd->vcpu->kvm; + bool writable = s2vi->map_writable; /* * Check if this is non-struct page memory PFN, and cannot support @@ -1855,7 +1856,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, * Only actually map the page as writable if this was a write * fault. */ - fault->writable = false; + writable = false; } if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu) && fault->s2_force_noncacheable) @@ -1873,9 +1874,9 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, } if (s2fd->nested) - adjust_nested_fault_perms(s2fd->nested, &fault->prot, &fault->writable); + adjust_nested_fault_perms(s2fd->nested, &fault->prot, &writable); - if (fault->writable) + if (writable) fault->prot |= KVM_PGTABLE_PROT_W; if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu)) @@ -1904,6 +1905,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, const struct kvm_s2_fault_vma_info *s2vi, void *memcache) { enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; + bool writable = fault->prot & KVM_PGTABLE_PROT_W; struct kvm *kvm = s2fd->vcpu->kvm; struct kvm_pgtable *pgt; long perm_fault_granule; @@ -1964,7 +1966,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, } out_unlock: - kvm_release_faultin_page(kvm, fault->page, !!ret, fault->writable); + kvm_release_faultin_page(kvm, fault->page, !!ret, writable); kvm_fault_unlock(kvm); /* @@ -1972,7 +1974,7 @@ out_unlock: * making sure we adjust the canonical IPA if the mapping size has * been updated (via a THP upgrade, for example). */ - if (fault->writable && !ret) { + if (writable && !ret) { phys_addr_t ipa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi)); ipa &= ~(mapping_size - 1); mark_page_dirty_in_slot(kvm, s2fd->memslot, gpa_to_gfn(ipa)); From f8dad9602ff305b879fa23688becd578102ba547 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 9 Mar 2026 15:55:38 +0000 Subject: [PATCH 222/373] KVM: arm64: Move kvm_s2_fault.{pfn,page} to kvm_s2_vma_info Continue restricting the visibility/mutability of some attributes by moving kvm_s2_fault.{pfn,page} to kvm_s2_vma_info. This is a pretty mechanical change. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index f8e88539988c..f749b5a2f893 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1650,6 +1650,8 @@ struct kvm_s2_fault_vma_info { unsigned long mmu_seq; long vma_pagesize; vm_flags_t vm_flags; + struct page *page; + kvm_pfn_t pfn; gfn_t gfn; bool mte_allowed; bool is_vma_cacheable; @@ -1720,10 +1722,8 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, struct kvm_s2_fault { bool s2_force_noncacheable; - kvm_pfn_t pfn; bool force_pte; enum kvm_pgtable_prot prot; - struct page *page; }; static bool kvm_s2_fault_is_perm(const struct kvm_s2_fault_desc *s2fd) @@ -1797,11 +1797,11 @@ static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, if (ret) return ret; - fault->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, s2vi), - kvm_is_write_fault(s2fd->vcpu) ? FOLL_WRITE : 0, - &s2vi->map_writable, &fault->page); - if (unlikely(is_error_noslot_pfn(fault->pfn))) { - if (fault->pfn == KVM_PFN_ERR_HWPOISON) { + s2vi->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, s2vi), + kvm_is_write_fault(s2fd->vcpu) ? FOLL_WRITE : 0, + &s2vi->map_writable, &s2vi->page); + if (unlikely(is_error_noslot_pfn(s2vi->pfn))) { + if (s2vi->pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(s2fd->hva, __ffs(s2vi->vma_pagesize)); return 0; } @@ -1822,7 +1822,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, * Check if this is non-struct page memory PFN, and cannot support * CMOs. It could potentially be unsafe to access as cacheable. */ - if (s2vi->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(fault->pfn)) { + if (s2vi->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(s2vi->pfn)) { if (s2vi->is_vma_cacheable) { /* * Whilst the VMA owner expects cacheable mapping to this @@ -1910,6 +1910,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, struct kvm_pgtable *pgt; long perm_fault_granule; long mapping_size; + kvm_pfn_t pfn; gfn_t gfn; int ret; @@ -1922,6 +1923,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, perm_fault_granule = (kvm_s2_fault_is_perm(s2fd) ? kvm_vcpu_trap_get_perm_fault_granule(s2fd->vcpu) : 0); mapping_size = s2vi->vma_pagesize; + pfn = s2vi->pfn; gfn = s2vi->gfn; /* @@ -1934,7 +1936,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, mapping_size = perm_fault_granule; } else { mapping_size = transparent_hugepage_adjust(kvm, s2fd->memslot, - s2fd->hva, &fault->pfn, + s2fd->hva, &pfn, &gfn); if (mapping_size < 0) { ret = mapping_size; @@ -1944,7 +1946,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, } if (!perm_fault_granule && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) - sanitise_mte_tags(kvm, fault->pfn, mapping_size); + sanitise_mte_tags(kvm, pfn, mapping_size); /* * Under the premise of getting a FSC_PERM fault, we just need to relax @@ -1961,12 +1963,12 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, fault->prot, flags); } else { ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size, - __pfn_to_phys(fault->pfn), fault->prot, + __pfn_to_phys(pfn), fault->prot, memcache, flags); } out_unlock: - kvm_release_faultin_page(kvm, fault->page, !!ret, writable); + kvm_release_faultin_page(kvm, s2vi->page, !!ret, writable); kvm_fault_unlock(kvm); /* @@ -2020,7 +2022,7 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) ret = kvm_s2_fault_compute_prot(s2fd, &fault, &s2vi); if (ret) { - kvm_release_page_unused(fault.page); + kvm_release_page_unused(s2vi.page); return ret; } From 29a5681843a52570ca9597bf355be33fe8753eb0 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Fri, 13 Mar 2026 09:22:04 +0000 Subject: [PATCH 223/373] KVM: arm64: Replace force_pte with a max_map_size attribute force_pte is annoyingly limited in what it expresses, and we'd be better off with a more generic primitive. Introduce max_map_size instead, which does the trick and can be moved into the vma_info structure. This firther allows it to reduce the scopes in which it is mutable. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index f749b5a2f893..422b1b4f662d 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1650,6 +1650,7 @@ struct kvm_s2_fault_vma_info { unsigned long mmu_seq; long vma_pagesize; vm_flags_t vm_flags; + unsigned long max_map_size; struct page *page; kvm_pfn_t pfn; gfn_t gfn; @@ -1659,14 +1660,18 @@ struct kvm_s2_fault_vma_info { }; static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, - struct vm_area_struct *vma, bool *force_pte) + struct kvm_s2_fault_vma_info *s2vi, + struct vm_area_struct *vma) { short vma_shift; - if (*force_pte) + if (memslot_is_logging(s2fd->memslot)) { + s2vi->max_map_size = PAGE_SIZE; vma_shift = PAGE_SHIFT; - else + } else { + s2vi->max_map_size = PUD_SIZE; vma_shift = get_vma_page_shift(vma, s2fd->hva); + } switch (vma_shift) { #ifndef __PAGETABLE_PMD_FOLDED @@ -1684,7 +1689,7 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, fallthrough; case CONT_PTE_SHIFT: vma_shift = PAGE_SHIFT; - *force_pte = true; + s2vi->max_map_size = PAGE_SIZE; fallthrough; case PAGE_SHIFT: break; @@ -1695,7 +1700,7 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, if (s2fd->nested) { unsigned long max_map_size; - max_map_size = *force_pte ? PAGE_SIZE : PUD_SIZE; + max_map_size = min(s2vi->max_map_size, PUD_SIZE); /* * If we're about to create a shadow stage 2 entry, then we @@ -1713,7 +1718,7 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE) max_map_size = PAGE_SIZE; - *force_pte = (max_map_size == PAGE_SIZE); + s2vi->max_map_size = max_map_size; vma_shift = min_t(short, vma_shift, __ffs(max_map_size)); } @@ -1722,7 +1727,6 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, struct kvm_s2_fault { bool s2_force_noncacheable; - bool force_pte; enum kvm_pgtable_prot prot; }; @@ -1746,7 +1750,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, return -EFAULT; } - s2vi->vma_pagesize = BIT(kvm_s2_resolve_vma_size(s2fd, vma, &fault->force_pte)); + s2vi->vma_pagesize = BIT(kvm_s2_resolve_vma_size(s2fd, s2vi, vma)); /* * Both the canonical IPA and fault IPA must be aligned to the @@ -1931,7 +1935,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, * backed by a THP and thus use block mapping if possible. */ if (mapping_size == PAGE_SIZE && - !(fault->force_pte || fault->s2_force_noncacheable)) { + !(s2vi->max_map_size == PAGE_SIZE || fault->s2_force_noncacheable)) { if (perm_fault_granule > PAGE_SIZE) { mapping_size = perm_fault_granule; } else { @@ -1992,7 +1996,6 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); struct kvm_s2_fault_vma_info s2vi = {}; struct kvm_s2_fault fault = { - .force_pte = memslot_is_logging(s2fd->memslot), .prot = KVM_PGTABLE_PROT_R, }; void *memcache; From e314a4dbdb8b29b7e9a69afb8831a15a6d15c1ed Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 14 Mar 2026 09:44:36 +0000 Subject: [PATCH 224/373] KVM: arm64: Move device mapping management into kvm_s2_fault_pin_pfn() Attributes computed for devices are computed very late in the fault handling process, meanning they are mutable for that long. Introduce both 'device' and 'map_non_cacheable' attributes to the vma_info structure, allowing that information to be set in stone earlier, in kvm_s2_fault_pin_pfn(). Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 52 ++++++++++++++++++++++++-------------------- 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 422b1b4f662d..b2977d113b12 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1654,9 +1654,11 @@ struct kvm_s2_fault_vma_info { struct page *page; kvm_pfn_t pfn; gfn_t gfn; + bool device; bool mte_allowed; bool is_vma_cacheable; bool map_writable; + bool map_non_cacheable; }; static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, @@ -1726,7 +1728,6 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, } struct kvm_s2_fault { - bool s2_force_noncacheable; enum kvm_pgtable_prot prot; }; @@ -1736,7 +1737,6 @@ static bool kvm_s2_fault_is_perm(const struct kvm_s2_fault_desc *s2fd) } static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault, struct kvm_s2_fault_vma_info *s2vi) { struct vm_area_struct *vma; @@ -1792,12 +1792,11 @@ static gfn_t get_canonical_gfn(const struct kvm_s2_fault_desc *s2fd, } static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault, struct kvm_s2_fault_vma_info *s2vi) { int ret; - ret = kvm_s2_fault_get_vma_info(s2fd, fault, s2vi); + ret = kvm_s2_fault_get_vma_info(s2fd, s2vi); if (ret) return ret; @@ -1812,16 +1811,6 @@ static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, return -EFAULT; } - return 1; -} - -static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault, - const struct kvm_s2_fault_vma_info *s2vi) -{ - struct kvm *kvm = s2fd->vcpu->kvm; - bool writable = s2vi->map_writable; - /* * Check if this is non-struct page memory PFN, and cannot support * CMOs. It could potentially be unsafe to access as cacheable. @@ -1840,8 +1829,10 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, * S2FWB and CACHE DIC are mandatory to avoid the need for * cache maintenance. */ - if (!kvm_supports_cacheable_pfnmap()) + if (!kvm_supports_cacheable_pfnmap()) { + kvm_release_faultin_page(s2fd->vcpu->kvm, s2vi->page, true, false); return -EFAULT; + } } else { /* * If the page was identified as device early by looking at @@ -1853,9 +1844,24 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, * In both cases, we don't let transparent_hugepage_adjust() * change things at the last minute. */ - fault->s2_force_noncacheable = true; + s2vi->map_non_cacheable = true; } - } else if (memslot_is_logging(s2fd->memslot) && !kvm_is_write_fault(s2fd->vcpu)) { + + s2vi->device = true; + } + + return 1; +} + +static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, + struct kvm_s2_fault *fault, + const struct kvm_s2_fault_vma_info *s2vi) +{ + struct kvm *kvm = s2fd->vcpu->kvm; + bool writable = s2vi->map_writable; + + if (!s2vi->device && memslot_is_logging(s2fd->memslot) && + !kvm_is_write_fault(s2fd->vcpu)) { /* * Only actually map the page as writable if this was a write * fault. @@ -1863,7 +1869,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, writable = false; } - if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu) && fault->s2_force_noncacheable) + if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu) && s2vi->map_non_cacheable) return -ENOEXEC; /* @@ -1886,7 +1892,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu)) fault->prot |= KVM_PGTABLE_PROT_X; - if (fault->s2_force_noncacheable) + if (s2vi->map_non_cacheable) fault->prot |= (s2vi->vm_flags & VM_ALLOW_ANY_UNCACHED) ? KVM_PGTABLE_PROT_NORMAL_NC : KVM_PGTABLE_PROT_DEVICE; else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) @@ -1895,7 +1901,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, if (s2fd->nested) adjust_nested_exec_perms(kvm, s2fd->nested, &fault->prot); - if (!kvm_s2_fault_is_perm(s2fd) && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) { + if (!kvm_s2_fault_is_perm(s2fd) && !s2vi->map_non_cacheable && kvm_has_mte(kvm)) { /* Check the VMM hasn't introduced a new disallowed VMA */ if (!s2vi->mte_allowed) return -EFAULT; @@ -1935,7 +1941,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, * backed by a THP and thus use block mapping if possible. */ if (mapping_size == PAGE_SIZE && - !(s2vi->max_map_size == PAGE_SIZE || fault->s2_force_noncacheable)) { + !(s2vi->max_map_size == PAGE_SIZE || s2vi->map_non_cacheable)) { if (perm_fault_granule > PAGE_SIZE) { mapping_size = perm_fault_granule; } else { @@ -1949,7 +1955,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, } } - if (!perm_fault_granule && !fault->s2_force_noncacheable && kvm_has_mte(kvm)) + if (!perm_fault_granule && !s2vi->map_non_cacheable && kvm_has_mte(kvm)) sanitise_mte_tags(kvm, pfn, mapping_size); /* @@ -2019,7 +2025,7 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) * Let's check if we will get back a huge page backed by hugetlbfs, or * get block mapping for device MMIO region. */ - ret = kvm_s2_fault_pin_pfn(s2fd, &fault, &s2vi); + ret = kvm_s2_fault_pin_pfn(s2fd, &s2vi); if (ret != 1) return ret; From adb70b3a8b31e9eb05f2ec3c76d85f9a7a8c8cbc Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sun, 15 Mar 2026 14:26:27 +0000 Subject: [PATCH 225/373] KVM: arm64: Directly expose mapping prot and kill kvm_s2_fault The 'prot' field is the only one left in kvm_s2_fault. Expose it directly to the functions needing it, and get rid of kvm_s2_fault. It has served us well during this refactoring, but it is now no longer needed. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 45 +++++++++++++++++++++----------------------- 1 file changed, 21 insertions(+), 24 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index b2977d113b12..49ddfbb78148 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1727,10 +1727,6 @@ static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, return vma_shift; } -struct kvm_s2_fault { - enum kvm_pgtable_prot prot; -}; - static bool kvm_s2_fault_is_perm(const struct kvm_s2_fault_desc *s2fd) { return kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); @@ -1854,8 +1850,8 @@ static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, } static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault, - const struct kvm_s2_fault_vma_info *s2vi) + const struct kvm_s2_fault_vma_info *s2vi, + enum kvm_pgtable_prot *prot) { struct kvm *kvm = s2fd->vcpu->kvm; bool writable = s2vi->map_writable; @@ -1883,23 +1879,25 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, return 1; } + *prot = KVM_PGTABLE_PROT_R; + if (s2fd->nested) - adjust_nested_fault_perms(s2fd->nested, &fault->prot, &writable); + adjust_nested_fault_perms(s2fd->nested, prot, &writable); if (writable) - fault->prot |= KVM_PGTABLE_PROT_W; + *prot |= KVM_PGTABLE_PROT_W; if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu)) - fault->prot |= KVM_PGTABLE_PROT_X; + *prot |= KVM_PGTABLE_PROT_X; if (s2vi->map_non_cacheable) - fault->prot |= (s2vi->vm_flags & VM_ALLOW_ANY_UNCACHED) ? - KVM_PGTABLE_PROT_NORMAL_NC : KVM_PGTABLE_PROT_DEVICE; + *prot |= (s2vi->vm_flags & VM_ALLOW_ANY_UNCACHED) ? + KVM_PGTABLE_PROT_NORMAL_NC : KVM_PGTABLE_PROT_DEVICE; else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) - fault->prot |= KVM_PGTABLE_PROT_X; + *prot |= KVM_PGTABLE_PROT_X; if (s2fd->nested) - adjust_nested_exec_perms(kvm, s2fd->nested, &fault->prot); + adjust_nested_exec_perms(kvm, s2fd->nested, prot); if (!kvm_s2_fault_is_perm(s2fd) && !s2vi->map_non_cacheable && kvm_has_mte(kvm)) { /* Check the VMM hasn't introduced a new disallowed VMA */ @@ -1911,11 +1909,12 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, } static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, - struct kvm_s2_fault *fault, - const struct kvm_s2_fault_vma_info *s2vi, void *memcache) + const struct kvm_s2_fault_vma_info *s2vi, + enum kvm_pgtable_prot prot, + void *memcache) { enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; - bool writable = fault->prot & KVM_PGTABLE_PROT_W; + bool writable = prot & KVM_PGTABLE_PROT_W; struct kvm *kvm = s2fd->vcpu->kvm; struct kvm_pgtable *pgt; long perm_fault_granule; @@ -1968,12 +1967,12 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, * Drop the SW bits in favour of those stored in the * PTE, which will be preserved. */ - fault->prot &= ~KVM_NV_GUEST_MAP_SZ; + prot &= ~KVM_NV_GUEST_MAP_SZ; ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn), - fault->prot, flags); + prot, flags); } else { ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size, - __pfn_to_phys(pfn), fault->prot, + __pfn_to_phys(pfn), prot, memcache, flags); } @@ -2001,9 +2000,7 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) { bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); struct kvm_s2_fault_vma_info s2vi = {}; - struct kvm_s2_fault fault = { - .prot = KVM_PGTABLE_PROT_R, - }; + enum kvm_pgtable_prot prot; void *memcache; int ret; @@ -2029,13 +2026,13 @@ static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) if (ret != 1) return ret; - ret = kvm_s2_fault_compute_prot(s2fd, &fault, &s2vi); + ret = kvm_s2_fault_compute_prot(s2fd, &s2vi, &prot); if (ret) { kvm_release_page_unused(s2vi.page); return ret; } - return kvm_s2_fault_map(s2fd, &fault, &s2vi, memcache); + return kvm_s2_fault_map(s2fd, &s2vi, prot, memcache); } /* Resolve the access fault by making the page young again. */ From fb9888fdfada0b5ad977f08f7550432a08aacbb1 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sun, 15 Mar 2026 14:58:44 +0000 Subject: [PATCH 226/373] KVM: arm64: Simplify integration of adjust_nested_*_perms() Instead of passing pointers to adjust_nested_*_perms(), allow them to return a new set of permissions. With some careful moving around so that the canonical permissions are computed before the nested ones are applied, we end-up with a bit less code, and something a bit more readable. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 62 +++++++++++++++++++------------------------- 1 file changed, 27 insertions(+), 35 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 49ddfbb78148..7863e428920a 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1541,25 +1541,27 @@ static int topup_mmu_memcache(struct kvm_vcpu *vcpu, void *memcache) * TLB invalidation from the guest and used to limit the invalidation scope if a * TTL hint or a range isn't provided. */ -static void adjust_nested_fault_perms(struct kvm_s2_trans *nested, - enum kvm_pgtable_prot *prot, - bool *writable) +static enum kvm_pgtable_prot adjust_nested_fault_perms(struct kvm_s2_trans *nested, + enum kvm_pgtable_prot prot) { - *writable &= kvm_s2_trans_writable(nested); + if (!kvm_s2_trans_writable(nested)) + prot &= ~KVM_PGTABLE_PROT_W; if (!kvm_s2_trans_readable(nested)) - *prot &= ~KVM_PGTABLE_PROT_R; + prot &= ~KVM_PGTABLE_PROT_R; - *prot |= kvm_encode_nested_level(nested); + return prot | kvm_encode_nested_level(nested); } -static void adjust_nested_exec_perms(struct kvm *kvm, - struct kvm_s2_trans *nested, - enum kvm_pgtable_prot *prot) +static enum kvm_pgtable_prot adjust_nested_exec_perms(struct kvm *kvm, + struct kvm_s2_trans *nested, + enum kvm_pgtable_prot prot) { if (!kvm_s2_trans_exec_el0(kvm, nested)) - *prot &= ~KVM_PGTABLE_PROT_UX; + prot &= ~KVM_PGTABLE_PROT_UX; if (!kvm_s2_trans_exec_el1(kvm, nested)) - *prot &= ~KVM_PGTABLE_PROT_PX; + prot &= ~KVM_PGTABLE_PROT_PX; + + return prot; } struct kvm_s2_fault_desc { @@ -1574,7 +1576,7 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, bool is_perm) { - bool write_fault, exec_fault, writable; + bool write_fault, exec_fault; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; @@ -1612,19 +1614,17 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return ret; } - writable = !(memslot->flags & KVM_MEM_READONLY); + if (!(memslot->flags & KVM_MEM_READONLY)) + prot |= KVM_PGTABLE_PROT_W; if (nested) - adjust_nested_fault_perms(nested, &prot, &writable); - - if (writable) - prot |= KVM_PGTABLE_PROT_W; + prot = adjust_nested_fault_perms(nested, prot); if (exec_fault || cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) prot |= KVM_PGTABLE_PROT_X; if (nested) - adjust_nested_exec_perms(kvm, nested, &prot); + prot = adjust_nested_exec_perms(kvm, nested, prot); kvm_fault_lock(kvm); if (mmu_invalidate_retry(kvm, mmu_seq)) { @@ -1637,10 +1637,10 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, memcache, flags); out_unlock: - kvm_release_faultin_page(kvm, page, !!ret, writable); + kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W); kvm_fault_unlock(kvm); - if (writable && !ret) + if ((prot & KVM_PGTABLE_PROT_W) && !ret) mark_page_dirty_in_slot(kvm, memslot, gfn); return ret != -EAGAIN ? ret : 0; @@ -1854,16 +1854,6 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, enum kvm_pgtable_prot *prot) { struct kvm *kvm = s2fd->vcpu->kvm; - bool writable = s2vi->map_writable; - - if (!s2vi->device && memslot_is_logging(s2fd->memslot) && - !kvm_is_write_fault(s2fd->vcpu)) { - /* - * Only actually map the page as writable if this was a write - * fault. - */ - writable = false; - } if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu) && s2vi->map_non_cacheable) return -ENOEXEC; @@ -1881,12 +1871,14 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, *prot = KVM_PGTABLE_PROT_R; - if (s2fd->nested) - adjust_nested_fault_perms(s2fd->nested, prot, &writable); - - if (writable) + if (s2vi->map_writable && (s2vi->device || + !memslot_is_logging(s2fd->memslot) || + kvm_is_write_fault(s2fd->vcpu))) *prot |= KVM_PGTABLE_PROT_W; + if (s2fd->nested) + *prot = adjust_nested_fault_perms(s2fd->nested, *prot); + if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu)) *prot |= KVM_PGTABLE_PROT_X; @@ -1897,7 +1889,7 @@ static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, *prot |= KVM_PGTABLE_PROT_X; if (s2fd->nested) - adjust_nested_exec_perms(kvm, s2fd->nested, prot); + *prot = adjust_nested_exec_perms(kvm, s2fd->nested, *prot); if (!kvm_s2_fault_is_perm(s2fd) && !s2vi->map_non_cacheable && kvm_has_mte(kvm)) { /* Check the VMM hasn't introduced a new disallowed VMA */ From e9550374d13a4bfd0b8a711733f5d423c2e56b96 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sun, 15 Mar 2026 15:45:22 +0000 Subject: [PATCH 227/373] KVM: arm64: Convert gmem_abort() to struct kvm_s2_fault_desc Having fully converted user_mem_abort() to kvm_s2_fault_desc and co, convert gmem_abort() to it as well. The change is obviously much simpler. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 43 ++++++++++++++++++++----------------------- 1 file changed, 20 insertions(+), 23 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 7863e428920a..03e1f389339c 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1572,34 +1572,32 @@ struct kvm_s2_fault_desc { unsigned long hva; }; -static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, bool is_perm) +static int gmem_abort(const struct kvm_s2_fault_desc *s2fd) { bool write_fault, exec_fault; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; - struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; + struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt; unsigned long mmu_seq; struct page *page; - struct kvm *kvm = vcpu->kvm; + struct kvm *kvm = s2fd->vcpu->kvm; void *memcache; kvm_pfn_t pfn; gfn_t gfn; int ret; - memcache = get_mmu_memcache(vcpu); - ret = topup_mmu_memcache(vcpu, memcache); + memcache = get_mmu_memcache(s2fd->vcpu); + ret = topup_mmu_memcache(s2fd->vcpu, memcache); if (ret) return ret; - if (nested) - gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT; + if (s2fd->nested) + gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT; else - gfn = fault_ipa >> PAGE_SHIFT; + gfn = s2fd->fault_ipa >> PAGE_SHIFT; - write_fault = kvm_is_write_fault(vcpu); - exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); + write_fault = kvm_is_write_fault(s2fd->vcpu); + exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu); VM_WARN_ON_ONCE(write_fault && exec_fault); @@ -1607,24 +1605,24 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, /* Pairs with the smp_wmb() in kvm_mmu_invalidate_end(). */ smp_rmb(); - ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL); + ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL); if (ret) { - kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE, + kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE, write_fault, exec_fault, false); return ret; } - if (!(memslot->flags & KVM_MEM_READONLY)) + if (!(s2fd->memslot->flags & KVM_MEM_READONLY)) prot |= KVM_PGTABLE_PROT_W; - if (nested) - prot = adjust_nested_fault_perms(nested, prot); + if (s2fd->nested) + prot = adjust_nested_fault_perms(s2fd->nested, prot); if (exec_fault || cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) prot |= KVM_PGTABLE_PROT_X; - if (nested) - prot = adjust_nested_exec_perms(kvm, nested, prot); + if (s2fd->nested) + prot = adjust_nested_exec_perms(kvm, s2fd->nested, prot); kvm_fault_lock(kvm); if (mmu_invalidate_retry(kvm, mmu_seq)) { @@ -1632,7 +1630,7 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, goto out_unlock; } - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE, + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE, __pfn_to_phys(pfn), prot, memcache, flags); @@ -1641,7 +1639,7 @@ out_unlock: kvm_fault_unlock(kvm); if ((prot & KVM_PGTABLE_PROT_W) && !ret) - mark_page_dirty_in_slot(kvm, memslot, gfn); + mark_page_dirty_in_slot(kvm, s2fd->memslot, gfn); return ret != -EAGAIN ? ret : 0; } @@ -2299,8 +2297,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) }; if (kvm_slot_has_gmem(memslot)) - ret = gmem_abort(vcpu, fault_ipa, nested, memslot, - esr_fsc_is_permission_fault(esr)); + ret = gmem_abort(&s2fd); else ret = user_mem_abort(&s2fd); From d133aa75e39dd72e0b8577ab1f5fc17c72246536 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 27 Mar 2026 13:00:44 +0000 Subject: [PATCH 228/373] KVM: arm64: Disable TRBE Trace Buffer Unit when running in guest context The nVHE world-switch code relies on zeroing TRFCR_EL1 to disable trace generation in guest context when self-hosted TRBE is in use by the host. Per D3.2.1 ("Controls to prohibit trace at Exception levels"), clearing TRFCR_EL1 means that trace generation is prohibited at EL1 and EL0 but per R_YCHKJ the Trace Buffer Unit will still be enabled if TRBLIMITR_EL1.E is set. R_SJFRQ goes on to state that, when enabled, the Trace Buffer Unit can perform address translation for the "owning exception level" even when it is out of context. Consequently, we can end up in a state where TRBE performs speculative page-table walks for a host VA/IPA in guest/hypervisor context depending on the value of MDCR_EL2.E2TB, which changes over world-switch. The potential result appears to be a heady mixture of SErrors, data corruption and hardware lockups. Extend the TRBE world-switch code to clear TRBLIMITR_EL1.E after draining the buffer, restoring the register on return to the host. This unfortunately means we need to tackle CPU errata #2064142 and #2038923 which add additional synchronisation requirements around manipulations of the limit register. Hopefully this doesn't need to be fast. Cc: Marc Zyngier Cc: Oliver Upton Cc: James Clark Cc: Leo Yan Cc: Suzuki K Poulose Cc: Fuad Tabba Cc: Alexandru Elisei Tested-by: Leo Yan Tested-by: Fuad Tabba Reviewed-by: Suzuki K Poulose Reviewed-by: Fuad Tabba Fixes: a1319260bf62 ("arm64: KVM: Enable access to TRBE support for host") Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260327130047.21065-2-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_host.h | 1 + arch/arm64/kvm/hyp/nvhe/debug-sr.c | 71 ++++++++++++++++++++++++++---- arch/arm64/kvm/hyp/nvhe/switch.c | 2 +- 3 files changed, 64 insertions(+), 10 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 70cb9cfd760a..b1335c55dbef 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -770,6 +770,7 @@ struct kvm_host_data { u64 pmscr_el1; /* Self-hosted trace */ u64 trfcr_el1; + u64 trblimitr_el1; /* Values of trap registers for the host before guest entry. */ u64 mdcr_el2; u64 brbcr_el1; diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c index 2a1c0f49792b..0955af771ad1 100644 --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c @@ -57,12 +57,54 @@ static void __trace_do_switch(u64 *saved_trfcr, u64 new_trfcr) write_sysreg_el1(new_trfcr, SYS_TRFCR); } -static bool __trace_needs_drain(void) +static void __trace_drain_and_disable(void) { - if (is_protected_kvm_enabled() && host_data_test_flag(HAS_TRBE)) - return read_sysreg_s(SYS_TRBLIMITR_EL1) & TRBLIMITR_EL1_E; + u64 *trblimitr_el1 = host_data_ptr(host_debug_state.trblimitr_el1); + bool needs_drain = is_protected_kvm_enabled() ? + host_data_test_flag(HAS_TRBE) : + host_data_test_flag(TRBE_ENABLED); - return host_data_test_flag(TRBE_ENABLED); + if (!needs_drain) { + *trblimitr_el1 = 0; + return; + } + + *trblimitr_el1 = read_sysreg_s(SYS_TRBLIMITR_EL1); + if (*trblimitr_el1 & TRBLIMITR_EL1_E) { + /* + * The host has enabled the Trace Buffer Unit so we have + * to beat the CPU with a stick until it stops accessing + * memory. + */ + + /* First, ensure that our prior write to TRFCR has stuck. */ + isb(); + + /* Now synchronise with the trace and drain the buffer. */ + tsb_csync(); + dsb(nsh); + + /* + * With no more trace being generated, we can disable the + * Trace Buffer Unit. + */ + write_sysreg_s(0, SYS_TRBLIMITR_EL1); + if (cpus_have_final_cap(ARM64_WORKAROUND_2064142)) { + /* + * Some CPUs are so good, we have to drain 'em + * twice. + */ + tsb_csync(); + dsb(nsh); + } + + /* + * Ensure that the Trace Buffer Unit is disabled before + * we start mucking with the stage-2 and trap + * configuration. + */ + isb(); + } } static bool __trace_needs_switch(void) @@ -79,15 +121,26 @@ static void __trace_switch_to_guest(void) __trace_do_switch(host_data_ptr(host_debug_state.trfcr_el1), *host_data_ptr(trfcr_while_in_guest)); - - if (__trace_needs_drain()) { - isb(); - tsb_csync(); - } + __trace_drain_and_disable(); } static void __trace_switch_to_host(void) { + u64 trblimitr_el1 = *host_data_ptr(host_debug_state.trblimitr_el1); + + if (trblimitr_el1 & TRBLIMITR_EL1_E) { + /* Re-enable the Trace Buffer Unit for the host. */ + write_sysreg_s(trblimitr_el1, SYS_TRBLIMITR_EL1); + isb(); + if (cpus_have_final_cap(ARM64_WORKAROUND_2038923)) { + /* + * Make sure the unit is re-enabled before we + * poke TRFCR. + */ + isb(); + } + } + __trace_do_switch(host_data_ptr(trfcr_while_in_guest), *host_data_ptr(host_debug_state.trfcr_el1)); } diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index 779089e42681..f00688e69d88 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -278,7 +278,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) * We're about to restore some new MMU state. Make sure * ongoing page-table walks that have started before we * trapped to EL2 have completed. This also synchronises the - * above disabling of BRBE, SPE and TRBE. + * above disabling of BRBE and SPE. * * See DDI0487I.a D8.1.5 "Out-of-context translation regimes", * rule R_LFHQG and subsequent information statements. From 07695f7dc1e141601254057a00bf4e23301eb0b2 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 27 Mar 2026 13:00:45 +0000 Subject: [PATCH 229/373] KVM: arm64: Disable SPE Profiling Buffer when running in guest context The nVHE world-switch code relies on zeroing PMSCR_EL1 to disable profiling data generation in guest context when SPE is in use by the host. Unfortunately, this may leave PMBLIMITR_EL1.E set and consequently we can end up running in guest/hypervisor context with the Profiling Buffer enabled. The current "known issues" document for Rev M.a of the Arm ARM states that this can lead to speculative, out-of-context translations: | 2.18 D23136: | | When the Profiling Buffer is enabled, profiling is not stopped, and | Discard mode is not enabled, the Statistical Profiling Unit might | cause speculative translations for the owning translation regime, | including when the owning translation regime is out-of-context. In a similar fashion to TRBE, ensure that the Profiling Buffer is disabled during the nVHE world switch before we start messing with the stage-2 MMU and trap configuration. Cc: Marc Zyngier Cc: Oliver Upton Cc: James Clark Cc: Leo Yan Cc: Suzuki K Poulose Cc: Fuad Tabba Cc: Alexandru Elisei Reviewed-by: Alexandru Elisei Reviewed-by: Fuad Tabba Tested-by: Alexandru Elisei Tested-by: Fuad Tabba Fixes: f85279b4bd48 ("arm64: KVM: Save/restore the host SPE state when entering/leaving a VM") Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260327130047.21065-3-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_host.h | 1 + arch/arm64/kvm/hyp/nvhe/debug-sr.c | 33 ++++++++++++++++++++---------- arch/arm64/kvm/hyp/nvhe/switch.c | 2 +- 3 files changed, 24 insertions(+), 12 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index b1335c55dbef..fe588760fe62 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -768,6 +768,7 @@ struct kvm_host_data { struct kvm_guest_debug_arch regs; /* Statistical profiling extension */ u64 pmscr_el1; + u64 pmblimitr_el1; /* Self-hosted trace */ u64 trfcr_el1; u64 trblimitr_el1; diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c index 0955af771ad1..84bc80f4e36b 100644 --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c @@ -14,20 +14,20 @@ #include #include -static void __debug_save_spe(u64 *pmscr_el1) +static void __debug_save_spe(void) { - u64 reg; + u64 *pmscr_el1, *pmblimitr_el1; - /* Clear pmscr in case of early return */ - *pmscr_el1 = 0; + pmscr_el1 = host_data_ptr(host_debug_state.pmscr_el1); + pmblimitr_el1 = host_data_ptr(host_debug_state.pmblimitr_el1); /* * At this point, we know that this CPU implements * SPE and is available to the host. * Check if the host is actually using it ? */ - reg = read_sysreg_s(SYS_PMBLIMITR_EL1); - if (!(reg & BIT(PMBLIMITR_EL1_E_SHIFT))) + *pmblimitr_el1 = read_sysreg_s(SYS_PMBLIMITR_EL1); + if (!(*pmblimitr_el1 & BIT(PMBLIMITR_EL1_E_SHIFT))) return; /* Yes; save the control register and disable data generation */ @@ -37,18 +37,29 @@ static void __debug_save_spe(u64 *pmscr_el1) /* Now drain all buffered data to memory */ psb_csync(); + dsb(nsh); + + /* And disable the profiling buffer */ + write_sysreg_s(0, SYS_PMBLIMITR_EL1); + isb(); } -static void __debug_restore_spe(u64 pmscr_el1) +static void __debug_restore_spe(void) { - if (!pmscr_el1) + u64 pmblimitr_el1 = *host_data_ptr(host_debug_state.pmblimitr_el1); + + if (!(pmblimitr_el1 & BIT(PMBLIMITR_EL1_E_SHIFT))) return; /* The host page table is installed, but not yet synchronised */ isb(); + /* Re-enable the profiling buffer. */ + write_sysreg_s(pmblimitr_el1, SYS_PMBLIMITR_EL1); + isb(); + /* Re-enable data generation */ - write_sysreg_el1(pmscr_el1, SYS_PMSCR); + write_sysreg_el1(*host_data_ptr(host_debug_state.pmscr_el1), SYS_PMSCR); } static void __trace_do_switch(u64 *saved_trfcr, u64 new_trfcr) @@ -175,7 +186,7 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu) { /* Disable and flush SPE data generation */ if (host_data_test_flag(HAS_SPE)) - __debug_save_spe(host_data_ptr(host_debug_state.pmscr_el1)); + __debug_save_spe(); /* Disable BRBE branch records */ if (host_data_test_flag(HAS_BRBE)) @@ -193,7 +204,7 @@ void __debug_switch_to_guest(struct kvm_vcpu *vcpu) void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu) { if (host_data_test_flag(HAS_SPE)) - __debug_restore_spe(*host_data_ptr(host_debug_state.pmscr_el1)); + __debug_restore_spe(); if (host_data_test_flag(HAS_BRBE)) __debug_restore_brbe(*host_data_ptr(host_debug_state.brbcr_el1)); if (__trace_needs_switch()) diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index f00688e69d88..9b6e87dac3b9 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -278,7 +278,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) * We're about to restore some new MMU state. Make sure * ongoing page-table walks that have started before we * trapped to EL2 have completed. This also synchronises the - * above disabling of BRBE and SPE. + * above disabling of BRBE. * * See DDI0487I.a D8.1.5 "Out-of-context translation regimes", * rule R_LFHQG and subsequent information statements. From 7aba10efef1d972fc82b00b84911f07f6afbdb78 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 27 Mar 2026 13:00:46 +0000 Subject: [PATCH 230/373] KVM: arm64: Don't pass host_debug_state to BRBE world-switch routines Now that the SPE and BRBE nVHE world-switch routines operate on the host_debug_state directly, tweak the BRBE code to do the same for consistency. This is purely cosmetic. Cc: Marc Zyngier Cc: Oliver Upton Cc: James Clark Cc: Leo Yan Cc: Suzuki K Poulose Cc: Fuad Tabba Cc: Alexandru Elisei Signed-off-by: Will Deacon Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Link: https://patch.msgid.link/20260327130047.21065-4-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/debug-sr.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c index 84bc80f4e36b..f8904391c125 100644 --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c @@ -156,8 +156,10 @@ static void __trace_switch_to_host(void) *host_data_ptr(host_debug_state.trfcr_el1)); } -static void __debug_save_brbe(u64 *brbcr_el1) +static void __debug_save_brbe(void) { + u64 *brbcr_el1 = host_data_ptr(host_debug_state.brbcr_el1); + *brbcr_el1 = 0; /* Check if the BRBE is enabled */ @@ -173,8 +175,10 @@ static void __debug_save_brbe(u64 *brbcr_el1) write_sysreg_el1(0, SYS_BRBCR); } -static void __debug_restore_brbe(u64 brbcr_el1) +static void __debug_restore_brbe(void) { + u64 brbcr_el1 = *host_data_ptr(host_debug_state.brbcr_el1); + if (!brbcr_el1) return; @@ -190,7 +194,7 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu) /* Disable BRBE branch records */ if (host_data_test_flag(HAS_BRBE)) - __debug_save_brbe(host_data_ptr(host_debug_state.brbcr_el1)); + __debug_save_brbe(); if (__trace_needs_switch()) __trace_switch_to_guest(); @@ -206,7 +210,7 @@ void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu) if (host_data_test_flag(HAS_SPE)) __debug_restore_spe(); if (host_data_test_flag(HAS_BRBE)) - __debug_restore_brbe(*host_data_ptr(host_debug_state.brbcr_el1)); + __debug_restore_brbe(); if (__trace_needs_switch()) __trace_switch_to_host(); } From 9b33ab1e8c189bd0aced0d07545b0f31a459d40e Mon Sep 17 00:00:00 2001 From: Yufeng Wang Date: Tue, 17 Mar 2026 19:47:59 +0800 Subject: [PATCH 231/373] riscv: kvm: add null pointer check for vector datap Add WARN_ON check before accessing cntx->vector.datap in kvm_riscv_vcpu_vreg_addr() to detect potential null pointer dereferences early, consistent with the pattern used in kvm_riscv_vcpu_vector_reset(). This helps catch initialization issues where vector context allocation may have failed. Signed-off-by: Yufeng Wang Reviewed-by: Anup Patel Link: https://lore.kernel.org/r/20260317114759.53165-1-r4o5m6e8o@163.com Signed-off-by: Anup Patel --- arch/riscv/kvm/vcpu_vector.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/riscv/kvm/vcpu_vector.c b/arch/riscv/kvm/vcpu_vector.c index 5b6ad82d47be..f3f5fb665cf6 100644 --- a/arch/riscv/kvm/vcpu_vector.c +++ b/arch/riscv/kvm/vcpu_vector.c @@ -130,6 +130,7 @@ static int kvm_riscv_vcpu_vreg_addr(struct kvm_vcpu *vcpu, } else if (reg_num <= KVM_REG_RISCV_VECTOR_REG(31)) { if (reg_size != vlenb) return -EINVAL; + WARN_ON(!cntx->vector.datap); *reg_addr = cntx->vector.datap + (reg_num - KVM_REG_RISCV_VECTOR_REG(0)) * vlenb; } else { From 310e2096b082ae0010e7afef666073b966ac2fa7 Mon Sep 17 00:00:00 2001 From: Jiakai Xu Date: Wed, 18 Mar 2026 09:29:56 +0000 Subject: [PATCH 232/373] RISC-V: KVM: Fix double-free of sdata in kvm_pmu_clear_snapshot_area() In kvm_riscv_vcpu_pmu_snapshot_set_shmem(), when kvm_vcpu_write_guest() fails, kvpmu->sdata is freed but not set to NULL. This leaves a dangling pointer that will be freed again when kvm_pmu_clear_snapshot_area() is called during vcpu teardown, triggering a KASAN double-free report. First free occurs in kvm_riscv_vcpu_pmu_snapshot_set_shmem(): kvm_riscv_vcpu_pmu_snapshot_set_shmem arch/riscv/kvm/vcpu_pmu.c:443 kvm_sbi_ext_pmu_handler arch/riscv/kvm/vcpu_sbi_pmu.c:74 kvm_riscv_vcpu_sbi_ecall arch/riscv/kvm/vcpu_sbi.c:608 kvm_riscv_vcpu_exit arch/riscv/kvm/vcpu_exit.c:240 kvm_arch_vcpu_ioctl_run arch/riscv/kvm/vcpu.c:1008 kvm_vcpu_ioctl virt/kvm/kvm_main.c:4476 Second free (double-free) occurs in kvm_pmu_clear_snapshot_area(): kvm_pmu_clear_snapshot_area arch/riscv/kvm/vcpu_pmu.c:403 [inline] kvm_riscv_vcpu_pmu_deinit.part arch/riscv/kvm/vcpu_pmu.c:905 kvm_riscv_vcpu_pmu_deinit arch/riscv/kvm/vcpu_pmu.c:893 kvm_arch_vcpu_destroy arch/riscv/kvm/vcpu.c:199 kvm_vcpu_destroy virt/kvm/kvm_main.c:469 [inline] kvm_destroy_vcpus virt/kvm/kvm_main.c:489 kvm_arch_destroy_vm arch/riscv/kvm/vm.c:54 kvm_destroy_vm virt/kvm/kvm_main.c:1301 [inline] kvm_put_kvm virt/kvm/kvm_main.c:1338 kvm_vm_release virt/kvm/kvm_main.c:1361 Fix it by setting kvpmu->sdata to NULL after kfree() in kvm_riscv_vcpu_pmu_snapshot_set_shmem(), so that the subsequent kfree(NULL) in kvm_pmu_clear_snapshot_area() becomes a safe no-op. This bug was found by fuzzing the KVM RISC-V PMU interface. Fixes: c2f41ddbcdd756 ("RISC-V: KVM: Implement SBI PMU Snapshot feature") Signed-off-by: Jiakai Xu Signed-off-by: Jiakai Xu Reviewed-by: Nutty Liu Reviewed-by: Andrew Jones Link: https://lore.kernel.org/r/20260318092956.708246-1-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel --- arch/riscv/kvm/vcpu_pmu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c index 9e9f3302cef8..bb6380ec7fc4 100644 --- a/arch/riscv/kvm/vcpu_pmu.c +++ b/arch/riscv/kvm/vcpu_pmu.c @@ -456,6 +456,7 @@ int kvm_riscv_vcpu_pmu_snapshot_set_shmem(struct kvm_vcpu *vcpu, unsigned long s /* No need to check writable slot explicitly as kvm_vcpu_write_guest does it internally */ if (kvm_vcpu_write_guest(vcpu, saddr, kvpmu->sdata, snapshot_area_size)) { kfree(kvpmu->sdata); + kvpmu->sdata = NULL; sbiret = SBI_ERR_INVALID_ADDRESS; goto out; } From 1762ac42eed653557d2feb9e37f45995ac238ce6 Mon Sep 17 00:00:00 2001 From: Jiakai Xu Date: Thu, 19 Mar 2026 03:59:02 +0000 Subject: [PATCH 233/373] RISC-V: KVM: Fix integer overflow in kvm_pmu_validate_counter_mask() When a guest initiates an SBI_EXT_PMU_COUNTER_CFG_MATCH call with ctr_base=0xfffffffffffffffe, ctr_mask=0xeb5f and flags=0x1 (SBI_PMU_CFG_FLAG_SKIP_MATCH), kvm_riscv_vcpu_pmu_ctr_cfg_match() first invokes kvm_pmu_validate_counter_mask() to verify whether ctr_base and ctr_mask are valid, by evaluating: !ctr_mask || (ctr_base + __fls(ctr_mask) >= kvm_pmu_num_counters(kvpmu)) With the above inputs, __fls(0xeb5f) equals 15, and adding 15 to 0xfffffffffffffffe causes an integer overflow, wrapping around to 13. Since 13 is less than kvm_pmu_num_counters(), the validation wrongly succeeds. Thereafter, since flags & SBI_PMU_CFG_FLAG_SKIP_MATCH is satisfied, the code evaluates: !test_bit(ctr_base + __ffs(ctr_mask), kvpmu->pmc_in_use) Here __ffs(0xeb5f) equals 0, so test_bit() receives 0xfffffffffffffffe as the bit index and attempts to access the corresponding element of the kvpmu->pmc_in_use, which results in an invalid memory access. This triggers the following Oops: Unable to handle kernel paging request at virtual address e3ebffff12abba89 generic_test_bit include/asm-generic/bitops/generic-non-atomic.h:128 kvm_riscv_vcpu_pmu_ctr_cfg_match arch/riscv/kvm/vcpu_pmu.c:758 kvm_sbi_ext_pmu_handler arch/riscv/kvm/vcpu_sbi_pmu.c:49 kvm_riscv_vcpu_sbi_ecall arch/riscv/kvm/vcpu_sbi.c:608 kvm_riscv_vcpu_exit arch/riscv/kvm/vcpu_exit.c:240 The root cause is that kvm_pmu_validate_counter_mask() does not account for the case where ctr_base itself is out of range, allowing the subsequent addition to silently overflow and bypass the check. Fix this by explicitly validating ctr_base against kvm_pmu_num_counters() before performing the addition. This bug was found by fuzzing the KVM RISC-V PMU interface. Fixes: 0cb74b65d2e5e6 ("RISC-V: KVM: Implement perf support without sampling") Signed-off-by: Jiakai Xu Signed-off-by: Jiakai Xu Reviewed-by: Nutty Liu Reviewed-by: Atish Patra Link: https://lore.kernel.org/r/20260319035902.924661-1-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel --- arch/riscv/kvm/vcpu_pmu.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c index bb6380ec7fc4..49427094a079 100644 --- a/arch/riscv/kvm/vcpu_pmu.c +++ b/arch/riscv/kvm/vcpu_pmu.c @@ -280,8 +280,10 @@ static int pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx, static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ctr_base, unsigned long ctr_mask) { - /* Make sure the we have a valid counter mask requested from the caller */ - if (!ctr_mask || (ctr_base + __fls(ctr_mask) >= kvm_pmu_num_counters(kvpmu))) + unsigned long num_ctrs = kvm_pmu_num_counters(kvpmu); + + /* Make sure we have a valid counter mask requested from the caller */ + if (!ctr_mask || ctr_base >= num_ctrs || (ctr_base + __fls(ctr_mask) >= num_ctrs)) return -EINVAL; return 0; From a216e24fc947573bfbd56471bd7c1f1d8c7a2b19 Mon Sep 17 00:00:00 2001 From: Wang Yechao Date: Mon, 30 Mar 2026 16:10:52 +0800 Subject: [PATCH 234/373] RISC-V: KVM: Fix lost write protection on huge pages during dirty logging When enabling dirty log in small chunks (e.g., QEMU default chunk size of 256K), the chunk size is always smaller than the page size of huge pages (1G or 2M) used in the gstage page tables. This caused the write protection to be incorrectly skipped for huge PTEs because the condition `(end - addr) >= page_size` was not satisfied. Remove the size check in `kvm_riscv_gstage_wp_range()` to ensure huge PTEs are always write-protected regardless of the chunk size. Additionally, explicitly align the address down to the page size before invoking `kvm_riscv_gstage_op_pte()` to guarantee that the address passed to the operation function is page-aligned. This fixes the issue where dirty pages might not be tracked correctly when using huge pages. Fixes: 9d05c1fee837 ("RISC-V: KVM: Implement stage2 page table programming") Signed-off-by: Wang Yechao Reviewed-by: Nutty Liu Reviewed-by: Anup Patel Link: https://lore.kernel.org/r/202603301610527120YZ-pAJY6x9SBpSRo1Wg4@zte.com.cn Signed-off-by: Anup Patel --- arch/riscv/kvm/gstage.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/riscv/kvm/gstage.c b/arch/riscv/kvm/gstage.c index b67d60d722c2..d2001d508046 100644 --- a/arch/riscv/kvm/gstage.c +++ b/arch/riscv/kvm/gstage.c @@ -304,10 +304,9 @@ void kvm_riscv_gstage_wp_range(struct kvm_gstage *gstage, gpa_t start, gpa_t end if (!found_leaf) goto next; - if (!(addr & (page_size - 1)) && ((end - addr) >= page_size)) - kvm_riscv_gstage_op_pte(gstage, addr, ptep, - ptep_level, GSTAGE_OP_WP); - + addr = ALIGN_DOWN(addr, page_size); + kvm_riscv_gstage_op_pte(gstage, addr, ptep, + ptep_level, GSTAGE_OP_WP); next: addr += page_size; } From 6ad36f39a7691bb59d2486efd467710fcbebee62 Mon Sep 17 00:00:00 2001 From: Wang Yechao Date: Mon, 30 Mar 2026 16:12:58 +0800 Subject: [PATCH 235/373] RISC-V: KVM: Split huge pages during fault handling for dirty logging During dirty logging, all huge pages are write-protected. When the guest writes to a write-protected huge page, a page fault is triggered. Before recovering the write permission, the huge page must be split into smaller pages (e.g., 4K). After splitting, the normal mapping process proceeds, allowing write permission to be restored at the smaller page granularity. If dirty logging is disabled because migration failed or was cancelled, only recover the write permission at the 4K level, and skip recovering the huge page mapping at this time to avoid the overhead of freeing page tables. The huge page mapping can be recovered in the ioctl context, similar to x86, in a later patch. Signed-off-by: Wang Yechao Reviewed-by: Anup Patel Link: https://lore.kernel.org/r/202603301612587174XZ6QMCrymBqv30S6BN50@zte.com.cn Signed-off-by: Anup Patel --- arch/riscv/include/asm/kvm_gstage.h | 4 + arch/riscv/kvm/gstage.c | 126 ++++++++++++++++++++++++++++ 2 files changed, 130 insertions(+) diff --git a/arch/riscv/include/asm/kvm_gstage.h b/arch/riscv/include/asm/kvm_gstage.h index 595e2183173e..a89d1422cc84 100644 --- a/arch/riscv/include/asm/kvm_gstage.h +++ b/arch/riscv/include/asm/kvm_gstage.h @@ -53,6 +53,10 @@ int kvm_riscv_gstage_map_page(struct kvm_gstage *gstage, bool page_rdonly, bool page_exec, struct kvm_gstage_mapping *out_map); +int kvm_riscv_gstage_split_huge(struct kvm_gstage *gstage, + struct kvm_mmu_memory_cache *pcache, + gpa_t addr, u32 target_level, bool flush); + enum kvm_riscv_gstage_op { GSTAGE_OP_NOP = 0, /* Nothing */ GSTAGE_OP_CLEAR, /* Clear/Unmap */ diff --git a/arch/riscv/kvm/gstage.c b/arch/riscv/kvm/gstage.c index d2001d508046..ffec3e5ddcaf 100644 --- a/arch/riscv/kvm/gstage.c +++ b/arch/riscv/kvm/gstage.c @@ -163,13 +163,32 @@ int kvm_riscv_gstage_set_pte(struct kvm_gstage *gstage, return 0; } +static void kvm_riscv_gstage_update_pte_prot(struct kvm_gstage *gstage, u32 level, + gpa_t addr, pte_t *ptep, pgprot_t prot) +{ + pte_t new_pte; + + if (pgprot_val(pte_pgprot(ptep_get(ptep))) == pgprot_val(prot)) + return; + + new_pte = pfn_pte(pte_pfn(ptep_get(ptep)), prot); + new_pte = pte_mkdirty(new_pte); + + set_pte(ptep, new_pte); + + gstage_tlb_flush(gstage, level, addr); +} + int kvm_riscv_gstage_map_page(struct kvm_gstage *gstage, struct kvm_mmu_memory_cache *pcache, gpa_t gpa, phys_addr_t hpa, unsigned long page_size, bool page_rdonly, bool page_exec, struct kvm_gstage_mapping *out_map) { + bool found_leaf; + u32 ptep_level; pgprot_t prot; + pte_t *ptep; int ret; out_map->addr = gpa; @@ -203,12 +222,119 @@ int kvm_riscv_gstage_map_page(struct kvm_gstage *gstage, else prot = PAGE_WRITE; } + + found_leaf = kvm_riscv_gstage_get_leaf(gstage, gpa, &ptep, &ptep_level); + if (found_leaf) { + /* + * ptep_level is the current gstage mapping level of addr, out_map->level + * is the required mapping level during fault handling. + * + * 1) ptep_level > out_map->level + * This happens when dirty logging is enabled and huge pages are used. + * KVM must track the pages at 4K level, and split the huge mapping + * into 4K mappings. + * + * 2) ptep_level < out_map->level + * This happens when dirty logging is disabled and huge pages are used. + * The gstage is split into 4K mappings, but the out_map level is now + * back to the huge page level. Ignore the out_map level this time, and + * just update the pte prot here. Otherwise, we would fall back to mapping + * the gstage at huge page level in `kvm_riscv_gstage_set_pte`, with the + * overhead of freeing the page tables(not support now), which would slow + * down the vCPUs' performance. + * + * It is better to recover the huge page mapping in the ioctl context when + * disabling dirty logging. + * + * 3) ptep_level == out_map->level + * We already have the ptep, just update the pte prot if the pfn not change. + * There is no need to invoke `kvm_riscv_gstage_set_pte` again. + */ + if (ptep_level > out_map->level) { + kvm_riscv_gstage_split_huge(gstage, pcache, gpa, + out_map->level, true); + } else if (ALIGN_DOWN(PFN_PHYS(pte_pfn(ptep_get(ptep))), page_size) == hpa) { + kvm_riscv_gstage_update_pte_prot(gstage, ptep_level, gpa, ptep, prot); + return 0; + } + } + out_map->pte = pfn_pte(PFN_DOWN(hpa), prot); out_map->pte = pte_mkdirty(out_map->pte); return kvm_riscv_gstage_set_pte(gstage, pcache, out_map); } +static inline unsigned long make_child_pte(unsigned long huge_pte, int index, + unsigned long child_page_size) +{ + unsigned long child_pte = huge_pte; + unsigned long child_pfn_offset; + + /* + * The child_pte already has the base address of the huge page being + * split. So we just have to OR in the offset to the page at the next + * lower level for the given index. + */ + child_pfn_offset = index * (child_page_size / PAGE_SIZE); + child_pte |= pte_val(pfn_pte(child_pfn_offset, __pgprot(0))); + + return child_pte; +} + +int kvm_riscv_gstage_split_huge(struct kvm_gstage *gstage, + struct kvm_mmu_memory_cache *pcache, + gpa_t addr, u32 target_level, bool flush) +{ + u32 current_level = kvm_riscv_gstage_pgd_levels - 1; + pte_t *next_ptep = (pte_t *)gstage->pgd; + unsigned long huge_pte, child_pte; + unsigned long child_page_size; + pte_t *ptep; + int i, ret; + + if (!pcache) + return -ENOMEM; + + while(current_level > target_level) { + ptep = (pte_t *)&next_ptep[gstage_pte_index(addr, current_level)]; + + if (!pte_val(ptep_get(ptep))) + break; + + if (!gstage_pte_leaf(ptep)) { + next_ptep = (pte_t *)gstage_pte_page_vaddr(ptep_get(ptep)); + current_level--; + continue; + } + + huge_pte = pte_val(ptep_get(ptep)); + + ret = gstage_level_to_page_size(current_level - 1, &child_page_size); + if (ret) + return ret; + + next_ptep = kvm_mmu_memory_cache_alloc(pcache); + if (!next_ptep) + return -ENOMEM; + + for (i = 0; i < PTRS_PER_PTE; i++) { + child_pte = make_child_pte(huge_pte, i, child_page_size); + set_pte((pte_t *)&next_ptep[i], __pte(child_pte)); + } + + set_pte(ptep, pfn_pte(PFN_DOWN(__pa(next_ptep)), + __pgprot(_PAGE_TABLE))); + + if (flush) + gstage_tlb_flush(gstage, current_level, addr); + + current_level--; + } + + return 0; +} + void kvm_riscv_gstage_op_pte(struct kvm_gstage *gstage, gpa_t addr, pte_t *ptep, u32 ptep_level, enum kvm_riscv_gstage_op op) { From 660b208e8b5ea0f5a68a8333e18960d89d484a27 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:02 +0100 Subject: [PATCH 236/373] KVM: arm64: Remove unused PKVM_ID_FFA definition Commit 7cbf7c37718e ("KVM: arm64: Drop pkvm_mem_transition for host/hyp sharing") removed the last users of PKVM_ID_FFA, so drop the definition altogether. Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-2-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index 5f9d56754e39..7f25f2bca90c 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -27,7 +27,6 @@ extern struct host_mmu host_mmu; enum pkvm_component_id { PKVM_ID_HOST, PKVM_ID_HYP, - PKVM_ID_FFA, }; extern unsigned long hyp_nr_cpus; From 5e66f723d4de432a5acb481293d81dc88c7f61a4 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:03 +0100 Subject: [PATCH 237/373] KVM: arm64: Don't leak stage-2 page-table if VM fails to init under pKVM If pkvm_init_host_vm() fails, we should free the stage-2 page-table previously allocated by kvm_init_stage2_mmu(). Cc: Fuad Tabba Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Fixes: 07aeb70707b1 ("KVM: arm64: Reserve pKVM handle during pkvm_init_host_vm()") Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-3-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 410ffd41fd73..3589fc08266c 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -236,7 +236,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) */ ret = pkvm_init_host_vm(kvm); if (ret) - goto err_free_cpumask; + goto err_uninit_mmu; } kvm_vgic_early_init(kvm); @@ -252,6 +252,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) return 0; +err_uninit_mmu: + kvm_uninit_stage2_mmu(kvm); err_free_cpumask: free_cpumask_var(kvm->arch.supported_cpus); err_unshare_kvm: From 9f02deef471e1b6637a6641ce6bf9e2a1dd7d2c1 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:04 +0100 Subject: [PATCH 238/373] KVM: arm64: Move handle check into pkvm_pgtable_stage2_destroy_range() When pKVM is enabled, a VM has a 'handle' allocated by the hypervisor in kvm_arch_init_vm() and released later by kvm_arch_destroy_vm(). Consequently, the only time __pkvm_pgtable_stage2_unmap() can run into an uninitialised 'handle' is on the kvm_arch_init_vm() failure path, where we destroy the empty stage-2 page-table if we fail to allocate a handle. Move the handle check into pkvm_pgtable_stage2_destroy_range(), which will additionally handle protected VMs in subsequent patches. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-4-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/pkvm.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index d7a0f69a9982..7797813f4dbe 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -329,9 +329,6 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e struct pkvm_mapping *mapping; int ret; - if (!handle) - return 0; - for_each_mapping_in_range_safe(pgt, start, end, mapping) { ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_guest, handle, mapping->gfn, mapping->nr_pages); @@ -347,6 +344,12 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, u64 addr, u64 size) { + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); + pkvm_handle_t handle = kvm->arch.pkvm.handle; + + if (!handle) + return; + __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); } From ff1e7f9897947ea79ece7eb587c8d93957af5ee8 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:05 +0100 Subject: [PATCH 239/373] KVM: arm64: Rename __pkvm_pgtable_stage2_unmap() In preparation for adding support for protected VMs, where pages are donated rather than shared, rename __pkvm_pgtable_stage2_unmap() to __pkvm_pgtable_stage2_unshare() to make it clearer about what is going on. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-5-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/pkvm.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index 7797813f4dbe..42f6e50825ac 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -322,7 +322,7 @@ int pkvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu, return 0; } -static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 end) +static int __pkvm_pgtable_stage2_unshare(struct kvm_pgtable *pgt, u64 start, u64 end) { struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); pkvm_handle_t handle = kvm->arch.pkvm.handle; @@ -350,7 +350,7 @@ void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, if (!handle) return; - __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); + __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); } void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) @@ -386,7 +386,7 @@ int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, return -EAGAIN; /* Remove _any_ pkvm_mapping overlapping with the range, bigger or smaller. */ - ret = __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); + ret = __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); if (ret) return ret; mapping = NULL; @@ -409,7 +409,7 @@ int pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size) { lockdep_assert_held_write(&kvm_s2_mmu_to_kvm(pgt->mmu)->mmu_lock); - return __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); + return __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); } int pkvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size) From be3473c338d23c87485d7728818005f32b58009f Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:06 +0100 Subject: [PATCH 240/373] KVM: arm64: Don't advertise unsupported features for protected guests Both SVE and PMUv3 are treated as "restricted" features for protected guests and attempts to access their corresponding architectural state from a protected guest result in an undefined exception being injected by the hypervisor. Since these exceptions are unexpected and typically fatal for the guest, don't advertise these features for protected guests. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-6-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_pkvm.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h index 757076ad4ec9..7041e398fb4c 100644 --- a/arch/arm64/include/asm/kvm_pkvm.h +++ b/arch/arm64/include/asm/kvm_pkvm.h @@ -40,8 +40,6 @@ static inline bool kvm_pkvm_ext_allowed(struct kvm *kvm, long ext) case KVM_CAP_MAX_VCPU_ID: case KVM_CAP_MSI_DEVID: case KVM_CAP_ARM_VM_IPA_SIZE: - case KVM_CAP_ARM_PMU_V3: - case KVM_CAP_ARM_SVE: case KVM_CAP_ARM_PTRAUTH_ADDRESS: case KVM_CAP_ARM_PTRAUTH_GENERIC: return true; From 3a81a814437d187ac893b3d937985d6503ff3841 Mon Sep 17 00:00:00 2001 From: Fuad Tabba Date: Mon, 30 Mar 2026 15:48:07 +0100 Subject: [PATCH 241/373] KVM: arm64: Expose self-hosted debug regs as RAZ/WI for protected guests Debug and trace are not currently supported for protected guests, so trap accesses to the related registers and emulate them as RAZ/WI for now. Although this isn't strictly compatible with the architecture, it's sufficient for Linux guests and means that debug support can be added later on. Tested-by: Mostafa Saleh Signed-off-by: Fuad Tabba Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-7-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/sys_regs.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/arch/arm64/kvm/hyp/nvhe/sys_regs.c b/arch/arm64/kvm/hyp/nvhe/sys_regs.c index 06d28621722e..0a84140afa28 100644 --- a/arch/arm64/kvm/hyp/nvhe/sys_regs.c +++ b/arch/arm64/kvm/hyp/nvhe/sys_regs.c @@ -392,6 +392,14 @@ static const struct sys_reg_desc pvm_sys_reg_descs[] = { /* Cache maintenance by set/way operations are restricted. */ /* Debug and Trace Registers are restricted. */ + RAZ_WI(SYS_DBGBVRn_EL1(0)), + RAZ_WI(SYS_DBGBCRn_EL1(0)), + RAZ_WI(SYS_DBGWVRn_EL1(0)), + RAZ_WI(SYS_DBGWCRn_EL1(0)), + RAZ_WI(SYS_MDSCR_EL1), + RAZ_WI(SYS_OSLAR_EL1), + RAZ_WI(SYS_OSLSR_EL1), + RAZ_WI(SYS_OSDLR_EL1), /* Group 1 ID registers */ HOST_HANDLED(SYS_REVIDR_EL1), From 733774b5204553ab5524c0330f262184d9d573f6 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:08 +0100 Subject: [PATCH 242/373] KVM: arm64: Remove is_protected_kvm_enabled() checks from hypercalls When pKVM is not enabled, the host shouldn't issue pKVM-specific hypercalls and so there's no point checking for this in the pKVM hypercall handlers. Remove the redundant is_protected_kvm_enabled() checks from each hypercall and instead rejig the hypercall table so that the pKVM-specific hypercalls are unreachable when pKVM is not being used. Reviewed-by: Quentin Perret Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-8-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 24 +++++++----- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 63 ++++++++++-------------------- 2 files changed, 34 insertions(+), 53 deletions(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index a1ad12c72ebf..7b72aac4730d 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -51,7 +51,7 @@ #include enum __kvm_host_smccc_func { - /* Hypercalls available only prior to pKVM finalisation */ + /* Hypercalls that are unavailable once pKVM has finalised. */ /* __KVM_HOST_SMCCC_FUNC___kvm_hyp_init */ __KVM_HOST_SMCCC_FUNC___pkvm_init = __KVM_HOST_SMCCC_FUNC___kvm_hyp_init + 1, __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping, @@ -60,16 +60,9 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___vgic_v3_init_lrs, __KVM_HOST_SMCCC_FUNC___vgic_v3_get_gic_config, __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize, + __KVM_HOST_SMCCC_FUNC_MIN_PKVM = __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize, - /* Hypercalls available after pKVM finalisation */ - __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp, - __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp, - __KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest, - __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest, - __KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest, - __KVM_HOST_SMCCC_FUNC___pkvm_host_wrprotect_guest, - __KVM_HOST_SMCCC_FUNC___pkvm_host_test_clear_young_guest, - __KVM_HOST_SMCCC_FUNC___pkvm_host_mkyoung_guest, + /* Hypercalls that are always available and common to [nh]VHE/pKVM. */ __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc, __KVM_HOST_SMCCC_FUNC___kvm_vcpu_run, __KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context, @@ -81,6 +74,17 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff, __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs, __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs, + __KVM_HOST_SMCCC_FUNC_MAX_NO_PKVM = __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs, + + /* Hypercalls that are available only when pKVM has finalised. */ + __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp, + __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp, + __KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest, + __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest, + __KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest, + __KVM_HOST_SMCCC_FUNC___pkvm_host_wrprotect_guest, + __KVM_HOST_SMCCC_FUNC___pkvm_host_test_clear_young_guest, + __KVM_HOST_SMCCC_FUNC___pkvm_host_mkyoung_guest, __KVM_HOST_SMCCC_FUNC___pkvm_reserve_vm, __KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index e7790097db93..127decc2dd2b 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -169,9 +169,6 @@ static void handle___pkvm_vcpu_load(struct kvm_cpu_context *host_ctxt) DECLARE_REG(u64, hcr_el2, host_ctxt, 3); struct pkvm_hyp_vcpu *hyp_vcpu; - if (!is_protected_kvm_enabled()) - return; - hyp_vcpu = pkvm_load_hyp_vcpu(handle, vcpu_idx); if (!hyp_vcpu) return; @@ -188,12 +185,8 @@ static void handle___pkvm_vcpu_load(struct kvm_cpu_context *host_ctxt) static void handle___pkvm_vcpu_put(struct kvm_cpu_context *host_ctxt) { - struct pkvm_hyp_vcpu *hyp_vcpu; + struct pkvm_hyp_vcpu *hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); - if (!is_protected_kvm_enabled()) - return; - - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); if (hyp_vcpu) pkvm_put_hyp_vcpu(hyp_vcpu); } @@ -257,9 +250,6 @@ static void handle___pkvm_host_share_guest(struct kvm_cpu_context *host_ctxt) struct pkvm_hyp_vcpu *hyp_vcpu; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) goto out; @@ -281,9 +271,6 @@ static void handle___pkvm_host_unshare_guest(struct kvm_cpu_context *host_ctxt) struct pkvm_hyp_vm *hyp_vm; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vm = get_np_pkvm_hyp_vm(handle); if (!hyp_vm) goto out; @@ -301,9 +288,6 @@ static void handle___pkvm_host_relax_perms_guest(struct kvm_cpu_context *host_ct struct pkvm_hyp_vcpu *hyp_vcpu; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) goto out; @@ -321,9 +305,6 @@ static void handle___pkvm_host_wrprotect_guest(struct kvm_cpu_context *host_ctxt struct pkvm_hyp_vm *hyp_vm; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vm = get_np_pkvm_hyp_vm(handle); if (!hyp_vm) goto out; @@ -343,9 +324,6 @@ static void handle___pkvm_host_test_clear_young_guest(struct kvm_cpu_context *ho struct pkvm_hyp_vm *hyp_vm; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vm = get_np_pkvm_hyp_vm(handle); if (!hyp_vm) goto out; @@ -362,9 +340,6 @@ static void handle___pkvm_host_mkyoung_guest(struct kvm_cpu_context *host_ctxt) struct pkvm_hyp_vcpu *hyp_vcpu; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) goto out; @@ -424,12 +399,8 @@ static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt) static void handle___pkvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); - struct pkvm_hyp_vm *hyp_vm; + struct pkvm_hyp_vm *hyp_vm = get_np_pkvm_hyp_vm(handle); - if (!is_protected_kvm_enabled()) - return; - - hyp_vm = get_np_pkvm_hyp_vm(handle); if (!hyp_vm) return; @@ -603,14 +574,6 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__vgic_v3_get_gic_config), HANDLE_FUNC(__pkvm_prot_finalize), - HANDLE_FUNC(__pkvm_host_share_hyp), - HANDLE_FUNC(__pkvm_host_unshare_hyp), - HANDLE_FUNC(__pkvm_host_share_guest), - HANDLE_FUNC(__pkvm_host_unshare_guest), - HANDLE_FUNC(__pkvm_host_relax_perms_guest), - HANDLE_FUNC(__pkvm_host_wrprotect_guest), - HANDLE_FUNC(__pkvm_host_test_clear_young_guest), - HANDLE_FUNC(__pkvm_host_mkyoung_guest), HANDLE_FUNC(__kvm_adjust_pc), HANDLE_FUNC(__kvm_vcpu_run), HANDLE_FUNC(__kvm_flush_vm_context), @@ -622,6 +585,15 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__kvm_timer_set_cntvoff), HANDLE_FUNC(__vgic_v3_save_aprs), HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs), + + HANDLE_FUNC(__pkvm_host_share_hyp), + HANDLE_FUNC(__pkvm_host_unshare_hyp), + HANDLE_FUNC(__pkvm_host_share_guest), + HANDLE_FUNC(__pkvm_host_unshare_guest), + HANDLE_FUNC(__pkvm_host_relax_perms_guest), + HANDLE_FUNC(__pkvm_host_wrprotect_guest), + HANDLE_FUNC(__pkvm_host_test_clear_young_guest), + HANDLE_FUNC(__pkvm_host_mkyoung_guest), HANDLE_FUNC(__pkvm_reserve_vm), HANDLE_FUNC(__pkvm_unreserve_vm), HANDLE_FUNC(__pkvm_init_vm), @@ -635,7 +607,7 @@ static const hcall_t host_hcall[] = { static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(unsigned long, id, host_ctxt, 0); - unsigned long hcall_min = 0; + unsigned long hcall_min = 0, hcall_max = -1; hcall_t hfn; /* @@ -647,14 +619,19 @@ static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) * basis. This is all fine, however, since __pkvm_prot_finalize * returns -EPERM after the first call for a given CPU. */ - if (static_branch_unlikely(&kvm_protected_mode_initialized)) - hcall_min = __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize; + if (static_branch_unlikely(&kvm_protected_mode_initialized)) { + hcall_min = __KVM_HOST_SMCCC_FUNC_MIN_PKVM; + } else { + hcall_max = __KVM_HOST_SMCCC_FUNC_MAX_NO_PKVM; + } id &= ~ARM_SMCCC_CALL_HINTS; id -= KVM_HOST_SMCCC_ID(0); - if (unlikely(id < hcall_min || id >= ARRAY_SIZE(host_hcall))) + if (unlikely(id < hcall_min || id > hcall_max || + id >= ARRAY_SIZE(host_hcall))) { goto inval; + } hfn = host_hcall[id]; if (unlikely(!hfn)) From 7250533ad2c1f0e49567077e9b0f66b0349b357e Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:09 +0100 Subject: [PATCH 243/373] KVM: arm64: Ignore MMU notifier callbacks for protected VMs In preparation for supporting the donation of pinned pages to protected VMs, return early from the MMU notifiers when called for a protected VM, as the necessary hypercalls are exposed only for non-protected guests. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-9-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 9 ++++++--- arch/arm64/kvm/pkvm.c | 19 ++++++++++++++++++- 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 17d64a1e11e5..5e7821fe0fc4 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -340,6 +340,9 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size, bool may_block) { + if (kvm_vm_is_protected(kvm_s2_mmu_to_kvm(mmu))) + return; + __unmap_stage2_range(mmu, start, size, may_block); } @@ -2223,7 +2226,7 @@ out_unlock: bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { - if (!kvm->arch.mmu.pgt) + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) return false; __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT, @@ -2238,7 +2241,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size = (range->end - range->start) << PAGE_SHIFT; - if (!kvm->arch.mmu.pgt) + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) return false; return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, @@ -2254,7 +2257,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size = (range->end - range->start) << PAGE_SHIFT; - if (!kvm->arch.mmu.pgt) + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) return false; return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index 42f6e50825ac..dd93dfdfe52d 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -407,7 +407,12 @@ int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, int pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size) { - lockdep_assert_held_write(&kvm_s2_mmu_to_kvm(pgt->mmu)->mmu_lock); + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); + + if (WARN_ON(kvm_vm_is_protected(kvm))) + return -EPERM; + + lockdep_assert_held_write(&kvm->mmu_lock); return __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); } @@ -419,6 +424,9 @@ int pkvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size) struct pkvm_mapping *mapping; int ret = 0; + if (WARN_ON(kvm_vm_is_protected(kvm))) + return -EPERM; + lockdep_assert_held(&kvm->mmu_lock); for_each_mapping_in_range_safe(pgt, addr, addr + size, mapping) { ret = kvm_call_hyp_nvhe(__pkvm_host_wrprotect_guest, handle, mapping->gfn, @@ -450,6 +458,9 @@ bool pkvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr, u64 struct pkvm_mapping *mapping; bool young = false; + if (WARN_ON(kvm_vm_is_protected(kvm))) + return false; + lockdep_assert_held(&kvm->mmu_lock); for_each_mapping_in_range_safe(pgt, addr, addr + size, mapping) young |= kvm_call_hyp_nvhe(__pkvm_host_test_clear_young_guest, handle, mapping->gfn, @@ -461,12 +472,18 @@ bool pkvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr, u64 int pkvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr, enum kvm_pgtable_prot prot, enum kvm_pgtable_walk_flags flags) { + if (WARN_ON(kvm_vm_is_protected(kvm_s2_mmu_to_kvm(pgt->mmu)))) + return -EPERM; + return kvm_call_hyp_nvhe(__pkvm_host_relax_perms_guest, addr >> PAGE_SHIFT, prot); } void pkvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr, enum kvm_pgtable_walk_flags flags) { + if (WARN_ON(kvm_vm_is_protected(kvm_s2_mmu_to_kvm(pgt->mmu)))) + return; + WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_mkyoung_guest, addr >> PAGE_SHIFT)); } From f0877a1455cc6a93be14e4da741ce26ac0d6ca6d Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:10 +0100 Subject: [PATCH 244/373] KVM: arm64: Prevent unsupported memslot operations on protected VMs Protected VMs do not support deleting or moving memslots after first run nor do they support read-only or dirty logging. Return -EPERM to userspace if such an operation is attempted. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-10-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 13 +++++++++++++ arch/arm64/kvm/pkvm.c | 6 ++++++ 2 files changed, 19 insertions(+) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 5e7821fe0fc4..b3cc5dfe5723 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -2414,6 +2414,19 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, hva_t hva, reg_end; int ret = 0; + if (kvm_vm_is_protected(kvm)) { + /* Cannot modify memslots once a pVM has run. */ + if (pkvm_hyp_vm_is_created(kvm) && + (change == KVM_MR_DELETE || change == KVM_MR_MOVE)) { + return -EPERM; + } + + if (new && + new->flags & (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)) { + return -EPERM; + } + } + if (change != KVM_MR_CREATE && change != KVM_MR_MOVE && change != KVM_MR_FLAGS_ONLY) return 0; diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index dd93dfdfe52d..94b70eb80aa8 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -192,10 +192,16 @@ int pkvm_create_hyp_vm(struct kvm *kvm) { int ret = 0; + /* + * Synchronise with kvm_arch_prepare_memory_region(), as we + * prevent memslot modifications on a pVM that has been run. + */ + mutex_lock(&kvm->slots_lock); mutex_lock(&kvm->arch.config_lock); if (!pkvm_hyp_vm_is_created(kvm)) ret = __pkvm_create_hyp_vm(kvm); mutex_unlock(&kvm->arch.config_lock); + mutex_unlock(&kvm->slots_lock); return ret; } From 73c55be08932a348f8b0a44f561c33eaa2cf1ad2 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:11 +0100 Subject: [PATCH 245/373] KVM: arm64: Ignore -EAGAIN when mapping in pages for the pKVM host If the host takes a stage-2 translation fault on two CPUs at the same time, one of them will get back -EAGAIN from the page-table mapping code when it runs into the mapping installed by the other. Rather than handle this explicitly in handle_host_mem_abort(), pass the new KVM_PGTABLE_WALK_IGNORE_EAGAIN flag to kvm_pgtable_stage2_map() from __host_stage2_idmap() and return -EEXIST if host_stage2_adjust_range() finds a valid pte. This will avoid having to test for -EAGAIN on the reclaim path in subsequent patches. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-11-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index d815265bd374..7d22893ab1dc 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -461,8 +461,15 @@ static bool range_is_memory(u64 start, u64 end) static inline int __host_stage2_idmap(u64 start, u64 end, enum kvm_pgtable_prot prot) { + /* + * We don't make permission changes to the host idmap after + * initialisation, so we can squash -EAGAIN to save callers + * having to treat it like success in the case that they try to + * map something that is already mapped. + */ return kvm_pgtable_stage2_map(&host_mmu.pgt, start, end - start, start, - prot, &host_s2_pool, 0); + prot, &host_s2_pool, + KVM_PGTABLE_WALK_IGNORE_EAGAIN); } /* @@ -504,7 +511,7 @@ static int host_stage2_adjust_range(u64 addr, struct kvm_mem_range *range) return ret; if (kvm_pte_valid(pte)) - return -EAGAIN; + return -EEXIST; if (pte) { WARN_ON(addr_is_memory(addr) && @@ -609,7 +616,6 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt) { struct kvm_vcpu_fault_info fault; u64 esr, addr; - int ret = 0; esr = read_sysreg_el2(SYS_ESR); if (!__get_fault_info(esr, &fault)) { @@ -628,8 +634,13 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt) BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS)); addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12; - ret = host_stage2_idmap(addr); - BUG_ON(ret && ret != -EAGAIN); + switch (host_stage2_idmap(addr)) { + case -EEXIST: + case 0: + break; + default: + BUG(); + } } struct check_walk_data { From 6c58f914eb9c4ce5225d03183ae1290d72b5f639 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:12 +0100 Subject: [PATCH 246/373] KVM: arm64: Split teardown hypercall into two phases In preparation for reclaiming protected guest VM pages from the host during teardown, split the current 'pkvm_teardown_vm' hypercall into separate 'start' and 'finalise' calls. The 'pkvm_start_teardown_vm' hypercall puts the VM into a new 'is_dying' state, which is a point of no return past which no vCPU of the pVM is allowed to run any more. Once in this new state, 'pkvm_finalize_teardown_vm' can be used to reclaim meta-data and page-table pages from the VM. A subsequent patch will add support for reclaiming the individual guest memory pages. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Co-developed-by: Quentin Perret Signed-off-by: Quentin Perret Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-12-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 3 +- arch/arm64/include/asm/kvm_host.h | 7 ++++ arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 4 ++- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 14 ++++++-- arch/arm64/kvm/hyp/nvhe/pkvm.c | 44 ++++++++++++++++++++++---- arch/arm64/kvm/pkvm.c | 7 +++- 6 files changed, 67 insertions(+), 12 deletions(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index 7b72aac4730d..df6b661701b6 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -89,7 +89,8 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu, - __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm, + __KVM_HOST_SMCCC_FUNC___pkvm_start_teardown_vm, + __KVM_HOST_SMCCC_FUNC___pkvm_finalize_teardown_vm, __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_load, __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_put, __KVM_HOST_SMCCC_FUNC___pkvm_tlb_flush_vmid, diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 70cb9cfd760a..31b9454bb74d 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -255,6 +255,13 @@ struct kvm_protected_vm { struct kvm_hyp_memcache stage2_teardown_mc; bool is_protected; bool is_created; + + /* + * True when the guest is being torn down. When in this state, the + * guest's vCPUs can't be loaded anymore, but its pages can be + * reclaimed by the host. + */ + bool is_dying; }; struct kvm_mpidr_data { diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h index 184ad7a39950..04c7ca703014 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h +++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h @@ -73,7 +73,9 @@ int __pkvm_init_vm(struct kvm *host_kvm, unsigned long vm_hva, unsigned long pgd_hva); int __pkvm_init_vcpu(pkvm_handle_t handle, struct kvm_vcpu *host_vcpu, unsigned long vcpu_hva); -int __pkvm_teardown_vm(pkvm_handle_t handle); + +int __pkvm_start_teardown_vm(pkvm_handle_t handle); +int __pkvm_finalize_teardown_vm(pkvm_handle_t handle); struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, unsigned int vcpu_idx); diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 127decc2dd2b..634ea2766240 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -553,11 +553,18 @@ static void handle___pkvm_init_vcpu(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __pkvm_init_vcpu(handle, host_vcpu, vcpu_hva); } -static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt) +static void handle___pkvm_start_teardown_vm(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); - cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle); + cpu_reg(host_ctxt, 1) = __pkvm_start_teardown_vm(handle); +} + +static void handle___pkvm_finalize_teardown_vm(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __pkvm_finalize_teardown_vm(handle); } typedef void (*hcall_t)(struct kvm_cpu_context *); @@ -598,7 +605,8 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__pkvm_unreserve_vm), HANDLE_FUNC(__pkvm_init_vm), HANDLE_FUNC(__pkvm_init_vcpu), - HANDLE_FUNC(__pkvm_teardown_vm), + HANDLE_FUNC(__pkvm_start_teardown_vm), + HANDLE_FUNC(__pkvm_finalize_teardown_vm), HANDLE_FUNC(__pkvm_vcpu_load), HANDLE_FUNC(__pkvm_vcpu_put), HANDLE_FUNC(__pkvm_tlb_flush_vmid), diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 2f029bfe4755..61e69e24656a 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -255,7 +255,10 @@ struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, hyp_spin_lock(&vm_table_lock); hyp_vm = get_vm_by_handle(handle); - if (!hyp_vm || hyp_vm->kvm.created_vcpus <= vcpu_idx) + if (!hyp_vm || hyp_vm->kvm.arch.pkvm.is_dying) + goto unlock; + + if (hyp_vm->kvm.created_vcpus <= vcpu_idx) goto unlock; hyp_vcpu = hyp_vm->vcpus[vcpu_idx]; @@ -301,8 +304,14 @@ struct pkvm_hyp_vm *get_pkvm_hyp_vm(pkvm_handle_t handle) hyp_spin_lock(&vm_table_lock); hyp_vm = get_vm_by_handle(handle); - if (hyp_vm) + if (!hyp_vm) + goto unlock; + + if (hyp_vm->kvm.arch.pkvm.is_dying) + hyp_vm = NULL; + else hyp_page_ref_inc(hyp_virt_to_page(hyp_vm)); +unlock: hyp_spin_unlock(&vm_table_lock); return hyp_vm; @@ -859,7 +868,32 @@ teardown_donated_memory(struct kvm_hyp_memcache *mc, void *addr, size_t size) unmap_donated_memory_noclear(addr, size); } -int __pkvm_teardown_vm(pkvm_handle_t handle) +int __pkvm_start_teardown_vm(pkvm_handle_t handle) +{ + struct pkvm_hyp_vm *hyp_vm; + int ret = 0; + + hyp_spin_lock(&vm_table_lock); + hyp_vm = get_vm_by_handle(handle); + if (!hyp_vm) { + ret = -ENOENT; + goto unlock; + } else if (WARN_ON(hyp_page_count(hyp_vm))) { + ret = -EBUSY; + goto unlock; + } else if (hyp_vm->kvm.arch.pkvm.is_dying) { + ret = -EINVAL; + goto unlock; + } + + hyp_vm->kvm.arch.pkvm.is_dying = true; +unlock: + hyp_spin_unlock(&vm_table_lock); + + return ret; +} + +int __pkvm_finalize_teardown_vm(pkvm_handle_t handle) { struct kvm_hyp_memcache *mc, *stage2_mc; struct pkvm_hyp_vm *hyp_vm; @@ -873,9 +907,7 @@ int __pkvm_teardown_vm(pkvm_handle_t handle) if (!hyp_vm) { err = -ENOENT; goto err_unlock; - } - - if (WARN_ON(hyp_page_count(hyp_vm))) { + } else if (!hyp_vm->kvm.arch.pkvm.is_dying) { err = -EBUSY; goto err_unlock; } diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index 94b70eb80aa8..ea7f267ee7ad 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -88,7 +88,7 @@ void __init kvm_hyp_reserve(void) static void __pkvm_destroy_hyp_vm(struct kvm *kvm) { if (pkvm_hyp_vm_is_created(kvm)) { - WARN_ON(kvm_call_hyp_nvhe(__pkvm_teardown_vm, + WARN_ON(kvm_call_hyp_nvhe(__pkvm_finalize_teardown_vm, kvm->arch.pkvm.handle)); } else if (kvm->arch.pkvm.handle) { /* @@ -356,6 +356,11 @@ void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, if (!handle) return; + if (pkvm_hyp_vm_is_created(kvm) && !kvm->arch.pkvm.is_dying) { + WARN_ON(kvm_call_hyp_nvhe(__pkvm_start_teardown_vm, handle)); + kvm->arch.pkvm.is_dying = true; + } + __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); } From 1e579adca1774b3713d1efa67d92a88ec86c04fa Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:13 +0100 Subject: [PATCH 247/373] KVM: arm64: Introduce __pkvm_host_donate_guest() In preparation for supporting protected VMs, whose memory pages are isolated from the host, introduce a new pKVM hypercall to allow the donation of pages to a guest. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-13-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/include/asm/kvm_pgtable.h | 2 +- arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 ++ arch/arm64/kvm/hyp/nvhe/hyp-main.c | 21 +++++++++++++ arch/arm64/kvm/hyp/nvhe/mem_protect.c | 30 +++++++++++++++++++ 5 files changed, 55 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index df6b661701b6..dfc6625c8269 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -79,6 +79,7 @@ enum __kvm_host_smccc_func { /* Hypercalls that are available only when pKVM has finalised. */ __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp, __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp, + __KVM_HOST_SMCCC_FUNC___pkvm_host_donate_guest, __KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest, __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest, __KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest, diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index c201168f2857..50caca311ef5 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -100,7 +100,7 @@ typedef u64 kvm_pte_t; KVM_PTE_LEAF_ATTR_HI_S2_XN) #define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) -#define KVM_MAX_OWNER_ID 1 +#define KVM_MAX_OWNER_ID 2 /* * Used to indicate a pte for which a 'break-before-make' sequence is in diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index 7f25f2bca90c..7061b0be340a 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -27,6 +27,7 @@ extern struct host_mmu host_mmu; enum pkvm_component_id { PKVM_ID_HOST, PKVM_ID_HYP, + PKVM_ID_GUEST, }; extern unsigned long hyp_nr_cpus; @@ -38,6 +39,7 @@ int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages); int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages); int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages); +int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu); int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, enum kvm_pgtable_prot prot); int __pkvm_host_unshare_guest(u64 gfn, u64 nr_pages, struct pkvm_hyp_vm *hyp_vm); diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 634ea2766240..970656318cf2 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -241,6 +241,26 @@ static int pkvm_refill_memcache(struct pkvm_hyp_vcpu *hyp_vcpu) &host_vcpu->arch.pkvm_memcache); } +static void handle___pkvm_host_donate_guest(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(u64, pfn, host_ctxt, 1); + DECLARE_REG(u64, gfn, host_ctxt, 2); + struct pkvm_hyp_vcpu *hyp_vcpu; + int ret = -EINVAL; + + hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); + if (!hyp_vcpu || !pkvm_hyp_vcpu_is_protected(hyp_vcpu)) + goto out; + + ret = pkvm_refill_memcache(hyp_vcpu); + if (ret) + goto out; + + ret = __pkvm_host_donate_guest(pfn, gfn, hyp_vcpu); +out: + cpu_reg(host_ctxt, 1) = ret; +} + static void handle___pkvm_host_share_guest(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(u64, pfn, host_ctxt, 1); @@ -595,6 +615,7 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__pkvm_host_share_hyp), HANDLE_FUNC(__pkvm_host_unshare_hyp), + HANDLE_FUNC(__pkvm_host_donate_guest), HANDLE_FUNC(__pkvm_host_share_guest), HANDLE_FUNC(__pkvm_host_unshare_guest), HANDLE_FUNC(__pkvm_host_relax_perms_guest), diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 7d22893ab1dc..03e6fa124253 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -971,6 +971,36 @@ static int __guest_check_transition_size(u64 phys, u64 ipa, u64 nr_pages, u64 *s return 0; } +int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu) +{ + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); + u64 phys = hyp_pfn_to_phys(pfn); + u64 ipa = hyp_pfn_to_phys(gfn); + int ret; + + host_lock_component(); + guest_lock_component(vm); + + ret = __host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_OWNED); + if (ret) + goto unlock; + + ret = __guest_check_page_state_range(vm, ipa, PAGE_SIZE, PKVM_NOPAGE); + if (ret) + goto unlock; + + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_GUEST)); + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), + &vcpu->vcpu.arch.pkvm_memcache, 0)); + +unlock: + guest_unlock_component(vm); + host_unlock_component(); + + return ret; +} + int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, enum kvm_pgtable_prot prot) { From 5fef16ef49126b0f71fb3e401aae4dca1865e6f9 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:14 +0100 Subject: [PATCH 248/373] KVM: arm64: Hook up donation hypercall to pkvm_pgtable_stage2_map() Mapping pages into a protected guest requires the donation of memory from the host. Extend pkvm_pgtable_stage2_map() to issue a donate hypercall when the target VM is protected. Since the hypercall only handles a single page, the splitting logic used for the share path is not required. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-14-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/pkvm.c | 58 ++++++++++++++++++++++++++++++------------- 1 file changed, 41 insertions(+), 17 deletions(-) diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index ea7f267ee7ad..7d0fe36fd8dc 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -379,31 +379,55 @@ int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, struct kvm_hyp_memcache *cache = mc; u64 gfn = addr >> PAGE_SHIFT; u64 pfn = phys >> PAGE_SHIFT; + u64 end = addr + size; int ret; - if (size != PAGE_SIZE && size != PMD_SIZE) - return -EINVAL; - lockdep_assert_held_write(&kvm->mmu_lock); + mapping = pkvm_mapping_iter_first(&pgt->pkvm_mappings, addr, end - 1); - /* - * Calling stage2_map() on top of existing mappings is either happening because of a race - * with another vCPU, or because we're changing between page and block mappings. As per - * user_mem_abort(), same-size permission faults are handled in the relax_perms() path. - */ - mapping = pkvm_mapping_iter_first(&pgt->pkvm_mappings, addr, addr + size - 1); - if (mapping) { - if (size == (mapping->nr_pages * PAGE_SIZE)) + if (kvm_vm_is_protected(kvm)) { + /* Protected VMs are mapped using RWX page-granular mappings */ + if (WARN_ON_ONCE(size != PAGE_SIZE)) + return -EINVAL; + + if (WARN_ON_ONCE(prot != KVM_PGTABLE_PROT_RWX)) + return -EINVAL; + + /* + * We raced with another vCPU. + */ + if (mapping) return -EAGAIN; - /* Remove _any_ pkvm_mapping overlapping with the range, bigger or smaller. */ - ret = __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); - if (ret) - return ret; - mapping = NULL; + ret = kvm_call_hyp_nvhe(__pkvm_host_donate_guest, pfn, gfn); + } else { + if (WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE)) + return -EINVAL; + + /* + * We either raced with another vCPU or we're changing between + * page and block mappings. As per user_mem_abort(), same-size + * permission faults are handled in the relax_perms() path. + */ + if (mapping) { + if (size == (mapping->nr_pages * PAGE_SIZE)) + return -EAGAIN; + + /* + * Remove _any_ pkvm_mapping overlapping with the range, + * bigger or smaller. + */ + ret = __pkvm_pgtable_stage2_unshare(pgt, addr, end); + if (ret) + return ret; + + mapping = NULL; + } + + ret = kvm_call_hyp_nvhe(__pkvm_host_share_guest, pfn, gfn, + size / PAGE_SIZE, prot); } - ret = kvm_call_hyp_nvhe(__pkvm_host_share_guest, pfn, gfn, size / PAGE_SIZE, prot); if (WARN_ON(ret)) return ret; From ea03466e806fea942841a41cfaab8db8c851aa71 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:15 +0100 Subject: [PATCH 249/373] KVM: arm64: Handle aborts from protected VMs Introduce a new abort handler for resolving stage-2 page faults from protected VMs by pinning and donating anonymous memory. This is considerably simpler than the infamous user_mem_abort() as we only have to deal with translation faults at the pte level. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-15-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 89 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 81 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index b3cc5dfe5723..6a4151e3e4a3 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1642,6 +1642,74 @@ out_unlock: return ret != -EAGAIN ? ret : 0; } +static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, + struct kvm_memory_slot *memslot, unsigned long hva) +{ + unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE; + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; + struct mm_struct *mm = current->mm; + struct kvm *kvm = vcpu->kvm; + void *hyp_memcache; + struct page *page; + int ret; + + ret = prepare_mmu_memcache(vcpu, true, &hyp_memcache); + if (ret) + return -ENOMEM; + + ret = account_locked_vm(mm, 1, true); + if (ret) + return ret; + + mmap_read_lock(mm); + ret = pin_user_pages(hva, 1, flags, &page); + mmap_read_unlock(mm); + + if (ret == -EHWPOISON) { + kvm_send_hwpoison_signal(hva, PAGE_SHIFT); + ret = 0; + goto dec_account; + } else if (ret != 1) { + ret = -EFAULT; + goto dec_account; + } else if (!folio_test_swapbacked(page_folio(page))) { + /* + * We really can't deal with page-cache pages returned by GUP + * because (a) we may trigger writeback of a page for which we + * no longer have access and (b) page_mkclean() won't find the + * stage-2 mapping in the rmap so we can get out-of-whack with + * the filesystem when marking the page dirty during unpinning + * (see cc5095747edf ("ext4: don't BUG if someone dirty pages + * without asking ext4 first")). + * + * Ideally we'd just restrict ourselves to anonymous pages, but + * we also want to allow memfd (i.e. shmem) pages, so check for + * pages backed by swap in the knowledge that the GUP pin will + * prevent try_to_unmap() from succeeding. + */ + ret = -EIO; + goto unpin; + } + + write_lock(&kvm->mmu_lock); + ret = pkvm_pgtable_stage2_map(pgt, fault_ipa, PAGE_SIZE, + page_to_phys(page), KVM_PGTABLE_PROT_RWX, + hyp_memcache, 0); + write_unlock(&kvm->mmu_lock); + if (ret) { + if (ret == -EAGAIN) + ret = 0; + goto unpin; + } + + return 0; +unpin: + unpin_user_pages(&page, 1); +dec_account: + account_locked_vm(mm, 1, false); + return ret; +} + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, unsigned long hva, @@ -2205,15 +2273,20 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) goto out_unlock; } - VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && - !write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu)); + if (kvm_vm_is_protected(vcpu->kvm)) { + ret = pkvm_mem_abort(vcpu, fault_ipa, memslot, hva); + } else { + VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && + !write_fault && + !kvm_vcpu_trap_is_exec_fault(vcpu)); - if (kvm_slot_has_gmem(memslot)) - ret = gmem_abort(vcpu, fault_ipa, nested, memslot, - esr_fsc_is_permission_fault(esr)); - else - ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva, - esr_fsc_is_permission_fault(esr)); + if (kvm_slot_has_gmem(memslot)) + ret = gmem_abort(vcpu, fault_ipa, nested, memslot, + esr_fsc_is_permission_fault(esr)); + else + ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva, + esr_fsc_is_permission_fault(esr)); + } if (ret == 0) ret = 1; out: From 0bf5f4d400cd11ab86b25a56b101726e35f3e7cb Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:16 +0100 Subject: [PATCH 250/373] KVM: arm64: Introduce __pkvm_reclaim_dying_guest_page() To enable reclaim of pages from a protected VM during teardown, introduce a new hypercall to reclaim a single page from a protected guest that is in the dying state. Since the EL2 code is non-preemptible, the new hypercall deliberately acts on a single page at a time so as to allow EL1 to reschedule frequently during the teardown operation. Reviewed-by: Vincent Donnefort Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Co-developed-by: Quentin Perret Signed-off-by: Quentin Perret Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-16-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 + arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 1 + arch/arm64/kvm/hyp/nvhe/hyp-main.c | 9 +++ arch/arm64/kvm/hyp/nvhe/mem_protect.c | 79 +++++++++++++++++++ arch/arm64/kvm/hyp/nvhe/pkvm.c | 14 ++++ 6 files changed, 105 insertions(+) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index dfc6625c8269..b6df8f64d573 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -90,6 +90,7 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu, + __KVM_HOST_SMCCC_FUNC___pkvm_reclaim_dying_guest_page, __KVM_HOST_SMCCC_FUNC___pkvm_start_teardown_vm, __KVM_HOST_SMCCC_FUNC___pkvm_finalize_teardown_vm, __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_load, diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index 7061b0be340a..29f81a1d9e1f 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -40,6 +40,7 @@ int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages); int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages); int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu); +int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm); int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, enum kvm_pgtable_prot prot); int __pkvm_host_unshare_guest(u64 gfn, u64 nr_pages, struct pkvm_hyp_vm *hyp_vm); diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h index 04c7ca703014..506831804f64 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h +++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h @@ -74,6 +74,7 @@ int __pkvm_init_vm(struct kvm *host_kvm, unsigned long vm_hva, int __pkvm_init_vcpu(pkvm_handle_t handle, struct kvm_vcpu *host_vcpu, unsigned long vcpu_hva); +int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn); int __pkvm_start_teardown_vm(pkvm_handle_t handle); int __pkvm_finalize_teardown_vm(pkvm_handle_t handle); diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 970656318cf2..7294c94f9296 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -573,6 +573,14 @@ static void handle___pkvm_init_vcpu(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __pkvm_init_vcpu(handle, host_vcpu, vcpu_hva); } +static void handle___pkvm_reclaim_dying_guest_page(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); + DECLARE_REG(u64, gfn, host_ctxt, 2); + + cpu_reg(host_ctxt, 1) = __pkvm_reclaim_dying_guest_page(handle, gfn); +} + static void handle___pkvm_start_teardown_vm(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); @@ -626,6 +634,7 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__pkvm_unreserve_vm), HANDLE_FUNC(__pkvm_init_vm), HANDLE_FUNC(__pkvm_init_vcpu), + HANDLE_FUNC(__pkvm_reclaim_dying_guest_page), HANDLE_FUNC(__pkvm_start_teardown_vm), HANDLE_FUNC(__pkvm_finalize_teardown_vm), HANDLE_FUNC(__pkvm_vcpu_load), diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 03e6fa124253..ca266a4d9d50 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -738,6 +738,32 @@ static int __guest_check_page_state_range(struct pkvm_hyp_vm *vm, u64 addr, return check_page_state_range(&vm->pgt, addr, size, &d); } +static int get_valid_guest_pte(struct pkvm_hyp_vm *vm, u64 ipa, kvm_pte_t *ptep, u64 *physp) +{ + kvm_pte_t pte; + u64 phys; + s8 level; + int ret; + + ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); + if (ret) + return ret; + if (!kvm_pte_valid(pte)) + return -ENOENT; + if (level != KVM_PGTABLE_LAST_LEVEL) + return -E2BIG; + + phys = kvm_pte_to_phys(pte); + ret = check_range_allowed_memory(phys, phys + PAGE_SIZE); + if (WARN_ON(ret)) + return ret; + + *ptep = pte; + *physp = phys; + + return 0; +} + int __pkvm_host_share_hyp(u64 pfn) { u64 phys = hyp_pfn_to_phys(pfn); @@ -971,6 +997,59 @@ static int __guest_check_transition_size(u64 phys, u64 ipa, u64 nr_pages, u64 *s return 0; } +static void hyp_poison_page(phys_addr_t phys) +{ + void *addr = hyp_fixmap_map(phys); + + memset(addr, 0, PAGE_SIZE); + /* + * Prefer kvm_flush_dcache_to_poc() over __clean_dcache_guest_page() + * here as the latter may elide the CMO under the assumption that FWB + * will be enabled on CPUs that support it. This is incorrect for the + * host stage-2 and would otherwise lead to a malicious host potentially + * being able to read the contents of newly reclaimed guest pages. + */ + kvm_flush_dcache_to_poc(addr, PAGE_SIZE); + hyp_fixmap_unmap(); +} + +int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm) +{ + u64 ipa = hyp_pfn_to_phys(gfn); + kvm_pte_t pte; + u64 phys; + int ret; + + host_lock_component(); + guest_lock_component(vm); + + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); + if (ret) + goto unlock; + + switch (guest_get_page_state(pte, ipa)) { + case PKVM_PAGE_OWNED: + WARN_ON(__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE)); + hyp_poison_page(phys); + break; + case PKVM_PAGE_SHARED_OWNED: + WARN_ON(__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)); + break; + default: + ret = -EPERM; + goto unlock; + } + + WARN_ON(kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE)); + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST)); + +unlock: + guest_unlock_component(vm); + host_unlock_component(); + + return ret; +} + int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu) { struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 61e69e24656a..092e9d0e55ac 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -868,6 +868,20 @@ teardown_donated_memory(struct kvm_hyp_memcache *mc, void *addr, size_t size) unmap_donated_memory_noclear(addr, size); } +int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn) +{ + struct pkvm_hyp_vm *hyp_vm; + int ret = -EINVAL; + + hyp_spin_lock(&vm_table_lock); + hyp_vm = get_vm_by_handle(handle); + if (hyp_vm && hyp_vm->kvm.arch.pkvm.is_dying) + ret = __pkvm_host_reclaim_page_guest(gfn, hyp_vm); + hyp_spin_unlock(&vm_table_lock); + + return ret; +} + int __pkvm_start_teardown_vm(pkvm_handle_t handle) { struct pkvm_hyp_vm *hyp_vm; From 4e6e03f9eaddb6be5ca8477dc5642e94ddece47e Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:17 +0100 Subject: [PATCH 251/373] KVM: arm64: Hook up reclaim hypercall to pkvm_pgtable_stage2_destroy() During teardown of a protected guest, its memory pages must be reclaimed from the hypervisor by issuing the '__pkvm_reclaim_dying_guest_page' hypercall. Add a new helper, __pkvm_pgtable_stage2_reclaim(), which is called during the VM teardown operation to reclaim pages from the hypervisor and drop the GUP pin on the host. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-17-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/pkvm.c | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index 7d0fe36fd8dc..3cf23496f225 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -328,6 +328,32 @@ int pkvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu, return 0; } +static int __pkvm_pgtable_stage2_reclaim(struct kvm_pgtable *pgt, u64 start, u64 end) +{ + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); + pkvm_handle_t handle = kvm->arch.pkvm.handle; + struct pkvm_mapping *mapping; + int ret; + + for_each_mapping_in_range_safe(pgt, start, end, mapping) { + struct page *page; + + ret = kvm_call_hyp_nvhe(__pkvm_reclaim_dying_guest_page, + handle, mapping->gfn); + if (WARN_ON(ret)) + continue; + + page = pfn_to_page(mapping->pfn); + WARN_ON_ONCE(mapping->nr_pages != 1); + unpin_user_pages_dirty_lock(&page, 1, true); + account_locked_vm(current->mm, 1, false); + pkvm_mapping_remove(mapping, &pgt->pkvm_mappings); + kfree(mapping); + } + + return 0; +} + static int __pkvm_pgtable_stage2_unshare(struct kvm_pgtable *pgt, u64 start, u64 end) { struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); @@ -361,7 +387,10 @@ void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, kvm->arch.pkvm.is_dying = true; } - __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); + if (kvm_vm_is_protected(kvm)) + __pkvm_pgtable_stage2_reclaim(pgt, addr, addr + size); + else + __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); } void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) From 90c59c4d920f3ea6e7b4dd8702b87b38c7162755 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:18 +0100 Subject: [PATCH 252/373] KVM: arm64: Factor out pKVM host exception injection logic inject_undef64() open-codes the logic to inject an exception into the pKVM host. In preparation for reusing this logic to inject a data abort on an unhandled stage-2 fault from the host, factor out the meat and potatoes of the function into a new inject_host_exception() function which takes the ESR as a parameter. Cc: Fuad Tabba Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-18-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 49 ++++++++++++++---------------- 1 file changed, 23 insertions(+), 26 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 7294c94f9296..adfc0bc15398 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -705,43 +705,40 @@ static void handle_host_smc(struct kvm_cpu_context *host_ctxt) kvm_skip_host_instr(); } -/* - * Inject an Undefined Instruction exception into the host. - * - * This is open-coded to allow control over PSTATE construction without - * complicating the generic exception entry helpers. - */ -static void inject_undef64(void) +static void inject_host_exception(u64 esr) { - u64 spsr_mask, vbar, sctlr, old_spsr, new_spsr, esr, offset; + u64 sctlr, spsr_el1, spsr_el2, exc_offset = except_type_sync; + const u64 spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | + PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; - spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; + exc_offset += CURRENT_EL_SP_ELx_VECTOR; + + spsr_el1 = spsr_el2 = read_sysreg_el2(SYS_SPSR); + spsr_el2 &= spsr_mask; + spsr_el2 |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT | + PSR_MODE_EL1h; - vbar = read_sysreg_el1(SYS_VBAR); sctlr = read_sysreg_el1(SYS_SCTLR); - old_spsr = read_sysreg_el2(SYS_SPSR); - - new_spsr = old_spsr & spsr_mask; - new_spsr |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT; - new_spsr |= PSR_MODE_EL1h; - if (!(sctlr & SCTLR_EL1_SPAN)) - new_spsr |= PSR_PAN_BIT; + spsr_el2 |= PSR_PAN_BIT; if (sctlr & SCTLR_ELx_DSSBS) - new_spsr |= PSR_SSBS_BIT; + spsr_el2 |= PSR_SSBS_BIT; if (system_supports_mte()) - new_spsr |= PSR_TCO_BIT; - - esr = (ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT) | ESR_ELx_IL; - offset = CURRENT_EL_SP_ELx_VECTOR + except_type_sync; + spsr_el2 |= PSR_TCO_BIT; write_sysreg_el1(esr, SYS_ESR); write_sysreg_el1(read_sysreg_el2(SYS_ELR), SYS_ELR); - write_sysreg_el1(old_spsr, SYS_SPSR); - write_sysreg_el2(vbar + offset, SYS_ELR); - write_sysreg_el2(new_spsr, SYS_SPSR); + write_sysreg_el1(spsr_el1, SYS_SPSR); + write_sysreg_el2(read_sysreg_el1(SYS_VBAR) + exc_offset, SYS_ELR); + write_sysreg_el2(spsr_el2, SYS_SPSR); +} + +static void inject_host_undef64(void) +{ + inject_host_exception((ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT) | + ESR_ELx_IL); } static bool handle_host_mte(u64 esr) @@ -764,7 +761,7 @@ static bool handle_host_mte(u64 esr) return false; } - inject_undef64(); + inject_host_undef64(); return true; } From be9ed3529e0599f036a425d83ecc6dd4a085c9d2 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:19 +0100 Subject: [PATCH 253/373] KVM: arm64: Support translation faults in inject_host_exception() Extend inject_host_exception() to support the injection of translation faults on both the data and instruction side to 32-bit and 64-bit EL0 as well as 64-bit EL1. This will be used in a subsequent patch when resolving an unhandled host stage-2 abort. Cc: Fuad Tabba Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-19-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/nvhe/trap_handler.h | 2 ++ arch/arm64/kvm/hyp/nvhe/hyp-main.c | 18 +++++++++++++++--- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h b/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h index ba5382c12787..32d7b7746e8e 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h +++ b/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h @@ -16,4 +16,6 @@ __always_unused int ___check_reg_ ## reg; \ type name = (type)cpu_reg(ctxt, (reg)) +void inject_host_exception(u64 esr); + #endif /* __ARM64_KVM_NVHE_TRAP_HANDLER_H__ */ diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index adfc0bc15398..6db5aebd92dc 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -705,15 +705,24 @@ static void handle_host_smc(struct kvm_cpu_context *host_ctxt) kvm_skip_host_instr(); } -static void inject_host_exception(u64 esr) +void inject_host_exception(u64 esr) { u64 sctlr, spsr_el1, spsr_el2, exc_offset = except_type_sync; const u64 spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; - exc_offset += CURRENT_EL_SP_ELx_VECTOR; - spsr_el1 = spsr_el2 = read_sysreg_el2(SYS_SPSR); + switch (spsr_el1 & (PSR_MODE_MASK | PSR_MODE32_BIT)) { + case PSR_MODE_EL0t: + exc_offset += LOWER_EL_AArch64_VECTOR; + break; + case PSR_MODE_EL0t | PSR_MODE32_BIT: + exc_offset += LOWER_EL_AArch32_VECTOR; + break; + default: + exc_offset += CURRENT_EL_SP_ELx_VECTOR; + } + spsr_el2 &= spsr_mask; spsr_el2 |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT | PSR_MODE_EL1h; @@ -728,6 +737,9 @@ static void inject_host_exception(u64 esr) if (system_supports_mte()) spsr_el2 |= PSR_TCO_BIT; + if (esr_fsc_is_translation_fault(esr)) + write_sysreg_el1(read_sysreg_el2(SYS_FAR), SYS_FAR); + write_sysreg_el1(esr, SYS_ESR); write_sysreg_el1(read_sysreg_el2(SYS_ELR), SYS_ELR); write_sysreg_el1(spsr_el1, SYS_SPSR); From 9ff714a09222128da16900fc7c15dea65692fc26 Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Mon, 30 Mar 2026 15:48:20 +0100 Subject: [PATCH 254/373] KVM: arm64: Inject SIGSEGV on illegal accesses The pKVM hypervisor will currently panic if the host tries to access memory that it doesn't own (e.g. protected guest memory). Sadly, as guest memory can still be mapped into the VMM's address space, userspace can trivially crash the kernel/hypervisor by poking into guest memory. To prevent this, inject the abort back in the host with S1PTW set in the ESR, hence allowing the host to differentiate this abort from normal userspace faults and inject a SIGSEGV cleanly. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Quentin Perret Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-20-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 37 +++++++++++++++++++++++++++ arch/arm64/mm/fault.c | 22 ++++++++++++++++ 2 files changed, 59 insertions(+) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index ca266a4d9d50..0e57dc1881e0 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -18,6 +18,7 @@ #include #include #include +#include #define KVM_HOST_S2_FLAGS (KVM_PGTABLE_S2_AS_S1 | KVM_PGTABLE_S2_IDMAP) @@ -612,6 +613,39 @@ unlock: return ret; } +static void host_inject_mem_abort(struct kvm_cpu_context *host_ctxt) +{ + u64 ec, esr, spsr; + + esr = read_sysreg_el2(SYS_ESR); + spsr = read_sysreg_el2(SYS_SPSR); + + /* Repaint the ESR to report a same-level fault if taken from EL1 */ + if ((spsr & PSR_MODE_MASK) != PSR_MODE_EL0t) { + ec = ESR_ELx_EC(esr); + if (ec == ESR_ELx_EC_DABT_LOW) + ec = ESR_ELx_EC_DABT_CUR; + else if (ec == ESR_ELx_EC_IABT_LOW) + ec = ESR_ELx_EC_IABT_CUR; + else + WARN_ON(1); + esr &= ~ESR_ELx_EC_MASK; + esr |= ec << ESR_ELx_EC_SHIFT; + } + + /* + * Since S1PTW should only ever be set for stage-2 faults, we're pretty + * much guaranteed that it won't be set in ESR_EL1 by the hardware. So, + * let's use that bit to allow the host abort handler to differentiate + * this abort from normal userspace faults. + * + * Note: although S1PTW is RES0 at EL1, it is guaranteed by the + * architecture to be backed by flops, so it should be safe to use. + */ + esr |= ESR_ELx_S1PTW; + inject_host_exception(esr); +} + void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt) { struct kvm_vcpu_fault_info fault; @@ -635,6 +669,9 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt) addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12; switch (host_stage2_idmap(addr)) { + case -EPERM: + host_inject_mem_abort(host_ctxt); + fallthrough; case -EEXIST: case 0: break; diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index be9dab2c7d6a..3abfc7272d63 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -43,6 +43,7 @@ #include #include #include +#include struct fault_info { int (*fn)(unsigned long far, unsigned long esr, @@ -269,6 +270,15 @@ static inline bool is_el1_permission_fault(unsigned long addr, unsigned long esr return false; } +static bool is_pkvm_stage2_abort(unsigned int esr) +{ + /* + * S1PTW should only ever be set in ESR_EL1 if the pkvm hypervisor + * injected a stage-2 abort -- see host_inject_mem_abort(). + */ + return is_pkvm_initialized() && (esr & ESR_ELx_S1PTW); +} + static bool __kprobes is_spurious_el1_translation_fault(unsigned long addr, unsigned long esr, struct pt_regs *regs) @@ -279,6 +289,9 @@ static bool __kprobes is_spurious_el1_translation_fault(unsigned long addr, if (!is_el1_data_abort(esr) || !esr_fsc_is_translation_fault(esr)) return false; + if (is_pkvm_stage2_abort(esr)) + return false; + local_irq_save(flags); asm volatile("at s1e1r, %0" :: "r" (addr)); isb(); @@ -395,6 +408,8 @@ static void __do_kernel_fault(unsigned long addr, unsigned long esr, msg = "read from unreadable memory"; } else if (addr < PAGE_SIZE) { msg = "NULL pointer dereference"; + } else if (is_pkvm_stage2_abort(esr)) { + msg = "access to hypervisor-protected memory"; } else { if (esr_fsc_is_translation_fault(esr) && kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs)) @@ -621,6 +636,13 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, addr, esr, regs); } + if (is_pkvm_stage2_abort(esr)) { + if (!user_mode(regs)) + goto no_context; + arm64_force_sig_fault(SIGSEGV, SEGV_ACCERR, far, "stage-2 fault"); + return 0; + } + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); if (!(mm_flags & FAULT_FLAG_USER)) From 664d61690357ac2154cd01d859d97455aa49a81d Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:21 +0100 Subject: [PATCH 255/373] KVM: arm64: Avoid pointless annotation when mapping host-owned pages When a page is transitioned to host ownership, we can eagerly map it into the host stage-2 page-table rather than going via the convoluted step of a faulting annotation to trigger the mapping. Call host_stage2_idmap_locked() directly when transitioning a page to be owned by the host. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-21-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 28 +++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 0e57dc1881e0..bf5102594fc8 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -551,23 +551,27 @@ static void __host_update_page_state(phys_addr_t addr, u64 size, enum pkvm_page_ int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) { - int ret; + int ret = -EINVAL; if (!range_is_memory(addr, addr + size)) return -EPERM; - ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, - addr, size, &host_s2_pool, owner_id); - if (ret) - return ret; + switch (owner_id) { + case PKVM_ID_HOST: + ret = host_stage2_idmap_locked(addr, size, PKVM_HOST_MEM_PROT); + if (!ret) + __host_update_page_state(addr, size, PKVM_PAGE_OWNED); + break; + case PKVM_ID_GUEST: + case PKVM_ID_HYP: + ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, + addr, size, &host_s2_pool, owner_id); + if (!ret) + __host_update_page_state(addr, size, PKVM_NOPAGE); + break; + } - /* Don't forget to update the vmemmap tracking for the host */ - if (owner_id == PKVM_ID_HOST) - __host_update_page_state(addr, size, PKVM_PAGE_OWNED); - else - __host_update_page_state(addr, size, PKVM_NOPAGE); - - return 0; + return ret; } static bool host_stage2_force_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot) From c6ba94640cf7b6da902d1d8a1383c7cf2303ca1f Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:22 +0100 Subject: [PATCH 256/373] KVM: arm64: Generalise kvm_pgtable_stage2_set_owner() kvm_pgtable_stage2_set_owner() can be generalised into a way to store up to 59 bits in the page tables alongside a 4-bit 'type' identifier specific to the format of the 59-bit payload. Introduce kvm_pgtable_stage2_annotate() and move the existing invalid ptes (for locked ptes and donated pages) over to the new scheme. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-22-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_pgtable.h | 39 +++++++++++++++++++-------- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 16 +++++++++-- arch/arm64/kvm/hyp/pgtable.c | 33 ++++++++++++++--------- 3 files changed, 62 insertions(+), 26 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 50caca311ef5..e36c2908bdb2 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -100,13 +100,25 @@ typedef u64 kvm_pte_t; KVM_PTE_LEAF_ATTR_HI_S2_XN) #define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) -#define KVM_MAX_OWNER_ID 2 -/* - * Used to indicate a pte for which a 'break-before-make' sequence is in - * progress. - */ -#define KVM_INVALID_PTE_LOCKED BIT(10) +/* pKVM invalid pte encodings */ +#define KVM_INVALID_PTE_TYPE_MASK GENMASK(63, 60) +#define KVM_INVALID_PTE_ANNOT_MASK ~(KVM_PTE_VALID | \ + KVM_INVALID_PTE_TYPE_MASK) + +enum kvm_invalid_pte_type { + /* + * Used to indicate a pte for which a 'break-before-make' + * sequence is in progress. + */ + KVM_INVALID_PTE_TYPE_LOCKED = 1, + + /* + * pKVM has unmapped the page from the host due to a change of + * ownership. + */ + KVM_HOST_INVALID_PTE_TYPE_DONATION, +}; static inline bool kvm_pte_valid(kvm_pte_t pte) { @@ -658,14 +670,18 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc, enum kvm_pgtable_walk_flags flags); /** - * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to - * track ownership. + * kvm_pgtable_stage2_annotate() - Unmap and annotate pages in the IPA space + * to track ownership (and more). * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). * @addr: Base intermediate physical address to annotate. * @size: Size of the annotated range. * @mc: Cache of pre-allocated and zeroed memory from which to allocate * page-table pages. - * @owner_id: Unique identifier for the owner of the page. + * @type: The type of the annotation, determining its meaning and format. + * @annotation: A 59-bit value that will be stored in the page tables. + * @annotation[0] and @annotation[63:60] must be 0. + * @annotation[59:1] is stored in the page tables, along + * with @type. * * By default, all page-tables are owned by identifier 0. This function can be * used to mark portions of the IPA space as owned by other entities. When a @@ -674,8 +690,9 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, * * Return: 0 on success, negative error code on failure. */ -int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, - void *mc, u8 owner_id); +int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size, + void *mc, enum kvm_invalid_pte_type type, + kvm_pte_t annotation); /** * kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 page-table. diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index bf5102594fc8..aea6ec981801 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -549,10 +549,19 @@ static void __host_update_page_state(phys_addr_t addr, u64 size, enum pkvm_page_ set_host_state(page, state); } +static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id) +{ + return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id); +} + int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) { + kvm_pte_t annotation; int ret = -EINVAL; + if (!FIELD_FIT(KVM_INVALID_PTE_OWNER_MASK, owner_id)) + return -EINVAL; + if (!range_is_memory(addr, addr + size)) return -EPERM; @@ -564,8 +573,11 @@ int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) break; case PKVM_ID_GUEST: case PKVM_ID_HYP: - ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, - addr, size, &host_s2_pool, owner_id); + annotation = kvm_init_invalid_leaf_owner(owner_id); + ret = host_stage2_try(kvm_pgtable_stage2_annotate, &host_mmu.pgt, + addr, size, &host_s2_pool, + KVM_HOST_INVALID_PTE_TYPE_DONATION, + annotation); if (!ret) __host_update_page_state(addr, size, PKVM_NOPAGE); break; diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 9b480f947da2..84c7a1df845d 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -114,11 +114,6 @@ static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, s8 level) return pte; } -static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id) -{ - return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id); -} - static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, const struct kvm_pgtable_visit_ctx *ctx, enum kvm_pgtable_walk_flags visit) @@ -581,7 +576,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt) struct stage2_map_data { const u64 phys; kvm_pte_t attr; - u8 owner_id; + kvm_pte_t pte_annot; kvm_pte_t *anchor; kvm_pte_t *childp; @@ -798,7 +793,11 @@ static bool stage2_pte_is_counted(kvm_pte_t pte) static bool stage2_pte_is_locked(kvm_pte_t pte) { - return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED); + if (kvm_pte_valid(pte)) + return false; + + return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) == + KVM_INVALID_PTE_TYPE_LOCKED; } static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new) @@ -829,6 +828,7 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu) { struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; + kvm_pte_t locked_pte; if (stage2_pte_is_locked(ctx->old)) { /* @@ -839,7 +839,9 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx, return false; } - if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED)) + locked_pte = FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, + KVM_INVALID_PTE_TYPE_LOCKED); + if (!stage2_try_set_pte(ctx, locked_pte)) return false; if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) { @@ -964,7 +966,7 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx, if (!data->annotation) new = kvm_init_valid_leaf_pte(phys, data->attr, ctx->level); else - new = kvm_init_invalid_leaf_owner(data->owner_id); + new = data->pte_annot; /* * Skip updating the PTE if we are trying to recreate the exact @@ -1118,16 +1120,18 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, return ret; } -int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, - void *mc, u8 owner_id) +int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size, + void *mc, enum kvm_invalid_pte_type type, + kvm_pte_t pte_annot) { int ret; struct stage2_map_data map_data = { .mmu = pgt->mmu, .memcache = mc, - .owner_id = owner_id, .force_pte = true, .annotation = true, + .pte_annot = pte_annot | + FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, type), }; struct kvm_pgtable_walker walker = { .cb = stage2_map_walker, @@ -1136,7 +1140,10 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, .arg = &map_data, }; - if (owner_id > KVM_MAX_OWNER_ID) + if (pte_annot & ~KVM_INVALID_PTE_ANNOT_MASK) + return -EINVAL; + + if (!type || type == KVM_INVALID_PTE_TYPE_LOCKED) return -EINVAL; ret = kvm_pgtable_walk(pgt, addr, size, &walker); From afa72d207e6b5d49ac597fcd04f0865af63cf589 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:23 +0100 Subject: [PATCH 257/373] KVM: arm64: Introduce host_stage2_set_owner_metadata_locked() Rework host_stage2_set_owner_locked() to add a new helper function, host_stage2_set_owner_metadata_locked(), which will allow us to store additional metadata alongside a 3-bit owner ID for invalid host stage-2 entries. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-23-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_pgtable.h | 2 -- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 47 ++++++++++++++++++--------- 2 files changed, 32 insertions(+), 17 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index e36c2908bdb2..2df22640833c 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -99,8 +99,6 @@ typedef u64 kvm_pte_t; KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ KVM_PTE_LEAF_ATTR_HI_S2_XN) -#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) - /* pKVM invalid pte encodings */ #define KVM_INVALID_PTE_TYPE_MASK GENMASK(63, 60) #define KVM_INVALID_PTE_ANNOT_MASK ~(KVM_PTE_VALID | \ diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index aea6ec981801..90003cbf5603 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -549,37 +549,54 @@ static void __host_update_page_state(phys_addr_t addr, u64 size, enum pkvm_page_ set_host_state(page, state); } -static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id) -{ - return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id); -} - -int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) +#define KVM_HOST_DONATION_PTE_OWNER_MASK GENMASK(3, 1) +#define KVM_HOST_DONATION_PTE_EXTRA_MASK GENMASK(59, 4) +static int host_stage2_set_owner_metadata_locked(phys_addr_t addr, u64 size, + u8 owner_id, u64 meta) { kvm_pte_t annotation; - int ret = -EINVAL; + int ret; - if (!FIELD_FIT(KVM_INVALID_PTE_OWNER_MASK, owner_id)) + if (owner_id == PKVM_ID_HOST) return -EINVAL; if (!range_is_memory(addr, addr + size)) return -EPERM; + if (!FIELD_FIT(KVM_HOST_DONATION_PTE_OWNER_MASK, owner_id)) + return -EINVAL; + + if (!FIELD_FIT(KVM_HOST_DONATION_PTE_EXTRA_MASK, meta)) + return -EINVAL; + + annotation = FIELD_PREP(KVM_HOST_DONATION_PTE_OWNER_MASK, owner_id) | + FIELD_PREP(KVM_HOST_DONATION_PTE_EXTRA_MASK, meta); + ret = host_stage2_try(kvm_pgtable_stage2_annotate, &host_mmu.pgt, + addr, size, &host_s2_pool, + KVM_HOST_INVALID_PTE_TYPE_DONATION, annotation); + if (!ret) + __host_update_page_state(addr, size, PKVM_NOPAGE); + + return ret; +} + +int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) +{ + int ret = -EINVAL; + switch (owner_id) { case PKVM_ID_HOST: + if (!range_is_memory(addr, addr + size)) + return -EPERM; + ret = host_stage2_idmap_locked(addr, size, PKVM_HOST_MEM_PROT); if (!ret) __host_update_page_state(addr, size, PKVM_PAGE_OWNED); break; case PKVM_ID_GUEST: case PKVM_ID_HYP: - annotation = kvm_init_invalid_leaf_owner(owner_id); - ret = host_stage2_try(kvm_pgtable_stage2_annotate, &host_mmu.pgt, - addr, size, &host_s2_pool, - KVM_HOST_INVALID_PTE_TYPE_DONATION, - annotation); - if (!ret) - __host_update_page_state(addr, size, PKVM_NOPAGE); + ret = host_stage2_set_owner_metadata_locked(addr, size, + owner_id, 0); break; } From 44887977ab0fdaa0af4d1cc97cda413884c0ef86 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:24 +0100 Subject: [PATCH 258/373] KVM: arm64: Change 'pkvm_handle_t' to u16 'pkvm_handle_t' doesn't need to be a 32-bit type and subsequent patches will rely on it being no more than 16 bits so that it can be encoded into a pte annotation. Change 'pkvm_handle_t' to a u16 and add a compile-type check that the maximum handle fits into the reduced type. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-24-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_host.h | 2 +- arch/arm64/kvm/hyp/nvhe/pkvm.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 31b9454bb74d..0c5e7ce5f187 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -247,7 +247,7 @@ struct kvm_smccc_features { unsigned long vendor_hyp_bmap_2; /* Function numbers 64-127 */ }; -typedef unsigned int pkvm_handle_t; +typedef u16 pkvm_handle_t; struct kvm_protected_vm { pkvm_handle_t handle; diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 092e9d0e55ac..0ba6423cd0d5 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -222,6 +222,7 @@ static struct pkvm_hyp_vm **vm_table; void pkvm_hyp_vm_table_init(void *tbl) { + BUILD_BUG_ON((u64)HANDLE_OFFSET + KVM_MAX_PVMS > (pkvm_handle_t)-1); WARN_ON(vm_table); vm_table = tbl; } From 70346d632b4d98dd33391fa263ab8bed7d9d934f Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:25 +0100 Subject: [PATCH 259/373] KVM: arm64: Annotate guest donations with handle and gfn in host stage-2 Handling host kernel faults arising from accesses to donated guest memory will require an rmap-like mechanism to identify the guest mapping of the faulting page. Extend the page donation logic to encode the guest handle and gfn alongside the owner information in the host stage-2 pte. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-25-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 90003cbf5603..51cb5c89fd20 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -593,7 +593,6 @@ int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) if (!ret) __host_update_page_state(addr, size, PKVM_PAGE_OWNED); break; - case PKVM_ID_GUEST: case PKVM_ID_HYP: ret = host_stage2_set_owner_metadata_locked(addr, size, owner_id, 0); @@ -603,6 +602,20 @@ int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) return ret; } +#define KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK GENMASK(15, 0) +/* We need 40 bits for the GFN to cover a 52-bit IPA with 4k pages and LPA2 */ +#define KVM_HOST_PTE_OWNER_GUEST_GFN_MASK GENMASK(55, 16) +static u64 host_stage2_encode_gfn_meta(struct pkvm_hyp_vm *vm, u64 gfn) +{ + pkvm_handle_t handle = vm->kvm.arch.pkvm.handle; + + BUILD_BUG_ON((pkvm_handle_t)-1 > KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK); + WARN_ON(!FIELD_FIT(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn)); + + return FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK, handle) | + FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn); +} + static bool host_stage2_force_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot) { /* @@ -1125,6 +1138,7 @@ int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu) struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); u64 phys = hyp_pfn_to_phys(pfn); u64 ipa = hyp_pfn_to_phys(gfn); + u64 meta; int ret; host_lock_component(); @@ -1138,7 +1152,9 @@ int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu) if (ret) goto unlock; - WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_GUEST)); + meta = host_stage2_encode_gfn_meta(vm, gfn); + WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE, + PKVM_ID_GUEST, meta)); WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), &vcpu->vcpu.arch.pkvm_memcache, 0)); From 56080f53a6ad779b971eb7f4f7a232498805d867 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:26 +0100 Subject: [PATCH 260/373] KVM: arm64: Introduce hypercall to force reclaim of a protected page Introduce a new hypercall, __pkvm_force_reclaim_guest_page(), to allow the host to forcefully reclaim a physical page that was previous donated to a protected guest. This results in the page being zeroed and the previous guest mapping being poisoned so that new pages cannot be subsequently donated at the same IPA. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-26-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/include/asm/kvm_pgtable.h | 6 + arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 + arch/arm64/kvm/hyp/include/nvhe/memory.h | 6 + arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 1 + arch/arm64/kvm/hyp/nvhe/hyp-main.c | 8 ++ arch/arm64/kvm/hyp/nvhe/mem_protect.c | 129 +++++++++++++++++- arch/arm64/kvm/hyp/nvhe/pkvm.c | 4 +- 8 files changed, 154 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index b6df8f64d573..04a230e906a7 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -90,6 +90,7 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu, + __KVM_HOST_SMCCC_FUNC___pkvm_force_reclaim_guest_page, __KVM_HOST_SMCCC_FUNC___pkvm_reclaim_dying_guest_page, __KVM_HOST_SMCCC_FUNC___pkvm_start_teardown_vm, __KVM_HOST_SMCCC_FUNC___pkvm_finalize_teardown_vm, diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 2df22640833c..41a8687938eb 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -116,6 +116,12 @@ enum kvm_invalid_pte_type { * ownership. */ KVM_HOST_INVALID_PTE_TYPE_DONATION, + + /* + * The page has been forcefully reclaimed from the guest by the + * host. + */ + KVM_GUEST_INVALID_PTE_TYPE_POISONED, }; static inline bool kvm_pte_valid(kvm_pte_t pte) diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index 29f81a1d9e1f..acc031103600 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -40,6 +40,7 @@ int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages); int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages); int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu); +int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys); int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm); int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, enum kvm_pgtable_prot prot); diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h index dee1a406b0c2..4cedb720c75d 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h @@ -30,6 +30,12 @@ enum pkvm_page_state { * struct hyp_page. */ PKVM_NOPAGE = BIT(0) | BIT(1), + + /* + * 'Meta-states' which aren't encoded directly in the PTE's SW bits (or + * the hyp_vmemmap entry for the host) + */ + PKVM_POISON = BIT(2), }; #define PKVM_PAGE_STATE_MASK (BIT(0) | BIT(1)) diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h index 506831804f64..a5a7bb453f3e 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h +++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h @@ -78,6 +78,7 @@ int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn); int __pkvm_start_teardown_vm(pkvm_handle_t handle); int __pkvm_finalize_teardown_vm(pkvm_handle_t handle); +struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle); struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, unsigned int vcpu_idx); void pkvm_put_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu); diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 6db5aebd92dc..456c83207717 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -573,6 +573,13 @@ static void handle___pkvm_init_vcpu(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __pkvm_init_vcpu(handle, host_vcpu, vcpu_hva); } +static void handle___pkvm_force_reclaim_guest_page(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __pkvm_host_force_reclaim_page_guest(phys); +} + static void handle___pkvm_reclaim_dying_guest_page(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); @@ -634,6 +641,7 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__pkvm_unreserve_vm), HANDLE_FUNC(__pkvm_init_vm), HANDLE_FUNC(__pkvm_init_vcpu), + HANDLE_FUNC(__pkvm_force_reclaim_guest_page), HANDLE_FUNC(__pkvm_reclaim_dying_guest_page), HANDLE_FUNC(__pkvm_start_teardown_vm), HANDLE_FUNC(__pkvm_finalize_teardown_vm), diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 51cb5c89fd20..73bdbd4a508e 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -616,6 +616,35 @@ static u64 host_stage2_encode_gfn_meta(struct pkvm_hyp_vm *vm, u64 gfn) FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn); } +static int host_stage2_decode_gfn_meta(kvm_pte_t pte, struct pkvm_hyp_vm **vm, + u64 *gfn) +{ + pkvm_handle_t handle; + u64 meta; + + if (WARN_ON(kvm_pte_valid(pte))) + return -EINVAL; + + if (FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) != + KVM_HOST_INVALID_PTE_TYPE_DONATION) { + return -EINVAL; + } + + if (FIELD_GET(KVM_HOST_DONATION_PTE_OWNER_MASK, pte) != PKVM_ID_GUEST) + return -EPERM; + + meta = FIELD_GET(KVM_HOST_DONATION_PTE_EXTRA_MASK, pte); + handle = FIELD_GET(KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK, meta); + *vm = get_vm_by_handle(handle); + if (!*vm) { + /* We probably raced with teardown; try again */ + return -EAGAIN; + } + + *gfn = FIELD_GET(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, meta); + return 0; +} + static bool host_stage2_force_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot) { /* @@ -801,8 +830,20 @@ static int __hyp_check_page_state_range(phys_addr_t phys, u64 size, enum pkvm_pa return 0; } +static bool guest_pte_is_poisoned(kvm_pte_t pte) +{ + if (kvm_pte_valid(pte)) + return false; + + return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) == + KVM_GUEST_INVALID_PTE_TYPE_POISONED; +} + static enum pkvm_page_state guest_get_page_state(kvm_pte_t pte, u64 addr) { + if (guest_pte_is_poisoned(pte)) + return PKVM_POISON; + if (!kvm_pte_valid(pte)) return PKVM_NOPAGE; @@ -831,6 +872,8 @@ static int get_valid_guest_pte(struct pkvm_hyp_vm *vm, u64 ipa, kvm_pte_t *ptep, ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); if (ret) return ret; + if (guest_pte_is_poisoned(pte)) + return -EHWPOISON; if (!kvm_pte_valid(pte)) return -ENOENT; if (level != KVM_PGTABLE_LAST_LEVEL) @@ -1096,6 +1139,86 @@ static void hyp_poison_page(phys_addr_t phys) hyp_fixmap_unmap(); } +static int host_stage2_get_guest_info(phys_addr_t phys, struct pkvm_hyp_vm **vm, + u64 *gfn) +{ + enum pkvm_page_state state; + kvm_pte_t pte; + s8 level; + int ret; + + if (!addr_is_memory(phys)) + return -EFAULT; + + state = get_host_state(hyp_phys_to_page(phys)); + switch (state) { + case PKVM_PAGE_OWNED: + case PKVM_PAGE_SHARED_OWNED: + case PKVM_PAGE_SHARED_BORROWED: + /* The access should no longer fault; try again. */ + return -EAGAIN; + case PKVM_NOPAGE: + break; + default: + return -EPERM; + } + + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, &level); + if (ret) + return ret; + + if (WARN_ON(level != KVM_PGTABLE_LAST_LEVEL)) + return -EINVAL; + + return host_stage2_decode_gfn_meta(pte, vm, gfn); +} + +int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys) +{ + struct pkvm_hyp_vm *vm; + u64 gfn, ipa, pa; + kvm_pte_t pte; + int ret; + + phys &= PAGE_MASK; + + hyp_spin_lock(&vm_table_lock); + host_lock_component(); + + ret = host_stage2_get_guest_info(phys, &vm, &gfn); + if (ret) + goto unlock_host; + + ipa = hyp_pfn_to_phys(gfn); + guest_lock_component(vm); + ret = get_valid_guest_pte(vm, ipa, &pte, &pa); + if (ret) + goto unlock_guest; + + WARN_ON(pa != phys); + if (guest_get_page_state(pte, ipa) != PKVM_PAGE_OWNED) { + ret = -EPERM; + goto unlock_guest; + } + + /* We really shouldn't be allocating, so don't pass a memcache */ + ret = kvm_pgtable_stage2_annotate(&vm->pgt, ipa, PAGE_SIZE, NULL, + KVM_GUEST_INVALID_PTE_TYPE_POISONED, + 0); + if (ret) + goto unlock_guest; + + hyp_poison_page(phys); + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST)); +unlock_guest: + guest_unlock_component(vm); +unlock_host: + host_unlock_component(); + hyp_spin_unlock(&vm_table_lock); + + return ret; +} + int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm) { u64 ipa = hyp_pfn_to_phys(gfn); @@ -1130,7 +1253,11 @@ unlock: guest_unlock_component(vm); host_unlock_component(); - return ret; + /* + * -EHWPOISON implies that the page was forcefully reclaimed already + * so return success for the GUP pin to be dropped. + */ + return ret && ret != -EHWPOISON ? ret : 0; } int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu) diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 0ba6423cd0d5..cdeefe3d74ff 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -230,10 +230,12 @@ void pkvm_hyp_vm_table_init(void *tbl) /* * Return the hyp vm structure corresponding to the handle. */ -static struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) +struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) { unsigned int idx = vm_handle_to_idx(handle); + hyp_assert_lock_held(&vm_table_lock); + if (unlikely(idx >= KVM_MAX_PVMS)) return NULL; From 281a38ad2920b5ccfbbc2a0ca0caeee110ad5d6b Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:27 +0100 Subject: [PATCH 261/373] KVM: arm64: Reclaim faulting page from pKVM in spurious fault handler Host kernel accesses to pages that are inaccessible at stage-2 result in the injection of a translation fault, which is fatal unless an exception table fixup is registered for the faulting PC (e.g. for user access routines). This is undesirable, since a get_user_pages() call could be used to obtain a reference to a donated page and then a subsequent access via a kernel mapping would lead to a panic(). Rework the spurious fault handler so that stage-2 faults injected back into the host result in the target page being forcefully reclaimed when no exception table fixup handler is registered. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-27-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/virt.h | 9 +++++++++ arch/arm64/kvm/pkvm.c | 12 ++++++++++++ arch/arm64/mm/fault.c | 17 +++++++++++------ 3 files changed, 32 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h index b51ab6840f9c..b546703c3ab9 100644 --- a/arch/arm64/include/asm/virt.h +++ b/arch/arm64/include/asm/virt.h @@ -94,6 +94,15 @@ static inline bool is_pkvm_initialized(void) static_branch_likely(&kvm_protected_mode_initialized); } +#ifdef CONFIG_KVM +bool pkvm_force_reclaim_guest_page(phys_addr_t phys); +#else +static inline bool pkvm_force_reclaim_guest_page(phys_addr_t phys) +{ + return false; +} +#endif + /* Reports the availability of HYP mode */ static inline bool is_hyp_mode_available(void) { diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index 3cf23496f225..10edd4965936 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -569,3 +569,15 @@ int pkvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, WARN_ON_ONCE(1); return -EINVAL; } + +/* + * Forcefully reclaim a page from the guest, zeroing its contents and + * poisoning the stage-2 pte so that pages can no longer be mapped at + * the same IPA. The page remains pinned until the guest is destroyed. + */ +bool pkvm_force_reclaim_guest_page(phys_addr_t phys) +{ + int ret = kvm_call_hyp_nvhe(__pkvm_force_reclaim_guest_page, phys); + + return !ret || ret == -EAGAIN; +} diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 3abfc7272d63..7eacc7b45c1f 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -289,9 +289,6 @@ static bool __kprobes is_spurious_el1_translation_fault(unsigned long addr, if (!is_el1_data_abort(esr) || !esr_fsc_is_translation_fault(esr)) return false; - if (is_pkvm_stage2_abort(esr)) - return false; - local_irq_save(flags); asm volatile("at s1e1r, %0" :: "r" (addr)); isb(); @@ -302,8 +299,14 @@ static bool __kprobes is_spurious_el1_translation_fault(unsigned long addr, * If we now have a valid translation, treat the translation fault as * spurious. */ - if (!(par & SYS_PAR_EL1_F)) + if (!(par & SYS_PAR_EL1_F)) { + if (is_pkvm_stage2_abort(esr)) { + par &= SYS_PAR_EL1_PA; + return pkvm_force_reclaim_guest_page(par); + } + return true; + } /* * If we got a different type of fault from the AT instruction, @@ -389,9 +392,11 @@ static void __do_kernel_fault(unsigned long addr, unsigned long esr, if (!is_el1_instruction_abort(esr) && fixup_exception(regs, esr)) return; - if (WARN_RATELIMIT(is_spurious_el1_translation_fault(addr, esr, regs), - "Ignoring spurious kernel translation fault at virtual address %016lx\n", addr)) + if (is_spurious_el1_translation_fault(addr, esr, regs)) { + WARN_RATELIMIT(!is_pkvm_stage2_abort(esr), + "Ignoring spurious kernel translation fault at virtual address %016lx\n", addr); return; + } if (is_el1_mte_sync_tag_check_fault(esr)) { do_tag_recovery(addr, esr, regs); From 5991916392d844ba6ed6c0d320ac6578f52e39b6 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:28 +0100 Subject: [PATCH 262/373] KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte If a protected vCPU faults on an IPA which appears to be mapped, query the hypervisor to determine whether or not the faulting pte has been poisoned by a forceful reclaim. If the pte has been poisoned, return -EFAULT back to userspace rather than retrying the instruction forever. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-28-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_asm.h | 1 + arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 + arch/arm64/kvm/hyp/nvhe/hyp-main.c | 10 +++++ arch/arm64/kvm/hyp/nvhe/mem_protect.c | 43 +++++++++++++++++++ arch/arm64/kvm/pkvm.c | 9 ++-- 5 files changed, 61 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index 04a230e906a7..6c79f7504d80 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -90,6 +90,7 @@ enum __kvm_host_smccc_func { __KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu, + __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_in_poison_fault, __KVM_HOST_SMCCC_FUNC___pkvm_force_reclaim_guest_page, __KVM_HOST_SMCCC_FUNC___pkvm_reclaim_dying_guest_page, __KVM_HOST_SMCCC_FUNC___pkvm_start_teardown_vm, diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index acc031103600..8bc9a2489298 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -40,6 +40,7 @@ int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages); int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages); int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu); +int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu); int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys); int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm); int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index 456c83207717..90e3b14fe287 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -573,6 +573,15 @@ static void handle___pkvm_init_vcpu(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __pkvm_init_vcpu(handle, host_vcpu, vcpu_hva); } +static void handle___pkvm_vcpu_in_poison_fault(struct kvm_cpu_context *host_ctxt) +{ + int ret; + struct pkvm_hyp_vcpu *hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); + + ret = hyp_vcpu ? __pkvm_vcpu_in_poison_fault(hyp_vcpu) : -EINVAL; + cpu_reg(host_ctxt, 1) = ret; +} + static void handle___pkvm_force_reclaim_guest_page(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(phys_addr_t, phys, host_ctxt, 1); @@ -641,6 +650,7 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__pkvm_unreserve_vm), HANDLE_FUNC(__pkvm_init_vm), HANDLE_FUNC(__pkvm_init_vcpu), + HANDLE_FUNC(__pkvm_vcpu_in_poison_fault), HANDLE_FUNC(__pkvm_force_reclaim_guest_page), HANDLE_FUNC(__pkvm_reclaim_dying_guest_page), HANDLE_FUNC(__pkvm_start_teardown_vm), diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 73bdbd4a508e..31ca3db26fb5 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -890,6 +890,49 @@ static int get_valid_guest_pte(struct pkvm_hyp_vm *vm, u64 ipa, kvm_pte_t *ptep, return 0; } +int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu) +{ + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(hyp_vcpu); + kvm_pte_t pte; + s8 level; + u64 ipa; + int ret; + + switch (kvm_vcpu_trap_get_class(&hyp_vcpu->vcpu)) { + case ESR_ELx_EC_DABT_LOW: + case ESR_ELx_EC_IABT_LOW: + if (kvm_vcpu_trap_is_translation_fault(&hyp_vcpu->vcpu)) + break; + fallthrough; + default: + return -EINVAL; + } + + /* + * The host has the faulting IPA when it calls us from the guest + * fault handler but we retrieve it ourselves from the FAR so as + * to avoid exposing an "oracle" that could reveal data access + * patterns of the guest after initial donation of its pages. + */ + ipa = kvm_vcpu_get_fault_ipa(&hyp_vcpu->vcpu); + ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(&hyp_vcpu->vcpu)); + + guest_lock_component(vm); + ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); + if (ret) + goto unlock; + + if (level != KVM_PGTABLE_LAST_LEVEL) { + ret = -EINVAL; + goto unlock; + } + + ret = guest_pte_is_poisoned(pte); +unlock: + guest_unlock_component(vm); + return ret; +} + int __pkvm_host_share_hyp(u64 pfn) { u64 phys = hyp_pfn_to_phys(pfn); diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index 10edd4965936..7f35df29e984 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -423,10 +423,13 @@ int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, return -EINVAL; /* - * We raced with another vCPU. + * We either raced with another vCPU or the guest PTE + * has been poisoned by an erroneous host access. */ - if (mapping) - return -EAGAIN; + if (mapping) { + ret = kvm_call_hyp_nvhe(__pkvm_vcpu_in_poison_fault); + return ret ? -EFAULT : -EAGAIN; + } ret = kvm_call_hyp_nvhe(__pkvm_host_donate_guest, pfn, gfn); } else { From 94c525051542c54907e2d3e9d2b008575829cdc8 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:29 +0100 Subject: [PATCH 263/373] KVM: arm64: Add hvc handler at EL2 for hypercalls from protected VMs Add a hypercall handler at EL2 for hypercalls originating from protected VMs. For now, this implements only the FEATURES and MEMINFO calls, but subsequent patches will implement the SHARE and UNSHARE functions necessary for virtio. Unhandled hypercalls (including PSCI) are passed back to the host. Reviewed-by: Vincent Donnefort Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-29-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 1 + arch/arm64/kvm/hyp/nvhe/pkvm.c | 37 ++++++++++++++++++++++++++ arch/arm64/kvm/hyp/nvhe/switch.c | 1 + 3 files changed, 39 insertions(+) diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h index a5a7bb453f3e..c904647d2f76 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h +++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h @@ -88,6 +88,7 @@ struct pkvm_hyp_vm *get_pkvm_hyp_vm(pkvm_handle_t handle); struct pkvm_hyp_vm *get_np_pkvm_hyp_vm(pkvm_handle_t handle); void put_pkvm_hyp_vm(struct pkvm_hyp_vm *hyp_vm); +bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code); bool kvm_handle_pvm_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code); bool kvm_handle_pvm_restricted(struct kvm_vcpu *vcpu, u64 *exit_code); void kvm_init_pvm_id_regs(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index cdeefe3d74ff..1f184c4994fa 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -4,6 +4,8 @@ * Author: Fuad Tabba */ +#include + #include #include @@ -971,3 +973,38 @@ err_unlock: hyp_spin_unlock(&vm_table_lock); return err; } +/* + * Handler for protected VM HVC calls. + * + * Returns true if the hypervisor has handled the exit (and control + * should return to the guest) or false if it hasn't (and the handling + * should be performed by the host). + */ +bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code) +{ + u64 val[4] = { SMCCC_RET_INVALID_PARAMETER }; + bool handled = true; + + switch (smccc_get_function(vcpu)) { + case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: + val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_HYP_MEMINFO); + break; + case ARM_SMCCC_VENDOR_HYP_KVM_HYP_MEMINFO_FUNC_ID: + if (smccc_get_arg1(vcpu) || + smccc_get_arg2(vcpu) || + smccc_get_arg3(vcpu)) { + break; + } + + val[0] = PAGE_SIZE; + break; + default: + /* Punt everything else back to the host, for now. */ + handled = false; + } + + if (handled) + smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]); + return handled; +} diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index 779089e42681..51bd88dc6012 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -190,6 +190,7 @@ static const exit_handler_fn hyp_exit_handlers[] = { static const exit_handler_fn pvm_exit_handlers[] = { [0 ... ESR_ELx_EC_MAX] = NULL, + [ESR_ELx_EC_HVC64] = kvm_handle_pvm_hvc64, [ESR_ELx_EC_SYS64] = kvm_handle_pvm_sys64, [ESR_ELx_EC_SVE] = kvm_handle_pvm_restricted, [ESR_ELx_EC_FP_ASIMD] = kvm_hyp_handle_fpsimd, From 03313efed5e2ca55e862bf514b907a431ebf642a Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:30 +0100 Subject: [PATCH 264/373] KVM: arm64: Implement the MEM_SHARE hypercall for protected VMs Implement the ARM_SMCCC_KVM_FUNC_MEM_SHARE hypercall to allow protected VMs to share memory (e.g. the swiotlb bounce buffers) back to the host. Reviewed-by: Vincent Donnefort Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-30-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 + arch/arm64/kvm/hyp/nvhe/mem_protect.c | 32 ++++++++++ arch/arm64/kvm/hyp/nvhe/pkvm.c | 61 +++++++++++++++++++ 3 files changed, 94 insertions(+) diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index 8bc9a2489298..fea8aecae5ef 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -34,6 +34,7 @@ extern unsigned long hyp_nr_cpus; int __pkvm_prot_finalize(void); int __pkvm_host_share_hyp(u64 pfn); +int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); int __pkvm_host_unshare_hyp(u64 pfn); int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages); int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 31ca3db26fb5..593eca37f863 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -959,6 +959,38 @@ unlock: return ret; } +int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn) +{ + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); + u64 phys, ipa = hyp_pfn_to_phys(gfn); + kvm_pte_t pte; + int ret; + + host_lock_component(); + guest_lock_component(vm); + + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); + if (ret) + goto unlock; + + ret = -EPERM; + if (pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte)) != PKVM_PAGE_OWNED) + goto unlock; + if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE)) + goto unlock; + + ret = 0; + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED), + &vcpu->vcpu.arch.pkvm_memcache, 0)); + WARN_ON(__host_set_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)); +unlock: + guest_unlock_component(vm); + host_unlock_component(); + + return ret; +} + int __pkvm_host_unshare_hyp(u64 pfn) { u64 phys = hyp_pfn_to_phys(pfn); diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 1f184c4994fa..408307603863 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -973,6 +973,58 @@ err_unlock: hyp_spin_unlock(&vm_table_lock); return err; } + +static u64 __pkvm_memshare_page_req(struct kvm_vcpu *vcpu, u64 ipa) +{ + u64 elr; + + /* Fake up a data abort (level 3 translation fault on write) */ + vcpu->arch.fault.esr_el2 = (ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT) | + ESR_ELx_WNR | ESR_ELx_FSC_FAULT | + FIELD_PREP(ESR_ELx_FSC_LEVEL, 3); + + /* Shuffle the IPA around into the HPFAR */ + vcpu->arch.fault.hpfar_el2 = (HPFAR_EL2_NS | (ipa >> 8)) & HPFAR_MASK; + + /* This is a virtual address. 0's good. Let's go with 0. */ + vcpu->arch.fault.far_el2 = 0; + + /* Rewind the ELR so we return to the HVC once the IPA is mapped */ + elr = read_sysreg(elr_el2); + elr -= 4; + write_sysreg(elr, elr_el2); + + return ARM_EXCEPTION_TRAP; +} + +static bool pkvm_memshare_call(u64 *ret, struct kvm_vcpu *vcpu, u64 *exit_code) +{ + struct pkvm_hyp_vcpu *hyp_vcpu; + u64 ipa = smccc_get_arg1(vcpu); + + if (!PAGE_ALIGNED(ipa)) + goto out_guest; + + hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu); + switch (__pkvm_guest_share_host(hyp_vcpu, hyp_phys_to_pfn(ipa))) { + case 0: + ret[0] = SMCCC_RET_SUCCESS; + goto out_guest; + case -ENOENT: + /* + * Convert the exception into a data abort so that the page + * being shared is mapped into the guest next time. + */ + *exit_code = __pkvm_memshare_page_req(vcpu, ipa); + goto out_host; + } + +out_guest: + return true; +out_host: + return false; +} + /* * Handler for protected VM HVC calls. * @@ -989,6 +1041,7 @@ bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code) case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); val[0] |= BIT(ARM_SMCCC_KVM_FUNC_HYP_MEMINFO); + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_SHARE); break; case ARM_SMCCC_VENDOR_HYP_KVM_HYP_MEMINFO_FUNC_ID: if (smccc_get_arg1(vcpu) || @@ -999,6 +1052,14 @@ bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code) val[0] = PAGE_SIZE; break; + case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID: + if (smccc_get_arg2(vcpu) || + smccc_get_arg3(vcpu)) { + break; + } + + handled = pkvm_memshare_call(val, vcpu, exit_code); + break; default: /* Punt everything else back to the host, for now. */ handled = false; From 246c976c370de9380660e2bb641758dc0aae8c5c Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:31 +0100 Subject: [PATCH 265/373] KVM: arm64: Implement the MEM_UNSHARE hypercall for protected VMs Implement the ARM_SMCCC_KVM_FUNC_MEM_UNSHARE hypercall to allow protected VMs to unshare memory that was previously shared with the host using the ARM_SMCCC_KVM_FUNC_MEM_SHARE hypercall. Reviewed-by: Vincent Donnefort Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-31-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 + arch/arm64/kvm/hyp/nvhe/mem_protect.c | 34 +++++++++++++++++++ arch/arm64/kvm/hyp/nvhe/pkvm.c | 22 ++++++++++++ 3 files changed, 57 insertions(+) diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index fea8aecae5ef..99d8398afe20 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -35,6 +35,7 @@ extern unsigned long hyp_nr_cpus; int __pkvm_prot_finalize(void); int __pkvm_host_share_hyp(u64 pfn); int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); +int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); int __pkvm_host_unshare_hyp(u64 pfn); int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages); int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 593eca37f863..db94323b430c 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -991,6 +991,40 @@ unlock: return ret; } +int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn) +{ + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); + u64 meta, phys, ipa = hyp_pfn_to_phys(gfn); + kvm_pte_t pte; + int ret; + + host_lock_component(); + guest_lock_component(vm); + + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); + if (ret) + goto unlock; + + ret = -EPERM; + if (pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte)) != PKVM_PAGE_SHARED_OWNED) + goto unlock; + if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)) + goto unlock; + + ret = 0; + meta = host_stage2_encode_gfn_meta(vm, gfn); + WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE, + PKVM_ID_GUEST, meta)); + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), + &vcpu->vcpu.arch.pkvm_memcache, 0)); +unlock: + guest_unlock_component(vm); + host_unlock_component(); + + return ret; +} + int __pkvm_host_unshare_hyp(u64 pfn) { u64 phys = hyp_pfn_to_phys(pfn); diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 408307603863..6f3b94a37fe3 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -1025,6 +1025,19 @@ out_host: return false; } +static void pkvm_memunshare_call(u64 *ret, struct kvm_vcpu *vcpu) +{ + struct pkvm_hyp_vcpu *hyp_vcpu; + u64 ipa = smccc_get_arg1(vcpu); + + if (!PAGE_ALIGNED(ipa)) + return; + + hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu); + if (!__pkvm_guest_unshare_host(hyp_vcpu, hyp_phys_to_pfn(ipa))) + ret[0] = SMCCC_RET_SUCCESS; +} + /* * Handler for protected VM HVC calls. * @@ -1042,6 +1055,7 @@ bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code) val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); val[0] |= BIT(ARM_SMCCC_KVM_FUNC_HYP_MEMINFO); val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_SHARE); + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_UNSHARE); break; case ARM_SMCCC_VENDOR_HYP_KVM_HYP_MEMINFO_FUNC_ID: if (smccc_get_arg1(vcpu) || @@ -1060,6 +1074,14 @@ bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code) handled = pkvm_memshare_call(val, vcpu, exit_code); break; + case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID: + if (smccc_get_arg2(vcpu) || + smccc_get_arg3(vcpu)) { + break; + } + + pkvm_memunshare_call(val, vcpu); + break; default: /* Punt everything else back to the host, for now. */ handled = false; From 8800dbf6614aad1013ea5f348520a2ce5ba4b6c8 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:32 +0100 Subject: [PATCH 266/373] KVM: arm64: Allow userspace to create protected VMs when pKVM is enabled Introduce a new VM type for KVM/arm64 to allow userspace to request the creation of a "protected VM" when the host has booted with pKVM enabled. For now, this feature results in a taint on first use as many aspects of a protected VM are not yet protected! Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-32-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_pkvm.h | 2 +- arch/arm64/kvm/arm.c | 8 +++++++- arch/arm64/kvm/mmu.c | 3 --- arch/arm64/kvm/pkvm.c | 8 +++++++- include/uapi/linux/kvm.h | 5 +++++ 5 files changed, 20 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h index 7041e398fb4c..2954b311128c 100644 --- a/arch/arm64/include/asm/kvm_pkvm.h +++ b/arch/arm64/include/asm/kvm_pkvm.h @@ -17,7 +17,7 @@ #define HYP_MEMBLOCK_REGIONS 128 -int pkvm_init_host_vm(struct kvm *kvm); +int pkvm_init_host_vm(struct kvm *kvm, unsigned long type); int pkvm_create_hyp_vm(struct kvm *kvm); bool pkvm_hyp_vm_is_created(struct kvm *kvm); void pkvm_destroy_hyp_vm(struct kvm *kvm); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 3589fc08266c..c2b666a46893 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -203,6 +203,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) { int ret; + if (type & ~KVM_VM_TYPE_ARM_MASK) + return -EINVAL; + mutex_init(&kvm->arch.config_lock); #ifdef CONFIG_LOCKDEP @@ -234,9 +237,12 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) * If any failures occur after this is successful, make sure to * call __pkvm_unreserve_vm to unreserve the VM in hyp. */ - ret = pkvm_init_host_vm(kvm); + ret = pkvm_init_host_vm(kvm, type); if (ret) goto err_uninit_mmu; + } else if (type & KVM_VM_TYPE_ARM_PROTECTED) { + ret = -EINVAL; + goto err_uninit_mmu; } kvm_vgic_early_init(kvm); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 6a4151e3e4a3..45358ae8a300 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -881,9 +881,6 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type) u64 mmfr0, mmfr1; u32 phys_shift; - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) - return -EINVAL; - phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); if (is_protected_kvm_enabled()) { phys_shift = kvm_ipa_limit; diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index 7f35df29e984..053e4f733e4b 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -225,9 +225,10 @@ void pkvm_destroy_hyp_vm(struct kvm *kvm) mutex_unlock(&kvm->arch.config_lock); } -int pkvm_init_host_vm(struct kvm *kvm) +int pkvm_init_host_vm(struct kvm *kvm, unsigned long type) { int ret; + bool protected = type & KVM_VM_TYPE_ARM_PROTECTED; if (pkvm_hyp_vm_is_created(kvm)) return -EINVAL; @@ -242,6 +243,11 @@ int pkvm_init_host_vm(struct kvm *kvm) return ret; kvm->arch.pkvm.handle = ret; + kvm->arch.pkvm.is_protected = protected; + if (protected) { + pr_warn_once("kvm: protected VMs are experimental and for development only, tainting kernel\n"); + add_taint(TAINT_USER, LOCKDEP_STILL_OK); + } return 0; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 80364d4dbebb..073b2bcaf560 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -703,6 +703,11 @@ struct kvm_enable_cap { #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL #define KVM_VM_TYPE_ARM_IPA_SIZE(x) \ ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) + +#define KVM_VM_TYPE_ARM_PROTECTED (1UL << 31) +#define KVM_VM_TYPE_ARM_MASK (KVM_VM_TYPE_ARM_IPA_SIZE_MASK | \ + KVM_VM_TYPE_ARM_PROTECTED) + /* * ioctls for /dev/kvm fds: */ From 287c6981f12a008bafc46f18a3e48540a1172a52 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:33 +0100 Subject: [PATCH 267/373] KVM: arm64: Add some initial documentation for pKVM Add some initial documentation for pKVM to help people understand what is supported, the limitations of protected VMs when compared to non-protected VMs and also what is left to do. Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-33-will@kernel.org Signed-off-by: Marc Zyngier --- .../admin-guide/kernel-parameters.txt | 4 +- Documentation/virt/kvm/arm/index.rst | 1 + Documentation/virt/kvm/arm/pkvm.rst | 106 ++++++++++++++++++ 3 files changed, 109 insertions(+), 2 deletions(-) create mode 100644 Documentation/virt/kvm/arm/pkvm.rst diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 03a550630644..44854a67bc63 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3247,8 +3247,8 @@ Kernel parameters for the host. To force nVHE on VHE hardware, add "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the command-line. - "nested" is experimental and should be used with - extreme caution. + "nested" and "protected" are experimental and should be + used with extreme caution. kvm-arm.vgic_v3_group0_trap= [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0 diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst index ec09881de4cf..0856b4942e05 100644 --- a/Documentation/virt/kvm/arm/index.rst +++ b/Documentation/virt/kvm/arm/index.rst @@ -10,6 +10,7 @@ ARM fw-pseudo-registers hyp-abi hypercalls + pkvm pvtime ptp_kvm vcpu-features diff --git a/Documentation/virt/kvm/arm/pkvm.rst b/Documentation/virt/kvm/arm/pkvm.rst new file mode 100644 index 000000000000..514992a79a83 --- /dev/null +++ b/Documentation/virt/kvm/arm/pkvm.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Protected KVM (pKVM) +==================== + +**NOTE**: pKVM is currently an experimental, development feature and +subject to breaking changes as new isolation features are implemented. +Please reach out to the developers at kvmarm@lists.linux.dev if you have +any questions. + +Overview +======== + +Booting a host kernel with '``kvm-arm.mode=protected``' enables +"Protected KVM" (pKVM). During boot, pKVM installs a stage-2 identity +map page-table for the host and uses it to isolate the hypervisor +running at EL2 from the rest of the host running at EL1/0. + +pKVM permits creation of protected virtual machines (pVMs) by passing +the ``KVM_VM_TYPE_ARM_PROTECTED`` machine type identifier to the +``KVM_CREATE_VM`` ioctl(). The hypervisor isolates pVMs from the host by +unmapping pages from the stage-2 identity map as they are accessed by a +pVM. Hypercalls are provided for a pVM to share specific regions of its +IPA space back with the host, allowing for communication with the VMM. +A Linux guest must be configured with ``CONFIG_ARM_PKVM_GUEST=y`` in +order to issue these hypercalls. + +See hypercalls.rst for more details. + +Isolation mechanisms +==================== + +pKVM relies on a number of mechanisms to isolate PVMs from the host: + +CPU memory isolation +-------------------- + +Status: Isolation of anonymous memory and metadata pages. + +Metadata pages (e.g. page-table pages and '``struct kvm_vcpu``' pages) +are donated from the host to the hypervisor during pVM creation and +are consequently unmapped from the stage-2 identity map until the pVM is +destroyed. + +Similarly to regular KVM, pages are lazily mapped into the guest in +response to stage-2 page faults handled by the host. However, when +running a pVM, these pages are first pinned and then unmapped from the +stage-2 identity map as part of the donation procedure. This gives rise +to some user-visible differences when compared to non-protected VMs, +largely due to the lack of MMU notifiers: + +* Memslots cannot be moved or deleted once the pVM has started running. +* Read-only memslots and dirty logging are not supported. +* With the exception of swap, file-backed pages cannot be mapped into a + pVM. +* Donated pages are accounted against ``RLIMIT_MLOCK`` and so the VMM + must have a sufficient resource limit or be granted ``CAP_IPC_LOCK``. + The lack of a runtime reclaim mechanism means that memory locked for + a pVM will remain locked until the pVM is destroyed. +* Changes to the VMM address space (e.g. a ``MAP_FIXED`` mmap() over a + mapping associated with a memslot) are not reflected in the guest and + may lead to loss of coherency. +* Accessing pVM memory that has not been shared back will result in the + delivery of a SIGSEGV. +* If a system call accesses pVM memory that has not been shared back + then it will either return ``-EFAULT`` or forcefully reclaim the + memory pages. Reclaimed memory is zeroed by the hypervisor and a + subsequent attempt to access it in the pVM will return ``-EFAULT`` + from the ``VCPU_RUN`` ioctl(). + +CPU state isolation +------------------- + +Status: **Unimplemented.** + +DMA isolation using an IOMMU +---------------------------- + +Status: **Unimplemented.** + +Proxying of Trustzone services +------------------------------ + +Status: FF-A and PSCI calls from the host are proxied by the pKVM +hypervisor. + +The FF-A proxy ensures that the host cannot share pVM or hypervisor +memory with Trustzone as part of a "confused deputy" attack. + +The PSCI proxy ensures that CPUs always have the stage-2 identity map +installed when they are executing in the host. + +Protected VM firmware (pvmfw) +----------------------------- + +Status: **Unimplemented.** + +Resources +========= + +Quentin Perret's KVM Forum 2022 talk entitled "Protected KVM on arm64: A +technical deep dive" remains a good resource for learning more about +pKVM, despite some of the details having changed in the meantime: + +https://www.youtube.com/watch?v=9npebeVFbFw From c290df5278fe8c9844b93751620664c3ca3b6e00 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:34 +0100 Subject: [PATCH 268/373] KVM: arm64: Extend pKVM page ownership selftests to cover guest donation Extend the pKVM page ownership selftests to donate and reclaim a page to/from a guest. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-34-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index db94323b430c..13d2cb2f5fab 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -1751,6 +1751,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); selftest_state.host = PKVM_PAGE_OWNED; selftest_state.hyp = PKVM_NOPAGE; @@ -1770,6 +1771,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); assert_transition_res(0, hyp_pin_shared_mem, virt, virt + size); assert_transition_res(0, hyp_pin_shared_mem, virt, virt + size); @@ -1782,6 +1784,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); hyp_unpin_shared_mem(virt, virt + size); assert_page_state(); @@ -1801,6 +1804,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); selftest_state.host = PKVM_PAGE_OWNED; @@ -1817,6 +1821,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); selftest_state.guest[1] = PKVM_PAGE_SHARED_BORROWED; @@ -1830,6 +1835,23 @@ void pkvm_ownership_selftest(void *base) selftest_state.host = PKVM_PAGE_OWNED; assert_transition_res(0, __pkvm_host_unshare_guest, gfn + 1, 1, vm); + selftest_state.host = PKVM_NOPAGE; + selftest_state.guest[0] = PKVM_PAGE_OWNED; + assert_transition_res(0, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + + selftest_state.host = PKVM_PAGE_OWNED; + selftest_state.guest[0] = PKVM_NOPAGE; + assert_transition_res(0, __pkvm_host_reclaim_page_guest, gfn, vm); + selftest_state.host = PKVM_NOPAGE; selftest_state.hyp = PKVM_PAGE_OWNED; assert_transition_res(0, __pkvm_host_donate_hyp, pfn, 1); From 8972a991606bc021249ae53a95c131a79fdeda4a Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:35 +0100 Subject: [PATCH 269/373] KVM: arm64: Register 'selftest_vm' in the VM table In preparation for extending the pKVM page ownership selftests to cover forceful reclaim of donated pages, rework the creation of the 'selftest_vm' so that it is registered in the VM table while the tests are running. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-35-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 + arch/arm64/kvm/hyp/nvhe/mem_protect.c | 53 ++++--------------- arch/arm64/kvm/hyp/nvhe/pkvm.c | 49 +++++++++++++++++ 3 files changed, 61 insertions(+), 43 deletions(-) diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index 99d8398afe20..5031879ccb87 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -76,6 +76,8 @@ static __always_inline void __load_host_stage2(void) #ifdef CONFIG_NVHE_EL2_DEBUG void pkvm_ownership_selftest(void *base); +struct pkvm_hyp_vcpu *init_selftest_vm(void *virt); +void teardown_selftest_vm(void); #else static inline void pkvm_ownership_selftest(void *base) { } #endif diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 13d2cb2f5fab..d8f8ebe59129 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -1648,53 +1648,18 @@ struct pkvm_expected_state { static struct pkvm_expected_state selftest_state; static struct hyp_page *selftest_page; - -static struct pkvm_hyp_vm selftest_vm = { - .kvm = { - .arch = { - .mmu = { - .arch = &selftest_vm.kvm.arch, - .pgt = &selftest_vm.pgt, - }, - }, - }, -}; - -static struct pkvm_hyp_vcpu selftest_vcpu = { - .vcpu = { - .arch = { - .hw_mmu = &selftest_vm.kvm.arch.mmu, - }, - .kvm = &selftest_vm.kvm, - }, -}; - -static void init_selftest_vm(void *virt) -{ - struct hyp_page *p = hyp_virt_to_page(virt); - int i; - - selftest_vm.kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; - WARN_ON(kvm_guest_prepare_stage2(&selftest_vm, virt)); - - for (i = 0; i < pkvm_selftest_pages(); i++) { - if (p[i].refcount) - continue; - p[i].refcount = 1; - hyp_put_page(&selftest_vm.pool, hyp_page_to_virt(&p[i])); - } -} +static struct pkvm_hyp_vcpu *selftest_vcpu; static u64 selftest_ipa(void) { - return BIT(selftest_vm.pgt.ia_bits - 1); + return BIT(selftest_vcpu->vcpu.arch.hw_mmu->pgt->ia_bits - 1); } static void assert_page_state(void) { void *virt = hyp_page_to_virt(selftest_page); u64 size = PAGE_SIZE << selftest_page->order; - struct pkvm_hyp_vcpu *vcpu = &selftest_vcpu; + struct pkvm_hyp_vcpu *vcpu = selftest_vcpu; u64 phys = hyp_virt_to_phys(virt); u64 ipa[2] = { selftest_ipa(), selftest_ipa() + PAGE_SIZE }; struct pkvm_hyp_vm *vm; @@ -1709,10 +1674,10 @@ static void assert_page_state(void) WARN_ON(__hyp_check_page_state_range(phys, size, selftest_state.hyp)); hyp_unlock_component(); - guest_lock_component(&selftest_vm); + guest_lock_component(vm); WARN_ON(__guest_check_page_state_range(vm, ipa[0], size, selftest_state.guest[0])); WARN_ON(__guest_check_page_state_range(vm, ipa[1], size, selftest_state.guest[1])); - guest_unlock_component(&selftest_vm); + guest_unlock_component(vm); } #define assert_transition_res(res, fn, ...) \ @@ -1725,14 +1690,15 @@ void pkvm_ownership_selftest(void *base) { enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RWX; void *virt = hyp_alloc_pages(&host_s2_pool, 0); - struct pkvm_hyp_vcpu *vcpu = &selftest_vcpu; - struct pkvm_hyp_vm *vm = &selftest_vm; + struct pkvm_hyp_vcpu *vcpu; u64 phys, size, pfn, gfn; + struct pkvm_hyp_vm *vm; WARN_ON(!virt); selftest_page = hyp_virt_to_page(virt); selftest_page->refcount = 0; - init_selftest_vm(base); + selftest_vcpu = vcpu = init_selftest_vm(base); + vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); size = PAGE_SIZE << selftest_page->order; phys = hyp_virt_to_phys(virt); @@ -1856,6 +1822,7 @@ void pkvm_ownership_selftest(void *base) selftest_state.hyp = PKVM_PAGE_OWNED; assert_transition_res(0, __pkvm_host_donate_hyp, pfn, 1); + teardown_selftest_vm(); selftest_page->refcount = 1; hyp_put_page(&host_s2_pool, virt); } diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 6f3b94a37fe3..8b906217c4c3 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -733,6 +733,55 @@ void __pkvm_unreserve_vm(pkvm_handle_t handle) hyp_spin_unlock(&vm_table_lock); } +#ifdef CONFIG_NVHE_EL2_DEBUG +static struct pkvm_hyp_vm selftest_vm = { + .kvm = { + .arch = { + .mmu = { + .arch = &selftest_vm.kvm.arch, + .pgt = &selftest_vm.pgt, + }, + }, + }, +}; + +static struct pkvm_hyp_vcpu selftest_vcpu = { + .vcpu = { + .arch = { + .hw_mmu = &selftest_vm.kvm.arch.mmu, + }, + .kvm = &selftest_vm.kvm, + }, +}; + +struct pkvm_hyp_vcpu *init_selftest_vm(void *virt) +{ + struct hyp_page *p = hyp_virt_to_page(virt); + int i; + + selftest_vm.kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; + WARN_ON(kvm_guest_prepare_stage2(&selftest_vm, virt)); + + for (i = 0; i < pkvm_selftest_pages(); i++) { + if (p[i].refcount) + continue; + p[i].refcount = 1; + hyp_put_page(&selftest_vm.pool, hyp_page_to_virt(&p[i])); + } + + selftest_vm.kvm.arch.pkvm.handle = __pkvm_reserve_vm(); + insert_vm_table_entry(selftest_vm.kvm.arch.pkvm.handle, &selftest_vm); + return &selftest_vcpu; +} + +void teardown_selftest_vm(void) +{ + hyp_spin_lock(&vm_table_lock); + remove_vm_table_entry(selftest_vm.kvm.arch.pkvm.handle); + hyp_spin_unlock(&vm_table_lock); +} +#endif /* CONFIG_NVHE_EL2_DEBUG */ + /* * Initialize the hypervisor copy of the VM state using host-donated memory. * From f4a5a6770af9b87a8e87717e1b97af052979f4ec Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:36 +0100 Subject: [PATCH 270/373] KVM: arm64: Extend pKVM page ownership selftests to cover forced reclaim Extend the pKVM page ownership selftests to forcefully reclaim a donated page and check that it cannot be re-donated at the same IPA. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-36-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index d8f8ebe59129..0775a82ee4fe 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -1815,8 +1815,20 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); selftest_state.host = PKVM_PAGE_OWNED; - selftest_state.guest[0] = PKVM_NOPAGE; - assert_transition_res(0, __pkvm_host_reclaim_page_guest, gfn, vm); + selftest_state.guest[0] = PKVM_POISON; + assert_transition_res(0, __pkvm_host_force_reclaim_page_guest, phys); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + + selftest_state.host = PKVM_NOPAGE; + selftest_state.guest[1] = PKVM_PAGE_OWNED; + assert_transition_res(0, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); + + selftest_state.host = PKVM_PAGE_OWNED; + selftest_state.guest[1] = PKVM_NOPAGE; + assert_transition_res(0, __pkvm_host_reclaim_page_guest, gfn + 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); selftest_state.host = PKVM_NOPAGE; selftest_state.hyp = PKVM_PAGE_OWNED; From feae58b6ee45096b1d8c29076f0d8098d9788e57 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:37 +0100 Subject: [PATCH 271/373] KVM: arm64: Extend pKVM page ownership selftests to cover guest hvcs Now that the guest can share and unshare memory with the host using hypercalls, extend the pKVM page ownership selftest to exercise these new transitions. Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-37-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 30 +++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 0775a82ee4fe..28a471d1927c 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -1814,11 +1814,41 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + selftest_state.host = PKVM_PAGE_SHARED_BORROWED; + selftest_state.guest[0] = PKVM_PAGE_SHARED_OWNED; + assert_transition_res(0, __pkvm_guest_share_host, vcpu, gfn); + assert_transition_res(-EPERM, __pkvm_guest_share_host, vcpu, gfn); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + + selftest_state.host = PKVM_NOPAGE; + selftest_state.guest[0] = PKVM_PAGE_OWNED; + assert_transition_res(0, __pkvm_guest_unshare_host, vcpu, gfn); + assert_transition_res(-EPERM, __pkvm_guest_unshare_host, vcpu, gfn); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + selftest_state.host = PKVM_PAGE_OWNED; selftest_state.guest[0] = PKVM_POISON; assert_transition_res(0, __pkvm_host_force_reclaim_page_guest, phys); assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + assert_transition_res(-EHWPOISON, __pkvm_guest_share_host, vcpu, gfn); + assert_transition_res(-EHWPOISON, __pkvm_guest_unshare_host, vcpu, gfn); selftest_state.host = PKVM_NOPAGE; selftest_state.guest[1] = PKVM_PAGE_OWNED; From 5bae7bc6360a7297e0be2c37017fe863b965646d Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:38 +0100 Subject: [PATCH 272/373] KVM: arm64: Rename PKVM_PAGE_STATE_MASK Rename PKVM_PAGE_STATE_MASK to PKVM_PAGE_STATE_VMEMMAP_MASK to make it clear that the mask applies to the page state recorded in the entries of the 'hyp_vmemmap', rather than page states stored elsewhere (e.g. in the ptes). Suggested-by: Alexandru Elisei Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Tested-by: Mostafa Saleh Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-38-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/nvhe/memory.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h index 4cedb720c75d..b50712d47f6d 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h @@ -37,7 +37,7 @@ enum pkvm_page_state { */ PKVM_POISON = BIT(2), }; -#define PKVM_PAGE_STATE_MASK (BIT(0) | BIT(1)) +#define PKVM_PAGE_STATE_VMEMMAP_MASK (BIT(0) | BIT(1)) #define PKVM_PAGE_STATE_PROT_MASK (KVM_PGTABLE_PROT_SW0 | KVM_PGTABLE_PROT_SW1) static inline enum kvm_pgtable_prot pkvm_mkstate(enum kvm_pgtable_prot prot, @@ -114,12 +114,12 @@ static inline void set_host_state(struct hyp_page *p, enum pkvm_page_state state static inline enum pkvm_page_state get_hyp_state(struct hyp_page *p) { - return p->__hyp_state_comp ^ PKVM_PAGE_STATE_MASK; + return p->__hyp_state_comp ^ PKVM_PAGE_STATE_VMEMMAP_MASK; } static inline void set_hyp_state(struct hyp_page *p, enum pkvm_page_state state) { - p->__hyp_state_comp = state ^ PKVM_PAGE_STATE_MASK; + p->__hyp_state_comp = state ^ PKVM_PAGE_STATE_VMEMMAP_MASK; } /* From 61135967fa76d37883d90ccccc5a1cb73e90b94d Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 30 Mar 2026 15:48:39 +0100 Subject: [PATCH 273/373] drivers/virt: pkvm: Add Kconfig dependency on DMA_RESTRICTED_POOL pKVM guests practically rely on CONFIG_DMA_RESTRICTED_POOL=y in order to establish shared memory regions with the host for virtio buffers. Make CONFIG_ARM_PKVM_GUEST depend on CONFIG_DMA_RESTRICTED_POOL to avoid the inevitable segmentation faults experience if you have the former but not the latter. Reported-by: Marc Zyngier Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260330144841.26181-39-will@kernel.org Signed-off-by: Marc Zyngier --- drivers/virt/coco/pkvm-guest/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/virt/coco/pkvm-guest/Kconfig b/drivers/virt/coco/pkvm-guest/Kconfig index d2f344f1f98f..928b8e1668cc 100644 --- a/drivers/virt/coco/pkvm-guest/Kconfig +++ b/drivers/virt/coco/pkvm-guest/Kconfig @@ -1,6 +1,6 @@ config ARM_PKVM_GUEST bool "Arm pKVM protected guest driver" - depends on ARM64 + depends on ARM64 && DMA_RESTRICTED_POOL help Protected guests running under the pKVM hypervisor on arm64 are isolated from the host and must issue hypercalls to enable From 2623c96f1172ae249b67de1dfc4eacebc8673876 Mon Sep 17 00:00:00 2001 From: Eric Farman Date: Wed, 25 Feb 2026 16:20:13 +0100 Subject: [PATCH 274/373] KVM: s390: only deliver service interrupt with payload Routine __inject_service() may set both the SERVICE and SERVICE_EV pending bits, and in the case of a pure service event the corresponding trip through __deliver_service_ev() will clear the SERVICE_EV bit only. This necessitates an additional trip through __deliver_service() for the other pending interrupt bit, however it is possible that the external interrupt parameters are zero and there is nothing to be delivered to the guest. To avoid sending empty data to the guest, let's only write out the SCLP data when there is something for the guest to do, otherwise bail out. Signed-off-by: Eric Farman Acked-by: Christian Borntraeger Signed-off-by: Christian Borntraeger Signed-off-by: Janosch Frank --- arch/s390/kvm/interrupt.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c index 18932a65ca68..dd0413387a9e 100644 --- a/arch/s390/kvm/interrupt.c +++ b/arch/s390/kvm/interrupt.c @@ -956,6 +956,9 @@ static int __must_check __deliver_service(struct kvm_vcpu *vcpu) set_bit(IRQ_PEND_EXT_SERVICE, &fi->masked_irqs); spin_unlock(&fi->lock); + if (!ext.ext_params) + return 0; + VCPU_EVENT(vcpu, 4, "deliver: sclp parameter 0x%x", ext.ext_params); vcpu->stat.deliver_service_signal++; From 1653545abc6835ab723c02697a5e2964e98e2c53 Mon Sep 17 00:00:00 2001 From: Janosch Frank Date: Mon, 23 Mar 2026 15:35:22 +0000 Subject: [PATCH 275/373] KVM: s390: Fix lpsw/e breaking event handling LPSW and LPSWE need to set the gbea on completion but currently don't. Time to fix this up. LPSWEY was designed to not set the bear. Fixes: 48a3e950f4cee ("KVM: s390: Add support for machine checks.") Reported-by: Christian Borntraeger Reviewed-by: Claudio Imbrenda Reviewed-by: Christian Borntraeger Signed-off-by: Janosch Frank --- arch/s390/kvm/priv.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c index a3250ad83a8e..cc0553da14cb 100644 --- a/arch/s390/kvm/priv.c +++ b/arch/s390/kvm/priv.c @@ -714,12 +714,13 @@ int kvm_s390_handle_lpsw(struct kvm_vcpu *vcpu) { psw_t *gpsw = &vcpu->arch.sie_block->gpsw; psw32_t new_psw; - u64 addr; + u64 addr, iaddr; int rc; u8 ar; vcpu->stat.instruction_lpsw++; + iaddr = gpsw->addr - kvm_s390_get_ilen(vcpu); if (gpsw->mask & PSW_MASK_PSTATE) return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); @@ -737,18 +738,20 @@ int kvm_s390_handle_lpsw(struct kvm_vcpu *vcpu) gpsw->addr = new_psw.addr & ~PSW32_ADDR_AMODE; if (!is_valid_psw(gpsw)) return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION); + vcpu->arch.sie_block->gbea = iaddr; return 0; } static int handle_lpswe(struct kvm_vcpu *vcpu) { psw_t new_psw; - u64 addr; + u64 addr, iaddr; int rc; u8 ar; vcpu->stat.instruction_lpswe++; + iaddr = vcpu->arch.sie_block->gpsw.addr - kvm_s390_get_ilen(vcpu); if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); @@ -761,6 +764,7 @@ static int handle_lpswe(struct kvm_vcpu *vcpu) vcpu->arch.sie_block->gpsw = new_psw; if (!is_valid_psw(&vcpu->arch.sie_block->gpsw)) return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION); + vcpu->arch.sie_block->gbea = iaddr; return 0; } From 760299a1d8102b36bed5c25c5a3f94b5a0eee081 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 31 Mar 2026 16:50:53 +0100 Subject: [PATCH 276/373] KVM: arm64: Prevent teardown finalisation of referenced 'hyp_vm' Destroying a 'hyp_vm' with an elevated referenced count in __pkvm_finalize_teardown_vm() is only going to lead to tears. In preparation for allowing limited references to be acquired on dying VMs during the teardown process, factor out the handle-to-vm logic for the teardown path and reuse it for both the 'start' and 'finalise' stages of the teardown process. Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260331155056.28220-2-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/pkvm.c | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 8b906217c4c3..3fd3b930beeb 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -936,20 +936,27 @@ int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn) return ret; } +static struct pkvm_hyp_vm *get_pkvm_unref_hyp_vm_locked(pkvm_handle_t handle) +{ + struct pkvm_hyp_vm *hyp_vm; + + hyp_assert_lock_held(&vm_table_lock); + + hyp_vm = get_vm_by_handle(handle); + if (!hyp_vm || hyp_page_count(hyp_vm)) + return NULL; + + return hyp_vm; +} + int __pkvm_start_teardown_vm(pkvm_handle_t handle) { struct pkvm_hyp_vm *hyp_vm; int ret = 0; hyp_spin_lock(&vm_table_lock); - hyp_vm = get_vm_by_handle(handle); - if (!hyp_vm) { - ret = -ENOENT; - goto unlock; - } else if (WARN_ON(hyp_page_count(hyp_vm))) { - ret = -EBUSY; - goto unlock; - } else if (hyp_vm->kvm.arch.pkvm.is_dying) { + hyp_vm = get_pkvm_unref_hyp_vm_locked(handle); + if (!hyp_vm || hyp_vm->kvm.arch.pkvm.is_dying) { ret = -EINVAL; goto unlock; } @@ -971,12 +978,9 @@ int __pkvm_finalize_teardown_vm(pkvm_handle_t handle) int err; hyp_spin_lock(&vm_table_lock); - hyp_vm = get_vm_by_handle(handle); - if (!hyp_vm) { - err = -ENOENT; - goto err_unlock; - } else if (!hyp_vm->kvm.arch.pkvm.is_dying) { - err = -EBUSY; + hyp_vm = get_pkvm_unref_hyp_vm_locked(handle); + if (!hyp_vm || !hyp_vm->kvm.arch.pkvm.is_dying) { + err = -EINVAL; goto err_unlock; } From 2400696883870ec3fb0fb9925426c62a3383ca36 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 31 Mar 2026 16:50:54 +0100 Subject: [PATCH 277/373] KVM: arm64: Allow get_pkvm_hyp_vm() to take a reference to a dying VM Now that completion of the teardown path requires a refcount of zero for the target VM, we can allow get_pkvm_hyp_vm() to take a reference on a dying VM, which is necessary to unshare pages with a non-protected VM during the teardown process itself. Note that vCPUs belonging to a dying VM cannot be loaded and pages can only be reclaimed from a protected VM (via __pkvm_reclaim_dying_guest_page()) if the target VM is in the dying state. Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260331155056.28220-3-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/pkvm.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 3fd3b930beeb..b955da0e50bc 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -309,14 +309,8 @@ struct pkvm_hyp_vm *get_pkvm_hyp_vm(pkvm_handle_t handle) hyp_spin_lock(&vm_table_lock); hyp_vm = get_vm_by_handle(handle); - if (!hyp_vm) - goto unlock; - - if (hyp_vm->kvm.arch.pkvm.is_dying) - hyp_vm = NULL; - else + if (hyp_vm) hyp_page_ref_inc(hyp_virt_to_page(hyp_vm)); -unlock: hyp_spin_unlock(&vm_table_lock); return hyp_vm; From bc20692f528b2ac8226bafe5b1db9a1f8be96dbf Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 31 Mar 2026 16:50:55 +0100 Subject: [PATCH 278/373] KVM: arm64: Don't hold 'vm_table_lock' across guest page reclaim Now that the teardown of a VM cannot be finalised as long as a reference is held on the VM, rework __pkvm_reclaim_dying_guest_page() to hold a reference to the dying VM rather than take the global 'vm_table_lock' during the reclaim operation. Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260331155056.28220-4-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/pkvm.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index b955da0e50bc..7ed96d64d611 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -918,15 +918,16 @@ teardown_donated_memory(struct kvm_hyp_memcache *mc, void *addr, size_t size) int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn) { - struct pkvm_hyp_vm *hyp_vm; + struct pkvm_hyp_vm *hyp_vm = get_pkvm_hyp_vm(handle); int ret = -EINVAL; - hyp_spin_lock(&vm_table_lock); - hyp_vm = get_vm_by_handle(handle); - if (hyp_vm && hyp_vm->kvm.arch.pkvm.is_dying) - ret = __pkvm_host_reclaim_page_guest(gfn, hyp_vm); - hyp_spin_unlock(&vm_table_lock); + if (!hyp_vm) + return ret; + if (hyp_vm->kvm.arch.pkvm.is_dying) + ret = __pkvm_host_reclaim_page_guest(gfn, hyp_vm); + + put_pkvm_hyp_vm(hyp_vm); return ret; } From ecc7f02499544ae879716be837af78260a6a10f7 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:35:56 +0100 Subject: [PATCH 279/373] KVM: arm64: vgic: Don't reset cpuif/redist addresses at finalize time Although we are OK with rewriting idregs at finalize time, resetting the guest's cpuif (GICv3) or redistributor (GICv3) addresses once we start running the guest is a pretty bad idea. Move back this initialisation to vgic creation time. Reviewed-by: Sascha Bischoff Fixes: a258a383b9177 ("KVM: arm64: gic-v5: Sanitize ID_AA64PFR2_EL1.GCIE") Link: https://patch.msgid.link/20260323174713.3183111-1-maz@kernel.org Link: https://patch.msgid.link/20260401103611.357092-2-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-init.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index 47169604100f..34460179fb8a 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -147,6 +147,15 @@ int kvm_vgic_create(struct kvm *kvm, u32 type) kvm->arch.vgic.implementation_rev = KVM_VGIC_IMP_REV_LATEST; kvm->arch.vgic.vgic_dist_base = VGIC_ADDR_UNDEF; + switch (type) { + case KVM_DEV_TYPE_ARM_VGIC_V2: + kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF; + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: + INIT_LIST_HEAD(&kvm->arch.vgic.rd_regions); + break; + } + /* * We've now created the GIC. Update the system register state * to accurately reflect what we've created. @@ -684,10 +693,8 @@ void kvm_vgic_finalize_idregs(struct kvm *kvm) switch (type) { case KVM_DEV_TYPE_ARM_VGIC_V2: - kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF; break; case KVM_DEV_TYPE_ARM_VGIC_V3: - INIT_LIST_HEAD(&kvm->arch.vgic.rd_regions); aa64pfr0 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); break; From d82d09d5ba4be0b5eb053b2ba2bc0e82c49cf2c8 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:35:57 +0100 Subject: [PATCH 280/373] KVM: arm64: Don't skip per-vcpu NV initialisation Some GICv5-related rework have resulted in the NV sanitisation of registers being skipped for secondary vcpus, which is a pretty bad idea. Hoist the NV init early so that it is always executed. Reviewed-by: Sascha Bischoff Fixes: cbd8c958be54a ("KVM: arm64: Return early from kvm_finalize_sys_regs() if guest has run") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-3-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/sys_regs.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index e1001544d4f4..18e2d2fccedb 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -5772,6 +5772,12 @@ int kvm_finalize_sys_regs(struct kvm_vcpu *vcpu) guard(mutex)(&kvm->arch.config_lock); + if (vcpu_has_nv(vcpu)) { + int ret = kvm_init_nv_sysregs(vcpu); + if (ret) + return ret; + } + if (kvm_vm_has_ran_once(kvm)) return 0; @@ -5820,12 +5826,6 @@ int kvm_finalize_sys_regs(struct kvm_vcpu *vcpu) kvm_vgic_finalize_idregs(kvm); } - if (vcpu_has_nv(vcpu)) { - int ret = kvm_init_nv_sysregs(vcpu); - if (ret) - return ret; - } - return 0; } From 77acae60be60adddf33e4c7e9cf73291f64fb9e8 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:35:58 +0100 Subject: [PATCH 281/373] arm64: Fix field references for ICH_PPI_DVIR[01]_EL2 The ICH_PPI_DVIR[01]_EL2 registers should refer to the ICH_PPI_DVIRx_EL2 fields, instead of ICH_PPI_DVIx_EL2. Reviewed-by: Sascha Bischoff Fixes: 2808a8337078f ("arm64/sysreg: Add remaining GICv5 ICC_ & ICH_ sysregs for KVM support") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-4-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/tools/sysreg | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 51dcca5b2fa6..3b57cb692c5b 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -4888,11 +4888,11 @@ Field 0 DVI0 EndSysregFields Sysreg ICH_PPI_DVIR0_EL2 3 4 12 10 0 -Fields ICH_PPI_DVIx_EL2 +Fields ICH_PPI_DVIRx_EL2 EndSysreg Sysreg ICH_PPI_DVIR1_EL2 3 4 12 10 1 -Fields ICH_PPI_DVIx_EL2 +Fields ICH_PPI_DVIRx_EL2 EndSysreg SysregFields ICH_PPI_ENABLERx_EL2 From 76efe94b1c5cc9b5fac7c5c1096d03f1596c7267 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:35:59 +0100 Subject: [PATCH 282/373] KVM: arm64: Fix writeable mask for ID_AA64PFR2_EL1 The writeable mask for fields in ID_AA64PFR2_EL1 has been accidentally inverted, which isn't a very good idea. Restore the expected polarity. Reviewed-by: Sascha Bischoff Fixes: a258a383b9177 ("KVM: arm64: gic-v5: Sanitize ID_AA64PFR2_EL1.GCIE") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-5-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/sys_regs.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 18e2d2fccedb..6a96cb7ba9a3 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -3304,10 +3304,10 @@ static const struct sys_reg_desc sys_reg_descs[] = { ID_AA64PFR1_EL1_MPAM_frac | ID_AA64PFR1_EL1_MTE)), ID_FILTERED(ID_AA64PFR2_EL1, id_aa64pfr2_el1, - ~(ID_AA64PFR2_EL1_FPMR | - ID_AA64PFR2_EL1_MTEFAR | - ID_AA64PFR2_EL1_MTESTOREONLY | - ID_AA64PFR2_EL1_GCIE)), + (ID_AA64PFR2_EL1_FPMR | + ID_AA64PFR2_EL1_MTEFAR | + ID_AA64PFR2_EL1_MTESTOREONLY | + ID_AA64PFR2_EL1_GCIE)), ID_UNALLOCATED(4,3), ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0), ID_HIDDEN(ID_AA64SMFR0_EL1), From d70d4323dd9636e35696639f6b4c2b2735291516 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:00 +0100 Subject: [PATCH 283/373] KVM: arm64: Account for RESx bits in __compute_fgt() When computing Fine Grained Traps, it is preferable to account for the reserved bits. The HW will most probably ignore them, unless the bits have been repurposed to do something else. Use caution, and fold our view of the reserved bits in, Reviewed-by: Sascha Bischoff Fixes: c259d763e6b09 ("KVM: arm64: Account for RES1 bits in DECLARE_FEAT_MAP() and co") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260401103611.357092-6-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/config.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/config.c b/arch/arm64/kvm/config.c index e14685343191..f35b8dddd7c1 100644 --- a/arch/arm64/kvm/config.c +++ b/arch/arm64/kvm/config.c @@ -1663,8 +1663,8 @@ static __always_inline void __compute_fgt(struct kvm_vcpu *vcpu, enum vcpu_sysre clear |= ~nested & m->nmask; } - val |= set; - val &= ~clear; + val |= set | m->res1; + val &= ~(clear | m->res0); *vcpu_fgt(vcpu, reg) = val; } From e63d0a32e7368f3eb935755db87add1bf000ea90 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:01 +0100 Subject: [PATCH 284/373] KVM: arm64: vgic-v5: Hold config_lock while finalizing GICv5 PPIs Finalizing the PPI state is done without holding any lock, which means that two vcpus can race against each other and have one zeroing the state while another one is setting it, or even maybe using it. Fixing this is done by: - holding the config lock while performing the initialisation - checking if SW_PPI has already been advertised, meaning that we have already completed the initialisation once Reviewed-by: Sascha Bischoff Fixes: 8f1fbe2fd2792 ("KVM: arm64: gic-v5: Finalize GICv5 PPIs and generate mask") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-7-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 2b6cd5c3f9c2..119d7d01d0e7 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -172,6 +172,16 @@ int vgic_v5_finalize_ppi_state(struct kvm *kvm) if (!vgic_is_v5(kvm)) return 0; + guard(mutex)(&kvm->arch.config_lock); + + /* + * If SW_PPI has been advertised, then we know we already + * initialised the whole thing, and we can return early. Yes, + * this is pretty hackish as far as state tracking goes... + */ + if (test_bit(GICV5_ARCH_PPI_SW_PPI, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask)) + return 0; + /* The PPI state for all VCPUs should be the same. Pick the first. */ vcpu0 = kvm_get_vcpu(kvm, 0); From 170a77b4185a87cc7e02e404d22b9bf3f9923884 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:02 +0100 Subject: [PATCH 285/373] KVM: arm64: vgic-v5: Transfer edge pending state to ICH_PPI_PENDRx_EL2 While it is perfectly correct to leave the pending state of a level interrupt as is when queuing it (it is, after all, only driven by the line), edge pending state must be transfered, as nothing will lower it. Reviewed-by: Sascha Bischoff Fixes: 4d591252bacb2 ("KVM: arm64: gic-v5: Implement PPI interrupt injection") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-8-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 119d7d01d0e7..422741c86c6a 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -445,8 +445,11 @@ void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu) irq = vgic_get_vcpu_irq(vcpu, intid); - scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { __assign_bit(i, pendr, irq_is_pending(irq)); + if (irq->config == VGIC_CONFIG_EDGE) + irq->pending_latch = false; + } vgic_put_irq(vcpu->kvm, irq); } From 42d7eac8291d2724b3897141ab2f226c69b7923e Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:03 +0100 Subject: [PATCH 286/373] KVM: arm64: vgic-v5: Cast vgic_apr to u32 to avoid undefined behaviours Passing a u64 to __builtin_ctz() is odd, and requires some digging to figure out why this construct is indeed safe as long as the HW is correct. But it is much easier to make it clear to the compiler by casting the u64 into an intermediate u32, and be done with the UD. Reviewed-by: Sascha Bischoff Fixes: 933e5288fa971 ("KVM: arm64: gic-v5: Check for pending PPIs") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-9-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 422741c86c6a..0f269321ece4 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -212,7 +212,7 @@ int vgic_v5_finalize_ppi_state(struct kvm *kvm) static u32 vgic_v5_get_effective_priority_mask(struct kvm_vcpu *vcpu) { struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; - u32 highest_ap, priority_mask; + u32 highest_ap, priority_mask, apr; /* * If the guest's CPU has not opted to receive interrupts, then the @@ -227,7 +227,8 @@ static u32 vgic_v5_get_effective_priority_mask(struct kvm_vcpu *vcpu) * priority. Explicitly use the 32-bit version here as we have 32 * priorities. 32 then means that there are no active priorities. */ - highest_ap = cpu_if->vgic_apr ? __builtin_ctz(cpu_if->vgic_apr) : 32; + apr = cpu_if->vgic_apr; + highest_ap = apr ? __builtin_ctz(apr) : 32; /* * An interrupt is of sufficient priority if it is equal to or From a4a645584793dbbb4e5a1a876800654a8883326e Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:04 +0100 Subject: [PATCH 287/373] KVM: arm64: vgic-v5: Make the effective priority mask a strict limit The way the effective priority mask is compared to the priority of an interrupt to decide whether to wake-up or not, is slightly odd, and breaks at the limits. This could result in spurious wake-ups that are undesirable. Make the computed priority mask comparison a strict inequality, so that interrupts that have the same priority as the mask are not signalled. Fixes: 933e5288fa971 ("KVM: arm64: gic-v5: Check for pending PPIs") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-10-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-v5.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 0f269321ece4..31040cfb61fc 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -367,7 +367,7 @@ bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu) scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) has_pending = (irq->enabled && irq_is_pending(irq) && - irq->priority <= priority_mask); + irq->priority < priority_mask); vgic_put_irq(vcpu->kvm, irq); From 848fa8373a53b0e5d871560743e13278da56fabc Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:05 +0100 Subject: [PATCH 288/373] KVM: arm64: vgic-v5: Correctly set dist->ready once initialised kvm_vgic_map_resources() targetting a v5 model results in vgic->dist_ready never being set. This doesn't result in anything really bad, only some more heavy locking as we go and re-init something for no good reason. Rejig the code to correctly set the ready flag in all non-failing cases. Reviewed-by: Sascha Bischoff Fixes: f4d37c7c35769 ("KVM: arm64: gic-v5: Create and initialise vgic_v5") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-11-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-init.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index 34460179fb8a..2859dad46a93 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -657,16 +657,20 @@ int kvm_vgic_map_resources(struct kvm *kvm) needs_dist = false; } - if (ret || !needs_dist) + if (ret) goto out; - dist_base = dist->vgic_dist_base; - mutex_unlock(&kvm->arch.config_lock); + if (needs_dist) { + dist_base = dist->vgic_dist_base; + mutex_unlock(&kvm->arch.config_lock); - ret = vgic_register_dist_iodev(kvm, dist_base, type); - if (ret) { - kvm_err("Unable to register VGIC dist MMIO regions\n"); - goto out_slots; + ret = vgic_register_dist_iodev(kvm, dist_base, type); + if (ret) { + kvm_err("Unable to register VGIC dist MMIO regions\n"); + goto out_slots; + } + } else { + mutex_unlock(&kvm->arch.config_lock); } smp_store_release(&dist->ready, true); From 8fe30434a81d36715ab83fdb4a5e6c967d2e3ecf Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:06 +0100 Subject: [PATCH 289/373] KVM: arm64: Kill arch_timer_context::direct field The newly introduced arch_timer_context::direct field is a bit pointless, as it is always set on timers that are... err... direct, while we already have a way to get to that by doing a get_map() operation. Additionally, this field is: - only set when get_map() is called - never cleared and the single point where it is actually checked doesn't call get_map() at all. At this stage, it is probably better to just kill it, and rely on get_map() to give us the correct information. Reviewed-by: Sascha Bischoff Fixes: 9491c63b6cd7b ("KVM: arm64: gic-v5: Enlighten arch timer for GICv5") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-12-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arch_timer.c | 15 +++++++++------ include/kvm/arm_arch_timer.h | 3 --- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 67b989671b41..37279f874869 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -183,10 +183,6 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map) map->emul_ptimer = vcpu_ptimer(vcpu); } - map->direct_vtimer->direct = true; - if (map->direct_ptimer) - map->direct_ptimer->direct = true; - trace_kvm_get_timer_map(vcpu->vcpu_id, map); } @@ -462,8 +458,15 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level, return; /* Skip injecting on GICv5 for directly injected (DVI'd) timers */ - if (vgic_is_v5(vcpu->kvm) && timer_ctx->direct) - return; + if (vgic_is_v5(vcpu->kvm)) { + struct timer_map map; + + get_timer_map(vcpu, &map); + + if (map.direct_ptimer == timer_ctx || + map.direct_vtimer == timer_ctx) + return; + } kvm_vgic_inject_irq(vcpu->kvm, vcpu, timer_irq(timer_ctx), diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h index a7754e0a2ef4..bf8cc9589bd0 100644 --- a/include/kvm/arm_arch_timer.h +++ b/include/kvm/arm_arch_timer.h @@ -76,9 +76,6 @@ struct arch_timer_context { /* Duplicated state from arch_timer.c for convenience */ u32 host_timer_irq; - - /* Is this a direct timer? */ - bool direct; }; struct timer_map { From fbcbf259d97d340376a176de20bdc04687356949 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:07 +0100 Subject: [PATCH 290/373] KVM: arm64: Remove evaluation of timer state in kvm_cpu_has_pending_timer() The vgic-v5 code added some evaluations of the timers in a helper funtion (kvm_cpu_has_pending_timer()) that is called to determine whether the vcpu can wake-up. But looking at the timer there is wrong: - we want to see timers that are signalling an interrupt to the vcpu, and not just that have a pending interrupt - we already have kvm_arch_vcpu_runnable() that evaluates the state of interrupts - kvm_cpu_has_pending_timer() really is about WFIT, as the timeout does not generate an interrupt, and is therefore distinct from the point above As a consequence, revert these changes and teach vgic_v5_has_pending_ppi() about checking for pending HW interrupts instead. Fixes: 9491c63b6cd7b ("KVM: arm64: gic-v5: Enlighten arch timer for GICv5") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-13-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arch_timer.c | 6 +----- arch/arm64/kvm/vgic/vgic-v5.c | 4 ++-- 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 37279f874869..6608c47d1f62 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -402,11 +402,7 @@ static bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx) int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu) { - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); - - return kvm_timer_should_fire(vtimer) || kvm_timer_should_fire(ptimer) || - (vcpu_has_wfit_active(vcpu) && wfit_delay_ns(vcpu) == 0); + return vcpu_has_wfit_active(vcpu) && wfit_delay_ns(vcpu) == 0; } /* diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 31040cfb61fc..8680a8354db9 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -366,8 +366,8 @@ bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu) irq = vgic_get_vcpu_irq(vcpu, intid); scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) - has_pending = (irq->enabled && irq_is_pending(irq) && - irq->priority < priority_mask); + if (irq->enabled && irq->priority < priority_mask) + has_pending = irq->hw ? vgic_get_phys_line_level(irq) : irq_is_pending(irq); vgic_put_irq(vcpu->kvm, irq); From 06c85b58e0b13e67f4e56cbba346201bfe95ad00 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:08 +0100 Subject: [PATCH 291/373] KVM: arm64: Move GICv5 timer PPI validation into timer_irqs_are_valid() Userspace can set the timer PPI numbers way before a GIC has been created, leading to odd behaviours on GICv5 as we'd accept non architectural PPI numbers. Move the v5 check into timer_irqs_are_valid(), which aligns the behaviour with the pre-v5 GICs, and is also guaranteed to run only once a GIC has been configured. Reviewed-by: Sascha Bischoff Fixes: 9491c63b6cd7b ("KVM: arm64: gic-v5: Enlighten arch timer for GICv5") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-14-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arch_timer.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 6608c47d1f62..cbea4d9ee955 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -1543,6 +1543,10 @@ static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu) if (kvm_vgic_set_owner(vcpu, irq, ctx)) break; + /* With GICv5, the default PPI is what you get -- nothing else */ + if (vgic_is_v5(vcpu->kvm) && irq != get_vgic_ppi(vcpu->kvm, default_ppi[i])) + break; + /* * We know by construction that we only have PPIs, so all values * are less than 32 for non-GICv5 VGICs. On GICv5, they are @@ -1678,13 +1682,6 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) return -ENXIO; } - /* - * The PPIs for the Arch Timers are architecturally defined for - * GICv5. Reject anything that changes them from the specified value. - */ - if (vgic_is_v5(vcpu->kvm) && vcpu->kvm->arch.timer_data.ppi[idx] != irq) - return -EINVAL; - /* * We cannot validate the IRQ unicity before we run, so take it at * face value. The verdict will be given on first vcpu run, for each From be46a408f376df31762e8a9914dc6d082755e686 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:09 +0100 Subject: [PATCH 292/373] KVM: arm64: Correctly plumb ID_AA64PFR2_EL1 into pkvm idreg handling While we now compute ID_AA64PFR2_EL1 to a glorious 0, we never use that data and instead return the 0 that corresponds to an allocated idreg. Not a big deal, but we might as well be consistent. Reviewed-by: Sascha Bischoff Fixes: 5aefaf11f9af5 ("KVM: arm64: gic: Hide GICv5 for protected guests") Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com Link: https://patch.msgid.link/20260401103611.357092-15-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/sys_regs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/kvm/hyp/nvhe/sys_regs.c b/arch/arm64/kvm/hyp/nvhe/sys_regs.c index b40fd01ebf32..be6f420388a1 100644 --- a/arch/arm64/kvm/hyp/nvhe/sys_regs.c +++ b/arch/arm64/kvm/hyp/nvhe/sys_regs.c @@ -439,7 +439,7 @@ static const struct sys_reg_desc pvm_sys_reg_descs[] = { /* CRm=4 */ AARCH64(SYS_ID_AA64PFR0_EL1), AARCH64(SYS_ID_AA64PFR1_EL1), - ID_UNALLOCATED(4,2), + AARCH64(SYS_ID_AA64PFR2_EL1), ID_UNALLOCATED(4,3), AARCH64(SYS_ID_AA64ZFR0_EL1), ID_UNALLOCATED(4,5), From f4626281c6bb563ef5ad9d3a59a1449b45a3dc30 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:10 +0100 Subject: [PATCH 293/373] KVM: arm64: Don't advertises GICv3 in ID_PFR1_EL1 if AArch32 isn't supported Although the AArch32 ID regs are architecturally UNKNOWN when AArch32 isn't supported at any EL, KVM makes a point in making them RAZ. Therefore, advertising GICv3 in ID_PFR1_EL1 must be gated on AArch32 being supported at least at EL0. Reviewed-by: Sascha Bischoff Fixes: a258a383b9177 ("KVM: arm64: gic-v5: Sanitize ID_AA64PFR2_EL1.GCIE") Reported-by: Mark Brown Tested-by: Mark Brown Link: https://patch.msgid.link/20260401103611.357092-16-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-init.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index 2859dad46a93..933983bb2005 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -700,7 +700,8 @@ void kvm_vgic_finalize_idregs(struct kvm *kvm) break; case KVM_DEV_TYPE_ARM_VGIC_V3: aa64pfr0 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); - pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); + if (kvm_supports_32bit_el0()) + pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); break; case KVM_DEV_TYPE_ARM_VGIC_V5: aa64pfr2 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR2_EL1, GCIE, IMP); From b3265a1b2bd00335308f27477cecb7702f4bb615 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 11:36:11 +0100 Subject: [PATCH 294/373] KVM: arm64: set_id_regs: Allow GICv3 support to be set at runtime set_id_regs creates a GIC3 guest when possible, and then proceeds to write the ID registers as if they were not affected by the presence of a GIC. As it turns out, ID_AA64PFR1_EL1 is the proof of the contrary. KVM now makes a point in exposing the GIC support to the guest, no matter what userspace says (userspace such as QEMU is known to write silly things at times). Accommodate for this level of nonsense by teaching set_id_regs about fields that are mutable, and only compare registers that have been re-sanitised first. Reported-by: Mark Brown Link: https://patch.msgid.link/20260401103611.357092-17-maz@kernel.org Signed-off-by: Marc Zyngier --- .../testing/selftests/kvm/arm64/set_id_regs.c | 52 ++++++++++++++++--- 1 file changed, 45 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/kvm/arm64/set_id_regs.c b/tools/testing/selftests/kvm/arm64/set_id_regs.c index 73de5be58bab..7899d557c70b 100644 --- a/tools/testing/selftests/kvm/arm64/set_id_regs.c +++ b/tools/testing/selftests/kvm/arm64/set_id_regs.c @@ -37,6 +37,9 @@ struct reg_ftr_bits { * For FTR_LOWER_SAFE, safe_val is used as the minimal safe value. */ int64_t safe_val; + + /* Allowed to be changed by the host after run */ + bool mutable; }; struct test_feature_reg { @@ -44,7 +47,7 @@ struct test_feature_reg { const struct reg_ftr_bits *ftr_bits; }; -#define __REG_FTR_BITS(NAME, SIGNED, TYPE, SHIFT, MASK, SAFE_VAL) \ +#define __REG_FTR_BITS(NAME, SIGNED, TYPE, SHIFT, MASK, SAFE_VAL, MUT) \ { \ .name = #NAME, \ .sign = SIGNED, \ @@ -52,15 +55,20 @@ struct test_feature_reg { .shift = SHIFT, \ .mask = MASK, \ .safe_val = SAFE_VAL, \ + .mutable = MUT, \ } #define REG_FTR_BITS(type, reg, field, safe_val) \ __REG_FTR_BITS(reg##_##field, FTR_UNSIGNED, type, reg##_##field##_SHIFT, \ - reg##_##field##_MASK, safe_val) + reg##_##field##_MASK, safe_val, false) + +#define REG_FTR_BITS_MUTABLE(type, reg, field, safe_val) \ + __REG_FTR_BITS(reg##_##field, FTR_UNSIGNED, type, reg##_##field##_SHIFT, \ + reg##_##field##_MASK, safe_val, true) #define S_REG_FTR_BITS(type, reg, field, safe_val) \ __REG_FTR_BITS(reg##_##field, FTR_SIGNED, type, reg##_##field##_SHIFT, \ - reg##_##field##_MASK, safe_val) + reg##_##field##_MASK, safe_val, false) #define REG_FTR_END \ { \ @@ -134,7 +142,8 @@ static const struct reg_ftr_bits ftr_id_aa64pfr0_el1[] = { REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, CSV2, 0), REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, DIT, 0), REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, SEL2, 0), - REG_FTR_BITS(FTR_EXACT, ID_AA64PFR0_EL1, GIC, 0), + /* GICv3 support will be forced at run time if available */ + REG_FTR_BITS_MUTABLE(FTR_EXACT, ID_AA64PFR0_EL1, GIC, 0), REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, EL3, 1), REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, EL2, 1), REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, EL1, 1), @@ -634,12 +643,38 @@ static void test_user_set_mte_reg(struct kvm_vcpu *vcpu) ksft_test_result_pass("ID_AA64PFR1_EL1.MTE_frac no longer 0xF\n"); } +static uint64_t reset_mutable_bits(uint32_t id, uint64_t val) +{ + struct test_feature_reg *reg = NULL; + + for (int i = 0; i < ARRAY_SIZE(test_regs); i++) { + if (test_regs[i].reg == id) { + reg = &test_regs[i]; + break; + } + } + + if (!reg) + return val; + + for (const struct reg_ftr_bits *bits = reg->ftr_bits; bits->type != FTR_END; bits++) { + if (bits->mutable) { + val &= ~bits->mask; + val |= bits->safe_val << bits->shift; + } + } + + return val; +} + static void test_guest_reg_read(struct kvm_vcpu *vcpu) { bool done = false; struct ucall uc; while (!done) { + uint64_t val; + vcpu_run(vcpu); switch (get_ucall(vcpu, &uc)) { @@ -647,9 +682,11 @@ static void test_guest_reg_read(struct kvm_vcpu *vcpu) REPORT_GUEST_ASSERT(uc); break; case UCALL_SYNC: + val = test_reg_vals[encoding_to_range_idx(uc.args[2])]; + val = reset_mutable_bits(uc.args[2], val); + /* Make sure the written values are seen by guest */ - TEST_ASSERT_EQ(test_reg_vals[encoding_to_range_idx(uc.args[2])], - uc.args[3]); + TEST_ASSERT_EQ(val, reset_mutable_bits(uc.args[2], uc.args[3])); break; case UCALL_DONE: done = true; @@ -740,7 +777,8 @@ static void test_assert_id_reg_unchanged(struct kvm_vcpu *vcpu, uint32_t encodin uint64_t observed; observed = vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(encoding)); - TEST_ASSERT_EQ(test_reg_vals[idx], observed); + TEST_ASSERT_EQ(reset_mutable_bits(encoding, test_reg_vals[idx]), + reset_mutable_bits(encoding, observed)); } static void test_reset_preserves_id_regs(struct kvm_vcpu *vcpu) From cf6348af645bd8e38758114e6afcc406c5bb515f Mon Sep 17 00:00:00 2001 From: Sebastian Ene Date: Mon, 30 Mar 2026 10:54:41 +0000 Subject: [PATCH 295/373] KVM: arm64: Prevent the host from using an smc with imm16 != 0 The ARM Service Calling Convention (SMCCC) specifies that the function identifier and parameters should be passed in registers, leaving the 16-bit immediate field un-handled in pKVM when an SMC instruction is trapped. Since the HVC is a private interface between EL2 and the host, enforce the host kernel running under pKVM to use an immediate value of 0 only when using SMCs to make it clear for non-compliant software talking to Trustzone that we only use SMCCC. Signed-off-by: Sebastian Ene Reviewed-by: Vincent Donnefort Link: https://patch.msgid.link/20260330105441.3226904-1-sebastianene@google.com Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index e7790097db93..461cf5cb5ac7 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -676,8 +676,14 @@ static void default_host_smc_handler(struct kvm_cpu_context *host_ctxt) static void handle_host_smc(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(u64, func_id, host_ctxt, 0); + u64 esr = read_sysreg_el2(SYS_ESR); bool handled; + if (esr & ESR_ELx_xVC_IMM_MASK) { + cpu_reg(host_ctxt, 0) = SMCCC_RET_NOT_SUPPORTED; + goto exit_skip_instr; + } + func_id &= ~ARM_SMCCC_CALL_HINTS; handled = kvm_host_psci_handler(host_ctxt, func_id); @@ -686,6 +692,7 @@ static void handle_host_smc(struct kvm_cpu_context *host_ctxt) if (!handled) default_host_smc_handler(host_ctxt); +exit_skip_instr: /* SMC was trapped, move ELR past the current PC. */ kvm_skip_host_instr(); } From 2fc0f3e2b9a9f397554ffe86e8f6eb0e2507ec6e Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 27 Mar 2026 19:27:56 +0000 Subject: [PATCH 296/373] KVM: arm64: Don't leave mmu->pgt dangling on kvm_init_stage2_mmu() error If kvm_init_stage2_mmu() fails to allocate 'mmu->last_vcpu_ran', it destroys the newly allocated stage-2 page-table before returning ENOMEM. Unfortunately, it also leaves a dangling pointer in 'mmu->pgt' which points at the freed 'kvm_pgtable' structure. This is likely to confuse the kvm_vcpu_init_nested() failure path which can double-free the structure if it finds it via kvm_free_stage2_pgd(). Ensure that the dangling 'mmu->pgt' pointer is cleared when returning an error from kvm_init_stage2_mmu(). Link: https://sashiko.dev/#/patchset/20260327140039.21228-1-will%40kernel.org?patch=12265 Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260327192758.21739-2-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/mmu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 17d64a1e11e5..34e9d897d08b 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1013,6 +1013,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t out_destroy_pgtable: kvm_stage2_destroy(pgt); + mmu->pgt = NULL; out_free_pgtable: kfree(pgt); return err; From a3ca3bfd01b7ee9f54ed85718a6d553cdd87050e Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 27 Mar 2026 19:27:57 +0000 Subject: [PATCH 297/373] KVM: arm64: Destroy stage-2 page-table in kvm_arch_destroy_vm() kvm_arch_destroy_vm() can be called on the kvm_create_vm() error path after we have failed to register the MMU notifiers for the new VM. In this case, we cannot rely on the MMU ->release() notifier to call kvm_arch_flush_shadow_all() and so the stage-2 page-table allocated in kvm_arch_init_vm() will be leaked. Explicitly destroy the stage-2 page-table in kvm_arch_destroy_vm(), so that we clean up after kvm_arch_destroy_vm() without relying on the MMU notifiers. Link: https://sashiko.dev/#/patchset/20260327140039.21228-1-will%40kernel.org?patch=12265 Signed-off-by: Will Deacon Link: https://patch.msgid.link/20260327192758.21739-3-will@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 410ffd41fd73..29bfa79555b2 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -301,6 +301,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) if (is_protected_kvm_enabled()) pkvm_destroy_hyp_vm(kvm); + kvm_uninit_stage2_mmu(kvm); kvm_destroy_mpidr_data(kvm); kfree(kvm->arch.sysreg_masks); From 03db5f05d4c76d76b32a9d26001e2ec6252f74f8 Mon Sep 17 00:00:00 2001 From: "Zenghui Yu (Huawei)" Date: Tue, 17 Mar 2026 21:15:58 +0800 Subject: [PATCH 298/373] KVM: arm64: selftests: Avoid testing the IMPDEF behavior It turned out that we can't really force KVM to use the "slow" path when emulating AT instructions [1]. We should therefore avoid testing the IMPDEF behavior (i.e., TEST_ACCESS_FLAG - address translation instructions are permitted to update AF but not required). Remove it and improve the comment a bit. [1] https://lore.kernel.org/r/b951dcfb-0ad1-4d7b-b6ce-d54b272dd9be@linux.dev Signed-off-by: Zenghui Yu (Huawei) Link: https://patch.msgid.link/20260317131558.52751-1-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier --- tools/testing/selftests/kvm/arm64/at.c | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/tools/testing/selftests/kvm/arm64/at.c b/tools/testing/selftests/kvm/arm64/at.c index c8ee6f520734..ce5d312ef6ba 100644 --- a/tools/testing/selftests/kvm/arm64/at.c +++ b/tools/testing/selftests/kvm/arm64/at.c @@ -13,7 +13,6 @@ enum { CLEAR_ACCESS_FLAG, - TEST_ACCESS_FLAG, }; static u64 *ptep_hva; @@ -49,7 +48,6 @@ do { \ GUEST_ASSERT_EQ(FIELD_GET(SYS_PAR_EL1_ATTR, par), MAIR_ATTR_NORMAL); \ GUEST_ASSERT_EQ(FIELD_GET(SYS_PAR_EL1_SH, par), PTE_SHARED >> 8); \ GUEST_ASSERT_EQ(par & SYS_PAR_EL1_PA, TEST_ADDR); \ - GUEST_SYNC(TEST_ACCESS_FLAG); \ } \ } while (0) @@ -85,10 +83,6 @@ static void guest_code(void) if (!SYS_FIELD_GET(ID_AA64MMFR1_EL1, HAFDBS, read_sysreg(id_aa64mmfr1_el1))) GUEST_DONE(); - /* - * KVM's software PTW makes the implementation choice that the AT - * instruction sets the access flag. - */ sysreg_clear_set(tcr_el1, 0, TCR_HA); isb(); test_at(false); @@ -102,8 +96,8 @@ static void handle_sync(struct kvm_vcpu *vcpu, struct ucall *uc) case CLEAR_ACCESS_FLAG: /* * Delete + reinstall the memslot to invalidate stage-2 - * mappings of the stage-1 page tables, forcing KVM to - * use the 'slow' AT emulation path. + * mappings of the stage-1 page tables, allowing KVM to + * potentially use the 'slow' AT emulation path. * * This and clearing the access flag from host userspace * ensures that the access flag cannot be set speculatively @@ -112,10 +106,6 @@ static void handle_sync(struct kvm_vcpu *vcpu, struct ucall *uc) clear_bit(__ffs(PTE_AF), ptep_hva); vm_mem_region_reload(vcpu->vm, vcpu->vm->memslots[MEM_REGION_PT]); break; - case TEST_ACCESS_FLAG: - TEST_ASSERT(test_bit(__ffs(PTE_AF), ptep_hva), - "Expected access flag to be set (desc: %lu)", *ptep_hva); - break; default: TEST_FAIL("Unexpected SYNC arg: %lu", uc->args[1]); } From 9c1ac77ddfc90b6292ef63a4fa5ab6f9e4b29981 Mon Sep 17 00:00:00 2001 From: Sascha Bischoff Date: Wed, 1 Apr 2026 16:21:57 +0000 Subject: [PATCH 299/373] KVM: arm64: vgic-v5: Fold PPI state for all exposed PPIs GICv5 supports up to 128 PPIs, which would introduce a large amount of overhead if all of them were actively tracked. Rather than keeping track of all 128 potential PPIs, we instead only consider the set of architected PPIs (the first 64). Moreover, we further reduce that set by only exposing a subset of the PPIs to a guest. In practice, this means that only 4 PPIs are typically exposed to a guest - the SW_PPI, PMUIRQ, and the timers. When folding the PPI state, changed bits in the active or pending were used to choose which state to sync back. However, this breaks badly for Edge interrupts when exiting the guest before it has consumed the edge. There is no change in pending state detected, and the edge is lost forever. Given the reduced set of PPIs exposed to the guest, and the issues around tracking the edges, drop the tracking of changed state, and instead iterate over the limited subset of PPIs exposed to the guest directly. This change drops the second copy of the PPI pending state used for detecting edges in the pending state, and reworks vgic_v5_fold_ppi_state() to iterate over the VM's PPI mask instead. Signed-off-by: Sascha Bischoff Link: https://patch.msgid.link/20260401162152.932243-1-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_host.h | 9 +-------- arch/arm64/kvm/hyp/vgic-v5-sr.c | 6 +++--- arch/arm64/kvm/vgic/vgic-v5.c | 28 +++++----------------------- 3 files changed, 9 insertions(+), 34 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index a7dc0aac3b93..729bd32207fa 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -803,14 +803,7 @@ struct kvm_host_data { /* PPI state tracking for GICv5-based guests */ struct { - /* - * For tracking the PPI pending state, we need both the entry - * state and exit state to correctly detect edges as it is - * possible that an interrupt has been injected in software in - * the interim. - */ - DECLARE_BITMAP(pendr_entry, VGIC_V5_NR_PRIVATE_IRQS); - DECLARE_BITMAP(pendr_exit, VGIC_V5_NR_PRIVATE_IRQS); + DECLARE_BITMAP(pendr, VGIC_V5_NR_PRIVATE_IRQS); /* The saved state of the regs when leaving the guest */ DECLARE_BITMAP(activer_exit, VGIC_V5_NR_PRIVATE_IRQS); diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c b/arch/arm64/kvm/hyp/vgic-v5-sr.c index 2c4304ffa9f3..47e6bcd43702 100644 --- a/arch/arm64/kvm/hyp/vgic-v5-sr.c +++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c @@ -37,7 +37,7 @@ void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if) bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, read_sysreg_s(SYS_ICH_PPI_ACTIVER0_EL2), 0, 64); - bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr_exit, + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr, read_sysreg_s(SYS_ICH_PPI_PENDR0_EL2), 0, 64); cpu_if->vgic_ppi_priorityr[0] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR0_EL2); @@ -52,7 +52,7 @@ void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if) if (VGIC_V5_NR_PRIVATE_IRQS == 128) { bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, read_sysreg_s(SYS_ICH_PPI_ACTIVER1_EL2), 64, 64); - bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr_exit, + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr, read_sysreg_s(SYS_ICH_PPI_PENDR1_EL2), 64, 64); cpu_if->vgic_ppi_priorityr[8] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR8_EL2); @@ -87,7 +87,7 @@ void __vgic_v5_restore_ppi_state(struct vgic_v5_cpu_if *cpu_if) SYS_ICH_PPI_ENABLER0_EL2); /* Update the pending state of the NON-DVI'd PPIs, only */ - bitmap_andnot(pendr, host_data_ptr(vgic_v5_ppi_state)->pendr_entry, + bitmap_andnot(pendr, host_data_ptr(vgic_v5_ppi_state)->pendr, cpu_if->vgic_ppi_dvir, VGIC_V5_NR_PRIVATE_IRQS); write_sysreg_s(bitmap_read(pendr, 0, 64), SYS_ICH_PPI_PENDR0_EL2); diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 8680a8354db9..fdd39ea7f83e 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -385,24 +385,14 @@ bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu) void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu) { struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; - DECLARE_BITMAP(changed_active, VGIC_V5_NR_PRIVATE_IRQS); - DECLARE_BITMAP(changed_pending, VGIC_V5_NR_PRIVATE_IRQS); - DECLARE_BITMAP(changed_bits, VGIC_V5_NR_PRIVATE_IRQS); - unsigned long *activer, *pendr_entry, *pendr; + unsigned long *activer, *pendr; int i; activer = host_data_ptr(vgic_v5_ppi_state)->activer_exit; - pendr_entry = host_data_ptr(vgic_v5_ppi_state)->pendr_entry; - pendr = host_data_ptr(vgic_v5_ppi_state)->pendr_exit; + pendr = host_data_ptr(vgic_v5_ppi_state)->pendr; - bitmap_xor(changed_active, cpu_if->vgic_ppi_activer, activer, - VGIC_V5_NR_PRIVATE_IRQS); - bitmap_xor(changed_pending, pendr_entry, pendr, - VGIC_V5_NR_PRIVATE_IRQS); - bitmap_or(changed_bits, changed_active, changed_pending, - VGIC_V5_NR_PRIVATE_IRQS); - - for_each_set_bit(i, changed_bits, VGIC_V5_NR_PRIVATE_IRQS) { + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, + VGIC_V5_NR_PRIVATE_IRQS) { u32 intid = vgic_v5_make_ppi(i); struct vgic_irq *irq; @@ -462,15 +452,7 @@ void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu) * incoming changes are merged with the outgoing changes on the return * path. */ - bitmap_copy(host_data_ptr(vgic_v5_ppi_state)->pendr_entry, pendr, - VGIC_V5_NR_PRIVATE_IRQS); - - /* - * Make sure that we can correctly detect "edges" in the PPI - * state. There's a path where we never actually enter the guest, and - * failure to do this risks losing pending state - */ - bitmap_copy(host_data_ptr(vgic_v5_ppi_state)->pendr_exit, pendr, + bitmap_copy(host_data_ptr(vgic_v5_ppi_state)->pendr, pendr, VGIC_V5_NR_PRIVATE_IRQS); } From 1323a5cfe52c7937ea3cd7a75e0355cacd805da4 Mon Sep 17 00:00:00 2001 From: Jinyu Tang Date: Fri, 27 Feb 2026 20:10:08 +0800 Subject: [PATCH 300/373] KVM: riscv: Skip CSR restore if VCPU is reloaded on the same core MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, kvm_arch_vcpu_load() unconditionally restores guest CSRs, HGATP, and AIA state. However, when a VCPU is loaded back on the same physical CPU, and no other KVM VCPU has run on this CPU since it was last put, the hardware CSRs and AIA registers are still valid. This patch optimizes the vcpu_load path by skipping the expensive CSR and AIA writes if all the following conditions are met: 1. It is being reloaded on the same CPU (vcpu->arch.last_exit_cpu == cpu). 2. The CSRs are not dirty (!vcpu->arch.csr_dirty). 3. No other VCPU used this CPU (vcpu == __this_cpu_read(kvm_former_vcpu)). To ensure this fast-path doesn't break corner cases: - Live migration and VCPU reset are naturally safe. KVM initializes last_exit_cpu to -1, which guarantees the fast-path won't trigger. - The 'csr_dirty' flag tracks runtime userspace interventions. If userspace modifies guest configurations (e.g., hedeleg via KVM_SET_GUEST_DEBUG, or CSRs including AIA via KVM_SET_ONE_REG), the flag is set to skip the fast path. With the 'csr_dirty' safeguard proven effective, it is safe to include kvm_riscv_vcpu_aia_load() inside the skip logic now. Signed-off-by: Jinyu Tang Reviewed-by: Nutty Liu Reviewed-by: Andrew Jones Reviewed-by: Radim Krčmář Link: https://lore.kernel.org/r/20260227121008.442241-1-tjytimi@163.com Signed-off-by: Anup Patel --- arch/riscv/include/asm/kvm_host.h | 3 +++ arch/riscv/kvm/vcpu.c | 24 ++++++++++++++++++++++-- arch/riscv/kvm/vcpu_onereg.c | 2 ++ 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index 24585304c02b..7ee47b83c80d 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -273,6 +273,9 @@ struct kvm_vcpu_arch { /* 'static' configurations which are set only once */ struct kvm_vcpu_config cfg; + /* Indicates modified guest CSRs */ + bool csr_dirty; + /* SBI steal-time accounting */ struct { gpa_t shmem; diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c index fdd99ac1e714..1d5c777eba80 100644 --- a/arch/riscv/kvm/vcpu.c +++ b/arch/riscv/kvm/vcpu.c @@ -24,6 +24,8 @@ #define CREATE_TRACE_POINTS #include "trace.h" +static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_former_vcpu); + const struct kvm_stats_desc kvm_vcpu_stats_desc[] = { KVM_GENERIC_VCPU_STATS(), STATS_DESC_COUNTER(VCPU, ecall_exit_stat), @@ -537,6 +539,8 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, vcpu->arch.cfg.hedeleg |= BIT(EXC_BREAKPOINT); } + vcpu->arch.csr_dirty = true; + return 0; } @@ -581,6 +585,21 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; + /* + * If VCPU is being reloaded on the same physical CPU and no + * other KVM VCPU has run on this CPU since it was last put, + * we can skip the expensive CSR and HGATP writes. + * + * Note: If a new CSR is added to this fast-path skip block, + * make sure that 'csr_dirty' is set to true in any + * ioctl (e.g., KVM_SET_ONE_REG) that modifies it. + */ + if (vcpu != __this_cpu_read(kvm_former_vcpu)) + __this_cpu_write(kvm_former_vcpu, vcpu); + else if (vcpu->arch.last_exit_cpu == cpu && !vcpu->arch.csr_dirty) + goto csr_restore_done; + + vcpu->arch.csr_dirty = false; if (kvm_riscv_nacl_sync_csr_available()) { nsh = nacl_shmem(); nacl_csr_write(nsh, CSR_VSSTATUS, csr->vsstatus); @@ -624,6 +643,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_riscv_mmu_update_hgatp(vcpu); + kvm_riscv_vcpu_aia_load(vcpu, cpu); + +csr_restore_done: kvm_riscv_vcpu_timer_restore(vcpu); kvm_riscv_vcpu_host_fp_save(&vcpu->arch.host_context); @@ -633,8 +655,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_riscv_vcpu_guest_vector_restore(&vcpu->arch.guest_context, vcpu->arch.isa); - kvm_riscv_vcpu_aia_load(vcpu, cpu); - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); vcpu->cpu = cpu; diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c index 45ecc0082e90..97fa72ba47c1 100644 --- a/arch/riscv/kvm/vcpu_onereg.c +++ b/arch/riscv/kvm/vcpu_onereg.c @@ -670,6 +670,8 @@ static int kvm_riscv_vcpu_set_reg_csr(struct kvm_vcpu *vcpu, if (rc) return rc; + vcpu->arch.csr_dirty = true; + return 0; } From ce47b798ed1e44a6ae2c2966cdf7cba6b428083e Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Wed, 1 Apr 2026 05:50:59 +0100 Subject: [PATCH 301/373] tracing: Non-consuming read for trace remotes with an offline CPU When a trace_buffer is created while a CPU is offline, this CPU is cleared from the trace_buffer CPU mask, preventing the creation of a non-consuming iterator (ring_buffer_iter). For trace remotes, it means the iterator fails to be allocated (-ENOMEM) even though there are available ring buffers in the trace_buffer. For non-consuming reads of trace remotes, skip missing ring_buffer_iter to allow reading the available ring buffers. Acked-by: Steven Rostedt (Google) Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260401045100.3394299-2-vdonnefort@google.com Signed-off-by: Marc Zyngier --- kernel/trace/trace_remote.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c index 0d78e5f5fe98..d6c3f94d67cd 100644 --- a/kernel/trace/trace_remote.c +++ b/kernel/trace/trace_remote.c @@ -282,6 +282,14 @@ static void trace_remote_put(struct trace_remote *remote) trace_remote_try_unload(remote); } +static bool trace_remote_has_cpu(struct trace_remote *remote, int cpu) +{ + if (cpu == RING_BUFFER_ALL_CPUS) + return true; + + return ring_buffer_poll_remote(remote->trace_buffer, cpu) == 0; +} + static void __poll_remote(struct work_struct *work) { struct delayed_work *dwork = to_delayed_work(work); @@ -324,6 +332,10 @@ static int __alloc_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu) iter->rb_iters[cpu] = ring_buffer_read_start(iter->remote->trace_buffer, cpu, GFP_KERNEL); if (!iter->rb_iters[cpu]) { + /* This CPU isn't part of trace_buffer. Skip it */ + if (!trace_remote_has_cpu(iter->remote, cpu)) + continue; + __free_ring_buffer_iter(iter, RING_BUFFER_ALL_CPUS); return -ENOMEM; } @@ -347,10 +359,10 @@ static struct trace_remote_iterator if (ret) return ERR_PTR(ret); - /* Test the CPU */ - ret = ring_buffer_poll_remote(remote->trace_buffer, cpu); - if (ret) + if (!trace_remote_has_cpu(remote, cpu)) { + ret = -ENODEV; goto err; + } iter = kzalloc_obj(*iter); if (iter) { @@ -361,6 +373,7 @@ static struct trace_remote_iterator switch (type) { case TRI_CONSUMING: + ring_buffer_poll_remote(remote->trace_buffer, cpu); INIT_DELAYED_WORK(&iter->poll_work, __poll_remote); schedule_delayed_work(&iter->poll_work, msecs_to_jiffies(remote->poll_ms)); break; @@ -476,6 +489,9 @@ __peek_event(struct trace_remote_iterator *iter, int cpu, u64 *ts, unsigned long return ring_buffer_peek(iter->remote->trace_buffer, cpu, ts, lost_events); case TRI_NONCONSUMING: rb_iter = __get_rb_iter(iter, cpu); + if (!rb_iter) + return NULL; + rb_evt = ring_buffer_iter_peek(rb_iter, ts); if (!rb_evt) return NULL; From ec07906bdc52848bd7dc93d1d44e642dcdc7a15a Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Wed, 1 Apr 2026 05:51:00 +0100 Subject: [PATCH 302/373] tracing: selftests: Extend hotplug testing for trace remotes The hotplug testing only tries reading a trace remote buffer, loaded before a CPU is offline. Extend this testing to cover: * A trace remote buffer loaded after a CPU is offline. * A trace remote buffer loaded before a CPU is online. Because of these added test cases, move the hotplug testing into a separate hotplug.tc file. Acked-by: Steven Rostedt (Google) Signed-off-by: Vincent Donnefort Link: https://patch.msgid.link/20260401045100.3394299-3-vdonnefort@google.com Signed-off-by: Marc Zyngier --- .../selftests/ftrace/test.d/remotes/functions | 19 +++- .../ftrace/test.d/remotes/hotplug.tc | 88 +++++++++++++++++++ .../test.d/remotes/hypervisor/hotplug.tc | 11 +++ .../selftests/ftrace/test.d/remotes/trace.tc | 27 +----- .../ftrace/test.d/remotes/trace_pipe.tc | 25 ------ 5 files changed, 115 insertions(+), 55 deletions(-) create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc diff --git a/tools/testing/selftests/ftrace/test.d/remotes/functions b/tools/testing/selftests/ftrace/test.d/remotes/functions index 97a09d564a34..05224fac3653 100644 --- a/tools/testing/selftests/ftrace/test.d/remotes/functions +++ b/tools/testing/selftests/ftrace/test.d/remotes/functions @@ -24,12 +24,21 @@ setup_remote_test() assert_loaded() { - grep -q "(loaded)" buffer_size_kb + grep -q "(loaded)" buffer_size_kb || return 1 } assert_unloaded() { - grep -q "(unloaded)" buffer_size_kb + grep -q "(unloaded)" buffer_size_kb || return 1 +} + +reload_remote() +{ + echo 0 > tracing_on + clear_trace + assert_unloaded + echo 1 > tracing_on + assert_loaded } dump_trace_pipe() @@ -79,10 +88,12 @@ get_cpu_ids() sed -n 's/^processor\s*:\s*\([0-9]\+\).*/\1/p' /proc/cpuinfo } -get_page_size() { +get_page_size() +{ sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page } -get_selftest_event_size() { +get_selftest_event_size() +{ sed -ne 's/^.*field:.*;.*size:\([0-9][0-9]*\);.*/\1/p' events/*/selftest/format | awk '{s+=$1} END {print s}' } diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc new file mode 100644 index 000000000000..145617eb8061 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc @@ -0,0 +1,88 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Test trace remote read with an offline CPU +# requires: remotes/test + +. $TEST_DIR/remotes/functions + +hotunplug_one_cpu() +{ + [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 1 + + for cpu in $(get_cpu_ids); do + echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 1 + break + done + + echo $cpu +} + +# Check non-consuming and consuming read +check_read() +{ + for i in $(seq 1 8); do + echo $i > write_event + done + + check_trace 1 8 trace + + output=$(dump_trace_pipe) + check_trace 1 8 $output + rm $output +} + +test_hotplug() +{ + echo 0 > trace + assert_loaded + + # + # Test a trace buffer containing an offline CPU + # + + cpu=$(hotunplug_one_cpu) || exit_unsupported + trap "echo 1 > /sys/devices/system/cpu/cpu$cpu/online" EXIT + + check_read + + # + # Test a trace buffer with a missing CPU + # + + reload_remote + + check_read + + # + # Test a trace buffer with a CPU added later + # + + echo 1 > /sys/devices/system/cpu/cpu$cpu/online + trap "" EXIT + assert_loaded + + check_read + + # Test if the ring-buffer for the newly added CPU is both writable and + # readable + for i in $(seq 1 8); do + taskset -c $cpu echo $i > write_event + done + + cd per_cpu/cpu$cpu/ + + check_trace 1 8 trace + + output=$(dump_trace_pipe) + check_trace 1 8 $output + rm $output + + cd - +} + +if [ -z "$SOURCE_REMOTE_TEST" ]; then + set -e + + setup_remote_test + test_hotplug +fi diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc new file mode 100644 index 000000000000..580ec32c8f81 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc @@ -0,0 +1,11 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Test hypervisor trace read with an offline CPU +# requires: remotes/hypervisor/write_event + +SOURCE_REMOTE_TEST=1 +. $TEST_DIR/remotes/hotplug.tc + +set -e +setup_remote "hypervisor" +test_hotplug diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc index 170f7648732a..bc9377a70e8d 100644 --- a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc +++ b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc @@ -58,11 +58,7 @@ test_trace() # # Ensure the writer is not on the reader page by reloading the buffer - echo 0 > tracing_on - echo 0 > trace - assert_unloaded - echo 1 > tracing_on - assert_loaded + reload_remote # Ensure ring-buffer overflow by emitting events from the same CPU for cpu in $(get_cpu_ids); do @@ -96,27 +92,6 @@ test_trace() cd - > /dev/null done - - # - # Test with hotplug - # - - [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0 - - echo 0 > trace - - for cpu in $(get_cpu_ids); do - echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0 - break - done - - for i in $(seq 1 8); do - echo $i > write_event - done - - check_trace 1 8 trace - - echo 1 > /sys/devices/system/cpu/cpu$cpu/online } if [ -z "$SOURCE_REMOTE_TEST" ]; then diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc index 669a7288ed7c..7f7b7b79c490 100644 --- a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc +++ b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc @@ -92,31 +92,6 @@ test_trace_pipe() rm $output cd - > /dev/null done - - # - # Test interaction with hotplug - # - - [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0 - - echo 0 > trace - - for cpu in $(get_cpu_ids); do - echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0 - break - done - - for i in $(seq 1 8); do - echo $i > write_event - done - - output=$(dump_trace_pipe) - - check_trace 1 8 $output - - rm $output - - echo 1 > /sys/devices/system/cpu/cpu$cpu/online } if [ -z "$SOURCE_REMOTE_TEST" ]; then From b0ad874d9852967dafdb94b1632e0732e01e6cd8 Mon Sep 17 00:00:00 2001 From: Eric Farman Date: Wed, 1 Apr 2026 17:12:18 +0200 Subject: [PATCH 303/373] KVM: s390: vsie: Allow non-zarch guests Linux/KVM runs in z/Architecture-only mode. Although z/Architecture is built upon a long history of hardware refinements, any other CPU mode is not permitted. Allow a userspace to explicitly enable the use of ESA mode for nested guests, otherwise usage will be rejected. Reviewed-by: Janosch Frank Signed-off-by: Eric Farman Reviewed-by: Hendrik Brueckner Reviewed-by: Christian Borntraeger Signed-off-by: Janosch Frank --- arch/s390/include/asm/kvm_host.h | 1 + arch/s390/kvm/vsie.c | 8 +++++--- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index 64a50f0862aa..b58faad8a7ce 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -656,6 +656,7 @@ struct kvm_arch { int user_stsi; int user_instr0; int user_operexec; + int allow_vsie_esamode; struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS]; wait_queue_head_t ipte_wq; int ipte_lock_count; diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c index d249b10044eb..aa43c6848217 100644 --- a/arch/s390/kvm/vsie.c +++ b/arch/s390/kvm/vsie.c @@ -125,8 +125,8 @@ static int prepare_cpuflags(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) struct kvm_s390_sie_block *scb_o = vsie_page->scb_o; int newflags, cpuflags = atomic_read(&scb_o->cpuflags); - /* we don't allow ESA/390 guests */ - if (!(cpuflags & CPUSTAT_ZARCH)) + /* we don't allow ESA/390 guests unless explicitly enabled */ + if (!(cpuflags & CPUSTAT_ZARCH) && !vcpu->kvm->arch.allow_vsie_esamode) return set_validity_icpt(scb_s, 0x0001U); if (cpuflags & (CPUSTAT_RRF | CPUSTAT_MCDS)) @@ -135,7 +135,9 @@ static int prepare_cpuflags(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) return set_validity_icpt(scb_s, 0x0007U); /* intervention requests will be set later */ - newflags = CPUSTAT_ZARCH; + newflags = 0; + if (cpuflags & CPUSTAT_ZARCH) + newflags = CPUSTAT_ZARCH; if (cpuflags & CPUSTAT_GED && test_kvm_facility(vcpu->kvm, 8)) newflags |= CPUSTAT_GED; if (cpuflags & CPUSTAT_GED2 && test_kvm_facility(vcpu->kvm, 78)) { From a9640e2eb7110f0aafda8905acbf5b4ae8db07a4 Mon Sep 17 00:00:00 2001 From: Eric Farman Date: Wed, 1 Apr 2026 17:12:19 +0200 Subject: [PATCH 304/373] KVM: s390: vsie: Disable some bits when in ESA mode In the event that a nested guest is put in ESA mode, ensure that some bits are scrubbed from the shadow SCB. Reviewed-by: Christian Borntraeger Signed-off-by: Eric Farman Reviewed-by: Hendrik Brueckner Signed-off-by: Janosch Frank --- arch/s390/kvm/vsie.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c index aa43c6848217..2ce406861d22 100644 --- a/arch/s390/kvm/vsie.c +++ b/arch/s390/kvm/vsie.c @@ -387,6 +387,17 @@ end: return 0; } +static void shadow_esa(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) +{ + struct kvm_s390_sie_block *scb_s = &vsie_page->scb_s; + + /* Ensure these bits are indeed turned off */ + scb_s->eca &= ~ECA_VX; + scb_s->ecb &= ~(ECB_GS | ECB_TE); + scb_s->ecb3 &= ~ECB3_RI; + scb_s->ecd &= ~ECD_HOSTREGMGMT; +} + /* shadow (round up/down) the ibc to avoid validity icpt */ static void prepare_ibc(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) { @@ -590,6 +601,9 @@ static int shadow_scb(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) scb_s->hpid = HPID_VSIE; scb_s->cpnc = scb_o->cpnc; + if (!(atomic_read(&scb_s->cpuflags) & CPUSTAT_ZARCH)) + shadow_esa(vcpu, vsie_page); + prepare_ibc(vcpu, vsie_page); rc = shadow_crycb(vcpu, vsie_page); out: From c0dcada088ffb5bbac3fc17a416ed2c225f49b9c Mon Sep 17 00:00:00 2001 From: Eric Farman Date: Wed, 1 Apr 2026 17:12:20 +0200 Subject: [PATCH 305/373] KVM: s390: vsie: Accommodate ESA prefix pages The prefix page address occupies a different number of bits for z/Architecture versus ESA mode. Adjust the definition to cover both, and permit an ESA mode address within the nested codepath. Signed-off-by: Eric Farman Reviewed-by: Hendrik Brueckner Signed-off-by: Janosch Frank --- arch/s390/include/asm/kvm_host_types.h | 3 +-- arch/s390/kvm/kvm-s390.h | 5 ++++- arch/s390/kvm/vsie.c | 7 ++++++- 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/kvm_host_types.h b/arch/s390/include/asm/kvm_host_types.h index 1394d3fb648f..3f50942bdfe6 100644 --- a/arch/s390/include/asm/kvm_host_types.h +++ b/arch/s390/include/asm/kvm_host_types.h @@ -137,8 +137,7 @@ struct mcck_volatile_info { struct kvm_s390_sie_block { atomic_t cpuflags; /* 0x0000 */ __u32 : 1; /* 0x0004 */ - __u32 prefix : 18; - __u32 : 1; + __u32 prefix : 19; __u32 ibc : 12; __u8 reserved08[4]; /* 0x0008 */ #define PROG_IN_SIE (1<<0) diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h index bf1d7798c1af..dc0573b7aa4b 100644 --- a/arch/s390/kvm/kvm-s390.h +++ b/arch/s390/kvm/kvm-s390.h @@ -122,7 +122,9 @@ static inline int kvm_is_ucontrol(struct kvm *kvm) #endif } -#define GUEST_PREFIX_SHIFT 13 +#define GUEST_PREFIX_SHIFT 12 +#define GUEST_PREFIX_MASK_ZARCH 0x7fffe +#define GUEST_PREFIX_MASK_ESA 0x7ffff static inline u32 kvm_s390_get_prefix(struct kvm_vcpu *vcpu) { return vcpu->arch.sie_block->prefix << GUEST_PREFIX_SHIFT; @@ -133,6 +135,7 @@ static inline void kvm_s390_set_prefix(struct kvm_vcpu *vcpu, u32 prefix) VCPU_EVENT(vcpu, 3, "set prefix of cpu %03u to 0x%x", vcpu->vcpu_id, prefix); vcpu->arch.sie_block->prefix = prefix >> GUEST_PREFIX_SHIFT; + vcpu->arch.sie_block->prefix &= GUEST_PREFIX_MASK_ZARCH; kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu); } diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c index 2ce406861d22..3f43fe05afd3 100644 --- a/arch/s390/kvm/vsie.c +++ b/arch/s390/kvm/vsie.c @@ -479,7 +479,7 @@ static int shadow_scb(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) struct kvm_s390_sie_block *scb_s = &vsie_page->scb_s; /* READ_ONCE does not work on bitfields - use a temporary variable */ const uint32_t __new_prefix = scb_o->prefix; - const uint32_t new_prefix = READ_ONCE(__new_prefix); + uint32_t new_prefix = READ_ONCE(__new_prefix); const bool wants_tx = READ_ONCE(scb_o->ecb) & ECB_TE; bool had_tx = scb_s->ecb & ECB_TE; unsigned long new_mso = 0; @@ -527,6 +527,11 @@ static int shadow_scb(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) scb_s->icpua = scb_o->icpua; + if (!(atomic_read(&scb_s->cpuflags) & CPUSTAT_ZARCH)) + new_prefix &= GUEST_PREFIX_MASK_ESA; + else + new_prefix &= GUEST_PREFIX_MASK_ZARCH; + if (!(atomic_read(&scb_s->cpuflags) & CPUSTAT_SM)) new_mso = READ_ONCE(scb_o->mso) & 0xfffffffffff00000UL; /* if the hva of the prefix changes, we have to remap the prefix */ From 4aebd7d5c72f805ef59985958ad76b8dbce60d8f Mon Sep 17 00:00:00 2001 From: Hendrik Brueckner Date: Wed, 1 Apr 2026 17:12:21 +0200 Subject: [PATCH 306/373] KVM: s390: Add KVM capability for ESA mode guests Now that all the bits are properly addressed, provide a mechanism for testing ESA mode guests in nested configurations. Signed-off-by: Hendrik Brueckner [farman@us.ibm.com: Updated commit message] Reviewed-by: Janosch Frank Signed-off-by: Eric Farman Signed-off-by: Janosch Frank --- Documentation/virt/kvm/api.rst | 8 ++++++++ arch/s390/kvm/kvm-s390.c | 6 ++++++ include/uapi/linux/kvm.h | 1 + 3 files changed, 15 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 6f85e1b321dd..682ae9278943 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -9428,6 +9428,14 @@ KVM exits with the register state of either the L1 or L2 guest depending on which executed at the time of an exit. Userspace must take care to differentiate between these cases. +8.47 KVM_CAP_S390_VSIE_ESAMODE +------------------------------ + +:Architectures: s390 + +The presence of this capability indicates that the nested KVM guest can +start in ESA mode. + 9. Known KVM API problems ========================= diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index bc7d6fa66eaf..a583c0a00efd 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -629,6 +629,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_S390_USER_OPEREXEC: case KVM_CAP_S390_KEYOP: + case KVM_CAP_S390_VSIE_ESAMODE: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: @@ -926,6 +927,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) icpt_operexc_on_all_vcpus(kvm); r = 0; break; + case KVM_CAP_S390_VSIE_ESAMODE: + VM_EVENT(kvm, 3, "%s", "ENABLE: CAP_S390_VSIE_ESAMODE"); + kvm->arch.allow_vsie_esamode = 1; + r = 0; + break; default: r = -EINVAL; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 65500f5db379..e658f89d5d3e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -985,6 +985,7 @@ struct kvm_enable_cap { #define KVM_CAP_ARM_SEA_TO_USER 245 #define KVM_CAP_S390_USER_OPEREXEC 246 #define KVM_CAP_S390_KEYOP 247 +#define KVM_CAP_S390_VSIE_ESAMODE 248 struct kvm_irq_routing_irqchip { __u32 irqchip; From 1ec8bea903f439acca927144570e1e419d2f9c9a Mon Sep 17 00:00:00 2001 From: Mayuresh Chitale Date: Thu, 2 Apr 2026 15:48:14 +0530 Subject: [PATCH 307/373] KVM: riscv: selftests: Implement kvm_arch_has_default_irqchip kvm_arch_has_default_irqchip is required for irqfd_test and returns true if an in-kernel interrupt controller is supported. Fixes: a133052666bed ("KVM: selftests: Fix irqfd_test for non-x86 architectures") Signed-off-by: Mayuresh Chitale Link: https://lore.kernel.org/r/20260402101818.2982071-1-mayuresh.chitale@oss.qualcomm.com Signed-off-by: Anup Patel --- tools/testing/selftests/kvm/lib/riscv/processor.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/kvm/lib/riscv/processor.c b/tools/testing/selftests/kvm/lib/riscv/processor.c index 51dd455ff52c..067c6b2c15b0 100644 --- a/tools/testing/selftests/kvm/lib/riscv/processor.c +++ b/tools/testing/selftests/kvm/lib/riscv/processor.c @@ -566,3 +566,8 @@ unsigned long riscv64_get_satp_mode(void) return val; } + +bool kvm_arch_has_default_irqchip(void) +{ + return kvm_check_cap(KVM_CAP_IRQCHIP); +} From cf05b059d59fff0207d93d7d21b34dad9b460b20 Mon Sep 17 00:00:00 2001 From: Anup Patel Date: Tue, 20 Jan 2026 13:29:50 +0530 Subject: [PATCH 308/373] RISC-V: KVM: Introduce common kvm_riscv_isa_check_host() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rename kvm_riscv_vcpu_isa_check_host() to kvm_riscv_isa_check_host() and use it as common function with KVM RISC-V to check isa extensions supported by host. Signed-off-by: Anup Patel Reviewed-by: Radim Krčmář Link: https://lore.kernel.org/r/20260120080013.2153519-5-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel --- arch/riscv/include/asm/kvm_host.h | 4 ++++ arch/riscv/kvm/aia_device.c | 2 +- arch/riscv/kvm/vcpu_fp.c | 8 +++---- arch/riscv/kvm/vcpu_onereg.c | 38 ++++++++++++++++--------------- arch/riscv/kvm/vcpu_pmu.c | 2 +- arch/riscv/kvm/vcpu_timer.c | 2 +- arch/riscv/kvm/vcpu_vector.c | 4 ++-- 7 files changed, 33 insertions(+), 27 deletions(-) diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index 7ee47b83c80d..cf8da3ec6392 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -311,6 +311,10 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run, void __kvm_riscv_switch_to(struct kvm_vcpu_arch *vcpu_arch); +int __kvm_riscv_isa_check_host(unsigned long kvm_ext, unsigned long *guest_ext); +#define kvm_riscv_isa_check_host(ext) \ + __kvm_riscv_isa_check_host(KVM_RISCV_ISA_EXT_##ext, NULL) + void kvm_riscv_vcpu_setup_isa(struct kvm_vcpu *vcpu); unsigned long kvm_riscv_vcpu_num_regs(struct kvm_vcpu *vcpu); int kvm_riscv_vcpu_copy_reg_indices(struct kvm_vcpu *vcpu, diff --git a/arch/riscv/kvm/aia_device.c b/arch/riscv/kvm/aia_device.c index 49c71d3cdb00..f3010dd2030a 100644 --- a/arch/riscv/kvm/aia_device.c +++ b/arch/riscv/kvm/aia_device.c @@ -23,7 +23,7 @@ static int aia_create(struct kvm_device *dev, u32 type) if (irqchip_in_kernel(kvm)) return -EEXIST; - if (!riscv_isa_extension_available(NULL, SSAIA)) + if (kvm_riscv_isa_check_host(SSAIA)) return -ENODEV; ret = -EBUSY; diff --git a/arch/riscv/kvm/vcpu_fp.c b/arch/riscv/kvm/vcpu_fp.c index bd5a9e7e7165..2faa0cd37b69 100644 --- a/arch/riscv/kvm/vcpu_fp.c +++ b/arch/riscv/kvm/vcpu_fp.c @@ -60,17 +60,17 @@ void kvm_riscv_vcpu_guest_fp_restore(struct kvm_cpu_context *cntx, void kvm_riscv_vcpu_host_fp_save(struct kvm_cpu_context *cntx) { /* No need to check host sstatus as it can be modified outside */ - if (riscv_isa_extension_available(NULL, d)) + if (!kvm_riscv_isa_check_host(D)) __kvm_riscv_fp_d_save(cntx); - else if (riscv_isa_extension_available(NULL, f)) + else if (!kvm_riscv_isa_check_host(F)) __kvm_riscv_fp_f_save(cntx); } void kvm_riscv_vcpu_host_fp_restore(struct kvm_cpu_context *cntx) { - if (riscv_isa_extension_available(NULL, d)) + if (!kvm_riscv_isa_check_host(D)) __kvm_riscv_fp_d_restore(cntx); - else if (riscv_isa_extension_available(NULL, f)) + else if (!kvm_riscv_isa_check_host(F)) __kvm_riscv_fp_f_restore(cntx); } #endif diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c index 97fa72ba47c1..a92351b78bf8 100644 --- a/arch/riscv/kvm/vcpu_onereg.c +++ b/arch/riscv/kvm/vcpu_onereg.c @@ -120,7 +120,7 @@ static unsigned long kvm_riscv_vcpu_base2isa_ext(unsigned long base_ext) return KVM_RISCV_ISA_EXT_MAX; } -static int kvm_riscv_vcpu_isa_check_host(unsigned long kvm_ext, unsigned long *guest_ext) +int __kvm_riscv_isa_check_host(unsigned long kvm_ext, unsigned long *base_ext) { unsigned long host_ext; @@ -129,8 +129,7 @@ static int kvm_riscv_vcpu_isa_check_host(unsigned long kvm_ext, unsigned long *g return -ENOENT; kvm_ext = array_index_nospec(kvm_ext, ARRAY_SIZE(kvm_isa_ext_arr)); - *guest_ext = kvm_isa_ext_arr[kvm_ext]; - switch (*guest_ext) { + switch (kvm_isa_ext_arr[kvm_ext]) { case RISCV_ISA_EXT_SMNPM: /* * Pointer masking effective in (H)S-mode is provided by the @@ -141,13 +140,16 @@ static int kvm_riscv_vcpu_isa_check_host(unsigned long kvm_ext, unsigned long *g host_ext = RISCV_ISA_EXT_SSNPM; break; default: - host_ext = *guest_ext; + host_ext = kvm_isa_ext_arr[kvm_ext]; break; } if (!__riscv_isa_extension_available(NULL, host_ext)) return -ENOENT; + if (base_ext) + *base_ext = kvm_isa_ext_arr[kvm_ext]; + return 0; } @@ -158,7 +160,7 @@ static bool kvm_riscv_vcpu_isa_enable_allowed(unsigned long ext) return false; case KVM_RISCV_ISA_EXT_SSCOFPMF: /* Sscofpmf depends on interrupt filtering defined in ssaia */ - return __riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SSAIA); + return !kvm_riscv_isa_check_host(SSAIA); case KVM_RISCV_ISA_EXT_SVADU: /* * The henvcfg.ADUE is read-only zero if menvcfg.ADUE is zero. @@ -265,7 +267,7 @@ void kvm_riscv_vcpu_setup_isa(struct kvm_vcpu *vcpu) unsigned long guest_ext, i; for (i = 0; i < ARRAY_SIZE(kvm_isa_ext_arr); i++) { - if (kvm_riscv_vcpu_isa_check_host(i, &guest_ext)) + if (__kvm_riscv_isa_check_host(i, &guest_ext)) continue; if (kvm_riscv_vcpu_isa_enable_allowed(i)) set_bit(guest_ext, vcpu->arch.isa); @@ -290,17 +292,17 @@ static int kvm_riscv_vcpu_get_reg_config(struct kvm_vcpu *vcpu, reg_val = vcpu->arch.isa[0] & KVM_RISCV_BASE_ISA_MASK; break; case KVM_REG_RISCV_CONFIG_REG(zicbom_block_size): - if (!riscv_isa_extension_available(NULL, ZICBOM)) + if (kvm_riscv_isa_check_host(ZICBOM)) return -ENOENT; reg_val = riscv_cbom_block_size; break; case KVM_REG_RISCV_CONFIG_REG(zicboz_block_size): - if (!riscv_isa_extension_available(NULL, ZICBOZ)) + if (kvm_riscv_isa_check_host(ZICBOZ)) return -ENOENT; reg_val = riscv_cboz_block_size; break; case KVM_REG_RISCV_CONFIG_REG(zicbop_block_size): - if (!riscv_isa_extension_available(NULL, ZICBOP)) + if (kvm_riscv_isa_check_host(ZICBOP)) return -ENOENT; reg_val = riscv_cbop_block_size; break; @@ -384,19 +386,19 @@ static int kvm_riscv_vcpu_set_reg_config(struct kvm_vcpu *vcpu, } break; case KVM_REG_RISCV_CONFIG_REG(zicbom_block_size): - if (!riscv_isa_extension_available(NULL, ZICBOM)) + if (kvm_riscv_isa_check_host(ZICBOM)) return -ENOENT; if (reg_val != riscv_cbom_block_size) return -EINVAL; break; case KVM_REG_RISCV_CONFIG_REG(zicboz_block_size): - if (!riscv_isa_extension_available(NULL, ZICBOZ)) + if (kvm_riscv_isa_check_host(ZICBOZ)) return -ENOENT; if (reg_val != riscv_cboz_block_size) return -EINVAL; break; case KVM_REG_RISCV_CONFIG_REG(zicbop_block_size): - if (!riscv_isa_extension_available(NULL, ZICBOP)) + if (kvm_riscv_isa_check_host(ZICBOP)) return -ENOENT; if (reg_val != riscv_cbop_block_size) return -EINVAL; @@ -682,7 +684,7 @@ static int riscv_vcpu_get_isa_ext_single(struct kvm_vcpu *vcpu, unsigned long guest_ext; int ret; - ret = kvm_riscv_vcpu_isa_check_host(reg_num, &guest_ext); + ret = __kvm_riscv_isa_check_host(reg_num, &guest_ext); if (ret) return ret; @@ -700,7 +702,7 @@ static int riscv_vcpu_set_isa_ext_single(struct kvm_vcpu *vcpu, unsigned long guest_ext; int ret; - ret = kvm_riscv_vcpu_isa_check_host(reg_num, &guest_ext); + ret = __kvm_riscv_isa_check_host(reg_num, &guest_ext); if (ret) return ret; @@ -859,13 +861,13 @@ static int copy_config_reg_indices(const struct kvm_vcpu *vcpu, * was not available. */ if (i == KVM_REG_RISCV_CONFIG_REG(zicbom_block_size) && - !riscv_isa_extension_available(NULL, ZICBOM)) + kvm_riscv_isa_check_host(ZICBOM)) continue; else if (i == KVM_REG_RISCV_CONFIG_REG(zicboz_block_size) && - !riscv_isa_extension_available(NULL, ZICBOZ)) + kvm_riscv_isa_check_host(ZICBOZ)) continue; else if (i == KVM_REG_RISCV_CONFIG_REG(zicbop_block_size) && - !riscv_isa_extension_available(NULL, ZICBOP)) + kvm_riscv_isa_check_host(ZICBOP)) continue; size = IS_ENABLED(CONFIG_32BIT) ? KVM_REG_SIZE_U32 : KVM_REG_SIZE_U64; @@ -1086,7 +1088,7 @@ static int copy_isa_ext_reg_indices(const struct kvm_vcpu *vcpu, KVM_REG_SIZE_U32 : KVM_REG_SIZE_U64; u64 reg = KVM_REG_RISCV | size | KVM_REG_RISCV_ISA_EXT | i; - if (kvm_riscv_vcpu_isa_check_host(i, &guest_ext)) + if (__kvm_riscv_isa_check_host(i, &guest_ext)) continue; if (uindices) { diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c index 49427094a079..a8ca1e65734a 100644 --- a/arch/riscv/kvm/vcpu_pmu.c +++ b/arch/riscv/kvm/vcpu_pmu.c @@ -845,7 +845,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu) * filtering is available in the host. Otherwise, guest will always count * events while the execution is in hypervisor mode. */ - if (!riscv_isa_extension_available(NULL, SSCOFPMF)) + if (kvm_riscv_isa_check_host(SSCOFPMF)) return; ret = riscv_pmu_get_hpm_info(&hpm_width, &num_hw_ctrs); diff --git a/arch/riscv/kvm/vcpu_timer.c b/arch/riscv/kvm/vcpu_timer.c index f36247e4c783..cac4f3a5f213 100644 --- a/arch/riscv/kvm/vcpu_timer.c +++ b/arch/riscv/kvm/vcpu_timer.c @@ -253,7 +253,7 @@ int kvm_riscv_vcpu_timer_init(struct kvm_vcpu *vcpu) t->next_set = false; /* Enable sstc for every vcpu if available in hardware */ - if (riscv_isa_extension_available(NULL, SSTC)) { + if (!kvm_riscv_isa_check_host(SSTC)) { t->sstc_enabled = true; hrtimer_setup(&t->hrt, kvm_riscv_vcpu_vstimer_expired, CLOCK_MONOTONIC, HRTIMER_MODE_REL); diff --git a/arch/riscv/kvm/vcpu_vector.c b/arch/riscv/kvm/vcpu_vector.c index f3f5fb665cf6..42ffb489c447 100644 --- a/arch/riscv/kvm/vcpu_vector.c +++ b/arch/riscv/kvm/vcpu_vector.c @@ -63,13 +63,13 @@ void kvm_riscv_vcpu_guest_vector_restore(struct kvm_cpu_context *cntx, void kvm_riscv_vcpu_host_vector_save(struct kvm_cpu_context *cntx) { /* No need to check host sstatus as it can be modified outside */ - if (riscv_isa_extension_available(NULL, v)) + if (!kvm_riscv_isa_check_host(V)) __kvm_riscv_vector_save(cntx); } void kvm_riscv_vcpu_host_vector_restore(struct kvm_cpu_context *cntx) { - if (riscv_isa_extension_available(NULL, v)) + if (!kvm_riscv_isa_check_host(V)) __kvm_riscv_vector_restore(cntx); } From e0b5cfc316f1d36e22b0745d20360b4719b89209 Mon Sep 17 00:00:00 2001 From: Anup Patel Date: Tue, 20 Jan 2026 13:29:51 +0530 Subject: [PATCH 309/373] RISC-V: KVM: Factor-out ISA checks into separate sources MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The KVM ISA extension related checks are not VCPU specific and should be factored out of vcpu_onereg.c into separate sources. Signed-off-by: Anup Patel Reviewed-by: Radim Krčmář Link: https://lore.kernel.org/r/20260120080013.2153519-6-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel --- arch/riscv/include/asm/kvm_host.h | 4 - arch/riscv/include/asm/kvm_isa.h | 20 +++ arch/riscv/kvm/Makefile | 1 + arch/riscv/kvm/aia_device.c | 2 +- arch/riscv/kvm/isa.c | 253 +++++++++++++++++++++++++++++ arch/riscv/kvm/vcpu_fp.c | 1 + arch/riscv/kvm/vcpu_onereg.c | 258 +----------------------------- arch/riscv/kvm/vcpu_pmu.c | 3 +- arch/riscv/kvm/vcpu_timer.c | 1 + arch/riscv/kvm/vcpu_vector.c | 1 + 10 files changed, 288 insertions(+), 256 deletions(-) create mode 100644 arch/riscv/include/asm/kvm_isa.h create mode 100644 arch/riscv/kvm/isa.c diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index cf8da3ec6392..7ee47b83c80d 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -311,10 +311,6 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run, void __kvm_riscv_switch_to(struct kvm_vcpu_arch *vcpu_arch); -int __kvm_riscv_isa_check_host(unsigned long kvm_ext, unsigned long *guest_ext); -#define kvm_riscv_isa_check_host(ext) \ - __kvm_riscv_isa_check_host(KVM_RISCV_ISA_EXT_##ext, NULL) - void kvm_riscv_vcpu_setup_isa(struct kvm_vcpu *vcpu); unsigned long kvm_riscv_vcpu_num_regs(struct kvm_vcpu *vcpu); int kvm_riscv_vcpu_copy_reg_indices(struct kvm_vcpu *vcpu, diff --git a/arch/riscv/include/asm/kvm_isa.h b/arch/riscv/include/asm/kvm_isa.h new file mode 100644 index 000000000000..bc4b956d5f17 --- /dev/null +++ b/arch/riscv/include/asm/kvm_isa.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2026 Qualcomm Technologies, Inc. + */ + +#ifndef __KVM_RISCV_ISA_H +#define __KVM_RISCV_ISA_H + +#include + +unsigned long kvm_riscv_base2isa_ext(unsigned long base_ext); + +int __kvm_riscv_isa_check_host(unsigned long ext, unsigned long *base_ext); +#define kvm_riscv_isa_check_host(ext) \ + __kvm_riscv_isa_check_host(KVM_RISCV_ISA_EXT_##ext, NULL) + +bool kvm_riscv_isa_enable_allowed(unsigned long ext); +bool kvm_riscv_isa_disable_allowed(unsigned long ext); + +#endif diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile index 3b8afb038b35..07eab96189e7 100644 --- a/arch/riscv/kvm/Makefile +++ b/arch/riscv/kvm/Makefile @@ -15,6 +15,7 @@ kvm-y += aia_aplic.o kvm-y += aia_device.o kvm-y += aia_imsic.o kvm-y += gstage.o +kvm-y += isa.o kvm-y += main.o kvm-y += mmu.o kvm-y += nacl.o diff --git a/arch/riscv/kvm/aia_device.c b/arch/riscv/kvm/aia_device.c index f3010dd2030a..3d1e81e2a36b 100644 --- a/arch/riscv/kvm/aia_device.c +++ b/arch/riscv/kvm/aia_device.c @@ -11,7 +11,7 @@ #include #include #include -#include +#include static int aia_create(struct kvm_device *dev, u32 type) { diff --git a/arch/riscv/kvm/isa.c b/arch/riscv/kvm/isa.c new file mode 100644 index 000000000000..1132d909cc25 --- /dev/null +++ b/arch/riscv/kvm/isa.c @@ -0,0 +1,253 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2026 Qualcomm Technologies, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include + +#define KVM_ISA_EXT_ARR(ext) \ +[KVM_RISCV_ISA_EXT_##ext] = RISCV_ISA_EXT_##ext + +/* Mapping between KVM ISA Extension ID & guest ISA extension ID */ +static const unsigned long kvm_isa_ext_arr[] = { + /* Single letter extensions (alphabetically sorted) */ + [KVM_RISCV_ISA_EXT_A] = RISCV_ISA_EXT_a, + [KVM_RISCV_ISA_EXT_C] = RISCV_ISA_EXT_c, + [KVM_RISCV_ISA_EXT_D] = RISCV_ISA_EXT_d, + [KVM_RISCV_ISA_EXT_F] = RISCV_ISA_EXT_f, + [KVM_RISCV_ISA_EXT_H] = RISCV_ISA_EXT_h, + [KVM_RISCV_ISA_EXT_I] = RISCV_ISA_EXT_i, + [KVM_RISCV_ISA_EXT_M] = RISCV_ISA_EXT_m, + [KVM_RISCV_ISA_EXT_V] = RISCV_ISA_EXT_v, + /* Multi letter extensions (alphabetically sorted) */ + KVM_ISA_EXT_ARR(SMNPM), + KVM_ISA_EXT_ARR(SMSTATEEN), + KVM_ISA_EXT_ARR(SSAIA), + KVM_ISA_EXT_ARR(SSCOFPMF), + KVM_ISA_EXT_ARR(SSNPM), + KVM_ISA_EXT_ARR(SSTC), + KVM_ISA_EXT_ARR(SVADE), + KVM_ISA_EXT_ARR(SVADU), + KVM_ISA_EXT_ARR(SVINVAL), + KVM_ISA_EXT_ARR(SVNAPOT), + KVM_ISA_EXT_ARR(SVPBMT), + KVM_ISA_EXT_ARR(SVVPTC), + KVM_ISA_EXT_ARR(ZAAMO), + KVM_ISA_EXT_ARR(ZABHA), + KVM_ISA_EXT_ARR(ZACAS), + KVM_ISA_EXT_ARR(ZALASR), + KVM_ISA_EXT_ARR(ZALRSC), + KVM_ISA_EXT_ARR(ZAWRS), + KVM_ISA_EXT_ARR(ZBA), + KVM_ISA_EXT_ARR(ZBB), + KVM_ISA_EXT_ARR(ZBC), + KVM_ISA_EXT_ARR(ZBKB), + KVM_ISA_EXT_ARR(ZBKC), + KVM_ISA_EXT_ARR(ZBKX), + KVM_ISA_EXT_ARR(ZBS), + KVM_ISA_EXT_ARR(ZCA), + KVM_ISA_EXT_ARR(ZCB), + KVM_ISA_EXT_ARR(ZCD), + KVM_ISA_EXT_ARR(ZCF), + KVM_ISA_EXT_ARR(ZCLSD), + KVM_ISA_EXT_ARR(ZCMOP), + KVM_ISA_EXT_ARR(ZFA), + KVM_ISA_EXT_ARR(ZFBFMIN), + KVM_ISA_EXT_ARR(ZFH), + KVM_ISA_EXT_ARR(ZFHMIN), + KVM_ISA_EXT_ARR(ZICBOM), + KVM_ISA_EXT_ARR(ZICBOP), + KVM_ISA_EXT_ARR(ZICBOZ), + KVM_ISA_EXT_ARR(ZICCRSE), + KVM_ISA_EXT_ARR(ZICNTR), + KVM_ISA_EXT_ARR(ZICOND), + KVM_ISA_EXT_ARR(ZICSR), + KVM_ISA_EXT_ARR(ZIFENCEI), + KVM_ISA_EXT_ARR(ZIHINTNTL), + KVM_ISA_EXT_ARR(ZIHINTPAUSE), + KVM_ISA_EXT_ARR(ZIHPM), + KVM_ISA_EXT_ARR(ZILSD), + KVM_ISA_EXT_ARR(ZIMOP), + KVM_ISA_EXT_ARR(ZKND), + KVM_ISA_EXT_ARR(ZKNE), + KVM_ISA_EXT_ARR(ZKNH), + KVM_ISA_EXT_ARR(ZKR), + KVM_ISA_EXT_ARR(ZKSED), + KVM_ISA_EXT_ARR(ZKSH), + KVM_ISA_EXT_ARR(ZKT), + KVM_ISA_EXT_ARR(ZTSO), + KVM_ISA_EXT_ARR(ZVBB), + KVM_ISA_EXT_ARR(ZVBC), + KVM_ISA_EXT_ARR(ZVFBFMIN), + KVM_ISA_EXT_ARR(ZVFBFWMA), + KVM_ISA_EXT_ARR(ZVFH), + KVM_ISA_EXT_ARR(ZVFHMIN), + KVM_ISA_EXT_ARR(ZVKB), + KVM_ISA_EXT_ARR(ZVKG), + KVM_ISA_EXT_ARR(ZVKNED), + KVM_ISA_EXT_ARR(ZVKNHA), + KVM_ISA_EXT_ARR(ZVKNHB), + KVM_ISA_EXT_ARR(ZVKSED), + KVM_ISA_EXT_ARR(ZVKSH), + KVM_ISA_EXT_ARR(ZVKT), +}; + +unsigned long kvm_riscv_base2isa_ext(unsigned long base_ext) +{ + unsigned long i; + + for (i = 0; i < KVM_RISCV_ISA_EXT_MAX; i++) { + if (kvm_isa_ext_arr[i] == base_ext) + return i; + } + + return KVM_RISCV_ISA_EXT_MAX; +} + +int __kvm_riscv_isa_check_host(unsigned long kvm_ext, unsigned long *base_ext) +{ + unsigned long host_ext; + + if (kvm_ext >= KVM_RISCV_ISA_EXT_MAX || + kvm_ext >= ARRAY_SIZE(kvm_isa_ext_arr)) + return -ENOENT; + + kvm_ext = array_index_nospec(kvm_ext, ARRAY_SIZE(kvm_isa_ext_arr)); + switch (kvm_isa_ext_arr[kvm_ext]) { + case RISCV_ISA_EXT_SMNPM: + /* + * Pointer masking effective in (H)S-mode is provided by the + * Smnpm extension, so that extension is reported to the guest, + * even though the CSR bits for configuring VS-mode pointer + * masking on the host side are part of the Ssnpm extension. + */ + host_ext = RISCV_ISA_EXT_SSNPM; + break; + default: + host_ext = kvm_isa_ext_arr[kvm_ext]; + break; + } + + if (!__riscv_isa_extension_available(NULL, host_ext)) + return -ENOENT; + + if (base_ext) + *base_ext = kvm_isa_ext_arr[kvm_ext]; + + return 0; +} + +bool kvm_riscv_isa_enable_allowed(unsigned long ext) +{ + switch (ext) { + case KVM_RISCV_ISA_EXT_H: + return false; + case KVM_RISCV_ISA_EXT_SSCOFPMF: + /* Sscofpmf depends on interrupt filtering defined in ssaia */ + return !kvm_riscv_isa_check_host(SSAIA); + case KVM_RISCV_ISA_EXT_SVADU: + /* + * The henvcfg.ADUE is read-only zero if menvcfg.ADUE is zero. + * Guest OS can use Svadu only when host OS enable Svadu. + */ + return arch_has_hw_pte_young(); + case KVM_RISCV_ISA_EXT_V: + return riscv_v_vstate_ctrl_user_allowed(); + default: + break; + } + + return true; +} + +bool kvm_riscv_isa_disable_allowed(unsigned long ext) +{ + switch (ext) { + /* Extensions which don't have any mechanism to disable */ + case KVM_RISCV_ISA_EXT_A: + case KVM_RISCV_ISA_EXT_C: + case KVM_RISCV_ISA_EXT_I: + case KVM_RISCV_ISA_EXT_M: + /* There is not architectural config bit to disable sscofpmf completely */ + case KVM_RISCV_ISA_EXT_SSCOFPMF: + case KVM_RISCV_ISA_EXT_SSNPM: + case KVM_RISCV_ISA_EXT_SSTC: + case KVM_RISCV_ISA_EXT_SVINVAL: + case KVM_RISCV_ISA_EXT_SVNAPOT: + case KVM_RISCV_ISA_EXT_SVVPTC: + case KVM_RISCV_ISA_EXT_ZAAMO: + case KVM_RISCV_ISA_EXT_ZABHA: + case KVM_RISCV_ISA_EXT_ZACAS: + case KVM_RISCV_ISA_EXT_ZALASR: + case KVM_RISCV_ISA_EXT_ZALRSC: + case KVM_RISCV_ISA_EXT_ZAWRS: + case KVM_RISCV_ISA_EXT_ZBA: + case KVM_RISCV_ISA_EXT_ZBB: + case KVM_RISCV_ISA_EXT_ZBC: + case KVM_RISCV_ISA_EXT_ZBKB: + case KVM_RISCV_ISA_EXT_ZBKC: + case KVM_RISCV_ISA_EXT_ZBKX: + case KVM_RISCV_ISA_EXT_ZBS: + case KVM_RISCV_ISA_EXT_ZCA: + case KVM_RISCV_ISA_EXT_ZCB: + case KVM_RISCV_ISA_EXT_ZCD: + case KVM_RISCV_ISA_EXT_ZCF: + case KVM_RISCV_ISA_EXT_ZCMOP: + case KVM_RISCV_ISA_EXT_ZFA: + case KVM_RISCV_ISA_EXT_ZFBFMIN: + case KVM_RISCV_ISA_EXT_ZFH: + case KVM_RISCV_ISA_EXT_ZFHMIN: + case KVM_RISCV_ISA_EXT_ZICBOP: + case KVM_RISCV_ISA_EXT_ZICCRSE: + case KVM_RISCV_ISA_EXT_ZICNTR: + case KVM_RISCV_ISA_EXT_ZICOND: + case KVM_RISCV_ISA_EXT_ZICSR: + case KVM_RISCV_ISA_EXT_ZIFENCEI: + case KVM_RISCV_ISA_EXT_ZIHINTNTL: + case KVM_RISCV_ISA_EXT_ZIHINTPAUSE: + case KVM_RISCV_ISA_EXT_ZIHPM: + case KVM_RISCV_ISA_EXT_ZIMOP: + case KVM_RISCV_ISA_EXT_ZKND: + case KVM_RISCV_ISA_EXT_ZKNE: + case KVM_RISCV_ISA_EXT_ZKNH: + case KVM_RISCV_ISA_EXT_ZKR: + case KVM_RISCV_ISA_EXT_ZKSED: + case KVM_RISCV_ISA_EXT_ZKSH: + case KVM_RISCV_ISA_EXT_ZKT: + case KVM_RISCV_ISA_EXT_ZTSO: + case KVM_RISCV_ISA_EXT_ZVBB: + case KVM_RISCV_ISA_EXT_ZVBC: + case KVM_RISCV_ISA_EXT_ZVFBFMIN: + case KVM_RISCV_ISA_EXT_ZVFBFWMA: + case KVM_RISCV_ISA_EXT_ZVFH: + case KVM_RISCV_ISA_EXT_ZVFHMIN: + case KVM_RISCV_ISA_EXT_ZVKB: + case KVM_RISCV_ISA_EXT_ZVKG: + case KVM_RISCV_ISA_EXT_ZVKNED: + case KVM_RISCV_ISA_EXT_ZVKNHA: + case KVM_RISCV_ISA_EXT_ZVKNHB: + case KVM_RISCV_ISA_EXT_ZVKSED: + case KVM_RISCV_ISA_EXT_ZVKSH: + case KVM_RISCV_ISA_EXT_ZVKT: + return false; + /* Extensions which can be disabled using Smstateen */ + case KVM_RISCV_ISA_EXT_SSAIA: + return riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN); + case KVM_RISCV_ISA_EXT_SVADE: + /* + * The henvcfg.ADUE is read-only zero if menvcfg.ADUE is zero. + * Svade can't be disabled unless we support Svadu. + */ + return arch_has_hw_pte_young(); + default: + break; + } + + return true; +} diff --git a/arch/riscv/kvm/vcpu_fp.c b/arch/riscv/kvm/vcpu_fp.c index 2faa0cd37b69..6ad6df26a2fd 100644 --- a/arch/riscv/kvm/vcpu_fp.c +++ b/arch/riscv/kvm/vcpu_fp.c @@ -13,6 +13,7 @@ #include #include #include +#include #ifdef CONFIG_FPU void kvm_riscv_vcpu_fp_reset(struct kvm_vcpu *vcpu) diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c index a92351b78bf8..bb920e8923c9 100644 --- a/arch/riscv/kvm/vcpu_onereg.c +++ b/arch/riscv/kvm/vcpu_onereg.c @@ -15,261 +15,19 @@ #include #include #include +#include #include -#include -#include #define KVM_RISCV_BASE_ISA_MASK GENMASK(25, 0) -#define KVM_ISA_EXT_ARR(ext) \ -[KVM_RISCV_ISA_EXT_##ext] = RISCV_ISA_EXT_##ext - -/* Mapping between KVM ISA Extension ID & guest ISA extension ID */ -static const unsigned long kvm_isa_ext_arr[] = { - /* Single letter extensions (alphabetically sorted) */ - [KVM_RISCV_ISA_EXT_A] = RISCV_ISA_EXT_a, - [KVM_RISCV_ISA_EXT_C] = RISCV_ISA_EXT_c, - [KVM_RISCV_ISA_EXT_D] = RISCV_ISA_EXT_d, - [KVM_RISCV_ISA_EXT_F] = RISCV_ISA_EXT_f, - [KVM_RISCV_ISA_EXT_H] = RISCV_ISA_EXT_h, - [KVM_RISCV_ISA_EXT_I] = RISCV_ISA_EXT_i, - [KVM_RISCV_ISA_EXT_M] = RISCV_ISA_EXT_m, - [KVM_RISCV_ISA_EXT_V] = RISCV_ISA_EXT_v, - /* Multi letter extensions (alphabetically sorted) */ - KVM_ISA_EXT_ARR(SMNPM), - KVM_ISA_EXT_ARR(SMSTATEEN), - KVM_ISA_EXT_ARR(SSAIA), - KVM_ISA_EXT_ARR(SSCOFPMF), - KVM_ISA_EXT_ARR(SSNPM), - KVM_ISA_EXT_ARR(SSTC), - KVM_ISA_EXT_ARR(SVADE), - KVM_ISA_EXT_ARR(SVADU), - KVM_ISA_EXT_ARR(SVINVAL), - KVM_ISA_EXT_ARR(SVNAPOT), - KVM_ISA_EXT_ARR(SVPBMT), - KVM_ISA_EXT_ARR(SVVPTC), - KVM_ISA_EXT_ARR(ZAAMO), - KVM_ISA_EXT_ARR(ZABHA), - KVM_ISA_EXT_ARR(ZACAS), - KVM_ISA_EXT_ARR(ZALASR), - KVM_ISA_EXT_ARR(ZALRSC), - KVM_ISA_EXT_ARR(ZAWRS), - KVM_ISA_EXT_ARR(ZBA), - KVM_ISA_EXT_ARR(ZBB), - KVM_ISA_EXT_ARR(ZBC), - KVM_ISA_EXT_ARR(ZBKB), - KVM_ISA_EXT_ARR(ZBKC), - KVM_ISA_EXT_ARR(ZBKX), - KVM_ISA_EXT_ARR(ZBS), - KVM_ISA_EXT_ARR(ZCA), - KVM_ISA_EXT_ARR(ZCB), - KVM_ISA_EXT_ARR(ZCD), - KVM_ISA_EXT_ARR(ZCF), - KVM_ISA_EXT_ARR(ZCLSD), - KVM_ISA_EXT_ARR(ZCMOP), - KVM_ISA_EXT_ARR(ZFA), - KVM_ISA_EXT_ARR(ZFBFMIN), - KVM_ISA_EXT_ARR(ZFH), - KVM_ISA_EXT_ARR(ZFHMIN), - KVM_ISA_EXT_ARR(ZICBOM), - KVM_ISA_EXT_ARR(ZICBOP), - KVM_ISA_EXT_ARR(ZICBOZ), - KVM_ISA_EXT_ARR(ZICCRSE), - KVM_ISA_EXT_ARR(ZICNTR), - KVM_ISA_EXT_ARR(ZICOND), - KVM_ISA_EXT_ARR(ZICSR), - KVM_ISA_EXT_ARR(ZIFENCEI), - KVM_ISA_EXT_ARR(ZIHINTNTL), - KVM_ISA_EXT_ARR(ZIHINTPAUSE), - KVM_ISA_EXT_ARR(ZIHPM), - KVM_ISA_EXT_ARR(ZILSD), - KVM_ISA_EXT_ARR(ZIMOP), - KVM_ISA_EXT_ARR(ZKND), - KVM_ISA_EXT_ARR(ZKNE), - KVM_ISA_EXT_ARR(ZKNH), - KVM_ISA_EXT_ARR(ZKR), - KVM_ISA_EXT_ARR(ZKSED), - KVM_ISA_EXT_ARR(ZKSH), - KVM_ISA_EXT_ARR(ZKT), - KVM_ISA_EXT_ARR(ZTSO), - KVM_ISA_EXT_ARR(ZVBB), - KVM_ISA_EXT_ARR(ZVBC), - KVM_ISA_EXT_ARR(ZVFBFMIN), - KVM_ISA_EXT_ARR(ZVFBFWMA), - KVM_ISA_EXT_ARR(ZVFH), - KVM_ISA_EXT_ARR(ZVFHMIN), - KVM_ISA_EXT_ARR(ZVKB), - KVM_ISA_EXT_ARR(ZVKG), - KVM_ISA_EXT_ARR(ZVKNED), - KVM_ISA_EXT_ARR(ZVKNHA), - KVM_ISA_EXT_ARR(ZVKNHB), - KVM_ISA_EXT_ARR(ZVKSED), - KVM_ISA_EXT_ARR(ZVKSH), - KVM_ISA_EXT_ARR(ZVKT), -}; - -static unsigned long kvm_riscv_vcpu_base2isa_ext(unsigned long base_ext) -{ - unsigned long i; - - for (i = 0; i < KVM_RISCV_ISA_EXT_MAX; i++) { - if (kvm_isa_ext_arr[i] == base_ext) - return i; - } - - return KVM_RISCV_ISA_EXT_MAX; -} - -int __kvm_riscv_isa_check_host(unsigned long kvm_ext, unsigned long *base_ext) -{ - unsigned long host_ext; - - if (kvm_ext >= KVM_RISCV_ISA_EXT_MAX || - kvm_ext >= ARRAY_SIZE(kvm_isa_ext_arr)) - return -ENOENT; - - kvm_ext = array_index_nospec(kvm_ext, ARRAY_SIZE(kvm_isa_ext_arr)); - switch (kvm_isa_ext_arr[kvm_ext]) { - case RISCV_ISA_EXT_SMNPM: - /* - * Pointer masking effective in (H)S-mode is provided by the - * Smnpm extension, so that extension is reported to the guest, - * even though the CSR bits for configuring VS-mode pointer - * masking on the host side are part of the Ssnpm extension. - */ - host_ext = RISCV_ISA_EXT_SSNPM; - break; - default: - host_ext = kvm_isa_ext_arr[kvm_ext]; - break; - } - - if (!__riscv_isa_extension_available(NULL, host_ext)) - return -ENOENT; - - if (base_ext) - *base_ext = kvm_isa_ext_arr[kvm_ext]; - - return 0; -} - -static bool kvm_riscv_vcpu_isa_enable_allowed(unsigned long ext) -{ - switch (ext) { - case KVM_RISCV_ISA_EXT_H: - return false; - case KVM_RISCV_ISA_EXT_SSCOFPMF: - /* Sscofpmf depends on interrupt filtering defined in ssaia */ - return !kvm_riscv_isa_check_host(SSAIA); - case KVM_RISCV_ISA_EXT_SVADU: - /* - * The henvcfg.ADUE is read-only zero if menvcfg.ADUE is zero. - * Guest OS can use Svadu only when host OS enable Svadu. - */ - return arch_has_hw_pte_young(); - case KVM_RISCV_ISA_EXT_V: - return riscv_v_vstate_ctrl_user_allowed(); - default: - break; - } - - return true; -} - -static bool kvm_riscv_vcpu_isa_disable_allowed(unsigned long ext) -{ - switch (ext) { - /* Extensions which don't have any mechanism to disable */ - case KVM_RISCV_ISA_EXT_A: - case KVM_RISCV_ISA_EXT_C: - case KVM_RISCV_ISA_EXT_I: - case KVM_RISCV_ISA_EXT_M: - /* There is not architectural config bit to disable sscofpmf completely */ - case KVM_RISCV_ISA_EXT_SSCOFPMF: - case KVM_RISCV_ISA_EXT_SSNPM: - case KVM_RISCV_ISA_EXT_SSTC: - case KVM_RISCV_ISA_EXT_SVINVAL: - case KVM_RISCV_ISA_EXT_SVNAPOT: - case KVM_RISCV_ISA_EXT_SVVPTC: - case KVM_RISCV_ISA_EXT_ZAAMO: - case KVM_RISCV_ISA_EXT_ZABHA: - case KVM_RISCV_ISA_EXT_ZACAS: - case KVM_RISCV_ISA_EXT_ZALASR: - case KVM_RISCV_ISA_EXT_ZALRSC: - case KVM_RISCV_ISA_EXT_ZAWRS: - case KVM_RISCV_ISA_EXT_ZBA: - case KVM_RISCV_ISA_EXT_ZBB: - case KVM_RISCV_ISA_EXT_ZBC: - case KVM_RISCV_ISA_EXT_ZBKB: - case KVM_RISCV_ISA_EXT_ZBKC: - case KVM_RISCV_ISA_EXT_ZBKX: - case KVM_RISCV_ISA_EXT_ZBS: - case KVM_RISCV_ISA_EXT_ZCA: - case KVM_RISCV_ISA_EXT_ZCB: - case KVM_RISCV_ISA_EXT_ZCD: - case KVM_RISCV_ISA_EXT_ZCF: - case KVM_RISCV_ISA_EXT_ZCMOP: - case KVM_RISCV_ISA_EXT_ZFA: - case KVM_RISCV_ISA_EXT_ZFBFMIN: - case KVM_RISCV_ISA_EXT_ZFH: - case KVM_RISCV_ISA_EXT_ZFHMIN: - case KVM_RISCV_ISA_EXT_ZICBOP: - case KVM_RISCV_ISA_EXT_ZICCRSE: - case KVM_RISCV_ISA_EXT_ZICNTR: - case KVM_RISCV_ISA_EXT_ZICOND: - case KVM_RISCV_ISA_EXT_ZICSR: - case KVM_RISCV_ISA_EXT_ZIFENCEI: - case KVM_RISCV_ISA_EXT_ZIHINTNTL: - case KVM_RISCV_ISA_EXT_ZIHINTPAUSE: - case KVM_RISCV_ISA_EXT_ZIHPM: - case KVM_RISCV_ISA_EXT_ZIMOP: - case KVM_RISCV_ISA_EXT_ZKND: - case KVM_RISCV_ISA_EXT_ZKNE: - case KVM_RISCV_ISA_EXT_ZKNH: - case KVM_RISCV_ISA_EXT_ZKR: - case KVM_RISCV_ISA_EXT_ZKSED: - case KVM_RISCV_ISA_EXT_ZKSH: - case KVM_RISCV_ISA_EXT_ZKT: - case KVM_RISCV_ISA_EXT_ZTSO: - case KVM_RISCV_ISA_EXT_ZVBB: - case KVM_RISCV_ISA_EXT_ZVBC: - case KVM_RISCV_ISA_EXT_ZVFBFMIN: - case KVM_RISCV_ISA_EXT_ZVFBFWMA: - case KVM_RISCV_ISA_EXT_ZVFH: - case KVM_RISCV_ISA_EXT_ZVFHMIN: - case KVM_RISCV_ISA_EXT_ZVKB: - case KVM_RISCV_ISA_EXT_ZVKG: - case KVM_RISCV_ISA_EXT_ZVKNED: - case KVM_RISCV_ISA_EXT_ZVKNHA: - case KVM_RISCV_ISA_EXT_ZVKNHB: - case KVM_RISCV_ISA_EXT_ZVKSED: - case KVM_RISCV_ISA_EXT_ZVKSH: - case KVM_RISCV_ISA_EXT_ZVKT: - return false; - /* Extensions which can be disabled using Smstateen */ - case KVM_RISCV_ISA_EXT_SSAIA: - return riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN); - case KVM_RISCV_ISA_EXT_SVADE: - /* - * The henvcfg.ADUE is read-only zero if menvcfg.ADUE is zero. - * Svade can't be disabled unless we support Svadu. - */ - return arch_has_hw_pte_young(); - default: - break; - } - - return true; -} - void kvm_riscv_vcpu_setup_isa(struct kvm_vcpu *vcpu) { unsigned long guest_ext, i; - for (i = 0; i < ARRAY_SIZE(kvm_isa_ext_arr); i++) { + for (i = 0; i < KVM_RISCV_ISA_EXT_MAX; i++) { if (__kvm_riscv_isa_check_host(i, &guest_ext)) continue; - if (kvm_riscv_vcpu_isa_enable_allowed(i)) + if (kvm_riscv_isa_enable_allowed(i)) set_bit(guest_ext, vcpu->arch.isa); } } @@ -363,15 +121,15 @@ static int kvm_riscv_vcpu_set_reg_config(struct kvm_vcpu *vcpu, if (!vcpu->arch.ran_atleast_once) { /* Ignore the enable/disable request for certain extensions */ for (i = 0; i < RISCV_ISA_EXT_BASE; i++) { - isa_ext = kvm_riscv_vcpu_base2isa_ext(i); + isa_ext = kvm_riscv_base2isa_ext(i); if (isa_ext >= KVM_RISCV_ISA_EXT_MAX) { reg_val &= ~BIT(i); continue; } - if (!kvm_riscv_vcpu_isa_enable_allowed(isa_ext)) + if (!kvm_riscv_isa_enable_allowed(isa_ext)) if (reg_val & BIT(i)) reg_val &= ~BIT(i); - if (!kvm_riscv_vcpu_isa_disable_allowed(isa_ext)) + if (!kvm_riscv_isa_disable_allowed(isa_ext)) if (!(reg_val & BIT(i))) reg_val |= BIT(i); } @@ -715,10 +473,10 @@ static int riscv_vcpu_set_isa_ext_single(struct kvm_vcpu *vcpu, * extension can be disabled */ if (reg_val == 1 && - kvm_riscv_vcpu_isa_enable_allowed(reg_num)) + kvm_riscv_isa_enable_allowed(reg_num)) set_bit(guest_ext, vcpu->arch.isa); else if (!reg_val && - kvm_riscv_vcpu_isa_disable_allowed(reg_num)) + kvm_riscv_isa_disable_allowed(reg_num)) clear_bit(guest_ext, vcpu->arch.isa); else return -EINVAL; diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c index a8ca1e65734a..18ef07b3f13a 100644 --- a/arch/riscv/kvm/vcpu_pmu.c +++ b/arch/riscv/kvm/vcpu_pmu.c @@ -7,16 +7,17 @@ */ #define pr_fmt(fmt) "riscv-kvm-pmu: " fmt +#include #include #include #include #include #include #include +#include #include #include #include -#include #define kvm_pmu_num_counters(pmu) ((pmu)->num_hw_ctrs + (pmu)->num_fw_ctrs) #define get_event_type(x) (((x) & SBI_PMU_EVENT_IDX_TYPE_MASK) >> 16) diff --git a/arch/riscv/kvm/vcpu_timer.c b/arch/riscv/kvm/vcpu_timer.c index cac4f3a5f213..9817ff802821 100644 --- a/arch/riscv/kvm/vcpu_timer.c +++ b/arch/riscv/kvm/vcpu_timer.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include diff --git a/arch/riscv/kvm/vcpu_vector.c b/arch/riscv/kvm/vcpu_vector.c index 42ffb489c447..62d2fb77bb9b 100644 --- a/arch/riscv/kvm/vcpu_vector.c +++ b/arch/riscv/kvm/vcpu_vector.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include From b1834e5afbcd463c83f99dc12e1863803d5a1803 Mon Sep 17 00:00:00 2001 From: Anup Patel Date: Tue, 20 Jan 2026 13:29:52 +0530 Subject: [PATCH 310/373] RISC-V: KVM: Move timer state defines closer to struct in UAPI header The KVM_RISCV_TIMER_STATE_xyz defines specify possible values of the "state" member in struct kvm_riscv_timer so move these defines closer to struct kvm_riscv_timer in uapi/asm/kvm.h. Signed-off-by: Anup Patel Link: https://lore.kernel.org/r/20260120080013.2153519-7-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel --- arch/riscv/include/uapi/asm/kvm.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h index 6a89c1d00a72..504e73305343 100644 --- a/arch/riscv/include/uapi/asm/kvm.h +++ b/arch/riscv/include/uapi/asm/kvm.h @@ -110,6 +110,10 @@ struct kvm_riscv_timer { __u64 state; }; +/* Possible states for kvm_riscv_timer */ +#define KVM_RISCV_TIMER_STATE_OFF 0 +#define KVM_RISCV_TIMER_STATE_ON 1 + /* * ISA extension IDs specific to KVM. This is not the same as the host ISA * extension IDs as that is internal to the host and should not be exposed @@ -238,10 +242,6 @@ struct kvm_riscv_sbi_fwft { struct kvm_riscv_sbi_fwft_feature pointer_masking; }; -/* Possible states for kvm_riscv_timer */ -#define KVM_RISCV_TIMER_STATE_OFF 0 -#define KVM_RISCV_TIMER_STATE_ON 1 - /* If you need to interpret the index values, here is the key: */ #define KVM_REG_RISCV_TYPE_MASK 0x00000000FF000000 #define KVM_REG_RISCV_TYPE_SHIFT 24 From 0ee5501441cc7536271073063f13390164148e25 Mon Sep 17 00:00:00 2001 From: Anup Patel Date: Tue, 20 Jan 2026 13:29:53 +0530 Subject: [PATCH 311/373] RISC-V: KVM: Add hideleg to struct kvm_vcpu_config MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The hideleg CSR state when VCPU is running in guest VS/VU-mode will be different from when it is running in guest HS-mode. To achieve this, add hideleg to struct kvm_vcpu_config and re-program hideleg CSR upon every kvm_arch_vcpu_load(). Signed-off-by: Anup Patel Reviewed-by: Radim Krčmář Link: https://lore.kernel.org/r/20260120080013.2153519-8-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel --- arch/riscv/include/asm/kvm_host.h | 1 + arch/riscv/kvm/vcpu.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index 7ee47b83c80d..abf4f388eac5 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -171,6 +171,7 @@ struct kvm_vcpu_config { u64 henvcfg; u64 hstateen0; unsigned long hedeleg; + unsigned long hideleg; }; struct kvm_vcpu_smstateen_csr { diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c index 1d5c777eba80..77a9400d7293 100644 --- a/arch/riscv/kvm/vcpu.c +++ b/arch/riscv/kvm/vcpu.c @@ -136,6 +136,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) vcpu->arch.ran_atleast_once = false; vcpu->arch.cfg.hedeleg = KVM_HEDELEG_DEFAULT; + vcpu->arch.cfg.hideleg = KVM_HIDELEG_DEFAULT; vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO; bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX); @@ -610,6 +611,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) nacl_csr_write(nsh, CSR_VSCAUSE, csr->vscause); nacl_csr_write(nsh, CSR_VSTVAL, csr->vstval); nacl_csr_write(nsh, CSR_HEDELEG, cfg->hedeleg); + nacl_csr_write(nsh, CSR_HIDELEG, cfg->hideleg); nacl_csr_write(nsh, CSR_HVIP, csr->hvip); nacl_csr_write(nsh, CSR_VSATP, csr->vsatp); nacl_csr_write(nsh, CSR_HENVCFG, cfg->henvcfg); @@ -629,6 +631,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) csr_write(CSR_VSCAUSE, csr->vscause); csr_write(CSR_VSTVAL, csr->vstval); csr_write(CSR_HEDELEG, cfg->hedeleg); + csr_write(CSR_HIDELEG, cfg->hideleg); csr_write(CSR_HVIP, csr->hvip); csr_write(CSR_VSATP, csr->vsatp); csr_write(CSR_HENVCFG, cfg->henvcfg); From 6ed523e2b6129d7dd4aab801a610930d85b2d3f8 Mon Sep 17 00:00:00 2001 From: Anup Patel Date: Tue, 20 Jan 2026 13:29:54 +0530 Subject: [PATCH 312/373] RISC-V: KVM: Factor-out VCPU config into separate sources MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The VCPU config deals with hideleg, hedeleg, henvcfg, and hstateenX CSR configuration for each VCPU. Factor-out VCPU config into separate sources so that VCPU config can do things differently for guest HS-mode and guest VS/VU-mode. Signed-off-by: Anup Patel Reviewed-by: Radim Krčmář Link: https://lore.kernel.org/r/20260120080013.2153519-9-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel --- arch/riscv/include/asm/kvm_host.h | 20 +---- arch/riscv/include/asm/kvm_vcpu_config.h | 25 ++++++ arch/riscv/kvm/Makefile | 1 + arch/riscv/kvm/main.c | 4 +- arch/riscv/kvm/vcpu.c | 80 +++--------------- arch/riscv/kvm/vcpu_config.c | 103 +++++++++++++++++++++++ 6 files changed, 146 insertions(+), 87 deletions(-) create mode 100644 arch/riscv/include/asm/kvm_vcpu_config.h create mode 100644 arch/riscv/kvm/vcpu_config.c diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index abf4f388eac5..85e1bb5b4d7e 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -47,18 +48,6 @@ #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE -#define KVM_HEDELEG_DEFAULT (BIT(EXC_INST_MISALIGNED) | \ - BIT(EXC_INST_ILLEGAL) | \ - BIT(EXC_BREAKPOINT) | \ - BIT(EXC_SYSCALL) | \ - BIT(EXC_INST_PAGE_FAULT) | \ - BIT(EXC_LOAD_PAGE_FAULT) | \ - BIT(EXC_STORE_PAGE_FAULT)) - -#define KVM_HIDELEG_DEFAULT (BIT(IRQ_VS_SOFT) | \ - BIT(IRQ_VS_TIMER) | \ - BIT(IRQ_VS_EXT)) - #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ KVM_DIRTY_LOG_INITIALLY_SET) @@ -167,13 +156,6 @@ struct kvm_vcpu_csr { unsigned long senvcfg; }; -struct kvm_vcpu_config { - u64 henvcfg; - u64 hstateen0; - unsigned long hedeleg; - unsigned long hideleg; -}; - struct kvm_vcpu_smstateen_csr { unsigned long sstateen0; }; diff --git a/arch/riscv/include/asm/kvm_vcpu_config.h b/arch/riscv/include/asm/kvm_vcpu_config.h new file mode 100644 index 000000000000..fcc15a0296b3 --- /dev/null +++ b/arch/riscv/include/asm/kvm_vcpu_config.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2026 Qualcomm Technologies, Inc. + */ + +#ifndef __KVM_VCPU_RISCV_CONFIG_H +#define __KVM_VCPU_RISCV_CONFIG_H + +#include + +struct kvm_vcpu; + +struct kvm_vcpu_config { + u64 henvcfg; + u64 hstateen0; + unsigned long hedeleg; + unsigned long hideleg; +}; + +void kvm_riscv_vcpu_config_init(struct kvm_vcpu *vcpu); +void kvm_riscv_vcpu_config_guest_debug(struct kvm_vcpu *vcpu); +void kvm_riscv_vcpu_config_ran_once(struct kvm_vcpu *vcpu); +void kvm_riscv_vcpu_config_load(struct kvm_vcpu *vcpu); + +#endif diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile index 07eab96189e7..296c2ba05089 100644 --- a/arch/riscv/kvm/Makefile +++ b/arch/riscv/kvm/Makefile @@ -21,6 +21,7 @@ kvm-y += mmu.o kvm-y += nacl.o kvm-y += tlb.o kvm-y += vcpu.o +kvm-y += vcpu_config.o kvm-y += vcpu_exit.o kvm-y += vcpu_fp.o kvm-y += vcpu_insn.o diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c index 0f3fe3986fc0..5399c3b4071d 100644 --- a/arch/riscv/kvm/main.c +++ b/arch/riscv/kvm/main.c @@ -41,8 +41,8 @@ int kvm_arch_enable_virtualization_cpu(void) if (rc) return rc; - csr_write(CSR_HEDELEG, KVM_HEDELEG_DEFAULT); - csr_write(CSR_HIDELEG, KVM_HIDELEG_DEFAULT); + csr_write(CSR_HEDELEG, 0); + csr_write(CSR_HIDELEG, 0); /* VS should access only the time counter directly. Everything else should trap */ csr_write(CSR_HCOUNTEREN, 0x02); diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c index 77a9400d7293..6929f7ce5948 100644 --- a/arch/riscv/kvm/vcpu.c +++ b/arch/riscv/kvm/vcpu.c @@ -135,11 +135,12 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) /* Mark this VCPU never ran */ vcpu->arch.ran_atleast_once = false; - vcpu->arch.cfg.hedeleg = KVM_HEDELEG_DEFAULT; - vcpu->arch.cfg.hideleg = KVM_HIDELEG_DEFAULT; vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO; bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX); + /* Setup VCPU config */ + kvm_riscv_vcpu_config_init(vcpu); + /* Setup ISA features available to VCPU */ kvm_riscv_vcpu_setup_isa(vcpu); @@ -532,59 +533,18 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg) { - if (dbg->control & KVM_GUESTDBG_ENABLE) { + if (dbg->control & KVM_GUESTDBG_ENABLE) vcpu->guest_debug = dbg->control; - vcpu->arch.cfg.hedeleg &= ~BIT(EXC_BREAKPOINT); - } else { + else vcpu->guest_debug = 0; - vcpu->arch.cfg.hedeleg |= BIT(EXC_BREAKPOINT); - } - - vcpu->arch.csr_dirty = true; return 0; } -static void kvm_riscv_vcpu_setup_config(struct kvm_vcpu *vcpu) -{ - const unsigned long *isa = vcpu->arch.isa; - struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; - - if (riscv_isa_extension_available(isa, SVPBMT)) - cfg->henvcfg |= ENVCFG_PBMTE; - - if (riscv_isa_extension_available(isa, SSTC)) - cfg->henvcfg |= ENVCFG_STCE; - - if (riscv_isa_extension_available(isa, ZICBOM)) - cfg->henvcfg |= (ENVCFG_CBIE | ENVCFG_CBCFE); - - if (riscv_isa_extension_available(isa, ZICBOZ)) - cfg->henvcfg |= ENVCFG_CBZE; - - if (riscv_isa_extension_available(isa, SVADU) && - !riscv_isa_extension_available(isa, SVADE)) - cfg->henvcfg |= ENVCFG_ADUE; - - if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { - cfg->hstateen0 |= SMSTATEEN0_HSENVCFG; - if (riscv_isa_extension_available(isa, SSAIA)) - cfg->hstateen0 |= SMSTATEEN0_AIA_IMSIC | - SMSTATEEN0_AIA | - SMSTATEEN0_AIA_ISEL; - if (riscv_isa_extension_available(isa, SMSTATEEN)) - cfg->hstateen0 |= SMSTATEEN0_SSTATEEN0; - } - - if (vcpu->guest_debug) - cfg->hedeleg &= ~BIT(EXC_BREAKPOINT); -} - void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { void *nsh; struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; - struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; /* * If VCPU is being reloaded on the same physical CPU and no @@ -601,6 +561,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) goto csr_restore_done; vcpu->arch.csr_dirty = false; + + /* + * Load VCPU config CSRs before other CSRs because + * the read/write behaviour of certain CSRs change + * based on VCPU config CSRs. + */ + kvm_riscv_vcpu_config_load(vcpu); + if (kvm_riscv_nacl_sync_csr_available()) { nsh = nacl_shmem(); nacl_csr_write(nsh, CSR_VSSTATUS, csr->vsstatus); @@ -610,18 +578,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) nacl_csr_write(nsh, CSR_VSEPC, csr->vsepc); nacl_csr_write(nsh, CSR_VSCAUSE, csr->vscause); nacl_csr_write(nsh, CSR_VSTVAL, csr->vstval); - nacl_csr_write(nsh, CSR_HEDELEG, cfg->hedeleg); - nacl_csr_write(nsh, CSR_HIDELEG, cfg->hideleg); nacl_csr_write(nsh, CSR_HVIP, csr->hvip); nacl_csr_write(nsh, CSR_VSATP, csr->vsatp); - nacl_csr_write(nsh, CSR_HENVCFG, cfg->henvcfg); - if (IS_ENABLED(CONFIG_32BIT)) - nacl_csr_write(nsh, CSR_HENVCFGH, cfg->henvcfg >> 32); - if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { - nacl_csr_write(nsh, CSR_HSTATEEN0, cfg->hstateen0); - if (IS_ENABLED(CONFIG_32BIT)) - nacl_csr_write(nsh, CSR_HSTATEEN0H, cfg->hstateen0 >> 32); - } } else { csr_write(CSR_VSSTATUS, csr->vsstatus); csr_write(CSR_VSIE, csr->vsie); @@ -630,18 +588,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) csr_write(CSR_VSEPC, csr->vsepc); csr_write(CSR_VSCAUSE, csr->vscause); csr_write(CSR_VSTVAL, csr->vstval); - csr_write(CSR_HEDELEG, cfg->hedeleg); - csr_write(CSR_HIDELEG, cfg->hideleg); csr_write(CSR_HVIP, csr->hvip); csr_write(CSR_VSATP, csr->vsatp); - csr_write(CSR_HENVCFG, cfg->henvcfg); - if (IS_ENABLED(CONFIG_32BIT)) - csr_write(CSR_HENVCFGH, cfg->henvcfg >> 32); - if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { - csr_write(CSR_HSTATEEN0, cfg->hstateen0); - if (IS_ENABLED(CONFIG_32BIT)) - csr_write(CSR_HSTATEEN0H, cfg->hstateen0 >> 32); - } } kvm_riscv_mmu_update_hgatp(vcpu); @@ -891,7 +839,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) struct kvm_run *run = vcpu->run; if (!vcpu->arch.ran_atleast_once) - kvm_riscv_vcpu_setup_config(vcpu); + kvm_riscv_vcpu_config_ran_once(vcpu); /* Mark this VCPU ran at least once */ vcpu->arch.ran_atleast_once = true; diff --git a/arch/riscv/kvm/vcpu_config.c b/arch/riscv/kvm/vcpu_config.c new file mode 100644 index 000000000000..238418fed2b9 --- /dev/null +++ b/arch/riscv/kvm/vcpu_config.c @@ -0,0 +1,103 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2026 Qualcomm Technologies, Inc. + */ + +#include +#include + +#define KVM_HEDELEG_DEFAULT (BIT(EXC_INST_MISALIGNED) | \ + BIT(EXC_INST_ILLEGAL) | \ + BIT(EXC_BREAKPOINT) | \ + BIT(EXC_SYSCALL) | \ + BIT(EXC_INST_PAGE_FAULT) | \ + BIT(EXC_LOAD_PAGE_FAULT) | \ + BIT(EXC_STORE_PAGE_FAULT)) + +#define KVM_HIDELEG_DEFAULT (BIT(IRQ_VS_SOFT) | \ + BIT(IRQ_VS_TIMER) | \ + BIT(IRQ_VS_EXT)) + +void kvm_riscv_vcpu_config_init(struct kvm_vcpu *vcpu) +{ + vcpu->arch.cfg.hedeleg = KVM_HEDELEG_DEFAULT; + vcpu->arch.cfg.hideleg = KVM_HIDELEG_DEFAULT; +} + +void kvm_riscv_vcpu_config_guest_debug(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; + + if (vcpu->guest_debug) + cfg->hedeleg &= ~BIT(EXC_BREAKPOINT); + else + cfg->hedeleg |= BIT(EXC_BREAKPOINT); + + vcpu->arch.csr_dirty = true; +} + +void kvm_riscv_vcpu_config_ran_once(struct kvm_vcpu *vcpu) +{ + const unsigned long *isa = vcpu->arch.isa; + struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; + + if (riscv_isa_extension_available(isa, SVPBMT)) + cfg->henvcfg |= ENVCFG_PBMTE; + + if (riscv_isa_extension_available(isa, SSTC)) + cfg->henvcfg |= ENVCFG_STCE; + + if (riscv_isa_extension_available(isa, ZICBOM)) + cfg->henvcfg |= (ENVCFG_CBIE | ENVCFG_CBCFE); + + if (riscv_isa_extension_available(isa, ZICBOZ)) + cfg->henvcfg |= ENVCFG_CBZE; + + if (riscv_isa_extension_available(isa, SVADU) && + !riscv_isa_extension_available(isa, SVADE)) + cfg->henvcfg |= ENVCFG_ADUE; + + if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { + cfg->hstateen0 |= SMSTATEEN0_HSENVCFG; + if (riscv_isa_extension_available(isa, SSAIA)) + cfg->hstateen0 |= SMSTATEEN0_AIA_IMSIC | + SMSTATEEN0_AIA | + SMSTATEEN0_AIA_ISEL; + if (riscv_isa_extension_available(isa, SMSTATEEN)) + cfg->hstateen0 |= SMSTATEEN0_SSTATEEN0; + } + + if (vcpu->guest_debug) + cfg->hedeleg &= ~BIT(EXC_BREAKPOINT); +} + +void kvm_riscv_vcpu_config_load(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; + void *nsh; + + if (kvm_riscv_nacl_sync_csr_available()) { + nsh = nacl_shmem(); + nacl_csr_write(nsh, CSR_HEDELEG, cfg->hedeleg); + nacl_csr_write(nsh, CSR_HIDELEG, cfg->hideleg); + nacl_csr_write(nsh, CSR_HENVCFG, cfg->henvcfg); + if (IS_ENABLED(CONFIG_32BIT)) + nacl_csr_write(nsh, CSR_HENVCFGH, cfg->henvcfg >> 32); + if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { + nacl_csr_write(nsh, CSR_HSTATEEN0, cfg->hstateen0); + if (IS_ENABLED(CONFIG_32BIT)) + nacl_csr_write(nsh, CSR_HSTATEEN0H, cfg->hstateen0 >> 32); + } + } else { + csr_write(CSR_HEDELEG, cfg->hedeleg); + csr_write(CSR_HIDELEG, cfg->hideleg); + csr_write(CSR_HENVCFG, cfg->henvcfg); + if (IS_ENABLED(CONFIG_32BIT)) + csr_write(CSR_HENVCFGH, cfg->henvcfg >> 32); + if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { + csr_write(CSR_HSTATEEN0, cfg->hstateen0); + if (IS_ENABLED(CONFIG_32BIT)) + csr_write(CSR_HSTATEEN0H, cfg->hstateen0 >> 32); + } + } +} From e2494f83f9d717a7f607cfaefb4e69e55b8e024d Mon Sep 17 00:00:00 2001 From: Anup Patel Date: Tue, 20 Jan 2026 13:29:55 +0530 Subject: [PATCH 313/373] RISC-V: KVM: Don't check hstateen0 when updating sstateen0 CSR MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The hstateen0 will be programmed differently for guest HS-mode and guest VS/VU-mode so don't check hstateen0.SSTATEEN0 bit when updating sstateen0 CSR in kvm_riscv_vcpu_swap_in_guest_state() and kvm_riscv_vcpu_swap_in_host_state(). Signed-off-by: Anup Patel Reviewed-by: Radim Krčmář Link: https://lore.kernel.org/r/20260120080013.2153519-10-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel --- arch/riscv/kvm/vcpu.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c index 6929f7ce5948..a73690eda84b 100644 --- a/arch/riscv/kvm/vcpu.c +++ b/arch/riscv/kvm/vcpu.c @@ -721,28 +721,22 @@ static __always_inline void kvm_riscv_vcpu_swap_in_guest_state(struct kvm_vcpu * { struct kvm_vcpu_smstateen_csr *smcsr = &vcpu->arch.smstateen_csr; struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; - struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; vcpu->arch.host_scounteren = csr_swap(CSR_SCOUNTEREN, csr->scounteren); vcpu->arch.host_senvcfg = csr_swap(CSR_SENVCFG, csr->senvcfg); - if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN) && - (cfg->hstateen0 & SMSTATEEN0_SSTATEEN0)) - vcpu->arch.host_sstateen0 = csr_swap(CSR_SSTATEEN0, - smcsr->sstateen0); + if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) + vcpu->arch.host_sstateen0 = csr_swap(CSR_SSTATEEN0, smcsr->sstateen0); } static __always_inline void kvm_riscv_vcpu_swap_in_host_state(struct kvm_vcpu *vcpu) { struct kvm_vcpu_smstateen_csr *smcsr = &vcpu->arch.smstateen_csr; struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; - struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; csr->scounteren = csr_swap(CSR_SCOUNTEREN, vcpu->arch.host_scounteren); csr->senvcfg = csr_swap(CSR_SENVCFG, vcpu->arch.host_senvcfg); - if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN) && - (cfg->hstateen0 & SMSTATEEN0_SSTATEEN0)) - smcsr->sstateen0 = csr_swap(CSR_SSTATEEN0, - vcpu->arch.host_sstateen0); + if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) + smcsr->sstateen0 = csr_swap(CSR_SSTATEEN0, vcpu->arch.host_sstateen0); } /* From 3d4470d71fbf70576636947aba1ae51adbad5225 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Thu, 12 Mar 2026 16:48:22 -0700 Subject: [PATCH 314/373] KVM: x86: Move nested_run_pending to kvm_vcpu_arch Move nested_run_pending field present in both svm_nested_state and nested_vmx to the common kvm_vcpu_arch. This allows for common code to use without plumbing it through per-vendor helpers. nested_run_pending remains zero-initialized, as the entire kvm_vcpu struct is, and all further accesses are done through vcpu->arch instead of svm->nested or vmx->nested. No functional change intended. Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed [sean: expand the commend in the field declaration] Link: https://patch.msgid.link/20260312234823.3120658-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 9 +++++++ arch/x86/kvm/svm/nested.c | 18 ++++++------- arch/x86/kvm/svm/svm.c | 16 ++++++------ arch/x86/kvm/svm/svm.h | 4 --- arch/x86/kvm/vmx/nested.c | 46 ++++++++++++++++----------------- arch/x86/kvm/vmx/vmx.c | 16 ++++++------ arch/x86/kvm/vmx/vmx.h | 3 --- 7 files changed, 57 insertions(+), 55 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ff07c45e3c73..19b3790e5e99 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1098,6 +1098,15 @@ struct kvm_vcpu_arch { */ bool pdptrs_from_userspace; + /* + * Set if an emulated nested VM-Enter to L2 is pending completion. KVM + * must not synthesize a VM-Exit to L1 before entering L2, as VM-Exits + * can only occur at instruction boundaries. The only exception is + * VMX's "notify" exits, which exist in large part to break the CPU out + * of infinite ucode loops, but can corrupt vCPU state in the process! + */ + bool nested_run_pending; + #if IS_ENABLED(CONFIG_HYPERV) hpa_t hv_root_tdp; #endif diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 3ffde1ff719b..e24f5450f121 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -914,7 +914,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * the CPU and/or KVM and should be used regardless of L1's support. */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || - !svm->nested.nested_run_pending) + !vcpu->arch.nested_run_pending) vmcb02->control.next_rip = vmcb12_ctrl->next_rip; svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); @@ -926,7 +926,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) if (is_evtinj_soft(vmcb02->control.event_inj)) { svm->soft_int_injected = true; if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || - !svm->nested.nested_run_pending) + !vcpu->arch.nested_run_pending) svm->soft_int_next_rip = vmcb12_ctrl->next_rip; } @@ -1132,11 +1132,11 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) if (!npt_enabled) vmcb01->save.cr3 = kvm_read_cr3(vcpu); - svm->nested.nested_run_pending = 1; + vcpu->arch.nested_run_pending = 1; if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) || !nested_svm_merge_msrpm(vcpu)) { - svm->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; svm->nmi_l1_to_l2 = false; svm->soft_int_injected = false; @@ -1278,7 +1278,7 @@ void nested_svm_vmexit(struct vcpu_svm *svm) /* Exit Guest-Mode */ leave_guest_mode(vcpu); svm->nested.vmcb12_gpa = 0; - WARN_ON_ONCE(svm->nested.nested_run_pending); + WARN_ON_ONCE(vcpu->arch.nested_run_pending); kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); @@ -1488,7 +1488,7 @@ void svm_leave_nested(struct kvm_vcpu *vcpu) struct vcpu_svm *svm = to_svm(vcpu); if (is_guest_mode(vcpu)) { - svm->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; svm->nested.vmcb12_gpa = INVALID_GPA; leave_guest_mode(vcpu); @@ -1673,7 +1673,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu) * previously injected event, the pending exception occurred while said * event was being delivered and thus needs to be handled. */ - bool block_nested_exceptions = svm->nested.nested_run_pending; + bool block_nested_exceptions = vcpu->arch.nested_run_pending; /* * New events (not exceptions) are only recognized at instruction * boundaries. If an event needs reinjection, then KVM is handling a @@ -1848,7 +1848,7 @@ static int svm_get_nested_state(struct kvm_vcpu *vcpu, kvm_state.size += KVM_STATE_NESTED_SVM_VMCB_SIZE; kvm_state.flags |= KVM_STATE_NESTED_GUEST_MODE; - if (svm->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) kvm_state.flags |= KVM_STATE_NESTED_RUN_PENDING; } @@ -1985,7 +1985,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET)); - svm->nested.nested_run_pending = + vcpu->arch.nested_run_pending = !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); svm->nested.vmcb12_gpa = kvm_state->hdr.svm.vmcb_pa; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 5e6bd7fca298..dbd35340e7b0 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3820,7 +3820,7 @@ static void svm_fixup_nested_rips(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - if (!is_guest_mode(vcpu) || !svm->nested.nested_run_pending) + if (!is_guest_mode(vcpu) || !vcpu->arch.nested_run_pending) return; /* @@ -3968,7 +3968,7 @@ bool svm_nmi_blocked(struct kvm_vcpu *vcpu) static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { struct vcpu_svm *svm = to_svm(vcpu); - if (svm->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; if (svm_nmi_blocked(vcpu)) @@ -4010,7 +4010,7 @@ static int svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection) { struct vcpu_svm *svm = to_svm(vcpu); - if (svm->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; if (svm_interrupt_blocked(vcpu)) @@ -4222,7 +4222,7 @@ static void svm_complete_soft_interrupt(struct kvm_vcpu *vcpu, u8 vector, * the soft int and will reinject it via the standard injection flow, * and so KVM needs to grab the state from the pending nested VMRUN. */ - if (is_guest_mode(vcpu) && svm->nested.nested_run_pending) + if (is_guest_mode(vcpu) && vcpu->arch.nested_run_pending) svm_set_nested_run_soft_int_state(vcpu); /* @@ -4525,11 +4525,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) nested_sync_control_from_vmcb02(svm); /* Track VMRUNs that have made past consistency checking */ - if (svm->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && !svm_is_vmrun_failure(svm->vmcb->control.exit_code)) ++vcpu->stat.nested_run; - svm->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; } svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; @@ -4898,7 +4898,7 @@ bool svm_smi_blocked(struct kvm_vcpu *vcpu) static int svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { struct vcpu_svm *svm = to_svm(vcpu); - if (svm->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; if (svm_smi_blocked(vcpu)) @@ -5013,7 +5013,7 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) if (ret) goto unmap_save; - svm->nested.nested_run_pending = 1; + vcpu->arch.nested_run_pending = 1; unmap_save: kvm_vcpu_unmap(vcpu, &map_save); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index c53068848628..5b287ad83b69 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -215,10 +215,6 @@ struct svm_nested_state { */ void *msrpm; - /* A VMRUN has started but has not yet been performed, so - * we cannot inject a nested vmexit yet. */ - bool nested_run_pending; - /* cache for control fields of the guest */ struct vmcb_ctrl_area_cached ctl; diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 248635da6766..031075467a6d 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2273,7 +2273,7 @@ static void vmx_start_preemption_timer(struct kvm_vcpu *vcpu, static u64 nested_vmx_calc_efer(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) { - if (vmx->nested.nested_run_pending && + if (vmx->vcpu.arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)) return vmcs12->guest_ia32_efer; else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) @@ -2513,7 +2513,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0 /* * Interrupt/Exception Fields */ - if (vmx->nested.nested_run_pending) { + if (vmx->vcpu.arch.nested_run_pending) { vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, vmcs12->vm_entry_intr_info_field); vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, @@ -2621,7 +2621,7 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); } - if (kvm_mpx_supported() && vmx->nested.nested_run_pending && + if (kvm_mpx_supported() && vmx->vcpu.arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)) vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs); } @@ -2718,7 +2718,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, !(evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1); } - if (vmx->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) { kvm_set_dr(vcpu, 7, vmcs12->guest_dr7); vmx_guest_debugctl_write(vcpu, vmcs12->guest_ia32_debugctl & @@ -2728,13 +2728,13 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, vmx_guest_debugctl_write(vcpu, vmx->nested.pre_vmenter_debugctl); } - if (!vmx->nested.nested_run_pending || + if (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) vmcs_write_cet_state(vcpu, vmx->nested.pre_vmenter_s_cet, vmx->nested.pre_vmenter_ssp, vmx->nested.pre_vmenter_ssp_tbl); - if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending || + if (kvm_mpx_supported() && (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs); vmx_set_rflags(vcpu, vmcs12->guest_rflags); @@ -2747,7 +2747,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits); - if (vmx->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)) { vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat); vcpu->arch.pat = vmcs12->guest_ia32_pat; @@ -3335,7 +3335,7 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu, * to bit 8 (LME) if bit 31 in the CR0 field (corresponding to * CR0.PG) is 1. */ - if (to_vmx(vcpu)->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)) { if (CC(!kvm_valid_efer(vcpu, vmcs12->guest_ia32_efer)) || CC(ia32e != !!(vmcs12->guest_ia32_efer & EFER_LMA)) || @@ -3613,15 +3613,15 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, kvm_service_local_tlb_flush_requests(vcpu); - if (!vmx->nested.nested_run_pending || + if (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) vmx->nested.pre_vmenter_debugctl = vmx_guest_debugctl_read(); if (kvm_mpx_supported() && - (!vmx->nested.nested_run_pending || + (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) vmx->nested.pre_vmenter_bndcfgs = vmcs_read64(GUEST_BNDCFGS); - if (!vmx->nested.nested_run_pending || + if (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) vmcs_read_cet_state(vcpu, &vmx->nested.pre_vmenter_s_cet, &vmx->nested.pre_vmenter_ssp, @@ -3830,7 +3830,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) * We're finally done with prerequisite checking, and can start with * the nested entry. */ - vmx->nested.nested_run_pending = 1; + vcpu->arch.nested_run_pending = 1; vmx->nested.has_preemption_timer_deadline = false; status = nested_vmx_enter_non_root_mode(vcpu, true); if (unlikely(status != NVMX_VMENTRY_SUCCESS)) @@ -3862,12 +3862,12 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) !nested_cpu_has(vmcs12, CPU_BASED_NMI_WINDOW_EXITING) && !(nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING) && (vmcs12->guest_rflags & X86_EFLAGS_IF))) { - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; return kvm_emulate_halt_noskip(vcpu); } break; case GUEST_ACTIVITY_WAIT_SIPI: - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; kvm_set_mp_state(vcpu, KVM_MP_STATE_INIT_RECEIVED); break; default: @@ -3877,7 +3877,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) return 1; vmentry_failed: - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; if (status == NVMX_VMENTRY_KVM_INTERNAL_ERROR) return 0; if (status == NVMX_VMENTRY_VMEXIT) @@ -4274,7 +4274,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu) * previously injected event, the pending exception occurred while said * event was being delivered and thus needs to be handled. */ - bool block_nested_exceptions = vmx->nested.nested_run_pending; + bool block_nested_exceptions = vcpu->arch.nested_run_pending; /* * Events that don't require injection, i.e. that are virtualized by * hardware, aren't blocked by a pending VM-Enter as KVM doesn't need @@ -4643,7 +4643,7 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) if (nested_cpu_has_preemption_timer(vmcs12) && vmcs12->vm_exit_controls & VM_EXIT_SAVE_VMX_PREEMPTION_TIMER && - !vmx->nested.nested_run_pending) + !vcpu->arch.nested_run_pending) vmcs12->vmx_preemption_timer_value = vmx_get_preemption_timer_value(vcpu); @@ -5042,7 +5042,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, vmx->nested.mtf_pending = false; /* trying to cancel vmlaunch/vmresume is a bug */ - WARN_ON_ONCE(vmx->nested.nested_run_pending); + WARN_ON_ONCE(vcpu->arch.nested_run_pending); #ifdef CONFIG_KVM_HYPERV if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { @@ -6665,7 +6665,7 @@ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu) unsigned long exit_qual; u32 exit_intr_info; - WARN_ON_ONCE(vmx->nested.nested_run_pending); + WARN_ON_ONCE(vcpu->arch.nested_run_pending); /* * Late nested VM-Fail shares the same flow as nested VM-Exit since KVM @@ -6761,7 +6761,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu, if (is_guest_mode(vcpu)) { kvm_state.flags |= KVM_STATE_NESTED_GUEST_MODE; - if (vmx->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) kvm_state.flags |= KVM_STATE_NESTED_RUN_PENDING; if (vmx->nested.mtf_pending) @@ -6836,7 +6836,7 @@ out: void vmx_leave_nested(struct kvm_vcpu *vcpu) { if (is_guest_mode(vcpu)) { - to_vmx(vcpu)->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; nested_vmx_vmexit(vcpu, -1, 0, 0); } free_nested(vcpu); @@ -6973,7 +6973,7 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu, if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) return 0; - vmx->nested.nested_run_pending = + vcpu->arch.nested_run_pending = !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); vmx->nested.mtf_pending = @@ -7025,7 +7025,7 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu, return 0; error_guest_mode: - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; return ret; } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 967b58a8ab9d..9ef3fb04403d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5279,7 +5279,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu) int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { - if (to_vmx(vcpu)->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; /* An NMI must not be injected into L2 if it's supposed to VM-Exit. */ @@ -5306,7 +5306,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu) int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection) { - if (to_vmx(vcpu)->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; /* @@ -6118,7 +6118,7 @@ static bool vmx_unhandleable_emulation_required(struct kvm_vcpu *vcpu) * only reachable if userspace modifies L2 guest state after KVM has * performed the nested VM-Enter consistency checks. */ - if (vmx->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return true; /* @@ -6802,7 +6802,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) * invalid guest state should never happen as that means KVM knowingly * allowed a nested VM-Enter with an invalid vmcs12. More below. */ - if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm)) + if (KVM_BUG_ON(vcpu->arch.nested_run_pending, vcpu->kvm)) return -EIO; if (is_guest_mode(vcpu)) { @@ -7730,11 +7730,11 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) * Track VMLAUNCH/VMRESUME that have made past guest state * checking. */ - if (vmx->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && !vmx_get_exit_reason(vcpu).failed_vmentry) ++vcpu->stat.nested_run; - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; } if (unlikely(vmx->fail)) @@ -8491,7 +8491,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu) int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { /* we need a nested vmexit to enter SMM, postpone if run is pending */ - if (to_vmx(vcpu)->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; return !is_smm(vcpu); } @@ -8532,7 +8532,7 @@ int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) if (ret) return ret; - vmx->nested.nested_run_pending = 1; + vcpu->arch.nested_run_pending = 1; vmx->nested.smm.guest_mode = false; } return 0; diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 70bfe81dea54..db84e8001da5 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -138,9 +138,6 @@ struct nested_vmx { */ bool enlightened_vmcs_enabled; - /* L2 must run next, and mustn't decide to exit to L1. */ - bool nested_run_pending; - /* Pending MTF VM-exit into L1. */ bool mtf_pending; From 7212094baef5acabef1969d77781a6527c09d743 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 12 Mar 2026 16:48:23 -0700 Subject: [PATCH 315/373] KVM: x86: Suppress WARNs on nested_run_pending after userspace exit To end an ongoing game of whack-a-mole between KVM and syzkaller, WARN on illegally cancelling a pending nested VM-Enter if and only if userspace has NOT gained control of the vCPU since the nested run was initiated. As proven time and time again by syzkaller, userspace can clobber vCPU state so as to force a VM-Exit that violates KVM's architectural modelling of VMRUN/VMLAUNCH/VMRESUME. To detect that userspace has gained control, while minimizing the risk of operating on stale data, convert nested_run_pending from a pure boolean to a tri-state of sorts, where '0' is still "not pending", '1' is "pending", and '2' is "pending but untrusted". Then on KVM_RUN, if the flag is in the "trusted pending" state, move it to "untrusted pending". Note, moving the state to "untrusted" even if KVM_RUN is ultimately rejected is a-ok, because for the "untrusted" state to matter, KVM must get past kvm_x86_vcpu_pre_run() at some point for the vCPU. Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260312234823.3120658-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 8 +++++++- arch/x86/kvm/svm/nested.c | 11 +++++++---- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/nested.c | 12 +++++++----- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 7 +++++++ arch/x86/kvm/x86.h | 10 ++++++++++ 7 files changed, 40 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 19b3790e5e99..c54c969c88ee 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1104,8 +1104,14 @@ struct kvm_vcpu_arch { * can only occur at instruction boundaries. The only exception is * VMX's "notify" exits, which exist in large part to break the CPU out * of infinite ucode loops, but can corrupt vCPU state in the process! + * + * For all intents and purposes, this is a boolean, but it's tracked as + * a u8 so that KVM can detect when userspace may have stuffed vCPU + * state and generated an architecturally-impossible VM-Exit. */ - bool nested_run_pending; +#define KVM_NESTED_RUN_PENDING 1 +#define KVM_NESTED_RUN_PENDING_UNTRUSTED 2 + u8 nested_run_pending; #if IS_ENABLED(CONFIG_HYPERV) hpa_t hv_root_tdp; diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index e24f5450f121..88e878160229 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1132,7 +1132,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) if (!npt_enabled) vmcb01->save.cr3 = kvm_read_cr3(vcpu); - vcpu->arch.nested_run_pending = 1; + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING; if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) || !nested_svm_merge_msrpm(vcpu)) { @@ -1278,7 +1278,8 @@ void nested_svm_vmexit(struct vcpu_svm *svm) /* Exit Guest-Mode */ leave_guest_mode(vcpu); svm->nested.vmcb12_gpa = 0; - WARN_ON_ONCE(vcpu->arch.nested_run_pending); + + kvm_warn_on_nested_run_pending(vcpu); kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); @@ -1985,8 +1986,10 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET)); - vcpu->arch.nested_run_pending = - !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); + if (kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING) + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING_UNTRUSTED; + else + vcpu->arch.nested_run_pending = 0; svm->nested.vmcb12_gpa = kvm_state->hdr.svm.vmcb_pa; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index dbd35340e7b0..f4b0aeba948f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -5013,7 +5013,7 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) if (ret) goto unmap_save; - vcpu->arch.nested_run_pending = 1; + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING; unmap_save: kvm_vcpu_unmap(vcpu, &map_save); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 031075467a6d..48d2991886cb 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3830,7 +3830,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) * We're finally done with prerequisite checking, and can start with * the nested entry. */ - vcpu->arch.nested_run_pending = 1; + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING; vmx->nested.has_preemption_timer_deadline = false; status = nested_vmx_enter_non_root_mode(vcpu, true); if (unlikely(status != NVMX_VMENTRY_SUCCESS)) @@ -5042,7 +5042,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, vmx->nested.mtf_pending = false; /* trying to cancel vmlaunch/vmresume is a bug */ - WARN_ON_ONCE(vcpu->arch.nested_run_pending); + kvm_warn_on_nested_run_pending(vcpu); #ifdef CONFIG_KVM_HYPERV if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { @@ -6665,7 +6665,7 @@ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu) unsigned long exit_qual; u32 exit_intr_info; - WARN_ON_ONCE(vcpu->arch.nested_run_pending); + kvm_warn_on_nested_run_pending(vcpu); /* * Late nested VM-Fail shares the same flow as nested VM-Exit since KVM @@ -6973,8 +6973,10 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu, if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) return 0; - vcpu->arch.nested_run_pending = - !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); + if (kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING) + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING_UNTRUSTED; + else + vcpu->arch.nested_run_pending = 0; vmx->nested.mtf_pending = !!(kvm_state->flags & KVM_STATE_NESTED_MTF_PENDING); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9ef3fb04403d..d75f6b22d74c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8532,7 +8532,7 @@ int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) if (ret) return ret; - vcpu->arch.nested_run_pending = 1; + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING; vmx->nested.smm.guest_mode = false; } return 0; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 64da02d1ee00..aa29f90c6e96 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11913,6 +11913,13 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) static int kvm_x86_vcpu_pre_run(struct kvm_vcpu *vcpu) { + /* + * Userspace may have modified vCPU state, mark nested_run_pending as + * "untrusted" to avoid triggering false-positive WARNs. + */ + if (vcpu->arch.nested_run_pending == KVM_NESTED_RUN_PENDING) + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING_UNTRUSTED; + /* * SIPI_RECEIVED is obsolete; KVM leaves the vCPU in Wait-For-SIPI and * tracks the pending SIPI separately. SIPI_RECEIVED is still accepted diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 94d4f07aaaa0..9fe3a53fd8be 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -188,6 +188,16 @@ static inline bool kvm_can_set_cpuid_and_feature_msrs(struct kvm_vcpu *vcpu) return vcpu->arch.last_vmentry_cpu == -1 && !is_guest_mode(vcpu); } +/* + * WARN if a nested VM-Enter is pending completion, and userspace hasn't gained + * control since the nested VM-Enter was initiated (in which case, userspace + * may have modified vCPU state to induce an architecturally invalid VM-Exit). + */ +static inline void kvm_warn_on_nested_run_pending(struct kvm_vcpu *vcpu) +{ + WARN_ON_ONCE(vcpu->arch.nested_run_pending == KVM_NESTED_RUN_PENDING); +} + static inline void kvm_set_mp_state(struct kvm_vcpu *vcpu, int mp_state) { vcpu->arch.mp_state = mp_state; From 8acffeef5ef720c35e513e322ab08e32683f32f2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 12 Mar 2026 17:32:58 -0700 Subject: [PATCH 316/373] KVM: SEV: Drop WARN on large size for KVM_MEMORY_ENCRYPT_REG_REGION Drop the WARN in sev_pin_memory() on npages overflowing an int, as the WARN is comically trivially to trigger from userspace, e.g. by doing: struct kvm_enc_region range = { .addr = 0, .size = -1ul, }; __vm_ioctl(vm, KVM_MEMORY_ENCRYPT_REG_REGION, &range); Note, the checks in sev_mem_enc_register_region() that presumably exist to verify the incoming address+size are completely worthless, as both "addr" and "size" are u64s and SEV is 64-bit only, i.e. they _can't_ be greater than ULONG_MAX. That wart will be cleaned up in the near future. if (range->addr > ULONG_MAX || range->size > ULONG_MAX) return -EINVAL; Opportunistically add a comment to explain why the code calculates the number of pages the "hard" way, e.g. instead of just shifting @ulen. Fixes: 78824fabc72e ("KVM: SVM: fix svn_pin_memory()'s use of get_user_pages_fast()") Cc: stable@vger.kernel.org Reviewed-by: Liam Merwick Tested-by: Liam Merwick Link: https://patch.msgid.link/20260313003302.3136111-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 77ebc166abfd..2c216726718d 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -690,10 +690,16 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, if (ulen == 0 || uaddr + ulen < uaddr) return ERR_PTR(-EINVAL); - /* Calculate number of pages. */ + /* + * Calculate the number of pages that need to be pinned to cover the + * entire range. Note! This isn't simply ulen >> PAGE_SHIFT, as KVM + * doesn't require the incoming address+size to be page aligned! + */ first = (uaddr & PAGE_MASK) >> PAGE_SHIFT; last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT; npages = (last - first + 1); + if (npages > INT_MAX) + return ERR_PTR(-EINVAL); locked = sev->pages_locked + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; @@ -702,9 +708,6 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, return ERR_PTR(-ENOMEM); } - if (WARN_ON_ONCE(npages > INT_MAX)) - return ERR_PTR(-EINVAL); - /* Avoid using vmalloc for smaller buffers. */ size = npages * sizeof(struct page *); if (size > PAGE_SIZE) From 12a8ff869ddc284f95fe111ababab166b05e1c57 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 12 Mar 2026 17:32:59 -0700 Subject: [PATCH 317/373] KVM: SEV: Drop useless sanity checks in sev_mem_enc_register_region() Drop sev_mem_enc_register_region()'s sanity checks on the incoming address and size, as SEV is 64-bit only, making ULONG_MAX a 64-bit, all-ones value, and thus making it impossible for kvm_enc_region.{addr,size} to be greater than ULONG_MAX. Note, sev_pin_memory() verifies the incoming address is non-NULL (which isn't strictly required, but whatever), and that addr+size don't wrap to zero (which _is_ needed and what really needs to be guarded against). Note #2, pin_user_pages_fast() guards against the end address walking into kernel address space, so lack of an access_ok() check is also safe (maybe not ideal, but safe). No functional change intended (the generated code is literally the same, i.e. the compiler was smart enough to know the checks were useless). Reviewed-by: Liam Merwick Tested-by: Liam Merwick Link: https://patch.msgid.link/20260313003302.3136111-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 2c216726718d..aa4499a235f2 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2711,9 +2711,6 @@ int sev_mem_enc_register_region(struct kvm *kvm, if (is_mirroring_enc_context(kvm)) return -EINVAL; - if (range->addr > ULONG_MAX || range->size > ULONG_MAX) - return -EINVAL; - region = kzalloc_obj(*region, GFP_KERNEL_ACCOUNT); if (!region) return -ENOMEM; From 6d71f9349d9bf09bf82309bdc46704b4b6f6b314 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 12 Mar 2026 17:33:00 -0700 Subject: [PATCH 318/373] KVM: SEV: Disallow pinning more pages than exist in the system Explicitly disallow pinning more pages for an SEV VM than exist in the system to defend against absurd userspace requests without relying on somewhat arbitrary kernel functionality to prevent truly stupid KVM behavior. E.g. even with the INT_MAX check, userspace can request that KVM pin nearly 8TiB of memory, regardless of how much RAM exists in the system. Opportunistically rename "locked" to a more descriptive "total_npages". Reviewed-by: Liam Merwick Tested-by: Liam Merwick Link: https://patch.msgid.link/20260313003302.3136111-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index aa4499a235f2..f37e23496b64 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -680,7 +680,7 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, struct kvm_sev_info *sev = to_kvm_sev_info(kvm); unsigned long npages, size; int npinned; - unsigned long locked, lock_limit; + unsigned long total_npages, lock_limit; struct page **pages; unsigned long first, last; int ret; @@ -701,10 +701,14 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, if (npages > INT_MAX) return ERR_PTR(-EINVAL); - locked = sev->pages_locked + npages; + total_npages = sev->pages_locked + npages; + if (total_npages > totalram_pages()) + return ERR_PTR(-EINVAL); + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { - pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit); + if (total_npages > lock_limit && !capable(CAP_IPC_LOCK)) { + pr_err("SEV: %lu total pages would exceed the lock limit of %lu.\n", + total_npages, lock_limit); return ERR_PTR(-ENOMEM); } @@ -727,7 +731,7 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, } *n = npages; - sev->pages_locked = locked; + sev->pages_locked = total_npages; return pages; From 7ad02ff1e4a4d1a06483ec839cff26ea232db70f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 12 Mar 2026 17:33:01 -0700 Subject: [PATCH 319/373] KVM: SEV: Use PFN_DOWN() to simplify "number of pages" math when pinning memory Use PFN_DOWN() instead of open coded equivalents in sev_pin_memory() to simplify the code and make it easier to read. No functional change intended (verified before and after versions of the generated code are identical). Reviewed-by: Liam Merwick Tested-by: Liam Merwick Link: https://patch.msgid.link/20260313003302.3136111-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index f37e23496b64..15ac2b907260 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -682,7 +682,6 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, int npinned; unsigned long total_npages, lock_limit; struct page **pages; - unsigned long first, last; int ret; lockdep_assert_held(&kvm->lock); @@ -692,12 +691,10 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, /* * Calculate the number of pages that need to be pinned to cover the - * entire range. Note! This isn't simply ulen >> PAGE_SHIFT, as KVM + * entire range. Note! This isn't simply PFN_DOWN(ulen), as KVM * doesn't require the incoming address+size to be page aligned! */ - first = (uaddr & PAGE_MASK) >> PAGE_SHIFT; - last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT; - npages = (last - first + 1); + npages = PFN_DOWN(uaddr + ulen - 1) - PFN_DOWN(uaddr) + 1; if (npages > INT_MAX) return ERR_PTR(-EINVAL); From a7f53694d591675fba26ef24b9ac3c2748e5499b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 12 Mar 2026 17:33:02 -0700 Subject: [PATCH 320/373] KVM: SEV: Use kvzalloc_objs() when pinning userpages Use kvzalloc_objs() instead of sev_pin_memory()'s open coded (rough) equivalent to harden the code and Note! This sanity check in __kvmalloc_node_noprof() /* Don't even allow crazy sizes */ if (unlikely(size > INT_MAX)) { WARN_ON_ONCE(!(flags & __GFP_NOWARN)); return NULL; } will artificially limit the maximum size of any single pinned region to just under 1TiB. While there do appear to be providers that support SEV VMs with more than 1TiB of _total_ memory, it's unlikely any KVM-based providers pin 1TiB in a single request. Allocate with NOWARN so that fuzzers can't trip the WARN_ON_ONCE() when they inevitably run on systems with copious amounts of RAM, i.e. when they can get by KVM's "total_npages > totalram_pages()" restriction. Note #2, KVM's usage of vmalloc()+kmalloc() instead of kvmalloc() predates commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") by 4+ years (see commit 89c505809052 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_UPDATE_DATA command"). I.e. the open coded behavior wasn't intended to avoid the aforementioned sanity check. The implementation appears to be pure oversight at the time the code was written, as it showed up in v3[1] of the early RFCs, whereas as v2[2] simply used kmalloc(). Cc: Liam Merwick Link: https://lore.kernel.org/all/20170724200303.12197-17-brijesh.singh@amd.com [1] Link: https://lore.kernel.org/all/148846786714.2349.17724971671841396908.stgit__25299.4950431914$1488470940$gmane$org@brijesh-build-machine [2] Reviewed-by: Liam Merwick Tested-by: Liam Merwick Link: https://patch.msgid.link/20260313003302.3136111-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 15ac2b907260..6a8de8ad880d 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -678,11 +678,9 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, unsigned int flags) { struct kvm_sev_info *sev = to_kvm_sev_info(kvm); - unsigned long npages, size; - int npinned; - unsigned long total_npages, lock_limit; + unsigned long npages, total_npages, lock_limit; struct page **pages; - int ret; + int npinned, ret; lockdep_assert_held(&kvm->lock); @@ -709,13 +707,13 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, return ERR_PTR(-ENOMEM); } - /* Avoid using vmalloc for smaller buffers. */ - size = npages * sizeof(struct page *); - if (size > PAGE_SIZE) - pages = __vmalloc(size, GFP_KERNEL_ACCOUNT); - else - pages = kmalloc(size, GFP_KERNEL_ACCOUNT); - + /* + * Don't WARN if the kernel (rightly) thinks the total size is absurd, + * i.e. rely on the kernel to reject outrageous range sizes. The above + * check on the number of pages is purely to avoid truncation as + * pin_user_pages_fast() takes the number of pages as a 32-bit int. + */ + pages = kvzalloc_objs(*pages, npages, GFP_KERNEL_ACCOUNT | __GFP_NOWARN); if (!pages) return ERR_PTR(-ENOMEM); From 25a642b6abc98bbbabbf2baef9fc498bbea6aee6 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:09 -0700 Subject: [PATCH 321/373] KVM: selftests: Remove duplicate LAUNCH_UPDATE_VMSA call in SEV-ES migrate test Drop the explicit KVM_SEV_LAUNCH_UPDATE_VMSA call when creating an SEV-ES VM in the SEV migration test, as sev_vm_create() automatically updates the VMSA pages for SEV-ES guests. The only reason the duplicate call doesn't cause visible problems is because the test doesn't actually try to run the vCPUs. That will change when KVM adds a check to prevent userspace from re-launching a VMSA (which corrupts the VMSA page due to KVM writing encrypted private memory). Fixes: 69f8e15ab61f ("KVM: selftests: Use the SEV library APIs in the intra-host migration test") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260310234829.2608037-2-seanjc@google.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/x86/sev_migrate_tests.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/tools/testing/selftests/kvm/x86/sev_migrate_tests.c b/tools/testing/selftests/kvm/x86/sev_migrate_tests.c index 0a6dfba3905b..6b0928e69051 100644 --- a/tools/testing/selftests/kvm/x86/sev_migrate_tests.c +++ b/tools/testing/selftests/kvm/x86/sev_migrate_tests.c @@ -36,8 +36,6 @@ static struct kvm_vm *sev_vm_create(bool es) sev_vm_launch(vm, es ? SEV_POLICY_ES : 0); - if (es) - vm_sev_ioctl(vm, KVM_SEV_LAUNCH_UPDATE_VMSA, NULL); return vm; } From 9b9f7962e3e879d12da2bf47e02a24ec51690e3d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:10 -0700 Subject: [PATCH 322/373] KVM: SEV: Reject attempts to sync VMSA of an already-launched/encrypted vCPU Reject synchronizing vCPU state to its associated VMSA if the vCPU has already been launched, i.e. if the VMSA has already been encrypted. On a host with SNP enabled, accessing guest-private memory generates an RMP #PF and panics the host. BUG: unable to handle page fault for address: ff1276cbfdf36000 #PF: supervisor write access in kernel mode #PF: error_code(0x80000003) - RMP violation PGD 5a31801067 P4D 5a31802067 PUD 40ccfb5063 PMD 40e5954063 PTE 80000040fdf36163 SEV-SNP: PFN 0x40fdf36, RMP entry: [0x6010fffffffff001 - 0x000000000000001f] Oops: Oops: 0003 [#1] SMP NOPTI CPU: 33 UID: 0 PID: 996180 Comm: qemu-system-x86 Tainted: G OE Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Dell Inc. PowerEdge R7625/0H1TJT, BIOS 1.5.8 07/21/2023 RIP: 0010:sev_es_sync_vmsa+0x54/0x4c0 [kvm_amd] Call Trace: snp_launch_update_vmsa+0x19d/0x290 [kvm_amd] snp_launch_finish+0xb6/0x380 [kvm_amd] sev_mem_enc_ioctl+0x14e/0x720 [kvm_amd] kvm_arch_vm_ioctl+0x837/0xcf0 [kvm] kvm_vm_ioctl+0x3fd/0xcc0 [kvm] __x64_sys_ioctl+0xa3/0x100 x64_sys_call+0xfe0/0x2350 do_syscall_64+0x81/0x10f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7ffff673287d Note, the KVM flaw has been present since commit ad73109ae7ec ("KVM: SVM: Provide support to launch and run an SEV-ES guest"), but has only been actively dangerous for the host since SNP support was added. With SEV-ES, KVM would "just" clobber guest state, which is totally fine from a host kernel perspective since userspace can clobber guest state any time before sev_launch_update_vmsa(). Fixes: ad27ce155566 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command") Reported-by: Jethro Beekman Closes: https://lore.kernel.org/all/d98692e2-d96b-4c36-8089-4bc1e5cc3d57@fortanix.com Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260310234829.2608037-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 6a8de8ad880d..d29783c3075a 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -884,6 +884,9 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm) u8 *d; int i; + if (vcpu->arch.guest_state_protected) + return -EINVAL; + /* Check some debug related fields before encrypting the VMSA */ if (svm->vcpu.guest_debug || (svm->vmcb->save.dr7 & ~DR7_FIXED_1)) return -EINVAL; From b6408b6cec5df76a165575777800ef2aba12b109 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:11 -0700 Subject: [PATCH 323/373] KVM: SEV: Protect *all* of sev_mem_enc_register_region() with kvm->lock Take and hold kvm->lock for before checking sev_guest() in sev_mem_enc_register_region(), as sev_guest() isn't stable unless kvm->lock is held (or KVM can guarantee KVM_SEV_INIT{2} has completed and can't rollack state). If KVM_SEV_INIT{2} fails, KVM can end up trying to add to a not-yet-initialized sev->regions_list, e.g. triggering a #GP Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 110 UID: 0 PID: 72717 Comm: syz.15.11462 Tainted: G U W O 6.16.0-smp-DEV #1 NONE Tainted: [U]=USER, [W]=WARN, [O]=OOT_MODULE Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 12.52.0-0 10/28/2024 RIP: 0010:sev_mem_enc_register_region+0x3f0/0x4f0 ../include/linux/list.h:83 Code: <41> 80 3c 04 00 74 08 4c 89 ff e8 f1 c7 a2 00 49 39 ed 0f 84 c6 00 RSP: 0018:ffff88838647fbb8 EFLAGS: 00010256 RAX: dffffc0000000000 RBX: 1ffff92015cf1e0b RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff888367870000 RBP: ffffc900ae78f050 R08: ffffea000d9e0007 R09: 1ffffd4001b3c000 R10: dffffc0000000000 R11: fffff94001b3c001 R12: 0000000000000000 R13: ffff8982ab0bde00 R14: ffffc900ae78f058 R15: 0000000000000000 FS: 00007f34e9dc66c0(0000) GS:ffff89ee64d33000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe180adef98 CR3: 000000047210e000 CR4: 0000000000350ef0 Call Trace: kvm_arch_vm_ioctl+0xa72/0x1240 ../arch/x86/kvm/x86.c:7371 kvm_vm_ioctl+0x649/0x990 ../virt/kvm/kvm_main.c:5363 __se_sys_ioctl+0x101/0x170 ../fs/ioctl.c:51 do_syscall_x64 ../arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x6f/0x1f0 ../arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f34e9f7e9a9 Code: <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f34e9dc6038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f34ea1a6080 RCX: 00007f34e9f7e9a9 RDX: 0000200000000280 RSI: 000000008010aebb RDI: 0000000000000007 RBP: 00007f34ea000d69 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f34ea1a6080 R15: 00007ffce77197a8 with a syzlang reproducer that looks like: syz_kvm_add_vcpu$x86(0x0, &(0x7f0000000040)={0x0, &(0x7f0000000180)=ANY=[], 0x70}) (async) syz_kvm_add_vcpu$x86(0x0, &(0x7f0000000080)={0x0, &(0x7f0000000180)=ANY=[@ANYBLOB="..."], 0x4f}) (async) r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000200), 0x0, 0x0) r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) r2 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000240), 0x0, 0x0) r3 = ioctl$KVM_CREATE_VM(r2, 0xae01, 0x0) ioctl$KVM_SET_CLOCK(r3, 0xc008aeba, &(0x7f0000000040)={0x1, 0x8, 0x0, 0x5625e9b0}) (async) ioctl$KVM_SET_PIT2(r3, 0x8010aebb, &(0x7f0000000280)={[...], 0x5}) (async) ioctl$KVM_SET_PIT2(r1, 0x4070aea0, 0x0) (async) r4 = ioctl$KVM_CREATE_VM(0xffffffffffffffff, 0xae01, 0x0) openat$kvm(0xffffffffffffff9c, 0x0, 0x0, 0x0) (async) ioctl$KVM_SET_USER_MEMORY_REGION(r4, 0x4020ae46, &(0x7f0000000400)={0x0, 0x0, 0x0, 0x2000, &(0x7f0000001000/0x2000)=nil}) (async) r5 = ioctl$KVM_CREATE_VCPU(r4, 0xae41, 0x2) close(r0) (async) openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x8000, 0x0) (async) ioctl$KVM_SET_GUEST_DEBUG(r5, 0x4048ae9b, &(0x7f0000000300)={0x4376ea830d46549b, 0x0, [0x46, 0x0, 0x0, 0x0, 0x0, 0x1000]}) (async) ioctl$KVM_RUN(r5, 0xae80, 0x0) Opportunistically use guard() to avoid having to define a new error label and goto usage. Fixes: 1e80fdc09d12 ("KVM: SVM: Pin guest memory when SEV is active") Cc: stable@vger.kernel.org Reported-by: Alexander Potapenko Tested-by: Alexander Potapenko Link: https://patch.msgid.link/20260310234829.2608037-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index d29783c3075a..9265ebd9aa18 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2706,6 +2706,8 @@ int sev_mem_enc_register_region(struct kvm *kvm, struct enc_region *region; int ret = 0; + guard(mutex)(&kvm->lock); + if (!sev_guest(kvm)) return -ENOTTY; @@ -2717,12 +2719,10 @@ int sev_mem_enc_register_region(struct kvm *kvm, if (!region) return -ENOMEM; - mutex_lock(&kvm->lock); region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, FOLL_WRITE | FOLL_LONGTERM); if (IS_ERR(region->pages)) { ret = PTR_ERR(region->pages); - mutex_unlock(&kvm->lock); goto e_free; } @@ -2740,8 +2740,6 @@ int sev_mem_enc_register_region(struct kvm *kvm, region->size = range->size; list_add_tail(®ion->list, &sev->regions_list); - mutex_unlock(&kvm->lock); - return ret; e_free: From 624bf3440d7214b62c22d698a0a294323f331d5d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:12 -0700 Subject: [PATCH 324/373] KVM: SEV: Disallow LAUNCH_FINISH if vCPUs are actively being created Reject LAUNCH_FINISH for SEV-ES and SNP VMs if KVM is actively creating one or more vCPUs, as KVM needs to process and encrypt each vCPU's VMSA. Letting userspace create vCPUs while LAUNCH_FINISH is in-progress is "fine", at least in the current code base, as kvm_for_each_vcpu() operates on online_vcpus, LAUNCH_FINISH (all SEV+ sub-ioctls) holds kvm->mutex, and fully onlining a vCPU in kvm_vm_ioctl_create_vcpu() is done under kvm->mutex. I.e. there's no difference between an in-progress vCPU and a vCPU that is created entirely after LAUNCH_FINISH. However, given that concurrent LAUNCH_FINISH and vCPU creation can't possibly work (for any reasonable definition of "work"), since userspace can't guarantee whether a particular vCPU will be encrypted or not, disallow the combination as a hardening measure, to reduce the probability of introducing bugs in the future, and to avoid having to reason about the safety of future changes related to LAUNCH_FINISH. Cc: Jethro Beekman Closes: https://lore.kernel.org/all/b31f7c6e-2807-4662-bcdd-eea2c1e132fa@fortanix.com Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260310234829.2608037-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 10 ++++++++-- include/linux/kvm_host.h | 7 +++++++ 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 9265ebd9aa18..10b12db7f902 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1032,6 +1032,9 @@ static int sev_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) if (!sev_es_guest(kvm)) return -ENOTTY; + if (kvm_is_vcpu_creation_in_progress(kvm)) + return -EBUSY; + kvm_for_each_vcpu(i, vcpu, kvm) { ret = mutex_lock_killable(&vcpu->mutex); if (ret) @@ -2052,8 +2055,8 @@ static int sev_check_source_vcpus(struct kvm *dst, struct kvm *src) struct kvm_vcpu *src_vcpu; unsigned long i; - if (src->created_vcpus != atomic_read(&src->online_vcpus) || - dst->created_vcpus != atomic_read(&dst->online_vcpus)) + if (kvm_is_vcpu_creation_in_progress(src) || + kvm_is_vcpu_creation_in_progress(dst)) return -EBUSY; if (!sev_es_guest(src)) @@ -2452,6 +2455,9 @@ static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) unsigned long i; int ret; + if (kvm_is_vcpu_creation_in_progress(kvm)) + return -EBUSY; + data.gctx_paddr = __psp_pa(sev->snp_context); data.page_type = SNP_PAGE_TYPE_VMSA; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 34759a262b28..3c7f8557f7af 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1029,6 +1029,13 @@ static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id) return NULL; } +static inline bool kvm_is_vcpu_creation_in_progress(struct kvm *kvm) +{ + lockdep_assert_held(&kvm->lock); + + return kvm->created_vcpus != atomic_read(&kvm->online_vcpus); +} + void kvm_destroy_vcpus(struct kvm *kvm); int kvm_trylock_all_vcpus(struct kvm *kvm); From c85aaff26d55920d783adac431a59ec738a35aef Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:24 +0000 Subject: [PATCH 325/373] KVM: SVM: Properly check RAX in the emulator for SVM instructions Architecturally, VMRUN/VMLOAD/VMSAVE should generate a #GP if the physical address in RAX is not supported. check_svme_pa() hardcodes this to checking that bits 63-48 are not set. This is incorrect on HW supporting 52 bits of physical address space. Additionally, the emulator does not check if the address is not aligned, which should also result in #GP. Use page_address_valid() which properly checks alignment and the address legality based on the guest's MAXPHYADDR. Plumb it through x86_emulate_ops, similar to is_canonical_addr(), to avoid directly accessing the vCPU object in emulator code. Fixes: 01de8b09e606 ("KVM: SVM: Add intercept checks for SVM instructions") Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-2-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/emulate.c | 3 +-- arch/x86/kvm/kvm_emulate.h | 2 ++ arch/x86/kvm/x86.c | 6 ++++++ 3 files changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c8e292e9a24d..202c376ff501 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3874,8 +3874,7 @@ static int check_svme_pa(struct x86_emulate_ctxt *ctxt) { u64 rax = reg_read(ctxt, VCPU_REGS_RAX); - /* Valid physical address? */ - if (rax & 0xffff000000000000ULL) + if (!ctxt->ops->page_address_valid(ctxt, rax)) return emulate_gp(ctxt, 0); return check_svme(ctxt); diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index fb3dab4b5a53..0abff36d0994 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -245,6 +245,8 @@ struct x86_emulate_ops { bool (*is_canonical_addr)(struct x86_emulate_ctxt *ctxt, gva_t addr, unsigned int flags); + + bool (*page_address_valid)(struct x86_emulate_ctxt *ctxt, gpa_t gpa); }; /* Type, address-of, and value of an instruction's operand. */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index aa29f90c6e96..2410401c57d8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8907,6 +8907,11 @@ static bool emulator_is_canonical_addr(struct x86_emulate_ctxt *ctxt, return !is_noncanonical_address(addr, emul_to_vcpu(ctxt), flags); } +static bool emulator_page_address_valid(struct x86_emulate_ctxt *ctxt, gpa_t gpa) +{ + return page_address_valid(emul_to_vcpu(ctxt), gpa); +} + static const struct x86_emulate_ops emulate_ops = { .vm_bugged = emulator_vm_bugged, .read_gpr = emulator_read_gpr, @@ -8954,6 +8959,7 @@ static const struct x86_emulate_ops emulate_ops = { .set_xcr = emulator_set_xcr, .get_untagged_addr = emulator_get_untagged_addr, .is_canonical_addr = emulator_is_canonical_addr, + .page_address_valid = emulator_page_address_valid, }; static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) From 27f70eaa8661c031f6c5efa4d72c7c4544cc41fc Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:25 +0000 Subject: [PATCH 326/373] KVM: SVM: Refactor SVM instruction handling on #GP intercept Instead of returning an opcode from svm_instr_opcode() and then passing it to emulate_svm_instr(), which uses it to find the corresponding exit code and intercept handler, return the exit code directly from svm_instr_opcode(), and rename it to svm_get_decoded_instr_exit_code(). emulate_svm_instr() boils down to synthesizing a #VMEXIT or calling the intercept handler, so open-code it in gp_interception(), and use svm_invoke_exit_handler() to call the intercept handler based on the exit code. This allows for dropping the SVM_INSTR_* enum, and the const array mapping its values to exit codes and intercept handlers. In gp_intercept(), handle SVM instructions and first with an early return, and invert is_guest_mode() checks, un-indenting the rest of the code. No functional change intended. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-3-yosry@kernel.org [sean: add BUILD_BUG_ON(), tweak formatting/naming] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 77 ++++++++++++++---------------------------- 1 file changed, 26 insertions(+), 51 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index f4b0aeba948f..927764894b89 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2236,54 +2236,28 @@ static int vmrun_interception(struct kvm_vcpu *vcpu) return nested_svm_vmrun(vcpu); } -enum { - NONE_SVM_INSTR, - SVM_INSTR_VMRUN, - SVM_INSTR_VMLOAD, - SVM_INSTR_VMSAVE, -}; - -/* Return NONE_SVM_INSTR if not SVM instrs, otherwise return decode result */ -static int svm_instr_opcode(struct kvm_vcpu *vcpu) +/* Return 0 if not SVM instr, otherwise return associated exit_code */ +static u64 svm_get_decoded_instr_exit_code(struct kvm_vcpu *vcpu) { struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; if (ctxt->b != 0x1 || ctxt->opcode_len != 2) - return NONE_SVM_INSTR; + return 0; + + BUILD_BUG_ON(!SVM_EXIT_VMRUN || !SVM_EXIT_VMLOAD || !SVM_EXIT_VMSAVE); switch (ctxt->modrm) { case 0xd8: /* VMRUN */ - return SVM_INSTR_VMRUN; + return SVM_EXIT_VMRUN; case 0xda: /* VMLOAD */ - return SVM_INSTR_VMLOAD; + return SVM_EXIT_VMLOAD; case 0xdb: /* VMSAVE */ - return SVM_INSTR_VMSAVE; + return SVM_EXIT_VMSAVE; default: break; } - return NONE_SVM_INSTR; -} - -static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode) -{ - const int guest_mode_exit_codes[] = { - [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN, - [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD, - [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE, - }; - int (*const svm_instr_handlers[])(struct kvm_vcpu *vcpu) = { - [SVM_INSTR_VMRUN] = vmrun_interception, - [SVM_INSTR_VMLOAD] = vmload_interception, - [SVM_INSTR_VMSAVE] = vmsave_interception, - }; - struct vcpu_svm *svm = to_svm(vcpu); - - if (is_guest_mode(vcpu)) { - nested_svm_simple_vmexit(svm, guest_mode_exit_codes[opcode]); - return 1; - } - return svm_instr_handlers[opcode](vcpu); + return 0; } /* @@ -2298,7 +2272,7 @@ static int gp_interception(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); u32 error_code = svm->vmcb->control.exit_info_1; - int opcode; + u64 svm_exit_code; /* Both #GP cases have zero error_code */ if (error_code) @@ -2308,27 +2282,28 @@ static int gp_interception(struct kvm_vcpu *vcpu) if (x86_decode_emulated_instruction(vcpu, 0, NULL, 0) != EMULATION_OK) goto reinject; - opcode = svm_instr_opcode(vcpu); - - if (opcode == NONE_SVM_INSTR) { - if (!enable_vmware_backdoor) - goto reinject; - - /* - * VMware backdoor emulation on #GP interception only handles - * IN{S}, OUT{S}, and RDPMC. - */ - if (!is_guest_mode(vcpu)) - return kvm_emulate_instruction(vcpu, - EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE); - } else { + svm_exit_code = svm_get_decoded_instr_exit_code(vcpu); + if (svm_exit_code) { /* All SVM instructions expect page aligned RAX */ if (svm->vmcb->save.rax & ~PAGE_MASK) goto reinject; - return emulate_svm_instr(vcpu, opcode); + if (!is_guest_mode(vcpu)) + return svm_invoke_exit_handler(vcpu, svm_exit_code); + + nested_svm_simple_vmexit(svm, svm_exit_code); + return 1; } + /* + * VMware backdoor emulation on #GP interception only handles + * IN{S}, OUT{S}, and RDPMC, and only for L1. + */ + if (!enable_vmware_backdoor || is_guest_mode(vcpu)) + goto reinject; + + return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE); + reinject: kvm_queue_exception_e(vcpu, GP_VECTOR, error_code); return 1; From 435741a4e766e3704af03c9ac634a73b9e75fc4c Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:26 +0000 Subject: [PATCH 327/373] KVM: SVM: Properly check RAX on #GP intercept of SVM instructions When KVM intercepts #GP on an SVM instruction, it re-injects the #GP if the instruction was executed with a mis-algined RAX. However, a #GP should also be reinjected if RAX contains an illegal GPA, according to the APM, one of #GP conditions is: rAX referenced a physical address above the maximum supported physical address. Replace the PAGE_MASK check with page_address_valid(), which checks both page-alignment as well as the legality of the GPA based on the vCPU's MAXPHYADDR. Use kvm_register_read() to read RAX to so that bits 63:32 are dropped when the vCPU is in 32-bit mode, i.e. to avoid a false positive when checking the validity of the address. Note that this is currently only a problem if KVM is running an L2 guest and ends up synthesizing a #VMEXIT to L1, as the RAX check takes precedence over the intercept. Otherwise, if KVM emulates the instruction, kvm_vcpu_map() should fail on illegal GPAs and inject a #GP anyway. However, following patches will change the failure behavior of kvm_vcpu_map(), so make sure the #GP interception handler does this appropriately. Opportunistically drop a teaser FIXME about the SVM instructions handling on #GP belonging in the emulator. Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Fixes: d1cba6c92237 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround") Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-4-yosry@kernel.org [sean: massage wording with respect to kvm_register_read()] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 927764894b89..f68958447e58 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2282,10 +2282,10 @@ static int gp_interception(struct kvm_vcpu *vcpu) if (x86_decode_emulated_instruction(vcpu, 0, NULL, 0) != EMULATION_OK) goto reinject; + /* FIXME: Handle SVM instructions through the emulator */ svm_exit_code = svm_get_decoded_instr_exit_code(vcpu); if (svm_exit_code) { - /* All SVM instructions expect page aligned RAX */ - if (svm->vmcb->save.rax & ~PAGE_MASK) + if (!page_address_valid(vcpu, kvm_register_read(vcpu, VCPU_REGS_RAX))) goto reinject; if (!is_guest_mode(vcpu)) From d2fbeb61e1451eba09eb3249aaf1f01d4c5c1f8b Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:27 +0000 Subject: [PATCH 328/373] KVM: SVM: Move RAX legality check to SVM insn interception handlers When #GP is intercepted by KVM, the #GP interception handler checks whether the GPA in RAX is legal and reinjects the #GP accordingly. Otherwise, it calls into the appropriate interception handler for VMRUN/VMLOAD/VMSAVE. The intercept handlers do not check RAX. However, the intercept handlers need to do the RAX check, because if the guest has a smaller MAXPHYADDR, RAX could be legal from the hardware perspective (i.e. CPU does not inject #GP), but not from the vCPU's perspective. Note that with allow_smaller_maxphyaddr, both NPT and VLS cannot be used, so VMLOAD/VMSAVE have to be intercepted, and RAX can always be checked against the vCPU's MAXPHYADDR. Move the check into the interception handlers for VMRUN/VMLOAD/VMSAVE as the CPU does not check RAX before the interception. Read RAX using kvm_register_read() to avoid a false negative on page_address_valid() on 32-bit due to garbage in the higher bits. Keep the check in the #GP intercept handler in the nested case where a #VMEXIT is synthesized into L1, as the RAX check is still needed there and takes precedence over the intercept. Opportunistically add a FIXME about the #VMEXIT being synthesized into L1, as it needs to be conditional. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-5-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 +++++- arch/x86/kvm/svm/svm.c | 20 ++++++++++++++++---- 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 88e878160229..16f4bc4f48f5 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1103,7 +1103,11 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) if (WARN_ON_ONCE(!svm->nested.initialized)) return -EINVAL; - vmcb12_gpa = svm->vmcb->save.rax; + vmcb12_gpa = kvm_register_read(vcpu, VCPU_REGS_RAX); + if (!page_address_valid(vcpu, vmcb12_gpa)) { + kvm_inject_gp(vcpu, 0); + return 1; + } ret = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); if (ret) { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index f68958447e58..3472916657e1 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2185,6 +2185,7 @@ static int intr_interception(struct kvm_vcpu *vcpu) static int vmload_vmsave_interception(struct kvm_vcpu *vcpu, bool vmload) { + u64 vmcb12_gpa = kvm_register_read(vcpu, VCPU_REGS_RAX); struct vcpu_svm *svm = to_svm(vcpu); struct vmcb *vmcb12; struct kvm_host_map map; @@ -2193,7 +2194,12 @@ static int vmload_vmsave_interception(struct kvm_vcpu *vcpu, bool vmload) if (nested_svm_check_permissions(vcpu)) return 1; - ret = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->vmcb->save.rax), &map); + if (!page_address_valid(vcpu, vmcb12_gpa)) { + kvm_inject_gp(vcpu, 0); + return 1; + } + + ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map); if (ret) { if (ret == -EINVAL) kvm_inject_gp(vcpu, 0); @@ -2285,12 +2291,18 @@ static int gp_interception(struct kvm_vcpu *vcpu) /* FIXME: Handle SVM instructions through the emulator */ svm_exit_code = svm_get_decoded_instr_exit_code(vcpu); if (svm_exit_code) { - if (!page_address_valid(vcpu, kvm_register_read(vcpu, VCPU_REGS_RAX))) - goto reinject; - if (!is_guest_mode(vcpu)) return svm_invoke_exit_handler(vcpu, svm_exit_code); + if (!page_address_valid(vcpu, kvm_register_read(vcpu, VCPU_REGS_RAX))) + goto reinject; + + /* + * FIXME: Only synthesize a #VMEXIT if L1 sets the intercept, + * but only after the VMLOAD/VMSAVE exit handlers can properly + * handle VMLOAD/VMSAVE from L2 with VLS enabled in L1 (i.e. + * RAX is an L2 GPA that needs translation through L1's NPT). + */ nested_svm_simple_vmexit(svm, svm_exit_code); return 1; } From 783cf7d01fb8788f37735c0a6c3955024189287c Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:28 +0000 Subject: [PATCH 329/373] KVM: SVM: Check EFER.SVME and CPL on #GP intercept of SVM instructions When KVM intercepts #GP on an SVM instruction from L2, it checks the legality of RAX, and injects a #GP if RAX is illegal, or otherwise synthesizes a #VMEXIT to L1. However, checking EFER.SVME and CPL takes precedence over both the RAX check and the intercept. Call nested_svm_check_permissions() first to cover both. Note that if #GP is intercepted on SVM instruction in L1, the intercept handlers of VMRUN/VMLOAD/VMSAVE already perform these checks. Note #2, if KVM does not intercept #GP, the check for EFER.SVME is not done in the correct order, because KVM handles it by intercepting the instructions when EFER.SVME=0 and injecting #UD. However, a #GP injected by hardware would happen before the instruction intercept, leading to #GP taking precedence over #UD from the guest's perspective. Opportunistically add a FIXME for this. Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-6-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 3472916657e1..7d0d95f40cd2 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1054,6 +1054,11 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) * No need to toggle any of the vgif/vls/etc. enable bits here, as they * are set when the VMCB is initialized and never cleared (if the * relevant intercepts are set, the enablements are meaningless anyway). + * + * FIXME: When #GP is not intercepted, a #GP on these instructions (e.g. + * due to CPL > 0) could be injected by hardware before the instruction + * is intercepted, leading to #GP taking precedence over #UD from the + * guest's perspective. */ if (!(vcpu->arch.efer & EFER_SVME)) { svm_set_intercept(svm, INTERCEPT_VMLOAD); @@ -2294,6 +2299,9 @@ static int gp_interception(struct kvm_vcpu *vcpu) if (!is_guest_mode(vcpu)) return svm_invoke_exit_handler(vcpu, svm_exit_code); + if (nested_svm_check_permissions(vcpu)) + return 1; + if (!page_address_valid(vcpu, kvm_register_read(vcpu, VCPU_REGS_RAX))) goto reinject; From 878b8efa2adbbfffc97f68cbba243cdf18d943c0 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:29 +0000 Subject: [PATCH 330/373] KVM: SVM: Treat mapping failures equally in VMLOAD/VMSAVE emulation Currently, a #GP is only injected if kvm_vcpu_map() fails with -EINVAL. But it could also fail with -EFAULT if creating a host mapping failed. Inject a #GP in all cases, no reason to treat failure modes differently. Similar to commit 01ddcdc55e09 ("KVM: nSVM: Always inject a #GP if mapping VMCB12 fails on nested VMRUN"), treat all failures equally. Fixes: 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory") Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-7-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7d0d95f40cd2..b83d524a6e78 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2204,10 +2204,8 @@ static int vmload_vmsave_interception(struct kvm_vcpu *vcpu, bool vmload) return 1; } - ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map); - if (ret) { - if (ret == -EINVAL) - kvm_inject_gp(vcpu, 0); + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) { + kvm_inject_gp(vcpu, 0); return 1; } From 2daf71bfd77d0b7ba7b81d1a6ac872ebb338ff31 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:30 +0000 Subject: [PATCH 331/373] KVM: nSVM: Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails KVM currently injects a #GP if mapping vmcb12 fails when emulating VMRUN/VMLOAD/VMSAVE. This is not architectural behavior, as #GP should only be injected if the physical address is not supported or not aligned. Instead, handle it as an emulation failure, similar to how nVMX handles failures to read/write guest memory in several emulation paths. When virtual VMLOAD/VMSAVE is enabled, if vmcb12's GPA is not mapped in the NPTs a VMEXIT(#NPF) will be generated, and KVM will install an MMIO SPTE and emulate the instruction if there is no corresponding memslot. x86_emulate_insn() will return EMULATION_FAILED as VMLOAD/VMSAVE are not handled as part of the twobyte_insn cases. Even though this will also result in an emulation failure, it will only result in a straight return to userspace if KVM_CAP_EXIT_ON_EMULATION_FAILURE is set. Otherwise, it would inject #UD and only exit to userspace if not in guest mode. So the behavior is slightly different if virtual VMLOAD/VMSAVE is enabled. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Reported-by: Jim Mattson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-8-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++---- arch/x86/kvm/svm/svm.c | 6 ++---- 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 16f4bc4f48f5..b42d95fc8499 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1111,10 +1111,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) ret = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); if (ret) { - if (ret == -EFAULT) { - kvm_inject_gp(vcpu, 0); - return 1; - } + if (ret == -EFAULT) + return kvm_handle_memory_failure(vcpu, X86EMUL_IO_NEEDED, NULL); /* Advance RIP past VMRUN as part of the nested #VMEXIT. */ return kvm_skip_emulated_instruction(vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b83d524a6e78..1e51cbb80e86 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2204,10 +2204,8 @@ static int vmload_vmsave_interception(struct kvm_vcpu *vcpu, bool vmload) return 1; } - if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) { - kvm_inject_gp(vcpu, 0); - return 1; - } + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) + return kvm_handle_memory_failure(vcpu, X86EMUL_IO_NEEDED, NULL); vmcb12 = map.hva; From 428543fbf06c498d9835d549920c2206befc1589 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:31 +0000 Subject: [PATCH 332/373] KVM: selftests: Rework svm_nested_invalid_vmcb12_gpa The test currently allegedly makes sure that VMRUN causes a #GP in vmcb12 GPA is valid but unmappable. However, it calls run_guest() with an the test vmcb12 GPA, and the #GP is produced from VMLOAD, not VMRUN. Additionally, the underlying logic just changed to match architectural behavior, and all of VMRUN/VMLOAD/VMSAVE fail emulation if vmcb12 cannot be mapped. The CPU still injects a #GP if the vmcb12 GPA exceeds maxphyaddr. Rework the test such to use the KVM_ONE_VCPU_TEST[_SUITE] harness, and test all of VMRUN/VMLOAD/VMSAVE with both an invalid GPA (-1ULL) causing a #GP, and a valid but unmappable GPA causing emulation failure. Execute the instructions directly from L1 instead of run_guest() to make sure the #GP or emulation failure is produced by the right instruction. Leave the #VMEXIT with unmappable GPA test case as-is, but wrap it with a test harness as well. Opportunisitically drop gp_triggered, as the test already checks that a #GP was injected through a SYNC. Also, use the first unmapped GPA instead of the maximum legal GPA, as some CPUs inject a #GP for the maximum legal GPA (likely in a reserved area). Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-9-yosry@kernel.org Signed-off-by: Sean Christopherson --- .../kvm/x86/svm_nested_invalid_vmcb12_gpa.c | 152 +++++++++++++----- 1 file changed, 115 insertions(+), 37 deletions(-) diff --git a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c index c6d5f712120d..569869bed20b 100644 --- a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c +++ b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c @@ -6,6 +6,8 @@ #include "vmx.h" #include "svm_util.h" #include "kselftest.h" +#include "kvm_test_harness.h" +#include "test_util.h" #define L2_GUEST_STACK_SIZE 64 @@ -13,86 +15,162 @@ #define SYNC_GP 101 #define SYNC_L2_STARTED 102 -u64 valid_vmcb12_gpa; -int gp_triggered; +static unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; static void guest_gp_handler(struct ex_regs *regs) { - GUEST_ASSERT(!gp_triggered); GUEST_SYNC(SYNC_GP); - gp_triggered = 1; - regs->rax = valid_vmcb12_gpa; } -static void l2_guest_code(void) +static void l2_code(void) { GUEST_SYNC(SYNC_L2_STARTED); vmcall(); } -static void l1_guest_code(struct svm_test_data *svm, u64 invalid_vmcb12_gpa) +static void l1_vmrun(struct svm_test_data *svm, u64 gpa) { - unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - generic_svm_setup(svm, l2_guest_code, - &l2_guest_stack[L2_GUEST_STACK_SIZE]); + asm volatile ("vmrun %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} - valid_vmcb12_gpa = svm->vmcb_gpa; +static void l1_vmload(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - run_guest(svm->vmcb, invalid_vmcb12_gpa); /* #GP */ + asm volatile ("vmload %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} - /* GP handler should jump here */ +static void l1_vmsave(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + asm volatile ("vmsave %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} + +static void l1_vmexit(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + run_guest(svm->vmcb, svm->vmcb_gpa); GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); GUEST_DONE(); } -int main(int argc, char *argv[]) +static u64 unmappable_gpa(struct kvm_vcpu *vcpu) +{ + struct userspace_mem_region *region; + u64 region_gpa_end, vm_gpa_end = 0; + int i; + + hash_for_each(vcpu->vm->regions.slot_hash, i, region, slot_node) { + region_gpa_end = region->region.guest_phys_addr + region->region.memory_size; + vm_gpa_end = max(vm_gpa_end, region_gpa_end); + } + + return vm_gpa_end; +} + +static void test_invalid_vmcb12(struct kvm_vcpu *vcpu) { - struct kvm_x86_state *state; vm_vaddr_t nested_gva = 0; - struct kvm_vcpu *vcpu; - uint32_t maxphyaddr; - u64 max_legal_gpa; - struct kvm_vm *vm; struct ucall uc; - TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); - vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); vm_install_exception_handler(vcpu->vm, GP_VECTOR, guest_gp_handler); - - /* - * Find the max legal GPA that is not backed by a memslot (i.e. cannot - * be mapped by KVM). - */ - maxphyaddr = kvm_cpuid_property(vcpu->cpuid, X86_PROPERTY_MAX_PHY_ADDR); - max_legal_gpa = BIT_ULL(maxphyaddr) - PAGE_SIZE; - vcpu_alloc_svm(vm, &nested_gva); - vcpu_args_set(vcpu, 2, nested_gva, max_legal_gpa); - - /* VMRUN with max_legal_gpa, KVM injects a #GP */ + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, -1ULL); vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); TEST_ASSERT_EQ(uc.args[1], SYNC_GP); +} + +static void test_unmappable_vmcb12(struct kvm_vcpu *vcpu) +{ + vm_vaddr_t nested_gva = 0; + + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, unmappable_gpa(vcpu)); + vcpu_run(vcpu); + + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_INTERNAL_ERROR); + TEST_ASSERT_EQ(vcpu->run->emulation_failure.suberror, KVM_INTERNAL_ERROR_EMULATION); +} + +static void test_unmappable_vmcb12_vmexit(struct kvm_vcpu *vcpu) +{ + struct kvm_x86_state *state; + vm_vaddr_t nested_gva = 0; + struct ucall uc; /* - * Enter L2 (with a legit vmcb12 GPA), then overwrite vmcb12 GPA with - * max_legal_gpa. KVM will fail to map vmcb12 on nested VM-Exit and + * Enter L2 (with a legit vmcb12 GPA), then overwrite vmcb12 GPA with an + * unmappable GPA. KVM will fail to map vmcb12 on nested VM-Exit and * cause a shutdown. */ + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, unmappable_gpa(vcpu)); vcpu_run(vcpu); TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); TEST_ASSERT_EQ(uc.args[1], SYNC_L2_STARTED); state = vcpu_save_state(vcpu); - state->nested.hdr.svm.vmcb_pa = max_legal_gpa; + state->nested.hdr.svm.vmcb_pa = unmappable_gpa(vcpu); vcpu_load_state(vcpu, state); vcpu_run(vcpu); TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SHUTDOWN); kvm_x86_state_cleanup(state); - kvm_vm_free(vm); - return 0; +} + +KVM_ONE_VCPU_TEST_SUITE(vmcb12_gpa); + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmrun_invalid, l1_vmrun) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmload_invalid, l1_vmload) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmsave_invalid, l1_vmsave) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmrun_unmappable, l1_vmrun) +{ + test_unmappable_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmload_unmappable, l1_vmload) +{ + test_unmappable_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmsave_unmappable, l1_vmsave) +{ + test_unmappable_vmcb12(vcpu); +} + +/* + * Invalid vmcb12_gpa cannot be test for #VMEXIT as KVM_SET_NESTED_STATE will + * reject it. + */ +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmexit_unmappable, l1_vmexit) +{ + test_unmappable_vmcb12_vmexit(vcpu); +} + +int main(int argc, char *argv[]) +{ + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + + return test_harness_run(argc, argv); } From 052ca584bd7c51de0de96e684631570459d46cda Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:32 +0000 Subject: [PATCH 333/373] KVM: selftests: Drop 'invalid' from svm_nested_invalid_vmcb12_gpa's name The test checks both invalid GPAs as well as unmappable GPAs, so drop 'invalid' from its name. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-10-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 2 +- ...{svm_nested_invalid_vmcb12_gpa.c => svm_nested_vmcb12_gpa.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename tools/testing/selftests/kvm/x86/{svm_nested_invalid_vmcb12_gpa.c => svm_nested_vmcb12_gpa.c} (100%) diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index ba87cd31872b..83792d136ac3 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -111,9 +111,9 @@ TEST_GEN_PROGS_x86 += x86/vmx_preemption_timer_test TEST_GEN_PROGS_x86 += x86/svm_vmcall_test TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test TEST_GEN_PROGS_x86 += x86/svm_nested_clear_efer_svme -TEST_GEN_PROGS_x86 += x86/svm_nested_invalid_vmcb12_gpa TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test +TEST_GEN_PROGS_x86 += x86/svm_nested_vmcb12_gpa TEST_GEN_PROGS_x86 += x86/svm_lbr_nested_state TEST_GEN_PROGS_x86 += x86/tsc_scaling_sync TEST_GEN_PROGS_x86 += x86/sync_regs_test diff --git a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c b/tools/testing/selftests/kvm/x86/svm_nested_vmcb12_gpa.c similarity index 100% rename from tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c rename to tools/testing/selftests/kvm/x86/svm_nested_vmcb12_gpa.c From efcac8424ba6ab75f2e16be9b0ccfdf60b13b294 Mon Sep 17 00:00:00 2001 From: Fangyu Yu Date: Fri, 3 Apr 2026 23:30:16 +0800 Subject: [PATCH 334/373] RISC-V: KVM: Support runtime configuration for per-VM's HGATP mode Introduces one per-VM architecture-specific fields to support runtime configuration of the G-stage page table format: - kvm->arch.pgd_levels: the corresponding number of page table levels for the selected mode. These fields replace the previous global variables kvm_riscv_gstage_mode and kvm_riscv_gstage_pgd_levels, enabling different virtual machines to independently select their G-stage page table format instead of being forced to share the maximum mode detected by the kernel at boot time. Signed-off-by: Fangyu Yu Reviewed-by: Andrew Jones Reviewed-by: Anup Patel Reviewed-by: Guo Ren Reviewed-by: Nutty Liu Link: https://lore.kernel.org/r/20260403153019.9916-2-fangyu.yu@linux.alibaba.com Signed-off-by: Anup Patel --- arch/riscv/include/asm/kvm_gstage.h | 37 ++++++++++++--- arch/riscv/include/asm/kvm_host.h | 1 + arch/riscv/kvm/gstage.c | 71 ++++++++++++++--------------- arch/riscv/kvm/main.c | 12 ++--- arch/riscv/kvm/mmu.c | 20 ++++---- arch/riscv/kvm/vm.c | 5 +- arch/riscv/kvm/vmid.c | 3 +- 7 files changed, 89 insertions(+), 60 deletions(-) diff --git a/arch/riscv/include/asm/kvm_gstage.h b/arch/riscv/include/asm/kvm_gstage.h index a89d1422cc84..c35874768641 100644 --- a/arch/riscv/include/asm/kvm_gstage.h +++ b/arch/riscv/include/asm/kvm_gstage.h @@ -29,16 +29,22 @@ struct kvm_gstage_mapping { #define kvm_riscv_gstage_index_bits 10 #endif -extern unsigned long kvm_riscv_gstage_mode; -extern unsigned long kvm_riscv_gstage_pgd_levels; +extern unsigned long kvm_riscv_gstage_max_pgd_levels; #define kvm_riscv_gstage_pgd_xbits 2 #define kvm_riscv_gstage_pgd_size (1UL << (HGATP_PAGE_SHIFT + kvm_riscv_gstage_pgd_xbits)) -#define kvm_riscv_gstage_gpa_bits (HGATP_PAGE_SHIFT + \ - (kvm_riscv_gstage_pgd_levels * \ - kvm_riscv_gstage_index_bits) + \ - kvm_riscv_gstage_pgd_xbits) -#define kvm_riscv_gstage_gpa_size ((gpa_t)(1ULL << kvm_riscv_gstage_gpa_bits)) + +static inline unsigned long kvm_riscv_gstage_gpa_bits(unsigned long pgd_levels) +{ + return (HGATP_PAGE_SHIFT + + pgd_levels * kvm_riscv_gstage_index_bits + + kvm_riscv_gstage_pgd_xbits); +} + +static inline gpa_t kvm_riscv_gstage_gpa_size(unsigned long pgd_levels) +{ + return BIT_ULL(kvm_riscv_gstage_gpa_bits(pgd_levels)); +} bool kvm_riscv_gstage_get_leaf(struct kvm_gstage *gstage, gpa_t addr, pte_t **ptepp, u32 *ptep_level); @@ -73,4 +79,21 @@ void kvm_riscv_gstage_wp_range(struct kvm_gstage *gstage, gpa_t start, gpa_t end void kvm_riscv_gstage_mode_detect(void); +static inline unsigned long kvm_riscv_gstage_mode(unsigned long pgd_levels) +{ + switch (pgd_levels) { + case 2: + return HGATP_MODE_SV32X4; + case 3: + return HGATP_MODE_SV39X4; + case 4: + return HGATP_MODE_SV48X4; + case 5: + return HGATP_MODE_SV57X4; + default: + WARN_ON_ONCE(1); + return HGATP_MODE_OFF; + } +} + #endif diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index 85e1bb5b4d7e..75b0a951c1bc 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -83,6 +83,7 @@ struct kvm_arch { /* G-stage page table */ pgd_t *pgd; phys_addr_t pgd_phys; + unsigned long pgd_levels; /* Guest Timer */ struct kvm_guest_timer timer; diff --git a/arch/riscv/kvm/gstage.c b/arch/riscv/kvm/gstage.c index ffec3e5ddcaf..bf7e54af2c7c 100644 --- a/arch/riscv/kvm/gstage.c +++ b/arch/riscv/kvm/gstage.c @@ -12,22 +12,21 @@ #include #ifdef CONFIG_64BIT -unsigned long kvm_riscv_gstage_mode __ro_after_init = HGATP_MODE_SV39X4; -unsigned long kvm_riscv_gstage_pgd_levels __ro_after_init = 3; +unsigned long kvm_riscv_gstage_max_pgd_levels __ro_after_init = 3; #else -unsigned long kvm_riscv_gstage_mode __ro_after_init = HGATP_MODE_SV32X4; -unsigned long kvm_riscv_gstage_pgd_levels __ro_after_init = 2; +unsigned long kvm_riscv_gstage_max_pgd_levels __ro_after_init = 2; #endif #define gstage_pte_leaf(__ptep) \ (pte_val(*(__ptep)) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)) -static inline unsigned long gstage_pte_index(gpa_t addr, u32 level) +static inline unsigned long gstage_pte_index(struct kvm_gstage *gstage, + gpa_t addr, u32 level) { unsigned long mask; unsigned long shift = HGATP_PAGE_SHIFT + (kvm_riscv_gstage_index_bits * level); - if (level == (kvm_riscv_gstage_pgd_levels - 1)) + if (level == gstage->kvm->arch.pgd_levels - 1) mask = (PTRS_PER_PTE * (1UL << kvm_riscv_gstage_pgd_xbits)) - 1; else mask = PTRS_PER_PTE - 1; @@ -40,12 +39,13 @@ static inline unsigned long gstage_pte_page_vaddr(pte_t pte) return (unsigned long)pfn_to_virt(__page_val_to_pfn(pte_val(pte))); } -static int gstage_page_size_to_level(unsigned long page_size, u32 *out_level) +static int gstage_page_size_to_level(struct kvm_gstage *gstage, unsigned long page_size, + u32 *out_level) { u32 i; unsigned long psz = 1UL << 12; - for (i = 0; i < kvm_riscv_gstage_pgd_levels; i++) { + for (i = 0; i < gstage->kvm->arch.pgd_levels; i++) { if (page_size == (psz << (i * kvm_riscv_gstage_index_bits))) { *out_level = i; return 0; @@ -55,21 +55,23 @@ static int gstage_page_size_to_level(unsigned long page_size, u32 *out_level) return -EINVAL; } -static int gstage_level_to_page_order(u32 level, unsigned long *out_pgorder) +static int gstage_level_to_page_order(struct kvm_gstage *gstage, u32 level, + unsigned long *out_pgorder) { - if (kvm_riscv_gstage_pgd_levels < level) + if (gstage->kvm->arch.pgd_levels < level) return -EINVAL; *out_pgorder = 12 + (level * kvm_riscv_gstage_index_bits); return 0; } -static int gstage_level_to_page_size(u32 level, unsigned long *out_pgsize) +static int gstage_level_to_page_size(struct kvm_gstage *gstage, u32 level, + unsigned long *out_pgsize) { int rc; unsigned long page_order = PAGE_SHIFT; - rc = gstage_level_to_page_order(level, &page_order); + rc = gstage_level_to_page_order(gstage, level, &page_order); if (rc) return rc; @@ -81,11 +83,11 @@ bool kvm_riscv_gstage_get_leaf(struct kvm_gstage *gstage, gpa_t addr, pte_t **ptepp, u32 *ptep_level) { pte_t *ptep; - u32 current_level = kvm_riscv_gstage_pgd_levels - 1; + u32 current_level = gstage->kvm->arch.pgd_levels - 1; *ptep_level = current_level; ptep = (pte_t *)gstage->pgd; - ptep = &ptep[gstage_pte_index(addr, current_level)]; + ptep = &ptep[gstage_pte_index(gstage, addr, current_level)]; while (ptep && pte_val(ptep_get(ptep))) { if (gstage_pte_leaf(ptep)) { *ptep_level = current_level; @@ -97,7 +99,7 @@ bool kvm_riscv_gstage_get_leaf(struct kvm_gstage *gstage, gpa_t addr, current_level--; *ptep_level = current_level; ptep = (pte_t *)gstage_pte_page_vaddr(ptep_get(ptep)); - ptep = &ptep[gstage_pte_index(addr, current_level)]; + ptep = &ptep[gstage_pte_index(gstage, addr, current_level)]; } else { ptep = NULL; } @@ -110,7 +112,7 @@ static void gstage_tlb_flush(struct kvm_gstage *gstage, u32 level, gpa_t addr) { unsigned long order = PAGE_SHIFT; - if (gstage_level_to_page_order(level, &order)) + if (gstage_level_to_page_order(gstage, level, &order)) return; addr &= ~(BIT(order) - 1); @@ -125,9 +127,9 @@ int kvm_riscv_gstage_set_pte(struct kvm_gstage *gstage, struct kvm_mmu_memory_cache *pcache, const struct kvm_gstage_mapping *map) { - u32 current_level = kvm_riscv_gstage_pgd_levels - 1; + u32 current_level = gstage->kvm->arch.pgd_levels - 1; pte_t *next_ptep = (pte_t *)gstage->pgd; - pte_t *ptep = &next_ptep[gstage_pte_index(map->addr, current_level)]; + pte_t *ptep = &next_ptep[gstage_pte_index(gstage, map->addr, current_level)]; if (current_level < map->level) return -EINVAL; @@ -151,7 +153,7 @@ int kvm_riscv_gstage_set_pte(struct kvm_gstage *gstage, } current_level--; - ptep = &next_ptep[gstage_pte_index(map->addr, current_level)]; + ptep = &next_ptep[gstage_pte_index(gstage, map->addr, current_level)]; } if (pte_val(*ptep) != pte_val(map->pte)) { @@ -194,7 +196,7 @@ int kvm_riscv_gstage_map_page(struct kvm_gstage *gstage, out_map->addr = gpa; out_map->level = 0; - ret = gstage_page_size_to_level(page_size, &out_map->level); + ret = gstage_page_size_to_level(gstage, page_size, &out_map->level); if (ret) return ret; @@ -286,7 +288,7 @@ int kvm_riscv_gstage_split_huge(struct kvm_gstage *gstage, struct kvm_mmu_memory_cache *pcache, gpa_t addr, u32 target_level, bool flush) { - u32 current_level = kvm_riscv_gstage_pgd_levels - 1; + u32 current_level = gstage->kvm->arch.pgd_levels - 1; pte_t *next_ptep = (pte_t *)gstage->pgd; unsigned long huge_pte, child_pte; unsigned long child_page_size; @@ -297,7 +299,7 @@ int kvm_riscv_gstage_split_huge(struct kvm_gstage *gstage, return -ENOMEM; while(current_level > target_level) { - ptep = (pte_t *)&next_ptep[gstage_pte_index(addr, current_level)]; + ptep = (pte_t *)&next_ptep[gstage_pte_index(gstage, addr, current_level)]; if (!pte_val(ptep_get(ptep))) break; @@ -310,7 +312,7 @@ int kvm_riscv_gstage_split_huge(struct kvm_gstage *gstage, huge_pte = pte_val(ptep_get(ptep)); - ret = gstage_level_to_page_size(current_level - 1, &child_page_size); + ret = gstage_level_to_page_size(gstage, current_level - 1, &child_page_size); if (ret) return ret; @@ -343,7 +345,7 @@ void kvm_riscv_gstage_op_pte(struct kvm_gstage *gstage, gpa_t addr, u32 next_ptep_level; unsigned long next_page_size, page_size; - ret = gstage_level_to_page_size(ptep_level, &page_size); + ret = gstage_level_to_page_size(gstage, ptep_level, &page_size); if (ret) return; @@ -355,7 +357,7 @@ void kvm_riscv_gstage_op_pte(struct kvm_gstage *gstage, gpa_t addr, if (ptep_level && !gstage_pte_leaf(ptep)) { next_ptep = (pte_t *)gstage_pte_page_vaddr(ptep_get(ptep)); next_ptep_level = ptep_level - 1; - ret = gstage_level_to_page_size(next_ptep_level, &next_page_size); + ret = gstage_level_to_page_size(gstage, next_ptep_level, &next_page_size); if (ret) return; @@ -389,7 +391,7 @@ void kvm_riscv_gstage_unmap_range(struct kvm_gstage *gstage, while (addr < end) { found_leaf = kvm_riscv_gstage_get_leaf(gstage, addr, &ptep, &ptep_level); - ret = gstage_level_to_page_size(ptep_level, &page_size); + ret = gstage_level_to_page_size(gstage, ptep_level, &page_size); if (ret) break; @@ -423,7 +425,7 @@ void kvm_riscv_gstage_wp_range(struct kvm_gstage *gstage, gpa_t start, gpa_t end while (addr < end) { found_leaf = kvm_riscv_gstage_get_leaf(gstage, addr, &ptep, &ptep_level); - ret = gstage_level_to_page_size(ptep_level, &page_size); + ret = gstage_level_to_page_size(gstage, ptep_level, &page_size); if (ret) break; @@ -444,39 +446,34 @@ void __init kvm_riscv_gstage_mode_detect(void) /* Try Sv57x4 G-stage mode */ csr_write(CSR_HGATP, HGATP_MODE_SV57X4 << HGATP_MODE_SHIFT); if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV57X4) { - kvm_riscv_gstage_mode = HGATP_MODE_SV57X4; - kvm_riscv_gstage_pgd_levels = 5; + kvm_riscv_gstage_max_pgd_levels = 5; goto done; } /* Try Sv48x4 G-stage mode */ csr_write(CSR_HGATP, HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT); if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV48X4) { - kvm_riscv_gstage_mode = HGATP_MODE_SV48X4; - kvm_riscv_gstage_pgd_levels = 4; + kvm_riscv_gstage_max_pgd_levels = 4; goto done; } /* Try Sv39x4 G-stage mode */ csr_write(CSR_HGATP, HGATP_MODE_SV39X4 << HGATP_MODE_SHIFT); if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV39X4) { - kvm_riscv_gstage_mode = HGATP_MODE_SV39X4; - kvm_riscv_gstage_pgd_levels = 3; + kvm_riscv_gstage_max_pgd_levels = 3; goto done; } #else /* CONFIG_32BIT */ /* Try Sv32x4 G-stage mode */ csr_write(CSR_HGATP, HGATP_MODE_SV32X4 << HGATP_MODE_SHIFT); if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV32X4) { - kvm_riscv_gstage_mode = HGATP_MODE_SV32X4; - kvm_riscv_gstage_pgd_levels = 2; + kvm_riscv_gstage_max_pgd_levels = 2; goto done; } #endif /* KVM depends on !HGATP_MODE_OFF */ - kvm_riscv_gstage_mode = HGATP_MODE_OFF; - kvm_riscv_gstage_pgd_levels = 0; + kvm_riscv_gstage_max_pgd_levels = 0; done: csr_write(CSR_HGATP, 0); diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c index 5399c3b4071d..cb8a65273c1f 100644 --- a/arch/riscv/kvm/main.c +++ b/arch/riscv/kvm/main.c @@ -105,17 +105,17 @@ static int __init riscv_kvm_init(void) return rc; kvm_riscv_gstage_mode_detect(); - switch (kvm_riscv_gstage_mode) { - case HGATP_MODE_SV32X4: + switch (kvm_riscv_gstage_max_pgd_levels) { + case 2: str = "Sv32x4"; break; - case HGATP_MODE_SV39X4: + case 3: str = "Sv39x4"; break; - case HGATP_MODE_SV48X4: + case 4: str = "Sv48x4"; break; - case HGATP_MODE_SV57X4: + case 5: str = "Sv57x4"; break; default: @@ -164,7 +164,7 @@ static int __init riscv_kvm_init(void) (rc) ? slist : "no features"); } - kvm_info("using %s G-stage page table format\n", str); + kvm_info("highest G-stage page table mode is %s\n", str); kvm_info("VMID %ld bits available\n", kvm_riscv_gstage_vmid_bits()); diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index 088d33ba90ed..fbcdd75cb9af 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -67,7 +67,7 @@ int kvm_riscv_mmu_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa, if (!writable) map.pte = pte_wrprotect(map.pte); - ret = kvm_mmu_topup_memory_cache(&pcache, kvm_riscv_gstage_pgd_levels); + ret = kvm_mmu_topup_memory_cache(&pcache, kvm->arch.pgd_levels); if (ret) goto out; @@ -186,7 +186,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, * space addressable by the KVM guest GPA space. */ if ((new->base_gfn + new->npages) >= - (kvm_riscv_gstage_gpa_size >> PAGE_SHIFT)) + kvm_riscv_gstage_gpa_size(kvm->arch.pgd_levels) >> PAGE_SHIFT) return -EFAULT; hva = new->userspace_addr; @@ -472,7 +472,7 @@ int kvm_riscv_mmu_map(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot, memset(out_map, 0, sizeof(*out_map)); /* We need minimum second+third level pages */ - ret = kvm_mmu_topup_memory_cache(pcache, kvm_riscv_gstage_pgd_levels); + ret = kvm_mmu_topup_memory_cache(pcache, kvm->arch.pgd_levels); if (ret) { kvm_err("Failed to topup G-stage cache\n"); return ret; @@ -575,6 +575,7 @@ int kvm_riscv_mmu_alloc_pgd(struct kvm *kvm) return -ENOMEM; kvm->arch.pgd = page_to_virt(pgd_page); kvm->arch.pgd_phys = page_to_phys(pgd_page); + kvm->arch.pgd_levels = kvm_riscv_gstage_max_pgd_levels; return 0; } @@ -590,10 +591,12 @@ void kvm_riscv_mmu_free_pgd(struct kvm *kvm) gstage.flags = 0; gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); gstage.pgd = kvm->arch.pgd; - kvm_riscv_gstage_unmap_range(&gstage, 0UL, kvm_riscv_gstage_gpa_size, false); + kvm_riscv_gstage_unmap_range(&gstage, 0UL, + kvm_riscv_gstage_gpa_size(kvm->arch.pgd_levels), false); pgd = READ_ONCE(kvm->arch.pgd); kvm->arch.pgd = NULL; kvm->arch.pgd_phys = 0; + kvm->arch.pgd_levels = 0; } spin_unlock(&kvm->mmu_lock); @@ -603,11 +606,12 @@ void kvm_riscv_mmu_free_pgd(struct kvm *kvm) void kvm_riscv_mmu_update_hgatp(struct kvm_vcpu *vcpu) { - unsigned long hgatp = kvm_riscv_gstage_mode << HGATP_MODE_SHIFT; - struct kvm_arch *k = &vcpu->kvm->arch; + struct kvm_arch *ka = &vcpu->kvm->arch; + unsigned long hgatp = kvm_riscv_gstage_mode(ka->pgd_levels) + << HGATP_MODE_SHIFT; - hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) & HGATP_VMID; - hgatp |= (k->pgd_phys >> PAGE_SHIFT) & HGATP_PPN; + hgatp |= (READ_ONCE(ka->vmid.vmid) << HGATP_VMID_SHIFT) & HGATP_VMID; + hgatp |= (ka->pgd_phys >> PAGE_SHIFT) & HGATP_PPN; ncsr_write(CSR_HGATP, hgatp); diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c index 13c63ae1a78b..fb7c4e07961f 100644 --- a/arch/riscv/kvm/vm.c +++ b/arch/riscv/kvm/vm.c @@ -199,7 +199,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = KVM_USER_MEM_SLOTS; break; case KVM_CAP_VM_GPA_BITS: - r = kvm_riscv_gstage_gpa_bits; + if (!kvm) + r = kvm_riscv_gstage_gpa_bits(kvm_riscv_gstage_max_pgd_levels); + else + r = kvm_riscv_gstage_gpa_bits(kvm->arch.pgd_levels); break; default: r = 0; diff --git a/arch/riscv/kvm/vmid.c b/arch/riscv/kvm/vmid.c index cf34d448289d..c15bdb1dd8be 100644 --- a/arch/riscv/kvm/vmid.c +++ b/arch/riscv/kvm/vmid.c @@ -26,7 +26,8 @@ static DEFINE_SPINLOCK(vmid_lock); void __init kvm_riscv_gstage_vmid_detect(void) { /* Figure-out number of VMID bits in HW */ - csr_write(CSR_HGATP, (kvm_riscv_gstage_mode << HGATP_MODE_SHIFT) | HGATP_VMID); + csr_write(CSR_HGATP, (kvm_riscv_gstage_mode(kvm_riscv_gstage_max_pgd_levels) << + HGATP_MODE_SHIFT) | HGATP_VMID); vmid_bits = csr_read(CSR_HGATP); vmid_bits = (vmid_bits & HGATP_VMID) >> HGATP_VMID_SHIFT; vmid_bits = fls_long(vmid_bits); From ec92248431be7ad08742e0d1dff5109cec5ef905 Mon Sep 17 00:00:00 2001 From: Fangyu Yu Date: Fri, 3 Apr 2026 23:30:17 +0800 Subject: [PATCH 335/373] RISC-V: KVM: Cache gstage pgd_levels in struct kvm_gstage Gstage page-table helpers frequently chase gstage->kvm->arch to fetch pgd_levels. This adds noise and repeats the same dereference chain in hot paths. Add pgd_levels to struct kvm_gstage and initialize it from kvm->arch when setting up a gstage instance. Introduce kvm_riscv_gstage_init() to centralize initialization and switch gstage code to use gstage->pgd_levels. Suggested-by: Anup Patel Signed-off-by: Fangyu Yu Reviewed-by: Anup Patel Reviewed-by: Nutty Liu Link: https://lore.kernel.org/r/20260403153019.9916-3-fangyu.yu@linux.alibaba.com Signed-off-by: Anup Patel --- arch/riscv/include/asm/kvm_gstage.h | 10 ++++++ arch/riscv/kvm/gstage.c | 12 +++---- arch/riscv/kvm/mmu.c | 50 ++++++----------------------- 3 files changed, 26 insertions(+), 46 deletions(-) diff --git a/arch/riscv/include/asm/kvm_gstage.h b/arch/riscv/include/asm/kvm_gstage.h index c35874768641..9c908432bc17 100644 --- a/arch/riscv/include/asm/kvm_gstage.h +++ b/arch/riscv/include/asm/kvm_gstage.h @@ -15,6 +15,7 @@ struct kvm_gstage { #define KVM_GSTAGE_FLAGS_LOCAL BIT(0) unsigned long vmid; pgd_t *pgd; + unsigned long pgd_levels; }; struct kvm_gstage_mapping { @@ -96,4 +97,13 @@ static inline unsigned long kvm_riscv_gstage_mode(unsigned long pgd_levels) } } +static inline void kvm_riscv_gstage_init(struct kvm_gstage *gstage, struct kvm *kvm) +{ + gstage->kvm = kvm; + gstage->flags = 0; + gstage->vmid = READ_ONCE(kvm->arch.vmid.vmid); + gstage->pgd = kvm->arch.pgd; + gstage->pgd_levels = kvm->arch.pgd_levels; +} + #endif diff --git a/arch/riscv/kvm/gstage.c b/arch/riscv/kvm/gstage.c index bf7e54af2c7c..d9fe8be2a151 100644 --- a/arch/riscv/kvm/gstage.c +++ b/arch/riscv/kvm/gstage.c @@ -26,7 +26,7 @@ static inline unsigned long gstage_pte_index(struct kvm_gstage *gstage, unsigned long mask; unsigned long shift = HGATP_PAGE_SHIFT + (kvm_riscv_gstage_index_bits * level); - if (level == gstage->kvm->arch.pgd_levels - 1) + if (level == gstage->pgd_levels - 1) mask = (PTRS_PER_PTE * (1UL << kvm_riscv_gstage_pgd_xbits)) - 1; else mask = PTRS_PER_PTE - 1; @@ -45,7 +45,7 @@ static int gstage_page_size_to_level(struct kvm_gstage *gstage, unsigned long pa u32 i; unsigned long psz = 1UL << 12; - for (i = 0; i < gstage->kvm->arch.pgd_levels; i++) { + for (i = 0; i < gstage->pgd_levels; i++) { if (page_size == (psz << (i * kvm_riscv_gstage_index_bits))) { *out_level = i; return 0; @@ -58,7 +58,7 @@ static int gstage_page_size_to_level(struct kvm_gstage *gstage, unsigned long pa static int gstage_level_to_page_order(struct kvm_gstage *gstage, u32 level, unsigned long *out_pgorder) { - if (gstage->kvm->arch.pgd_levels < level) + if (gstage->pgd_levels < level) return -EINVAL; *out_pgorder = 12 + (level * kvm_riscv_gstage_index_bits); @@ -83,7 +83,7 @@ bool kvm_riscv_gstage_get_leaf(struct kvm_gstage *gstage, gpa_t addr, pte_t **ptepp, u32 *ptep_level) { pte_t *ptep; - u32 current_level = gstage->kvm->arch.pgd_levels - 1; + u32 current_level = gstage->pgd_levels - 1; *ptep_level = current_level; ptep = (pte_t *)gstage->pgd; @@ -127,7 +127,7 @@ int kvm_riscv_gstage_set_pte(struct kvm_gstage *gstage, struct kvm_mmu_memory_cache *pcache, const struct kvm_gstage_mapping *map) { - u32 current_level = gstage->kvm->arch.pgd_levels - 1; + u32 current_level = gstage->pgd_levels - 1; pte_t *next_ptep = (pte_t *)gstage->pgd; pte_t *ptep = &next_ptep[gstage_pte_index(gstage, map->addr, current_level)]; @@ -288,7 +288,7 @@ int kvm_riscv_gstage_split_huge(struct kvm_gstage *gstage, struct kvm_mmu_memory_cache *pcache, gpa_t addr, u32 target_level, bool flush) { - u32 current_level = gstage->kvm->arch.pgd_levels - 1; + u32 current_level = gstage->pgd_levels - 1; pte_t *next_ptep = (pte_t *)gstage->pgd; unsigned long huge_pte, child_pte; unsigned long child_page_size; diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index fbcdd75cb9af..2d3def024270 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -24,10 +24,7 @@ static void mmu_wp_memory_region(struct kvm *kvm, int slot) phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT; struct kvm_gstage gstage; - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); spin_lock(&kvm->mmu_lock); kvm_riscv_gstage_wp_range(&gstage, start, end); @@ -49,10 +46,7 @@ int kvm_riscv_mmu_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa, struct kvm_gstage_mapping map; struct kvm_gstage gstage; - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK; pfn = __phys_to_pfn(hpa); @@ -89,10 +83,7 @@ void kvm_riscv_mmu_iounmap(struct kvm *kvm, gpa_t gpa, unsigned long size) { struct kvm_gstage gstage; - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); spin_lock(&kvm->mmu_lock); kvm_riscv_gstage_unmap_range(&gstage, gpa, size, false); @@ -109,10 +100,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT; struct kvm_gstage gstage; - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); kvm_riscv_gstage_wp_range(&gstage, start, end); } @@ -141,10 +129,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm, phys_addr_t size = slot->npages << PAGE_SHIFT; struct kvm_gstage gstage; - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); spin_lock(&kvm->mmu_lock); kvm_riscv_gstage_unmap_range(&gstage, gpa, size, false); @@ -250,10 +235,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) if (!kvm->arch.pgd) return false; - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); mmu_locked = spin_trylock(&kvm->mmu_lock); kvm_riscv_gstage_unmap_range(&gstage, range->start << PAGE_SHIFT, (range->end - range->start) << PAGE_SHIFT, @@ -275,10 +257,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE); - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); if (!kvm_riscv_gstage_get_leaf(&gstage, range->start << PAGE_SHIFT, &ptep, &ptep_level)) return false; @@ -298,10 +277,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE); - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); if (!kvm_riscv_gstage_get_leaf(&gstage, range->start << PAGE_SHIFT, &ptep, &ptep_level)) return false; @@ -463,10 +439,7 @@ int kvm_riscv_mmu_map(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot, struct kvm_gstage gstage; struct page *page; - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); /* Setup initial state of output mapping */ memset(out_map, 0, sizeof(*out_map)); @@ -587,10 +560,7 @@ void kvm_riscv_mmu_free_pgd(struct kvm *kvm) spin_lock(&kvm->mmu_lock); if (kvm->arch.pgd) { - gstage.kvm = kvm; - gstage.flags = 0; - gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid); - gstage.pgd = kvm->arch.pgd; + kvm_riscv_gstage_init(&gstage, kvm); kvm_riscv_gstage_unmap_range(&gstage, 0UL, kvm_riscv_gstage_gpa_size(kvm->arch.pgd_levels), false); pgd = READ_ONCE(kvm->arch.pgd); From 7263b4fdb0b240e67e3ebd802e0df761d35a7fdf Mon Sep 17 00:00:00 2001 From: Fangyu Yu Date: Fri, 3 Apr 2026 23:30:18 +0800 Subject: [PATCH 336/373] RISC-V: KVM: Reuse KVM_CAP_VM_GPA_BITS to select HGATP.MODE Reuse KVM_CAP_VM_GPA_BITS to advertise and select the effective G-stage GPA width for a VM. KVM_CHECK_EXTENSION(KVM_CAP_VM_GPA_BITS) returns the effective GPA bits for a VM, KVM_ENABLE_CAP(KVM_CAP_VM_GPA_BITS) allows userspace to downsize the effective GPA width by selecting a smaller G-stage page table format: - gpa_bits <= 41 selects Sv39x4 (pgd_levels=3) - gpa_bits <= 50 selects Sv48x4 (pgd_levels=4) - gpa_bits <= 59 selects Sv57x4 (pgd_levels=5) Reject the request with -EINVAL for unsupported values and with -EBUSY if vCPUs have been created or any memslot is populated. Signed-off-by: Fangyu Yu Reviewed-by: Andrew Jones Reviewed-by: Guo Ren Reviewed-by: Nutty Liu Reviewed-by: Anup Patel Link: https://lore.kernel.org/r/20260403153019.9916-4-fangyu.yu@linux.alibaba.com Signed-off-by: Anup Patel --- arch/riscv/kvm/vm.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 42 insertions(+), 2 deletions(-) diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c index fb7c4e07961f..a9f083feeb76 100644 --- a/arch/riscv/kvm/vm.c +++ b/arch/riscv/kvm/vm.c @@ -214,12 +214,52 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) { + if (cap->flags) + return -EINVAL; + switch (cap->cap) { case KVM_CAP_RISCV_MP_STATE_RESET: - if (cap->flags) - return -EINVAL; kvm->arch.mp_state_reset = true; return 0; + case KVM_CAP_VM_GPA_BITS: { + unsigned long gpa_bits = cap->args[0]; + unsigned long new_levels; + int r = 0; + + /* Decide target pgd levels from requested gpa_bits */ +#ifdef CONFIG_64BIT + if (gpa_bits <= 41) + new_levels = 3; /* Sv39x4 */ + else if (gpa_bits <= 50) + new_levels = 4; /* Sv48x4 */ + else if (gpa_bits <= 59) + new_levels = 5; /* Sv57x4 */ + else + return -EINVAL; +#else + /* 32-bit: only Sv32x4*/ + if (gpa_bits <= 34) + new_levels = 2; + else + return -EINVAL; +#endif + if (new_levels > kvm_riscv_gstage_max_pgd_levels) + return -EINVAL; + + /* Follow KVM's lock ordering: kvm->lock -> kvm->slots_lock. */ + mutex_lock(&kvm->lock); + mutex_lock(&kvm->slots_lock); + + if (kvm->created_vcpus || !kvm_are_all_memslots_empty(kvm)) + r = -EBUSY; + else + kvm->arch.pgd_levels = new_levels; + + mutex_unlock(&kvm->slots_lock); + mutex_unlock(&kvm->lock); + + return r; + } default: return -EINVAL; } From 7e629348df81b339dbc233313f0f36ff5a25fc3d Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 1 Apr 2026 18:00:17 +0100 Subject: [PATCH 337/373] KVM: arm64: Advertise ID_AA64PFR2_EL1.GCIE As we are missing ID_AA64PFR2_EL1.GCIE from the kernel feature set, userspace cannot write ID_AA64PFR2_EL1 with GCIE set, even if we are on a GICv5 host. Add the required field description. Acked-by: Catalin Marinas Link: https://patch.msgid.link/20260401170017.369529-1-maz@kernel.org Signed-off-by: Marc Zyngier --- arch/arm64/kernel/cpufeature.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 32c2dbcc0c64..1bfaa96881da 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -325,6 +325,7 @@ static const struct arm64_ftr_bits ftr_id_aa64pfr1[] = { static const struct arm64_ftr_bits ftr_id_aa64pfr2[] = { ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_FPMR_SHIFT, 4, 0), + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_GCIE_SHIFT, 4, ID_AA64PFR2_EL1_GCIE_NI), ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_MTEFAR_SHIFT, 4, ID_AA64PFR2_EL1_MTEFAR_NI), ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_MTESTOREONLY_SHIFT, 4, ID_AA64PFR2_EL1_MTESTOREONLY_NI), ARM64_FTR_END, From ddbf9c76c4020bf63a0799b00faad40caa3de6c2 Mon Sep 17 00:00:00 2001 From: Jiakai Xu Date: Fri, 3 Apr 2026 23:20:11 +0000 Subject: [PATCH 338/373] RISC-V: KVM: Fix shift-out-of-bounds in make_xfence_request() The make_xfence_request() function uses a shift operation to check if a vCPU is in the hart mask: if (!(hmask & (1UL << (vcpu->vcpu_id - hbase)))) However, when the difference between vcpu_id and hbase is >= BITS_PER_LONG, the shift operation causes undefined behavior. This was detected by UBSAN: UBSAN: shift-out-of-bounds in arch/riscv/kvm/tlb.c:343:23 shift exponent 256 is too large for 64-bit type 'long unsigned int' Fix this by adding a bounds check before the shift operation. This bug was found by fuzzing the KVM RISC-V interface. Fixes: 13acfec2dbcc ("RISC-V: KVM: Add remote HFENCE functions based on VCPU requests") Signed-off-by: Jiakai Xu Signed-off-by: Jiakai Xu Reviewed-by: Andrew Jones Link: https://lore.kernel.org/r/20260403232011.2394966-1-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel --- arch/riscv/kvm/tlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/riscv/kvm/tlb.c b/arch/riscv/kvm/tlb.c index ff1aeac4eb8e..439c20c2775a 100644 --- a/arch/riscv/kvm/tlb.c +++ b/arch/riscv/kvm/tlb.c @@ -338,7 +338,8 @@ static void make_xfence_request(struct kvm *kvm, bitmap_zero(vcpu_mask, KVM_MAX_VCPUS); kvm_for_each_vcpu(i, vcpu, kvm) { if (hbase != -1UL) { - if (vcpu->vcpu_id < hbase) + if (vcpu->vcpu_id < hbase || + vcpu->vcpu_id >= hbase + BITS_PER_LONG) continue; if (!(hmask & (1UL << (vcpu->vcpu_id - hbase)))) continue; From 6da4b1a4359b3ed5e7ee5e9a9751a9e483906409 Mon Sep 17 00:00:00 2001 From: Claudio Imbrenda Date: Thu, 2 Apr 2026 17:01:30 +0200 Subject: [PATCH 339/373] KVM: s390: Add some useful mask macros Add _{SEGMENT,REGION3}_FR_MASK, similar to _{SEGMENT,REGION3}_MASK, but working on gfn/pfn instead of addresses. Use them in gaccess.c instead of using the normal masks plus gpa_to_gfn(). Also add _PAGES_PER_{SEGMENT,REGION3} to make future code more readable. Reviewed-by: Steffen Eiden Signed-off-by: Claudio Imbrenda --- arch/s390/kvm/dat.h | 5 +++++ arch/s390/kvm/gaccess.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/s390/kvm/dat.h b/arch/s390/kvm/dat.h index 123e11dcd70d..809cd7a8adb7 100644 --- a/arch/s390/kvm/dat.h +++ b/arch/s390/kvm/dat.h @@ -104,6 +104,11 @@ union pte { } tok; }; +#define _SEGMENT_FR_MASK (_SEGMENT_MASK >> PAGE_SHIFT) +#define _REGION3_FR_MASK (_REGION3_MASK >> PAGE_SHIFT) +#define _PAGES_PER_SEGMENT _PAGE_ENTRIES +#define _PAGES_PER_REGION3 (_PAGES_PER_SEGMENT * _CRST_ENTRIES) + /* Soft dirty, needed as macro for atomic operations on ptes */ #define _PAGE_SD 0x002 diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c index 4630b2a067ea..a2ad11e2bf61 100644 --- a/arch/s390/kvm/gaccess.c +++ b/arch/s390/kvm/gaccess.c @@ -1461,7 +1461,7 @@ static int _do_shadow_crste(struct gmap *sg, gpa_t raddr, union crste *host, uni lockdep_assert_held(&sg->kvm->mmu_lock); lockdep_assert_held(&sg->parent->children_lock); - gfn = f->gfn & gpa_to_gfn(is_pmd(*table) ? _SEGMENT_MASK : _REGION3_MASK); + gfn = f->gfn & (is_pmd(*table) ? _SEGMENT_FR_MASK : _REGION3_FR_MASK); scoped_guard(spinlock, &sg->host_to_rmap_lock) rc = gmap_insert_rmap(sg, gfn, gpa_to_gfn(raddr), host->h.tt); if (rc) From 4204067f99820eda590ab99ae068463b4f930a33 Mon Sep 17 00:00:00 2001 From: Claudio Imbrenda Date: Thu, 2 Apr 2026 17:01:31 +0200 Subject: [PATCH 340/373] KVM: s390: Add alignment checks for hugepages When backing a guest page with a large page, check that the alignment of the guest page matches the alignment of the host physical page backing it within the large page. Also check that the memslot is large enough to fit the large page. Those checks are currently not needed, because memslots are guaranteed to be 1m-aligned, but this will change. Reviewed-by: Steffen Eiden Signed-off-by: Claudio Imbrenda --- arch/s390/kvm/faultin.c | 2 +- arch/s390/kvm/gmap.c | 32 ++++++++++++++++++++++++++------ arch/s390/kvm/gmap.h | 3 ++- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/arch/s390/kvm/faultin.c b/arch/s390/kvm/faultin.c index e37cd18200f5..ddf0ca71f374 100644 --- a/arch/s390/kvm/faultin.c +++ b/arch/s390/kvm/faultin.c @@ -109,7 +109,7 @@ int kvm_s390_faultin_gfn(struct kvm_vcpu *vcpu, struct kvm *kvm, struct guest_fa scoped_guard(read_lock, &kvm->mmu_lock) { if (!mmu_invalidate_retry_gfn(kvm, inv_seq, f->gfn)) { f->valid = true; - rc = gmap_link(mc, kvm->arch.gmap, f); + rc = gmap_link(mc, kvm->arch.gmap, f, slot); kvm_release_faultin_page(kvm, f->page, !!rc, f->write_attempt); f->page = NULL; } diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c index ef0c6ebfdde2..e3c1b070a11d 100644 --- a/arch/s390/kvm/gmap.c +++ b/arch/s390/kvm/gmap.c @@ -613,17 +613,37 @@ int gmap_try_fixup_minor(struct gmap *gmap, struct guest_fault *fault) return rc; } -static inline bool gmap_2g_allowed(struct gmap *gmap, gfn_t gfn) +static inline bool gmap_2g_allowed(struct gmap *gmap, struct guest_fault *f, + struct kvm_memory_slot *slot) { return false; } -static inline bool gmap_1m_allowed(struct gmap *gmap, gfn_t gfn) +/** + * gmap_1m_allowed() - Check whether a 1M hugepage is allowed. + * @gmap: The gmap of the guest. + * @f: Describes the fault that is being resolved. + * @slot: The memslot the faulting address belongs to. + * + * The function checks whether the GMAP_FLAG_ALLOW_HPAGE_1M flag is set for + * @gmap, whether the offset of the address in the 1M virtual frame is the + * same as the offset in the physical 1M frame, and finally whether the whole + * 1M page would fit in the given memslot. + * + * Return: true if a 1M hugepage is allowed to back the faulting address, false + * otherwise. + */ +static inline bool gmap_1m_allowed(struct gmap *gmap, struct guest_fault *f, + struct kvm_memory_slot *slot) { - return test_bit(GMAP_FLAG_ALLOW_HPAGE_1M, &gmap->flags); + return test_bit(GMAP_FLAG_ALLOW_HPAGE_1M, &gmap->flags) && + !((f->gfn ^ f->pfn) & ~_SEGMENT_FR_MASK) && + slot->base_gfn <= ALIGN_DOWN(f->gfn, _PAGES_PER_SEGMENT) && + slot->base_gfn + slot->npages >= ALIGN(f->gfn + 1, _PAGES_PER_SEGMENT); } -int gmap_link(struct kvm_s390_mmu_cache *mc, struct gmap *gmap, struct guest_fault *f) +int gmap_link(struct kvm_s390_mmu_cache *mc, struct gmap *gmap, struct guest_fault *f, + struct kvm_memory_slot *slot) { unsigned int order; int rc, level; @@ -633,9 +653,9 @@ int gmap_link(struct kvm_s390_mmu_cache *mc, struct gmap *gmap, struct guest_fau level = TABLE_TYPE_PAGE_TABLE; if (f->page) { order = folio_order(page_folio(f->page)); - if (order >= get_order(_REGION3_SIZE) && gmap_2g_allowed(gmap, f->gfn)) + if (order >= get_order(_REGION3_SIZE) && gmap_2g_allowed(gmap, f, slot)) level = TABLE_TYPE_REGION3; - else if (order >= get_order(_SEGMENT_SIZE) && gmap_1m_allowed(gmap, f->gfn)) + else if (order >= get_order(_SEGMENT_SIZE) && gmap_1m_allowed(gmap, f, slot)) level = TABLE_TYPE_SEGMENT; } rc = dat_link(mc, gmap->asce, level, uses_skeys(gmap), f); diff --git a/arch/s390/kvm/gmap.h b/arch/s390/kvm/gmap.h index ccb5cd751e31..a2f74587ddbf 100644 --- a/arch/s390/kvm/gmap.h +++ b/arch/s390/kvm/gmap.h @@ -90,7 +90,8 @@ struct gmap *gmap_new(struct kvm *kvm, gfn_t limit); struct gmap *gmap_new_child(struct gmap *parent, gfn_t limit); void gmap_remove_child(struct gmap *child); void gmap_dispose(struct gmap *gmap); -int gmap_link(struct kvm_s390_mmu_cache *mc, struct gmap *gmap, struct guest_fault *fault); +int gmap_link(struct kvm_s390_mmu_cache *mc, struct gmap *gmap, struct guest_fault *fault, + struct kvm_memory_slot *slot); void gmap_sync_dirty_log(struct gmap *gmap, gfn_t start, gfn_t end); int gmap_set_limit(struct gmap *gmap, gfn_t limit); int gmap_ucas_translate(struct kvm_s390_mmu_cache *mc, struct gmap *gmap, gpa_t *gaddr); From 06a20c3ab6042ea7f9927fbeb50aa4e79894c136 Mon Sep 17 00:00:00 2001 From: Claudio Imbrenda Date: Thu, 2 Apr 2026 17:01:32 +0200 Subject: [PATCH 341/373] KVM: s390: Allow 4k granularity for memslots Until now memslots on s390 needed to have 1M granularity and be 1M aligned. Since the new gmap code can handle memslots with 4k granularity and alignment, remove the restrictions. Reviewed-by: Christian Borntraeger Reviewed-by: Steffen Eiden Signed-off-by: Claudio Imbrenda --- arch/s390/kvm/kvm-s390.c | 20 ++++++-------------- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index a583c0a00efd..156878c95e06 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -5642,8 +5642,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, struct kvm_memory_slot *new, enum kvm_mr_change change) { - gpa_t size; - if (kvm_is_ucontrol(kvm) && new->id < KVM_USER_MEM_SLOTS) return -EINVAL; @@ -5653,20 +5651,14 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, if (change != KVM_MR_DELETE && change != KVM_MR_FLAGS_ONLY) { /* - * A few sanity checks. We can have memory slots which have to be - * located/ended at a segment boundary (1MB). The memory in userland is - * ok to be fragmented into various different vmas. It is okay to mmap() - * and munmap() stuff in this slot after doing this call at any time + * A few sanity checks. The memory in userland is ok to be + * fragmented into various different vmas. It is okay to mmap() + * and munmap() stuff in this slot after doing this call at any + * time. */ - - if (new->userspace_addr & 0xffffful) + if (new->userspace_addr & ~PAGE_MASK) return -EINVAL; - - size = new->npages * PAGE_SIZE; - if (size & 0xffffful) - return -EINVAL; - - if ((new->base_gfn * PAGE_SIZE) + size > kvm->arch.mem_limit) + if ((new->base_gfn + new->npages) * PAGE_SIZE > kvm->arch.mem_limit) return -EINVAL; } From c10e2771c745a206a2642cb03eec40ace2e0a7b5 Mon Sep 17 00:00:00 2001 From: Claudio Imbrenda Date: Thu, 2 Apr 2026 17:01:33 +0200 Subject: [PATCH 342/373] KVM: selftests: Remove 1M alignment requirement for s390 Remove the 1M memslot alignment requirement for s390, since it is not needed anymore. Reviewed-by: Steffen Eiden Signed-off-by: Claudio Imbrenda --- tools/testing/selftests/kvm/dirty_log_test.c | 3 --- tools/testing/selftests/kvm/include/kvm_util.h | 4 ---- tools/testing/selftests/kvm/kvm_page_table_test.c | 3 --- tools/testing/selftests/kvm/lib/kvm_util.c | 9 +-------- tools/testing/selftests/kvm/lib/memstress.c | 4 ---- tools/testing/selftests/kvm/pre_fault_memory_test.c | 4 ---- tools/testing/selftests/kvm/set_memory_region_test.c | 9 +-------- 7 files changed, 2 insertions(+), 34 deletions(-) diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index d58a641b0e6a..7627b328f18a 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -641,9 +641,6 @@ static void run_test(enum vm_guest_mode mode, void *arg) } #ifdef __s390x__ - /* Align to 1M (segment size) */ - guest_test_phys_mem = align_down(guest_test_phys_mem, 1 << 20); - /* * The workaround in guest_code() to write all pages prior to the first * iteration isn't compatible with the dirty ring, as the dirty ring diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h index 8b39cb919f4f..f861242b4ae8 100644 --- a/tools/testing/selftests/kvm/include/kvm_util.h +++ b/tools/testing/selftests/kvm/include/kvm_util.h @@ -1127,10 +1127,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, unsigned int num_guest_pages) { unsigned int n; n = vm_num_guest_pages(mode, vm_num_host_pages(mode, num_guest_pages)); -#ifdef __s390x__ - /* s390 requires 1M aligned guest sizes */ - n = (n + 255) & ~255; -#endif return n; } diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c b/tools/testing/selftests/kvm/kvm_page_table_test.c index dd8b12f626d3..c60a24a92829 100644 --- a/tools/testing/selftests/kvm/kvm_page_table_test.c +++ b/tools/testing/selftests/kvm/kvm_page_table_test.c @@ -261,9 +261,6 @@ static struct kvm_vm *pre_init_before_test(enum vm_guest_mode mode, void *arg) guest_page_size; else guest_test_phys_mem = p->phys_offset; -#ifdef __s390x__ - alignment = max(0x100000UL, alignment); -#endif guest_test_phys_mem = align_down(guest_test_phys_mem, alignment); /* Set up the shared data structure test_args */ diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index 1959bf556e88..f5e076591c64 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -985,7 +985,7 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, struct userspace_mem_region *region; size_t backing_src_pagesz = get_backing_src_pagesz(src_type); size_t mem_size = npages * vm->page_size; - size_t alignment; + size_t alignment = 1; TEST_REQUIRE_SET_USER_MEMORY_REGION2(); @@ -1039,13 +1039,6 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, TEST_ASSERT(region != NULL, "Insufficient Memory"); region->mmap_size = mem_size; -#ifdef __s390x__ - /* On s390x, the host address must be aligned to 1M (due to PGSTEs) */ - alignment = 0x100000; -#else - alignment = 1; -#endif - /* * When using THP mmap is not guaranteed to returned a hugepage aligned * address so we have to pad the mmap. Padding is not needed for HugeTLB diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c index 557c0a0a5658..1ea735d66e15 100644 --- a/tools/testing/selftests/kvm/lib/memstress.c +++ b/tools/testing/selftests/kvm/lib/memstress.c @@ -196,10 +196,6 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus, args->gpa = (region_end_gfn - guest_num_pages - 1) * args->guest_page_size; args->gpa = align_down(args->gpa, backing_src_pagesz); -#ifdef __s390x__ - /* Align to 1M (segment size) */ - args->gpa = align_down(args->gpa, 1 << 20); -#endif args->size = guest_num_pages * args->guest_page_size; pr_info("guest physical test memory: [0x%lx, 0x%lx)\n", args->gpa, args->gpa + args->size); diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c index 93e603d91311..f3de0386ba7b 100644 --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c @@ -175,11 +175,7 @@ static void __test_pre_fault_memory(unsigned long vm_type, bool private) alignment = guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size; gpa = (vm->max_gfn - TEST_NPAGES) * guest_page_size; -#ifdef __s390x__ - alignment = max(0x100000UL, guest_page_size); -#else alignment = SZ_2M; -#endif gpa = align_down(gpa, alignment); gva = gpa & ((1ULL << (vm->va_bits - 1)) - 1); diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c index 7fe427ff9b38..a398dc3a8c4b 100644 --- a/tools/testing/selftests/kvm/set_memory_region_test.c +++ b/tools/testing/selftests/kvm/set_memory_region_test.c @@ -413,14 +413,7 @@ static void test_add_max_memory_regions(void) uint32_t max_mem_slots; uint32_t slot; void *mem, *mem_aligned, *mem_extra; - size_t alignment; - -#ifdef __s390x__ - /* On s390x, the host address must be aligned to 1M (due to PGSTEs) */ - alignment = 0x100000; -#else - alignment = 1; -#endif + size_t alignment = 1; max_mem_slots = kvm_check_cap(KVM_CAP_NR_MEMSLOTS); TEST_ASSERT(max_mem_slots > 0, From 857e92662c07543887dafdb14b127519f4c0ac93 Mon Sep 17 00:00:00 2001 From: Claudio Imbrenda Date: Thu, 2 Apr 2026 17:01:34 +0200 Subject: [PATCH 343/373] KVM: s390: selftests: enable some common memory-related tests Enable the following tests on s390: * memslot_modification_stress_test * memslot_perf_test * mmu_stress_test Since the first two tests are now supported on all architectures, move them into TEST_GEN_PROGS_COMMON and out of the indiviual architectures. Reviewed-by: Steffen Eiden Signed-off-by: Claudio Imbrenda --- tools/testing/selftests/kvm/Makefile.kvm | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index fdec90e85467..057f17d6b896 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -64,6 +64,8 @@ TEST_GEN_PROGS_COMMON += kvm_binary_stats_test TEST_GEN_PROGS_COMMON += kvm_create_max_vcpus TEST_GEN_PROGS_COMMON += kvm_page_table_test TEST_GEN_PROGS_COMMON += set_memory_region_test +TEST_GEN_PROGS_COMMON += memslot_modification_stress_test +TEST_GEN_PROGS_COMMON += memslot_perf_test # Compiled test targets TEST_GEN_PROGS_x86 = $(TEST_GEN_PROGS_COMMON) @@ -147,8 +149,6 @@ TEST_GEN_PROGS_x86 += coalesced_io_test TEST_GEN_PROGS_x86 += dirty_log_perf_test TEST_GEN_PROGS_x86 += guest_memfd_test TEST_GEN_PROGS_x86 += hardware_disable_test -TEST_GEN_PROGS_x86 += memslot_modification_stress_test -TEST_GEN_PROGS_x86 += memslot_perf_test TEST_GEN_PROGS_x86 += mmu_stress_test TEST_GEN_PROGS_x86 += rseq_test TEST_GEN_PROGS_x86 += steal_time @@ -186,8 +186,6 @@ TEST_GEN_PROGS_arm64 += coalesced_io_test TEST_GEN_PROGS_arm64 += dirty_log_perf_test TEST_GEN_PROGS_arm64 += get-reg-list TEST_GEN_PROGS_arm64 += guest_memfd_test -TEST_GEN_PROGS_arm64 += memslot_modification_stress_test -TEST_GEN_PROGS_arm64 += memslot_perf_test TEST_GEN_PROGS_arm64 += mmu_stress_test TEST_GEN_PROGS_arm64 += rseq_test TEST_GEN_PROGS_arm64 += steal_time @@ -205,6 +203,7 @@ TEST_GEN_PROGS_s390 += s390/ucontrol_test TEST_GEN_PROGS_s390 += s390/user_operexec TEST_GEN_PROGS_s390 += s390/keyop TEST_GEN_PROGS_s390 += rseq_test +TEST_GEN_PROGS_s390 += mmu_stress_test TEST_GEN_PROGS_riscv = $(TEST_GEN_PROGS_COMMON) TEST_GEN_PROGS_riscv += riscv/sbi_pmu_test @@ -214,8 +213,6 @@ TEST_GEN_PROGS_riscv += arch_timer TEST_GEN_PROGS_riscv += coalesced_io_test TEST_GEN_PROGS_riscv += dirty_log_perf_test TEST_GEN_PROGS_riscv += get-reg-list -TEST_GEN_PROGS_riscv += memslot_modification_stress_test -TEST_GEN_PROGS_riscv += memslot_perf_test TEST_GEN_PROGS_riscv += mmu_stress_test TEST_GEN_PROGS_riscv += rseq_test TEST_GEN_PROGS_riscv += steal_time From 9b8e8aad5896d66005d29920cb1643076a20b172 Mon Sep 17 00:00:00 2001 From: Claudio Imbrenda Date: Thu, 2 Apr 2026 17:01:35 +0200 Subject: [PATCH 344/373] KVM: s390: ucontrol: Fix memslot handling Fix memslots handling for UCONTROL guests. Attempts to delete user memslots will fail, as they should, without the risk of a NULL pointer dereference. Fixes: 413c98f24c63 ("KVM: s390: fake memslot for ucontrol VMs") Reviewed-by: Steffen Eiden Signed-off-by: Claudio Imbrenda --- arch/s390/kvm/kvm-s390.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 156878c95e06..63bc496d0c37 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -5642,7 +5642,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, struct kvm_memory_slot *new, enum kvm_mr_change change) { - if (kvm_is_ucontrol(kvm) && new->id < KVM_USER_MEM_SLOTS) + if (kvm_is_ucontrol(kvm) && new && new->id < KVM_USER_MEM_SLOTS) return -EINVAL; /* When we are protected, we should not change the memory slots */ From cb923ee6a80f4e604e6242a4702b59251e61a380 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:13 -0700 Subject: [PATCH 345/373] KVM: SEV: Lock all vCPUs when synchronzing VMSAs for SNP launch finish Lock all vCPUs when synchronizing and encrypting VMSAs for SNP guests, as allowing userspace to manipulate and/or run a vCPU while its state is being synchronized would at best corrupt vCPU state, and at worst crash the host kernel. Opportunistically assert that vcpu->mutex is held when synchronizing its VMSA (the SEV-ES path already locks vCPUs). Fixes: ad27ce155566 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260310234829.2608037-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 10b12db7f902..709e611188c1 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -884,6 +884,8 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm) u8 *d; int i; + lockdep_assert_held(&vcpu->mutex); + if (vcpu->arch.guest_state_protected) return -EINVAL; @@ -2458,6 +2460,10 @@ static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) if (kvm_is_vcpu_creation_in_progress(kvm)) return -EBUSY; + ret = kvm_lock_all_vcpus(kvm); + if (ret) + return ret; + data.gctx_paddr = __psp_pa(sev->snp_context); data.page_type = SNP_PAGE_TYPE_VMSA; @@ -2467,12 +2473,12 @@ static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) ret = sev_es_sync_vmsa(svm); if (ret) - return ret; + goto out; /* Transition the VMSA page to a firmware state. */ ret = rmp_make_private(pfn, INITIAL_VMSA_GPA, PG_LEVEL_4K, sev->asid, true); if (ret) - return ret; + goto out; /* Issue the SNP command to encrypt the VMSA */ data.address = __sme_pa(svm->sev_es.vmsa); @@ -2481,7 +2487,7 @@ static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) if (ret) { snp_page_reclaim(kvm, pfn); - return ret; + goto out; } svm->vcpu.arch.guest_state_protected = true; @@ -2495,7 +2501,9 @@ static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) svm_enable_lbrv(vcpu); } - return 0; +out: + kvm_unlock_all_vcpus(kvm); + return ret; } static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp) From 8075360f3b9648abe58bcedcb7a27d83d9bf210d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:14 -0700 Subject: [PATCH 346/373] KVM: SEV: Lock all vCPUs for the duration of SEV-ES VMSA synchronization Lock and unlock all vCPUs in a single batch when synchronizing SEV-ES VMSAs during launch finish, partly to dedup the code by a tiny amount, but mostly so that sev_launch_update_vmsa() uses the same logic/flow as all other SEV ioctls that lock all vCPUs. Link: https://patch.msgid.link/20260310234829.2608037-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 709e611188c1..15086ffe6143 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1037,19 +1037,18 @@ static int sev_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) if (kvm_is_vcpu_creation_in_progress(kvm)) return -EBUSY; + ret = kvm_lock_all_vcpus(kvm); + if (ret) + return ret; + kvm_for_each_vcpu(i, vcpu, kvm) { - ret = mutex_lock_killable(&vcpu->mutex); - if (ret) - return ret; - ret = __sev_launch_update_vmsa(kvm, vcpu, &argp->error); - - mutex_unlock(&vcpu->mutex); if (ret) - return ret; + break; } - return 0; + kvm_unlock_all_vcpus(kvm); + return ret; } static int sev_launch_measure(struct kvm *kvm, struct kvm_sev_cmd *argp) From 5bf92e475311b22598770caa151dea697b63c0cf Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:15 -0700 Subject: [PATCH 347/373] KVM: SEV: Provide vCPU-scoped accessors for detecting SEV+ guests Provide vCPU-scoped accessors for detecting if the vCPU belongs to an SEV, SEV-ES, or SEV-SNP VM, partly to dedup a small amount of code, but mostly to better document which usages are "safe". Generally speaking, using the VM-scoped sev_guest() and friends outside of kvm->lock is unsafe, as they can get both false positives and false negatives. But for vCPUs, the accessors are guaranteed to provide a stable result as KVM disallows initialization SEV+ state after vCPUs are created. I.e. operating on a vCPU guarantees the VM can't "become" an SEV+ VM, and that it can't revert back to a "normal" VM. This will also allow dropping the stubs for the VM-scoped accessors, as it's relatively easy to eliminate usage of the accessors from common SVM once the vCPU-scoped checks are out of the way. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-8-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 49 +++++++++++++------------- arch/x86/kvm/svm/svm.c | 80 +++++++++++++++++++++--------------------- arch/x86/kvm/svm/svm.h | 17 +++++++++ 3 files changed, 82 insertions(+), 64 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 15086ffe6143..f36c6694247c 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -3268,7 +3268,7 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm; - if (!sev_es_guest(vcpu->kvm)) + if (!is_sev_es_guest(vcpu)) return; svm = to_svm(vcpu); @@ -3278,7 +3278,7 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu) * a guest-owned page. Transition the page to hypervisor state before * releasing it back to the system. */ - if (sev_snp_guest(vcpu->kvm)) { + if (is_sev_snp_guest(vcpu)) { u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT; if (kvm_rmp_make_shared(vcpu->kvm, pfn, PG_LEVEL_4K)) @@ -3479,7 +3479,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm) goto vmgexit_err; break; case SVM_VMGEXIT_AP_CREATION: - if (!sev_snp_guest(vcpu->kvm)) + if (!is_sev_snp_guest(vcpu)) goto vmgexit_err; if (lower_32_bits(control->exit_info_1) != SVM_VMGEXIT_AP_DESTROY) if (!kvm_ghcb_rax_is_valid(svm)) @@ -3493,12 +3493,12 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm) case SVM_VMGEXIT_TERM_REQUEST: break; case SVM_VMGEXIT_PSC: - if (!sev_snp_guest(vcpu->kvm) || !kvm_ghcb_sw_scratch_is_valid(svm)) + if (!is_sev_snp_guest(vcpu) || !kvm_ghcb_sw_scratch_is_valid(svm)) goto vmgexit_err; break; case SVM_VMGEXIT_GUEST_REQUEST: case SVM_VMGEXIT_EXT_GUEST_REQUEST: - if (!sev_snp_guest(vcpu->kvm) || + if (!is_sev_snp_guest(vcpu) || !PAGE_ALIGNED(control->exit_info_1) || !PAGE_ALIGNED(control->exit_info_2) || control->exit_info_1 == control->exit_info_2) @@ -3572,7 +3572,8 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm) int pre_sev_run(struct vcpu_svm *svm, int cpu) { struct svm_cpu_data *sd = per_cpu_ptr(&svm_data, cpu); - struct kvm *kvm = svm->vcpu.kvm; + struct kvm_vcpu *vcpu = &svm->vcpu; + struct kvm *kvm = vcpu->kvm; unsigned int asid = sev_get_asid(kvm); /* @@ -3580,7 +3581,7 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu) * VMSA, e.g. if userspace forces the vCPU to be RUNNABLE after an SNP * AP Destroy event. */ - if (sev_es_guest(kvm) && !VALID_PAGE(svm->vmcb->control.vmsa_pa)) + if (is_sev_es_guest(vcpu) && !VALID_PAGE(svm->vmcb->control.vmsa_pa)) return -EINVAL; /* @@ -4126,7 +4127,7 @@ static int snp_handle_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_ sev_ret_code fw_err = 0; int ret; - if (!sev_snp_guest(kvm)) + if (!is_sev_snp_guest(&svm->vcpu)) return -EINVAL; mutex_lock(&sev->guest_req_mutex); @@ -4196,10 +4197,12 @@ static int snp_complete_req_certs(struct kvm_vcpu *vcpu) static int snp_handle_ext_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa) { - struct kvm *kvm = svm->vcpu.kvm; + struct kvm_vcpu *vcpu = &svm->vcpu; + struct kvm *kvm = vcpu->kvm; + u8 msg_type; - if (!sev_snp_guest(kvm)) + if (!is_sev_snp_guest(vcpu)) return -EINVAL; if (kvm_read_guest(kvm, req_gpa + offsetof(struct snp_guest_msg_hdr, msg_type), @@ -4218,7 +4221,6 @@ static int snp_handle_ext_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t r */ if (msg_type == SNP_MSG_REPORT_REQ) { struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct kvm_vcpu *vcpu = &svm->vcpu; u64 data_npages; gpa_t data_gpa; @@ -4335,7 +4337,7 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS); break; case GHCB_MSR_PREF_GPA_REQ: - if (!sev_snp_guest(vcpu->kvm)) + if (!is_sev_snp_guest(vcpu)) goto out_terminate; set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_NONE, GHCB_MSR_GPA_VALUE_MASK, @@ -4346,7 +4348,7 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) case GHCB_MSR_REG_GPA_REQ: { u64 gfn; - if (!sev_snp_guest(vcpu->kvm)) + if (!is_sev_snp_guest(vcpu)) goto out_terminate; gfn = get_ghcb_msr_bits(svm, GHCB_MSR_GPA_VALUE_MASK, @@ -4361,7 +4363,7 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) break; } case GHCB_MSR_PSC_REQ: - if (!sev_snp_guest(vcpu->kvm)) + if (!is_sev_snp_guest(vcpu)) goto out_terminate; ret = snp_begin_psc_msr(svm, control->ghcb_gpa); @@ -4434,7 +4436,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) sev_es_sync_from_ghcb(svm); /* SEV-SNP guest requires that the GHCB GPA must be registered */ - if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) { + if (is_sev_snp_guest(vcpu) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) { vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa); return -EINVAL; } @@ -4692,10 +4694,10 @@ void sev_init_vmcb(struct vcpu_svm *svm, bool init_event) */ clr_exception_intercept(svm, GP_VECTOR); - if (init_event && sev_snp_guest(vcpu->kvm)) + if (init_event && is_sev_snp_guest(vcpu)) sev_snp_init_protected_guest_state(vcpu); - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) sev_es_init_vmcb(svm, init_event); } @@ -4706,7 +4708,7 @@ int sev_vcpu_create(struct kvm_vcpu *vcpu) mutex_init(&svm->sev_es.snp_vmsa_mutex); - if (!sev_es_guest(vcpu->kvm)) + if (!is_sev_es_guest(vcpu)) return 0; /* @@ -4726,8 +4728,6 @@ int sev_vcpu_create(struct kvm_vcpu *vcpu) void sev_es_prepare_switch_to_guest(struct vcpu_svm *svm, struct sev_es_save_area *hostsa) { - struct kvm *kvm = svm->vcpu.kvm; - /* * All host state for SEV-ES guests is categorized into three swap types * based on how it is handled by hardware during a world switch: @@ -4766,7 +4766,8 @@ void sev_es_prepare_switch_to_guest(struct vcpu_svm *svm, struct sev_es_save_are * loaded with the correct values *if* the CPU writes the MSRs. */ if (sev_vcpu_has_debug_swap(svm) || - (sev_snp_guest(kvm) && cpu_feature_enabled(X86_FEATURE_DEBUG_SWAP))) { + (cpu_feature_enabled(X86_FEATURE_DEBUG_SWAP) && + is_sev_snp_guest(&svm->vcpu))) { hostsa->dr0_addr_mask = amd_get_dr_addr_mask(0); hostsa->dr1_addr_mask = amd_get_dr_addr_mask(1); hostsa->dr2_addr_mask = amd_get_dr_addr_mask(2); @@ -5130,7 +5131,7 @@ struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu) int error = 0; int ret; - if (!sev_es_guest(vcpu->kvm)) + if (!is_sev_es_guest(vcpu)) return NULL; /* @@ -5143,7 +5144,7 @@ struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu) sev = to_kvm_sev_info(vcpu->kvm); /* Check if the SEV policy allows debugging */ - if (sev_snp_guest(vcpu->kvm)) { + if (is_sev_snp_guest(vcpu)) { if (!(sev->policy & SNP_POLICY_MASK_DEBUG)) return NULL; } else { @@ -5151,7 +5152,7 @@ struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu) return NULL; } - if (sev_snp_guest(vcpu->kvm)) { + if (is_sev_snp_guest(vcpu)) { struct sev_data_snp_dbg dbg = {0}; vmsa = snp_alloc_firmware_page(__GFP_ZERO); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 3f3290d5a0a6..d874af3d520a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -242,7 +242,7 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) * Never intercept #GP for SEV guests, KVM can't * decrypt guest memory to workaround the erratum. */ - if (svm_gp_erratum_intercept && !sev_guest(vcpu->kvm)) + if (svm_gp_erratum_intercept && !is_sev_guest(vcpu)) set_exception_intercept(svm, GP_VECTOR); } } @@ -284,7 +284,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu, * SEV-ES does not expose the next RIP. The RIP update is controlled by * the type of exit and the #VC handler in the guest. */ - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) goto done; if (nrips && svm->vmcb->control.next_rip != 0) { @@ -736,7 +736,7 @@ static void svm_recalc_lbr_msr_intercepts(struct kvm_vcpu *vcpu) svm_set_intercept_for_msr(vcpu, MSR_IA32_LASTINTFROMIP, MSR_TYPE_RW, intercept); svm_set_intercept_for_msr(vcpu, MSR_IA32_LASTINTTOIP, MSR_TYPE_RW, intercept); - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) svm_set_intercept_for_msr(vcpu, MSR_IA32_DEBUGCTLMSR, MSR_TYPE_RW, intercept); svm->lbr_msrs_intercepted = intercept; @@ -846,7 +846,7 @@ static void svm_recalc_msr_intercepts(struct kvm_vcpu *vcpu) svm_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP, MSR_TYPE_RW, !shstk_enabled); } - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) sev_es_recalc_msr_intercepts(vcpu); svm_recalc_pmu_msr_intercepts(vcpu); @@ -881,7 +881,7 @@ void svm_enable_lbrv(struct kvm_vcpu *vcpu) static void __svm_disable_lbrv(struct kvm_vcpu *vcpu) { - KVM_BUG_ON(sev_es_guest(vcpu->kvm), vcpu->kvm); + KVM_BUG_ON(is_sev_es_guest(vcpu), vcpu->kvm); to_svm(vcpu)->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; } @@ -1223,7 +1223,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) if (vcpu->kvm->arch.bus_lock_detection_enabled) svm_set_intercept(svm, INTERCEPT_BUSLOCK); - if (sev_guest(vcpu->kvm)) + if (is_sev_guest(vcpu)) sev_init_vmcb(svm, init_event); svm_hv_init_vmcb(vmcb); @@ -1397,7 +1397,7 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu) struct vcpu_svm *svm = to_svm(vcpu); struct svm_cpu_data *sd = per_cpu_ptr(&svm_data, vcpu->cpu); - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) sev_es_unmap_ghcb(svm); if (svm->guest_state_loaded) @@ -1408,7 +1408,7 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu) * or subsequent vmload of host save area. */ vmsave(sd->save_area_pa); - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) sev_es_prepare_switch_to_guest(svm, sev_es_host_save_area(sd)); if (tsc_scaling) @@ -1421,7 +1421,7 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu) * all CPUs support TSC_AUX virtualization). */ if (likely(tsc_aux_uret_slot >= 0) && - (!boot_cpu_has(X86_FEATURE_V_TSC_AUX) || !sev_es_guest(vcpu->kvm))) + (!boot_cpu_has(X86_FEATURE_V_TSC_AUX) || !is_sev_es_guest(vcpu))) kvm_set_user_return_msr(tsc_aux_uret_slot, svm->tsc_aux, -1ull); if (cpu_feature_enabled(X86_FEATURE_SRSO_BP_SPEC_REDUCE) && @@ -1488,7 +1488,7 @@ static bool svm_get_if_flag(struct kvm_vcpu *vcpu) { struct vmcb *vmcb = to_svm(vcpu)->vmcb; - return sev_es_guest(vcpu->kvm) + return is_sev_es_guest(vcpu) ? vmcb->control.int_state & SVM_GUEST_INTERRUPT_MASK : kvm_get_rflags(vcpu) & X86_EFLAGS_IF; } @@ -1722,7 +1722,7 @@ static void sev_post_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) * contents of the VMSA, and future VMCB save area updates won't be * seen. */ - if (sev_es_guest(vcpu->kvm)) { + if (is_sev_es_guest(vcpu)) { svm->vmcb->save.cr3 = cr3; vmcb_mark_dirty(svm->vmcb, VMCB_CR); } @@ -1777,7 +1777,7 @@ void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) * SEV-ES guests must always keep the CR intercepts cleared. CR * tracking is done using the CR write traps. */ - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) return; if (hcr0 == cr0) { @@ -1888,7 +1888,7 @@ static void svm_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - if (WARN_ON_ONCE(sev_es_guest(vcpu->kvm))) + if (WARN_ON_ONCE(is_sev_es_guest(vcpu))) return; get_debugreg(vcpu->arch.db[0], 0); @@ -1967,7 +1967,7 @@ static int npf_interception(struct kvm_vcpu *vcpu) } } - if (sev_snp_guest(vcpu->kvm) && (error_code & PFERR_GUEST_ENC_MASK)) + if (is_sev_snp_guest(vcpu) && (error_code & PFERR_GUEST_ENC_MASK)) error_code |= PFERR_PRIVATE_ACCESS; trace_kvm_page_fault(vcpu, gpa, error_code); @@ -2112,7 +2112,7 @@ static int shutdown_interception(struct kvm_vcpu *vcpu) * The VM save area for SEV-ES guests has already been encrypted so it * cannot be reinitialized, i.e. synthesizing INIT is futile. */ - if (!sev_es_guest(vcpu->kvm)) { + if (!is_sev_es_guest(vcpu)) { clear_page(svm->vmcb); #ifdef CONFIG_KVM_SMM if (is_smm(vcpu)) @@ -2139,7 +2139,7 @@ static int io_interception(struct kvm_vcpu *vcpu) size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT; if (string) { - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) return sev_es_string_io(svm, size, port, in); else return kvm_emulate_instruction(vcpu, 0); @@ -2471,13 +2471,13 @@ static int task_switch_interception(struct kvm_vcpu *vcpu) static void svm_clr_iret_intercept(struct vcpu_svm *svm) { - if (!sev_es_guest(svm->vcpu.kvm)) + if (!is_sev_es_guest(&svm->vcpu)) svm_clr_intercept(svm, INTERCEPT_IRET); } static void svm_set_iret_intercept(struct vcpu_svm *svm) { - if (!sev_es_guest(svm->vcpu.kvm)) + if (!is_sev_es_guest(&svm->vcpu)) svm_set_intercept(svm, INTERCEPT_IRET); } @@ -2485,7 +2485,7 @@ static int iret_interception(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - WARN_ON_ONCE(sev_es_guest(vcpu->kvm)); + WARN_ON_ONCE(is_sev_es_guest(vcpu)); ++vcpu->stat.nmi_window_exits; svm->awaiting_iret_completion = true; @@ -2659,7 +2659,7 @@ static int dr_interception(struct kvm_vcpu *vcpu) * SEV-ES intercepts DR7 only to disable guest debugging and the guest issues a VMGEXIT * for DR7 write only. KVM cannot change DR7 (always swapped as type 'A') so return early. */ - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) return 1; if (vcpu->guest_debug == 0) { @@ -2741,7 +2741,7 @@ static int svm_get_feature_msr(u32 msr, u64 *data) static bool sev_es_prevent_msr_access(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { - return sev_es_guest(vcpu->kvm) && vcpu->arch.guest_state_protected && + return is_sev_es_guest(vcpu) && vcpu->arch.guest_state_protected && msr_info->index != MSR_IA32_XSS && !msr_write_intercepted(vcpu, msr_info->index); } @@ -2877,7 +2877,7 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) static int svm_complete_emulated_msr(struct kvm_vcpu *vcpu, int err) { struct vcpu_svm *svm = to_svm(vcpu); - if (!err || !sev_es_guest(vcpu->kvm) || WARN_ON_ONCE(!svm->sev_es.ghcb)) + if (!err || !is_sev_es_guest(vcpu) || WARN_ON_ONCE(!svm->sev_es.ghcb)) return kvm_complete_insn_gp(vcpu, err); svm_vmgexit_inject_exception(svm, X86_TRAP_GP); @@ -3058,7 +3058,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) * required in this case because TSC_AUX is restored on #VMEXIT * from the host save area. */ - if (boot_cpu_has(X86_FEATURE_V_TSC_AUX) && sev_es_guest(vcpu->kvm)) + if (boot_cpu_has(X86_FEATURE_V_TSC_AUX) && is_sev_es_guest(vcpu)) break; /* @@ -3158,7 +3158,7 @@ static int pause_interception(struct kvm_vcpu *vcpu) * vcpu->arch.preempted_in_kernel can never be true. Just * set in_kernel to false as well. */ - in_kernel = !sev_es_guest(vcpu->kvm) && svm_get_cpl(vcpu) == 0; + in_kernel = !is_sev_es_guest(vcpu) && svm_get_cpl(vcpu) == 0; grow_ple_window(vcpu); @@ -3323,9 +3323,9 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) guard(mutex)(&vmcb_dump_mutex); - vm_type = sev_snp_guest(vcpu->kvm) ? "SEV-SNP" : - sev_es_guest(vcpu->kvm) ? "SEV-ES" : - sev_guest(vcpu->kvm) ? "SEV" : "SVM"; + vm_type = is_sev_snp_guest(vcpu) ? "SEV-SNP" : + is_sev_es_guest(vcpu) ? "SEV-ES" : + is_sev_guest(vcpu) ? "SEV" : "SVM"; pr_err("%s vCPU%u VMCB %p, last attempted VMRUN on CPU %d\n", vm_type, vcpu->vcpu_id, svm->current_vmcb->ptr, vcpu->arch.last_vmentry_cpu); @@ -3370,7 +3370,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) pr_err("%-20s%016llx\n", "allowed_sev_features:", control->allowed_sev_features); pr_err("%-20s%016llx\n", "guest_sev_features:", control->guest_sev_features); - if (sev_es_guest(vcpu->kvm)) { + if (is_sev_es_guest(vcpu)) { save = sev_decrypt_vmsa(vcpu); if (!save) goto no_vmsa; @@ -3453,7 +3453,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) "excp_from:", save->last_excp_from, "excp_to:", save->last_excp_to); - if (sev_es_guest(vcpu->kvm)) { + if (is_sev_es_guest(vcpu)) { struct sev_es_save_area *vmsa = (struct sev_es_save_area *)save; pr_err("%-15s %016llx\n", @@ -3514,7 +3514,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) } no_vmsa: - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) sev_free_decrypted_vmsa(vcpu, save); } @@ -3603,7 +3603,7 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) struct kvm_run *kvm_run = vcpu->run; /* SEV-ES guests must use the CR write traps to track CR registers. */ - if (!sev_es_guest(vcpu->kvm)) { + if (!is_sev_es_guest(vcpu)) { if (!svm_is_intercept(svm, INTERCEPT_CR0_WRITE)) vcpu->arch.cr0 = svm->vmcb->save.cr0; if (npt_enabled) @@ -3655,7 +3655,7 @@ static int pre_svm_run(struct kvm_vcpu *vcpu) svm->current_vmcb->cpu = vcpu->cpu; } - if (sev_guest(vcpu->kvm)) + if (is_sev_guest(vcpu)) return pre_sev_run(svm, vcpu->cpu); /* FIXME: handle wraparound of asid_generation */ @@ -3815,7 +3815,7 @@ static void svm_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) * SEV-ES guests must always keep the CR intercepts cleared. CR * tracking is done using the CR write traps. */ - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) return; if (nested_svm_virtualize_tpr(vcpu)) @@ -4015,7 +4015,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu) * ignores SEV-ES guest writes to EFER.SVME *and* CLGI/STGI are not * supported NAEs in the GHCB protocol. */ - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) return; if (!gif_set(svm)) { @@ -4303,7 +4303,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_in amd_clear_divider(); - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) __svm_sev_es_vcpu_run(svm, spec_ctrl_intercepted, sev_es_host_save_area(sd)); else @@ -4404,7 +4404,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL)) x86_spec_ctrl_restore_host(svm->virt_spec_ctrl); - if (!sev_es_guest(vcpu->kvm)) { + if (!is_sev_es_guest(vcpu)) { vcpu->arch.cr2 = svm->vmcb->save.cr2; vcpu->arch.regs[VCPU_REGS_RAX] = svm->vmcb->save.rax; vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp; @@ -4554,7 +4554,7 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) if (guest_cpuid_is_intel_compatible(vcpu)) guest_cpu_cap_clear(vcpu, X86_FEATURE_V_VMSAVE_VMLOAD); - if (sev_guest(vcpu->kvm)) + if (is_sev_guest(vcpu)) sev_vcpu_after_set_cpuid(svm); } @@ -4950,7 +4950,7 @@ static int svm_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type, return X86EMUL_UNHANDLEABLE_VECTORING; /* Emulation is always possible when KVM has access to all guest state. */ - if (!sev_guest(vcpu->kvm)) + if (!is_sev_guest(vcpu)) return X86EMUL_CONTINUE; /* #UD and #GP should never be intercepted for SEV guests. */ @@ -4962,7 +4962,7 @@ static int svm_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type, * Emulation is impossible for SEV-ES guests as KVM doesn't have access * to guest register state. */ - if (sev_es_guest(vcpu->kvm)) + if (is_sev_es_guest(vcpu)) return X86EMUL_RETRY_INSTR; /* @@ -5099,7 +5099,7 @@ static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu) static void svm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector) { - if (!sev_es_guest(vcpu->kvm)) + if (!is_sev_es_guest(vcpu)) return kvm_vcpu_deliver_sipi_vector(vcpu, vector); sev_vcpu_deliver_sipi_vector(vcpu, vector); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 68675b25ef8e..79f00184a2ec 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -389,10 +389,27 @@ static __always_inline bool sev_snp_guest(struct kvm *kvm) return (sev->vmsa_features & SVM_SEV_FEAT_SNP_ACTIVE) && !WARN_ON_ONCE(!sev_es_guest(kvm)); } + +static __always_inline bool is_sev_guest(struct kvm_vcpu *vcpu) +{ + return sev_guest(vcpu->kvm); +} +static __always_inline bool is_sev_es_guest(struct kvm_vcpu *vcpu) +{ + return sev_es_guest(vcpu->kvm); +} + +static __always_inline bool is_sev_snp_guest(struct kvm_vcpu *vcpu) +{ + return sev_snp_guest(vcpu->kvm); +} #else #define sev_guest(kvm) false #define sev_es_guest(kvm) false #define sev_snp_guest(kvm) false +#define is_sev_guest(vcpu) false +#define is_sev_es_guest(vcpu) false +#define is_sev_snp_guest(vcpu) false #endif static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val) From 138e5f6a3e1172fee8665bc8b1bbe695ba6b2adf Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:16 -0700 Subject: [PATCH 348/373] KVM: SEV: Add quad-underscore version of VM-scoped APIs to detect SEV+ guests Add "unsafe" quad-underscore versions of the SEV+ guest detectors in anticipation of hardening the APIs via lockdep assertions. This will allow adding exceptions for usage that is known to be safe in advance of the lockdep assertions. Use a pile of underscores to try and communicate that use of the "unsafe" shouldn't be done lightly. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.h | 28 +++++++++++++++++++++------- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 79f00184a2ec..f14e2fe551cd 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -371,37 +371,51 @@ static __always_inline struct kvm_sev_info *to_kvm_sev_info(struct kvm *kvm) } #ifdef CONFIG_KVM_AMD_SEV -static __always_inline bool sev_guest(struct kvm *kvm) +static __always_inline bool ____sev_guest(struct kvm *kvm) { return to_kvm_sev_info(kvm)->active; } -static __always_inline bool sev_es_guest(struct kvm *kvm) +static __always_inline bool ____sev_es_guest(struct kvm *kvm) { struct kvm_sev_info *sev = to_kvm_sev_info(kvm); return sev->es_active && !WARN_ON_ONCE(!sev->active); } -static __always_inline bool sev_snp_guest(struct kvm *kvm) +static __always_inline bool ____sev_snp_guest(struct kvm *kvm) { struct kvm_sev_info *sev = to_kvm_sev_info(kvm); return (sev->vmsa_features & SVM_SEV_FEAT_SNP_ACTIVE) && - !WARN_ON_ONCE(!sev_es_guest(kvm)); + !WARN_ON_ONCE(!____sev_es_guest(kvm)); +} + +static __always_inline bool sev_guest(struct kvm *kvm) +{ + return ____sev_guest(kvm); +} +static __always_inline bool sev_es_guest(struct kvm *kvm) +{ + return ____sev_es_guest(kvm); +} + +static __always_inline bool sev_snp_guest(struct kvm *kvm) +{ + return ____sev_snp_guest(kvm); } static __always_inline bool is_sev_guest(struct kvm_vcpu *vcpu) { - return sev_guest(vcpu->kvm); + return ____sev_guest(vcpu->kvm); } static __always_inline bool is_sev_es_guest(struct kvm_vcpu *vcpu) { - return sev_es_guest(vcpu->kvm); + return ____sev_es_guest(vcpu->kvm); } static __always_inline bool is_sev_snp_guest(struct kvm_vcpu *vcpu) { - return sev_snp_guest(vcpu->kvm); + return ____sev_snp_guest(vcpu->kvm); } #else #define sev_guest(kvm) false From 56906910ea3084cbe82b9078a561130a6203f978 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:17 -0700 Subject: [PATCH 349/373] KVM: SEV: Document the SEV-ES check when querying SMM support as "safe" Use the "unsafe" API to check for an SEV-ES+ guest when determining whether or not SMBASE is a supported MSR, i.e. whether or not emulated SMM is supported. This will eventually allow adding lockdep assertings to the APIs for detecting SEV+ VMs without triggering "real" false positives. While svm_has_emulated_msr() doesn't hold kvm->lock, i.e. can get both false positives *and* false negatives, both are completely fine, as the only time the result isn't stable is when userspace is the sole consumer of the result. I.e. userspace can confuse itself, but that's it. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-10-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index d874af3d520a..69a3efc14368 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4517,9 +4517,17 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index) case MSR_IA32_SMBASE: if (!IS_ENABLED(CONFIG_KVM_SMM)) return false; - /* SEV-ES guests do not support SMM, so report false */ - if (kvm && sev_es_guest(kvm)) + +#ifdef CONFIG_KVM_AMD_SEV + /* + * KVM can't access register state to emulate SMM for SEV-ES + * guests. Conusming stale data here is "fine", as KVM only + * checks for MSR_IA32_SMBASE support without a vCPU when + * userspace is querying KVM_CAP_X86_SMM. + */ + if (kvm && ____sev_es_guest(kvm)) return false; +#endif break; default: break; From 7341500f8b8624616f3760206765b1ea01e2b849 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:18 -0700 Subject: [PATCH 350/373] KVM: SEV: Move standard VM-scoped helpers to detect SEV+ guests to sev.c Now that all external usage of the VM-scoped APIs to detect SEV+ guests is gone, drop the stubs provided for CONFIG_KVM_AMD_SEV=n builds and bury the "standard" APIs in sev.c. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-11-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 14 ++++++++++++++ arch/x86/kvm/svm/svm.h | 17 ----------------- 2 files changed, 14 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index f36c6694247c..56ace27f739c 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -107,6 +107,20 @@ static unsigned int nr_asids; static unsigned long *sev_asid_bitmap; static unsigned long *sev_reclaim_asid_bitmap; +static bool sev_guest(struct kvm *kvm) +{ + return ____sev_guest(kvm); +} +static bool sev_es_guest(struct kvm *kvm) +{ + return ____sev_es_guest(kvm); +} + +static bool sev_snp_guest(struct kvm *kvm) +{ + return ____sev_snp_guest(kvm); +} + static int snp_decommission_context(struct kvm *kvm); struct enc_region { diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index f14e2fe551cd..4c841e330aaf 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -390,20 +390,6 @@ static __always_inline bool ____sev_snp_guest(struct kvm *kvm) !WARN_ON_ONCE(!____sev_es_guest(kvm)); } -static __always_inline bool sev_guest(struct kvm *kvm) -{ - return ____sev_guest(kvm); -} -static __always_inline bool sev_es_guest(struct kvm *kvm) -{ - return ____sev_es_guest(kvm); -} - -static __always_inline bool sev_snp_guest(struct kvm *kvm) -{ - return ____sev_snp_guest(kvm); -} - static __always_inline bool is_sev_guest(struct kvm_vcpu *vcpu) { return ____sev_guest(vcpu->kvm); @@ -418,9 +404,6 @@ static __always_inline bool is_sev_snp_guest(struct kvm_vcpu *vcpu) return ____sev_snp_guest(vcpu->kvm); } #else -#define sev_guest(kvm) false -#define sev_es_guest(kvm) false -#define sev_snp_guest(kvm) false #define is_sev_guest(vcpu) false #define is_sev_es_guest(vcpu) false #define is_sev_snp_guest(vcpu) false From e353f1beeda3e7037f192235d5bd6abffacb49f6 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:19 -0700 Subject: [PATCH 351/373] KVM: SEV: Move SEV-specific VM initialization to sev.c Move SEV+ VM initialization to sev.c (as sev_vm_init()) so that kvm_sev_info (and all usage) can be gated on CONFIG_KVM_AMD_SEV=y without needing more #ifdefs. As a bonus, isolating the logic will make it easier to harden the flow, e.g. to WARN if the vm_type is unknown. No functional change intended (SEV, SEV_ES, and SNP VM types are only supported if CONFIG_KVM_AMD_SEV=y). Link: https://patch.msgid.link/20260310234829.2608037-12-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 15 +++++++++++++++ arch/x86/kvm/svm/svm.c | 12 +----------- arch/x86/kvm/svm/svm.h | 2 ++ 3 files changed, 18 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 56ace27f739c..4df0f17da3e2 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2925,6 +2925,21 @@ static int snp_decommission_context(struct kvm *kvm) return 0; } +void sev_vm_init(struct kvm *kvm) +{ + int type = kvm->arch.vm_type; + + if (type == KVM_X86_DEFAULT_VM || type == KVM_X86_SW_PROTECTED_VM) + return; + + kvm->arch.has_protected_state = (type == KVM_X86_SEV_ES_VM || + type == KVM_X86_SNP_VM); + to_kvm_sev_info(kvm)->need_init = true; + + kvm->arch.has_private_mem = (type == KVM_X86_SNP_VM); + kvm->arch.pre_fault_allowed = !kvm->arch.has_private_mem; +} + void sev_vm_destroy(struct kvm *kvm) { struct kvm_sev_info *sev = to_kvm_sev_info(kvm); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 69a3efc14368..261e563b9bab 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -5123,17 +5123,7 @@ static void svm_vm_destroy(struct kvm *kvm) static int svm_vm_init(struct kvm *kvm) { - int type = kvm->arch.vm_type; - - if (type != KVM_X86_DEFAULT_VM && - type != KVM_X86_SW_PROTECTED_VM) { - kvm->arch.has_protected_state = - (type == KVM_X86_SEV_ES_VM || type == KVM_X86_SNP_VM); - to_kvm_sev_info(kvm)->need_init = true; - - kvm->arch.has_private_mem = (type == KVM_X86_SNP_VM); - kvm->arch.pre_fault_allowed = !kvm->arch.has_private_mem; - } + sev_vm_init(kvm); if (!pause_filter_count || !pause_filter_thresh) kvm_disable_exits(kvm, KVM_X86_DISABLE_EXITS_PAUSE); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 4c841e330aaf..089726eb06b2 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -900,6 +900,7 @@ static inline struct page *snp_safe_alloc_page(void) int sev_vcpu_create(struct kvm_vcpu *vcpu); void sev_free_vcpu(struct kvm_vcpu *vcpu); +void sev_vm_init(struct kvm *kvm); void sev_vm_destroy(struct kvm *kvm); void __init sev_set_cpu_caps(void); void __init sev_hardware_setup(void); @@ -926,6 +927,7 @@ static inline struct page *snp_safe_alloc_page(void) static inline int sev_vcpu_create(struct kvm_vcpu *vcpu) { return 0; } static inline void sev_free_vcpu(struct kvm_vcpu *vcpu) {} +static inline void sev_vm_init(struct kvm *kvm) {} static inline void sev_vm_destroy(struct kvm *kvm) {} static inline void __init sev_set_cpu_caps(void) {} static inline void __init sev_hardware_setup(void) {} From da773ea3f59032f659bfc4c450ca86e384786168 Mon Sep 17 00:00:00 2001 From: Tao Cui Date: Thu, 9 Apr 2026 18:56:36 +0800 Subject: [PATCH 352/373] LoongArch: KVM: Use CSR_CRMD_PLV in kvm_arch_vcpu_in_kernel() The function reads LOONGARCH_CSR_CRMD but uses CSR_PRMD_PPLV to extract the privilege level. While both masks have the same value (0x3), CSR_CRMD_PLV is the semantically correct constant for CRMD. Cc: stable@vger.kernel.org Reviewed-by: Bibo Mao Signed-off-by: Tao Cui Signed-off-by: Huacai Chen --- arch/loongarch/kvm/vcpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c index 831f381a8fd1..ed2cfcd76f60 100644 --- a/arch/loongarch/kvm/vcpu.c +++ b/arch/loongarch/kvm/vcpu.c @@ -402,7 +402,7 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu) val = gcsr_read(LOONGARCH_CSR_CRMD); preempt_enable(); - return (val & CSR_PRMD_PPLV) == PLV_KERN; + return (val & CSR_CRMD_PLV) == PLV_KERN; } #ifdef CONFIG_GUEST_PERF_EVENTS From 14d2714d6537a9df5bea2515185034dbb9d30f03 Mon Sep 17 00:00:00 2001 From: Bibo Mao Date: Thu, 9 Apr 2026 18:56:36 +0800 Subject: [PATCH 353/373] LoongArch: KVM: Check kvm_request_pending() in kvm_late_check_requests() Add kvm_request_pending() checking firstly in kvm_late_check_requests(), at most time there is no pending request, then the following pending bit checking can be skipped. Also embed function kvm_check_pmu() in to kvm_late_check_requests(), and put it after the kvm_request_pending() checking. Signed-off-by: Bibo Mao Signed-off-by: Huacai Chen --- arch/loongarch/kvm/vcpu.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c index ed2cfcd76f60..fc8f7290cfed 100644 --- a/arch/loongarch/kvm/vcpu.c +++ b/arch/loongarch/kvm/vcpu.c @@ -149,14 +149,6 @@ static void kvm_lose_pmu(struct kvm_vcpu *vcpu) kvm_restore_host_pmu(vcpu); } -static void kvm_check_pmu(struct kvm_vcpu *vcpu) -{ - if (kvm_check_request(KVM_REQ_PMU, vcpu)) { - kvm_own_pmu(vcpu); - vcpu->arch.aux_inuse |= KVM_LARCH_PMU; - } -} - static void kvm_update_stolen_time(struct kvm_vcpu *vcpu) { u32 version; @@ -232,6 +224,15 @@ static int kvm_check_requests(struct kvm_vcpu *vcpu) static void kvm_late_check_requests(struct kvm_vcpu *vcpu) { lockdep_assert_irqs_disabled(); + + if (!kvm_request_pending(vcpu)) + return; + + if (kvm_check_request(KVM_REQ_PMU, vcpu)) { + kvm_own_pmu(vcpu); + vcpu->arch.aux_inuse |= KVM_LARCH_PMU; + } + if (kvm_check_request(KVM_REQ_TLB_FLUSH_GPA, vcpu)) if (vcpu->arch.flush_gpa != INVALID_GPA) { kvm_flush_tlb_gpa(vcpu, vcpu->arch.flush_gpa); @@ -312,7 +313,6 @@ static int kvm_pre_enter_guest(struct kvm_vcpu *vcpu) /* Make sure the vcpu mode has been written */ smp_store_mb(vcpu->mode, IN_GUEST_MODE); kvm_check_vpid(vcpu); - kvm_check_pmu(vcpu); /* * Called after function kvm_check_vpid() From f62eb9ca8def410bcef39e8909945409d0968473 Mon Sep 17 00:00:00 2001 From: Bibo Mao Date: Thu, 9 Apr 2026 18:56:36 +0800 Subject: [PATCH 354/373] LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch CSR register LOONGARCH_CSR_EENTRY is shared between host CPU and guest vCPU, KVM need save and restore LOONGARCH_CSR_EENTRY register. Here move LOONGARCH_CSR_EENTRY saving in to context switch function rather than VM entry. At most time VM enter/exit is much more frequent than vCPU thread context switch. Signed-off-by: Bibo Mao Signed-off-by: Huacai Chen --- arch/loongarch/kvm/vcpu.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c index fc8f7290cfed..3367a9886b63 100644 --- a/arch/loongarch/kvm/vcpu.c +++ b/arch/loongarch/kvm/vcpu.c @@ -320,7 +320,6 @@ static int kvm_pre_enter_guest(struct kvm_vcpu *vcpu) * and it may also clear KVM_REQ_TLB_FLUSH_GPA pending bit */ kvm_late_check_requests(vcpu); - vcpu->arch.host_eentry = csr_read64(LOONGARCH_CSR_EENTRY); /* Clear KVM_LARCH_SWCSR_LATEST as CSR will change when enter guest */ vcpu->arch.aux_inuse &= ~KVM_LARCH_SWCSR_LATEST; @@ -1628,9 +1627,11 @@ static int _kvm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) * If not, any old guest state from this vCPU will have been clobbered. */ context = per_cpu_ptr(vcpu->kvm->arch.vmcs, cpu); - if (migrated || (context->last_vcpu != vcpu)) + if (migrated || (context->last_vcpu != vcpu)) { + context->last_vcpu = vcpu; vcpu->arch.aux_inuse &= ~KVM_LARCH_HWCSR_USABLE; - context->last_vcpu = vcpu; + vcpu->arch.host_eentry = csr_read64(LOONGARCH_CSR_EENTRY); + } /* Restore timer state regardless */ kvm_restore_timer(vcpu); From aac656857e9f008a014ac9d58aab66e8fc803604 Mon Sep 17 00:00:00 2001 From: Bibo Mao Date: Thu, 9 Apr 2026 18:56:36 +0800 Subject: [PATCH 355/373] LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch CSR register LOONGARCH_CSR_GSTAT stores guest VMID information. With existing implementation method, VMID is per vCPU, similar with ASID in kernel. LOONGARCH_CSR_GSTAT is written at VM entry even if VMID is not changed. Here move LOONGARCH_CSR_GSTAT save/restore in vCPU context switch, and update LOONGARCH_CSR_GSTAT only when VMID is updated at VM entry. At most time VM enter/exit is much more frequent than vCPU thread context switch. Signed-off-by: Bibo Mao Signed-off-by: Huacai Chen --- arch/loongarch/kvm/main.c | 8 ++++---- arch/loongarch/kvm/vcpu.c | 2 ++ 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/loongarch/kvm/main.c b/arch/loongarch/kvm/main.c index 2c593ac7892f..304c83863e71 100644 --- a/arch/loongarch/kvm/main.c +++ b/arch/loongarch/kvm/main.c @@ -271,11 +271,11 @@ void kvm_check_vpid(struct kvm_vcpu *vcpu) * memory with new address is changed on other VCPUs. */ set_gcsr_llbctl(CSR_LLBCTL_WCLLB); - } - /* Restore GSTAT(0x50).vpid */ - vpid = (vcpu->arch.vpid & vpid_mask) << CSR_GSTAT_GID_SHIFT; - change_csr_gstat(vpid_mask << CSR_GSTAT_GID_SHIFT, vpid); + /* Restore GSTAT(0x50).vpid */ + vpid = (vcpu->arch.vpid & vpid_mask) << CSR_GSTAT_GID_SHIFT; + change_csr_gstat(vpid_mask << CSR_GSTAT_GID_SHIFT, vpid); + } } void kvm_init_vmcs(struct kvm *kvm) diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c index 3367a9886b63..e28084c49e68 100644 --- a/arch/loongarch/kvm/vcpu.c +++ b/arch/loongarch/kvm/vcpu.c @@ -1699,6 +1699,7 @@ static int _kvm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) /* Restore Root.GINTC from unused Guest.GINTC register */ write_csr_gintc(csr->csrs[LOONGARCH_CSR_GINTC]); + write_csr_gstat(csr->csrs[LOONGARCH_CSR_GSTAT]); /* * We should clear linked load bit to break interrupted atomics. This @@ -1794,6 +1795,7 @@ static int _kvm_vcpu_put(struct kvm_vcpu *vcpu, int cpu) kvm_save_hw_gcsr(csr, LOONGARCH_CSR_ISR3); } + csr->csrs[LOONGARCH_CSR_GSTAT] = read_csr_gstat(); vcpu->arch.aux_inuse |= KVM_LARCH_SWCSR_LATEST; out: From c43dce6f13fb12144571c168c7a593e5e546f3b5 Mon Sep 17 00:00:00 2001 From: Bibo Mao Date: Thu, 9 Apr 2026 18:56:36 +0800 Subject: [PATCH 356/373] LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function vcpu_is_preempted() is performance sensitive that called in function osq_lock(), here make it as a macro. So that parameter is not parsed at most time, it can avoid cache line thrashing across numa nodes. Here is part of UnixBench result on Loongson-3C5000 DualWay machine with 32 cores and 2 numa nodes. original inline macro execl 7025.7 6991.2 7242.3 fstime 474.6 703.1 1071 From the test result, making vcpu_is_preempted() as a macro is the best, and there is some improvment compared with the original function method. Signed-off-by: Bibo Mao Signed-off-by: Huacai Chen --- arch/loongarch/include/asm/qspinlock.h | 26 ++++++++++++++++++++++---- arch/loongarch/kernel/paravirt.c | 16 ++-------------- 2 files changed, 24 insertions(+), 18 deletions(-) diff --git a/arch/loongarch/include/asm/qspinlock.h b/arch/loongarch/include/asm/qspinlock.h index 66244801db67..0ee15b3b3937 100644 --- a/arch/loongarch/include/asm/qspinlock.h +++ b/arch/loongarch/include/asm/qspinlock.h @@ -2,11 +2,13 @@ #ifndef _ASM_LOONGARCH_QSPINLOCK_H #define _ASM_LOONGARCH_QSPINLOCK_H +#include #include #ifdef CONFIG_PARAVIRT - +DECLARE_STATIC_KEY_FALSE(virt_preempt_key); DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key); +DECLARE_PER_CPU(struct kvm_steal_time, steal_time); #define virt_spin_lock virt_spin_lock @@ -34,9 +36,25 @@ __retry: return true; } -#define vcpu_is_preempted vcpu_is_preempted - -bool vcpu_is_preempted(int cpu); +/* + * Macro is better than inline function here + * With macro, parameter cpu is parsed only when it is used. + * With inline function, parameter cpu is parsed even though it is not used. + * This may cause cache line thrashing across NUMA nodes. + */ +#define vcpu_is_preempted(cpu) \ +({ \ + bool __val; \ + \ + if (!static_branch_unlikely(&virt_preempt_key)) \ + __val = false; \ + else { \ + struct kvm_steal_time *src; \ + src = &per_cpu(steal_time, cpu); \ + __val = !!(READ_ONCE(src->preempted) & KVM_VCPU_PREEMPTED); \ + } \ + __val; \ +}) #endif /* CONFIG_PARAVIRT */ diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index b74fe6db49ab..10821cce554c 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -10,9 +10,9 @@ #include static int has_steal_clock; -static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); -static DEFINE_STATIC_KEY_FALSE(virt_preempt_key); +DEFINE_STATIC_KEY_FALSE(virt_preempt_key); DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key); +DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); static bool steal_acc = true; @@ -260,18 +260,6 @@ static int pv_time_cpu_down_prepare(unsigned int cpu) return 0; } - -bool vcpu_is_preempted(int cpu) -{ - struct kvm_steal_time *src; - - if (!static_branch_unlikely(&virt_preempt_key)) - return false; - - src = &per_cpu(steal_time, cpu); - return !!(src->preempted & KVM_VCPU_PREEMPTED); -} -EXPORT_SYMBOL(vcpu_is_preempted); #endif static void pv_cpu_reboot(void *unused) From 229132c309d667bb05405fc8b539e7d90e0dfb3b Mon Sep 17 00:00:00 2001 From: Song Gao Date: Thu, 9 Apr 2026 18:56:37 +0800 Subject: [PATCH 357/373] LoongArch: KVM: Add DMSINTC device support Add device model for DMSINTC interrupt controller, implement basic create/destroy/set_attr interfaces, and register device model to kvm device table. Reviewed-by: Bibo Mao Signed-off-by: Song Gao Signed-off-by: Huacai Chen --- arch/loongarch/include/asm/kvm_dmsintc.h | 24 +++++ arch/loongarch/include/asm/kvm_host.h | 3 + arch/loongarch/include/uapi/asm/kvm.h | 4 + arch/loongarch/kvm/Makefile | 1 + arch/loongarch/kvm/intc/dmsintc.c | 108 +++++++++++++++++++++++ arch/loongarch/kvm/main.c | 6 ++ include/uapi/linux/kvm.h | 2 + 7 files changed, 148 insertions(+) create mode 100644 arch/loongarch/include/asm/kvm_dmsintc.h create mode 100644 arch/loongarch/kvm/intc/dmsintc.c diff --git a/arch/loongarch/include/asm/kvm_dmsintc.h b/arch/loongarch/include/asm/kvm_dmsintc.h new file mode 100644 index 000000000000..3c5ec9805ed4 --- /dev/null +++ b/arch/loongarch/include/asm/kvm_dmsintc.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2025 Loongson Technology Corporation Limited + */ + +#ifndef __ASM_KVM_DMSINTC_H +#define __ASM_KVM_DMSINTC_H + +#include + +struct loongarch_dmsintc { + struct kvm *kvm; + uint64_t msg_addr_base; + uint64_t msg_addr_size; + uint32_t cpu_mask; +}; + +struct dmsintc_state { + atomic64_t vector_map[4]; +}; + +int kvm_loongarch_register_dmsintc_device(void); + +#endif diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h index 19eb5e5c3984..130cedbb6b39 100644 --- a/arch/loongarch/include/asm/kvm_host.h +++ b/arch/loongarch/include/asm/kvm_host.h @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -133,6 +134,7 @@ struct kvm_arch { s64 time_offset; struct kvm_context __percpu *vmcs; struct loongarch_ipi *ipi; + struct loongarch_dmsintc *dmsintc; struct loongarch_eiointc *eiointc; struct loongarch_pch_pic *pch_pic; }; @@ -247,6 +249,7 @@ struct kvm_vcpu_arch { struct kvm_mp_state mp_state; /* ipi state */ struct ipi_state ipi_state; + struct dmsintc_state dmsintc_state; /* cpucfg */ u32 cpucfg[KVM_MAX_CPUCFG_REGS]; diff --git a/arch/loongarch/include/uapi/asm/kvm.h b/arch/loongarch/include/uapi/asm/kvm.h index 419647aacdf3..cd0b5c11ca9c 100644 --- a/arch/loongarch/include/uapi/asm/kvm.h +++ b/arch/loongarch/include/uapi/asm/kvm.h @@ -155,4 +155,8 @@ struct kvm_iocsr_entry { #define KVM_DEV_LOONGARCH_PCH_PIC_GRP_CTRL 0x40000006 #define KVM_DEV_LOONGARCH_PCH_PIC_CTRL_INIT 0 +#define KVM_DEV_LOONGARCH_DMSINTC_GRP_CTRL 0x40000007 +#define KVM_DEV_LOONGARCH_DMSINTC_MSG_ADDR_BASE 0x0 +#define KVM_DEV_LOONGARCH_DMSINTC_MSG_ADDR_SIZE 0x1 + #endif /* __UAPI_ASM_LOONGARCH_KVM_H */ diff --git a/arch/loongarch/kvm/Makefile b/arch/loongarch/kvm/Makefile index cb41d9265662..ae469edec99c 100644 --- a/arch/loongarch/kvm/Makefile +++ b/arch/loongarch/kvm/Makefile @@ -17,6 +17,7 @@ kvm-y += tlb.o kvm-y += vcpu.o kvm-y += vm.o kvm-y += intc/ipi.o +kvm-y += intc/dmsintc.o kvm-y += intc/eiointc.o kvm-y += intc/pch_pic.o kvm-y += irqfd.o diff --git a/arch/loongarch/kvm/intc/dmsintc.c b/arch/loongarch/kvm/intc/dmsintc.c new file mode 100644 index 000000000000..8f0b91eb95dc --- /dev/null +++ b/arch/loongarch/kvm/intc/dmsintc.c @@ -0,0 +1,108 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2025 Loongson Technology Corporation Limited + */ + +#include +#include +#include + +static int kvm_dmsintc_ctrl_access(struct kvm_device *dev, + struct kvm_device_attr *attr, bool is_write) +{ + int addr = attr->attr; + unsigned long cpu_bit, val; + void __user *data = (void __user *)attr->addr; + struct loongarch_dmsintc *s = dev->kvm->arch.dmsintc; + + switch (addr) { + case KVM_DEV_LOONGARCH_DMSINTC_MSG_ADDR_BASE: + if (is_write) { + if (copy_from_user(&val, data, sizeof(s->msg_addr_base))) + return -EFAULT; + if (s->msg_addr_base) + return -EFAULT; /* Duplicate setting are not allowed. */ + if ((val & (BIT(AVEC_CPU_SHIFT) - 1)) != 0) + return -EINVAL; + s->msg_addr_base = val; + cpu_bit = find_first_bit((unsigned long *)&(s->msg_addr_base), 64) - AVEC_CPU_SHIFT; + cpu_bit = min(cpu_bit, AVEC_CPU_BIT); + s->cpu_mask = GENMASK(cpu_bit - 1, 0) & AVEC_CPU_MASK; + } + break; + case KVM_DEV_LOONGARCH_DMSINTC_MSG_ADDR_SIZE: + if (is_write) { + if (copy_from_user(&val, data, sizeof(s->msg_addr_size))) + return -EFAULT; + if (s->msg_addr_size) + return -EFAULT; /*Duplicate setting are not allowed. */ + s->msg_addr_size = val; + } + break; + default: + kvm_err("%s: unknown dmsintc register, addr = %d\n", __func__, addr); + return -ENXIO; + } + + return 0; +} + +static int kvm_dmsintc_set_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_LOONGARCH_DMSINTC_GRP_CTRL: + return kvm_dmsintc_ctrl_access(dev, attr, true); + default: + kvm_err("%s: unknown group (%d)\n", __func__, attr->group); + return -EINVAL; + } +} + +static int kvm_dmsintc_create(struct kvm_device *dev, u32 type) +{ + struct kvm *kvm; + struct loongarch_dmsintc *s; + + if (!dev) { + kvm_err("%s: kvm_device ptr is invalid!\n", __func__); + return -EINVAL; + } + + kvm = dev->kvm; + if (kvm->arch.dmsintc) { + kvm_err("%s: LoongArch DMSINTC has already been created!\n", __func__); + return -EINVAL; + } + + s = kzalloc(sizeof(struct loongarch_dmsintc), GFP_KERNEL); + if (!s) + return -ENOMEM; + + s->kvm = kvm; + kvm->arch.dmsintc = s; + + return 0; +} + +static void kvm_dmsintc_destroy(struct kvm_device *dev) +{ + + if (!dev || !dev->kvm || !dev->kvm->arch.dmsintc) + return; + + kfree(dev->kvm->arch.dmsintc); + kfree(dev); +} + +static struct kvm_device_ops kvm_dmsintc_dev_ops = { + .name = "kvm-loongarch-dmsintc", + .create = kvm_dmsintc_create, + .destroy = kvm_dmsintc_destroy, + .set_attr = kvm_dmsintc_set_attr, +}; + +int kvm_loongarch_register_dmsintc_device(void) +{ + return kvm_register_device_ops(&kvm_dmsintc_dev_ops, KVM_DEV_TYPE_LOONGARCH_DMSINTC); +} diff --git a/arch/loongarch/kvm/main.c b/arch/loongarch/kvm/main.c index 304c83863e71..76ebff2faedd 100644 --- a/arch/loongarch/kvm/main.c +++ b/arch/loongarch/kvm/main.c @@ -416,6 +416,12 @@ static int kvm_loongarch_env_init(void) /* Register LoongArch PCH-PIC interrupt controller interface. */ ret = kvm_loongarch_register_pch_pic_device(); + if (ret) + return ret; + + /* Register LoongArch DMSINTC interrupt contrroller interface */ + if (cpu_has_msgint) + ret = kvm_loongarch_register_dmsintc_device(); return ret; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 80364d4dbebb..9e7887230bdd 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1224,6 +1224,8 @@ enum kvm_device_type { #define KVM_DEV_TYPE_LOONGARCH_EIOINTC KVM_DEV_TYPE_LOONGARCH_EIOINTC KVM_DEV_TYPE_LOONGARCH_PCHPIC, #define KVM_DEV_TYPE_LOONGARCH_PCHPIC KVM_DEV_TYPE_LOONGARCH_PCHPIC + KVM_DEV_TYPE_LOONGARCH_DMSINTC, +#define KVM_DEV_TYPE_LOONGARCH_DMSINTC KVM_DEV_TYPE_LOONGARCH_DMSINTC KVM_DEV_TYPE_MAX, From 03de5eecb0f0f68f29086bc0075c7fd597bf4e4a Mon Sep 17 00:00:00 2001 From: Song Gao Date: Thu, 9 Apr 2026 18:56:37 +0800 Subject: [PATCH 358/373] LoongArch: KVM: Add DMSINTC inject msi to vCPU Implement irqfd that deliver msi to vCPU and vCPU dmsintc irq injection. Add pch_msi_set_irq() choice dmsintc to set msi irq by the msg_addr and implement dmsintc set msi irq. Signed-off-by: Song Gao Signed-off-by: Huacai Chen --- arch/loongarch/include/asm/kvm_dmsintc.h | 3 + arch/loongarch/include/asm/kvm_pch_pic.h | 3 +- arch/loongarch/kvm/intc/dmsintc.c | 74 ++++++++++++++++++++++++ arch/loongarch/kvm/intc/pch_pic.c | 15 ++++- arch/loongarch/kvm/interrupt.c | 2 + arch/loongarch/kvm/irqfd.c | 10 ++-- 6 files changed, 99 insertions(+), 8 deletions(-) diff --git a/arch/loongarch/include/asm/kvm_dmsintc.h b/arch/loongarch/include/asm/kvm_dmsintc.h index 3c5ec9805ed4..5a71b9ccbe78 100644 --- a/arch/loongarch/include/asm/kvm_dmsintc.h +++ b/arch/loongarch/include/asm/kvm_dmsintc.h @@ -20,5 +20,8 @@ struct dmsintc_state { }; int kvm_loongarch_register_dmsintc_device(void); +void dmsintc_inject_irq(struct kvm_vcpu *vcpu); +int dmsintc_set_irq(struct kvm *kvm, u64 addr, int data, int level); +int dmsintc_deliver_msi_to_vcpu(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 vector, int level); #endif diff --git a/arch/loongarch/include/asm/kvm_pch_pic.h b/arch/loongarch/include/asm/kvm_pch_pic.h index 7f33a3039272..e74b3b742634 100644 --- a/arch/loongarch/include/asm/kvm_pch_pic.h +++ b/arch/loongarch/include/asm/kvm_pch_pic.h @@ -68,8 +68,9 @@ struct loongarch_pch_pic { uint64_t pch_pic_base; }; +struct kvm_kernel_irq_routing_entry; int kvm_loongarch_register_pch_pic_device(void); void pch_pic_set_irq(struct loongarch_pch_pic *s, int irq, int level); -void pch_msi_set_irq(struct kvm *kvm, int irq, int level); +int pch_msi_set_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e, int level); #endif /* __ASM_KVM_PCH_PIC_H */ diff --git a/arch/loongarch/kvm/intc/dmsintc.c b/arch/loongarch/kvm/intc/dmsintc.c index 8f0b91eb95dc..de25735ce039 100644 --- a/arch/loongarch/kvm/intc/dmsintc.c +++ b/arch/loongarch/kvm/intc/dmsintc.c @@ -4,9 +4,83 @@ */ #include +#include #include #include +void dmsintc_inject_irq(struct kvm_vcpu *vcpu) +{ + unsigned int i; + unsigned long vector[4], old; + struct dmsintc_state *ds = &vcpu->arch.dmsintc_state; + + if (!ds) + return; + + for (i = 0; i < 4; i++) { + old = atomic64_read(&(ds->vector_map[i])); + if (old) + vector[i] = atomic64_xchg(&(ds->vector_map[i]), 0); + } + + if (vector[0]) { + old = kvm_read_hw_gcsr(LOONGARCH_CSR_ISR0); + kvm_write_hw_gcsr(LOONGARCH_CSR_ISR0, vector[0] | old); + } + + if (vector[1]) { + old = kvm_read_hw_gcsr(LOONGARCH_CSR_ISR1); + kvm_write_hw_gcsr(LOONGARCH_CSR_ISR1, vector[1] | old); + } + + if (vector[2]) { + old = kvm_read_hw_gcsr(LOONGARCH_CSR_ISR2); + kvm_write_hw_gcsr(LOONGARCH_CSR_ISR2, vector[2] | old); + } + + if (vector[3]) { + old = kvm_read_hw_gcsr(LOONGARCH_CSR_ISR3); + kvm_write_hw_gcsr(LOONGARCH_CSR_ISR3, vector[3] | old); + } +} + +int dmsintc_deliver_msi_to_vcpu(struct kvm *kvm, + struct kvm_vcpu *vcpu, u32 vector, int level) +{ + struct kvm_interrupt vcpu_irq; + struct dmsintc_state *ds = &vcpu->arch.dmsintc_state; + + if (!level) + return 0; + if (!vcpu || vector >= 256) + return -EINVAL; + if (!ds) + return -ENODEV; + + vcpu_irq.irq = INT_AVEC; + set_bit(vector, (unsigned long *)&ds->vector_map); + kvm_vcpu_ioctl_interrupt(vcpu, &vcpu_irq); + kvm_vcpu_kick(vcpu); + + return 0; +} + +int dmsintc_set_irq(struct kvm *kvm, u64 addr, int data, int level) +{ + unsigned int irq, cpu; + struct kvm_vcpu *vcpu; + + irq = (addr >> AVEC_IRQ_SHIFT) & AVEC_IRQ_MASK; + cpu = (addr >> AVEC_CPU_SHIFT) & kvm->arch.dmsintc->cpu_mask; + if (cpu >= KVM_MAX_VCPUS) + return -EINVAL; + vcpu = kvm_get_vcpu_by_cpuid(kvm, cpu); + if (!vcpu) + return -EINVAL; + + return dmsintc_deliver_msi_to_vcpu(kvm, vcpu, irq, level); +} + static int kvm_dmsintc_ctrl_access(struct kvm_device *dev, struct kvm_device_attr *attr, bool is_write) { diff --git a/arch/loongarch/kvm/intc/pch_pic.c b/arch/loongarch/kvm/intc/pch_pic.c index dd7e7f8d53db..aa0ed59ae8cf 100644 --- a/arch/loongarch/kvm/intc/pch_pic.c +++ b/arch/loongarch/kvm/intc/pch_pic.c @@ -3,6 +3,7 @@ * Copyright (C) 2024 Loongson Technology Corporation Limited */ +#include #include #include #include @@ -67,9 +68,19 @@ void pch_pic_set_irq(struct loongarch_pch_pic *s, int irq, int level) } /* msi irq handler */ -void pch_msi_set_irq(struct kvm *kvm, int irq, int level) +int pch_msi_set_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e, int level) { - eiointc_set_irq(kvm->arch.eiointc, irq, level); + u64 msg_addr = (((u64)e->msi.address_hi) << 32) | e->msi.address_lo; + + if (cpu_has_msgint && kvm->arch.dmsintc && + msg_addr >= kvm->arch.dmsintc->msg_addr_base && + msg_addr < (kvm->arch.dmsintc->msg_addr_base + kvm->arch.dmsintc->msg_addr_size)) { + return dmsintc_set_irq(kvm, msg_addr, e->msi.data, level); + } + + eiointc_set_irq(kvm->arch.eiointc, e->msi.data, level); + + return 0; } static int loongarch_pch_pic_read(struct loongarch_pch_pic *s, gpa_t addr, int len, void *val) diff --git a/arch/loongarch/kvm/interrupt.c b/arch/loongarch/kvm/interrupt.c index fb704f4c8ac5..32930959f7c2 100644 --- a/arch/loongarch/kvm/interrupt.c +++ b/arch/loongarch/kvm/interrupt.c @@ -7,6 +7,7 @@ #include #include #include +#include static unsigned int priority_to_irq[EXCCODE_INT_NUM] = { [INT_TI] = CPU_TIMER, @@ -33,6 +34,7 @@ static int kvm_irq_deliver(struct kvm_vcpu *vcpu, unsigned int priority) irq = priority_to_irq[priority]; if (kvm_guest_has_msgint(&vcpu->arch) && (priority == INT_AVEC)) { + dmsintc_inject_irq(vcpu); set_gcsr_estat(irq); return 1; } diff --git a/arch/loongarch/kvm/irqfd.c b/arch/loongarch/kvm/irqfd.c index 9a39627aecf0..f4f953b22419 100644 --- a/arch/loongarch/kvm/irqfd.c +++ b/arch/loongarch/kvm/irqfd.c @@ -29,9 +29,7 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, if (!level) return -1; - pch_msi_set_irq(kvm, e->msi.data, level); - - return 0; + return pch_msi_set_irq(kvm, e, level); } /* @@ -71,13 +69,15 @@ int kvm_set_routing_entry(struct kvm *kvm, int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level, bool line_status) { + if (!level) + return -EWOULDBLOCK; + switch (e->type) { case KVM_IRQ_ROUTING_IRQCHIP: pch_pic_set_irq(kvm->arch.pch_pic, e->irqchip.pin, level); return 0; case KVM_IRQ_ROUTING_MSI: - pch_msi_set_irq(kvm, e->msi.data, level); - return 0; + return pch_msi_set_irq(kvm, e, level); default: return -EWOULDBLOCK; } From fa19ea9a7bdb97575e05d72305a4c40a3a631357 Mon Sep 17 00:00:00 2001 From: Song Gao Date: Thu, 9 Apr 2026 18:56:37 +0800 Subject: [PATCH 359/373] KVM: LoongArch: selftests: Add cpucfg read/write helpers Add helper macros and functions to read and write CPU configuration registers (cpucfg) from the guest and from the VMM. This interface is required in upcoming selftests for querying and setting CPU features, such as PMU capabilities. Signed-off-by: Song Gao Signed-off-by: Huacai Chen --- .../selftests/kvm/include/loongarch/processor.h | 11 +++++++++++ tools/testing/selftests/kvm/lib/loongarch/processor.c | 8 ++++++++ 2 files changed, 19 insertions(+) diff --git a/tools/testing/selftests/kvm/include/loongarch/processor.h b/tools/testing/selftests/kvm/include/loongarch/processor.h index 76840ddda57d..6c1e59484485 100644 --- a/tools/testing/selftests/kvm/include/loongarch/processor.h +++ b/tools/testing/selftests/kvm/include/loongarch/processor.h @@ -128,6 +128,17 @@ #define CSR_TLBREHI_PS_SHIFT 0 #define CSR_TLBREHI_PS (0x3fUL << CSR_TLBREHI_PS_SHIFT) +#define read_cpucfg(reg) \ +({ \ + register unsigned long __v; \ + __asm__ __volatile__( \ + "cpucfg %0, %1\n\t" \ + : "=r" (__v) \ + : "r" (reg) \ + : "memory"); \ + __v; \ +}) + #define csr_read(csr) \ ({ \ register unsigned long __v; \ diff --git a/tools/testing/selftests/kvm/lib/loongarch/processor.c b/tools/testing/selftests/kvm/lib/loongarch/processor.c index 17aa55a2047a..0ad4544517e9 100644 --- a/tools/testing/selftests/kvm/lib/loongarch/processor.c +++ b/tools/testing/selftests/kvm/lib/loongarch/processor.c @@ -251,6 +251,14 @@ static void loongarch_set_reg(struct kvm_vcpu *vcpu, uint64_t id, uint64_t val) __vcpu_set_reg(vcpu, id, val); } +static void loongarch_set_cpucfg(struct kvm_vcpu *vcpu, uint64_t id, uint64_t val) +{ + uint64_t cfgid; + + cfgid = KVM_REG_LOONGARCH_CPUCFG | KVM_REG_SIZE_U64 | 8 * id; + __vcpu_set_reg(vcpu, cfgid, val); +} + static void loongarch_get_csr(struct kvm_vcpu *vcpu, uint64_t id, void *addr) { uint64_t csrid; From 11c840192768a5a63b6aed75273c5e8e416230ee Mon Sep 17 00:00:00 2001 From: Song Gao Date: Thu, 9 Apr 2026 18:56:37 +0800 Subject: [PATCH 360/373] KVM: LoongArch: selftests: Add basic PMU event counting test Introduce a basic PMU test that verifies hardware event counting for four performance counters. The test enables the events for CPU cycles, instructions retired, branch instructions, and branch misses, runs a fixed number of loops, and checks that the counter values fall within expected ranges. It also validates that the host supports PMU and that the VM feature is enabled. Signed-off-by: Song Gao Signed-off-by: Huacai Chen --- tools/testing/selftests/kvm/Makefile.kvm | 3 +- .../selftests/kvm/include/loongarch/pmu.h | 45 +++++ .../kvm/include/loongarch/processor.h | 1 + .../selftests/kvm/lib/loongarch/processor.c | 7 +- .../selftests/kvm/loongarch/pmu_test.c | 169 ++++++++++++++++++ 5 files changed, 223 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/kvm/include/loongarch/pmu.h create mode 100644 tools/testing/selftests/kvm/loongarch/pmu_test.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 6471fa214a9f..502c99258bd1 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -222,7 +222,8 @@ TEST_GEN_PROGS_riscv += mmu_stress_test TEST_GEN_PROGS_riscv += rseq_test TEST_GEN_PROGS_riscv += steal_time -TEST_GEN_PROGS_loongarch = arch_timer +TEST_GEN_PROGS_loongarch = loongarch/pmu_test +TEST_GEN_PROGS_loongarch += arch_timer TEST_GEN_PROGS_loongarch += coalesced_io_test TEST_GEN_PROGS_loongarch += demand_paging_test TEST_GEN_PROGS_loongarch += dirty_log_perf_test diff --git a/tools/testing/selftests/kvm/include/loongarch/pmu.h b/tools/testing/selftests/kvm/include/loongarch/pmu.h new file mode 100644 index 000000000000..2f734a1d1ae4 --- /dev/null +++ b/tools/testing/selftests/kvm/include/loongarch/pmu.h @@ -0,0 +1,45 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * LoongArch PMU specific interface + */ +#ifndef SELFTEST_KVM_PMU_H +#define SELFTEST_KVM_PMU_H + +#include "processor.h" + +#define LOONGARCH_CPUCFG6 0x6 +#define CPUCFG6_PMP BIT(0) +#define CPUCFG6_PAMVER GENMASK(3, 1) +#define CPUCFG6_PMNUM GENMASK(7, 4) +#define CPUCFG6_PMNUM_SHIFT 4 +#define CPUCFG6_PMBITS GENMASK(13, 8) +#define CPUCFG6_PMBITS_SHIFT 8 +#define CPUCFG6_UPM BIT(14) + +/* Performance Counter registers */ +#define LOONGARCH_CSR_PERFCTRL0 0x200 /* perf event 0 config */ +#define LOONGARCH_CSR_PERFCNTR0 0x201 /* perf event 0 count value */ +#define LOONGARCH_CSR_PERFCTRL1 0x202 /* perf event 1 config */ +#define LOONGARCH_CSR_PERFCNTR1 0x203 /* perf event 1 count value */ +#define LOONGARCH_CSR_PERFCTRL2 0x204 /* perf event 2 config */ +#define LOONGARCH_CSR_PERFCNTR2 0x205 /* perf event 2 count value */ +#define LOONGARCH_CSR_PERFCTRL3 0x206 /* perf event 3 config */ +#define LOONGARCH_CSR_PERFCNTR3 0x207 /* perf event 3 count value */ +#define CSR_PERFCTRL_PLV0 BIT(16) +#define CSR_PERFCTRL_PLV1 BIT(17) +#define CSR_PERFCTRL_PLV2 BIT(18) +#define CSR_PERFCTRL_PLV3 BIT(19) +#define PMU_ENVENT_ENABLED (CSR_PERFCTRL_PLV0 | CSR_PERFCTRL_PLV1 | CSR_PERFCTRL_PLV2 | CSR_PERFCTRL_PLV3) + +/* Hardware event codes (from LoongArch perf_event.c */ +#define LOONGARCH_PMU_EVENT_CYCLES 0x00 /* CPU cycles */ +#define LOONGARCH_PMU_EVENT_INSTR_RETIRED 0x01 /* Instructions retired */ +#define PERF_COUNT_HW_BRANCH_INSTRUCTIONS 0x02 /* Branch instructions */ +#define PERF_COUNT_HW_BRANCH_MISSES 0x03 /* Branch misses */ + +#define NUM_LOOPS 1000 +#define EXPECTED_INSTR_MIN (NUM_LOOPS + 10) /* Loop + overhead */ +#define EXPECTED_CYCLES_MIN NUM_LOOPS /* At least 1 cycle per iteration */ +#define UPPER_BOUND (10 * NUM_LOOPS) + +#endif diff --git a/tools/testing/selftests/kvm/include/loongarch/processor.h b/tools/testing/selftests/kvm/include/loongarch/processor.h index 6c1e59484485..916426707c86 100644 --- a/tools/testing/selftests/kvm/include/loongarch/processor.h +++ b/tools/testing/selftests/kvm/include/loongarch/processor.h @@ -189,6 +189,7 @@ struct handlers { handler_fn exception_handlers[VECTOR_NUM]; }; +void loongarch_vcpu_setup(struct kvm_vcpu *vcpu); void vm_init_descriptor_tables(struct kvm_vm *vm); void vm_install_exception_handler(struct kvm_vm *vm, int vector, handler_fn handler); diff --git a/tools/testing/selftests/kvm/lib/loongarch/processor.c b/tools/testing/selftests/kvm/lib/loongarch/processor.c index 0ad4544517e9..ee4ad3b1d2a4 100644 --- a/tools/testing/selftests/kvm/lib/loongarch/processor.c +++ b/tools/testing/selftests/kvm/lib/loongarch/processor.c @@ -5,6 +5,7 @@ #include #include "kvm_util.h" +#include "pmu.h" #include "processor.h" #include "ucall_common.h" @@ -275,9 +276,10 @@ static void loongarch_set_csr(struct kvm_vcpu *vcpu, uint64_t id, uint64_t val) __vcpu_set_reg(vcpu, csrid, val); } -static void loongarch_vcpu_setup(struct kvm_vcpu *vcpu) +void loongarch_vcpu_setup(struct kvm_vcpu *vcpu) { int width; + unsigned int cfg; unsigned long val; struct kvm_vm *vm = vcpu->vm; @@ -290,6 +292,9 @@ static void loongarch_vcpu_setup(struct kvm_vcpu *vcpu) TEST_FAIL("Unknown guest mode, mode: 0x%x", vm->mode); } + cfg = read_cpucfg(LOONGARCH_CPUCFG6); + loongarch_set_cpucfg(vcpu, LOONGARCH_CPUCFG6, cfg); + /* kernel mode and page enable mode */ val = PLV_KERN | CSR_CRMD_PG; loongarch_set_csr(vcpu, LOONGARCH_CSR_CRMD, val); diff --git a/tools/testing/selftests/kvm/loongarch/pmu_test.c b/tools/testing/selftests/kvm/loongarch/pmu_test.c new file mode 100644 index 000000000000..c0f25976b160 --- /dev/null +++ b/tools/testing/selftests/kvm/loongarch/pmu_test.c @@ -0,0 +1,169 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * LoongArch KVM PMU event counting test + * + * Test hardware event counting: CPU_CYCLES, INSTR_RETIRED, + * BRANCH_INSTRUCTIONS and BRANCH_MISSES. + */ +#include +#include "kvm_util.h" +#include "pmu.h" +#include "loongarch/processor.h" + +/* Check PMU support */ +static bool has_pmu_support(void) +{ + uint32_t cfg6; + + /* Read CPUCFG6 to check PMU */ + cfg6 = read_cpucfg(LOONGARCH_CPUCFG6); + + /* Check PMU present bit */ + if (!(cfg6 & CPUCFG6_PMP)) + return false; + + /* Check that at least one counter exists */ + if (((cfg6 & CPUCFG6_PMNUM) >> CPUCFG6_PMNUM_SHIFT) == 0) + return false; + + return true; +} + +/* Dump PMU capabilities */ +static void dump_pmu_caps(void) +{ + uint32_t cfg6; + int nr_counters, counter_bits; + + cfg6 = read_cpucfg(LOONGARCH_CPUCFG6); + nr_counters = ((cfg6 & CPUCFG6_PMNUM) >> CPUCFG6_PMNUM_SHIFT) + 1; + counter_bits = ((cfg6 & CPUCFG6_PMBITS) >> CPUCFG6_PMBITS_SHIFT) + 1; + + pr_info("PMU capabilities:\n"); + pr_info(" Counters present: %s\n", cfg6 & CPUCFG6_PMP ? "yes" : "no"); + pr_info(" Number of counters: %d\n", nr_counters); + pr_info(" Counter width: %d bits\n", counter_bits); +} + +/* Guest test code - runs inside VM */ +static void guest_pmu_base_test(void) +{ + int i; + uint32_t cfg6, pmnum; + uint64_t cnt[4]; + + cfg6 = read_cpucfg(LOONGARCH_CPUCFG6); + pmnum = (cfg6 >> 4) & 0xf; + GUEST_PRINTF("CPUCFG6 = 0x%x\n", cfg6); + GUEST_PRINTF("PMP enabled: %s\n", (cfg6 & 0x1) ? "YES" : "NO"); + GUEST_PRINTF("Number of counters (PMNUM): %x\n", pmnum + 1); + GUEST_ASSERT(pmnum == 3); + + GUEST_PRINTF("Clean csr_perfcntr0-3\n"); + csr_write(0, LOONGARCH_CSR_PERFCNTR0); + csr_write(0, LOONGARCH_CSR_PERFCNTR1); + csr_write(0, LOONGARCH_CSR_PERFCNTR2); + csr_write(0, LOONGARCH_CSR_PERFCNTR3); + GUEST_PRINTF("Set csr_perfctrl0 for cycles event\n"); + csr_write(PMU_ENVENT_ENABLED | + LOONGARCH_PMU_EVENT_CYCLES, LOONGARCH_CSR_PERFCTRL0); + GUEST_PRINTF("Set csr_perfctrl1 for instr_retired event\n"); + csr_write(PMU_ENVENT_ENABLED | + LOONGARCH_PMU_EVENT_INSTR_RETIRED, LOONGARCH_CSR_PERFCTRL1); + GUEST_PRINTF("Set csr_perfctrl2 for branch_instructions event\n"); + csr_write(PMU_ENVENT_ENABLED | + PERF_COUNT_HW_BRANCH_INSTRUCTIONS, LOONGARCH_CSR_PERFCTRL2); + GUEST_PRINTF("Set csr_perfctrl3 for branch_misses event\n"); + csr_write(PMU_ENVENT_ENABLED | + PERF_COUNT_HW_BRANCH_MISSES, LOONGARCH_CSR_PERFCTRL3); + + for (i = 0; i < NUM_LOOPS; i++) + cpu_relax(); + + cnt[0] = csr_read(LOONGARCH_CSR_PERFCNTR0); + GUEST_PRINTF("csr_perfcntr0 is %lx\n", cnt[0]); + cnt[1] = csr_read(LOONGARCH_CSR_PERFCNTR1); + GUEST_PRINTF("csr_perfcntr1 is %lx\n", cnt[1]); + cnt[2] = csr_read(LOONGARCH_CSR_PERFCNTR2); + GUEST_PRINTF("csr_perfcntr2 is %lx\n", cnt[2]); + cnt[3] = csr_read(LOONGARCH_CSR_PERFCNTR3); + GUEST_PRINTF("csr_perfcntr3 is %lx\n", cnt[3]); + + GUEST_PRINTF("assert csr_perfcntr0 >EXPECTED_CYCLES_MIN && csr_perfcntr0 < UPPER_BOUND\n"); + GUEST_ASSERT(cnt[0] > EXPECTED_CYCLES_MIN && cnt[0] < UPPER_BOUND); + GUEST_PRINTF("assert csr_perfcntr1 > EXPECTED_INSTR_MIN && csr_perfcntr1 < UPPER_BOUND\n"); + GUEST_ASSERT(cnt[1] > EXPECTED_INSTR_MIN && cnt[1] < UPPER_BOUND); + GUEST_PRINTF("assert csr_perfcntr2 > 0 && csr_perfcntr2 < UPPER_BOUND\n"); + GUEST_ASSERT(cnt[2] > 0 && cnt[2] < UPPER_BOUND); + GUEST_PRINTF("assert csr_perfcntr3 > 0 && csr_perfcntr3 < UPPER_BOUND\n"); + GUEST_ASSERT(cnt[3] > 0 && cnt[3] < UPPER_BOUND); +} + +static void guest_code(void) +{ + guest_pmu_base_test(); + + GUEST_DONE(); +} + +int main(int argc, char *argv[]) +{ + int ret = 0; + struct kvm_device_attr attr; + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + struct ucall uc; + + /* Check host KVM PMU support */ + if (!has_pmu_support()) { + print_skip("PMU not supported by host hardware\n"); + dump_pmu_caps(); + return KSFT_SKIP; + } + pr_info("Host support PMU\n"); + + /* Dump PMU capabilities */ + dump_pmu_caps(); + + vm = vm_create(VM_MODE_P47V47_16K); + vcpu = vm_vcpu_add(vm, 0, guest_code); + + vm_init_descriptor_tables(vm); + loongarch_vcpu_setup(vcpu); + + attr.group = KVM_LOONGARCH_VM_FEAT_CTRL, + attr.attr = KVM_LOONGARCH_VM_FEAT_PMU, + + ret = ioctl(vm->fd, KVM_HAS_DEVICE_ATTR, &attr); + + if (ret == 0) { + pr_info("PMU is enabled in VM\n"); + } else { + print_skip("PMU not enabled by VM config\n"); + return KSFT_SKIP; + } + + while (1) { + vcpu_run(vcpu); + switch (get_ucall(vcpu, &uc)) { + case UCALL_PRINTF: + printf("%s", (const char *)uc.buffer); + break; + case UCALL_DONE: + printf("PMU test PASSED\n"); + goto done; + case UCALL_ABORT: + printf("PMU test FAILED\n"); + ret = -1; + goto done; + default: + printf("Unexpected exit\n"); + ret = -1; + goto done; + } + } + +done: + kvm_vm_free(vm); + return ret; +} From e47b8e1db9a9bbef6765e85b11e87f48e6b56846 Mon Sep 17 00:00:00 2001 From: Song Gao Date: Thu, 9 Apr 2026 18:56:38 +0800 Subject: [PATCH 361/373] KVM: LoongArch: selftests: Add PMU overflow interrupt test Extend the PMU test suite to cover overflow interrupts. The test enables the PMI (Performance Monitor Interrupt), sets counter 0 to one less than the overflow value, and verifies that an interrupt is raised when the counter overflows. A guest interrupt handler checks the interrupt cause and disables further PMU interrupts upon success. Signed-off-by: Song Gao Signed-off-by: Huacai Chen --- .../selftests/kvm/include/loongarch/pmu.h | 21 +++++++++++ .../kvm/include/loongarch/processor.h | 3 ++ .../selftests/kvm/loongarch/pmu_test.c | 36 +++++++++++++++++++ 3 files changed, 60 insertions(+) diff --git a/tools/testing/selftests/kvm/include/loongarch/pmu.h b/tools/testing/selftests/kvm/include/loongarch/pmu.h index 2f734a1d1ae4..478e6a9bbb2b 100644 --- a/tools/testing/selftests/kvm/include/loongarch/pmu.h +++ b/tools/testing/selftests/kvm/include/loongarch/pmu.h @@ -29,6 +29,7 @@ #define CSR_PERFCTRL_PLV1 BIT(17) #define CSR_PERFCTRL_PLV2 BIT(18) #define CSR_PERFCTRL_PLV3 BIT(19) +#define CSR_PERFCTRL_PMIE BIT(20) #define PMU_ENVENT_ENABLED (CSR_PERFCTRL_PLV0 | CSR_PERFCTRL_PLV1 | CSR_PERFCTRL_PLV2 | CSR_PERFCTRL_PLV3) /* Hardware event codes (from LoongArch perf_event.c */ @@ -42,4 +43,24 @@ #define EXPECTED_CYCLES_MIN NUM_LOOPS /* At least 1 cycle per iteration */ #define UPPER_BOUND (10 * NUM_LOOPS) +#define PMU_OVERFLOW (1ULL << 63) + +static inline void pmu_irq_enable(void) +{ + unsigned long val; + + val = csr_read(LOONGARCH_CSR_ECFG); + val |= ECFGF_PMU; + csr_write(val, LOONGARCH_CSR_ECFG); +} + +static inline void pmu_irq_disable(void) +{ + unsigned long val; + + val = csr_read(LOONGARCH_CSR_ECFG); + val &= ~ECFGF_PMU; + csr_write(val, LOONGARCH_CSR_ECFG); +} + #endif diff --git a/tools/testing/selftests/kvm/include/loongarch/processor.h b/tools/testing/selftests/kvm/include/loongarch/processor.h index 916426707c86..93dc1fbd2e79 100644 --- a/tools/testing/selftests/kvm/include/loongarch/processor.h +++ b/tools/testing/selftests/kvm/include/loongarch/processor.h @@ -83,6 +83,8 @@ #define LOONGARCH_CSR_PRMD 0x1 #define LOONGARCH_CSR_EUEN 0x2 #define LOONGARCH_CSR_ECFG 0x4 +#define ECFGB_PMU 10 +#define ECFGF_PMU (BIT_ULL(ECFGB_PMU)) #define ECFGB_TIMER 11 #define ECFGF_TIMER (BIT_ULL(ECFGB_TIMER)) #define LOONGARCH_CSR_ESTAT 0x5 /* Exception status */ @@ -90,6 +92,7 @@ #define CSR_ESTAT_EXC_WIDTH 6 #define CSR_ESTAT_EXC (0x3f << CSR_ESTAT_EXC_SHIFT) #define EXCCODE_INT 0 /* Interrupt */ +#define INT_PMI 10 /* PMU interrupt */ #define INT_TI 11 /* Timer interrupt*/ #define LOONGARCH_CSR_ERA 0x6 /* ERA */ #define LOONGARCH_CSR_BADV 0x7 /* Bad virtual address */ diff --git a/tools/testing/selftests/kvm/loongarch/pmu_test.c b/tools/testing/selftests/kvm/loongarch/pmu_test.c index c0f25976b160..88bb530e336e 100644 --- a/tools/testing/selftests/kvm/loongarch/pmu_test.c +++ b/tools/testing/selftests/kvm/loongarch/pmu_test.c @@ -10,6 +10,8 @@ #include "pmu.h" #include "loongarch/processor.h" +static int pmu_irq_count; + /* Check PMU support */ static bool has_pmu_support(void) { @@ -99,10 +101,41 @@ static void guest_pmu_base_test(void) GUEST_ASSERT(cnt[3] > 0 && cnt[3] < UPPER_BOUND); } +static void guest_irq_handler(struct ex_regs *regs) +{ + unsigned int intid; + + pmu_irq_disable(); + intid = !!(regs->estat & BIT(INT_PMI)); + GUEST_ASSERT_EQ(intid, 1); + GUEST_PRINTF("Get PMU interrupt\n"); + WRITE_ONCE(pmu_irq_count, pmu_irq_count + 1); +} + +static void guest_pmu_interrupt_test(void) +{ + uint64_t cnt; + + csr_write(PMU_OVERFLOW - 1, LOONGARCH_CSR_PERFCNTR0); + csr_write(PMU_ENVENT_ENABLED | CSR_PERFCTRL_PMIE | LOONGARCH_PMU_EVENT_CYCLES, LOONGARCH_CSR_PERFCTRL0); + + cpu_relax(); + + GUEST_ASSERT_EQ(pmu_irq_count, 1); + cnt = csr_read(LOONGARCH_CSR_PERFCNTR0); + GUEST_PRINTF("csr_perfcntr0 is %lx\n", cnt); + GUEST_PRINTF("PMU interrupt test success\n"); + +} + static void guest_code(void) { guest_pmu_base_test(); + pmu_irq_enable(); + local_irq_enable(); + guest_pmu_interrupt_test(); + GUEST_DONE(); } @@ -128,8 +161,11 @@ int main(int argc, char *argv[]) vm = vm_create(VM_MODE_P47V47_16K); vcpu = vm_vcpu_add(vm, 0, guest_code); + pmu_irq_count = 0; vm_init_descriptor_tables(vm); loongarch_vcpu_setup(vcpu); + vm_install_exception_handler(vm, EXCCODE_INT, guest_irq_handler); + sync_global_to_guest(vm, pmu_irq_count); attr.group = KVM_LOONGARCH_VM_FEAT_CTRL, attr.attr = KVM_LOONGARCH_VM_FEAT_PMU, From 4f67cf7e7e756436d2a525ac6743711e598573c4 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:20 -0700 Subject: [PATCH 362/373] KVM: SEV: WARN on unhandled VM type when initializing VM WARN if KVM encounters an unhandled VM type when setting up flags for SEV+ VMs, e.g. to guard against adding a new flavor of SEV without adding proper recognition in sev_vm_init(). Practically speaking, no functional change intended (the new "default" case should be unreachable). Link: https://patch.msgid.link/20260310234829.2608037-13-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 4df0f17da3e2..015d102b32d9 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2927,17 +2927,24 @@ static int snp_decommission_context(struct kvm *kvm) void sev_vm_init(struct kvm *kvm) { - int type = kvm->arch.vm_type; - - if (type == KVM_X86_DEFAULT_VM || type == KVM_X86_SW_PROTECTED_VM) - return; - - kvm->arch.has_protected_state = (type == KVM_X86_SEV_ES_VM || - type == KVM_X86_SNP_VM); - to_kvm_sev_info(kvm)->need_init = true; - - kvm->arch.has_private_mem = (type == KVM_X86_SNP_VM); - kvm->arch.pre_fault_allowed = !kvm->arch.has_private_mem; + switch (kvm->arch.vm_type) { + case KVM_X86_DEFAULT_VM: + case KVM_X86_SW_PROTECTED_VM: + break; + case KVM_X86_SNP_VM: + kvm->arch.has_private_mem = true; + fallthrough; + case KVM_X86_SEV_ES_VM: + kvm->arch.has_protected_state = true; + fallthrough; + case KVM_X86_SEV_VM: + kvm->arch.pre_fault_allowed = !kvm->arch.has_private_mem; + to_kvm_sev_info(kvm)->need_init = true; + break; + default: + WARN_ONCE(1, "Unsupported VM type %u", kvm->arch.vm_type); + break; + } } void sev_vm_destroy(struct kvm *kvm) From 85d2243a21122b9be328152b9e0a0a64625b7016 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:21 -0700 Subject: [PATCH 363/373] KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y Bury "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y to make it harder for SEV specific code to sneak into common SVM code. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-14-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 2 ++ arch/x86/kvm/svm/svm.h | 6 +++++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 261e563b9bab..20d36f78104c 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4245,8 +4245,10 @@ static void svm_cancel_injection(struct kvm_vcpu *vcpu) static int svm_vcpu_pre_run(struct kvm_vcpu *vcpu) { +#ifdef CONFIG_KVM_AMD_SEV if (to_kvm_sev_info(vcpu->kvm)->need_init) return -EINVAL; +#endif return 1; } diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 089726eb06b2..242a65400341 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -92,6 +92,7 @@ enum { /* TPR and CR2 are always written before VMRUN */ #define VMCB_ALWAYS_DIRTY_MASK ((1U << VMCB_INTR) | (1U << VMCB_CR2)) +#ifdef CONFIG_KVM_AMD_SEV struct kvm_sev_info { bool active; /* SEV enabled guest */ bool es_active; /* SEV-ES enabled guest */ @@ -117,6 +118,7 @@ struct kvm_sev_info { cpumask_var_t have_run_cpus; /* CPUs that have done VMRUN for this VM. */ bool snp_certs_enabled; /* SNP certificate-fetching support. */ }; +#endif struct kvm_svm { struct kvm kvm; @@ -127,7 +129,9 @@ struct kvm_svm { u64 *avic_physical_id_table; struct hlist_node hnode; +#ifdef CONFIG_KVM_AMD_SEV struct kvm_sev_info sev_info; +#endif }; struct kvm_vcpu; @@ -365,12 +369,12 @@ static __always_inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) return container_of(kvm, struct kvm_svm, kvm); } +#ifdef CONFIG_KVM_AMD_SEV static __always_inline struct kvm_sev_info *to_kvm_sev_info(struct kvm *kvm) { return &to_kvm_svm(kvm)->sev_info; } -#ifdef CONFIG_KVM_AMD_SEV static __always_inline bool ____sev_guest(struct kvm *kvm) { return to_kvm_sev_info(kvm)->active; From 2f34d421e8f041cf831b26a7596dc38336062d24 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:22 -0700 Subject: [PATCH 364/373] KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe" Document that the check for an SEV+ guest when reclaiming guest memory is safe even though kvm->lock isn't held. This will allow asserting that kvm->lock is held in the SEV accessors, without triggering false positives on the "safe" cases. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-15-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 015d102b32d9..ed8bb60341ae 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -3293,8 +3293,14 @@ void sev_guest_memory_reclaimed(struct kvm *kvm) * With SNP+gmem, private/encrypted memory is unreachable via the * hva-based mmu notifiers, i.e. these events are explicitly scoped to * shared pages, where there's no need to flush caches. + * + * Checking for SEV+ outside of kvm->lock is safe as __sev_guest_init() + * can only be done before vCPUs are created, caches can be incoherent + * if and only if a vCPU was run, and either this task will see the VM + * as being SEV+ or the vCPU won't be to access the memory (because of + * the in-progress invalidation). */ - if (!sev_guest(kvm) || sev_snp_guest(kvm)) + if (!____sev_guest(kvm) || ____sev_snp_guest(kvm)) return; sev_writeback_caches(kvm); From ba903f7382490776d2df2fca6bf5c8ef2eb4663f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:23 -0700 Subject: [PATCH 365/373] KVM: SEV: Assert that kvm->lock is held when querying SEV+ support Assert that kvm->lock is held when checking if a VM is an SEV+ VM, as KVM sets *and* resets the relevant flags when initialization SEV state, i.e. it's extremely easy to end up with TOCTOU bugs if kvm->lock isn't held. Add waivers for a VM being torn down (refcount is '0') and for there being a loaded vCPU, with comments for both explaining why they're safe. Note, the "vCPU loaded" waiver is necessary to avoid splats on the SNP checks in sev_gmem_prepare() and sev_gmem_max_mapping_level(), which are currently called when handling nested page faults. Alternatively, those checks could key off KVM_X86_SNP_VM, as kvm_arch.vm_type is stable early in VM creation. Prioritize consistency, at least for now, and to leave a "reminder" that the max mapping level code in particular likely needs special attention if/when KVM supports dirty logging for SNP guests. Link: https://patch.msgid.link/20260310234829.2608037-16-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index ed8bb60341ae..57f3ec36b62a 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -107,17 +107,42 @@ static unsigned int nr_asids; static unsigned long *sev_asid_bitmap; static unsigned long *sev_reclaim_asid_bitmap; +static __always_inline void kvm_lockdep_assert_sev_lock_held(struct kvm *kvm) +{ +#ifdef CONFIG_PROVE_LOCKING + /* + * Querying SEV+ support is safe if there are no other references, i.e. + * if concurrent initialization of SEV+ is impossible. + */ + if (!refcount_read(&kvm->users_count)) + return; + + /* + * Querying SEV+ support from vCPU context is always safe, as vCPUs can + * only be created after SEV+ is initialized (and KVM disallows all SEV + * sub-ioctls while vCPU creation is in-progress). + */ + if (kvm_get_running_vcpu()) + return; + + lockdep_assert_held(&kvm->lock); +#endif +} + static bool sev_guest(struct kvm *kvm) { + kvm_lockdep_assert_sev_lock_held(kvm); return ____sev_guest(kvm); } static bool sev_es_guest(struct kvm *kvm) { + kvm_lockdep_assert_sev_lock_held(kvm); return ____sev_es_guest(kvm); } static bool sev_snp_guest(struct kvm *kvm) { + kvm_lockdep_assert_sev_lock_held(kvm); return ____sev_snp_guest(kvm); } From 04d77ded6407199d7a6964fa8a74bd93e568b763 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlos=20L=C3=B3pez?= Date: Tue, 10 Mar 2026 16:48:24 -0700 Subject: [PATCH 366/373] KVM: SEV: use mutex guard in snp_launch_update() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Simplify the error paths in snp_launch_update() by using a mutex guard, allowing early return instead of using gotos. Signed-off-by: Carlos López Link: https://patch.msgid.link/20260120201013.3931334-4-clopez@suse.de Link: https://patch.msgid.link/20260310234829.2608037-17-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 31 ++++++++++++------------------- 1 file changed, 12 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 57f3ec36b62a..96510b1ec4cc 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2407,7 +2407,6 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp) struct kvm_memory_slot *memslot; long npages, count; void __user *src; - int ret = 0; if (!sev_snp_guest(kvm) || !sev->snp_context) return -EINVAL; @@ -2452,13 +2451,11 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp) * initial expected state and better guard against unexpected * situations. */ - mutex_lock(&kvm->slots_lock); + guard(mutex)(&kvm->slots_lock); memslot = gfn_to_memslot(kvm, params.gfn_start); - if (!kvm_slot_has_gmem(memslot)) { - ret = -EINVAL; - goto out; - } + if (!kvm_slot_has_gmem(memslot)) + return -EINVAL; sev_populate_args.sev_fd = argp->sev_fd; sev_populate_args.type = params.type; @@ -2469,22 +2466,18 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp) argp->error = sev_populate_args.fw_error; pr_debug("%s: kvm_gmem_populate failed, ret %ld (fw_error %d)\n", __func__, count, argp->error); - ret = -EIO; - } else { - params.gfn_start += count; - params.len -= count * PAGE_SIZE; - if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO) - params.uaddr += count * PAGE_SIZE; - - ret = 0; - if (copy_to_user(u64_to_user_ptr(argp->data), ¶ms, sizeof(params))) - ret = -EFAULT; + return -EIO; } -out: - mutex_unlock(&kvm->slots_lock); + params.gfn_start += count; + params.len -= count * PAGE_SIZE; + if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO) + params.uaddr += count * PAGE_SIZE; - return ret; + if (copy_to_user(u64_to_user_ptr(argp->data), ¶ms, sizeof(params))) + return -EFAULT; + + return 0; } static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) From 63e56d8425a71e20789f68c533f5615a9e7d43cc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlos=20L=C3=B3pez?= Date: Tue, 10 Mar 2026 16:48:25 -0700 Subject: [PATCH 367/373] KVM: SEV: use mutex guard in sev_mem_enc_ioctl() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Simplify the error paths in sev_mem_enc_ioctl() by using a mutex guard, allowing early return instead of using gotos. Signed-off-by: Carlos López Link: https://patch.msgid.link/20260120201013.3931334-5-clopez@suse.de Link: https://patch.msgid.link/20260310234829.2608037-18-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 96510b1ec4cc..b7bc69f8b0f9 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2637,30 +2637,24 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp) if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd))) return -EFAULT; - mutex_lock(&kvm->lock); + guard(mutex)(&kvm->lock); /* Only the enc_context_owner handles some memory enc operations. */ if (is_mirroring_enc_context(kvm) && - !is_cmd_allowed_from_mirror(sev_cmd.id)) { - r = -EINVAL; - goto out; - } + !is_cmd_allowed_from_mirror(sev_cmd.id)) + return -EINVAL; /* * Once KVM_SEV_INIT2 initializes a KVM instance as an SNP guest, only * allow the use of SNP-specific commands. */ - if (sev_snp_guest(kvm) && sev_cmd.id < KVM_SEV_SNP_LAUNCH_START) { - r = -EPERM; - goto out; - } + if (sev_snp_guest(kvm) && sev_cmd.id < KVM_SEV_SNP_LAUNCH_START) + return -EPERM; switch (sev_cmd.id) { case KVM_SEV_ES_INIT: - if (!sev_es_enabled) { - r = -ENOTTY; - goto out; - } + if (!sev_es_enabled) + return -ENOTTY; fallthrough; case KVM_SEV_INIT: r = sev_guest_init(kvm, &sev_cmd); @@ -2732,15 +2726,12 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp) r = snp_enable_certs(kvm); break; default: - r = -EINVAL; - goto out; + return -EINVAL; } if (copy_to_user(argp, &sev_cmd, sizeof(struct kvm_sev_cmd))) r = -EFAULT; -out: - mutex_unlock(&kvm->lock); return r; } From 84841f3941d7e837b23852e966c7bd4f4d6b62e6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlos=20L=C3=B3pez?= Date: Tue, 10 Mar 2026 16:48:26 -0700 Subject: [PATCH 368/373] KVM: SEV: use mutex guard in sev_mem_enc_unregister_region() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Simplify the error paths in sev_mem_enc_unregister_region() by using a mutex guard, allowing early return instead of using gotos. Signed-off-by: Carlos López Link: https://patch.msgid.link/20260120201013.3931334-7-clopez@suse.de Link: https://patch.msgid.link/20260310234829.2608037-19-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index b7bc69f8b0f9..1cb9aa9f748c 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2811,35 +2811,25 @@ int sev_mem_enc_unregister_region(struct kvm *kvm, struct kvm_enc_region *range) { struct enc_region *region; - int ret; /* If kvm is mirroring encryption context it isn't responsible for it */ if (is_mirroring_enc_context(kvm)) return -EINVAL; - mutex_lock(&kvm->lock); + guard(mutex)(&kvm->lock); - if (!sev_guest(kvm)) { - ret = -ENOTTY; - goto failed; - } + if (!sev_guest(kvm)) + return -ENOTTY; region = find_enc_region(kvm, range); - if (!region) { - ret = -EINVAL; - goto failed; - } + if (!region) + return -EINVAL; sev_writeback_caches(kvm); __unregister_enc_region_locked(kvm, region); - mutex_unlock(&kvm->lock); return 0; - -failed: - mutex_unlock(&kvm->lock); - return ret; } int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd) From f09b7f4af9bb8a0e4f219bb2ed6af25b8baa8be3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlos=20L=C3=B3pez?= Date: Tue, 10 Mar 2026 16:48:27 -0700 Subject: [PATCH 369/373] KVM: SEV: use mutex guard in snp_handle_guest_req() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Simplify the error paths in snp_handle_guest_req() by using a mutex guard, allowing early return instead of using gotos. Signed-off-by: Carlos López Link: https://patch.msgid.link/20260120201013.3931334-8-clopez@suse.de Link: https://patch.msgid.link/20260310234829.2608037-20-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 1cb9aa9f748c..4bf7b2c6b148 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -4171,12 +4171,10 @@ static int snp_handle_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_ if (!is_sev_snp_guest(&svm->vcpu)) return -EINVAL; - mutex_lock(&sev->guest_req_mutex); + guard(mutex)(&sev->guest_req_mutex); - if (kvm_read_guest(kvm, req_gpa, sev->guest_req_buf, PAGE_SIZE)) { - ret = -EIO; - goto out_unlock; - } + if (kvm_read_guest(kvm, req_gpa, sev->guest_req_buf, PAGE_SIZE)) + return -EIO; data.gctx_paddr = __psp_pa(sev->snp_context); data.req_paddr = __psp_pa(sev->guest_req_buf); @@ -4189,21 +4187,16 @@ static int snp_handle_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_ */ ret = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &fw_err); if (ret && !fw_err) - goto out_unlock; + return ret; - if (kvm_write_guest(kvm, resp_gpa, sev->guest_resp_buf, PAGE_SIZE)) { - ret = -EIO; - goto out_unlock; - } + if (kvm_write_guest(kvm, resp_gpa, sev->guest_resp_buf, PAGE_SIZE)) + return -EIO; /* No action is requested *from KVM* if there was a firmware error. */ svm_vmgexit_no_action(svm, SNP_GUEST_ERR(0, fw_err)); - ret = 1; /* resume guest */ - -out_unlock: - mutex_unlock(&sev->guest_req_mutex); - return ret; + /* resume guest */ + return 1; } static int snp_req_certs_err(struct vcpu_svm *svm, u32 vmm_error) From 1d353dae3d33bf22fba47a96b627eeb7bfe37be8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlos=20L=C3=B3pez?= Date: Tue, 10 Mar 2026 16:48:28 -0700 Subject: [PATCH 370/373] KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extract the lock-protected parts of SEV ASID allocation into a new helper and opportunistically convert it to use guard() when acquiring the mutex. Preserve the goto even though it's a little odd, as it's there's a fair amount of subtlety that makes it surprisingly difficult to replicate the functionality with a loop construct, and arguably using goto yields the most readable code. No functional change intended. Signed-off-by: Carlos López [sean: move code to separate helper, rework shortlog+changelog] Link: https://patch.msgid.link/20260310234829.2608037-21-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 37 +++++++++++++++++++++++-------------- 1 file changed, 23 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 4bf7b2c6b148..1567d01ef464 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -237,6 +237,28 @@ static void sev_misc_cg_uncharge(struct kvm_sev_info *sev) misc_cg_uncharge(type, sev->misc_cg, 1); } +static unsigned int sev_alloc_asid(unsigned int min_asid, unsigned int max_asid) +{ + unsigned int asid; + bool retry = true; + + guard(mutex)(&sev_bitmap_lock); + +again: + asid = find_next_zero_bit(sev_asid_bitmap, max_asid + 1, min_asid); + if (asid > max_asid) { + if (retry && __sev_recycle_asids(min_asid, max_asid)) { + retry = false; + goto again; + } + + return asid; + } + + __set_bit(asid, sev_asid_bitmap); + return asid; +} + static int sev_asid_new(struct kvm_sev_info *sev, unsigned long vm_type) { /* @@ -244,7 +266,6 @@ static int sev_asid_new(struct kvm_sev_info *sev, unsigned long vm_type) * SEV-ES-enabled guest can use from 1 to min_sev_asid - 1. */ unsigned int min_asid, max_asid, asid; - bool retry = true; int ret; if (vm_type == KVM_X86_SNP_VM) { @@ -277,24 +298,12 @@ static int sev_asid_new(struct kvm_sev_info *sev, unsigned long vm_type) return ret; } - mutex_lock(&sev_bitmap_lock); - -again: - asid = find_next_zero_bit(sev_asid_bitmap, max_asid + 1, min_asid); + asid = sev_alloc_asid(min_asid, max_asid); if (asid > max_asid) { - if (retry && __sev_recycle_asids(min_asid, max_asid)) { - retry = false; - goto again; - } - mutex_unlock(&sev_bitmap_lock); ret = -EBUSY; goto e_uncharge; } - __set_bit(asid, sev_asid_bitmap); - - mutex_unlock(&sev_bitmap_lock); - sev->asid = asid; return 0; e_uncharge: From bc0932cf9b9917e826871db947398aa2b62789b2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 16:48:29 -0700 Subject: [PATCH 371/373] KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails Dedup a small amount of cleanup code in SEV ASID allocation by reusing an existing error label. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-22-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 1567d01ef464..734e5206fbf9 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -289,14 +289,11 @@ static int sev_asid_new(struct kvm_sev_info *sev, unsigned long vm_type) if (min_asid > max_asid) return -ENOTTY; - WARN_ON(sev->misc_cg); + WARN_ON_ONCE(sev->misc_cg); sev->misc_cg = get_current_misc_cg(); ret = sev_misc_cg_try_charge(sev); - if (ret) { - put_misc_cg(sev->misc_cg); - sev->misc_cg = NULL; - return ret; - } + if (ret) + goto e_put_cg; asid = sev_alloc_asid(min_asid, max_asid); if (asid > max_asid) { @@ -306,8 +303,10 @@ static int sev_asid_new(struct kvm_sev_info *sev, unsigned long vm_type) sev->asid = asid; return 0; + e_uncharge: sev_misc_cg_uncharge(sev); +e_put_cg: put_misc_cg(sev->misc_cg); sev->misc_cg = NULL; return ret; From e30aa03d032df0f3ee5efb1995a7a2fe662177be Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Apr 2026 12:13:41 -0700 Subject: [PATCH 372/373] x86/virt: Treat SVM as unsupported when running as an SEV+ guest When running as an SEV+ guest, treat SVM as unsupported even if CPUID (and other reporting, e.g. MSRs) enumerate support for SVM, as KVM doesn't support nested virtualization within an SEV VM (KVM would need to explicitly share all VMCBs and other assets with the untrusted host), let alone running nested VMs within SEV-ES+ guests (e.g. emulating VMLOAD, VMSAVE, and VMRUN all require access to guest register state). And outside of KVM, there is no in-tree user of SVM enabling. Arguably, the hypervisor/VMM (e.g. QEMU) should clear SVM from guest CPUID for SEV VMs, especially for SEV-ES+, but super duper technically, it's feasible to run nested VMs in SEV+ guests (with many caveats). More importantly, Linux-as-a-guest has played nice with SVM being advertised to SEV+ guests for a long time. Treating SVM as unsupported fixes a regression where a clean shutdown of an SEV-ES+ guest degrades into an abrupt termination. Due to a gnarly virtualization hole in SEV-ES (the architecture), where EFER must NOT be intercepted by the hypervisor (because the untrusted hypervisor can't set e.g. EFER.LME on behalf o the guest), the _host's_ EFER.SVME is visible to the guest. Because EFER.SVME must be always '1' while in guest mode, Linux-the-guest sees EFER.SVME=1 even when _its_ EFER.SVME is '0', thinks it has enabled virtualization, and ultimately can cause x86_svm_emergency_disable_virtualization_cpu() to execute STGI to ensure GIF is enabled. Executing STGI _should_ be fine, except Linux is a also wee bit paranoid when running as an SEV-ES guest. Because L0 sees EFER.SVME=0 for the guest, a well-behaved L0 hypervisor will intercept STGI (to inject #UD), and thus generate a #VC on the STGI. Which, again, should be fine. Unfortunately, vc_check_opcode_bytes() fails to account for STGI and other SVM instructions, throws a fatal error, and triggers a termination request. In a perfect world, the #VC handler would be more forgiving of unknown intercepts, especially when the #VC happened on an instruction with exception fixup. For now, just fix the immediate regression. Fixes: 428afac5a8ea ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem") Reported-by: Srikanth Aithal Closes: https://lore.kernel.org/all/c820e242-9f3a-4210-b414-19d11b022404@amd.com Link: https://patch.msgid.link/20260409191341.1932853-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/virt/hw.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/virt/hw.c b/arch/x86/virt/hw.c index c898f16fe612..f647557d38ac 100644 --- a/arch/x86/virt/hw.c +++ b/arch/x86/virt/hw.c @@ -269,7 +269,8 @@ static __init int x86_svm_init(void) .emergency_disable_virtualization_cpu = x86_svm_emergency_disable_virtualization_cpu, }; - if (!cpu_feature_enabled(X86_FEATURE_SVM)) + if (!cpu_feature_enabled(X86_FEATURE_SVM) || + cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) return -EOPNOTSUPP; memcpy(&virt_ops, &svm_ops, sizeof(virt_ops)); From 01f217fa8a8c7878d28df90233f68c20bea9bdc7 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 13 Apr 2026 18:57:26 +0200 Subject: [PATCH 373/373] KVM: x86: use inlines instead of macros for is_sev_*guest This helps avoiding more embarrassment to this maintainer, but also will catch mistakes more easily for others. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/svm.h | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index fd0652b32c81..a10668d17a16 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -422,9 +422,19 @@ static __always_inline bool is_sev_snp_guest(struct kvm_vcpu *vcpu) return ____sev_snp_guest(vcpu->kvm); } #else -#define is_sev_guest(vcpu) false -#define is_sev_es_guest(vcpu) false -#define is_sev_snp_guest(vcpu) false +static __always_inline bool is_sev_guest(struct kvm_vcpu *vcpu) +{ + return false; +} +static __always_inline bool is_sev_es_guest(struct kvm_vcpu *vcpu) +{ + return false; +} + +static __always_inline bool is_sev_snp_guest(struct kvm_vcpu *vcpu) +{ + return false; +} #endif static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val)