linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 06:44:00 -04:00

Author	SHA1	Message	Date
Will Deacon	8800dbf661	KVM: arm64: Allow userspace to create protected VMs when pKVM is enabled Introduce a new VM type for KVM/arm64 to allow userspace to request the creation of a "protected VM" when the host has booted with pKVM enabled. For now, this feature results in a taint on first use as many aspects of a protected VM are not yet protected! Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-32-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:09 +01:00
Will Deacon	5991916392	KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte If a protected vCPU faults on an IPA which appears to be mapped, query the hypervisor to determine whether or not the faulting pte has been poisoned by a forceful reclaim. If the pte has been poisoned, return -EFAULT back to userspace rather than retrying the instruction forever. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-28-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:09 +01:00
Will Deacon	281a38ad29	KVM: arm64: Reclaim faulting page from pKVM in spurious fault handler Host kernel accesses to pages that are inaccessible at stage-2 result in the injection of a translation fault, which is fatal unless an exception table fixup is registered for the faulting PC (e.g. for user access routines). This is undesirable, since a get_user_pages() call could be used to obtain a reference to a donated page and then a subsequent access via a kernel mapping would lead to a panic(). Rework the spurious fault handler so that stage-2 faults injected back into the host result in the target page being forcefully reclaimed when no exception table fixup handler is registered. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-27-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:09 +01:00
Will Deacon	4e6e03f9ea	KVM: arm64: Hook up reclaim hypercall to pkvm_pgtable_stage2_destroy() During teardown of a protected guest, its memory pages must be reclaimed from the hypervisor by issuing the '__pkvm_reclaim_dying_guest_page' hypercall. Add a new helper, __pkvm_pgtable_stage2_reclaim(), which is called during the VM teardown operation to reclaim pages from the hypervisor and drop the GUP pin on the host. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-17-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:08 +01:00
Will Deacon	5fef16ef49	KVM: arm64: Hook up donation hypercall to pkvm_pgtable_stage2_map() Mapping pages into a protected guest requires the donation of memory from the host. Extend pkvm_pgtable_stage2_map() to issue a donate hypercall when the target VM is protected. Since the hypercall only handles a single page, the splitting logic used for the share path is not required. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-14-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:08 +01:00
Will Deacon	6c58f914eb	KVM: arm64: Split teardown hypercall into two phases In preparation for reclaiming protected guest VM pages from the host during teardown, split the current 'pkvm_teardown_vm' hypercall into separate 'start' and 'finalise' calls. The 'pkvm_start_teardown_vm' hypercall puts the VM into a new 'is_dying' state, which is a point of no return past which no vCPU of the pVM is allowed to run any more. Once in this new state, 'pkvm_finalize_teardown_vm' can be used to reclaim meta-data and page-table pages from the VM. A subsequent patch will add support for reclaiming the individual guest memory pages. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Co-developed-by: Quentin Perret <qperret@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-12-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:07 +01:00
Will Deacon	f0877a1455	KVM: arm64: Prevent unsupported memslot operations on protected VMs Protected VMs do not support deleting or moving memslots after first run nor do they support read-only or dirty logging. Return -EPERM to userspace if such an operation is attempted. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-10-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:07 +01:00
Will Deacon	7250533ad2	KVM: arm64: Ignore MMU notifier callbacks for protected VMs In preparation for supporting the donation of pinned pages to protected VMs, return early from the MMU notifiers when called for a protected VM, as the necessary hypercalls are exposed only for non-protected guests. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-9-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:07 +01:00
Will Deacon	ff1e7f9897	KVM: arm64: Rename __pkvm_pgtable_stage2_unmap() In preparation for adding support for protected VMs, where pages are donated rather than shared, rename __pkvm_pgtable_stage2_unmap() to __pkvm_pgtable_stage2_unshare() to make it clearer about what is going on. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-5-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:07 +01:00
Will Deacon	9f02deef47	KVM: arm64: Move handle check into pkvm_pgtable_stage2_destroy_range() When pKVM is enabled, a VM has a 'handle' allocated by the hypervisor in kvm_arch_init_vm() and released later by kvm_arch_destroy_vm(). Consequently, the only time __pkvm_pgtable_stage2_unmap() can run into an uninitialised 'handle' is on the kvm_arch_init_vm() failure path, where we destroy the empty stage-2 page-table if we fail to allocate a handle. Move the handle check into pkvm_pgtable_stage2_destroy_range(), which will additionally handle protected VMs in subsequent patches. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-4-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:07 +01:00
Raghavendra Rao Ananta	d68d66e57e	KVM: arm64: Split kvm_pgtable_stage2_destroy() Split kvm_pgtable_stage2_destroy() into two: - kvm_pgtable_stage2_destroy_range(), that performs the page-table walk and free the entries over a range of addresses. - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. This refactoring enables subsequent patches to free large page-tables in chunks, calling cond_resched() between each chunk, to yield the CPU as necessary. Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot take advantage of this (such as nVMHE), will continue to function as is. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Suggested-by: Oliver Upton <oupton@kernel.org> Link: https://msgid.link/20251113052452.975081-3-rananta@google.com Signed-off-by: Oliver Upton <oupton@kernel.org>	2025-11-19 14:11:57 -08:00
Paolo Bonzini	924ebaefce	Merge tag 'kvmarm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.18 - Add support for FF-A 1.2 as the secure memory conduit for pKVM, allowing more registers to be used as part of the message payload. - Change the way pKVM allocates its VM handles, making sure that the privileged hypervisor is never tricked into using uninitialised data. - Speed up MMIO range registration by avoiding unnecessary RCU synchronisation, which results in VMs starting much quicker. - Add the dump of the instruction stream when panic-ing in the EL2 payload, just like the rest of the kernel has always done. This will hopefully help debugging non-VHE setups. - Add 52bit PA support to the stage-1 page-table walker, and make use of it to populate the fault level reported to the guest on failing to translate a stage-1 walk. - Add NV support to the GICv3-on-GICv5 emulation code, ensuring feature parity for guests, irrespective of the host platform. - Fix some really ugly architecture problems when dealing with debug in a nested VM. This has some bad performance impacts, but is at least correct. - Add enough infrastructure to be able to disable EL2 features and give effective values to the EL2 control registers. This then allows a bunch of features to be turned off, which helps cross-host migration. - Large rework of the selftest infrastructure to allow most tests to transparently run at EL2. This is the first step towards enabling NV testing. - Various fixes and improvements all over the map, including one BE fix, just in time for the removal of the feature.	2025-09-30 13:23:28 -04:00
Marc Zyngier	3064cee8c3	Merge branch kvm-arm64/pkvm_vm_handle into kvmarm-master/next * kvm-arm64/pkvm_vm_handle: : pKVM VM handle allocation fixes, courtesy of Fuad Tabba. : : From the cover letter (20250909072437.4110547-1-tabba@google.com): : : "In pKVM, this handle is allocated when the VM is initialized at the : hypervisor, which is on the first vCPU run. However, the host starts : initializing the VM and setting up its data structures earlier. MMU : notifiers for the VMs are also registered before VM initialization at : the hypervisor, and rely on the handle to identify the VM. : : Therefore, there is a potential gap between when the VM is (partially) : setup at the host, but still without a valid pKVM handle to identify it : when communicating with the hypervisor." KVM: arm64: Reserve pKVM handle during pkvm_init_host_vm() KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization KVM: arm64: Consolidate pKVM hypervisor VM initialization logic KVM: arm64: Separate allocation and insertion of pKVM VM table entries KVM: arm64: Decouple hyp VM creation state from its handle KVM: arm64: Clarify comments to distinguish pKVM mode from protected VMs KVM: arm64: Rename 'host_kvm' to 'kvm' in pKVM host code KVM: arm64: Rename pkvm.enabled to pkvm.is_protected KVM: arm64: Add build-time check for duplicate DECLARE_REG use Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-15 10:49:04 +01:00
Fuad Tabba	07aeb70707	KVM: arm64: Reserve pKVM handle during pkvm_init_host_vm() When a pKVM guest is active, TLB invalidations triggered by host MMU notifiers require a valid hypervisor handle. Currently, this handle is only allocated when the first vCPU is run. However, the guest's memory is associated with the host MMU much earlier, during kvm_arch_init_vm(). This creates a window where an MMU invalidation could occur after the kvm_pgtable pointer checked by the notifiers is set but before the pKVM handle has been created. Fix this by reserving the pKVM handle when the host VM is first set up. Move the call to the __pkvm_reserve_vm hypercall from the first-vCPU-run path into pkvm_init_host_vm(), which is called during initial VM setup. This ensures the handle is available before any subsystem can trigger an MMU notification for the VM. The VM destruction path is updated to call __pkvm_unreserve_vm for cases where a VM was reserved but never fully created at the hypervisor, ensuring the handle is properly released. This fix leverages the two-stage reservation/initialization hypercall interface introduced in preceding patches. Signed-off-by: Fuad Tabba <tabba@google.com> Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-15 10:46:55 +01:00
Fuad Tabba	256b4668cd	KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization The existing __pkvm_init_vm hypercall performs both the reservation of a VM table entry and the initialization of the hypervisor VM state in a single operation. This design prevents the host from obtaining a VM handle from the hypervisor until all preparation for the creation and the initialization of the VM is done, which is on the first vCPU run operation. To support more flexible VM lifecycle management, the host needs the ability to reserve a handle early, before the first vCPU run. Refactor the hypercall interface to enable this, splitting the single hypercall into a two-stage process: - __pkvm_reserve_vm: A new hypercall that allocates a slot in the hypervisor's vm_table, marks it as reserved, and returns a unique handle to the host. - __pkvm_unreserve_vm: A corresponding cleanup hypercall to safely release the reservation if the host fails to proceed with full initialization. - __pkvm_init_vm: The existing hypercall is modified to no longer allocate a slot. It now expects a pre-reserved handle and commits the donated VM memory to that slot. For now, the host-side code in __pkvm_create_hyp_vm calls the new reserve and init hypercalls back-to-back to maintain existing behavior. This paves the way for subsequent patches to separate the reservation and initialization steps in the VM's lifecycle. Signed-off-by: Fuad Tabba <tabba@google.com> Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-15 10:46:55 +01:00
Fuad Tabba	3c45b67625	KVM: arm64: Decouple hyp VM creation state from its handle Currently, the presence of a pKVM handle (pkvm.handle != 0) is used to determine if the corresponding hypervisor (EL2) VM has been created and initialized. This couples the handle's lifecycle with the VM's creation state. This coupling will become problematic with upcoming changes that will allocate the pKVM handle earlier in the VM's life, before the VM is instantiated at the hypervisor. To prepare for this and make the state tracking explicit, decouple the two concepts. Introduce a new boolean flag, 'pkvm.is_created', to track whether the hypervisor-side VM has been created and initialized. A new helper, pkvm_hyp_vm_is_created(), is added to check this flag. All call sites that previously checked for the handle's existence are converted to use the new, explicit check. The 'is_created' flag is set to true upon successful creation in the hypervisor (EL2) and cleared upon destruction. Signed-off-by: Fuad Tabba <tabba@google.com> Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-15 10:46:55 +01:00
Fuad Tabba	604a5032b4	KVM: arm64: Rename 'host_kvm' to 'kvm' in pKVM host code In hypervisor (EL2) code, it is important to distinguish between the host's 'struct kvm' and a protected VM's 'struct kvm'. Using 'host_kvm' as variable name in that context makes this distinction clear. However, in the host kernel code (EL1), there is no such ambiguity. The code is only ever concerned with the host's own 'struct kvm' instance. The 'host_' prefix is therefore redundant and adds unnecessary verbosity. Simplify the code by renaming the 'host_kvm' parameter to 'kvm' in all functions within host-side kernel code (EL1). This improves readability and makes the naming consistent with other host-side kernel code. No functional change intended. Signed-off-by: Fuad Tabba <tabba@google.com> Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-15 10:46:55 +01:00
Raghavendra Rao Ananta	0e89ca13ee	KVM: arm64: Split kvm_pgtable_stage2_destroy() Split kvm_pgtable_stage2_destroy() into two: - kvm_pgtable_stage2_destroy_range(), that performs the page-table walk and free the entries over a range of addresses. - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. This refactoring enables subsequent patches to free large page-tables in chunks, calling cond_resched() between each chunk, to yield the CPU as necessary. Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot take advantage of this (such as nVMHE), will continue to function as is. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Suggested-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250820162242.2624752-2-rananta@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:27:03 -07:00
Marc Zyngier	1b85d923ba	Merge branch kvm-arm64/misc-6.16 into kvmarm-master/next * kvm-arm64/misc-6.16: : . : Misc changes and improvements for 6.16: : : - Add a new selftest for the SVE host state being corrupted by a guest : : - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, : ensuring that the window for interrupts is slightly bigger, and avoiding : a pretty bad erratum on the AmpereOne HW : : - Replace a couple of open-coded on/off strings with str_on_off() : : - Get rid of the pKVM memblock sorting, which now appears to be superflous : : - Drop superflous clearing of ICH_LR_EOI in the LR when nesting : : - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from : a pretty bad case of TLB corruption unless accesses to HCR_EL2 are : heavily synchronised : : - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables : in a human-friendly fashion : . KVM: arm64: Fix documentation for vgic_its_iter_next() KVM: arm64: vgic-its: Add debugfs interface to expose ITS tables arm64: errata: Work around AmpereOne's erratum AC04_CPU_23 KVM: arm64: nv: Remove clearing of ICH_LR<n>.EOI if ICH_LR<n>.HW == 1 KVM: arm64: Drop sort_memblock_regions() KVM: arm64: selftests: Add test for SVE host corruption KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode KVM: arm64: Replace ternary flags with str_on_off() helper Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-23 10:59:43 +01:00
Vincent Donnefort	db14091d8f	KVM: arm64: Stage-2 huge mappings for np-guests Now np-guests hypercalls with range are supported, we can let the hypervisor to install block mappings whenever the Stage-1 allows it, that is when backed by either Hugetlbfs or THPs. The size of those block mappings is limited to PMD_SIZE. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20250521124834.1070650-10-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-21 14:33:51 +01:00
Quentin Perret	3669ddd8fa	KVM: arm64: Add a range to pkvm_mappings In preparation for supporting stage-2 huge mappings for np-guest, add a nr_pages member for pkvm_mappings to allow EL1 to track the size of the stage-2 mapping. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20250521124834.1070650-9-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-21 14:33:51 +01:00
Quentin Perret	b38c9775f7	KVM: arm64: Convert pkvm_mappings to interval tree In preparation for supporting stage-2 huge mappings for np-guest, let's convert pgt.pkvm_mappings to an interval tree. No functional change intended. Suggested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20250521124834.1070650-8-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-21 14:33:51 +01:00
Vincent Donnefort	c4d99a833d	KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest() In preparation for supporting stage-2 huge mappings for np-guest. Add a nr_pages argument to the __pkvm_host_test_clear_young_guest hypercall. This range supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512 on a 4K-pages system). Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20250521124834.1070650-7-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-21 14:33:51 +01:00
Vincent Donnefort	0eb802b3b4	KVM: arm64: Add a range to __pkvm_host_wrprotect_guest() In preparation for supporting stage-2 huge mappings for np-guest. Add a nr_pages argument to the __pkvm_host_wrprotect_guest hypercall. This range supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512 on a 4K-pages system). Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20250521124834.1070650-6-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-21 14:33:51 +01:00
Vincent Donnefort	f28f1d02f4	KVM: arm64: Add a range to __pkvm_host_unshare_guest() In preparation for supporting stage-2 huge mappings for np-guest. Add a nr_pages argument to the __pkvm_host_unshare_guest hypercall. This range supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512 on a 4K-pages system). Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20250521124834.1070650-5-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-21 14:33:51 +01:00
Vincent Donnefort	4274385ebf	KVM: arm64: Add a range to __pkvm_host_share_guest() In preparation for supporting stage-2 huge mappings for np-guest. Add a nr_pages argument to the __pkvm_host_share_guest hypercall. This range supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512 on a 4K-pages system). Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20250521124834.1070650-4-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-21 14:33:51 +01:00
Gavin Shan	00b0300cf1	KVM: arm64: Drop sort_memblock_regions() Drop sort_memblock_regions() and avoid sorting the copied memory regions to be ascending order on their base addresses, because the source memory regions should have been sorted correctly when they are added by memblock_add() or its variants. This is generally reverting commit `a14307f531` ("KVM: arm64: Sort the hypervisor memblocks"). No functional changes intended. Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Quentin Perret <qperret@google.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250311043718.91004-1-gshan@redhat.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-08 11:29:45 +01:00
Quentin Perret	48d5645072	KVM: arm64: Extend pKVM selftest for np-guests The pKVM selftest intends to test as many memory 'transitions' as possible, so extend it to cover sharing pages with non-protected guests, including in the case of multi-sharing. Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20250416160900.3078417-5-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-06 09:56:18 +01:00
David Brazdil	74b13d5816	KVM: arm64: Add .hyp.data section The hypervisor has not needed its own .data section because all globals were either .rodata or .bss. To avoid having to initialize future data-structures at run-time, let's introduce add a .data section to the hypervisor. Signed-off-by: David Brazdil <dbrazdil@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20250416160900.3078417-2-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-05-06 09:56:18 +01:00
Fuad Tabba	1eab115486	KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu Instead of creating and initializing _all_ hyp vcpus in pKVM when the first host vcpu runs for the first time, initialize _each_ hyp vcpu in conjunction with its corresponding host vcpu. Some of the host vcpu state (e.g., system registers and traps values) is not initialized until the first time the host vcpu is run. Therefore, initializing a hyp vcpu before its corresponding host vcpu has run for the first time might not view the complete host state of these vcpus. Additionally, this behavior is inline with non-protected modes. Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250314111832.4137161-5-tabba@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-03-14 16:06:03 -07:00
Fuad Tabba	8b21fb47c7	KVM: arm64: Factor out pKVM hyp vcpu creation to separate function Move the code that creates and initializes the hyp view of a vcpu in pKVM to its own function. This is meant to make the transition to initializing every vcpu individually clearer. Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250314111832.4137161-4-tabba@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-03-14 16:04:23 -07:00
Vincent Donnefort	79ea662315	KVM: arm64: Count pKVM stage-2 usage in secondary pagetable stats Count the pages used by pKVM for the guest stage-2 in memory stats under secondary pagetable, similarly to what the VHE mode does. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250313114038.1502357-4-vdonnefort@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-03-14 00:56:30 -07:00
Vincent Donnefort	8c0d7d14c5	KVM: arm64: Distinct pKVM teardown memcache for stage-2 In order to account for memory dedicated to the stage-2 page-tables, use a separated memcache when tearing down the VM. Meanwhile rename reclaim_guest_pages to reflect the fact it only reclaim page-table pages. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250313114038.1502357-3-vdonnefort@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-03-14 00:56:29 -07:00
Quentin Perret	e912efed48	KVM: arm64: Introduce the EL1 pKVM MMU Introduce a set of helper functions allowing to manipulate the pKVM guest stage-2 page-tables from EL1 using pKVM's HVC interface. Each helper has an exact one-to-one correspondance with the traditional kvm_pgtable_stage2_*() functions from pgtable.c, with a strictly matching prototype. This will ease plumbing later on in mmu.c. These callbacks track the gfn->pfn mappings in a simple rb_tree indexed by IPA in lieu of a page-table. This rb-tree is kept in sync with pKVM's state and is protected by the mmu_lock like a traditional stage-2 page-table. Signed-off-by: Quentin Perret <qperret@google.com> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-18-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2024-12-20 09:44:00 +00:00
Quentin Perret	06cacc9d28	KVM: arm64: Prevent kmemleak from accessing .hyp.data We've added a .data section for the hypervisor, which kmemleak is eager to parse. This clearly doesn't go well, so add the section to kmemleak's block list. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240423150538.2103045-13-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2024-05-01 16:48:14 +01:00
Fuad Tabba	40099dedb4	KVM: arm64: Do not re-initialize the KVM lock The lock is already initialized in core KVM code at kvm_create_vm(). Fixes: `9d0c063a4d` ("KVM: arm64: Instantiate pKVM hypervisor VM and vCPU structures from EL1") Signed-off-by: Fuad Tabba <tabba@google.com> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240423150538.2103045-5-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2024-05-01 16:46:58 +01:00
Sebastian Ene	10c02aad11	KVM: arm64: Fix circular locking dependency The rule inside kvm enforces that the vcpu->mutex is taken inside kvm->lock. The rule is violated by the pkvm_create_hyp_vm() which acquires the kvm->lock while already holding the vcpu->mutex lock from kvm_vcpu_ioctl(). Avoid the circular locking dependency altogether by protecting the hyp vm handle with the config_lock, much like we already do for other forms of VM-scoped data. Signed-off-by: Sebastian Ene <sebastianene@google.com> Cc: stable@vger.kernel.org Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240124091027.1477174-2-sebastianene@google.com	2024-01-30 21:30:33 +00:00
Marc Zyngier	fe49fd940e	KVM: arm64: Move VTCR_EL2 into struct s2_mmu We currently have a global VTCR_EL2 value for each guest, even if the guest uses NV. This implies that the guest's own S2 must fit in the host's. This is odd, for multiple reasons: - the PARange values and the number of IPA bits don't necessarily match: you can have 33 bits of IPA space, and yet you can only describe 32 or 36 bits of PARange - When userspace set the IPA space, it creates a contract with the kernel saying "this is the IPA space I'm prepared to handle". At no point does it constraint the guest's own IPA space as long as the guest doesn't try to use a [I]PA outside of the IPA space set by userspace - We don't even try to hide the value of ID_AA64MMFR0_EL1.PARange. And then there is the consequence of the above: if a guest tries to create a S2 that has for input address something that is larger than the IPA space defined by the host, we inject a fatal exception. This is no good. For all intent and purposes, a guest should be able to have the S2 it really wants, as long as the output address of that S2 isn't outside of the IPA space. For that, we need to have a per-s2_mmu VTCR_EL2 setting, which allows us to represent the full PARange. Move the vctr field into the s2_mmu structure, which has no impact whatsoever, except for NV. Note that once we are able to override ID_AA64MMFR0_EL1.PARange from userspace, we'll also be able to restrict the size of the shadow S2 that NV uses. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231012205108.3937270-1-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2023-10-23 18:48:46 +00:00
Sudeep Holla	fa729bc7c9	KVM: arm64: Handle kvm_arm_init failure correctly in finalize_pkvm Currently there is no synchronisation between finalize_pkvm() and kvm_arm_init() initcalls. The finalize_pkvm() proceeds happily even if kvm_arm_init() fails resulting in the following warning on all the CPUs and eventually a HYP panic: \| kvm [1]: IPA Size Limit: 48 bits \| kvm [1]: Failed to init hyp memory protection \| kvm [1]: error initializing Hyp mode: -22 \| \| <snip> \| \| WARNING: CPU: 0 PID: 0 at arch/arm64/kvm/pkvm.c:226 _kvm_host_prot_finalize+0x30/0x50 \| Modules linked in: \| CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0 #237 \| Hardware name: FVP Base RevC (DT) \| pstate: 634020c5 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) \| pc : _kvm_host_prot_finalize+0x30/0x50 \| lr : __flush_smp_call_function_queue+0xd8/0x230 \| \| Call trace: \| _kvm_host_prot_finalize+0x3c/0x50 \| on_each_cpu_cond_mask+0x3c/0x6c \| pkvm_drop_host_privileges+0x4c/0x78 \| finalize_pkvm+0x3c/0x5c \| do_one_initcall+0xcc/0x240 \| do_initcall_level+0x8c/0xac \| do_initcalls+0x54/0x94 \| do_basic_setup+0x1c/0x28 \| kernel_init_freeable+0x100/0x16c \| kernel_init+0x20/0x1a0 \| ret_from_fork+0x10/0x20 \| Failed to finalize Hyp protection: -22 \| dtb=fvp-base-revc.dtb \| kvm [95]: nVHE hyp BUG at: arch/arm64/kvm/hyp/nvhe/mem_protect.c:540! \| kvm [95]: nVHE call trace: \| kvm [95]: [<ffff800081052984>] __kvm_nvhe_hyp_panic+0xac/0xf8 \| kvm [95]: [<ffff800081059644>] __kvm_nvhe_handle_host_mem_abort+0x1a0/0x2ac \| kvm [95]: [<ffff80008105511c>] __kvm_nvhe_handle_trap+0x4c/0x160 \| kvm [95]: [<ffff8000810540fc>] __kvm_nvhe___skip_pauth_save+0x4/0x4 \| kvm [95]: ---[ end nVHE call trace ]--- \| kvm [95]: Hyp Offset: 0xfffe8db00ffa0000 \| Kernel panic - not syncing: HYP panic: \| PS:a34023c9 PC:0000f250710b973c ESR:00000000f2000800 \| FAR:ffff000800cb00d0 HPFAR:000000000880cb00 PAR:0000000000000000 \| VCPU:0000000000000000 \| CPU: 3 PID: 95 Comm: kworker/u16:2 Tainted: G W 6.4.0 #237 \| Hardware name: FVP Base RevC (DT) \| Workqueue: rpciod rpc_async_schedule \| Call trace: \| dump_backtrace+0xec/0x108 \| show_stack+0x18/0x2c \| dump_stack_lvl+0x50/0x68 \| dump_stack+0x18/0x24 \| panic+0x138/0x33c \| nvhe_hyp_panic_handler+0x100/0x184 \| new_slab+0x23c/0x54c \| ___slab_alloc+0x3e4/0x770 \| kmem_cache_alloc_node+0x1f0/0x278 \| __alloc_skb+0xdc/0x294 \| tcp_stream_alloc_skb+0x2c/0xf0 \| tcp_sendmsg_locked+0x3d0/0xda4 \| tcp_sendmsg+0x38/0x5c \| inet_sendmsg+0x44/0x60 \| sock_sendmsg+0x1c/0x34 \| xprt_sock_sendmsg+0xdc/0x274 \| xs_tcp_send_request+0x1ac/0x28c \| xprt_transmit+0xcc/0x300 \| call_transmit+0x78/0x90 \| __rpc_execute+0x114/0x3d8 \| rpc_async_schedule+0x28/0x48 \| process_one_work+0x1d8/0x314 \| worker_thread+0x248/0x474 \| kthread+0xfc/0x184 \| ret_from_fork+0x10/0x20 \| SMP: stopping secondary CPUs \| Kernel Offset: 0x57c5cb460000 from 0xffff800080000000 \| PHYS_OFFSET: 0x80000000 \| CPU features: 0x00000000,1035b7a3,ccfe773f \| Memory Limit: none \| ---[ end Kernel panic - not syncing: HYP panic: \| PS:a34023c9 PC:0000f250710b973c ESR:00000000f2000800 \| FAR:ffff000800cb00d0 HPFAR:000000000880cb00 PAR:0000000000000000 \| VCPU:0000000000000000 ]--- Fix it by checking for the successfull initialisation of kvm_arm_init() in finalize_pkvm() before proceeding any futher. Fixes: `87727ba2bb` ("KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege") Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: James Morse <james.morse@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230704193243.3300506-1-sudeep.holla@arm.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2023-07-11 19:30:14 +00:00
Will Deacon	bc3888a0f4	KVM: arm64: Allocate pages for hypervisor FF-A mailboxes The FF-A proxy code needs to allocate its own buffer pair for communication with EL3 and for forwarding calls from the host at EL1. Reserve a couple of pages for this purpose and use them to initialise the hypervisor's FF-A buffer structure. Co-developed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-4-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2023-06-01 21:34:50 +00:00
Will Deacon	87727ba2bb	KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege Although pKVM supports CPU PMU emulation for non-protected guests since `722625c6f4` ("KVM: arm64: Reenable pmu in Protected Mode"), this relies on the PMU driver probing before the host has de-privileged so that the 'kvm_arm_pmu_available' static key can still be enabled by patching the hypervisor text. As it happens, both of these events hang off device_initcall() but the PMU consistently won the race until `7755cec63a` ("arm64: perf: Move PMUv3 driver to drivers/perf"). Since then, the host will fail to boot when pKVM is enabled: \| hw perfevents: enabled with armv8_pmuv3_0 PMU driver, 7 counters available \| kvm [1]: nVHE hyp BUG at: [<ffff8000090366e0>] __kvm_nvhe_handle_host_mem_abort+0x270/0x284! \| kvm [1]: Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE \| kvm [1]: Hyp Offset: 0xfffea41fbdf70000 \| Kernel panic - not syncing: HYP panic: \| PS:a00003c9 PC:0000dbe04b0c66e0 ESR:00000000f2000800 \| FAR:fffffbfffddfcf00 HPFAR:00000000010b0bf0 PAR:0000000000000000 \| VCPU:0000000000000000 \| CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc7-00083-g0bce6746d154 #1 \| Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 \| Call trace: \| dump_backtrace+0xec/0x108 \| show_stack+0x18/0x2c \| dump_stack_lvl+0x50/0x68 \| dump_stack+0x18/0x24 \| panic+0x13c/0x33c \| nvhe_hyp_panic_handler+0x10c/0x190 \| aarch64_insn_patch_text_nosync+0x64/0xc8 \| arch_jump_label_transform+0x4c/0x5c \| __jump_label_update+0x84/0xfc \| jump_label_update+0x100/0x134 \| static_key_enable_cpuslocked+0x68/0xac \| static_key_enable+0x20/0x34 \| kvm_host_pmu_init+0x88/0xa4 \| armpmu_register+0xf0/0xf4 \| arm_pmu_acpi_probe+0x2ec/0x368 \| armv8_pmu_driver_init+0x38/0x44 \| do_one_initcall+0xcc/0x240 Fix the race properly by deferring the de-privilege step to device_initcall_sync(). This will also be needed in future when probing IOMMU devices and allows us to separate the pKVM de-privilege logic from the core hypervisor initialisation path. Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Fixes: `7755cec63a` ("arm64: perf: Move PMUv3 driver to drivers/perf") Tested-by: Fuad Tabba <tabba@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230420123356.2708-1-will@kernel.org Signed-off-by: Will Deacon <will@kernel.org>	2023-04-20 16:57:53 +01:00
Quentin Perret	f41dff4efb	KVM: arm64: Return guest memory from EL2 via dedicated teardown memcache Rather than relying on the host to free the previously-donated pKVM hypervisor VM pages explicitly on teardown, introduce a dedicated teardown memcache which allows the host to reclaim guest memory resources without having to keep track of all of the allocations made by the pKVM hypervisor at EL2. Tested-by: Vincent Donnefort <vdonnefort@google.com> Co-developed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Will Deacon <will@kernel.org> [maz: dropped __maybe_unused from unmap_donated_memory_noclear()] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110190259.26861-21-will@kernel.org	2022-11-11 17:18:58 +00:00
Fuad Tabba	9d0c063a4d	KVM: arm64: Instantiate pKVM hypervisor VM and vCPU structures from EL1 With the pKVM hypervisor at EL2 now offering hypercalls to the host for creating and destroying VM and vCPU structures, plumb these in to the existing arm64 KVM backend to ensure that the hypervisor data structures are allocated and initialised on first vCPU run for a pKVM guest. In the host, 'struct kvm_protected_vm' is introduced to hold the handle of the pKVM VM instance as well as to track references to the memory donated to the hypervisor so that it can be freed back to the host allocator following VM teardown. The stage-2 page-table, hypervisor VM and vCPU structures are allocated separately so as to avoid the need for a large physically-contiguous allocation in the host at run-time. Tested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110190259.26861-14-will@kernel.org	2022-11-11 17:16:24 +00:00
Fuad Tabba	a1ec5c70d3	KVM: arm64: Add infrastructure to create and track pKVM instances at EL2 Introduce a global table (and lock) to track pKVM instances at EL2, and provide hypercalls that can be used by the untrusted host to create and destroy pKVM VMs and their vCPUs. pKVM VM/vCPU state is directly accessible only by the trusted hypervisor (EL2). Each pKVM VM is directly associated with an untrusted host KVM instance, and is referenced by the host using an opaque handle. Future patches will provide hypercalls to allow the host to initialize/set/get pKVM VM/vCPU state using the opaque handle. Tested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Co-developed-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> [maz: silence warning on unmap_donated_memory_noclear()] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110190259.26861-13-will@kernel.org	2022-11-11 17:16:05 +00:00
Quentin Perret	8e6bcc3a45	KVM: arm64: Back the hypervisor 'struct hyp_page' array for all memory The EL2 'vmemmap' array in nVHE Protected mode is currently very sparse: only memory pages owned by the hypervisor itself have a matching 'struct hyp_page'. However, as the size of this struct has been reduced significantly since its introduction, it appears that we can now afford to back the vmemmap for all of memory. Having an easily accessible 'struct hyp_page' for every physical page in memory provides the hypervisor with a simple mechanism to store metadata (e.g. a refcount) that wouldn't otherwise fit in the very limited number of software bits available in the host stage-2 page-table entries. This will be used in subsequent patches when pinning host memory pages for use by the hypervisor at EL2. Tested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110190259.26861-4-will@kernel.org	2022-11-11 16:40:54 +00:00
Will Deacon	9429f4b041	KVM: arm64: Move host EL1 code out of hyp/ directory kvm/hyp/reserved_mem.c contains host code executing at EL1 and is not linked into the hypervisor object. Move the file into kvm/pkvm.c and rework the headers so that the definitions shared between the host and the hypervisor live in asm/kvm_pkvm.h. Signed-off-by: Will Deacon <will@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211202171048.26924-4-will@kernel.org	2021-12-06 08:37:03 +00:00

46 Commits