Files
linux/arch/arm64/kvm/hyp/pgtable.c
Linus Torvalds 01f492e181 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "Arm:

   - Add support for tracing in the standalone EL2 hypervisor code,
     which should help both debugging and performance analysis. This
     uses the new infrastructure for 'remote' trace buffers that can be
     exposed by non-kernel entities such as firmware, and which came
     through the tracing tree

   - Add support for GICv5 Per Processor Interrupts (PPIs), as the
     starting point for supporting the new GIC architecture in KVM

   - Finally add support for pKVM protected guests, where pages are
     unmapped from the host as they are faulted into the guest and can
     be shared back from the guest using pKVM hypercalls. Protected
     guests are created using a new machine type identifier. As the
     elusive guestmem has not yet delivered on its promises, anonymous
     memory is also supported

     This is only a first step towards full isolation from the host; for
     example, the CPU register state and DMA accesses are not yet
     isolated. Because this does not really yet bring fully what it
     promises, it is hidden behind CONFIG_ARM_PKVM_GUEST +
     'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is
     created. Caveat emptor

   - Rework the dreaded user_mem_abort() function to make it more
     maintainable, reducing the amount of state being exposed to the
     various helpers and rendering a substantial amount of state
     immutable

   - Expand the Stage-2 page table dumper to support NV shadow page
     tables on a per-VM basis

   - Tidy up the pKVM PSCI proxy code to be slightly less hard to
     follow

   - Fix both SPE and TRBE in non-VHE configurations so that they do not
     generate spurious, out of context table walks that ultimately lead
     to very bad HW lockups

   - A small set of patches fixing the Stage-2 MMU freeing in error
     cases

   - Tighten-up accepted SMC immediate value to be only #0 for host
     SMCCC calls

   - The usual cleanups and other selftest churn

  LoongArch:

   - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel()

   - Add DMSINTC irqchip in kernel support

  RISC-V:

   - Fix steal time shared memory alignment checks

   - Fix vector context allocation leak

   - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()

   - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()

   - Fix integer overflow in kvm_pmu_validate_counter_mask()

   - Fix shift-out-of-bounds in make_xfence_request()

   - Fix lost write protection on huge pages during dirty logging

   - Split huge pages during fault handling for dirty logging

   - Skip CSR restore if VCPU is reloaded on the same core

   - Implement kvm_arch_has_default_irqchip() for KVM selftests

   - Factored-out ISA checks into separate sources

   - Added hideleg to struct kvm_vcpu_config

   - Factored-out VCPU config into separate sources

   - Support configuration of per-VM HGATP mode from KVM user space

  s390:

   - Support for ESA (31-bit) guests inside nested hypervisors

   - Remove restriction on memslot alignment, which is not needed
     anymore with the new gmap code

   - Fix LPSW/E to update the bear (which of course is the breaking
     event address register)

  x86:

   - Shut up various UBSAN warnings on reading module parameter before
     they were initialized

   - Don't zero-allocate page tables that are used for splitting
     hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in
     the page table and thus write all bytes

   - As an optimization, bail early when trying to unsync 4KiB mappings
     if the target gfn can just be mapped with a 2MiB hugepage

  x86 generic:

   - Copy single-chunk MMIO write values into struct kvm_vcpu (more
     precisely struct kvm_mmio_fragment) to fix use-after-free stack
     bugs where KVM would dereference stack pointer after an exit to
     userspace

   - Clean up and comment the emulated MMIO code to try to make it
     easier to maintain (not necessarily "easy", but "easier")

   - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of
     VMX and SVM enabling) as it is needed for trusted I/O

   - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions

   - Immediately fail the build if a required #define is missing in one
     of KVM's headers that is included multiple times

   - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
     exception, mostly to prevent syzkaller from abusing the uAPI to
     trigger WARNs, but also because it can help prevent userspace from
     unintentionally crashing the VM

   - Exempt SMM from CPUID faulting on Intel, as per the spec

   - Misc hardening and cleanup changes

  x86 (AMD):

   - Fix and optimize IRQ window inhibit handling for AVIC; make it
     per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple
     vCPUs have to-be-injected IRQs

   - Clean up and optimize the OSVW handling, avoiding a bug in which
     KVM would overwrite state when enabling virtualization on multiple
     CPUs in parallel. This should not be a problem because OSVW should
     usually be the same for all CPUs

   - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains
     about a "too large" size based purely on user input

   - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION

   - Disallow synchronizing a VMSA of an already-launched/encrypted
     vCPU, as doing so for an SNP guest will crash the host due to an
     RMP violation page fault

   - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped
     queries are required to hold kvm->lock, and enforce it by lockdep.
     Fix various bugs where sev_guest() was not ensured to be stable for
     the whole duration of a function or ioctl

   - Convert a pile of kvm->lock SEV code to guard()

   - Play nicer with userspace that does not enable
     KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6
     as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the
     payload would end up in EXITINFO2 rather than CR2, for example).
     Only set CR2 and DR6 when consumption of the payload is imminent,
     but on the other hand force delivery of the payload in all paths
     where userspace retrieves CR2 or DR6

   - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT
     instead of vmcb02->save.cr2. The value is out of sync after a
     save/restore or after a #PF is injected into L2

   - Fix a class of nSVM bugs where some fields written by the CPU are
     not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so
     are not up-to-date when saved by KVM_GET_NESTED_STATE

   - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE
     and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly
     initialized after save+restore

   - Add a variety of missing nSVM consistency checks

   - Fix several bugs where KVM failed to correctly update VMCB fields
     on nested #VMEXIT

   - Fix several bugs where KVM failed to correctly synthesize #UD or
     #GP for SVM-related instructions

   - Add support for save+restore of virtualized LBRs (on SVM)

   - Refactor various helpers and macros to improve clarity and
     (hopefully) make the code easier to maintain

   - Aggressively sanitize fields when copying from vmcb12, to guard
     against unintentionally allowing L1 to utilize yet-to-be-defined
     features

   - Fix several bugs where KVM botched rAX legality checks when
     emulating SVM instructions. There are remaining issues in that KVM
     doesn't handle size prefix overrides for 64-bit guests

   - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
     instead of somewhat arbitrarily synthesizing #GP (i.e. don't double
     down on AMD's architectural but sketchy behavior of generating #GP
     for "unsupported" addresses)

   - Cache all used vmcb12 fields to further harden against TOCTOU bugs

  x86 (Intel):

   - Drop obsolete branch hint prefixes from the VMX instruction macros

   - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
     register input when appropriate

   - Code cleanups

  guest_memfd:

   - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't
     support reclaim, the memory is unevictable, and there is no storage
     to write back to

  LoongArch selftests:

   - Add KVM PMU test cases

  s390 selftests:

   - Enable more memory selftests

  x86 selftests:

   - Add support for Hygon CPUs in KVM selftests

   - Fix a bug in the MSR test where it would get false failures on
     AMD/Hygon CPUs with exactly one of RDPID or RDTSCP

   - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test
     for a bug where the kernel would attempt to collapse guest_memfd
     folios against KVM's will"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits)
  KVM: x86: use inlines instead of macros for is_sev_*guest
  x86/virt: Treat SVM as unsupported when running as an SEV+ guest
  KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
  KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
  KVM: SEV: use mutex guard in snp_handle_guest_req()
  KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
  KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
  KVM: SEV: use mutex guard in snp_launch_update()
  KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
  KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
  KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
  KVM: SEV: WARN on unhandled VM type when initializing VM
  KVM: LoongArch: selftests: Add PMU overflow interrupt test
  KVM: LoongArch: selftests: Add basic PMU event counting test
  KVM: LoongArch: selftests: Add cpucfg read/write helpers
  LoongArch: KVM: Add DMSINTC inject msi to vCPU
  LoongArch: KVM: Add DMSINTC device support
  LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
  LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
  LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
  ...
2026-04-17 07:18:03 -07:00

1706 lines
42 KiB
C

// SPDX-License-Identifier: GPL-2.0-only
/*
* Stand-alone page-table allocator for hyp stage-1 and guest stage-2.
* No bombay mix was harmed in the writing of this file.
*
* Copyright (C) 2020 Google LLC
* Author: Will Deacon <will@kernel.org>
*/
#include <linux/bitfield.h>
#include <asm/kvm_pgtable.h>
#include <asm/stage2_pgtable.h>
struct kvm_pgtable_walk_data {
struct kvm_pgtable_walker *walker;
const u64 start;
u64 addr;
const u64 end;
};
static bool kvm_pgtable_walk_skip_bbm_tlbi(const struct kvm_pgtable_visit_ctx *ctx)
{
return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_BBM_TLBI);
}
static bool kvm_pgtable_walk_skip_cmo(const struct kvm_pgtable_visit_ctx *ctx)
{
return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_CMO);
}
static bool kvm_block_mapping_supported(const struct kvm_pgtable_visit_ctx *ctx, u64 phys)
{
u64 granule = kvm_granule_size(ctx->level);
if (!kvm_level_supports_block_mapping(ctx->level))
return false;
if (granule > (ctx->end - ctx->addr))
return false;
if (!IS_ALIGNED(phys, granule))
return false;
return IS_ALIGNED(ctx->addr, granule);
}
static u32 kvm_pgtable_idx(struct kvm_pgtable_walk_data *data, s8 level)
{
u64 shift = kvm_granule_shift(level);
u64 mask = BIT(PAGE_SHIFT - 3) - 1;
return (data->addr >> shift) & mask;
}
static u32 kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
{
u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
u64 mask = BIT(pgt->ia_bits) - 1;
return (addr & mask) >> shift;
}
static u32 kvm_pgd_pages(u32 ia_bits, s8 start_level)
{
struct kvm_pgtable pgt = {
.ia_bits = ia_bits,
.start_level = start_level,
};
return kvm_pgd_page_idx(&pgt, -1ULL) + 1;
}
static bool kvm_pte_table(kvm_pte_t pte, s8 level)
{
if (level == KVM_PGTABLE_LAST_LEVEL)
return false;
if (!kvm_pte_valid(pte))
return false;
return FIELD_GET(KVM_PTE_TYPE, pte) == KVM_PTE_TYPE_TABLE;
}
static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
{
return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
}
static void kvm_clear_pte(kvm_pte_t *ptep)
{
WRITE_ONCE(*ptep, 0);
}
static kvm_pte_t kvm_init_table_pte(kvm_pte_t *childp, struct kvm_pgtable_mm_ops *mm_ops)
{
kvm_pte_t pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));
pte |= FIELD_PREP(KVM_PTE_TYPE, KVM_PTE_TYPE_TABLE);
pte |= KVM_PTE_VALID;
return pte;
}
static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, s8 level)
{
kvm_pte_t pte = kvm_phys_to_pte(pa);
u64 type = (level == KVM_PGTABLE_LAST_LEVEL) ? KVM_PTE_TYPE_PAGE :
KVM_PTE_TYPE_BLOCK;
pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI);
pte |= FIELD_PREP(KVM_PTE_TYPE, type);
pte |= KVM_PTE_VALID;
return pte;
}
static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data,
const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
struct kvm_pgtable_walker *walker = data->walker;
/* Ensure the appropriate lock is held (e.g. RCU lock for stage-2 MMU) */
WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx) && !kvm_pgtable_walk_lock_held());
return walker->cb(ctx, visit);
}
static bool kvm_pgtable_walk_continue(const struct kvm_pgtable_walker *walker,
int r)
{
/*
* Visitor callbacks return EAGAIN when the conditions that led to a
* fault are no longer reflected in the page tables due to a race to
* update a PTE. In the context of a fault handler this is interpreted
* as a signal to retry guest execution.
*
* Ignore the return code altogether for walkers outside a fault handler
* (e.g. write protecting a range of memory) and chug along with the
* page table walk.
*/
if (r == -EAGAIN)
return walker->flags & KVM_PGTABLE_WALK_IGNORE_EAGAIN;
return !r;
}
static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, s8 level);
static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
struct kvm_pgtable_mm_ops *mm_ops,
kvm_pteref_t pteref, s8 level)
{
enum kvm_pgtable_walk_flags flags = data->walker->flags;
kvm_pte_t *ptep = kvm_dereference_pteref(data->walker, pteref);
struct kvm_pgtable_visit_ctx ctx = {
.ptep = ptep,
.old = READ_ONCE(*ptep),
.arg = data->walker->arg,
.mm_ops = mm_ops,
.start = data->start,
.addr = data->addr,
.end = data->end,
.level = level,
.flags = flags,
};
int ret = 0;
bool reload = false;
kvm_pteref_t childp;
bool table = kvm_pte_table(ctx.old, level);
if (table && (ctx.flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_TABLE_PRE);
reload = true;
}
if (!table && (ctx.flags & KVM_PGTABLE_WALK_LEAF)) {
ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_LEAF);
reload = true;
}
/*
* Reload the page table after invoking the walker callback for leaf
* entries or after pre-order traversal, to allow the walker to descend
* into a newly installed or replaced table.
*/
if (reload) {
ctx.old = READ_ONCE(*ptep);
table = kvm_pte_table(ctx.old, level);
}
if (!kvm_pgtable_walk_continue(data->walker, ret))
goto out;
if (!table) {
data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
data->addr += kvm_granule_size(level);
goto out;
}
childp = (kvm_pteref_t)kvm_pte_follow(ctx.old, mm_ops);
ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
if (!kvm_pgtable_walk_continue(data->walker, ret))
goto out;
if (ctx.flags & KVM_PGTABLE_WALK_TABLE_POST)
ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_TABLE_POST);
out:
if (kvm_pgtable_walk_continue(data->walker, ret))
return 0;
return ret;
}
static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, s8 level)
{
u32 idx;
int ret = 0;
if (WARN_ON_ONCE(level < KVM_PGTABLE_FIRST_LEVEL ||
level > KVM_PGTABLE_LAST_LEVEL))
return -EINVAL;
for (idx = kvm_pgtable_idx(data, level); idx < PTRS_PER_PTE; ++idx) {
kvm_pteref_t pteref = &pgtable[idx];
if (data->addr >= data->end)
break;
ret = __kvm_pgtable_visit(data, mm_ops, pteref, level);
if (ret)
break;
}
return ret;
}
static int _kvm_pgtable_walk(struct kvm_pgtable *pgt, struct kvm_pgtable_walk_data *data)
{
u32 idx;
int ret = 0;
u64 limit = BIT(pgt->ia_bits);
if (data->addr > limit || data->end > limit)
return -ERANGE;
if (!pgt->pgd)
return -EINVAL;
for (idx = kvm_pgd_page_idx(pgt, data->addr); data->addr < data->end; ++idx) {
kvm_pteref_t pteref = &pgt->pgd[idx * PTRS_PER_PTE];
ret = __kvm_pgtable_walk(data, pgt->mm_ops, pteref, pgt->start_level);
if (ret)
break;
}
return ret;
}
int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
struct kvm_pgtable_walker *walker)
{
struct kvm_pgtable_walk_data walk_data = {
.start = ALIGN_DOWN(addr, PAGE_SIZE),
.addr = ALIGN_DOWN(addr, PAGE_SIZE),
.end = PAGE_ALIGN(walk_data.addr + size),
.walker = walker,
};
int r;
r = kvm_pgtable_walk_begin(walker);
if (r)
return r;
r = _kvm_pgtable_walk(pgt, &walk_data);
kvm_pgtable_walk_end(walker);
return r;
}
struct leaf_walk_data {
kvm_pte_t pte;
s8 level;
};
static int leaf_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
struct leaf_walk_data *data = ctx->arg;
data->pte = ctx->old;
data->level = ctx->level;
return 0;
}
int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
kvm_pte_t *ptep, s8 *level)
{
struct leaf_walk_data data;
struct kvm_pgtable_walker walker = {
.cb = leaf_walker,
.flags = KVM_PGTABLE_WALK_LEAF,
.arg = &data,
};
int ret;
ret = kvm_pgtable_walk(pgt, ALIGN_DOWN(addr, PAGE_SIZE),
PAGE_SIZE, &walker);
if (!ret) {
if (ptep)
*ptep = data.pte;
if (level)
*level = data.level;
}
return ret;
}
struct hyp_map_data {
const u64 phys;
kvm_pte_t attr;
};
static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
{
bool device = prot & KVM_PGTABLE_PROT_DEVICE;
u32 mtype = device ? MT_DEVICE_nGnRE : MT_NORMAL;
kvm_pte_t attr = FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX, mtype);
u32 sh = KVM_PTE_LEAF_ATTR_LO_S1_SH_IS;
u32 ap = (prot & KVM_PGTABLE_PROT_W) ? KVM_PTE_LEAF_ATTR_LO_S1_AP_RW :
KVM_PTE_LEAF_ATTR_LO_S1_AP_RO;
if (!(prot & KVM_PGTABLE_PROT_R))
return -EINVAL;
if (!cpus_have_final_cap(ARM64_KVM_HVHE))
prot &= ~KVM_PGTABLE_PROT_UX;
if (prot & KVM_PGTABLE_PROT_X) {
if (prot & KVM_PGTABLE_PROT_W)
return -EINVAL;
if (device)
return -EINVAL;
if (system_supports_bti_kernel())
attr |= KVM_PTE_LEAF_ATTR_HI_S1_GP;
}
if (cpus_have_final_cap(ARM64_KVM_HVHE)) {
if (!(prot & KVM_PGTABLE_PROT_PX))
attr |= KVM_PTE_LEAF_ATTR_HI_S1_PXN;
if (!(prot & KVM_PGTABLE_PROT_UX))
attr |= KVM_PTE_LEAF_ATTR_HI_S1_UXN;
} else {
if (!(prot & KVM_PGTABLE_PROT_PX))
attr |= KVM_PTE_LEAF_ATTR_HI_S1_XN;
}
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
if (!kvm_lpa2_is_enabled())
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW;
*ptep = attr;
return 0;
}
enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
{
enum kvm_pgtable_prot prot = pte & KVM_PTE_LEAF_ATTR_HI_SW;
u32 ap;
if (!kvm_pte_valid(pte))
return prot;
if (cpus_have_final_cap(ARM64_KVM_HVHE)) {
if (!(pte & KVM_PTE_LEAF_ATTR_HI_S1_PXN))
prot |= KVM_PGTABLE_PROT_PX;
if (!(pte & KVM_PTE_LEAF_ATTR_HI_S1_UXN))
prot |= KVM_PGTABLE_PROT_UX;
} else {
if (!(pte & KVM_PTE_LEAF_ATTR_HI_S1_XN))
prot |= KVM_PGTABLE_PROT_PX;
}
ap = FIELD_GET(KVM_PTE_LEAF_ATTR_LO_S1_AP, pte);
if (ap == KVM_PTE_LEAF_ATTR_LO_S1_AP_RO)
prot |= KVM_PGTABLE_PROT_R;
else if (ap == KVM_PTE_LEAF_ATTR_LO_S1_AP_RW)
prot |= KVM_PGTABLE_PROT_RW;
return prot;
}
static bool hyp_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
struct hyp_map_data *data)
{
u64 phys = data->phys + (ctx->addr - ctx->start);
kvm_pte_t new;
if (!kvm_block_mapping_supported(ctx, phys))
return false;
new = kvm_init_valid_leaf_pte(phys, data->attr, ctx->level);
if (ctx->old == new)
return true;
if (!kvm_pte_valid(ctx->old))
ctx->mm_ops->get_page(ctx->ptep);
else if (WARN_ON((ctx->old ^ new) & ~KVM_PTE_LEAF_ATTR_HI_SW))
return false;
smp_store_release(ctx->ptep, new);
return true;
}
static int hyp_map_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
kvm_pte_t *childp, new;
struct hyp_map_data *data = ctx->arg;
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
if (hyp_map_walker_try_leaf(ctx, data))
return 0;
if (WARN_ON(ctx->level == KVM_PGTABLE_LAST_LEVEL))
return -EINVAL;
childp = (kvm_pte_t *)mm_ops->zalloc_page(NULL);
if (!childp)
return -ENOMEM;
new = kvm_init_table_pte(childp, mm_ops);
mm_ops->get_page(ctx->ptep);
smp_store_release(ctx->ptep, new);
return 0;
}
int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
enum kvm_pgtable_prot prot)
{
int ret;
struct hyp_map_data map_data = {
.phys = ALIGN_DOWN(phys, PAGE_SIZE),
};
struct kvm_pgtable_walker walker = {
.cb = hyp_map_walker,
.flags = KVM_PGTABLE_WALK_LEAF,
.arg = &map_data,
};
ret = hyp_set_prot_attr(prot, &map_data.attr);
if (ret)
return ret;
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
dsb(ishst);
isb();
return ret;
}
static int hyp_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
kvm_pte_t *childp = NULL;
u64 granule = kvm_granule_size(ctx->level);
u64 *unmapped = ctx->arg;
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
if (!kvm_pte_valid(ctx->old))
return -EINVAL;
if (kvm_pte_table(ctx->old, ctx->level)) {
childp = kvm_pte_follow(ctx->old, mm_ops);
if (mm_ops->page_count(childp) != 1)
return 0;
kvm_clear_pte(ctx->ptep);
dsb(ishst);
__tlbi_level(vae2is, ctx->addr, TLBI_TTL_UNKNOWN);
} else {
if (ctx->end - ctx->addr < granule)
return -EINVAL;
kvm_clear_pte(ctx->ptep);
dsb(ishst);
__tlbi_level(vale2is, ctx->addr, ctx->level);
*unmapped += granule;
}
__tlbi_sync_s1ish_hyp();
isb();
mm_ops->put_page(ctx->ptep);
if (childp)
mm_ops->put_page(childp);
return 0;
}
u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
u64 unmapped = 0;
struct kvm_pgtable_walker walker = {
.cb = hyp_unmap_walker,
.arg = &unmapped,
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};
if (!pgt->mm_ops->page_count)
return 0;
kvm_pgtable_walk(pgt, addr, size, &walker);
return unmapped;
}
int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
struct kvm_pgtable_mm_ops *mm_ops)
{
s8 start_level = KVM_PGTABLE_LAST_LEVEL + 1 -
ARM64_HW_PGTABLE_LEVELS(va_bits);
if (start_level < KVM_PGTABLE_FIRST_LEVEL ||
start_level > KVM_PGTABLE_LAST_LEVEL)
return -EINVAL;
pgt->pgd = (kvm_pteref_t)mm_ops->zalloc_page(NULL);
if (!pgt->pgd)
return -ENOMEM;
pgt->ia_bits = va_bits;
pgt->start_level = start_level;
pgt->mm_ops = mm_ops;
pgt->mmu = NULL;
pgt->force_pte_cb = NULL;
return 0;
}
static int hyp_free_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
if (!kvm_pte_valid(ctx->old))
return 0;
mm_ops->put_page(ctx->ptep);
if (kvm_pte_table(ctx->old, ctx->level))
mm_ops->put_page(kvm_pte_follow(ctx->old, mm_ops));
return 0;
}
void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
{
struct kvm_pgtable_walker walker = {
.cb = hyp_free_walker,
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};
WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
pgt->mm_ops->put_page(kvm_dereference_pteref(&walker, pgt->pgd));
pgt->pgd = NULL;
}
struct stage2_map_data {
const u64 phys;
kvm_pte_t attr;
kvm_pte_t pte_annot;
kvm_pte_t *anchor;
kvm_pte_t *childp;
struct kvm_s2_mmu *mmu;
void *memcache;
/* Force mappings to page granularity */
bool force_pte;
/* Walk should update owner_id only */
bool annotation;
};
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
{
u64 vtcr = VTCR_EL2_FLAGS;
s8 lvls;
vtcr |= FIELD_PREP(VTCR_EL2_PS, kvm_get_parange(mmfr0));
vtcr |= FIELD_PREP(VTCR_EL2_T0SZ, (UL(64) - phys_shift));
/*
* Use a minimum 2 level page table to prevent splitting
* host PMD huge pages at stage2.
*/
lvls = stage2_pgtable_levels(phys_shift);
if (lvls < 2)
lvls = 2;
/*
* When LPA2 is enabled, the HW supports an extra level of translation
* (for 5 in total) when using 4K pages. It also introduces VTCR_EL2.SL2
* to as an addition to SL0 to enable encoding this extra start level.
* However, since we always use concatenated pages for the first level
* lookup, we will never need this extra level and therefore do not need
* to touch SL2.
*/
vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
#ifdef CONFIG_ARM64_HW_AFDBM
/*
* Enable the Hardware Access Flag management, unconditionally
* on all CPUs. In systems that have asymmetric support for the feature
* this allows KVM to leverage hardware support on the subset of cores
* that implement the feature.
*
* The architecture requires VTCR_EL2.HA to be RES0 (thus ignored by
* hardware) on implementations that do not advertise support for the
* feature. As such, setting HA unconditionally is safe, unless you
* happen to be running on a design that has unadvertised support for
* HAFDBS. Here be dragons.
*/
if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38))
vtcr |= VTCR_EL2_HA;
#endif /* CONFIG_ARM64_HW_AFDBM */
if (kvm_lpa2_is_enabled())
vtcr |= VTCR_EL2_DS;
/* Set the vmid bits */
vtcr |= (get_vmid_bits(mmfr1) == 16) ? VTCR_EL2_VS : 0;
return vtcr;
}
void kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
phys_addr_t addr, size_t size)
{
unsigned long pages, inval_pages;
if (!system_supports_tlb_range()) {
kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
return;
}
pages = size >> PAGE_SHIFT;
while (pages > 0) {
inval_pages = min(pages, MAX_TLBI_RANGE_PAGES);
kvm_call_hyp(__kvm_tlb_flush_vmid_range, mmu, addr, inval_pages);
addr += inval_pages << PAGE_SHIFT;
pages -= inval_pages;
}
}
#define KVM_S2_MEMATTR(pgt, attr) \
({ \
kvm_pte_t __attr; \
\
if ((pgt)->flags & KVM_PGTABLE_S2_AS_S1) \
__attr = PAGE_S2_MEMATTR(AS_S1); \
else \
__attr = PAGE_S2_MEMATTR(attr); \
\
__attr; \
})
static int stage2_set_xn_attr(enum kvm_pgtable_prot prot, kvm_pte_t *attr)
{
bool px, ux;
u8 xn;
px = prot & KVM_PGTABLE_PROT_PX;
ux = prot & KVM_PGTABLE_PROT_UX;
if (!cpus_have_final_cap(ARM64_HAS_XNX) && px != ux)
return -EINVAL;
if (px && ux)
xn = 0b00;
else if (!px && ux)
xn = 0b01;
else if (!px && !ux)
xn = 0b10;
else
xn = 0b11;
*attr &= ~KVM_PTE_LEAF_ATTR_HI_S2_XN;
*attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_HI_S2_XN, xn);
return 0;
}
static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot prot,
kvm_pte_t *ptep)
{
kvm_pte_t attr;
u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
int r;
switch (prot & (KVM_PGTABLE_PROT_DEVICE |
KVM_PGTABLE_PROT_NORMAL_NC)) {
case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
return -EINVAL;
case KVM_PGTABLE_PROT_DEVICE:
if (prot & KVM_PGTABLE_PROT_X)
return -EINVAL;
attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE);
break;
case KVM_PGTABLE_PROT_NORMAL_NC:
if (prot & KVM_PGTABLE_PROT_X)
return -EINVAL;
attr = KVM_S2_MEMATTR(pgt, NORMAL_NC);
break;
default:
attr = KVM_S2_MEMATTR(pgt, NORMAL);
}
r = stage2_set_xn_attr(prot, &attr);
if (r)
return r;
if (prot & KVM_PGTABLE_PROT_R)
attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R;
if (prot & KVM_PGTABLE_PROT_W)
attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
if (!kvm_lpa2_is_enabled())
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW;
*ptep = attr;
return 0;
}
enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte)
{
enum kvm_pgtable_prot prot = pte & KVM_PTE_LEAF_ATTR_HI_SW;
if (!kvm_pte_valid(pte))
return prot;
if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R)
prot |= KVM_PGTABLE_PROT_R;
if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W)
prot |= KVM_PGTABLE_PROT_W;
switch (FIELD_GET(KVM_PTE_LEAF_ATTR_HI_S2_XN, pte)) {
case 0b00:
prot |= KVM_PGTABLE_PROT_PX | KVM_PGTABLE_PROT_UX;
break;
case 0b01:
prot |= KVM_PGTABLE_PROT_UX;
break;
case 0b11:
prot |= KVM_PGTABLE_PROT_PX;
break;
default:
break;
}
return prot;
}
static bool stage2_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
{
if (!kvm_pte_valid(old) || !kvm_pte_valid(new))
return true;
return ((old ^ new) & (~KVM_PTE_LEAF_ATTR_S2_PERMS));
}
static bool stage2_pte_is_counted(kvm_pte_t pte)
{
/*
* The refcount tracks valid entries as well as invalid entries if they
* encode ownership of a page to another entity than the page-table
* owner, whose id is 0.
*/
return !!pte;
}
static bool stage2_pte_is_locked(kvm_pte_t pte)
{
if (kvm_pte_valid(pte))
return false;
return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) ==
KVM_INVALID_PTE_TYPE_LOCKED;
}
static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
{
if (!kvm_pgtable_walk_shared(ctx)) {
WRITE_ONCE(*ctx->ptep, new);
return true;
}
return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
}
/**
* stage2_try_break_pte() - Invalidates a pte according to the
* 'break-before-make' requirements of the
* architecture.
*
* @ctx: context of the visited pte.
* @mmu: stage-2 mmu
*
* Returns: true if the pte was successfully broken.
*
* If the removed pte was valid, performs the necessary serialization and TLB
* invalidation for the old value. For counted ptes, drops the reference count
* on the containing table page.
*/
static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
struct kvm_s2_mmu *mmu)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
kvm_pte_t locked_pte;
if (stage2_pte_is_locked(ctx->old)) {
/*
* Should never occur if this walker has exclusive access to the
* page tables.
*/
WARN_ON(!kvm_pgtable_walk_shared(ctx));
return false;
}
locked_pte = FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK,
KVM_INVALID_PTE_TYPE_LOCKED);
if (!stage2_try_set_pte(ctx, locked_pte))
return false;
if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) {
/*
* Perform the appropriate TLB invalidation based on the
* evicted pte value (if any).
*/
if (kvm_pte_table(ctx->old, ctx->level)) {
u64 size = kvm_granule_size(ctx->level);
u64 addr = ALIGN_DOWN(ctx->addr, size);
kvm_tlb_flush_vmid_range(mmu, addr, size);
} else if (kvm_pte_valid(ctx->old)) {
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
ctx->addr, ctx->level);
}
}
if (stage2_pte_is_counted(ctx->old))
mm_ops->put_page(ctx->ptep);
return true;
}
static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
if (stage2_pte_is_counted(new))
mm_ops->get_page(ctx->ptep);
smp_store_release(ctx->ptep, new);
}
static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
{
/*
* If FEAT_TLBIRANGE is implemented, defer the individual
* TLB invalidations until the entire walk is finished, and
* then use the range-based TLBI instructions to do the
* invalidations. Condition deferred TLB invalidation on the
* system supporting FWB as the optimization is entirely
* pointless when the unmap walker needs to perform CMOs.
*/
return system_supports_tlb_range() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
}
static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
struct kvm_s2_mmu *mmu,
struct kvm_pgtable_mm_ops *mm_ops)
{
struct kvm_pgtable *pgt = ctx->arg;
/*
* Clear the existing PTE, and perform break-before-make if it was
* valid. Depending on the system support, defer the TLB maintenance
* for the same until the entire unmap walk is completed.
*/
if (kvm_pte_valid(ctx->old)) {
kvm_clear_pte(ctx->ptep);
if (kvm_pte_table(ctx->old, ctx->level)) {
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr,
TLBI_TTL_UNKNOWN);
} else if (!stage2_unmap_defer_tlb_flush(pgt)) {
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr,
ctx->level);
}
}
mm_ops->put_page(ctx->ptep);
}
static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
{
u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
return kvm_pte_valid(pte) && memattr == KVM_S2_MEMATTR(pgt, NORMAL);
}
static bool stage2_pte_executable(kvm_pte_t pte)
{
return kvm_pte_valid(pte) && !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
}
static u64 stage2_map_walker_phys_addr(const struct kvm_pgtable_visit_ctx *ctx,
const struct stage2_map_data *data)
{
u64 phys = data->phys;
/* Work out the correct PA based on how far the walk has gotten */
return phys + (ctx->addr - ctx->start);
}
static bool stage2_leaf_mapping_allowed(const struct kvm_pgtable_visit_ctx *ctx,
struct stage2_map_data *data)
{
u64 phys = stage2_map_walker_phys_addr(ctx, data);
if (data->force_pte && ctx->level < KVM_PGTABLE_LAST_LEVEL)
return false;
if (data->annotation)
return true;
return kvm_block_mapping_supported(ctx, phys);
}
static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
struct stage2_map_data *data)
{
kvm_pte_t new;
u64 phys = stage2_map_walker_phys_addr(ctx, data);
u64 granule = kvm_granule_size(ctx->level);
struct kvm_pgtable *pgt = data->mmu->pgt;
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
if (!stage2_leaf_mapping_allowed(ctx, data))
return -E2BIG;
if (!data->annotation)
new = kvm_init_valid_leaf_pte(phys, data->attr, ctx->level);
else
new = data->pte_annot;
/*
* Skip updating the PTE if we are trying to recreate the exact
* same mapping or only change the access permissions. Instead,
* the vCPU will exit one more time from guest if still needed
* and then go through the path of relaxing permissions.
*/
if (!stage2_pte_needs_update(ctx->old, new))
return -EAGAIN;
/* If we're only changing software bits, then store them and go! */
if (!kvm_pgtable_walk_shared(ctx) &&
!((ctx->old ^ new) & ~KVM_PTE_LEAF_ATTR_HI_SW)) {
bool old_is_counted = stage2_pte_is_counted(ctx->old);
if (old_is_counted != stage2_pte_is_counted(new)) {
if (old_is_counted)
mm_ops->put_page(ctx->ptep);
else
mm_ops->get_page(ctx->ptep);
}
WARN_ON_ONCE(!stage2_try_set_pte(ctx, new));
return 0;
}
if (!stage2_try_break_pte(ctx, data->mmu))
return -EAGAIN;
/* Perform CMOs before installation of the guest stage-2 PTE */
if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->dcache_clean_inval_poc &&
stage2_pte_cacheable(pgt, new))
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(new, mm_ops),
granule);
if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou &&
stage2_pte_executable(new))
mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
stage2_make_pte(ctx, new);
return 0;
}
static int stage2_map_walk_table_pre(const struct kvm_pgtable_visit_ctx *ctx,
struct stage2_map_data *data)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
int ret;
if (!stage2_leaf_mapping_allowed(ctx, data))
return 0;
ret = stage2_map_walker_try_leaf(ctx, data);
if (ret)
return ret;
mm_ops->free_unlinked_table(childp, ctx->level);
return 0;
}
static int stage2_map_walk_leaf(const struct kvm_pgtable_visit_ctx *ctx,
struct stage2_map_data *data)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
kvm_pte_t *childp, new;
int ret;
ret = stage2_map_walker_try_leaf(ctx, data);
if (ret != -E2BIG)
return ret;
if (WARN_ON(ctx->level == KVM_PGTABLE_LAST_LEVEL))
return -EINVAL;
if (!data->memcache)
return -ENOMEM;
childp = mm_ops->zalloc_page(data->memcache);
if (!childp)
return -ENOMEM;
if (!stage2_try_break_pte(ctx, data->mmu)) {
mm_ops->put_page(childp);
return -EAGAIN;
}
/*
* If we've run into an existing block mapping then replace it with
* a table. Accesses beyond 'end' that fall within the new table
* will be mapped lazily.
*/
new = kvm_init_table_pte(childp, mm_ops);
stage2_make_pte(ctx, new);
return 0;
}
/*
* The TABLE_PRE callback runs for table entries on the way down, looking
* for table entries which we could conceivably replace with a block entry
* for this mapping. If it finds one it replaces the entry and calls
* kvm_pgtable_mm_ops::free_unlinked_table() to tear down the detached table.
*
* Otherwise, the LEAF callback performs the mapping at the existing leaves
* instead.
*/
static int stage2_map_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
struct stage2_map_data *data = ctx->arg;
switch (visit) {
case KVM_PGTABLE_WALK_TABLE_PRE:
return stage2_map_walk_table_pre(ctx, data);
case KVM_PGTABLE_WALK_LEAF:
return stage2_map_walk_leaf(ctx, data);
default:
return -EINVAL;
}
}
int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
u64 phys, enum kvm_pgtable_prot prot,
void *mc, enum kvm_pgtable_walk_flags flags)
{
int ret;
struct stage2_map_data map_data = {
.phys = ALIGN_DOWN(phys, PAGE_SIZE),
.mmu = pgt->mmu,
.memcache = mc,
.force_pte = pgt->force_pte_cb && pgt->force_pte_cb(addr, addr + size, prot),
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
.flags = flags |
KVM_PGTABLE_WALK_TABLE_PRE |
KVM_PGTABLE_WALK_LEAF,
.arg = &map_data,
};
if (WARN_ON((pgt->flags & KVM_PGTABLE_S2_IDMAP) && (addr != phys)))
return -EINVAL;
ret = stage2_set_prot_attr(pgt, prot, &map_data.attr);
if (ret)
return ret;
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
dsb(ishst);
return ret;
}
int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size,
void *mc, enum kvm_invalid_pte_type type,
kvm_pte_t pte_annot)
{
int ret;
struct stage2_map_data map_data = {
.mmu = pgt->mmu,
.memcache = mc,
.force_pte = true,
.annotation = true,
.pte_annot = pte_annot |
FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, type),
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
.flags = KVM_PGTABLE_WALK_TABLE_PRE |
KVM_PGTABLE_WALK_LEAF,
.arg = &map_data,
};
if (pte_annot & ~KVM_INVALID_PTE_ANNOT_MASK)
return -EINVAL;
if (!type || type == KVM_INVALID_PTE_TYPE_LOCKED)
return -EINVAL;
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
return ret;
}
static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
struct kvm_pgtable *pgt = ctx->arg;
struct kvm_s2_mmu *mmu = pgt->mmu;
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
kvm_pte_t *childp = NULL;
bool need_flush = false;
if (!kvm_pte_valid(ctx->old)) {
if (stage2_pte_is_counted(ctx->old)) {
kvm_clear_pte(ctx->ptep);
mm_ops->put_page(ctx->ptep);
}
return 0;
}
if (kvm_pte_table(ctx->old, ctx->level)) {
childp = kvm_pte_follow(ctx->old, mm_ops);
if (mm_ops->page_count(childp) != 1)
return 0;
} else if (stage2_pte_cacheable(pgt, ctx->old)) {
need_flush = !cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
}
/*
* This is similar to the map() path in that we unmap the entire
* block entry and rely on the remaining portions being faulted
* back lazily.
*/
stage2_unmap_put_pte(ctx, mmu, mm_ops);
if (need_flush && mm_ops->dcache_clean_inval_poc)
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
kvm_granule_size(ctx->level));
if (childp)
mm_ops->put_page(childp);
return 0;
}
int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
int ret;
struct kvm_pgtable_walker walker = {
.cb = stage2_unmap_walker,
.arg = pgt,
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
if (stage2_unmap_defer_tlb_flush(pgt))
/* Perform the deferred TLB invalidations */
kvm_tlb_flush_vmid_range(pgt->mmu, addr, size);
return ret;
}
struct stage2_attr_data {
kvm_pte_t attr_set;
kvm_pte_t attr_clr;
kvm_pte_t pte;
s8 level;
};
static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
kvm_pte_t pte = ctx->old;
struct stage2_attr_data *data = ctx->arg;
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
if (!kvm_pte_valid(ctx->old))
return -EAGAIN;
data->level = ctx->level;
data->pte = pte;
pte &= ~data->attr_clr;
pte |= data->attr_set;
/*
* We may race with the CPU trying to set the access flag here,
* but worst-case the access flag update gets lost and will be
* set on the next access instead.
*/
if (data->pte != pte) {
/*
* Invalidate instruction cache before updating the guest
* stage-2 PTE if we are going to add executable permission.
*/
if (mm_ops->icache_inval_pou &&
stage2_pte_executable(pte) && !stage2_pte_executable(ctx->old))
mm_ops->icache_inval_pou(kvm_pte_follow(pte, mm_ops),
kvm_granule_size(ctx->level));
if (!stage2_try_set_pte(ctx, pte))
return -EAGAIN;
}
return 0;
}
static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
u64 size, kvm_pte_t attr_set,
kvm_pte_t attr_clr, kvm_pte_t *orig_pte,
s8 *level, enum kvm_pgtable_walk_flags flags)
{
int ret;
kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
struct stage2_attr_data data = {
.attr_set = attr_set & attr_mask,
.attr_clr = attr_clr & attr_mask,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_attr_walker,
.arg = &data,
.flags = flags | KVM_PGTABLE_WALK_LEAF,
};
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
if (ret)
return ret;
if (orig_pte)
*orig_pte = data.pte;
if (level)
*level = data.level;
return 0;
}
int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
return stage2_update_leaf_attrs(pgt, addr, size, 0,
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
NULL, NULL,
KVM_PGTABLE_WALK_IGNORE_EAGAIN);
}
void kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr,
enum kvm_pgtable_walk_flags flags)
{
int ret;
ret = stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0,
NULL, NULL, flags);
if (!ret)
dsb(ishst);
}
struct stage2_age_data {
bool mkold;
bool young;
};
static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
struct stage2_age_data *data = ctx->arg;
if (!kvm_pte_valid(ctx->old) || new == ctx->old)
return 0;
data->young = true;
/*
* stage2_age_walker() is always called while holding the MMU lock for
* write, so this will always succeed. Nonetheless, this deliberately
* follows the race detection pattern of the other stage-2 walkers in
* case the locking mechanics of the MMU notifiers is ever changed.
*/
if (data->mkold && !stage2_try_set_pte(ctx, new))
return -EAGAIN;
/*
* "But where's the TLBI?!", you scream.
* "Over in the core code", I sigh.
*
* See the '->clear_flush_young()' callback on the KVM mmu notifier.
*/
return 0;
}
bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
u64 size, bool mkold)
{
struct stage2_age_data data = {
.mkold = mkold,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_age_walker,
.arg = &data,
.flags = KVM_PGTABLE_WALK_LEAF,
};
WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
return data.young;
}
int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
enum kvm_pgtable_prot prot, enum kvm_pgtable_walk_flags flags)
{
kvm_pte_t xn = 0, set = 0, clr = 0;
s8 level;
int ret;
if (prot & KVM_PTE_LEAF_ATTR_HI_SW)
return -EINVAL;
if (prot & KVM_PGTABLE_PROT_R)
set |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R;
if (prot & KVM_PGTABLE_PROT_W)
set |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
ret = stage2_set_xn_attr(prot, &xn);
if (ret)
return ret;
set |= xn & KVM_PTE_LEAF_ATTR_HI_S2_XN;
clr |= ~xn & KVM_PTE_LEAF_ATTR_HI_S2_XN;
ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level, flags);
if (!ret || ret == -EAGAIN)
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level);
return ret;
}
static int stage2_flush_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
struct kvm_pgtable *pgt = ctx->arg;
struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
if (!stage2_pte_cacheable(pgt, ctx->old))
return 0;
if (mm_ops->dcache_clean_inval_poc)
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
kvm_granule_size(ctx->level));
return 0;
}
int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_flush_walker,
.flags = KVM_PGTABLE_WALK_LEAF,
.arg = pgt,
};
if (cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
return 0;
return kvm_pgtable_walk(pgt, addr, size, &walker);
}
kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
u64 phys, s8 level,
enum kvm_pgtable_prot prot,
void *mc, bool force_pte)
{
struct stage2_map_data map_data = {
.phys = phys,
.mmu = pgt->mmu,
.memcache = mc,
.force_pte = force_pte,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
.flags = KVM_PGTABLE_WALK_LEAF |
KVM_PGTABLE_WALK_SKIP_BBM_TLBI |
KVM_PGTABLE_WALK_SKIP_CMO,
.arg = &map_data,
};
/*
* The input address (.addr) is irrelevant for walking an
* unlinked table. Construct an ambiguous IA range to map
* kvm_granule_size(level) worth of memory.
*/
struct kvm_pgtable_walk_data data = {
.walker = &walker,
.addr = 0,
.end = kvm_granule_size(level),
};
struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
kvm_pte_t *pgtable;
int ret;
if (!IS_ALIGNED(phys, kvm_granule_size(level)))
return ERR_PTR(-EINVAL);
ret = stage2_set_prot_attr(pgt, prot, &map_data.attr);
if (ret)
return ERR_PTR(ret);
pgtable = mm_ops->zalloc_page(mc);
if (!pgtable)
return ERR_PTR(-ENOMEM);
ret = __kvm_pgtable_walk(&data, mm_ops, (kvm_pteref_t)pgtable,
level + 1);
if (ret) {
kvm_pgtable_stage2_free_unlinked(mm_ops, pgtable, level);
return ERR_PTR(ret);
}
return pgtable;
}
/*
* Get the number of page-tables needed to replace a block with a
* fully populated tree up to the PTE entries. Note that @level is
* interpreted as in "level @level entry".
*/
static int stage2_block_get_nr_page_tables(s8 level)
{
switch (level) {
case 1:
return PTRS_PER_PTE + 1;
case 2:
return 1;
case 3:
return 0;
default:
WARN_ON_ONCE(level < KVM_PGTABLE_MIN_BLOCK_LEVEL ||
level > KVM_PGTABLE_LAST_LEVEL);
return -EINVAL;
};
}
static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
struct kvm_mmu_memory_cache *mc = ctx->arg;
struct kvm_s2_mmu *mmu;
kvm_pte_t pte = ctx->old, new, *childp;
enum kvm_pgtable_prot prot;
s8 level = ctx->level;
bool force_pte;
int nr_pages;
u64 phys;
/* No huge-pages exist at the last level */
if (level == KVM_PGTABLE_LAST_LEVEL)
return 0;
/* We only split valid block mappings */
if (!kvm_pte_valid(pte))
return 0;
nr_pages = stage2_block_get_nr_page_tables(level);
if (nr_pages < 0)
return nr_pages;
if (mc->nobjs >= nr_pages) {
/* Build a tree mapped down to the PTE granularity. */
force_pte = true;
} else {
/*
* Don't force PTEs, so create_unlinked() below does
* not populate the tree up to the PTE level. The
* consequence is that the call will require a single
* page of level 2 entries at level 1, or a single
* page of PTEs at level 2. If we are at level 1, the
* PTEs will be created recursively.
*/
force_pte = false;
nr_pages = 1;
}
if (mc->nobjs < nr_pages)
return -ENOMEM;
mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
phys = kvm_pte_to_phys(pte);
prot = kvm_pgtable_stage2_pte_prot(pte);
childp = kvm_pgtable_stage2_create_unlinked(mmu->pgt, phys,
level, prot, mc, force_pte);
if (IS_ERR(childp))
return PTR_ERR(childp);
if (!stage2_try_break_pte(ctx, mmu)) {
kvm_pgtable_stage2_free_unlinked(mm_ops, childp, level);
return -EAGAIN;
}
/*
* Note, the contents of the page table are guaranteed to be made
* visible before the new PTE is assigned because stage2_make_pte()
* writes the PTE using smp_store_release().
*/
new = kvm_init_table_pte(childp, mm_ops);
stage2_make_pte(ctx, new);
return 0;
}
int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
struct kvm_mmu_memory_cache *mc)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_split_walker,
.flags = KVM_PGTABLE_WALK_LEAF,
.arg = mc,
};
int ret;
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
dsb(ishst);
return ret;
}
int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
struct kvm_pgtable_mm_ops *mm_ops,
enum kvm_pgtable_stage2_flags flags,
kvm_pgtable_force_pte_cb_t force_pte_cb)
{
size_t pgd_sz;
u64 vtcr = mmu->vtcr;
u32 ia_bits = VTCR_EL2_IPA(vtcr);
u32 sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr);
s8 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0;
pgd_sz = kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE;
pgt->pgd = (kvm_pteref_t)mm_ops->zalloc_pages_exact(pgd_sz);
if (!pgt->pgd)
return -ENOMEM;
pgt->ia_bits = ia_bits;
pgt->start_level = start_level;
pgt->mm_ops = mm_ops;
pgt->mmu = mmu;
pgt->flags = flags;
pgt->force_pte_cb = force_pte_cb;
/* Ensure zeroed PGD pages are visible to the hardware walker */
dsb(ishst);
return 0;
}
size_t kvm_pgtable_stage2_pgd_size(u64 vtcr)
{
u32 ia_bits = VTCR_EL2_IPA(vtcr);
u32 sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr);
s8 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0;
return kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE;
}
static int stage2_free_leaf(const struct kvm_pgtable_visit_ctx *ctx)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
mm_ops->put_page(ctx->ptep);
return 0;
}
static int stage2_free_table_post(const struct kvm_pgtable_visit_ctx *ctx)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
if (mm_ops->page_count(childp) != 1)
return 0;
/*
* Drop references and clear the now stale PTE to avoid rewalking the
* freed page table.
*/
mm_ops->put_page(ctx->ptep);
mm_ops->put_page(childp);
kvm_clear_pte(ctx->ptep);
return 0;
}
static int stage2_free_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
if (!stage2_pte_is_counted(ctx->old))
return 0;
switch (visit) {
case KVM_PGTABLE_WALK_LEAF:
return stage2_free_leaf(ctx);
case KVM_PGTABLE_WALK_TABLE_POST:
return stage2_free_table_post(ctx);
default:
return -EINVAL;
}
}
void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt,
u64 addr, u64 size)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_free_walker,
.flags = KVM_PGTABLE_WALK_LEAF |
KVM_PGTABLE_WALK_TABLE_POST,
};
WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
}
void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt)
{
size_t pgd_sz;
pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
/*
* Since the pgtable is unlinked at this point, and not shared with
* other walkers, safely deference pgd with kvm_dereference_pteref_raw()
*/
pgt->mm_ops->free_pages_exact(kvm_dereference_pteref_raw(pgt->pgd), pgd_sz);
pgt->pgd = NULL;
}
void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
{
kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits));
kvm_pgtable_stage2_destroy_pgd(pgt);
}
void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, s8 level)
{
kvm_pteref_t ptep = (kvm_pteref_t)pgtable;
struct kvm_pgtable_walker walker = {
.cb = stage2_free_walker,
.flags = KVM_PGTABLE_WALK_LEAF |
KVM_PGTABLE_WALK_TABLE_POST,
};
struct kvm_pgtable_walk_data data = {
.walker = &walker,
/*
* At this point the IPA really doesn't matter, as the page
* table being traversed has already been removed from the stage
* 2. Set an appropriate range to cover the entire page table.
*/
.addr = 0,
.end = kvm_granule_size(level),
};
WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));
WARN_ON(mm_ops->page_count(pgtable) != 1);
mm_ops->put_page(pgtable);
}