Files
linux/arch/x86/virt/vmx/tdx/tdx.h
Kai Huang 70060463cb x86/mce: Differentiate real hardware #MCs from TDX erratum ones
The first few generations of TDX hardware have an erratum.  Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

Make an effort to detect these TDX-induced machine checks and spit out
a new blurb to dmesg so folks do not think their hardware is failing.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger.  Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

 [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
 	...
 [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error".  But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error.  Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error".  The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

  ...
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory.  There is no existing infrastructure
to do that.  Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC.  A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory.  It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map.  It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data.  That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

[ dhansen: changelog tweaks ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
2023-12-12 08:46:46 -08:00

122 lines
3.0 KiB
C

/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _X86_VIRT_TDX_H
#define _X86_VIRT_TDX_H
#include <linux/bits.h>
/*
* This file contains both macros and data structures defined by the TDX
* architecture and Linux defined software data structures and functions.
* The two should not be mixed together for better readability. The
* architectural definitions come first.
*/
/*
* TDX module SEAMCALL leaf functions
*/
#define TDH_PHYMEM_PAGE_RDMD 24
#define TDH_SYS_KEY_CONFIG 31
#define TDH_SYS_INIT 33
#define TDH_SYS_RD 34
#define TDH_SYS_LP_INIT 35
#define TDH_SYS_TDMR_INIT 36
#define TDH_SYS_CONFIG 45
/* TDX page types */
#define PT_NDA 0x0
#define PT_RSVD 0x1
/*
* Global scope metadata field ID.
*
* See Table "Global Scope Metadata", TDX module 1.5 ABI spec.
*/
#define MD_FIELD_ID_MAX_TDMRS 0x9100000100000008ULL
#define MD_FIELD_ID_MAX_RESERVED_PER_TDMR 0x9100000100000009ULL
#define MD_FIELD_ID_PAMT_4K_ENTRY_SIZE 0x9100000100000010ULL
#define MD_FIELD_ID_PAMT_2M_ENTRY_SIZE 0x9100000100000011ULL
#define MD_FIELD_ID_PAMT_1G_ENTRY_SIZE 0x9100000100000012ULL
/*
* Sub-field definition of metadata field ID.
*
* See Table "MD_FIELD_ID (Metadata Field Identifier / Sequence Header)
* Definition", TDX module 1.5 ABI spec.
*
* - Bit 33:32: ELEMENT_SIZE_CODE -- size of a single element of metadata
*
* 0: 8 bits
* 1: 16 bits
* 2: 32 bits
* 3: 64 bits
*/
#define MD_FIELD_ID_ELE_SIZE_CODE(_field_id) \
(((_field_id) & GENMASK_ULL(33, 32)) >> 32)
#define MD_FIELD_ID_ELE_SIZE_16BIT 1
struct tdmr_reserved_area {
u64 offset;
u64 size;
} __packed;
#define TDMR_INFO_ALIGNMENT 512
#define TDMR_INFO_PA_ARRAY_ALIGNMENT 512
struct tdmr_info {
u64 base;
u64 size;
u64 pamt_1g_base;
u64 pamt_1g_size;
u64 pamt_2m_base;
u64 pamt_2m_size;
u64 pamt_4k_base;
u64 pamt_4k_size;
/*
* The actual number of reserved areas depends on the value of
* field MD_FIELD_ID_MAX_RESERVED_PER_TDMR in the TDX module
* global metadata.
*/
DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
} __packed __aligned(TDMR_INFO_ALIGNMENT);
/*
* Do not put any hardware-defined TDX structure representations below
* this comment!
*/
/* Kernel defined TDX module status during module initialization. */
enum tdx_module_status_t {
TDX_MODULE_UNINITIALIZED,
TDX_MODULE_INITIALIZED,
TDX_MODULE_ERROR
};
struct tdx_memblock {
struct list_head list;
unsigned long start_pfn;
unsigned long end_pfn;
int nid;
};
/* "TDMR info" part of "Global Scope Metadata" for constructing TDMRs */
struct tdx_tdmr_sysinfo {
u16 max_tdmrs;
u16 max_reserved_per_tdmr;
u16 pamt_entry_size[TDX_PS_NR];
};
/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
#define TDMR_NR_WARN 4
struct tdmr_info_list {
void *tdmrs; /* Flexible array to hold 'tdmr_info's */
int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */
/* Metadata for finding target 'tdmr_info' and freeing @tdmrs */
int tdmr_sz; /* Size of one 'tdmr_info' */
int max_tdmrs; /* How many 'tdmr_info's are allocated */
};
#endif