Commit Graph

372 Commits

Author SHA1 Message Date
Joe Damato
ba17de9854 iommu/amd: Block identity domain when SNP enabled
Previously, commit 8388f7df93 ("iommu/amd: Do not support
IOMMU_DOMAIN_IDENTITY after SNP is enabled") prevented users from
changing the IOMMU domain to identity if SNP was enabled.

This resulted in an error when writing to sysfs:

  # echo "identity" > /sys/kernel/iommu_groups/50/type
  -bash: echo: write error: Cannot allocate memory

However, commit 4402f2627d ("iommu/amd: Implement global identity
domain") changed the flow of the code, skipping the SNP guard and
allowing users to change the IOMMU domain to identity after a machine
has booted.

Once the user does that, they will probably try to bind and the
device/driver will start to do DMA which will trigger errors:

  iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
  iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
  AMD-Vi: DTE[0]: 6000000000000003
  AMD-Vi: DTE[1]: 0000000000000001
  AMD-Vi: DTE[2]: 2000003088b3e013
  AMD-Vi: DTE[3]: 0000000000000000
  bnxt_en 0000:43:00.0 (unnamed net_device) (uninitialized): Error (timeout: 500015) msg {0x0 0x0} len:0
  iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
  iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
  AMD-Vi: DTE[0]: 6000000000000003
  AMD-Vi: DTE[1]: 0000000000000001
  AMD-Vi: DTE[2]: 2000003088b3e013
  AMD-Vi: DTE[3]: 0000000000000000
  bnxt_en 0000:43:00.0: probe with driver bnxt_en failed with error -16

To prevent this from happening, create an attach wrapper for
identity_domain_ops which returns EINVAL if amd_iommu_snp_en is true.

With this commit applied:

  # echo "identity" > /sys/kernel/iommu_groups/62/type
  -bash: echo: write error: Invalid argument

Fixes: 4402f2627d ("iommu/amd: Implement global identity domain")
Signed-off-by: Joe Damato <joe@dama.to>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-03-17 14:02:02 +01:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
1e0ea4dff0 Merge tag 'iommu-updates-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
 "Core changes:
   - Rust bindings for IO-pgtable code
   - IOMMU page allocation debugging support
   - Disable ATS during PCI resets

  Intel VT-d changes:
   - Skip dev-iotlb flush for inaccessible PCIe device
   - Flush cache for PASID table before using it
   - Use right invalidation method for SVA and NESTED domains
   - Ensure atomicity in context and PASID entry updates

  AMD-Vi changes:
   - Support for nested translations
   - Other minor improvements

  ARM-SMMU-v2 changes:
   - Configure SoC-specific prefetcher settings for Qualcomm's "MDSS"

  ARM-SMMU-v3 changes:
   - Improve CMDQ locking fairness for pathetically small queue sizes
   - Remove tracking of the IAS as this is only relevant for AArch32 and
     was causing C_BAD_STE errors
   - Add device-tree support for NVIDIA's CMDQV extension
   - Allow some hitless transitions for the 'MEV' and 'EATS' STE fields
   - Don't disable ATS for nested S1-bypass nested domains
   - Additions to the kunit selftests"

* tag 'iommu-updates-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits)
  iommupt: Always add IOVA range to iotlb_gather in gather_range_pages()
  iommu/amd: serialize sequence allocation under concurrent TLB invalidations
  iommu/amd: Fix type of type parameter to amd_iommufd_hw_info()
  iommu/arm-smmu-v3: Do not set disable_ats unless vSTE is Translate
  iommu/arm-smmu-v3-test: Add nested s1bypass/s1dssbypass coverage
  iommu/arm-smmu-v3: Mark EATS_TRANS safe when computing the update sequence
  iommu/arm-smmu-v3: Mark STE MEV safe when computing the update sequence
  iommu/arm-smmu-v3: Add update_safe bits to fix STE update sequence
  iommu/arm-smmu-v3: Add device-tree support for CMDQV driver
  iommu/tegra241-cmdqv: Decouple driver from ACPI
  iommu/arm-smmu-qcom: Restore ACTLR settings for MDSS on sa8775p
  iommu/vt-d: Fix race condition during PASID entry replacement
  iommu/vt-d: Clear Present bit before tearing down context entry
  iommu/vt-d: Clear Present bit before tearing down PASID entry
  iommu/vt-d: Flush piotlb for SVM and Nested domain
  iommu/vt-d: Flush cache for PASID table before using it
  iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode
  rust: iommu: fix `srctree` link warning
  rust: iommu: fix Rust formatting
  ...
2026-02-11 16:36:08 -08:00
Linus Torvalds
4e21e585b6 Merge tag 'irq-cleanups-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Thomas Gleixner:
 "A series of treewide cleanups to ensure interrupt request consistency.

   - Add the missing IRQF_COND_ONESHOT flag to devm_request_irq()

     This is inconsistent vs request_irq() and causes the same issues
     which where addressed with the introduction of this flag

   - Cleanup IRQF_ONESHOT and IRQF_NO_THREAD usage

     Quite some drivers have inconsistent interrupt request flags
     related to interrupt threading namely IRQF_ONESHOT and
     IRQF_NO_THREAD. This leads to warnings and/or malfunction when
     forced interrupt threading is enabled.

   - Remove stub primary (hard interrupt) handlers

     A bunch of drivers implement a stub primary (hard interrupt)
     handler which just returns IRQ_WAKE_THREAD. The same functionality
     is provided by the core code when the primary handler argument of
     request_thread_irq() is set to NULL"

* tag 'irq-cleanups-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  media: pci: mg4b: Use IRQF_NO_THREAD
  mfd: wm8350-core: Use IRQF_ONESHOT
  thermal/qcom/lmh: Replace IRQF_ONESHOT with IRQF_NO_THREAD
  rtc: amlogic-a4: Remove IRQF_ONESHOT
  usb: typec: fusb302: Remove IRQF_ONESHOT
  EDAC/altera: Remove IRQF_ONESHOT
  char: tpm: cr50: Remove IRQF_ONESHOT
  ARM: versatile: Remove IRQF_ONESHOT
  scsi: efct: Use IRQF_ONESHOT and default primary handler
  Bluetooth: btintel_pcie: Use IRQF_ONESHOT and default primary handler
  bus: fsl-mc: Use default primary handler
  mailbox: bcm-ferxrm-mailbox: Use default primary handler
  iommu/amd: Use core's primary handler and set IRQF_ONESHOT
  platform/x86: int0002: Remove IRQF_ONESHOT from request_irq()
  genirq: Set IRQF_COND_ONESHOT in devm_request_irq().
2026-02-10 13:22:50 -08:00
Joerg Roedel
ad09563660 Merge branches 'fixes', 'arm/smmu/updates', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2026-02-06 11:10:40 +01:00
Ankit Soni
9e249c4841 iommu/amd: serialize sequence allocation under concurrent TLB invalidations
With concurrent TLB invalidations, completion wait randomly gets timed out
because cmd_sem_val was incremented outside the IOMMU spinlock, allowing
CMD_COMPL_WAIT commands to be queued out of sequence and breaking the
ordering assumption in wait_on_sem().
Move the cmd_sem_val increment under iommu->lock so completion sequence
allocation is serialized with command queuing.
And remove the unnecessary return.

Fixes: d2a0cac105 ("iommu/amd: move wait_on_sem() out of spinlock")

Tested-by: Srikanth Aithal <sraithal@amd.com>
Reported-by: Srikanth Aithal <sraithal@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-02-03 14:27:05 +01:00
Sebastian Andrzej Siewior
5bfcdccb4d iommu/amd: Use core's primary handler and set IRQF_ONESHOT
request_threaded_irq() is invoked with a primary and a secondary handler
and no flags are passed. The primary handler is the same as
irq_default_primary_handler() so there is no need to have an identical
copy.

The lack of the IRQF_ONESHOT can be dangerous because the interrupt
source is not masked while the threaded handler is active. This means,
especially on LEVEL typed interrupt lines, the interrupt can fire again
before the threaded handler had a chance to run.

Use the default primary interrupt handler by specifying NULL and set
IRQF_ONESHOT so the interrupt source is masked until the secondary
handler is done.

Fixes: 72fe00f01f ("x86/amd-iommu: Use threaded interupt handler")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260128095540.863589-4-bigeasy@linutronix.de
2026-02-01 17:37:13 +01:00
Vasant Hegde
3222b6de51 iommu/amd: Fix error path in amd_iommu_probe_device()
Currently, the error path of amd_iommu_probe_device() unconditionally
references dev_data, which may not be initialized if an early failure
occurs (like iommu_init_device() fails).

Move the out_err label to ensure the function exits immediately on
failure without accessing potentially uninitialized dev_data.

Fixes: 19e5cc156c ("iommu/amd: Enable support for up to 2K interrupts per function")
Cc: Rakuram Eswaran <rakuram.e96@gmail.com>
Cc: Jörg Rödel <joro@8bytes.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202512191724.meqJENXe-lkp@intel.com/
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 11:03:12 +01:00
Suravee Suthikulpanit
93eee2a49c iommu/amd: Refactor logic to program the host page table in DTE
Introduce the amd_iommu_set_dte_v1() helper function to configure
IOMMU host (v1) page table into DTE. This will be used later
when attaching nested doamin.

Also, remove obsolete warning when SNP is enabled and domain id
is zero since this check is no longer applicable.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:15 +01:00
Suravee Suthikulpanit
4e1b09d90b iommu/amd: Refactor persistent DTE bits programming into amd_iommu_make_clear_dte()
To help avoid duplicate logic when programing DTE for nested translation.

Note that this commit changes behavior of when the IOMMU driver is
switching domain during attach and the blocking domain, where DTE bit
fields for interrupt pass-through (i.e. Lint0, Lint1, NMI, INIT, ExtInt)
and System management message could be affected. These DTE bits are
specified in the IVRS table for specific devices, and should be persistent.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:14 +01:00
Suravee Suthikulpanit
757d2b1fdf iommu/amd: Introduce gDomID-to-hDomID Mapping and handle parent domain invalidation
Each nested domain is assigned guest domain ID (gDomID), which guest OS
programs into guest Device Table Entry (gDTE). For each gDomID, the driver
assigns a corresponding host domain ID (hDomID), which will be programmed
into the host Device Table Entry (hDTE).

The hDomID is allocated during amd_iommu_alloc_domain_nested(),
and free during nested_domain_free(). The gDomID-to-hDomID mapping info
(struct guest_domain_mapping_info) is stored in a per-viommu xarray
(struct amd_iommu_viommu.gdomid_array), which is indexed by gDomID.

Note also that parent domain can be shared among struct iommufd_viommu.
Therefore, when hypervisor invalidates the nest parent domain, the AMD
IOMMU command INVALIDATE_IOMMU_PAGES must be issued for each hDomID in
the gdomid_array. This is handled by the iommu_flush_pages_v1_hdom_ids(),
where it iterates through struct protection_domain.viommu_list.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:14 +01:00
Suravee Suthikulpanit
e113a72576 iommu/amd: Introduce struct amd_iommu_viommu
Which stores reference to nested parent domain assigned during the call to
struct iommu_ops.viommu_init(). Information in the nest parent is needed
when setting up the nested translation.

Note that the viommu initialization will be introduced in subsequent
commit.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:12 +01:00
Suravee Suthikulpanit
b43a29def2 iommu/amd: Add support for nest parent domain allocation
To support nested translation, the nest parent domain is allocated with
IOMMU_HWPT_ALLOC_NEST_PARENT flag, and stores information of the v1 page
table for stage 2 (i.e. GPA->SPA).

Also, only support nest parent domain on AMD system, which can support
the Guest CR3 Table (GCR3TRPMode) feature. This feature is required in
order to program DTE[GCR3 Table Root Pointer] with the GPA.

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:12 +01:00
Suravee Suthikulpanit
9b467a5af8 iommu/amd: Introduce helper function amd_iommu_update_dte()
Which includes DTE update, clone_aliases, DTE flush and completion-wait
commands to avoid code duplication when reuse to setup DTE for nested
translation.

Also, make amd_iommu_update_dte() non-static to reuse in
in a new nested.c file for nested translation.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:11 +01:00
Suravee Suthikulpanit
11cfa782f0 iommu/amd: Make amd_iommu_make_clear_dte() non-static inline
This will be reused in a new nested.c file for nested translation.

Also, remove unused function parameter ptr.

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:11 +01:00
Suravee Suthikulpanit
5335fc1657 iommu/amd: Rename DEV_DOMID_MASK to DTE_DOMID_MASK
Also change the define to use GENMASK_ULL instead.
There is no functional change.

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:10 +01:00
Suravee Suthikulpanit
7d8b06ecc4 iommu/amd: Add support for hw_info for iommu capability query
AMD IOMMU Extended Feature (EFR) and Extended Feature 2 (EFR2) registers
specify features supported by each IOMMU hardware instance.
The IOMMU driver checks each feature-specific bits before enabling
each feature at run time.

For IOMMUFD, the hypervisor passes the raw value of amd_iommu_efr and
amd_iommu_efr2 to VMM via iommufd IOMMU_DEVICE_GET_HW_INFO ioctl.

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-18 10:56:09 +01:00
Rakuram Eswaran
2e66659565 iommu/amd: Drop incorrect NULL check for iommu in alloc_irq_table()
alloc_irq_table() contains a conditional check for a NULL iommu pointer
when computing the NUMA node, but the function dereferences iommu
in multiple places afterwards.

All callers ensure that a valid iommu pointer is passed in, and a NULL
iommu is not expected by the current callers. Remove the incorrect
NULL check to make the assumptions consistent and address the Smatch
warning.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202512191724.meqJENXe-lkp@intel.com/
Signed-off-by: Rakuram Eswaran <rakuram.e96@gmail.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-10 11:17:43 +01:00
Ankit Soni
d2a0cac105 iommu/amd: move wait_on_sem() out of spinlock
With iommu.strict=1, the existing completion wait path can cause soft
lockups under stressed environment, as wait_on_sem() busy-waits under the
spinlock with interrupts disabled.

Move the completion wait in iommu_completion_wait() out of the spinlock.
wait_on_sem() only polls the hardware-updated cmd_sem and does not require
iommu->lock, so holding the lock during the busy wait unnecessarily
increases contention and extends the time with interrupts disabled.

Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-10 10:54:38 +01:00
Sairaj Kodilkar
c7fe9384c8 amd/iommu: Make protection domain ID functions non-static
So that both iommu.c and init.c can utilize them. Also define a new
function 'pdom_id_destroy()' to destroy 'pdom_ids' instead of directly
calling ida functions.

Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-12-19 11:23:49 +01:00
Joerg Roedel
0d081b1694 Merge branches 'arm/smmu/updates', 'arm/smmu/bindings', 'mediatek', 'nvidia/tegra', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2025-11-28 08:44:21 +01:00
Jason Gunthorpe
1eb0ae6fbd iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
If the IOVA is limited to less than 48 the page table will be constructed
with a 3 level configuration which is unsupported by hardware.

Like the second stage the caller needs to pass in both the top_level an
the vasz to specify a table that has more levels than required to hold the
IOVA range.

Fixes: 6cbc09b771 ("iommu/vt-d: Restore previous domain::aperture_end calculation")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28 08:43:55 +01:00
Jinhui Guo
2381a1b40b iommu/amd: Propagate the error code returned by __modify_irte_ga() in modify_irte_ga()
The return type of __modify_irte_ga() is int, but modify_irte_ga()
treats it as a bool. Casting the int to bool discards the error code.

To fix the issue, change the type of ret to int in modify_irte_ga().

Fixes: 57cdb720ea ("iommu/amd: Do not flush IRTE when only updating isRun and destination fields")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-25 15:10:23 +01:00
Dheeraj Kumar Srivastava
d1e281f832 iommu/amd: Enhance "Completion-wait Time-out" error message
Current IOMMU driver prints "Completion-wait Time-out" error message with
insufficient information to further debug the issue.

Enhancing the error message as following:
1. Log IOMMU PCI device ID in the error message.
2. With "amd_iommu_dump=1" kernel command line option, dump entire
   command buffer entries including Head and Tail offset.

Dump the entire command buffer only on the first 'Completion-wait Time-out'
to avoid dmesg spam.

Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-13 16:13:26 +01:00
Alejandro Jimenez
789a5913b2 iommu/amd: Use the generic iommu page table
Replace the io_pgtable versions with pt_iommu versions. The v2 page table
uses the x86 implementation that will be eventually shared with VT-d.

This supports the same special features as the original code:
 - increase_top for the v1 format to allow scaling from 3 to 6 levels
 - non-present flushing
 - Dirty tracking for v1 only
 - __sme_set() to adjust the PTEs for CC
 - Optimization for flushing with virtualization to minimize the range
 - amd_iommu_pgsize_bitmap override of the native page sizes
 - page tables allocate from the device's NUMA node

Rework the domain ops so that v1/v2 get their own ops. Make dedicated
allocation functions for v1 and v2. Hook up invalidation for a top change
to struct pt_iommu_flush_ops. Delete some of the iopgtable related code
that becomes unused in this patch. The next patch will delete the rest of
it.

This fixes a race bug in AMD's increase_address_space() implementation. It
stores the top level and top pointer in different memory, which prevents
other threads from reading a coherent version:

   increase_address_space()   alloc_pte()
                                level = pgtable->mode - 1;
	pgtable->root  = pte;
	pgtable->mode += 1;
                                pte = &pgtable->root[PM_LEVEL_INDEX(level, address)];

The iommupt version is careful to put mode and root under a single
READ_ONCE and then is careful to only READ_ONCE a single time per
walk.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:08:56 +01:00
Nicolin Chen
fd714986e4 iommu: Pass in old domain to attach_dev callback functions
The IOMMU core attaches each device to a default domain on probe(). Then,
every new "attach" operation has a fundamental meaning of two-fold:
 - detach from its currently attached (old) domain
 - attach to a given new domain

Modern IOMMU drivers following this pattern usually want to clean up the
things related to the old domain, so they call iommu_get_domain_for_dev()
to fetch the old domain.

Pass in the old domain pointer from the core to drivers, aligning with the
set_dev_pasid op that does so already.

Ensure all low-level attach fcuntions in the core can forward the correct
old domain pointer. Thus, rework those functions as well.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:35 +01:00
Nicolin Chen
c21b34762e iommu/amd: Set release_domain to blocked_domain
The set_dev_pasid for a release domain never gets called anyhow. So, there
is no point in defining a separate release_domain from the blocked_domain.

Simply reuse the blocked_domain.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:35 +01:00
Ashish Kalra
f32fe7cb01 iommu/amd: Add support to remap/unmap IOMMU buffers for kdump
After a panic if SNP is enabled in the previous kernel then the kdump
kernel boots with IOMMU SNP enforcement still enabled.

IOMMU completion wait buffers (CWBs), command buffers and event buffer
registers remain locked and exclusive to the previous kernel. Attempts
to allocate and use new buffers in the kdump kernel fail, as hardware
ignores writes to the locked MMIO registers as per AMD IOMMU spec
Section 2.12.2.1.

This results in repeated "Completion-Wait loop timed out" errors and a
second kernel panic: "Kernel panic - not syncing: timer doesn't work
through Interrupt-remapped IO-APIC"

The list of MMIO registers locked and which ignore writes after failed
SNP shutdown are mentioned in the AMD IOMMU specifications below:

Section 2.12.2.1.
https://docs.amd.com/v/u/en-US/48882_3.10_PUB

Reuse the pages of the previous kernel for completion wait buffers,
command buffers, event buffers and memremap them during kdump boot
and essentially work with an already enabled IOMMU configuration and
re-using the previous kernel’s data structures.

Reusing of command buffers and event buffers is now done for kdump boot
irrespective of SNP being enabled during kdump.

Re-use of completion wait buffers is only done when SNP is enabled as
the exclusion base register is used for the completion wait buffer
(CWB) address only when SNP is enabled.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/ff04b381a8fe774b175c23c1a336b28bc1396511.1756157913.git.ashish.kalra@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-05 14:44:30 +02:00
Xichao Zhao
d3d3b60427 iommu/amd: use str_plural() to simplify the code
Use the string choice helper function str_plural() to simplify the code.

Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/20250818070556.458271-1-zhao.xichao@vivo.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-05 14:25:32 +02:00
Linus Torvalds
63eb28bb14 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Host driver for GICv5, the next generation interrupt controller for
     arm64, including support for interrupt routing, MSIs, interrupt
     translation and wired interrupts

   - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
     GICv5 hardware, leveraging the legacy VGIC interface

   - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
     userspace to disable support for SGIs w/o an active state on
     hardware that previously advertised it unconditionally

   - Map supporting endpoints with cacheable memory attributes on
     systems with FEAT_S2FWB and DIC where KVM no longer needs to
     perform cache maintenance on the address range

   - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
     guest hypervisor to inject external aborts into an L2 VM and take
     traps of masked external aborts to the hypervisor

   - Convert more system register sanitization to the config-driven
     implementation

   - Fixes to the visibility of EL2 registers, namely making VGICv3
     system registers accessible through the VGIC device instead of the
     ONE_REG vCPU ioctls

   - Various cleanups and minor fixes

  LoongArch:

   - Add stat information for in-kernel irqchip

   - Add tracepoints for CPUCFG and CSR emulation exits

   - Enhance in-kernel irqchip emulation

   - Various cleanups

  RISC-V:

   - Enable ring-based dirty memory tracking

   - Improve perf kvm stat to report interrupt events

   - Delegate illegal instruction trap to VS-mode

   - MMU improvements related to upcoming nested virtualization

  s390x

   - Fixes

  x86:

   - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
     APIC, PIC, and PIT emulation at compile time

   - Share device posted IRQ code between SVM and VMX and harden it
     against bugs and runtime errors

   - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
     O(1) instead of O(n)

   - For MMIO stale data mitigation, track whether or not a vCPU has
     access to (host) MMIO based on whether the page tables have MMIO
     pfns mapped; using VFIO is prone to false negatives

   - Rework the MSR interception code so that the SVM and VMX APIs are
     more or less identical

   - Recalculate all MSR intercepts from scratch on MSR filter changes,
     instead of maintaining shadow bitmaps

   - Advertise support for LKGS (Load Kernel GS base), a new instruction
     that's loosely related to FRED, but is supported and enumerated
     independently

   - Fix a user-triggerable WARN that syzkaller found by setting the
     vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
     the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
     possible path leading to architecturally forbidden states is hard
     and even risks breaking userspace (if it goes from valid to valid
     state but passes through invalid states), so just wait until
     KVM_RUN to detect that the vCPU state isn't allowed

   - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
     interception of APERF/MPERF reads, so that a "properly" configured
     VM can access APERF/MPERF. This has many caveats (APERF/MPERF
     cannot be zeroed on vCPU creation or saved/restored on suspend and
     resume, or preserved over thread migration let alone VM migration)
     but can be useful whenever you're interested in letting Linux
     guests see the effective physical CPU frequency in /proc/cpuinfo

   - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
     created, as there's no known use case for changing the default
     frequency for other VM types and it goes counter to the very reason
     why the ioctl was added to the vm file descriptor. And also, there
     would be no way to make it work for confidential VMs with a
     "secure" TSC, so kill two birds with one stone

   - Dynamically allocation the shadow MMU's hashed page list, and defer
     allocating the hashed list until it's actually needed (the TDP MMU
     doesn't use the list)

   - Extract many of KVM's helpers for accessing architectural local
     APIC state to common x86 so that they can be shared by guest-side
     code for Secure AVIC

   - Various cleanups and fixes

  x86 (Intel):

   - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
     Failure to honor FREEZE_IN_SMM can leak host state into guests

   - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
     prevent L1 from running L2 with features that KVM doesn't support,
     e.g. BTF

  x86 (AMD):

   - WARN and reject loading kvm-amd.ko instead of panicking the kernel
     if the nested SVM MSRPM offsets tracker can't handle an MSR (which
     is pretty much a static condition and therefore should never
     happen, but still)

   - Fix a variety of flaws and bugs in the AVIC device posted IRQ code

   - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
     supports) instead of rejecting vCPU creation

   - Extend enable_ipiv module param support to SVM, by simply leaving
     IsRunning clear in the vCPU's physical ID table entry

   - Disable IPI virtualization, via enable_ipiv, if the CPU is affected
     by erratum #1235, to allow (safely) enabling AVIC on such CPUs

   - Request GA Log interrupts if and only if the target vCPU is
     blocking, i.e. only if KVM needs a notification in order to wake
     the vCPU

   - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
     the vCPU's CPUID model

   - Accept any SNP policy that is accepted by the firmware with respect
     to SMT and single-socket restrictions. An incompatible policy
     doesn't put the kernel at risk in any way, so there's no reason for
     KVM to care

   - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
     use WBNOINVD instead of WBINVD when possible for SEV cache
     maintenance

   - When reclaiming memory from an SEV guest, only do cache flushes on
     CPUs that have ever run a vCPU for the guest, i.e. don't flush the
     caches for CPUs that can't possibly have cache lines with dirty,
     encrypted data

  Generic:

   - Rework irqbypass to track/match producers and consumers via an
     xarray instead of a linked list. Using a linked list leads to
     O(n^2) insertion times, which is hugely problematic for use cases
     that create large numbers of VMs. Such use cases typically don't
     actually use irqbypass, but eliminating the pointless registration
     is a future problem to solve as it likely requires new uAPI

   - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
     "void *", to avoid making a simple concept unnecessarily difficult
     to understand

   - Decouple device posted IRQs from VFIO device assignment, as binding
     a VM to a VFIO group is not a requirement for enabling device
     posted IRQs

   - Clean up and document/comment the irqfd assignment code

   - Disallow binding multiple irqfds to an eventfd with a priority
     waiter, i.e. ensure an eventfd is bound to at most one irqfd
     through the entire host, and add a selftest to verify eventfd:irqfd
     bindings are globally unique

   - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
     related to private <=> shared memory conversions

   - Drop guest_memfd's .getattr() implementation as the VFS layer will
     call generic_fillattr() if inode_operations.getattr is NULL

   - Fix issues with dirty ring harvesting where KVM doesn't bound the
     processing of entries in any way, which allows userspace to keep
     KVM in a tight loop indefinitely

   - Kill off kvm_arch_{start,end}_assignment() and x86's associated
     tracking, now that KVM no longer uses assigned_device_count as a
     heuristic for either irqbypass usage or MDS mitigation

  Selftests:

   - Fix a comment typo

   - Verify KVM is loaded when getting any KVM module param so that
     attempting to run a selftest without kvm.ko loaded results in a
     SKIP message about KVM not being loaded/enabled (versus some random
     parameter not existing)

   - Skip tests that hit EACCES when attempting to access a file, and
     print a "Root required?" help message. In most cases, the test just
     needs to be run with elevated permissions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
  Documentation: KVM: Use unordered list for pre-init VGIC registers
  RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
  RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
  RISC-V: perf/kvm: Add reporting of interrupt events
  RISC-V: KVM: Enable ring-based dirty memory tracking
  RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
  RISC-V: KVM: Delegate illegal instruction fault to VS mode
  RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
  RISC-V: KVM: Factor-out g-stage page table management
  RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
  RISC-V: KVM: Introduce struct kvm_gstage_mapping
  RISC-V: KVM: Factor-out MMU related declarations into separate headers
  RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
  RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
  RISC-V: KVM: Don't flush TLB when PTE is unchanged
  RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
  RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
  RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
  RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
  KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
  ...
2025-07-30 17:14:01 -07:00
Will Deacon
6ae1477fd3 Merge branch 'amd/amd-vi' into next
* amd/amd-vi:
  iommu/amd: Fix geometry.aperture_end for V2 tables
  iommu/amd: Wrap debugfs ABI testing symbols snippets in literal code blocks
  iommu/amd: Add documentation for AMD IOMMU debugfs support
  iommu/amd: Add debugfs support to dump IRT Table
  iommu/amd: Add debugfs support to dump device table
  iommu/amd: Add support for device id user input
  iommu/amd: Add debugfs support to dump IOMMU command buffer
  iommu/amd: Add debugfs support to dump IOMMU Capability registers
  iommu/amd: Add debugfs support to dump IOMMU MMIO registers
  iommu/amd: Refactor AMD IOMMU debugfs initial setup
  iommu/amd: Enable PASID and ATS capabilities in the correct order
  iommu/amd: Add efr[HATS] max v1 page table level
  iommu/amd: Add HATDis feature support
2025-07-24 11:17:59 +01:00
Jason Gunthorpe
8637afa79c iommu/amd: Fix geometry.aperture_end for V2 tables
The AMD IOMMU documentation seems pretty clear that the V2 table follows
the normal CPU expectation of sign extension. This is shown in

  Figure 25: AMD64 Long Mode 4-Kbyte Page Address Translation

Where bits Sign-Extend [63:57] == [56]. This is typical for x86 which
would have three regions in the page table: lower, non-canonical, upper.

The manual describes that the V1 table does not sign extend in section
2.2.4 Sharing AMD64 Processor and IOMMU Page Tables GPA-to-SPA

Further, Vasant has checked this and indicates the HW has an addtional
behavior that the manual does not yet describe. The AMDv2 table does not
have the sign extended behavior when attached to PASID 0, which may
explain why this has gone unnoticed.

The iommu domain geometry does not directly support sign extended page
tables. The driver should report only one of the lower/upper spaces. Solve
this by removing the top VA bit from the geometry to use only the lower
space.

This will also make the iommu_domain work consistently on all PASID 0 and
PASID != 1.

Adjust dma_max_address() to remove the top VA bit. It now returns:

5 Level:
  Before 0x1ffffffffffffff
  After  0x0ffffffffffffff
4 Level:
  Before 0xffffffffffff
  After  0x7fffffffffff

Fixes: 11c439a194 ("iommu/amd/pgtbl_v2: Fix domain max address")
Link: https://lore.kernel.org/all/8858d4d6-d360-4ef0-935c-bfd13ea54f42@amd.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v2-0615cc99b88a+1ce-amdv2_geo_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-17 10:18:57 +01:00
Dheeraj Kumar Srivastava
fb3af1f4fe iommu/amd: Add debugfs support to dump IOMMU command buffer
IOMMU driver sends command to IOMMU hardware via command buffer. In cases
where IOMMU hardware fails to process commands in command buffer, dumping
it is a valuable input to debug the issue.

IOMMU hardware processes command buffer entry at offset equals to the head
pointer. Dumping just the entry at the head pointer may not always be
useful. The current head may not be pointing to the entry of the command
buffer which is causing the issue. IOMMU Hardware may have processed the
entry and updated the head pointer. So dumping the entire command buffer
gives a broad understanding of what hardware was/is doing. The command
buffer dump will have all entries from start to end of the command buffer.
Along with that, it will have a head and tail command buffer pointer
register dump to facilitate where the IOMMU driver and hardware are in
the command buffer for injecting and processing the entries respectively.

Command buffer is a per IOMMU data structure. So dumping on per IOMMU
basis.
eg.
-> To get command buffer dump for iommu<x> (say, iommu00)
   #cat /sys/kernel/debug/iommu/amd/iommu00/cmdbuf

Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250702093804.849-5-dheerajkumar.srivastava@amd.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-15 11:41:52 +01:00
Easwar Hariharan
c694bc8b61 iommu/amd: Enable PASID and ATS capabilities in the correct order
Per the PCIe spec, behavior of the PASID capability is undefined if the
value of the PASID Enable bit changes while the Enable bit of the
function's ATS control register is Set. Unfortunately,
pdev_enable_caps() does exactly that by ordering enabling ATS for the
device before enabling PASID.

Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerry Snitselaar <jsnitsel@redhat.com>
Fixes: eda8c2860a ("iommu/amd: Enable device ATS/PASID/PRI capabilities independently")
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250703155433.6221-1-eahariha@linux.microsoft.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-07-11 09:13:58 +02:00
Marc Zyngier
0eaa67ad3a iommu/amd: Convert to msi_create_parent_irq_domain() helper
Now that we have a concise helper to create an MSI parent domain,
switch the AMD IOMMU remapping over to that.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nam Cao <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241204124549.607054-9-maz@kernel.org
Link: https://lore.kernel.org/r/92e5ae97a03e4ffc272349d0863cd2cc8f904c44.1750858125.git.namcao@linutronix.de
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-07-04 10:28:10 +02:00
Ankit Soni
025d1371cc iommu/amd: Add efr[HATS] max v1 page table level
The EFR[HATS] bits indicate maximum host translation level supported by
IOMMU. Adding support to set the maximum host page table level as indicated
by EFR[HATS]. If the HATS=11b (reserved), the driver will attempt to use
guest page table for DMA API.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/df0f8562c2a20895cc185c86f1a02c4d826fd597.1749016436.git.Ankit.Soni@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-06-27 08:51:20 +02:00
Ankit Soni
7e5516e609 iommu/amd: Add HATDis feature support
Current AMD IOMMU assumes Host Address Translation (HAT) is always
supported, and Linux kernel enables this capability by default. However,
in case of emulated and virtualized IOMMU, this might not be the case.
For example,current QEMU-emulated AMD vIOMMU does not support host
translation for VFIO pass-through device, but the interrupt remapping
support is required for x2APIC (i.e. kvm-msi-ext-dest-id is also not
supported by the guest OS). This would require the guest kernel to boot
with guest kernel option iommu=pt to by-pass the initialization of
host (v1) table.

The AMD I/O Virtualization Technology (IOMMU) Specification Rev 3.10 [1]
introduces a new flag 'HATDis' in the IVHD 11h IOMMU attributes to indicate
that HAT is not supported on a particular IOMMU instance.

Therefore, modifies the AMD IOMMU driver to detect the new HATDis
attributes, and disable host translation and switch to use guest
translation if it is available. Otherwise, the driver will disable DMA
translation.

[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf

Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/8109b208f87b80e400c2abd24a2e44fcbc0763a5.1749016436.git.Ankit.Soni@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-06-27 08:51:18 +02:00
Sean Christopherson
b9e53f9ff4 iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts.  KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.

Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.

Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().

Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.

Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:51 -07:00
Sean Christopherson
a23480fe21 iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support
WARN if KVM attempts to update IRTE entries when virtual APIC isn't fully
supported, as KVM should guard all such calls on IRQ posting being enabled.

Link: https://lore.kernel.org/r/20250611224604.313496-58-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:48 -07:00
Sean Christopherson
6df262f915 iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited
If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode.  KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.

Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:40 -07:00
Sean Christopherson
f965255dc5 iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity.  Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.

This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.

Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:39 -07:00
Sean Christopherson
0b2b541fa3 iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
Split the guts of amd_iommu_update_ga() to a dedicated helper so that the
logic can be shared with flows that put the IRTE into posted mode.

Opportunistically move amd_iommu_update_ga() and its new helper above
amd_iommu_activate_guest_mode() so that it's all co-located.

Link: https://lore.kernel.org/r/20250611224604.313496-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:38 -07:00
Sean Christopherson
08d9ccdd1a iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running.  amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:37 -07:00
Sean Christopherson
3be405e89f iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify
Add a comment to amd_iommu_update_ga() to document what fields it can
safely modify without issuing an invalidation of the IRTE, and to explain
its role in keeping GA IRTEs up-to-date.

Per page 93 of the IOMMU spec dated Feb 2025:

  When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
  IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
  are not cached by the IOMMU. Modifications to these fields do not require an
  invalidation of the Interrupt Remapping Table.

Link: https://lore.kernel.org/all/9b7ceea3-8c47-4383-ad9c-1a9bbdc9044a@oracle.com
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:37 -07:00
Sean Christopherson
53527ea1b7 iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures.  Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:31 -07:00
Sean Christopherson
95d50ebe6d iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a
vCPU" now that there's no need to communicate information back to KVM
about the previous vCPU (KVM does its own tracking).

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:25 -07:00
Sean Christopherson
c4cdbaf9d8 iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer.  KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:24 -07:00
Sean Christopherson
1da19c5ce0 iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:55 -07:00
Joerg Roedel
879b141b7c Merge branches 'fixes', 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', 'fsl/pamu', 'mediatek', 'renesas/ipmmu', 's390', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2025-05-23 17:14:32 +02:00