Commit Graph

17039 Commits

Author SHA1 Message Date
Srinivasan Shanmugam
b2426a211d drm/amdgpu/userq: Fix fence reference leak on queue teardown v2
The user mode queue keeps a pointer to the most recent fence in
userq->last_fence. This pointer holds an extra dma_fence reference.

When the queue is destroyed, we free the fence driver and its xarray,
but we forgot to drop the last_fence reference.

Because of the missing dma_fence_put(), the last fence object can stay
alive when the driver unloads. This leaves an allocated object in the
amdgpu_userq_fence slab cache and triggers

This is visible during driver unload as:

  BUG amdgpu_userq_fence: Objects remaining on __kmem_cache_shutdown()
  kmem_cache_destroy amdgpu_userq_fence: Slab cache still has objects
  Call Trace:
    kmem_cache_destroy
    amdgpu_userq_fence_slab_fini
    amdgpu_exit
    __do_sys_delete_module

Fix this by putting userq->last_fence and clearing the pointer during
amdgpu_userq_fence_driver_free().

This makes sure the fence reference is released and the slab cache is
empty when the module exits.

v2: Update to only release userq->last_fence with dma_fence_put()
    (Christian)

Fixes: edc762a51c ("drm/amdgpu/userq: move some code around")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8e051e38a8)
2026-01-14 15:07:29 -05:00
Prike Liang
808c2052f0 Revert "drm/amdgpu: don't attach the tlb fence for SI"
This reverts commit 820b3d376e.

It’s better to validate VM TLB flushes in the flush‑TLB backend
rather than in the generic VM layer.

Reverting this patch depends on
commit fa7c231fc2b0 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
being present in the tree.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9163fe4d79)
2026-01-14 15:06:51 -05:00
Prike Liang
0bea77b13b drm/amdgpu: validate the flush_gpu_tlb_pasid()
Validate flush_gpu_tlb_pasid() availability before flushing tlb.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f4db9913e4)
2026-01-14 15:06:43 -05:00
Alex Deucher
b6dff005fc drm/amdgpu: make sure userqs are enabled in userq IOCTLs
These IOCTLs shouldn't be called when userqs are not
enabled.  Make sure they are enabled before executing
the IOCTLs.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d967509651)
Cc: stable@vger.kernel.org
2026-01-14 14:57:55 -05:00
Xiaogang Chen
122b15cdbc drm/amdgpu: Use correct address to setup gart page table for vram access
Use dst input parameter to setup gart page table entries instead of using fixed
location.

Fixes: 237d623ae6 ("drm/amdgpu/gart: Add helper to bind VRAM pages (v2)")
Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ca5d4db8db)
2026-01-14 14:57:34 -05:00
Peter Colberg
9c81200152 Revert duplicate "drm/amdgpu: disable peer-to-peer access for DCC-enabled GC12 VRAM surfaces"
This reverts commit 22a36e660d once,
which was merged twice due to an incorrect backmerge resolution.

Fixes: ce0478b02e ("Merge tag 'v6.18-rc6' into drm-next")
Signed-off-by: Peter Colberg <pcolberg@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 38a0f4cf8c)
2026-01-14 14:51:36 -05:00
Mario Limonciello (AMD)
28695ca09d drm/amd: Clean up kfd node on surprise disconnect
When an eGPU is unplugged the KFD topology should also be destroyed
for that GPU. This never happens because the fini_sw callbacks never
get to run. Run them manually before calling amdgpu_device_ip_fini_early()
when a device has already been disconnected.

This location is intentionally chosen to make sure that the kfd locking
refcount doesn't get incremented unintentionally.

Cc: kent.russell@amd.com
Closes: https://community.frame.work/t/amd-egpu-on-linux/8691/33
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6a23e7b433)
Cc: stable@vger.kernel.org
2026-01-14 14:51:36 -05:00
Lu Yao
9cb6278b44 drm/amdgpu: fix drm panic null pointer when driver not support atomic
When driver not support atomic, fb using plane->fb rather than
plane->state->fb.

Fixes: fe151ed7af ("drm/amdgpu: add generic display panic helper code")
Signed-off-by: Lu Yao <yaolu@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2f2a72de67)
2026-01-14 14:51:36 -05:00
Philip Yang
292e5757b2 drm/amdgpu: Fix gfx9 update PTE mtype flag
Fix copy&paste error, that should have been an assignment instead of an or,
otherwise MTYPE_UC 0x3 can not be updated to MTYPE_RW 0x1.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fc1366016a)
Cc: stable@vger.kernel.org
2026-01-14 14:51:36 -05:00
Ivan Lipski
6a681cd903 drm/amd/display: Add an hdmi_hpd_debounce_delay_ms module
[Why&How]
Right now, the HDMI HPD filter is enabled by default at 1500ms.

We want to disable it by default, as most modern displays with HDMI do
not require it for DPMS mode.

The HPD can instead be enabled as a driver parameter with a custom delay
value in ms (up to 5000ms).

Fixes: c918e75e1e ("drm/amd/display: Add an HPD filter for HDMI")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4859
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:59 -05:00
Philip Yang
0cba5b27f1 drm/amdkfd: Add domain parameter to alloc kernel BO
To allocate kernel BO from VRAM domain for MQD in the following patch.
No functional change because kernel BO allocate all from GTT domain.

Rename amdgpu_amdkfd_alloc_gtt_mem to amdgpu_amdkfd_alloc_kernel_mem
Rename amdgpu_amdkfd_free_gtt_mem to amdgpu_amdkfd_free_kernel_mem
Rename mem_kfd_mem_obj gtt_mem to mem

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:59 -05:00
Srinivasan Shanmugam
8e051e38a8 drm/amdgpu/userq: Fix fence reference leak on queue teardown v2
The user mode queue keeps a pointer to the most recent fence in
userq->last_fence. This pointer holds an extra dma_fence reference.

When the queue is destroyed, we free the fence driver and its xarray,
but we forgot to drop the last_fence reference.

Because of the missing dma_fence_put(), the last fence object can stay
alive when the driver unloads. This leaves an allocated object in the
amdgpu_userq_fence slab cache and triggers

This is visible during driver unload as:

  BUG amdgpu_userq_fence: Objects remaining on __kmem_cache_shutdown()
  kmem_cache_destroy amdgpu_userq_fence: Slab cache still has objects
  Call Trace:
    kmem_cache_destroy
    amdgpu_userq_fence_slab_fini
    amdgpu_exit
    __do_sys_delete_module

Fix this by putting userq->last_fence and clearing the pointer during
amdgpu_userq_fence_driver_free().

This makes sure the fence reference is released and the slab cache is
empty when the module exits.

v2: Update to only release userq->last_fence with dma_fence_put()
    (Christian)

Fixes: edc762a51c ("drm/amdgpu/userq: move some code around")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:59 -05:00
Prike Liang
9163fe4d79 Revert "drm/amdgpu: don't attach the tlb fence for SI"
This reverts commit 820b3d376e.

It’s better to validate VM TLB flushes in the flush‑TLB backend
rather than in the generic VM layer.

Reverting this patch depends on
commit fa7c231fc2b0 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
being present in the tree.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:59 -05:00
Prike Liang
f4db9913e4 drm/amdgpu: validate the flush_gpu_tlb_pasid()
Validate flush_gpu_tlb_pasid() availability before flushing tlb.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:59 -05:00
Xiaogang Chen
6cca686dfc drm/amdkfd: kfd driver supports hot unplug/replug amdgpu devices
This patch allows kfd driver function correctly when AMD gpu devices got
unplug/replug at run time.

When an AMD gpu device got unplug kfd driver gracefully terminates existing
kfd processes after stops all queues by sending SIGBUS to user process. After
that user space can still use remaining AMD gpu devices. When all AMD gpu
devices at system got removed kfd driver will not response new requests.

Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices
to function as usual.

The purpose of this patch is having kfd driver behavior as expected during and
after AMD gpu devices unplug/replug at run time.

Signed-off-by: Xiaogang Chen <Xiaogang.Chen@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:49 -05:00
Lang Yu
0f744593ad drm/amdgpu/mes: Simplify hqd mask initialization
"adev->mes.compute_hqd_mask[i] = adev->gfx.disable_kq ? 0xF"
is actually incorrect for MEC with 8 queues per pipe.
Let's get rid of version check and hardcode, calculate hqd
mask with number of queues per pipe and number of gfx/compute
queues kernel used.

Currently, only MEC1 is used for both kernel/user compute queue.
To enable other MEC, we need to redistribute queues per pipe and
adjust queue resource shared with kfd that needs a separate patch.
Just skip other MEC for now to avoid potential issues.

v2: Force reserved queues to 0 if kernel queue is explicitly disabled.

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-10 14:21:52 -05:00
Srinivasan Shanmugam
efdc66fe12 drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v7
When GPU memory mappings are updated, the driver returns a fence so
userspace knows when the update is finished.

The previous refactor could pick the wrong fence or rely on checks that
are not safe for GPU mappings that stay valid even when memory is
missing. In some cases this could return an invalid fence or cause fence
reference counting problems.

Fix this by (v5,v6, per Christian):
- Starting from the VM’s existing last update fence, so a valid and
  meaningful fence is always returned even when no new work is required.
- Selecting the VM-level fence only for always-valid / PRT mappings using
  the required combined bo_va + bo guard.
- Using the per-BO page table update fence for normal MAP and REPLACE
  operations.
- For UNMAP and CLEAR, returning the fence provided by
  amdgpu_vm_clear_freed(), which may remain unchanged when nothing needs
  clearing.
- Keeping fence reference counting balanced.

v7: Drop the extra bo_va/bo NULL guard since
    amdgpu_vm_is_bo_always_valid() handles NULL BOs correctly (including
    PRT). (Christian)

This makes VM timeline fences correct and prevents crashes caused by
incorrect fence handling.

Fixes: bd8150a1b3 ("drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v4")
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-10 14:21:52 -05:00
Alex Deucher
d967509651 drm/amdgpu: make sure userqs are enabled in userq IOCTLs
These IOCTLs shouldn't be called when userqs are not
enabled.  Make sure they are enabled before executing
the IOCTLs.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-10 14:21:52 -05:00
Christophe JAILLET
9858810e62 drm/amdgpu: Slightly simplify base_addr_show()
sysfs_emit_at() never returns a negative error code. It returns 0 or the
number of characters written in the buffer.

Remove the useless tests. This simplifies the logic and saves a few lines
of code.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-10 14:21:52 -05:00
Christian König
96e97a562d drm/amdgpu: Drop MMIO_REMAP domain bit and keep it Internal
"AMDGPU_GEM_DOMAIN_MMIO_REMAP" - Never activated as UAPI and it turned
out that this was to inflexible.

Allocate the MMIO_REMAP buffer object as a regular GEM BO and explicitly
move it into the fixed AMDGPU_PL_MMIO_REMAP placement at the TTM level.

This avoids relying on GEM domain bits for MMIO_REMAP, keeps the
placement purely internal, and makes the lifetime and pinning of the
global MMIO_REMAP BO explicit. The BO is pinned in TTM so it cannot be
migrated or evicted.

The corresponding free path relies on normal DRM teardown ordering,
where no further user ioctls can access the global BO once TTM teardown
begins.

v2 (Srini):
- Updated patch title.
- Drop use of AMDGPU_GEM_DOMAIN_MMIO_REMAP in amdgpu_ttm.c. The
  MMIO_REMAP domain bit is removed from UAPI, so keep the MMIO_REMAP BO
  allocation domain-less (bp.domain = 0) and rely on the TTM placement
  (AMDGPU_PL_MMIO_REMAP) for backing/pinning.
- Keep fdinfo/mem-stats visibility for MMIO_REMAP by classifying BOs
  based on bo->tbo.resource->mem_type == AMDGPU_PL_MMIO_REMAP, since the
  domain bit is removed.

v3: Squash patches #1 & #3

Fixes: 0561324837 ("drm/amdgpu/uapi: Introduce AMDGPU_GEM_DOMAIN_MMIO_REMAP")
Fixes: 2a7a794eb8 ("drm/amdgpu/ttm: Allocate/Free 4K MMIO_REMAP Singleton")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Leo Liu <leo.liu@amd.com>
Cc: Ruijing Dong <ruijing.dong@amd.com>
Cc: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-10 14:21:35 -05:00
Xiaogang Chen
ca5d4db8db drm/amdgpu: Use correct address to setup gart page table for vram access
Use dst input parameter to setup gart page table entries instead of using fixed
location.

Fixes: 237d623ae6 ("drm/amdgpu/gart: Add helper to bind VRAM pages (v2)")
Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-10 14:06:39 -05:00
YuBiao Wang
39c21b8111 drm/amdgpu: Skip loading SDMA_RS64 in VF
VFs use the PF SDMA ucode and are unable to load SDMA_RS64.

Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Signed-off-by: Victor Skvortsov <Victor.Skvortsov@amd.com>
Reviewed-by: Gavin Wan <gavin.wan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-10 14:06:33 -05:00
Peter Colberg
38a0f4cf8c Revert duplicate "drm/amdgpu: disable peer-to-peer access for DCC-enabled GC12 VRAM surfaces"
This reverts commit 22a36e660d once,
which was merged twice due to an incorrect backmerge resolution.

Fixes: ce0478b02e ("Merge tag 'v6.18-rc6' into drm-next")
Signed-off-by: Peter Colberg <pcolberg@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-08 15:18:13 -05:00
Mario Limonciello (AMD)
6a23e7b433 drm/amd: Clean up kfd node on surprise disconnect
When an eGPU is unplugged the KFD topology should also be destroyed
for that GPU. This never happens because the fini_sw callbacks never
get to run. Run them manually before calling amdgpu_device_ip_fini_early()
when a device has already been disconnected.

This location is intentionally chosen to make sure that the kfd locking
refcount doesn't get incremented unintentionally.

Cc: kent.russell@amd.com
Closes: https://community.frame.work/t/amd-egpu-on-linux/8691/33
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-08 11:43:23 -05:00
Hawking Zhang
5e3f50fda2 drm/amdgpu: Extend psp_skip_tmr for bare-metal and sriov
In SRIOV, guest drivers no longer setup/destory
VMR starting from mp0 v11_0_7.

In bare-metal, if boot-time TMR is enabled, some
generation (e.g., mp0 v13_0_x) don’t need runtime
TMR allocation but still require SETUP_TMR command
with tmr address 0 for backward compatibility.
some newer generations require neither SETUP_TMR nor
DESTROY_TMR and will return errors if they are sent.
Driver relies on boot_time_tmr and autoload_supported
to handle these cases correctly.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-08 11:43:09 -05:00
Philip Yang
698fa62f56 drm/amdgpu: Add helper to alloc GART entries
Add helper amdgpu_gtt_mgr_alloc/free_entries, define
GART_ENTRY_WITHOUT_BO_COLOR color for GART node not allocated with GTT
bo, then amdgpu_gtt_mgr_recover skip those mm_node.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-08 11:42:53 -05:00
Lu Yao
2f2a72de67 drm/amdgpu: fix drm panic null pointer when driver not support atomic
When driver not support atomic, fb using plane->fb rather than
plane->state->fb.

Fixes: fe151ed7af ("drm/amdgpu: add generic display panic helper code")
Signed-off-by: Lu Yao <yaolu@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-08 11:41:46 -05:00
Pratik Vishwakarma
c7fc0f3723 drm/amd: Enable SMU 15_0_0 support
Add SMU 15_0_0

v2: rebase (Alex)
v3: fix clang build (Alex)

Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-08 11:41:42 -05:00
Pratik Vishwakarma
52cd9bc47f drm/amd: Enable SMUIO 15_0_0 support
Add SMUIO 15_0_0.

Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-08 11:41:34 -05:00
Philip Yang
fc1366016a drm/amdgpu: Fix gfx9 update PTE mtype flag
Fix copy&paste error, that should have been an assignment instead of an or,
otherwise MTYPE_UC 0x3 can not be updated to MTYPE_RW 0x1.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-08 11:40:45 -05:00
Mario Limonciello (AMD)
6b2989ac5e Reapply "Revert "drm/amd: Skip power ungate during suspend for VPE""
Skipping power ungate exposed some scenarios that will fail
like below:

```
amdgpu: Register(0) [regVPEC_QUEUE_RESET_REQ] failed to reach value 0x00000000 != 0x00000001n
amdgpu 0000:c1:00.0: amdgpu: VPE queue reset failed
...
amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
```

The underlying s2idle issue that prompted this commit is going to
be fixed in BIOS.
This reverts commit 2a6c826cfe.

This was lost in the 6.19 merge so reapply it.

Fixes: 2a6c826cfe ("drm/amd: Skip power ungate during suspend for VPE")
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reported-by: Konstantin <answer2019@yandex.ru>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220812
Reported-by: Matthew Schwartz <matthew.schwartz@linux.dev>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3925683515)
2026-01-07 17:24:16 -05:00
Perry Yuan
0de604d035 drm/amd/pm: Disable MMIO access during SMU Mode 1 reset
During Mode 1 reset, the ASIC undergoes a reset cycle and becomes
temporarily inaccessible via PCIe. Any attempt to access MMIO registers
during this window (e.g., from interrupt handlers or other driver threads)
can result in uncompleted PCIe transactions, leading to NMI panics or
system hangs.

To prevent this, set the `no_hw_access` flag to true immediately after
triggering the reset. This signals other driver components to skip
register accesses while the device is offline.

A memory barrier `smp_mb()` is added to ensure the flag update is
globally visible to all cores before the driver enters the sleep/wait
state.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7edb503fe4)
2026-01-07 17:24:10 -05:00
Alan Liu
72d7f45736 drm/amdgpu: Fix query for VPE block_type and ip_count
[Why]
Query for VPE block_type and ip_count is missing.

[How]
Add VPE case in ip_block_type and hw_ip_count query.

Reviewed-by: Lang Yu <lang.yu@amd.com>
Signed-off-by: Alan Liu <haoping.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a6ea0a430a)
Cc: stable@vger.kernel.org
2026-01-05 17:34:48 -05:00
Pratap Nirujogi
7ed51e3a13 drm/amd/amdgpu: Fix SMU warning during isp suspend-resume
ISP mfd child devices are using genpd and the system suspend-resume
operations between genpd and amdgpu parent device which uses only
runtime suspend-resume are not in sync.

Linux power manager during suspend-resume resuming the genpd devices
earlier than the amdgpu parent device. This is resulting in the below
warning as SMU is in suspended state when genpd attempts to resume ISP.

WARNING: CPU: 13 PID: 5435 at drivers/gpu/drm/amd/amdgpu/../pm/swsmu/amdgpu_smu.c:398 smu_dpm_set_power_gate+0x36f/0x380 [amdgpu]

To fix this warning isp suspend-resume is handled as part of amdgpu
parent device suspend-resume instead of genpd sequence. Each ISP MFD
child device is marked as dev_pm_syscore_device to skip genpd
suspend-resume and use pm_runtime_force api's to suspend-resume
the devices when callbacks from amdgpu are received.

Co-developed-by: Gjorgji Rosikopulos <grosikop@amd.com>
Signed-off-by: Gjorgji Rosikopulos <grosikop@amd.com>
Signed-off-by: Bin Du <bin.du@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0288a345f1)
2026-01-05 17:29:51 -05:00
Alex Deucher
531b432609 drm/amdgpu: always backup and reemit fences
If when we backup the ring contents for reemit before a
ring reset, we skip jobs associated with the bad
context, however, we need to make sure the fences
are reemited as unprocessed submissions may depend on
them.

v2: clean up fence handling, make helpers static

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 155a748f14)
2026-01-05 17:28:45 -05:00
Alex Deucher
9fc27cbabe drm/amdgpu: don't reemit ring contents more than once
If we cancel a bad job and reemit the ring contents, and
we get another timeout, cancel everything rather than reemitting.
The wptr markers are only relevant for the original emit.  If
we reemit, the wptr markers are no longer correct.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fb62a2067c)
2026-01-05 17:28:45 -05:00
Julia Lawall
e9009c8b74 drm/amdgpu: update outdated comment
The function amdgpu_amdkfd_gpuvm_import_dmabuf() was split into
import_obj_create() and amdgpu_amdkfd_gpuvm_import_dmabuf_fd() in
commit 0188006d7c ("drm/amdkfd: Import DMABufs for interop
through DRM").  import_obj_create() now does the allocation for
the mem variable discussed in the comment.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 17:00:00 -05:00
Perry Yuan
7edb503fe4 drm/amd/pm: Disable MMIO access during SMU Mode 1 reset
During Mode 1 reset, the ASIC undergoes a reset cycle and becomes
temporarily inaccessible via PCIe. Any attempt to access MMIO registers
during this window (e.g., from interrupt handlers or other driver threads)
can result in uncompleted PCIe transactions, leading to NMI panics or
system hangs.

To prevent this, set the `no_hw_access` flag to true immediately after
triggering the reset. This signals other driver components to skip
register accesses while the device is offline.

A memory barrier `smp_mb()` is added to ensure the flag update is
globally visible to all cores before the driver enters the sleep/wait
state.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 17:00:00 -05:00
Srinivasan Shanmugam
bd8150a1b3 drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v4
This commit simplifies the amdgpu_gem_va_ioctl function, key updates
include:
 - Moved the logic for managing the last update fence directly into
   amdgpu_gem_va_update_vm.
 - Introduced checks for the timeline point to enable conditional
   replacement or addition of fences.

v2: Addressed review comments from Christian.
v3: Updated comments (Christian).
v4: The previous version selected the fence too early and did not manage its
    reference correctly, which could lead to stale or freed fences being used.
    This resulted in refcount underflows and could crash when updating GPU
    timelines.
    The fence is now chosen only after the VA mapping work is completed, and its
    reference is taken safely. After exporting it to the VM timeline syncobj, the
    driver always drops its local fence reference, ensuring balanced refcounting
    and avoiding use-after-free on dma_fence.

	Crash signature:
	[  205.828135] refcount_t: underflow; use-after-free.
	[  205.832963] WARNING: CPU: 30 PID: 7274 at lib/refcount.c:28 refcount_warn_saturate+0xbe/0x110
	...
	[  206.074014] Call Trace:
	[  206.076488]  <TASK>
	[  206.078608]  amdgpu_gem_va_ioctl+0x6ea/0x740 [amdgpu]
	[  206.084040]  ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
	[  206.089994]  drm_ioctl_kernel+0x86/0xe0 [drm]
	[  206.094415]  drm_ioctl+0x26e/0x520 [drm]
	[  206.098424]  ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
	[  206.104402]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
	[  206.109387]  __x64_sys_ioctl+0x96/0xe0
	[  206.113156]  do_syscall_64+0x66/0x2d0
	...
	[  206.553351] BUG: unable to handle page fault for address: ffffffffc0dfde90
	...
	[  206.553378] RIP: 0010:dma_fence_signal_timestamp_locked+0x39/0xe0
	...
	[  206.553405] Call Trace:
	[  206.553409]  <IRQ>
	[  206.553415]  ? __pfx_drm_sched_fence_free_rcu+0x10/0x10 [gpu_sched]
	[  206.553424]  dma_fence_signal+0x30/0x60
	[  206.553427]  drm_sched_job_done.isra.0+0x123/0x150 [gpu_sched]
	[  206.553434]  dma_fence_signal_timestamp_locked+0x6e/0xe0
	[  206.553437]  dma_fence_signal+0x30/0x60
	[  206.553441]  amdgpu_fence_process+0xd8/0x150 [amdgpu]
	[  206.553854]  sdma_v4_0_process_trap_irq+0x97/0xb0 [amdgpu]
	[  206.554353]  edac_mce_amd(E) ee1004(E)
	[  206.554270]  amdgpu_irq_dispatch+0x150/0x230 [amdgpu]
	[  206.554702]  amdgpu_ih_process+0x6a/0x180 [amdgpu]
	[  206.555101]  amdgpu_irq_handler+0x23/0x60 [amdgpu]
	[  206.555500]  __handle_irq_event_percpu+0x4a/0x1c0
	[  206.555506]  handle_irq_event+0x38/0x80
	[  206.555509]  handle_edge_irq+0x92/0x1e0
	[  206.555513]  __common_interrupt+0x3e/0xb0
	[  206.555519]  common_interrupt+0x80/0xa0
	[  206.555525]  </IRQ>
	[  206.555527]  <TASK>
	...
	[  206.555650] RIP: 0010:dma_fence_signal_timestamp_locked+0x39/0xe0
	...
	[  206.555667] Kernel panic - not syncing: Fatal exception in interrupt

Link: https://patchwork.freedesktop.org/patch/654669/
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 17:00:00 -05:00
Gangliang Xie
f9f281e839 drm/amdgpu: only check critical address when it is not reserved
when an address is reserved already, no need to check if it is
in critical or not, to save time

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 17:00:00 -05:00
Alan Liu
a6ea0a430a drm/amdgpu: Fix query for VPE block_type and ip_count
[Why]
Query for VPE block_type and ip_count is missing.

[How]
Add VPE case in ip_block_type and hw_ip_count query.

Reviewed-by: Lang Yu <lang.yu@amd.com>
Signed-off-by: Alan Liu <haoping.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 17:00:00 -05:00
Alex Deucher
5ab75f98fb drm/amdgpu/gfx9: Implement KGQ ring reset
GFX ring resets work differently on pre-GFX10 hardware since
there is no MQD managed by the scheduler.
For ring reset, you need issue the reset via CP_VMID_RESET
via KIQ or MMIO and submit the following to the gfx ring to
complete the reset:
1. EOP packet with EXEC bit set
2. WAIT_REG_MEM to wait for the fence
3. Clear CP_VMID_RESET to 0
4. EVENT_WRITE ENABLE_LEGACY_PIPELINE
5. EOP packet with EXEC bit set
6. WAIT_REG_MEM to wait for the fence
Once those commands have completed the reset should
be complete and the ring can accept new packets.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Tested-by: Jiqian Chen <Jiqian.Chen@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 17:00:00 -05:00
Alex Deucher
9596097be4 drm/amdgpu/gfx9: rework pipeline sync packet sequence
Replace WAIT_REG_MEM with EVENT_WRITE flushes for all
shader types and ACQUIRE_MEM.  That should accomplish
the same thing and avoid having to wait on a fence
preventing any issues with pipeline syncs during
queue resets.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:59 -05:00
Alex Deucher
c8cf9ddc54 drm/amdgpu: avoid a warning in timedout job handler
Only set an error on the fence if the fence is not
signalled.  We can end up with a warning if the
per queue reset path signals the fence and sets an error
as part of the reset, but fails to recover.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:59 -05:00
Pratap Nirujogi
0288a345f1 drm/amd/amdgpu: Fix SMU warning during isp suspend-resume
ISP mfd child devices are using genpd and the system suspend-resume
operations between genpd and amdgpu parent device which uses only
runtime suspend-resume are not in sync.

Linux power manager during suspend-resume resuming the genpd devices
earlier than the amdgpu parent device. This is resulting in the below
warning as SMU is in suspended state when genpd attempts to resume ISP.

WARNING: CPU: 13 PID: 5435 at drivers/gpu/drm/amd/amdgpu/../pm/swsmu/amdgpu_smu.c:398 smu_dpm_set_power_gate+0x36f/0x380 [amdgpu]

To fix this warning isp suspend-resume is handled as part of amdgpu
parent device suspend-resume instead of genpd sequence. Each ISP MFD
child device is marked as dev_pm_syscore_device to skip genpd
suspend-resume and use pm_runtime_force api's to suspend-resume
the devices when callbacks from amdgpu are received.

Co-developed-by: Gjorgji Rosikopulos <grosikop@amd.com>
Signed-off-by: Gjorgji Rosikopulos <grosikop@amd.com>
Signed-off-by: Bin Du <bin.du@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:59 -05:00
Alex Deucher
b4ba5c9509 drm/amdgpu: use dma_fence_get_status() for adapter reset
We need to check if the fence was signaled without an
error as the per queue resets may have signalled the fence
while attempting to reset the queue.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:58 -05:00
Yo-Jung Leo Lin (AMD)
5946dbe1c8 Documentation/amdgpu: Add UMA carveout details
Add documentation for the uma/carveout_options and uma/carveout
attributes in sysfs

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Yo-Jung Leo Lin (AMD) <Leo.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:58 -05:00
Yo-Jung Leo Lin (AMD)
19ba61ac06 drm/amdgpu: add UMA allocation interfaces to sysfs
Add a uma/ directory containing two sysfs files as interfaces to
inspect or change UMA carveout size. These files are:

- uma/carveout_options: a read-only file listing all the available
  UMA allocation options and their index.

- uma/carveout: a file that is both readable and writable. On read,
  it shows the index of the current setting. Writing a valid index
  into this file allows users to change the UMA carveout size to that
  option on the next boot.

Co-developed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Yo-Jung Leo Lin (AMD) <Leo.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:58 -05:00
Yo-Jung Leo Lin (AMD)
379a316063 drm/amdgpu: add UMA allocation setting helpers
On some platforms, UMA allocation size can be set using the ATCS
methods. Add helper functions to interact with this functionality.

Co-developed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Yo-Jung Leo Lin (AMD) <Leo.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:58 -05:00
Yo-Jung Leo Lin (AMD)
685b7113e0 drm/amdgpu: add helper to read UMA carveout info
Currently, the available UMA allocation configs in the integrated system
information table have not been parsed. Add a helper function to retrieve
and store these configs.

Co-developed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Yo-Jung Leo Lin (AMD) <Leo.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:58 -05:00