Commit Graph

17039 Commits

Author SHA1 Message Date
Oleg Nesterov
7d08e0916a drm/amdgpu: don't abuse current->group_leader
Cleanup and preparation to simplify the next changes.

- Use current->tgid instead of current->group_leader->pid

- Use get_task_pid(current, PIDTYPE_TGID) instead of
  get_task_pid(current->group_leader, PIDTYPE_PID)

Link: https://lkml.kernel.org/r/aXY_wKewzV5lCa5I@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Christan König <christian.koenig@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-03 08:21:25 -08:00
Dave Airlie
a60f627cf4 Merge tag 'amd-drm-next-6.20-2026-01-30' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.20-2026-01-30:

amdgpu:
- Misc cleanups
- SMU 13 fixes
- SMU 14 fixes
- GPUVM fault filter fix
- USB4 fixes
- DC FP guard fixes
- Powergating fix
- JPEG ring reset fix
- RAS fixes
- Xclk fix for soc21 APUs
- Fix COND_EXEC handling for GC 11
- UserQ fixes
- MQD size alignment fixes
- SMU feature interface cleanup
- GC 10-12 KGQ init fixes
- GC 11-12 KGQ reset fixes

amdkfd:
- Fix device snapshot reporting
- GC 12.1 trap handler fixes
- MQD size alignment fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260130183257.28879-1-alexander.deucher@amd.com
2026-02-02 05:51:54 +10:00
Alex Deucher
dfd64f6e8c drm/amdgpu/gfx12: adjust KGQ reset sequence
Kernel gfx queues do not need to be reinitialized or
remapped after a reset.  Align with gfx11.

v2: preserve init and remap for MMIO case.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0a6d6ed694)
Cc: stable@vger.kernel.org
2026-01-29 12:39:21 -05:00
Alex Deucher
3eb46fbb60 drm/amdgpu/gfx11: adjust KGQ reset sequence
Kernel gfx queues do not need to be reinitialized or
remapped after a reset.  This fixes queue reset failures
on APUs.

v2: preserve init and remap for MMIO case.

Fixes: b3e9bfd866 ("drm/amdgpu/gfx11: add ring reset callbacks")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b340ff216f)
Cc: stable@vger.kernel.org
2026-01-29 12:39:21 -05:00
Alex Deucher
9077d32a4b drm/amdgpu/gfx12: fix wptr reset in KGQ init
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a2918f958d)
Cc: stable@vger.kernel.org
2026-01-29 12:39:15 -05:00
Alex Deucher
b1f810471c drm/amdgpu/gfx11: fix wptr reset in KGQ init
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1f16866bdb)
Cc: stable@vger.kernel.org
2026-01-29 12:39:09 -05:00
Alex Deucher
cc4f433b14 drm/amdgpu/gfx10: fix wptr reset in KGQ init
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e80b1d1aa1)
Cc: stable@vger.kernel.org
2026-01-29 12:38:54 -05:00
Alex Deucher
0a6d6ed694 drm/amdgpu/gfx12: adjust KGQ reset sequence
Kernel gfx queues do not need to be reinitialized or
remapped after a reset.  Align with gfx11.

v2: preserve init and remap for MMIO case.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29 12:27:37 -05:00
Alex Deucher
b340ff216f drm/amdgpu/gfx11: adjust KGQ reset sequence
Kernel gfx queues do not need to be reinitialized or
remapped after a reset.  This fixes queue reset failures
on APUs.

v2: preserve init and remap for MMIO case.

Fixes: b3e9bfd866 ("drm/amdgpu/gfx11: add ring reset callbacks")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29 12:27:29 -05:00
Alex Deucher
a2918f958d drm/amdgpu/gfx12: fix wptr reset in KGQ init
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29 12:27:27 -05:00
Alex Deucher
1f16866bdb drm/amdgpu/gfx11: fix wptr reset in KGQ init
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29 12:27:18 -05:00
Alex Deucher
e80b1d1aa1 drm/amdgpu/gfx10: fix wptr reset in KGQ init
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29 12:27:10 -05:00
Lang Yu
a6a4dd519c drm/amdgpu: Use AMDGPU_MQD_SIZE_ALIGN in KGD
Use AMDGPU_MQD_SIZE_ALIGN for both kernel and user queue.

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29 12:26:55 -05:00
Lang Yu
82a9ab369a drm/amdgpu: Add a helper macro to align mqd size
MES FW uses address(mqd_addr + sizeof(struct mqd) + 3*sizeof(uint32_t))
as fence address and writes a 32 bit fence value to this address. Driver
needs to allocate some extra memory(at least 4 DWs) in addition to
sizeof(struct mqd) as mqd memory(limited to gfx/compute/sdma queue).

For gfx11/12, sizeof(struct mqd) < PAGE_SIZE, KGD allocates mqd memory with
PAGE_SIZE aligned works. For gfx12.1, sizeof(struct mqd) == PAGE_SIZE,
it doesn't work.

KFD mqd manager hardcodes mqd size to PAGE_SIZE/MQD_SIZE across different
IP versions to solve this issue.

To avoid hardcoding in differnet places and across different IP versions.
Let's use AMDGPU_MQD_SIZE_ALIGN instead. It is used in two places.

1. mqd memory alloction
2. mqd stride handling for multi xcc config

v2: Use AMDGPU_GPU_PAGE_ALIGN. (Mukul)

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: David Belanger <david.belanger@amd.com> (v1)
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29 12:26:26 -05:00
Jesse.Zhang
8079b87c02 drm/amdgpu: validate user queue size constraints
Add validation to ensure user queue sizes meet hardware requirements:
- Size must be a power of two for efficient ring buffer wrapping
- Size must be at least AMDGPU_GPU_PAGE_SIZE to prevent undersized allocations

This prevents invalid configurations that could lead to GPU faults or
unexpected behavior.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-29 12:26:15 -05:00
Alex Deucher
b1defcdc44 drm/amdgpu: Fix cond_exec handling in amdgpu_ib_schedule()
The EXEC_COUNT field must be > 0.  In the gfx shadow
handling we always emit a cond_exec packet after the gfx_shadow
packet, but the EXEC_COUNT never gets patched.  This leads
to a hang when we try and reset queues on gfx11 APUs.

Fixes: c68cbbfd54 ("drm/amdgpu: cleanup conditional execution")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ba205ac3d6)
Cc: stable@vger.kernel.org
2026-01-28 16:35:00 -05:00
Alex Deucher
e7fbff9e76 drm/amdgpu/soc21: fix xclk for APUs
The reference clock is supposed to be 100Mhz, but it
appears to actually be slightly lower (99.81Mhz).

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14451
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 637fee3954)
Cc: stable@vger.kernel.org
2026-01-28 16:34:24 -05:00
Alex Deucher
ba205ac3d6 drm/amdgpu: Fix cond_exec handling in amdgpu_ib_schedule()
The EXEC_COUNT field must be > 0.  In the gfx shadow
handling we always emit a cond_exec packet after the gfx_shadow
packet, but the EXEC_COUNT never gets patched.  This leads
to a hang when we try and reset queues on gfx11 APUs.

Fixes: c68cbbfd54 ("drm/amdgpu: cleanup conditional execution")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-28 16:21:45 -05:00
Alex Deucher
637fee3954 drm/amdgpu/soc21: fix xclk for APUs
The reference clock is supposed to be 100Mhz, but it
appears to actually be slightly lower (99.81Mhz).

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14451
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-28 16:21:31 -05:00
Dave Airlie
6704d98a4f BackMerge tag 'v6.19-rc7' into drm-next
Linux 6.19-rc7

This is needed for msm and rust trees.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2026-01-28 12:44:28 +10:00
Jon Doron
8b1ecc9377 drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
ih2 interrupt ring buffers are not initialized. This is by design, as
these secondary IH rings are only available on discrete GPUs. See
vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
AMD_IS_APU is set.

However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
get the timestamp of the last interrupt entry. When retry faults are
enabled on APUs (noretry=0), this function is called from the SVM page
fault recovery path, resulting in a NULL pointer dereference when
amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].

The crash manifests as:

  BUG: kernel NULL pointer dereference, address: 0000000000000004
  RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
  Call Trace:
   amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
   svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
   amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
   gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
   amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
   amdgpu_ih_process+0x84/0x100 [amdgpu]

This issue was exposed by commit 1446226d32 ("drm/amdgpu: Remove GC HW
IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
noretry=1 to noretry=0, enabling retry fault handling and thus
exercising the buggy code path.

Fix this by adding a check for ih1.ring_size before attempting to use
it. Also restore the soft_ih support from commit dd29944165 ("drm/amdgpu:
Rework retry fault removal").  This is needed if the hardware doesn't
support secondary HW IH rings.

v2: additional updates (Alex)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
Fixes: dd29944165 ("drm/amdgpu: Rework retry fault removal")
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Jon Doron <jond@wiz.io>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6ce8d536c8)
Cc: stable@vger.kernel.org
2026-01-27 18:24:39 -05:00
Kent Russell
e0d11bdb29 drm/amdgpu: Send RMA CPER at bad page loading
Some older builds weren't sending RMA CPERs when the bad page threshold
was exceeded. Newer builds have resolved this, but there could be
systems out there with bad page numbers higher than the threshold, that
haven't sent out an RMA CPER. To be thorough and safe, send an RMA CPER
when we load the table, if the threshold is met or exceeded, instead of
waiting for the next UE to trigger the CPER.

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27 18:13:24 -05:00
Jesse.Zhang
91544c45fa drm/amdgpu: Fix jpeg ring test order in vcn_v4_0_3
Fix the vcn reset sequence in vcn_v4_0_3_ring_reset() to restore
JPEG power state and unlock the JPEG powergating mutex before
running the JPEG ring post-reset helper.

Fixes: d25c67fd9d ("drm/amdgpu/vcn4.0.3: rework reset handling")
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27 18:12:10 -05:00
Jon Doron
6ce8d536c8 drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
ih2 interrupt ring buffers are not initialized. This is by design, as
these secondary IH rings are only available on discrete GPUs. See
vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
AMD_IS_APU is set.

However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
get the timestamp of the last interrupt entry. When retry faults are
enabled on APUs (noretry=0), this function is called from the SVM page
fault recovery path, resulting in a NULL pointer dereference when
amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].

The crash manifests as:

  BUG: kernel NULL pointer dereference, address: 0000000000000004
  RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
  Call Trace:
   amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
   svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
   amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
   gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
   amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
   amdgpu_ih_process+0x84/0x100 [amdgpu]

This issue was exposed by commit 1446226d32 ("drm/amdgpu: Remove GC HW
IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
noretry=1 to noretry=0, enabling retry fault handling and thus
exercising the buggy code path.

Fix this by adding a check for ih1.ring_size before attempting to use
it. Also restore the soft_ih support from commit dd29944165 ("drm/amdgpu:
Rework retry fault removal").  This is needed if the hardware doesn't
support secondary HW IH rings.

v2: additional updates (Alex)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
Fixes: dd29944165 ("drm/amdgpu: Rework retry fault removal")
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Jon Doron <jond@wiz.io>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27 18:09:04 -05:00
Tvrtko Ursulin
dda702172d drm/amdgpu: Simplify sorting of the bo list
Sort function only cares about the sign so we can replace the conditionals
with a single subtraction.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27 18:08:29 -05:00
Tvrtko Ursulin
f7e0678651 drm/amdgpu/mes: Remove idr leftovers v2
Commit
cb17fff3a2 ("drm/amdgpu/mes: remove unused functions")
removed most of the code using these IDRs but forgot to remove the struct
members and init/destroy paths.

There is also interrupt handling code in SDMA 5.0 and 5.2 which appears to
be using it, but is is unreachable since nothing ever allocates the
relevant IDR. We replace those with one time warnings just to avoid any
functional difference, but it is also possible they should be removed.

v2: also fix up gfx_v12_1.c and sdma_v7_1.c

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
References: cb17fff3a2 ("drm/amdgpu/mes: remove unused functions")
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-27 18:02:06 -05:00
Alex Deucher
095ca81517 drm/amdgpu: fix type for wptr in ring backup
Needs to be a u64.

Fixes: 77cc0da39c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 56fff1941a)
2026-01-21 14:55:56 -05:00
Timur Kristóf
fd2ac113a5 drm/amdgpu: Fix validating flush_gpu_tlb_pasid()
When a function holds a lock and we return without unlocking it,
it deadlocks the kernel. We should always unlock before returning.

This commit fixes suspend/resume on SI.
Tested on two Tahiti GPUs: FirePro W9000 and R9 280X.

Fixes: f4db9913e4 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e3a6eff92b)
2026-01-21 14:55:44 -05:00
Alex Deucher
ad7b68b043 drm/amdgpu: rename amdgpu_fence_driver_guilty_force_completion()
The function no longer signals the fence so rename it to
better match what it does.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21 14:27:45 -05:00
Alex Deucher
56fff1941a drm/amdgpu: fix type for wptr in ring backup
Needs to be a u64.

Fixes: 77cc0da39c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21 14:25:50 -05:00
Gangliang Xie
0028b86b52 drm/amdgpu: mark invalid records with U64_MAX
set retired_page of invalid ras records to U64_MAX, and skip
them when reading ras records

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21 14:25:43 -05:00
Lijo Lazar
9d03d404f4 drm/amdgpu: Avoid excessive dmesg log
KIQ access is not guaranteed to work reliably under all reset
situations. Avoid flooding dmesg with HDP flush failure messages.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21 14:25:31 -05:00
Timur Kristóf
e3a6eff92b drm/amdgpu: Fix validating flush_gpu_tlb_pasid()
When a function holds a lock and we return without unlocking it,
it deadlocks the kernel. We should always unlock before returning.

This commit fixes suspend/resume on SI.
Tested on two Tahiti GPUs: FirePro W9000 and R9 280X.

Fixes: f4db9913e4 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21 14:24:27 -05:00
Jesse.Zhang
fab47d2db5 drm/amdgpu/vcn5.0.1: rework reset handling
Resetting VCN resets the entire tile, including jpeg.
When resetting the VCN, we need to ensure that JPEG data blocks are accessible and we also need to handle the JPEG queue.
Add a helper function to restore the JPEG queue during the VCN reset.

v2: split the jpeg helper in two, in the top helper we can stop the sched workqueues and attempt to wait for any outstanding fences.
    Then in the bottom helper, we can force completion, re-init the rings, and restart the sched workqueues (Alex)

v3: merge patches 4 and 5 into one patch (Alex)

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21 14:18:09 -05:00
Jesse.Zhang
d25c67fd9d drm/amdgpu/vcn4.0.3: rework reset handling
Resetting VCN resets the entire tile, including jpeg.
When resetting the VCN, we need to ensure that JPEG data blocks are accessible and we also need to handle the JPEG queue.
Add a helper function to restore the JPEG queue during the VCN reset.

v2: split the jpeg helper in two, in the top helper we can stop the sched workqueues and attempt to wait for any outstanding fences.
    Then in the bottom helper, we can force completion, re-init the rings, and restart the sched workqueues (Alex)

v3: merge patches 1 and 2 into one patch (Alex)

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21 14:17:48 -05:00
Jesse.Zhang
de93bc3533 drm/amdgpu/vcn4.0.3: implement DPG pause mode handling for VCN 4.0.3
For MI projects, when Dynamic Power Gating (DPG) is enabled,
VCN reset operations should be performed with DPG in pause mode.
Otherwise, the hardware may perform undesirable reset operations

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-21 14:17:36 -05:00
Alex Deucher
82a401ceff drm/amdgpu: fix error handling in ib_schedule()
If fence emit fails, free the fence if necessary.

Fixes: db36632ea5 ("drm/amdgpu: clean up and unify hw fence handling")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5eb680a060)
2026-01-20 21:53:18 -05:00
Jiqian Chen
8e96b36d9b drm/amdgpu: free hw_vm_fence when fail in amdgpu_job_alloc
If drm_sched_job_init fails, hw_vm_fence is not freed currently,
then cause memory leak.

Fixes: db36632ea5 ("drm/amdgpu: clean up and unify hw fence handling")
Link: https://lore.kernel.org/amd-gfx/a5a828cb-0e4a-41f0-94c3-df31e5ddad52@amd.com/T/#t
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Reviewed-by: Amos Kong <kongjianjun@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5d42ee457c)
2026-01-20 21:50:03 -05:00
Likun Gao
1034325332 drm/amdgpu: remove frame cntl for gfx v12
Remove emit_frame_cntl function for gfx v12, which is not support.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5aaa5058de)
Cc: stable@vger.kernel.org
2026-01-20 21:49:25 -05:00
Alex Deucher
e51b709a6d drm/amdgpu: add new job ids
Use this for gfx, sdma, vpe IB tests and kernel shaders.
The end goal it to get rid of the direct IB submit without a
job structure.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:25:53 -05:00
Alex Deucher
5eb680a060 drm/amdgpu: fix error handling in ib_schedule()
If fence emit fails, free the fence if necessary.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:25:40 -05:00
Alex Deucher
e6048510c3 drm/amdgpu/jpeg4.0.3: remove redundant sr-iov check
The per queue reset flag is only set when sr-iov is
disabled so this check is not necessary as the function
will never be called on sr-iov.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:17:48 -05:00
Perry Yuan
54f2fc76dd drm/amdgpu: Improve IP discovery checksum failure logging
Enhance the error logging in amdgpu_discovery_verify_checksum() to
print the calculated checksum, the expected checksum, the data size.

This extra context helps quickly identify if the issue is a data
corruption, a partially read binary, or an invalid table header without
requiring additional instrumentation.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:17:23 -05:00
Jiqian Chen
5d42ee457c drm/amdgpu: free hw_vm_fence when fail in amdgpu_job_alloc
If drm_sched_job_init fails, hw_vm_fence is not freed currently,
then cause memory leak.

Fixes: db36632ea5 ("drm/amdgpu: clean up and unify hw fence handling")
Link: https://lore.kernel.org/amd-gfx/a5a828cb-0e4a-41f0-94c3-df31e5ddad52@amd.com/T/#t
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Reviewed-by: Amos Kong <kongjianjun@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:16:37 -05:00
Jesse.Zhang
fc3336be9c drm/amd/amdgpu: Add independent hang detect work for user queue fence
In error scenarios (e.g., malformed commands), user queue fences may never
be signaled, causing processes to wait indefinitely. To address this while
preserving the requirement of infinite fence waits, implement an independent
timeout detection mechanism:

1. Initialize a hang detect work when creating a user queue (one-time setup)
2. Start the work with queue-type-specific timeout (gfx/compute/sdma) when
       the last fence is created via amdgpu_userq_signal_ioctl (per-fence timing)
3. Trigger queue reset logic if the timer expires before the fence is signaled

v2: make timeout per queue type (adev->gfx_timeout vs adev->compute_timeout vs adev->sdma_timeout) to be consistent with kernel queues. (Alex)
v3: The timeout detection must be independent from the fence, e.g. you don't wait for a timeout on the fence
        but rather have the timeout start as soon as the fence is initialized. (Christian)
v4: replace the timer with the `hang_detect_work` delayed work.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:16:12 -05:00
Likun Gao
5aaa5058de drm/amdgpu: remove frame cntl for gfx v12
Remove emit_frame_cntl function for gfx v12, which is not support.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:15:59 -05:00
Philip Yang
d4a814f400 drm/amdkfd: Move gfx9.4.3 and gfx 9.5 MQD to HBM
To reduce queue switch latency further, move MQD to VRAM domain, CP
access MQD and control stack via FB aperture, this requires contiguous
pages.

After MQD is initialized, updated or restored, flush HDP to guarantee
the data is written to HBM and GPU cache is invalidated, then CP will
read the new MQD.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:15:46 -05:00
Dave Airlie
c098b1aa2f Merge tag 'amd-drm-next-6.20-2026-01-16' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.20-2026-01-16:

amdgpu:
- SR-IOV fixes
- Rework SMU mailbox handling
- Drop MMIO_REMAP domain
- UserQ fixes
- MES cleanups
- Panel Replay updates
- HDMI fixes
- Backlight fixes
- SMU 14.x fixes
- SMU 15 updates

amdkfd:
- Fix a memory leak
- Fixes for systems with non-4K pages
- LDS/Scratch cleanup
- MES process eviction fix

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260116202609.23107-1-alexander.deucher@amd.com
2026-01-19 06:54:46 +10:00
Dave Airlie
83dc0ba275 Merge tag 'amd-drm-next-6.20-2026-01-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.20-2026-01-09:

amdgpu:
- GPUVM updates
- Initial support for larger GPU address spaces
- Initial SMUIO 15.x support
- Documentation updates
- Initial PSP 15.x support
- Initial IH 7.1 support
- Initial IH 6.1.1 support
- SMU 13.0.12 updates
- RAS updates
- Initial MMHUB 3.4 support
- Initial MMHUB 4.2 support
- Initial GC 12.1 support
- Initial GC 11.5.4 support
- HDMI fixes
- Panel replay improvements
- DML updates
- DC FP fixes
- Initial SDMA 6.1.4 support
- Initial SDMA 7.1 support
- Userq updates
- DC HPD refactor
- SwSMU cleanups and refactoring
- TTM memory ops parallelization
- DCN 3.5 fixes
- DP audio fixes
- Clang fixes
- Misc spelling fixes and cleanups
- Initial SDMA 7.11.4 support
- Convert legacy DRM logging helpers to new drm logging helpers
- Initial JPEG 5.3 support
- Add support for changing UMA size via the driver
- DC analog fixes
- GC 9 gfx queue reset support
- Initial SMU 15.x support

amdkfd:
- Reserved SDMA rework
- Refactor SPM
- Initial GC 12.1 support
- Initial GC 11.5.4 support
- Initial SDMA 7.1 support
- Initial SDMA 6.1.4 support
- Increase the kfd process hash table
- Per context support
- Topology fixes

radeon:
- Convert legacy DRM logging helpers to new drm logging helpers
- Use devm for i2c adapters
- Variable sized array fix
- Misc cleanups

UAPI:
- KFD context support.  Proposed userspace:
  https://github.com/ROCm/rocm-systems/pull/1705
  https://github.com/ROCm/rocm-systems/pull/1701
- Add userq metadata queries for more queue types.  Proposed userspace:
  https://gitlab.freedesktop.org/yogeshmohan/mesa/-/commits/userq_query

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260109154713.3242957-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2026-01-15 14:49:33 +10:00
Ivan Lipski
d04f73668b drm/amd/display: Add an hdmi_hpd_debounce_delay_ms module
[Why&How]
Right now, the HDMI HPD filter is enabled by default at 1500ms.

We want to disable it by default, as most modern displays with HDMI do
not require it for DPMS mode.

The HPD can instead be enabled as a driver parameter with a custom delay
value in ms (up to 5000ms).

Fixes: c918e75e1e ("drm/amd/display: Add an HPD filter for HDMI")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4859
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6a681cd903)
2026-01-14 15:07:43 -05:00