Commit Graph

17069 Commits

Author SHA1 Message Date
Mario Limonciello
6d622755bc drm/amd: Drop some common modes from amdgpu_connector_add_common_modes()
[Why]
DC and non-DC codepaths have different sets of common modes that are
added for eDP and LVDS cases. This can cause different behaviors for
turning on DC on hardware that can support both.

[How]
Drop extra modes from amdgpu_connector_add_common_modes() not present
in amdgpu_dm_connector_add_common_modes().

Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-5-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:54:12 -04:00
Alex Deucher
dbf2341569 drm/amdgpu: update MODULE_PARM_DESC for freesync_video
To better describe what it does.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3756
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:54:07 -04:00
Mario Limonciello
123a1750c5 drm/amd: Use dynamic array size declaration for amdgpu_connector_add_common_modes()
[Why]
Adding or removing a mode from common_modes[] can be fragile if a user
forgot to update the for loop boundaries.

[How]
Use ARRAY_SIZE() to detect size of the array and use that instead.

Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-4-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:53:55 -04:00
Jesse.Zhang
b8ae2640f9 drm/amdgpu: Fix fence signaling race condition in userqueue
This commit fixes a potential race condition in the userqueue fence
signaling mechanism by replacing dma_fence_is_signaled_locked() with
dma_fence_is_signaled().

The issue occurred because:
1. dma_fence_is_signaled_locked() should only be used when holding
   the fence's individual lock, not just the fence list lock
2. Using the locked variant without the proper fence lock could lead
   to double-signaling scenarios:
   - Hardware completion signals the fence
   - Software path also tries to signal the same fence

By using dma_fence_is_signaled() instead, we properly handle the
locking hierarchy and avoid the race condition while still maintaining
the necessary synchronization through the fence_list_lock.

v2: drop the comment (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:53:23 -04:00
Mario Limonciello
210844d2c0 drm/amd: Drop unnecessary check in amdgpu_connector_add_common_modes()
[Why]
amdgpu_connector_add_common_modes() has a check for the width and height
of common modes being too small, but the array of common_modes[] has fixed
values.  The check is dead code.

[How]
Drop unnecessary check.

Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-3-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:50:58 -04:00
Sunil Khatri
4e3b45d7b6 drm/amdgpu: remove the redeclaration of variable i
Variable "i" has been redeclared as integer later in the function
which is wrong and not serving any purpose.

Fixes: 899fbde146 ("drm/amdgpu: replace get_user_pages with HMM mirror helpers")
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:46:35 -04:00
Prike Liang
883bd89d00 drm/amdgpu/userq: assign an error code for invalid userq va
It should return an error code if userq VA validation fails.

Fixes: 9e46b8bb05 ("drm/amdgpu: validate userq buffer virtual address and size")
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:40:18 -04:00
Christian König
90e09ea4cf drm/amdgpu: revert "rework reserved VMID handling" v2
This reverts commit e44a0fe630.

Initially we used VMID reservation to enforce isolation between
processes. That has now been replaced by proper fence handling.

Both OpenGL, RADV and ROCm developers requested a way to reserve a VMID
for SPM, so restore that approach by reverting back to only allowing a
single process to use the reserved VMID.

Only compile tested for now.

v2: use -ENOENT instead of -EINVAL if VMID is not available

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:39:00 -04:00
Christian König
66f3883dbc drm/amdgpu: remove leftover from enforcing isolation by VMID
Initially we enforced isolation by reserving a VMID, but that practice
was now removed.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:38:54 -04:00
Jesse.Zhang
7469567d88 drm/amdgpu: Add fallback to pipe reset if KCQ ring reset fails
Add a fallback mechanism to attempt pipe reset when KCQ reset
fails to recover the ring. After performing the KCQ reset and
queue remapping, test the ring functionality. If the ring test
fails, initiate a pipe reset as an additional recovery step.

v2: fix the typo (Lijo)
v3: try pipeline reset when kiq mapping fails (Lijo)

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:38:48 -04:00
Mario Limonciello (AMD)
0a6e9e098f drm/amd: Fix hybrid sleep
[Why]
commit 530694f54d ("drm/amdgpu: do not resume device in thaw for
normal hibernation") optimized the flow for systems that are going
into S4 where the power would be turned off.  Basically the thaw()
callback wouldn't resume the device if the hibernation image was
successfully created since the system would be powered off.

This however isn't the correct flow for a system entering into
s0i3 after the hibernation image is created.  Some of the amdgpu
callbacks have different behavior depending upon the intended
state of the suspend.

[How]
Use pm_hibernation_mode_is_suspend() as an input to decide whether
to run resume during thaw() callback.

Reported-by: Ionut Nechita <ionut_n2001@yahoo.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4573
Tested-by: Ionut Nechita <ionut_n2001@yahoo.com>
Fixes: 530694f54d ("drm/amdgpu: do not resume device in thaw for normal hibernation")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Tested-by: Kenneth Crudup <kenny@panix.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Cc: 6.17+ <stable@vger.kernel.org> # 6.17+: 495c8d3503: PM: hibernate: Add pm_hibernation_mode_is_suspend()
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-25 21:37:38 +02:00
Jesse.Zhang
5886090032 drm/amdgpu: Move VCN reset mask setup to late_init for VCN 5.0.1
This patch moves the initialization of the VCN supported_reset mask from
sw_init to a new late_init function for VCN 5.0.1. The change ensures
that all necessary hardware and firmware initialization is complete
before determining the supported reset types.

Key changes:
- Added vcn_v5_0_1_late_init() function to handle late initialization
- Moved supported_reset mask setup from sw_init to late_init
- Added check for per-queue reset support via amdgpu_dpm_reset_vcn_is_supported()
- Updated ip_funcs to use the new late_init function

This change helps ensure proper reset behavior by waiting until all
dependencies are initialized before determining available reset types.

Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:41:37 -04:00
Jesse.Zhang
dc704458dd drm/amdgpu: Add ring reset support for VCN v5.0.1
Implement the ring reset callback for VCN v5.0.1 to properly handle
hardware recovery when encountering GPU hangs. The new functionality:

1. Adds vcn_v5_0_1_ring_reset() function that:
   - Prepares for reset using amdgpu_ring_reset_helper_begin()
   - Performs VCN instance reset via amdgpu_dpm_reset_vcn()
   - Re-initializes hardware through vcn_v5_0_1_hw_init_inst()
   - Restarts DPG mode with vcn_v5_0_1_start_dpg_mode()
   - Completes reset with amdgpu_ring_reset_helper_end()

2. Hooks the reset function into the unified ring functions via:
   - Adding .reset = vcn_v5_0_1_ring_reset to vcn_v5_0_1_unified_ring_vm_funcs

3. Maintains existing behavior for SR-IOV VF cases by checking RRMT status

This provides proper hardware recovery capabilities for VCN 5.0.1 IP block
during fault conditions, matching functionality available in other VCN versions.

v2: Remove the RRMT_ENABLED cap setting in the reset function
    and replace adev->vcn.inst[ring->me].indirect_sram with vinst->indirect_sram (Lijo)

Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:41:27 -04:00
Jesse.Zhang
eb6910cdaa drm/amdgpu: Refactor VCN v5.0.1 HW init into separate instance function
Split the per-instance initialization code from vcn_v5_0_1_hw_init()
into a new vcn_v5_0_1_hw_init_inst() function. This improves code
organization by:

1. Separating the instance-specific initialization logic
2. Making the main init function more readable
3. Following the pattern used in queue reset

The SR-IOV specific initialization remains in the main function since
it has different requirements.

Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:41:11 -04:00
Rahul Kumar
86a54e45fd drm/amdgpu: Use kmalloc_array() instead of kmalloc()
Documentation/process/deprecated.rst recommends against the use of
kmalloc with dynamic size calculations due to the risk of overflow and
smaller allocation being made than the caller was expecting.

Replace kmalloc() with kmalloc_array() in amdgpu_amdkfd_gfx_v10.c,
amdgpu_amdkfd_gfx_v10_3.c, amdgpu_amdkfd_gfx_v11.c and
amdgpu_amdkfd_gfx_v12.c to make the intended allocation size clearer
and avoid potential overflow issues.

Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Rahul Kumar <rk0006818@gmail.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:35:54 -04:00
Sonny Jiang
854b9ab637 drm/amdgpu: Update amdgpu_vcn5_fw_shared for vcn_5_0_1
Align vcn5_fw_shared structure with FW

Signed-off-by: Sonny Jiang <sonny.jiang@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:22:51 -04:00
Mario Limonciello
1fb710793c drm/amdgpu: Enable MES lr_compute_wa by default
The MES set resources packet has an optional bit 'lr_compute_wa'
which can be used for preventing MES hangs on long compute jobs.

Set this bit by default.

Co-developed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:22:38 -04:00
Sunil Khatri
c5b3cc417b drm/amdgpu: use hmm_pfns instead of array of pages
we dont need to allocate local array of pages to hold
the pages returned by the hmm, instead we could use
the hmm_range structure itself to get to hmm_pfn
and get the required pages directly.

This avoids call to alloc/free quite a lot.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:22:31 -04:00
Lijo Lazar
b29c22b8da drm/amdgpu: Fix vbios build number parsing logic
It's not necessary that the build string and atom header section has a
difference of 32 bytes. Use the remaining bytes in the section as copy
limit.

Fixes: d6fa802661 ("drm/amdgpu: Add vbios build number interface")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:21:09 -04:00
Dave Airlie
342f141ba9 Merge tag 'amd-drm-next-6.18-2025-09-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.18-2025-09-19:

amdgpu:
- Fence drv clean up fix
- DPC fixes
- Misc display fixes
- Support the MMIO remap page as a ttm pool
- JPEG parser updates
- UserQ updates
- VCN ctx handling fixes
- Documentation updates
- Misc cleanups
- SMU 13.0.x updates
- SI DPM updates
- GC 11.x cleaner shader updates
- DMCUB updates
- DML fixes
- Improve fallback handling for pixel encoding
- VCN reset improvements
- DCE6 DC updates
- DSC fixes
- Use devm for i2c buses
- GPUVM locking updates
- GPUVM documentation improvements
- Drop non-DC DCE11 code
- S0ix fixes
- Backlight fix
- SR-IOV fixes

amdkfd:
- SVM updates

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250919193354.2989255-1-alexander.deucher@amd.com
2025-09-22 08:45:51 +10:00
Guangshuo Li
cc9a8e238e drm/amdgpu/atom: Check kcalloc() for WS buffer in amdgpu_atom_execute_table_locked()
kcalloc() may fail. When WS is non-zero and allocation fails, ectx.ws
remains NULL while ectx.ws_size is set, leading to a potential NULL
pointer dereference in atom_get_src_int() when accessing WS entries.

Return -ENOMEM on allocation failure to avoid the NULL dereference.

Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 16:59:27 -04:00
Christian König
59e4405e9e drm/amdgpu: revert to old status lock handling v3
It turned out that protecting the status of each bo_va with a
spinlock was just hiding problems instead of solving them.

Revert the whole approach, add a separate stats_lock and lockdep
assertions that the correct reservation lock is held all over the place.

This not only allows for better checks if a state transition is properly
protected by a lock, but also switching back to using list macros to
iterate over the state of lists protected by the dma_resv lock of the
root PD.

v2: re-add missing check
v3: split into two patches

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 16:59:14 -04:00
Alex Deucher
9272bb34b0 drm/amdgpu: suspend KFD and KGD user queues for S0ix
We need to make sure the user queues are preempted so
GFX can enter gfxoff.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f8b367e6fa)
Cc: stable@vger.kernel.org
2025-09-18 14:59:41 -04:00
Alex Deucher
2ade36eaa9 drm/amdkfd: add proper handling for S0ix
When in S0i3, the GFX state is retained, so all we need to do
is stop the runlist so GFX can enter gfxoff.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4bfa860993)
Cc: stable@vger.kernel.org
2025-09-18 14:59:24 -04:00
Sunil Khatri
0aa09d8a6c drm/amdgpu: add missing comment for the new argument
In function 'amdgpu_vm_lock_done_list' update the comment
for the new argument 'vm'.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509180211.UAqME0zj-lkp@intel.com/
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:38 -04:00
Alex Deucher
f8b367e6fa drm/amdgpu: suspend KFD and KGD user queues for S0ix
We need to make sure the user queues are preempted so
GFX can enter gfxoff.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:30 -04:00
Alex Deucher
846de1384a drm/amdgpu/userq: Optimize S0ix handling
In S0i3, GFX state is retained, so it's preferrable to
preempt queues rather than unmapping them as the overhead
is lower.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:23 -04:00
Joe.Wang
f05c03ffc7 drm/amdgpu: Fix PRT flag for gfx12
AMDGPU_PTE_PRT_GFX12 flag is missed during pageTable rework, add it back.

Fixes: 6716a823d1 ("drm/amdgpu: rework how PTE flags are generated v3")
Signed-off-by: Joe Wang <joe.wang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:10 -04:00
Xiang Liu
1ed511fb76 drm/amdgpu: Check VF critical region before RAS poison injection
Check VF critical region before RAS poison injection to ensure that the
poison injection will not hit the VF critical region.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:10 -04:00
Alex Deucher
4bfa860993 drm/amdkfd: add proper handling for S0ix
When in S0i3, the GFX state is retained, so all we need to do
is stop the runlist so GFX can enter gfxoff.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:02 -04:00
Xiang Liu
f1fdeb3d07 drm/amdgpu: Introduce VF critical region check for RAS poison injection
The SRIOV guest send requet to host to check whether the poison
injection address is in VF critical region or not via mabox.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:02 -04:00
Alex Deucher
18f769ff36 drm/amdgpu: remove non-DC DCE 11 code
DC has been the default for ~8 years now and supports
many things that the non-DC code does not (audio, DP MST, etc.).
No DCE 11.x IPs ever supported analog encoders so that is not
an issue.  Finally drop this code.

Acked-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:02 -04:00
Christian König
ed7a4397f5 drm/ttm: rename ttm_bo_put to _fini v3
Give TTM BOs a separate cleanup function.

No funktional change, but the next step in removing the TTM BO reference
counting and replacing it with the GEM object reference counting.

v2: move the code around a bit to make it clearer what's happening
v3: fix nouveau_bo_fini as well

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://lore.kernel.org/r/20250909144311.1927-1-christian.koenig@amd.com
2025-09-17 14:03:21 +02:00
Christian König
df99f6d112 drm/amdgpu: re-order and document VM code
Re-order fields in the VM structure and try to improve the
documentation a bit.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-16 17:51:48 -04:00
Christian König
930595df25 drm/amdgpu: remove check for BO reservation add assert instead
We should leave such checks to lockdep and not implement something
manually.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-16 17:51:35 -04:00
Rodrigo Siqueira
63137c7c8c drm/amdgpu: Use devm_i2c_add_adapter() in SMU V11
Instead of using i2c_add_adapter() and i2c_del_adapter() in the SMU V11,
use devm_i2c_add_adapter() to simplify the code path.

Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-16 17:47:24 -04:00
Rodrigo Siqueira
0f36a3c6af drm/amdgpu/amdgpu_i2c: Use devm_i2c_add_adapter instead of i2c_add_adapter
This commit replaces i2c_add_adapter() with devm_i2c_add_adapter() and
removes part of the cleanup logic since the new function handles the i2c
removal.

Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-16 17:47:22 -04:00
Christian König
39203f5e6d drm/amdgpu: fix userq VM validation v4
That was actually complete nonsense and not validating the BOs
at all. The code just cleared all VM areas were it couldn't grab the
lock for a BO.

Try to fix this. Only compile tested at the moment.

v2: fix fence slot reservation as well as pointed out by Sunil.
    also validate PDs, PTs, per VM BOs and update PDEs
v3: grab the status_lock while working with the done list.
v4: rename functions, add some comments, fix waiting for updates to
    complete.
v4: rename amdgpu_vm_lock_done_list(), add some more comments

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-16 17:47:06 -04:00
Christian König
d7ddcf921e drm/amdgpu: reject gang submissions under SRIOV
Gang submission means that the kernel driver guarantees that multiple
submissions are executed on the HW at the same time on different engines.

Background is that those submissions then depend on each other and each
can't finish stand alone.

SRIOV now uses world switch to preempt submissions on the engines to allow
sharing the HW resources between multiple VFs.

The problem is now that the SRIOV world switch can't know about such inter
dependencies and will cause a timeout if it waits for a partially running
gang submission.

To conclude SRIOV and gang submissions are fundamentally incompatible at
the moment. For now just disable them.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-16 17:47:00 -04:00
Srinivasan Shanmugam
c1b6b8c770 drm/amdgpu/gfx11: Add Cleaner Shader Support for GFX11.0.1/11.0.4 GPUs
Enable the cleaner shader for additional GFX11.0.1/11.0.4 series GPUs to
ensure data isolation among GPU tasks. The cleaner shader is tasked with
clearing the Local Data Store (LDS), Vector General Purpose Registers
(VGPRs), and Scalar General Purpose Registers (SGPRs), which helps avoid
data leakage and guarantees the accuracy of computational results.

This update extends cleaner shader support to GFX11.0.1/11.0.4 GPUs,
previously available for GFX11.0.3. It enhances security by clearing GPU
memory between processes and maintains a consistent GPU state across KGD
and KFD workloads.

Cc: Wasee Alam <wasee.alam@amd.com>
Cc: Mario Sopena-Novales <mario.novales@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0a71ceb27f)
2025-09-15 17:23:42 -04:00
Christian König
2740509623 drm/amdgpu: revert "Implement new dummy vram manager"
This is should be unnecessary since a VRAM manager isn't mandatory in
the first place.

It could be that we have some missing checks inside AMDGPU or TTM but
those should then be fixed instead of worked around like that.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 17:04:49 -04:00
Christian König
a9273da04f drm/amdgpu: add AMDGPU_IDS_FLAGS_GANG_SUBMIT
Add a UAPI flag indicating if gang submit is supported or not.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 17:04:42 -04:00
Shaoyun Liu
85442bac84 drm/amd/amdgpu: Fix the mes version that support inv_tlbs
MES pipe0 will do VM invalidation with engine set 5 when assign VMID to a process,
driver will submit inv_tlb package to mes pipe1. It might run into race condition
if both pipes use the same invalidate engine set. From MES version 0x83 it will use
invalidate engine set 6 for pipe1 to fix the issue

Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 17:02:44 -04:00
Mario Limonciello (AMD)
531df041f2 drm/amd: Avoid evicting resources at S5
Normally resources are evicted on dGPUs at suspend or hibernate and
on APUs at hibernate.  These steps are unnecessary when using the S4
callbacks to put the system into S5.

Cc: AceLan Kao <acelan.kao@canonical.com>
Cc: Kai-Heng Feng <kaihengf@nvidia.com>
Cc: Mark Pearson <mpearson-lenovo@squebb.ca>
Cc: Denis Benato <benato.denis96@gmail.com>
Cc: Merthan Karakaş <m3rthn.k@gmail.com>
Tested-by: Eric Naim <dnaim@cachyos.org>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 17:02:39 -04:00
Jesse.Zhang
bb1d7f157e drm/amdgpu: Switch user queues to use preempt/restore for eviction
This patch modifies the user queue management to use preempt/restore
operations instead of full map/unmap for queue eviction scenarios where
applicable. The changes include:

1. Introduces new helper functions:
   - amdgpu_userqueue_preempt_helper()
   - amdgpu_userqueue_restore_helper()

2. Updates queue state management to track PREEMPTED state

3. Modifies eviction handling to use preempt instead of unmap:
   - amdgpu_userq_evict_all() now uses preempt_helper
   - amdgpu_userq_restore_all() now uses restore_helper

The preempt/restore approach provides better performance during queue
eviction by avoiding the overhead of full queue teardown and setup.
Full map/unmap operations are still used for initial setup/teardown
and system suspend scenarios.

v2: rename amdgpu_userqueue_restore_helper/amdgpu_userqueue_preempt_helper to
amdgpu_userq_restore_helper/amdgpu_userq_preempt_helper for consistency. (Alex)

v3: amdgpu_userq_stop_sched_for_enforce_isolation() and
amdgpu_userq_start_sched_for_enforce_isolation() should use preempt and restore (Alex)

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 17:02:33 -04:00
Jesse.Zhang
5cefcbb306 drm/amdgpu: adjust MES API used for suspend and resume
Use the suspend and resume API rather than remove queue
and add queue API.  The former just preempts the queue
while the latter remove it from the scheduler completely.
There is no need to do that, we only need preemption
in this case.

V2: replace queue_active with queue state
v3: set the suspend_fence_addr
v4: allocate another per queue buffer for the suspend fence, and  set the sequence number.
    also wait for the suspend fence. (Alex)
v5: use a wb slot (Alex)
v6: Change the timeout period. For MES, the default timeout  is  2100000; /* 2100 ms */ (Alex)

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 17:02:28 -04:00
Hawking Zhang
46fbe1e349 Revert "drm/amdgpu: Allocate psp fw private buffer in vram"
This reverts commit 22dcb283d6.
Need to certain APU platforms and will proceed to rework
the patch accordingly

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Le Ma <Le.Ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 16:56:15 -04:00
Srinivasan Shanmugam
0a71ceb27f drm/amdgpu/gfx11: Add Cleaner Shader Support for GFX11.0.1/11.0.4 GPUs
Enable the cleaner shader for additional GFX11.0.1/11.0.4 series GPUs to
ensure data isolation among GPU tasks. The cleaner shader is tasked with
clearing the Local Data Store (LDS), Vector General Purpose Registers
(VGPRs), and Scalar General Purpose Registers (SGPRs), which helps avoid
data leakage and guarantees the accuracy of computational results.

This update extends cleaner shader support to GFX11.0.1/11.0.4 GPUs,
previously available for GFX11.0.3. It enhances security by clearing GPU
memory between processes and maintains a consistent GPU state across KGD
and KFD workloads.

Cc: Wasee Alam <wasee.alam@amd.com>
Cc: Mario Sopena-Novales <mario.novales@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 16:56:07 -04:00
Lijo Lazar
780f7a45e5 drm/amdgpu: Add virtual device capabilities
Add a member to define the capabilities of virtual device.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 16:55:48 -04:00
Lijo Lazar
1f9ba8ea04 drm/amdgpu: Add generic capability class
Define a utility macro for defining capabilities and their attributes.
Capability attributes are read-only, write-only, read-write.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15 16:55:41 -04:00