linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-23 09:05:50 -04:00

Author	SHA1	Message	Date
Jiang Liu	38e8ca3e4b	amdgpu/soc15: enable asic reset for dGPU in case of suspend abort When GPU suspend is aborted, do the same for dGPU as APU to reset soc15 asic. Otherwise it may cause following errors: [ 547.229463] amdgpu 0001:81:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring kiq_0.2.1.0 test failed (-110) [ 555.126827] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring kiq_0.2.1.0 test failed (-110) [ 555.126901] [drm:amdgpu_gfx_enable_kcq [amdgpu]] ERROR KCQ enable failed [ 555.126957] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] ERROR resume of IP block <gfx_v9_4_3> failed -110 [ 555.126959] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_resume failed (-110). [ 555.126965] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110 [ 555.126966] PM: Device 0000:0a:00.0 failed to resume async: error -110 This fix has been tested on Mi308X. Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/2462b4b12eb9d025e82525178d568cbaa4c223ff.1736739303.git.gerry@linux.alibaba.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:58 -05:00
Jesse.zhang@amd.com	30f7f53a5b	drm/amdgpu/gfx10: implement gfx queue reset via MMIO Using mmio to do queue reset v2: Alignment the function with gfx9/gfx9.4.3. Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:57 -05:00
Jesse.zhang@amd.com	ffdd7a7b28	drm/amdgpu/gfx10: implement queue reset via MMIO Using mmio to do queue reset. v2: Alignment this function with gfx9/gfx9.4.3. Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:57 -05:00
Lijo Lazar	f7a594e405	drm/amdgpu: Use active umc info from discovery There could be configs where some UMC instances are harvested. This information is obtained through discovery data and populated in umc.active_mask. Avoid reassigning this as AID mask, instead use the mask directly while iterating through umc instances. This is to avoid accesses to harvested UMC instances. v2: fix warning (Alex) Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:56 -05:00
Amber Lin	46d0436a3e	drm/amdgpu: Set noretry default for GC 9.5.0 Set GC 9.5.0 noretry default as 1 for better performance. It can be changed by the administrator using amdgpu.noretry=0 or by the user using HSA_XNACK=1 environment variable. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviwanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:56 -05:00
Le Ma	23cb207751	drm/amdgpu: read harvest info from harvest table for gfx950 Harvest table is applied for gfx950. Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:56 -05:00
Shiwu Zhang	667b96134c	drm/amdgpu: enlarge the VBIOS binary size limit Some chips have a larger VBIOS file so raise the size limit to support the flashing tool. Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:56 -05:00
Alex Deucher	5f95a15495	drm/amdgpu: add dynamic workload profile switching for gfx12 Enable dynamic workload profile switching for gfx12. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:56 -05:00
Alex Deucher	963537ca23	drm/amdgpu: add dynamic workload profile switching for gfx11 Enable dynamic workload profile switching for gfx11. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:56 -05:00
Alex Deucher	b9467983b7	drm/amdgpu: add dynamic workload profile switching for gfx10 Enable dynamic workload profile switching for gfx10. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:56 -05:00
Alex Deucher	8fdb3958e3	drm/amdgpu/gfx: add ring helpers for setting workload profile Add helpers to switch the workload profile dynamically when commands are submitted. This allows us to switch to the FULLSCREEN3D or COMPUTE profile when work is submitted. Add a delayed work handler to delay switching out of the selected profile if additional work comes in. This works the same as the VIDEO profile for VCN. This lets dynamically enable workload profiles on the fly and then move back to the default when there is no work. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Xiaogang Chen	8544374c0f	drm/amdkfd: Have kfd driver use same PASID values from graphic driver Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kfd process and vm. That design is not working when a physical gpu device has multiple spatial partitions, ex: adev in CPX mode. This patch has kfd driver use same pasid values that graphic driver generated which is per vm per pasid. These pasid values are passed to fw/hardware. We do not need change interrupt handler though more pasid values are used. Also, pasid values at log are replaced by user process pid; pasid values are not exposed to user. Users see their process pids that have meaning in user space. Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Lijo Lazar	ca44922107	drm/amdgpu: Check RRMT status for JPEG v4.0.3 RRMT could get dynamically enabled/disabled by PSP firmware. Read the status from register for reading RRMT status. For VFs, this is not accessible, hence assume that it's always disabled for now. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Lijo Lazar	485380f7fe	drm/amdgpu: Check RRMT status for VCN v4.0.3 RRMT could get dynamically enabled/disabled by PSP firmware. Read the status from register for reading RRMT status. For VFs, this is not accessible, hence assume that it's always disabled for now. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Tim Huang	e55565f880	drm/amdgpu: add support for PSP IP version 14.0.5 This initializes PSP IP version 14.0.5. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Tim Huang	e7704d7c72	drm/amdgpu: add support for SMU IP version 14.0.5 This initializes SMU IP version 14.0.5. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Tim Huang	6d437d5203	drm/amdgpu: enable VCN/JPEG CGPG for GC IP version 11.5.3 Enable VCN/JPEG CGPG for ASIC with GFX version 11.5.3. Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com> Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Tim Huang	6bde08d317	drm/amdgpu: add support for MMHUB IP version 3.3.2 This initializes MMHUB IP version 3.3.2. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Tim Huang	e659c9eb87	drm/amdgpu: add support for NBIO IP version 7.11.2 This initializes NBIO IP version 7.11.2. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Tim Huang	b2e5a04147	drm/amdgpu: add support for SDMA IP version 6.1.3 This initializes SDMA IP version 6.1.3. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Tim Huang	b784faeba2	drm/amdgpu: add support for GC IP version 11.5.3 This initializes GC IP version 11.5.3. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:55 -05:00
Alex Deucher	20f48be63d	drm/amdgpu: add OEM i2c bus for polaris chips It uses the VGADCC bus. DC doesn't use this bus, so it is safe to add it here. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:54 -05:00
Alex Deucher	1c0b144bf7	drm/amdgpu: rework i2c init and fini No functional change. Rework the code to allow for adding some additional i2c buses in conjunction with DC in the future. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:54 -05:00
Alex Deucher	ba7f8eb7e4	drm/amdgpu/atombios: drop empty function This was leftover from when amdgpu was forked from radeon. The function is empty so drop it. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:54 -05:00
Alex Deucher	b217105acb	drm/amd/display/dm: handle OEM i2c buses in i2c functions Allow the creation of an OEM i2c bus and use the proper DC helpers for that case. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:54 -05:00
Sathishkumar S	8064ca6e93	drm/amdgpu: increase amdgpu max rings limit increase max rings to 132 to support all JPEG5_0_1 cores, else ring_init fails due to ring count exceeding maximum limit. Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:54 -05:00
Jiang Liu	a0a455b4bc	drm/amdgpu: bail out when failed to load fw in psp_init_cap_microcode() In function psp_init_cap_microcode(), it should bail out when failed to load firmware, otherwise it may cause invalid memory access. Fixes: `07dbfc6b10` ("drm/amd: Use `amdgpu_ucode_*` helpers for PSP") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 19:47:15 -05:00
Alex Deucher	55ed2b1b50	drm/amdgpu: bump version for RV/PCO compute fix Bump the driver version for RV/PCO compute stability fix so mesa can use this check to enable compute queues on RV/PCO. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.12.x	2025-02-12 19:47:15 -05:00
Alex Deucher	b35eb9128e	drm/amdgpu/gfx9: manually control gfxoff for CS on RV When mesa started using compute queues more often we started seeing additional hangs with compute queues. Disabling gfxoff seems to mitigate that. Manually control gfxoff and gfx pg with command submissions to avoid any issues related to gfxoff. KFD already does the same thing for these chips. v2: limit to compute v3: limit to APUs v4: limit to Raven/PCO v5: only update the compute ring_funcs v6: Disable GFX PG v7: adjust order Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Suggested-by: Błażej Szczygieł <mumei6102@gmail.com> Suggested-by: Sergey Kovalenko <seryoga.engineering@gmail.com> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3861 Link: https://lists.freedesktop.org/archives/amd-gfx/2025-January/119116.html Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.12.x	2025-02-12 19:47:01 -05:00
Philipp Stanner	796a9f55a8	drm/sched: Use struct for drm_sched_init() params drm_sched_init() has a great many parameters and upcoming new functionality for the scheduler might add even more. Generally, the great number of parameters reduces readability and has already caused one missnaming, addressed in: commit `6f1cacf4eb` ("drm/nouveau: Improve variable name in nouveau_sched_init()"). Introduce a new struct for the scheduler init parameters and port all users. Reviewed-by: Liviu Dudau <liviu.dudau@arm.com> Acked-by: Matthew Brost <matthew.brost@intel.com> # for Xe Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> # for Panfrost and Panthor Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com> # for Etnaviv Reviewed-by: Frank Binns <frank.binns@imgtec.com> # for Imagination Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> # for Sched Reviewed-by: Maíra Canal <mcanal@igalia.com> # for v3d Reviewed-by: Danilo Krummrich <dakr@kernel.org> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> # for amdxdna Signed-off-by: Philipp Stanner <phasta@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20250211111422.21235-2-phasta@kernel.org	2025-02-12 11:59:52 +01:00
Maxime Ripard	93c7dd1b39	Merge drm/drm-next into drm-misc-next Bring rc1 to start the new release dev. Signed-off-by: Maxime Ripard <mripard@kernel.org>	2025-02-06 13:47:32 +01:00
Marek Olšák	2255b40cac	drm/amdgpu: add a BO metadata flag to disable write compression for Vulkan Vulkan can't support DCC and Z/S compression on GFX12 without WRITE_COMPRESS_DISABLE in this commit or a completely different DCC interface. AMDGPU_TILING_GFX12_SCANOUT is added because it's already used by userspace. Cc: stable@vger.kernel.org # 6.12.x Signed-off-by: Marek Olšák <marek.olsak@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-03 12:11:36 -05:00
Kenneth Feng	5cda56bd86	drm/amd/amdgpu: change the config of cgcg on gfx12 change the config of cgcg on gfx12 Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.12.x	2025-01-28 16:22:39 -05:00
Shaoyun Liu	335acfb64e	drm/amd/amdgpu: Enable scratch data dump for mes 12 MES internal will check CP_MES_MSCRATCH_LO/HI register to set scratch data location during ucode start, driver side need to start the MES one by one with different setting for each pipe Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-24 09:56:13 -05:00
Mario Limonciello	7e4cb7dea2	drm/amd: Clarify kdoc for amdgpu.gttsize Effectively amdgpu.gttsize gets set to ~1/2 of RAM, but that's controlled by what the TTM page limit is set to. Clarify the kdoc. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-24 09:56:08 -05:00
Srinivasan Shanmugam	dc915275ea	drm/amd/amdgpu: Prevent null pointer dereference in GPU bandwidth calculation If the parent is NULL, adev->pdev is used to retrieve the PCIe speed and width, ensuring that the function can still determine these capabilities from the device itself. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6193 amdgpu_device_gpu_bandwidth() error: we previously assumed 'parent' could be null (see line 6180) drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 6170 static void amdgpu_device_gpu_bandwidth(struct amdgpu_device adev, 6171 enum pci_bus_speed speed, 6172 enum pcie_link_width width) 6173 { 6174 struct pci_dev parent = adev->pdev; 6175 6176 if (!speed \|\| !width) 6177 return; 6178 6179 parent = pci_upstream_bridge(parent); 6180 if (parent && parent->vendor == PCI_VENDOR_ID_ATI) { ^^^^^^ If parent is NULL 6181 /* use the upstream/downstream switches internal to dGPU / 6182 speed = pcie_get_speed_cap(parent); 6183 width = pcie_get_width_cap(parent); 6184 while ((parent = pci_upstream_bridge(parent))) { 6185 if (parent->vendor == PCI_VENDOR_ID_ATI) { 6186 / use the upstream/downstream switches internal to dGPU / 6187 speed = pcie_get_speed_cap(parent); 6188 width = pcie_get_width_cap(parent); 6189 } 6190 } 6191 } else { 6192 / use the device itself / --> 6193 speed = pcie_get_speed_cap(parent); ^^^^^^ Then we are toasted here. 6194 *width = pcie_get_width_cap(parent); 6195 } 6196 } Fixes: `757e8b951c` ("drm/amdgpu: cache gpu pcie link width") Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-24 09:55:26 -05:00
Lin.Cao	b529093999	drm/amdgpu: fix ring timeout issue in gfx10 sr-iov environment commit `26c95e838e` ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") set job->vm as NULL if there is no fence. It will cause emit switch buffer be skippen if job->vm set as NULL. Check job rather than vm could solve this problem. Fixes: `26c95e838e` ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-24 09:55:04 -05:00
Alex Deucher	64314e3f9c	drm/amdgpu: fix the PCIe lanes reporting in the INFO IOCTL Combine the platform and GPU caps like we do for PCIe Gen. This aligns properly with expectations and documentation for the interface. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3820 Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-24 09:53:30 -05:00
Alex Deucher	757e8b951c	drm/amdgpu: cache gpu pcie link width Get the PCIe link with of the device itself (or it's integrated upstream bridge) and cache that. v2: fix typo Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3820 Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-24 09:53:24 -05:00
Lijo Lazar	a0db1ea0dd	drm/amdgpu: Refine ip detection log message 'add ip block' causes a confusion if the blocks are disabled later with ip_block_mask. Instead change to 'detected' and also add device context. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-24 09:52:58 -05:00
Lijo Lazar	b1df8050e7	drm/amdgpu: Add handler for SDMA context empty Context empty interrupt is enabled for SDMA 4.4.2. Add a handler for context empty interrupt so that it is disposed of fast, and not propagated to KFD layer. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-24 09:52:43 -05:00
Srinivasan Shanmugam	19b7f7c721	drm/amdgpu/gfx12: Add Cleaner Shader Support for GFX12.0 GPUs This commit enables the cleaner shader feature for GFX12.0 and GFX12.0.1 GPUs. The cleaner shader is important for clearing GPU resources such as Local Data Share (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs) between workloads. - This feature ensures that GPU resources are reset between workloads, preventing data leaks and ensuring accurate computation. By enabling the cleaner shader, this update enhances the security and reliability of GPU operations on GFX12.0 hardware. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-14 11:06:50 -05:00
Kent Russell	f2935a3019	drm/amdgpu: Mark debug KFD module params as unsafe Mark options only meant to be used for debugging as unsafe so that the kernel is tainted when they are used. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-14 11:06:50 -05:00
Gui Chengming	62952a38d9	drm/amdgpu: fix fw attestation for MP0_14_0_{2/3} FW attestation was disabled on MP0_14_0_{2/3}. V2: Move check into is_fw_attestation_support func. (Frank) Remove DRM_WARN log info. (Alex) Fix format. (Christian) Signed-off-by: Gui Chengming <Jack.Gui@amd.com> Reviewed-by: Frank.Min <Frank.Min@amd.com> Reviewed-by: Christian König <Christian.Koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-14 11:06:50 -05:00
Christian König	def59436fb	drm/amdgpu: always sync the GFX pipe on ctx switch That is needed to enforce isolation between contexts. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-14 11:06:50 -05:00
Christian König	177b76a8d8	drm/amdgpu: mark a bunch of module parameters unsafe We sometimes have people trying to use debugging options in production environments. Mark options only meant to be used for debugging as unsafe so that the kernel is tainted when they are used. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-14 11:06:50 -05:00
Tvrtko Ursulin	e996127ec1	drm/amdgpu: Use DRM scheduler API in amdgpu_xcp_release_sched Lets use the existing helper instead of peeking into the structure directly. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Christian König <christian.koenig@amd.com> Cc: Danilo Krummrich <dakr@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Philipp Stanner <pstanner@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-14 11:06:50 -05:00
Kenneth Feng	2affe2bbc9	drm/amdgpu: disable gfxoff with the compute workload on gfx12 Disable gfxoff with the compute workload on gfx12. This is a workaround for the opencl test failure. Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-14 11:06:50 -05:00
Srinivasan Shanmugam	0b6b2dd383	drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation This commit addresses a circular locking dependency issue within the GFX isolation mechanism. The problem was identified by a warning indicating a potential deadlock due to inconsistent lock acquisition order. - The `amdgpu_gfx_enforce_isolation_ring_begin_use` and `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`, leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is called while `enforce_isolation_mutex` is held, and `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is held, it can create a circular dependency. By ensuring consistent lock usage, this fix resolves the issue: [ 606.297333] ====================================================== [ 606.297343] WARNING: possible circular locking dependency detected [ 606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G OE [ 606.297365] ------------------------------------------------------ [ 606.297375] kworker/u96:3/3825 is trying to acquire lock: [ 606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610 [ 606.297413] but task is already holding lock: [ 606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.297725] which lock already depends on the new lock. [ 606.297738] the existing dependency chain (in reverse order) is: [ 606.297749] -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}: [ 606.297765] __mutex_lock+0x85/0x930 [ 606.297776] mutex_lock_nested+0x1b/0x30 [ 606.297786] amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.298007] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.298225] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.298412] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.298603] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.298866] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.298880] process_one_work+0x21e/0x680 [ 606.298890] worker_thread+0x190/0x350 [ 606.298899] kthread+0xe7/0x120 [ 606.298908] ret_from_fork+0x3c/0x60 [ 606.298919] ret_from_fork_asm+0x1a/0x30 [ 606.298929] -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}: [ 606.298947] __mutex_lock+0x85/0x930 [ 606.298956] mutex_lock_nested+0x1b/0x30 [ 606.298966] amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu] [ 606.299190] process_one_work+0x21e/0x680 [ 606.299199] worker_thread+0x190/0x350 [ 606.299208] kthread+0xe7/0x120 [ 606.299217] ret_from_fork+0x3c/0x60 [ 606.299227] ret_from_fork_asm+0x1a/0x30 [ 606.299236] -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}: [ 606.299257] __lock_acquire+0x16f9/0x2810 [ 606.299267] lock_acquire+0xd1/0x300 [ 606.299276] __flush_work+0x250/0x610 [ 606.299286] cancel_delayed_work_sync+0x71/0x80 [ 606.299296] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.299509] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.299723] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.299909] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.300101] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.300355] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.300369] process_one_work+0x21e/0x680 [ 606.300378] worker_thread+0x190/0x350 [ 606.300387] kthread+0xe7/0x120 [ 606.300396] ret_from_fork+0x3c/0x60 [ 606.300406] ret_from_fork_asm+0x1a/0x30 [ 606.300416] other info that might help us debug this: [ 606.300428] Chain exists of: (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex [ 606.300458] Possible unsafe locking scenario: [ 606.300468] CPU0 CPU1 [ 606.300476] ---- ---- [ 606.300484] lock(&adev->gfx.kfd_sch_mutex); [ 606.300494] lock(&adev->enforce_isolation_mutex); [ 606.300508] lock(&adev->gfx.kfd_sch_mutex); [ 606.300521] lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)); [ 606.300536] * DEADLOCK * [ 606.300546] 5 locks held by kworker/u96:3/3825: [ 606.300555] #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680 [ 606.300577] #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680 [ 606.300600] #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu] [ 606.300837] #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.301062] #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610 [ 606.301083] stack backtrace: [ 606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G OE 6.10.0-amd-mlkd-610-311224-lof #19 [ 606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024 [ 606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched] [ 606.301140] Call Trace: [ 606.301146] <TASK> [ 606.301154] dump_stack_lvl+0x9b/0xf0 [ 606.301166] dump_stack+0x10/0x20 [ 606.301175] print_circular_bug+0x26c/0x340 [ 606.301187] check_noncircular+0x157/0x170 [ 606.301197] ? register_lock_class+0x48/0x490 [ 606.301213] __lock_acquire+0x16f9/0x2810 [ 606.301230] lock_acquire+0xd1/0x300 [ 606.301239] ? __flush_work+0x232/0x610 [ 606.301250] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301261] ? mark_held_locks+0x54/0x90 [ 606.301274] ? __flush_work+0x232/0x610 [ 606.301284] __flush_work+0x250/0x610 [ 606.301293] ? __flush_work+0x232/0x610 [ 606.301305] ? __pfx_wq_barrier_func+0x10/0x10 [ 606.301318] ? mark_held_locks+0x54/0x90 [ 606.301331] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301345] cancel_delayed_work_sync+0x71/0x80 [ 606.301356] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.301661] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.302050] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.302069] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.302452] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.302862] ? drm_sched_entity_error+0x82/0x190 [gpu_sched] [ 606.302890] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.303366] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.303388] process_one_work+0x21e/0x680 [ 606.303409] worker_thread+0x190/0x350 [ 606.303424] ? __pfx_worker_thread+0x10/0x10 [ 606.303437] kthread+0xe7/0x120 [ 606.303449] ? __pfx_kthread+0x10/0x10 [ 606.303463] ret_from_fork+0x3c/0x60 [ 606.303476] ? __pfx_kthread+0x10/0x10 [ 606.303489] ret_from_fork_asm+0x1a/0x30 [ 606.303512] </TASK> v2: Refactor lock handling to resolve circular dependency (Alex) - Introduced a `sched_work` flag to defer the call to `amdgpu_gfx_kfd_sch_ctrl` until after releasing `enforce_isolation_mutex`. - This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside the critical section, preventing the circular dependency and deadlock. - The `sched_work` flag is set within the mutex-protected section if conditions are met, and the actual function call is made afterward. - This approach ensures consistent lock acquisition order. Fixes: `afefd6f245` ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization") Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-01-14 10:38:33 -05:00
Dave Airlie	c3d590f8ba	Merge tag 'amd-drm-next-6.14-2025-01-10' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.14-2025-01-10: amdgpu: - Fix max surface handling in DC - clang fixes - DCN 3.5 fixes - DCN 4.0.1 fixes - DC CRC fixes - DML updates - DSC fixes - PSR fixes - DC add some divide by 0 checks - SMU13 updates - SR-IOV fixes - RAS fixes - Cleaner shader support for gfx10.3 dGPUs - fix drm buddy trim handling - SDMA engine reset updates _ Fix RB bitmap setup - Fix doorbell ttm cleanup - Add CEC notifier support - DPIA updates - MST fixes amdkfd: - Shader debugger fixes - Trap handler cleanup - Cleanup includes - Eviction fence wq fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250110172731.2960668-1-alexander.deucher@amd.com	2025-01-13 11:13:13 +10:00

... 14 15 16 17 18 ...

15965 Commits