linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-22 00:33:58 -04:00

Author	SHA1	Message	Date
Mangesh Gadre	5e9aec4ea3	drm/amdgpu:Add psp v13_0_15 ip block Add support for psp v13_0_15 ip block Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:23:39 -05:00
Alex Deucher	57d00816c6	drm/amdgpu: set family for GC 11.5.4 Set the family for GC 11.5.4 Fixes: `47ae1f938d` ("drm/amdgpu: add support for GC IP version 11.5.4") Cc: Tim Huang <tim.huang@amd.com> Cc: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Cc: Roman Li <Roman.Li@amd.com> Reviewed-by: Tim Huang <tim.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:23:09 -05:00
Pierre-Eric Pelloux-Prayer	b18fc0ab83	drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify Invalidating a dmabuf will impact other users of the shared BO. In the scenario where process A moves the BO, it needs to inform process B about the move and process B will need to update its page table. The commit fixes a synchronisation bug caused by the use of the ticket: it made amdgpu_vm_handle_moved behave as if updating the page table immediately was correct but in this case it's not. An example is the following scenario, with 2 GPUs and glxgears running on GPU0 and Xorg running on GPU1, on a system where P2P PCI isn't supported: glxgears: export linear buffer from GPU0 and import using GPU1 submit frame rendering to GPU0 submit tiled->linear blit Xorg: copy of linear buffer The sequence of jobs would be: drm_sched_job_run # GPU0, frame rendering drm_sched_job_queue # GPU0, blit drm_sched_job_done # GPU0, frame rendering drm_sched_job_run # GPU0, blit move linear buffer for GPU1 access # amdgpu_dma_buf_move_notify -> update pt # GPU0 It this point the blit job on GPU0 is still running and would likely produce a page fault. Cc: stable@vger.kernel.org Fixes: `a448cb003e` ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2") Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:22:03 -05:00
Lijo Lazar	9aca641143	drm/amdgpu: Move xgmi status to interface header These definitions are used by user APIs. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:22:00 -05:00
Gangliang Xie	044f8d3b1f	drm/amdgpu: return when ras table checksum is error end the function flow when ras table checksum is error Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:21:56 -05:00
Ce Sun	3ee1c72606	drm/amdgpu: Adjust usleep_range in fence wait Tune the sleep interval in the PSP fence wait loop from 10-100us to 60-100us.This adjustment results in an overall wait window of 1.2s (60us * 20000 iterations) to 2 seconds (100us * 20000 iterations), which guarantees that we can retrieve the correct fence value Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:21:41 -05:00
Pratik Vishwakarma	33ed922d24	drm/amd: Add CG/PG flags for GC 11.5.4 Enable GFXOff for GC 11.5.4 Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:21:30 -05:00
Pratik Vishwakarma	39a96f126d	drm/amdgpu: enable mode2 reset for SMU IP v15.0.0 Set the default reset method to mode2 for SMU 15.0.0. Signed-off-by: Kanala Ramalingeswara Reddy <Kanala.RamalingeswaraReddy@amd.com> Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:21:25 -05:00
Pratik Vishwakarma	e98bb71e24	drm/amdgpu: Load TA ucode for PSP 15_0_0 TOC and TA both are required Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:20:56 -05:00
Kenneth Feng	d76c66c6ad	drm/amd/pm: send unload command to smu during modprobe -r amdgpu Send unload command to smu during modprobe -r amdgpu for smu 13/14. 1. This can fix the high voltage/temperatue issue after driver is unloaded. 2. Reloading driver could fail but with the debug port based mode1 reset during driver is reloaded, it is good and safe. Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:20:07 -05:00
Srinivasan Shanmugam	ba03806565	drm/amdgpu: Fix missing unwind in amdgpu_ib_schedule() error path amdgpu_ib_schedule() returns early after calling amdgpu_ring_undo(). This skips the common free_fence cleanup path. Other error paths were already changed to use goto free_fence, but this one was missed. Change the early return to goto free_fence so all error paths clean up the same way. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c:232 amdgpu_ib_schedule() warn: missing unwind goto? drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c 124 int amdgpu_ib_schedule(struct amdgpu_ring ring, unsigned int num_ibs, 125 struct amdgpu_ib ibs, struct amdgpu_job job, 126 struct dma_fence *f) 127 { ... 224 225 if (ring->funcs->insert_start) 226 ring->funcs->insert_start(ring); 227 228 if (job) { 229 r = amdgpu_vm_flush(ring, job, need_pipe_sync); 230 if (r) { 231 amdgpu_ring_undo(ring); --> 232 return r; The patch changed the other error paths to goto free_fence but this one was accidentally skipped. 233 } 234 } 235 236 amdgpu_ring_ib_begin(ring); ... 338 339 free_fence: 340 if (!job) 341 kfree(af); 342 return r; 343 } Fixes: `f903b85ed0` ("drm/amdgpu: fix possible fence leaks from job structure") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:17:31 -05:00
Kent Russell	5028a24aa8	drm/amdgpu: Send applicable RMA CPERs at end of RAS init Firmware and monitoring tools may not be ready to receive a CPER when we read the bad pages, so send the CPERs at the end of RAS initialization to ensure that the FW is ready to receive and process the CPER. This removes the previous CPER submission that was added during bad page load, and sends both in-band and out-of-band at the same time. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:28:34 -05:00
Mario Limonciello	f7afda7fcd	drm/amd: Fix hang on amdgpu unload by using pci_dev_is_disconnected() The commit `6a23e7b433` ("drm/amd: Clean up kfd node on surprise disconnect") introduced early KFD cleanup when drm_dev_is_unplugged() returns true. However, this causes hangs during normal module unload (rmmod amdgpu). The issue occurs because drm_dev_unplug() is called in amdgpu_pci_remove() for all removal scenarios, not just surprise disconnects. This was done intentionally in commit `39934d3ed5` ("Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"") to fix IGT PCI software unplug test failures. As a result, drm_dev_is_unplugged() returns true even during normal module unload, triggering the early KFD cleanup inappropriately. The correct check should distinguish between: - Actual surprise disconnect (eGPU unplugged): pci_dev_is_disconnected() returns true - Normal module unload (rmmod): pci_dev_is_disconnected() returns false Replace drm_dev_is_unplugged() with pci_dev_is_disconnected() to ensure the early cleanup only happens during true hardware disconnect events. Cc: stable@vger.kernel.org Reported-by: Cal Peake <cp@absolutedigital.net> Closes: https://lore.kernel.org/all/b0c22deb-c0fa-3343-33cf-fd9a77d7db99@absolutedigital.net/ Fixes: `6a23e7b433` ("drm/amd: Clean up kfd node on surprise disconnect") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:25:57 -05:00
Alex Deucher	6952255ed9	drm/amdgpu: re-add the bad job to the pending list for ring resets Returning DRM_GPU_SCHED_STAT_NO_HANG causes the scheduler to add the bad job back the pending list. We've already set the errors on the fence and killed the bad job at this point so it's the correct behavior. Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:25:44 -05:00
Victor Zhao	5cc7bbd9f1	drm/amdgpu: avoid sdma ring reset in sriov sdma ring reset is not supported in SRIOV. kfd driver does not check reset mask, and could queue sdma ring reset during unmap_queues_cpsch. Avoid the ring reset for sriov. Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:25:37 -05:00
Sunil Khatri	f025a2b8d9	drm/amdgpu: clean up the amdgpu_cs_parser_bos In low memory conditions, kmalloc can fail. In such conditions unlock the mutex for a clean exit. We do not need to amdgpu_bo_list_put as it's been handled in the amdgpu_cs_parser_fini. Fixes: `737da5363c` ("drm/amdgpu: update the functions to use amdgpu version of hmm") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202602030017.7E0xShmH-lkp@intel.com/ Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:24:33 -05:00
Philip Yang	3b948dd036	drm/amdgpu: Use 5-level paging if gmc support 57-bit VA Regardless if CPU enable 5-level paging, GPU vm use 5-level paging if gmc init with 57-bit address space support, because ARM64 4-level paging support 48-bit VA, x86 and GPU 4-level paging support 47-bit VA, require 5-level paging on GPU to support ARM64. NPA address space 52-bit mapping on NPA GPU VM require 5-level paging. Debugger trap get device snapshot expect LDS and Scratch base, limit above 57-bit, which is set only for 5-level paging. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.19.x	2026-02-05 17:23:51 -05:00
Yifan Zhang	39fc2bc4da	drm/amdgpu: Protect GPU register accesses in powergated state in some paths Ungate GPU CG/PG in device_fini_hw and device_halt to protect GPU register accesses, e.g. GC registers are accessed in amdgpu_irq_disable_all() and amdgpu_fence_driver_hw_fini(). Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2026-02-05 17:20:33 -05:00
Alex Deucher	56423871e9	drm/amdgpu/sdma6: enable queue resets unconditionally There is no firmware version dependency. This also enables sdma queue resets on all SDMA 6.x based chips. Fixes: `59fd50b866` ("drm/amdgpu: Add sysfs interface for sdma reset mask") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:20:04 -05:00
Alex Deucher	314d30ad50	drm/amdgpu/sdma5.2: enable queue resets unconditionally There is no firmware version dependency. This also enables sdma queue resets on all SDMA 5.2.x based chips. Fixes: `59fd50b866` ("drm/amdgpu: Add sysfs interface for sdma reset mask") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:19:57 -05:00
Alex Deucher	46a2cb7d24	drm/amdgpu/sdma5: enable queue resets unconditionally There is no firmware version dependency. Fixes: `59fd50b866` ("drm/amdgpu: Add sysfs interface for sdma reset mask") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:19:40 -05:00
Zilin Guan	ee41e5b63c	drm/amdgpu: Fix memory leak in amdgpu_ras_init() When amdgpu_nbio_ras_sw_init() fails in amdgpu_ras_init(), the function returns directly without freeing the allocated con structure, leading to a memory leak. Fix this by jumping to the release_con label to properly clean up the allocated memory before returning the error code. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: `fdc94d3a8c` ("drm/amdgpu: Rework pcie_bif ras sw_init") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:19:09 -05:00
Zilin Guan	0c44d61945	drm/amdgpu: Use kvfree instead of kfree in amdgpu_gmc_get_nps_memranges() amdgpu_discovery_get_nps_info() internally allocates memory for ranges using kvcalloc(), which may use vmalloc() for large allocation. Using kfree() to release vmalloc memory will lead to a memory corruption. Use kvfree() to safely handle both kmalloc and vmalloc allocations. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: `b194d21b9b` ("drm/amdgpu: Use NPS ranges from discovery table") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:18:33 -05:00
Zilin Guan	c9be63d565	drm/amdgpu: Fix memory leak in amdgpu_acpi_enumerate_xcc() In amdgpu_acpi_enumerate_xcc(), if amdgpu_acpi_dev_init() returns -ENOMEM, the function returns directly without releasing the allocated xcc_info, resulting in a memory leak. Fix this by ensuring that xcc_info is properly freed in the error paths. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: `4d5275ab0b` ("drm/amdgpu: Add parsing of acpi xcc objects") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:17:57 -05:00
Lijo Lazar	8980be03b3	drm/amdgpu: Skip vcn poison irq release on VF VF doesn't enable VCN poison irq in VCNv2.5. Skip releasing it and avoid call trace during deinitialization. [ 71.913601] [drm] clean up the vf2pf work item [ 71.915088] ------------[ cut here ]------------ [ 71.915092] WARNING: CPU: 3 PID: 1079 at /tmp/amd.aFkFvSQl/amd/amdgpu/amdgpu_irq.c:641 amdgpu_irq_put+0xc6/0xe0 [amdgpu] [ 71.915355] Modules linked in: amdgpu(OE-) amddrm_ttm_helper(OE) amdttm(OE) amddrm_buddy(OE) amdxcp(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) drm_suballoc_helper drm_display_helper cec rc_core i2c_algo_bit video wmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common input_leds joydev serio_raw mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel usbhid 8139too sha256_ssse3 sha1_ssse3 hid psmouse bochs i2c_i801 ahci drm_vram_helper libahci i2c_smbus lpc_ich drm_ttm_helper 8139cp mii ttm aesni_intel crypto_simd cryptd [ 71.915484] CPU: 3 PID: 1079 Comm: rmmod Tainted: G OE 6.8.0-87-generic #88~22.04.1-Ubuntu [ 71.915489] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014 [ 71.915492] RIP: 0010:amdgpu_irq_put+0xc6/0xe0 [amdgpu] [ 71.915768] Code: 75 84 b8 ea ff ff ff eb d4 44 89 ea 48 89 de 4c 89 e7 e8 fd fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 31 f6 31 ff e9 55 30 3b c7 <0f> 0b eb d4 b8 fe ff ff ff eb a8 e9 b7 3b 8a 00 66 2e 0f 1f 84 00 [ 71.915771] RSP: 0018:ffffcf0800eafa30 EFLAGS: 00010246 [ 71.915775] RAX: 0000000000000000 RBX: ffff891bda4b0668 RCX: 0000000000000000 [ 71.915777] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 71.915779] RBP: ffffcf0800eafa50 R08: 0000000000000000 R09: 0000000000000000 [ 71.915781] R10: 0000000000000000 R11: 0000000000000000 R12: ffff891bda480000 [ 71.915782] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 [ 71.915792] FS: 000070cff87c4c40(0000) GS:ffff893abfb80000(0000) knlGS:0000000000000000 [ 71.915795] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.915797] CR2: 00005fa13073e478 CR3: 000000010d634006 CR4: 0000000000770ef0 [ 71.915800] PKRU: 55555554 [ 71.915802] Call Trace: [ 71.915805] <TASK> [ 71.915809] vcn_v2_5_hw_fini+0x19e/0x1e0 [amdgpu] Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:48:12 -05:00
Harish Kasiviswanathan	6ba60345f4	drm/amdgpu: Fix double deletion of validate_list If amdgpu_amdkfd_gpuvm_free_memory_of_gpu() fails after kgd_mem is removed from validate_list, the mem handle still lingers in the KFD idr. This means when process is terminated, kfd_process_free_outstanding_kfd_bos() will call amdgpu_amdkfd_gpuvm_free_memory_of_gpu() again resulting in double deletion. To avoid this - (a) Check if list is empty before deleting it (b) Rearragne amdgpu_amdkfd_gpuvm_free_memory_of_gpu() such that it can be safely called again if it returns failure the first time. Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:46:33 -05:00
Andrew Martin	08b1eb621c	drm/amdgpu: Ignored various return code The return code of a non void function should not be ignored. In cases where we do not care, the code needs to suppress it. Signed-off-by: Andrew Martin <andrew.martin@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:46:31 -05:00
Jinzhou Su	c6fa06fc4c	drm/amdgpu/psp_v15_0_8: Add get ras capability Add get ras capability for psp 15.0.8. v2:Remove APU type check and IP version check. Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:46:25 -05:00
Bert Karwatzki	97a9689300	Revert "drm/amd: Check if ASPM is enabled from PCIe subsystem" This reverts commit `7294863a6f`. This commit was erroneously applied again after commit `0ab5d711ec` ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") removed it, leading to very hard to debug crashes, when used with a system with two AMD GPUs of which only one supports ASPM. Link: https://lore.kernel.org/linux-acpi/20251006120944.7880-1-spasswolf@web.de/ Link: https://github.com/acpica/acpica/issues/1060 Fixes: `0ab5d711ec` ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:43:59 -05:00
Stanley.Yang	500449eb70	drm/amdgpu: statistic xgmi training error count Report xgmi training error uncorrectable error count. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:43:46 -05:00
Mario Limonciello	c2d2ccc85f	drm/amd: Set minimum version for set_hw_resource_1 on gfx11 to 0x52 commit `f81cd79311` ("drm/amd/amdgpu: Fix MES init sequence") caused a dependency on new enough MES firmware to use amdgpu. This was fixed on most gfx11 and gfx12 hardware with commit `0180e0a5dd` ("drm/amdgpu/mes: add compatibility checks for set_hw_resource_1"), but this left out that GC 11.0.4 had breakage at MES 0x51. Bump the requirement to 0x52 instead. Reported-by: danijel@nausys.com Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4576 Fixes: `f81cd79311` ("drm/amd/amdgpu: Fix MES init sequence") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:40:15 -05:00
Perry Yuan	31b153315b	drm/amdgpu: ensure no_hw_access is visible before MMIO Add a full memory barrier after clearing no_hw_access in amdgpu_device_mode1_reset() so subsequent PCI state restore access cannot observe stale state on other CPUs. Fixes: `7edb503fe4` ("drm/amd/pm: Disable MMIO access during SMU Mode 1 reset") Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:23:11 -05:00
Dave Airlie	a60f627cf4	Merge tag 'amd-drm-next-6.20-2026-01-30' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.20-2026-01-30: amdgpu: - Misc cleanups - SMU 13 fixes - SMU 14 fixes - GPUVM fault filter fix - USB4 fixes - DC FP guard fixes - Powergating fix - JPEG ring reset fix - RAS fixes - Xclk fix for soc21 APUs - Fix COND_EXEC handling for GC 11 - UserQ fixes - MQD size alignment fixes - SMU feature interface cleanup - GC 10-12 KGQ init fixes - GC 11-12 KGQ reset fixes amdkfd: - Fix device snapshot reporting - GC 12.1 trap handler fixes - MQD size alignment fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260130183257.28879-1-alexander.deucher@amd.com	2026-02-02 05:51:54 +10:00
Alex Deucher	0a6d6ed694	drm/amdgpu/gfx12: adjust KGQ reset sequence Kernel gfx queues do not need to be reinitialized or remapped after a reset. Align with gfx11. v2: preserve init and remap for MMIO case. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:27:37 -05:00
Alex Deucher	b340ff216f	drm/amdgpu/gfx11: adjust KGQ reset sequence Kernel gfx queues do not need to be reinitialized or remapped after a reset. This fixes queue reset failures on APUs. v2: preserve init and remap for MMIO case. Fixes: `b3e9bfd866` ("drm/amdgpu/gfx11: add ring reset callbacks") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:27:29 -05:00
Alex Deucher	a2918f958d	drm/amdgpu/gfx12: fix wptr reset in KGQ init wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:27:27 -05:00
Alex Deucher	1f16866bdb	drm/amdgpu/gfx11: fix wptr reset in KGQ init wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:27:18 -05:00
Alex Deucher	e80b1d1aa1	drm/amdgpu/gfx10: fix wptr reset in KGQ init wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:27:10 -05:00
Lang Yu	a6a4dd519c	drm/amdgpu: Use AMDGPU_MQD_SIZE_ALIGN in KGD Use AMDGPU_MQD_SIZE_ALIGN for both kernel and user queue. Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: David Belanger <david.belanger@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:26:55 -05:00
Lang Yu	82a9ab369a	drm/amdgpu: Add a helper macro to align mqd size MES FW uses address(mqd_addr + sizeof(struct mqd) + 3*sizeof(uint32_t)) as fence address and writes a 32 bit fence value to this address. Driver needs to allocate some extra memory(at least 4 DWs) in addition to sizeof(struct mqd) as mqd memory(limited to gfx/compute/sdma queue). For gfx11/12, sizeof(struct mqd) < PAGE_SIZE, KGD allocates mqd memory with PAGE_SIZE aligned works. For gfx12.1, sizeof(struct mqd) == PAGE_SIZE, it doesn't work. KFD mqd manager hardcodes mqd size to PAGE_SIZE/MQD_SIZE across different IP versions to solve this issue. To avoid hardcoding in differnet places and across different IP versions. Let's use AMDGPU_MQD_SIZE_ALIGN instead. It is used in two places. 1. mqd memory alloction 2. mqd stride handling for multi xcc config v2: Use AMDGPU_GPU_PAGE_ALIGN. (Mukul) Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: David Belanger <david.belanger@amd.com> (v1) Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:26:26 -05:00
Jesse.Zhang	8079b87c02	drm/amdgpu: validate user queue size constraints Add validation to ensure user queue sizes meet hardware requirements: - Size must be a power of two for efficient ring buffer wrapping - Size must be at least AMDGPU_GPU_PAGE_SIZE to prevent undersized allocations This prevents invalid configurations that could lead to GPU faults or unexpected behavior. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:26:15 -05:00
Alex Deucher	ba205ac3d6	drm/amdgpu: Fix cond_exec handling in amdgpu_ib_schedule() The EXEC_COUNT field must be > 0. In the gfx shadow handling we always emit a cond_exec packet after the gfx_shadow packet, but the EXEC_COUNT never gets patched. This leads to a hang when we try and reset queues on gfx11 APUs. Fixes: `c68cbbfd54` ("drm/amdgpu: cleanup conditional execution") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-28 16:21:45 -05:00
Alex Deucher	637fee3954	drm/amdgpu/soc21: fix xclk for APUs The reference clock is supposed to be 100Mhz, but it appears to actually be slightly lower (99.81Mhz). Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14451 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-28 16:21:31 -05:00
Dave Airlie	6704d98a4f	BackMerge tag 'v6.19-rc7' into drm-next Linux 6.19-rc7 This is needed for msm and rust trees. Signed-off-by: Dave Airlie <airlied@redhat.com>	2026-01-28 12:44:28 +10:00
Kent Russell	e0d11bdb29	drm/amdgpu: Send RMA CPER at bad page loading Some older builds weren't sending RMA CPERs when the bad page threshold was exceeded. Newer builds have resolved this, but there could be systems out there with bad page numbers higher than the threshold, that haven't sent out an RMA CPER. To be thorough and safe, send an RMA CPER when we load the table, if the threshold is met or exceeded, instead of waiting for the next UE to trigger the CPER. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:13:24 -05:00
Jesse.Zhang	91544c45fa	drm/amdgpu: Fix jpeg ring test order in vcn_v4_0_3 Fix the vcn reset sequence in vcn_v4_0_3_ring_reset() to restore JPEG power state and unlock the JPEG powergating mutex before running the JPEG ring post-reset helper. Fixes: `d25c67fd9d` ("drm/amdgpu/vcn4.0.3: rework reset handling") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:12:10 -05:00
Jon Doron	6ce8d536c8	drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit `1446226d32` ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit `dd29944165` ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: `dd29944165` ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Jon Doron <jond@wiz.io> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:09:04 -05:00
Tvrtko Ursulin	dda702172d	drm/amdgpu: Simplify sorting of the bo list Sort function only cares about the sign so we can replace the conditionals with a single subtraction. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:08:29 -05:00
Tvrtko Ursulin	f7e0678651	drm/amdgpu/mes: Remove idr leftovers v2 Commit `cb17fff3a2` ("drm/amdgpu/mes: remove unused functions") removed most of the code using these IDRs but forgot to remove the struct members and init/destroy paths. There is also interrupt handling code in SDMA 5.0 and 5.2 which appears to be using it, but is is unreachable since nothing ever allocates the relevant IDR. We replace those with one time warnings just to avoid any functional difference, but it is also possible they should be removed. v2: also fix up gfx_v12_1.c and sdma_v7_1.c Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> References: `cb17fff3a2` ("drm/amdgpu/mes: remove unused functions") Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:02:06 -05:00
Alex Deucher	095ca81517	drm/amdgpu: fix type for wptr in ring backup Needs to be a u64. Fixes: `77cc0da39c` ("drm/amdgpu: track ring state associated with a fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `56fff1941a`)	2026-01-21 14:55:56 -05:00

1 2 3 4 5 ...

17012 Commits