linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-25 01:52:32 -04:00

Author	SHA1	Message	Date
Jon Doron	8b1ecc9377	drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit `1446226d32` ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit `dd29944165` ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: `dd29944165` ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Jon Doron <jond@wiz.io> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `6ce8d536c8`) Cc: stable@vger.kernel.org	2026-01-27 18:24:39 -05:00
Kent Russell	e0d11bdb29	drm/amdgpu: Send RMA CPER at bad page loading Some older builds weren't sending RMA CPERs when the bad page threshold was exceeded. Newer builds have resolved this, but there could be systems out there with bad page numbers higher than the threshold, that haven't sent out an RMA CPER. To be thorough and safe, send an RMA CPER when we load the table, if the threshold is met or exceeded, instead of waiting for the next UE to trigger the CPER. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:13:24 -05:00
Jesse.Zhang	91544c45fa	drm/amdgpu: Fix jpeg ring test order in vcn_v4_0_3 Fix the vcn reset sequence in vcn_v4_0_3_ring_reset() to restore JPEG power state and unlock the JPEG powergating mutex before running the JPEG ring post-reset helper. Fixes: `d25c67fd9d` ("drm/amdgpu/vcn4.0.3: rework reset handling") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:12:10 -05:00
Jon Doron	6ce8d536c8	drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit `1446226d32` ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit `dd29944165` ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: `dd29944165` ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Jon Doron <jond@wiz.io> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:09:04 -05:00
Tvrtko Ursulin	dda702172d	drm/amdgpu: Simplify sorting of the bo list Sort function only cares about the sign so we can replace the conditionals with a single subtraction. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:08:29 -05:00
Tvrtko Ursulin	f7e0678651	drm/amdgpu/mes: Remove idr leftovers v2 Commit `cb17fff3a2` ("drm/amdgpu/mes: remove unused functions") removed most of the code using these IDRs but forgot to remove the struct members and init/destroy paths. There is also interrupt handling code in SDMA 5.0 and 5.2 which appears to be using it, but is is unreachable since nothing ever allocates the relevant IDR. We replace those with one time warnings just to avoid any functional difference, but it is also possible they should be removed. v2: also fix up gfx_v12_1.c and sdma_v7_1.c Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> References: `cb17fff3a2` ("drm/amdgpu/mes: remove unused functions") Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:02:06 -05:00
Alex Deucher	095ca81517	drm/amdgpu: fix type for wptr in ring backup Needs to be a u64. Fixes: `77cc0da39c` ("drm/amdgpu: track ring state associated with a fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `56fff1941a`)	2026-01-21 14:55:56 -05:00
Timur Kristóf	fd2ac113a5	drm/amdgpu: Fix validating flush_gpu_tlb_pasid() When a function holds a lock and we return without unlocking it, it deadlocks the kernel. We should always unlock before returning. This commit fixes suspend/resume on SI. Tested on two Tahiti GPUs: FirePro W9000 and R9 280X. Fixes: `f4db9913e4` ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/ Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `e3a6eff92b`)	2026-01-21 14:55:44 -05:00
Alex Deucher	ad7b68b043	drm/amdgpu: rename amdgpu_fence_driver_guilty_force_completion() The function no longer signals the fence so rename it to better match what it does. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-21 14:27:45 -05:00
Alex Deucher	56fff1941a	drm/amdgpu: fix type for wptr in ring backup Needs to be a u64. Fixes: `77cc0da39c` ("drm/amdgpu: track ring state associated with a fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-21 14:25:50 -05:00
Gangliang Xie	0028b86b52	drm/amdgpu: mark invalid records with U64_MAX set retired_page of invalid ras records to U64_MAX, and skip them when reading ras records Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-21 14:25:43 -05:00
Lijo Lazar	9d03d404f4	drm/amdgpu: Avoid excessive dmesg log KIQ access is not guaranteed to work reliably under all reset situations. Avoid flooding dmesg with HDP flush failure messages. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-21 14:25:31 -05:00
Timur Kristóf	e3a6eff92b	drm/amdgpu: Fix validating flush_gpu_tlb_pasid() When a function holds a lock and we return without unlocking it, it deadlocks the kernel. We should always unlock before returning. This commit fixes suspend/resume on SI. Tested on two Tahiti GPUs: FirePro W9000 and R9 280X. Fixes: `f4db9913e4` ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/ Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-21 14:24:27 -05:00
Jesse.Zhang	fab47d2db5	drm/amdgpu/vcn5.0.1: rework reset handling Resetting VCN resets the entire tile, including jpeg. When resetting the VCN, we need to ensure that JPEG data blocks are accessible and we also need to handle the JPEG queue. Add a helper function to restore the JPEG queue during the VCN reset. v2: split the jpeg helper in two, in the top helper we can stop the sched workqueues and attempt to wait for any outstanding fences. Then in the bottom helper, we can force completion, re-init the rings, and restart the sched workqueues (Alex) v3: merge patches 4 and 5 into one patch (Alex) Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-21 14:18:09 -05:00
Jesse.Zhang	d25c67fd9d	drm/amdgpu/vcn4.0.3: rework reset handling Resetting VCN resets the entire tile, including jpeg. When resetting the VCN, we need to ensure that JPEG data blocks are accessible and we also need to handle the JPEG queue. Add a helper function to restore the JPEG queue during the VCN reset. v2: split the jpeg helper in two, in the top helper we can stop the sched workqueues and attempt to wait for any outstanding fences. Then in the bottom helper, we can force completion, re-init the rings, and restart the sched workqueues (Alex) v3: merge patches 1 and 2 into one patch (Alex) Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-21 14:17:48 -05:00
Jesse.Zhang	de93bc3533	drm/amdgpu/vcn4.0.3: implement DPG pause mode handling for VCN 4.0.3 For MI projects, when Dynamic Power Gating (DPG) is enabled, VCN reset operations should be performed with DPG in pause mode. Otherwise, the hardware may perform undesirable reset operations Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-21 14:17:36 -05:00
Alex Deucher	82a401ceff	drm/amdgpu: fix error handling in ib_schedule() If fence emit fails, free the fence if necessary. Fixes: `db36632ea5` ("drm/amdgpu: clean up and unify hw fence handling") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `5eb680a060`)	2026-01-20 21:53:18 -05:00
Jiqian Chen	8e96b36d9b	drm/amdgpu: free hw_vm_fence when fail in amdgpu_job_alloc If drm_sched_job_init fails, hw_vm_fence is not freed currently, then cause memory leak. Fixes: `db36632ea5` ("drm/amdgpu: clean up and unify hw fence handling") Link: https://lore.kernel.org/amd-gfx/a5a828cb-0e4a-41f0-94c3-df31e5ddad52@amd.com/T/#t Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Amos Kong <kongjianjun@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `5d42ee457c`)	2026-01-20 21:50:03 -05:00
Likun Gao	1034325332	drm/amdgpu: remove frame cntl for gfx v12 Remove emit_frame_cntl function for gfx v12, which is not support. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `5aaa5058de`) Cc: stable@vger.kernel.org	2026-01-20 21:49:25 -05:00
Alex Deucher	e51b709a6d	drm/amdgpu: add new job ids Use this for gfx, sdma, vpe IB tests and kernel shaders. The end goal it to get rid of the direct IB submit without a job structure. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:25:53 -05:00
Alex Deucher	5eb680a060	drm/amdgpu: fix error handling in ib_schedule() If fence emit fails, free the fence if necessary. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:25:40 -05:00
Alex Deucher	e6048510c3	drm/amdgpu/jpeg4.0.3: remove redundant sr-iov check The per queue reset flag is only set when sr-iov is disabled so this check is not necessary as the function will never be called on sr-iov. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:17:48 -05:00
Perry Yuan	54f2fc76dd	drm/amdgpu: Improve IP discovery checksum failure logging Enhance the error logging in amdgpu_discovery_verify_checksum() to print the calculated checksum, the expected checksum, the data size. This extra context helps quickly identify if the issue is a data corruption, a partially read binary, or an invalid table header without requiring additional instrumentation. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:17:23 -05:00
Jiqian Chen	5d42ee457c	drm/amdgpu: free hw_vm_fence when fail in amdgpu_job_alloc If drm_sched_job_init fails, hw_vm_fence is not freed currently, then cause memory leak. Fixes: `db36632ea5` ("drm/amdgpu: clean up and unify hw fence handling") Link: https://lore.kernel.org/amd-gfx/a5a828cb-0e4a-41f0-94c3-df31e5ddad52@amd.com/T/#t Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Amos Kong <kongjianjun@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:16:37 -05:00
Jesse.Zhang	fc3336be9c	drm/amd/amdgpu: Add independent hang detect work for user queue fence In error scenarios (e.g., malformed commands), user queue fences may never be signaled, causing processes to wait indefinitely. To address this while preserving the requirement of infinite fence waits, implement an independent timeout detection mechanism: 1. Initialize a hang detect work when creating a user queue (one-time setup) 2. Start the work with queue-type-specific timeout (gfx/compute/sdma) when the last fence is created via amdgpu_userq_signal_ioctl (per-fence timing) 3. Trigger queue reset logic if the timer expires before the fence is signaled v2: make timeout per queue type (adev->gfx_timeout vs adev->compute_timeout vs adev->sdma_timeout) to be consistent with kernel queues. (Alex) v3: The timeout detection must be independent from the fence, e.g. you don't wait for a timeout on the fence but rather have the timeout start as soon as the fence is initialized. (Christian) v4: replace the timer with the `hang_detect_work` delayed work. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:16:12 -05:00
Likun Gao	5aaa5058de	drm/amdgpu: remove frame cntl for gfx v12 Remove emit_frame_cntl function for gfx v12, which is not support. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:15:59 -05:00
Philip Yang	d4a814f400	drm/amdkfd: Move gfx9.4.3 and gfx 9.5 MQD to HBM To reduce queue switch latency further, move MQD to VRAM domain, CP access MQD and control stack via FB aperture, this requires contiguous pages. After MQD is initialized, updated or restored, flush HDP to guarantee the data is written to HBM and GPU cache is invalidated, then CP will read the new MQD. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:15:46 -05:00
Dave Airlie	c098b1aa2f	Merge tag 'amd-drm-next-6.20-2026-01-16' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.20-2026-01-16: amdgpu: - SR-IOV fixes - Rework SMU mailbox handling - Drop MMIO_REMAP domain - UserQ fixes - MES cleanups - Panel Replay updates - HDMI fixes - Backlight fixes - SMU 14.x fixes - SMU 15 updates amdkfd: - Fix a memory leak - Fixes for systems with non-4K pages - LDS/Scratch cleanup - MES process eviction fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260116202609.23107-1-alexander.deucher@amd.com	2026-01-19 06:54:46 +10:00
Dave Airlie	83dc0ba275	Merge tag 'amd-drm-next-6.20-2026-01-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.20-2026-01-09: amdgpu: - GPUVM updates - Initial support for larger GPU address spaces - Initial SMUIO 15.x support - Documentation updates - Initial PSP 15.x support - Initial IH 7.1 support - Initial IH 6.1.1 support - SMU 13.0.12 updates - RAS updates - Initial MMHUB 3.4 support - Initial MMHUB 4.2 support - Initial GC 12.1 support - Initial GC 11.5.4 support - HDMI fixes - Panel replay improvements - DML updates - DC FP fixes - Initial SDMA 6.1.4 support - Initial SDMA 7.1 support - Userq updates - DC HPD refactor - SwSMU cleanups and refactoring - TTM memory ops parallelization - DCN 3.5 fixes - DP audio fixes - Clang fixes - Misc spelling fixes and cleanups - Initial SDMA 7.11.4 support - Convert legacy DRM logging helpers to new drm logging helpers - Initial JPEG 5.3 support - Add support for changing UMA size via the driver - DC analog fixes - GC 9 gfx queue reset support - Initial SMU 15.x support amdkfd: - Reserved SDMA rework - Refactor SPM - Initial GC 12.1 support - Initial GC 11.5.4 support - Initial SDMA 7.1 support - Initial SDMA 6.1.4 support - Increase the kfd process hash table - Per context support - Topology fixes radeon: - Convert legacy DRM logging helpers to new drm logging helpers - Use devm for i2c adapters - Variable sized array fix - Misc cleanups UAPI: - KFD context support. Proposed userspace: https://github.com/ROCm/rocm-systems/pull/1705 https://github.com/ROCm/rocm-systems/pull/1701 - Add userq metadata queries for more queue types. Proposed userspace: https://gitlab.freedesktop.org/yogeshmohan/mesa/-/commits/userq_query From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260109154713.3242957-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>	2026-01-15 14:49:33 +10:00
Ivan Lipski	d04f73668b	drm/amd/display: Add an hdmi_hpd_debounce_delay_ms module [Why&How] Right now, the HDMI HPD filter is enabled by default at 1500ms. We want to disable it by default, as most modern displays with HDMI do not require it for DPMS mode. The HPD can instead be enabled as a driver parameter with a custom delay value in ms (up to 5000ms). Fixes: `c918e75e1e` ("drm/amd/display: Add an HPD filter for HDMI") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4859 Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `6a681cd903`)	2026-01-14 15:07:43 -05:00
Srinivasan Shanmugam	b2426a211d	drm/amdgpu/userq: Fix fence reference leak on queue teardown v2 The user mode queue keeps a pointer to the most recent fence in userq->last_fence. This pointer holds an extra dma_fence reference. When the queue is destroyed, we free the fence driver and its xarray, but we forgot to drop the last_fence reference. Because of the missing dma_fence_put(), the last fence object can stay alive when the driver unloads. This leaves an allocated object in the amdgpu_userq_fence slab cache and triggers This is visible during driver unload as: BUG amdgpu_userq_fence: Objects remaining on __kmem_cache_shutdown() kmem_cache_destroy amdgpu_userq_fence: Slab cache still has objects Call Trace: kmem_cache_destroy amdgpu_userq_fence_slab_fini amdgpu_exit __do_sys_delete_module Fix this by putting userq->last_fence and clearing the pointer during amdgpu_userq_fence_driver_free(). This makes sure the fence reference is released and the slab cache is empty when the module exits. v2: Update to only release userq->last_fence with dma_fence_put() (Christian) Fixes: `edc762a51c` ("drm/amdgpu/userq: move some code around") Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `8e051e38a8`)	2026-01-14 15:07:29 -05:00
Prike Liang	808c2052f0	Revert "drm/amdgpu: don't attach the tlb fence for SI" This reverts commit `820b3d376e`. It’s better to validate VM TLB flushes in the flush‑TLB backend rather than in the generic VM layer. Reverting this patch depends on commit fa7c231fc2b0 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") being present in the tree. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `9163fe4d79`)	2026-01-14 15:06:51 -05:00
Prike Liang	0bea77b13b	drm/amdgpu: validate the flush_gpu_tlb_pasid() Validate flush_gpu_tlb_pasid() availability before flushing tlb. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `f4db9913e4`)	2026-01-14 15:06:43 -05:00
Alex Deucher	b6dff005fc	drm/amdgpu: make sure userqs are enabled in userq IOCTLs These IOCTLs shouldn't be called when userqs are not enabled. Make sure they are enabled before executing the IOCTLs. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `d967509651`) Cc: stable@vger.kernel.org	2026-01-14 14:57:55 -05:00
Xiaogang Chen	122b15cdbc	drm/amdgpu: Use correct address to setup gart page table for vram access Use dst input parameter to setup gart page table entries instead of using fixed location. Fixes: `237d623ae6` ("drm/amdgpu/gart: Add helper to bind VRAM pages (v2)") Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `ca5d4db8db`)	2026-01-14 14:57:34 -05:00
Peter Colberg	9c81200152	Revert duplicate "drm/amdgpu: disable peer-to-peer access for DCC-enabled GC12 VRAM surfaces" This reverts commit `22a36e660d` once, which was merged twice due to an incorrect backmerge resolution. Fixes: `ce0478b02e` ("Merge tag 'v6.18-rc6' into drm-next") Signed-off-by: Peter Colberg <pcolberg@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `38a0f4cf8c`)	2026-01-14 14:51:36 -05:00
Mario Limonciello (AMD)	28695ca09d	drm/amd: Clean up kfd node on surprise disconnect When an eGPU is unplugged the KFD topology should also be destroyed for that GPU. This never happens because the fini_sw callbacks never get to run. Run them manually before calling amdgpu_device_ip_fini_early() when a device has already been disconnected. This location is intentionally chosen to make sure that the kfd locking refcount doesn't get incremented unintentionally. Cc: kent.russell@amd.com Closes: https://community.frame.work/t/amd-egpu-on-linux/8691/33 Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `6a23e7b433`) Cc: stable@vger.kernel.org	2026-01-14 14:51:36 -05:00
Lu Yao	9cb6278b44	drm/amdgpu: fix drm panic null pointer when driver not support atomic When driver not support atomic, fb using plane->fb rather than plane->state->fb. Fixes: `fe151ed7af` ("drm/amdgpu: add generic display panic helper code") Signed-off-by: Lu Yao <yaolu@kylinos.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `2f2a72de67`)	2026-01-14 14:51:36 -05:00
Philip Yang	292e5757b2	drm/amdgpu: Fix gfx9 update PTE mtype flag Fix copy&paste error, that should have been an assignment instead of an or, otherwise MTYPE_UC 0x3 can not be updated to MTYPE_RW 0x1. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `fc1366016a`) Cc: stable@vger.kernel.org	2026-01-14 14:51:36 -05:00
Ivan Lipski	6a681cd903	drm/amd/display: Add an hdmi_hpd_debounce_delay_ms module [Why&How] Right now, the HDMI HPD filter is enabled by default at 1500ms. We want to disable it by default, as most modern displays with HDMI do not require it for DPMS mode. The HPD can instead be enabled as a driver parameter with a custom delay value in ms (up to 5000ms). Fixes: `c918e75e1e` ("drm/amd/display: Add an HPD filter for HDMI") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4859 Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-14 14:28:59 -05:00
Philip Yang	0cba5b27f1	drm/amdkfd: Add domain parameter to alloc kernel BO To allocate kernel BO from VRAM domain for MQD in the following patch. No functional change because kernel BO allocate all from GTT domain. Rename amdgpu_amdkfd_alloc_gtt_mem to amdgpu_amdkfd_alloc_kernel_mem Rename amdgpu_amdkfd_free_gtt_mem to amdgpu_amdkfd_free_kernel_mem Rename mem_kfd_mem_obj gtt_mem to mem Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-14 14:28:59 -05:00
Srinivasan Shanmugam	8e051e38a8	drm/amdgpu/userq: Fix fence reference leak on queue teardown v2 The user mode queue keeps a pointer to the most recent fence in userq->last_fence. This pointer holds an extra dma_fence reference. When the queue is destroyed, we free the fence driver and its xarray, but we forgot to drop the last_fence reference. Because of the missing dma_fence_put(), the last fence object can stay alive when the driver unloads. This leaves an allocated object in the amdgpu_userq_fence slab cache and triggers This is visible during driver unload as: BUG amdgpu_userq_fence: Objects remaining on __kmem_cache_shutdown() kmem_cache_destroy amdgpu_userq_fence: Slab cache still has objects Call Trace: kmem_cache_destroy amdgpu_userq_fence_slab_fini amdgpu_exit __do_sys_delete_module Fix this by putting userq->last_fence and clearing the pointer during amdgpu_userq_fence_driver_free(). This makes sure the fence reference is released and the slab cache is empty when the module exits. v2: Update to only release userq->last_fence with dma_fence_put() (Christian) Fixes: `edc762a51c` ("drm/amdgpu/userq: move some code around") Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-14 14:28:59 -05:00
Prike Liang	9163fe4d79	Revert "drm/amdgpu: don't attach the tlb fence for SI" This reverts commit `820b3d376e`. It’s better to validate VM TLB flushes in the flush‑TLB backend rather than in the generic VM layer. Reverting this patch depends on commit fa7c231fc2b0 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") being present in the tree. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-14 14:28:59 -05:00
Prike Liang	f4db9913e4	drm/amdgpu: validate the flush_gpu_tlb_pasid() Validate flush_gpu_tlb_pasid() availability before flushing tlb. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-14 14:28:59 -05:00
Xiaogang Chen	6cca686dfc	drm/amdkfd: kfd driver supports hot unplug/replug amdgpu devices This patch allows kfd driver function correctly when AMD gpu devices got unplug/replug at run time. When an AMD gpu device got unplug kfd driver gracefully terminates existing kfd processes after stops all queues by sending SIGBUS to user process. After that user space can still use remaining AMD gpu devices. When all AMD gpu devices at system got removed kfd driver will not response new requests. Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices to function as usual. The purpose of this patch is having kfd driver behavior as expected during and after AMD gpu devices unplug/replug at run time. Signed-off-by: Xiaogang Chen <Xiaogang.Chen@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-14 14:28:49 -05:00
Lang Yu	0f744593ad	drm/amdgpu/mes: Simplify hqd mask initialization "adev->mes.compute_hqd_mask[i] = adev->gfx.disable_kq ? 0xF" is actually incorrect for MEC with 8 queues per pipe. Let's get rid of version check and hardcode, calculate hqd mask with number of queues per pipe and number of gfx/compute queues kernel used. Currently, only MEC1 is used for both kernel/user compute queue. To enable other MEC, we need to redistribute queues per pipe and adjust queue resource shared with kfd that needs a separate patch. Just skip other MEC for now to avoid potential issues. v2: Force reserved queues to 0 if kernel queue is explicitly disabled. Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-10 14:21:52 -05:00
Srinivasan Shanmugam	efdc66fe12	drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v7 When GPU memory mappings are updated, the driver returns a fence so userspace knows when the update is finished. The previous refactor could pick the wrong fence or rely on checks that are not safe for GPU mappings that stay valid even when memory is missing. In some cases this could return an invalid fence or cause fence reference counting problems. Fix this by (v5,v6, per Christian): - Starting from the VM’s existing last update fence, so a valid and meaningful fence is always returned even when no new work is required. - Selecting the VM-level fence only for always-valid / PRT mappings using the required combined bo_va + bo guard. - Using the per-BO page table update fence for normal MAP and REPLACE operations. - For UNMAP and CLEAR, returning the fence provided by amdgpu_vm_clear_freed(), which may remain unchanged when nothing needs clearing. - Keeping fence reference counting balanced. v7: Drop the extra bo_va/bo NULL guard since amdgpu_vm_is_bo_always_valid() handles NULL BOs correctly (including PRT). (Christian) This makes VM timeline fences correct and prevents crashes caused by incorrect fence handling. Fixes: `bd8150a1b3` ("drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v4") Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-10 14:21:52 -05:00
Alex Deucher	d967509651	drm/amdgpu: make sure userqs are enabled in userq IOCTLs These IOCTLs shouldn't be called when userqs are not enabled. Make sure they are enabled before executing the IOCTLs. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-10 14:21:52 -05:00
Christophe JAILLET	9858810e62	drm/amdgpu: Slightly simplify base_addr_show() sysfs_emit_at() never returns a negative error code. It returns 0 or the number of characters written in the buffer. Remove the useless tests. This simplifies the logic and saves a few lines of code. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-10 14:21:52 -05:00
Christian König	96e97a562d	drm/amdgpu: Drop MMIO_REMAP domain bit and keep it Internal "AMDGPU_GEM_DOMAIN_MMIO_REMAP" - Never activated as UAPI and it turned out that this was to inflexible. Allocate the MMIO_REMAP buffer object as a regular GEM BO and explicitly move it into the fixed AMDGPU_PL_MMIO_REMAP placement at the TTM level. This avoids relying on GEM domain bits for MMIO_REMAP, keeps the placement purely internal, and makes the lifetime and pinning of the global MMIO_REMAP BO explicit. The BO is pinned in TTM so it cannot be migrated or evicted. The corresponding free path relies on normal DRM teardown ordering, where no further user ioctls can access the global BO once TTM teardown begins. v2 (Srini): - Updated patch title. - Drop use of AMDGPU_GEM_DOMAIN_MMIO_REMAP in amdgpu_ttm.c. The MMIO_REMAP domain bit is removed from UAPI, so keep the MMIO_REMAP BO allocation domain-less (bp.domain = 0) and rely on the TTM placement (AMDGPU_PL_MMIO_REMAP) for backing/pinning. - Keep fdinfo/mem-stats visibility for MMIO_REMAP by classifying BOs based on bo->tbo.resource->mem_type == AMDGPU_PL_MMIO_REMAP, since the domain bit is removed. v3: Squash patches #1 & #3 Fixes: `0561324837` ("drm/amdgpu/uapi: Introduce AMDGPU_GEM_DOMAIN_MMIO_REMAP") Fixes: `2a7a794eb8` ("drm/amdgpu/ttm: Allocate/Free 4K MMIO_REMAP Singleton") Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Leo Liu <leo.liu@amd.com> Cc: Ruijing Dong <ruijing.dong@amd.com> Cc: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-10 14:21:35 -05:00

1 2 3 4 5 ...

17069 Commits