linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-26 18:42:25 -04:00

Author	SHA1	Message	Date
Lijo Lazar	1cbac73d1a	drm/amdgpu: Move reset-on-init sequence earlier Complete reset-on-init sequence before sysfs interfaces are created. Devices get properly initiaized only after reset, and then only sysfs interfaces should be made available. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:35 -04:00
Lijo Lazar	9e2096baab	drm/amdgpu: Add amdgpu_discovery_info Add amdgpu_discovery_info structure to keep all discovery related information. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:35 -04:00
Lijo Lazar	aa6674f2da	drm/amdgpu: Reorganize sysfs ini/fini calls Aggregate sysfs ini/fini calls into separate functions. No functional change. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:35 -04:00
Alex Deucher	db36632ea5	drm/amdgpu: clean up and unify hw fence handling Decouple the amdgpu fence from the amdgpu_job structure. This lets us clean up the separate fence ops for the embedded fence and other fences. This also allows us to allocate the vm fence up front when we allocate the job. v2: Additional cleanup suggested by Christian v3: Additional cleanups suggested by Christian v4: Additional cleanups suggested by David and vm fence fix v5: cast seqno (David) Cc: David.Wu3@amd.com Cc: christian.koenig@amd.com Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:35 -04:00
Alex Deucher	16dc933a4f	drm/amdgpu/userq: drop VCN and VPE doorbell handling VCN and VPE userqs are not yet supported and this code is not correct. Userspace should provide the correct doorbell offset with in their doorbell page for the IP. Adjusting it here will not work as expected as userspace and the queue itself will have different offsets. We need to add a INFO IOCTL query to get the offset and range for each IP within the doorbell page to handle this properly. Cc: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com> Reviewed-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:35 -04:00
Mario Limonciello	1f3cca7794	drm/amd: Pass userq suspend failures up to caller If a userq failed to suspend the rest of the suspend sequence may have problems. Pass the error code up to the caller for a decision on what to do. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Mario Limonciello	7877934019	drm/amd: Fix error handling with multiple userq IDRs If multiple userq IDR are in use and there is an error handling one at suspend or resume it will be silently discarded. Switch the suspend/resume() code to use guards and return immediately. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Mario Limonciello	173360fe49	drm/amd: Pass IP suspend errors up to callers If IP suspend fails the callers should be notified so that they can potentially react. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Mario Limonciello	b7ff2e7924	drm/amd: Don't always set IP block HW status to false amdgpu_device_ip_suspend_phase2() calls amdgpu_ip_block_suspend() which already sets HW block status to false when succeeding with IP suspend. Remove the explicit call in amdgpu_device_ip_suspend_phase2() so that the status is accurate. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Mario Limonciello	f35f254178	drm/amd: Remove comment about handling errors in amdgpu_device_ip_suspend_phase1() Error handling was introduced in commit `e095026f00` ("drm/amdgpu: validate suspend before function call") so the comment about TODO is no longer needed. Fixes: `e095026f00` ("drm/amdgpu: validate suspend before function call") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Mario Limonciello	6062ede680	drm/amd: Stop exporting amdgpu_device_ip_suspend() outside amdgpu_device amdgpu_device_ip_suspend() doesn't have a caller outside of amdgpu_device.c. Make it static. No intended functional changes. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Mario Limonciello	e6af507de8	drm/amd: Unify shutdown() callback behavior [Why] The shutdown() callback uses amdgpu_ip_suspend() which doesn't notify drm clients during shutdown. This could lead to hangs. [How] Change amdgpu_pci_shutdown() to call the same sequence as suspend/resume. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Prike Liang	2e7ceac0ea	drm/amdgpu: validate userq va for GEM unmap When a user unmaps a userq VA, the driver must ensure the queue has no in-flight jobs. If there is pending work, the kernel should wait for the attached eviction (bookkeeping) fence to signal before deleting the mapping. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Prike Liang	89926812d3	drm/amdgpu: validate the queue va for resuming the queue It requires validating the userq VA whether is mapped before trying to resume the queue. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Prike Liang	873f44c327	drm/amdgpu: keeping waiting userq fence infinitely Keeping waiting the userq fence infinitely until hang detection, and then suspend the hang queue and set the fence error. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Prike Liang	2a28f9665d	drm/amdgpu: track the userq bo va for its obj management Track the userq obj for its life time, and reference and dereference the buffer flag at its creating and destroying period. Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Prike Liang	5cfa33fabf	drm/amdgpu: add userq object va track helpers Add the userq object virtual address list_add() helpers for tracking the userq obj va address usage. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Christian König	1bea57ea75	drm/amdgpu: reduce queue timeout to 2 seconds v2 There has been multiple complains that 10 seconds are usually to long. The original requirement for longer timeout came from compute tests on AMDVLK, since that is no longer a topic reduce the timeout back to 2 seconds for all queues. While at it also remove any special handling for compute queues under SRIOV or pass through. v2: fix checkpatch warning. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Alex Deucher	fb3c34ef49	drm/amdgpu/mes: adjust the VMID masks The firmware limits the max vmid, but align the settings with the hw limits as well just to be safe. Reviewed-by: Shaoyun liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Lijo Lazar	4d0858d4eb	drm/amdgpu: Skip SDMA suspend during mode-2 reset For SDMA IP versions >= v4.4.2, firmware will take care of quiescing SDMA before mode-2 reset. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Pierre-Eric Pelloux-Prayer	c79cf5a7d9	drm/amdgpu: remove gart_window_lock usage from gmc v12 This lock was part of the SDMA workaround originally implemented in gmc_v10_0_flush_gpu_tlb (`a70cb2176f` moved it to amdgpu_gmc_flush_gpu_tlb). This means this lock is useless and be safely dropped. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Pierre-Eric Pelloux-Prayer	0c1beb7cbf	drm/amdgpu: make non-NULL out fence mandatory amdgpu_ttm_copy_mem_to_mem has a single caller, make sure the out fence is non-NULL to simplify the code. Since none of the pointers should be NULL, we can enable __attribute__((nonnull))__. While at it make the function static since it's only used from amdgpuu_ttm.c. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Lijo Lazar	17f5bd9551	drm/amdgpu: Remove redundant return value gfx_v9_4_3_xcc_kcq_init_queue doesn't have a fail condition. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Prike Liang	4944e470a2	drm/amdgpu/userq: extend userq state Extend the userq state for identifying the userq invalid cases. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:32 -04:00
Jonathan Kim	277bb0f83e	drm/amdgpu: enable suspend/resume all for gfx 12 Suspend/resume all gangs has been available for GFX12 for a while now so enable it. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:28 -04:00
Jonathan Kim	0ef930e1fa	drm/amdgpu: fix hung reset queue array memory allocation By design the MES will return an array result that is twice the number of hung doorbells it can report. i.e. if up k reported doorbells are supported, then the second half of the array, also of length k, holds the HQD information (type/queue/pipe) where queue 1 corresponds to index 0 and k, queue 2 corresponds to index 1 and k + 1 etc ... The driver will use the HDQ info to target queue/pipe reset for hardware scheduled user compute queues. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:28 -04:00
Jonathan Kim	8745ca5efb	drm/amdgpu: fix initialization of doorbell array for detect and hang Initialized doorbells should be set to invalid rather than 0 to prevent driver from over counting hung doorbells since it checks against the invalid value to begin with. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:28 -04:00
Sunil Khatri	b1dd0db1c6	drm/amdgpu: clean up amdgpu hmm range functions Clean up the amdgpu hmm range functions for clearer definition of each. a. Split amdgpu_ttm_tt_get_user_pages_done into two: 1. amdgpu_hmm_range_valid: To check if the user pages are valid and update seq num 2. amdgpu_hmm_range_free: Clean up the hmm range and pfn memory. b. amdgpu_ttm_tt_get_user_pages_done and amdgpu_ttm_tt_discard_user_pages are similar function so remove discard and directly use amdgpu_hmm_range_free to clean up the hmm range and pfn memory. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:28 -04:00
Sunil Khatri	e095b55155	drm/amdgpu: use user provided hmm_range buffer in amdgpu_ttm_tt_get_user_pages update the amdgpu_ttm_tt_get_user_pages and all dependent function along with it callers to use a user allocated hmm_range buffer instead hmm layer allocates the buffer. This is a need to get hmm_range pointers easily accessible without accessing the bo and that is a requirement for the userqueue to lock the userptrs effectively. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:28 -04:00
Jonathan Kim	d0de79f66a	drm/amdgpu: fix gfx12 mes packet status return check GFX12 MES uses low 32 bits of status return for success (1 or 0) and high bits for debug information if low bits are 0. GFX11 MES doesn't do this so checking full 64-bit status return for 1 or 0 is still valid. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2025-10-13 14:14:16 -04:00
Jesse.Zhang	883f309add	drm/amdgpu: Fix NULL pointer dereference in VRAM logic for APU devices Previously, APU platforms (and other scenarios with uninitialized VRAM managers) triggered a NULL pointer dereference in `ttm_resource_manager_usage()`. The root cause is not that the `struct ttm_resource_manager man` pointer itself is NULL, but that `man->bdev` (the backing device pointer within the manager) remains uninitialized (NULL) on APUs—since APUs lack dedicated VRAM and do not fully set up VRAM manager structures. When `ttm_resource_manager_usage()` attempts to acquire `man->bdev->lru_lock`, it dereferences the NULL `man->bdev`, leading to a kernel OOPS. 1. amdgpu_cs.c: Extend the existing bandwidth control check in `amdgpu_cs_get_threshold_for_moves()` to include a check for `ttm_resource_manager_used()`. If the manager is not used (uninitialized `bdev`), return 0 for migration thresholds immediately—skipping VRAM-specific logic that would trigger the NULL dereference. 2. amdgpu_kms.c: Update the `AMDGPU_INFO_VRAM_USAGE` ioctl and memory info reporting to use a conditional: if the manager is used, return the real VRAM usage; otherwise, return 0. This avoids accessing `man->bdev` when it is NULL. 3. amdgpu_virt.c*: Modify the vf2pf (virtual function to physical function) data write path. Use `ttm_resource_manager_used()` to check validity: if the manager is usable, calculate `fb_usage` from VRAM usage; otherwise, set `fb_usage` to 0 (APUs have no discrete framebuffer to report). This approach is more robust than APU-specific checks because it: - Works for all scenarios where the VRAM manager is uninitialized (not just APUs), - Aligns with TTM's design by using its native helper function, - Preserves correct behavior for discrete GPUs (which have fully initialized `man->bdev` and pass the `ttm_resource_manager_used()` check). v4: use ttm_resource_manager_used(&adev->mman.vram_mgr.manager) instead of checking the adev->gmc.is_app_apu flag (Christian) Reviewed-by: Christian König <christian.koenig@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:15 -04:00
Christian König	33cc891b56	drm/amdgpu: hide VRAM sysfs attributes on GPUs without VRAM Otherwise accessing them can cause a crash. Signed-off-by: Christian König <christian.koenig@amd.com> Tested-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:15 -04:00
Sathishkumar S	74de0eaa00	drm/amdgpu: fix bit shift logic BIT_ULL(n) sets nth bit, remove explicit shift and set the position Fixes: `a7a411e246` ("drm/amdgpu: fix shift-out-of-bounds in amdgpu_debugfs_jpeg_sched_mask_set") Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:15 -04:00
Gui-Dong Han	6df8e84aa6	drm/amdgpu: use atomic functions with memory barriers for vm fault info The atomic variable vm_fault_info_updated is used to synchronize access to adev->gmc.vm_fault_info between the interrupt handler and get_vm_fault_info(). The default atomic functions like atomic_set() and atomic_read() do not provide memory barriers. This allows for CPU instruction reordering, meaning the memory accesses to vm_fault_info and the vm_fault_info_updated flag are not guaranteed to occur in the intended order. This creates a race condition that can lead to inconsistent or stale data being used. The previous implementation, which used an explicit mb(), was incomplete and inefficient. It failed to account for all potential CPU reorderings, such as the access of vm_fault_info being reordered before the atomic_read of the flag. This approach is also more verbose and less performant than using the proper atomic functions with acquire/release semantics. Fix this by switching to atomic_set_release() and atomic_read_acquire(). These functions provide the necessary acquire and release semantics, which act as memory barriers to ensure the correct order of operations. It is also more efficient and idiomatic than using explicit full memory barriers. Fixes: `b97dfa27ef` ("drm/amdgpu: save vm fault information for amdkfd") Cc: stable@vger.kernel.org Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:15 -04:00
Alex Deucher	ff780f4f80	drm/amdgpu: set an error on all fences from a bad context When we backup ring contents to reemit after a queue reset, we don't backup ring contents from the bad context. When we signal the fences, we should set an error on those fences as well. v2: misc cleanups v3: add locking for fence error, fix comment (Christian) v4: fix wrap around, locking (Christian) Fixes: `77cc0da39c` ("drm/amdgpu: track ring state associated with a fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:15 -04:00
Alex Deucher	1f22fcb88b	drm/amdgpu: handle wrap around in reemit handling Compare the sequence numbers directly. Fixes: `77cc0da39c` ("drm/amdgpu: track ring state associated with a fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:15 -04:00
Alex Deucher	357d90be2c	drm/amdgpu: fix handling of harvesting for ip_discovery firmware Chips which use the IP discovery firmware loaded by the driver reported incorrect harvesting information in the ip discovery table in sysfs because the driver only uses the ip discovery firmware for populating sysfs and not for direct parsing for the driver itself as such, the fields that are used to print the harvesting info in sysfs report incorrect data for some IPs. Populate the relevant fields for this case as well. Fixes: `514678da56` ("drm/amdgpu/discovery: fix fw based ip discovery") Acked-by: Tom St Denis <tom.stdenis@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:15 -04:00
Christian König	8f74c70be5	drm/amdgpu: block CE CS if not explicitely allowed by module option The Constant Engine found on gfx6-gfx10 HW has been a notorious source of problems. RADV never used it in the first place, radeonsi only used it for a few releases around 2017 for gfx6-gfx9 before dropping support for it as well. While investigating another problem I just recently found that submitting to the CE seems to be completely broken on gfx9 for quite a while. Since nobody complained about that problem it most likely means that nobody is using any of the affected radeonsi versions on current Linux kernels any more. So to potentially phase out the support for the CE and eliminate another source of problems block submitting CE IBs unless it is enabled again using a debug flag. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:14 -04:00
Christian König	5d55ed19d4	drm/amdgpu: remove two invalid BUG_ON()s Those can be triggered trivially by userspace. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:14 -04:00
Timur Kristóf	7bdd91abf0	drm/amd: Disable ASPM on SI Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4. It's unclear if this is a platform-specific or GPU-specific issue. Disable ASPM on SI for the time being. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:14 -04:00
Thomas Zimmermann	9b966ae422	Merge drm/drm-next into drm-misc-next Updating drm-misc-next to the state of v6.18-rc1. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>	2025-10-13 09:19:19 +02:00
Linus Torvalds	284fc30e66	Merge tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel Pull more drm fixes from Dave Airlie: "Just the follow up fixes for rc1 from the next branch, amdgpu and xe mostly with a single v3d fix in there. amdgpu: - DC DCE6 fixes - GPU reset fixes - Secure diplay messaging cleanup - MES fix - GPUVM locking fixes - PMFW messaging cleanup - PCI US/DS switch handling fix - VCN queue reset fix - DC FPU handling fix - DCN 3.5 fix - DC mirroring fix amdkfd: - Fix kfd process ref leak - mmap write lock handling fix - Fix comments in IOCTL xe: - Fix build with clang 16 - Fix handling of invalid configfs syntax usage and spell out the expected syntax in the documentation - Do not try late bind firmware when running as VF since it shouldn't handle firmware loading - Fix idle assertion for local BOs - Fix uninitialized variable for late binding - Do not require perfmon_capable to expose free memory at page granularity. Handle it like other drm drivers do - Fix lock handling on suspend error path - Fix I2C controller resume after S3 v3d: - fix fence locking" * tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel: (34 commits) drm/amd/display: Incorrect Mirror Cositing drm/amd/display: Enable Dynamic DTBCLK Switch drm/amdgpu: Report individual reset error drm/amdgpu: partially revert "revert to old status lock handling v3" drm/amd/display: Fix unsafe uses of kernel mode FPU drm/amd/pm: Disable VCN queue reset on SMU v13.0.6 due to regression drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine drm/amdgpu: Check swus/ds for switch state save drm/amdkfd: Fix two comments in kfd_ioctl.h drm/amd/pm: Avoid interface mismatch messaging drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init drm/amd/amdgpu: Fix the mes version that support inv_tlbs drm/amd: Check whether secure display TA loaded successfully drm/amdkfd: Fix mmap write lock not release drm/amdkfd: Fix kfd process ref leaking when userptr unmapping drm/amdgpu: Fix for GPU reset being blocked by KIQ I/O. drm/amd/display: Disable scaling on DCE6 for now drm/amd/display: Properly disable scaling on DCE6 drm/amd/display: Properly clear SCL__FILTER_CONTROL on DCE6 drm/amd/display: Add missing DCE6 SCL_HORZ_FILTER_INIT SRIs ...	2025-10-10 14:02:14 -07:00
Lijo Lazar	2e97663760	drm/amdgpu: Report individual reset error If reinitialization of one of the GPUs fails after reset, it logs failure on all subsequent GPUs eventhough they have resumed successfully. A sample log where only device at 0000:95:00.0 had a failure - amdgpu 0000:15:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:65:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:75:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:85:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:95:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:e5:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:f5:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:05:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:15:00.0: amdgpu: GPU reset end with ret = -5 To avoid confusion, report the error for each device separately and return the first error as the overall result. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:19 -04:00
Christian König	a107aeb6a2	drm/amdgpu: partially revert "revert to old status lock handling v3" The CI systems are pointing out list corruptions, so we still need to fix something here. Keep the asserts, but revert the lock changes for now. Fixes: `59e4405e9e` ("drm/amdgpu: revert to old status lock handling v3") Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:19 -04:00
Jesse.Zhang	8d557eab3a	drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine After GPU reset with VRAM loss, a general protection fault occurs during user queue restoration when accessing vm_bo->vm after spinlock release in amdgpu_vm_bo_reset_state_machine. The root cause is that vm_bo points to the last entry from the list_for_each_entry loop, but this becomes invalid after the spinlock is released. Accessing vm_bo->vm at this point leads to memory corruption. Crash log shows: [ 326.981811] Oops: general protection fault, probably for non-canonical address 0x4156415741e58ac8: 0000 [#1] SMP NOPTI [ 326.981820] CPU: 13 UID: 0 PID: 1035 Comm: kworker/13:3 Tainted: G E 6.16.0+ #25 PREEMPT(voluntary) [ 326.981826] Tainted: [E]=UNSIGNED_MODULE [ 326.981827] Hardware name: Gigabyte Technology Co., Ltd. X870E AORUS PRO ICE/X870E AORUS PRO ICE, BIOS F3i 12/19/2024 [ 326.981831] Workqueue: events amdgpu_userq_restore_worker [amdgpu] [ 326.981999] RIP: 0010:amdgpu_vm_assert_locked+0x16/0x70 [amdgpu] [ 326.982094] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 45 48 8b 87 80 03 00 00 48 85 c0 74 40 <48> 8b b8 80 01 00 00 48 85 ff 74 3b 8b 05 0c b7 0e f0 85 c0 75 05 [ 326.982098] RSP: 0018:ffffaa91c2a6bc20 EFLAGS: 00010206 [ 326.982100] RAX: 4156415741e58948 RBX: ffff9e8f013e8330 RCX: 0000000000000000 [ 326.982102] RDX: 0000000000000005 RSI: 000000001d254e88 RDI: ffffffffc144814a [ 326.982104] RBP: ffffaa91c2a6bc68 R08: 0000004c21a25674 R09: 0000000000000001 [ 326.982106] R10: 0000000000000001 R11: dccaf3f2f82863fc R12: ffff9e8f013e8000 [ 326.982108] R13: ffff9e8f013e8000 R14: 0000000000000000 R15: ffff9e8f09980000 [ 326.982110] FS: 0000000000000000(0000) GS:ffff9e9e79995000(0000) knlGS:0000000000000000 [ 326.982112] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 326.982114] CR2: 000055ed6c9caa80 CR3: 0000000797060000 CR4: 0000000000750ef0 [ 326.982116] PKRU: 55555554 Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:19 -04:00
Lijo Lazar	9b608fe948	drm/amdgpu: Check swus/ds for switch state save For saving switch state, check if the GPU is having SWUS/DS architecture. Otherwise, skip saving. Reported-by: Roman Elshin <roman.elshin@gmail.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4602 Fixes: `1dd2fa0e00` ("drm/amdgpu: Save and restore switch state") Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:19 -04:00
Jesse.Zhang	b809ca91a5	drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init As KFD no longer uses a separate PASID, the global amdgpu_vm_set_pasid()function is no longer necessary. Merge its functionality directly intoamdgpu_vm_init() to simplify code flow and eliminate redundant locking. v2: remove superflous check adjust amdgpu_vm_fin and remove amdgpu_vm_set_pasid (Chritian) v3: drop amdgpu_vm_assert_locked (Chritian) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4614 Fixes: `59e4405e9e` ("drm/amdgpu: revert to old status lock handling v3") Reviewed-by: Christian König <christian.koenig@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:07 -04:00
Shaoyun Liu	8dbac5cf8b	drm/amd/amdgpu: Fix the mes version that support inv_tlbs MES version 0x83 is not stable to use the inv_tlbs API. Defer it to 0x84 vertsion. Fixes: `85442bac84` ("drm/amd/amdgpu: Fix the mes version that support inv_tlbs") Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com> Reviewed-by: Michael Chen <michael.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:06 -04:00
Mario Limonciello	c760bcda83	drm/amd: Check whether secure display TA loaded successfully [Why] Not all renoir hardware supports secure display. If the TA is present but the feature isn't supported it will fail to load or send commands. This shows ERR messages to the user that make it seems like there is a problem. [How] Check the resp_status of the context to see if there was an error before trying to send any secure display commands. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1415 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:06 -04:00
Philip Yang	58e6fc2fb9	drm/amdkfd: Fix kfd process ref leaking when userptr unmapping kfd_lookup_process_by_pid hold the kfd process reference to ensure it doesn't get destroyed while sending the segfault event to user space. Calling kfd_lookup_process_by_pid as function parameter leaks the kfd process refcount and miss the NULL pointer check if app process is already destroyed. Fixes: `2d274bf709` ("amd/amdkfd: Trigger segfault for early userptr unmmapping") Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:06 -04:00

... 3 4 5 6 7 ...

16673 Commits