linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-04 22:43:04 -04:00

Author	SHA1	Message	Date
Sunil Khatri	4eaf5d2c31	drm/amdgpu/userq: Fix the code alignment for readability Fix the code alignment for if condition and also provide a line space between multiline if condition and next statement. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-30 14:38:04 -04:00
Sunil Khatri	de95eda05f	drm/amdgpu/userq: amdgpu_userq_vm_validate does not need userq mutex amdgpu_userq_vm_validate function does not need userq_mutex and exec lock is good enough to locking all bos and updating the eviction fence. Also since we only need userq_mutex for amdgpu_userq_restore_all so move the locks in the function itself. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-30 14:31:55 -04:00
Junrui Luo	de1ef4ffd7	drm/amdgpu: validate doorbell_offset in user queue creation amdgpu_userq_get_doorbell_index() passes the user-provided doorbell_offset to amdgpu_doorbell_index_on_bar() without bounds checking. An arbitrarily large doorbell_offset can cause the calculated doorbell index to fall outside the allocated doorbell BO, potentially corrupting kernel doorbell space. Validate that doorbell_offset falls within the doorbell BO before computing the BAR index, using u64 arithmetic to prevent overflow. Fixes: `f09c1e6077` ("drm/amdgpu: generate doorbell index for userqueue") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-30 14:30:55 -04:00
Sunil Khatri	a3057aa926	drm/amdgpu/userq: schedule_delayed_work should be after fence signalled Reorganise the amdgpu_eviction_fence_suspend_worker code so schedule_delayed_work is the last thing we do after amdgpu_userq_evict is complete and the eviction fence is signalled. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-24 13:35:23 -04:00
Sunil Khatri	473527e70e	drm/amdgpu/userq: dont use goto to jump when at end of function In function amdgpu_userq_restore_worker we dont need to use goto as we already in the end of function and it will exit naturally. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-24 13:32:37 -04:00
Sunil Khatri	8f402ddd4f	drm/amdgpu/userq: cleanup amdgpu_userq_get/put where not needed amdgpu_userq_put/get are not needed in case we already holding the userq_mutex and reference is valid already from queue create time or from signal ioctl. These additional get/put could be a potential reason for deadlock in case the ref count reaches zero and destroy is called which again try to take the userq_mutex. Due to the above change we avoid deadlock between suspend/restore calling destroy queues trying to take userq_mutex again. Cc: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-23 14:17:31 -04:00
Srinivasan Shanmugam	9a62a097a7	drm/amdgpu: Drop redundant queue NULL check in hang detect worker amdgpu_userq_hang_detect_work() retrieves the queue pointer using container_of() from the embedded work item. Since the work structure is part of struct amdgpu_usermode_queue, the returned queue pointer cannot be NULL in normal execution. Remove the redundant !queue check and keep the validation for queue->userq_mgr. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c:159 amdgpu_userq_hang_detect_work() warn: can 'queue' even be NULL? Fixes: `290f46cf57` ("drm/amdgpu: Implement user queue reset functionality") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Acked-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 17:46:44 -04:00
Christian König	99f30a0607	drm/amdgpu: fix eviction fence and userq manager shutdown That is a really complicated dance and wasn't implemented fully correct. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 17:46:21 -04:00
Christian König	2cd7284ba5	drm/amdgpu: completely rework eviction fence handling v2 Well that was broken on multiple levels. First of all a lot of checks were placed at incorrect locations, especially if the resume worker should run or not. Then a bunch of code was just mid-layering because of incorrect assignment who should do what. And finally comments explaining what happens instead of why. Just re-write it from scratch, that should at least fix some of the hangs we are seeing. Use RCU for the eviction fence pointer in the manager, the spinlock usage was mostly incorrect as well. Then finally remove all the nonsense checks and actually add them in the correct locations. v2: some typo fixes and cleanups suggested by Sunil Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 17:46:13 -04:00
Sunil Khatri	f802f7b0bc	drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work cancel_delayed_work_sync for work hand_detect_work should not be locked since the amdgpu_userq_hang_detect_work also need the same mutex and when they run together it could be a deadlock. we do not need to hold the mutex for cancel_delayed_work_sync(&queue->hang_detect_work). With this in place if cancel and worker thread run at same time they will not deadlock. Due to any failures if there is a hand detect and reset that there a deadlock scenarios between cancel and running the main thread. [ 243.118276] task:kworker/9:0 state:D stack:0 pid:73 tgid:73 ppid:2 task_flags:0x4208060 flags:0x00080000 [ 243.118283] Workqueue: events amdgpu_userq_hang_detect_work [amdgpu] [ 243.118636] Call Trace: [ 243.118639] <TASK> [ 243.118644] __schedule+0x581/0x1810 [ 243.118649] ? srso_return_thunk+0x5/0x5f [ 243.118656] ? srso_return_thunk+0x5/0x5f [ 243.118659] ? wake_up_process+0x15/0x20 [ 243.118665] schedule+0x64/0xe0 [ 243.118668] schedule_preempt_disabled+0x15/0x30 [ 243.118671] __mutex_lock+0x346/0x950 [ 243.118677] __mutex_lock_slowpath+0x13/0x20 [ 243.118681] mutex_lock+0x2c/0x40 [ 243.118684] amdgpu_userq_hang_detect_work+0x63/0x90 [amdgpu] [ 243.118888] process_scheduled_works+0x1f0/0x450 [ 243.118894] worker_thread+0x27f/0x370 [ 243.118899] kthread+0x1ed/0x210 [ 243.118903] ? __pfx_worker_thread+0x10/0x10 [ 243.118906] ? srso_return_thunk+0x5/0x5f [ 243.118909] ? __pfx_kthread+0x10/0x10 [ 243.118913] ret_from_fork+0x10f/0x1b0 [ 243.118916] ? __pfx_kthread+0x10/0x10 [ 243.118920] ret_from_fork_asm+0x1a/0x30 Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 10:42:39 -04:00
Christian König	98dc529a27	drm/amdgpu: fix amdgpu_userq_evict Canceling the resume worker synchonized can deadlock because it can in turn wait for the eviction worker through the userq_mutex. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 10:33:39 -04:00
Sunil Khatri	3fd20c149e	Revert "drm/amdgpu: revert to old status lock handling v4" This reverts commit `7a9419ab42`. Reverting due to some of the probable issues caused by this change and CI is blocked. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 10:28:47 -04:00
Christian König	7a9419ab42	drm/amdgpu: revert to old status lock handling v4 It turned out that protecting the status of each bo_va with a spinlock was just hiding problems instead of solving them. Revert the whole approach, add a separate stats_lock and lockdep assertions that the correct reservation lock is held all over the place. This not only allows for better checks if a state transition is properly protected by a lock, but also switching back to using list macros to iterate over the state of lists protected by the dma_resv lock of the root PD. v2: re-add missing check v3: split into two patches v4: re-apply by fixing holding the VM lock at the right places. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-11 13:58:08 -04:00
Sunil Khatri	eb2e7f20c1	drm/amdgpu: push userq debugfs function in amdgpu_debugfs files Debugfs files for amdgpu are better to be handled in the dedicated amdgpu_debugfs.c/.h files. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:34:19 -05:00
Sunil Khatri	a07930e4db	drm/amdgpu/userq: declutter the code with goto Clean up the amdgpu_userq_create function clean up in failure condition using goto method. This avoid replication of cleanup for every failure condition. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:34:15 -05:00
Sunil Khatri	28cacaace5	drm/amdgpu/userq: defer queue publication until create completes The userq create path publishes queues to global xarrays such as userq_doorbell_xa and userq_xa before creation was fully complete. Later on if create queue fails, teardown could free an already visible queue, opening a UAF race with concurrent queue walkers. Also calling amdgpu_userq_put in such cases complicates the cleanup. Solution is to defer queue publication until create succeeds and no partially initialized queue is exposed. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:34:06 -05:00
Sunil Khatri	a978ed3d64	drm/amdgpu/userq: remove queue from doorbell xa during clean up If function amdgpu_userq_map_helper fails we do need to clean up and remove the queue from the userq_doorbell_xa. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:25:52 -05:00
Sunil Khatri	f0e46fd06c	drm/amdgpu/userq: remove queue from doorbell xarray In case of failure in xa_alloc, remove the queue during clean up from the userq_doorbell_xa. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:25:50 -05:00
Sunil Khatri	4952189b28	drm/amdgpu/userq: refcount userqueues to avoid any race conditions To avoid race condition and avoid UAF cases, implement kref based queues and protect the below operations using xa lock a. Getting a queue from xarray b. Increment/Decrement it's refcount Every time some one want to access a queue, always get via amdgpu_userq_get to make sure we have locks in place and get the object if active. A userqueue is destroyed on the last refcount is dropped which typically would be via IOCTL or during fini. v2: Add the missing drop in one the condition in the signal ioclt [Alex] v3: remove the queue from the xarray first in the free queue ioctl path [Christian] - Pass queue to the amdgpu_userq_put directly. - make amdgpu_userq_put xa_lock free since we are doing put for each get only and final put is done via destroy and we remove the queue from xa with lock. - use userq_put in fini too so cleanup is done fully. v4: Use xa_erase directly rather than doing load and erase in free ioctl. Also remove some of the error logs which could be exploited by the user to flood the logs [Christian] Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-04 11:50:56 -05:00
Sunil Khatri	2d60e9898a	drm/amdgpu/userq: change queue id type to u32 from int queue id always remain a positive value and should be of type unsigned. With this we also dont need to typecast the id to other types specially in xarray functions. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:41:19 -05:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Jesse.Zhang	8079b87c02	drm/amdgpu: validate user queue size constraints Add validation to ensure user queue sizes meet hardware requirements: - Size must be a power of two for efficient ring buffer wrapping - Size must be at least AMDGPU_GPU_PAGE_SIZE to prevent undersized allocations This prevents invalid configurations that could lead to GPU faults or unexpected behavior. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-29 12:26:15 -05:00
Jesse.Zhang	fc3336be9c	drm/amd/amdgpu: Add independent hang detect work for user queue fence In error scenarios (e.g., malformed commands), user queue fences may never be signaled, causing processes to wait indefinitely. To address this while preserving the requirement of infinite fence waits, implement an independent timeout detection mechanism: 1. Initialize a hang detect work when creating a user queue (one-time setup) 2. Start the work with queue-type-specific timeout (gfx/compute/sdma) when the last fence is created via amdgpu_userq_signal_ioctl (per-fence timing) 3. Trigger queue reset logic if the timer expires before the fence is signaled v2: make timeout per queue type (adev->gfx_timeout vs adev->compute_timeout vs adev->sdma_timeout) to be consistent with kernel queues. (Alex) v3: The timeout detection must be independent from the fence, e.g. you don't wait for a timeout on the fence but rather have the timeout start as soon as the fence is initialized. (Christian) v4: replace the timer with the `hang_detect_work` delayed work. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-20 17:16:12 -05:00
Alex Deucher	d967509651	drm/amdgpu: make sure userqs are enabled in userq IOCTLs These IOCTLs shouldn't be called when userqs are not enabled. Make sure they are enabled before executing the IOCTLs. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-10 14:21:52 -05:00
Saleemkhan Jamadar	1bc44dee26	drm/amdgpu: do not use amdgpu_bo_gpu_offset_no_check individually This should not be used indiviually, use amdgpu_bo_gpu_offset with bo reserved. v3 - unpin bo in queue destroy (Christian) v2 - pin bo so that offset returned won't change after unlock (Christian) Signed-off-by: Saleemkhan Jamadar <saleemkhan083@gmail.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-12-16 13:27:13 -05:00
Lijo Lazar	7b0813f32a	drm/amdgpu: Rename userq_mgr_xa to userq_xa Rename since it is an xarray of userq pointers Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-12-08 13:56:39 -05:00
Lijo Lazar	dc21e39fd2	drm/amdgpu: Clean up userq helper functions Remove userq manager from function signatures. Get the associated manager from userq itself. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-12-08 13:56:39 -05:00
Lijo Lazar	473f12f820	drm/amdgpu: Change user queue interface signatures A userq is associated with its queue manager. Use that and make the userqueue interfaces to operate on queue. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-12-08 13:56:39 -05:00
Philip Yang	cf856ca9b9	drm/amdgpu: Update vm start, end, hole to support 57bit address Change gmc macro AMDGPU_GMC_HOLE_START/END/MASK to 57bit if vm root level is PDB3 for 5-level page tables. The macro access adev without passing adev as parameter is to minimize the code change to support 57bit, then we have to add adev variable in several places to use the macro. Because adev definition is not available in all amdgpu c files which include amdgpu_gmc.h, change inline function amdgpu_gmc_sign_extend to macro. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-12-08 13:56:30 -05:00
Jiapeng Chong	3b832487a9	drm/amdgpu/userqueue: Remove duplicate amdgpu_reset.h header ./drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c: amdgpu_reset.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=26930 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-11 21:54:17 -05:00
Sunil Khatri	cd6250f3ae	drm/amdgpu: validate the bo from done list for NULL Make sure the bo is valid before using it. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:53:21 -05:00
Jesse.Zhang	290f46cf57	drm/amdgpu: Implement user queue reset functionality This patch adds robust reset handling for user queues (userq) to improve recovery from queue failures. The key components include: 1. Queue detection and reset logic: - amdgpu_userq_detect_and_reset_queues() identifies failed queues - Per-IP detect_and_reset callbacks for targeted recovery - Falls back to full GPU reset when needed 2. Reset infrastructure: - Adds userq_reset_work workqueue for async reset handling - Implements pre/post reset handlers for queue state management - Integrates with existing GPU reset framework 3. Error handling improvements: - Enhanced state tracking with HUNG state - Automatic reset triggering on critical failures - VRAM loss handling during recovery 4. Integration points: - Added to device init/reset paths - Called during queue destroy, suspend, and isolation events - Handles both individual queue and full GPU resets The reset functionality works with both gfx/compute and sdma queues, providing better resilience against queue failures while minimizing disruption to unaffected queues. v2: add detection and reset calls when preemption/unmaped fails. add a per device userq counter for each user queue type.(Alex) v3: make sure we hold the adev->userq_mutex when we call amdgpu_userq_detect_and_reset_queues. (Alex) warn if the adev->userq_mutex is not held. v4: make sure we have all of the uqm->userq_mutex held. warn if the uqm->userq_mutex is not held. v5: Use array for user queue type counters.(Alex) all of the uqm->userq_mutex need to be held when calling detect and reset. (Alex) v6: fix lock dep warning in amdgpu_userq_fence_dence_driver_process v7: add the queue types in an array and use a loop in amdgpu_userq_detect_and_reset_queues (Lijo) v8: remove atomic_set(&userq_mgr->userq_count[i], 0). it should already be 0 since we kzalloc the structure (Alex) v9: For consistency with kernel queues, We may want something like: amdgpu_userq_is_reset_type_supported (Alex) Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:53:05 -05:00
Sakari Ailus	ef4a4b8781	drm/amd: Remove redundant pm_runtime_mark_last_busy() calls pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(), pm_runtime_autosuspend() and pm_request_autosuspend() now include a call to pm_runtime_mark_last_busy(). Remove the now-redundant explicit call to pm_runtime_mark_last_busy(). Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 11:31:45 -04:00
Jesse.Zhang	f18719ef4b	drm/amdgpu: Convert amdgpu userqueue management from IDR to XArray This commit refactors the AMDGPU userqueue management subsystem to replace IDR (ID Allocation) with XArray for improved performance, scalability, and maintainability. The changes address several issues with the previous IDR implementation and provide better locking semantics. Key changes: 1. Global XArray Introduction: - Added `userq_doorbell_xa` to `struct amdgpu_device` for global queue tracking - Uses doorbell_index as key for efficient global lookup - Replaces the previous `userq_mgr_list` linked list approach 2. Per-process XArray Conversion: - Replaced `userq_idr` with `userq_mgr_xa` in `struct amdgpu_userq_mgr` - Maintains per-process queue tracking with queue_id as key - Uses XA_FLAGS_ALLOC for automatic ID allocation 3. Locking Improvements: - Removed global `userq_mutex` from `struct amdgpu_device` - Replaced with fine-grained XArray locking using XArray's internal spinlocks 4. Runtime Idle Check Optimization: - Updated `amdgpu_runtime_idle_check_userq()` to use xa_empty 5. Queue Management Functions: - Converted all IDR operations to equivalent XArray functions: - `idr_alloc()` → `xa_alloc()` - `idr_find()` → `xa_load()` - `idr_remove()` → `xa_erase()` - `idr_for_each()` → `xa_for_each()` Benefits: - Performance: XArray provides better scalability for large numbers of queues - Memory Efficiency: Reduced memory overhead compared to IDR - Thread Safety: Improved locking semantics with XArray's internal spinlocks v2: rename userq_global_xa/userq_xa to userq_doorbell_xa/userq_mgr_xa Remove xa_lock and use its own lock. v3: Set queue->userq_mgr = uq_mgr in amdgpu_userq_create() v4: use xa_store_irq (Christian) hold the read side of the reset lock while creating/destroying queues and the manager data structure. (Chritian) Acked-by: Alex Deucher <alexander.deucher@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 09:59:22 -04:00
Dan Carpenter	6142aa0660	drm/amdgpu/userqueue: Fix use after free in amdgpu_userq_buffer_vas_list_cleanup() The amdgpu_userq_buffer_va_list_del() function frees "va_cursor" but it is dereferenced on the next line when we print the debug message. Print the debug message first and then free it. Fixes: `2a28f9665d` ("drm/amdgpu: track the userq bo va for its obj management") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 09:50:58 -04:00
Sunil Khatri	42f1487884	drm/amdgpu/userqueue: validate userptrs for userqueues userptrs could be changed by the user at any time and hence while locking all the bos before GPU start processing validate all the userptr bos. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:36 -04:00
Alex Deucher	16dc933a4f	drm/amdgpu/userq: drop VCN and VPE doorbell handling VCN and VPE userqs are not yet supported and this code is not correct. Userspace should provide the correct doorbell offset with in their doorbell page for the IP. Adjusting it here will not work as expected as userspace and the queue itself will have different offsets. We need to add a INFO IOCTL query to get the offset and range for each IP within the doorbell page to handle this properly. Cc: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com> Reviewed-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:35 -04:00
Mario Limonciello	7877934019	drm/amd: Fix error handling with multiple userq IDRs If multiple userq IDR are in use and there is an error handling one at suspend or resume it will be silently discarded. Switch the suspend/resume() code to use guards and return immediately. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Prike Liang	2e7ceac0ea	drm/amdgpu: validate userq va for GEM unmap When a user unmaps a userq VA, the driver must ensure the queue has no in-flight jobs. If there is pending work, the kernel should wait for the attached eviction (bookkeeping) fence to signal before deleting the mapping. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Prike Liang	89926812d3	drm/amdgpu: validate the queue va for resuming the queue It requires validating the userq VA whether is mapped before trying to resume the queue. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Prike Liang	873f44c327	drm/amdgpu: keeping waiting userq fence infinitely Keeping waiting the userq fence infinitely until hang detection, and then suspend the hang queue and set the fence error. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:34 -04:00
Prike Liang	2a28f9665d	drm/amdgpu: track the userq bo va for its obj management Track the userq obj for its life time, and reference and dereference the buffer flag at its creating and destroying period. Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Prike Liang	5cfa33fabf	drm/amdgpu: add userq object va track helpers Add the userq object virtual address list_add() helpers for tracking the userq obj va address usage. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:33 -04:00
Christian König	a107aeb6a2	drm/amdgpu: partially revert "revert to old status lock handling v3" The CI systems are pointing out list corruptions, so we still need to fix something here. Keep the asserts, but revert the lock changes for now. Fixes: `59e4405e9e` ("drm/amdgpu: revert to old status lock handling v3") Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:19 -04:00
Prike Liang	883bd89d00	drm/amdgpu/userq: assign an error code for invalid userq va It should return an error code if userq VA validation fails. Fixes: `9e46b8bb05` ("drm/amdgpu: validate userq buffer virtual address and size") Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:40:18 -04:00
Christian König	59e4405e9e	drm/amdgpu: revert to old status lock handling v3 It turned out that protecting the status of each bo_va with a spinlock was just hiding problems instead of solving them. Revert the whole approach, add a separate stats_lock and lockdep assertions that the correct reservation lock is held all over the place. This not only allows for better checks if a state transition is properly protected by a lock, but also switching back to using list macros to iterate over the state of lists protected by the dma_resv lock of the root PD. v2: re-add missing check v3: split into two patches Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-18 16:59:14 -04:00
Alex Deucher	846de1384a	drm/amdgpu/userq: Optimize S0ix handling In S0i3, GFX state is retained, so it's preferrable to preempt queues rather than unmapping them as the overhead is lower. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: David Perry <david.perry@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-18 09:43:23 -04:00
Christian König	39203f5e6d	drm/amdgpu: fix userq VM validation v4 That was actually complete nonsense and not validating the BOs at all. The code just cleared all VM areas were it couldn't grab the lock for a BO. Try to fix this. Only compile tested at the moment. v2: fix fence slot reservation as well as pointed out by Sunil. also validate PDs, PTs, per VM BOs and update PDEs v3: grab the status_lock while working with the done list. v4: rename functions, add some comments, fix waiting for updates to complete. v4: rename amdgpu_vm_lock_done_list(), add some more comments Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-16 17:47:06 -04:00
Jesse.Zhang	bb1d7f157e	drm/amdgpu: Switch user queues to use preempt/restore for eviction This patch modifies the user queue management to use preempt/restore operations instead of full map/unmap for queue eviction scenarios where applicable. The changes include: 1. Introduces new helper functions: - amdgpu_userqueue_preempt_helper() - amdgpu_userqueue_restore_helper() 2. Updates queue state management to track PREEMPTED state 3. Modifies eviction handling to use preempt instead of unmap: - amdgpu_userq_evict_all() now uses preempt_helper - amdgpu_userq_restore_all() now uses restore_helper The preempt/restore approach provides better performance during queue eviction by avoiding the overhead of full queue teardown and setup. Full map/unmap operations are still used for initial setup/teardown and system suspend scenarios. v2: rename amdgpu_userqueue_restore_helper/amdgpu_userqueue_preempt_helper to amdgpu_userq_restore_helper/amdgpu_userq_preempt_helper for consistency. (Alex) v3: amdgpu_userq_stop_sched_for_enforce_isolation() and amdgpu_userq_start_sched_for_enforce_isolation() should use preempt and restore (Alex) Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-15 17:02:33 -04:00

1 2

77 Commits