drm/amd/amdgpu: Add independent hang detect work for user queue fence

In error scenarios (e.g., malformed commands), user queue fences may never
be signaled, causing processes to wait indefinitely. To address this while
preserving the requirement of infinite fence waits, implement an independent
timeout detection mechanism:

1. Initialize a hang detect work when creating a user queue (one-time setup)
2. Start the work with queue-type-specific timeout (gfx/compute/sdma) when
       the last fence is created via amdgpu_userq_signal_ioctl (per-fence timing)
3. Trigger queue reset logic if the timer expires before the fence is signaled

v2: make timeout per queue type (adev->gfx_timeout vs adev->compute_timeout vs adev->sdma_timeout) to be consistent with kernel queues. (Alex)
v3: The timeout detection must be independent from the fence, e.g. you don't wait for a timeout on the fence
        but rather have the timeout start as soon as the fence is initialized. (Christian)
v4: replace the timer with the `hang_detect_work` delayed work.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This commit is contained in:
Jesse.Zhang
2026-01-13 16:13:47 +08:00
committed by Alex Deucher
parent 5aaa5058de
commit fc3336be9c
3 changed files with 73 additions and 1 deletions

View File

@@ -72,6 +72,8 @@ struct amdgpu_usermode_queue {
u32 xcp_id;
int priority;
struct dentry *debugfs_queue;
struct delayed_work hang_detect_work;
struct dma_fence *hang_detect_fence;
struct list_head userq_va_list;
};
@@ -147,6 +149,7 @@ int amdgpu_userq_start_sched_for_enforce_isolation(struct amdgpu_device *adev,
void amdgpu_userq_reset_work(struct work_struct *work);
void amdgpu_userq_pre_reset(struct amdgpu_device *adev);
int amdgpu_userq_post_reset(struct amdgpu_device *adev, bool vram_lost);
void amdgpu_userq_start_hang_detect_work(struct amdgpu_usermode_queue *queue);
int amdgpu_userq_input_va_validate(struct amdgpu_device *adev,
struct amdgpu_usermode_queue *queue,