linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-22 08:44:02 -04:00

Author	SHA1	Message	Date
Francois Dugast	3dc6da76ae	drm/xe/guc_submit: Make suspend_wait interruptible Rely on wait_event_interruptible_timeout() to put the process to sleep with TASK_INTERRUPTIBLE. It allows using this function in interruptible context. v2: Propagate error on wait_event_interruptible_timeout (Matt Brost) Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-3-francois.dugast@intel.com	2024-08-17 18:31:50 -07:00
Daniele Ceraolo Spurio	5a891a0e69	drm/xe/uc: Use devm to register cleanup that includes exec_queues Exec_queue cleanup requires HW access, so we need to use devm instead of drmm for it. Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240815230541.3828206-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-08-16 09:15:04 -07:00
Matthew Brost	d79fdaef2b	drm/xe: Allow suspend / resume to be safely called multiple times Switching modes between LR and dma-fence can result in multiple calls to suspend / resume. Make these calls safe while still enforcing call order. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240809191929.3138956-6-matthew.brost@intel.com	2024-08-09 19:09:33 -07:00
Matthew Brost	885c313825	drm/xe: Only enable scheduling upon resume if needed No need to enable scheduling in already enabled. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240809191929.3138956-5-matthew.brost@intel.com	2024-08-09 19:07:31 -07:00
Matthew Brost	1a394b4f50	drm/xe: Fix possible UAF in guc_exec_queue_process_msg Store xe_device ahead of processing message as message can be free'd in some cases. v2: - Including missing local changes v3: - Resend for CI Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202407231445.rpisd1vA-lkp@intel.com/ Fixes: `d930c19fdf` ("drm/xe: Build PM into GuC CT layer") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240724164341.1848954-1-matthew.brost@intel.com	2024-07-24 13:41:01 -07:00
Matthew Brost	8af13c3fc1	drm/xe: Store process name and pid in xe file An xe file can outlive the associated process as the GPU cleanup is just triggered upon file close (process kill) and completes sometime later. If the file close triggers error conditions (GPU hangs) the process cannot be safely referenced to retrieve the name and pid for debug information. Store the process name and pid directly in the xe file to be safe. v2: - Access file->pid via rcu_access_pointer (Matthew Auld) Fixes: `b10d0c5e9d` ("drm/xe: Add process name to devcoredump") Fixes: `f6ca930d97` ("drm/xe: Add process name and PID to job timedout message") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240723151045.1725417-1-matthew.brost@intel.com	2024-07-23 10:45:40 -07:00
Matthew Brost	d930c19fdf	drm/xe: Build PM into GuC CT layer Take PM ref when any G2H are outstanding, drop when none are outstanding. To safely ensure we have PM ref when in the GuC CT layer, a PM ref needs to be held when scheduler messages are pending too. v2: - Add outer PM protections to xe_file_close (CI) v3: - Only take PM ref 0->1 and drop on 1->0 (Matthew Auld) v4: - Add assert to G2H increment function v5: - Rebase v6: - Declare xe as local variable in xe_file_close (CI) Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240719172905.1527927-5-matthew.brost@intel.com	2024-07-19 19:45:34 -07:00
Matthew Brost	7dbe8af13c	drm/xe: Wedge the entire device Wedge the entire device, not just GT which may have triggered the wedge. To implement this, cleanup the layering so xe_device_declare_wedged() calls into the lower layers (GT) to ensure entire device is wedged. While we are here, also signal any pending GT TLB invalidations upon wedging device. Lastly, short circuit reset wait if device is wedged. v2: - Short circuit reset wait if device is wedged (Local testing) Fixes: `8ed9aaae39` ("drm/xe: Force wedged state and block GT reset upon any GPU hang") Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240716063902.1390130-1-matthew.brost@intel.com	2024-07-17 11:58:26 -07:00
José Roberto de Souza	f6ca930d97	drm/xe: Add process name and PID to job timedout message This will be very helpful for Mesa CI, where it uses PID to match the exacly test that cause timedout/GPU hang and mark that test as failing. Also printing the process name as it might be relavant for human readers. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240710213149.57662-1-jose.souza@intel.com	2024-07-11 13:44:15 -07:00
Matthew Brost	627c961d67	drm/xe: Add timeout to preempt fences To adhere to dma fencing rules that fences must signal within a reasonable amount of time, add a 5 second timeout to preempt fences. If this timeout occurs, kill the associated VM as this fatal to the VM. v2: - Add comment for smp_wmb (Checkpatch) - Fix kernel doc typo (Inspection) - Add comment for killed check (Niranjana) v3: - Drop smp_wmb (Matthew Auld) - Don't take vm->lock in preempt fence worker (Matthew Auld) - Drop RB given changes to patch v4: - Add WRITE/READ_ONCE (Niranjana) - Don't export xe_vm_kill (Niranjana) Cc: Matthew Auld <matthew.auld@intel.com> Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Tested-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240626004137.4060806-1-matthew.brost@intel.com	2024-07-03 15:27:50 -07:00
Matthew Brost	0d39640ace	drm/xe: Invert runnable_state / pending enable check and assert Rather than checking for pending enable and asserting runnable_state == 1 in sched done handler, invert these. This is more robust code taking action based on the G2H message and asserting KMD tracking state is correct. Suggested-by: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240614061343.2931649-1-matthew.brost@intel.com	2024-06-20 15:33:14 -07:00
Matthew Brost	7ddb9403dd	drm/xe: Sample ctx timestamp to determine if jobs have timed out In GuC TDR sample ctx timestamp to determine if jobs have timed out. The scheduling enable needs to be toggled to properly sample the timestamp. If a job has not been running for longer than the timeout period, re-enable scheduling and restart the TDR. v2: - Use GT clock to msec helper (Umesh, off list) - s/ctx_timestamp_job/ctx_job_timestamp v3: - Fix state machine for TDR, mainly decouple sched disable and deregister (testing) - Rebase (CI) v4: - Fix checkpatch && newline issue (CI) - Do not deregister on wedged or unregistered (CI) - Fix refcounting bugs (CI) - Move devcoredump above VM / kernel job check (John H) - Add comment for check_timeout state usage (John H) - Assert pending disable not inflight when enabling scheduling (John H) - Use enable_scheduling in other scheduling enable code (John H) - Add comments on a few steps in TDR (John H) - Add assert for timestamp overflow protection (John H) v6: - Use mul_u64_u32_div (CI, checkpath) - Change check time to dbg level (Paulo) - Add immediate mode to sched disable (inspection) - Use xe_gt_* messages (John H) - Fix typo in comment (John H) - Check timeout before clearing pending disable (Paulo) v7: - Fix ADJUST_FIVE_PERCENT macro (checkpatch) - Don't print sched disable failure message on GT reset (John H) - Move kernel / VM jobs WARNs near comment (John H) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-12-matthew.brost@intel.com	2024-06-12 19:14:10 -07:00
Matthew Brost	b47b83ef16	drm/xe: Add killed, banned, or wedged as stick bit during GuC reset These bits should be persistent across reset, treat them as such. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-11-matthew.brost@intel.com	2024-06-12 19:10:26 -07:00
Matthew Brost	fc592a81ff	drm/xe: Add pending disable assert to handle_sched_done Will help catch bugs in GuC state machine. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-10-matthew.brost@intel.com	2024-06-12 19:10:25 -07:00
Matthew Brost	716ce587a8	drm/xe: Add GuC state asserts to deregister_exec_queue Will help catch bugs in GuC state machine. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-9-matthew.brost@intel.com	2024-06-12 19:10:25 -07:00
Matthew Brost	7f4f492c70	drm/xe: Assert runnable state in handle_sched_done Ensure G2H and KMD GuC machine match. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-8-matthew.brost@intel.com	2024-06-12 19:10:24 -07:00
Matthew Brost	41e1fa93a2	drm/xe: Improve unexpected state error messages Include G2H handler name when an unexpected error state messages. v6: - Use xe_gt_err (Michal) - Print runnable state (John H) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-7-matthew.brost@intel.com	2024-06-12 19:10:23 -07:00
Matthew Brost	4468d0488e	drm/xe: Drop EXEC_QUEUE_FLAG_BANNED Clean up laying violation of setting q->flags EXEC_QUEUE_FLAG_BANNED bit in GuC backend. Move banned to GuC owned bit and report banned status to upper layers via reset_status vfunc. This is a slight change in behavior as reset_status returns true if wedged or killed bits set too, but in all of these cases submission to queue is no longer allowed. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240604184700.1946918-1-matthew.brost@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-06-07 12:16:36 -04:00
Niranjana Vishwanathapura	264eecdba2	drm/xe: Decouple xe_exec_queue and xe_lrc Decouple xe_lrc from xe_exec_queue and reference count xe_lrc. Removing hard coupling between xe_exec_queue and xe_lrc allows flexible design where the user interface xe_exec_queue can be destroyed independent of the hardware/firmware interface xe_lrc. v2: Fix lrc indexing in wq_item_append() Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240530032211.29299-1-niranjana.vishwanathapura@intel.com	2024-05-29 23:44:41 -07:00
Nirmoy Das	dac81a9adb	drm/xe: Add engine name to the engine reset and cat-err log Add engine name to the engine reset and cat error log which should be useful while debugging. v2: Add logical mask and engine class(Matt) Use xe_gt_{info\|dbg} (Michal) Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240528101445.27688-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>	2024-05-29 10:57:10 +02:00
Umesh Nerlige Ramappa	45bb564de0	drm/xe: Use run_ticks instead of runtime for client stats Note that runtime is also used in the pm context, so it is confusing to use the same name to denote run time of the drm client. Use a more appropriate name for the client utilization. While at it, drop the incorrect multi-lrc comment in the helper description v2: s/show_runtime/show_run_ticks/ (Rodrigo) Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240524234744.1352543-1-umesh.nerlige.ramappa@intel.com	2024-05-27 14:07:44 -07:00
Matthew Brost	08f7200899	drm/xe: Decouple job seqno and lrc seqno Tightly coupling these seqno presents problems if alternative fences for jobs are used. Decouple these for correctness. v2: - Slightly reword commit message (Thomas) - Make sure the lrc fence ops are used in comparison (Thomas) - Assume seqno is unsigned rather than signed in format string (Thomas) Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240527135912.152156-2-thomas.hellstrom@linux.intel.com	2024-05-27 21:25:59 +02:00
José Roberto de Souza	83ee002df0	drm/xe: Nuke simple error capture This error capture prints into dmesg HW state when a gpu hang happens. It was useful when we did not had devcoredump, now it is a incompleted version of devcoredump that has potential to flood dmesg. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522203431.191594-1-jose.souza@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-05-23 13:38:26 -04:00
Rodrigo Vivi	ad1e331fc4	drm/xe: Relax runtime pm protection during execution Limit the protection only during moments of actual job execution, and introduce protection for guc submit fini, which is currently unprotected due to the absence of exec_queue life protection. In the regular use case scenario, user space will create an exec queue, and keep it alive to reuse that until it is done with that kind of workload. For the regular desktop cases, it means that the exec_queue is alive even on idle scenarios where display goes off. This is unacceptable since this would entirely block runtime PM indefinitely, blocking deeper Package-C state. This would be a waste drainage of power. Cc: Matthew Brost <matthew.brost@intel.com> Tested-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522170105.327472-3-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-05-23 11:52:56 -04:00
Niranjana Vishwanathapura	40672b792a	drm/xe: Properly handle alloc_guc_id() failure Release the submission_state lock if alloc_guc_id() fails. v2: Add Fixes tag and CC stable kernel Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: <stable@vger.kernel.org> # v6.8+ Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240521201711.4934-1-niranjana.vishwanathapura@intel.com	2024-05-22 12:33:37 -07:00
Michal Wajdeczko	7065b19bd5	drm/xe/guc: Allow to initialize submission with limited set of IDs While PF and native drivers may initialize submission code to use all available GuC contexts IDs, the VF driver may only use limited number of IDs. Update init function to accept number of context IDs available for use. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240521092518.624-2-michal.wajdeczko@intel.com	2024-05-22 12:53:43 +02:00
Umesh Nerlige Ramappa	6109f24f87	drm/xe: Add helper to accumulate exec queue runtime Add a helper to accumulate per-client runtime of all its exec queues. This is called every time a sched job is finished. v2: - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate runtime when job is finished since xe_sched_job_completed() is not a notification that job finished. - Stop trying to update runtime from xe_exec_queue_fini() - that is redundant and may happen after xef is closed, leading to a use-after-free - Do not special case the first timestamp read: the default LRC sets CTX_TIMESTAMP to zero, so even the first sample should be a valid one. - Handle the parallel submission case by multiplying the runtime by width. v3: Update comments Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240517204310.88854-6-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-05-21 06:33:40 -07:00
Jonathan Cavitt	b31cfb47b2	drm/xe/xe_guc_submit: Declare reset if banned or killed or wedged Add an additional condition to the reset_status guc_exec_queue_op that returns true if the exec queue has been banned or killed or wedged. The reset_status op is only used for exiting any xe_wait_user_fence_ioctl that waits on an exec queue without timing out, so doing this will exit the ioctl early in cases where the exec queue can no longer function, such as after a GuC stop during a reset. Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240510194540.3246991-3-jonathan.cavitt@intel.com	2024-05-14 16:28:53 -07:00
Jonathan Cavitt	abdea2847a	drm/xe/xe_guc_submit: Allow lr exec queues to be banned LR queues currently don't get banned during a GT/GuC reset because they lack a job. Though they don't have a job to detect the reset status of, it's still possible to tell when they should be banned by looking at the LRC: if the LRC head and tail don't match, then the exec queue should be banned and cleaned up. This also requires swapping the usage of xe_sched_tdr_queue_imm with xe_guc_exec_queue_trigger_cleanup, as the former is specific to non-lr exec queues. Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240510194540.3246991-2-jonathan.cavitt@intel.com	2024-05-14 16:28:52 -07:00
Jonathan Cavitt	1564d411e1	drm/xe/xe_guc_submit: Fix exec queue stop race condition Reorder the xe_sched_tdr_queue_imm and set_exec_queue_banned calls in guc_exec_queue_stop. This prevents a possible race condition between the two events in which it's possible for xe_sched_tdr_queue_imm to wake the ufence waiter before the exec queue is banned, causing the ufence waiter to miss the banned state. Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240510194540.3246991-1-jonathan.cavitt@intel.com	2024-05-14 16:28:51 -07:00
Niranjana Vishwanathapura	d6219e1cd5	drm/xe: Add Indirect Ring State support When Indirect Ring State is enabled, the Ring Buffer state and Batch Buffer state are context save/restored to/from Indirect Ring State instead of the LRC. The Indirect Ring State is a 4K page mapped in global GTT at a 4K aligned address. This address is programmed in the INDIRECT_RING_STATE register of the corresponding context's LRC. v2: Fix kernel-doc, add bspec reference v3: Fix typo in commit text Bspec: 67296, 67139 Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240507224255.5059-3-niranjana.vishwanathapura@intel.com	2024-05-08 14:48:30 -07:00
Tejas Upadhyay	c18a5e3e61	drm/xe: skip error capture when exec queue is killed When user closes exec queue soon after job submission, we are generating error coredump. Instead check if exec queue is killed during job timeout then skip error coredump capture. V2: - Just skip error capture - MattB Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240430131229.2228809-1-tejas.upadhyay@intel.com	2024-05-07 11:43:08 -07:00
Himal Prasad Ghimiray	c832541ca8	drm/xe: Change xe_guc_submit_stop return to void The function xe_guc_submit_stop consistently returns 0 without an error state, prompting the caller to verify it, which is redundant. Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240424041911.2184868-1-himal.prasad.ghimiray@intel.com	2024-04-25 20:38:49 -07:00
Matthew Brost	edc9f11af3	drm/xe: Replace engine references with exec queue in xe_guc_submit.c Exec queue has replaced engine nomenclature. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-6-matthew.brost@intel.com	2024-04-25 18:41:29 -07:00
Matthew Brost	3713a383f5	drm/xe: Fix alignment in GuC exec queue state defines Normalize the alignment for readability. v3: - Fix typo in commit (Himal) - Fix EXEC_QUEUE_STATE_WEDGED too (Himal) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-5-matthew.brost@intel.com	2024-04-25 18:41:28 -07:00
Matthew Brost	1a1563e324	drm/xe: s/ENGINE_STATE_KILLED/EXEC_QUEUE_STATE_KILLED Exec queue has replaced engine nomenclature. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-4-matthew.brost@intel.com	2024-04-25 18:41:28 -07:00
Matthew Brost	03b3517630	drm/xe: s/ENGINE_STATE_SUSPENDED/EXEC_QUEUE_STATE_SUSPENDED Exec queue has replaced engine nomenclature. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-3-matthew.brost@intel.com	2024-04-25 18:41:27 -07:00
Matthew Brost	f85ada84f6	drm/xe: s/ENGINE_STATE_ENABLED/EXEC_QUEUE_STATE_ENABLED Exec queue has replaced engine nomenclature. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-2-matthew.brost@intel.com	2024-04-25 18:41:26 -07:00
Matthew Brost	3f371a98de	drm/xe: Delete unused GuC submission_state.suspend GuC submission_state.suspend is unused, delete it. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240425054747.1918811-1-matthew.brost@intel.com	2024-04-25 14:27:19 -07:00
Rodrigo Vivi	6b8ef44cc0	drm/xe: Introduce the wedged_mode debugfs So, the wedged mode can be selected per device at runtime, before the tests or before reproducing the issue. v2: - s/busted/wedged - some locking consistency v3: - remove mutex - toggle guc reset policy on any mode change Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-4-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-04-24 12:12:58 -04:00
Rodrigo Vivi	8ed9aaae39	drm/xe: Force wedged state and block GT reset upon any GPU hang In many validation situations when debugging GPU Hangs, it is useful to preserve the GT situation from the moment that the timeout occurred. This patch introduces a module parameter that could be used on situations like this. If xe.wedged module parameter is set to 2, Xe will be declared wedged on every single execution timeout (a.k.a. GPU hang) right after devcoredump snapshot capture and without attempting any kind of GT reset and blocking entirely any kind of execution. v2: Really block gt_reset from guc side. (Lucas) s/wedged/busted (Lucas) v3: - s/busted/wedged - Really use global_flags (Dafna) - More robust timeout handling when wedging it. v4: A really robust clean exit done by Matt Brost. No more kernel warns on unbind. v5: Simplify error message (Lucas) Cc: Matthew Brost <matthew.brost@intel.com> Cc: Dafna Hirschfeld <dhirschfeld@habana.ai> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Cc: Himanshu Somaiya <himanshu.somaiya@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-3-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-04-24 12:12:58 -04:00
Matthew Brost	0417a5f848	drm/xe: Always capture exec queues on snapshot Always capture exec queues on snapshot regardless if exec queue has pending jobs or not. Having jobs or not does indicate whether the exec queue capture is useful. Example bugs that would not be easily detected by skipping capture when pending job list is empty: - Jobs pending on exec queue have dependencies - Leaking exec queue refs - GuC protocol issues (i.e. losing G2H) In addition to above bugs, in general it just useful to see every exec queue registered with the GuC and its state. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240405211632.223568-2-matthew.brost@intel.com	2024-04-08 14:47:37 -07:00
Michal Wajdeczko	83787afe06	drm/xe/guc: Initialize GuC ID manager sooner The GuC submission cleanup code may depend on the GuC ID manager, thus we can't initialize it after registering a submission cleanup action, as reverse cleanup sequence will destroy GuC ID manager prior to a call to guc_submit_fini(). Move GuC ID manager initialization up, right after managed mutex initialization, to have it available during guc_submit_fini(). Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240406143946.979-2-michal.wajdeczko@intel.com	2024-04-08 11:22:18 +02:00
Michal Wajdeczko	104f7519db	drm/xe/guc: Use drm_device-managed version of mutex_init() This is safer approach and will help resolve a cleanup ordering conflict related to the GuC ID manager. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240406143946.979-1-michal.wajdeczko@intel.com	2024-04-08 11:22:17 +02:00
Rodrigo Vivi	972d01d0e3	drm/xe: Protect devcoredump access after unbind While we don't have the full flow protection when devcoredump is accessed after device unbind. Let's at least for now protect against null dereference: [ 422.766508] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] [ 423.119584] RIP: 0010:xe_vm_snapshot_free+0x30/0x180 [xe] While at it, I also fixed a non-standard code-declaration block on the similar function of xe_guc_submit. v2: - Use IS_ERR_OR_NULL (Nirmoy) - Expand to other functions Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240403195044.239766-1-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-04-04 14:53:22 -04:00
Michal Wajdeczko	e6e7eff627	drm/xe/guc: Use GuC ID Manager in submission code We are ready to replace private guc_ids management code with separate GuC ID Manager that can be shared with upcoming SR-IOV PF provisioning code. Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240313221112.1089-5-michal.wajdeczko@intel.com	2024-03-27 20:19:29 +01:00
Michal Wajdeczko	f88beeed82	drm/xe/guc: Move GUC_ID_MAX definition to GuC ABI header This macro represents GuC firmware capability and shall be defined in the firmware ABI header. Move it to xe_guc_fwif.h file. Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240313221112.1089-2-michal.wajdeczko@intel.com	2024-03-27 20:19:23 +01:00
Daniele Ceraolo Spurio	4c15a6dcee	drm/xe/uc: Use u64 for offsets for which we use upper_32_bits() The GGTT is currently a 32 bit address space, but the HW and GuC support 48b addresses in GGTT-related operations, both to keep the interface/HW paths common between PPGTT and GGTT and to allow for future increase of the GGTT size. This leaves us having to program a 64b field with a 32b offset, which currently we're in some cases doing this by using an upper_32_bits() call on a 32b variable, which doesn't make any sense. To do this cleanly we have 2 options: 1 - Set the upper 32 bits directly to zero. 2 - Use 64b variables for the offset and keep programming the whole thing, so we're ready if we ever have bigger offsets. This patch goes with option #2 and switches the related variables to u64. v2: don't change the log ctl flag variable (John) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240319195101.2784480-1-daniele.ceraolospurio@intel.com	2024-03-20 14:40:57 -07:00
Daniele Ceraolo Spurio	649a125a88	drm/xe: Always check force_wake_get return code A force_wake_get failure means that the HW might not be awake for the access we're doing; this can lead to an immediate error or it can be a more subtle problem (e.g. a register read might return an incorrect value that is still valid, leading the driver to make a wrong choice instead of flagging an error). We avoid an error from the force_wake function because callers might handle or tolerate the error, but this only works if all callers are checking the error code. The majority already do, but a few are not. These are mainly falling into 3 categories, which are each handled differently: 1) error capture: in this case we want to continue the capture, but we log an info message in dmesg to notify the user that the capture might have incorrect data. 2) ioctl: in this case we return a -EIO error to userspace 3) unabortable actions: these are scenarios where we can't simply abort and retry and so it's better to just try it anyway because there is a chance the HW is awake even with the failure. In this case we throw a warning so we know there was a forcewake problem if something fails down the line. v2: use gt_WARN_ON where appropriate Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240318154924.3453513-1-daniele.ceraolospurio@intel.com	2024-03-20 14:13:58 -07:00
Niranjana Vishwanathapura	aacf3f629a	drm/xe: Separate out sched/deregister_done handling Abstract out the core part of sched_done and deregister_done handlers to separate functions to decouple them from any protocol error handling part and make them more readable. Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240319184153.16667-1-niranjana.vishwanathapura@intel.com	2024-03-19 22:36:15 -07:00

1 2 3

103 Commits