Nirmoy Das
cdc21021f0
drm/xe: Don't restart parallel queues multiple times on GT reset
...
In case of parallel submissions multiple GuC id will point to the
same exec queue and on GT reset such exec queues will get restarted
multiple times which is not desirable.
v2: don't use exec_queue_enabled() which could race,
do the same for xe_guc_submit_stop (Matt B)
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2295
Cc: Jonathan Cavitt <jonathan.cavitt@intel.com >
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Cc: Matthew Auld <matthew.auld@intel.com >
Cc: Matthew Brost <matthew.brost@intel.com >
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20241022103555.731557-1-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com >
(cherry picked from commit c8b0acd6d8 )
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com >
2024-10-24 12:42:52 -05:00
Matthew Brost
82926f52d7
drm/xe: Don't free job in TDR
...
Freeing job in TDR is not safe as TDR can pass the run_job thread
resulting in UAF. It is only safe for free job to naturally be called by
the scheduler. Rather free job in TDR, add to pending list.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2811
Cc: Matthew Auld <matthew.auld@intel.com >
Fixes: e275d61c5f ("drm/xe/guc: Handle timing out of signaled jobs gracefully")
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Matthew Auld <matthew.auld@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20241003001657.3517883-3-matthew.brost@intel.com
(cherry picked from commit ea2f6a77d0 )
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com >
2024-10-16 09:00:22 -05:00
Dave Airlie
ac44ff7cec
Merge tag 'drm-xe-fixes-2024-10-10' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
...
Driver Changes:
- Fix error checking with xa_store() (Matthe Auld)
- Fix missing freq restore on GSC load error (Vinay)
- Fix wedged_mode file permission (Matt Roper)
- Fix use-after-free in ct communication (Matthew Auld)
Signed-off-by: Dave Airlie <airlied@redhat.com >
From: Lucas De Marchi <lucas.demarchi@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/jri65tmv3bjbhqhxs5smv45nazssxzhtwphojem4uufwtjuliy@gsdhlh6kzsdy
2024-10-11 13:54:10 +10:00
Dave Airlie
b634acb2a0
Merge tag 'drm-misc-fixes-2024-10-10' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
...
Short summary of fixes pull:
fbdev-dma:
- Only clean up deferred I/O if instanciated
nouveau:
- dmem: Fix privileged error in copy engine channel; Fix possible
data leak in migrate_to_ram()
- gsp: Fix coding style
sched:
- Avoid leaking lockdep map
v3d:
- Stop active perfmon before destroying it
vc4:
- Stop active perfmon before destroying it
xe:
- Drop GuC submit_wq pool
Signed-off-by: Dave Airlie <airlied@redhat.com >
From: Thomas Zimmermann <tzimmermann@suse.de >
Link: https://patchwork.freedesktop.org/patch/msgid/20241010133708.GA461532@localhost.localdomain
2024-10-11 09:03:30 +10:00
Matthew Auld
42465603a3
drm/xe/guc_submit: fix xa_store() error checking
...
Looks like we are meant to use xa_err() to extract the error encoded in
the ptr.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld <matthew.auld@intel.com >
Cc: Matthew Brost <matthew.brost@intel.com >
Cc: Badal Nilawar <badal.nilawar@intel.com >
Cc: <stable@vger.kernel.org > # v6.8+
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-7-matthew.auld@intel.com
(cherry picked from commit f040327238 )
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com >
2024-10-08 18:06:16 -05:00
Matthew Brost
1b30f87e08
drm/xe: Resume TDR after GT reset
...
Not starting the TDR after GT reset on exec queue which have been
restarted can lead to jobs being able to be run forever. Fix this by
restarting the TDR.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240724235919.1917216-1-matthew.brost@intel.com
(cherry picked from commit 8ec5a4e5ce )
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com >
2024-10-03 01:19:44 -05:00
Matthew Auld
2d2be279f1
drm/xe: fix UAF around queue destruction
...
We currently do stuff like queuing the final destruction step on a
random system wq, which will outlive the driver instance. With bad
timing we can teardown the driver with one or more work workqueue still
being alive leading to various UAF splats. Add a fini step to ensure
user queues are properly torn down. At this point GuC should already be
nuked so queue itself should no longer be referenced from hw pov.
v2 (Matt B)
- Looks much safer to use a waitqueue and then just wait for the
xa_array to become empty before triggering the drain.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2317
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld <matthew.auld@intel.com >
Cc: Matthew Brost <matthew.brost@intel.com >
Cc: <stable@vger.kernel.org > # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240923145647.77707-2-matthew.auld@intel.com
(cherry picked from commit 861108666c )
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com >
2024-10-03 01:13:54 -05:00
Matthew Auld
790533e44b
drm/xe/guc_submit: add missing locking in wedged_fini
...
Any non-wedged queue can have a zero refcount here and can be running
concurrently with an async queue destroy, therefore dereferencing the
queue ptr to check wedge status after the lookup can trigger UAF if
queue is not wedged. Fix this by keeping the submission_state lock held
around the check to postpone the free and make the check safe, before
dropping again around the put() to avoid the deadlock.
Fixes: 8ed9aaae39 ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
Signed-off-by: Matthew Auld <matthew.auld@intel.com >
Cc: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240924150947.118433-2-matthew.auld@intel.com
(cherry picked from commit d28af0b6b9 )
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com >
2024-10-03 01:13:54 -05:00
Matthew Brost
9286a191ab
drm/xe: Drop GuC submit_wq pool
...
Now that drm sched uses a single lockdep map for all submit_wq, drop the
GuC submit_wq pool hack.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20241002131639.3425022-3-matthew.brost@intel.com
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com >
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com >
2024-10-02 17:54:05 +02:00
Francois Dugast
3dc6da76ae
drm/xe/guc_submit: Make suspend_wait interruptible
...
Rely on wait_event_interruptible_timeout() to put the process to sleep
with TASK_INTERRUPTIBLE. It allows using this function in interruptible
context.
v2: Propagate error on wait_event_interruptible_timeout (Matt Brost)
Signed-off-by: Francois Dugast <francois.dugast@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240809155156.1955925-3-francois.dugast@intel.com
2024-08-17 18:31:50 -07:00
Daniele Ceraolo Spurio
5a891a0e69
drm/xe/uc: Use devm to register cleanup that includes exec_queues
...
Exec_queue cleanup requires HW access, so we need to use devm instead of
drmm for it.
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com >
Cc: John Harrison <John.C.Harrison@Intel.com >
Cc: Alan Previn <alan.previn.teres.alexis@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com >
Reviewed-by: Matthew Auld <matthew.auld@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240815230541.3828206-2-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com >
2024-08-16 09:15:04 -07:00
Matthew Brost
d79fdaef2b
drm/xe: Allow suspend / resume to be safely called multiple times
...
Switching modes between LR and dma-fence can result in multiple calls to
suspend / resume. Make these calls safe while still enforcing call
order.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240809191929.3138956-6-matthew.brost@intel.com
2024-08-09 19:09:33 -07:00
Matthew Brost
885c313825
drm/xe: Only enable scheduling upon resume if needed
...
No need to enable scheduling in already enabled.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240809191929.3138956-5-matthew.brost@intel.com
2024-08-09 19:07:31 -07:00
Matthew Brost
1a394b4f50
drm/xe: Fix possible UAF in guc_exec_queue_process_msg
...
Store xe_device ahead of processing message as message can be free'd in
some cases.
v2:
- Including missing local changes
v3:
- Resend for CI
Reported-by: kernel test robot <lkp@intel.com >
Reported-by: Dan Carpenter <dan.carpenter@linaro.org >
Closes: https://lore.kernel.org/r/202407231445.rpisd1vA-lkp@intel.com/
Fixes: d930c19fdf ("drm/xe: Build PM into GuC CT layer")
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240724164341.1848954-1-matthew.brost@intel.com
2024-07-24 13:41:01 -07:00
Matthew Brost
8af13c3fc1
drm/xe: Store process name and pid in xe file
...
An xe file can outlive the associated process as the GPU cleanup is just
triggered upon file close (process kill) and completes sometime later.
If the file close triggers error conditions (GPU hangs) the process
cannot be safely referenced to retrieve the name and pid for debug
information. Store the process name and pid directly in the xe file to
be safe.
v2:
- Access file->pid via rcu_access_pointer (Matthew Auld)
Fixes: b10d0c5e9d ("drm/xe: Add process name to devcoredump")
Fixes: f6ca930d97 ("drm/xe: Add process name and PID to job timedout message")
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com >
Reviewed-by: Matthew Auld <matthew.auld@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240723151045.1725417-1-matthew.brost@intel.com
2024-07-23 10:45:40 -07:00
Matthew Brost
d930c19fdf
drm/xe: Build PM into GuC CT layer
...
Take PM ref when any G2H are outstanding, drop when none are
outstanding.
To safely ensure we have PM ref when in the GuC CT layer, a PM ref needs
to be held when scheduler messages are pending too.
v2:
- Add outer PM protections to xe_file_close (CI)
v3:
- Only take PM ref 0->1 and drop on 1->0 (Matthew Auld)
v4:
- Add assert to G2H increment function
v5:
- Rebase
v6:
- Declare xe as local variable in xe_file_close (CI)
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Matthew Auld <matthew.auld@intel.com >
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com >
Cc: Nirmoy Das <nirmoy.das@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Matthew Auld <matthew.auld@intel.com >
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240719172905.1527927-5-matthew.brost@intel.com
2024-07-19 19:45:34 -07:00
Matthew Brost
7dbe8af13c
drm/xe: Wedge the entire device
...
Wedge the entire device, not just GT which may have triggered the wedge.
To implement this, cleanup the layering so xe_device_declare_wedged()
calls into the lower layers (GT) to ensure entire device is wedged.
While we are here, also signal any pending GT TLB invalidations upon
wedging device.
Lastly, short circuit reset wait if device is wedged.
v2:
- Short circuit reset wait if device is wedged (Local testing)
Fixes: 8ed9aaae39 ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240716063902.1390130-1-matthew.brost@intel.com
2024-07-17 11:58:26 -07:00
José Roberto de Souza
f6ca930d97
drm/xe: Add process name and PID to job timedout message
...
This will be very helpful for Mesa CI, where it uses PID to match
the exacly test that cause timedout/GPU hang and mark that test as
failing.
Also printing the process name as it might be relavant for human
readers.
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com >
Signed-off-by: José Roberto de Souza <jose.souza@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240710213149.57662-1-jose.souza@intel.com
2024-07-11 13:44:15 -07:00
Matthew Brost
627c961d67
drm/xe: Add timeout to preempt fences
...
To adhere to dma fencing rules that fences must signal within a
reasonable amount of time, add a 5 second timeout to preempt fences. If
this timeout occurs, kill the associated VM as this fatal to the VM.
v2:
- Add comment for smp_wmb (Checkpatch)
- Fix kernel doc typo (Inspection)
- Add comment for killed check (Niranjana)
v3:
- Drop smp_wmb (Matthew Auld)
- Don't take vm->lock in preempt fence worker (Matthew Auld)
- Drop RB given changes to patch
v4:
- Add WRITE/READ_ONCE (Niranjana)
- Don't export xe_vm_kill (Niranjana)
Cc: Matthew Auld <matthew.auld@intel.com >
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Tested-by: Stuart Summers <stuart.summers@intel.com >
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240626004137.4060806-1-matthew.brost@intel.com
2024-07-03 15:27:50 -07:00
Matthew Brost
0d39640ace
drm/xe: Invert runnable_state / pending enable check and assert
...
Rather than checking for pending enable and asserting runnable_state ==
1 in sched done handler, invert these. This is more robust code taking
action based on the G2H message and asserting KMD tracking state is
correct.
Suggested-by: John Harrison <John.C.Harrison@Intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: John Harrison <John.C.Harrison@Intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240614061343.2931649-1-matthew.brost@intel.com
2024-06-20 15:33:14 -07:00
Matthew Brost
7ddb9403dd
drm/xe: Sample ctx timestamp to determine if jobs have timed out
...
In GuC TDR sample ctx timestamp to determine if jobs have timed out. The
scheduling enable needs to be toggled to properly sample the timestamp.
If a job has not been running for longer than the timeout period,
re-enable scheduling and restart the TDR.
v2:
- Use GT clock to msec helper (Umesh, off list)
- s/ctx_timestamp_job/ctx_job_timestamp
v3:
- Fix state machine for TDR, mainly decouple sched disable and
deregister (testing)
- Rebase (CI)
v4:
- Fix checkpatch && newline issue (CI)
- Do not deregister on wedged or unregistered (CI)
- Fix refcounting bugs (CI)
- Move devcoredump above VM / kernel job check (John H)
- Add comment for check_timeout state usage (John H)
- Assert pending disable not inflight when enabling scheduling (John H)
- Use enable_scheduling in other scheduling enable code (John H)
- Add comments on a few steps in TDR (John H)
- Add assert for timestamp overflow protection (John H)
v6:
- Use mul_u64_u32_div (CI, checkpath)
- Change check time to dbg level (Paulo)
- Add immediate mode to sched disable (inspection)
- Use xe_gt_* messages (John H)
- Fix typo in comment (John H)
- Check timeout before clearing pending disable (Paulo)
v7:
- Fix ADJUST_FIVE_PERCENT macro (checkpatch)
- Don't print sched disable failure message on GT reset (John H)
- Move kernel / VM jobs WARNs near comment (John H)
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-12-matthew.brost@intel.com
2024-06-12 19:14:10 -07:00
Matthew Brost
b47b83ef16
drm/xe: Add killed, banned, or wedged as stick bit during GuC reset
...
These bits should be persistent across reset, treat them as such.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-11-matthew.brost@intel.com
2024-06-12 19:10:26 -07:00
Matthew Brost
fc592a81ff
drm/xe: Add pending disable assert to handle_sched_done
...
Will help catch bugs in GuC state machine.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-10-matthew.brost@intel.com
2024-06-12 19:10:25 -07:00
Matthew Brost
716ce587a8
drm/xe: Add GuC state asserts to deregister_exec_queue
...
Will help catch bugs in GuC state machine.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-9-matthew.brost@intel.com
2024-06-12 19:10:25 -07:00
Matthew Brost
7f4f492c70
drm/xe: Assert runnable state in handle_sched_done
...
Ensure G2H and KMD GuC machine match.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-8-matthew.brost@intel.com
2024-06-12 19:10:24 -07:00
Matthew Brost
41e1fa93a2
drm/xe: Improve unexpected state error messages
...
Include G2H handler name when an unexpected error state messages.
v6:
- Use xe_gt_err (Michal)
- Print runnable state (John H)
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-7-matthew.brost@intel.com
2024-06-12 19:10:23 -07:00
Matthew Brost
4468d0488e
drm/xe: Drop EXEC_QUEUE_FLAG_BANNED
...
Clean up laying violation of setting q->flags EXEC_QUEUE_FLAG_BANNED bit
in GuC backend. Move banned to GuC owned bit and report banned status to
upper layers via reset_status vfunc. This is a slight change in behavior
as reset_status returns true if wedged or killed bits set too, but in
all of these cases submission to queue is no longer allowed.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240604184700.1946918-1-matthew.brost@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com >
2024-06-07 12:16:36 -04:00
Niranjana Vishwanathapura
264eecdba2
drm/xe: Decouple xe_exec_queue and xe_lrc
...
Decouple xe_lrc from xe_exec_queue and reference count xe_lrc.
Removing hard coupling between xe_exec_queue and xe_lrc allows
flexible design where the user interface xe_exec_queue can be
destroyed independent of the hardware/firmware interface xe_lrc.
v2: Fix lrc indexing in wq_item_append()
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240530032211.29299-1-niranjana.vishwanathapura@intel.com
2024-05-29 23:44:41 -07:00
Nirmoy Das
dac81a9adb
drm/xe: Add engine name to the engine reset and cat-err log
...
Add engine name to the engine reset and cat error log
which should be useful while debugging.
v2: Add logical mask and engine class(Matt)
Use xe_gt_{info|dbg} (Michal)
Cc: Matthew Brost <matthew.brost@intel.com >
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240528101445.27688-1-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com >
2024-05-29 10:57:10 +02:00
Umesh Nerlige Ramappa
45bb564de0
drm/xe: Use run_ticks instead of runtime for client stats
...
Note that runtime is also used in the pm context, so it is confusing to
use the same name to denote run time of the drm client. Use a more
appropriate name for the client utilization.
While at it, drop the incorrect multi-lrc comment in the helper
description
v2: s/show_runtime/show_run_ticks/ (Rodrigo)
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com >
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240524234744.1352543-1-umesh.nerlige.ramappa@intel.com
2024-05-27 14:07:44 -07:00
Matthew Brost
08f7200899
drm/xe: Decouple job seqno and lrc seqno
...
Tightly coupling these seqno presents problems if alternative fences for
jobs are used. Decouple these for correctness.
v2:
- Slightly reword commit message (Thomas)
- Make sure the lrc fence ops are used in comparison (Thomas)
- Assume seqno is unsigned rather than signed in format string (Thomas)
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com >
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240527135912.152156-2-thomas.hellstrom@linux.intel.com
2024-05-27 21:25:59 +02:00
José Roberto de Souza
83ee002df0
drm/xe: Nuke simple error capture
...
This error capture prints into dmesg HW state when a gpu hang happens.
It was useful when we did not had devcoredump, now it is a incompleted
version of devcoredump that has potential to flood dmesg.
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com >
Cc: John Harrison <John.C.Harrison@Intel.com >
Signed-off-by: José Roberto de Souza <jose.souza@intel.com >
Reviewed-by: John Harrison <John.C.Harrison@Intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240522203431.191594-1-jose.souza@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com >
2024-05-23 13:38:26 -04:00
Rodrigo Vivi
ad1e331fc4
drm/xe: Relax runtime pm protection during execution
...
Limit the protection only during moments of actual job execution,
and introduce protection for guc submit fini, which is currently
unprotected due to the absence of exec_queue life protection.
In the regular use case scenario, user space will create an
exec queue, and keep it alive to reuse that until it is done
with that kind of workload.
For the regular desktop cases, it means that the exec_queue
is alive even on idle scenarios where display goes off. This
is unacceptable since this would entirely block runtime PM
indefinitely, blocking deeper Package-C state. This would be
a waste drainage of power.
Cc: Matthew Brost <matthew.brost@intel.com >
Tested-by: Francois Dugast <francois.dugast@intel.com >
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240522170105.327472-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com >
2024-05-23 11:52:56 -04:00
Niranjana Vishwanathapura
40672b792a
drm/xe: Properly handle alloc_guc_id() failure
...
Release the submission_state lock if alloc_guc_id() fails.
v2: Add Fixes tag and CC stable kernel
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: <stable@vger.kernel.org > # v6.8+
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com >
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: José Roberto de Souza <jose.souza@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240521201711.4934-1-niranjana.vishwanathapura@intel.com
2024-05-22 12:33:37 -07:00
Michal Wajdeczko
7065b19bd5
drm/xe/guc: Allow to initialize submission with limited set of IDs
...
While PF and native drivers may initialize submission code to use
all available GuC contexts IDs, the VF driver may only use limited
number of IDs. Update init function to accept number of context
IDs available for use.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com >
Cc: Matthew Brost <matthew.brost@intel.com >
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240521092518.624-2-michal.wajdeczko@intel.com
2024-05-22 12:53:43 +02:00
Umesh Nerlige Ramappa
6109f24f87
drm/xe: Add helper to accumulate exec queue runtime
...
Add a helper to accumulate per-client runtime of all its
exec queues. This is called every time a sched job is finished.
v2:
- Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
runtime when job is finished since xe_sched_job_completed() is not a
notification that job finished.
- Stop trying to update runtime from xe_exec_queue_fini() - that is
redundant and may happen after xef is closed, leading to a
use-after-free
- Do not special case the first timestamp read: the default LRC sets
CTX_TIMESTAMP to zero, so even the first sample should be a valid
one.
- Handle the parallel submission case by multiplying the runtime by
width.
v3: Update comments
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com >
Reviewed-by: Matt Roper <matthew.d.roper@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240517204310.88854-6-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com >
2024-05-21 06:33:40 -07:00
Jonathan Cavitt
b31cfb47b2
drm/xe/xe_guc_submit: Declare reset if banned or killed or wedged
...
Add an additional condition to the reset_status guc_exec_queue_op that
returns true if the exec queue has been banned or killed or wedged. The
reset_status op is only used for exiting any xe_wait_user_fence_ioctl
that waits on an exec queue without timing out, so doing this will exit
the ioctl early in cases where the exec queue can no longer function,
such as after a GuC stop during a reset.
Suggested-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Reviewed-by: Stuart Summers <stuart.summers@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240510194540.3246991-3-jonathan.cavitt@intel.com
2024-05-14 16:28:53 -07:00
Jonathan Cavitt
abdea2847a
drm/xe/xe_guc_submit: Allow lr exec queues to be banned
...
LR queues currently don't get banned during a GT/GuC reset because they
lack a job. Though they don't have a job to detect the reset status of,
it's still possible to tell when they should be banned by looking at the
LRC: if the LRC head and tail don't match, then the exec queue should be
banned and cleaned up.
This also requires swapping the usage of xe_sched_tdr_queue_imm with
xe_guc_exec_queue_trigger_cleanup, as the former is specific to non-lr
exec queues.
Suggested-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Stuart Summers <stuart.summers@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240510194540.3246991-2-jonathan.cavitt@intel.com
2024-05-14 16:28:52 -07:00
Jonathan Cavitt
1564d411e1
drm/xe/xe_guc_submit: Fix exec queue stop race condition
...
Reorder the xe_sched_tdr_queue_imm and set_exec_queue_banned calls in
guc_exec_queue_stop. This prevents a possible race condition between
the two events in which it's possible for xe_sched_tdr_queue_imm to
wake the ufence waiter before the exec queue is banned, causing the
ufence waiter to miss the banned state.
Suggested-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Stuart Summers <stuart.summers@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240510194540.3246991-1-jonathan.cavitt@intel.com
2024-05-14 16:28:51 -07:00
Niranjana Vishwanathapura
d6219e1cd5
drm/xe: Add Indirect Ring State support
...
When Indirect Ring State is enabled, the Ring Buffer state and
Batch Buffer state are context save/restored to/from Indirect
Ring State instead of the LRC. The Indirect Ring State is a 4K
page mapped in global GTT at a 4K aligned address. This address
is programmed in the INDIRECT_RING_STATE register of the
corresponding context's LRC.
v2: Fix kernel-doc, add bspec reference
v3: Fix typo in commit text
Bspec: 67296, 67139
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com >
Reviewed-by: Matt Roper <matthew.d.roper@intel.com >
Signed-off-by: Matt Roper <matthew.d.roper@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240507224255.5059-3-niranjana.vishwanathapura@intel.com
2024-05-08 14:48:30 -07:00
Tejas Upadhyay
c18a5e3e61
drm/xe: skip error capture when exec queue is killed
...
When user closes exec queue soon after job submission,
we are generating error coredump. Instead check if
exec queue is killed during job timeout then skip
error coredump capture.
V2:
- Just skip error capture - MattB
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240430131229.2228809-1-tejas.upadhyay@intel.com
2024-05-07 11:43:08 -07:00
Himal Prasad Ghimiray
c832541ca8
drm/xe: Change xe_guc_submit_stop return to void
...
The function xe_guc_submit_stop consistently returns 0 without an error
state, prompting the caller to verify it, which is redundant.
Cc: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Reviewed-by: Matthew Brost <matthew.brost@intel.com >
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240424041911.2184868-1-himal.prasad.ghimiray@intel.com
2024-04-25 20:38:49 -07:00
Matthew Brost
edc9f11af3
drm/xe: Replace engine references with exec queue in xe_guc_submit.c
...
Exec queue has replaced engine nomenclature.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-6-matthew.brost@intel.com
2024-04-25 18:41:29 -07:00
Matthew Brost
3713a383f5
drm/xe: Fix alignment in GuC exec queue state defines
...
Normalize the alignment for readability.
v3:
- Fix typo in commit (Himal)
- Fix EXEC_QUEUE_STATE_WEDGED too (Himal)
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-5-matthew.brost@intel.com
2024-04-25 18:41:28 -07:00
Matthew Brost
1a1563e324
drm/xe: s/ENGINE_STATE_KILLED/EXEC_QUEUE_STATE_KILLED
...
Exec queue has replaced engine nomenclature.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-4-matthew.brost@intel.com
2024-04-25 18:41:28 -07:00
Matthew Brost
03b3517630
drm/xe: s/ENGINE_STATE_SUSPENDED/EXEC_QUEUE_STATE_SUSPENDED
...
Exec queue has replaced engine nomenclature.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-3-matthew.brost@intel.com
2024-04-25 18:41:27 -07:00
Matthew Brost
f85ada84f6
drm/xe: s/ENGINE_STATE_ENABLED/EXEC_QUEUE_STATE_ENABLED
...
Exec queue has replaced engine nomenclature.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240425232544.1935578-2-matthew.brost@intel.com
2024-04-25 18:41:26 -07:00
Matthew Brost
3f371a98de
drm/xe: Delete unused GuC submission_state.suspend
...
GuC submission_state.suspend is unused, delete it.
Signed-off-by: Matthew Brost <matthew.brost@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240425054747.1918811-1-matthew.brost@intel.com
2024-04-25 14:27:19 -07:00
Rodrigo Vivi
6b8ef44cc0
drm/xe: Introduce the wedged_mode debugfs
...
So, the wedged mode can be selected per device at runtime,
before the tests or before reproducing the issue.
v2: - s/busted/wedged
- some locking consistency
v3: - remove mutex
- toggle guc reset policy on any mode change
Cc: Lucas De Marchi <lucas.demarchi@intel.com >
Cc: Alan Previn <alan.previn.teres.alexis@intel.com >
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-4-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com >
2024-04-24 12:12:58 -04:00
Rodrigo Vivi
8ed9aaae39
drm/xe: Force wedged state and block GT reset upon any GPU hang
...
In many validation situations when debugging GPU Hangs,
it is useful to preserve the GT situation from the moment
that the timeout occurred.
This patch introduces a module parameter that could be used
on situations like this.
If xe.wedged module parameter is set to 2, Xe will be declared
wedged on every single execution timeout (a.k.a. GPU hang) right
after devcoredump snapshot capture and without attempting any
kind of GT reset and blocking entirely any kind of execution.
v2: Really block gt_reset from guc side. (Lucas)
s/wedged/busted (Lucas)
v3: - s/busted/wedged
- Really use global_flags (Dafna)
- More robust timeout handling when wedging it.
v4: A really robust clean exit done by Matt Brost.
No more kernel warns on unbind.
v5: Simplify error message (Lucas)
Cc: Matthew Brost <matthew.brost@intel.com >
Cc: Dafna Hirschfeld <dhirschfeld@habana.ai >
Cc: Lucas De Marchi <lucas.demarchi@intel.com >
Cc: Alan Previn <alan.previn.teres.alexis@intel.com >
Cc: Himanshu Somaiya <himanshu.somaiya@intel.com >
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com >
Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com >
2024-04-24 12:12:58 -04:00