An xe file can outlive the associated process as the GPU cleanup is just
triggered upon file close (process kill) and completes sometime later.
If the file close triggers error conditions (GPU hangs) the process
cannot be safely referenced to retrieve the name and pid for debug
information. Store the process name and pid directly in the xe file to
be safe.
v2:
- Access file->pid via rcu_access_pointer (Matthew Auld)
Fixes: b10d0c5e9d ("drm/xe: Add process name to devcoredump")
Fixes: f6ca930d97 ("drm/xe: Add process name and PID to job timedout message")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240723151045.1725417-1-matthew.brost@intel.com
Wedge the entire device, not just GT which may have triggered the wedge.
To implement this, cleanup the layering so xe_device_declare_wedged()
calls into the lower layers (GT) to ensure entire device is wedged.
While we are here, also signal any pending GT TLB invalidations upon
wedging device.
Lastly, short circuit reset wait if device is wedged.
v2:
- Short circuit reset wait if device is wedged (Local testing)
Fixes: 8ed9aaae39 ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240716063902.1390130-1-matthew.brost@intel.com
In GuC TDR sample ctx timestamp to determine if jobs have timed out. The
scheduling enable needs to be toggled to properly sample the timestamp.
If a job has not been running for longer than the timeout period,
re-enable scheduling and restart the TDR.
v2:
- Use GT clock to msec helper (Umesh, off list)
- s/ctx_timestamp_job/ctx_job_timestamp
v3:
- Fix state machine for TDR, mainly decouple sched disable and
deregister (testing)
- Rebase (CI)
v4:
- Fix checkpatch && newline issue (CI)
- Do not deregister on wedged or unregistered (CI)
- Fix refcounting bugs (CI)
- Move devcoredump above VM / kernel job check (John H)
- Add comment for check_timeout state usage (John H)
- Assert pending disable not inflight when enabling scheduling (John H)
- Use enable_scheduling in other scheduling enable code (John H)
- Add comments on a few steps in TDR (John H)
- Add assert for timestamp overflow protection (John H)
v6:
- Use mul_u64_u32_div (CI, checkpath)
- Change check time to dbg level (Paulo)
- Add immediate mode to sched disable (inspection)
- Use xe_gt_* messages (John H)
- Fix typo in comment (John H)
- Check timeout before clearing pending disable (Paulo)
v7:
- Fix ADJUST_FIVE_PERCENT macro (checkpatch)
- Don't print sched disable failure message on GT reset (John H)
- Move kernel / VM jobs WARNs near comment (John H)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240611144053.2805091-12-matthew.brost@intel.com
Limit the protection only during moments of actual job execution,
and introduce protection for guc submit fini, which is currently
unprotected due to the absence of exec_queue life protection.
In the regular use case scenario, user space will create an
exec queue, and keep it alive to reuse that until it is done
with that kind of workload.
For the regular desktop cases, it means that the exec_queue
is alive even on idle scenarios where display goes off. This
is unacceptable since this would entirely block runtime PM
indefinitely, blocking deeper Package-C state. This would be
a waste drainage of power.
Cc: Matthew Brost <matthew.brost@intel.com>
Tested-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240522170105.327472-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Add a helper to accumulate per-client runtime of all its
exec queues. This is called every time a sched job is finished.
v2:
- Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
runtime when job is finished since xe_sched_job_completed() is not a
notification that job finished.
- Stop trying to update runtime from xe_exec_queue_fini() - that is
redundant and may happen after xef is closed, leading to a
use-after-free
- Do not special case the first timestamp read: the default LRC sets
CTX_TIMESTAMP to zero, so even the first sample should be a valid
one.
- Handle the parallel submission case by multiplying the runtime by
width.
v3: Update comments
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240517204310.88854-6-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
In many validation situations when debugging GPU Hangs,
it is useful to preserve the GT situation from the moment
that the timeout occurred.
This patch introduces a module parameter that could be used
on situations like this.
If xe.wedged module parameter is set to 2, Xe will be declared
wedged on every single execution timeout (a.k.a. GPU hang) right
after devcoredump snapshot capture and without attempting any
kind of GT reset and blocking entirely any kind of execution.
v2: Really block gt_reset from guc side. (Lucas)
s/wedged/busted (Lucas)
v3: - s/busted/wedged
- Really use global_flags (Dafna)
- More robust timeout handling when wedging it.
v4: A really robust clean exit done by Matt Brost.
No more kernel warns on unbind.
v5: Simplify error message (Lucas)
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dafna Hirschfeld <dhirschfeld@habana.ai>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Himanshu Somaiya <himanshu.somaiya@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Always capture exec queues on snapshot regardless if exec queue has
pending jobs or not. Having jobs or not does indicate whether the exec
queue capture is useful.
Example bugs that would not be easily detected by skipping capture when
pending job list is empty:
- Jobs pending on exec queue have dependencies
- Leaking exec queue refs
- GuC protocol issues (i.e. losing G2H)
In addition to above bugs, in general it just useful to see every exec
queue registered with the GuC and its state.
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240405211632.223568-2-matthew.brost@intel.com
While we don't have the full flow protection when devcoredump
is accessed after device unbind. Let's at least for now
protect against null dereference:
[ 422.766508] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[ 423.119584] RIP: 0010:xe_vm_snapshot_free+0x30/0x180 [xe]
While at it, I also fixed a non-standard code-declaration block
on the similar function of xe_guc_submit.
v2: - Use IS_ERR_OR_NULL (Nirmoy)
- Expand to other functions
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240403195044.239766-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The GGTT is currently a 32 bit address space, but the HW and GuC
support 48b addresses in GGTT-related operations, both to keep the
interface/HW paths common between PPGTT and GGTT and to allow for
future increase of the GGTT size.
This leaves us having to program a 64b field with a 32b offset, which
currently we're in some cases doing this by using an upper_32_bits()
call on a 32b variable, which doesn't make any sense. To do this cleanly
we have 2 options:
1 - Set the upper 32 bits directly to zero.
2 - Use 64b variables for the offset and keep programming the whole thing,
so we're ready if we ever have bigger offsets.
This patch goes with option #2 and switches the related variables to u64.
v2: don't change the log ctl flag variable (John)
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240319195101.2784480-1-daniele.ceraolospurio@intel.com
A force_wake_get failure means that the HW might not be awake for the
access we're doing; this can lead to an immediate error or it can be a
more subtle problem (e.g. a register read might return an incorrect
value that is still valid, leading the driver to make a wrong choice
instead of flagging an error).
We avoid an error from the force_wake function because callers might
handle or tolerate the error, but this only works if all callers
are checking the error code. The majority already do, but a few are not.
These are mainly falling into 3 categories, which are each handled
differently:
1) error capture: in this case we want to continue the capture, but we
log an info message in dmesg to notify the user that the capture
might have incorrect data.
2) ioctl: in this case we return a -EIO error to userspace
3) unabortable actions: these are scenarios where we can't simply abort
and retry and so it's better to just try it anyway because there is a
chance the HW is awake even with the failure. In this case we throw a
warning so we know there was a forcewake problem if something fails
down the line.
v2: use gt_WARN_ON where appropriate
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240318154924.3453513-1-daniele.ceraolospurio@intel.com