linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-18 23:03:57 -04:00

Author	SHA1	Message	Date
Arnd Bergmann	9c088a5c0d	drm/xe: fix devcoredump chunk alignmnent calculation The device core dumps are copied in 1.5GB chunks, which leads to a link-time error on 32-bit builds because of the 64-bit division not getting trivially turned into mask and shift operations: ERROR: modpost: "__moddi3" [drivers/gpu/drm/xe/xe.ko] undefined! On top of this, I noticed that the ALIGN_DOWN() usage here cannot work because that is only defined for power-of-two alignments. Change ALIGN_DOWN into an explicit div_u64_rem() that avoids the link error and hopefully produces the right results. Doing a 1.5GB kvmalloc() does seem a bit suspicious as well, e.g. this will clearly fail on any 32-bit platform and is also likely to run out of memory on 64-bit systems under memory pressure, so using a much smaller power-of-two chunk size might be a good idea instead. v2: - Always call div_u64_rem (Matt) Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202504251238.JsNgFeFc-lkp@intel.com/ Fixes: `c4a2e5f865` ("drm/xe: Add devcoredump chunking") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250501012545.1045247-1-matthew.brost@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-01 11:48:56 -04:00
Matthew Brost	c4a2e5f865	drm/xe: Add devcoredump chunking Chunk devcoredump into 1.5G pieces to avoid hitting the kvmalloc limit of 2G. Simple algorithm reads 1.5G at time in xe_devcoredump_read callback as needed. Some memory allocations are changed to GFP_ATOMIC as they done in xe_devcoredump_read which holds lock in the path of reclaim. The allocations are small, so in practice should never fail. v2: - Update commit message wrt gfp atomic (John H) v6: - Drop GFP_ATOMIC change for hwconfig (John H) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://lore.kernel.org/r/20250423171725.597955-2-matthew.brost@intel.com	2025-04-24 15:51:38 -07:00
Shuicheng Lin	046eda6525	drm/xe/devcoredump: Remove IS_ERR_OR_NULL check for kzalloc kzalloc returns a valid pointer or NULL if the allocation fails. It never returns an error pointer. It is better to check for NULL directly. Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250220001710.1803749-3-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-24 13:31:35 -08:00
Shuicheng Lin	c504ad914f	drm/xe/devcoredump: Fix print typo of offset The log should print with "offset" instead of "size". Correct the typo in the comment. v2: split kzalloc change and add typo fix in commit message (Lucas) Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250220001710.1803749-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-24 13:31:12 -08:00
Lucas De Marchi	2c95bbf500	drm/xe: Fix and re-enable xe_print_blob_ascii85() Commit `70fb86a85d` ("drm/xe: Revert some changes that break a mesa debug tool") partially reverted some changes to workaround breakage caused to mesa tools. However, in doing so it also broke fetching the GuC log via debugfs since xe_print_blob_ascii85() simply bails out. The fix is to avoid the extra newlines: the devcoredump interface is line-oriented and adding random newlines in the middle breaks it. If a tool is able to parse it by looking at the data and checking for chars that are out of the ascii85 space, it can still do so. A format change that breaks the line-oriented output on devcoredump however needs better coordination with existing tools. v2: Add suffix description comment v3: Reword explanation of xe_print_blob_ascii85() calling drm_puts() in a loop Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: stable@vger.kernel.org Fixes: `70fb86a85d` ("drm/xe: Revert some changes that break a mesa debug tool") Fixes: `ec1455ce7e` ("drm/xe/devcoredump: Add ASCII85 dump helper function") Link: https://patchwork.freedesktop.org/patch/msgid/20250123202307.95103-2-jose.souza@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-01-27 19:40:00 -08:00
Lucas De Marchi	a37934ea75	drm/xe/devcoredump: Move exec queue snapshot to Contexts section Having the exec queue snapshot inside a "GuC CT" section was always wrong. Commit `c28fd6c358` ("drm/xe/devcoredump: Improve section headings and add tile info") tried to fix that bug, but with that also broke the mesa tool that parses the devcoredump, hence it was reverted in commit `70fb86a85d` ("drm/xe: Revert some changes that break a mesa debug tool"). With the mesa tool also fixed, this can propagate as a fix on both kernel and userspace side to avoid unnecessary headache for a debug feature. Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: stable@vger.kernel.org Fixes: `70fb86a85d` ("drm/xe: Revert some changes that break a mesa debug tool") Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250123051112.1938193-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-01-27 15:10:02 -08:00
Nitin Gote	75fd04f276	drm/xe: Fix all typos in xe Fix all typos in files of xe, reported by codespell tool. Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250106102646.1400146-2-nitin.r.gote@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>	2025-01-09 17:58:09 +01:00
John Harrison	70fb86a85d	drm/xe: Revert some changes that break a mesa debug tool There is a mesa debug tool for decoding devcoredump files. Recent changes to improve the devcoredump output broke that tool. So revert the changes until the tool can be extended to support the new fields. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Fixes: `c28fd6c358` ("drm/xe/devcoredump: Improve section headings and add tile info") Fixes: `ec1455ce7e` ("drm/xe/devcoredump: Add ASCII85 dump helper function") Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: intel-xe@lists.freedesktop.org Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241213172833.1733376-1-John.C.Harrison@Intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-12-13 17:01:26 -05:00
John Harrison	906c4b306e	drm/xe: Add mutex locking to devcoredump There are now multiple places that can trigger a coredump. Some of which can happen in parallel. There is already a check against capturing multiple dumps sequentially, but without locking it doesn't guarantee to work against concurrent dumps. And if two dumps do happen in parallel, they can end up doing Bad Things such as one call stack freeing the data the other call stack is still processing. Which leads to a crashed kernel. Further, it is possible for the DRM timeout to expire and trigger a free of the capture while a user is still reading that capture out through sysfs. Again leading to dodgy pointer problems. So, add a mutext lock around the capture, read and free functions to prevent inteference. v2: Swap tiny scope spin_lock for larger scope mutex and fix kernel-doc comment (review feedback from Matthew Brost) v3: Move mutex locks to exclude worker thread and add reclaim annotation (review feedback from Matthew Brost) v4: Fix typo. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241128210824.3302147-4-John.C.Harrison@Intel.com	2024-12-02 15:50:44 -08:00
John Harrison	90f51a7f4e	drm/xe: Move the coredump registration to the worker thread Adding lockdep checking to the coredump code showed that there was an existing violation. The dev_coredumpm_timeout() call is used to register the dump with the base coredump subsystem. However, that makes multiple memory allocations, only some of which use the GFP_ flags passed in. So that also needs to be deferred to the worker function where it is safe to allocate with arbitrary flags. In order to not add protoypes for the callback functions, moving the _timeout call also means moving the worker thread function to later in the file. v2: Rebased after other changes to the worker function. Fixes: `e799485044` ("drm/xe: Introduce the dev_coredump infrastructure.") Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Francois Dugast <francois.dugast@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Cc: intel-xe@lists.freedesktop.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: <stable@vger.kernel.org> # v6.8+ Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241128210824.3302147-3-John.C.Harrison@Intel.com	2024-12-02 15:50:43 -08:00
John Harrison	429915acae	drm/xe: Add a reason string to the devcoredump There are debug level prints giving more information about the cause of the hang immediately before core dumps are created. However, not everyone has debug level prints enabled or saves the dmesg log at all. So include that information in the dump file itself. Also, at least one of those prints included the pid as well as the process name. So include that in the capture too. v2: Fix kvfree vs kfree and missing kernel-doc (review feedback from Matthew Brost) v3: Use GFP_ATOMIC instead of GFP_KERNEL. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241128210824.3302147-2-John.C.Harrison@Intel.com	2024-12-02 15:50:41 -08:00
Matthew Brost	1c6878af11	drm/xe: Take PM ref in delayed snapshot capture worker The delayed snapshot capture worker can access the GPU or VRAM both of which require a PM reference. Take a reference in this worker. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Fixes: `4f04d07c0a` ("drm/xe: Faster devcoredump") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126174615.2665852-5-matthew.brost@intel.com	2024-11-27 16:38:52 -08:00
Matthew Brost	a54b0de7ed	drm/xe: Change xe_engine_snapshot_capture_for_job to be for queue During capture time, the target job may be unavailable (e.g., if it's in LR mode). However, the associated exec queue will be available regardless, change xe_engine_snapshot_capture_for_job to take a queue argument ann rename to xe_engine_snapshot_capture_for_queue. v2: - Reword commit message (Jonathan) - Remove redundant queueu check (Zhanjun) - Remove devcoredump job member (Zhanjun) Cc: Zhanjun Dong <zhanjun.dong@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-7-matthew.brost@intel.com	2024-11-14 06:38:46 -08:00
Matthew Brost	f62e6edfc1	drm/xe: Add exec queue param to devcoredump During capture time, the target job may be unavailable (e.g., if it's in LR mode). However, the associated exec queue will be available regardless, so add an exec queue param for such cases. v2: - Reword commit message (Jonathan) Cc: Zhanjun Dong <zhanjun.dong@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-5-matthew.brost@intel.com	2024-11-14 06:38:44 -08:00
Lucas De Marchi	aa06cb8351	drm/xe: Improve devcoredump documentation Let the introduction be useful for both userspace and kernel. Also improve the formatting to wire up later to the documentation build. Reviewed-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241102161254.1818604-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-04 23:29:43 -08:00
John Harrison	db38fdb7bf	drm/xe/guc: Separate full CTB content from guc_info debugfs The guc_info debugfs file is meant to be a quick view of the current software state of the GuC interface. Including the full CTB contents makes the file as a whole much less human readable and is not partiular useful in the general case. So don't pollute the info dump with the full buffers. Instead, move those into a separate debugfs entry that can be read when that information is actually required. Also, improve the human readability by adding a few extra blank lines to delimt the sections. v2: Hide the internal capture/print params from external callers that don't need to know (review feedback from Matthew Brost). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241024002554.1983101-3-John.C.Harrison@Intel.com	2024-10-29 13:11:34 -07:00
Himal Prasad Ghimiray	9ffd6ec2de	drm/xe/devcoredump: Update handling of xe_force_wake_get return xe_force_wake_get() now returns the reference count-incremented domain mask. If it fails for individual domains, the return value will always be 0. However, for XE_FORCEWAKE_ALL, it may return a non-zero value even in the event of failure. Use helper xe_force_wake_ref_has_domain to verify all domains are initialized or not. Update the return handling of xe_force_wake_get() to reflect this behavior, and ensure that the return value is passed as input to xe_force_wake_put(). v3 - return xe_wakeref_t instead of int in xe_force_wake_get() v5 - return unsigned int for xe_force_wake_get() v6 - use helper xe_force_wake_ref_has_domain() v7 - Fix commit message Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241014075601.2324382-12-himal.prasad.ghimiray@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-10-17 10:17:08 -04:00
Zhanjun Dong	0f1fdf5592	drm/xe/guc: Save manual engine capture into capture list Save manual engine capture into capture list. This removes duplicate register definitions across manual-capture vs guc-err-capture. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-7-zhanjun.dong@intel.com	2024-10-08 09:47:02 -07:00
Zhanjun Dong	ecb6336463	drm/xe/guc: Plumb GuC-capture into dev coredump When we decide to kill a job, (from guc_exec_queue_timedout_job), we could end up with 4 possible scenarios at this starting point of this decision: 1. the guc-captured register-dump is already there. 2. the driver is wedged.mode > 1, so GuC-engine-reset / GuC-err-capture will not happen. 3. the user has started the driver in execlist-submission mode. 4. the guc-captured register-dump is not ready yet so we force GuC to kill that context now, but: A. we don't know yet if GuC will be successful on the engine-reset and get the guc-err-capture, else kmd will do a manual reset later OR B. guc will be successful and we will get a guc-err-capture shortly. So to accomdate the scenarios of 2 and 4A, we will need to do a manual KMD capture first(which is not be reliable in guc-submission mode) and decide later if we need to use that for the cases of 2 or 4A. So this flow is part of the implementation for this patch. Provide xe_guc_capture_get_reg_desc_list to get the register dscriptor list. Add manual capture by read from hw engine if GuC capture is not ready. If it becomes ready at later time, GuC sourced data will be used. Although there may only be a small delay between (1) the check for whether guc-err-capture is available at the start of guc_exec_queue_timedout_job and (2) the decision on using a valid guc-err-capture or manual-capture, lets not take any chances and lock the matching node down so it doesn't get re-claimed if GuC-Err-Capture subsystem is running out of pre-cached nodes. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-6-zhanjun.dong@intel.com	2024-10-08 09:39:58 -07:00
John Harrison	8b7dfb9855	drm/xe/guc: Add GuC log to devcoredump captures Include the GuC log in devcoredump captures because they can be useful with debugging certain types of bug. v2: Fix kerneldoc v3: Drop module parameter as now using more compact ascii85 encoding rather than hexdump (although still not compressed) (review feedback from Matthew B). Rebase onto recent refactoring of devcoredump code. v4: Don't move the submission snapshot inside the GuC internals structure 'cos it really doesn't belong there. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-11-John.C.Harrison@Intel.com	2024-10-07 18:35:00 -07:00
John Harrison	ec1455ce7e	drm/xe/devcoredump: Add ASCII85 dump helper function There is a need to include the GuC log and other large binary objects in core dumps and via dmesg. So add a helper for dumping to a printer function via conversion to ASCII85 encoding. Another issue with dumping such a large buffer is that it can be slow, especially if dumping to dmesg over a serial port. So add a yield to prevent the 'task has been stuck for 120s' kernel hang check feature from firing. v2: Add a prefix to the output string. Fix memory allocation bug. v3: Correct a string size calculation and clean up a define (review feedback from Julia F). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-5-John.C.Harrison@Intel.com	2024-10-07 18:12:12 -07:00
John Harrison	c28fd6c358	drm/xe/devcoredump: Improve section headings and add tile info The xe_guc_exec_queue_snapshot is not really a GuC internal thing and is definitely not a GuC CT thing. So give it its own section heading. The snapshot itself is really a capture of the submission backend's internal state. Although all it currently prints out is the submission contexts. So label it as 'Contexts'. If more general state is added later then it could be change to 'Submission backend' or some such. Further, everything from the GuC CT section onwards is GT specific but there was no indication of which GT it was related to (and that is impossible to work out from the other fields that are given). So add a GT section heading. Also include the tile id of the GT, because again significant information. Lastly, drop a couple of unnecessary line feeds within sections. v2: Add GT section heading, add tile id to device section. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-4-John.C.Harrison@Intel.com	2024-10-07 18:12:10 -07:00
John Harrison	9d86d080cf	drm/xe/devcoredump: Use drm_puts and already cached local variables There are a bunch of calls to drm_printf with static strings. Switch them to drm_puts instead. There are also a bunch of 'coredump->snapshot.XXX' references when 'coredump->snapshot' has alread been cached locally as 'ss'. So use 'ss->XXX' instead. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-3-John.C.Harrison@Intel.com	2024-10-07 18:12:09 -07:00
Matthew Brost	4f04d07c0a	drm/xe: Faster devcoredump The current algorithm to read out devcoredump is O(N*N) where N is the size of coredump due to usage of the drm_coredump_printer in xe_devcoredump_read. Switch to a O(N) algorithm which prints the devcoredump into a readable format in snapshot work and update xe_devcoredump_read to memcpy from the readable format directly. v2: - Fix double free on devcoredump removal (Testing) - Set read_data_size after snap work flush - Adjust remaining in iterator upon realloc (Testing) - Set read_data upon realloc (Testing) v3: - Kernel doc v4: - Two pass algorithm to determine size (Maarten) v5: - Use scope for reading variables (Johnathan) Reported-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2408 Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240801154118.2547543-4-matthew.brost@intel.com	2024-08-01 11:00:14 -07:00
Matthew Brost	8af13c3fc1	drm/xe: Store process name and pid in xe file An xe file can outlive the associated process as the GPU cleanup is just triggered upon file close (process kill) and completes sometime later. If the file close triggers error conditions (GPU hangs) the process cannot be safely referenced to retrieve the name and pid for debug information. Store the process name and pid directly in the xe file to be safe. v2: - Access file->pid via rcu_access_pointer (Matthew Auld) Fixes: `b10d0c5e9d` ("drm/xe: Add process name to devcoredump") Fixes: `f6ca930d97` ("drm/xe: Add process name and PID to job timedout message") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240723151045.1725417-1-matthew.brost@intel.com	2024-07-23 10:45:40 -07:00
José Roberto de Souza	ec3ac2c8d9	drm/xe: Increase devcoredump timeout 5 minutes is too short for a regular user to search and understand what he needs to do to report capture devcoredump and report a bug to us, so here increasing this timeout to 1 hour. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240611174716.72660-2-jose.souza@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-06-12 11:29:37 -04:00
Matthew Brost	f2bf9e9598	drm/xe: Fix NULL ptr dereference in devcoredump Kernel VM do not have an Xe file. Include a check for Xe file in the VM before trying to get pid from VM's Xe file when taking a devcoredump. Fixes: `b10d0c5e9d` ("drm/xe: Add process name to devcoredump") Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240530203341.1795181-1-matthew.brost@intel.com	2024-05-31 10:31:09 -07:00
Arnd Bergmann	9bbfab1c7c	drm/xe: replace format-less snprintf() with strscpy() Using snprintf() with a format string from task->comm is a bit dangerous since the string may be controlled by unprivileged userspace: drivers/gpu/drm/xe/xe_devcoredump.c: In function 'devcoredump_snapshot': drivers/gpu/drm/xe/xe_devcoredump.c:184:9: error: format not a string literal and no format arguments [-Werror=format-security] 184 \| snprintf(ss->process_name, sizeof(ss->process_name), process_name); \| ^~~~~~~~ In this case there is no reason for an snprintf(), so use a simpler string copy. Fixes: `b10d0c5e9d` ("drm/xe: Add process name to devcoredump") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240528133251.2310868-1-arnd@kernel.org	2024-05-30 18:57:26 +02:00
José Roberto de Souza	b10d0c5e9d	drm/xe: Add process name to devcoredump Process name help us track what application caused the gpug hang, this is crucial when running several applications at the same time. v2: - handle Xe KMD exec_queues without VM v3: - use get_pid_task() (suggested by Nirmoy) Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522201203.145403-1-jose.souza@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-05-23 13:37:56 -04:00
Matthew Auld	cf13ae6b81	drm/xe/coredump: move over to devm Here we are using drmm to ensure we release the coredump when unloading the module, however the coredump is very much tied to the struct device underneath. We can see this when we hotunplug the device, for which we have already got a coredump attached. In such a case the coredump still remains and adding another is not possible. However we still register the release action via xe_driver_devcoredump_fini(), so in effect two or more releases for one dump. The other consideration is that the coredump state is embedded in the xe_driver instance, so technically once the drmm release action fires we might free the coredumpe state from a different driver instance, assuming we have two release actions and they can race. Rather use devm here to remove the coredump when the device is released. References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1679 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Andrzej Hajda <andrzej.hajda@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522102143.128069-29-matthew.auld@intel.com	2024-05-22 13:22:39 +01:00
José Roberto de Souza	4209d635a8	drm/xe: Remove devcoredump during driver release This will remove devcoredump from file system and free its resources during driver unload. This fix the driver unload after gpu hang happened, otherwise this it would report that Xe KMD is still in use and it would leave the kernel in a state that Xe KMD can't be unload without a reboot. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Acked-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240409200206.108452-2-jose.souza@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-04-11 09:39:48 -04:00
Matthew Brost	0417a5f848	drm/xe: Always capture exec queues on snapshot Always capture exec queues on snapshot regardless if exec queue has pending jobs or not. Having jobs or not does indicate whether the exec queue capture is useful. Example bugs that would not be easily detected by skipping capture when pending job list is empty: - Jobs pending on exec queue have dependencies - Leaking exec queue refs - GuC protocol issues (i.e. losing G2H) In addition to above bugs, in general it just useful to see every exec queue registered with the GuC and its state. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240405211632.223568-2-matthew.brost@intel.com	2024-04-08 14:47:37 -07:00
Himal Prasad Ghimiray	b15e653495	drm/xe/xe_devcoredump: Check NULL before assignments Assign 'xe_devcoredump_snapshot ' and 'xe_device ' only if 'coredump' is not NULL. v2 - Fix commit messages. v3 - Define variables before code.(Ashutosh/Jose) v4 - Drop return check for coredump_to_xe. (Jose/Rodrigo) v5 - Modify misleading commit message. (Matt) Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240328123739.3633428-1-himal.prasad.ghimiray@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-03-29 11:28:43 -04:00
José Roberto de Souza	e5f661bb56	drm/xe/devcoredump: Print errno if VM snapshot was not captured My testing machine has only 8GB of RAM and while running piglit tests I can reach the OOM cache in xe_vm_snapshot_capture() snap allocaiton sometimes. So to differentiate the OOM from race between capture and UMDs unbinbind VMs here I'm adding a '[0].error: -12' to devcoredump. v2: - fix returned errno values Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240307135229.41973-2-jose.souza@intel.com	2024-03-22 08:08:47 -07:00
Daniele Ceraolo Spurio	649a125a88	drm/xe: Always check force_wake_get return code A force_wake_get failure means that the HW might not be awake for the access we're doing; this can lead to an immediate error or it can be a more subtle problem (e.g. a register read might return an incorrect value that is still valid, leading the driver to make a wrong choice instead of flagging an error). We avoid an error from the force_wake function because callers might handle or tolerate the error, but this only works if all callers are checking the error code. The majority already do, but a few are not. These are mainly falling into 3 categories, which are each handled differently: 1) error capture: in this case we want to continue the capture, but we log an info message in dmesg to notify the user that the capture might have incorrect data. 2) ioctl: in this case we return a -EIO error to userspace 3) unabortable actions: these are scenarios where we can't simply abort and retry and so it's better to just try it anyway because there is a chance the HW is awake even with the failure. In this case we throw a warning so we know there was a forcewake problem if something fails down the line. v2: use gt_WARN_ON where appropriate Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240318154924.3453513-1-daniele.ceraolospurio@intel.com	2024-03-20 14:13:58 -07:00
Maarten Lankhorst	784b34100f	drm/xe: Add infrastructure for delayed LRC capture Add a xe_guc_exec_queue_snapshot_capture_delayed and xe_lrc_snapshot_capture_delayed function to capture the contents of LRC in the next patch. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240227131248.92910-2-maarten.lankhorst@linux.intel.com	2024-03-04 12:24:38 +01:00
Maarten Lankhorst	0eb2a18a8f	drm/xe: Implement VM snapshot support for BO's and userptr Since we cannot immediately capture the BO's and userptr, perform it in 2 stages. The immediate stage takes a reference to each BO and userptr, while a delayed worker captures the contents and then frees the reference. This is required because in signaling context, no locks can be taken, no memory can be allocated, and no waits on userspace can be performed. With the delayed worker, all of this can be performed very easily, without having to resort to hacks. Changes since v1: - Fix crash on NULL captured vm. - Use ascii85_encode to capture BO contents and save some space. - Add length to coredump output for each captured area. Changes since v2: - Dump each mapping on their own line, to simplify tooling. - Fix null pointer deref in xe_vm_snapshot_free. Changes since v3: - Don't add uninitialized value to snap->ofs. (Souza) - Use kernel types for u32 and u64. - Move snap_mutex destruction to final vm destruction. (Souza) Changes since v4: - Remove extra memset. (Souza) Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240221133024.898315-6-maarten.lankhorst@linux.intel.com	2024-02-21 20:08:57 +01:00
Maarten Lankhorst	bd71cdd209	drm/xe: Clear all snapshot members after deleting coredump It's not strictly needed to clear right now, but this prevents bugs from dangling pointers. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240221133024.898315-2-maarten.lankhorst@linux.intel.com	2024-02-21 20:08:20 +01:00
José Roberto de Souza	be7d51c5b4	drm/xe: Add batch buffer addresses to devcoredump Those addresses are necessary to Mesa tools knows where in VM are the batch buffers to parse and print instructions that are human readable. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Maarten Lankhorst <dev@lankhorst.se> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240130135648.30211-2-jose.souza@intel.com	2024-01-30 11:53:47 -08:00
José Roberto de Souza	4376cee620	drm/xe: Print more device information in devcoredump To properly decode batch buffer Mesa tools needs to know what platform is this one, for now we can do that with PCI id but already making it future proof by also printing GTs GMD version. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Maarten Lankhorst <dev@lankhorst.se> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240123204454.246788-5-jose.souza@intel.com	2024-01-24 11:08:25 -08:00
José Roberto de Souza	98fefec8c3	drm/xe: Change devcoredump functions parameters to xe_sched_job When devcoredump start to dump the VMs contents it will be necessary to know the starting addresses of batch buffers of the job that hang. This information it set in xe_sched_job and xe_sched_job is not easily acessible from xe_exec_queue, so here changing the parameter, next patch will append the batch buffer addresses to devcoredump snapshot capture. v3: - update functions documentation to xe_sched_job Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Maarten Lankhorst <dev@lankhorst.se> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240123204454.246788-2-jose.souza@intel.com	2024-01-24 10:53:38 -08:00
Francois Dugast	9b9529ce37	drm/xe: Rename engine to exec_queue Engine was inappropriately used to refer to execution queues and it also created some confusion with hardware engines. Where it applies the exec_queue variable name is changed to q and comments are also updated. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/162 Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:39:20 -05:00
Francois Dugast	c22a4ed0c3	drm/xe: Rename xe_engine.[ch] to xe_exec_queue.[ch] This is a preparation commit for a larger renaming of engine to exec queue. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:39:17 -05:00
Rodrigo Vivi	01a87f3181	drm/xe: Add HW Engine snapshot to xe_devcoredump. Let's continue to add our existent simple logs to devcoredump one by one. Any format change should come on follow-up work. v2: remove unnecessary, and now duplicated, dma_fence annotation. (Matthew) v3: avoid for_each with faulty_engine since that can be already freed at the time of the read/free. Instead, iterate in the full array of hw_engines. (Kasan) Cc: Francois Dugast <francois.dugast@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com>	2023-12-19 18:33:53 -05:00
Rodrigo Vivi	3847ec03dd	drm/xe: Add GuC Submit Engine snapshot to xe_devcoredump. Let's start to move our existent logs to devcoredump one by one. Any format change should come on follow-up work. Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>	2023-12-19 18:33:52 -05:00
Rodrigo Vivi	5ed5344632	drm/xe: Add GuC CT snapshot to xe_devcoredump. Let's start to move our existent logs to devcoredump one by one. Any format change should come on follow-up work. v2: Rebase and add the dma_fence locking annotation here. Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>	2023-12-19 18:33:52 -05:00
Rodrigo Vivi	656d29506c	drm/xe: Do not take any action if our device was removed. Unfortunately devcoredump infrastructure does not provide and interface for us to force the device removal upon the pci_remove time of our device. The devcoredump is linked at the device level, so when in use it will prevent the module removal, but it doesn't prevent the call of the pci_remove callback. This callback cannot fail anyway and we end up clearing and freeing the entire pci device. Hence, after we removed the pci device, we shouldn't allow any read or free operations to avoid segmentation fault. Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>	2023-12-19 18:33:51 -05:00
Rodrigo Vivi	e799485044	drm/xe: Introduce the dev_coredump infrastructure. The goal is to use devcoredump infrastructure to report error states captured at the crash time. The error state will contain useful information for GPU hang debug, such as INSTDONE registers and the current buffers getting executed, as well as any other information that helps user space and allow later replays of the error. The proposal here is to avoid a Xe only error_state like i915 and use a standard dev_coredump infrastructure to expose the error state. For our own case, the data is only useful if it is a snapshot of the time when the GPU crash has happened, since we reset the GPU immediately after and the registers might have changed. So the proposal here is to have an internal snapshot to be printed out later. Also, usually a subsequent GPU hang can be only a cause of the initial one. So we only save the 'first' hang. The dev_coredump has a delayed work queue where it remove the coredump and free all the data within a few moments of the error. When that happens we also reset our capture state and allow further snapshots. Right now this infra only print out the time of the hang. More information will be migrated here on subsequent work. Also, in order to organize the dump better, the goal is to propose dev_coredump changes itself to allow multiple files and different controls. But for now we start Xe usage of it without any dependency on dev_coredump core changes. v2: Add dma_fence annotation for capture that might happen during long running. (Thomas and Matt) Use xe->drm.primary->index on drm_info msg. (Jani) v3: checkpatch fixes v4: Fix building and locking issues found by Francois. Actually let's kill all of the locking in here. gt_reset serialization already guarantee that there will be only one capture at the same time. Also, the devcoredump has its own locking to protect the free and reads and drivers don't need to duplicate it. Besides this, the dma_fence locking was pushed to a following patch since it is not needed in this one. Fix a use after free identified by KASAN: Do not stash the faulty_engine since that will be freed somewhere else. v5: Fix Uptime - ktime_get_boottime actually returns the Uptime. (Francois) Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Francois Dugast <francois.dugast@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>	2023-12-19 18:33:51 -05:00

48 Commits