linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-29 12:02:35 -04:00

Author	SHA1	Message	Date
Michal Wajdeczko	a2e1407eb8	drm/xe/guc: Clear whole g2h_fence during initialization The struct g2h_fence must be explicitly initializated using the g2h_fence_init() function to avoid trash values in its members, but we missed to update this helper function with the new member. To fix that and avoid any future mistakes, memset the whole struct first, then update remaining non-zero members. Fixes: `94de94d24e` ("drm/xe/guc: Cancel ongoing H2G requests when stopping CT") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250723175639.206875-1-michal.wajdeczko@intel.com (cherry picked from commit `159afd92ba`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-07-28 10:22:38 -04:00
Michal Wajdeczko	94de94d24e	drm/xe/guc: Cancel ongoing H2G requests when stopping CT Once we have started a GT reset sequence, which includes stopping GuC CTB communication, we should also cancel all ongoing H2G send- recv requests, as either GuC is already dead, or due to imminent reset GuC will not be able to reply, or due to internal cleanup we will lose pending fences. With this we will report dedicated -ECANCELED error instead of misleading -ETIME. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250709174038.1876-4-michal.wajdeczko@intel.com	2025-07-10 21:46:29 +02:00
Michal Wajdeczko	4ecdcf9caf	drm/xe/guc: Move state change logger to helper In the state change helper we are already doing extra stuff, move debug state logger there to cover all state changes. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250709174038.1876-3-michal.wajdeczko@intel.com	2025-07-10 21:39:25 +02:00
Michal Wajdeczko	1b822b7f56	drm/xe/guc: Rename CT state change helper In this helper we are already doing much more than just setting a new CT state and its name was little misleading. Rename it. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250709174038.1876-2-michal.wajdeczko@intel.com	2025-07-10 21:38:43 +02:00
Matthew Brost	ec9223b49a	drm/xe: Drop bo->size bo->size is redundant because the base GEM object already has a size field with the same value. Drop bo->size and use the base GEM object’s size instead. While at it, introduce xe_bo_size() to abstract the BO size. v2: - Fix typo in kernel doc (Ashutosh) - Fix kunit (CI) - Fix line wrap (Checkpatch) v3: - Fix sriov build (CI) v4: - Fix display build (CI) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://lore.kernel.org/r/20250625144128.2827577-1-matthew.brost@intel.com	2025-06-27 14:52:31 -07:00
Maarten Lankhorst	396044c9d8	drm/xe: Simplify GuC early initialization Add a 2-stage GuC init. An early one for everything that is needed for VF, and a full one that loads GuC and is allowed to do allocations. Link: https://lore.kernel.org/r/20250619104858.418440-16-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-26 22:10:30 +02:00
Michal Wajdeczko	2ddbb73ec2	drm/xe/guc: Explicitly exit CT safe mode on unwind During driver probe we might be briefly using CT safe mode, which is based on a delayed work, but usually we are able to stop this once we have IRQ fully operational. However, if we abort the probe quite early then during unwind we might try to destroy the workqueue while there is still a pending delayed work that attempts to restart itself which triggers a WARN. This was recently observed during unsuccessful VF initialization: [ ] xe 0000:00:02.1: probe with driver xe failed with error -62 [ ] ------------[ cut here ]------------ [ ] workqueue: cannot queue safe_mode_worker_func [xe] on wq xe-g2h-wq [ ] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:2257 __queue_work+0x287/0x710 [ ] RIP: 0010:__queue_work+0x287/0x710 [ ] Call Trace: [ ] delayed_work_timer_fn+0x19/0x30 [ ] call_timer_fn+0xa1/0x2a0 Exit the CT safe mode on unwind to avoid that warning. Fixes: `09b286950f` ("drm/xe/guc: Allow CTB G2H processing without G2H IRQ") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250612220937.857-3-michal.wajdeczko@intel.com	2025-06-24 22:27:12 +02:00
Satyanarayana K V P	87c648c313	drm/xe: Add helper function to inject fault into ct_dead_capture() When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv() functions, the CI test systems are going out of space and crashing. To avoid this issue, a new helper function is created and when fault is injected into this xe_is_injection_active() helper function, ct dead capture is avoided which suppresses ct dumps in the log. Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Suggested-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250612080402.22011-1-satyanarayana.k.v.p@intel.com	2025-06-12 16:53:56 -07:00
Daniele Ceraolo Spurio	0b93b7dcd9	drm/xe: Fix early wedge on GuC load failure When the GuC fails to load we declare the device wedged. However, the very first GuC load attempt on GT0 (from xe_gt_init_hwconfig) is done before the GT1 GuC objects are initialized, so things go bad when the wedge code attempts to cleanup GT1. To fix this, check the initialization status in the functions called during wedge. Fixes: `7dbe8af13c` ("drm/xe: Wedge the entire device") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Zhanjun Dong <zhanjun.dong@intel.com> Cc: stable@vger.kernel.org # v6.12+: `1e1981b16b`: drm/xe: Fix taking invalid lock on wedge Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250611214453.1159846-2-daniele.ceraolospurio@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-12 15:17:12 -07:00
John Harrison	16b7e65d29	drm/xe/guc: Track FAST_REQ H2Gs to report where errors came from Most H2G messages are FAST_REQ which means no synchronous response is expected. The messages are sent as fire-and-forget with no tracking. However, errors can still be returned when something goes unexpectedly wrong. That leads to confusion due to not being able to match up the error response to the originating H2G. So add support for tracking the FAST_REQ H2Gs and matching up an error response to its originator. This is only enabled in XE_DEBUG builds given that such errors should never happen in a working system and there is an overhead for the tracking. Further, if XE_DEBUG_GUC is enabled then even more memory and time is used to record the call stack of each H2G and report that with the error. That makes it much easier to work out where a specific H2G came from if there are multiple code paths that can send it. v2: Some re-wording of comments and prints, more consistent use of #if vs stub functions - review feedback from Daniele & Michal). v3: Split config change to separate patch, improve a debug print (review feedback from Michal). v4: Bunch of minor tweaks (review feedback from Michal). Original-i915-code: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512215324.1457009-5-John.C.Harrison@Intel.com	2025-05-15 12:27:37 -07:00
John Harrison	fddf8cdd4b	drm/xe/guc: Remove double blank line An earlier patch moved a drm_print a few lines lower but accidentally left a double blank line behind. So fix that. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512215324.1457009-2-John.C.Harrison@Intel.com	2025-05-15 12:27:33 -07:00
Tomasz Lis	cef88d1265	drm/xe/vf: Fixup CTB send buffer messages after migration During post-migration recovery of a VF, it is necessary to update GGTT references included in messages which are going to be sent to GuC. GuC will start consuming messages after VF KMD will inform it about fixups being done; before that, the VF KMD is expected to update any H2G messages which are already in send buffer but were not consumed by GuC. Only a small subset of messages allowed for VFs have GGTT references in them. This patch adds the functionality to parse the CTB send ring buffer and shift addresses contained within. While fixing the CTB content, ct->lock is not taken. This means the only barrier taken remains GGTT address lock - which is ok, because only requests with GGTT addresses matter, but it also means tail changes can happen during the CTB fixups execution (which may be ignored as any new messages will not have anything to fix). The GGTT address locking will be introduced in a future series. v2: removed storing shift as that's now done in VMA nodes patch; macros to inlines; warns to asserts; log messages fixes (Michal) v3: removed inline keywords, enums for offsets in CTB messages, less error messages, if return unused then made functs void (Michal) v4: update the cached head before starting fixups v5: removed/updated comments, wrapped lines, converted assert into error, enums for offsets to separate patch, reused xe_map_rd v6: define xe_map__array() macros, support CTB wrap which divides a message, updated comments, moved one function to an earlier patch v7: renamed few functions, wider use on previously introduced helper, separate cases in parsing messges, documented a static funct v8: Introduced more helpers, fixed coding style mistakes v9: Move xe_map() functs to macros, add asserts, add debug print v10: Errors in place of some asserts, style fixes v11: Fixed invalid conditionals, added debug-only local pointer v12: Removed redundant __maybe_unused Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512114018.361843-5-tomasz.lis@intel.com	2025-05-12 15:53:38 +02:00
Satyanarayana K V P	104080e339	drm/xe: Introduce fault injection for guc CTB send/recv Fault can be injected with below steps. FAILTYPE=fail_function FAILFUNC=xe_guc_ct_send_recv echo > /sys/kernel/debug/$FAILTYPE/inject echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject printf %#x -19 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval echo N > /sys/kernel/debug/$FAILTYPE/task-filter echo 10 > /sys/kernel/debug/$FAILTYPE/probability echo 0 > /sys/kernel/debug/$FAILTYPE/interval echo -1 > /sys/kernel/debug/$FAILTYPE/times echo 0 > /sys/kernel/debug/$FAILTYPE/space echo 1 > /sys/kernel/debug/$FAILTYPE/verbose Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michał Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250403120641.7258-3-satyanarayana.k.v.p@intel.com	2025-04-17 22:11:17 +02:00
Matthew Brost	045448da87	drm/xe: Add XE_BO_FLAG_PINNED_NORESTORE Not all BOs need to be restored on resume / d3cold exit, add XE_BO_FLAG_PINNED_NO_RESTORE which skips restoring of BOs rather just allocates VRAM for the BO. This should slightly speedup resume / d3cold exit flows. Marking GuC ADS, GuC CT, GuC log, GuC PC, and SA as NORESTORE. v2: - s/WONTNEED/NORESTORE (Vivi) - Rebase on newly added g2g and backup object flow Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250403102440.266113-11-matthew.auld@intel.com	2025-04-04 11:41:01 +01:00
John Harrison	66fb0dd2b1	drm/xe/guc: Reformat dead CT reason string to be devcoredump compatible The dump on a dead CT tries to emulate the devcoredump formatting (it would use devcoredump code directly but that requires more re-work to happen - work in progress). So update the print of the dead CT reason code to match the format of the 'reason' string that was added to the actual devcoredump a little while ago. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250325203111.3907426-1-John.C.Harrison@Intel.com	2025-03-28 09:44:56 -07:00
Lucas De Marchi	7748289df5	drm/xe/guc: Fix size_t print format Use %zx format to print size_t to remove the following warning when building for i386: >> drivers/gpu/drm/xe/xe_guc_ct.c:1727:43: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat] 1727 \| drm_printf(p, "[CTB].length: 0x%lx\n", snapshot->ctb_size); \| ~~~ ^~~~~~~~~~~~~~~~~~ \| %zx Cc: José Roberto de Souza <jose.souza@intel.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501281627.H6nj184e-lkp@intel.com/ Fixes: `cb1f868ca1` ("drm/xe: Make GUC binaries dump consistent with other binaries in devcoredump") Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250128154242.3371687-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-01-29 10:53:21 -08:00
José Roberto de Souza	cb1f868ca1	drm/xe: Make GUC binaries dump consistent with other binaries in devcoredump All other(hwsp, hwctx and vmas) binaries follow this format: [name].length: 0x1000 [name].data: xxxxxxx [name].error: errno The error one is just in case by some reason it was not able to capture the binary. So this GuC binaries should follow the same patern. v2: - renamed GUC binary to LOG Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250123202307.95103-3-jose.souza@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-01-27 19:41:07 -08:00
Lucas De Marchi	2c95bbf500	drm/xe: Fix and re-enable xe_print_blob_ascii85() Commit `70fb86a85d` ("drm/xe: Revert some changes that break a mesa debug tool") partially reverted some changes to workaround breakage caused to mesa tools. However, in doing so it also broke fetching the GuC log via debugfs since xe_print_blob_ascii85() simply bails out. The fix is to avoid the extra newlines: the devcoredump interface is line-oriented and adding random newlines in the middle breaks it. If a tool is able to parse it by looking at the data and checking for chars that are out of the ascii85 space, it can still do so. A format change that breaks the line-oriented output on devcoredump however needs better coordination with existing tools. v2: Add suffix description comment v3: Reword explanation of xe_print_blob_ascii85() calling drm_puts() in a loop Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: stable@vger.kernel.org Fixes: `70fb86a85d` ("drm/xe: Revert some changes that break a mesa debug tool") Fixes: `ec1455ce7e` ("drm/xe/devcoredump: Add ASCII85 dump helper function") Link: https://patchwork.freedesktop.org/patch/msgid/20250123202307.95103-2-jose.souza@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-01-27 19:40:00 -08:00
Nitin Gote	75fd04f276	drm/xe: Fix all typos in xe Fix all typos in files of xe, reported by codespell tool. Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250106102646.1400146-2-nitin.r.gote@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>	2025-01-09 17:58:09 +01:00
John Harrison	36bcc52b9b	drm/xe/guc: Fix for dead CT dump not re-arming The state dump on a dead CT incident deliberately disarms itself after running. This is to prevent a long stream of errors causing continuous dumps. It was supposed to re-arm itself after a reset, however that was not happening. The re-arm flag was being set but the worker was not being run to process that flag. So fix that. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241203005949.3947920-1-John.C.Harrison@Intel.com	2024-12-05 15:25:44 -08:00
John Harrison	7d4d1c54c4	drm/xe/guc: Support crash dump notification from GuC Add support for the two crash dump notifications from GuC. Either one means GuC is toast, so just capture state trigger a reset. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108212737.2044007-3-John.C.Harrison@Intel.com	2024-11-12 11:42:13 -08:00
Michal Wajdeczko	c4ed1bb128	drm/xe/guc: Log content of the failed G2H message We are already logging an error once we failed to process a G2H message, but then it's quite hard to extract the content of the broken G2H message from the captured snapshot. Extend our error log with the raw hexdump of the G2H message. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241105173032.1947-2-michal.wajdeczko@intel.com	2024-11-07 17:38:09 +01:00
Lucas De Marchi	be15f0bc4a	drm/xe: Fix drm-next merge Fix merge in commit `c787c2901e` ("Merge drm/drm-next into drm-xe-next"). That workaround needs to be done only once. While at it, also remove undesired newline before checking for error. Fixes: `c787c2901e` ("Merge drm/drm-next into drm-xe-next") Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104183104.2265275-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-04 22:36:02 -08:00
Thomas Hellström	c787c2901e	Merge drm/drm-next into drm-xe-next Backmerging to get up-to-date and to bring in a fix that was merged through drm-misc-fixes. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>	2024-11-04 09:21:20 +01:00
Dave Airlie	30169bb645	Backmerge v6.12-rc6 of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-next Backmerge Linus tree for some drm-fixes needed for msm and xe merges. Signed-off-by: Dave Airlie <airlied@redhat.com>	2024-11-04 14:25:33 +10:00
Nirmoy Das	cbe006a649	drm/xe: Move LNL scheduling WA to xe_device.h Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed. Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241029120117.449694-1-nirmoy.das@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-01 06:23:26 -07:00
John Harrison	db38fdb7bf	drm/xe/guc: Separate full CTB content from guc_info debugfs The guc_info debugfs file is meant to be a quick view of the current software state of the GuC interface. Including the full CTB contents makes the file as a whole much less human readable and is not partiular useful in the general case. So don't pollute the info dump with the full buffers. Instead, move those into a separate debugfs entry that can be read when that information is actually required. Also, improve the human readability by adding a few extra blank lines to delimt the sections. v2: Hide the internal capture/print params from external callers that don't need to know (review feedback from Matthew Brost). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241024002554.1983101-3-John.C.Harrison@Intel.com	2024-10-29 13:11:34 -07:00
Badal Nilawar	22ef43c786	drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout In case if g2h worker doesn't get opportunity to within specified timeout delay then flush the g2h worker explicitly. v2: - Describe change in the comment and add TODO (Matt B/John H) - Add xe_gt_warn on fence done after G2H flush (John H) v3: - Updated the comment with root cause - Clean up xe_gt_warn message (John H) Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620 Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902 Signed-off-by: Badal Nilawar <badal.nilawar@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241017111410.2553784-2-badal.nilawar@intel.com (cherry picked from commit `e515272338`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-24 12:42:52 -05:00
Matthew Brost	179e01793a	drm/xe: Mark G2H work queue with WQ_MEM_RECLAIM G2H work queue can be used to free memory thus we should allow this work queue to run during reclaim. Mark with G2H work queue with WQ_MEM_RECLAIM appropriately. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241021175705.1584521-4-matthew.brost@intel.com	2024-10-23 11:08:07 -07:00
Badal Nilawar	e515272338	drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout In case if g2h worker doesn't get opportunity to within specified timeout delay then flush the g2h worker explicitly. v2: - Describe change in the comment and add TODO (Matt B/John H) - Add xe_gt_warn on fence done after G2H flush (John H) v3: - Updated the comment with root cause - Clean up xe_gt_warn message (John H) Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620 Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902 Signed-off-by: Badal Nilawar <badal.nilawar@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241017111410.2553784-2-badal.nilawar@intel.com	2024-10-17 09:13:41 -07:00
Matthew Auld	e863781abe	drm/xe/ct: fix xa_store() error checking Looks like we are meant to use xa_err() to extract the error encoded in the ptr. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-6-matthew.auld@intel.com (cherry picked from commit `1aa4b78647`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-08 18:06:10 -05:00
Matthew Auld	db7f92af62	drm/xe/ct: prevent UAF in send_recv() Ensure we serialize with completion side to prevent UAF with fence going out of scope on the stack, since we have no clue if it will fire after the timeout before we can erase from the xa. Also we have some dependent loads and stores for which we need the correct ordering, and we lack the needed barriers. Fix this by grabbing the ct->lock after the wait, which is also held by the completion side. v2 (Badal): - Also print done after acquiring the lock and seeing timeout. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-5-matthew.auld@intel.com (cherry picked from commit `52789ce35c`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-08 18:06:05 -05:00
Zhanjun Dong	8bfc496327	drm/xe/guc: Extract GuC error capture lists Upon the G2H Notify-Err-Capture event, parse through the GuC Log Buffer (error-capture-subregion) and generate one or more capture-nodes. A single node represents a single "engine- instance-capture-dump" and contains at least 3 register lists: global, engine-class and engine-instance. An internal link list is maintained to store one or more nodes. Because the link-list node generation happen before the call to devcoredump, duplicate global and engine-class register lists for each engine-instance register dump if we find dependent-engine resets in a engine-capture-group. To avoid dynamically allocate the output nodes during gt reset, pre-allocate a fixed number of empty nodes up front (at the time of ADS registration) that we can consume from or return to an internal cached list of nodes. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-5-zhanjun.dong@intel.com	2024-10-08 09:34:45 -07:00
John Harrison	d7c925b299	drm/xe/guc: Dump entire CTB on errors The dump of the CT buffers was only showing the unprocessed data which is not generally useful for saying why a hang occurred - because it was probably caused by the commands that were just processed. So save and dump the entire buffer but in a more compact dump format. Also zero fill it on allocation to avoid confusion over uninitialised data in the dump. v2: Add kerneldoc - review feedback from Michal W. v3: Fix kerneldoc. v4: Use ascii85 instead of hexdump (review feedback from Matthew B). v5: Dump the entire CTB object rather than separately dumping just the H2G and G2H sections. That way it includes the full header info. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-10-John.C.Harrison@Intel.com	2024-10-07 18:35:00 -07:00
John Harrison	d2c5a5a926	drm/xe/guc: Dead CT helper Add a worker function helper for asynchronously dumping state when an internal/fatal error is detected in CT processing. Being asynchronous is required to avoid deadlocks and scheduling-while-atomic or process-stalled-for-too-long issues. Also check for a bunch more error conditions and improve the handling of some existing checks. v2: Use compile time CONFIG check for new (but not directly CT_DEAD related) checks and use unsigned int for a bitmask, rename CT_DEAD_RESET to CT_DEAD_REARM and add some explaining comments, rename 'hxg' macro parameter to 'ctb' - review feedback from Michal W. Drop CT_DEAD_ALIVE as no need for a bitfield define to just set the entire mask to zero. v3: Fix kerneldoc v4: Nullify some floating pointers after free. v5: Add section headings and device info to make the state dump look more like a devcoredump to allow parsing by the same tools (eventual aim is to just call the devcoredump code itself, but that currently requires an xe_sched_job, which is not available in the CT code). v6: Fix potential for leaking snapshots with concurrent error conditions (review feedback from Julia F). v7: Don't complain about unexpected G2H messages yet because there is a known issue causing them. Fix bit shift bug with v6 change. Add GT id to fake coredump headers and use puts instead of printf. v8: Disable the head mis-match check in g2h_read because it is failing on various discrete platforms due to unknown reasons. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-9-John.C.Harrison@Intel.com	2024-10-07 18:34:59 -07:00
John Harrison	0114f66370	drm/xe/guc: Remove spurious line feed in debug print Including line feeds at the start of a debug print messes up the output when sent to dmesg. The break appears between all the useful prefix information and the actual string being printed. In this case, each block of data has a very clear start line and an extra delimeter is really not necessary. So don't do it. v2: Fix typo in commit message (review feedback from Michal W.) Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-2-John.C.Harrison@Intel.com	2024-10-07 18:12:08 -07:00
Francois Dugast	91b2c42c21	drm/xe: Use fault injection infrastructure to find issues at probe time The kernel fault injection infrastructure is used to test proper error handling during probe. The return code of the functions using ALLOW_ERROR_INJECTION() can be conditionnally modified at runtime by tuning some debugfs entries. This requires CONFIG_FUNCTION_ERROR_INJECTION (among others). One way to use fault injection at probe time by making each of those functions fail one at a time is: FAILTYPE=fail_function DEVICE="0000:00:08.0" # depends on the system ERRNO=-12 # -ENOMEM, can depend on the function echo N > /sys/kernel/debug/$FAILTYPE/task-filter echo 100 > /sys/kernel/debug/$FAILTYPE/probability echo 0 > /sys/kernel/debug/$FAILTYPE/interval echo -1 > /sys/kernel/debug/$FAILTYPE/times echo 0 > /sys/kernel/debug/$FAILTYPE/space echo 1 > /sys/kernel/debug/$FAILTYPE/verbose modprobe xe echo $DEVICE > /sys/bus/pci/drivers/xe/unbind grep -oP "^.* \[xe\]" /sys/kernel/debug/$FAILTYPE/injectable \| \ cut -d ' ' -f 1 \| while read -r FUNCTION ; do echo "Injecting fault in $FUNCTION" echo "" > /sys/kernel/debug/$FAILTYPE/inject echo $FUNCTION > /sys/kernel/debug/$FAILTYPE/inject printf %#x $ERRNO > /sys/kernel/debug/$FAILTYPE/$FUNCTION/retval echo $DEVICE > /sys/bus/pci/drivers/xe/bind done rmmod xe It will also be integrated into IGT for systematic execution by CI. v2: Wrappers are not needed in the cases covered by this patch, so remove them and use ALLOW_ERROR_INJECTION() directly. v3: Document the use of fault injection at probe time in xe_pci_probe and refer to it where ALLOW_ERROR_INJECTION() is used. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240927151207.399354-1-francois.dugast@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-10-03 08:58:26 -04:00
Matthew Auld	11bfc4a2cf	drm/xe/ct: drop irq usage of xa_erase() Unclear why disabling interrupts is needed here. Nothing seems to be touching fence_lookup and its corresponding lock from an irq so there should be no risk of deadlock. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-8-matthew.auld@intel.com	2024-10-03 08:34:21 +01:00
Matthew Auld	1aa4b78647	drm/xe/ct: fix xa_store() error checking Looks like we are meant to use xa_err() to extract the error encoded in the ptr. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-6-matthew.auld@intel.com	2024-10-03 08:34:19 +01:00
Matthew Auld	52789ce35c	drm/xe/ct: prevent UAF in send_recv() Ensure we serialize with completion side to prevent UAF with fence going out of scope on the stack, since we have no clue if it will fire after the timeout before we can erase from the xa. Also we have some dependent loads and stores for which we need the correct ordering, and we lack the needed barriers. Fix this by grabbing the ct->lock after the wait, which is also held by the completion side. v2 (Badal): - Also print done after acquiring the lock and seeing timeout. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-5-matthew.auld@intel.com	2024-10-03 08:34:18 +01:00
Nitin Gote	cd89de14bb	drm/xe: Replace double space with single space after comma Avoid using double space, ", " in function or macro parameters where it's not required by any alignment purpose. Replace it with a single space, ", ". Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240823080643.2461992-1-nitin.r.gote@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>	2024-09-05 18:20:00 +02:00
Stuart Summers	df2dbc925f	drm/xe/guc: Bump the G2H queue size to account for page faults With the increase in the size of the recoverable page fault queue, we want to ensure the initial messages from GuC in the G2H buffer have space while we transfer those out to the actual pf_queue. Bump the G2H queue size to account for this increase in the pf_queue size. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/4c2b6974801bcffd8a010d838c8733fa4092573d.1723862633.git.stuart.summers@intel.com	2024-08-20 09:45:54 -07:00
Matthew Brost	5e9209c373	drm/xe: Assert G2H outstanding when releasing G2H Ensure we are managing G2H credits correctly. Extra important now that this is tied to PM. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240725231801.1958038-1-matthew.brost@intel.com	2024-07-26 06:05:07 -07:00
Matthew Brost	d930c19fdf	drm/xe: Build PM into GuC CT layer Take PM ref when any G2H are outstanding, drop when none are outstanding. To safely ensure we have PM ref when in the GuC CT layer, a PM ref needs to be held when scheduler messages are pending too. v2: - Add outer PM protections to xe_file_close (CI) v3: - Only take PM ref 0->1 and drop on 1->0 (Matthew Auld) v4: - Add assert to G2H increment function v5: - Rebase v6: - Declare xe as local variable in xe_file_close (CI) Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240719172905.1527927-5-matthew.brost@intel.com	2024-07-19 19:45:34 -07:00
Nirmoy Das	eb523ec382	drm/xe/guc: Configure TLB timeout based on CT buffer size GuC TLB invalidation depends on GuC to process the request from the CT queue and then the real time to invalidate TLB. Add a function to return overestimated possible time a TLB inval H2G might take which can be used as timeout value for TLB invalidation wait time. v4: Make sure CTB is in 4K blocks(Michal) and other doc fixes v3: Pass CT to xe_guc_ct_queue_proc_time_jiffies() (Michal) Add tlb_timeout_jiffies() that replaces TLB_TIMEOUT(Michal) v2: Address reviews from Michal. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1622 Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Suggested-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240628085845.2369-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>	2024-07-01 17:38:48 +02:00
Michal Wajdeczko	92e9db6e1f	drm/xe/guc: Print GuC error codes as hex value We maintain GuC error code values in hex format. Also print them in that format for easier matching. While at it, slightly reformat the log and add missing \n. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240625141258.1257-4-michal.wajdeczko@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-06-26 18:25:13 -04:00
Michal Wajdeczko	be3bf9dd1c	drm/xe/guc: Demote the H2G retry log message to debug The G2H RETRY message sent by the GuC does not necessary indicate any serious problem and can be a part of the normal communication flow. Switch the log level from warning to more appropriate debug. This will also let the CI ignore these logs which were seen in few SR-IOV scenarios. While at it, use hex to print the reason and add missing \n. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240625141258.1257-2-michal.wajdeczko@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-06-26 18:25:02 -04:00
Radhakrishna Sripada	3cba2f1d3f	drm/xe/trace: Print device_id in xe_trace_guc events In multi-gpu environments it is important to know the device guc txn belongs to. The tracing information includes the device_id to indicate the device the event is associated with. v2: Use variable sized variant to display dev name(Gustavo) v3: Pass single argument to __assign_str to fix kunit error v4: Minor formatting tweaks Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by: Radhakrishna Sripada <radhakrishna.sripada@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240607182943.3572524-5-radhakrishna.sripada@intel.com	2024-06-12 09:25:12 -07:00
Radhakrishna Sripada	6a04e1fc36	drm/xe/trace: Extract guc related traces xe_trace.h is starting to get over crowded. Move the traces related to guc to its own file. v2: Update year in License(Gustavo) Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Suggested-by: Jani Nikula <jani.nikula@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Radhakrishna Sripada <radhakrishna.sripada@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240607182943.3572524-3-radhakrishna.sripada@intel.com	2024-06-12 09:25:10 -07:00
Michal Wajdeczko	09b286950f	drm/xe/guc: Allow CTB G2H processing without G2H IRQ During early initialization, in the xe_guc_min_load_for_hwconfig() function, we are successfully enabling CTB communication, but it will only allow us to send non-blocking H2G messages, as due to not yet enabled IRQs, including G2H IRQs, we will not notice any new G2H message sent by the GuC, including replies to our blocking H2G request messages. And those successful replies are mandatory for the VF drivers to continue normal operations. As attempt to workaround this driver initialization ordering issue, introduce special safe-mode CTB worker, that will periodically trigger G2H processing, like original IRQ handler, in case no MSI/MSIX IRQs were enabled on the driver yet. Once we detect that IRQ were enabled, we will stop this worker. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240606130639.1504-3-michal.wajdeczko@intel.com	2024-06-07 12:24:30 +02:00

1 2 3

107 Commits