linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-03 22:12:32 -04:00

Author	SHA1	Message	Date
Dave Airlie	ce0478b02e	Merge tag 'v6.18-rc6' into drm-next Linux 6.18-rc6 Backmerge in order to merge msm next Signed-off-by: Dave Airlie <airlied@redhat.com>	2025-11-21 08:55:08 +10:00
Matthew Brost	f289f78071	drm/xe: Add xe_guc_pagefault layer Add xe_guc_pagefault layer (producer) which parses G2H fault messages messages into struct xe_pagefault, forwards them to the page fault layer (consumer) for servicing, and provides a vfunc to acknowledge faults to the GuC upon completion. Replace the old (and incorrect) GT page fault layer with this new layer throughout the driver. As part of this change, the ACC handling code has been removed, as it is dead code that is currently unused. v2: - Include engine instance (Stuart) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Tested-by: Francois Dugast <francois.dugast@intel.com> Link: https://patch.msgid.link/20251031165416.2871503-7-matthew.brost@intel.com	2025-11-04 09:04:29 -08:00
Matthew Brost	79be336d1a	drm/xe: Implement xe_pagefault_reset Squash any pending faults on the GT being reset by setting the GT field in struct xe_pagefault to NULL. v4: - Only do reset it page faults queues initialized (CI) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Tested-by: Francois Dugast <francois.dugast@intel.com> Link: https://patch.msgid.link/20251031165416.2871503-4-matthew.brost@intel.com	2025-11-04 09:04:29 -08:00
Balasubramani Vivekanandan	09c452d117	drm/xe/gt: Synchronize GT reset with device unbind When unbinding wait for any GT reset in progress to complete. Unbinding will release the mmio mapping but mmio operations are performed during GT reset causing Kernel panic. Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patch.msgid.link/20251103123144.3231829-5-balasubramani.vivekanandan@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-11-03 11:29:00 -08:00
Lucas De Marchi	1f8a87be9c	drm/xe: Inline gt_reset in the worker gt_reset() doesn't make sense by itself: it can only be called as part of the worker. Inline it there to avoid it being called from elsewhere and clarify the gt_reset() vs do_gt_reset() paths. Note that the error return from gt_reset() was just being ignored. Also add a comment to the xe_pm_runtime_put() to make sure the get()/put() pair is clear. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251031222244.37735-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-11-02 22:00:15 -08:00
Matthew Brost	b3fbda1a63	drm/xe: Do not wake device during a GT reset Waking the device during a GT reset can lead to unintended memory allocation, which is not allowed since GT resets occur in the reclaim path. Prevent this by holding a PM reference while a reset is in flight. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20251022005538.828980-3-matthew.brost@intel.com (cherry picked from commit `480b358e7d`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-29 11:43:30 -07:00
Matthew Brost	480b358e7d	drm/xe: Do not wake device during a GT reset Waking the device during a GT reset can lead to unintended memory allocation, which is not allowed since GT resets occur in the reclaim path. Prevent this by holding a PM reference while a reset is in flight. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20251022005538.828980-3-matthew.brost@intel.com	2025-10-22 09:10:06 -07:00
Matthew Brost	efd38d619a	drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF VF CCS restore is a primary GT operation on which the media GT depends. Therefore, it doesn't make much sense to run these operations in parallel. To address this, point the media GT's ordered work queue to the primary GT's ordered work queue on platforms that require (PTL VFs) CCS restore as part of VF post-migration recovery. v7: - Remove bool from xe_gt_alloc (Lucas) v9: - Fix typo (Lucas) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20251008214532.3442967-32-matthew.brost@intel.com	2025-10-09 03:24:28 -07:00
Matthew Brost	f1029b9dde	drm/xe/vf: Don't allow GT reset to be queued during VF post migration recovery With well-behaved software, a GT reset should never occur, nor should it happen during VF post-migration recovery. If it does, trigger a warning but suppress the GT reset, as VF post-migration recovery is expected to bring the VF back to a working state. v3: - Better commit message (Tomasz) v5: - Use xe_gt_WARN_ON (Michal) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Tomasz Lis <tomasz.lis@intel.com> Link: https://lore.kernel.org/r/20251008214532.3442967-17-matthew.brost@intel.com	2025-10-09 03:22:37 -07:00
Matthew Brost	b47c0c07c3	drm/xe/vf: Teardown VF post migration worker on driver unload Be cautious and ensure the VF post-migration worker is not running during driver unload. v3: - More teardown later in driver init, use devm (Tomasz) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Tomasz Lis <tomasz.lis@intel.com> Link: https://lore.kernel.org/r/20251008214532.3442967-16-matthew.brost@intel.com	2025-10-09 03:22:36 -07:00
Matthew Brost	e1587f1660	drm/xe/vf: Make VF recovery run on per-GT worker VF recovery is a per-GT operation, so it makes sense to isolate it to a per-GT queue. Scheduling this operation on the same worker as the GT reset and TDR not only aligns with this design but also helps avoid race conditions, as those operations can also modify the queue state. v2: - Fix lockdep splat (Adam) - Use xe_sriov_vf_migration_supported helper v3: - Drop xe_gt_sriov_ prefix for private functions (Michal) - Drop message in xe_gt_sriov_vf_migration_init_early (Michal) - Logic rework in vf_post_migration_notify_resfix_done (Michal) - Rework init sequence layering (Michal) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Tomasz Lis <tomasz.lis@intel.com> Link: https://lore.kernel.org/r/20251008214532.3442967-10-matthew.brost@intel.com	2025-10-09 03:22:25 -07:00
Michal Wajdeczko	9a54b5127f	drm/xe/pf: Make the late-initialization really late While the late PF per-GT initialization is done quite late in the single GT initialization flow, in case of multi-GT platforms, it may still be done before other GT early initialization. That leads to some issues during unwind, when there are cross-GT dependencies, like resource cleanup that is shared by both GTs, but the other GT may already be sanitized or disabled. The following errors could be observed when trying to unload the PF driver with some LMEM/VRAM already provisioned for few VFs: [ ] xe 0000:03:00.0: DEVRES REL ffff88814708f240 fini_config (16 bytes) [ ] xe 0000:03:00.0: [drm:lmtt_write_pte [xe]] PF: LMTT: WRITE level=2 index=1 pte=0x0 [ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19 [ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=0 addr=53a470000 [ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=1 addr=53a4b0000 [ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19 [ ] xe 0000:03:00.0: [drm] PF: LMTT0 invalidation failed (-ENODEV) [ ] xe 0000:03:00.0: [drm:lmtt_write_pte [xe]] PF: LMTT: WRITE level=2 index=2 pte=0x0 [ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19 [ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=0 addr=539b70000 [ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=1 addr=539bf0000 [ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19 [ ] xe 0000:03:00.0: [drm] PF: LMTT0 invalidation failed (-ENODEV) Move all PF per-GT late initialization to the already defined late SR-IOV initialization function to allow proper order of the cleanup actions. While around, format all PF function stubs as one-liners, like many other stubs are defined in the Xe driver. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Link: https://lore.kernel.org/r/20251004162008.1782-1-michal.wajdeczko@intel.com	2025-10-06 19:30:17 +02:00
Michal Wajdeczko	846a81abbe	drm/xe: Detect GT workqueue allocation failure The allocation of the per-GT workqueue may fail and we shouldn't ignore that. While around use drm managed allocation function to drop our custom fini action. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20251001144051.202040-1-michal.wajdeczko@intel.com	2025-10-02 18:48:10 +02:00
Daniele Ceraolo Spurio	8843444843	drm/xe/guc: Set RCS/CCS yield policy All recent platforms (including all the ones officially supported by the Xe driver) do not allow concurrent execution of RCS and CCS workloads from different address spaces, with the HW blocking the context switch when it detects such a scenario. The DUAL_QUEUE flag helps with this, by causing the GuC to not submit a context it knows will not be able to execute. This, however, causes a new problem: if RCS and CCS queues have pending workloads from different address spaces, the GuC needs to choose from which of the 2 queues to pick the next workload to execute. By default, the GuC prioritizes RCS submissions over CCS ones, which can lead to CCS workloads being significantly (or completely) starved of execution time. The driver can tune this by setting a dedicated scheduling policy KLV; this KLV allows the driver to specify a quantum (in ms) and a ratio (percentage value between 0 and 100), and the GuC will prioritize the CCS for that percentage of each quantum. Given that we want to guarantee enough RCS throughput to avoid missing frames, we set the yield policy to 20% of each 80ms interval. v2: updated quantum and ratio, improved comment, use xe_guc_submit_disable in gt_sanitize Fixes: `d9a1ae0d17` ("drm/xe/guc: Enable WA_DUAL_QUEUE for newer platforms") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Tested-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://lore.kernel.org/r/20250905235632.3333247-2-daniele.ceraolospurio@intel.com	2025-09-11 09:45:35 -07:00
Matthew Brost	db16f9d90c	drm/xe: Split TLB invalidation code in frontend and backend The frontend exposes an API to the driver to send invalidations, handles sequence number assignment, synchronization (fences), and provides a timeout mechanism. The backend issues the actual invalidation to the hardware (or firmware). The new layering easily allows issuing TLB invalidations to different hardware or firmware interfaces. Normalize some naming while here too. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-10-stuart.summers@intel.com	2025-08-27 11:49:31 -07:00
Matthew Brost	15366239e2	drm/xe: Decouple TLB invalidations from GT Decouple TLB invalidations from the GT by updating the TLB invalidation layer to accept a `struct xe_tlb_inval` instead of a `struct xe_gt`. Also, rename gt_tlb to tlb. The internals of the TLB invalidation code still operate on a GT, but this is now hidden from the rest of the driver. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-7-stuart.summers@intel.com	2025-08-27 11:49:18 -07:00
Matthew Brost	c697ddcf27	drm/xe: s/tlb_invalidation/tlb_inval tlb_invalidation is a bit verbose leading to ugly wraps in the code, shorten to tlb_inval. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-4-stuart.summers@intel.com	2025-08-27 11:49:00 -07:00
Stuart Summers	76186a253a	drm/xe: Cancel pending TLB inval workers on teardown Add a new _fini() routine on the GT TLB invalidation side to handle this worker cleanup on driver teardown. v2: Move the TLB teardown to the gt fini() routine called during gt_init rather than in gt_alloc. This way the GT structure stays alive for while we reset the TLB state. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-3-stuart.summers@intel.com	2025-08-27 11:48:56 -07:00
Matt Atwood	342d1f8432	drm/xe: Update function names for GT specific workarounds Now that there distinctly different OOB functions, update the names to reflect the IPs they interact with. Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250807214224.32728-2-matthew.s.atwood@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-08 10:50:45 -04:00
Matt Atwood	4d5c98eb77	drm/xe: rename XE_WA to XE_GT_WA Now that there are two types of wa tables and infrastructure, be more concise in the naming of GT wa macros. v2: update the documentation Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250807214224.32728-1-matthew.s.atwood@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-08 10:50:45 -04:00
Tomasz Lis	a0840b1ce9	drm/xe: Block reset while recovering from VF migration Resetting GuC during recovery could interfere with the recovery process. Such reset might be also triggered without justification, due to migration taking time, rather than due to the workload not progressing. Doing GuC reset during the recovery would cause exit of RESFIX state, and therefore continuation of GuC work while fixups are still being applied. To avoid that, reset needs to be blocked during the recovery. This patch blocks the reset during recovery. Reset request in that time range will be stalled, and unblocked only after GuC goes out of RESFIX state. In case a reset procedure already started while the recovery is triggered, there isn't much we can do - we cannot wait for it to finish as it involves waiting for hardware, and we can't be sure at which exact point of the reset procedure the GPU got switched. Therefore, the rare cases where migration happens while reset is in progress, are still dangerous. Resets are not a part of the standard flow, and cause unfinished workloads - that will happen during the reset interrupted by migration as well, so it doesn't diverge that much from what normally happens during such resets. v2: Introduce a new atomic for reset blocking, as we cannot reuse `stopped` atomic (that could lead to losing a workload). v3: Switched atomic functs to ones which include proper barriers Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michal Winiarski <michal.winiarski@intel.com> Reviewed-by: Michal Winiarski <michal.winiarski@intel.com> Link: https://lore.kernel.org/r/20250802031045.1127138-4-tomasz.lis@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>	2025-08-04 16:46:34 +02:00
Satyanarayana K V P	a843b98947	drm/xe/vf: Fix VM crash during VF driver release The VF CCS save/restore series (patchwork #149108) has a dependency on the migration framework. A recent migration update in commit `d65ff1ec85` ("drm/xe: Split xe_migrate allocation from initialization") caused a VM crash during XE driver release for iGPU devices. Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b83: 0000 [#1] SMP NOPTI RIP: 0010:xe_lrc_ring_head+0x12/0xb0 [xe] Call Trace: xe_sriov_vf_ccs_fini+0x1e/0x40 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x96/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 device_release_driver+0x12/0x20 pci_stop_bus_device+0x69/0x90 pci_stop_and_remove_bus_device+0x12/0x30 pci_iov_remove_virtfn+0xbd/0x130 sriov_disable+0x42/0x100 pci_disable_sriov+0x34/0x50 xe_pci_sriov_configure+0xf71/0x1020 [xe] Update the VF CCS migration initialization sequence to align with the new migration framework changes, resolving the release-time crash. Fixes: `f3009272ff` ("drm/xe/vf: Create contexts for CCS read write") Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Piotr Piórkowski <piotr.piorkowski@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250729120720.13990-1-satyanarayana.k.v.p@intel.com	2025-07-29 22:05:14 -07:00
Michal Wajdeczko	9f50b729dd	drm/xe/pf: Prepare to stop SR-IOV support prior GT reset As part of the resume or GT reset, the PF driver schedules work which is then used to complete restarting of the SR-IOV support, including resending to the GuC configurations of provisioned VFs. However, in case of short delay between those two actions, which could be seen by triggering a GT reset on the suspened device: $ echo 1 > /sys/kernel/debug/dri/0000:00:02.0/gt0/force_reset this PF worker might be still busy, which lead to errors due to just stopped or disabled GuC CTB communication: [ ] xe 0000:00:02.0: [drm:xe_gt_resume [xe]] GT0: resumed [ ] xe 0000:00:02.0: [drm] GT0: trying reset from force_reset_show [xe] [ ] xe 0000:00:02.0: [drm] GT0: reset queued [ ] xe 0000:00:02.0: [drm] GT0: reset started [ ] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] GT0: GuC CT communication channel stopped [ ] xe 0000:00:02.0: [drm:guc_ct_send_recv [xe]] GT0: H2G request 0x5503 canceled! [ ] xe 0000:00:02.0: [drm] GT0: PF: Failed to push VF1 12 config KLVs (-ECANCELED) [ ] xe 0000:00:02.0: [drm] GT0: PF: Failed to push VF1 configuration (-ECANCELED) [ ] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] GT0: GuC CT communication channel disabled [ ] xe 0000:00:02.0: [drm] GT0: PF: Failed to push VF2 12 config KLVs (-ENODEV) [ ] xe 0000:00:02.0: [drm] GT0: PF: Failed to push VF2 configuration (-ENODEV) [ ] xe 0000:00:02.0: [drm] GT0: PF: Failed to push 2 of 2 VFs configurations [ ] xe 0000:00:02.0: [drm:pf_worker_restart_func [xe]] GT0: PF: restart completed While this VFs reprovisioning will be successful during next spin of the worker, to avoid those errors, make sure to cancel restart worker if we are about to trigger next reset. Fixes: `411220808c` ("drm/xe/pf: Restart VFs provisioning after GT reset") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Link: https://lore.kernel.org/r/20250711193316.1920-2-michal.wajdeczko@intel.com	2025-07-15 13:05:15 +02:00
Tvrtko Ursulin	aded26ccaa	drm/xe: Waste fewer instructions in emit_wa_job() I was debugging some unrelated issue and noticed the current code was very verbose. We can improve it easily by using the more common batch buffer building pattern. Before: bb->cs[bb->len++] = MI_LOAD_REGISTER_REG \| MI_LRR_DST_CS_MMIO; c4d: 41 8b 56 10 mov 0x10(%r14),%edx c51: 49 8b 4e 08 mov 0x8(%r14),%rcx c55: 8d 72 01 lea 0x1(%rdx),%esi c58: 41 89 76 10 mov %esi,0x10(%r14) c5c: c7 04 91 01 00 08 15 movl $0x15080001,(%rcx,%rdx,4) bb->cs[bb->len++] = entry->reg.addr; c63: 8b 08 mov (%rax),%ecx c65: 41 8b 56 10 mov 0x10(%r14),%edx c69: 49 8b 76 08 mov 0x8(%r14),%rsi c6d: 81 e1 ff ff 3f 00 and $0x3fffff,%ecx c73: 8d 7a 01 lea 0x1(%rdx),%edi c76: 41 89 7e 10 mov %edi,0x10(%r14) c7a: 89 0c 96 mov %ecx,(%rsi,%rdx,4) ..etc.. After: cs++ = MI_LOAD_REGISTER_REG \| MI_LRR_DST_CS_MMIO; c52: 41 c7 04 24 01 00 08 movl $0x15080001,(%r12) c59: 15 cs++ = entry->reg.addr; c5a: 8b 10 mov (%rax),%edx ..etc.. Resulting in the following binary change: add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-348 (-348) Function old new delta xe_gt_record_default_lrcs.cold 304 296 -8 xe_gt_record_default_lrcs 2200 1860 -340 Total: Before=13554, After=13206, chg -2.57% Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250710-lrc-refactors-v2-7-a5e2ca03f6bd@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-07-14 13:40:17 -07:00
Lucas De Marchi	f4b538245f	drm/xe/gt: Drop third submission for default context There's no need to submit the nop job again on the first queue. Any state needed is already saved when the first LRC is switched out. The comment is a little misleading regarding indirect W/A: first of all there's still no indirect W/A enabled and secondly, even after they are, there's no need to submit this job again for having their state propagated: the indirect W/A will actually run on every LRC switch. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250710-lrc-refactors-v2-6-a5e2ca03f6bd@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-07-14 13:40:17 -07:00
Lucas De Marchi	fab2cc0c09	drm/xe/gt: Extract emit_job_sync() Both the nop and wa jobs are going through the same boiler plate calls to emit the job with a timeout and handling error for both bb and job. Extract emit_job_sync() so those functions create the bb, handling possible errors and delegate the part about really emitting the job and waiting for its completion. Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://lore.kernel.org/r/20250710-lrc-refactors-v2-3-a5e2ca03f6bd@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-07-14 13:20:03 -07:00
Lucas De Marchi	e4cb5823ba	drm/xe: Count dwords before allocating The bb allocation in emit_wa_job() is wrong in 2 ways: first it's allocating enough space for the 3DSTATE or hardcoding 4k depending on the engine. In the first case it doesn't account for the WAs and in the former it may not be sufficient. Secondly it's using the size instead of number of dwords, causing the buffer to be 4x bigger than needed: xe_bb_new() receives number of dwords as parameter and its declaration was also not following its implementation. Lastly, reword the debug message since it's not only about the LRC WAs anymore as it also include the 3DSTATE for render. While it's unlikely this is causing any real issue, let's calculate the needed space and allocate just enough. Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://lore.kernel.org/r/20250710-lrc-refactors-v2-2-a5e2ca03f6bd@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-07-14 13:20:02 -07:00
Michal Wajdeczko	ffab82b062	drm/xe: Introduce xe_gt_is_main_type helper Instead of checking for not being a media type GT provide a small helper to explicitly express our intentions. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Link: https://lore.kernel.org/r/20250713103625.1964-5-michal.wajdeczko@intel.com	2025-07-14 18:19:29 +02:00
Matthew Brost	beb72acb5b	drm/xe: Move page fault init after topology init We need the topology to determine GT page fault queue size, move page fault init after topology init. Cc: stable@vger.kernel.org Fixes: `3338e4f90c` ("drm/xe: Use topology to determine page fault queue size") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250710191208.1040215-1-matthew.brost@intel.com	2025-07-11 10:43:11 -07:00
Maarten Lankhorst	18635b6328	drm/xe: Rename xe_uc_init_hw to xe_uc_load_hw It feels to me like load is closer to the intention than init_hw. It makes the init calls slightly less confusing to me. :) Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250619104858.418440-24-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-26 22:11:35 +02:00
Maarten Lankhorst	80fa03eb8a	drm/xe: Remove xe_uc_init_hwconfig() This function is called immediately after xe_uc_init(), so just put it there instead. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250619104858.418440-22-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-26 22:11:35 +02:00
Maarten Lankhorst	11bf0f0b3a	drm/xe: Split init of xe_gt_init_hwconfig to xe_gt_init and *_early Now that we added the separate step of initialising GUC in xe_gt_init_early, it should be ok to initialise the minimum during early init, and the rest after allocations are allowed. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250619104858.418440-20-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-26 22:11:35 +02:00
Maarten Lankhorst	6386a49951	drm/xe: Rename gt_init sub-functions s/gt_fw_domain_init/gt_init_with_gt_forcewake()/ s/all_fw_domain_init/gt_init_with_all_forcewake()/ Clarify that the functions are the part of gt_init() that are called with the respective power domains held. all_domain() of course only works after discovering and initialisation of force_wake on all engines, that's why the split is needed in the first place. Suggested-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250619104858.418440-19-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-26 22:11:35 +02:00
Maarten Lankhorst	4c5517e9ec	drm/xe: Only dump PAT when xe_hw_engines_init_early fails After discussion with Lucas De Marchi, it turns out that is the specific caller requiring a dump. This allows us to cleanup gt_init in a bit. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250619104858.418440-18-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-26 22:11:35 +02:00
Maarten Lankhorst	396044c9d8	drm/xe: Simplify GuC early initialization Add a 2-stage GuC init. An early one for everything that is needed for VF, and a full one that loads GuC and is allowed to do allocations. Link: https://lore.kernel.org/r/20250619104858.418440-16-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-26 22:10:30 +02:00
Maarten Lankhorst	b3412d7233	drm/xe/sriov: Move VF bootstrap and query_config to vf_guc_init We want to split up GUC init to an alloc and noalloc part to keep the init path the same for VF and !VF as much as possible. Everything in vf_guc_init should be done as early as possible, otherwise VRAM probing becomes impossible. Also move xe_gt_mmio_init to the end of xe_gt_init_early(), cleaning up the init in xe_device slightly. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250619104858.418440-15-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-26 22:07:44 +02:00
Vinay Belgaumkar	6ab42fa03d	drm/xe/bmg: Update Wa_16023588340 This allows for additional L2 caching modes. Fixes: `01570b4469` ("drm/xe/bmg: implement Wa_16023588340") Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://lore.kernel.org/r/20250612-wa-14022085890-v4-2-94ba5dcc1e30@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-12 23:23:39 -07:00
Balasubramani Vivekanandan	241cc827c0	drm/xe/mocs: Initialize MOCS index early MOCS uc_index is used even before it is initialized in the following callstack guc_prepare_xfer() __xe_guc_upload() xe_guc_min_load_for_hwconfig() xe_uc_init_hwconfig() xe_gt_init_hwconfig() Do MOCS index initialization earlier in the device probe. Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Ravi Kumar Vodapalli <ravi.kumar.vodapalli@intel.com> Link: https://lore.kernel.org/r/20250520142445.2792824-1-balasubramani.vivekanandan@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-05-29 14:29:18 -07:00
Daniele Ceraolo Spurio	12370bfcc4	drm/xe/gsc: do not flush the GSC worker from the reset path The workqueue used for the reset worker is marked as WQ_MEM_RECLAIM, while the GSC one isn't (and can't be as we need to do memory allocations in the gsc worker). Therefore, we can't flush the latter from the former. The reason why we had such a flush was to avoid interrupting either the GSC FW load or in progress GSC proxy operations. GSC proxy operations fall into 2 categories: 1) GSC proxy init: this only happens once immediately after GSC FW load and does not support being interrupted. The only way to recover from an interruption of the proxy init is to do an FLR and re-load the GSC. 2) GSC proxy request: this can happen in response to a request that the driver sends to the GSC. If this is interrupted, the GSC FW will timeout and the driver request will be failed, but overall the GSC will keep working fine. Flushing the work allowed us to avoid interruption in both cases (unless the hang came from the GSC engine itself, in which case we're toast anyway). However, a failure on a proxy request is tolerable if we're in a scenario where we're triggering a GT reset (i.e., something is already gone pretty wrong), so what we really need to avoid is interrupting the init flow, which we can do by polling on the register that reports when the proxy init is complete (as that ensure us that all the load and init operations have been completed). Note that during suspend we still want to do a flush of the worker to make sure it completes any operations involving the HW before the power is cut. v2: fix spelling in commit msg, rename waiter function (Julia) Fixes: `dd0e89e5ed` ("drm/xe/gsc: GSC FW load") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4830 Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://lore.kernel.org/r/20250502155104.2201469-1-daniele.ceraolospurio@intel.com	2025-05-05 13:36:52 -07:00
Michal Wajdeczko	f2f90989cc	drm/xe: Avoid reading RMW registers in emit_wa_job To allow VFs properly handle LRC WAs, we should postpone doing all RMW register operations and let them be run by the engine itself, since attempt to perform read registers from within the driver will fail on the VF. Use MI_MATH and ALU for that. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250303173522.1822-4-michal.wajdeczko@intel.com	2025-03-12 11:37:51 +01:00
Tvrtko Ursulin	067a974fd8	drm/xe: Add performance tunings to debugfs Add a list of active tunings to debugfs, analogous to the existing list of workarounds. Rationale being that it seems to make sense to either have both or none. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250227101304.46660-6-tvrtko.ursulin@igalia.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-28 21:47:33 -08:00
Tvrtko Ursulin	25d434cef7	drm/xe: Fix GT "for each engine" workarounds Any rules using engine matching are currently broken due RTP processing happening too in early init, before the list of hardware engines has been initialised. Fix this by moving workaround processing to later in the driver probe sequence, to just before the processed list is used for the first time. Looking at the debugfs gt0/workarounds on ADL-P we notice 14011060649 should be present while we see, before: GT Workarounds 14011059788 14015795083 And with the patch: GT Workarounds 14011060649 14011059788 14015795083 Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: stable@vger.kernel.org # v6.11+ Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250227101304.46660-2-tvrtko.ursulin@igalia.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-28 21:42:49 -08:00
Harish Chegondi	9a0b11d4cf	drm/xe/eustall: Add support to init, enable and disable EU stall sampling Implement EU stall sampling APIs introduced in the previous patch for Xe_HPC (PVC). Add register definitions and the code that accesses these registers to the APIs. Add initialization and clean up functions and their implementations, EU stall enable and disable functions. v11: Move stream->xecore_buf alloc to xe_eu_stall_data_buf_alloc(). Register xe_eu_stall_fini() with devm_add_action_or_reset() instead of calling it from xe_gt_fini(). Changed a couple of variables in struct xe_eu_stall_data_stream from unsigned int to int. v10: Fixed error rewinding code Moved code around as per review feedback v9: Moved structure definitions from xe_eu_stall.h to xe_eu_stall.c Moved read and poll implementations to the next patch Used xe_bo_create_pin_map_at_aligned instead of xe_bo_create_pin_map Changed lock names as per review feedback Moved drop data handling into a subsequent patch Moved code around as per review feedback v8: Updated copyright year in xe_eu_stall_regs.h to 2025. Renamed struct drm_xe_eu_stall_data_pvc to struct xe_eu_stall_data_pvc since it is a local structure. v6: Fix buffer wrap around over write bug (Matt Olson) Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/b6aeca593d521828a0b4fbf6cfd2844716c4fc66.1740533885.git.harish.chegondi@intel.com	2025-02-26 11:30:59 -08:00
Ilia Levi	eb79d71e50	drm/xe: Add xe_mmio_init() initialization function Add a convenience function for minimal initialization of struct xe_mmio. This function also validates that the entirety of the provided mmio region is usable with struct xe_reg. v2: Modify commit message, add kernel doc, refactor assert (Michal) v3: Fix off-by-one bug, add clarifying macro (Michal) v4: Derive bitfield width from size (Michal) Signed-off-by: Ilia Levi <ilia.levi@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250213093559.204652-1-ilia.levi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-18 08:27:11 -08:00
Lucas De Marchi	f5ebe80e32	drm/xe: Cleanup extra calls to xe_hw_fence_irq_finish() Now that xe_gt_remove is handled entirely by xe_gt, it's clear there are some extra calls to xe_hw_fence_irq_finish() that aren't necessary. Neither all_fw_domain_init() or gt_fw_domain_init() need to do that since it's handled by the caller on any error. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250213192909.996148-8-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-14 11:42:55 -08:00
Lucas De Marchi	ff6cd29b69	drm/xe: Cleanup unwind of gt initialization The only thing in xe_gt_remove() that really needs to happen on the device remove callback is the xe_uc_remove(). That's because of the following call chain: xe_gt_remove() xe_uc_remove() xe_gsc_remove() xe_gsc_proxy_remove() Move xe_gsc_proxy_remove() to be handled as a xe_device_remove_action, so it's recorded when it should run during device removal. The rest can be handled normally by devm infra. Besides removing the deep call chain above, xe_device_probe() doesn't have to unwind the gt loop and it's also more in line with the xe_device_probe() style. Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250213192909.996148-7-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-14 11:42:55 -08:00
Michal Wajdeczko	459777724d	drm/xe/vf: Don't try to trigger a full GT reset if VF VFs don't have access to the GDRST(0x941c) register that driver uses to reset a GT. Attempt to trigger a reset using debugfs: $ cat /sys/kernel/debug/dri/0000:00:02.1/gt0/force_reset or due to a hang condition detected by the driver leads to: [ ] xe 0000:00:02.1: [drm] GT0: trying reset from force_reset [xe] [ ] xe 0000:00:02.1: [drm] GT0: reset queued [ ] xe 0000:00:02.1: [drm] GT0: reset started [ ] ------------[ cut here ]------------ [ ] xe 0000:00:02.1: [drm] GT0: VF is trying to write 0x1 to an inaccessible register 0x941c+0x0 [ ] WARNING: CPU: 3 PID: 3069 at drivers/gpu/drm/xe/xe_gt_sriov_vf.c:996 xe_gt_sriov_vf_write32+0xc6/0x580 [xe] [ ] RIP: 0010:xe_gt_sriov_vf_write32+0xc6/0x580 [xe] [ ] Call Trace: [ ] <TASK> [ ] ? show_regs+0x6c/0x80 [ ] ? __warn+0x93/0x1c0 [ ] ? xe_gt_sriov_vf_write32+0xc6/0x580 [xe] [ ] ? report_bug+0x182/0x1b0 [ ] ? handle_bug+0x6e/0xb0 [ ] ? exc_invalid_op+0x18/0x80 [ ] ? asm_exc_invalid_op+0x1b/0x20 [ ] ? xe_gt_sriov_vf_write32+0xc6/0x580 [xe] [ ] ? xe_gt_sriov_vf_write32+0xc6/0x580 [xe] [ ] ? xe_gt_tlb_invalidation_reset+0xef/0x110 [xe] [ ] ? __mutex_unlock_slowpath+0x41/0x2e0 [ ] xe_mmio_write32+0x64/0x150 [xe] [ ] do_gt_reset+0x2f/0xa0 [xe] [ ] gt_reset_worker+0x14e/0x1e0 [xe] [ ] process_one_work+0x21c/0x740 [ ] worker_thread+0x1db/0x3c0 Fix that by sending H2G VF_RESET(0x5507) action instead. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4078 Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250131182502.852-1-michal.wajdeczko@intel.com	2025-02-04 15:31:45 +01:00
Michal Wajdeczko	9ebb5846e1	drm/xe/pf: Fix migration initialization The migration support only needs to be initialized once, but it was incorrectly called from the xe_gt_sriov_pf_init_hw(), which is part of the reset flow and may be called multiple times. Fixes: `d86e3737c7` ("drm/xe/pf: Add functions to save and restore VF GuC state") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250120232443.544-1-michal.wajdeczko@intel.com	2025-01-21 21:52:47 +01:00
Michal Wajdeczko	bbd8429264	drm/xe: Always setup GT MMIO adjustment data While we believed that xe_gt_mmio_init() will be called just once per GT, this might not be a case due to some tweaks that need to performed by the VF driver during early probe. To avoid leaving any stale data in case of the re-run, reset the GT MMIO adjustment data for the non-media GT case. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114175955.2299-2-michal.wajdeczko@intel.com	2025-01-18 22:03:42 +01:00
Lucas De Marchi	5001ef3af8	drm/xe: Fix tlb invalidation when wedging If GuC fails to load, the driver wedges, but in the process it tries to do stuff that may not be initialized yet. This moves the xe_gt_tlb_invalidation_init() to be done earlier: as its own doc says, it's a software-only initialization and should had been named with the _early() suffix. Move it to be called by xe_gt_init_early(), so the locks and seqno are initialized, avoiding a NULL ptr deref when wedging: xe 0000:03:00.0: [drm] ERROR GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01 xe 0000:03:00.0: [drm] ERROR GT0: firmware signature verification failed xe 0000:03:00.0: [drm] ERROR CRITICAL: Xe has declared device 0000:03:00.0 as wedged. ... BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 9 UID: 0 PID: 3908 Comm: modprobe Tainted: G U W 6.13.0-rc4-xe+ #3 Tainted: [U]=USER, [W]=WARN Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-S ADP-S DDR5 UDIMM CRB, BIOS ADLSFWI1.R00.3275.A00.2207010640 07/01/2022 RIP: 0010:xe_gt_tlb_invalidation_reset+0x75/0x110 [xe] This can be easily triggered by poking the GuC binary to force a signature failure. There will still be an extra message, xe 0000:03:00.0: [drm] ERROR GT0: GuC mmio request 0x4100: no reply 0x4100 but that's better than a NULL ptr deref. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3956 Fixes: `7dbe8af13c` ("drm/xe: Wedge the entire device") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250103001111.331684-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-01-03 12:43:01 -08:00

1 2 3 4

196 Commits