linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-25 10:02:31 -04:00

Author	SHA1	Message	Date
Michal Wajdeczko	3088f485de	drm/xe/configfs: Don't expose survivability_mode if not applicable The survivability_mode attribute is applicable only for DGFX and platforms newer than BATTLEMAGE. Use .is_visible() hook to hide this attribute when above conditions are not met. Remove code that was trying to fix such configuration during the runtime. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250902131744.5076-4-michal.wajdeczko@intel.com	2025-09-04 22:33:51 +02:00
Michal Wajdeczko	079a5c83db	drm/xe/configfs: Don't touch survivability_mode on fini This is a user controlled configfs attribute, we should not modify that outside the configfs attr.store() implementation. Fixes: `bc417e54e2` ("drm/xe: Enable configfs support for survivability mode") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250904103521.7130-1-michal.wajdeczko@intel.com	2025-09-04 22:31:26 +02:00
Riana Tauro	a7df563b45	drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Add support to handle CSC firmware reported errors. When CSC firmware errors are encoutered, a error interrupt is received by the GFX device as a MSI interrupt. Device Source control registers indicates the source of the error as CSC The HEC error status register indicates that the error is firmware reported Depending on the type of error, the error cause is written to the HEC Firmware error register. On encountering such CSC firmware errors, the graphics device is non-recoverable from driver context. The only way to recover from these errors is firmware flash. System admin/userspace is notified of the necessity of firmware flash with a combination of vendor-specific drm device edged uevent, dmesg logs and runtime survivability sysfs. It is the responsiblity of the consumer to verify all the actions and then trigger a firmware flash using tools like fwupd. $ udevadm monitor --property --kernel monitor will print the received events for: KERNEL - the kernel uevent KERNEL[754.709341] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 SUBSYSTEM=drm WEDGED=vendor-specific DEVNAME=/dev/dri/card0 DEVTYPE=drm_minor SEQNUM=5973 MAJOR=226 MINOR=0 Logs xe 0000:03:00.0: [drm] ERROR [Hardware Error]: Tile0 reported NONFATAL error 0x20000 xe 0000:03:00.0: [drm] ERROR [Hardware Error]: NONFATAL: HEC Uncorrected FW FD Corruption error reported, bit[2] is set xe 0000:03:00.0: Runtime Survivability mode enabled xe 0000:03:00.0: [drm] ERROR CRITICAL: Xe has declared device 0000:03:00.0 as wedged. IOCTLs and executions are blocked. Only a rebind may clear the failure Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new xe 0000:03:00.0: [drm] device wedged, needs recovery xe 0000:03:00.0: Firmware flash required, Please refer to the userspace documentation for more details! Runtime survivability Sysfs: /sys/bus/pci/devices/<device>/survivability_mode v2: use vendor recovery method with runtime survivability (Christian, Rodrigo, Raag) v3: move declare wedged to runtime survivability mode (Rodrigo) v4: update commit message Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-10-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	f646c9f937	drm/xe/doc: Document device wedged and runtime survivability Add documentation for vendor specific device wedged recovery method and runtime survivability. v2: fix documentation (Raag) v3: add userspace tool for firmware update (Raag) v4: use consistent documentation (Raag) v5: add more documentation Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-8-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	a2ca0633a0	drm/xe/xe_survivability: Add support for Runtime survivability mode Certain runtime firmware errors can cause the device to be in a unusable state requiring a firmware flash to restore normal operation. Runtime Survivability Mode indicates firmware flash is necessary by wedging the device and exposing survivability mode sysfs. The below sysfs is an indication that device is in survivability mode /sys/bus/pci/devices/<device>/survivability_mode v2: Fix kernel-doc (Umesh) v3: Add user friendly dmesg (Frank) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-7-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	41ff795aff	drm/xe/xe_survivability: Refactor survivability mode Refactor survivability mode code to support both boot and runtime survivability. Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-6-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	f5c5d29522	drm/xe/xe_i2c: Add support for i2c in survivability mode Initialize i2c in survivability mode to allow firmware update of Add-In Management Controller (AMC) in survivability mode. Signed-off-by: Riana Tauro <riana.tauro@intel.com> Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20250701122252.2590230-6-heikki.krogerus@linux.intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-07-10 10:19:41 -04:00
Riana Tauro	bc417e54e2	drm/xe: Enable configfs support for survivability mode Enable survivability mode if supported and configfs attribute is set. Enabling survivability mode manually is useful in cases where pcode does not detect failure, validation and for IFR (in-field-repair). To set configfs survivability mode attribute for a device echo 1 > /sys/kernel/config/xe/0000:03:00.0/survivability_mode The card enters survivability mode if supported v2: add a log if survivability mode is enabled for unsupported platforms (Rodrigo) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250407051414.1651616-4-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-04-08 22:24:00 -07:00
Riana Tauro	77052ab245	drm/xe: Add documentation for survivability mode Add survivability mode document to pcode document as it is enabled when pcode detects a failure. v2: fix kernel-doc (Lucas) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250407051414.1651616-3-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-04-08 22:23:59 -07:00
Lucas De Marchi	14efa739ca	drm/xe: Set survivability mode before heci init Commit `d40f275d96` ("drm/xe: Move survivability entirely to xe_pci") tried to follow the logic: initialize everything needed and if everything succeeds, set the flag that it's enabled. While it fixed some corner cases of those calls failing, it was wrong for setting the flag after the call to xe_heci_gsc_init(): that function does a different initialization for survivability mode. Fix that and add comments about this being done on purpose. Suggested-by: Riana Tauro <riana.tauro@intel.com> Fixes: `d40f275d96` ("drm/xe: Move survivability entirely to xe_pci") Reviewed-by: Riana Tauro <riana.tauro@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250314-fix-survivability-v5-2-fdb3559ea965@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-03-21 11:48:22 -07:00
Lucas De Marchi	86b5e0dbba	drm/xe: Move survivability back to xe Commit `d40f275d96` ("drm/xe: Move survivability entirely to xe_pci") moved the survivability handling to be done entirely in the xe_pci layer. However there are some issues with that approach: 1) Survivability mode needs at least the mmio initialized, otherwise it can't really read a register to decide if it should enter that state 2) SR-IOV mode should be initialized, otherwise it's not possible to check if it's VF Besides, as pointed by Riana the check for xe_survivability_mode_enable() was wrong in xe_pci_probe() since it's not a bool return. Fix that by moving the initialization to be entirely in the xe_device layer, with the correct dependencies handled: only after mmio and sriov initialization, and not triggering it on error from wait_for_lmem_ready(). This restores the trigger behavior before that commit. The xe_pci layer now only checks for "is it enabled?", like it's doing in xe_pci_suspend()/xe_pci_remove(), etc. Cc: Riana Tauro <riana.tauro@intel.com> Fixes: `d40f275d96` ("drm/xe: Move survivability entirely to xe_pci") Reviewed-by: Riana Tauro <riana.tauro@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250314-fix-survivability-v5-1-fdb3559ea965@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-03-21 11:48:09 -07:00
Lucas De Marchi	292b1a8a50	drm/xe: Stop ignoring errors from xe_heci_gsc_init() Do not ignore errors from xe_heci_gsc_init(). For example, it shouldn't be fine to report successfully entering survivability mode when there's no communication with gsc working. The driver should also not be half-initialized in the normal case neither. Cc: Riana Tauro <riana.tauro@intel.com> Cc: Alexander Usyskin <alexander.usyskin@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-10-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-25 14:32:03 -08:00
Lucas De Marchi	d40f275d96	drm/xe: Move survivability entirely to xe_pci There's an odd split between xe_pci.c and xe_device.c wrt xe_survivability: it's initialized by xe_device, but then finalized by xe_pci. Move it entirely to the outer layer, xe_pci, so it controls the flow entirely. This also allows to stop ignoring some of the errors. E.g.: if there's an -ENOMEM, it shouldn't continue as if it survivability had been enabled. One change worth mentioning is that if "wait for lmem" fails, it will also check the pcode status to decide if it should enter or not in survivability mode, which it was not doing before. The bit from pcode for that decision should remain the same after lmem failed initialization, so it should be fine. Cc: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Riana Tauro <riana.tauro@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-9-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-25 14:32:03 -08:00
Lucas De Marchi	83e3d08767	drm/xe: Stop setting drvdata to NULL PCI subsystem is not supposed to call the remove() function when probe fails and doesn't need a protection for that. The only places checking for NULL drvdata, is on 2 sysfs files and they shouldn't be needed since the files are removed and reads on open fds just return an error. For this protection the core driver implementation in drivers/base/dd.c:device_unbind_cleanup() already sets it to NULL, after the release of dev resources. Remove the setting to NULL so it's possible to obtain the xe pointer from callbacks like the component unbind from device_unbind_cleanup(), i.e. after xe_pci_remove() already finished. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-5-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-25 14:29:06 -08:00
Riana Tauro	d9bc304437	drm/xe: Skip survivability mode for VF Follow the probe flow in case of VF and do not enter survivability mode in case of pcode init failure. Fixes: `5e940312a2` ("drm/xe: Add functions and sysfs for boot survivability") Suggested-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250131080527.2256475-1-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-01-31 05:40:10 -05:00
Riana Tauro	8b47c9cdb6	drm/xe: Initialize mei-gsc and vsec in survivability mode Initialize mei-gsc in survivability mode and disable HECI interrupts. Also initialize vsec in survivability mode Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Alexander Usyskin <alexander.usyskin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250128095632.1294722-4-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-01-28 08:58:46 -05:00
Riana Tauro	256daa32c9	drm/xe: Enable Boot Survivability mode Enable boot survivability mode if pcode initialization fails and if boot status indicates a failure. In this mode, drm card is not exposed and driver probe returns success after loading the bare minimum to allow firmware to be flashed via mei. v2: abstract survivability mode variable add BMG check inside function (Jani, Rodrigo) v3: return -EBUSY during system suspend (Anshuman) check survivability mode in pci probe only on error Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250128095632.1294722-3-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-01-28 08:58:46 -05:00
Riana Tauro	5e940312a2	drm/xe: Add functions and sysfs for boot survivability Boot Survivability is a software based workflow for recovering a system in a failed boot state. Here system recoverability is concerned with recovering the firmware responsible for boot. This is implemented by loading the driver with bare minimum (no drm card) to allow the firmware to be flashed through mei-gsc and collect telemetry. The driver's probe flow is modified such that it enters survivability mode when pcode initialization is incomplete and boot status denotes a failure. In this mode, drm card is not exposed and presence of survivability_mode entry in PCI sysfs is used to indicate survivability mode and provide additional information required for debug This patch adds initialization functions and exposes admin readable sysfs entries The new sysfs will have the below layout /sys/bus/.../bdf ├── survivability_mode v2: reorder headers fix doc remove survivability info and use mode to display information use separate function for logging survivability information for critical error (Rodrigo) v3: use for loop use dev logs instead of drm use helper function for aux history(Rodrigo) remove unnecessary error check of greater than max_scratch as we are reading only 3 bit v4: fix checkpatch warnings fix space (Rodrigo) rename register Signed-off-by: Riana Tauro <riana.tauro@intel.com> Acked-by: Ashwin Kumar Kulkarni <ashwin.kumar.kulkarni@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250128095632.1294722-2-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-01-28 08:58:45 -05:00

18 Commits