Commit Graph

7 Commits

Author SHA1 Message Date
Lucas De Marchi
292b1a8a50 drm/xe: Stop ignoring errors from xe_heci_gsc_init()
Do not ignore errors from xe_heci_gsc_init(). For example, it shouldn't
be fine to report successfully entering survivability mode when there's
no communication with gsc working. The driver should also not be
half-initialized in the normal case neither.

Cc: Riana Tauro <riana.tauro@intel.com>
Cc: Alexander Usyskin <alexander.usyskin@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-10-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25 14:32:03 -08:00
Lucas De Marchi
d40f275d96 drm/xe: Move survivability entirely to xe_pci
There's an odd split between xe_pci.c and xe_device.c wrt
xe_survivability: it's initialized by xe_device, but then finalized by
xe_pci. Move it entirely to the outer layer, xe_pci, so it controls
the flow entirely.

This also allows to stop ignoring some of the errors. E.g.: if there's
an -ENOMEM, it shouldn't continue as if it survivability had been
enabled.

One change worth mentioning is that if "wait for lmem" fails, it will
also check the pcode status to decide if it should enter or not in
survivability mode, which it was not doing before. The bit from pcode
for that decision should remain the same after lmem failed
initialization, so it should be fine.

Cc: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-9-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25 14:32:03 -08:00
Lucas De Marchi
83e3d08767 drm/xe: Stop setting drvdata to NULL
PCI subsystem is not supposed to call the remove() function when probe
fails and doesn't need a protection for that. The only places checking
for NULL drvdata, is on 2 sysfs files and they shouldn't be needed since
the files are removed and reads on open fds just return an error.

For this protection the core driver implementation in
drivers/base/dd.c:device_unbind_cleanup() already sets it to NULL, after
the release of dev resources.

Remove the setting to NULL so it's possible to obtain the xe pointer
from callbacks like the component unbind from device_unbind_cleanup(),
i.e. after xe_pci_remove() already finished.

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-5-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25 14:29:06 -08:00
Riana Tauro
d9bc304437 drm/xe: Skip survivability mode for VF
Follow the probe flow in case of VF and do not enter survivability mode
in case of pcode init failure.

Fixes: 5e940312a2 ("drm/xe: Add functions and sysfs for boot survivability")
Suggested-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250131080527.2256475-1-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-01-31 05:40:10 -05:00
Riana Tauro
8b47c9cdb6 drm/xe: Initialize mei-gsc and vsec in survivability mode
Initialize mei-gsc in survivability mode and disable HECI
interrupts. Also initialize vsec in survivability mode

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Alexander Usyskin <alexander.usyskin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250128095632.1294722-4-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-01-28 08:58:46 -05:00
Riana Tauro
256daa32c9 drm/xe: Enable Boot Survivability mode
Enable boot survivability mode if pcode initialization fails and
if boot status indicates a failure. In this mode, drm card is not
exposed and driver probe returns success after loading the bare minimum
to allow firmware to be flashed via mei.

v2: abstract survivability mode variable
    add BMG check inside function (Jani, Rodrigo)

v3: return -EBUSY during system suspend (Anshuman)
    check survivability mode in pci probe only
    on error

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250128095632.1294722-3-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-01-28 08:58:46 -05:00
Riana Tauro
5e940312a2 drm/xe: Add functions and sysfs for boot survivability
Boot Survivability is a software based workflow for recovering a system
in a failed boot state. Here system recoverability is concerned with
recovering the firmware responsible for boot.

This is implemented by loading the driver with bare minimum (no drm card)
to allow the firmware to be flashed through mei-gsc and collect telemetry.
The driver's probe flow is modified such that it enters survivability mode
when pcode initialization is incomplete and boot status denotes a failure.
In this mode, drm card is not exposed and presence of survivability_mode
entry in PCI sysfs  is used to indicate survivability mode and
provide additional information required for debug

This patch adds initialization functions and exposes admin
readable sysfs entries

The new sysfs will have the below layout

	/sys/bus/.../bdf
             	     ├── survivability_mode

v2: reorder headers
    fix doc
    remove survivability info and use mode to display information
    use separate function for logging survivability information
    for critical error (Rodrigo)

v3: use for loop
    use dev logs instead of drm
    use helper function for aux history(Rodrigo)
    remove unnecessary error check of greater than max_scratch
    as we are reading only 3 bit

v4: fix checkpatch warnings
    fix space (Rodrigo)
    rename register

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Acked-by: Ashwin Kumar Kulkarni <ashwin.kumar.kulkarni@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250128095632.1294722-2-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-01-28 08:58:45 -05:00