mirror of
https://github.com/torvalds/linux.git
synced 2026-04-18 06:44:00 -04:00
drm/ras: Introduce the DRM RAS infrastructure over generic netlink
Introduces the DRM RAS infrastructure over generic netlink. The new interface allows drivers to expose RAS nodes and their associated error counters to userspace in a structured and extensible way. Each drm_ras node can register its own set of error counters, which are then discoverable and queryable through netlink operations. This lays the groundwork for reporting and managing hardware error states in a unified manner across different DRM drivers. Currently it only supports error-counter nodes. But it can be extended later. The registration is also not tied to any drm node, so it can be used by accel devices as well. It uses the new and mandatory YAML description format stored in Documentation/netlink/specs/. This forces a single generic netlink family namespace for the entire drm: "drm-ras". But multiple-endpoints are supported within the single family. Any modification to this API needs to be applied to Documentation/netlink/specs/drm_ras.yaml before regenerating the code: $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \ Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \ -o include/uapi/drm/drm_ras.h $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \ Documentation/netlink/specs/drm_ras.yaml --mode kernel \ --header -o drivers/gpu/drm/drm_ras_nl.h $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \ Documentation/netlink/specs/drm_ras.yaml \ --mode kernel --source -o drivers/gpu/drm/drm_ras_nl.c Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Cc: Lijo Lazar <lijo.lazar@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: netdev@vger.kernel.org Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patch.msgid.link/20260304074412.464435-8-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
This commit is contained in:
103
Documentation/gpu/drm-ras.rst
Normal file
103
Documentation/gpu/drm-ras.rst
Normal file
@@ -0,0 +1,103 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0+
|
||||
|
||||
============================
|
||||
DRM RAS over Generic Netlink
|
||||
============================
|
||||
|
||||
The DRM RAS (Reliability, Availability, Serviceability) interface provides a
|
||||
standardized way for GPU/accelerator drivers to expose error counters and
|
||||
other reliability nodes to user space via Generic Netlink. This allows
|
||||
diagnostic tools, monitoring daemons, or test infrastructure to query hardware
|
||||
health in a uniform way across different DRM drivers.
|
||||
|
||||
Key Goals:
|
||||
|
||||
* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
|
||||
data center monitoring and reliability operations.
|
||||
* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
|
||||
specifications and centralize all RAS-related communication in one namespace.
|
||||
* Support a basic error counter interface, addressing the immediate, essential
|
||||
monitoring needs.
|
||||
* Offer a flexible, future-proof interface that can be extended to support
|
||||
additional types of RAS data in the future.
|
||||
* Allow multiple nodes per driver, enabling drivers to register separate
|
||||
nodes for different IP blocks, sub-blocks, or other logical subdivisions
|
||||
as applicable.
|
||||
|
||||
Nodes
|
||||
=====
|
||||
|
||||
Nodes are logical abstractions representing an error type or error source within
|
||||
the device. Currently, only error counter nodes is supported.
|
||||
|
||||
Drivers are responsible for registering and unregistering nodes via the
|
||||
`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
|
||||
|
||||
Node Management
|
||||
-------------------
|
||||
|
||||
.. kernel-doc:: drivers/gpu/drm/drm_ras.c
|
||||
:doc: DRM RAS Node Management
|
||||
.. kernel-doc:: drivers/gpu/drm/drm_ras.c
|
||||
:internal:
|
||||
|
||||
Generic Netlink Usage
|
||||
=====================
|
||||
|
||||
The interface is implemented as a Generic Netlink family named ``drm-ras``.
|
||||
User space tools can:
|
||||
|
||||
* List registered nodes with the ``list-nodes`` command.
|
||||
* List all error counters in an node with the ``get-error-counter`` command with ``node-id``
|
||||
as a parameter.
|
||||
* Query specific error counter values with the ``get-error-counter`` command, using both
|
||||
``node-id`` and ``error-id`` as parameters.
|
||||
|
||||
YAML-based Interface
|
||||
--------------------
|
||||
|
||||
The interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml``
|
||||
|
||||
This YAML is used to auto-generate user space bindings via
|
||||
``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
|
||||
attributes and operations.
|
||||
|
||||
Usage Notes
|
||||
-----------
|
||||
|
||||
* User space must first enumerate nodes to obtain their IDs.
|
||||
* Node IDs or Node names can be used for all further queries, such as error counters.
|
||||
* Error counters can be queried by either the Error ID or Error name.
|
||||
* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
|
||||
* The interface supports future extension by adding new node types and
|
||||
additional attributes.
|
||||
|
||||
Example: List nodes using ynl
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo ynl --family drm_ras --dump list-nodes
|
||||
[{'device-name': '0000:03:00.0',
|
||||
'node-id': 0,
|
||||
'node-name': 'correctable-errors',
|
||||
'node-type': 'error-counter'},
|
||||
{'device-name': '0000:03:00.0',
|
||||
'node-id': 1,
|
||||
'node-name': 'uncorrectable-errors',
|
||||
'node-type': 'error-counter'}]
|
||||
|
||||
Example: List all error counters using ynl
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
|
||||
[{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0},
|
||||
{'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}]
|
||||
|
||||
Example: Query an error counter for a given node
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
|
||||
{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
|
||||
|
||||
@@ -9,6 +9,7 @@ GPU Driver Developer's Guide
|
||||
drm-mm
|
||||
drm-kms
|
||||
drm-kms-helpers
|
||||
drm-ras
|
||||
drm-uapi
|
||||
drm-usage-stats
|
||||
driver-uapi
|
||||
|
||||
Reference in New Issue
Block a user