Pull drm fixes from Dave Airlie:
"This is pretty much two weeks worth of fixes, plus one thing that
might be considered next: amdkfd is now able to be enabled on risc-v
platforms.
Otherwise, amdgpu and xe with the majority of fixes, and then a
smattering all over.
panel:
- nt37801: fix IS_ERR
- nt37801: fix KConfig
connector:
- Fix null deref in HDMI audio helper.
bridge:
- analogix_dp: fixup clk-disable removal
nouveau:
- minor typo fix (',' vs ';')
msm:
- mailmap updates
i915:
- Fix the enabling/disabling of DP audio SDP splitting
- Fix PSR register definitions for ALPM
- Fix u32 overflow in SNPS PHY HDMI PLL setup
- Fix GuC pending message underflow when submit fails
- Fix GuC wakeref underflow race during reset
xe:
- Two documentation fixes
- A couple of vm init fixes
- Hwmon fixes
- Drop reduntant conversion to bool
- Fix CONFIG_INTEL_VSEC dependency
- Rework eviction rejection of bound external bos
- Stop re-submitting signalled jobs
- A couple of pxp fixes
- Add back a fix that got lost in a merge
- Create LRC bo without VM
- Fix for the above fix
amdgpu:
- UserQ fixes
- SMU 13.x fixes
- VCN fixes
- JPEG fixes
- Misc cleanups
- runtime pm fix
- DCN 4.0.1 fixes
- Misc display fixes
- ISP fix
- VRAM manager fix
- RAS fixes
- IP discovery fix
- Cleaner shader fix for GC 10.1.x
- OD fix
- Non-OLED panel fix
- Misc display fixes
- Brightness fixes
amdkfd:
- Enable CONFIG_HSA_AMD on RISCV
- SVM fix
- Misc cleanups
- Ref leak fix
- WPTR BO fix
radeon:
- Misc cleanups"
* tag 'drm-next-2025-06-06' of https://gitlab.freedesktop.org/drm/kernel: (105 commits)
drm/nouveau/vfn/r535: Convert comma to semicolon
drm/xe: remove unmatched xe_vm_unlock() from __xe_exec_queue_init()
drm/xe: Create LRC BO without VM
drm/xe/guc_submit: add back fix
drm/xe/pxp: Clarify PXP queue creation behavior if PXP is not ready
drm/xe/pxp: Use the correct define in the set_property_funcs array
drm/xe/sched: stop re-submitting signalled jobs
drm/xe: Rework eviction rejection of bound external bos
drm/xe/vsec: fix CONFIG_INTEL_VSEC dependency
drm/xe: drop redundant conversion to bool
drm/xe/hwmon: Move card reactive critical power under channel card
drm/xe/hwmon: Add support to manage power limits though mailbox
drm/xe/vm: move xe_svm_init() earlier
drm/xe/vm: move rebind_work init earlier
MAINTAINERS: .mailmap: update Rob Clark's email address
mailmap: Update entry for Akhil P Oommen
MAINTAINERS: update my email address
MAINTAINERS: drop myself as maintainer
drm/i915/display: Fix u32 overflow in SNPS PHY HDMI PLL setup
drm/amd/display: Fix default DC and AC levels
...
Pull drm updates from Dave Airlie:
"As part of building up nova-core/nova-drm pieces we've brought in some
rust abstractions through this tree, aux bus being the main one, with
devres changes also in the driver-core tree. Along with the drm core
abstractions and enough nova-core/nova-drm to use them. This is still
all stub work under construction, to build the nova driver upstream.
The other big NVIDIA related one is nouveau adds support for
Hopper/Blackwell GPUs, this required a new GSP firmware update to
570.144, and a bunch of rework in order to support multiple fw
interfaces.
There is also the introduction of an asahi uapi header file as a
precursor to getting the real driver in later, but to unblock
userspace mesa packages while the driver is trapped behind rust
enablement.
Otherwise it's the usual mixture of stuff all over, amdgpu, i915/xe,
and msm being the main ones, and some changes to vsprintf.
new drivers:
- bring in the asahi uapi header standalone
- nova-drm: stub driver
rust dependencies (for nova-core):
- auxiliary
- bus abstractions
- driver registration
- sample driver
- devres changes from driver-core
- revocable changes
core:
- add Apple fourcc modifiers
- add virtio capset definitions
- extend EXPORT_SYNC_FILE for timeline syncobjs
- convert to devm_platform_ioremap_resource
- refactor shmem helper page pinning
- DP powerup/down link helpers
- extended %p4cc in vsprintf.c to support fourcc prints
- change vsprintf %p4cn to %p4chR, remove %p4cn
- Add drm_file_err function
- IN_FORMATS_ASYNC property
- move sitronix from tiny to their own subdir
rust:
- add drm core infrastructure rust abstractions
(device/driver, ioctl, file, gem)
dma-buf:
- adjust sg handling to not cache map on attach
- allow setting dma-device for import
- Add a helper to sort and deduplicate dma_fence arrays
docs:
- updated drm scheduler docs
- fbdev todo update
- fb rendering
- actual brightness
ttm:
- fix delayed destroy resv object
bridge:
- add kunit tests
- convert tc358775 to atomic
- convert drivers to devm_drm_bridge_alloc
- convert rk3066_hdmi to bridge driver
scheduler:
- add kunit tests
panel:
- refcount panels to improve lifetime handling
- Powertip PH128800T004-ZZA01
- NLT NL13676BC25-03F, Tianma TM070JDHG34-00
- Himax HX8279/HX8279-D DDIC
- Visionox G2647FB105
- Sitronix ST7571
- ZOTAC rotation quirk
vkms:
- allow attaching more displays
i915:
- xe3lpd display updates
- vrr refactor
- intel_display struct conversions
- xe2hpd memory type identification
- add link rate/count to i915_display_info
- cleanup VGA plane handling
- refactor HDCP GSC
- fix SLPC wait boosting reference counting
- add 20ms delay to engine reset
- fix fence release on early probe errors
xe:
- SRIOV updates
- BMG PCI ID update
- support separate firmware for each GT
- SVM fix, prelim SVM multi-device work
- export fan speed
- temp disable d3cold on BMG
- backup VRAM in PM notifier instead of suspend/freeze
- update xe_ttm_access_memory to use GPU for non-visible access
- fix guc_info debugfs for VFs
- use copy_from_user instead of __copy_from_user
- append PCIe gen5 limitations to xe_firmware document
amdgpu:
- DSC cleanup
- DC Scaling updates
- Fused I2C-over-AUX updates
- DMUB updates
- Use drm_file_err in amdgpu
- Enforce isolation updates
- Use new dma_fence helpers
- USERQ fixes
- Documentation updates
- SR-IOV updates
- RAS updates
- PSP 12 cleanups
- GC 9.5 updates
- SMU 13.x updates
- VCN / JPEG SR-IOV updates
amdkfd:
- Update error messages for SDMA
- Userptr updates
- XNACK fixes
radeon:
- CIK doorbell cleanup
nouveau:
- add support for NVIDIA r570 GSP firmware
- enable Hopper/Blackwell support
nova-core:
- fix task list
- register definition infrastructure
- move firmware into own rust module
- register auxiliary device for nova-drm
nova-drm:
- initial driver skeleton
msm:
- GPU:
- ACD (adaptive clock distribution) for X1-85
- drop fictional address_space_size
- improve GMU HFI response time out robustness
- fix crash when throttling during boot
- DPU:
- use single CTL path for flushing on DPU 5.x+
- improve SSPP allocation code for better sharing
- Enabled SmartDMA on SM8150, SC8180X, SC8280XP, SM8550
- Added SAR2130P support
- Disabled DSC support on MSM8937, MSM8917, MSM8953, SDM660
- DP:
- switch to new audio helpers
- better LTTPR handling
- DSI:
- Added support for SA8775P
- Added SAR2130P support
- HDMI:
- Switched to use new helpers for ACR data
- Fixed old standing issue of HPD not working in some cases
amdxdna:
- add dma-buf support
- allow empty command submits
renesas:
- add dma-buf support
- add zpos, alpha, blend support
panthor:
- fail properly for NO_MMAP bos
- add SET_LABEL ioctl
- debugfs BO dumping support
imagination:
- update DT bindings
- support TI AM68 GPU
hibmc:
- improve interrupt handling and HPD support
virtio:
- add panic handler support
rockchip:
- add RK3588 support
- add DP AUX bus panel support
ivpu:
- add heartbeat based hangcheck
mediatek:
- prepares support for MT8195/99 HDMIv2/DDCv2
anx7625:
- improve HPD
tegra:
- speed up firmware loading
* tag 'drm-next-2025-05-28' of https://gitlab.freedesktop.org/drm/kernel: (1627 commits)
drm/nouveau/tegra: Fix error pointer vs NULL return in nvkm_device_tegra_resource_addr()
drm/xe: Default auto_link_downgrade status to false
drm/xe/guc: Make creation of SLPC debugfs files conditional
drm/i915/display: Add check for alloc_ordered_workqueue() and alloc_workqueue()
drm/i915/dp_mst: Work around Thunderbolt sink disconnect after SINK_COUNT_ESI read
drm/i915/ptl: Use everywhere the correct DDI port clock select mask
drm/nouveau/kms: add support for GB20x
drm/dp: add option to disable zero sized address only transactions.
drm/nouveau: add support for GB20x
drm/nouveau/gsp: add hal for fifo.chan.doorbell_handle
drm/nouveau: add support for GB10x
drm/nouveau/gf100-: track chan progress with non-WFI semaphore release
drm/nouveau/nv50-: separate CHANNEL_GPFIFO handling out from CHANNEL_DMA
drm/nouveau: add helper functions for allocating pinned/cpu-mapped bos
drm/nouveau: add support for GH100
drm/nouveau: improve handling of 64-bit BARs
drm/nouveau/gv100-: switch to volta semaphore methods
drm/nouveau/gsp: support deeper page tables in COPY_SERVER_RESERVED_PDES
drm/nouveau/gsp: init client VMMs with NV0080_CTRL_DMA_SET_PAGE_DIRECTORY
drm/nouveau/gsp: fetch level shift and PDE from BAR2 VMM
...
Context Timestamp (CTX_TIMESTAMP) in the LRC accumulates the run ticks
of the context, but only gets updated when the context switches out. In
order to check how long a context has been active before it switches
out, two things are required:
(1) Determine if the context is running:
To do so, we program the WA BB to set an initial value for CTX_TIMESTAMP
in the LRC. The value chosen is 1 since 0 is the initial value when the
LRC is initialized. During a query, we just check for this value to
determine if the context is active. If the context switched out, it
would overwrite this location with the actual CTX_TIMESTAMP MMIO value.
Note that WA BB runs as the last part of the context restore, so reusing
this LRC location will not clobber anything.
(2) Calculate the time that the context has been active for:
The CTX_TIMESTAMP ticks only when the context is active. If a context is
active, we just use the CTX_TIMESTAMP MMIO as the new value of
utilization. While doing so, we need to read the CTX_TIMESTAMP MMIO
for the specific engine instance. Since we do not know which instance
the context is running on until it is scheduled, we also read the
ENGINE_ID MMIO in the WA BB and store it in the PPHSWP.
Using the above 2 instructions in a WA BB, capture active context
utilization.
v2: (Matt Brost)
- This breaks TDR, fix it by saving the CTX_TIMESTAMP register
"drm/xe: Save CTX_TIMESTAMP mmio value instead of LRC value"
- Drop tile from LRC if using gt
"drm/xe: Save the gt pointer in LRC and drop the tile"
v3:
- Remove helpers for bb_per_ctx_ptr (Matt)
- Add define for context active value (Matt)
- Use 64 bit CTX TIMESTAMP for platforms that support it. For platforms
that don't, live with the rare race. (Matt, Lucas)
- Convert engine id to hwe and get the MMIO value (Lucas)
- Correct commit message on when WA BB runs (Lucas)
v4:
- s/GRAPHICS_VER(...)/xe->info.has_64bit_timestamp/ (Matt)
- Drop support for active utilization on a VF (CI failure)
- In xe_lrc_init ensure the lrc value is 0 to begin with (CI regression)
v5:
- Minor checkpatch fix
- Squash into previous commit and make TDR use 32-bit time
- Update code comment to match commit msg
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4532
Cc: <stable@vger.kernel.org> # v6.13+
Suggested-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250509161159.2173069-8-umesh.nerlige.ramappa@intel.com
(cherry picked from commit 82b98cadb0)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
If device wedges on e.g. GuC upload, the submission is not yet enabled
and the state is not even initialized. Protect the wedge call so it does
nothing in this case. It fixes the following splat:
[] xe 0000:bf:00.0: [drm] device wedged, needs recovery
[] ------------[ cut here ]------------
[] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[] WARNING: CPU: 48 PID: 312 at kernel/locking/mutex.c:564 __mutex_lock+0x8a1/0xe60
...
[] RIP: 0010:__mutex_lock+0x8a1/0xe60
[] mutex_lock_nested+0x1b/0x30
[] xe_guc_submit_wedge+0x80/0x2b0 [xe]
Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Link: https://lore.kernel.org/r/20250402-warn-after-wedge-v1-1-93e971511fa5@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Pull drm updates from Dave Airlie:
"Outside of drm there are some rust patches from Danilo who maintains
that area in here, and some pieces for drm header check tests.
The major things in here are a new driver supporting the touchbar
displays on M1/M2, the nova-core stub driver which is just the vehicle
for adding rust abstractions and start developing a real driver inside
of.
xe adds support for SVM with a non-driver specific SVM core
abstraction that will hopefully be useful for other drivers, along
with support for shrinking for TTM devices. I'm sure xe and AMD
support new devices, but the pipeline depth on these things is hard to
know what they end up being in the marketplace!
uapi:
- add mediatek tiled fourcc
- add support for notifying userspace on device wedged
new driver:
- appletbdrm: support for Apple Touchbar displays on m1/m2
- nova-core: skeleton rust driver to develop nova inside off
firmware:
- add some rust firmware pieces
rust:
- add 'LocalModule' type alias
component:
- add helper to query bound status
fbdev:
- fbtft: remove access to page->index
media:
- cec: tda998x: import driver from drm
dma-buf:
- add fast path for single fence merging
tests:
- fix lockdep warnings
atomic:
- allow full modeset on connector changes
- clarify semantics of allow_modeset and drm_atomic_helper_check
- async-flip: support on arbitary planes
- writeback: fix UAF
- Document atomic-state history
format-helper:
- support ARGB8888 to ARGB4444 conversions
buddy:
- fix multi-root cleanup
ci:
- update IGT
dp:
- support extended wake timeout
- mst: fix RAD to string conversion
- increase DPCD eDP control CAP size to 5 bytes
- add DPCD eDP v1.5 definition
- add helpers for LTTPR transparent mode
panic:
- encode QR code according to Fido 2.2
scheduler:
- add parameter struct for init
- improve job peek/pop operations
- optimise drm_sched_job struct layout
ttm:
- refactor pool allocation
- add helpers for TTM shrinker
panel-orientation:
- add a bunch of new quirks
panel:
- convert panels to multi-style functions
- edp: Add support for B140UAN04.4, BOE NV140FHM-NZ, CSW MNB601LS1-3,
LG LP079QX1-SP0V, MNE007QS3-7, STA 116QHD024002, Starry
116KHD024006, Lenovo T14s Gen6 Snapdragon
- himax-hx83102: Add support for CSOT PNA957QT1-1, Kingdisplay
kd110n11-51ie, Starry 2082109qfh040022-50e
- visionox-r66451: use multi-style MIPI-DSI functions
- raydium-rm67200: Add driver for Raydium RM67200
- simple: Add support for BOE AV123Z7M-N17, BOE AV123Z7M-N17
- sony-td4353-jdi: Use MIPI-DSI multi-func interface
- summit: Add driver for Apple Summit display panel
- visionox-rm692e5: Add driver for Visionox RM692E5
bridge:
- pass full atomic state to various callbacks
- adv7511: Report correct capabilities
- it6505: Fix HDCP V compare
- snd65dsi86: fix device IDs
- nwl-dsi: set bridge type
- ti-sn65si83: add error recovery and set bridge type
- synopsys: add HDMI audio support
xe:
- support device-wedged event
- add mmap support for PCI memory barrier
- perf pmu integration and expose per-engien activity
- add EU stall sampling support
- GPU SVM and Xe SVM implementation
- use TTM shrinker
- add survivability mode to allow the driver to do firmware updates
in critical failure states
- PXP HWDRM support for MTL and LNL
- expose package/vram temps over hwmon
- enable DP tunneling
- drop mmio_ext abstraction
- Reject BO evcition if BO is bound to current VM
- Xe suballocator improvements
- re-use display vmas when possible
- add GuC Buffer Cache abstraction
- PCI ID update for Panther Lake and Battlemage
- Enable SRIOV for Panther Lake
- Refactor VRAM manager location
i915:
- enable extends wake timeout
- support device-wedged event
- Enable DP 128b/132b SST DSC
- FBC dirty rectangle support for display version 30+
- convert i915/xe to drm client setup
- Compute HDMI PLLS for rates not in fixed tables
- Allow DSB usage when PSR is enabled on LNL+
- Enable panel replay without full modeset
- Enable async flips with compressed buffers on ICL+
- support luminance based brightness via DPCD for eDP
- enable VRR enable/disable without full modeset
- allow GuC SLPC default strategies on MTL+ for performance
- lots of display refactoring in move to struct intel_display
amdgpu:
- add device wedged event
- support async page flips on overlay planes
- enable broadcast RGB drm property
- add info ioctl for virt mode
- OEM i2c support for RGB lights
- GC 11.5.2 + 11.5.3 support
- SDMA 6.1.3 support
- NBIO 7.9.1 + 7.11.2 support
- MMHUB 1.8.1 + 3.3.2 support
- DCN 3.6.0 support
- Add dynamic workload profile switching for GC 10-12
- support larger VBIOS sizes
- Mark gttsize parameters as deprecated
- Initial JPEG queue resset support
amdkfd:
- add KFD per process flags for setting precision
- sync pasid values between KGD and KFD
- improve GTT/VRAM handling for APUs
- fix user queue validation on GC7/8
- SDMA queue reset support
raedeon:
- rs400 hyperz fix
i2c:
- td998x: drop platform_data, split driver into media and bridge
ast:
- transmitter chip detection refactoring
- vbios display mode refactoring
- astdp: fix connection status and filter unsupported modes
- cursor handling refactoring
imagination:
- check job dependencies with sched helper
ivpu:
- improve command queue handling
- use workqueue for IRQ handling
- add support HW fault injection
- locking fixes
mgag200:
- add support for G200eH5
msm:
- dpu: add concurrent writeback support for DPU 10.x+
- use LTTPR helpers
- GPU:
- Fix obscure GMU suspend failure
- Expose syncobj timeline support
- Extend GPU devcoredump with pagetable info
- a623 support
- Fix a6xx gen1/gen2 indexed-register blocks in gpu snapshot /
devcoredump
- Display:
- Add cpu-cfg interconnect paths on SM8560 and SM8650
- Introduce KMS OMMU fault handler, causing devcoredump snapshot
- Fixed error pointer dereference in msm_kms_init_aspace()
- DPU:
- Fix mode_changing handling
- Add writeback support on SM6150 (QCS615)
- Fix DSC programming in 1:1:1 topology
- Reworked hardware resource allocation, moving it to the CRTC code
- Enabled support for Concurrent WriteBack (CWB) on SM8650
- Enabled CDM blocks on all relevant platforms
- Reworked debugfs interface for BW/clocks debugging
- Clear perf params before calculating bw
- Support YUV formats on writeback
- Fixed double inclusion
- Fixed writeback in YUV formats when using cloned output, Dropped
wb2_formats_rgb
- Corrected dpu_crtc_check_mode_changed and struct dpu_encoder_virt
kerneldocs
- Fixed uninitialized variable in dpu_crtc_kickoff_clone_mode()
- DSI:
- DSC-related fixes
- Rework clock programming
- DSI PHY:
- Fix 7nm (and lower) PHY programming
- Add proper DT schema definitions for DSI PHY clocks
- HDMI:
- Rework the driver, enabling the use of the HDMI Connector
framework
- Bindings:
- Added eDP PHY on SA8775P
nouveau:
- move drm_slave_encoder interface into driver
- nvkm: refactor GSP RPC
- use LTTPR helpers
mediatek:
- HDMI fixup and refinement
- add MT8188 dsc compatible
- MT8365 SoC support
panthor:
- Expose sizes of intenral BOs via fdinfo
- Fix race between reset and suspend
- Improve locking
qaic:
- Add support for AIC200
renesas:
- Fix limits in DT bindings
rockchip:
- support rk3562-mali
- rk3576: Add HDMI support
- vop2: Add new display modes on RK3588 HDMI0 up to 4K
- Don't change HDMI reference clock rate
- Fix DT bindings
- analogix_dp: add eDP support
- fix shutodnw
solomon:
- Set SPI device table to silence warnings
- Fix pixel and scanline encoding
v3d:
- handle clock
vc4:
- Use drm_exec
- Use dma-resv for wait-BO ioctl
- Remove seqno infrastructure
virtgpu:
- Support partial mappings of GEM objects
- Reserve VGA resources during initialization
- Fix UAF in virtgpu_dma_buf_free_obj()
- Add panic support
vkms:
- Switch to a managed modesetting pipeline
- Add support for ARGB8888
- fix UAf
xlnx:
- Set correct DMA segment size
- use mutex guards
- Fix error handling
- Fix docs"
* tag 'drm-next-2025-03-28' of https://gitlab.freedesktop.org/drm/kernel: (1762 commits)
drm/amd/pm: Update feature list for smu_v13_0_6
drm/amdgpu: Add parameter documentation for amdgpu_sync_fence
drm/amdgpu/discovery: optionally use fw based ip discovery
drm/amdgpu/discovery: use specific ip_discovery.bin for legacy asics
drm/amdgpu/discovery: check ip_discovery fw file available
drm/amd/pm: Remove unnecessay UQ10 to UINT conversion
drm/amd/pm: Remove unnecessay UQ10 to UINT conversion
drm/amdgpu/sdma_v4_4_2: update VM flush implementation for SDMA
drm/amdgpu: Optimize VM invalidation engine allocation and synchronize GPU TLB flush
drm/amd/amdgpu: Increase max rings to enable SDMA page ring
drm/amdgpu: Decode deferred error type in gfx aca bank parser
drm/amdgpu/gfx11: Add Cleaner Shader Support for GFX11.5 GPUs
drm/amdgpu/mes: clean up SDMA HQD loop
drm/amdgpu/mes: enable compute pipes across all MEC
drm/amdgpu/mes: drop MES 10.x leftovers
drm/amdgpu/mes: optimize compute loop handling
drm/amdgpu/sdma: guilty tracking is per instance
drm/amdgpu/sdma: fix engine reset handling
drm/amdgpu: remove invalid usage of sched.ready
drm/amdgpu: add cleaner shader trace point
...
Allow user to provide a low latency hint. When set, KMD sends a hint
to GuC which results in special handling for that process. SLPC will
ramp the GT frequency aggressively every time it switches to this
process.
We need to enable the use of SLPC Compute strategy during init, but
it will apply only to processes that set this bit during process
creation.
Improvement with this approach as below:
Before,
:~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
Platform: Intel(R) OpenCL Graphics
Device: Intel(R) Graphics [0xe20b]
Driver version : 24.52.0 (Linux x64)
Compute units : 160
Clock frequency : 2850 MHz
Kernel launch latency : 283.16 us
After,
:~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
Platform: Intel(R) OpenCL Graphics
Device: Intel(R) Graphics [0xe20b]
Driver version : 24.52.0 (Linux x64)
Compute units : 160
Clock frequency : 2850 MHz
Kernel launch latency : 63.38 us
Compute PR: https://github.com/intel/compute-runtime/pull/794
Mesa PR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33214
IGT PR: https://patchwork.freedesktop.org/patch/639989/
V10(Lucas):
- Remove doc from drm-uapi.rst
v9(Vinay):
- remove extra line, align commit message
v8(Vinay):
- Add separate example for using low latency hint
v7(Jose):
- Update UMD PR
- applicable to all gpus
V6:
- init flags, remove redundant flags check (MAuld)
V5:
- Move uapi doc to documentation and GuC ABI specific change (Rodrigo)
- Modify logic to restrict exec queue flags (MAuld)
V4:
- To make it clear, dont use exec queue word (Vinay)
- Correct typo in description of flag (Jose/Vinay)
- rename set_strategy api and replace ctx with exec queue(Vinay)
- Start with 0th bit to indentify user flags (Jose)
V3:
- Conver user flag to kernel internal flag and use (Oak)
- Support query config for use to check kernel support (Jose)
- Dont need to take runtime pm (Vinay)
V2:
- DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 planned for other hint(Szymon)
- Add motivation to description (Lucas)
Acked-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250228070224.739295-2-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
The async call to __guc_exec_queue_fini_async frees the scheduler
while a submission may time out and restart. To prevent this race
condition, the pending job timer should be canceled before freeing
the scheduler.
V3(MattB):
- Adjust position of cancel pending job
- Remove gitlab issue# from commit message
V2(MattB):
- Cancel pending jobs before scheduler finish
Fixes: a20c75dba1 ("drm/xe: Call __guc_exec_queue_fini_async direct for KERNEL exec_queues")
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250225045754.600905-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
(cherry picked from commit 18fbd567e7)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The async call to __guc_exec_queue_fini_async frees the scheduler
while a submission may time out and restart. To prevent this race
condition, the pending job timer should be canceled before freeing
the scheduler.
V3(MattB):
- Adjust position of cancel pending job
- Remove gitlab issue# from commit message
V2(MattB):
- Cancel pending jobs before scheduler finish
Fixes: a20c75dba1 ("drm/xe: Call __guc_exec_queue_fini_async direct for KERNEL exec_queues")
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250225045754.600905-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
There are debug level prints giving more information about the cause
of the hang immediately before core dumps are created. However, not
everyone has debug level prints enabled or saves the dmesg log at all.
So include that information in the dump file itself. Also, at least
one of those prints included the pid as well as the process name. So
include that in the capture too.
v2: Fix kvfree vs kfree and missing kernel-doc (review feedback from
Matthew Brost)
v3: Use GFP_ATOMIC instead of GFP_KERNEL.
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241128210824.3302147-2-John.C.Harrison@Intel.com
Currently in some testcases we can trigger:
xe 0000:03:00.0: [drm] Assertion `exec_queue_destroyed(q)` failed!
....
WARNING: CPU: 18 PID: 2640 at drivers/gpu/drm/xe/xe_guc_submit.c:1826 xe_guc_sched_done_handler+0xa54/0xef0 [xe]
xe 0000:03:00.0: [drm] *ERROR* GT1: DEREGISTER_DONE: Unexpected engine state 0x00a1, guc_id=57
Looking at a snippet of corresponding ftrace for this GuC id we can see:
162.673311: xe_sched_msg_add: dev=0000:03:00.0, gt=1 guc_id=57, opcode=3
162.673317: xe_sched_msg_recv: dev=0000:03:00.0, gt=1 guc_id=57, opcode=3
162.673319: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0x29, flags=0x0
162.674089: xe_exec_queue_kill: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0x29, flags=0x0
162.674108: xe_exec_queue_close: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa9, flags=0x0
162.674488: xe_exec_queue_scheduling_done: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa9, flags=0x0
162.678452: xe_exec_queue_deregister: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa1, flags=0x0
It looks like we try to suspend the queue (opcode=3), setting
suspend_pending and triggering a disable_scheduling. The user then
closes the queue. However the close will also forcefully signal the
suspend fence after killing the queue, later when the G2H response for
disable_scheduling comes back we have now cleared suspend_pending when
signalling the suspend fence, so the disable_scheduling now incorrectly
tries to also deregister the queue. This leads to warnings since the queue
has yet to even be marked for destruction. We also seem to trigger
errors later with trying to double unregister the same queue.
To fix this tweak the ordering when handling the response to ensure we
don't race with a disable_scheduling that didn't actually intend to
perform an unregister. The destruction path should now also correctly
wait for any pending_disable before marking as destroyed.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3371
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241122161914.321263-6-matthew.auld@intel.com
Currently in some testcases we can trigger:
[drm] *ERROR* GT0: SCHED_DONE: Unexpected engine state 0x02b1, guc_id=8, runnable_state=0
[drm] *ERROR* GT0: G2H action 0x1002 failed (-EPROTO) len 3 msg 02 10 00 90 08 00 00 00 00 00 00 00
Looking at a snippet of corresponding ftrace for this GuC id we can see:
498.852891: xe_sched_msg_add: dev=0000:03:00.0, gt=0 guc_id=8, opcode=3
498.854083: xe_sched_msg_recv: dev=0000:03:00.0, gt=0 guc_id=8, opcode=3
498.855389: xe_exec_queue_kill: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x3, flags=0x0
498.855436: xe_exec_queue_lr_cleanup: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x83, flags=0x0
498.856767: xe_exec_queue_close: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x83, flags=0x0
498.862889: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0xa9, flags=0x0
498.863032: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b9, flags=0x0
498.875596: xe_exec_queue_scheduling_done: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b9, flags=0x0
498.875604: xe_exec_queue_deregister: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b1, flags=0x0
499.074483: xe_exec_queue_deregister_done: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b1, flags=0x0
This looks to be the two scheduling_disable racing with each other, one
from the suspend (opcode=3) and then again during lr cleanup. While
those two operations are serialized, the G2H portion is not, therefore
when marking the queue as pending_disabled and then firing off the first
request, we proceed do the same again, however the first disable
response only fires after this which then clears the pending_disabled.
At this point the second comes back and is processed, however the
pending_disabled is no longer set, hence triggering the warning.
To fix this wait for pending_disabled when doing the lr cleanup and
calling disable_scheduling_deregister. Also do the same for all other
disable_scheduling callers.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3515
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <mattheq.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241122161914.321263-5-matthew.auld@intel.com
The exec queue timestamp is only really useful when it's being queried
through the fdinfo. There's no need to update it so often, on every
job_free. Tracing a simple app like vkcube running shows an update
rate of ~ 120Hz. In case of discrete, the BO is on vram, creating a lot
of pcie transactions.
The update on job_free() is used to cover a gap: if exec
queue is created and destroyed rapidly, before a new query, the
timestamp still needs to be accumulated and accounted for in the xef.
Initial implementation in commit 6109f24f87 ("drm/xe: Add helper to
accumulate exec queue runtime") couldn't do it on the exec_queue_fini
since the xef could be gone at that point. However since commit
ce8c161cba ("drm/xe: Add ref counting for xe_file") the xef is
refcounted and the exec queue always holds a reference, making this safe
now.
Improve the fix in commit 2149ded630 ("drm/xe: Fix use after free when
client stats are captured") by reducing the frequency in which the
update is needed.
Fixes: 2149ded630 ("drm/xe: Fix use after free when client stats are captured")
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241104143815.2112272-3-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Backmerging to get up-to-date and to bring in a fix that was
merged through drm-misc-fixes.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Short circuiting TDR on jobs not started is an optimization which is not
required. On LNL we are facing an issue where jobs do not get scheduled
by the GuC if it misses a GGTT page update. When this occurs let the TDR
fire, toggle the scheduling which may get the job unstuck, and print a
warning message. If the TDR fires twice on job that hasn't started,
timeout the job.
v2:
- Add warning message (Paulo)
- Add fixes tag (Paulo)
- Timeout job which hasn't started after TDR firing twice
v3:
- Include local change
v4:
- Short circuit check_timeout on job not started
- use warn level rather than notice (Paulo)
Fixes: 7ddb9403dd ("drm/xe: Sample ctx timestamp to determine if jobs have timed out")
Cc: stable@vger.kernel.org
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241025214330.2010521-2-matthew.brost@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 35d25a4a00)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Short circuiting TDR on jobs not started is an optimization which is not
required. On LNL we are facing an issue where jobs do not get scheduled
by the GuC if it misses a GGTT page update. When this occurs let the TDR
fire, toggle the scheduling which may get the job unstuck, and print a
warning message. If the TDR fires twice on job that hasn't started,
timeout the job.
v2:
- Add warning message (Paulo)
- Add fixes tag (Paulo)
- Timeout job which hasn't started after TDR firing twice
v3:
- Include local change
v4:
- Short circuit check_timeout on job not started
- use warn level rather than notice (Paulo)
Fixes: 7ddb9403dd ("drm/xe: Sample ctx timestamp to determine if jobs have timed out")
Cc: stable@vger.kernel.org
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241025214330.2010521-2-matthew.brost@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
xe_force_wake_get() now returns the reference count-incremented domain
mask. If it fails for individual domains, the return value will always
be 0. However, for XE_FORCEWAKE_ALL, it may return a non-zero value even
in the event of failure. Use helper xe_force_wake_ref_has_domain to
verify all domains are initialized or not. Update the return handling of
xe_force_wake_get() to reflect this behavior, and ensure that the return
value is passed as input to xe_force_wake_put().
v3
- return xe_wakeref_t instead of int in xe_force_wake_get()
- xe_force_wake_put() error doesn't need to be checked. It internally
WARNS on domain ack failure.
v5
- return unsigned int from xe_force_wake_get()
- Remove redundant xe_gt_WARN_ON
v6
- use helper xe_force_wake_ref_has_domain()
v7
- Fix commit message
v9
- Rebase
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241014075601.2324382-17-himal.prasad.ghimiray@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Short summary of fixes pull:
fbdev-dma:
- Only clean up deferred I/O if instanciated
nouveau:
- dmem: Fix privileged error in copy engine channel; Fix possible
data leak in migrate_to_ram()
- gsp: Fix coding style
sched:
- Avoid leaking lockdep map
v3d:
- Stop active perfmon before destroying it
vc4:
- Stop active perfmon before destroying it
xe:
- Drop GuC submit_wq pool
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20241010133708.GA461532@localhost.localdomain
When we decide to kill a job, (from guc_exec_queue_timedout_job), we could
end up with 4 possible scenarios at this starting point of this decision:
1. the guc-captured register-dump is already there.
2. the driver is wedged.mode > 1, so GuC-engine-reset / GuC-err-capture
will not happen.
3. the user has started the driver in execlist-submission mode.
4. the guc-captured register-dump is not ready yet so we force GuC to kill
that context now, but:
A. we don't know yet if GuC will be successful on the engine-reset
and get the guc-err-capture, else kmd will do a manual reset later
OR B. guc will be successful and we will get a guc-err-capture
shortly.
So to accomdate the scenarios of 2 and 4A, we will need to do a manual KMD
capture first(which is not be reliable in guc-submission mode) and decide
later if we need to use that for the cases of 2 or 4A. So this flow is
part of the implementation for this patch.
Provide xe_guc_capture_get_reg_desc_list to get the register dscriptor
list.
Add manual capture by read from hw engine if GuC capture is not ready.
If it becomes ready at later time, GuC sourced data will be used.
Although there may only be a small delay between (1) the check for whether
guc-err-capture is available at the start of guc_exec_queue_timedout_job
and (2) the decision on using a valid guc-err-capture or manual-capture,
lets not take any chances and lock the matching node down so it doesn't
get re-claimed if GuC-Err-Capture subsystem is running out of pre-cached
nodes.
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-6-zhanjun.dong@intel.com
Upon the G2H Notify-Err-Capture event, parse through the
GuC Log Buffer (error-capture-subregion) and generate one or
more capture-nodes. A single node represents a single "engine-
instance-capture-dump" and contains at least 3 register lists:
global, engine-class and engine-instance. An internal link
list is maintained to store one or more nodes.
Because the link-list node generation happen before the call
to devcoredump, duplicate global and engine-class register
lists for each engine-instance register dump if we find
dependent-engine resets in a engine-capture-group.
To avoid dynamically allocate the output nodes during gt reset,
pre-allocate a fixed number of empty nodes up front (at the
time of ADS registration) that we can consume from or return to
an internal cached list of nodes.
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-5-zhanjun.dong@intel.com
The xe_guc_exec_queue_snapshot is not really a GuC internal thing and
is definitely not a GuC CT thing. So give it its own section heading.
The snapshot itself is really a capture of the submission backend's
internal state. Although all it currently prints out is the submission
contexts. So label it as 'Contexts'. If more general state is added
later then it could be change to 'Submission backend' or some such.
Further, everything from the GuC CT section onwards is GT specific but
there was no indication of which GT it was related to (and that is
impossible to work out from the other fields that are given). So add a
GT section heading. Also include the tile id of the GT, because again
significant information.
Lastly, drop a couple of unnecessary line feeds within sections.
v2: Add GT section heading, add tile id to device section.
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-4-John.C.Harrison@Intel.com
Any non-wedged queue can have a zero refcount here and can be running
concurrently with an async queue destroy, therefore dereferencing the
queue ptr to check wedge status after the lookup can trigger UAF if
queue is not wedged. Fix this by keeping the submission_state lock held
around the check to postpone the free and make the check safe, before
dropping again around the put() to avoid the deadlock.
Fixes: 8ed9aaae39 ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240924150947.118433-2-matthew.auld@intel.com
(cherry picked from commit d28af0b6b9)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>