Commit Graph

121 Commits

Author SHA1 Message Date
Lizhi Hou
6b13cb8f48 accel/amdxdna: Fix runtime suspend deadlock when there is pending job
The runtime suspend callback drains the running job workqueue before
suspending the device. If a job is still executing and calls
pm_runtime_resume_and_get(), it can deadlock with the runtime suspend
path.

Fix this by moving pm_runtime_resume_and_get() from the job execution
routine to the job submission routine, ensuring the device is resumed
before the job is queued and avoiding the deadlock during runtime
suspend.

Fixes: 063db45183 ("accel/amdxdna: Enhance runtime power management")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260310180058.336348-1-lizhi.hou@amd.com
2026-03-10 11:46:40 -07:00
Lizhi Hou
d5b8b0347f accel/amdxdna: Split mailbox channel create function
The management channel used for firmware control command submission is
currently created after the firmware is started. If channel creation
fails (for example, due to memory allocation failure or workqueue
creation interruption), the firmware remains in a pending state and is
unable to receive any control commands.

To avoid leaving the firmware in this inconsistent state, split
xdna_mailbox_create_channel() into two separate functions so that
resource allocation can be completed before interacting with the
hardware.
  xdna_mailbox_alloc_channel()
    Allocates memory and initializes the workqueue. This can be called
    earlier, before interacting with the hardware.
  xdna_mailbox_start_channel()
    Performs the hardware interaction required to start the channel.

Rename xdna_mailbox_destroy_channel() to xdna_mailbox_free_channel().
Ensure that xdna_mailbox_stop_channel() and xdna_mailbox_free_channel()
properly unwind the corresponding start and allocation steps, respectively.

Fixes: b87f920b93 ("accel/amdxdna: Support hardware mailbox")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260305062041.3954024-1-lizhi.hou@amd.com
2026-03-05 09:24:33 -08:00
Lizhi Hou
f82859c84a accel/amdxdna: Fix major version check on NPU1 platform
Add the missing major number in npu1_fw_feature_table.

Without the major version specified, the firmware feature check fails,
preventing new firmware commands from being enabled on the NPU1
platform.

With the correct major version populated, the driver properly detects
firmware support and enables the new command.

Fixes: f1eac46fe5 ("accel/amdxdna: Update firmware version check for latest firmware")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260304195012.3616908-1-lizhi.hou@amd.com
2026-03-04 12:05:02 -08:00
Lizhi Hou
6270ee26e1 accel/amdxdna: Fix NULL pointer dereference of mgmt_chann
mgmt_chann may be set to NULL if the firmware returns an unexpected
error in aie2_send_mgmt_msg_wait(). This can later lead to a NULL
pointer dereference in aie2_hw_stop().

Fix this by introducing a dedicated helper to destroy mgmt_chann
and by adding proper NULL checks before accessing it.

Fixes: b87f920b93 ("accel/amdxdna: Support hardware mailbox")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260226213857.3068474-1-lizhi.hou@amd.com
2026-03-02 09:43:22 -08:00
Lizhi Hou
89ff45359a accel/amdxdna: Fill invalid payload for failed command
Newer userspace applications may read the payload of a failed command
to obtain detailed error information. However, the driver and old firmware
versions may not support returning advanced error information.
In this case, initialize the command payload with an invalid value so
userspace can detect that no detailed error information is available.

Fixes: aac243092b ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260227004841.3080241-1-lizhi.hou@amd.com
2026-02-27 23:01:36 -08:00
Lizhi Hou
75c151ceaa accel/amdxdna: Use a different name for latest firmware
Using legacy driver with latest firmware causes a power off issue.

Fix this by assigning a different filename (npu_7.sbin) to the latest
firmware. The driver attempts to load the latest firmware first and falls
back to the previous firmware version if loading fails.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/5009
Fixes: f1eac46fe5 ("accel/amdxdna: Update firmware version check for latest firmware")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260225204752.2711734-1-lizhi.hou@amd.com
2026-02-25 13:51:31 -08:00
Lizhi Hou
901ec34709 accel/amdxdna: Validate command buffer payload count
The count field in the command header is used to determine the valid
payload size. Verify that the valid payload does not exceed the remaining
buffer space.

Fixes: aac243092b ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260219211946.1920485-1-lizhi.hou@amd.com
2026-02-23 09:24:21 -08:00
Lizhi Hou
03808abb1d accel/amdxdna: Prevent ubuf size overflow
The ubuf size calculation may overflow, resulting in an undersized
allocation and possible memory corruption.

Use check_add_overflow() helpers to validate the size calculation before
allocation.

Fixes: bd72d4acda ("accel/amdxdna: Support user space allocated buffer")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260217192815.1784689-1-lizhi.hou@amd.com
2026-02-23 09:24:21 -08:00
Lizhi Hou
1110a94967 accel/amdxdna: Fix out-of-bounds memset in command slot handling
The remaining space in a command slot may be smaller than the size of
the command header. Clearing the command header with memset() before
verifying the available slot space can result in an out-of-bounds write
and memory corruption.

Fix this by moving the memset() call after the size validation.

Fixes: 3d32eb7a5e ("accel/amdxdna: Fix cu_idx being cleared by memset() during command setup")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260217185415.1781908-1-lizhi.hou@amd.com
2026-02-23 09:24:20 -08:00
Lizhi Hou
07efce5a66 accel/amdxdna: Fix command hang on suspended hardware context
When a hardware context is suspended, the job scheduler is stopped. If a
command is submitted while the context is suspended, the job is queued in
the scheduler but aie2_sched_job_run() is never invoked to restart the
hardware context. As a result, the command hangs.

Fix this by modifying the hardware context suspend routine to keep the job
scheduler running so that queued jobs can trigger context restart properly.

Fixes: aac243092b ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260211205341.722982-1-lizhi.hou@amd.com
2026-02-23 09:24:19 -08:00
Lizhi Hou
fdb65acfe6 accel/amdxdna: Fix suspend failure after enabling turbo mode
Enabling turbo mode disables hardware clock gating. Suspend requires
hardware clock gating to be re-enabled, otherwise suspend will fail.
Fix this by calling aie2_runtime_cfg() from aie2_hw_stop() to
re-enable clock gating during suspend. Also ensure that firmware is
initialized in aie2_hw_start() before modifying clock-gating
settings during resume.

Fixes: f4d7b8a6bc ("accel/amdxdna: Enhance power management settings")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260211204716.722788-1-lizhi.hou@amd.com
2026-02-23 09:24:19 -08:00
Lizhi Hou
1aa82181a3 accel/amdxdna: Fix dead lock for suspend and resume
When an application issues a query IOCTL while auto suspend is running,
a deadlock can occur. The query path holds dev_lock and then calls
pm_runtime_resume_and_get(), which waits for the ongoing suspend to
complete. Meanwhile, the suspend callback attempts to acquire dev_lock
and blocks, resulting in a deadlock.

Fix this by releasing dev_lock before calling pm_runtime_resume_and_get()
and reacquiring it after the call completes. Also acquire dev_lock in the
resume callback to keep the locking consistent.

Fixes: 063db45183 ("accel/amdxdna: Enhance runtime power management")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260211204644.722758-1-lizhi.hou@amd.com
2026-02-23 09:24:17 -08:00
Mario Limonciello
57aa3917a3 accel/amdxdna: Reduce log noise during process termination
During process termination, several error messages are logged that are
not actual errors but expected conditions when a process is killed or
interrupted. This creates unnecessary noise in the kernel log.

The specific scenarios are:

1. HMM invalidation returns -ERESTARTSYS when the wait is interrupted by
   a signal during process cleanup. This is expected when a process is
   being terminated and should not be logged as an error.

2. Context destruction returns -ENODEV when the firmware or device has
   already stopped, which commonly occurs during cleanup if the device
   was already torn down. This is also an expected condition during
   orderly shutdown.

Downgrade these expected error conditions from error level to debug level
to reduce log noise while still keeping genuine errors visible.

Fixes: 97f2757383 ("accel/amdxdna: Fix potential NULL pointer dereference in context cleanup")
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260210164521.1094274-3-mario.limonciello@amd.com
2026-02-23 09:24:16 -08:00
Lizhi Hou
8363c02863 accel/amdxdna: Fix crash when destroying a suspended hardware context
If userspace issues an ioctl to destroy a hardware context that has
already been automatically suspended, the driver may crash because the
mailbox channel pointer is NULL for the suspended context.

Fix this by checking the mailbox channel pointer in aie2_destroy_context()
before accessing it.

Fixes: 97f2757383 ("accel/amdxdna: Fix potential NULL pointer dereference in context cleanup")
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260206060306.4050531-1-lizhi.hou@amd.com
2026-02-23 09:24:15 -08:00
Lizhi Hou
c68a6af400 accel/amdxdna: Switch to always use chained command
Preempt commands are only supported when submitted as chained commands.
To ensure preempt support works consistently, always submit commands in
chained command format.

Set force_cmdlist to true so that single commands are filled using the
chained command layout, enabling correct handling of preempt commands.

Fixes: 3a0ff7b98a ("accel/amdxdna: Support preemption requests")
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260206060251.4050512-1-lizhi.hou@amd.com
2026-02-23 09:24:15 -08:00
Lizhi Hou
08fe1b5166 accel/amdxdna: Remove buffer size check when creating command BO
Large command buffers may be used, and they do not always need to be
mapped or accessed by the driver. Performing a size check at command BO
creation time unnecessarily rejects valid use cases.

Remove the buffer size check from command BO creation, and defer vmap
and size validation to the paths where the driver actually needs to map
and access the command buffer.

Fixes: ac49797c18 ("accel/amdxdna: Add GEM buffer object management")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260206060237.4050492-1-lizhi.hou@amd.com
2026-02-23 09:23:41 -08:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Lizhi Hou
69674c1c70 accel/amdxdna: Move RPM resume into job run function
Currently, amdxdna_pm_resume_get() is called during job creation, and
amdxdna_pm_suspend_put() is called when the hardware notifies job
completion. If a job is canceled before it is run, no hardware
completion notification is generated, resulting in an unbalanced
runtime PM resume/suspend pair.

Fix this by moving amdxdna_pm_resume_get() to the job run path, ensuring
runtime PM is only resumed for jobs that are actually executed.

Fixes: 063db45183 ("accel/amdxdna: Enhance runtime power management")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260204171118.3165607-1-lizhi.hou@amd.com
2026-02-04 13:08:48 -08:00
Lizhi Hou
d19d963d2a accel/amdxdna: Fix incorrect DPM level after suspend/resume
The suspend routine sets the DPM level to 0, which unintentionally
overwrites the previously saved DPM level. As a result, the device always
resumes with DPM level 0 instead of restoring the original value.

Fix this by ensuring the suspend path does not overwrite the saved DPM
level, allowing the correct DPM level to be restored during resume.

Fixes: f4d7b8a6bc ("accel/amdxdna: Enhance power management settings")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260204171048.3165580-1-lizhi.hou@amd.com
2026-02-04 13:08:35 -08:00
Lizhi Hou
750817a7c4 accel/amdxdna: Fix incorrect error code returned for failed chain command
The driver currently returns an incorrect error code when a chain command
fails. In this case, ERT_CMD_STATE_ERROR is expected to be reported for
failed chain commands.

Fixes: aac243092b ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260203184037.2751889-1-lizhi.hou@amd.com
2026-02-03 11:35:53 -08:00
Lizhi Hou
b853007fdc accel/amdxdna: Remove hardware context status
One newly supported command does not require hardware context configuration
to be performed upfront. As a result, checking hardware context status
causes this command to fail incorrectly.

Remove hardware context status handling entirely. For other commands,
if userspace submits a request without configuring the hardware context
first, the firmware will report an error or time out as appropriate.

Fixes: aac243092b ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260202212450.2681273-1-lizhi.hou@amd.com
2026-02-03 09:20:24 -08:00
Zishun Yi
84dd57fb03 accel/amdxdna: Fix memory leak in amdxdna_ubuf_map
The amdxdna_ubuf_map() function allocates memory for sg and
internal sg table structures, but it fails to free them if subsequent
operations (sg_alloc_table_from_pages or dma_map_sgtable) fail.

Fixes: bd72d4acda ("accel/amdxdna: Support user space allocated buffer")
Signed-off-by: Zishun Yi <zishun.yi.dev@gmail.com>
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Reviewed-by: Min Ma <mamin506@gmail.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260129171022.68578-1-zishun.yi.dev@gmail.com
2026-01-30 11:52:59 -08:00
Lizhi Hou
f1370241fe accel/amdxdna: Stop job scheduling across aie2_release_resource()
Running jobs on a hardware context while it is in the process of
releasing resources can lead to use-after-free and crashes.

Fix this by stopping job scheduling before calling
aie2_release_resource() and restarting it after the release completes.
Additionally, aie2_sched_job_run() now checks whether the hardware
context is still active.

Fixes: 4fd6ca90fc ("accel/amdxdna: Refactor hardware context destroy routine")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260130003255.2083255-1-lizhi.hou@amd.com
2026-01-30 11:52:53 -08:00
Lizhi Hou
a9162439ad accel/amdxdna: Hold mm structure across iommu_sva_unbind_device()
Some tests trigger a crash in iommu_sva_unbind_device() due to
accessing iommu_mm after the associated mm structure has been
freed.

Fix this by taking an explicit reference to the mm structure
after successfully binding the device, and releasing it only
after the device is unbound. This ensures the mm remains valid
for the entire SVA bind/unbind lifetime.

Fixes: be462c97b7 ("accel/amdxdna: Add hardware context")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260128002356.1858122-1-lizhi.hou@amd.com
2026-01-30 11:52:45 -08:00
Dave Airlie
37b812b7fd Merge tag 'drm-misc-next-2026-01-15' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next for 6.20:

Core Changes:

- atomic: Introduce Gamma/Degamma LUT size check
- gem: Fix a leak in drm_gem_get_unmapped_area
- gpuvm: API sanitation for Rust bindings
- panic: Few corner-cases fixes

Driver Changes:

- Replace system workqueue with percpu equivalent

- amdxdna: Update message buffer allocation requirements, Update
  firmware version check
- imagination: Add AM62P support
- ivpu: Implement warm boot flow
- rockchip: Get rid of atomic_check fixups, Add Rockchip RK3506 Support
- rocket: Cleanups

- bridge:
  - dw-hdmi-qp: Add support for HPD-less setups
- panel:
  - mantix: Various power management related improvements
  - new panels: Innolux G150XGE-L05,

- dma-buf:
  - cma: Call clear_page instead of memset

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <mripard@redhat.com>
Link: https://patch.msgid.link/20260115-lilac-dragon-of-opposition-ac0a30@houat
2026-01-16 11:04:03 +10:00
Lizhi Hou
b36178488d accel/amdxdna: Fix notifier_wq flushing warning
Create notifier_wq with WQ_MEM_RECLAIM flag to fix the possible warning.

  workqueue: WQ_MEM_RECLAIM amdxdna_js:drm_sched_free_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM notifier_wq:0x0

Fixes: e486147c91 ("accel/amdxdna: Add BO import and export")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260113173624.256053-1-lizhi.hou@amd.com
2026-01-14 09:07:33 -08:00
Lizhi Hou
f1eac46fe5 accel/amdxdna: Update firmware version check for latest firmware
The latest firmware increases the major version number. Update
aie2_check_protocol() to accept and support the new firmware version.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251219014356.2234241-2-lizhi.hou@amd.com
2026-01-08 09:48:43 -08:00
Lizhi Hou
1222e06934 accel/amdxdna: Update message DMA buffer allocation
The latest firmware requires the message DMA buffer to
  - have a minimum size of 8K
  - use a power-of-two size
  - be aligned to the buffer size
  - not cross 64M boundary

Update the buffer allocation logic to meet these requirements and support
the latest firmware.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251219014356.2234241-1-lizhi.hou@amd.com
2026-01-08 09:48:34 -08:00
Dave Airlie
7bc0f871f9 Merge tag 'drm-misc-next-2025-12-19' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next for 6.20:

Core Changes:

  - dma-buf: Add tracepoints
  - sched: Introduce new helpers

Driver Changes:

  - amdxdna: Enable hardware context priority, Remove (obsolete and
    never public) NPU2 Support, Race condition fix
  - rockchip: Add RK3368 HDMI Support
  - rz-du: Add RZ/V2H(P) MIPI-DSI Support

  - panels:
    - st7571: Introduce SPI support
    - New panels: Sitronix ST7920, Samsung LTL106HL02, LG LH546WF1-ED01, HannStar HSD156JUW2

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <mripard@redhat.com>
Link: https://patch.msgid.link/20251219-arcane-quaint-skunk-e383b0@houat
2025-12-26 19:00:41 +10:00
Dave Airlie
6c8e404891 Merge tag 'drm-misc-next-2025-12-12' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next for 6.19:

UAPI Changes:

  - panfrost: Add PANFROST_BO_SYNC ioctl
  - panthor: Add PANTHOR_BO_SYNC ioctl

Core Changes:

  - atomic: Add drm_device pointer to drm_private_obj
  - bridge: Introduce drm_bridge_unplug, drm_bridge_enter, and
    drm_bridge_exit
  - dma-buf: Improve sg_table debugging
  - dma-fence: Add new helpers, and use them when needed
  - dp_mst: Avoid out-of-bounds access with VCPI==0
  - gem: Reduce page table overhead with transparent huge pages
  - panic: Report invalid panic modes
  - sched: Add TODO entries
  - ttm: Various cleanups
  - vblank: Various refactoring and cleanups

  - Kconfig cleanups
  - Removed support for kdb

Driver Changes:

  - amdxdna: Fix race conditions at suspend, Improve handling of zero
    tail pointers, Fix cu_idx being overwritten during command setup
  - ast: Support imported cursor buffers
  -
  - panthor: Enable timestamp propagation, Multiple improvements and
    fixes to improve the overall robustness, notably of the scheduler.

  - panels:
    - panel-edp: Support for CSW MNE007QB3-1, AUO B140HAN06.4, AUO B140QAX01.H

Signed-off-by: Dave Airlie <airlied@redhat.com>

[airlied: fix mm conflict]
From: Maxime Ripard <mripard@redhat.com>
Link: https://patch.msgid.link/20251212-spectacular-agama-of-abracadabra-aaef32@penduick
2025-12-26 18:15:33 +10:00
Lizhi Hou
332070795b accel/amdxdna: Enable hardware context priority
Newer firmware supports hardware context priority. Set the priority based
on application input.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251217171719.2139025-1-lizhi.hou@amd.com
2025-12-18 10:36:44 -08:00
Lizhi Hou
7818618a09 accel/amdxdna: Enable temporal sharing only mode
Newer firmware versions prefer temporal sharing only mode. In this mode,
the driver no longer needs to manage AIE array column allocation. Instead,
a new field, num_unused_col, is added to the hardware context creation
request to specify how many columns will not be used by this hardware
context.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251217191150.2145937-1-lizhi.hou@amd.com
2025-12-18 10:36:33 -08:00
Lizhi Hou
3ef9384103 accel/amdxdna: Remove NPU2 support
NPU2 hardware was never publicly released and is now obsolete.
Remove all remaining NPU2 support from the driver.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251217190818.2145781-1-lizhi.hou@amd.com
2025-12-18 10:36:22 -08:00
Lizhi Hou
e8c28e16c3 accel/amdxdna: Remove amdxdna_flush()
amdxdna_flush() was introduced to ensure that the device does not access
a process address space after it has been freed. However, this is no
longer necessary because the driver now increments the mm reference count
when a command is submitted and decrements it only after the command has
completed. This guarantees that the process address space remains valid
for the entire duration of command execution. Remove amdxdna_flush to
simplify the teardown path.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251216031311.2033399-1-lizhi.hou@amd.com
2025-12-17 08:28:53 -08:00
Mario Limonciello (AMD)
7bbf6d15e9 accel/amdxdna: Block running under a hypervisor
SVA support is required, which isn't configured by hypervisor
solutions.

Closes: https://github.com/QubesOS/qubes-issues/issues/10275
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4656
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251213054513.87925-1-superm1@kernel.org
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
2025-12-15 13:00:03 -06:00
Lizhi Hou
97f2757383 accel/amdxdna: Fix potential NULL pointer dereference in context cleanup
aie_destroy_context() is invoked during error handling in
aie2_create_context(). However, aie_destroy_context() assumes that the
context's mailbox channel pointer is non-NULL. If mailbox channel
creation fails, the pointer remains NULL and calling aie_destroy_context()
can lead to a NULL pointer dereference.

In aie2_create_context(), replace aie_destroy_context() with a function
which request firmware to remove the context created previously.

Fixes: be462c97b7 ("accel/amdxdna: Add hardware context")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251212183244.1826318-1-lizhi.hou@amd.com
2025-12-15 09:52:59 -08:00
Lizhi Hou
343f5683cf accel/amdxdna: Fix race where send ring appears full due to delayed head update
The firmware sends a response and interrupts the driver before advancing
the mailbox send ring head pointer. As a result, the driver may observe
the response and attempt to send a new request before the firmware has
updated the head pointer. In this window, the send ring still appears
full, causing the driver to incorrectly fail the send operation.

This race can be triggered more easily in a multithreaded environment,
leading to unexpected and spurious "send ring full" failures.

To address this, poll the send ring head pointer for up to 100us before
returning a full-ring condition. This allows the firmware time to update
the head pointer.

Fixes: b87f920b93 ("accel/amdxdna: Support hardware mailbox")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251211045125.1724604-1-lizhi.hou@amd.com
2025-12-12 09:49:10 -08:00
Lizhi Hou
3d32eb7a5e accel/amdxdna: Fix cu_idx being cleared by memset() during command setup
For one command type, cu_idx is assigned before calling memset() on the
command structure. This results in cu_idx being overwritten, causing the
firmware to receive an incomplete or invalid command and leading to
unexpected command failures.

Fix this by moving the memset() call before initializing cu_idx so that
all fields are populated in the correct order.

Fixes: 71829d7f2f ("accel/amdxdna: Use MSG_OP_CHAIN_EXEC_NPU when supported")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251209211639.1636888-1-lizhi.hou@amd.com
2025-12-10 09:07:18 -08:00
Lizhi Hou
00ffe45ece accel/amdxdna: Fix race condition when checking rpm_on
When autosuspend is triggered, driver rpm_on flag is set to indicate that
a suspend/resume is already in progress. However, when a userspace
application submits a command during this narrow window,
amdxdna_pm_resume_get() may incorrectly skip the resume operation because
the rpm_on flag is still set. This results in commands being submitted
while the device has not actually resumed, causing unexpected behavior.

The set_dpm() is called by suspend/resume, it relied on rpm_on flag to
avoid calling into rpm suspend/resume recursivly. So to fix this, remove
the use of the rpm_on flag entirely. Instead, introduce aie2_pm_set_dpm()
which explicitly resumes the device before invoking set_dpm(). With this
change, set_dpm() is called directly inside the suspend or resume execution
path. Otherwise, aie2_pm_set_dpm() is called.

Fixes: 063db45183 ("accel/amdxdna: Enhance runtime power management")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251208165356.1549237-1-lizhi.hou@amd.com
2025-12-09 11:51:03 -08:00
Lizhi Hou
cd77d5a4aa accel/amdxdna: Fix tail-pointer polling in mailbox_get_msg()
In mailbox_get_msg(), mailbox_reg_read_non_zero() is called to poll for a
non-zero tail pointer. This assumed that a zero value indicates an error.
However, certain corner cases legitimately produce a zero tail pointer.
To handle these cases, remove mailbox_reg_read_non_zero(). The zero tail
pointer will be treated as a valid rewind event.

Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251204181603.793824-1-lizhi.hou@amd.com
2025-12-08 08:29:34 -08:00
Lizhi Hou
3d3ac202c7 accel/amdxdna: Poll MPNPU_PWAITMODE after requesting firmware suspend
After issuing a firmware suspend request, the driver must ensure that the
suspend operation has completed before proceeding. Add polling of the
MPNPU_PWAITMODE register to confirm that the firmware has fully entered
the suspended state. This prevents race conditions where subsequent
operations assume the firmware is idle before it has actually completed
its suspend sequence.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251202165427.507414-1-lizhi.hou@amd.com
2025-12-02 16:31:14 -08:00
Lizhi Hou
ca25834123 accel/amdxdna: Fix deadlock between context destroy and job timeout
Hardware context destroy function holds dev_lock while waiting for all jobs
to complete. The timeout job also needs to acquire dev_lock, this leads to
a deadlock.

Fix the issue by temporarily releasing dev_lock before waiting for all
jobs to finish, and reacquiring it afterward.

Fixes: 4fd6ca90fc ("accel/amdxdna: Refactor hardware context destroy routine")
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251107181050.1293125-1-lizhi.hou@amd.com
2025-11-13 09:10:43 -08:00
Lizhi Hou
6ff9385c07 accel/amdxdna: Clear mailbox interrupt register during channel creation
The mailbox interrupt register is not always cleared when a mailbox channel
is created. This can leave stale interrupt states from previous operations.

Fix this by explicitly clearing the interrupt register in the mailbox
channel creation function.

Fixes: b87f920b93 ("accel/amdxdna: Support hardware mailbox")
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251107181115.1293158-1-lizhi.hou@amd.com
2025-11-13 08:36:08 -08:00
Lizhi Hou
1750e652a9 accel/amdxdna: Treat power-off failure as unrecoverable error
Failing to set power off indicates an unrecoverable hardware or firmware
error. Update the driver to treat such a failure as a fatal condition
and stop further operations that depend on successful power state
transition.

This prevents undefined behavior when the hardware remains in an
unexpected state after a failed power-off attempt.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251106180521.1095218-1-lizhi.hou@amd.com
2025-11-07 08:52:29 -08:00
Lizhi Hou
dea9f84776 accel/amdxdna: Fix dma_fence leak when job is canceled
Currently, dma_fence_put(job->fence) is called in job notification
callback. However, if a job is canceled, the notification callback is never
invoked, leading to a memory leak. Move dma_fence_put(job->fence)
to the job cleanup function to ensure the fence is always released.

Fixes: aac243092b ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251105194140.1004314-1-lizhi.hou@amd.com
2025-11-06 09:23:42 -08:00
Lizhi Hou
3a0ff7b98a accel/amdxdna: Support preemption requests
The driver checks the firmware version during initialization.If preemption
is supported, the driver configures preemption accordingly and handles
userspace preemption requests. Otherwise, the driver returns an error for
userspace preemption requests.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251104185340.897560-1-lizhi.hou@amd.com
2025-11-05 08:56:28 -08:00
Lizhi Hou
e568dc3e62 accel/amdxdna: Add IOCTL parameter for telemetry data
Extend DRM_IOCTL_AMDXDNA_GET_INFO to include additional parameters
that allow collection of telemetry data.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251104062546.833771-3-lizhi.hou@amd.com
2025-11-04 09:04:21 -08:00