Commit Graph

243 Commits

Author SHA1 Message Date
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Andrew Martin
08b1eb621c drm/amdgpu: Ignored various return code
The return code of a non void function should not be ignored. In cases
where we do not care, the code needs to suppress it.

Signed-off-by: Andrew Martin <andrew.martin@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03 16:46:31 -05:00
Philip Yang
d4a814f400 drm/amdkfd: Move gfx9.4.3 and gfx 9.5 MQD to HBM
To reduce queue switch latency further, move MQD to VRAM domain, CP
access MQD and control stack via FB aperture, this requires contiguous
pages.

After MQD is initialized, updated or restored, flush HDP to guarantee
the data is written to HBM and GPU cache is invalidated, then CP will
read the new MQD.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-20 17:15:46 -05:00
Philip Yang
0cba5b27f1 drm/amdkfd: Add domain parameter to alloc kernel BO
To allocate kernel BO from VRAM domain for MQD in the following patch.
No functional change because kernel BO allocate all from GTT domain.

Rename amdgpu_amdkfd_alloc_gtt_mem to amdgpu_amdkfd_alloc_kernel_mem
Rename amdgpu_amdkfd_free_gtt_mem to amdgpu_amdkfd_free_kernel_mem
Rename mem_kfd_mem_obj gtt_mem to mem

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:59 -05:00
Xiaogang Chen
6cca686dfc drm/amdkfd: kfd driver supports hot unplug/replug amdgpu devices
This patch allows kfd driver function correctly when AMD gpu devices got
unplug/replug at run time.

When an AMD gpu device got unplug kfd driver gracefully terminates existing
kfd processes after stops all queues by sending SIGBUS to user process. After
that user space can still use remaining AMD gpu devices. When all AMD gpu
devices at system got removed kfd driver will not response new requests.

Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices
to function as usual.

The purpose of this patch is having kfd driver behavior as expected during and
after AMD gpu devices unplug/replug at run time.

Signed-off-by: Xiaogang Chen <Xiaogang.Chen@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:49 -05:00
Mario Limonciello (AMD)
e291729873 drm/amd: Convert DRM_*() to drm_*()
The drm_*() macros include the device which is helpful for debugging
issues in multi-GPU systems.

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:59:55 -05:00
YiPeng Chai
6a37539973 drm/amdgpu: Fix the issue of missing ras message on sriov host
This code only applies to amdgpu processing
poison consumption after uniras is enabled,
but not to sriov.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-14 11:28:15 -05:00
YiPeng Chai
3d72d2e5f4 drm/amdgpu: Intercept ras interrupts to ras module
Intercept ras interrupts to ras module.

V2:
  Change function names in ras module.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20 18:18:26 -04:00
Alex Deucher
2ade36eaa9 drm/amdkfd: add proper handling for S0ix
When in S0i3, the GFX state is retained, so all we need to do
is stop the runlist so GFX can enter gfxoff.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4bfa860993)
Cc: stable@vger.kernel.org
2025-09-18 14:59:24 -04:00
Dave Airlie
7e2818386a Merge tag 'amd-drm-next-6.17-2025-07-01' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.17-2025-07-01:

amdgpu:
- FAMS2 fixes
- OLED fixes
- Misc cleanups
- AUX fixes
- DMCUB updates
- SR-IOV hibernation support
- RAS updates
- DP tunneling fixes
- DML2 fixes
- Backlight improvements
- Suspend improvements
- Use scaling for non-native modes on eDP
- SDMA 4.4.x fixes
- PCIe DPM fixes
- SDMA 5.x fixes
- Cleaner shader updates for GC 9.x
- Remove fence slab
- ISP genpd support
- Parition handling rework
- SDMA FW checks for userq support
- Add missing firmware declaration
- Fix leak in amdgpu_ctx_mgr_entity_fini()
- Freesync fix
- Ring reset refactoring
- Legacy dpm verbosity changes

amdkfd:
- GWS fix
- mtype fix for ext coherent system memory
- MMU notifier fix
- gfx7/8 fix

radeon:
- CS validation support for additional GL extensions
- Bump driver version for new CS validation checks

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250701194707.32905-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-04 10:06:29 +10:00
Emily Deng
54f7a24e14 drm/amdkfd: Move the process suspend and resume out of full access
For the suspend and resume process, exclusive access is not required.
Therefore, it can be moved out of the full access section to reduce the
duration of exclusive access.

v3:
Move suspend processes before hardware fini.
Remove twice call for bare metal.

v4:
Refine code

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Jonathan Kim
96f75f9594 drm/amdkfd: allow compute partition mode switch with cgroup exclusions
The KFD currently bars a compute partition mode switch while a KFD
process exists.

Since cgroup excluded devices remain excluded for the lifetime of a KFD
process and user space is able to mode switch single devices, allow
users to mode switch a device with any running process that has been
cgroup excluded from this device.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:17 -04:00
Thomas Zimmermann
c598d5eb9f Merge drm/drm-next into drm-misc-next
Backmerging to forward to v6.16-rc1

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2025-06-11 09:01:34 +02:00
Lang Yu
30837a49bd drm/amdkfd: Map wptr BO to GART unconditionally
For simulation C models that don't run CP FW where adev->mes.sched_version
is not populated correctly. This causes NULL dereference in
amdgpu_amdkfd_free_gtt_mem(dev->adev, (void **)&pqn->q->wptr_bo_gart)
and warning on unpinned BO in amdgpu_bo_gpu_offset(q->properties.wptr_bo).

Compared with adding version check here and there,
always map wptr BO to GART simplifies things.

v2: Add NULL check in amdgpu_amdkfd_free_gtt_mem.(Philip)

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:58:44 -04:00
Pierre-Eric Pelloux-Prayer
2956554823 drm/sched: Store the drm client_id in drm_sched_fence
This will be used in a later commit to trace the drm client_id in
some of the gpu_scheduler trace events.

This requires changing all the users of drm_sched_job_init to
add an extra parameter.

The newly added drm_client_id field in the drm_sched_fence is a bit
of a duplicate of the owner one. One suggestion I received was to
merge those 2 fields - this can't be done right now as amdgpu uses
some special values (AMDGPU_FENCE_OWNER_*) that can't really be
translated into a client id. Christian is working on getting rid of
those; when it's done we should be able to squash owner/drm_client_id
together.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250526125505.2360-3-pierre-eric.pelloux-prayer@amd.com
2025-05-28 16:15:58 +02:00
Jonathan Kim
9424a5bf08 drm/amdgpu: simplify xgmi peer info calls
Deprecate KFD XGMI peer info calls in favour of calling directly from
simplified XGMI peer info functions.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-25 11:45:12 -05:00
Alex Deucher
8b0d068e7d drm/amdkfd: add a new flag to manage where VRAM allocations go
On big and small APUs we send KFD VRAM allocations to GTT
since the carve out is either non-existent or relatively
small.  However, if someone sets the carve out size to be
relatively large, we may end up using GTT rather than VRAM.

No change of logic with this patch, but it allows the
driver to determine which logic to use based on the
carve out size in the future.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12 21:04:08 -05:00
Linus Torvalds
96c84703f1 Merge tag 'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel
Pull drm updates from Dave Airlie:
 "There are two external interactions of note, the msm tree pull in some
  opp tree, hopefully the opp tree arrives from the same git tree
  however it normally does.

  There is also a new cgroup controller for device memory, that is used
  by drm, so is merging through my tree. This will hopefully help open
  up gpu cgroup usage a bit more and move us forward.

  There is a new accelerator driver for the AMD XDNA Ryzen AI NPUs.

  Then the usual xe/amdgpu/i915/msm leaders and lots of changes and
  refactors across the board:

  core:
   - device memory cgroup controller added
   - Remove driver date from drm_driver
   - Add drm_printer based hex dumper
   - drm memory stats docs update
   - scheduler documentation improvements

  new driver:
   - amdxdna - Ryzen AI NPU support

  connector:
   - add a mutex to protect ELD
   - make connector setup two-step

  panels:
   - Introduce backlight quirks infrastructure
   - New panels: KDB KD116N2130B12, Tianma TM070JDHG34-00,
   - Multi-Inno Technology MI1010Z1T-1CP11

  bridge:
   - ti-sn65dsi83: Add ti,lvds-vod-swing optional properties
   - Provide default implementation of atomic_check for HDMI bridges
   - it605: HDCP improvements, MCCS Support

  xe:
   - make OA buffer size configurable
   - GuC capture fixes
   - add ufence and g2h flushes
   - restore system memory GGTT mappings
   - ioctl fixes
   - SRIOV PF scheduling priority
   - allow fault injection
   - lots of improvements/refactors
   - Enable GuC's WA_DUAL_QUEUE for newer platforms
   - IRQ related fixes and improvements

  i915:
   - More accurate engine busyness metrics with GuC submission
   - Ensure partial BO segment offset never exceeds allowed max
   - Flush GuC CT receive tasklet during reset preparation
   - Some DG2 refactor to fix DG2 bugs when operating with certain CPUs
   - Fix DG1 power gate sequence
   - Enabling uncompressed 128b/132b UHBR SST
   - Handle hdmi connector init failures, and no HDMI/DP cases
   - More robust engine resets on Haswell and older

  i915/xe display:
   - HDCP fixes for Xe3Lpd
   - New GSC FW ARL-H/ARL-U
   - support 3 VDSC engines 12 slices
   - MBUS joining sanitisation
   - reconcile i915/xe display power mgmt
   - Xe3Lpd fixes
   - UHBR rates for Thunderbolt

  amdgpu:
   - DRM panic support
   - track BO memory stats at runtime
   - Fix max surface handling in DC
   - Cleaner shader support for gfx10.3 dGPUs
   - fix drm buddy trim handling
   - SDMA engine reset updates
   - Fix doorbell ttm cleanup
   - RAS updates
   - ISP updates
   - SDMA queue reset support
   - Rework DPM powergating interfaces
   - Documentation updates and cleanups
   - DCN 3.5 updates
   - Use a pm notifier to more gracefully handle VRAM eviction on
     suspend or hibernate
   - Add debugfs interfaces for forcing scheduling to specific engine
     instances
   - GG 9.5 updates
   - IH 4.4 updates
   - Make missing optional firmware less noisy
   - PSP 13.x updates
   - SMU 13.x updates
   - VCN 5.x updates
   - JPEG 5.x updates
   - GC 12.x updates
   - DC FAMS updates

  amdkfd:
   - GG 9.5 updates
   - Logging improvements
   - Shader debugger fixes
   - Trap handler cleanup
   - Cleanup includes
   - Eviction fence wq fix

  msm:
   - MDSS:
      - properly described UBWC registers
      - added SM6150 (aka QCS615) support
   - DPU:
      - added SM6150 (aka QCS615) support
      - enabled wide planes if virtual planes are enabled (by using two
        SSPPs for a single plane)
      - added CWB hardware blocks support
   - DSI:
      - added SM6150 (aka QCS615) support
   - GPU:
      - Print GMU core fw version
      - GMU bandwidth voting for a740 and a750
      - Expose uche trap base via uapi
      - UAPI error reporting

  rcar-du:
   - Add r8a779h0 Support

  ivpu:
   - Fix qemu crash when using passthrough

  nouveau:
   - expose GSP-RM logging buffers via debugfs

  panfrost:
   - Add MT8188 Mali-G57 MC3 support

  rockchip:
   - Gamma LUT support

  hisilicon:
   - new HIBMC support

  virtio-gpu:
   - convert to helpers
   - add prime support for scanout buffers

  v3d:
   - Add DRM_IOCTL_V3D_PERFMON_SET_GLOBAL

  vc4:
   - Add support for BCM2712

  vkms:
   - line-per-line compositing algorithm to improve performance

  zynqmp:
   - Add DP audio support

  mediatek:
   - dp: Add sdp path reset
   - dp: Support flexible length of DP calibration data

  etnaviv:
   - add fdinfo memory support
   - add explicit reset handling"

* tag 'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel: (1070 commits)
  drm/bridge: fix documentation for the hdmi_audio_prepare() callback
  doc/cgroup: Fix title underline length
  drm/doc: Include new drm-compute documentation
  cgroup/dmem: Fix parameters documentation
  cgroup/dmem: Select PAGE_COUNTER
  kernel/cgroup: Remove the unused variable climit
  drm/display: hdmi: Do not read EDID on disconnected connectors
  drm/tests: hdmi: Add connector disablement test
  drm/connector: hdmi: Do atomic check when necessary
  drm/amd/display: 3.2.316
  drm/amd/display: avoid reset DTBCLK at clock init
  drm/amd/display: improve dpia pre-train
  drm/amd/display: Apply DML21 Patches
  drm/amd/display: Use HW lock mgr for PSR1
  drm/amd/display: Revised for Replay Pseudo vblank control
  drm/amd/display: Add a new flag for replay low hz
  drm/amd/display: Remove unused read_ono_state function from Hwss module
  drm/amd/display: Do not elevate mem_type change to full update
  drm/amd/display: Do not wait for PSR disable on vbl enable
  drm/amd/display: Remove unnecessary eDP power down
  ...
2025-01-21 16:09:47 -08:00
Kenneth Feng
90505894c4 drm/amdgpu: disable gfxoff with the compute workload on gfx12
Disable gfxoff with the compute workload on gfx12. This is a
workaround for the opencl test failure.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2affe2bbc9)
Cc: stable@vger.kernel.org # 6.12.x
2025-01-14 11:54:07 -05:00
Christian König
11815bb0e3 drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cb.

This commit introduced a new state variable into adev without even
remotely worrying about CPU barriers.

Since we already have the amdgpu_in_reset() function exactly for this
use case partially revert that.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Andrew Martin
357ef5b3b7 drm/amdgpu: Failed to check various return code
Clean up code to quiet the compiler on us failing to check the return
code.

Signed-off-by: Andrew Martin <Andrew.Martin@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:18:20 -05:00
Boyuan Zhang
80d8051124 drm/amdgpu: pass ip_block in set_powergating_state
Pass ip_block instead of adev in set_powergating_state callback function.
Modify set_powergating_state ip functions for all correspoding ip blocks.

v2: fix a ip block index error.

v3: remove type casting

Signed-off-by: Boyuan Zhang <boyuan.zhang@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-10 10:26:47 -05:00
Lijo Lazar
fa31798582 drm/amdgpu: Fix map/unmap queue logic
In current logic, it calls ring_alloc followed by a ring_test. ring_test
in turn will call another ring_alloc. This is illegal usage as a
ring_alloc is expected to be closed properly with a ring_commit. Change
to commit the map/unmap queue packet first followed by a ring_test. Add a
comment about the usage of ring_test.

Also, reorder the current pre-condition checks of job hang or kiq ring
scheduler not ready. Without them being met, it is not useful to attempt
ring or memory allocations.

Fixes tag refers to the original patch which introduced this issue which
then got carried over into newer code.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fixes: 6c10b5cc4e ("drm/amdgpu: Remove duplicate code in gfx_v8_0.c")
2024-11-08 11:10:00 -05:00
Alex Deucher
8fe7cf58ff drm/amdkfd: add an interface to query whether is KFD is active
Add an interface to query whether KFD has any active queues.

v2: fix build issues

Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-04 11:25:42 -05:00
Feifei Xu
3eebfd5e9c drm/amdkfd:Add kfd function to config sq perfmon
Expose the interface for kfd to config sq perfmon.

Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:17 -04:00
Hawking Zhang
b05d6476ae drm/amdgpu: Retire query_utcl2_poison_status callback
Driver switches to interrupt source id to identify
utcl2 poison event. polling interface is not needed.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-23 10:53:16 -04:00
Amber Lin
234eebe161 drm/amdkfd: APIs to stop/start KFD scheduling
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from
runlist. If users send ioctls to KFD to create queues, they'll be added
but those queues won't be mapped to runlist (so not scheduled) until
amdgpu_amdkfd_start_sched is called.

v2: fix build (Alex)

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:28 -04:00
Philip Yang
c86ad39140 drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original pointer
not set to NULL, this could cause use-after-free bug.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23 17:34:44 -04:00
Eric Huang
dbe2c4c8ab drm/amdkfd: add reset cause in gpu pre-reset smi event
reset cause is requested by customer as additional
info for gpu reset smi event.

v2: integerate reset sources suggested by Lijo Lazar

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 11:25:14 -04:00
Lang Yu
89773b8559 drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
Small APUs(i.e., consumer, embedded products) usually have a small
carveout device memory which can't satisfy most compute workloads
memory allocation requirements.

We can't even run a Basic MNIST Example with a default 512MB carveout.
https://github.com/pytorch/examples/tree/main/mnist. Error Log:

"torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate
84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes
is free. Of the allocated memory 103.83 MiB is allocated by PyTorch,
and 22.17 MiB is reserved by PyTorch but unallocated"

Though we can change BIOS settings to enlarge carveout size,
which is inflexible and may bring complaint. On the other hand,
the memory resource can't be effectively used between host and device.

The solution is MI300A approach, i.e., let VRAM allocations go to GTT.
Then device and host can flexibly and effectively share memory resource.

v2: Report local_mem_size_private as 0. (Felix)

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-08 15:17:04 -04:00
YiPeng Chai
bfa579b38b drm/amdgpu: prepare to handle pasid poison consumption
Prepare to handle pasid poison consumption.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:42 -04:00
Tao Zhou
2fc46e0b2f drm/amdgpu: make reset method configurable for RAS poison
Each RAS block has different requirement for gpu reset in poison
consumption handling.
Add support for mmhub RAS poison consumption handling.

v2: remove the mmhub poison support for kfd int v10.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Tao Zhou
d8070c4241 drm/amdgpu: support utcl2 RAS poison query for mmhub
Support the query for both gfxhub and mmhub, also replace
xcc_id with hub_inst.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Tao Zhou
71a8d61ebc drm/amdgpu: retire gfx ras query_utcl2_poison_status
Replace it with related interface in gfxhub functions.

v2: replace node id with xcc id.
    get node id for query_utcl2_poison_status

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:36 -04:00
Ahmad Rehman
f679fd6057 drm/amdgpu: Init zone device and drm client after mode-1 reset on reload
In passthrough environment, when amdgpu is reloaded after unload, mode-1
is triggered after initializing the necessary IPs, That init does not
include KFD, and KFD init waits until the reset is completed. KFD init
is called in the reset handler, but in this case, the zone device and
drm client is not initialized, causing app to create kernel panic.

v2: Removing the init KFD condition from amdgpu_amdkfd_drm_client_create.
As the previous version has the potential of creating DRM client twice.

v3: v2 patch results in SDMA engine hung as DRM open causes VM clear to SDMA
before SDMA init. Adding the condition to in drm client creation, on top of v1,
to guard against drm client creation call multiple times.

Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:57 -04:00
Lijo Lazar
e1f6746f33 drm/amdkfd: Skip packet submission on fatal error
If fatal error is detected, packet submission won't go through. Return
error in such cases. Also, avoid waiting for fence when fatal error is
detected.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-26 11:14:31 -05:00
Le Ma
c0125b848a drm/amdgpu: move the drm client creation behind drm device registration
This patch is to eliminate interrupt warning below:

  "[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt clearing job is sent to SDMA ahead of interrupt enabled.
And re-locating the drm client creation following after drm_dev_register
looks like a more proper flow.

v2: wrap the drm client creation

Fixes: 1819200166 ("drm/amdkfd: Export DMABufs from KFD using GEM handles")
Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-29 15:35:13 -05:00
YiPeng Chai
ed1e1e42fd drm/amdgpu: Support passing poison consumption ras block to SRIOV
Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25 14:58:03 -05:00
Ori Messinger
776b0953ab drm/amdgpu: Enable GFXOFF for Compute on GFX11
On GFX version 11, GFXOFF was disabled due to a MES KIQ firmware
issue, which has since been fixed after version 64.
This patch only re-enables GFXOFF for GFX version 11 if the GPU's
MES KIQ firmware version is newer than version 64.

V2: Keep GFXOFF disabled on GFX11 if MES KIQ is below version 64.
V3: Add parentheses to avoid GCC warning for parentheses:
"suggest parentheses around comparison in operand of ‘&’"
V4: Remove "V3" from commit title
V5: Change commit description and insert 'Acked-by'

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-18 15:46:56 -05:00
Flora Cui
733965a90f drm/amdkfd: init drm_client with funcs hook
otherwise drm_client_dev_unregister() would try to
kfree(&adev->kfd.client).

Fixes: 1819200166 ("drm/amdkfd: Export DMABufs from KFD using GEM handles")
Signed-off-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-18 15:41:08 -05:00
Felix Kuehling
1819200166 drm/amdkfd: Export DMABufs from KFD using GEM handles
Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM
handles are created in a drm_client_dev context to avoid exposing them
in user mode contexts through a DMABuf import.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-13 15:09:55 -05:00
Jonathan Kim
35c425f5cc drm/amdgpu: update xgmi num links info post gc9.4.2
GC IP 9.4.2 and up support TA reporting of the number
of xGMI links between peers.

Tested-by: Vignesh Chander <vignesh.chander@amd.com>
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29 16:48:45 -05:00
Felix Kuehling
94e2dae0a8 drm/amdkfd: Move TLB flushing logic into amdgpu
This will make it possible for amdgpu GEM ioctls to flush TLBs on compute
VMs.

This removes VMID-based TLB flushing and always uses PASID-based
flushing. This still works because it scans the VMID-PASID mapping
registers to find the right VMID. It's only slightly less efficient. This
is not a production use case.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-17 09:29:53 -05:00
Jesse Zhang
e341631f4a drm/amdgpu: disable GFXOFF and PG during compute for GFX9
Temporary workaround to fix issues observed in some compute
applications when GFXOFF is enabled on GFX9.

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-13 11:33:56 -04:00
Rajneesh Bhardwaj
f4bff6e0b9 drm/amdgpu: Use ttm_pages_limit to override vram reporting
On GFXIP9.4.3 APU, allow the memory reporting as per the ttm pages
limit in NPS1 mode.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-04 18:36:42 -04:00
Alex Deucher
0021d70a06 drm/amdkfd: drop struct kfd_cu_info
I think this was an abstraction back from when
kfd supported both radeon and amdgpu.  Since we just
support amdgpu now, there is no more need for this and
we can use the amdgpu structures directly.

This also avoids having the kfd_cu_info structures on
the stack when inlining which can blow up the stack.

Cc: Arnd Bergmann <arnd@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-03 15:41:13 -04:00
Lijo Lazar
4e8303cf2c drm/amdgpu: Use function for IP version check
Use an inline function for version check. Gives more flexibility to
handle any format changes.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 12:23:28 -04:00
Mukul Joshi
68fa72a437 drm/amdgpu: Rename KGD_MAX_QUEUES to AMDGPU_MAX_QUEUES
Rename KGD_MAX_QUEUES to AMDGPU_MAX_QUEUES to conform with
the naming convention followed in amdgpu_gfx.h. No functional
change.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:15:41 -04:00
Mukul Joshi
f705a6f021 drm/amdgpu: Store CU info from all XCCs for GFX v9.4.3
Currently, we store CU info only for a single XCC assuming
that it is the same for all XCCs. However, that may not be
true. As a result, store CU info for all XCCs. This info is
later used for CU masking.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:10:19 -04:00