Commit Graph

13800 Commits

Author SHA1 Message Date
Tao Zhou
8b3495eafb drm/amdgpu: add socket id parameter for psp query address cmd
And set the socket id.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:54:54 -04:00
Shashank Sharma
f88a7dd06a drm/amdgpu: Add a NULL check for freeing root PT
This patch adds a NULL check to fix this crash reported during the
freeing of root PT entry:

 BUG: unable to handle page fault for address: ffffc9002d637aa0
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 RIP: 0010:amdgpu_vm_pt_free+0x66/0xe0 [amdgpu]
 PKRU: 55555554
 Call Trace:
 <TASK>
  amdgpu_vm_pt_free_root+0x60/0xa0 [amdgpu]
  amdgpu_vm_fini+0x2cb/0x5d0 [amdgpu]
  ? amdgpu_ctx_mgr_entity_fini+0x53/0x1c0 [amdgpu]
  amdgpu_driver_postclose_kms+0x191/0x2d0 [amdgpu]
  drm_file_free.part.0+0x1e5/0x260 [drm]

Cc: Christian König <Christian.Koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Acked-by: Christian König <Christian.Koenig@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:51:55 -04:00
Sunil Khatri
9022f01b97 drm/amdgpu: refactor code to split devcoredump code
Refractor devcoredump code into new files since its
functionality is expanded further and better to slit
and devcoredump to have its own file.

v2: Fix the build failure caught by arm compiler
of implicit function declaration with #ifdef

v3: squash in fix for implicit declaration error

Cc: Ivan Lipski <ivan.lipski@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:51:48 -04:00
Peyton Lee
9ddafd1d14 drm/amdgpu/vpe: power on vpe when hw_init
To fix mode2 reset failure.
Should power on VPE when hw_init.

Signed-off-by: Peyton Lee <peytolee@amd.com>
Reviewed-by: Lang Yu <lang.yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:50:20 -04:00
Yang Wang
31fd330b97 drm/amdgpu: add ras event id support for ACA
add ras event id support for ACA.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:48:18 -04:00
Yang Wang
bd15bf742f drm/amdgpu: avoid update aca bank multi times during ras isr
Because the UE Valid MCA count will only be cleared after reset,
in order to avoid repeated counting of the error count,
the aca bank is only updated once during ras isr.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:48:11 -04:00
Yang Wang
f7bcfb7a56 drm/amdgpu: retrieve umc odecc error count for aca umc v12.0
retrieve umc odecc error count for aca umc v12.0

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:48:03 -04:00
Shashank Sharma
b6c4f90b38 drm/amdgpu: sync page table freeing with tlb flush
The idea behind this patch is to delay the freeing of PT entry objects
until the TLB flush is done.

This patch:
- Adds a tlb_flush_waitlist in amdgpu_vm_update_params which will keep the
  objects that need to be freed after tlb_flush.
- Adds PT entries in this list in amdgpu_vm_ptes_update after finding
  the PT entry.
- Changes functionality of amdgpu_vm_pt_free_dfs from (df_search + free)
  to simply freeing of the BOs, also renames it to
  amdgpu_vm_pt_free_list to reflect this same.
- Exports function amdgpu_vm_pt_free_list to be called directly.
- Calls amdgpu_vm_pt_free_list directly from amdgpu_vm_update_range.

V2: rebase
V4: Addressed review comments from Christian
    - add only locked PTEs entries in TLB flush waitlist.
    - do not create a separate function for list flush.
    - do not create a new lock for TLB flush.
    - there is no need to wait on tlb_flush_fence exclusively.

V5: Addressed review comments from Christian
    - change the amdgpu_vm_pt_free_dfs's functionality to simple freeing
      of the objects and rename it.
    - add all the PTE objects in params->tlb_flush_waitlist
    - let amdgpu_vm_pt_free_root handle the freeing of BOs independently
    - call amdgpu_vm_pt_free_list directly

V6: Rebase
V7: Rebase
V8: Added a NULL check to fix this backtrace issue:
[  415.351447] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  415.359245] #PF: supervisor write access in kernel mode
[  415.365081] #PF: error_code(0x0002) - not-present page
[  415.370817] PGD 101259067 P4D 101259067 PUD 10125a067 PMD 0
[  415.377140] Oops: 0002 [#1] PREEMPT SMP NOPTI
[  415.382004] CPU: 0 PID: 25481 Comm: test_with_MPI.e Tainted: G           OE     5.18.2-mi300-build-140423-ubuntu-22.04+ #24
[  415.394437] Hardware name: AMD Corporation Sh51p/Sh51p, BIOS RMO1001AS 02/21/2024
[  415.402797] RIP: 0010:amdgpu_vm_ptes_update+0x6fd/0xa10 [amdgpu]
[  415.409648] Code: 4c 89 ff 4d 8d 66 30 e8 f1 ed ff ff 48 85 db 74 42 48 39 5d a0 74 40 48 8b 53 20 48 8b 4b 18 48 8d 43 18 48 8d 75 b0 4c 89 ff <48
> 89 51 08 48 89 0a 49 8b 56 30 48 89 42 08 48 89 53 18 4c 89 63
[  415.430621] RSP: 0018:ffffc9000401f990 EFLAGS: 00010287
[  415.436456] RAX: ffff888147bb82f0 RBX: ffff888147bb82d8 RCX: 0000000000000000
[  415.444426] RDX: 0000000000000000 RSI: ffffc9000401fa30 RDI: ffff888161f80000
[  415.452397] RBP: ffffc9000401fa80 R08: 0000000000000000 R09: ffffc9000401fa00
[  415.460368] R10: 00000007f0cc0000 R11: 00000007f0c85000 R12: ffffc9000401fb20
[  415.468340] R13: 00000007f0d00000 R14: ffffc9000401faf0 R15: ffff888161f80000
[  415.476312] FS:  00007f132ff89840(0000) GS:ffff889f87c00000(0000) knlGS:0000000000000000
[  415.485350] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  415.491767] CR2: 0000000000000008 CR3: 0000000161d46003 CR4: 0000000000770ef0
[  415.499738] PKRU: 55555554
[  415.502750] Call Trace:
[  415.505482]  <TASK>
[  415.507825]  amdgpu_vm_update_range+0x32a/0x880 [amdgpu]
[  415.513869]  amdgpu_vm_clear_freed+0x117/0x250 [amdgpu]
[  415.519814]  amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x18c/0x250 [amdgpu]
[  415.527729]  kfd_ioctl_unmap_memory_from_gpu+0xed/0x340 [amdgpu]
[  415.534551]  kfd_ioctl+0x3b6/0x510 [amdgpu]

V9: Addressed review comments from Christian
    - No NULL check reqd for root PT freeing
    - Free PT list regardless of needs_flush
    - Move adding BOs in list in a separate function

V10: Added Christian's RB
V11: squash in list fix

Cc: Christian König <Christian.Koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Reviewed-by: Christian König <Christian.Koenig@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:47:17 -04:00
Hawking Zhang
a61e2ce9d4 drm/amdgpu: Enable smuio v14_0_2 callbacks
Enable smuio v14_0_2_callbacks

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:16 -04:00
Hawking Zhang
d80e44a34e drm/amdgpu: Add smuio callback to get gpu clk counter
Add smuio callback to get gpu clk counter

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:16 -04:00
Hawking Zhang
2d93151de8 drm/amdgpu: Add smuio v14_0_2 ip block support
Add smuio v14_0_2 ip block support

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:16 -04:00
Yang Wang
b93d759f54 drm/amdgpu: add umc v12.0.0 deferred error support
add umc v12.0.0 deferred error support.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Yang Wang
865d339763 drm/amdgpu: add aca deferred error type support
add aca deferred error type support

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Tao Zhou
2fc46e0b2f drm/amdgpu: make reset method configurable for RAS poison
Each RAS block has different requirement for gpu reset in poison
consumption handling.
Add support for mmhub RAS poison consumption handling.

v2: remove the mmhub poison support for kfd int v10.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Yang Wang
e3d4de8d8b drm/amdgpu: retire unused aca_bank_report data structure
retire unused aca_bank_report data structure.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Candice Li
f26c4e3fc9 drm/amdgpu: Update setting EEPROM table version
Use helper function instead of umc callback to set
EEPROM table version.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Yang Wang
69bf42fbb2 drm/amdgpu: refine aca error cache for umc v12.0
refine aca error cache for umc v12.0

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Yang Wang
87428b4054 drm/amdgpu: refine aca error cache for sdma v4.4.2
refine aca error cache for sdma v4.4.2

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Yang Wang
62d2aaa7d4 drm/amdgpu: refine aca error cache for xgmi v6.4.0
refine aca error cache for xgmi v6.4.0

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Tao Zhou
d8070c4241 drm/amdgpu: support utcl2 RAS poison query for mmhub
Support the query for both gfxhub and mmhub, also replace
xcc_id with hub_inst.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Tao Zhou
176c3e8956 drm/amdgpu: add utcl2 RAS poison query for mmhub
Add it for mmhub v1.8.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Yang Wang
5275114a70 drm/amdgpu: refine aca error cache for mmhub v1.8
refine aca error cache for mmhub v1.8

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Christian Koenig
d8a3f0a034 drm/amdgpu: implement TLB flush fence
The problem is that when (for example) 4k pages are replaced
with a single 2M page we need to wait for change to be flushed
out by invalidating the TLB before the PT can be freed.

Solve this by moving the TLB flush into a DMA-fence object which
can be used to delay the freeing of the PT BOs until it is signaled.

V2: (Shashank)
    - rebase
    - set dma_fence_error only in case of error
    - add tlb_flush fence only when PT/PD BO is locked (Felix)
    - use vm->pasid when f is NULL (Mukul)

V4: - add a wait for (f->dependency) in tlb_fence_work (Christian)
    - move the misplaced fence_create call to the end (Philip)

V5: - free the f->dependency properly

V6: (Shashank)
    - light code movement, moved all the clean-up in previous patch
    - introduce params.needs_flush and its usage in this patch
    - rebase without TLB HW sequence patch

V7:
   - Keep the vm->last_update_fence and tlb_cb code until
     we can fix the HW sequencing (Christian)
   - Move all the tlb_fence related code in a separate function so that
     its easier to read and review

V9: Addressed review comments from Christian
    - start PT update only when we have callback memory allocated

V10:
    - handle device unlock in OOM case (Christian, Mukul)
    - added Christian's R-B

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Reviewed-by: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Yang Wang
e6136150cd drm/amdgpu: refine aca error cache for gfx v9.4.3
refine aca error cache for gfx 9.4.3

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Yang Wang
949899cbac drm/amdgpu: add new api to save error count into aca cache
add new api to save error count into aca cache.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Yang Wang
abc3b5d21d drm/amdgpu: add new aca_smu_type support
Add new types to distinguish between ACA error type and smu mca type.

e.g.:
the ACA_ERROR_TYPE_DEFERRED is not matched any smu mca valid bank
channel, so add new type 'aca_smu_type' to distinguish aca error type
and smu mca type.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Sunil Khatri
6fe4dab331 drm/amdgpu: remove the adev check for NULL
adev is a global data structure and isn't expected
to be NULL and hence removing the redundant adev
check from the devcoredump code.

Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:13 -04:00
Likun Gao
3cfaadbe0f drm/amdgpu: add support for atom fw version v3_5
Support for atom_firmware_info_v3_5.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:13 -04:00
Hawking Zhang
765bea0d73 drm/amdgpu: Apply retry to IP discovery v2 and v4
To ensure GPU driver touch the local framebuffer until
it is initialized by integrated firmware.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:13 -04:00
Yang Wang
9dc57c2adf drm/amdgpu: add ras event id support
add amdgpu ras event id support to better distinguish different
error information sources in dmesg logs.

the following log will be identify by event id:
{event_id} interrupt to inform RAS event
{event_id} ACA logs
{event_id} errors statistic since from current injection/error query
{event_id} errors statistic since from gpu load

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:13 -04:00
Zhigang Luo
ab66c83284 drm/amdgpu: trigger flr_work if reading pf2vf data failed
if reading pf2vf data failed 30 times continuously, it means something is
wrong. Need to trigger flr_work to recover the issue.

also use dev_err to print the error message to get which device has
issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL
timeout.

Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Acked-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:13 -04:00
Sunil Khatri
d72e2bdac4 drm/amdgpu: add the hw_ip version of all IP's
Add all the IP's version information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:11 -04:00
Victor Skvortsov
6bb89d1340 drm/amdgpu: Skip virt_exchange_init on SDMA poison consumption
Host will initiate an FLR in SDMA poison consumption scenario.
Guest should wait for FLR message to re-init data exchange.

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:38 -04:00
Lijo Lazar
dfe9c3cde2 drm/amdgpu: Do a basic health check before reset
Check if the device is present in the bus before trying to recover. It
could be that device itself is lost from the bus in some hang
situations.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:38 -04:00
Shashank Sharma
97d5aa6030 drm/amdgpu: cleanup unused variable
This patch removes an unused input variable in the MES
doorbell function.

Cc: Christian König <Christian.Koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <Christian.Koenig@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:37 -04:00
Tao Zhou
0c501d3c11 drm/amdgpu: skip GFX FED error in page fault handling
Let kfd interrupt handler process it.

v2: return 0 instead of 1 for fed error.
drop the usage of strcmp in interrupt handler.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:36 -04:00
Tao Zhou
71a8d61ebc drm/amdgpu: retire gfx ras query_utcl2_poison_status
Replace it with related interface in gfxhub functions.

v2: replace node id with xcc id.
    get node id for query_utcl2_poison_status

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:36 -04:00
Sunil Khatri
3eb899c40a drm/amdgpu: add ring buffer information in devcoredump
Add relevant ringbuffer information such as
rptr, wptr,rb mask, ring name, ring size and also
the rings content for each ring on a gpu reset.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:36 -04:00
Sunil Khatri
583681d4a4 drm/amdgpu: add vm fault information to devcoredump
Add page fault information to the devcoredump.

Output of devcoredump:
**** AMDGPU Device Coredump ****
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed
Faulty page starting at address: 0x0000000000000000
Protection fault status register: 0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:36 -04:00
Tao Zhou
fb0f5f5414 drm/amdgpu: add utcl2 poison query for gfxhub
Implement it for gfxhub 1.0 and 1.2.

v2: input logical xcc id for poison query interface.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:36 -04:00
Sunil Khatri
dc406d92a0 drm/amdgpu: add recent pagefault info in vm_manager
Currently page fault information is stored per
vm and which could be freed or stale during
reset. Add it pagefault information in the
vm_manager which is a global space for vm's
and remains valid across.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:36 -04:00
Le Ma
ad550dbe8a drm/amdgpu: drop setting buffer funcs in sdma442
To fix the entity rq NULL issue. This setting has been moved
to upper level.

Fixes: b70438004a ("drm/amdgpu: move buffer funcs setting up a level")
Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:36:29 -04:00
Lang Yu
1b7eec6bf3 Revert "drm/amdgpu/vpe: don't emit cond exec command under collaborate mode"
Ready now. Remove this workaround.
This reverts commit d40f6213b5.

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Tested-by: Alan Liu <haoping.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:59 -04:00
Ma Jun
03c6284df1 Revert "drm/amd/amdgpu: Fix potential ioremap() memory leaks in amdgpu_device_init()"
This patch causes the following iounmap erorr and calltrace
iounmap: bad address 00000000d0b3631f

The original patch was unjustified because amdgpu_device_fini_sw() will
always cleanup the rmmio mapping.

This reverts commit eb4f139888.

Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:59 -04:00
Hawking Zhang
9b3fec307f drm/amdgpu: Bypass display ta if display hw is not available
Do not load/invoke display TA if display hardware
is not available.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:58 -04:00
Prike Liang
43bda3e782 drm/amdgpu: correct the KGQ fallback message
Fix the KGQ fallback function name, as this will
help differentiate the failure in the KCQ enablement.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:58 -04:00
ZhenGuo Yin
56b30ac84c drm/amdgpu: Skip access PF-only registers on gfx10/gfxhub2_1 under SRIOV
[Why]
RLCG interface returns "out-of-range" error under SRIOV VF when accessing
PF-only registers.

[How]
Skip access PF-only registers on gfx10/gfxhub2_1 under SRIOV.

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:57 -04:00
Ahmad Rehman
f679fd6057 drm/amdgpu: Init zone device and drm client after mode-1 reset on reload
In passthrough environment, when amdgpu is reloaded after unload, mode-1
is triggered after initializing the necessary IPs, That init does not
include KFD, and KFD init waits until the reset is completed. KFD init
is called in the reset handler, but in this case, the zone device and
drm client is not initialized, causing app to create kernel panic.

v2: Removing the init KFD condition from amdgpu_amdkfd_drm_client_create.
As the previous version has the potential of creating DRM client twice.

v3: v2 patch results in SDMA engine hung as DRM open causes VM clear to SDMA
before SDMA init. Adding the condition to in drm client creation, on top of v1,
to guard against drm client creation call multiple times.

Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:57 -04:00
Philip Yang
6c6064cbe5 drm/amdgpu: amdgpu_ttm_gart_bind set gtt bound flag
Otherwise after the GTT bo is released, the GTT and gart space is freed
but amdgpu_ttm_backend_unbind will not clear the gart page table entry
and leave valid mapping entry pointing to the stale system page. Then
if GPU access the gart address mistakely, it will read undefined value
instead page fault, harder to debug and reproduce the real issue.

Cc: stable@vger.kernel.org
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:57 -04:00
Saleemkhan Jamadar
6a7cbbc267 drm/amdgpu/vcn: enable vcn1 fw load for VCN 4_0_6
v1 - update the fw header for each vcn instance (Veera)

VCN1 has different FW binary in VCN v4_0_6.
Add changes to load the VCN1 fw binary

Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Veerabadhran Gopalakrishnan <Veerabadhran.Gopalakrishnan@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:12:57 -04:00