Commit Graph

283 Commits

Author SHA1 Message Date
Mukul Joshi
d1d22df174 drm/amdkfd: Populate memory info before adding GPU node to topology
The local memory info needs to be fetched before the GPU node is added
to topology. Without this, the sysfs is incorrectly populated and the
size is reported as 0. This was causing rocr tests to fail. This issue
was caused because of a bad merge.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:44:22 -04:00
Mukul Joshi
1bd6dd21fc drm/amdkfd: Add SDMA info for SDMA 4.4.2
Update SDMA queue information for SDMA 4.4.2.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:44:18 -04:00
Jonathan Kim
92085240ef drm/amdkfd: add gpu compute cores io links for gfx9.4.3
The PSP TA will only provide xGMI topology info for links between GPU
sockets so links between partitions from different sockets will be
hardcoded as 3 xGMI hops with 1 hops weighted as xGMI and 2 hops
weighted with a new intra-socket weight to indicate the longest
possible distance.

If the link between a partition and the CPU is non-PCIe, then assume
the CPU (CCDs) is located within the same socket as the partition
and represent the link as an intra-socket weighted single hop XGMI link
with memory bandwidth.

Links between partitions within a single socket will be abstracted as
single hop xGMI links weighted with the new intra-socket weight and
will have memory bandwidth.

Finally, use the unused function bits in the location ID to represent the
coordinates of the compute partition within its socket.

A follow on patch will resolve the requirement for GPU socket xGMI
link representation sometime later.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:44:07 -04:00
Mukul Joshi
a805889a15 drm/amdkfd: Update SDMA queue management for GFX9.4.3
This patch updates SDMA queue management for multi XCC in GFX9.4.3.
- Allocate/deallocate SDMA queues from the correct SDMA engines
  based on the partition mode.
- Updates the kgd2kfd interface to fetch the correct SDMA register
  addresses.
- It also fixes dumping correct SDMA queue info in debugfs.

v2: squash in fix "drm/amdkfd: Fix XGMI SDMA user-mode queue allocation"

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:43:05 -04:00
Mukul Joshi
74c5b85da7 drm/amdkfd: Add spatial partitioning support in KFD
This patch introduces multi-partition support in KFD.
This patch includes:
- Support for maximum 8 spatial partitions in KFD.
- Initialize one HIQ per partition.
- Management of VMID range depending on partition mode.
- Management of doorbell aperture space between all
  partitions.
- Each partition does its own queue management, interrupt
  handling, SMI event reporting.
- IOMMU, if enabled with multiple partitions, will only work
  on first partition.
- SPM is only supported on the first partition.
- Currently, there is no support for resetting individual
  partitions. All partitions will reset together.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:33 -04:00
Mukul Joshi
8dc1db3172 drm/amdkfd: Introduce kfd_node struct (v5)
Introduce a new structure, kfd_node, which will now represent
a compute node. kfd_node is carved out of kfd_dev structure.
kfd_dev struct now will become the parent of kfd_node, and will
store common resources such as doorbells, GTT sub-alloctor etc.
kfd_node struct will store all resources specific to a compute
node, such as device queue manager, interrupt handling etc.

This is the first step in adding compute partition support in KFD.

v2: introduce kfd_node struct to gc v11 (Hawking)
v3: make reference to kfd_dev struct through kfd_node (Morris)
v4: use kfd_node instead for kfd isr/mqd functions (Morris)
v5: rebase (Alex)

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Morris Zhang <Shiwu.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:27 -04:00
Amber Lin
f544afac3f drm/amdgpu: Add kgd2kfd for GC 9.4.3
New GC (v9.4.3) and ATHUB (v1.8.0) versions
are used. Add kgd_gfx_v9_4_3_*
functions if registers in use of kgd_gfx_v9_*
functions are changed or have different offset.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:40:40 -04:00
Graham Sider
70bdfedaae drm/amdkfd: Add gfx_target_version for GC 9.4.3
Required for Thunk GFX version sysfs query.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-14 13:47:48 -04:00
Sreekant Somasekharan
00fa40353b drm/amdkfd: Check PCIe atomics support on GFX11 to set CP_HQD_HQ_STATUS0[29]
CP_HQD_HQ_STATUS0[29] bit will be used by CPFW to acknowledge whether
PCIe atomics are supported. The default value of this bit is set
to 0. Driver will check whether PCIe atomics are supported and set the
bit to 1 if supported. This will force CPFW to use real atomic ops.
If the bit is not set, CPFW will default to read/modify/write using the
firmware itself.

This is applicable only to GFX11 RS64 CP with MEC FW >= 509. If MEC
FW < 509 and for all GFX11 F32 CP, PCIe atomics needs to be supported
else it will skip the device.

This commit also involves moving amdgpu_amdkfd_device_probe() function
call after per-IP early_init loop in amdgpu_device_ip_early_init()
function so as to check for RS64 enabled device.

Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-11 18:03:45 -04:00
Jay Cornwall
1d44ff3d7a drm/amdkfd: Trap handler changes for GC 9.4.3 v2
v1:
Check new exception bits in TRAPSTS register
Remove single step exception workaround, now part of
exception bits

v2:
GC 9.4.3 uses ttmp11 to store {1’b0, dispatch index [24:0],
wave_id_in_workgroup[5:0]}, so use ttmp13 instead of ttmp11 to
preserve ib_sts. (Laurent)

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Laurent Morichetti <Laurent.Morichetti@amd.com>
Reviewed-by: Laurent Morichetti <laurent.morichetti@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-31 11:18:55 -04:00
Hawking Zhang
828d9a872c drm/amdkfd: Add GC 9.4.3 KFD support
Add initial KFD support
Convert a few structures to IP version checking (Hawking)

Signed-off-by: Elena Sakhnovitch <elena.sakhnovitch@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-31 11:18:55 -04:00
Felix Kuehling
7ee938ac00 drm/amdgpu: Don't resume IOMMU after incomplete init
Check kfd->init_complete in kgd2kfd_iommu_resume, consistent with other
kgd2kfd calls. This should fix IOMMU errors on resume from suspend when
KFD IOMMU initialization failed.

Reported-by: Matt Fagnani <matt.fagnani@bell.net>
Link: https://lore.kernel.org/r/4a3b225c-2ffd-e758-4de1-447375e34cad@bell.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217170
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2454
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info>
Cc: stable@vger.kernel.org
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Tested-by: Matt Fagnani <matt.fagnani@bell.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Alex Deucher
f200521899 drm/amdkfd: simplify cases
A number of the gfx8 cases were the same.  Clean them
up.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-05 11:41:57 -05:00
Yifan Zhang
88c21c2b56 drm/amdkfd: add GC 11.0.4 KFD support
Add initial support for GC 11.0.4 in KFD compute driver.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Aaron Liu <aaron.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-29 11:03:36 -05:00
Jonathan Kim
beb15bc1c6 drm/amdkfd: enable cooperative launch for gfx10.3
FW fix available to enable cooperative launch for GFX10.3.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-17 18:08:58 -05:00
Mukul Joshi
d69a3b762d drm/amdkfd: Cleanup kfd_dev struct
Cleanup kfd_dev struct by removing ddev and pdev as both
drm_device and pci_dev can be fetched from amdgpu_device.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27 15:12:09 -04:00
David Belanger
5ddb5fe9e5 drm/amdkfd: Added GFX 11.0.3 Support
Added missing cases for GFX 11.0.3 code in a few switch statements.

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-30 16:37:11 -04:00
Prike Liang
2724efa389 drm/amdkfd: Fix isa version for the GC 10.3.7
Correct the isa version for handling KFD test.

Fixes: 7c4f4f197e ("drm/amdkfd: Add GC 10.3.6 and 10.3.7 KFD definitions")
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Aaron Liu <aaron.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-25 13:35:17 -04:00
Yifan Zhang
e48e6a131d drm/amdkfd: reserve 2 queues for sdma 6.0.1 in bitmap
There is only one engine in sdma 6.0.1, the total number of
reserved queues should be 2, reflect this number in bitmap as well.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:06:23 -04:00
Prike Liang
c4e8555119 drm/amdkfd: correct the MEC atomic support firmware checking for GC 10.3.7
On the GC 10.3.7 platform the initial MEC release version #3 can support
atomic operation,so need correct and set its MEC atomic support version to #3.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Aaron Liu <aaron.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-13 11:25:16 -04:00
Philip Yang
c7f21978fa drm/amdkfd: Add user queue eviction restore SMI event
Output user queue eviction and restore event. User queue eviction may be
triggered by svm or userptr MMU notifier, TTM eviction, device suspend
and CRIU checkpoint and restore.

User queue restore may be rescheduled if eviction happens again while
restore.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-30 15:31:14 -04:00
Yifan Zhang
efb4fd107c drm/amdkfd: correct sdma queue number of sdma 6.0.1
sdma 6.0.1 has 8 queues instead of 2.

Fixes: 26776a7031 ("drm/amdkfd: add GC 11.0.1 KFD support")
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-21 18:17:24 -04:00
Jesse Zhang
68e355c00f drm/amdkfd:Fix fw version for 10.3.6
fix fw error when loading fw for 10.3.6

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-08 11:39:56 -04:00
Mario Limonciello
7c4f4f197e drm/amdkfd: Add GC 10.3.6 and 10.3.7 KFD definitions
Loading amdgpu on GC 10.3.7 shows an ERR level message:
`kfd kfd: amdgpu: GC IP 0a0307 not supported in kfd`

Add these targets to match yellow carp structures.

Reported-by: David Chang <david.chang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Jesse(Jie) Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 5.18.x
2022-06-03 16:26:36 -04:00
Jay Cornwall
6a8170383c drm/amdkfd: Add gfx11 trap handler
Based on gfx10 with following changes:

- GPR_ALLOC.VGPR_SIZE field moved (and size corrected in gfx10)
- s_sendmsg_rtn_b64 replaces some s_sendmsg/s_getreg
- Buffer instructions no longer have direct-to-LDS modifier

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Laurent Morichetti <laurent.morichetti@amd.com>
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-26 14:56:33 -04:00
Huang Rui
26776a7031 drm/amdkfd: add GC 11.0.1 KFD support
Add initial support for GC 11.0.1 in KFD compute driver.

Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Xiaojian Du <Xiaojian.Du@amd.com>
Reviewed-by: Aaron Liu <aaron.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-06 10:36:15 -04:00
Eric Huang
ec661f1ca4 drm/amdkfd: add asic support for GC 11.0.2
Changes are inherited from GC 11.0.0.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-05 16:52:48 -04:00
Eric Huang
22dd871e2b drm/amdkfd: add asic support for SDMA 6.0.2
It is inherited from SDMA 6.0.0.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-05 16:52:30 -04:00
Mukul Joshi
cc009e613d drm/amdkfd: Add KFD support for soc21 v3
Add initial support for soc21 in KFD compute
driver (Mukul)
- Add new definition for soc21 device.
- Add new file for amdgpu-kfd interface for GFX11 family.
- Add new file for queue management, interrupt handling,
  mqd management for GFX11 family in KFD driver.
- Related changes/updates for soc21 device in
  KFD driver.
- Repurpose last 2 entries of SDMA MQD for driver use.

v2: Add an optional argument into update queue operation (Mukul)

v3: Switch to ip version check, replace kgd_dev with
    amdgpu_device (Hawking)

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:54 -04:00
Mukul Joshi
b179fc28d5 drm/amdkfd: Fix circular lock dependency warning
[  168.544078] ======================================================
[  168.550309] WARNING: possible circular locking dependency detected
[  168.556523] 5.16.0-kfd-fkuehlin #148 Tainted: G            E
[  168.562558] ------------------------------------------------------
[  168.568764] kfdtest/3479 is trying to acquire lock:
[  168.573672] ffffffffc0927a70 (&topology_lock){++++}-{3:3}, at:
		kfd_topology_device_by_id+0x16/0x60 [amdgpu] [  168.583663]
                but task is already holding lock:
[  168.589529] ffff97d303dee668 (&mm->mmap_lock#2){++++}-{3:3}, at:
		vm_mmap_pgoff+0xa9/0x180 [  168.597755]
                which lock already depends on the new lock.

[  168.605970]
                the existing dependency chain (in reverse order) is:
[  168.613487]
                -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
[  168.619700]        lock_acquire+0xca/0x2e0
[  168.623814]        down_read+0x3e/0x140
[  168.627676]        do_user_addr_fault+0x40d/0x690
[  168.632399]        exc_page_fault+0x6f/0x270
[  168.636692]        asm_exc_page_fault+0x1e/0x30
[  168.641249]        filldir64+0xc8/0x1e0
[  168.645115]        call_filldir+0x7c/0x110
[  168.649238]        ext4_readdir+0x58e/0x940
[  168.653442]        iterate_dir+0x16a/0x1b0
[  168.657558]        __x64_sys_getdents64+0x83/0x140
[  168.662375]        do_syscall_64+0x35/0x80
[  168.666492]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  168.672095]
                -> #2 (&type->i_mutex_dir_key#6){++++}-{3:3}:
[  168.679008]        lock_acquire+0xca/0x2e0
[  168.683122]        down_read+0x3e/0x140
[  168.686982]        path_openat+0x5b2/0xa50
[  168.691095]        do_file_open_root+0xfc/0x190
[  168.695652]        file_open_root+0xd8/0x1b0
[  168.702010]        kernel_read_file_from_path_initns+0xc4/0x140
[  168.709542]        _request_firmware+0x2e9/0x5e0
[  168.715741]        request_firmware+0x32/0x50
[  168.721667]        amdgpu_cgs_get_firmware_info+0x370/0xdd0 [amdgpu]
[  168.730060]        smu7_upload_smu_firmware_image+0x53/0x190 [amdgpu]
[  168.738414]        fiji_start_smu+0xcf/0x4e0 [amdgpu]
[  168.745539]        pp_dpm_load_fw+0x21/0x30 [amdgpu]
[  168.752503]        amdgpu_pm_load_smu_firmware+0x4b/0x80 [amdgpu]
[  168.760698]        amdgpu_device_fw_loading+0xb8/0x140 [amdgpu]
[  168.768412]        amdgpu_device_init.cold+0xdf6/0x1716 [amdgpu]
[  168.776285]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
[  168.784034]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
[  168.791161]        local_pci_probe+0x40/0x80
[  168.797027]        work_for_cpu_fn+0x10/0x20
[  168.802839]        process_one_work+0x273/0x5b0
[  168.808903]        worker_thread+0x20f/0x3d0
[  168.814700]        kthread+0x176/0x1a0
[  168.819968]        ret_from_fork+0x1f/0x30
[  168.825563]
                -> #1 (&adev->pm.mutex){+.+.}-{3:3}:
[  168.834721]        lock_acquire+0xca/0x2e0
[  168.840364]        __mutex_lock+0xa2/0x930
[  168.846020]        amdgpu_dpm_get_mclk+0x37/0x60 [amdgpu]
[  168.853257]        amdgpu_amdkfd_get_local_mem_info+0xba/0xe0 [amdgpu]
[  168.861547]        kfd_create_vcrat_image_gpu+0x1b1/0xbb0 [amdgpu]
[  168.869478]        kfd_create_crat_image_virtual+0x447/0x510 [amdgpu]
[  168.877884]        kfd_topology_add_device+0x5c8/0x6f0 [amdgpu]
[  168.885556]        kgd2kfd_device_init.cold+0x385/0x4c5 [amdgpu]
[  168.893347]        amdgpu_amdkfd_device_init+0x138/0x180 [amdgpu]
[  168.901177]        amdgpu_device_init.cold+0x141b/0x1716 [amdgpu]
[  168.909025]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
[  168.916458]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
[  168.923442]        local_pci_probe+0x40/0x80
[  168.929249]        work_for_cpu_fn+0x10/0x20
[  168.935008]        process_one_work+0x273/0x5b0
[  168.940944]        worker_thread+0x20f/0x3d0
[  168.946623]        kthread+0x176/0x1a0
[  168.951765]        ret_from_fork+0x1f/0x30
[  168.957277]
                -> #0 (&topology_lock){++++}-{3:3}:
[  168.965993]        check_prev_add+0x8f/0xbf0
[  168.971613]        __lock_acquire+0x1299/0x1ca0
[  168.977485]        lock_acquire+0xca/0x2e0
[  168.982877]        down_read+0x3e/0x140
[  168.987975]        kfd_topology_device_by_id+0x16/0x60 [amdgpu]
[  168.995583]        kfd_device_by_id+0xa/0x20 [amdgpu]
[  169.002180]        kfd_mmap+0x95/0x200 [amdgpu]
[  169.008293]        mmap_region+0x337/0x5a0
[  169.013679]        do_mmap+0x3aa/0x540
[  169.018678]        vm_mmap_pgoff+0xdc/0x180
[  169.024095]        ksys_mmap_pgoff+0x186/0x1f0
[  169.029734]        do_syscall_64+0x35/0x80
[  169.035005]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  169.041754]
                other info that might help us debug this:

[  169.053276] Chain exists of:
                  &topology_lock --> &type->i_mutex_dir_key#6 --> &mm->mmap_lock#2

[  169.068389]  Possible unsafe locking scenario:

[  169.076661]        CPU0                    CPU1
[  169.082383]        ----                    ----
[  169.088087]   lock(&mm->mmap_lock#2);
[  169.092922]                                lock(&type->i_mutex_dir_key#6);
[  169.100975]                                lock(&mm->mmap_lock#2);
[  169.108320]   lock(&topology_lock);
[  169.112957]
                 *** DEADLOCK ***

This commit fixes the deadlock warning by ensuring pm.mutex is not
held while holding the topology lock. For this, kfd_local_mem_info
is moved into the KFD dev struct and filled during device init.
This cached value can then be used instead of querying the value
again and again.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28 17:45:40 -04:00
Christophe JAILLET
b8b9ba58b6 drm/amdkfd: Use non-atomic bitmap functions when possible
All uses of the 'kfd->gtt_sa_bitmap' bitmap are protected with the
'kfd->gtt_sa_lock' mutex.

So:
   - prefer the non-atomic '__set_bit()' function
   - use the non-atomic 'bitmap_[set|clear]()' functions instead of
     equivalent 'for' loops. These functions can work on several bits at a
     time

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28 17:44:29 -04:00
Christophe JAILLET
f43a9f18e0 drm/amdkfd: Use bitmap_zalloc() when applicable
'kfd->gtt_sa_bitmap' is a bitmap. So use 'bitmap_zalloc()' to simplify
code, improve the semantic and avoid some open-coded arithmetic in
allocator arguments.

Also change the corresponding 'kfree()' into 'bitmap_free()' to keep
consistency.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28 17:44:23 -04:00
Divya Shikre
c5650327ab drm/amdkfd: Check use_xgmi_p2p before reporting hive_id
Recently introduced commit 158a05a0b8 ("drm/amdgpu: Add
use_xgmi_p2p module parameter") did not update XGMI iolinks
when use_xgmi_p2p is disabled. Add fix to not create XGMI
iolinks in KFD topology when this parameter is disabled.

Fixes: 158a05a0b8 ("drm/amdgpu: Add use_xgmi_p2p module parameter")
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:25 -04:00
Tushar Patel
b7dfbd2e60 drm/amdkfd: Fix Incorrect VMIDs passed to HWS
Compute-only GPUs have more than 8 VMIDs allocated to KFD. Fix
this by passing correct number of VMIDs to HWS

v2: squash in warning fix (Alex)

Signed-off-by: Tushar Patel <tushar.patel@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:25 -04:00
Rajneesh Bhardwaj
2243f4937a drm/amdkfd: Fix leftover errors and warnings
A bunch of errors and warnings are leftover KFD over the years, attempt
to fix the errors and most warnings reported by checkpatch tool. Still a
few warnings remain which may be false positives so ignore them for now.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj
d87f36a063 drm/amdkfd: update SPDX license header
Update the SPDX License header for all the KFD files.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Lang Yu
f9ed188d5a drm/amdgpu: add support for GC 10.1.4
Add basic support for GC 10.1.4,
it uses same IP blocks with GC 10.1.3

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:11:55 -05:00
Mukul Joshi
5bdd3eb253 drm/amdkfd: Remove unused old debugger implementation
Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Graham Sider
20c5e425d3 drm/amdkfd: Fix indentation on switch statement
Cases should be same indentation as switch. Also fix string spanning
across multiple lines.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:00 -05:00
Kent Russell
5eb877b282 drm/amdkfd: Fix ASIC name typos
Three misspelled ASICs in comments here, so fix the spelling

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:28 -05:00
Guchun Chen
f89c6bf734 drm/amdkfd: correct sdma queue number in kfd device init (v3)
This patch keeps the setting of sdma queue number to the same
after recent KFD code refactor. Additionally, improve code to
use switch case to list IP version to complete kfd device_info
structure filling for IH version assignment. This makes consistency
with the IP parse code in amdgpu_discovery.c.

v2: use dev_warn for the default switch case;
    set default sdma queue per engine(8) and IH handler to v9. (Jonathan)

v3: Fix missed IP version check of Raven.

Fixes: f0dc99a6f7 ("drm/amdkfd: add kfd_device_info_init function")
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-30 08:54:43 -05:00
Jonathan Kim
addaac0cf7 drm/amdgpu: disable default navi2x co-op kernel support
This patch reverts the following:
commit 48733b224f ("drm/amdkfd: add Navi2x to GWS init conditions")

Disable GWS usage in default settings for now due to FW bugs.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:35 -05:00
Graham Sider
48733b224f drm/amdkfd: add Navi2x to GWS init conditions
Initalize GWS on Navi2x with mec2_fw_version >= 0x42.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-and-tested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:35 -05:00
Felix Kuehling
0f7ef0b99d drm/amdkfd: Make KFD support on Hawaii experimental
Hawaii support is mostly untested these days. ROCm user mode also
depends on custom firmware for AQL packet processing, that was never
pushed upstream due to quality regressions in graphics driver testing.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:35 -05:00
chen gong
27cc310f13 drm/amdkfd: Correct the value of the no_atomic_fw_version variable
145:
navi10            IP_VERSION(10, 1, 10)
navi12            IP_VERSION(10, 1, 2)
navi14            IP_VERSION(10, 1, 1)

92:
sienna_cichlid    IP_VERSION(10, 3, 0)
navy_flounder     IP_VERSION(10, 3, 2)
vangogh           IP_VERSION(10, 3, 1)
dimgrey_cavefish  IP_VERSION(10, 3, 4)
beige_goby        IP_VERSION(10, 3, 5)
yellow_carp       IP_VERSION(10, 3, 3)

Signed-off-by: chen gong <curry.gong@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-07 13:07:08 -05:00
Graham Sider
2c1f19b327 drm/amdkfd: remove hardcoded device_info structs
With device_info initialization being handled in kfd_device_info_init,
these structs may be removed. Also add comments to help matching IP
versions to asic names.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:15:47 -05:00
Graham Sider
f0dc99a6f7 drm/amdkfd: add kfd_device_info_init function
Initializes kfd->device_info given either asic_type (enum) if GFX
version is less than GFX9, or GC IP version if greater. Also takes in vf
and the target compiler gfx version. Uses SDMA version to determine
num_sdma_queues_per_engine.

Convert device_info to a non-pointer member of kfd, change references
accordingly.

Change unsupported asic condition to only probe f2g, move device_info
initialization post-switch.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:15:37 -05:00
Graham Sider
b7675b7bbc drm/amdkfd: replace asic_name with amdgpu_asic_name
device_info->asic_name and amdgpu_asic_name[adev->asic_type] both
provide asic name strings, with the only difference being casing.
Remove asic_name from device_info and replace sysfs entry with lowercase
amdgpu_asic_name[]. Ensures string is null-terminated so that this
doesn't break if dev->node_props.name ever gets set anywhere else.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:15:29 -05:00
Amber Lin
a0e7e140b5 drm/amdkfd: Remove unused entries in table
Remove unused entries in kfd_device_info table: num_xgmi_sdma_engines
and num_sdma_queues_per_engine. They are calculated in
kfd_get_num_sdma_engines and kfd_get_num_xgmi_sdma_engines instead.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:59:05 -05:00
Amber Lin
ee2f17f4d0 drm/amdkfd: Retrieve SDMA numbers from amdgpu
Instead of hard coding the number of sdma engines and the number of
sdma_xgmi engines in the device_info table, get the number of toal SDMA
instances from amdgpu. The first two engines are sdma engines and the
rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.

v2: add kfd_ prefix to non static function names

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:46:03 -05:00