Commit Graph

253 Commits

Author SHA1 Message Date
Mukul Joshi
cc009e613d drm/amdkfd: Add KFD support for soc21 v3
Add initial support for soc21 in KFD compute
driver (Mukul)
- Add new definition for soc21 device.
- Add new file for amdgpu-kfd interface for GFX11 family.
- Add new file for queue management, interrupt handling,
  mqd management for GFX11 family in KFD driver.
- Related changes/updates for soc21 device in
  KFD driver.
- Repurpose last 2 entries of SDMA MQD for driver use.

v2: Add an optional argument into update queue operation (Mukul)

v3: Switch to ip version check, replace kgd_dev with
    amdgpu_device (Hawking)

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:54 -04:00
Mukul Joshi
b179fc28d5 drm/amdkfd: Fix circular lock dependency warning
[  168.544078] ======================================================
[  168.550309] WARNING: possible circular locking dependency detected
[  168.556523] 5.16.0-kfd-fkuehlin #148 Tainted: G            E
[  168.562558] ------------------------------------------------------
[  168.568764] kfdtest/3479 is trying to acquire lock:
[  168.573672] ffffffffc0927a70 (&topology_lock){++++}-{3:3}, at:
		kfd_topology_device_by_id+0x16/0x60 [amdgpu] [  168.583663]
                but task is already holding lock:
[  168.589529] ffff97d303dee668 (&mm->mmap_lock#2){++++}-{3:3}, at:
		vm_mmap_pgoff+0xa9/0x180 [  168.597755]
                which lock already depends on the new lock.

[  168.605970]
                the existing dependency chain (in reverse order) is:
[  168.613487]
                -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
[  168.619700]        lock_acquire+0xca/0x2e0
[  168.623814]        down_read+0x3e/0x140
[  168.627676]        do_user_addr_fault+0x40d/0x690
[  168.632399]        exc_page_fault+0x6f/0x270
[  168.636692]        asm_exc_page_fault+0x1e/0x30
[  168.641249]        filldir64+0xc8/0x1e0
[  168.645115]        call_filldir+0x7c/0x110
[  168.649238]        ext4_readdir+0x58e/0x940
[  168.653442]        iterate_dir+0x16a/0x1b0
[  168.657558]        __x64_sys_getdents64+0x83/0x140
[  168.662375]        do_syscall_64+0x35/0x80
[  168.666492]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  168.672095]
                -> #2 (&type->i_mutex_dir_key#6){++++}-{3:3}:
[  168.679008]        lock_acquire+0xca/0x2e0
[  168.683122]        down_read+0x3e/0x140
[  168.686982]        path_openat+0x5b2/0xa50
[  168.691095]        do_file_open_root+0xfc/0x190
[  168.695652]        file_open_root+0xd8/0x1b0
[  168.702010]        kernel_read_file_from_path_initns+0xc4/0x140
[  168.709542]        _request_firmware+0x2e9/0x5e0
[  168.715741]        request_firmware+0x32/0x50
[  168.721667]        amdgpu_cgs_get_firmware_info+0x370/0xdd0 [amdgpu]
[  168.730060]        smu7_upload_smu_firmware_image+0x53/0x190 [amdgpu]
[  168.738414]        fiji_start_smu+0xcf/0x4e0 [amdgpu]
[  168.745539]        pp_dpm_load_fw+0x21/0x30 [amdgpu]
[  168.752503]        amdgpu_pm_load_smu_firmware+0x4b/0x80 [amdgpu]
[  168.760698]        amdgpu_device_fw_loading+0xb8/0x140 [amdgpu]
[  168.768412]        amdgpu_device_init.cold+0xdf6/0x1716 [amdgpu]
[  168.776285]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
[  168.784034]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
[  168.791161]        local_pci_probe+0x40/0x80
[  168.797027]        work_for_cpu_fn+0x10/0x20
[  168.802839]        process_one_work+0x273/0x5b0
[  168.808903]        worker_thread+0x20f/0x3d0
[  168.814700]        kthread+0x176/0x1a0
[  168.819968]        ret_from_fork+0x1f/0x30
[  168.825563]
                -> #1 (&adev->pm.mutex){+.+.}-{3:3}:
[  168.834721]        lock_acquire+0xca/0x2e0
[  168.840364]        __mutex_lock+0xa2/0x930
[  168.846020]        amdgpu_dpm_get_mclk+0x37/0x60 [amdgpu]
[  168.853257]        amdgpu_amdkfd_get_local_mem_info+0xba/0xe0 [amdgpu]
[  168.861547]        kfd_create_vcrat_image_gpu+0x1b1/0xbb0 [amdgpu]
[  168.869478]        kfd_create_crat_image_virtual+0x447/0x510 [amdgpu]
[  168.877884]        kfd_topology_add_device+0x5c8/0x6f0 [amdgpu]
[  168.885556]        kgd2kfd_device_init.cold+0x385/0x4c5 [amdgpu]
[  168.893347]        amdgpu_amdkfd_device_init+0x138/0x180 [amdgpu]
[  168.901177]        amdgpu_device_init.cold+0x141b/0x1716 [amdgpu]
[  168.909025]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
[  168.916458]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
[  168.923442]        local_pci_probe+0x40/0x80
[  168.929249]        work_for_cpu_fn+0x10/0x20
[  168.935008]        process_one_work+0x273/0x5b0
[  168.940944]        worker_thread+0x20f/0x3d0
[  168.946623]        kthread+0x176/0x1a0
[  168.951765]        ret_from_fork+0x1f/0x30
[  168.957277]
                -> #0 (&topology_lock){++++}-{3:3}:
[  168.965993]        check_prev_add+0x8f/0xbf0
[  168.971613]        __lock_acquire+0x1299/0x1ca0
[  168.977485]        lock_acquire+0xca/0x2e0
[  168.982877]        down_read+0x3e/0x140
[  168.987975]        kfd_topology_device_by_id+0x16/0x60 [amdgpu]
[  168.995583]        kfd_device_by_id+0xa/0x20 [amdgpu]
[  169.002180]        kfd_mmap+0x95/0x200 [amdgpu]
[  169.008293]        mmap_region+0x337/0x5a0
[  169.013679]        do_mmap+0x3aa/0x540
[  169.018678]        vm_mmap_pgoff+0xdc/0x180
[  169.024095]        ksys_mmap_pgoff+0x186/0x1f0
[  169.029734]        do_syscall_64+0x35/0x80
[  169.035005]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  169.041754]
                other info that might help us debug this:

[  169.053276] Chain exists of:
                  &topology_lock --> &type->i_mutex_dir_key#6 --> &mm->mmap_lock#2

[  169.068389]  Possible unsafe locking scenario:

[  169.076661]        CPU0                    CPU1
[  169.082383]        ----                    ----
[  169.088087]   lock(&mm->mmap_lock#2);
[  169.092922]                                lock(&type->i_mutex_dir_key#6);
[  169.100975]                                lock(&mm->mmap_lock#2);
[  169.108320]   lock(&topology_lock);
[  169.112957]
                 *** DEADLOCK ***

This commit fixes the deadlock warning by ensuring pm.mutex is not
held while holding the topology lock. For this, kfd_local_mem_info
is moved into the KFD dev struct and filled during device init.
This cached value can then be used instead of querying the value
again and again.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28 17:45:40 -04:00
Felix Kuehling
c3eb12dff0 drm/amdkfd: Ignore bogus signals from MEC efficiently
MEC firmware sometimes sends signal interrupts without a valid context ID
on end of pipe events that don't intend to signal any HSA signals.
This triggers the slow path in kfd_signal_event_interrupt that scans the
entire event page for signaled events. Detect these signals in the top
half interrupt handler to stop processing them as early as possible.

Because we now always treat event ID 0 as invalid, reserve that ID during
process initialization.

v2: Update firmware version checks to support more GPUs

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-25 17:05:48 -04:00
David Yat Sin
747eea0732 drm/amdkfd: CRIU add support for GWS queues
Add support to checkpoint/restore GWS (Global Wave Sync) queues.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-21 11:32:38 -04:00
Lang Yu
459ccca5f7 drm/amdkfd: move kfd_flush_tlb_after_unmap into kfd_priv.h
To make kfd_flush_tlb_after_unmap visible in kfd_svm.c,
move it into kfd_priv.h. And change it to an inline function.

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-19 13:55:37 -04:00
Mukul Joshi
46d18d510d drm/amdkfd: Cleanup IO links during KFD device removal
Currently, the IO-links to the device being removed from topology,
are not cleared. As a result, there would be dangling links left in
the KFD topology. This patch aims to fix the following:
1. Cleanup all IO links to the device being removed.
2. Ensure that node numbering in sysfs and nodes proximity domain
   values are consistent after the device is removed:
   a. Adding a device and removing a GPU device are made mutually
      exclusive.
   b. The global proximity domain counter is no longer required to be
      an atomic counter. A normal 32-bit counter can be used instead.
3. Update generation_count to let user-mode know that topology has
   changed due to device removal.

CC: Shuotao Xu <shuotaoxu@microsoft.com>
Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-13 09:14:22 -04:00
Philip Yang
8fde0248a3 drm/amdkfd: Use atomic64_t type for pdd->tlb_seq
To support multi-thread update page table.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-31 23:05:54 -04:00
Christian König
bffa91dadf drm/amdkfd: start using tlb_seq from the VM subsystem
Instead of trying to figure out if a TLB flush is necessary or not use
the information provided by the VM subsystem now.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:51 -04:00
Felix Kuehling
a0c5fd46b2 drm/amdkfd: Use real device for messages
kfd_chardev() doesn't provide much useful information in dev_... messages
on multi-GPU systems because there is only one KFD device, which doesn't
correspond to any particular GPU. Use the actual GPU device to indicate
the GPU that caused a message.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:02:50 -05:00
Changcheng Deng
d5c831566d drm/amdkfd: Replace zero-length array with flexible-array member
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use "flexible array members" for these cases. The older
style of one-element or zero-length arrays should no longer be used.
Reference:
https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-16 17:30:02 -05:00
Jonathan Kim
d2cb0b21b8 drm/amdkfd: remove unneeded unmap single queue option
The KFD only unmaps all queues, all dynamics queues or all process queues
since RUN_LIST is mapped with all KFD queues.

There's no need to provide a single type unmap so remove this option.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:41 -05:00
Rajneesh Bhardwaj
2243f4937a drm/amdkfd: Fix leftover errors and warnings
A bunch of errors and warnings are leftover KFD over the years, attempt
to fix the errors and most warnings reported by checkpatch tool. Still a
few warnings remain which may be false positives so ignore them for now.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj
d87f36a063 drm/amdkfd: update SPDX license header
Update the SPDX License header for all the KFD files.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Mukul Joshi
5bdd3eb253 drm/amdkfd: Remove unused old debugger implementation
Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Tao Zhou
03e5b167bd drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid
As the function is used in more different cases, use a more general
name.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 14:14:53 -05:00
Rajneesh Bhardwaj
c2db32ce77 drm/amdkfd: CRIU prepare for svm resume
During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
08a987a8a0 drm/amdkfd: CRIU Discover svm ranges
A KFD process may contain a number of virtual address ranges for shared
virtual memory management and each such range can have many SVM
attributes spanning across various nodes within the process boundary.
This change reports the total number of such SVM ranges and
their total private data size by extending the PROCESS_INFO op of the the
CRIU IOCTL to discover the svm ranges in the target process and a future
patches brings in the required support for checkpoint and restore for
SVM ranges.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
4717fe3d8d drm/amdkfd: CRIU checkpoint and restore xnack mode
Recoverable page faults are represented by the xnack mode setting inside
a kfd process and are used to represent the device page faults. For CR,
we don't consider negative values which are typically used for querying
the current xnack mode without modifying it.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
bef153b70c drm/amdkfd: CRIU implement gpu_id remapping
When doing a restore on a different node, the gpu_id's on the restore
node may be different. But the user space application will still refer
use the original gpu_id's in the ioctl calls. Adding code to create a
gpu id mapping so that kfd can determine actual gpu_id during the user
ioctl's.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
40e8a766a7 drm/amdkfd: CRIU checkpoint and restore events
Add support to existing CRIU ioctl's to save and restore events during
criu checkpoint and restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
3a9822d7bd drm/amdkfd: CRIU checkpoint and restore queue control stack
Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
42c6c48214 drm/amdkfd: CRIU checkpoint and restore queue mqds
Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
8668dfc30d drm/amdkfd: CRIU restore queue ids
When re-creating queues during CRIU restore, restore the queue with the
same queue id value used during CRIU dump.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
626f7b3190 drm/amdkfd: CRIU add queues support
Add support to existing CRIU ioctl's to save number of queues and queue
properties for each queue during checkpoint and re-create queues on
restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
cd9f791030 drm/amdkfd: CRIU Implement KFD unpause operation
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO
op the queues will be stay in an evicted state. Once the plugin is done
draining BO contents, it is safe to perform an UNPAUSE op for the queues
to resume.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:46 -05:00
Rajneesh Bhardwaj
011bbb0302 drm/amdkfd: CRIU Implement KFD resume ioctl
This adds support to create userptr BOs on restore and introduces a new
ioctl op to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore process i.e. criu_resume ioctl op is received, and the process is
ready to be resumed. This ioctl is different from other KFD CRIU ioctls
since its called by CRIU master restore process for all the target
processes being resumed by CRIU.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:41 -05:00
Rajneesh Bhardwaj
5ccbb057c0 drm/amdkfd: CRIU Implement KFD checkpoint ioctl
This adds support to discover the  buffer objects that belong to a
process being checkpointed. The data corresponding to these buffer
objects is returned to user space plugin running under criu master
context which then stores this info to recreate these buffer objects
during a restore operation.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:29 -05:00
Rajneesh Bhardwaj
3698807094 drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
Checkpoint-Restore in userspace (CRIU) is a powerful tool that can
snapshot a running process and later restore it on same or a remote
machine but expects the processes that have a device file (e.g. GPU)
associated with them, provide necessary driver support to assist CRIU
and its extensible plugin interface. Thus, In order to support the
Checkpoint-Restore of any ROCm process, the AMD Radeon Open Compute
Kernel driver, needs to provide a set of new APIs that provide
necessary VRAM metadata and its contents to a userspace component
(CRIU plugin) that can store it in form of image files.

This introduces some new ioctls which will be used to checkpoint-Restore
any KFD bound user process. KFD only allows ioctl calls from the same
process that opened the KFD file descriptor. Since these ioctls are
expected to be called from a KFD criu plugin which has elevated ptrace
attached privileges and CAP_CHECKPOINT_RESTORE capabilities attached with
the file descriptors so modify KFD to allow such calls.

(API redesigned by David Yat Sin)
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:03 -05:00
Tao Zhou
b6485bed40 drm/amdkfd: reset queue which consumes RAS poison (v2)
CP supports unmap queue with reset mode which only destroys specific queue without affecting others.
Replacing whole gpu reset with reset queue mode for RAS poison consumption
saves much time, and we can also fallback to gpu reset solution if reset
queue fails.

v2: Return directly if process is NULL;
    Reset queue solution is not applicable to SDMA, fallback to legacy
way;
    Call kfd_unref_process after lookup process.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-28 16:02:59 -05:00
Graham Sider
f0dc99a6f7 drm/amdkfd: add kfd_device_info_init function
Initializes kfd->device_info given either asic_type (enum) if GFX
version is less than GFX9, or GC IP version if greater. Also takes in vf
and the target compiler gfx version. Uses SDMA version to determine
num_sdma_queues_per_engine.

Convert device_info to a non-pointer member of kfd, change references
accordingly.

Change unsupported asic condition to only probe f2g, move device_info
initialization post-switch.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:15:37 -05:00
Graham Sider
b7675b7bbc drm/amdkfd: replace asic_name with amdgpu_asic_name
device_info->asic_name and amdgpu_asic_name[adev->asic_type] both
provide asic name strings, with the only difference being casing.
Remove asic_name from device_info and replace sysfs entry with lowercase
amdgpu_asic_name[]. Ensures string is null-terminated so that this
doesn't break if dev->node_props.name ever gets set anywhere else.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:15:29 -05:00
Philip Yang
2e4477282c drm/amdkfd: simplify drain retry fault
unmap range always increase atomic svms->drain_pagefaults to simplify
both parent range and child range unmap, page fault handle ignores the
retry fault if svms->drain_pagefaults is set to speed up interrupt
handling. svm_range_drain_retry_fault restart draining if another
range unmap from cpu.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24 14:06:53 -05:00
Amber Lin
a0e7e140b5 drm/amdkfd: Remove unused entries in table
Remove unused entries in kfd_device_info table: num_xgmi_sdma_engines
and num_sdma_queues_per_engine. They are calculated in
kfd_get_num_sdma_engines and kfd_get_num_xgmi_sdma_engines instead.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:59:05 -05:00
Amber Lin
ee2f17f4d0 drm/amdkfd: Retrieve SDMA numbers from amdgpu
Instead of hard coding the number of sdma engines and the number of
sdma_xgmi engines in the device_info table, get the number of toal SDMA
instances from amdgpu. The first two engines are sdma engines and the
rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.

v2: add kfd_ prefix to non static function names

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:46:03 -05:00
Graham Sider
7eb0502ac0 drm/amdkfd: replace asic_family with asic_type
asic_family was a duplicate of asic_type, both of type amd_asic_type.
Replace all instances of device_info->asic_family with adev->asic_type
and remove asic_family from device_info.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 17:10:01 -05:00
Graham Sider
dd0ae064e7 drm/amdkfd: convert KFD_IS_SOC to IP version checking
Defined as GC HWIP >= IP_VERSION(9, 0, 1).

Also defines KFD_GC_VERSION to return GC HWIP version.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 17:09:28 -05:00
Graham Sider
b5d1d755c1 drm/amdkfd: remove kgd_dev declaration and initialization
Completes removal of kgd_dev. Direct references to amdgpu_device objects
should now be used instead.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:03 -05:00
Graham Sider
56c5977eae drm/amdkfd: replace/remove remaining kgd_dev references
Remove get_amdgpu_device and other remaining kgd_dev references aside
from declaration/kfd struct entry and initialization.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:02 -05:00
Graham Sider
574c4183ef drm/amdkfd: replace kgd_dev in get amdgpu_amdkfd funcs
Modified definitions:

- amdgpu_amdkfd_get_fw_version
- amdgpu_amdkfd_get_local_mem_info
- amdgpu_amdkfd_get_gpu_clock_counter
- amdgpu_amdkfd_get_max_engine_clock_in_mhz
- amdgpu_amdkfd_get_cu_info
- amdgpu_amdkfd_get_dmabuf_info
- amdgpu_amdkfd_get_vram_usage
- amdgpu_amdkfd_get_hive_id
- amdgpu_amdkfd_get_unique_id
- amdgpu_amdkfd_get_mmio_remap_phys_addr
- amdgpu_amdkfd_get_num_gws
- amdgpu_amdkfd_get_asic_rev_id
- amdgpu_amdkfd_get_noretry
- amdgpu_amdkfd_get_xgmi_hops_count
- amdgpu_amdkfd_get_xgmi_bandwidth_mbytes
- amdgpu_amdkfd_get_pcie_bandwidth_mbytes

Also replaces kfd_device_by_kgd with kfd_device_by_adev, now
searching via adev rather than kgd.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:02 -05:00
Graham Sider
c6c5744638 drm/amdkfd: add amdgpu_device entry to kfd_dev
Patch series to remove kgd_dev struct and replace all instances with
amdgpu_device objects.

amdgpu_device needs to be declared in kgd_kfd_interface.h to be visible
to kfd2kgd_calls.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:57:59 -05:00
Felix Kuehling
a44fe9ee05 drm/amdkfd: Fix retry fault drain race conditions
The check for whether to drain retry faults must be under the mmap write
lock to serialize with munmap notifier callbacks.

We were also missing checks on child ranges. To fix that, simplify the
logic by using a flag rather than checking on each prange. That also
allows draining less freqeuntly when many ranges are unmapped at once.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Alex Sierra <Alex.Sierra@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-09 17:08:00 -05:00
Alex Sierra
a6283010e2 drm/amdkfd: avoid recursive lock in migrations back to RAM
[Why]:
When we call hmm_range_fault to map memory after a migration, we don't
expect memory to be migrated again as a result of hmm_range_fault. The
driver ensures that all memory is in GPU-accessible locations so that
no migration should be needed. However, there is one corner case where
hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE
back to system memory due to a write-fault when a system memory page in
the same range was mapped read-only (e.g. COW). Ranges with individual
pages in different locations are usually the result of failed page
migrations (e.g. page lock contention). The unexpected migration back
to system memory causes a deadlock from recursive locking in our
driver.

[How]:
Creating a task reference new member under svm_range_list struct.
Setting this with "current" reference, right before the hmm_range_fault
is called. This member is checked against "current" reference at
svm_migrate_to_ram callback function. If equal, the migration will be
ignored.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-05 14:10:58 -04:00
Lang Yu
7c695a2c54 drm/amdkfd: Remove cu mask from struct queue_properties(v2)
Actually, cu_mask has been copied to mqd memory and
does't have to persist in queue_properties. Remove it
from queue_properties.

And use struct mqd_update_info to store such properties,
then pass it to update queue operation.

v2:
* Rename pqm_update_queue to pqm_update_queue_properties.
* Rename struct queue_update_info to struct mqd_update_info.
* Rename pqm_set_cu_mask to pqm_update_mqd.

Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28 14:26:13 -04:00
Lang Yu
c6e559eb3b drm/amdkfd: Add an optional argument into update queue operation(v2)
Currently, queue is updated with data in queue_properties.
And all allocated resource in queue_properties will not
be freed until the queue is destroyed.

But some properties(e.g., cu mask) bring some memory
management headaches(e.g., memory leak) and make code
complex. Actually they have been copied to mqd and
don't have to persist in queue_properties.

Add an argument into update queue to pass such properties,
then we can remove them from queue_properties.

v2: Don't use void *.

Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28 14:26:13 -04:00
Lang Yu
68df0f195a drm/amdkfd: Separate pinned BOs destruction from general routine
Currently, all kfd BOs use same destruction routine. But pinned
BOs are not unpinned properly. Separate them from general routine.

v2 (Felix):
Add safeguard to prevent user space from freeing signal BO.
Kunmap signal BO in the event of setting event page error.
Just kunmap signal BO to avoid duplicating the code.

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28 14:26:12 -04:00
Felix Kuehling
fb932dfeb8 drm/amdkfd: make needs_pcie_atomics FW-version dependent
On some GPUs the PCIe atomic requirement for KFD depends on the MEC
firmware version. Add a firmware version check for this. The minimum
firmware version that works without atomics can be updated in the
device_info structure for each GPU type.

Move PCIe atomic detection from kgd2kfd_probe into kgd2kfd_device_init
because the MEC firmware is not loaded yet at the probe stage.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-16 09:56:23 -04:00
Graham Sider
9d6fa9c7ff drm/amdkfd: Expose GFXIP engine version to sysfs
Add u32 gfx_target_version field to kfd_node_properties and
kfd_device_info. Populate <asic>_device_info structs accordingly and
expose to sysfs.

This allows eliminating device-ID-based lookup tables in user mode for
future ASICs.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-05 21:18:00 -04:00
Oak Zeng
4f942aaeb1 drm/amdkfd: Fix a concurrency issue during kfd recovery
start_cpsch and stop_cpsch can be called during kfd device
initialization or during gpu reset/recovery. So they can
run concurrently. Currently in start_cpsch and stop_cpsch,
pm_init and pm_uninit is not protected by the dpm lock.
Imagine such a case that user use packet manager's function
to submit a pm4 packet to hang hws (ie through command
cat /sys/class/kfd/kfd/topology/nodes/1/gpu_id | sudo tee
/sys/kernel/debug/kfd/hang_hws), while kfd device is under
device reset/recovery so packet manager can be not initialized.
There will be unpredictable protection fault in such case.

This patch moves pm_init/uninit inside the dpm lock and check
packet manager is initialized before using packet manager
function.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Acked-by: Christian Konig <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:08:00 -04:00
Philip Yang
751580b3ff drm/amdkfd: add sysfs counters for vm fault and migration
This is part of SVM profiling API, export sysfs counters for
per-process, per-GPU vm retry fault, pages migrated in and out of GPU vram.

counters will not be updated in parallel in GPU retry fault handler and
migration to vram/ram path, use READ_ONCE to avoid compiler
optimization.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-30 00:18:23 -04:00
Felix Kuehling
5a75ea56e3 drm/amdkfd: Disable SVM per GPU, not per process
When some GPUs don't support SVM, don't disabe it for the entire process.
That would be inconsistent with the information the process got from the
topology, which indicates SVM support per GPU.

Instead disable SVM support only for the unsupported GPUs. This is done
by checking any per-device attributes against the bitmap of supported
GPUs. Also use the supported GPU bitmap to initialize access bitmaps for
new SVM address ranges.

Don't handle recoverable page faults from unsupported GPUs. (I don't
think there will be unsupported GPUs that can generate recoverable page
faults. But better safe than sorry.)

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-15 17:25:41 -04:00