Commit Graph

1685 Commits

Author SHA1 Message Date
Philip Yang
d58b8a99cb drm/amdkfd: Add SMI add event helper
To remove duplicate code, unify event message format and simplify new
event add in the following patches.

Use KFD_SMI_EVENT_MSG_SIZE to define msg size, the same size will be
used in user space to alloc the msg receive buffer.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:05 -05:00
Philip Yang
38abd56bed drm/amdkfd: Correct SMI event read size
sizeof(buf) is 8 bytes because it is defined as unsigned char *buf,
each SMI event read only copy max 8 bytes to user buffer. Correct this
by using the buf allocate size.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:05 -05:00
Philip Yang
e433d68433 Revert "drm/amdkfd: process_info lock not needed for svm"
This reverts commit 3abfe30d80.

To fix deadlock in kFDSVMEvictTest when xnack off.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:05 -05:00
Harish Kasiviswanathan
0c41b9b561 drm/amdkfd: Print bdf in peer map failure message
Print alloc node, peer node and memory domain when peer map fails. This
is more useful

v2: use dev_err instead of pr_err
    use bdf for identify peer gpu

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:26:35 -05:00
Felix Kuehling
a0c5fd46b2 drm/amdkfd: Use real device for messages
kfd_chardev() doesn't provide much useful information in dev_... messages
on multi-GPU systems because there is only one KFD device, which doesn't
correspond to any particular GPU. Use the actual GPU device to indicate
the GPU that caused a message.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:02:50 -05:00
David Yat Sin
8f7519b2f3 drm/amdkfd: Fix for possible integer overflow
Fix for possible integer overflow when doing addition.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:02:50 -05:00
Alex Deucher
9dff13f9ed drm/amdkfd: make CRAT table missing message informational only
The driver has a fallback so make the message informational
rather than a warning. The driver has a fallback if the
Component Resource Association Table (CRAT) is missing, so
make this informational now.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1906
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-22 14:52:39 -05:00
Felix Kuehling
22804e03f7 drm/amdkfd: Fix criu_restore_bo error handling
Clang static analysis reports this problem
kfd_chardev.c:2327:2: warning: 1st function call argument
  is an uninitialized value
  kvfree(bo_privs);
  ^~~~~~~~~~~~~~~~

Make sure bo_buckets and bo_privs are initialized so freeing them in the
error handling code path will never result in undefined behaviour.

Fixes: 73fa13b6a5 ("drm/amdkfd: CRIU Implement KFD restore ioctl")
Reported-by: Tom Rix <trix@redhat.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-22 14:40:44 -05:00
Kent Russell
757f9e4dd5 drm/amdkfd: Drop IH ring overflow message to dbg
When this was first implemented, overflows weren't expected in regular
operations, and tests weren't in place to cause said overflow. Now there
are cases where overflows occur with real workloads, but we know that
the kernel can handle this robustly, so move the message to a debug
message.

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-22 14:40:26 -05:00
Nathan Chancellor
b63c54d978 drm/amdkfd: Use proper enum in pm_unmap_queues_v9()
Clang warns:

  drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_packet_manager_v9.c:267:3:
  error: implicit conversion from enumeration type 'enum
  mes_map_queues_extended_engine_sel_enum' to different enumeration type
  'enum mes_unmap_queues_extended_engine_sel_enum'
  [-Werror,-Wenum-conversion]
                  extended_engine_sel__mes_map_queues__sdma0_to_7_sel :
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1 error generated.

Use 'extended_engine_sel__mes_unmap_queues__sdma0_to_7_sel' to eliminate
the warning, which is the same numeric value of the proper type.

Fixes: 009e9a1585 ("drm/amdkfd: navi2x requires extended engines to map and unmap sdma queues")
Link: https://github.com/ClangBuiltLinux/linux/issues/1596
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17 15:59:06 -05:00
Tao Zhou
29b440d204 drm/amdkfd: add return value check for queue eviction
Otherwise gpu reset will be triggered unconditionally in poison
consumption.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-16 17:30:02 -05:00
Changcheng Deng
d5c831566d drm/amdkfd: Replace zero-length array with flexible-array member
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use "flexible array members" for these cases. The older
style of one-element or zero-length arrays should no longer be used.
Reference:
https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-16 17:30:02 -05:00
Jonathan Kim
009e9a1585 drm/amdkfd: navi2x requires extended engines to map and unmap sdma queues
SDMA 5.2.x queues are required to be mapped and unmapped from the extended
engines.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:41 -05:00
Jonathan Kim
d2cb0b21b8 drm/amdkfd: remove unneeded unmap single queue option
The KFD only unmaps all queues, all dynamics queues or all process queues
since RUN_LIST is mapped with all KFD queues.

There's no need to provide a single type unmap so remove this option.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:41 -05:00
Rajneesh Bhardwaj
2243f4937a drm/amdkfd: Fix leftover errors and warnings
A bunch of errors and warnings are leftover KFD over the years, attempt
to fix the errors and most warnings reported by checkpatch tool. Still a
few warnings remain which may be false positives so ignore them for now.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj
d87f36a063 drm/amdkfd: update SPDX license header
Update the SPDX License header for all the KFD files.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Alex Sierra
7f161df1a5 drm/amdkfd: replace err by dbg print at svm vram migration
Avoid spam the kernel log on application memory allocation failures.
__func__ argument was also removed from dev_fmt macro due to
parameter conflicts with dynamic_dev_dbg.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.comi>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:20:24 -05:00
Rajneesh Bhardwaj
24992ab0b8 drm/amdkfd: Fix prototype warning for get_process_num_bos
Fix the warning: no previous prototype for 'get_process_num_bos'
[-Wmissing-prototypes]

Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:20:17 -05:00
Rajneesh Bhardwaj
b010a46bd3 drm/amdkfd: CRIU fix extra whitespace and block comment warnings
Fix checkpatch reported warning for a quoted line and block line
comments.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:20:08 -05:00
Lang Yu
f9ed188d5a drm/amdgpu: add support for GC 10.1.4
Add basic support for GC 10.1.4,
it uses same IP blocks with GC 10.1.3

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:11:55 -05:00
Tom Rix
d8a25e4858 drm/amdkfd: fix loop error handling
Clang static analysis reports this problem
kfd_chardev.c:2594:16: warning: The expression is an uninitialized value.
  The computed value will also be garbage
        while (ret && i--) {
                      ^~~

i is a loop variable and this block unwinds a problem in the loop.
When the error happens before the loop, this value is garbage.
Move the initialization of i to its decalaration.

Fixes: be072b06c7 ("drm/amdkfd: CRIU export BOs as prime dmabuf objects")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:11:33 -05:00
Tom Rix
574ff46f10 drm/amdkfd: fix freeing an unset pointer
clang static analysis reports this problem
kfd_chardev.c:2092:2: warning: 1st function call argument
  is an uninitialized value
        kvfree(bo_privs);
        ^~~~~~~~~~~~~~~~

When bo_buckets alloc fails, it jumps to an error handler
that frees the yet to be allocated bo_privs.  Because
bo_buckets is the first error, return directly.

Fixes: 5ccbb057c0 ("drm/amdkfd: CRIU Implement KFD checkpoint ioctl")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:10:40 -05:00
Dan Carpenter
5aa71bd773 drm/amdkfd: CRIU return -EFAULT for copy_to_user() failure
If copy_to_user() fails, it returns the number of bytes remaining to
be copied but we want to return a negative error code (-EFAULT) to the
user.

Fixes: 9d5dabfeff ("drm/amdkfd: CRIU Save Shared Virtual Memory ranges")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:10:06 -05:00
Dan Carpenter
e5af61ffaa drm/amdkfd: CRIU fix a NULL vs IS_ERR() check
The kfd_process_device_data_by_id() does not return error pointers,
it returns NULL.

Fixes: bef153b70c ("drm/amdkfd: CRIU implement gpu_id remapping")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-11 16:09:26 -05:00
Mukul Joshi
a439b890db drm/amdkfd: Consolidate MQD manager functions
A few MQD manager functions are duplicated for all versions of
MQD manager. Remove this duplication by moving the common
functions into kfd_mqd_manager.c file.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Mukul Joshi
5bdd3eb253 drm/amdkfd: Remove unused old debugger implementation
Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Mukul Joshi
6c1a786773 drm/amdkfd: Fix TLB flushing in KFD SVM with no HWS
With no HWS, TLB flushing will not work in SVM code.
Fix this by calling kfd_flush_tlb() which works for both
HWS and no HWS case.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Tao Zhou
b1c87b0874 drm/amdkfd: use unmap all queues for poison consumption
Replace reset queue for specific PASID with unmap all queues, reset
queue could break CP scheduler.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 14:15:07 -05:00
Tao Zhou
03e5b167bd drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid
As the function is used in more different cases, use a more general
name.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 14:14:53 -05:00
Rajneesh Bhardwaj
2a909ae718 drm/amdkfd: CRIU resume shared virtual memory ranges
In CRIU resume stage, resume all the shared virtual memory ranges from
the data stored inside the resuming kfd process during CRIU restore
phase. Also setup xnack mode and free up the resources.

KFD_IOCTL_SVM_ATTR_CLR_FLAGS is not available for querying via get_attr
interface but we must clear the flags during restore as there might be
some default flags set when the prange is created. Also handle the
invalid PREFETCH atribute values saved during checkpoint by replacing
them with another dummy KFD_IOCTL_SVM_ATTR_SET_FLAGS attribute.

(rajneesh: Fixed the checkpatch reported problems)
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
c2db32ce77 drm/amdkfd: CRIU prepare for svm resume
During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
9d5dabfeff drm/amdkfd: CRIU Save Shared Virtual Memory ranges
During checkpoint stage, save the shared virtual memory ranges and
attributes for the target process. A process may contain a number of svm
ranges and each range might contain a number of attributes. While not
all attributes may be applicable for a given prange but during
checkpoint we store all possible values for the max possible attribute
types.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
08a987a8a0 drm/amdkfd: CRIU Discover svm ranges
A KFD process may contain a number of virtual address ranges for shared
virtual memory management and each such range can have many SVM
attributes spanning across various nodes within the process boundary.
This change reports the total number of such SVM ranges and
their total private data size by extending the PROCESS_INFO op of the the
CRIU IOCTL to discover the svm ranges in the target process and a future
patches brings in the required support for checkpoint and restore for
SVM ranges.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
d763d8030f drm/amdkfd: use user_gpu_id for svm ranges
Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore
support its possible that the SVM ranges can be resumed on another node
where the actual_gpu_id may not be same as the original (user_gpu_id)
gpu id. So modify svm code to use user_gpu_id.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
d1289b41ec drm/amdkfd: CRIU allow external mm for svm ranges
Both svm_range_get_attr and svm_range_set_attr helpers use mm struct
from current but for a Checkpoint or Restore operation, the current->mm
will fetch the mm for the CRIU master process. So modify these helpers to
accept the task mm for a target kfd process to support Checkpoint
Restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
4717fe3d8d drm/amdkfd: CRIU checkpoint and restore xnack mode
Recoverable page faults are represented by the xnack mode setting inside
a kfd process and are used to represent the device page faults. For CR,
we don't consider negative values which are typically used for querying
the current xnack mode without modifying it.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
Rajneesh Bhardwaj
be072b06c7 drm/amdkfd: CRIU export BOs as prime dmabuf objects
KFD buffer objects do not associate a GEM handle with them so cannot
directly be used with libdrm to initiate a system dma (sDMA) operation
to speedup the checkpoint and restore operation so export them as dmabuf
objects and use with libdrm helper (amdgpu_bo_import) to further process
the sdma command submissions.

With sDMA, we see huge improvement in checkpoint and restore operations
compared to the generic pci based access via host data path.

Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
bef153b70c drm/amdkfd: CRIU implement gpu_id remapping
When doing a restore on a different node, the gpu_id's on the restore
node may be different. But the user space application will still refer
use the original gpu_id's in the ioctl calls. Adding code to create a
gpu id mapping so that kfd can determine actual gpu_id during the user
ioctl's.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
40e8a766a7 drm/amdkfd: CRIU checkpoint and restore events
Add support to existing CRIU ioctl's to save and restore events during
criu checkpoint and restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
3a9822d7bd drm/amdkfd: CRIU checkpoint and restore queue control stack
Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
42c6c48214 drm/amdkfd: CRIU checkpoint and restore queue mqds
Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
5bb6a8fa75 drm/amdkfd: CRIU restore queue doorbell id
When re-creating queues during CRIU restore, restore the queue with the
same doorbell id value used during CRIU dump.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
2485c12c98 drm/amdkfd: CRIU restore sdma id for queues
When re-creating queues during CRIU restore, restore the queue with the
same sdma id value used during CRIU dump.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
8668dfc30d drm/amdkfd: CRIU restore queue ids
When re-creating queues during CRIU restore, restore the queue with the
same queue id value used during CRIU dump.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
626f7b3190 drm/amdkfd: CRIU add queues support
Add support to existing CRIU ioctl's to save number of queues and queue
properties for each queue during checkpoint and re-create queues on
restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
cd9f791030 drm/amdkfd: CRIU Implement KFD unpause operation
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO
op the queues will be stay in an evicted state. Once the plugin is done
draining BO contents, it is safe to perform an UNPAUSE op for the queues
to resume.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:46 -05:00
Rajneesh Bhardwaj
011bbb0302 drm/amdkfd: CRIU Implement KFD resume ioctl
This adds support to create userptr BOs on restore and introduces a new
ioctl op to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore process i.e. criu_resume ioctl op is received, and the process is
ready to be resumed. This ioctl is different from other KFD CRIU ioctls
since its called by CRIU master restore process for all the target
processes being resumed by CRIU.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:41 -05:00
Rajneesh Bhardwaj
73fa13b6a5 drm/amdkfd: CRIU Implement KFD restore ioctl
This implements the KFD CRIU Restore ioctl that lays the basic
foundation for the CRIU restore operation. It provides support to
create the buffer objects corresponding to the checkpointed image.
This ioctl creates various types of buffer objects such as VRAM,
MMIO, Doorbell, GTT based on the date sent from the userspace plugin.
The data mostly contains the previously checkpointed KFD images from
some KFD processs.

While restoring a criu process, attach old IDR values to newly
created BOs. This also adds the minimal gpu mapping support for a single
gpu checkpoint restore use case.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:35 -05:00
Rajneesh Bhardwaj
5ccbb057c0 drm/amdkfd: CRIU Implement KFD checkpoint ioctl
This adds support to discover the  buffer objects that belong to a
process being checkpointed. The data corresponding to these buffer
objects is returned to user space plugin running under criu master
context which then stores this info to recreate these buffer objects
during a restore operation.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:29 -05:00
Rajneesh Bhardwaj
f185381b64 drm/amdkfd: CRIU Implement KFD process_info ioctl
This IOCTL op is expected to be called as a precursor to the actual
Checkpoint operation. This does the basic discovery into the target
process seized by CRIU and relays the information to the userspace that
utilizes it to start the Checkpoint operation via another dedicated
IOCTL op.

The process_info IOCTL op determines the number of GPUs, buffer objects
that are associated with the target process, its process id in
caller's namespace since /proc/pid/mem interface maybe used to drain
the contents of the discovered buffer objects in userspace and getpid
returns the pid of CRIU dumper process. Also the pid of a process
inside a container might be different than its global pid so return
the ns pid.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:20 -05:00