linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-25 18:12:26 -04:00

Author	SHA1	Message	Date
Philip Yang	d58b8a99cb	drm/amdkfd: Add SMI add event helper To remove duplicate code, unify event message format and simplify new event add in the following patches. Use KFD_SMI_EVENT_MSG_SIZE to define msg size, the same size will be used in user space to alloc the msg receive buffer. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-03-02 18:40:05 -05:00
Philip Yang	38abd56bed	drm/amdkfd: Correct SMI event read size sizeof(buf) is 8 bytes because it is defined as unsigned char *buf, each SMI event read only copy max 8 bytes to user buffer. Correct this by using the buf allocate size. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-03-02 18:40:05 -05:00
Philip Yang	e433d68433	Revert "drm/amdkfd: process_info lock not needed for svm" This reverts commit `3abfe30d80`. To fix deadlock in kFDSVMEvictTest when xnack off. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-03-02 18:40:05 -05:00
Harish Kasiviswanathan	0c41b9b561	drm/amdkfd: Print bdf in peer map failure message Print alloc node, peer node and memory domain when peer map fails. This is more useful v2: use dev_err instead of pr_err use bdf for identify peer gpu Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-23 14:26:35 -05:00
Felix Kuehling	a0c5fd46b2	drm/amdkfd: Use real device for messages kfd_chardev() doesn't provide much useful information in dev_... messages on multi-GPU systems because there is only one KFD device, which doesn't correspond to any particular GPU. Use the actual GPU device to indicate the GPU that caused a message. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-23 14:02:50 -05:00
David Yat Sin	8f7519b2f3	drm/amdkfd: Fix for possible integer overflow Fix for possible integer overflow when doing addition. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-23 14:02:50 -05:00
Alex Deucher	9dff13f9ed	drm/amdkfd: make CRAT table missing message informational only The driver has a fallback so make the message informational rather than a warning. The driver has a fallback if the Component Resource Association Table (CRAT) is missing, so make this informational now. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1906 Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-22 14:52:39 -05:00
Felix Kuehling	22804e03f7	drm/amdkfd: Fix criu_restore_bo error handling Clang static analysis reports this problem kfd_chardev.c:2327:2: warning: 1st function call argument is an uninitialized value kvfree(bo_privs); ^~~~~~~~~~~~~~~~ Make sure bo_buckets and bo_privs are initialized so freeing them in the error handling code path will never result in undefined behaviour. Fixes: `73fa13b6a5` ("drm/amdkfd: CRIU Implement KFD restore ioctl") Reported-by: Tom Rix <trix@redhat.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-22 14:40:44 -05:00
Kent Russell	757f9e4dd5	drm/amdkfd: Drop IH ring overflow message to dbg When this was first implemented, overflows weren't expected in regular operations, and tests weren't in place to cause said overflow. Now there are cases where overflows occur with real workloads, but we know that the kernel can handle this robustly, so move the message to a debug message. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-22 14:40:26 -05:00
Nathan Chancellor	b63c54d978	drm/amdkfd: Use proper enum in pm_unmap_queues_v9() Clang warns: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_packet_manager_v9.c:267:3: error: implicit conversion from enumeration type 'enum mes_map_queues_extended_engine_sel_enum' to different enumeration type 'enum mes_unmap_queues_extended_engine_sel_enum' [-Werror,-Wenum-conversion] extended_engine_sel__mes_map_queues__sdma0_to_7_sel : ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Use 'extended_engine_sel__mes_unmap_queues__sdma0_to_7_sel' to eliminate the warning, which is the same numeric value of the proper type. Fixes: `009e9a1585` ("drm/amdkfd: navi2x requires extended engines to map and unmap sdma queues") Link: https://github.com/ClangBuiltLinux/linux/issues/1596 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-17 15:59:06 -05:00
Tao Zhou	29b440d204	drm/amdkfd: add return value check for queue eviction Otherwise gpu reset will be triggered unconditionally in poison consumption. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-16 17:30:02 -05:00
Changcheng Deng	d5c831566d	drm/amdkfd: Replace zero-length array with flexible-array member There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use "flexible array members" for these cases. The older style of one-element or zero-length arrays should no longer be used. Reference: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-16 17:30:02 -05:00
Jonathan Kim	009e9a1585	drm/amdkfd: navi2x requires extended engines to map and unmap sdma queues SDMA 5.2.x queues are required to be mapped and unmapped from the extended engines. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
Jonathan Kim	d2cb0b21b8	drm/amdkfd: remove unneeded unmap single queue option The KFD only unmaps all queues, all dynamics queues or all process queues since RUN_LIST is mapped with all KFD queues. There's no need to provide a single type unmap so remove this option. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
Rajneesh Bhardwaj	2243f4937a	drm/amdkfd: Fix leftover errors and warnings A bunch of errors and warnings are leftover KFD over the years, attempt to fix the errors and most warnings reported by checkpatch tool. Still a few warnings remain which may be false positives so ignore them for now. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj	d87f36a063	drm/amdkfd: update SPDX license header Update the SPDX License header for all the KFD files. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Alex Sierra	7f161df1a5	drm/amdkfd: replace err by dbg print at svm vram migration Avoid spam the kernel log on application memory allocation failures. __func__ argument was also removed from dev_fmt macro due to parameter conflicts with dynamic_dev_dbg. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.comi> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:20:24 -05:00
Rajneesh Bhardwaj	24992ab0b8	drm/amdkfd: Fix prototype warning for get_process_num_bos Fix the warning: no previous prototype for 'get_process_num_bos' [-Wmissing-prototypes] Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:20:17 -05:00
Rajneesh Bhardwaj	b010a46bd3	drm/amdkfd: CRIU fix extra whitespace and block comment warnings Fix checkpatch reported warning for a quoted line and block line comments. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:20:08 -05:00
Lang Yu	f9ed188d5a	drm/amdgpu: add support for GC 10.1.4 Add basic support for GC 10.1.4, it uses same IP blocks with GC 10.1.3 Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:11:55 -05:00
Tom Rix	d8a25e4858	drm/amdkfd: fix loop error handling Clang static analysis reports this problem kfd_chardev.c:2594:16: warning: The expression is an uninitialized value. The computed value will also be garbage while (ret && i--) { ^~~ i is a loop variable and this block unwinds a problem in the loop. When the error happens before the loop, this value is garbage. Move the initialization of i to its decalaration. Fixes: `be072b06c7` ("drm/amdkfd: CRIU export BOs as prime dmabuf objects") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:11:33 -05:00
Tom Rix	574ff46f10	drm/amdkfd: fix freeing an unset pointer clang static analysis reports this problem kfd_chardev.c:2092:2: warning: 1st function call argument is an uninitialized value kvfree(bo_privs); ^~~~~~~~~~~~~~~~ When bo_buckets alloc fails, it jumps to an error handler that frees the yet to be allocated bo_privs. Because bo_buckets is the first error, return directly. Fixes: `5ccbb057c0` ("drm/amdkfd: CRIU Implement KFD checkpoint ioctl") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:10:40 -05:00
Dan Carpenter	5aa71bd773	drm/amdkfd: CRIU return -EFAULT for copy_to_user() failure If copy_to_user() fails, it returns the number of bytes remaining to be copied but we want to return a negative error code (-EFAULT) to the user. Fixes: `9d5dabfeff` ("drm/amdkfd: CRIU Save Shared Virtual Memory ranges") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Yat Sin <david.yatsin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:10:06 -05:00
Dan Carpenter	e5af61ffaa	drm/amdkfd: CRIU fix a NULL vs IS_ERR() check The kfd_process_device_data_by_id() does not return error pointers, it returns NULL. Fixes: `bef153b70c` ("drm/amdkfd: CRIU implement gpu_id remapping") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Yat Sin <david.yatsin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:09:26 -05:00
Mukul Joshi	a439b890db	drm/amdkfd: Consolidate MQD manager functions A few MQD manager functions are duplicated for all versions of MQD manager. Remove this duplication by moving the common functions into kfd_mqd_manager.c file. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:51 -05:00
Mukul Joshi	5bdd3eb253	drm/amdkfd: Remove unused old debugger implementation Cleanup the kfd code by removing the unused old debugger implementation. The address watch was only ever implemented in the upstream driver for GFXv7 (Kaveri). The user mode tools runtime using this API was never open-sourced. Work on the old debugger prototype that used this API has been discontinued years ago. Only a small piece of resetting wavefronts is kept and is moved to kfd_device_queue_manager.c. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:51 -05:00
Mukul Joshi	6c1a786773	drm/amdkfd: Fix TLB flushing in KFD SVM with no HWS With no HWS, TLB flushing will not work in SVM code. Fix this by calling kfd_flush_tlb() which works for both HWS and no HWS case. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:51 -05:00
Tao Zhou	b1c87b0874	drm/amdkfd: use unmap all queues for poison consumption Replace reset queue for specific PASID with unmap all queues, reset queue could break CP scheduler. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 14:15:07 -05:00
Tao Zhou	03e5b167bd	drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid As the function is used in more different cases, use a more general name. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 14:14:53 -05:00
Rajneesh Bhardwaj	2a909ae718	drm/amdkfd: CRIU resume shared virtual memory ranges In CRIU resume stage, resume all the shared virtual memory ranges from the data stored inside the resuming kfd process during CRIU restore phase. Also setup xnack mode and free up the resources. KFD_IOCTL_SVM_ATTR_CLR_FLAGS is not available for querying via get_attr interface but we must clear the flags during restore as there might be some default flags set when the prange is created. Also handle the invalid PREFETCH atribute values saved during checkpoint by replacing them with another dummy KFD_IOCTL_SVM_ATTR_SET_FLAGS attribute. (rajneesh: Fixed the checkpatch reported problems) Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj	c2db32ce77	drm/amdkfd: CRIU prepare for svm resume During CRIU restore phase, the VMAs for the virtual address ranges are not at their final location yet so in this stage, only cache the data required to successfully resume the svm ranges during an imminent CRIU resume phase. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj	9d5dabfeff	drm/amdkfd: CRIU Save Shared Virtual Memory ranges During checkpoint stage, save the shared virtual memory ranges and attributes for the target process. A process may contain a number of svm ranges and each range might contain a number of attributes. While not all attributes may be applicable for a given prange but during checkpoint we store all possible values for the max possible attribute types. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj	08a987a8a0	drm/amdkfd: CRIU Discover svm ranges A KFD process may contain a number of virtual address ranges for shared virtual memory management and each such range can have many SVM attributes spanning across various nodes within the process boundary. This change reports the total number of such SVM ranges and their total private data size by extending the PROCESS_INFO op of the the CRIU IOCTL to discover the svm ranges in the target process and a future patches brings in the required support for checkpoint and restore for SVM ranges. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj	d763d8030f	drm/amdkfd: use user_gpu_id for svm ranges Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore support its possible that the SVM ranges can be resumed on another node where the actual_gpu_id may not be same as the original (user_gpu_id) gpu id. So modify svm code to use user_gpu_id. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj	d1289b41ec	drm/amdkfd: CRIU allow external mm for svm ranges Both svm_range_get_attr and svm_range_set_attr helpers use mm struct from current but for a Checkpoint or Restore operation, the current->mm will fetch the mm for the CRIU master process. So modify these helpers to accept the task mm for a target kfd process to support Checkpoint Restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj	4717fe3d8d	drm/amdkfd: CRIU checkpoint and restore xnack mode Recoverable page faults are represented by the xnack mode setting inside a kfd process and are used to represent the device page faults. For CR, we don't consider negative values which are typically used for querying the current xnack mode without modifying it. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
Rajneesh Bhardwaj	be072b06c7	drm/amdkfd: CRIU export BOs as prime dmabuf objects KFD buffer objects do not associate a GEM handle with them so cannot directly be used with libdrm to initiate a system dma (sDMA) operation to speedup the checkpoint and restore operation so export them as dmabuf objects and use with libdrm helper (amdgpu_bo_import) to further process the sdma command submissions. With sDMA, we see huge improvement in checkpoint and restore operations compared to the generic pci based access via host data path. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	bef153b70c	drm/amdkfd: CRIU implement gpu_id remapping When doing a restore on a different node, the gpu_id's on the restore node may be different. But the user space application will still refer use the original gpu_id's in the ioctl calls. Adding code to create a gpu id mapping so that kfd can determine actual gpu_id during the user ioctl's. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	40e8a766a7	drm/amdkfd: CRIU checkpoint and restore events Add support to existing CRIU ioctl's to save and restore events during criu checkpoint and restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	3a9822d7bd	drm/amdkfd: CRIU checkpoint and restore queue control stack Checkpoint contents of queue control stacks on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	42c6c48214	drm/amdkfd: CRIU checkpoint and restore queue mqds Checkpoint contents of queue MQD's on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	5bb6a8fa75	drm/amdkfd: CRIU restore queue doorbell id When re-creating queues during CRIU restore, restore the queue with the same doorbell id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	2485c12c98	drm/amdkfd: CRIU restore sdma id for queues When re-creating queues during CRIU restore, restore the queue with the same sdma id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	8668dfc30d	drm/amdkfd: CRIU restore queue ids When re-creating queues during CRIU restore, restore the queue with the same queue id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	626f7b3190	drm/amdkfd: CRIU add queues support Add support to existing CRIU ioctl's to save number of queues and queue properties for each queue during checkpoint and re-create queues on restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:52 -05:00
David Yat Sin	cd9f791030	drm/amdkfd: CRIU Implement KFD unpause operation Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO op the queues will be stay in an evicted state. Once the plugin is done draining BO contents, it is safe to perform an UNPAUSE op for the queues to resume. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:46 -05:00
Rajneesh Bhardwaj	011bbb0302	drm/amdkfd: CRIU Implement KFD resume ioctl This adds support to create userptr BOs on restore and introduces a new ioctl op to restart memory notifiers for the restored userptr BOs. When doing CRIU restore MMU notifications can happen anytime after we call amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the restore process i.e. criu_resume ioctl op is received, and the process is ready to be resumed. This ioctl is different from other KFD CRIU ioctls since its called by CRIU master restore process for all the target processes being resumed by CRIU. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:41 -05:00
Rajneesh Bhardwaj	73fa13b6a5	drm/amdkfd: CRIU Implement KFD restore ioctl This implements the KFD CRIU Restore ioctl that lays the basic foundation for the CRIU restore operation. It provides support to create the buffer objects corresponding to the checkpointed image. This ioctl creates various types of buffer objects such as VRAM, MMIO, Doorbell, GTT based on the date sent from the userspace plugin. The data mostly contains the previously checkpointed KFD images from some KFD processs. While restoring a criu process, attach old IDR values to newly created BOs. This also adds the minimal gpu mapping support for a single gpu checkpoint restore use case. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:35 -05:00
Rajneesh Bhardwaj	5ccbb057c0	drm/amdkfd: CRIU Implement KFD checkpoint ioctl This adds support to discover the buffer objects that belong to a process being checkpointed. The data corresponding to these buffer objects is returned to user space plugin running under criu master context which then stores this info to recreate these buffer objects during a restore operation. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:29 -05:00
Rajneesh Bhardwaj	f185381b64	drm/amdkfd: CRIU Implement KFD process_info ioctl This IOCTL op is expected to be called as a precursor to the actual Checkpoint operation. This does the basic discovery into the target process seized by CRIU and relays the information to the userspace that utilizes it to start the Checkpoint operation via another dedicated IOCTL op. The process_info IOCTL op determines the number of GPUs, buffer objects that are associated with the target process, its process id in caller's namespace since /proc/pid/mem interface maybe used to drain the contents of the discovered buffer objects in userspace and getpid returns the pid of CRIU dumper process. Also the pid of a process inside a container might be different than its global pid so return the ns pid. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:20 -05:00

... 11 12 13 14 15 ...

1685 Commits