linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-22 16:53:59 -04:00

Author	SHA1	Message	Date
Felix Kuehling	7e38ccb527	drm/amdkfd: Fix eviction fence handling Handle case that dma_fence_get_rcu_safe returns NULL. If restore work is already scheduled, only update its timer. The same work item cannot be queued twice, so undo the extra queue eviction. Fixes: `9a1c1339ab` ("drm/amdkfd: Run restore_workers on freezable WQs") Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Tested-by: Gang BA <Gang.Ba@amd.com> Reviewed-by: Gang BA <Gang.Ba@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-18 23:54:48 -04:00
Hawking Zhang	5e984b0a3d	drm/amdgpu: Use driver mode reset for data poison mode-2 reset is the only reliable method that can get GC/SDMA back when poison is consumed. mmhub requires mode-1 reset. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-18 23:46:23 -04:00
Felix Kuehling	18921b2050	drm/amdkfd: Fix memory leak in create_process failure Fix memory leak due to a leaked mmget reference on an error handling code path that is triggered when attempting to create KFD processes while a GPU reset is in progress. Fixes: `0ab2d7532b` ("drm/amdkfd: prepare per-process debug enable and disable") CC: Xiaogang Chen <xiaogang.chen@amd.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Tested-by: Harish Kasiviswanthan <Harish.Kasiviswanthan@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2024-04-17 11:05:09 -04:00
Felix Kuehling	4135899209	drm/amdkfd: Fix memory leak in create_process failure Fix memory leak due to a leaked mmget reference on an error handling code path that is triggered when attempting to create KFD processes while a GPU reset is in progress. Fixes: `0ab2d7532b` ("drm/amdkfd: prepare per-process debug enable and disable") CC: Xiaogang Chen <xiaogang.chen@amd.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Tested-by: Harish Kasiviswanthan <Harish.Kasiviswanthan@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-12 00:35:49 -04:00
Dave Airlie	3b0daecfea	amdkfd: use calloc instead of kzalloc to avoid integer overflow This uses calloc instead of doing the multiplication which might overflow. Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>	2024-04-12 11:11:59 +10:00
Zhigang Luo	d06af584be	amd/amdkfd: sync all devices to wait all processes being evicted If there are more than one device doing reset in parallel, the first device will call kfd_suspend_all_processes() to evict all processes on all devices, this call takes time to finish. other device will start reset and recover without waiting. if the process has not been evicted before doing recover, it will be restored, then caused page fault. Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-09 23:28:30 -04:00
Harish Kasiviswanathan	8bdfb4ea95	drm/amdkfd: Reset GPU on queue preemption failure Currently, with F32 HWS GPU reset is only when unmap queue fails. However, if compute queue doesn't repond to preemption request in time unmap will return without any error. In this case, only preemption error is logged and Reset is not triggered. Call GPU reset in this case also. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2024-04-09 23:09:31 -04:00
Zhigang Luo	dfb15c4ab5	amd/amdkfd: sync all devices to wait all processes being evicted If there are more than one device doing reset in parallel, the first device will call kfd_suspend_all_processes() to evict all processes on all devices, this call takes time to finish. other device will start reset and recover without waiting. if the process has not been evicted before doing recover, it will be restored, then caused page fault. Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-09 22:14:03 -04:00
Jonathan Kim	0cac183b98	drm/amdkfd: range check cp bad op exception interrupts Due to a CP interrupt bug, bad packet garbage exception codes are raised. Do a range check so that the debugger and runtime do not receive garbage codes. Update the user api to guard exception code type checking as well. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Tested-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-27 08:53:02 -04:00
Eric Huang	1210e2f103	drm/amdkfd: fix TLB flush after unmap for GFX9.4.2 TLB flush after unmap accidentially was removed on gfx9.4.2. It is to add it back. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2024-03-27 08:51:47 -04:00
Mukul Joshi	9d7993a7ab	drm/amdkfd: Check cgroup when returning DMABuf info Check cgroup permissions when returning DMA-buf info and based on cgroup info return the GPU id of the GPU that have access to the BO. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-27 08:49:29 -04:00
Harish Kasiviswanathan	d7e8ddc392	drm/amdkfd: Reset GPU on queue preemption failure Currently, with F32 HWS GPU reset is only when unmap queue fails. However, if compute queue doesn't repond to preemption request in time unmap will return without any error. In this case, only preemption error is logged and Reset is not triggered. Call GPU reset in this case also. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-27 01:44:53 -04:00
Mukul Joshi	417f78a2a1	drm/amdkfd: Cleanup workqueue during module unload Destroy the high priority workqueue that handles interrupts during KFD node cleanup. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-22 15:54:48 -04:00
Jonathan Kim	fb880635e0	drm/amdkfd: range check cp bad op exception interrupts Due to a CP interrupt bug, bad packet garbage exception codes are raised. Do a range check so that the debugger and runtime do not receive garbage codes. Update the user api to guard exception code type checking as well. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Tested-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-22 15:54:42 -04:00
Eric Huang	acf760c890	drm/amdkfd: fix TLB flush after unmap for GFX9.4.2 TLB flush after unmap accidentially was removed on gfx9.4.2. It is to add it back. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-22 15:51:49 -04:00
Mukul Joshi	e2680ee222	drm/amdkfd: Check cgroup when returning DMABuf info Check cgroup permissions when returning DMA-buf info and based on cgroup info return the GPU id of the GPU that have access to the BO. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-22 15:47:56 -04:00
Tao Zhou	2fc46e0b2f	drm/amdgpu: make reset method configurable for RAS poison Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling. v2: remove the mmhub poison support for kfd int v10. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-20 13:38:15 -04:00
Tao Zhou	d8070c4241	drm/amdgpu: support utcl2 RAS poison query for mmhub Support the query for both gfxhub and mmhub, also replace xcc_id with hub_inst. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-20 13:38:14 -04:00
Mukul Joshi	0991a4c192	drm/amdkfd: Check preemption status on all XCDs This patch adds the following functionality: - Check the queue preemption status on all XCDs in a partition for GFX 9.4.3. - Update the queue preemption debug message to print the queue doorbell id for which preemption failed. - Change the signature of check preemption failed function to return a bool instead of uint32_t and pass the MQD manager as an argument. Suggested-by: Jay Cornwall <jay.cornwall@amd.com> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-20 13:38:12 -04:00
Mukul Joshi	26d97182bb	drm/amdkfd: Rename read_doorbell_id in MQD functions Rename read_doorbell_id function to a more meaningful name, implying what it is used for. No functional change. Suggested-by: Jay Cornwall <jay.cornwall@amd.com> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-20 13:38:12 -04:00
Tao Zhou	71a8d61ebc	drm/amdgpu: retire gfx ras query_utcl2_poison_status Replace it with related interface in gfxhub functions. v2: replace node id with xcc id. get node id for query_utcl2_poison_status Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-20 13:37:36 -04:00
Ricardo B. Marliere	6d3b27e046	drm/amdkfd: make kfd_class constant Since commit `43a7206b09` ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the kfd_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-06 15:24:50 -05:00
Laurent Morichetti	45bbf800c5	drm/amdkfd: Use SQC when TCP would fail in gfx10.1 context save Similarly to gfx9, gfx10.1 drops vector stores when an xnack error is raised. To work around this issue, use scalar stores instead of vector stores when trapsts.xnack_error == 1. Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com> Reviewed-by: Jay Cornwall <jay.cornwall@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-06 15:24:50 -05:00
Laurent Morichetti	f36e3f7260	drm/amdkfd: Increase the size of the memory reserved for the TBA In a future commit, the cwsr trap handler code size for gfx10.1 will increase to slightly above the one page mark. Since the TMA does not need to be page aligned, and only 2 pointers are stored in it, push the TMA offset by 2 KiB and keep the TBA+TMA reserved memory size to two pages. Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-06 15:24:49 -05:00
Shashank Sharma	b8f67b9ddf	drm/amdgpu: change vm->task_info handling This patch changes the handling and lifecycle of vm->task_info object. The major changes are: - vm->task_info is a dynamically allocated ptr now, and its uasge is reference counted. - introducing two new helper funcs for task_info lifecycle management - amdgpu_vm_get_task_info: reference counts up task_info before returning this info - amdgpu_vm_put_task_info: reference counts down task_info - last put to task_info() frees task_info from the vm. This patch also does logistical changes required for existing usage of vm->task_info. V2: Do not block all the prints when task_info not found (Felix) V3: Fixed review comments from Felix - Fix wrong indentation - No debug message for -ENOMEM - Add NULL check for task_info - Do not duplicate the debug messages (ti vs no ti) - Get first reference of task_info in vm_init(), put last in vm_fini() V4: Fixed review comments from Felix - fix double reference increment in create_task_info - change amdgpu_vm_get_task_info_pasid - additional changes in amdgpu_gem.c while porting Cc: Christian Koenig <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-04 15:59:08 -05:00
Eric Huang	1761d9a688	amd/amdkfd: remove unused parameter The adev can be found from bo by amdgpu_ttm_adev(bo->tbo.bdev), and adev is also not used in the function amdgpu_amdkfd_map_gtt_bo_to_gart(). Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-28 17:10:53 -05:00
Lijo Lazar	c37ce764cd	drm/amdkfd: Add partition id field to location_id On devices which have multi-partition nodes, keep partition id in location_id[31:28]. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-26 11:15:38 -05:00
Lijo Lazar	e1f6746f33	drm/amdkfd: Skip packet submission on fatal error If fatal error is detected, packet submission won't go through. Return error in such cases. Also, avoid waiting for fence when fatal error is detected. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-26 11:14:31 -05:00
Jonathan Kim	6f05159a0d	drm/amdkfd: fix process reference drop on debug ioctl Prevent dropping the KFD process reference at the end of a debug IOCTL call where the acquired process value is an error. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-22 10:27:50 -05:00
Yifan Zhang	f5f83441c4	drm/amdkfd: add KFD support for GC 11.5.1 Enable KFD for GC 11.5.1. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-22 10:27:10 -05:00
Felix Kuehling	34a1de0f79	drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole The TBA and TMA, along with an unused IB allocation, reside at low addresses in the VM address space. A stray VM fault which hits these pages must be serviced by making their page table entries invalid. The scheduler depends upon these pages being resident and fails, preventing a debugger from inspecting the failure state. By relocating these pages above 47 bits in the VM address space they can only be reached when bits [63:48] are set to 1. This makes it much less likely for a misbehaving program to generate accesses to them. The current placement at VA (PAGE_SIZE*2) is readily hit by a NULL access with a small offset. v2: - Move it to the reserved space to avoid concflicts with Mesa - Add macros to make reserved space management easier v3: - Move VM max PFN calculation into AMDGPU_VA_RESERVED macros Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-16 15:41:50 -05:00
Rajneesh Bhardwaj	3459ffe8a8	drm/amdgpu: Fix implicit assumtion in gfx11 debug flags Gfx11 debug flags mask is currently set with an implicit assumption that no other mqd update flags exist. This needs to be fixed with newly introduced flag UPDATE_FLAG_IS_GWS by the previous patch. Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-14 17:15:35 -05:00
Rajneesh Bhardwaj	5869b32bbe	drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Set COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang scenarions. Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Suggested-by: Joseph Greathouse <Joseph.Greathouse@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-14 17:15:26 -05:00
Kent Russell	1727816961	drm/amdkfd: Fix L2 cache size reporting in GFX9.4.3 Its currently incorrectly multiplied by number of XCCs in the partition Fixes: `be457b2252` ("drm/amdkfd: Update cache info for GFX 9.4.3") Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-12 16:06:05 -05:00
Laurent Morichetti	804bf74b16	drm/amdkfd: pass debug exceptions to second-level trap handler Call the 2nd level trap handler if the cwsr handler is entered with any one of wave_start, wave_end, or trap_after_inst exceptions. Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com> Tested-by: Lancelot Six <lancelot.six@amd.com> Reviewed-by: Jay Cornwall <jay.cornwall@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-12 16:05:33 -05:00
Jonathan Kim	7f9dde7884	drm/amdkfd: fill in data for control stack header for gfx10 The debugger requires the control stack header to be filled in to update_waves. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-12 16:05:26 -05:00
Joseph Greathouse	5a2df8ecba	drm/amdkfd: Add cache line sizes to KFD topology The KFD topology includes cache line size, but we have not been filling that information out unless we are parsing a CRAT table. Fill in this information for the devices where we have cache information structs, and pipe this information to the topology sysfs files. v2: squash in fix from Joe (Alex) Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-07 12:25:51 -05:00
Lang Yu	0c93bd4957	drm/amdkfd: reserve the BO before validating it Fix a warning. v2: Avoid unmapping attachment repeatedly when ERESTARTSYS. v3: Lock the BO before accessing ttm->sg to avoid race conditions.(Felix) [ 41.708711] WARNING: CPU: 0 PID: 1463 at drivers/gpu/drm/ttm/ttm_bo.c:846 ttm_bo_validate+0x146/0x1b0 [ttm] [ 41.708989] Call Trace: [ 41.708992] <TASK> [ 41.708996] ? show_regs+0x6c/0x80 [ 41.709000] ? ttm_bo_validate+0x146/0x1b0 [ttm] [ 41.709008] ? __warn+0x93/0x190 [ 41.709014] ? ttm_bo_validate+0x146/0x1b0 [ttm] [ 41.709024] ? report_bug+0x1f9/0x210 [ 41.709035] ? handle_bug+0x46/0x80 [ 41.709041] ? exc_invalid_op+0x1d/0x80 [ 41.709048] ? asm_exc_invalid_op+0x1f/0x30 [ 41.709057] ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80 [amdgpu] [ 41.709185] ? ttm_bo_validate+0x146/0x1b0 [ttm] [ 41.709197] ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80 [amdgpu] [ 41.709337] ? srso_alias_return_thunk+0x5/0x7f [ 41.709346] kfd_mem_dmaunmap_attachment+0x9e/0x1e0 [amdgpu] [ 41.709467] amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x56/0x80 [amdgpu] [ 41.709586] kfd_ioctl_unmap_memory_from_gpu+0x1b7/0x300 [amdgpu] [ 41.709710] kfd_ioctl+0x1ec/0x650 [amdgpu] [ 41.709822] ? __pfx_kfd_ioctl_unmap_memory_from_gpu+0x10/0x10 [amdgpu] [ 41.709945] ? srso_alias_return_thunk+0x5/0x7f [ 41.709949] ? tomoyo_file_ioctl+0x20/0x30 [ 41.709959] __x64_sys_ioctl+0x9c/0xd0 [ 41.709967] do_syscall_64+0x3f/0x90 [ 41.709973] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: `101b810430` ("drm/amdkfd: Move dma unmapping after TLB flush") Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-31 14:05:19 -05:00
Mukul Joshi	50d3cf5e5a	drm/amdkfd: Use correct drm device for cgroup permission check On GFX 9.4.3, for a given KFD node, fetch the correct drm device from XCP manager when checking for cgroup permissions. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-29 15:38:30 -05:00
Jay Cornwall	7297ff96ea	drm/amdkfd: Use S_ENDPGM_SAVED in trap handler This instruction has no functional difference to S_ENDPGM but allows performance counters to track save events correctly. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Laurent Morichetti <laurent.morichetti@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-29 15:38:20 -05:00
Philip Yang	34e98e5b07	drm/amdkfd: Correct partial migration virtual addr Partial migration to system memory should use migrate.addr, not prange->start as virtual address to allocate system memory page. Fixes: `a546a27684` ("drm/amdkfd: Use partial migrations/mapping for GPU/CPU page faults in SVM") Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Xiaogang Chen <Xiaogang.Chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-29 15:36:47 -05:00
YiPeng Chai	ed1e1e42fd	drm/amdgpu: Support passing poison consumption ras block to SRIOV Support passing poison consumption ras blocks to SRIOV. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-25 14:58:03 -05:00
Alex Deucher	fc8f5a29d4	drm/amdgpu/gfx11: set UNORD_DISPATCH in compute MQDs This needs to be set to 1 to avoid a potential deadlock in the GC 10.x and newer. On GC 9.x and older, this needs to be set to 0. This can lead to hangs in some mixed graphics and compute workloads. Updated firmware is also required for AQL. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2024-01-25 14:49:03 -05:00
Alex Deucher	ca01082353	drm/amdgpu/gfx10: set UNORD_DISPATCH in compute MQDs This needs to be set to 1 to avoid a potential deadlock in the GC 10.x and newer. On GC 9.x and older, this needs to be set to 0. This can lead to hangs in some mixed graphics and compute workloads. Updated firmware is also required for AQL. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2024-01-25 14:48:58 -05:00
YiPeng Chai	22f6e3e112	drm/amdgpu: Add log info for umc_v12_0 Add log info for umc_v12_0. v2: Delete redundant logs. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-22 17:13:25 -05:00
Srinivasan Shanmugam	81d4b97068	drm/amdkfd: Fix variable dereferenced before NULL check in 'kfd_dbg_trap_device_snapshot()' Fixes the below: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_debug.c:1024 kfd_dbg_trap_device_snapshot() warn: variable dereferenced before check 'entry_size' (see line 1021) Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-15 18:35:37 -05:00
Felix Kuehling	50661eb1a2	drm/amdgpu: Auto-validate DMABuf imports in compute VMs DMABuf imports in compute VMs are not wrapped in a kgd_mem object on the process_info->kfd_bo_list. There is no explicit KFD API call to validate them or add eviction fences to them. This patch automatically validates and fences dymanic DMABuf imports when they are added to a compute VM. Revalidation after evictions is handled in the VM code. v2: * Renamed amdgpu_vm_validate_evicted_bos to amdgpu_vm_validate * Eliminated evicted_user state, use evicted state for VM BOs and user BOs * Fixed and simplified amdgpu_vm_fence_imports, depends on reserved BOs * Moved dma_resv_reserve_fences for amdgpu_vm_fence_imports into amdgpu_vm_validate, outside the vm->status_lock * Added dummy version of amdgpu_amdkfd_bo_validate_and_fence for builds without KFD v4: Eliminate amdgpu_vm_fence_imports. It's not needed because the reservation with its fences is shared with the export, as long as all imports are from KFD, with the exports already reserved, validated and fenced by the KFD restore worker. v5: Reintroduced separate evicted_user state to simplify the state machine and CS error handling when amdgpu_vm_validate is called without a ticket. Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-15 18:35:35 -05:00
Srinivasan Shanmugam	d7a254fad8	drm/amdkfd: Fix 'node' NULL check in 'svm_range_get_range_boundaries()' Range interval [start, last] is ordered by rb_tree, rb_prev, rb_next return value still needs NULL check, thus modified from "node" to "rb_node". Fixes the below: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2691 svm_range_get_range_boundaries() warn: can 'node' even be NULL? Suggested-by: Philip Yang <Philip.Yang@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-15 18:33:00 -05:00
Dafna Hirschfeld	02eed83abc	drm/amdkfd: fixes for HMM mem allocation Fix err return value and reset pgmap->type after checking it. Fixes: `c83dee9b63` ("drm/amdkfd: add SPM support for SVM") Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-15 18:30:46 -05:00
Felix Kuehling	c147ddc68e	drm/amdkfd: Fix sparse __rcu annotation warnings Properly mark kfd_process->ef as __rcu and consistently use the right accessor functions. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/ Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-01-09 15:44:13 -05:00

1 2 3 4 5 ...

1685 Commits