linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-22 16:53:59 -04:00

Author	SHA1	Message	Date
Jesse Zhang	7f7f43f28e	drm/amdkfd: remove logically dead code idr_for_each_entry can ensure that mem is not empty during the loop. So don't need check mem again. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 11:25:14 -04:00
Eric Huang	dbe2c4c8ab	drm/amdkfd: add reset cause in gpu pre-reset smi event reset cause is requested by customer as additional info for gpu reset smi event. v2: integerate reset sources suggested by Lijo Lazar Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 11:25:14 -04:00
Jesse Zhang	b5b561621d	drm/amdkfd: remove dead code in kfd_create_vcrat_image_gpu kfd_create_vcrat_image_gpu itself checks the avail_size at the start. So the value of avail_size is at least VCRAT_SIZE_FOR_GPU(16384), minus struct crat_header(40UL) and struct crat_subtype_compute(40UL) it cannot be less than 0. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 11:25:13 -04:00
Jesse Zhang	8178cfb0b4	drm/amdkfd: fix the kdf debugger issue The expression caps \| HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED and caps \| HSA_CAP_TRAP_DEBUG_PRECISE_ALU_OPERATIONS_SUPPORTED are always 1/true regardless of the values of its operand. Fixes: `9243240bed` ("drm/amdkfd: enable single alu ops for gfx12") Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 11:25:00 -04:00
Jesse Zhang	46eb63ec8a	drm/amdkfd: Comment out the unused variable use_static in pm_map_queues_v9 To fix the warning about unused value, remove the use_static and use the parameter is_static directly. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 11:06:52 -04:00
Jay Cornwall	c5afb313e7	drm/amdkfd: Handle deallocated VPGRs in gfx11+ trap handler A wavefront may deallocate its VGPRs at the end of a program while waiting for memory transactions to complete. If it subsequently receives a context save exception it will be unable to save, since this requires VGPRs. In this case the trap handler should terminate the wavefront. Fixes intermittent VM faults under context switching load. V2: Use S_ENDPGM instead of S_ENDPGM_SAVED for performance counters Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 11:05:57 -04:00
Jesse Zhang	dfe190aff8	drm/amdkfd: remove dead code in the function svm_range_get_pte_flags The varible uncached set false, the condition uncached cannot be true. So remove the dead code, mapping flags will set the flag AMDGPU_VM_MTYPE_UC in else. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 11:02:59 -04:00
Jay Cornwall	fda812ebe3	drm/amdkfd: gfx12 context save/restore trap handler fixes Fix LDS size interpretation: 512 bytes (>= gfx12) vs 256 (< gfx12). Ensure STATE_PRIV.BARRIER_COMPLETE cannot change after reading or before writing. Other waves in the threadgroup may cause this field to assert if they complete the barrier. Do not overwrite EXCP_FLAG_PRIV.{SAVE_CONTEXT,HOST_TRAP} when restoring this register. Both of these fields can assert while the wavefront is running the trap handler. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 10:57:40 -04:00
Jay Cornwall	c5e358913d	drm/amdkfd: Replace deprecated gfx12 trap handler instructions Newer assemblers reject S_WAITCNT. All instances of S_WAITCNT can be replaced by S_WAITCNT 0 (< gfx12) or S_WAIT_IDLE (>= gfx12) since there is no concurrency of different memory instruction classes. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-29 14:48:30 -04:00
Jay Cornwall	dec4f2d224	drm/amdkfd: Sync trap handler binary with source Source and binary have become mismatched during branch activity. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-29 14:48:30 -04:00
Alex Deucher	7978c4d414	drm/amdkfd: simplify APU VRAM handling With commit `89773b8559` ("drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs") big and small APU "VRAM" handling in KFD was unified. Since AMD_IS_APU is set for both big and small APUs, we can simplify the checks in the code. v2: clean up a few more places (Lang) Acked-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Lang Yu <Lang.Yu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-29 14:48:30 -04:00
Alex Deucher	d4ab6c409b	Revert "drm/amdkfd: fix gfx_target_version for certain 11.0.3 devices" This reverts commit `28ebbb4981`. Revert this commit as apparently the LLVM code to take advantage of this never landed. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: Feifei Xu <feifei.xu@amd.com>	2024-05-29 14:48:30 -04:00
Alex Deucher	f2a1fbdd1f	drm/amdgpu: drop MES 10.1 support v3 It was an enablement vehicle for MES 11 and was never productized. Remove it. v2: drop additional checks in the GFX10 code. v3: drop mes_api_def.h Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-29 14:09:01 -04:00
Sreekant Somasekharan	5d32b7e77b	drm/amdkfd: Add GFX1201 to svm_range_get_pte_flags function GFX1201 was missed in the commit below. Adding it in. Fixes: `628e1ace23` ("drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC") Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-17 17:40:37 -04:00
Harish Kasiviswanathan	3ed181b8ff	drm/amdkfd: Ensure gpu_id is unique gpu_id needs to be unique for user space to identify GPUs via KFD interface. In the current implementation there is a very small probability of having non unique gpu_ids. v2: Add check to confirm if gpu_id is unique. If not unique, find one Changed commit header to reflect the above v3: Use crc16 as suggested-by: Lijo Lazar <lijo.lazar@amd.com> Ensure that gpu_id != 0 Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-17 17:40:37 -04:00
Harish Kasiviswanathan	8e8c68f4c9	drm/amdkfd: Use dev_error intead of pr_error No functional change. This will help in moving gpu_id creation to next step while still being able to identify the correct GPU Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-13 16:12:02 -04:00
Mukul Joshi	85cf43c554	drm/amdkfd: Fix CU Masking for GFX 9.4.3 We are incorrectly passing the first XCC's MQD when updating CU masks for other XCCs in the partition. Fix this by passing the MQD for the XCC currently being updated with CU mask to update_cu_mask function. Fixes: `fc6efed2c7` ("drm/amdkfd: Update CU masking for GFX 9.4.3") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-13 15:48:21 -04:00
Ramesh Errabolu	d2d3a44008	drm/amd/amdkfd: Fix a resource leak in svm_range_validate_and_map() Analysis of code by Coverity, a static code analyser, has identified a resource leak in the symbol hmm_range. This leak occurs when one of the prior steps before it is released encounters an error. Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-13 15:44:32 -04:00
Philip Yang	9095e55440	drm/amdkfd: Remove arbitrary timeout for hmm_range_fault On system with khugepaged enabled and user cases with THP buffer, the hmm_range_fault may takes > 15 seconds to return -EBUSY, the arbitrary timeout value is not accurate, cause memory allocation failure. Remove the arbitrary timeout value, return EAGAIN to application if hmm_range_fault return EBUSY, then userspace libdrm and Thunk will call ioctl again. Change EAGAIN to debug message as this is not error. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-13 15:44:02 -04:00
Michael Chen	10f624ef23	drm/amdkfd: Reconcile the definition and use of oem_id in struct kfd_topology_device Currently oem_id is defined as uint8_t[6] and casted to uint64_t* in some use case. This would lead code scanner to complain about access beyond. Re-define it in union to enforce 8-byte size and alignment to avoid potential issue. Signed-off-by: Michael Chen <michael.chen@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-08 18:46:26 -04:00
Alex Deucher	24e82654e9	drm/amdkfd: don't allow mapping the MMIO HDP page with large pages We don't get the right offset in that case. The GPU has an unused 4K area of the register BAR space into which you can remap registers. We remap the HDP flush registers into this space to allow userspace (CPU or GPU) to flush the HDP when it updates VRAM. However, on systems with >4K pages, we end up exposing PAGE_SIZE of MMIO space. Fixes: `d8e408a827` ("drm/amdkfd: Expose HDP registers to user space") Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2024-05-08 15:17:05 -04:00
Lin.Cao	547033b593	drm/amdkfd: Check debug trap enable before write dbg_ev_file In interrupt context, write dbg_ev_file will be run by work queue. It will cause write dbg_ev_file execution after debug_trap_disable, which will cause NULL pointer access. v2: cancel work "debug_event_workarea" before set dbg_ev_file as NULL. Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-08 15:17:05 -04:00
Lijo Lazar	421226e5c9	Revert "drm/amdkfd: Add partition id field to location_id" This reverts commit `c37ce764cd`. RCCL library is currently not treating spatial partitions differently, hence this change is causing issues. Revert temporarily till RCCL implementation is ready for spatial partitions. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-08 15:17:04 -04:00
Lang Yu	89773b8559	drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads memory allocation requirements. We can't even run a Basic MNIST Example with a default 512MB carveout. https://github.com/pytorch/examples/tree/main/mnist. Error Log: "torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes is free. Of the allocated memory 103.83 MiB is allocated by PyTorch, and 22.17 MiB is reserved by PyTorch but unallocated" Though we can change BIOS settings to enlarge carveout size, which is inflexible and may bring complaint. On the other hand, the memory resource can't be effectively used between host and device. The solution is MI300A approach, i.e., let VRAM allocations go to GTT. Then device and host can flexibly and effectively share memory resource. v2: Report local_mem_size_private as 0. (Felix) Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-08 15:17:04 -04:00
Sreekant Somasekharan	628e1ace23	drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC Due to a HW bug, the system memory mappings and peer GPU mappings on GFX12 need to be marked as MTYPE_NC. Cc: Joe Greathouse <joseph.greathouse@amd.com> Cc: David Belanger <david.belanger@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:13 -04:00
Sreekant Somasekharan	a8a4615ba0	drm/amd/amdkfd: Add GFX12 PTE flag to SVM get PTE function Add new GFX12 PTE flag AMDGPU_PTE_IS_PTE to svm_range_get_pte_flags function. This resolves the issues related to SVM enablement in GFX12. Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:13 -04:00
David Belanger	c5faf18bbe	drm/amdkfd: Enable atomic support for GFX12 Enable flag in KFD and set the atomic support bit in MQD. Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:13 -04:00
Eric Huang	a921c35ae5	drm/amdkfd: fix NULL ptr for debugfs mqds on GFX v12 mqd_stride function in gfx v12 is not implemented, that causes NULL ptr error. Add the generic func to fix it. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:13 -04:00
Jonathan Kim	9243240bed	drm/amdkfd: enable single alu ops for gfx12 GFX12 debugging requires setting up precise ALU operation for catching ALU exceptions. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Tested-by: Lancelot Six <lancelot.six@amd.com> Reviewed-by: Eric Huang <jinhuieric.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:13 -04:00
Jonathan Kim	fda3f378c4	drm/amdkfd: always enable ttmp setup for gfx12 Similar to GFX11, always enable the setup of trap temporaries on GFX12. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:13 -04:00
David Belanger	782b93436a	drm/amdkfd: Enable GFX12 trap handler Updated switch statement to use GFX12 trap handler. Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:13 -04:00
Laurent Morichetti	cf338b5dfe	drm/amdkfd: enable missed single-step workaround for gfx12 When trap_ctrl.trap_after_inst is set, it is possible for a wave to enter the trap handler, after single-stepping an instruction and a save_context is raised, with only save_context set in excp_flag_priv. Because excp_flag_priv.trap_after_inst is not reliably set, we need to use the missed single-step workaround for gfx12 as well. Also add wave_start and wave_end as exceptions that should be handled by the 2nd level trap handler. Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com> Tested-by: Lancelot Six <lancelot.six@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:13 -04:00
Lancelot SIX	450abfe433	drm/amdkfd: save and restore barrier state for gfx12 Add support to save and restore the work group barrier state in gfx12 CWSR trap handler. There is no support to directly restore the signal count of a barrier state, so instead this patch repeatedly calls s_barrier_signal to increment the signal count to the desired value. In this patch, I have implemented the logic to restore the barrier at the end of the block restoring the HWREGs. This process needs to be done by exactly 1 wave per work group. To achieve this, the initial value of s_restore_spi_init_hi (containing a FIRST_WAVE bit) needs to be saved up until that point. An alternative could be restore the barrier earlier in the process (around when LDS is restored, as the same wave does both). Doing this would break the pattern that the restore procedure follows the CWSR area layout. Before restoring the barrier, this patch checks if the barrier was whose state was saved has the "valid" bit set, even if I don't think this barrier can be in an invalid state during context save. I expect this test to always be true. Signed-off-by: Lancelot SIX <lancelot.six@amd.com> Reviewed-by: Jay Cornwall <jay.cornwall@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:12 -04:00
Jay Cornwall	f281003336	drm/amdkfd: Add gfx12 trap handler support - HWREG changes since gfx11 - Save/restore barrier state - get_wave_size is now reserved by assembler v2: rebase (Alex) Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:12 -04:00
Jay Cornwall	385093fde8	drm/amdkfd: Move trap handler coherence flags to preprocessor No functional change. Preparation for gfx12 support. v2: drop unrelated change (Alex) Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:12 -04:00
David Belanger	90e4fc8369	drm/amdkfd: Added gfx_v12_kfd2kgd interface for GFX12. Initial implementation, based on GFX11. v2: Removed functions not needed by cp scheduler. v3: Fixed typos. v4: squash in warning fix (Alex) Signed-off-by: David Belanger <david.belanger@amd.com> Acked-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:12 -04:00
David Belanger	47fa09b788	drm/amdkfd: Added device queue manager files for GFX12. Initial implementation, based on GFX11. v2: squash in include fix from David (Alex) Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:12 -04:00
David Belanger	48f0bdf4e3	drm/amdkfd: Added MQD manager files for GFX12. Initial implementation, based on GFX11. v2: Removed dbg_wa code as not needed on GFX12. v3: squash in SDMA queue fixes (Alex) v4: rebase (Alex) Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:12 -04:00
David Belanger	8aa89b69d6	drm/amdkfd: Added temporary changes for GFX12. Added cases for GFX12 in switch statement, code relying on GFX11 implementation until GFX12 implementation is complete. Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:12 -04:00
David Belanger	592a5d7de4	drm/amdkfd: Basic SDMA and cache info changes for GFX12. Added GFX12 support to a few switch statements. Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 16:18:12 -04:00
Hawking Zhang	5f571c61b9	drm/amdgpu: Add gfx v9_4_4 ip block Add gfx v9_4_4 ip block support Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 15:49:16 -04:00
Hawking Zhang	3d1bb1a2e0	drm/amdgpu: Add sdma v4_4_5 ip block Add sdma v4_4_5 ip block support Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 15:48:57 -04:00
Lancelot SIX	a89a05e3ca	drm/amdkfd: Flush the process wq before creating a kfd_process There is a race condition when re-creating a kfd_process for a process. This has been observed when a process under the debugger executes exec(3). In this scenario: - The process executes exec. - This will eventually release the process's mm, which will cause the kfd_process object associated with the process to be freed (kfd_process_free_notifier decrements the reference count to the kfd_process to 0). This causes kfd_process_ref_release to enqueue kfd_process_wq_release to the kfd_process_wq. - The debugger receives the PTRACE_EVENT_EXEC notification, and tries to re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE). - When handling this request, KFD tries to re-create a kfd_process. This eventually calls kfd_create_process and kobject_init_and_add. At this point the call to kobject_init_and_add can fail because the old kfd_process.kobj has not been freed yet by kfd_process_wq_release. This patch proposes to avoid this race by making sure to drain kfd_process_wq before creating a new kfd_process object. This way, we know that any cleanup task is done executing when we reach kobject_init_and_add. Signed-off-by: Lancelot SIX <lancelot.six@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-30 10:00:15 -04:00
Harish Kasiviswanathan	7f11a836e1	drm/amdkfd: Enforce queue BO's adev Queue buffer, though it is in system memory, has to be created using the correct amdgpu device. Enforce this as the BO needs to mapped to the GART for MES Hardware scheduler to access it. Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-26 17:22:42 -04:00
YiPeng Chai	bfa579b38b	drm/amdgpu: prepare to handle pasid poison consumption Prepare to handle pasid poison consumption. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-26 17:22:42 -04:00
Frank Min	ea9238a81b	drm/amdgpu: replace tmz flag into buffer flag Replace tmz flag into buffer flag to make it easier to understand and extend Signed-off-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Frank Min <Frank.Min@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-26 17:22:38 -04:00
Mukul Joshi	63335b383a	drm/amdkfd: Add VRAM accounting for SVM migration Do VRAM accounting when doing migrations to vram to make sure there is enough available VRAM and migrating to VRAM doesn't evict other possible non-unified memory BOs. If migrating to VRAM fails, driver can fall back to using system memory seamlessly. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-26 17:22:38 -04:00
Felix Kuehling	f989ecccdf	drm/amdkfd: Fix rescheduling of restore worker Handle the case that the restore worker was already scheduled by another eviction while the restore was in progress. Fixes: `9a1c1339ab` ("drm/amdkfd: Run restore_workers on freezable WQs") Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Tested-by: Yunxiang Li <Yunxiang.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-26 17:22:38 -04:00
Alex Deucher	80f071a343	drm/amdkfd: demote unsupported device messages to dev_info It's not really an error since the devices don't support the necessary hardware functionality. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3331 Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-04-23 12:08:30 -04:00
Dave Airlie	0208ca55aa	Backmerge tag 'v6.9-rc5' into drm-next Linux 6.9-rc5 I've had a persistent msm failure on clang, and the fix is in fixes so just pull it back to fix that. Signed-off-by: Dave Airlie <airlied@redhat.com>	2024-04-22 14:35:52 +10:00

1 2 3 4 5 ...

1685 Commits