linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-04 06:22:40 -04:00

Author	SHA1	Message	Date
Harish Kasiviswanathan	076470b9f6	drm/amdkfd: Fix GPU mappings for APU after prefetch Fix the following corner case:- Consider a 2M huge page SVM allocation, followed by prefetch call for the first 4K page. The whole range is initially mapped with single PTE. After the prefetch, this range gets split to first page + rest of the pages. Currently, the first page mapping is not updated on MI300A (APU) since page hasn't migrated. However, after range split PTE mapping it not valid. Fix this by forcing page table update for the whole range when prefetch is called. Calling prefetch on APU doesn't improve performance. If all it deteriotes. However, functionality has to be supported. v2: Use apu_prefer_gtt as this issue doesn't apply to APUs with carveout VRAM v3: Simplify by setting the flag for all ASICs as it doesn't affect dGPU v4: Remove v2 and v3 changes. Force update_mapping when range is split at a size that is not aligned to prange granularity Suggested-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-11 21:54:19 -05:00
Jonathan Kim	15bd4958fe	drm/amdkfd: relax checks for over allocation of save area Over allocation of save area is not fatal, only under allocation is. ROCm has various components that independently claim authority over save area size. Unless KFD decides to claim single authority, relax size checks. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Philip Yang <philip.yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-11 21:54:17 -05:00
Ahmad Rehman	37cdb89c0a	drm/amdkfd: Fixing the clang format This patch fixes the formatting in the patch "amdkfd: Do not wait for queue op response during reset" Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-11 21:54:14 -05:00
Ahmad Rehman	07528f7d97	drm/amdkfd: Do not wait for queue op response during reset This patch adds the condition to not wait for the queue response for unmap, if the gpu is in reset. Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-06 09:56:30 -05:00
Philip Yang	88ef4de35f	Revert "drm/amdkfd: Improve signal event slow path" To fix regression report on gfx8, which requires the exhaustive search path for signaled event. The high CPU usage of KFD interrupt wq issue is gone after HIP/ROCr add option to reduce HW event interrupts, safe to revert this optimization patch now. This reverts commit `de844846f7`. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:53:22 -05:00
Sunday Clement	ff7644faf3	drm/amdkfd: Fix Unchecked Return Values Properly Check for return values from calls to debug functions in runtime_disable(). v2: storing the last non zero returned value from the loop. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com> Reviewed-by: Jonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:53:05 -05:00
Sunil Khatri	c0de552910	drm/amdkfd: clean up the code to free hmm_range a. hmm_range is either NULL or a valid pointer so we do not need to set range to NULL ever. b. keep the hmm_range_free in the end irrespective of the other conditions to avoid some additional checks and also avoid double free issue. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:33:53 -05:00
Sakari Ailus	ef4a4b8781	drm/amd: Remove redundant pm_runtime_mark_last_busy() calls pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(), pm_runtime_autosuspend() and pm_request_autosuspend() now include a call to pm_runtime_mark_last_busy(). Remove the now-redundant explicit call to pm_runtime_mark_last_busy(). Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 11:31:45 -04:00
Sunday Clement	fea8f13f4f	drm/amdkfd: Fix Unchecked Return Value Properly check the return values for function, as done elsewhere. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 10:10:23 -04:00
Srinivasan Shanmugam	dfc74e37bd	drm/amdkfd: Fix use-after-free of HMM range in svm_range_validate_and_map() The function svm_range_validate_and_map() was freeing `range` when amdgpu_hmm_range_get_pages() failed. But later, the code still used the same `range` pointer and freed it again. This could cause a use-after-free and double-free issue. The fix sets `range = NULL` right after it is freed and checks for `range` before using or freeing it again. v2: Removed duplicate !r check in the condition for clarity. v3: In amdgpu_hmm_range_get_pages(), when hmm_range_fault() fails, we kvfree(pfns) but leave the pointer in hmm_range->hmm_pfns still pointing to freed memory. The caller (or amdgpu_hmm_range_free(range)) may try to free range->hmm_range.hmm_pfns again, causing a double free, Setting hmm_range->hmm_pfns = NULL immediately after kvfree(pfns) prevents both double free. (Philip) In svm_range_validate_and_map(), When r == 0, it means success → range is not NULL. When r != 0, it means failure → already made range = NULL. So checking both (!r && range) is unnecessary because the moment r == 0, we automatically know range exists and is safe to use. (Philip) Fixes: `737da5363c` ("drm/amdgpu: update the functions to use amdgpu version of hmm") Reported by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Philip Yang <Philip.Yang@amd.com> Cc: Sunil Khatri <sunil.khatri@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 10:02:21 -04:00
Sunil Khatri	1017e393ad	drm/amdkfd: fix the clean up when amdgpu_hmm_range_alloc fails we need to unreserve the bo's too during clean up along with freeing the memory of context. Fixes: `7bb02a34c2` ("drm/amdkfd: add missing return value check for range") Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 10:00:35 -04:00
Philip Yang	22d9e071e5	drm/amdkfd: Dequeue user queues when process mm released Move dequeue user queues and destroy user queues from kfd_process_wq_release to mmu notifier release callback, to ensure no system memory access from GPU because the process memory is going to free from CPU after mmu release notifier callback returns. Destroy queue releases the svm prange queue_refcount, this also removes fake flase positive warning message "Freeing queue vital buffer" message if application crash or killed. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 09:59:44 -04:00
Sunil Khatri	7bb02a34c2	drm/amdkfd: add missing return value check for range amdgpu_hmm_range_alloc could fails in case of low memory condition and hence we should have a check for the return value. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Shirish S <shirish.s@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-28 09:59:37 -04:00
YiPeng Chai	1d87afd610	drm/amdgpu: Add poison consumption sequence numbers for gfx and sdma Add poison consumption sequence numbers for gfx and sdma. V3: Use RAS_EVENT_LOG to print ras log info. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-20 18:18:47 -04:00
Sunil Khatri	737da5363c	drm/amdgpu: update the functions to use amdgpu version of hmm At times we need a bo reference for hmm and for that add a new struct amdgpu_hmm_range which will hold an optional bo member and hmm_range. Use amdgpu_hmm_range instead of hmm_range and let the bo as an optional argument for the caller if they want to the bo reference to be taken or they want to handle that explicitly. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:36 -04:00
Sunil Khatri	b1dd0db1c6	drm/amdgpu: clean up amdgpu hmm range functions Clean up the amdgpu hmm range functions for clearer definition of each. a. Split amdgpu_ttm_tt_get_user_pages_done into two: 1. amdgpu_hmm_range_valid: To check if the user pages are valid and update seq num 2. amdgpu_hmm_range_free: Clean up the hmm range and pfn memory. b. amdgpu_ttm_tt_get_user_pages_done and amdgpu_ttm_tt_discard_user_pages are similar function so remove discard and directly use amdgpu_hmm_range_free to clean up the hmm range and pfn memory. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:28 -04:00
Sunil Khatri	e095b55155	drm/amdgpu: use user provided hmm_range buffer in amdgpu_ttm_tt_get_user_pages update the amdgpu_ttm_tt_get_user_pages and all dependent function along with it callers to use a user allocated hmm_range buffer instead hmm layer allocates the buffer. This is a need to get hmm_range pointers easily accessible without accessing the bo and that is a requirement for the userqueue to lock the userptrs effectively. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:28 -04:00
Jonathan Kim	079ae5118e	drm/amdkfd: fix suspend/resume all calls in mes based eviction path Suspend/resume all gangs should be done with the device lock is held. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-13 14:14:28 -04:00
Philip Yang	7574f30337	drm/amdkfd: Fix mmap write lock not release If mmap write lock is taken while draining retry fault, mmap write lock is not released because svm_range_restore_pages calls mmap_read_unlock then returns. This causes deadlock and system hangs later because mmap read or write lock cannot be taken. Downgrade mmap write lock to read lock if draining retry fault fix this bug. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-07 14:09:06 -04:00
Yifan Zhang	45da20e00d	amd/amdkfd: enhance kfd process check in switch partition current switch partition only check if kfd_processes_table is empty. kfd_prcesses_table entry is deleted in kfd_process_notifier_release, but kfd_process tear down is in kfd_process_wq_release. consider two processes: Process A (workqueue) -> kfd_process_wq_release -> Access kfd_node member Process B switch partition -> amdgpu_xcp_pre_partition_switch -> amdgpu_amdkfd_device_fini_sw -> kfd_node tear down. Process A and B may trigger a race as shown in dmesg log. This patch is to resolve the race by adding an atomic kfd_process counter kfd_processes_count, it increment as create kfd process, decrement as finish kfd_process_wq_release. v2: Put kfd_processes_count per kfd_dev, move decrement to kfd_process_destroy_pdds and bug fix. (Philip Yang) [3966658.307702] divide error: 0000 [#1] SMP NOPTI [3966658.350818] i10nm_edac [3966658.356318] CPU: 124 PID: 38435 Comm: kworker/124:0 Kdump: loaded Tainted [3966658.356890] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu] [3966658.362839] nfit [3966658.366457] RIP: 0010:kfd_get_num_sdma_engines+0x17/0x40 [amdgpu] [3966658.366460] Code: 00 00 e9 ac 81 02 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 48 8b 4f 08 48 8b b7 00 01 00 00 8b 81 58 26 03 00 99 <f7> be b8 01 00 00 80 b9 70 2e 00 00 00 74 0b 83 f8 02 ba 02 00 00 [3966658.380967] x86_pkg_temp_thermal [3966658.391529] RSP: 0018:ffffc900a0edfdd8 EFLAGS: 00010246 [3966658.391531] RAX: 0000000000000008 RBX: ffff8974e593b800 RCX: ffff888645900000 [3966658.391531] RDX: 0000000000000000 RSI: ffff888129154400 RDI: ffff888129151c00 [3966658.391532] RBP: ffff8883ad79d400 R08: 0000000000000000 R09: ffff8890d2750af4 [3966658.391532] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000000 [3966658.391533] R13: ffff8883ad79d400 R14: ffffe87ff662ba00 R15: ffff8974e593b800 [3966658.391533] FS: 0000000000000000(0000) GS:ffff88fe7f600000(0000) knlGS:0000000000000000 [3966658.391534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3966658.391534] CR2: 0000000000d71000 CR3: 000000dd0e970004 CR4: 0000000002770ee0 [3966658.391535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3966658.391535] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [3966658.391536] PKRU: 55555554 [3966658.391536] Call Trace: [3966658.391674] deallocate_sdma_queue+0x38/0xa0 [amdgpu] [3966658.391762] process_termination_cpsch+0x1ed/0x480 [amdgpu] [3966658.399754] intel_powerclamp [3966658.402831] kfd_process_dequeue_from_all_devices+0x5b/0xc0 [amdgpu] [3966658.402908] kfd_process_wq_release+0x1a/0x1a0 [amdgpu] [3966658.410516] coretemp [3966658.434016] process_one_work+0x1ad/0x380 [3966658.434021] worker_thread+0x49/0x310 [3966658.438963] kvm_intel [3966658.446041] ? process_one_work+0x380/0x380 [3966658.446045] kthread+0x118/0x140 [3966658.446047] ? __kthread_bind_mask+0x60/0x60 [3966658.446050] ret_from_fork+0x1f/0x30 [3966658.446053] Modules linked in: kpatch_20765354(OEK) [3966658.455310] kvm [3966658.464534] mptcp_diag xsk_diag raw_diag unix_diag af_packet_diag netlink_diag udp_diag act_pedit act_mirred act_vlan cls_flower kpatch_21951273(OEK) kpatch_18424469(OEK) kpatch_19749756(OEK) [3966658.473462] idxd_mdev [3966658.482306] kpatch_17971294(OEK) sch_ingress xt_conntrack amdgpu(OE) amdxcp(OE) amddrm_buddy(OE) amd_sched(OE) amdttm(OE) amdkcl(OE) intel_ifs iptable_mangle tcm_loop target_core_pscsi tcp_diag target_core_file inet_diag target_core_iblock target_core_user target_core_mod coldpgs kpatch_18383292(OEK) ip6table_nat ip6table_filter ip6_tables ip_set_hash_ipportip ip_set_hash_ipportnet ip_set_hash_ipport ip_set_bitmap_port xt_comment iptable_nat nf_nat iptable_filter ip_tables ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sn_core_odd(OE) i40e overlay binfmt_misc tun bonding(OE) aisqos(OE) aisqos_hotfixes(OE) rfkill uio_pci_generic uio cuse fuse nf_tables nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm idxd_mdev [3966658.491237] vfio_pci [3966658.501196] vfio_pci vfio_virqfd mdev vfio_iommu_type1 vfio iax_crypto intel_pmt_telemetry iTCO_wdt intel_pmt_class iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq [3966658.508537] vfio_virqfd [3966658.517569] snd_seq_device ipmi_ssif isst_if_mbox_pci isst_if_mmio pcspkr snd_pcm idxd intel_uncore ses isst_if_common intel_vsec idxd_bus enclosure snd_timer mei_me snd i2c_i801 i2c_smbus mei i2c_ismt soundcore joydev acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad vfat fat [3966658.526851] mdev [3966658.536096] nfsd auth_rpcgss nfs_acl lockd grace slb_vtoa(OE) sunrpc dm_mod hookers mlx5_ib(OE) ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm mlx5_core(OE) mlxfw(OE) [3966658.540381] vfio_iommu_type1 [3966658.544341] nvme mpt3sas tls drm nvme_core pci_hyperv_intf raid_class psample libcrc32c crc32c_intel mlxdevm(OE) i2c_core [3966658.551254] vfio [3966658.558742] scsi_transport_sas wmi pinctrl_emmitsburg sd_mod t10_pi sg ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [3966658.563004] iax_crypto [3966658.570988] [last unloaded: diagnose] [3966658.571027] ---[ end trace cc9dbb180f9ae537 ]--- Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Philip.Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:08 -04:00
Yifan Zhang	99d7181bca	amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw There is race in amdgpu_amdkfd_device_fini_sw and interrupt. if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and kfree(kfd), and KGD interrupt generated. kernel panic log: BUG: kernel NULL pointer dereference, address: 0000000000000098 amdgpu 0000:c8:00.0: amdgpu: Requesting 4 partitions through PSP PGD d78c68067 P4D d78c68067 kfd kfd: amdgpu: Allocated 3969056 bytes on gart PUD 1465b8067 PMD @ Oops: @002 [#1] SMP NOPTI kfd kfd: amdgpu: Total number of KFD nodes to be created: 4 CPU: 115 PID: @ Comm: swapper/115 Kdump: loaded Tainted: G S W OE K RIP: 0010:_raw_spin_lock_irqsave+0x12/0x40 Code: 89 e@ 41 5c c3 cc cc cc cc 66 66 2e Of 1f 84 00 00 00 00 00 OF 1f 40 00 Of 1f 44% 00 00 41 54 9c 41 5c fa 31 cO ba 01 00 00 00 <fO> OF b1 17 75 Ba 4c 89 e@ 41 Sc 89 c6 e8 07 38 5d RSP: 0018: ffffc90@1a6b0e28 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018 0000000000000001 RSI: ffff8883bb623e00 RDI: 0000000000000098 ffff8883bb000000 RO8: ffff888100055020 ROO: ffff888100055020 0000000000000000 R11: 0000000000000000 R12: 0900000000000002 ffff888F2b97da0@ R14: @000000000000098 R15: ffff8883babdfo00 CS: 010 DS: 0000 ES: 0000 CRO: 0000000080050033 CR2: 0000000000000098 CR3: 0000000e7cae2006 CR4: 0000000002770ce0 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 0000000000000000 DR6: 00000000fffeO7FO DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> kgd2kfd_interrupt+@x6b/0x1f@ [amdgpu] ? amdgpu_fence_process+0xa4/0x150 [amdgpu] kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 3 YcpxFl Rant tErace amdgpu_irq_dispatch+0x165/0x210 [amdgpu] amdgpu_ih_process+0x80/0x100 [amdgpu] amdgpu: Virtual CRAT table created for GPU amdgpu_irq_handler+0x1f/@x60 [amdgpu] __handle_irq_event_percpu+0x3d/0x170 amdgpu: Topology: Add dGPU node [0x74a2:0x1002] handle_irq_event+0x5a/@xcO handle_edge_irq+0x93/0x240 kfd kfd: amdgpu: KFD node 1 partition @ size 49148M asm_call_irq_on_stack+0xf/@x20 </IRQ> common_interrupt+0xb3/0x130 asm_common_interrupt+0x1le/0x40 5.10.134-010.a1i5000.a18.x86_64 #1 Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:52:36 -04:00
Sunil Khatri	c5b3cc417b	drm/amdgpu: use hmm_pfns instead of array of pages we dont need to allocate local array of pages to hold the pages returned by the hmm, instead we could use the hmm_range structure itself to get to hmm_pfn and get the required pages directly. This avoids call to alloc/free quite a lot. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:22:31 -04:00
Dave Airlie	342f141ba9	Merge tag 'amd-drm-next-6.18-2025-09-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-09-19: amdgpu: - Fence drv clean up fix - DPC fixes - Misc display fixes - Support the MMIO remap page as a ttm pool - JPEG parser updates - UserQ updates - VCN ctx handling fixes - Documentation updates - Misc cleanups - SMU 13.0.x updates - SI DPM updates - GC 11.x cleaner shader updates - DMCUB updates - DML fixes - Improve fallback handling for pixel encoding - VCN reset improvements - DCE6 DC updates - DSC fixes - Use devm for i2c buses - GPUVM locking updates - GPUVM documentation improvements - Drop non-DC DCE11 code - S0ix fixes - Backlight fix - SR-IOV fixes amdkfd: - SVM updates Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250919193354.2989255-1-alexander.deucher@amd.com	2025-09-22 08:45:51 +10:00
Alex Deucher	4bfa860993	drm/amdkfd: add proper handling for S0ix When in S0i3, the GFX state is retained, so all we need to do is stop the runlist so GFX can enter gfxoff. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: David Perry <david.perry@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-18 09:43:02 -04:00
James Zhu	a450d22532	drm/amdkfd: add function svm_migrate_successful_pages to get migration pages. dst MIGRATE_PFN_VALID bit and src MIGRATE_PFN_MIGRATE bit should always be set when migration success. cpage includes src MIGRATE_PFN_MIGRATE bit set and MIGRATE_PFN_VALID bit unset pages for both RAM and VRAM when memory is only allocated without being populated before migration, those ram pages should be counted as migrated pages and those vram pages should not be counted as migrated pages. Here migration pages refer to how many vram pages invloved. Current svm_migrate_unsuccessful_pages only covers the unsuccessful case that source is on RAM. So far, we only see two unsuccessful migration cases. Since we can clearly identify successful migration cases through dst MIGRATE_PFN_VALID bit and src MIGRATE_PFN_MIGRATE bit within this prange, also eventually successful migration pages will be used, so we can use function svm_migrate_successful_pages to replace function svm_migrate_unsuccessful_pages. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-15 17:03:44 -04:00
James Zhu	7ccaaf1319	Revert "drm/amdkfd: return migration pages from copy function" This reverts commit `bd6093e2f1`. migrate_vma_pages can fail if a CPU thread faults on the same page. However, the page table is locked and only one of the new pages will be inserted. The device driver will see that the MIGRATE_PFN_MIGRATE bit is cleared if it loses the race. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-15 17:02:50 -04:00
Thorsten Blum	6156c101e5	drm/amdkfd: Replace kzalloc + copy_from_user with memdup_user Replace kzalloc() followed by copy_from_user() with memdup_user() to improve and simplify kfd_ioctl_set_cu_mask(). Return early if an error occurs and remove the obsolete 'out' label. No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-15 16:51:48 -04:00
Dave Airlie	cf99b26d30	Merge tag 'amd-drm-next-6.18-2025-09-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-09-09: amdgpu: - Add CRIU support for gem objects - SI UVD fix - SI DPM fixes - Misc code cleanups - RAS updates - GPUVM debugfs fixes - Cyan Skillfish updates - UserQ updates - OEM i2c fix - SMU 13.0.x updates - DPCD probe quirk fix - Make vbios build number available in sysfs - HDCP updates - Brightness curve fixes - eDP updates - Vblank fixes - DCN 3.5 PG fix - PBN calcution fix amdkfd: - Add CRIU support for gem objects - Flexible array fix - P2P topology fix - APU memlimit fixes - Misc code cleanups UAPI: - Add CRIU support for gem objects Proposed userspace: https://github.com/checkpoint-restore/criu/pull/2613 radeon: - Use dev_warn_once() in CS parsers Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250909161928.942785-1-alexander.deucher@amd.com	2025-09-12 13:37:41 +10:00
Qianfeng Rong	cbda64f3f5	drm/amdkfd: Fix error code sign for EINVAL in svm_ioctl() Use negative error code -EINVAL instead of positive EINVAL in the default case of svm_ioctl() to conform to Linux kernel error code conventions. Fixes: `42de677f79` ("drm/amdkfd: register svm range") Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:39 -04:00
Eric Huang	36cc7d1317	drm/amdkfd: fix p2p links bug in topology When creating p2p links, KFD needs to check XGMI link with two conditions, hive_id and is_sharing_enabled, but it is missing to check is_sharing_enabled, so add it to fix the error. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:06:43 -04:00
Dave Airlie	6dc1d3c191	Merge tag 'drm-misc-next-2025-09-04' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for v6.18: Cross-subsystem Changes: - Update a number of DT bindings for STM32MP25 Arm SoC Core Changes: gem: - Simplify locking for GPUVM panel-backlight-quirks: - Add additional quirks for EDID, DMI, brightness sched: - Fix race condition in trace code - Clean up sysfb: - Clean up Driver Changes: amdgpu: - Give kernel jobs a unique id for better tracing amdxdna: - Improve error reporting bridge: - Improve ref counting on bridge management - adv7511: Provide SPD and HDMI infoframes - it6505: Replace crypto_shash with sha() - synopsys: Add support for DW DPTX Controller plus DT bindings gud: - Replace simple-KMS pipe with regular atomic helpers imagination: - Improve power management - Add support for TH1520 GPU - Support Risc-V architectures ivpu: - Clean up nouveau: - Improve error reporting panthor: - Fail VM bind if BO has offset - Clean up rcar-du: - Make number of lanes configurable rockchip: - Add support for RK3588 DPTX output rocket: - Use kfree() and sizeof() correctly - Test DMA status - Clean up sitronix: - st7571-i2c: Add support for inverted displays and 2-bit grayscale - Clean up stm: - ltdc: Add support support for STM32MP257F-EV1 plus DT bindings tidss: - Convert to kernel's FIELD_ macros v3d: - Improve job management and locking Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://lore.kernel.org/r/20250904090932.GA193997@linux.fritz.box	2025-09-05 11:49:01 +10:00
David Francis	85705b18ae	drm/amdgpu: Allow kfd CRIU with no buffer objects The kfd CRIU checkpoint ioctl would return an error if trying to checkpoint a process with no kfd buffer objects. This is a normal case and should not be an error. Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: David Francis <David.Francis@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-02 15:53:56 -04:00
Pierre-Eric Pelloux-Prayer	256576ed68	drm/amdgpu: give each kernel job a unique id Userspace jobs have drm_file.client_id as a unique identifier as job's owners. For kernel jobs, we can allocate arbitrary values - the risk of overlap with userspace ids is small (given that it's a u64 value). In the unlikely case the overlap happens, it'll only impact trace events. Since this ID is traced in the gpu_scheduler trace events, this allows to determine the source of each job sent to the hardware. To make grepping easier, the IDs are defined as they will appear in the trace output. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Link: https://lore.kernel.org/r/20250604122827.2191-1-pierre-eric.pelloux-prayer@amd.com	2025-09-01 10:49:31 +05:30
Amber Lin	f3820e9d35	drm/amdkfd: Tie UNMAP_LATENCY to queue_preemption When KFD asks CP to preempt queues, other than preempt CP queues, CP also requests SDMA to preempt SDMA queues with UNMAP_LATENCY timeout. Currently queue_preemption_timeout_ms is 9000 ms by default but can be configured via module parameter. KFD_UNMAP_LATENCY_MS is hard coded as 4000 ms though. This patch ties KFD_UNMAP_LATENCY_MS to queue_preemption_timeout_ms so in a slow system such as emulator, both CP and SDMA slowness are taken into account. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:49 -04:00
Eric Huang	93aa919ca0	drm/amdkfd: fix vram allocation failure for a special case When it only allocates vram without va, which is 0, and a SVM range allocated stays in this range, the vram allocation returns failure. It should be skipped for this case from SVM usage check. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:48 -04:00
Sunday Clement	7bbfa1b1fa	drm/amdkfd: Allow device error to be logged The addition of a WARN_ON() check in order to return early in the kq_initialize function retroactively causes the default case in the following switch statement to never be executed, preventing dev_err from logging device errors in the kernel. Both logs are now checked in the default case. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:48 -04:00
Kent Russell	0ed704d058	drm/amdkfd: Handle lack of READ permissions in SVM mapping HMM assumes that pages have READ permissions by default. Inside svm_range_validate_and_map, we add READ permissions then add WRITE permissions if the VMA isn't read-only. This will conflict with regions that only have PROT_WRITE or have PROT_NONE. When that happens, svm_range_restore_work will continue to retry, silently, giving the impression of a hang if pr_debug isn't enabled to show the retries.. If pages don't have READ permissions, simply unmap them and continue. If they weren't mapped in the first place, this would be a no-op. Since x86 doesn't support write-only, and PROT_NONE doesn't allow reads or writes anyways, this will allow the svm range validation to continue without getting stuck in a loop forever on mappings we can't use with HMM. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-15 13:04:41 -04:00
Eric Huang	3a75edf93a	drm/amdkfd: set uuid for each partition in topology Currently each kfd compute partition/node is sharing the same uuid of AID, which doen't meet the CUDA spec for visible device, so corresponding XCD id for each partition in smu has been assigned to xcp, and exposed to kfd topology. v2: add NULL check (Lijo) Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-15 13:04:02 -04:00
Geoffrey McRae	57af162bfc	drm/amdkfd: return -ENOTTY for unsupported IOCTLs Some kfd ioctls may not be available depending on the kernel version the user is running, as such we need to report -ENOTTY so userland can determine the cause of the ioctl failure. Signed-off-by: Geoffrey McRae <geoffrey.mcrae@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-12 14:14:54 -04:00
Amber Lin	2e58401a24	drm/amdkfd: Destroy KFD debugfs after destroy KFD wq Since KFD proc content was moved to kernel debugfs, we can't destroy KFD debugfs before kfd_process_destroy_wq. Move kfd_process_destroy_wq prior to kfd_debugfs_fini to fix a kernel NULL pointer problem. It happens when /sys/kernel/debug/kfd was already destroyed in kfd_debugfs_fini but kfd_process_destroy_wq calls kfd_debugfs_remove_process. This line debugfs_remove_recursive(entry->proc_dentry); tries to remove /sys/kernel/debug/kfd/proc/<pid> while /sys/kernel/debug/kfd is already gone. It hangs the kernel by kernel NULL pointer. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Eric Huang <jinhuieric.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `0333052d90`) Cc: stable@vger.kernel.org	2025-08-06 16:52:08 -04:00
James Zhu	bd6093e2f1	drm/amdkfd: return migration pages from copy function dst MIGRATE_PFN_VALID bit and src MIGRATE_PFN_MIGRATE bit should always be set when migration success. cpage includes src MIGRATE_PFN_MIGRATE bit set and MIGRATE_PFN_VALID bit unset pages for both ram and vram when memory is only allocated without being populated before migration, those ram pages should be counted as migrate pages and those vram pages should not be counted as migrate pages. Here migration pages refer to how many vram pages involved. -v2 use dst to check MIGRATE_PFN_VALID bit (suggested-by Philip) -v3 add warning when vram pages is less than migration pages return migration pages directly from copy function -v4 correct comments and copy function return mpage (suggested-by Felix) Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:19:12 -04:00
James Zhu	5c15a05b52	drm/amdkfd: remove unused code upages is assigned under cpages = 0, so it isn't really used in this function. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Philip.Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:19:01 -04:00
Amber Lin	0333052d90	drm/amdkfd: Destroy KFD debugfs after destroy KFD wq Since KFD proc content was moved to kernel debugfs, we can't destroy KFD debugfs before kfd_process_destroy_wq. Move kfd_process_destroy_wq prior to kfd_debugfs_fini to fix a kernel NULL pointer problem. It happens when /sys/kernel/debug/kfd was already destroyed in kfd_debugfs_fini but kfd_process_destroy_wq calls kfd_debugfs_remove_process. This line debugfs_remove_recursive(entry->proc_dentry); tries to remove /sys/kernel/debug/kfd/proc/<pid> while /sys/kernel/debug/kfd is already gone. It hangs the kernel by kernel NULL pointer. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Eric Huang <jinhuieric.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:18:02 -04:00
David Yat Sin	f6c0f3d244	drm/amdkfd: Fix checkpoint-restore on multi-xcc GPUs with multi-xcc have multiple MQDs per queue. This patch saves and restores all the MQDs within the partition. Signed-off-by: David Yat Sin <David.YatSin@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `a578f2a58c`) Cc: stable@vger.kernel.org	2025-08-04 15:38:49 -04:00
David Yat Sin	a578f2a58c	drm/amdkfd: Fix checkpoint-restore on multi-xcc GPUs with multi-xcc have multiple MQDs per queue. This patch saves and restores all the MQDs within the partition. Signed-off-by: David Yat Sin <David.YatSin@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:27:21 -04:00
Christian König	6716a823d1	drm/amdgpu: rework how PTE flags are generated v3 Previously we tried to keep the HW specific PTE flags in each mapping, but for CRIU that isn't sufficient any more since the original value is needed for the checkpoint procedure. So rework the whole handling, nuke the early mapping function, keep the UAPI flags in each mapping instead of the HW flags and translate them to the HW flags while filling in the PTEs. Only tested on Navi 23 for now, so probably needs quite a bit of more work. v2: fix KFD and SVN handling v3: one more SVN fix pointed out by Felix v4: squash in gfx12 fix from David Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:26:38 -04:00
Han Gao	fa301127ba	drm/amdkfd: enable kfd on LoongArch systems KFD has been confirmed that can run on LoongArch systems. It's necessary to support CONFIG_HSA_AMD on LoongArch. Signed-off-by: Han Gao <rabenda.cn@gmail.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-07-15 14:07:50 -04:00
Alex Deucher	0c3c2e334c	drm/amdgpu/sdma: allow caller to handle kernel rings in engine reset Add a parameter to amdgpu_sdma_reset_engine() to let the caller handle the kernel rings. This allows the kernel rings to back up their unprocessed state if the reset comes in via the drm scheduler rather than KFD. Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-07-07 13:48:25 -04:00
Lijo Lazar	91134e8008	drm/amdkfd: Avoid queue reset if disabled If ring reset is disabled, skip resetting queues. Instead, fall back to device based reset. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-07-07 13:48:12 -04:00
Dave Airlie	7e2818386a	Merge tag 'amd-drm-next-6.17-2025-07-01' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.17-2025-07-01: amdgpu: - FAMS2 fixes - OLED fixes - Misc cleanups - AUX fixes - DMCUB updates - SR-IOV hibernation support - RAS updates - DP tunneling fixes - DML2 fixes - Backlight improvements - Suspend improvements - Use scaling for non-native modes on eDP - SDMA 4.4.x fixes - PCIe DPM fixes - SDMA 5.x fixes - Cleaner shader updates for GC 9.x - Remove fence slab - ISP genpd support - Parition handling rework - SDMA FW checks for userq support - Add missing firmware declaration - Fix leak in amdgpu_ctx_mgr_entity_fini() - Freesync fix - Ring reset refactoring - Legacy dpm verbosity changes amdkfd: - GWS fix - mtype fix for ext coherent system memory - MMU notifier fix - gfx7/8 fix radeon: - CS validation support for additional GL extensions - Bump driver version for new CS validation checks From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250701194707.32905-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>	2025-07-04 10:06:29 +10:00

1 2 3 4 5 ...

1861 Commits