Commit Graph

23 Commits

Author SHA1 Message Date
Yang Li
0de5472a01 drm/amdkfd: fix resource_size.cocci warnings
Use resource_size function on resource object
instead of explicit computation.

Clean up coccicheck warning:
./drivers/gpu/drm/amd/amdkfd/kfd_migrate.c:905:10-13: ERROR: Missing
resource_size with res

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Amos Kong <kongjianjun@gmail.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-29 17:30:00 -04:00
Philip Yang
22f4f4faf3 drm/amdkfd: fix svm_migrate_fini warning
Device manager releases device-specific resources when a driver
disconnects from a device, devm_memunmap_pages and
devm_release_mem_region calls in svm_migrate_fini are redundant.

It causes below warning trace after patch "drm/amdgpu: Split
amdgpu_device_fini into early and late", so remove function
svm_migrate_fini.

BUG: https://gitlab.freedesktop.org/drm/amd/-/issues/1718

WARNING: CPU: 1 PID: 3646 at drivers/base/devres.c:795
devm_release_action+0x51/0x60
Call Trace:
    ? memunmap_pages+0x360/0x360
    svm_migrate_fini+0x2d/0x60 [amdgpu]
    kgd2kfd_device_exit+0x23/0xa0 [amdgpu]
    amdgpu_amdkfd_device_fini_sw+0x1d/0x30 [amdgpu]
    amdgpu_device_fini_sw+0x45/0x290 [amdgpu]
    amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
    drm_dev_release+0x20/0x40 [drm]
    release_nodes+0x196/0x1e0
    device_release_driver_internal+0x104/0x1d0
    driver_detach+0x47/0x90
    bus_remove_driver+0x7a/0xd0
    pci_unregister_driver+0x3d/0x90
    amdgpu_exit+0x11/0x20 [amdgpu]

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-23 16:34:57 -04:00
Philip Yang
586d71a427 drm/amdkfd: handle svm migrate init error
If svm migration init failed to create pgmap for device memory, set
pgmap type to 0 to disable device SVM support capability.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-23 16:34:49 -04:00
Alex Sierra
7981ec6549 drm/amdkfd: Maintain svm_bo reference in page->zone_device_data
Each zone-device page holds a reference to the SVM BO that manages its
backing storage. This is necessary to correctly hold on to the BO in
case zone_device pages are shared with a child-process.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
3bf8282c6b drm/amdkfd: add invalid pages debug at vram migration
This is for debug purposes only.
It conditionally generates partial migrations to test mixed
CPU/GPU memory domain pages in a prange easily.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
6ffecc946f drm/amdkfd: skip migration for pages already in VRAM
Migration skipped for pages that are already in VRAM
domain. These could be the result of previous partial
migrations to SYS RAM, and prefetch back to VRAM.
Ex. Coherent pages in VRAM that were not written/invalidated after
a copy-on-write.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
1ade5f84cc drm/amdkfd: skip invalid pages during migrations
Invalid pages can be the result of pages that have been migrated
already due to copy-on-write procedure or pages that were never
migrated to VRAM in first place. This is not an issue anymore,
as pranges now support mixed memory domains (CPU/GPU).

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
a010d98a78 drm/amdkfd: set owner ref to svm range prefault
svm_range_prefault is called right before migrations to VRAM,
to make sure pages are resident in system memory before the migration.
With partial migrations, this reference is used by hmm range get pages
to avoid migrating pages that are already in the same VRAM domain.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
3a61dae854 drm/amdkfd: device pgmap owner at the svm migrate init
GPUs in the same XGMI hive have direct access to all
members'VRAM. When mapping memory to a GPU, we don't need
hmm_range_fault to fault device-private pages in the same
hive back to the host. Identifying the page owner as the hive,
rather than the individual GPU, accomplishes this.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Philip Yang
d4ebc20070 drm/amdkfd: implement counters for vm fault and migration
Add helper function to get process device data structure from adev to
update counters.

Update vm faults, page_in, page_out counters will no be executed in
parallel, use WRITE_ONCE to avoid any form of compiler optimizations.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-30 00:18:23 -04:00
Christian König
2fdcb55dfc drm/amdkfd: use resource cursor in svm_migrate_copy_to_vram v2
Access to the mm_node is now forbidden. So instead of hand wiring that
use the cursor functionality.

v2: fix handling as pointed out by Philip.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by and Tested-by: Philip Yang <philip.yang@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210602100914.46246-5-christian.koenig@amd.com
2021-06-04 15:16:46 +02:00
Philip Yang
04fe3fd10e drm/amdkfd: handle errors returned by svm_migrate_copy_to_vram/ram
If migration copy failed because process is killed, or out of VRAM or
system memory, pass error code back to caller to handle error
gracefully.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:08:32 -04:00
Felix Kuehling
2e4ec25162 drm/amdkfd: Make svm_migrate_put_sys_page static
This function is only used in this source file.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:44 -04:00
Philip Yang
c0f76fc8ad drm/amdkfd: fix double free device pgmap resource
Use devm_memunmap_pages instead of memunmap_pages to release pgmap
and remove pgmap from device action, to avoid double free pgmap when
unloading driver module.

Release device memory region if failed to create device memory pages
structure.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:36:04 -04:00
Colin Ian King
a40eb089b4 drm/amdkfd: remove redundant initialization to variable r
The variable r is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23 17:17:50 -04:00
Felix Kuehling
1a3b2b5dca drm/amdkfd: multiple gpu migrate vram to vram
If prefetch range to gpu with acutal location is another gpu, or GPU
retry fault restore pages to migrate the range with acutal location is
gpu, then migrate from one gpu to another gpu.

Use system memory as bridge because sdma engine may not able to access
another gpu vram, use sdma of source gpu to migrate to system memory,
then use sdma of destination gpu to migrate from system memory to gpu.

Print out gpuid or gpuidx in debug messages.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:50:22 -04:00
Felix Kuehling
cda0f85bfa drm/amdkfd: refine migration policy with xnack on
With xnack on, GPU vm fault handler decide the best restore location,
then migrate range to the best restore location and update GPU mapping
to recover the GPU vm fault.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:50:03 -04:00
Felix Kuehling
90d7d3eda5 drm/amdkfd: invalidate tables on page retry fault
GPU page tables are invalidated by unmapping prange directly at
the mmu notifier, when page fault retry is enabled through
amdgpu_noretry global parameter. The restore page table is
performed at the page fault handler.

If xnack is on, we update GPU mappings after migration to avoid
unnecessary GPUVM faults.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:50 -04:00
Felix Kuehling
48ff079b28 drm/amdkfd: HMM migrate vram to ram
If CPU page fault happens, HMM pgmap_ops callback migrate_to_ram start
migrate memory from vram to ram in steps:

1. migrate_vma_pages get vram pages, and notify HMM to invalidate the
pages, HMM interval notifier callback evict process queues
2. Allocate system memory pages
3. Use svm copy memory to migrate data from vram to ram
4. migrate_vma_pages copy pages structure from vram pages to ram pages
5. Return VM_FAULT_SIGBUS if migration failed, to notify application
6. migrate_vma_finalize put vram pages, page_free callback free vram
pages and vram nodes
7. Restore work wait for migration is finished, then update GPU page
table mapping to system memory, and resume process queues

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:43 -04:00
Felix Kuehling
0b0e518d61 drm/amdkfd: HMM migrate ram to vram
Register svm range with same address and size but perferred_location
is changed from CPU to GPU or from GPU to CPU, trigger migration the svm
range from ram to vram or from vram to ram.

If svm range prefetch location is GPU with flags
KFD_IOCTL_SVM_FLAG_HOST_ACCESS, validate the svm range on ram first,
then migrate it from ram to vram.

After migrating to vram is done, CPU access will have cpu page fault,
page fault handler migrate it back to ram and resume cpu access.

Migration steps:

1. migrate_vma_pages get svm range ram pages, notify the
interval is invalidated and unmap from CPU page table, HMM interval
notifier callback evict process queues
2. Allocate new pages in vram using TTM
3. Use svm copy memory to sdma copy data from ram to vram
4. migrate_vma_pages copy ram pages structure to vram pages structure
5. migrate_vma_finalize put ram pages to free ram pages and memory
6. Restore work wait for migration is finished, then update GPUs page
table mapping to new vram pages, resume process queues

If migrate_vma_setup failed to collect all ram pages of range, retry 3
times until success to start migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:30 -04:00
Philip Yang
50ea50cf6f drm/amdkfd: copy memory through gart table
Use sdma linear copy to migrate data between ram and vram. The sdma
linear copy command uses kernel buffer function queue to access system
memory through gart table.

Use reserved gart table window 0 to map system page address, and vram
page address is direct mapping. Use the same kernel buffer function to
fill in gart table mapping, so this is serialized with memory copy by
sdma job submit. We only need wait for the last memory copy sdma fence
for larger buffer migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:23 -04:00
Philip Yang
c46ebb6a6d drm/amdkfd: set memory limit to avoid OOM with HMM enabled
HMM migration alloc sizeof(struct page) on system memory for each VRAM
page, it is 1GB system memory reserved for 64GB VRAM. To avoid
application OOM, increase system memory used size based on VRAM size of
all GPUs, then application alloc memory will fail if system memory usage
reach the limit.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:00 -04:00
Philip Yang
814ab9930c drm/amdkfd: register HMM device private zone
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
allocate vram backing pages for page migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:47:54 -04:00