A user reported a bug on CAPE VERDE system where uvd_v3_1
IP component failed to initialize as there is an issue with
BO move code from one memory to other.
In function amdgpu_mem_visible() called by amdgpu_bo_move(),
when there are no blocks to compare or if we have a single
block then break the loop.
Fixes: 312b4dc11d ("drm/amdgpu: Fix VRAM BO swap issue")
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
If mes is not dequeued during fini, mes will be in an uncleaned state
during reload, then mes couldn't receive some commands which leads to
reload failure.
[How]
Perform MES dequeue via MMIO after all the unmap jobs are done by mes
and before kiq fini.
v2: Move the dequeue operation inside kiq_hw_fini.
Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: Jack Xiao <Jack.Xiao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- refactor mode2 on v11.0.7 to align with aldebaran
- comment out using mode2 reset as default for now, will introduce
another controller to replace previous reset_level_mask
v2: squash in unused variable removal (Alex)
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit dac6b80818.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.
Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 5bd8d53f6f.
This commit breaks the reset logic for aldebaran, revert it for now.
Will move the mask inside the reset handler.
Fixes: 5bd8d53f6f ("drm/amdgpu: add debugfs amdgpu_reset_level")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For asic with VF MMIO access protection avoid using CPU for VM table updates.
CPU pagetable updates have issues with HDP flush as VF MMIO access protection
blocks write to mmBIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL register
during sriov runtime.
v3: introduce virtualization capability flag AMDGPU_VF_MMIO_ACCESS_PROTECT
which indicates that VF MMIO write access is not allowed in sriov runtime
Signed-off-by: Danijel Slivka <danijel.slivka@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
A user reported a bug on CAPE VERDE system where uvd_v3_1
IP component failed to initialize as there is an issue with
BO move code from one memory to other.
In function amdgpu_mem_visible() called by amdgpu_bo_move(),
when there are no blocks to compare or if we have a single
block then break the loop.
Fixes: 312b4dc11d ("drm/amdgpu: Fix VRAM BO swap issue")
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
If mes is not dequeued during fini, mes will be in an uncleaned state
during reload, then mes couldn't receive some commands which leads to
reload failure.
[How]
Perform MES dequeue via MMIO after all the unmap jobs are done by mes
and before kiq fini.
v2: Move the dequeue operation inside kiq_hw_fini.
Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: Jack Xiao <Jack.Xiao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
kmap() is being deprecated in favor of kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since its use in amdgpu/amdgpu_ttm.c is safe, it should be preferred.
Therefore, replace kmap() with kmap_local_page() in amdgpu/amdgpu_ttm.c.
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
RAS error address translation algorithm is common
across dGPU and A + A platform as along as the SOC
integrates the same generation of UMC IP.
UMC RAS is managed by x86 MCA on A + A platform,
umc_ras in GPU driver is not initialized at all on
A + A platform. In such case, any umc_ras callback
implemented for dGPU config shouldn't be invoked
from A + A specific callback.
The change moves convert_error_address out of dGPU
umc_ras structure and makes it share between A + A
and dGPU config.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- refactor mode2 on v11.0.7 to align with aldebaran
- comment out using mode2 reset as default for now, will introduce
another controller to replace previous reset_level_mask
v2: squash in unused variable removal (Alex)
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit dac6b80818.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.
Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 5bd8d53f6f.
This commit breaks the reset logic for aldebaran, revert it for now.
Will move the mask inside the reset handler.
Fixes: 5bd8d53f6f ("drm/amdgpu: add debugfs amdgpu_reset_level")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For asic with VF MMIO access protection avoid using CPU for VM table updates.
CPU pagetable updates have issues with HDP flush as VF MMIO access protection
blocks write to mmBIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL register
during sriov runtime.
v3: introduce virtualization capability flag AMDGPU_VF_MMIO_ACCESS_PROTECT
which indicates that VF MMIO write access is not allowed in sriov runtime
Signed-off-by: Danijel Slivka <danijel.slivka@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
After commit 571c053658 ("drm/amdgpu: switch sdma buffer function
tear down to a helper"), the variable 'ring' is not used anymore, it
can be removed.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pull more drm updates from Dave Airlie:
"Round of fixes for the merge window stuff, bunch of amdgpu and i915
changes, this should have the gcc11 warning fix, amongst other
changes.
amdgpu:
- DC mutex fix
- DC SubVP fixes
- DCN 3.2.x fixes
- DCN 3.1.x fixes
- SDMA 6.x fixes
- Enable DPIA for 3.1.4
- VRR fixes
- VRAM BO swapping fix
- Revert dirty fb helper change
- SR-IOV suspend/resume fixes
- Work around GCC array bounds check fail warning
- UMC 8.10 fixes
- Misc fixes and cleanups
i915:
- Round to closest in g4x+ HDMI clock readout
- Update MOCS table for EHL
- Fix PSR_IMR/IIR field handling
- Fix watermark calculations for gen12+/DG2 modifiers
- Reject excessive dotclocks early
- Fix revocation of non-persistent contexts
- Handle migration for dpt
- Fix display problems after resume
- Allow control over the flags when migrating
- Consider DG2_RC_CCS_CC when migrating buffers"
* tag 'drm-next-2022-10-14' of git://anongit.freedesktop.org/drm/drm: (110 commits)
drm/amd/display: Add HUBP surface flip interrupt handler
drm/i915/display: consider DG2_RC_CCS_CC when migrating buffers
drm/i915: allow control over the flags when migrating
drm/amd/display: Simplify bool conversion
drm/amd/display: fix transfer function passed to build_coefficients()
drm/amd/display: add a license to cursor_reg_cache.h
drm/amd/display: make virtual_disable_link_output static
drm/amd/display: fix indentation in dc.c
drm/amd/display: make dcn32_split_stream_for_mpc_or_odm static
drm/amd/display: fix build error on arm64
drm/amd/display: 3.2.207
drm/amd/display: Clean some DCN32 macros
drm/amdgpu: Add poison mode query for umc v8_10_0
drm/amdgpu: Update umc v8_10_0 headers
drm/amdgpu: fix coding style issue for mca notifier
drm/amdgpu: define convert_error_address for umc v8.7
drm/amdgpu: define RAS convert_error_address API
drm/amdgpu: remove check for CE in RAS error address query
drm/i915: Fix display problems after resume
drm/amd/display: fix array-bounds error in dc_stream_remove_writeback() [take 2]
...
Update all SDMA versions that support SR-IOV to properly
tear down the ttm buffer functions on suspend.
Tested-by: Bokun Zhang <Bokun.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Switch all of the SDMA implementations to use the helper to
tear down the ttm buffer manager.
Tested-by: Bokun Zhang <Bokun.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- Under SRIOV, SDMA engine is shared between VFs. Therefore,
we will not stop SDMA during hw_fini. This is not an issue
with normal dirver loading and unloading.
- However, when we put the SDMA engine to suspend state and resume
it, the issue starts to show up. Something could attempt to use
that SDMA engine to clear or move memory before the engine is
initialized since the DRM entity is still there.
- Therefore, we will call sdma_v5_2_enable(false) during hw_fini,
and if we are under SRIOV, we will call sdma_v5_2_enable(true)
afterwards to allow other VFs to use SDMA. This way, the DRM
entity of SDMA engine is emptied and it will follow the flow
of resume code path.
Tested-by: Bokun Zhang <Bokun.Zhang@amd.com>
Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pull driver core updates from Greg KH:
"Here is the big set of driver core and debug printk changes for
6.1-rc1. Included in here is:
- dynamic debug updates for the core and the drm subsystem. The drm
changes have all been acked by the relevant maintainers
- kernfs fixes for syzbot reported problems
- kernfs refactors and updates for cgroup requirements
- magic number cleanups and removals from the kernel tree (they were
not being used and they really did not actually do anything)
- other tiny cleanups
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits)
docs: filesystems: sysfs: Make text and code for ->show() consistent
Documentation: NBD_REQUEST_MAGIC isn't a magic number
a.out: restore CMAGIC
device property: Add const qualifier to device_get_match_data() parameter
drm_print: add _ddebug descriptor to drm_*dbg prototypes
drm_print: prefer bare printk KERN_DEBUG on generic fn
drm_print: optimize drm_debug_enabled for jump-label
drm-print: add drm_dbg_driver to improve namespace symmetry
drm-print.h: include dyndbg header
drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro
drm_print: interpose drm_*dbg with forwarding macros
drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers.
drm_print: condense enum drm_debug_category
debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops
driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs()
Documentation: ENI155_MAGIC isn't a magic number
Documentation: NBD_REPLY_MAGIC isn't a magic number
nbd: remove define-only NBD_MAGIC, previously magic number
Documentation: FW_HEADER_MAGIC isn't a magic number
Documentation: EEPROM_MAGIC_VALUE isn't a magic number
...
amdkfd_total_mem_size is the size of total GPUs vram plus system memory
to estimate page tables memory usage and leave enough VRAM room for page
tables allocation.
Calculate amdkfd_total_mem_size in amdgpu_amdkfd_device_probe is
incorrect because adev->gmc.real_vram_size is still 0 called from
amdgpu_device_ip_early_init. Move the calculation
to amdgpu_amdkfd_device_init to get the correct VRAM size.
Do reverse calculation in amdgpu_amdkfd_device_fini_sw to support
hot-unplugging GPUs.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Under VRAM usage pression, map to GPU may fail to create pt bo and
vmbo->shadow_list is not initialized, then ttm_bo_release calling
amdgpu_bo_vm_destroy to access vmbo->shadow_list generates below
dmesg and NULL pointer access backtrace:
Set vmbo destroy callback to amdgpu_bo_vm_destroy only after creating pt
bo successfully, otherwise use default callback amdgpu_bo_destroy.
amdgpu: amdgpu_vm_bo_update failed
amdgpu: update_gpuvm_pte() failed
amdgpu: Failed to map bo to gpuvm
amdgpu 0000:43:00.0: amdgpu: Failed to map peer:0000:43:00.0 mem_domain:2
BUG: kernel NULL pointer dereference, address:
RIP: 0010:amdgpu_bo_vm_destroy+0x4d/0x80 [amdgpu]
Call Trace:
<TASK>
ttm_bo_release+0x207/0x320 [amdttm]
amdttm_bo_init_reserved+0x1d6/0x210 [amdttm]
amdgpu_bo_create+0x1ba/0x520 [amdgpu]
amdgpu_bo_create_vm+0x3a/0x80 [amdgpu]
amdgpu_vm_pt_create+0xde/0x270 [amdgpu]
amdgpu_vm_ptes_update+0x63b/0x710 [amdgpu]
amdgpu_vm_update_range+0x2e7/0x6e0 [amdgpu]
amdgpu_vm_bo_update+0x2bd/0x600 [amdgpu]
update_gpuvm_pte+0x160/0x420 [amdgpu]
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x313/0x1130 [amdgpu]
kfd_ioctl_map_memory_to_gpu+0x115/0x390 [amdgpu]
kfd_ioctl+0x24a/0x5b0 [amdgpu]
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
DRM buddy manager allocates the contiguous memory requests in
a single block or multiple blocks. So for the ttm move operation
(incase of low vram memory) we should consider all the blocks to
compute the total memory size which compared with the struct
ttm_resource num_pages in order to verify that the blocks are
contiguous for the eviction process.
v2: Added a Fixes tag
v3: Rewrite the code to save a bit of calculations and
variables (Christian)
Fixes: c9cad937c0 ("drm/amdgpu: add drm buddy support to amdgpu")
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>