If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources. Currently the suspend
carries on proceeding to reset the GPU:
```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```
At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored. However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.
```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```
To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails. This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.
Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use mes.sched_version, mes.kiq_version for debugfs as
mes.ucode_fw_version does not contain correct versioning information.
Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Jack Xiao <Jack.Xiao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This patch to fix the gdm3 start failure with virual display:
/usr/libexec/gdm-x-session[1711]: (II) AMDGPU(0): Setting screen physical size to 270 x 203
/usr/libexec/gdm-x-session[1711]: (EE) AMDGPU(0): Failed to make import prime FD as pixmap: 22
/usr/libexec/gdm-x-session[1711]: (EE) AMDGPU(0): failed to set mode: Invalid argument
/usr/libexec/gdm-x-session[1711]: (WW) AMDGPU(0): Failed to set mode on CRTC 0
/usr/libexec/gdm-x-session[1711]: (EE) AMDGPU(0): Failed to enable any CRTC
gnome-shell[1840]: Running GNOME Shell (using mutter 42.2) as a X11 window and compositing manager
/usr/libexec/gdm-x-session[1711]: (EE) AMDGPU(0): failed to set mode: Invalid argument
vkms doesn't have modifiers support, set fb_modifiers_not_supported to bring the gdm back.
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
rn_vbios_smu_set_dcn_low_power_state() produces a valid warning with
gcc-13:
drivers/gpu/drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr_vbios_smu.c:237:6: error: conflicting types for 'rn_vbios_smu_set_dcn_low_power_state' due to enum/integer mismatch; have 'void(struct clk_mgr_internal *, enum dcn_pwr_state)'
drivers/gpu/drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr_vbios_smu.h:36:6: note: previous declaration of 'rn_vbios_smu_set_dcn_low_power_state' with type 'void(struct clk_mgr_internal *, int)'
I.e. the type of the 2nd parameter of
rn_vbios_smu_set_dcn_low_power_state() in the declaration is int, while
the definition spells enum dcn_pwr_state. Synchronize them to the
latter (and add a forward enum declaration).
Cc: Martin Liska <mliska@suse.cz>
Cc: Harry Wentland <harry.wentland@amd.com>
Cc: Leo Li <sunpeng.li@amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
DC version 3.2.210 brings along the following:
- Investigate tool reported FCLK P-state deviations
- Fix null pointer issues found in emulation
- Add DSC delay factor workaround
- Round up DST_after_scaler to nearest int
- Use forced DSC bpp in DML
- Fix DCN32 DSC delay calculation
- Add a debug option HBR2CP2520 over TPS4
- Stop waiting for vblank during pipe programming
- Modify checks to enable TPS3 pattern when required
- Remove rate check from pixel rate divider update
- Check validation passed after applying pipe split changes
- Update DML formula
- Don't enable ODM + MPO
- Include virtual signal to set k1 and k2 values
- Reinit DPG when exiting dynamic ODM
Acked-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Aric Cyr <aric.cyr@amd.com>
Tested-by: Mark Broadworth <mark.broadworth@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
Fix for some of the tool reported modes for FCLK
P-state deviations and UCLK P-state deviations that
are coming from DSC terms and/or Scaling terms
causing MinActiveFCLKChangeLatencySupported
and MaxActiveDRAMClockChangeLatencySupported
incorrectly calculated in DML for these configurations.
Reviewed-by: Chaitanya Dhere <Chaitanya.Dhere@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Acked-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Nevenko Stupar <Nevenko.Stupar@amd.com>
Tested-by: Mark Broadworth <mark.broadworth@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
Certain 4K high refresh rate modes requiring DSC are exhibiting top
of screen underflow corruption. Increasing the DSC delay by a factor
of 6 percent stops the underflow for most use cases.
[How]
Multiply DSC delay requirement in DML by a factor.
Add debug option to make this DSC delay factor configurable.
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: George Shen <george.shen@amd.com>
Tested-by: Mark Broadworth <mark.broadworth@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
DSC config is calculated separately from DML calculations.
DML should use these separately calculated DSC params. The issue is
that the calculated bpp is not properly propagated into DML.
[How]
Correctly used forced_bpp value in DML.
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: George Shen <george.shen@amd.com>
Tested-by: Mark Broadworth <mark.broadworth@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
DCN32 DSC delay calculation had an unintentional integer division,
resulting in a mismatch against the DML spreadsheet.
[How]
Cast numerator to double before performing the division.
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: George Shen <george.shen@amd.com>
Tested-by: Mark Broadworth <mark.broadworth@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[why and how]
This line was originally removed for a compliance issue, but then
reverted as it caused a fringe underflow case.
However, the addition of this line caused a underflow regression
when subVP is on, and it needs to be removed again. We plan to
fix subvp underflow and then re-add in this line. After that,
we will investigate what to do next for the compliance issue.
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Martin Leung <Martin.Leung@amd.com>
Tested-by: Mark Broadworth <mark.broadworth@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[WHY?]
Validation can fail for configurations that were previously supported, by only
changing parameters such as the DET allocations, which is currently unexpected.
[HOW?]
Add a check that validation passes after applying pipe split related changes.
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Dillon Varone <Dillon.Varone@amd.com>
Tested-by: Mark Broadworth <mark.broadworth@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As of commit 09a5df6c44 ("drm/amd/display: Fix multi-display support
for idle opt workqueue"), vblank_lock is no longer being used. So, don't
init it in amdgpu_dm_init() and remove it from struct
amdgpu_display_manager.
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
gfxhub_v3_0_3 system aperture registers are removed from RLCG register access range.
[How]
Skip access gfxhub_v3_0_3 system aperture registers under SRIOV VF.
These registers will be programmed on host side.
Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
SDMA0_F32_CNTL is a PF_only regitser which will be blocked by L1.
RLCG will not program the register as well.
[How]
Skip to program SDMA0_F32_CNTL under SRIOV VF.
Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
GRBM_CNTL is a PF_only register on gfx_v11.
RLCG interface will return "out of range" under SRIOV VF.
[How]
Skip access GRBM_CNTL under gfx_v11 SRIOV VF.
Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The recent change brought a bug on SRIOV envrionment. It caused
unloading amdgpu failed on Guest VM. The reason is that the VF
FLR was requested while unloading amdgpu driver, but the VF FLR
of SRIOV sequence is wrong while removing PCI device.
For SRIOV, the guest driver should not trigger the whole XGMI hive
to do the reset. Host driver control how the device been reset.
Fixes: f5c7e77970 ("drm/amdgpu: Adjust removal control flow for smu v13_0_2")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Gavin Wan <Gavin.Wan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources. Currently the suspend
carries on proceeding to reset the GPU:
```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```
At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored. However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.
```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```
To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails. This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.
Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Update EEPROM_I2C_MADDR_SMU_13_0_0 to EEPROM_I2C_MADDR_54H
2. Add EEPROM I2C address support for smu v13_0_0 and v13_0_10.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Add a function pointer structure ta_funcs to psp context
2. Make the interfaces generic to all TAs
3. Leverage exisitng TA context and remove unused functions
4. Fix return code bugs
v2: Add comments for ta funcs macros and correct typo
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Save TA unload psp response status
2. Add RAS TA loading status check for initializaiton
3. Drop RAS context teardown to allow RAS TA to be reloaded
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>