drm/amdgpu: break driver init process when it's bad GPU(v5)

When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

v4: Fix missing check of returning value of i2c
    initialization error case.

v5: Use dev_err to print PCI information in dmesg instead
    of DRM_ERROR.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This commit is contained in:
Guchun Chen
2020-07-23 15:42:19 +08:00
committed by Alex Deucher
parent 1d6a9d122d
commit b82e65a935
4 changed files with 36 additions and 7 deletions

View File

@@ -2055,13 +2055,19 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
* it should be called after amdgpu_device_ip_hw_init_phase2 since
* for some ASICs the RAS EEPROM code relies on SMU fully functioning
* for I2C communication which only true at this point.
* recovery_init may fail, but it can free all resources allocated by
* itself and its failure should not stop amdgpu init process.
*
* amdgpu_ras_recovery_init may fail, but the upper only cares the
* failure from bad gpu situation and stop amdgpu init process
* accordingly. For other failed cases, it will still release all
* the resource and print error message, rather than returning one
* negative value to upper level.
*
* Note: theoretically, this should be called before all vram allocations
* to protect retired page from abusing
*/
amdgpu_ras_recovery_init(adev);
r = amdgpu_ras_recovery_init(adev);
if (r)
goto init_failed;
if (adev->gmc.xgmi.num_physical_nodes > 1)
amdgpu_xgmi_add_device(adev);