mirror of
https://github.com/torvalds/linux.git
synced 2026-04-18 23:03:57 -04:00
drm/amdgpu: refactor bad_page_work for corner case handling
When a poison is consumed on the guest before the guest receives the host's poison creation msg, a corner case may occur to have poison_handler complete processing earlier than it should to cause the guest to hang waiting for the req_bad_pages reply during a VF FLR, resulting in the VM becoming inaccessible in stress tests.
To fix this issue, this patch refactored the mailbox sequence by seperating the bad_page_work into two parts req_bad_pages_work and handle_bad_pages_work.
Old sequence:
1.Stop data exchange work
2.Guest sends MB_REQ_RAS_BAD_PAGES to host and keep polling for IDH_RAS_BAD_PAGES_READY
3.If the IDH_RAS_BAD_PAGES_READY arrives within timeout limit, re-init the data exchange region for updated bad page info
else timeout with error message
New sequence:
req_bad_pages_work:
1.Stop data exhange work
2.Guest sends MB_REQ_RAS_BAD_PAGES to host
Once Guest receives IDH_RAS_BAD_PAGES_READY event
handle_bad_pages_work:
3.re-init the data exchange region for updated bad page info
Signed-off-by: Chenglei Xie <Chenglei.Xie@amd.com>
Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This commit is contained in:
committed by
Alex Deucher
parent
fc4e990a32
commit
d2fa0ec6e0
@@ -267,7 +267,8 @@ struct amdgpu_virt {
|
||||
struct amdgpu_irq_src rcv_irq;
|
||||
|
||||
struct work_struct flr_work;
|
||||
struct work_struct bad_pages_work;
|
||||
struct work_struct req_bad_pages_work;
|
||||
struct work_struct handle_bad_pages_work;
|
||||
|
||||
struct amdgpu_mm_table mm_table;
|
||||
const struct amdgpu_virt_ops *ops;
|
||||
|
||||
Reference in New Issue
Block a user