Users reported oopses on list corruptions when using i915 perf with a
number of concurrently running graphics applications. Root cause analysis
pointed at an issue in barrier processing code -- a race among perf open /
close replacing active barriers with perf requests on kernel context and
concurrent barrier preallocate / acquire operations performed during user
context first pin / last unpin.
When adding a request to a composite tracker, we try to reuse an existing
fence tracker, already allocated and registered with that composite. The
tracker we obtain may already track another fence, may be an idle barrier,
or an active barrier.
If the tracker we get occurs a non-idle barrier then we try to delete that
barrier from a list of barrier tasks it belongs to. However, while doing
that we don't respect return value from a function that performs the
barrier deletion. Should the deletion ever fail, we would end up reusing
the tracker still registered as a barrier task. Since the same structure
field is reused with both fence callback lists and barrier tasks list,
list corruptions would likely occur.
Barriers are now deleted from a barrier tasks list by temporarily removing
the list content, traversing that content with skip over the node to be
deleted, then populating the list back with the modified content. Should
that intentionally racy concurrent deletion attempts be not serialized,
one or more of those may fail because of the list being temporary empty.
Related code that ignores the results of barrier deletion was initially
introduced in v5.4 by commit d8af05ff38 ("drm/i915: Allow sharing the
idle-barrier from other kernel requests"). However, all users of the
barrier deletion routine were apparently serialized at that time, then the
issue didn't exhibit itself. Results of git bisect with help of a newly
developed igt@gem_barrier_race@remote-request IGT test indicate that list
corruptions might start to appear after commit 311770173f ("drm/i915/gt:
Schedule request retirement when timeline idles"), introduced in v5.5.
Respect results of barrier deletion attempts -- mark the barrier as idle
only if successfully deleted from the list. Then, before proceeding with
setting our fence as the one currently tracked, make sure that the tracker
we've got is not a non-idle barrier. If that check fails then don't use
that tracker but go back and try to acquire a new, usable one.
v3: use unlikely() to document what outcome we expect (Andi),
- fix bad grammar in commit description.
v2: no code changes,
- blame commit 311770173f ("drm/i915/gt: Schedule request retirement
when timeline idles"), v5.5, not commit d8af05ff38 ("drm/i915: Allow
sharing the idle-barrier from other kernel requests"), v5.4,
- reword commit description.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6333
Fixes: 311770173f ("drm/i915/gt: Schedule request retirement when timeline idles")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org # v5.5
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230302120820.48740-1-janusz.krzysztofik@linux.intel.com
The Driver-FLR flow may inadvertently exit early before the full
completion of the re-init of the internal HW state if we only poll
GU_DEBUG Bit31 (polling for it to toggle from 0 -> 1). Instead
we need a two-step completion wait-for-completion flow that also
involves GU_CNTL. See the patch and new code comments for detail.
This is new direction from HW architecture folks.
v2: - Add error message for the teardown timeout (Anshuman)
- Don't duplicate code in comments (Jani)
v3: - Add get/put runtime-pm for this function. Though
not functionally required during unload, its so the uncore
doesn't complain.
v4: - Remove the get/put runtime-pm - that was for a prior
version of this patch (not needed for drm-managed callback).
- Remove the fixes tag since this is only for MTL and MTL
still needs force probe (Daniele).
- Bit 31 of GU_CNTL should be DRIVERFLR instead of
DRIVERFLR_STATUS (Daniele).
Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Tested-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230224001758.544817-1-alan.previn.teres.alexis@intel.com
Pull drm fixes from Dave Airlie:
"fbdev:
- fix uninit var in error path
shmem:
- revert unGPLing an export
i915:
- Don't use stolen memory or BAR mappings for ring buffers with LLC
- Add inverted backlight quirk for HP 14-r206nv
- Fix GSI offset for MCR lookups
- GVT fixes (memleak, debugfs attributes, kconfig, typos)
amdgpu:
- SMU 13 fixes
- Enable TMZ for GC 10.3.6
- Misc display fixes
- Buddy allocator fixes
- GC 11 fixes
- S0ix fix
- INFO IOCTL queries for GC 11
- VCN harvest fixes for SR-IOV
- UMC 8.10 RAS fixes
- Don't restrict bpc to 8
- NBIO 7.5 fix
- Allow freesync on PCon for more devices
amdkfd:
- SDMA fix
- Illegal memory access fix"
* tag 'drm-next-2023-03-03-1' of git://anongit.freedesktop.org/drm/drm: (45 commits)
drm/amdgpu/vcn: fix compilation issue with legacy gcc
drm/amd/display: Extend Freesync over PCon support for more devices
Revert "drm/amd/display: Do not set DRR on pipe commit"
drm/amd/display: fix shift-out-of-bounds in CalculateVMAndRowBytes
drm/amd/display: Ext displays with dock can't recognized after resume
drm/amdgpu: fix ttm_bo calltrace warning in psp_hw_fini
drm/amdgpu: remove unused variable ring
drm/amd/display: fix dm irq error message in gpu recover
drm/amd: Fix initialization for nbio 7.5.1
drm/amd/display: Don't restrict bpc to 8 bpc
drm/amdgpu: Make umc_v8_10_convert_error_address static and remove unused variable
drm/radeon: Fix eDP for single-display iMac11,2
drm/shmem-helper: Revert accidental non-GPL export
drm: omapdrm: Do not use helper unininitialized in omap_fbdev_init()
drm/amd/pm: downgrade log level upon SMU IF version mismatch
drm/amdgpu: Add ecc info query interface for umc v8_10
drm/amdgpu: Add convert_error_address function for umc v8_10
drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err
drm/amdgpu: change default behavior of bad_page_threshold parameter
drm/amdgpu: exclude duplicate pages from UMC RAS UE count
...
This patch is used to fix following compilation issue with legacy gcc
error: ‘for’ loop initial declarations are only allowed in C99 mode
for (int i = 0; i < adev->vcn.num_vcn_inst; ++i) {
Signed-off-by: bobzhou <bob.zhou@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[why]
More branch devices are able to support Freesync
over PCon so include them in the list of supporting devices.
[how]
Add more compatible PCon devices in the whitelist
for Freesync over Pcon.
Reviewed-by: Harry Wentland <Harry.Wentland@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Sung Joon Kim <sungjoon.kim@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[WHY]
When PTEBufferSizeInRequests is zero, UBSAN reports the following
warning because dml_log2 returns an unexpected negative value:
shift exponent 4294966273 is too large for 32-bit type 'int'
[HOW]
In the case PTEBufferSizeInRequests is zero, skip the dml_log2() and
assign the result directly.
Reviewed-by: Jun Lei <Jun.Lei@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
building with gcc and W=1 reports
drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c:81:29: error: variable
‘ring’ set but not used [-Werror=unused-but-set-variable]
81 | struct amdgpu_ring *ring;
| ^~~~
ring is not used so remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
Variable adev->crtc_irq.num_types was initialized as the value of
adev->mode_info.num_crtc at early_init stage, later at hw_init stage,
the num_crtc changed due to the display pipe harvest on some SKUs,
but the num_types was not updated accordingly, that cause below error
in gpu recover.
*ERROR* amdgpu_dm_set_crtc_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_crtc_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_crtc_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_pflip_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_pflip_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_pflip_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_pflip_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_vupdate_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_vupdate_irq_state: crtc is NULL at id :3
*ERROR* amdgpu_dm_set_vupdate_irq_state: crtc is NULL at id :3
[How]
Defer the initialization of num_types to eliminate the error logs.
Signed-off-by: tiancyin <tianci.yin@amd.com>
Reviewed-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This will let us pass the kms_hdr.bpc_switch IGT
test.
The reason the bpc restriction was required is
historical. At one point in time we were not falling
back to a lower bpc when we didn't have enough
bandwidth for the maximum bpc reported by a display.
This meant that we couldn't enable some high refresh
modes unless we limitted the bpc.
Starting with this patch the issue is fixed:
commit cbd14ae7ea ("drm/amd/display: Fix
incorrectly pruned modes with deep color")
This patch implemented a fallback mechanism if mode
validation failed at the max bpc. This means users
now automatically get all modes that can be supported
by at least 6 bpc. The driver will enable the mode
with the highest possible bpc that is supported by
the display.
v2:
- explain why this is no longer needed (Michel)
- refer to commit that fixed bpc fallback (Michel)
Signed-off-by: Harry Wentland <harry.wentland@amd.com>
Cc: Pekka Paalanen <ppaalanen@gmail.com>
Cc: Sebastian Wick <sebastian.wick@redhat.com>
Cc: Vitaly.Prosyak@amd.com
Cc: Joshua Ashton <joshua@froggi.es>
Cc: dri-devel@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Cc: Michel Dänzer <michel.daenzer@mailbox.org>
Reviewed-by: Joshua Ashton <joshua@froggi.es>
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fixes following warnings:
warning: no previous prototype for 'umc_v8_10_convert_error_address'
warning: variable 'channel_index' set but not used
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Apple iMac11,2 (mid 2010) also with Radeon HD-4670 that has the same
issue as iMac10,1 (late 2009) where the internal eDP panel stays dark on
driver load. This patch treats iMac11,2 the same as iMac10,1,
so the eDP panel stays active.
Additional steps:
Kernel boot parameter radeon.nomodeset=0 required to keep the eDP
panel active.
This patch is an extension of
commit 564d8a2cf3 ("drm/radeon: Fix eDP for single-display iMac10,1 (v2)")
Link: https://lore.kernel.org/all/lsq.1507553064.833262317@decadent.org.uk/
Signed-off-by: Mark Hawrylak <mark.hawrylak@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
It seems that commit bc3c5e0809 ("drm/i915/sseu: Don't try to store EU
mask internally in UAPI format") exposed a potential out-of-bounds
access, reported by UBSAN as following on a laptop with a gen 11 i915
card:
UBSAN: array-index-out-of-bounds in drivers/gpu/drm/i915/gt/intel_sseu.c:65:27
index 6 is out of range for type 'u16 [6]'
CPU: 2 PID: 165 Comm: systemd-udevd Not tainted 6.2.0-9-generic #9-Ubuntu
Hardware name: Dell Inc. XPS 13 9300/077Y9N, BIOS 1.11.0 03/22/2022
Call Trace:
<TASK>
show_stack+0x4e/0x61
dump_stack_lvl+0x4a/0x6f
dump_stack+0x10/0x18
ubsan_epilogue+0x9/0x3a
__ubsan_handle_out_of_bounds.cold+0x42/0x47
gen11_compute_sseu_info+0x121/0x130 [i915]
intel_sseu_info_init+0x15d/0x2b0 [i915]
intel_gt_init_mmio+0x23/0x40 [i915]
i915_driver_mmio_probe+0x129/0x400 [i915]
? intel_gt_probe_all+0x91/0x2e0 [i915]
i915_driver_probe+0xe1/0x3f0 [i915]
? drm_privacy_screen_get+0x16d/0x190 [drm]
? acpi_dev_found+0x64/0x80
i915_pci_probe+0xac/0x1b0 [i915]
...
According to the definition of sseu_dev_info, eu_mask->hsw is limited to
a maximum of GEN_MAX_SS_PER_HSW_SLICE (6) sub-slices, but
gen11_sseu_info_init() can potentially set 8 sub-slices, in the
!IS_JSL_EHL(gt->i915) case.
Fix this by reserving up to 8 slots for max_subslices in the eu_mask
struct.
Reported-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Fixes: bc3c5e0809 ("drm/i915/sseu: Don't try to store EU mask internally in UAPI format")
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230220171858.131416-1-andrea.righi@canonical.com
Grab the HDR DPCD refresh timeout (time we need to wait after
writing the sourc OUI before the HDR DPCD registers are ready)
from the VBT.
Windows doesn't even seem to have any default value for this,
which is perhaps a bit weird since the VBT value is documented
as TGL+ and I thought the HDR backlight stuff might already be
used on earlier platforms. To play it safe I left the old
hardcoded 30ms default in place. Digging through some internal
stuff that seems to have been a number given by the vendor for
one particularly slow TCON. Although I did see 50ms mentioned
somewhere as well.
Let's also include the value in the debug print to ease
debugging, and toss in the customary connector id+name as well.
The TGL Thinkpad T14 I have sets this to 0 btw. So the delay
is now gone on this machine:
[CONNECTOR:308:eDP-1] Detected Intel HDR backlight interface version 1
[CONNECTOR:308:eDP-1] Using Intel proprietary eDP backlight controls
[CONNECTOR:308:eDP-1] SDR backlight is controlled through PWM
[CONNECTOR:308:eDP-1] Using native PCH PWM for backlight control (controller=0)
[CONNECTOR:308:eDP-1] Using AUX HDR interface for backlight control (range 0..496)
[CONNECTOR:308:eDP-1] Performing OUI wait (0 ms)
Cc: Lyude Paul <lyude@redhat.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230220164718.23117-1-ville.syrjala@linux.intel.com
Reviewed-by: Jouni Högander <jouni.hogander@intel.com>
Remove the bogus csync check and replace it with something that:
- triggers for all forms of csync, not just the basic analog variant
- actually populates the mode csync flags so that drivers can
decide what to do with the mode
Originally the code tried to outright reject csync, but that
apparently broke some bogus LCD monitor that claimed to have
a detailed mode that uses analog csync, despite also claiming
the monitor only support separate sync:
https://bugzilla.redhat.com/show_bug.cgi?id=540024
Potentially that monitor should just be quirked or something.
Anyways, what we are dealing with now is some kind of funny i915
JSL machine with eDP where the panel claims to support a sensible
60Hz separate sync mode, and a 50Hz mode with bipolar analog
csync. The 50Hz mode does not work so we want to not use it.
Easiest way is to just correctly flag it as csync and the driver
will reject it.
TODO: or should we just reject any form of csync (or at least
the analog variants) for digital display interfaces?
v2: Grab digital csync polarity from hsync polarity bit (Jani)
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8146
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230228213610.26283-1-ville.syrjala@linux.intel.com