In order to avoid confusing the HW, we must never submit an empty ring
during lite-restore, that is we should always advance the RING_TAIL
before submitting to stay ahead of the RING_HEAD.
Normally this is prevented by keeping a couple of spare NOPs in the
request->wa_tail so that on resubmission we can advance the tail. This
relies on the request only being resubmitted once, which is the normal
condition as it is seen once for ELSP[1] and then later in ELSP[0]. On
preemption, the requests are unwound and the tail reset back to the
normal end point (as we know the request is incomplete and therefore its
RING_HEAD is even earlier).
However, if this w/a should fail we would try and resubmit the request
with the RING_TAIL already set to the location of this request's wa_tail
potentially causing a GPU hang. We can spot when we do try and
incorrectly resubmit without advancing the RING_TAIL and spare any
embarrassment by forcing the context restore.
In the case of preempt-to-busy, we leave the requests running on the HW
while we unwind. As the ring is still live, we cannot rewind our
rq->tail without forcing a reload so leave it set to rq->wa_tail and
only force a reload if we resubmit after a lite-restore. (Normally, the
forced reload will be a part of the preemption event.)
Fixes: 22b7a426bb ("drm/i915/execlists: Preempt-to-busy")
Closes: https://gitlab.freedesktop.org/drm/intel/issues/673
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: stable@kernel.vger.org
Link: https://patchwork.freedesktop.org/patch/msgid/20191209023215.3519970-1-chris@chris-wilson.co.uk
(cherry picked from commit 82c69bf586)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
In order to avoid confusing the HW, we must never submit an empty ring
during lite-restore, that is we should always advance the RING_TAIL
before submitting to stay ahead of the RING_HEAD.
Normally this is prevented by keeping a couple of spare NOPs in the
request->wa_tail so that on resubmission we can advance the tail. This
relies on the request only being resubmitted once, which is the normal
condition as it is seen once for ELSP[1] and then later in ELSP[0]. On
preemption, the requests are unwound and the tail reset back to the
normal end point (as we know the request is incomplete and therefore its
RING_HEAD is even earlier).
However, if this w/a should fail we would try and resubmit the request
with the RING_TAIL already set to the location of this request's wa_tail
potentially causing a GPU hang. We can spot when we do try and
incorrectly resubmit without advancing the RING_TAIL and spare any
embarrassment by forcing the context restore.
In the case of preempt-to-busy, we leave the requests running on the HW
while we unwind. As the ring is still live, we cannot rewind our
rq->tail without forcing a reload so leave it set to rq->wa_tail and
only force a reload if we resubmit after a lite-restore. (Normally, the
forced reload will be a part of the preemption event.)
Fixes: 22b7a426bb ("drm/i915/execlists: Preempt-to-busy")
Closes: https://gitlab.freedesktop.org/drm/intel/issues/673
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: stable@kernel.vger.org
Link: https://patchwork.freedesktop.org/patch/msgid/20191209023215.3519970-1-chris@chris-wilson.co.uk
Instead of relying on the workqueue, the upcoming reworked GuC
submission flow will offer the host driver indipendent control over
the execution status of each context submitted to GuC. As part of this,
the doorbell usage model has been reworked, with each doorbell being
paired to a single lrc and a doorbell ring representing new work
available for that specific context. This mechanism, however, limits
the number of contexts that can be registered with GuC to the number of
doorbells, which is an undesired limitation. To avoid this limitation,
we requested the GuC team to also provide a H2G that will allow the host
to notify the GuC of work available for a specified lrc, so we can use
that mechanism instead of relying on the doorbells. We can therefore drop
the doorbell code we currently have, also given the fact that in the
unlikely case we'd want to switch back to using doorbells we'd have to
heavily rework it.
The workqueue will still have a use in the new interface to pass special
commands, so that code has been retained for now.
With the doorbells gone and the GuC client becoming even simpler, the
existing GuC selftests don't give us any meaningful coverage so we can
remove them as well. Some selftests might come with the new code, but
they will look different from what we have now so if doesn't seem worth
it to keep the file around in the meantime.
v2: fix comments and commit message (John)
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191205220243.27403-3-daniele.ceraolospurio@intel.com
Pull more drm updates from Dave Airlie:
"Rob pointed out I missed his pull request for msm-next, it's been in
next for a while outside of my tree so shouldn't cause any unexpected
issues, it has some OCMEM support in drivers/soc that is acked by
other maintainers as it's outside my tree.
Otherwise it's a usual fixes pull, i915, amdgpu, the main ones, with
some tegra, omap, mgag200 and one core fix.
Summary:
msm-next:
- OCMEM support for a3xx and a4xx GPUs.
- a510 support + display support
core:
- mst payload deletion fix
i915:
- uapi alignment fix
- fix for power usage regression due to security fixes
- change default preemption timeout to 640ms from 100ms
- EHL voltage level display fixes
- TGL DGL PHY fix
- gvt - MI_ATOMIC cmd parser fix, CFL non-priv warning
- CI spotted deadlock fix
- EHL port D programming fix
amdgpu:
- VRAM lost fixes on BACO for CI/VI
- navi14 DC fixes
- misc SR-IOV, gfx10 fixes
- XGMI fixes for arcturus
- SRIOV fixes
amdkfd:
- KFD on ppc64le enabled
- page table optimisations
radeon:
- fix for r1xx/2xx register checker.
tegra:
- displayport regression fixes
- DMA API regression fixes
mgag200:
- fix devices that can't scanout except at 0 addr
omap:
- fix dma_addr refcounting"
* tag 'drm-next-2019-12-06' of git://anongit.freedesktop.org/drm/drm: (100 commits)
drm/dp_mst: Correct the bug in drm_dp_update_payload_part1()
drm/omap: fix dma_addr refcounting
drm/tegra: Run hub cleanup on ->remove()
drm/tegra: sor: Make the +5V HDMI supply optional
drm/tegra: Silence expected errors on IOMMU attach
drm/tegra: vic: Export module device table
drm/tegra: sor: Implement system suspend/resume
drm/tegra: Use proper IOVA address for cursor image
drm/tegra: gem: Remove premature import restrictions
drm/tegra: gem: Properly pin imported buffers
drm/tegra: hub: Remove bogus connection mutex check
ia64: agp: Replace empty define with do while
agp: Add bridge parameter documentation
agp: remove unused variable num_segments
agp: move AGPGART_MINOR to include/linux/miscdevice.h
agp: remove unused variable size in agp_generic_create_gatt_table
drm/dp_mst: Fix build on systems with STACKTRACE_SUPPORT=n
drm/radeon: fix r1xx/r2xx register checker for POT textures
drm/amdgpu: fix GFX10 missing CSIB set(v3)
drm/amdgpu: should stop GFX ring in hw_fini
...
This is really just an alias of mmap_gtt. The 'mmap offset' nomenclature
comes from the value returned by this ioctl which is the offset into the
device fd which userpace uses with mmap(2).
mmap_gtt was our initial mmap_offset implementation, this extends
our CPU mmap support to allow additional fault handlers that depends on
the object's backing pages.
Note that we multiplex mmap_gtt and mmap_offset through the same ioctl,
and use the zero extending behaviour of drm to differentiate between
them, when we inspect the flags.
To support multiple mmap types on an object we need to support multiple
mmap_offsets for an object (each offset in the global device address
space corresponding to a unique instance of the object for a file + mmap
type). As we drop the simplified drm core idea of a single mmap_offset,
we need to provide replacement hooks for the dumb mmap interface as
well.
Link: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1675
Testcase: igt/gem_mmap_offset
Signed-off-by: Abdiel Janulgue <abdiel.janulgue@linux.intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20191204120032.3682839-1-chris@chris-wilson.co.uk
Only along the submission path can we guarantee that the locked request
is indeed from a foreign engine, and so the nesting of engine/rq is
permissible. On the submission tasklet (process_csb()), we may find
ourselves competing with the normal nesting of rq/engine, invalidating
our nesting. As we only use the spinlock for debug purposes, skip the
debug if we cannot acquire the spinlock for safe validation - catching
99% of the bugs is better than causing a hard lockup.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: c95d31c3df ("drm/i915/execlists: Lock the request while validating it during promotion")
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191203152631.3107653-2-chris@chris-wilson.co.uk
After much hair pulling, resort to preallocating the ppGTT entries on
init to circumvent the apparent lack of PD invalidate following the
write to PP_DCLV upon switching mm between contexts (and here the same
context after binding new objects). However, the details of that PP_DCLV
invalidate are still unknown, and it appears we need to reload the mm
twice to cover over a timing issue. Worrying.
Fixes: 3dc007fe9b ("drm/i915/gtt: Downgrade gen7 (ivb, byt, hsw) back to aliasing-ppgtt")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191129201328.1398583-1-chris@chris-wilson.co.uk
Implement Wa_1604555607 (set the DS pairing timer to 128 cycles).
FF_MODE2 is part of the register state context, that's why it is
implemented here.
At TGL A0 stepping, FF_MODE2 register read back is broken, hence
disabling the WA verification.
v2: Rebased on top of the WA refactoring (Oscar)
v3: Correctly add to ctx_workarounds_init (Michel)
v4:
uncore read is used [Tvrtko]
Macros as used for MASK definition [Chris]
v5:
Skip the Wa_1604555607 verification [Ram]
i915 ptr retrieved from engine. [Tvrtko]
v6:
Added wa_add as a wrapper for __wa_add [Chris]
wa_add is directly called instead of new wrapper [tvrtko]
BSpec: 19363
HSDES: 1604555607
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Ramalingam C <ramlingam.c@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> [v5]
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191128021005.3350-1-ramalingam.c@intel.com
Pull drm updates from Dave Airlie:
"Lots of stuff in here, though it hasn't been too insane this merge
apart from dealing with the security fun.
uapi:
- export different colorspace properties on DP vs HDMI
- new fourcc for ARM 16x16 block format
- syncobj: allow querying last submitted timeline value
- DRM_FORMAT_BIG_ENDIAN defined as unsigned
core:
- allow using gem vma manager in ttm
- connector/encoder/bridge doc fixes
- allow more than 3 encoders for a connector
- displayport mst suspend/resume reprobing support
- vram lazy unmapping, uniform vram mm and gem vram
- edid cleanups + AVI informframe bar info
- displayport helpers - dpcd parser added
dp_cec:
- Allow a connector to be associated with a cec device
ttm:
- pipelining with no_gpu_wait fix
- always keep BOs on the LRU
sched:
- allow free_job routine to sleep
i915:
- Block userptr from mappable GTT
- i915 perf uapi versioning
- OA stream dynamic reconfiguration
- make context persistence optional
- introduce DRM_I915_UNSTABLE Kconfig
- add fake lmem testing under unstable
- BT.2020 support for DP MSA
- struct mutex elimination
- Tigerlake display/PLL/power management improvements
- Jasper Lake PCH support
- refactor PMU for multiple GPUs
- Icelake firmware update
- Split out vga + switcheroo code
amdgpu:
- implement dma-buf import/export without helpers
- vega20 RAS enablement
- DC i2c over aux fixes
- renoir GPU reset
- DC HDCP support
- BACO support for CI/VI asics
- MSI-X support
- Arcturus EEPROM support
- Arcturus VCN encode support
- VCN dynamic powergating on RV/RV2
amdkfd:
- add navi12/14/renoir support to kfd
radeon:
- SI dpm fix ported from amdgpu
- fix bad DMA on ppc platforms
gma500:
- memory leak fixes
qxl:
- convert to new gem mmap
exynos:
- build warning fix
komeda:
- add aclk sysfs attribute
v3d:
- userspace cleanup uapi change
i810:
- fix for underflow in dispatch ioctls
ast:
- refactor show_cursor
mgag200:
- refactor show_cursor
arcgpu:
- encoder finding improvements
mediatek:
- mipi_tx, dsi and partial crtc support for MT8183 SoC
- rotation support
meson:
- add suspend/resume support
omap:
- misc refactors
tegra:
- DisplayPort support for Tegra 210, 186 and 194.
- IOMMU-backed DMA API fixes
panfrost:
- fix lockdep issue
- simplify devfreq integration
rcar-du:
- R8A774B1 SoC support
- fixes for H2 ES2.0
sun4i:
- vcc-dsi regulator support
virtio-gpu:
- vmexit vs spinlock fix
- move to gem shmem helpers
- handle large command buffers with cma"
* tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm: (1855 commits)
drm/amdgpu: invalidate mmhub semaphore workaround in gmc9/gmc10
drm/amdgpu: initialize vm_inv_eng0_sem for gfxhub and mmhub
drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF.
drm/amd/amdgpu/sriov temporarily skip ras,dtm,hdcp for arcturus VF
drm/amdgpu/gfx10: re-init clear state buffer after gpu reset
merge fix for "ftrace: Rework event_create_dir()"
drm/amdgpu: Update Arcturus golden registers
drm/amdgpu/gfx10: fix out-of-bound mqd_backup array access
drm/amdgpu/gfx10: explicitly wait for cp idle after halt/unhalt
Revert "drm/amd/display: enable S/G for RAVEN chip"
drm/amdgpu: disable gfxoff on original raven
drm/amdgpu: remove experimental flag for Navi14
drm/amdgpu: disable gfxoff when using register read interface
drm/amdgpu/powerplay: properly set PP_GFXOFF_MASK (v2)
drm/amdgpu: fix bad DMA from INTERRUPT_CNTL2
drm/radeon: fix bad DMA from INTERRUPT_CNTL2
drm/amd/display: Fix debugfs on MST connectors
drm/amdgpu/nv: add asic func for fetching vbios from rom directly
drm/amdgpu: put flush_delayed_work at first
drm/amdgpu/vcn2.5: fix the enc loop with hw fini
...
The design of our interrupt handlers is that we ack the receipt of the
interrupt first, inside the critical section where the master interrupt
control is off and other cpus cannot start processing the next
interrupt; and then process the interrupt events afterwards. However,
Icelake introduced a whole new set of banked GT_IIR that are inherently
serialised and slow to retrieve the IIR and must be processed within the
critical section. We can still push our breadcrumbs out of this critical
section by using our irq_worker. On bdw+, this should not make too much
of a difference as we only slightly defer the breadcrumbs, but on icl+
this should make a big difference to our throughput of interrupts from
concurrently executing engines.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191127115813.3345823-1-chris@chris-wilson.co.uk
The expected downside to commit 58b4c1a07a ("drm/i915: Reduce nested
prepare_remote_context() to a trylock") was that it would need to return
-EAGAIN to userspace in order to resolve potential mutex inversion. Such
an unsightly round trip is unnecessary if we could atomically insert a
barrier into the i915_active_fence, so make it happen.
Currently, we use the timeline->mutex (or some other named outer lock)
to order insertion into the i915_active_fence (and so individual nodes
of i915_active). Inside __i915_active_fence_set, we only need then
serialise with the interrupt handler in order to claim the timeline for
ourselves.
However, if we remove the outer lock, we need to ensure the order is
intact between not only multiple threads trying to insert themselves
into the timeline, but also with the interrupt handler completing the
previous occupant. We use xchg() on insert so that we have an ordered
sequence of insertions (and each caller knows the previous fence on
which to wait, preserving the chain of all fences in the timeline), but
we then have to cmpxchg() in the interrupt handler to avoid overwriting
the new occupant. The only nasty side-effect is having to temporarily
strip off the RCU-annotations to apply the atomic operations, otherwise
the rules are much more conventional!
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112402
Fixes: 58b4c1a07a ("drm/i915: Reduce nested prepare_remote_context() to a trylock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191127134527.3438410-1-chris@chris-wilson.co.uk
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- A comprehensive rewrite of the robust/PI futex code's exit handling
to fix various exit races. (Thomas Gleixner et al)
- Rework the generic REFCOUNT_FULL implementation using
atomic_fetch_* operations so that the performance impact of the
cmpxchg() loops is mitigated for common refcount operations.
With these performance improvements the generic implementation of
refcount_t should be good enough for everybody - and this got
confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
REFCOUNT_FULL entirely, leaving the generic implementation enabled
unconditionally. (Will Deacon)
- Other misc changes, fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
lkdtm: Remove references to CONFIG_REFCOUNT_FULL
locking/refcount: Remove unused 'refcount_error_report()' function
locking/refcount: Consolidate implementations of refcount_t
locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
locking/refcount: Move saturation warnings out of line
locking/refcount: Improve performance of generic REFCOUNT_FULL code
locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
locking/refcount: Remove unused refcount_*_checked() variants
locking/refcount: Ensure integer operands are treated as signed
locking/refcount: Define constants for saturation and max refcount values
futex: Prevent exit livelock
futex: Provide distinct return value when owner is exiting
futex: Add mutex around futex exit
futex: Provide state handling for exec() as well
futex: Sanitize exit state handling
futex: Mark the begin of futex exit explicitly
futex: Set task::futex_state to DEAD right after handling futex exit
futex: Split futex_mm_release() for exit/exec
exit/exec: Seperate mm_release()
futex: Replace PF_EXITPIDONE with a state
...
In order to avoid some nasty mutex inversions, commit 09c5ab384f
("drm/i915: Keep rings pinned while the context is active") allowed the
intel_ring unpinning to be run concurrently with the next context
pinning it. Thus each step in intel_ring_unpin() needed to be atomic and
ordered in a nice onion with intel_ring_pin() so that the lifetimes
overlapped and were always safe.
Sadly, a few steps in intel_ring_unpin() were overlooked, such as
closing the read/write pointers of the ring and discarding the
intel_ring.vaddr, as these steps were not serialised with
intel_ring_pin() and so could leave the ring in disarray.
Fixes: 09c5ab384f ("drm/i915: Keep rings pinned while the context is active")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191118230254.2615942-6-chris@chris-wilson.co.uk
(cherry picked from commit a266bf4200)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
The major drawback of commit 7e34f4e4aa ("drm/i915/gen8+: Add RC6 CTX
corruption WA") is that it disables RC6 while Skylake (and friends) is
active, and we do not consider the GPU idle until all outstanding
requests have been retired and the engine switched over to the kernel
context. If userspace is idle, this task falls onto our background idle
worker, which only runs roughly once a second, meaning that userspace has
to have been idle for a couple of seconds before we enable RC6 again.
Naturally, this causes us to consume considerably more energy than
before as powersaving is effectively disabled while a display server
(here's looking at you Xorg) is running.
As execlists will get a completion event as each context is completed,
we can use this interrupt to queue a retire worker bound to this engine
to cleanup idle timelines. We will then immediately notice the idle
engine (without userspace intervention or the aid of the background
retire worker) and start parking the GPU. Thus during light workloads,
we will do much more work to idle the GPU faster... Hopefully with
commensurate power saving!
v2: Watch context completions and only look at those local to the engine
when retiring to reduce the amount of excess work we perform.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112315
References: 7e34f4e4aa ("drm/i915/gen8+: Add RC6 CTX corruption WA")
References: 2248a28384 ("drm/i915/gen8+: Add RC6 CTX corruption WA")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-3-chris@chris-wilson.co.uk
(cherry picked from commit 4f88f8747f)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
In the next patch, we will introduce a new asynchronous retirement
worker, fed by execlists CS events. Here we may queue a retirement as
soon as a request is submitted to HW (and completes instantly), and we
also want to process that retirement as early as possible and cannot
afford to postpone (as there may not be another opportunity to retire it
for a few seconds). To allow the new async retirer to run in parallel
with our submission, pull the __i915_request_queue (that passes the
request to HW) inside the timelines spinlock so that the retirement
cannot release the timeline before we have completed the submission.
v2: Actually to play nicely with engine_retire, we have to raise the
timeline.active_lock before releasing the HW. intel_gt_retire_requsts()
is still serialised by the outer lock so they cannot see this
intermediate state, and engine_retire is serialised by HW submission.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-2-chris@chris-wilson.co.uk
(cherry picked from commit 88a4655e75)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Since we want to do a lockless read of the current active request, and
that request is written to by process_csb also without serialisation, we
need to instruct gcc to take care in reading the pointer itself.
Otherwise, we have observed execlists_active() to report 0x40.
[ 2400.760381] igt/para-4098 1..s. 2376479300us : process_csb: rcs0 cs-irq head=3, tail=4
[ 2400.760826] igt/para-4098 1..s. 2376479303us : process_csb: rcs0 csb[4]: status=0x00000001:0x00000000
[ 2400.761271] igt/para-4098 1..s. 2376479306us : trace_ports: rcs0: promote { b9c59:2622, b9c55:2624 }
[ 2400.761726] igt/para-4097 0d... 2376479311us : __i915_schedule: rcs0: -2147483648->3, inflight:0000000000000040, rq:ffff888208c1e940
which is impossible!
The answer is that as we keep the existing execlists->active pointing
into the array as we copy over that array, the unserialised read may see
a partial pointer value.
Fixes: df40306902 ("drm/i915/execlists: Lift process_csb() out of the irq-off spinlock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191125094318.1630806-1-chris@chris-wilson.co.uk
(cherry picked from commit 331bf90591)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
In commit a79ca656b6 ("drm/i915: Push the wakeref->count deferral to
the backend"), I erroneously concluded that we last modify the engine
inside __i915_request_commit() meaning that we could enable concurrent
submission for userspace as we enqueued this request. However, this
falls into a trap with other users of the engine->kernel_context waking
up and submitting their request before the idle-switch is queued, with
the result that the kernel_context is executed out-of-sequence most
likely upsetting the GPU and certainly ourselves when we try to retire
the out-of-sequence requests.
As such we need to hold onto the effective engine->kernel_context mutex
lock (via the engine pm mutex proxy) until we have finish queuing the
request to the engine.
v2: Serialise against concurrent intel_gt_retire_requests()
v3: Describe the hairy locking scheme with intel_gt_retire_requests()
for future reference.
v4: Combine timeline->lock and engine pm release; it's hairy.
Fixes: a79ca656b6 ("drm/i915: Push the wakeref->count deferral to the backend")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191120165514.3955081-2-chris@chris-wilson.co.uk
(cherry picked from commit 5cba288466)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
The general concept was that intel_timeline.active_count was locked by
the intel_timeline.mutex. The exception was for power management, where
the engine->kernel_context->timeline could be manipulated under the
global wakeref.mutex.
This was quite solid, as we always manipulated the timeline only while
we held an engine wakeref.
And then we started retiring requests outside of struct_mutex, only
using the timelines.active_list and the timeline->mutex. There we
started manipulating intel_timeline.active_count outside of an engine
wakeref, and so introduced a race between __engine_park() and
intel_gt_retire_requests(), a race that could result in the
engine->kernel_context not being added to the active timelines and so
losing requests, which caused us to keep the system permanently powered
up [and unloadable].
The race would be easy to close if we could take the engine wakeref for
the timeline before we retire -- except timelines are not bound to any
engine and so we would need to keep all active engines awake. The
alternative is to guard intel_timeline_enter/intel_timeline_exit for use
outside of the timeline->mutex.
Fixes: e5dadff4b0 ("drm/i915: Protect request retirement with timeline->mutex")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191120165514.3955081-1-chris@chris-wilson.co.uk
(cherry picked from commit a6edbca74b)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
The major drawback of commit 7e34f4e4aa ("drm/i915/gen8+: Add RC6 CTX
corruption WA") is that it disables RC6 while Skylake (and friends) is
active, and we do not consider the GPU idle until all outstanding
requests have been retired and the engine switched over to the kernel
context. If userspace is idle, this task falls onto our background idle
worker, which only runs roughly once a second, meaning that userspace has
to have been idle for a couple of seconds before we enable RC6 again.
Naturally, this causes us to consume considerably more energy than
before as powersaving is effectively disabled while a display server
(here's looking at you Xorg) is running.
As execlists will get a completion event as each context is completed,
we can use this interrupt to queue a retire worker bound to this engine
to cleanup idle timelines. We will then immediately notice the idle
engine (without userspace intervention or the aid of the background
retire worker) and start parking the GPU. Thus during light workloads,
we will do much more work to idle the GPU faster... Hopefully with
commensurate power saving!
v2: Watch context completions and only look at those local to the engine
when retiring to reduce the amount of excess work we perform.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112315
References: 7e34f4e4aa ("drm/i915/gen8+: Add RC6 CTX corruption WA")
References: 2248a28384 ("drm/i915/gen8+: Add RC6 CTX corruption WA")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-3-chris@chris-wilson.co.uk
In the next patch, we will introduce a new asynchronous retirement
worker, fed by execlists CS events. Here we may queue a retirement as
soon as a request is submitted to HW (and completes instantly), and we
also want to process that retirement as early as possible and cannot
afford to postpone (as there may not be another opportunity to retire it
for a few seconds). To allow the new async retirer to run in parallel
with our submission, pull the __i915_request_queue (that passes the
request to HW) inside the timelines spinlock so that the retirement
cannot release the timeline before we have completed the submission.
v2: Actually to play nicely with engine_retire, we have to raise the
timeline.active_lock before releasing the HW. intel_gt_retire_requsts()
is still serialised by the outer lock so they cannot see this
intermediate state, and engine_retire is serialised by HW submission.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-2-chris@chris-wilson.co.uk
As the engine->kernel_context is used within the engine-pm barrier, we
have to be careful when emitting requests outside of the barrier, as the
strict timeline locking rules do not apply. Instead, we must ensure the
engine_park() cannot be entered as we build the request, which is
simplest by taking an explicit engine-pm wakeref around the request
construction.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-1-chris@chris-wilson.co.uk