The dpy_reg_mmio_read_x functions directly copy 4 bytes data to the
target address with considering the length. If may cause the target
memory corrupted if the requested length less than 4 bytes. Fix it
for safety even we already have some checking to avoid this happen.
And for convince, the 3 functions are merged.
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
When host connects a crt screen, linux guest will detect two
screens: crt and dp. This is wrong as linux guest has only
one dp.
In order to avoid guest get host crt screen, we should set
ADPA_CRT_HOTPLUG_MONITOR to none. But MMIO_RO(PCH_ADPA) prevent
from that. So MMIO_DH should be used instead of MMIO_RO.
v2: Clear its staus to none at initialize, so guest don't
get host crt.(Zhangyu)
v3: SKL doesn't have this register, limit it to pre_skl.(xiong)
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
On BDW, when host physical screen and guest virtual screen aren't on
the same DDI port, guest i915 driver prints the following error and
stop running.
[ 6.775873] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000068
[ 6.775928] IP: intel_ddi_clock_get+0x81/0x430 [i915]
[ 6.776206] Call Trace:
[ 6.776233] ? vgpu_read32+0x4f/0x100 [i915]
[ 6.776264] intel_ddi_get_config+0x11c/0x230 [i915]
[ 6.776298] intel_modeset_setup_hw_state+0x313/0xd40 [i915]
[ 6.776334] intel_modeset_init+0xe49/0x18d0 [i915]
[ 6.776368] ? vgpu_write32+0x53/0x100 [i915]
[ 6.776731] ? intel_i2c_reset+0x42/0x50 [i915]
[ 6.777085] ? intel_setup_gmbus+0x32a/0x350 [i915]
[ 6.777427] i915_driver_load+0xabc/0x14d0 [i915]
[ 6.777768] i915_pci_probe+0x4f/0x70 [i915]
The null pointer is guest intel_crtc_state->shared_dpll which is
setted in haswell_get_ddi_pll(). When guest and host screen are
on different DDI port, host driver won't set PORT_CLK_SET(guest_port),
so haswell_get_ddi_pll() will return null and don't set
pipe_config->shared_dpll, once the following program refernce this
structure, it will print the above error.
This patch set the initial val of guest PORT_CLK_SEL(guest_port) to
LCPLL_810. And guest i915 driver will reset this value according to
guest screen mode.
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
If we write a relocation into the buffer, we require our own implicit
synchronisation added after the start of the execbuf, outside of the
user's control. As we may end up clflushing, or doing the patch itself
on the GPU, asynchronously we need to look at the implicit serialisation
on obj->resv and hence need to disable EXEC_OBJECT_ASYNC for this
object.
If the user does trigger a stall for relocations, we make sure the stall
is complete enough so that the batch is not submitted before we complete
those relocations.
Fixes: 77ae995789 ("drm/i915: Enable userspace to opt-out of implicit fencing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
(cherry picked from commit 071750e550)
[danvet: Resolve conflicts, resolution reviewed by Tvrtko on irc.]
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
commit 2889caa923 ("drm/i915: Eliminate lots of iterations over the
execobjects array") jiggled around the error handling and replace a test
that we cleaned up properly after ourselves with an assertion. That
assertion failed because in the release function (moments after the
assertion) we were indeed forgetting to mark the vma as cleared. The
consequence was when testing an invalid relocation address, we would try
to release the vma twice (following the couple of attempts to verify the
address) and on the second release notice that the first release was
incomplete.
Testcase: igt/gem_reloc_overflow/invalid-address
Fixes: 2889caa923 ("drm/i915: Eliminate lots of iterations over the execobjects array")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170622104722.2583-1-chris@chris-wilson.co.uk
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
(cherry picked from commit 51d05e1b29)
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
There are two kinds of locking sequence.
One is in the thread which is started by vfio ioctl to do
the iommu unmapping. The locking sequence is:
down_read(&group_lock) ----> mutex_lock(&cached_lock)
The other is in the vfio release thread which will unpin all
the cached pages. The lock sequence is:
mutex_lock(&cached_lock) ---> down_read(&group_lock)
And, the cache_lock is used to protect the rb tree of the cache
node and doing vfio unpin doesn't require this lock. Move the
vfio unpin out of the cache_lock protected region.
v2:
- use for style instead of do{}while(1). (Zhenyu)
Fixes: f30437c5e7 ("drm/i915/gvt: add KVMGT support")
Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Final pile of features for 4.13
New uabi:
- batch bo in first slot, for faster execbuf assembly in userspace
(Chris Wilson)
- (sub)slice getparam, needed for mesa perf support (Robert Bragg)
First pile of patches for cnl/cfl support, maintained by Rodrigo but
with lots of contributions from others. Still incomplete since public
review still ongoing.
Features/refactoring:
- Make execbuf faster (Chris Wilson), a pile of series to make execbuf
buffer handling have fewer passes, use less list walking, postpone
more work to async workers and shuffle buffers less, all to make the
common case much faster (in some cases at least).
- cold boot support for glk dsi (Madhav Chauhan)
- Clean up pipe A quirk and related old platform hacks (Ville)
- perf sampling support for kbl/glk (Lionel)
- perf cleanups (Robert Bragg)
- wire atomic state to backlight code, to avoid pipe lookup hacks
(Maarten)
- reduce request waiting latency/overhead to remove the spinning and
associated cpu cycle wasting (Chris)
- fix 90/270 rotation wm computation (Ville)
- new ddb allocation algo for skl (Kumar Mahesh)
- fix regression due to system suspend optimiazatino (Imre)
- the usual pile of small cleanups and refactors all over
GVT updates contained in this tag:
- optimization for per-VM mmio save/restore (Changbin)
- optimization for mmio hash table (Changbin)
- scheduler optimization with event (Ping)
- vGPU reset refinement (Fred)
- other misc refactor and cleanups, etc.
* tag 'drm-intel-next-2017-06-19' of git://anongit.freedesktop.org/git/drm-intel: (170 commits)
drm/i915: Update DRIVER_DATE to 20170619
drm/i915/cfl: Introduce Coffee Lake workarounds.
drm/i915: Store 9 bits of PCI Device ID for platforms with a LP PCH
drm/i915: Stash a pointer to the obj's resv in the vma
drm/i915: Async GPU relocation processing
drm/i915: Allow execbuffer to use the first object as the batch
drm/i915: Wait upon userptr get-user-pages within execbuffer
drm/i915: First try the previous execbuffer location
drm/i915: Store a persistent reference for an object in the execbuffer cache
drm/i915: Eliminate lots of iterations over the execobjects array
drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations
drm/i915: Pass vma to relocate entry
drm/i915: Store a direct lookup from object handle to vma
drm/i915: Fix retrieval of hangcheck stats
drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty
drm/i915: Mark CPU cache as dirty on every transition for CPU writes
drm/i915: Make i915_vma_destroy() static
drm/i915: Actually attach the tv_format property to the SDVO connector
Revert "drm/i915/skl: New ddb allocation algorithm"
drm/i915/glk: Add cold boot sequence for GLK DSI
...
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I tried __GFP_NORETRY in the belief that __GFP_RECLAIM was effective. It
struggles with handling reclaim of our dirty buffers and relies on
reclaim via kswapd. As a result, a single pass of direct reclaim is
unreliable when i915 occupies the majority of available memory, and the
only means of effectively waiting on kswapd to amke progress is by not
setting the __GFP_NORETRY flag and lopping. That leaves us with the
dilemma of invoking the oomkiller instead of propagating the allocation
failure back to userspace where it can be handled more gracefully (one
hopes). In the future we may have __GFP_MAYFAIL to allow repeats up until
we genuinely run out of memory and the oomkiller would have been invoked.
Until then, let the oomkiller wreck havoc.
v2: Stop playing with side-effects of gfp flags and await __GFP_MAYFAIL
v3: Update comments that direct reclaim only appears to be ignoring our
dirty buffers!
Fixes: 24f8e00a8a ("drm/i915: Prefer to report ENOMEM rather than incur the oom for gfx allocations")
Testcase: igt/gem_tiled_swapping
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170609110350.1767-2-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
(cherry picked from commit eaf4180155)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Commit 24f8e00a8a ("drm/i915: Prefer to report ENOMEM rather than
incur the oom for gfx allocations") made the bold decision to try and
avoid the oomkiller by reporting -ENOMEM to userspace if our allocation
failed after attempting to free enough buffer objects. In short, it
appears we were giving up too easily (even before we start wondering if
one pass of reclaim is as strong as we would like). Part of the problem
is that if we only shrink just enough pages for our expected allocation,
the likelihood of those pages becoming available to us is less than 100%
To counter-act that we ask for twice the number of pages to be made
available. Furthermore, we allow the shrinker to pull pages from the
active list in later passes.
v2: Be a little more cautious in paging out gfx buffers, and leave that
to a more balanced approach from shrink_slab(). Important when combined
with "drm/i915: Start writeback from the shrinker" as anything shrunk is
immediately swapped out and so should be more conservative.
Fixes: 24f8e00a8a ("drm/i915: Prefer to report ENOMEM rather than incur the oom for gfx allocations")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170609110350.1767-1-chris@chris-wilson.co.uk
(cherry picked from commit 4846bf0ca8)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
During execbuf, a mandatory step is that we add this request (this
fence) to each object's reservation_object. Inside execbuf, we track the
vma, and to add the fence to the reservation_object then means having to
first chase the obj, incurring another cache miss. We can reduce the
number of cache misses by stashing a pointer to the reservation_object
in the vma itself.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170616140525.6394-1-chris@chris-wilson.co.uk
If the user requires patching of their batch or auxiliary buffers, we
currently make the alterations on the cpu. If they are active on the GPU
at the time, we wait under the struct_mutex for them to finish executing
before we rewrite the contents. This happens if shared relocation trees
are used between different contexts with separate address space (and the
buffers then have different addresses in each), the 3D state will need
to be adjusted between execution on each context. However, we don't need
to use the CPU to do the relocation patching, as we could queue commands
to the GPU to perform it and use fences to serialise the operation with
the current activity and future - so the operation on the GPU appears
just as atomic as performing it immediately. Performing the relocation
rewrites on the GPU is not free, in terms of pure throughput, the number
of relocations/s is about halved - but more importantly so is the time
under the struct_mutex.
v2: Break out the request/batch allocation for clearer error flow.
v3: A few asserts to ensure rq ordering is maintained
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Currently, the last object in the execlist is the always the batch.
However, when building the batch buffer we often know the batch object
first and if we can use the first slot in the execlist we can emit
relocation instructions relative to it immediately and avoid a separate
pass to adjust the relocations to point to the last execlist slot.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
This simply hides the EAGAIN caused by userptr when userspace causes
resource contention. However, it is quite beneficial with highly
contended userptr users as we avoid repeating the setup costs and
kernel-user context switches.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
When choosing a slot for an execbuffer, we ideally want to use the same
address as last time (so that we don't have to rebind it) and the same
address as expected by the user (so that we don't have to fixup any
relocations pointing to it). If we first try to bind the incoming
execbuffer->offset from the user, or the currently bound offset that
should hopefully achieve the goal of avoiding the rebind cost and the
relocation penalty. However, if the object is not currently bound there
we don't want to arbitrarily unbind an object in our chosen position and
so choose to rebind/relocate the incoming object instead. After we
report the new position back to the user, on the next pass the
relocations should have settled down.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtien@linux.intel.com>
If we take a reference to the object/vma when it is first used in an
execbuf, we can keep that reference until the object's file-local handle
is closed. Thereby saving a frequent ref/unref pair.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
The major scaling bottleneck in execbuffer is the processing of the
execobjects. Creating an auxiliary list is inefficient when compared to
using the execobject array we already have allocated.
Reservation is then split into phases. As we lookup up the VMA, we
try and bind it back into active location. Only if that fails, do we add
it to the unbound list for phase 2. In phase 2, we try and add all those
objects that could not fit into their previous location, with fallback
to retrying all objects and evicting the VM in case of severe
fragmentation. (This is the same as before, except that phase 1 is now
done inline with looking up the VMA to avoid an iteration over the
execobject array. In the ideal case, we eliminate the separate reservation
phase). During the reservation phase, we only evict from the VM between
passes (rather than currently as we try to fit every new VMA). In
testing with Unreal Engine's Atlantis demo which stresses the eviction
logic on gen7 class hardware, this speed up the framerate by a factor of
2.
The second loop amalgamation is between move_to_gpu and move_to_active.
As we always submit the request, even if incomplete, we can use the
current request to track active VMA as we perform the flushes and
synchronisation required.
The next big advancement is to avoid copying back to the user any
execobjects and relocations that are not changed.
v2: Add a Theory of Operation spiel.
v3: Fall back to slow relocations in preparation for flushing userptrs.
v4: Document struct members, factor out eb_validate_vma(), add a few
more comments to explain some magic and hide other magic behind macros.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
If we write a relocation into the buffer, we require our own implicit
synchronisation added after the start of the execbuf, outside of the
user's control. As we may end up clflushing, or doing the patch itself
on the GPU, asynchronously we need to look at the implicit serialisation
on obj->resv and hence need to disable EXEC_OBJECT_ASYNC for this
object.
If the user does trigger a stall for relocations, we make sure the stall
is complete enough so that the batch is not submitted before we complete
those relocations.
Fixes: 77ae995789 ("drm/i915: Enable userspace to opt-out of implicit fencing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
We can simplify our tracking of pending writes in an execbuf to the
single bit in the vma->exec_entry->flags, but that requires the
relocation function knowing the object's vma. Pass it along.
Note we have only been using a single bit to track flushing since
commit cc889e0f6c
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date: Wed Jun 13 20:45:19 2012 +0200
drm/i915: disable flushing_list/gpu_write_list
unconditionally flushed all render caches before the breadcrumb and
commit 6ac42f4148
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date: Sat Jul 21 12:25:01 2012 +0200
drm/i915: Replace the complex flushing logic with simple invalidate/flush all
did away with the explicit GPU domain tracking. This was then codified
into the ABI with NO_RELOC in
commit ed5982e6ce
Author: Daniel Vetter <daniel.vetter@ffwll.ch> # Oi! Patch stealer!
Date: Thu Jan 17 22:23:36 2013 +0100
drm/i915: Allow userspace to hint that the relocations were known
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
The advent of full-ppgtt lead to an extra indirection between the object
and its binding. That extra indirection has a noticeable impact on how
fast we can convert from the user handles to our internal vma for
execbuffer. In order to bypass the extra indirection, we use a
resizable hashtable to jump from the object to the per-ctx vma.
rhashtable was considered but we don't need the online resizing feature
and the extra complexity proved to undermine its usefulness. Instead, we
simply reallocate the hastable on demand in a background task and
serialize it before iterating.
In non-full-ppgtt modes, multiple files and multiple contexts can share
the same vma. This leads to having multiple possible handle->vma links,
so we only use the first to establish the fast path. The majority of
buffers are not shared and so we should still be able to realise
speedups with multiple clients.
v2: Prettier names, more magic.
v3: Many style tweaks, most notably hiding the misuse of execobj[].rsvd2
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Currently, we only mark the CPU cache as dirty if we skip a clflush.
This leads to some confusion where we have to ask if the object is in
the write domain or missed a clflush. If we always mark the cache as
dirty, this becomes a much simply question to answer.
The goal remains to do as few clflushes as required and to do them as
late as possible, in the hope of deferring the work to a kthread and not
block the caller (e.g. execbuf, flips).
v2: Always call clflush before GPU execution when the cache_dirty flag
is set. This may cause some extra work on llc systems that migrate dirty
buffers back and forth - but we do try to limit that by only setting
cache_dirty at the end of the gpu sequence.
v3: Always mark the cache as dirty upon a level change, as we need to
invalidate any stale cachelines due to external writes.
Reported-by: Dongwon Kim <dongwon.kim@intel.com>
Fixes: a6a7cc4b7d ("drm/i915: Always flush the dirty CPU cache when pinning the scanout")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Tested-by: Dongwon Kim <dongwon.kim@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170615123850.26843-1-chris@chris-wilson.co.uk
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>