Since the migrate code is using the identity map for addressing VRAM,
copy chunks may become as small as 64K if the VRAM resource is fragmented.
However, a chunk size smaller that 1MiB may lead to the *next* chunk's
offset into the CCS metadata backup memory may not be page-aligned, and
the XY_CTRL_SURF_COPY_BLT command can't handle that, and even if it could,
the current code doesn't handle the offset calculaton correctly.
To fix this, make sure we align the size of VRAM copy chunks to 1MiB. If
the remaining data to copy is smaller than that, that's not a problem,
so use the remaining size. If the VRAM copy cunk becomes fragmented due
to the size alignment restriction, don't use the identity map, but instead
emit PTEs into the page-table like we do for system memory.
v2:
- Rebase
v3:
- Future proof somewhat by taking into account the real data size to
flat CCS metadata size ratio. (Matt Roper)
- Invert a couple of if-statements for better readability.
- Fix support for 4K-granularity VRAM sizes. (Tested on DG1).
v4:
- Fix up code comments
- Fix debug printout format typo.
v5:
- Add a Fixes: tag.
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Fixes: e89b384cde ("drm/xe/migrate: Update emit_pte to cope with a size level than 4k")
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240110163415.524165-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit ef51d7542d)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
After exec_queue has been created, we cannot simply modify q->priority.
This needs to be done by the backend via q->ops. However in this case,
it would be more efficient to simply pass a flag when creating the
exec_queue and set the desired priority upfront during queue creation.
To that end: new flag EXEC_QUEUE_FLAG_HIGH_PRIORITY is introduced.
The priority field is moved to be with other scheduling properties and
is now exec_queue.sched_props.priority. This is no longer set to initial
value by the backend, but is now set within __xe_exec_queue_create().
Fixes: b4eecedc75 ("drm/xe: Fix potential deadlock handling page faults")
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
(cherry picked from commit a8004af338)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
- Clear flat ccs during user bo creation.
- copy ccs meta data between flat ccs and bo during eviction and
restore.
- Add a bool field ccs_cleared in bo, true means ccs region of bo is
already cleared.
v2:
- Rebase.
v3:
- Maintain order of xe_bo_move_notify for ttm_bo_type_sg.
v4:
- xe_migrate_copy can be used to copy src to dst bo on igfx too.
Add a bool which handles only ccs metadata copy.
v5:
- on dgfx ccs should be cleared even if the bo is not compression enabled.
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
- The XY_CTRL_SURF_COPY_BLT instruction operating on ccs data expects
size in pages of main memory for which CCS data should be copied.
- The bitfield representing copy size in XY_CTRL_SURF_COPY_BLT has
shifted one bit higher in the instruction.
v2:
- Fix the num_pages for ccs size calculation.
- Address nits (Thomas)
v3:
- Use FIELD_PREP and FIELD_FIT instead of shifts and numbers.(Matt)
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The idea being out-syncs can signal indicating all previous operations
on the bind queue are complete. An example use case of this would be
support for implementing vkQueueWaitIdle easily.
All in-syncs are waited on before signaling out-syncs. This is
implemented by forming a composite software fence of in-syncs and
installing this fence in the out-syncs and exec queue last fence slot.
The last fence must be added as a dependency for jobs on user exec
queues as it is possible for the last fence to be a composite software
fence (unordered, ioctl with zero bb or binds) rather than hardware
fence (ordered, previous job on queue).
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Update xe_migrate_prepare_vm() to use the usm batch buffer even for
servicing device page faults on integrated platforms. And as we have
no VRAM on integrated platforms, device pagefault handler should not
attempt to migrate into VRAM.
LNL is first integrated platform to support device pagefaults.
Signed-off-by: Brian Welty <brian.welty@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
There could be active fences already in the dma-resv for the object
prior to clearing. Make sure to input them as dependencies for the clear
job.
v2 (Matt B):
- We can use USAGE_KERNEL here, since it's only the move fences we
care about here. Also add a comment.
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Spec says: "This is a privileged command; it will not be effective (will
be converted to a no-op) if executed from within a non-privileged batch
buffer." However here it looks like we are just emitting it inside some
bb which was jumped to via the ppGTT, which should be considered
a non-privileged address space.
It looks like we just need some way of preventing things like the
emit_pte() and later copy/clear being preempted in-between so rather
just emit directly in the ring for migration jobs.
Bspec: 45716
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Extracting the common MI_* instructions that can be used with any engine
to their own header will make it easier as we add additional engine
instructions in upcoming patches.
Also, since the majority of GPU instructions (both MI and non-MI) have
a "length" field in bits 7:0 of the instruction header, a common define
is added for that. Instruction-specific length fields are still defined
for special case instructions that have larger/smaller length fields.
v2:
- Use "instr" instead of "inst" as the short form of "instruction"
everywhere. (Lucas)
- Include xe_reg_defs.h instead of the i915 compat header. (Lucas)
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20231016163449.1300701-12-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
MI_STORE_DATA_IMM can store either dword values or qword values, and can
store more than one value if the instruction's length field is large
enough. Create explicit defines to specify the number of dwords/qwords
to be stored, which will set the instruction length correctly and, if
necessary, turn on the 'store qword' bit.
While we're here, also replace an open-coded version of
MI_STORE_DATA_IMM with the common macros.
Bspec: 60246
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20231016163449.1300701-11-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Device Physical Address (DPA) is the starting offset device memory.
Update xe_migrate identity map base PTE entries to start at dpa_base
instead of 0.
The VM offset value should be 0 relative instead of DPA relative.
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: "Michael J. Ruhl" <michael.j.ruhl@intel.com>
Signed-off-by: David Kershner <david.kershner@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Set bits 30 and 31 of XY_FAST_COPY_BLT's dword1 for XeHP and above.
Destination or source being Y-Major is selected on dword0 and there's
nothing to set on dword1. According to the bspec for Xe2,
"Behavior is undefined when programmed the value 0". Also for XeHP,
the only value allowed in those bits is 0b11, not being possible to
select "Legacy Tile-Y" anymore, only the newer Tile4.
So, unconditionally set those bits for graphics IP 12.50 and above.
v2: Reword commit message and extend it to graphics version >= 12.50
(Matt Roper)
Bspec: 57567
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Haridhar Kalvala <haridhar.kalvala@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://lore.kernel.org/r/20230929213640.3189912-4-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Change the xelp_pte_encode() and xelp_pde_encode() functions to use the
platform-dependent pat_index. The same function can be used for all
platforms as they only need to encode the pat_index bits in the same
pte/pde layout. For platforms that don't have the most significant bit,
as long as they don't return a bogus index they should be fine.
v2: Use the same logic to encode pde as it's compatible with previous
logic, it's more future proof and also fixes the cache setting for
PVC (Matt Roper)
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://lore.kernel.org/r/20230927193902.2849159-10-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
On platforms with multiple BCS engines (i.e., PVC and Xe2), not all BCS
engines are created equal. The BCS0 engine is what the specs refer to
as a "resource copy engine," which supports the platform's full set of
copy/fill instructions. In contast, the non-BCS0 "service copy" engines
are more streamlined and only support a subset of the GPU instructions
supported by the resource copy engine. Platforms with both types of
copy engines always support the MEM_COPY and MEM_SET instructions which
can be used for simple copy and fill operations on either type of BCS
engine. Since the simple MEM_SET instruction meets the needs of Xe's
migrate code (and since the more elaborate XY_FAST_COLOR_BLT instruction
isn't available to use on service copy engines), we always prefer to use
MEM_SET for clearing buffers on our newer platforms.
We've been using a 'has_link_copy_engine' feature flag to keep track of
which platforms should use MEM_SET for fills. However a feature flag
like this is unnecessary since we can already derive the presence of
service copy engines (and in turn the MEM_SET instruction) just by
looking at the platform's pre-fusing engine list. Utilizing the engine
list for this also avoids mistakes like we've made on Xe2 where we
forget to set the feature flag in the IP definition.
For clarity, "service copy" is a general term that covers any blitter
engines that support a limited subset of the overall blitter instruction
set (in practice this is any non-BCS0 blitter engine). The "link copy
engines" introduced on PVC and the "paging copy engine" present in Xe2
are both instances of service copy engines.
v2:
- Rewrite / expand the commit message. (Bala)
- Fix checkpatch whitespace error.
Bspec: 65019
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Reviewed-by: Haridhar Kalvala <haridhar.kalvala@intel.com>
Link: https://lore.kernel.org/r/20230927205143.2695089-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The XE_WARN_ON macro maps to WARN_ON which is not justified
in many cases where only a simple debug check is needed.
Replace the use of the XE_WARN_ON macro with the new xe_assert
macros which relies on drm_*. This takes a struct drm_device
argument, which is one of the main changes in this commit. The
other main change is that the condition is reversed, as with
XE_WARN_ON a message is displayed if the condition is true,
whereas with xe_assert it is if the condition is false.
v2:
- Rebase
- Keep WARN splats in xe_wopcm.c (Matt Roper)
v3:
- Rebase
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Some copy hardware engine instances are faster than others on PVC.
Use a virtual engine of these plus the reserved instance for the migrate
engine on PVC. The idea being if a fast instance is available it will be
used and the throughput of kernel copies, clears, and pagefault
servicing will be higher.
v2: Use OOB WA, use all copy engines if no WA is required
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
If an engine is only destroyed on driver unload, we can skip its
clean-up steps with the GuC because the GuC is going to be tuned off as
well, so it doesn't matter if we're in sync with it or not. Currently,
we apply this optimization to all engines marked as kernel, but this
stops us to supporting kernel engines that don't stick around until
unload. To remove this limitation, add a separate flag to indicate if
the engine is expected to only be destryed on driver unload and use that
to trigger the optimzation.
While at it, add a small comment to explain what each engine flag
represents.
v2: s/XE_BUG_ON/XE_WARN_ON, s/ENGINE/EXEC_QUEUE
v3: rebased
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20230822173334.1664332-3-daniele.ceraolospurio@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Make a xe_mem_region structure which will be used in the
coming patches. The new structure is used in both xe device
level (xe->mem.vram) and xe_tile level (tile->vram).
Make the definition of xe_mem_region.dpa_base to be the DPA
base of this memory region and change codes according to
this new definition.
v1:
- rename xe_mem_region.base to dpa_base per conversation with Mike
Ruhl
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Replace calls to XE_BUG_ON() with calls XE_WARN_ON() which in turn calls
WARN() instead of BUG(). BUG() crashes the kernel and should only be
used when it is absolutely unavoidable in case of catastrophic and
unrecoverable failures, which is not the case here.
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Integrated graphics 1270 and beyond should set the PTE_LM bit in the PTE
when it's stolen memory. Add a new function, xe_bo_is_stolen_devmem(),
and use it when encoding the PTE.
In some places in the spec the PTE bit is called "Local Memory",
abbreviated as LM, and in others it's called "Device Memory" (DM). Since
we moved away from "Local Memory" and preferred the "vram" terminology,
also rename the macros as DM to follow the name of the new function.
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://lore.kernel.org/r/20230726160708.3967790-7-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
We currently have a race between bind engines which can result in
corrupted page tables leading to faults.
A simple example:
bind A 0x0000-0x1000, engine A, has unsatisfied in-fence
bind B 0x1000-0x2000, engine B, no in-fences
exec A uses 0x1000-0x2000
Bind B will pass bind A and exec A will fault. This occurs as bind A
programs the root of the page table in a bind job which is held up by an
in-fence. Bind B in this case just programs a leaf entry of the
structure.
To fix use range-fence utility to track cross bind engine conflicts. In
the above example bind A would insert an dependency into the range-fence
tree with a key of 0x0-0x7fffffffff, bind B would find that dependency
and its bind job would scheduled behind the unsatisfied in-fence and
bind A's job.
Reviewed-by: Maarten Lankhorst<maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reduce the number of warnings reported by checkpatch.pl from 118 to 48 by
addressing those warnings types:
LEADING_SPACE
LINE_SPACING
BRACES
TRAILING_SEMICOLON
CONSTANT_COMPARISON
BLOCK_COMMENT_STYLE
RETURN_VOID
ONE_SEMICOLON
SUSPECT_CODE_INDENT
LINE_CONTINUATIONS
UNNECESSARY_ELSE
UNSPECIFIED_INT
UNNECESSARY_INT
MISORDERED_TYPE
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Rather than open coding VM binds and VMA tracking, use the GPUVA
library. GPUVA provides a common infrastructure for VM binds to use mmap
/ munmap semantics and support for VK sparse bindings.
The concepts are:
1) xe_vm inherits from drm_gpuva_manager
2) xe_vma inherits from drm_gpuva
3) xe_vma_op inherits from drm_gpuva_op
4) VM bind operations (MAP, UNMAP, PREFETCH, UNMAP_ALL) call into the
GPUVA code to generate an VMA operations list which is parsed, committed,
and executed.
v2 (CI): Add break after default in case statement.
v3: Rebase
v4: Fix some error handling
v5: Use unlocked version VMA in error paths
v6: Rebase, address some review feedback mainly Thomas H
v7: Fix compile error in xe_vma_op_unwind, address checkpatch
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Migration primarily focuses on the memory associated with a tile, so it
makes more sense to track this at the tile level (especially since the
driver was already skipping migration operations on media GTs).
Note that the blitter engine used to perform the migration always lives
in the tile's primary GT today. In theory that could change if media
GTs ever start including blitter engines in the future, but we can
extend the design if/when that happens in the future.
v2:
- Fix kunit test build
- Kerneldoc parameter name update
v3:
- Removed leftover prototype for removed function. (Gustavo)
- Remove unrelated / unwanted error handling change. (Gustavo)
Cc: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20230601215244.678611-15-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Since memory and address spaces are a tile concept rather than a GT
concept, we need to plumb tile-based handling through lots of
memory-related code.
Note that one remaining shortcoming here that will need to be addressed
before media GT support can be re-enabled is that although the address
space is shared between a tile's GTs, each GT caches the PTEs
independently in their own TLB and thus TLB invalidation should be
handled at the GT level.
v2:
- Fix kunit test build.
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20230601215244.678611-13-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The _io_offset helper function is returning an offset into the GPU
address space. Using the CPU address offset (io_) is not correct.
Rename to reflect usage.
Update to use GPU offset information.
Update PT dma_offset to use the helper
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
There is no mention that migrate_copy() will skip copying the CCS aux
state for all types of vram -> vram transfers. Currently we don't need
such a facility but might be surprising if we ever do.
v2: (Lucas):
- s/lmem/vram/ in the commit message
- Tidy up the code a bit; use one emit_copy_ccs()
v3:
- Reword the commit message
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Nirmoy Das <nirmoy.das@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Modify the xe_migrate_copy() function somewhat to explicitly allow
copying of data between two buffer objects including system memory
buffer objects. Update the migrate test accordingly.
v2:
- Check that buffer object sizes match when copying (Matthew Auld)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>