Commit Graph

41 Commits

Author SHA1 Message Date
Arnd Bergmann
78e9b217d7 accel/habanalabs: add more debugfs stub helpers
Two functions got added with normal prototypes for debugfs, but not
alternative when building without it:

drivers/accel/habanalabs/common/device.c: In function 'hl_device_init':
drivers/accel/habanalabs/common/device.c:2177:14: error: implicit declaration of function 'hl_debugfs_device_init'; did you mean 'hl_debugfs_init'? [-Werror=implicit-function-declaration]
drivers/accel/habanalabs/common/device.c:2305:9: error: implicit declaration of function 'hl_debugfs_device_fini'; did you mean 'hl_debugfs_remove_file'? [-Werror=implicit-function-declaration]

Add stubs for these as well.

Fixes: 3b9abb4fa6 ("accel/habanalabs: expose debugfs files later")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Acked-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20230609120636.3969045-1-arnd@kernel.org
2023-07-20 12:51:59 +02:00
Dani Liberman
e6f49e96bc accel/habanalabs: refactor error info reset
Moved error info reset code to single function for future use from
other places in the driver.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Oded Gabbay
569210233a accel/habanalabs: remove sim code
There were a few places where simulator only code got into the upstream.
Remove those places that can confuse other developers.

Fixes: 2a0a839b6a ("habanalabs: extend fatal messages to contain PCI info")
Cc: Moti Haimovski <mhaimovski@habana.ai>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Ofir Bitton
583f12a80d accel/habanalabs: remove support for mmu disable
As mmu disable mode is only used for bring-up stages, let's remove this
option and all code related to it.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:55 +03:00
Tomer Tayar
3b9abb4fa6 accel/habanalabs: expose debugfs files later
Currently the debugfs root folder and files for a device are created at
an early step, before the device initialization and before the char
device and sysfs files are exposed to user.
As there is no real reason not to do it together with the device
creation, postpone it to be done right afterwards.

The initialization of the debugfs entry structure is left in its
current position because it is used before creating the files.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Ofir Bitton
d8b9cea584 accel/habanalabs: add pci health check during heartbeat
Currently upon a heartbeat failure, we don't know if the failure
is due to firmware hang or due to a bad PCI link. Hence, we
are reading a PCI config space register with a known value (vendor ID)
so we will know which of the two possibilities caused the heartbeat
failure.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Koby Elbaz
9a4e44a4ee accel/habanalabs: refactor abort of completions and waits
Aborting CS completions should be in command_submission.c but aborting
waiting for user interrupts should be in device.c.

This separation is also for adding more abort operations in the future.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Dafna Hirschfeld
cc7b790d41 accel/habanalabs: do soft-reset using cpucp packet
This is done depending on the FW version. The cpucp method is
preferable and saves scratchpads resource.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:34 +03:00
Dafna Hirschfeld
a12428acf8 accel/habanalabs: check fw version using sw version
The fw inner version is less trustable, instead use the fw general
sw release version.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Dafna Hirschfeld
dd5667ff6f accel/habanalabs: extract and save the FW's SW major/minor/sub-minor
It is not always possible to know the FW's SW version from the inner FW
version. Therefore we should extract the general SW version in addition
to the FW version and use it in functions like
'hl_is_fw_ver_below_1_9' etc.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Dafna Hirschfeld
3071247ca0 accel/habanalabs: rename fw_{major/minor}_version to fw_inner_{major/minor}_ver
We later want to add fields for Firmware SW version. The current
extracted FW version is the inner FW versioning so the new name
is better and also better differentiate from the FW's SW version.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Ofir Bitton
38f3c732fc accel/habanalabs: fixes for unexpected error interrupt
Removing redundant asic prop variable as we don't need to expose this
to common code. In addition, fix some typos.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Tal Cohen
802f25b6c2 accel/habanalabs: sync f/w events interrupt in hard reset
Receiving events from FW, while the device is in hard reset, causes
a warning message in Driver log. The message may point to a
problem in the Driver or FW. But It also can appear as a result
of events that have been sent from FW just before the hard reset.
In order to avoid receiving events from FW while the device is in reset
and is already in 'disabled' mode, sync the f/w events interrupt right
before setting the device to 'disabled'.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Tomer Tayar
9cf56f0d97 accel/habanalabs: remove completion from abnormal interrupt work name
Decoder abnormal interrupts are for errors and not for completion, so
rename the relevant work and work function to not include 'completion'.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Dani Liberman
ec48493183 accel/habanalabs: change razwi handle after fw fix
FW had one data route for tpc0 and tpc1 when running in secured mode
and a different one when running without secured mode. After fw fixed
this issue, both mode have the same data path.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Ofir Bitton
e1ef053e08 accel/habanalabs: add handling for unexpected user event
In order for the user to be aware of unexpected events in Gaudi2 that
aren't assigned to a specific engine, we are adding the handling of
this dedicated interrupt.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Ofir Bitton
958e47977b accel/habanalabs: expose rotator mask to userspace
All engine masks are exposed to user, make sure user gets the
correct rotator enabled mask in gaudi2.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Tomer Tayar
2e8e9a895c accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()
The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context might be still held, and memory
buffers might be still in use.
Later on, calls to release those buffers will fail due to not finding
their handles in the IDR, leading to a memory leak.
To avoid this leak, split the IDR destruction from the memory manager
fini, and postpone it to hpriv_release() when there is no user context
and no buffers are used.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
f7f0085eec accel/habanalabs: add uapi to stall/resume engine
The user might want to stall/resume engines to perform power testing
for various scenarios. Because our current
HL_CS_FLAGS_ENGINE_CORE_COMMAND command only handles the engines' cores,
we need to add another opcode for handling entire engine and not just
its core.

The user supplies an array, where each entry holds the engine's ID and
the command to send to the engine. The size of the array is limited
by the number of engines in the ASIC (only Gaudi2 is currently
supported).

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
7ffb5ced2b accel/habanalabs: use a mutex rather than a spinlock
There are two reasons why mutex is better here:
1. There's a critical section relatively long, where in
certain scenarios (e.g., multiple VM allocations) taking a spinlock
might cause noticeable performance degradation.
2. It will remove the incorrect usage of mutex under
spin_lock (where preemption is disabled).

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
farah kassabri
25ebbc57ca accel/habanalabs: fix few misspelled words in the code
Run spell checker on the code and fix accordingly.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Dafna Hirschfeld
5e09ae9203 accel/habanalabs: change hw_fini to return int to indicate error
We later use cpucp packet for soft reset which might fail
so we should be able propagate the failure case.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:14 +02:00
Sagiv Ozeri
66116a3981 accel/habanalabs: organize hl_device structure comment
Make the comments align with the order of the fields in the structure

Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:14 +02:00
Dafna Hirschfeld
4a2e9d11fc accel/habanalabs: don't trace cpu accessible dma alloc/free
The cpu accessible dma allocations use the gen_pool api which actually
does not allocate new memory from the system but manages memory already
allocated before. When tracing this together with real dma
allocation/free it cause confusing logs like a '0' dma address and
a cpu address appearing twice etc.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Tomer Tayar
34bef313e4 accel/habanalabs: remove hl_irq_handler_default()
hl_irq_handler_default() is not used and can be removed.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Ofir Bitton
4713ace324 accel/habanalabs: add support for TPC assert
In order to allow TPC engines to raise an assert, we must expose
the relevant MSIX interrupt to the user so he will configure the engine
correctly. In addition, we implement the corresponding interrupt
handler that will notify the user upon such an event.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Ofir Bitton
bcfcd084aa accel/habanalabs: capture interrupt timestamp in handler
In order for interrupt timestamp to be more accurate we should
capture it during the interrupt handling rather than in threaded
irq context.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Tal Cohen
6012235899 accel/habanalabs: change user interrupt to threaded IRQ
We prefer not to handle the user interrupt job inside the interrupt
context. Instead, use threaded IRQ to handle the user interrupts.
This will allow to avoid disabling interrupts when the user process
registers for a new event and to avoid long handling inside an
interrupt.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Ofir Bitton
7fc0d011c3 accel/habanalabs: expose engine core int reg address
In order for engine cores to raise interrupts towards FW, They need
to know which register the event data should be written to.
Hence, we forward the relevant scratchpad register received during
dynamic regs handshake with FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Moti Haimovski
313e9f63b7 accel/habanalabs: add critical-event bit in notifier
Enhance the existing user notifications by adding a HW and FW critical
event bits to be used when a HW or FW event occur that requires
both SW abort and hard-resetting the chip.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Tomer Tayar
d43bce6e76 accel/habanalabs: add info when FD released while device still in use
When user closes the device file descriptor, it is checked whether the
device is still in use, and a message is printed if it is.
To make this message more informative, add to this print also the reason
due to which the device is considered as in use.
The possible reasons which are checked for now are active CS and
exported dma-buf.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Oded Gabbay
323adae99f accel/habanalabs: save class in hdev
It is more concise than to pass it to device init. Once we will add the
accel class, then we won't need to change the function signatures.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:11 +02:00
Koby Elbaz
f7d67c1cfd habanalabs/gaudi2: find decode error root cause
When a decode error happens, we often don't know the exact root
cause (the erroneous address that was accessed) and the exact engine
that created the erroneous transaction.

To find out, we need to go over all the relevant register blocks
in the ASIC. Once we find the relevant engine, we print its details
and the offending address.

This helps tremendously when debugging an error that was created
by running a user workload.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:14 +02:00
Ofir Bitton
75b6984ef6 habanalabs: optimize command submission completion timestamp
Completion timestamp is taken during the actual command submission
release. As the release happens in a work queue, the timestamp taken
is not accurate. Hence, we will take the timestamp in the interrupt
handler itself while propagating it to the release function.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Ofir Bitton
9a7d530a80 habanalabs: refactor user interrupt type
In order to support more user interrupt types in the future, we
enumerate the user interrupt type instead of using a boolean.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Ohad Sharabi
ab509d81c9 habanalabs: add set engines masks ASIC function
This function shall be used whenever components enable/binning masks
should be updated.

Usage is in one of the below cases:
- update user (or default) component masks
- update when getting the masks from FW (either CPUCP or COMMS)

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Koby Elbaz
571d1a7222 habanalabs: protect access to dynamic mem 'user_mappings'
When HL_INFO_USER_MAPPINGS IOCTL is called, we copy_to_user from
a dynamically allocated memory - 'user_mappings'.
Since freeing/allocating it happens in runtime (upon a page fault),
it not unlikely to access it even before being initially allocated
(i.e., accessing a NULL pointer).

The solution is to simply mark the spot when the err info has been
collected, and that way to know whether err info (either page fault
or RAZWI) is available to be read.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Koby Elbaz
78baccbdc3 habanalabs: refactor razwi/page-fault information structures
This refactor makes the code clearer and the new variables' names
better describe their roles.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Tomer Tayar
e2a079a206 habanalabs: verify that kernel CB is destroyed only once
Remove the distinction between user CB and kernel CB, and verify for
both that they are not destroyed more than once.

As kernel CB might be taken from the pre-allocated CB pool, so we need
to clear the handle destroyed indication when returning a CB to the
pool.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00
Ohad Sharabi
20faaeec37 habanalabs: add uapi to flush inbound HBM transactions
When doing p2p with a NIC device, the NIC needs to make sure all the
writes to the HBM (through the PCI bar of the Gaudi device) were
flushed.

It can be done by either the NIC or the host reading through the PCI
bar.

To support the host side, we supply a simple uapi to perform this flush
through the driver, because the user can't create such a transaction
by itself (the PCI bar isn't exposed to normal users).

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00
Oded Gabbay
e65e175b07 habanalabs: move driver to accel subsystem
Now that we have a subsystem for compute accelerators, move the
habanalabs driver to it.

This patch only moves the files and fixes the Makefiles. Future
patches will change the existing code to register to the accel
subsystem and expose the accel device char files instead of the
habanalabs device char files.

Update the MAINTAINERS file to reflect this change.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00