linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-30 04:22:32 -04:00

Author	SHA1	Message	Date
farah kassabri	84190b92cc	accel/habanalabs: fix bug in decoder wait for cs completion The decoder interrupts are handled in the interrupt context same as all user interrupts. In such case, the wait list should be protected by spin_lock_irqsave in order to avoid deadlock that might happen with the user submission flow. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:24 +03:00
Dafna Hirschfeld	2ba0236f5b	accel/habanalabs: remove wrong doc for init_phys_pg_pack_from_userptr The function does not pin the pages so remove that from the inline doc. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:24 +03:00
Arnd Bergmann	c1805bf36a	accel/habanalabs: add missing debugfs function stubs Two function stubs were removed in an earlier commit but are now needed again: drivers/accel/habanalabs/common/device.c: In function 'hl_device_init': drivers/accel/habanalabs/common/device.c:2231:14: error: implicit declaration of function 'hl_debugfs_device_init'; did you mean 'drm_debugfs_dev_init'? [-Werror=implicit-function-declaration] 2231 \| rc = hl_debugfs_device_init(hdev); drivers/accel/habanalabs/common/device.c:2367:9: error: implicit declaration of function 'hl_debugfs_device_fini'; did you mean 'hl_debugfs_remove_file'? [-Werror=implicit-function-declaration] 2367 \| hl_debugfs_device_fini(hdev); \| ^~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:24 +03:00
Oded Gabbay	1630d14f8d	accel/habanalabs: minor cosmetic update to habanalabs.h - Update copyright years - Align fields in struct hl_userptr - Fix comments Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:24 +03:00
Oded Gabbay	4355f2c322	accel/habanalabs/gaudi: remove define used for simulator We don't support simulator in upstream. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:24 +03:00
Oded Gabbay	87c60e23f2	accel/habanalabs: remove leftover code This code was added as part of a bigger feature which was never upstreamed, so remove this code. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:24 +03:00
Oded Gabbay	6fc69ca84a	accel/habanalabs: print device name when it is removed Notifies the user which device was removed. It is important in a server with multiple devices. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:23 +03:00
Oded Gabbay	e5873f6b91	accel/habanalabs: remove unused field flags in struct wait_interrupt_data is not used anywhere so remove it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:23 +03:00
Ohad Sharabi	ff92d01052	accel/habanalabs: trace dma map sgtable Traces the DMA [un]map_sgtable using the new traces we added. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:23 +03:00
Oded Gabbay	d7aa294805	accel/habanalabs: remove unused asic functions asic_dma_{un}map_single() asic-specific functions are no longer called from the common code, so delete these functions. In addition, delete the gaudi2 implementation as they are also not called. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:23 +03:00
Ariel Suller	de8773fdc5	accel/habanalabs: update boot status print FW shutdown preparation status was added to spec. Signed-off-by: Ariel Suller <asuller@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:23 +03:00
Dafna Hirschfeld	674f77798e	accel/habanalabs: extend preboot timeout when preboot might take longer There are cases such when FW runs MBIST, that preboot is expected to take longer than the usual. In such cases the firmware reports status SECURITY_READY/IN_PREBOOT and we extend the timeout waiting for it. This is currently implemented for Gaudi2 only. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	3824be1f4d	accel/habanalabs: add debug prints to dump content of SG table for dma-buf Add debug prints to dump the content of the SG table which is prepared when the dma-buf map op is called. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	d16945f602	accel/habanalabs: add missing offset handling for dma-buf On devices with virtual device memory (Gaudi2 onwards), user can provide an offset within an allocated device memory from which he wants to export a dma-buf object. The offset value is verified by driver, but it is not taken into consideration when the importer driver maps the dma-buf and the SG table it prepared. Add the missing offset handling. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	878ebc14db	accel/habanalabs: set hl_dmabuf_priv.device_address only when needed The device_address member of 'struct hl_dmabuf_priv' is used only when virtual device memory is not supported and dma-buf is exported from address. Set the value of this field only when it is relevant, and add "phys" to its name so it would be clearer that it can't be a device virtual address. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	bb644f6197	accel/habanalabs: fix SG table creation for dma-buf mapping In some cases the calculated number of required entries for the dma-buf SG table is wrong. For example, if the page size is larger than both the dma max segment size of the importer device and from the exported side, or if the exported size is part of a phys_pg_pack that is composed of several pages. In these cases, redundant entries will be added to the SG table. Modify the method that the number of entries is calculated, and the way they are prepared. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
farah kassabri	ba24b5ec78	accel/habanalabs: split user interrupts pending list Currently driver maintain one list for both pending user interrupts which seeks to wait till CQ reaches it's target value and also the ones that seeks to get timestamp records when the CQ reaches it's target value. This causes delay in handling the waiters which gets higher priority than the timestamp records. In order to solve this, let's split the list into two, one for each case and each one is protected by it's own spinlock. Waiters will be handled within the interrupt context first, then the timestamp records will be set. Freeing the timestamp related memory will be handled in a workqueue. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
farah kassabri	1157b5d6b3	accel/habanalabs: optimize timestamp registration handler Currently we use dynamic allocation inside the irq handler in order to allocate free node to be used for the free jobs. This operation is expensive, especially when we deal with large burst of events records that get released at the same time. The alternative is to have pre allocated pool of free nodes and just fetch nodes from this pool at irq handling time instead of allocating them. In case the pool becomes full, then the driver will fallback to dynamic allocations. As part of the optimization also update the unregister flow upon re-using a timestamp record, by making the operation much simpler and quicker. We already have the record in the registration flow and now we just seek to re-use with different interrupt. Therefore, no need to look for buffer according to the user handle. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
farah kassabri	0165994c21	accel/habanalabs: fix bug in timestamp interrupt handling There is a potential race between user thread seeking to re-use a timestamp record with new interrupt id, while this record is still in the middle of interrupt handling and it is about to be freed. Imagine the driver set the record in_use to 0 and only then fill the free_node information. This might lead to unpleasant scenario where the new registration thread detects the record as free to use, and change the cq buff address. That will cause the free_node to get the wrong buffer address to put refcount to. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	d89d329a2b	accel/habanalabs: tiny refactor of hl_map_dmabuf() alloc_sgt_from_device_pages() includes relatively many parameters, and in a subsequent change another offset parameter is going to be added. Using structure fields directly when calling this function, and in hl_map_dmabuf() it is done twice, makes it a little bit difficult to understand the meaning of the parameters. To make it clearer, assign the required values into local variables with explicit names, and use the variables when calling the function. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	0b75cb5b24	accel/habanalabs: export dma-buf only if size/offset multiples of PAGE_SIZE It is currently allowed for a user to export dma-buf with size and offset that are not multiples of PAGE_SIZE. The exported memory is mapped for the importer device, and there it will be rounded to PAGE_SIZE, leading to actually exporting more than the user intended to. To make the user be aware of it, accept only size and offset which are multiple of PAGE_SIZE. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	efbca048c6	accel/habanalabs: use exported size from dma_buf and not from phys_pg_pack The 'exported_size' member in 'struct hl_vm_phys_pg_pack' is used to keep the exported dma-buf size, to be later used when the buffer is mapped. However it is possible that the same phys_pg_pack will be exported more than once, and independently of when the mapping takes place. Remove this member from the phys_pg_pack structure, and simply use the size in the dma-buf object as the exported size when mapping. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Tomer Tayar	dfdbc55a9c	accel/habanalabs: always pass exported size to alloc_sgt_from_device_pages() For Gaudi1 the exported dma-buf is always composed of a single page, and therefore the exported size is equal to this page's size. When calling alloc_sgt_from_device_pages(), we pass 0 as the exported size and internally calculate it as "number of pages * page size". This makes alloc_sgt_from_device_pages() less clear, because the exported size parameter is not understood as a restriction on the pages' size. Modify to always pass the exported size explicitly. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
farah kassabri	051868d93c	accel/habanalabs: prevent sending heartbeat before events are enabled After the heartbeat mechanism is now expanded to be used also for EQ health check, we shouldn't send heartbeat messages to FW before driver allow events to be received from FW. Because if the driver will send two heartbeats before it enables events to be received from FW, then the EQ health check will fail and reset the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
farah kassabri	764bfd138f	accel/habanalabs/gaudi2: add eq health check using irq This is the second patch for applying the eq health check mechanism which will add support for the interrupt flow for gaudi2 asic. More info about the interrupt mechanism: set a dedicated msix for the eq error interrupt, and add interrupt handler for it. when FW detects some issue with EQ like EQ_FULL, it'll raise that interrupt and driver should reset the device. Driver will inform the FW which msix index to use through the already existing handshake mechanism which will send msix info message to fw. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
farah kassabri	7c4130e6dd	accel/habanalabs/gaudi2: handle eq health heartbeat check Add mechanism for fw eq health check. this will be done using two flows: using the heartbeat mechanism and raising a dedicated interrupt to indicate an eq failure like EQ full. This patch will add implementation for the eq heartbeat for gaudi2 asic. More info about the heartbeat mechanism: Expand the heartbeat mechanism to monitor a new event that will be sent from FW upon receiving heartbeat message. that way driver can know that the eq is working or not. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Hen Alon	0648c4d080	accel/habanalabs: add tsc clock sampling to clock sync info Add tsc clock to clock sync info, to enable using this clock for sampling and sync it with device time. Signed-off-by: Hen Alon <halon@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Dafna Hirschfeld	e0f452802b	accel/habanalabs: fix inline doc typos Fix two typos Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Dafna Hirschfeld	ab574f6a81	accel/habanalabs: disable events ioctls on control device Because it is not used and also, for graceful reset to work those ioctls should run on the compute device. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
David Meriin	2b76129c5a	accel/habanalabs: move cpucp interface to linux/habanalabs The CPUCP interface is moved to a shared folder outside of accel as a pre-requisite to upstream the NIC drivers that will also include this file. Signed-off-by: David Meriin <dmeriin@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Justin Stitt	571bfeb48a	accel/habanalabs: refactor deprecated strncpy `strncpy` is deprecated for use on NUL-terminated destination strings [1]. A suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on its destination buffer argument which is _not_ the case for `strncpy`! There is likely no bug happening in this case since HL_STR_MAX is strictly larger than all source strings. Nonetheless, prefer a safer and more robust interface. It should also be noted that `strscpy` will not pad like `strncpy`. If this NUL-padding behavior is _required_ we should use `strscpy_pad` instead of `strscpy`. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Tomer Tayar	57963ff8ad	accel/habanalabs: Move ioctls to the device specific ioctls range To use drm_ioctl(), move the ioctls to the device specific ioctls range at [DRM_COMMAND_BASE, DRM_COMMAND_END). Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Tomer Tayar	fe77368c0f	accel/habanalabs: register compute device as an accel device Register the compute device as an accel device, and remove the creation of the habanalabs compute char device. The IOCTLs in this patch are still handled by the current driver handler. Moving to DRM IOCTL handling requires moving the IOCTLs numbers to a specific range, so it will be handled in subsequent patches. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Ofir Bitton	a8ab1a81cc	accel/habanalabs: add info ioctl for engine error reports User gets notification for every engine error report, but he still lacks the exact engine information. Hence, we allow user to query for the exact engine reported an error. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Tomer Tayar	10926f6005	accel/habanalabs: set default device release watchdog T/O as 30 sec After being notified about certain errors, user is expected to finish his post-errors actions and to release the device within some timeout, after which is deice is being reset. The default timeout value is 5 sec, which in some case is not enough for a user application to collect debug data. Increase the default value to 30 sec. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Dani Liberman	8887279092	accel/habanalabs: handle f/w reserved dram space request It is possible for FW to request reserved space in dram. If the device supports this option, it will retrieve the size from the f/w and will reserve it. Currently we add the common code infrastructure to support it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Ofir Bitton	d33c3d0541	accel/habanalabs: dump temperature threshold boot error Add dump of an error reported from f/w during boot time. This error indicates a failure with setting temperature threshold. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Oded Gabbay	37d72439a4	accel/habanalabs: reset device if scrubbing failed If scrubbing memory after user released device has failed it means the device is in a bad state and should be reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:19 +03:00
Oded Gabbay	89803af535	accel/habanalabs: remove pdev check on idle check Our simulator supports idle check so no need anymore to check if pdev exists. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:18 +03:00
farah kassabri	2da9f8d805	accel/habanalabs: fix wait_for_interrupt abortion flow When the driver needs to abort waiters for interrupts, for cases such as critical events that occur and driver need to do hard reset, in such scenario the driver will complete the fence to wake up the waiting thread, and will set the fence error indication. The return value of the completion API will be greater than 0 since it will return the timeout, but as this indicates successful completion, the driver should mark it as aborted. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
farah kassabri	eaa43a06b7	accel/habanalabs: Allow single timestamp registration request at a time Protect against concurrency of user requesting to register a timestamp offset (where the driver fills the timestamp when the command submission has finished executing) to a specific user interrupt ID. The protection is basically to allow only one timestamp registration request to be handled at a time. This is needed because the user can decide to re-use a timestamp offset (register an already registered offset, to a different interrupt ID). This means the request will cause the timestamp node to move from one interrupt list to another interrupt list. In such scenario, without proper protection, we could end up adding the same node twice to the interrupts wait lists. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Koby Elbaz	964b1f675d	accel/habanalabs: rename fd_list to hpriv_list Every time an FD is returned to the user, the driver adds a corresponding private structure to the list. Yet, it's still a list of private structures rather than of FDs. Remove, as well, an unnecessary comment. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Koby Elbaz	942f18c56d	accel/habanalabs: call put_pid after hpriv list is updated Because we might still be using related resources, decrementing PID's reference count should be done at later stages of the device release. A good place is right after the representing private structure is removed from LKD's list. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Koby Elbaz	2b541cf913	accel/habanalabs: print return code when process termination fails As part of driver teardown, we attempt to kill all user processes. It shouldn't fail, but if it does we want to print the error code that the kapi returned to us. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
farah kassabri	bffd2f16ae	accel/habanalabs: fix standalone preboot descriptor request The preboot used to statically allocate memory for the comms descriptor on the device memory when driver requested the descriptor information. Now preboot moved to dynamic memory allocation where it wants to check the size the driver expects vs. what the f/w expects. Note there are no backward compatibility issues as older f/w versions simply ignore this value. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Koby Elbaz	e4a97d6b62	accel/habanalabs: set device status 'malfunction' while in rmmod hl_device_status() returns the status of an acquired device. If a device is going down (following an rmmod cmd), it should be marked as an unusable/malfunctioning device, and hence should not be acquired. However, since this was not the case so far (i.e., a device going down would inaccurately return 'in reset' status allowing the user to acquire the device) it introduced a bug where as part of a reset flow, the driver could not kill processes that have not run yet, and since those processes aren't blocked from reacquiring a device, we get eventually a new flow of a driver attempting to kill all processes in a list that can't be ever really empty. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Tomer Tayar	e7b2902a33	accel/habanalabs: print task name upon creation of a user context It is useful for debug to know which user process have acquired the device. Add this info to the relevant debug print, in addition to the already printed user context's ASID. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Tomer Tayar	7dccb064a7	accel/habanalabs: print task name and request code upon ioctl failure When an ioctl fails, it is useful to know what is the task command name and the full ioctl request code, in addition to the task pid and the ioctl number. Add the additional information to the relevant debug error prints. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:17 +03:00
Tomer Tayar	a35c997601	accel/habanalabs: update pending reset flags with new reset requests If hl_device_cond_reset() is called while a reset is already pending but hasn't started, the reset request will be dropped. If the flags of the new request are more severe, e.g. a hard reset while the pending reset is a compute reset, the eventual reset won't be suitable for the device status. To prevent such cases, update the pending reset flags with the new requests flags before the requests are dropped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:17 +03:00
Tomer Tayar	5d89ce6f8c	accel/habanalabs: prevent immediate hard reset due to 2 adjacent H/W events When a H/W event is received while a user is registered to events, no immediate hard reset will happen, and instead the user will be notified and will have some time to handle it and eventually release the device, after which the reset will be done. If a user, as part of the handling and as part of the cleanup steps towards releasing the device, unregisters from receiving those events, and at that time an adjacent H/W event is received, it will be assumed that the user is not registered to events and thus an immediate hard reset is required. To prevent such an unwanted immediate reset, modify the driver to perform it if the user is not registered to events AND we don't already have a pending reset for a previous H/W event. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:17 +03:00

1 2 3 4

156 Commits