linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-19 15:24:02 -04:00

Author	SHA1	Message	Date
Dani Liberman	e6f49e96bc	accel/habanalabs: refactor error info reset Moved error info reset code to single function for future use from other places in the driver. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-06-08 12:35:56 +03:00
Oded Gabbay	569210233a	accel/habanalabs: remove sim code There were a few places where simulator only code got into the upstream. Remove those places that can confuse other developers. Fixes: `2a0a839b6a` ("habanalabs: extend fatal messages to contain PCI info") Cc: Moti Haimovski <mhaimovski@habana.ai> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-06-08 12:35:56 +03:00
Tomer Tayar	3b9abb4fa6	accel/habanalabs: expose debugfs files later Currently the debugfs root folder and files for a device are created at an early step, before the device initialization and before the char device and sysfs files are exposed to user. As there is no real reason not to do it together with the device creation, postpone it to be done right afterwards. The initialization of the debugfs entry structure is left in its current position because it is used before creating the files. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-06-08 12:35:54 +03:00
Ofir Bitton	d8b9cea584	accel/habanalabs: add pci health check during heartbeat Currently upon a heartbeat failure, we don't know if the failure is due to firmware hang or due to a bad PCI link. Hence, we are reading a PCI config space register with a known value (vendor ID) so we will know which of the two possibilities caused the heartbeat failure. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-06-08 12:35:54 +03:00
Koby Elbaz	9a4e44a4ee	accel/habanalabs: refactor abort of completions and waits Aborting CS completions should be in command_submission.c but aborting waiting for user interrupts should be in device.c. This separation is also for adding more abort operations in the future. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-06-08 12:35:54 +03:00
Tal Cohen	802f25b6c2	accel/habanalabs: sync f/w events interrupt in hard reset Receiving events from FW, while the device is in hard reset, causes a warning message in Driver log. The message may point to a problem in the Driver or FW. But It also can appear as a result of events that have been sent from FW just before the hard reset. In order to avoid receiving events from FW while the device is in reset and is already in 'disabled' mode, sync the f/w events interrupt right before setting the device to 'disabled'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-04-08 10:39:34 +03:00
Tal Cohen	3a8d7c3a7d	accel/habanalabs: send disable pci when compute ctx is active Fix an issue in hard reset flow in which the driver didn't send a disable pci message if there was an active compute context. In hard reset, disable pci message should be sent no matter if a compute context exists or not. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-04-08 10:39:33 +03:00
Tal Cohen	a855f710f5	accel/habanalabs: remove duplicated disable pci msg The disable pci message is sent in reset device. It informs the FW not to raise more EQs. The Driver may ignore received EQs, when the device is in disabled mode. The duplication happens when hard reset is scheduled during compute reset and also performs 'escalate_reset_flow'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-04-08 10:39:33 +03:00
Cai Huoqing	248ed9e227	accel/habanalabs: Remove redundant pci_clear_master Remove pci_clear_master to simplify the code, the bus-mastering is also cleared in do_pci_disable_device, like this: ./drivers/pci/pci.c:2197 static void do_pci_disable_device(struct pci_dev *dev) { u16 pci_command; pci_read_config_word(dev, PCI_COMMAND, &pci_command); if (pci_command & PCI_COMMAND_MASTER) { pci_command &= ~PCI_COMMAND_MASTER; pci_write_config_word(dev, PCI_COMMAND, pci_command); } pcibios_disable_device(dev); }. And dev->is_busmaster is set to 0 in pci_disable_device. Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-04-08 10:39:33 +03:00
Tomer Tayar	2e8e9a895c	accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release() The memory manager IDR is currently destroyed when user releases the file descriptor. However, at this point the user context might be still held, and memory buffers might be still in use. Later on, calls to release those buffers will fail due to not finding their handles in the IDR, leading to a memory leak. To avoid this leak, split the IDR destruction from the memory manager fini, and postpone it to hpriv_release() when there is no user context and no buffers are used. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-03-15 13:29:15 +02:00
Tomer Tayar	28fbc058f2	accel/habanalabs: use scnprintf() in print_device_in_use_info() compose_device_in_use_info() was added to handle the snprintf() return value in a single place. However, the buffer size in print_device_in_use_info() is set such that it would be enough for the max possible print, so compose_device_in_use_info() is not really needed. Moreover, scnprintf() can be used instead of snprintf(), to save the check if the return value larger than the given size. Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:15 +02:00
Dafna Hirschfeld	86b74d8438	accel/habanalabs: assert return value of hw_fini Since hw_fini return error code for failure indication, we should check its return value. Currently it might only fail upon soft-reset from hl_device_reset. Later patch will add hw_fini failure in case of polling timeout in hard-reset. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-03-15 13:29:14 +02:00
Sagiv Ozeri	efbd36b281	accel/habanalabs: add device id to all threads names Compute driver threads names will start with hlX-*, when X is the device id. This will help distinguish them from the NIC thread names. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-03-15 13:29:14 +02:00
Tomer Tayar	a8c14f5388	accel/habanalabs: improve readability of engines idle mask print Remove leading zeroes when printing the idle mask to make it clearer. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:14 +02:00
Tom Rix	3a621af637	accel/habanalabs: set hl_capture_*_err storage-class-specifier to static smatch reports drivers/accel/habanalabs/common/device.c:2619:6: warning: symbol 'hl_capture_hw_err' was not declared. Should it be static? drivers/accel/habanalabs/common/device.c:2641:6: warning: symbol 'hl_capture_fw_err' was not declared. Should it be static? both are only used in device.c, so they should be static Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-03-15 13:29:14 +02:00
Dafna Hirschfeld	4a2e9d11fc	accel/habanalabs: don't trace cpu accessible dma alloc/free The cpu accessible dma allocations use the gen_pool api which actually does not allocate new memory from the system but manages memory already allocated before. When tracing this together with real dma allocation/free it cause confusing logs like a '0' dma address and a cpu address appearing twice etc. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:13 +02:00
Dafna Hirschfeld	1d0f9ad7ce	accel/habanalabs: in hl_device_reset small refactor for readabilty in the out_err flow, combine the two cases of soft-reset since they have mostly common code. In addition unlock reset_info.lock after touching reset count. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:13 +02:00
Dafna Hirschfeld	39ab4da9c1	accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft' Because this field is only used for debug print, we can do more precise debug directly instead. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:13 +02:00
Dafna Hirschfeld	7810c5244d	accel/habanalabs: tiny refactor of hl_device_reset for readability Align assignment of reset_upon_device_release to the convention used in this function. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:13 +02:00
Tomer Tayar	18d1358459	accel/habanalabs: enable graceful reset mechanism for compute-reset The graceful reset mechanism is currently enabled only for reset requests that will end up with hard-reset. In future, reset requests due to errors in some device engines, are going to be modified to request compute-reset, as the much longer hard-reset is not really needed there. To allow it, enable graceful reset also for compute-reset, and reset after user releases the device won't be escalated to hard-reset in those cases. If watchdog expires and user didn't release the device, hard-reset will be initiated in any case. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:12 +02:00
Koby Elbaz	57479adb41	accel/habanalabs: disable PCI when escalating compute to hard-reset In case a compute reset has failed or a request for a hard reset has just arrived, then we escalate current reset procedure from compute to hard-reset. In such a case, the FW should be aware of the updated error cause, and if LKD is the one who performs the reset (rather than the FW), then we ask the FW to disable PCI access. We would also like to have relevant debug info and therefore we print the currently escalating reset type. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:12 +02:00
Moti Haimovski	313e9f63b7	accel/habanalabs: add critical-event bit in notifier Enhance the existing user notifications by adding a HW and FW critical event bits to be used when a HW or FW event occur that requires both SW abort and hard-resetting the chip. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:12 +02:00
Tomer Tayar	d43bce6e76	accel/habanalabs: add info when FD released while device still in use When user closes the device file descriptor, it is checked whether the device is still in use, and a message is printed if it is. To make this message more informative, add to this print also the reason due to which the device is considered as in use. The possible reasons which are checked for now are active CS and exported dma-buf. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:12 +02:00
Oded Gabbay	323adae99f	accel/habanalabs: save class in hdev It is more concise than to pass it to device init. Once we will add the accel class, then we won't need to change the function signatures. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:11 +02:00
Oded Gabbay	89859a8997	accel/habanalabs: split cdev creation to separate function Move the cdev creation code from the main hdev init function to a separate function. This will make the code more readable once we add the accel registration code (instead/in addition to legacy cdev). Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>	2023-03-15 13:29:11 +02:00
Tomer Tayar	44155bb627	habanalabs: clear in_compute_reset when escalating to hard reset If resetting device upon release while the release watchdog work is scheduled, the compute reset is replaced with hard reset. In this case, need to clear the in_compute_reset indication in the device reset information structure. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:14 +02:00
Tomer Tayar	0c93eb098f	habanalabs: run error handling if scrub_device_mem fails after reset If device memory scrubbing from hl_device_reset() fails, we return with an error code but not perform error handling code. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:13 +02:00
Koby Elbaz	a6685b573c	habanalabs: block soft-reset on an unusable device A device with status malfunction indicates that it can't be used. In such a case we do not support certain reset types, e.g., all kinds of soft-resets (compute reset, inference soft-reset), and reset upon device release. A hard-reset is the only way that an unusable device can change its status. All other reset procedures can't put the device in a reset procedure, which might ultimately cause the device to change its status, unintentionally, to become operational again. Such a scenario has recently occurred, when a user requested a hard-reset while another heavy user workload was ongoing (reset request is queued). Since the workload couldn't finish within reset's timeout limits, the reset has failed and set a device status malfunction. Eventually, when the user released the FD, an unsuccessful soft-reset occurred, hence followed by an additional hard-reset that changed the ASICs status back to be operational. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:13 +02:00
Moti Haimovski	2a0a839b6a	habanalabs: extend fatal messages to contain PCI info This commit attaches the PCI device address to driver fatal messages in order to ease debugging in multi-device setups. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:12 +02:00
Ohad Sharabi	54fcb384be	habanalabs: trace LBW reads/writes Add traces to LBW reads/writes. This may be handy when debugging configuration failure or events when tracking configuration flow. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:11 +02:00
Koby Elbaz	571d1a7222	habanalabs: protect access to dynamic mem 'user_mappings' When HL_INFO_USER_MAPPINGS IOCTL is called, we copy_to_user from a dynamically allocated memory - 'user_mappings'. Since freeing/allocating it happens in runtime (upon a page fault), it not unlikely to access it even before being initially allocated (i.e., accessing a NULL pointer). The solution is to simply mark the spot when the err info has been collected, and that way to know whether err info (either page fault or RAZWI) is available to be read. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:11 +02:00
Koby Elbaz	78baccbdc3	habanalabs: refactor razwi/page-fault information structures This refactor makes the code clearer and the new variables' names better describe their roles. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:11 +02:00
Tomer Tayar	e2a079a206	habanalabs: verify that kernel CB is destroyed only once Remove the distinction between user CB and kernel CB, and verify for both that they are not destroyed more than once. As kernel CB might be taken from the pre-allocated CB pool, so we need to clear the handle destroyed indication when returning a CB to the pool. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:10 +02:00
Oded Gabbay	e65e175b07	habanalabs: move driver to accel subsystem Now that we have a subsystem for compute accelerators, move the habanalabs driver to it. This patch only moves the files and fixes the Makefiles. Future patches will change the existing code to register to the accel subsystem and expose the accel device char files instead of the habanalabs device char files. Update the MAINTAINERS file to reflect this change. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-01-26 11:52:10 +02:00

34 Commits