linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-02 05:22:49 -04:00

Author	SHA1	Message	Date
Oded Gabbay	0b0ae02440	habanalabs: rename soft reset to compute reset Doing compute reset can be the traditional inference soft reset that is supported only in Goya. Or it can be the new reset upon device release, which is supported in Gaudi2 and above. Therefore, wherever suitable, use the terminology of compute reset instead of soft reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:31 +03:00
Oded Gabbay	e3b20f3ee4	habanalabs: add status of reset after device release The user might want to know the device is in reset after device release, which is not an erroneous event as a regular reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:31 +03:00
Oded Gabbay	bd4a338886	habanalabs: fix update of is_in_soft_reset reset_info.is_in_soft_reset should be updated both before in_reset and inside the spin lock of the reset info structure. The reasons are: - When we are inside soft reset, it implies we are in reset. Therefore, if someone checks if we are in soft reset, he can deduce we are in reset, while the opposite is not correct and might be misleading. - Both these flags are changed together so they must be changed inside the reset info spinlock. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:31 +03:00
Ofir Bitton	08f0aa9548	habanalabs: expose only valid debugfs nodes In case security is enabled on the device, some debugfs nodes will fail. Hence, we do not expose them. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:31 +03:00
Tomer Tayar	af2e650b36	habanalabs: add a value field to hl_fw_send_pci_access_msg() For gaudi2 we need to send a value to F/W as part of the PCI_ACCESS packet. As a preparation, modify hl_fw_send_pci_access_msg() to have a 'value' field. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:30 +03:00
Ohad Sharabi	20cd88a775	habanalabs: fixes to the poll-timeout macros - use conventional internal macro variables (double underscore prefix) - adjust address casting - on register poll using ELBI use ELBI read rather than BAR read on error condition - remove unused macro Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:30 +03:00
Oded Gabbay	b596ad6f11	habanalabs: initialize variable explicitly Fix warning of "warning: ‘old_base’ may be used uninitialized in this function" Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:30 +03:00
Christophe JAILLET	6d24b4d17d	habanalabs: Use the bitmap API to allocate bitmaps Use bitmap_zalloc()/bitmap_free() instead of hand-writing them. It is less verbose and it improves the semantic. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:30 +03:00
Oded Gabbay	cf008f5acb	habanalabs: make sure variable is set before used timestamp could be unset in both _hl_interrupt_wait_ioctl() and _hl_interrupt_wait_ioctl_user_addr() so it is better to explicitly initialize it to 0 when declaring it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:29 +03:00
Oded Gabbay	f2d9ec872c	habanalabs: don't declare tmp twice in same function tmp is declared in the scope of the function cs_do_release() and inside a block inside that function. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:29 +03:00
Ofir Bitton	cc81c0f3b0	habanalabs: do not set max power on a secured device Max power API is not supported in secured devices. Hence, we should skip setting it during boot. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:29 +03:00
Oded Gabbay	18913d6870	habanalabs: allow detection of unsupported f/w packets If we send a packet to the f/w, and that packet is unsupported, we want to be able to identify this situation and possibly ignore this. Therefore, if the f/w returned an error, we need to propagate it to the callers in the result value, if those callers were interested in it. In addition, no point of printing the error code here because each caller prints its own error with a specific message. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:29 +03:00
Sagiv Ozeri	ea9770e653	habanalabs: save f/w preboot minor version We need this property for backward compatibility against the f/w. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:29 +03:00
Ofir Bitton	d6a66d5960	habanalabs: add support for common decoder interrupts User application should be able to get notification for any decoder completion. Hence, we introduce a new interface in which a user can wait for all current decoder pending interrupts. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:29 +03:00
Ofir Bitton	1a6609cdd4	habanalabs: naming refactor of user interrupt flow Current naming convention can be misleading. Hence renaming some variables and defines in order to be more explicit. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:28 +03:00
Ohad Sharabi	2b9e583d0a	habanalabs: wait for preboot ready after hard reset Currently we are not waiting for preboot ready after hard reset. This leads to a race in which COMMs protocol begins but will get no response from the f/w. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:28 +03:00
Oded Gabbay	6b4e8a12b2	habanalabs: enable gaudi2 code in driver Enable the Gaudi2 ASIC code in the pci probe callback of the driver so the driver will handle Gaudi2 ASICs. Add the PCI ID to the PCI table and add the ASIC enum value to all relevant places. Fixup the device parameters initialization for Gaudi2. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:28 +03:00
Moti Haimovski	8aa1e1e605	habanalabs: add gaudi2 MMU support Gaudi2 has new MMU units. A PMMU for device->host accesses, and HMMU for HBM accesses. The page tables of both MMUs are located in the host's memory (referred to in the code as host-resident pgt). Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:28 +03:00
Oded Gabbay	f73c637645	habanalabs: add gaudi2 wait-for-CS support In Gaudi2 we moved to a different wait for command submission completion model. Instead of receiving interrupt only on external queues, we use the device's sync manager to notify us when the entire command submission finishes. This enables us to remove the categorization of queues to external and internal, and treat each queue equally, without the need to parse and patch any command buffer. This change also requires refactoring to the IRQ handling of CS completions. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:28 +03:00
Ofir Bitton	e392d1bd04	habanalabs: add generic security module As the ASICs become more complex and have many more registers, we need a better way to configure the security properties. As a reminder, we have two dedicated mechanisms for security: Range Registers and Protection bits. Those mechanisms protect sensitive memory and configuration areas inside the device. The generic module handles the low-level part of the configuration, because the configuration mechanism is identical in all ASICs. The difference is the address ranges and register names. Any ASIC that use this block should first block all the register blocks in the ASIC. Then, it should open only the registers that need to be accessed by the user (This is opposed to Goya and Gaudi, where we blocked only what should not be accesses by the user). The module contains several functions, to unblock single register, multiple registers, entire blocks, ranges, ranges with mask. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:27 +03:00
Oded Gabbay	c47082c22d	habanalabs: remove obsolete device variables used for testing There are a couple of device variables that are used for testing purposes and they are set to fixed values. Remove the variables that are not relevant anymore and document the remaining variables. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:27 +03:00
Oded Gabbay	be7813eaa6	habanalabs: initialize new asic properties New asic properties were added for Gaudi2. We want to initialize and use them, when relevant, also for Goya and Gaudi. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:27 +03:00
Oded Gabbay	d7bb1ac89b	habanalabs: add gaudi2 asic-specific code Add the ASIC-specific code for Gaudi2. Supply (almost) all of the function callbacks that the driver's common code need to initialize, finalize and submit workloads to the Gaudi2 ASIC. It also contains the code to initialize the F/W of the Gaudi2 ASIC and to receive events from the F/W. It contains new debugfs entry to dump razwi events. razwi is a case where the device's engines create a transaction that reaches an invalid destination. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:27 +03:00
Ofir Bitton	ccf991e4f2	habanalabs: remove redundant argument in access_dev_mem APIs Region structure is derived from region type, hence no need to pass it as an argument. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:26 +03:00
Ohad Sharabi	fce854e9bc	habanalabs: communicate supported page sizes to user Because in future ASICs the driver will allow the user to set the page size we need to make sure this data is propagated in all APIs. In addition, since this is already an ASIC property we no longer need ASIC function for it. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:26 +03:00
Tomer Tayar	a74cf4a8f6	habanalabs: remove dead code from free_device_memory() free_device_memory() ends with if and else, each has a return statement, followed by another return statement that can never be reached. Restructure the function and remove this dead code. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:26 +03:00
Ohad Sharabi	b2711ab2b0	habanalabs: page size can only be a power of 2 We dropped support for page sizes that are not power of 2. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:26 +03:00
Ohad Sharabi	1ef0c327e1	habanalabs: refactor dma asic-specific functions This is a pre-requisite patch for adding tracepoints to the DMA memory operations (allocation/free) in the driver. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:26 +03:00
Dafna Hirschfeld	7659c30d19	habanalabs: set default value for memory_scrub Set a default value for memory scrubbing Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:25 +03:00
Dafna Hirschfeld	605e1ef3d5	habanalabs: move call to scrub_device_mem after ctx_fini In future ASICs, it would be possible to have a non-idle device when context is released. We thus need to postpone the scrubbing. Postpone it to hpriv release if reset is not executed or to device late init if reset is executed. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:25 +03:00
Dafna Hirschfeld	8c834a1442	habanalabs: don't send addr and size to scrub_device_mem cb We use scrub_device_mem only to scrub the entire SRAM and entire DRAM. Therefore there is no need to send addr and size args to the callback. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:25 +03:00
Dafna Hirschfeld	c1048d14c0	habanalabs: don't do memory scrubbing when unmapping There is no need to do memory scrub when unmapping anymore as it is an overhead as long as we have a single user at any given time. Remove that code and change return value of free_phys_pg_pack to void Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:24 +03:00
Ofir Bitton	856fe7b0aa	habanalabs: print if firmware is secured during load For easier debug, it is desirable to have a simple way to know whether the device is secured or not, hence we dump this indication during boot. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:24 +03:00
Yuri Nudelman	17ab47d2d6	habanalabs/gaudi: fix a race condition causing DMAR error There is a rare race condition in CB completion mechanism, that can occur under a very high pressure of command submissions. The preconditions for this to happen are: 1. There should be enough command submissions for the pre-allocated patched CB pool to run out of commands. At this stage we start allocating new patched CBs as they arrive. 2. CB size has to be exactly (128*n + 104)B for some n, i.e. 24B below a cache line end. The flow: 1. Two command buffers being completed on different streams, at the same time. Denote those CB1 and CB2. 2. Each command buffer is injected with two messages, 16B each - one for a HBW update of the completion queue, another to raise interrupt. 3. Assume CB1 updated the completion queue and raise the interrupt. 4. Assume CB2 updated the completion queue but did not raise the interrupt yet. 5. The host receives the interrupt. It goes over the completion queue and sees two completions - CB1 and CB2. Release them both. 6. CB2 performs the last command. The problem is that the last command is split between 2 cache lines. So to read the last 8B of the last command, it has to access the host again. Problem is - CB2 is already released. This causes a DMAR error. The solution to this problem is simply to make sure the last two commands in the CB are always in the same cache line, using NOP padding. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:24 +03:00
Dafna Hirschfeld	792588a8c2	habanalabs: move memory_scrub_val to hdev struct move the field memory_scrub_val from struct hl_dbg_device_entry to struct hl_device. This is because we want to use this field also if debugfs is off. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:24 +03:00
Oded Gabbay	0d98943437	habanalabs: fix comment style function name should not be preceded with @ Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:24 +03:00
Oded Gabbay	fb1155a9f0	habanalabs: use kvcalloc when possible kvcalloc is same as kvmalloc_array with GFP_ZERO. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:24 +03:00
Oded Gabbay	b63539a6fa	habanalabs: print pointer with correct modifier Use %p instead of %llx for printing pointers. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:24 +03:00
Oded Gabbay	abe85a9c11	habanalabs: check fence pointer before use fence pointer can be NULL in this path, as shown by an earlier check. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:23 +03:00
Oded Gabbay	4cd213807b	habanalabs: remove unused get_dma_desc_list_size This asic callback function is not called anymore from the common code. The asic-specific function itself is called but from within the asic-specific code. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:23 +03:00
Yuri Nudelman	a189977701	habanalabs: fix NULL dereference on cs timeout Device descriptor is accessed before an assignment Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:23 +03:00
farah kassabri	d64a29af12	habanalabs: add validity check for cq counter offset Driver performs no validity check for the user cq counter offset used in both wait_for_interrupt and register_for_timestamp APIs. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:23 +03:00
Dani Liberman	ada103b677	habanalabs: avoid unnecessary error print When sending a packet to FW right after it made reset, we will get packet timeout. Since it is expected behavior, we don't need to print an error in such case. Hence, when driver is in hard reset it will avoid from printing error messages about packet timeout. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:23 +03:00
Tal Cohen	fa9deaca2f	habanalabs: send an event notification when CS timeout occurs The Driver needs to inform the User process whenever one of its CS is timed out. The Driver shall recognize the CS timeout and shall send an eventfd notification, towards user space, whenever a timeout is expired on a CS. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:22 +03:00
Tal Cohen	6474691483	habanalabs: expose undefined opcode status via info ioctl The info ioctl retrieves information on the last undefined opcode occurred. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:22 +03:00
Tal Cohen	a7d6c35bcd	habanalabs/gaudi: collect undefined opcode error info when an undefined opcode error occurres, the driver collects the relevant information from the Qman and stores it inside the hdev data structure. An event fd indication is sent towards the user space. Note: another commit shall be followed which will add support to read the error info by an ioctl. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:22 +03:00
Tomer Tayar	41021f728a	habanalabs: fix race between hl_get_compute_ctx() and hl_ctx_put() hl_get_compute_ctx() is used to get the pointer to the compute context from the hpriv object. The function is called in code paths that are not necessarily initiated by user, so it is possible that a context release process will happen in parallel. This can lead to a race condition in which hl_get_compute_ctx() retrieves the context pointer, and just before it increments the context refcount, the context object is released and a freed memory is accessed. To avoid this race, add a mutex to protect the context pointer in hpriv. With this lock, hl_get_compute_ctx() will be able to detect if the context has been released or is about to be released. struct hl_ctx_mgr has a mutex for contexts IDR with a similar "ctx_lock" name, so rename it to just "lock" to avoid a confusion with the new lock. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:22 +03:00
Yuri Nudelman	2bc61bc4f3	habanalabs: keep a record of completed CS outcomes Often, the user is not interested in the completion timestamp of all command submissions. A common situation is, for example, when the user submits a burst of, possibly, several thousands of commands, then request the completion timestamp of only couple of specific key commands from all the burst. The problem is that currently, the outcome of the early commands may be lost, due to a large amount of later commands, that the user does not really care about. This patch creates a separate store with the outcomes of commands the user has mark explicitly as interested in. This store does not mix the marked commands with the unmarked ones, hence the data there will survive for much longer. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:22 +03:00
Tal Cohen	d0c92afc0e	habanalabs: change the write flag name of error info structs positive flags naming will make more clear code while adding more 'error info' structures Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:21 +03:00
Dafna Hirschfeld	78d503087b	habanalabs: add terminating NULL to attrs arrays Arrays of struct attribute are expected to be NULL terminated. This is required by API methods such as device_add_groups. This fixes a crash when loading the driver for Goya device. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:21 +03:00

1 2 3 4 5 ...

535 Commits