Commit Graph

29 Commits

Author SHA1 Message Date
Tomer Tayar
ffd123fe83 habanalabs: Disable file operations after device is removed
A device can be removed from the PCI subsystem while a process holds the
file descriptor opened.
In such a case, the driver attempts to kill the process, but as it is
still possible that the process will be alive after this step, the
device removal will complete, and we will end up with a process object
that points to a device object which was already released.

To prevent the usage of this released device object, disable the
following file operations for this process object, and avoid the cleanup
steps when the file descriptor is eventually closed.
The latter is just a best effort, as memory leak will occur.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-03-10 09:16:09 +01:00
Tomer Tayar
27ac5aada0 habanalabs: Call put_pid() when releasing control device
The refcount of the "hl_fpriv" structure is not used for the control
device, and thus hl_hpriv_put() is not called when releasing this
device.
This results with no call to put_pid(), so add it explicitly in
hl_device_release_ctrl().

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-03-10 09:16:09 +01:00
Oded Gabbay
28bcf1fdc4 habanalabs: enable F/W events after init done
Only after the initialization of the device is done, the driver is
ready to receive events from the F/W. The driver can't handle events
before that because of races so it will ignore events. In case of
a fatal event, the driver won't know about it and the device will be
operational although it shouldn't be.

Same logic should be applied after hard-reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-02-08 18:20:08 +02:00
Ofir Bitton
d00697fbe1 habanalabs: add new mem ioctl op for mapping hw blocks
For future ASIC support the driver allows user to map certain regions
in the device's configuration space for direct access from userspace.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:51 +02:00
Alon Mizrahi
75d9a2a0aa habanalabs: replace WARN/WARN_ON with dev_crit in driver
Often WARN is defined in data-centers as BUG and we would like to
avoid hanging the entire server on some internal error of the driver
(important as it might be).

Therefore, use dev_crit instead.

Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Ofir Bitton
8e39e75a13 habanalabs: Init the VM module for kernel context
In order for reserving VA ranges for kernel memory, we need
to allow the VM module to be initiated with kernel context.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Ohad Sharabi
cb6ef0ee6d habanalabs: refactor MMU locks code
remove mmu_cache_lock as it protects a section which is already
protected by mmu_lock.

in addition, wrap mmu cache invalidate calls in hl_vm_ctx_fini with
mmu_lock.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Oded Gabbay
2dc4a6d791 habanalabs: disable FW events on device removal
When device is removed, we need to make sure the F/W won't send us
any more events because during the remove process we disable the
interrupts.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-21 20:30:22 +02:00
Oded Gabbay
aa6df6533b habanalabs: fix reset process in case of failures
There are some points in the reset process where if the code fails
for some reason, and the system admin tries to initiate the reset
process again we will get a kernel panic.

This is because there aren't any protections in different fini
functions that are called during the reset process.

The protections that are added in this patch make sure that if the fini
functions are called multiple times, without calling init functions
between them, there won't be double release of already released
resources.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-12 14:59:52 +02:00
Dinghao Liu
b000700d6d habanalabs: Fix memleak in hl_device_reset
When kzalloc() fails, we should execute hl_mmu_fini()
to release the MMU module. It's the same when
hl_ctx_init() fails.

Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-29 23:23:12 +02:00
Oded Gabbay
097c62b6f0 habanalabs: fix order of status check
When the device is in reset or needs to be reset, the disabled property
is don't-care.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:39 +02:00
Greg Kroah-Hartman
a3ab07c642 Merge 5.10-rc7 into char-misc-next
We want the fixes in here, and this resolves a merge issue with
drivers/misc/habanalabs/common/memory.c.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-07 10:08:14 +01:00
Ofir Bitton
adb51298fd habanalabs: improve hard reset procedure
We want to handle the scenario in which the driver was not able
to kill all user processes due to many memory mappings.
We need to retry again after some period while releasing the cores.
The devices will be unusable and "in-reset" status during that time.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Tomer Tayar
804a72276c habanalabs: Rename hw_queues_mirror to cs_mirror
Future command submission types might be submitted to HW not via the
QMAN queues path. However, it would be still required to have the TDR
mechanism for these CS, and thus the patch renames the TDR fields and
replaces the hw_queues_ prefix with cs_.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Ofir Bitton
d1ddd90551 habanalabs: move HW dirty check to a proper location
Driver must verify if HW is dirty before trying to fetch preboot
information. Hence, we move this validation to a prior stage of
the boot sequence.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Ofir Bitton
66a76401c5 habanalabs: add 'needs reset' state in driver
The new state indicates that device should be reset in order
to re-gain funcionality.
This unique state can occur if reset_on_lockup is disabled
and an actual lockup has occurred.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Omer Shpigelman
f2d032ee13 habanalabs: fix hard reset print and comment
One of the first steps of a hard reset flow is to close all open user
contexts. This user process teradown might take some time due to long
cleanup in our driver or some other reason even before our cleanup flow.
Hence fix the relevant print and comment to be more accurate.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Tomer Tayar
ba7e389c30 habanalabs: Move repeatedly included headers to habanalabs.h
Several header files are repeatedly included in many files.
Move these files to habanalabs.h which is included by all.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:32 +02:00
Ofir Bitton
5555b7c56b habanalabs: put devices before driver removal
Driver never puts its device and control_device objects, hence
a memory leak is introduced every driver removal.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:30:16 +02:00
Oded Gabbay
9e2e8fc7d6 habanalabs: release kernel context after hw_fini
Some engines use resources that belong to the kernel context (e.g. MMU
mappings). In case the halt-engines doesn't work properly due to H/W
restriction, we need to make sure the kernel context lives on until after
the hw_fini. The hw_fini resets the ASIC after that no engine is alive and
we can safely close the kernel context.

Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-25 14:44:20 +03:00
Moti Haimovski
d83fe66928 habanalabs: refactor MMU as device-oriented
As preparation to MMU v2, rework MMU to be device oriented
instantiated according to the device in hand.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-22 18:49:53 +03:00
Oded Gabbay
3174ac9bb1 habanalabs: restructure hl_mmap
Arrange the hl_mmap code to be more structured and expandable for the
future. Add better defines that describe our usage of the vm_pgoff.

Note that I shamelessly took the code and defines from the amdkfd driver
(my previous driver).

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-22 18:49:51 +03:00
Oded Gabbay
2f55342c5e habanalabs: replace armcp with the generic cpucp
ArmCP mandates that the device CPU is always an ARM processor, which might
be wrong in the future.

Most of this change is an internal renaming of variables, functions and
defines but there are two entries in sysfs which have armcp in their
names. Add identical cpucp entries but don't remove yet the armcp entries.
Those will be deprecated next year. Add the documentation about it in sysfs
documentation.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-22 18:49:51 +03:00
Oded Gabbay
f907af183b habanalabs: cast int to u32 before printing it with %u
%u is used for unsigned so we need to cast the int variable to u32 to avoid
compiler warning.

Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-22 18:49:49 +03:00
Colin Ian King
804d057cfa habanalabs: fix incorrect check on failed workqueue create
The null check on a failed workqueue create is currently null checking
hdev->cq_wq rather than the pointer hdev->cq_wq[i] and so the test
will never be true on a failed workqueue create. Fix this by checking
hdev->cq_wq[i].

Addresses-Coverity: ("Dereference before null check")
Fixes: 5574cb2194 ("habanalabs: Assign each CQ with its own work queue")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22 12:47:58 +03:00
Oded Gabbay
58361aae4b habanalabs: set max power according to card type
In Gaudi, the default max power setting is different between PCI and PMC
cards. Therefore, the driver need to set the default after knowing what is
the card type.

The current code has a bug where it limits the maximum power of the PMC
card to 200W after a reset occurs.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22 12:47:57 +03:00
Greg Kroah-Hartman
65a9bde6ed Merge 5.8-rc7 into char-misc-next
This should resolve the merge/build issues reported when trying to
create linux-next.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-27 11:49:37 +02:00
Oded Gabbay
8df8cb1efc habanalabs: enable device before hw_init()
Device is now enabled before the hw_init() because part of the
initialization requires communication with the device firmware to get
information that is required for the initialization itself

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2020-07-24 20:31:37 +03:00
Oded Gabbay
70b2f993ea habanalabs: create common folder
For internal needs of our CI we need to move all the common code into a
common folder instead of putting them in the root folder of the driver.

Same applies to the common header files under include/

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
2020-07-24 20:31:37 +03:00