Manivannan writes:
MHI Host
========
- Fix conflict between MHI power up and SYSERR state transitions by issuing MHI
reset only if the device is in SYSERR state while in SBL/PBL EEs. The device
won't respond to reset if it is not in SYSERR state in SBL/PBL EEs.
- Remove redundant call to pci_assign_resource() since PCI core calls this API
internally.
- Add Telit FN920C04 modem which is based on Qcom SDX35 chipset.
* tag 'mhi-for-v6.16' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mani/mhi:
bus: mhi: host: pci_generic: Add Telit FN920C04 modem support
bus: mhi: host: pci_generic: Remove redundant assign resource usage
bus: mhi: host: Fix conflict between power_up and SYSERR
When mhi_async_power_up() enables IRQs, it is possible that we could
receive a SYSERR notification from the device if the firmware has crashed
for some reason. Then the SYSERR notification queues a work item that
cannot execute until the pm_mutex is released by mhi_async_power_up().
So the SYSERR work item will be pending. If mhi_async_power_up() detects
the SYSERR, it will handle it. If the device is in PBL, then the PBL state
transition event will be queued, resulting in a work item after the
pending SYSERR work item. Once mhi_async_power_up() releases the pm_mutex,
the SYSERR work item can run. It will blindly attempt to reset the MHI
state machine, which is the recovery action for SYSERR. PBL/SBL are not
interrupt driven and will ignore the MHI Reset unless SYSERR is actively
advertised. This will cause the SYSERR work item to timeout waiting for
reset to be cleared, and will leave the host state in SYSERR processing.
The PBL transition work item will then run, and immediately fail because
SYSERR processing is not a valid state for PBL transition.
This leaves the device uninitialized.
This issue has a fairly unique signature in the kernel log:
mhi mhi3: Requested to power ON
Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from
device. Attempting to recover
mhi mhi3: Power on setup success
mhi mhi3: Device failed to exit MHI Reset state
mhi mhi3: Device MHI is not in valid state
We cannot remove the SYSERR handling from mhi_async_power_up() because the
device may be in the SYSERR state, but we missed the notification as the
irq was fired before irqs were enabled. We also can't queue the SYSERR work
item from mhi_async_power_up() if SYSERR is detected because that may
result in a duplicate work item, and cause the same issue since the
duplicate item will blindly issue MHI reset even if SYSERR is no longer
active.
Instead, add a check in the SYSERR work item to make sure that MHI reset is
only issued if the device is in SYSERR state for PBL or SBL EEs.
Fixes: a6e2e3522f ("bus: mhi: core: Add support for PM state transitions")
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Troy Hanson <quic_thanson@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250328163526.3365497-1-jeff.hugo@oss.qualcomm.com
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull char / misc / IIO driver updates from Greg KH:
"Here is the big set of char, misc, iio, and other smaller driver
subsystems for 6.15-rc1. Lots of stuff in here, including:
- loads of IIO changes and driver updates
- counter driver updates
- w1 driver updates
- faux conversions for some drivers that were abusing the platform
bus interface
- coresight driver updates
- rust miscdevice binding updates based on real-world-use
- other minor driver updates
All of these have been in linux-next with no reported issues for quite
a while"
* tag 'char-misc-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (292 commits)
samples: rust_misc_device: fix markup in top-level docs
Coresight: Fix a NULL vs IS_ERR() bug in probe
misc: lis3lv02d: convert to use faux_device
tlclk: convert to use faux_device
regulator: dummy: convert to use the faux device interface
bus: mhi: host: Fix race between unprepare and queue_buf
coresight: configfs: Constify struct config_item_type
doc: iio: ad7380: describe offload support
iio: ad7380: add support for SPI offload
iio: light: Add check for array bounds in veml6075_read_int_time_ms
iio: adc: ti-ads7924 Drop unnecessary function parameters
staging: iio: ad9834: Use devm_regulator_get_enable()
staging: iio: ad9832: Use devm_regulator_get_enable()
iio: gyro: bmg160_spi: add of_match_table
dt-bindings: iio: adc: Add i.MX94 and i.MX95 support
iio: adc: ad7768-1: remove unnecessary locking
Documentation: ABI: add wideband filter type to sysfs-bus-iio
iio: adc: ad7768-1: set MOSI idle state to prevent accidental reset
iio: adc: ad7768-1: Fix conversion result sign
iio: adc: ad7124: Benefit of dev = indio_dev->dev.parent in ad7124_parse_channel_config()
...
Manivannan writes:
MHI Host
========
- Remove unused mhi_device_get() and mhi_queue_dma() APIs.
- Add support for SA8775p MHI endpoint device with IP_SW0 channel.
- Fix the race between mhi_unprepare_from_transfer() and mhi_queue_buf().
* tag 'mhi-for-v6.15' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mani/mhi:
bus: mhi: host: Fix race between unprepare and queue_buf
bus: mhi: host: pci_generic: Add support for SA8775P endpoint
bus: mhi: host: Remove unused functions
A client driver may use mhi_unprepare_from_transfer() to quiesce
incoming data during the client driver's tear down. The client driver
might also be processing data at the same time, resulting in a call to
mhi_queue_buf() which will invoke mhi_gen_tre(). If mhi_gen_tre() runs
after mhi_unprepare_from_transfer() has torn down the channel, a panic
will occur due to an invalid dereference leading to a page fault.
This occurs because mhi_gen_tre() does not verify the channel state
after locking it. Fix this by having mhi_gen_tre() confirm the channel
state is valid, or return error to avoid accessing deinitialized data.
Cc: stable@vger.kernel.org # 6.8
Fixes: b89b6a863d ("bus: mhi: host: Add spinlock to protect WP access when queueing TREs")
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Reviewed-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Reviewed-by: Youssef Samir <quic_yabdulra@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Troy Hanson <quic_thanson@quicinc.com>
Link: https://lore.kernel.org/r/20250306172913.856982-1-jeff.hugo@oss.qualcomm.com
[mani: added stable tag]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
There are multiple places from where the recovery work gets scheduled
asynchronously. Also, there are multiple places where the caller waits
synchronously for the recovery to be completed. One such place is during
the PM shutdown() callback.
If the device is not alive during recovery_work, it will try to reset the
device using pci_reset_function(). This function internally will take the
device_lock() first before resetting the device. By this time, if the lock
has already been acquired, then recovery_work will get stalled while
waiting for the lock. And if the lock was already acquired by the caller
which waits for the recovery_work to be completed, it will lead to
deadlock.
This is what happened on the X1E80100 CRD device when the device died
before shutdown() callback. Driver core calls the driver's shutdown()
callback while holding the device_lock() leading to deadlock.
And this deadlock scenario can occur on other paths as well, like during
the PM suspend() callback, where the driver core would hold the
device_lock() before calling driver's suspend() callback. And if the
recovery_work was already started, it could lead to deadlock. This is also
observed on the X1E80100 CRD.
So to fix both issues, use pci_try_reset_function() in recovery_work. This
function first checks for the availability of the device_lock() before
trying to reset the device. If the lock is available, it will acquire it
and reset the device. Otherwise, it will return -EAGAIN. If that happens,
recovery_work will fail with the error message "Recovery failed" as not
much could be done.
Cc: stable@vger.kernel.org # 5.12
Reported-by: Johan Hovold <johan@kernel.org>
Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com
Fixes: 7389337f0a ("mhi: pci_generic: Add suspend/resume/recovery procedure")
Reviewed-by: Johan Hovold <johan+linaro@kernel.org>
Tested-by: Johan Hovold <johan+linaro@kernel.org>
Analyzed-by: Johan Hovold <johan@kernel.org>
Link: https://lore.kernel.org/mhi/Z2KKjWY2mPen6GPL@hovoldconsulting.com/
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Link: https://lore.kernel.org/r/20250108-mhi_recovery_fix-v1-1-a0a00a17da46@linaro.org
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
We need the IIO fixes in here as well, and it resolves a merge conflict
in:
drivers/iio/adc/ti-ads1119.c
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Foxconn uses a unique firmware for their MHI based modems. So the generic
firmware from Qcom won't work. Hence, update the EDL firmware path to
include the 'foxconn' subdirectory based on the modem SoC so that the
Foxconn specific firmware could be used.
Respective firmware will be upstreamed to linux-firmware repo.
Cc: stable@vger.kernel.org # 6.11
Fixes: bf30a75e6e ("bus: mhi: host: Add support for Foxconn SDX72 modems")
Signed-off-by: Slark Xiao <slark_xiao@163.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20240725022941.65948-1-slark_xiao@163.com
[mani: Reworded the subject and description]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Add Netprisma LCUR57 and FCUN69 hardware revision:
LCUR57:
02:00.0 Unassigned class [ff00]: Device 203e:1000
Subsystem: Device 203e:1000
FCUN69:
02:00.0 Unassigned class [ff00]: Device 203e:1001
Subsystem: Device 203e:1001
Both of these modules create IP interfaces through MBIM.
And these modules can be checked for successful recognition through the
following command:
$ mmcli -L
/org/freedesktop/ModemManager1/Modem/0 [NetPrisma] LCUR57-WWD
$ mmcli -L
/org/freedesktop/ModemManager1/Modem/0 [NetPrisma] FCUN69-WWD
Signed-off-by: Mank Wang <mank.wang@netprisma.us>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/PH7PR22MB30386647BE2D813B502226CF81942@PH7PR22MB3038.namprd22.prod.outlook.com
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Pull driver core updates from Greg KH:
"Here is the big set of driver core changes for 6.11-rc1.
Lots of stuff in here, with not a huge diffstat, but apis are evolving
which required lots of files to be touched. Highlights of the changes
in here are:
- platform remove callback api final fixups (Uwe took many releases
to get here, finally!)
- Rust bindings for basic firmware apis and initial driver-core
interactions.
It's not all that useful for a "write a whole driver in rust" type
of thing, but the firmware bindings do help out the phy rust
drivers, and the driver core bindings give a solid base on which
others can start their work.
There is still a long way to go here before we have a multitude of
rust drivers being added, but it's a great first step.
- driver core const api changes.
This reached across all bus types, and there are some fix-ups for
some not-common bus types that linux-next and 0-day testing shook
out.
This work is being done to help make the rust bindings more safe,
as well as the C code, moving toward the end-goal of allowing us to
put driver structures into read-only memory. We aren't there yet,
but are getting closer.
- minor devres cleanups and fixes found by code inspection
- arch_topology minor changes
- other minor driver core cleanups
All of these have been in linux-next for a very long time with no
reported problems"
* tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits)
ARM: sa1100: make match function take a const pointer
sysfs/cpu: Make crash_hotplug attribute world-readable
dio: Have dio_bus_match() callback take a const *
zorro: make match function take a const pointer
driver core: module: make module_[add|remove]_driver take a const *
driver core: make driver_find_device() take a const *
driver core: make driver_[create|remove]_file take a const *
firmware_loader: fix soundness issue in `request_internal`
firmware_loader: annotate doctests as `no_run`
devres: Correct code style for functions that return a pointer type
devres: Initialize an uninitialized struct member
devres: Fix memory leakage caused by driver API devm_free_percpu()
devres: Fix devm_krealloc() wasting memory
driver core: platform: Switch to use kmemdup_array()
driver core: have match() callback in struct bus_type take a const *
MAINTAINERS: add Rust device abstractions to DRIVER CORE
device: rust: improve safety comments
MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer
MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER
firmware: rust: improve safety comments
...
In the match() callback, the struct device_driver * should not be
changed, so change the function callback to be a const *. This is one
step of many towards making the driver core safe to have struct
device_driver in read-only memory.
Because the match() callback is in all busses, all busses are modified
to handle this properly. This does entail switching some container_of()
calls to container_of_const() to properly handle the constant *.
For some busses, like PCI and USB and HV, the const * is cast away in
the match callback as those busses do want to modify those structures at
this point in time (they have a local lock in the driver structure.)
That will have to be changed in the future if they wish to have their
struct device * in read-only-memory.
Cc: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Alex Elder <elder@kernel.org>
Acked-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/2024070136-wrongdoer-busily-01e8@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, a single 'mhi_pci_dev_info' is shared across different product
families. Even though it makes the device functional, it misleads the users
by sharing the common product name.
For instance, below message will be printed for Foxconn SDX62 modem during
boot:
"MHI PCI device found: foxconn-sdx65"
But this is quite misleading to the users since the actual modem plugged in
could be 'T99W373' which is based on SDX62.
So fix this issue by using a unique 'mhi_pci_dev_info' for product
families. This allows us to specify a unique product name for each product
family. Also, once this name is exposed to client drivers, they may use
this name to identify the modems and use any modem specific configuration.
Modems of unknown product families are not impacted by this change.
CC: Slark Xiao <slark_xiao@163.com>
Reviewed-by: Slark Xiao <slark_xiao@163.com>
Link: https://lore.kernel.org/r/20240626053237.4227-1-manivannan.sadhasivam@linaro.org
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.
This means that with:
__string(field, mystring)
Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.
There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:
git grep -l __assign_str | while read a ; do
sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
mv /tmp/test-file $a;
done
I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.
Note, the same updates will need to be done for:
__assign_str_len()
__assign_rel_str()
__assign_rel_str_len()
I tested this with both an allmodconfig and an allyesconfig (build only for both).
[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/
Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
Some of the MHI modems like SDX65 based ones are capable of entering the
EDL mode as per the standard triggering mechanism defined in the MHI spec
v1.2. So let's add a common mhi_pci_generic_edl_trigger() function that
triggers the EDL mode in the device when user writes to the
/sys/bus/mhi/devices/.../trigger_edl file.
As per the spec, the EDL mode can be triggered by writing a cookie to the
EDL doorbell register and then resetting the device.
Devices supporting this standard way of entering EDL mode can set the
mhi_pci_dev_info::edl_trigger flag.
Signed-off-by: Qiang Yu <quic_qianyu@quicinc.com>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/1713928915-18229-4-git-send-email-quic_qianyu@quicinc.com
[mani: reworded commit message]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Some controllers may want to access a specific doorbell register. Hence add
a new API that reads the CHDBOFF register and returns the offset of the
doorbell registers from MMIO base, so that the controller can calculate the
address of the specific doorbell register by adding the register offset
with doorbell offset and MMIO base address.
Signed-off-by: Qiang Yu <quic_qianyu@quicinc.com>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://lore.kernel.org/r/1713928915-18229-3-git-send-email-quic_qianyu@quicinc.com
[mani: reworded commit message and Kdoc]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Add sysfs entry to allow users of MHI bus to force device to enter EDL
(Emergency Download) mode to download the device firmware. Since there is
no guarantee that all the devices will support EDL mode, the sysfs entry
is kept as an optional one and will appear only for the supported devices.
Controllers supporting the EDL mode are expected to provide edl_trigger()
callback that puts the device into EDL mode.
Signed-off-by: Qiang Yu <quic_qianyu@quicinc.com>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/1713928915-18229-2-git-send-email-quic_qianyu@quicinc.com
[mani: fixed the kernel version and reworded the commit message]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Currently, ath11k fails to resume from system suspend/hibernation on some
the x86 host machines with below error message:
```
ath11k_pci 0000:06:00.0: timeout while waiting for restart complete
```
This happens because, ath11k powers down the MHI stack during suspend and
that leads to destruction of the struct device associated with the MHI
channels. And during resume, ath11k calls calling mhi_sync_power_up() to
power up the MHI subsystem and that eventually calls the driver framework's
device_add() API from mhi_create_devices(). But the PM framework blocks the
struct device creation during device_add() and this leads to probe deferral
as below:
```
mhi mhi0_IPCR: Driver qcom_mhi_qrtr force probe deferral
```
The reason for deferring device creation during resume is explained in
dpm_prepare():
/*
* It is unsafe if probing of devices will happen during suspend or
* hibernation and system behavior will be unpredictable in this
* case. So, let's prohibit device's probing here and defer their
* probes instead. The normal behavior will be restored in
* dpm_complete().
*/
Due to the device probe deferral, qcom_mhi_qrtr_probe() API is not getting
called during resume and thus MHI channels are not prepared. So this blocks
the QMI messages from being transferred between ath11k and firmware,
resulting in a firmware initialization failure.
After consulting with Rafael, it was decided to not destroy the struct
device for the MHI channels during system suspend/hibernation because the
device is bound to appear again during resume.
So to achieve this, a new API called mhi_power_down_keep_dev() is
introduced for MHI controllers to keep the struct device when required.
This API is similar to the existing mhi_power_down() API, except that it
keeps the struct device associated with MHI channels instead of destroying
them.
Tested-on: WCN6855 hw2.0 PCI WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.30
Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240305021320.3367-2-quic_bqiang@quicinc.com
[mani: reworded the commit message and subject]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
This reverts commit 3316ab2b45.
The MHI spec owner pointed out that the SOC_HW_VERSION register is part
of the BHIe segment, and only valid on devices which implement BHIe.
Only a small subset of MHI devices implement BHIe so blindly accessing
the register for all devices is not correct. Also, since the BHIe
segment offset is not used when accessing the register, any
implementation which moves the BHIe segment will result in accessing
some other register. We've seen that accessing this register on AIC100
which does not support BHIe can result in initialization failures.
We could try to put checks into the code to address these issues, but in
the roughly 4 years this functionality has existed, no one has used it.
Easier to drop this dead code and address the issues if anyone comes up
with a real world use for it.
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20240219180748.1591527-1-quic_jhugo@quicinc.com
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker
Change the implementation of the arrays which has enum to strings mapping
to make it consistent in both trace header file and other files.
Where ever the trace events are added, debug messages are removed.
Signed-off-by: Krishna chaitanya chundru <quic_krichai@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20240206-ftrace_support-v11-1-3f71dc187544@quicinc.com
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
The OEM PK HASH registers in the BHI region are read once during firmware
load (boot), cached, and displayed on demand via sysfs. This has a few
problems - if firmware load is skipped, the registers will not be read and
if the register values change over the life of the device the local cache
will be out of sync.
Qualcomm Cloud AI 100 can expose both these problems. It is possible for
mhi_async_power_up() to be invoked while the device is in AMSS EE, which
would bypass firmware loading. Also, Qualcomm Cloud AI 100 has 5 PK HASH
slots which can be dynamically provisioned while the device is active,
which would result in the values changing and users may want to know what
keys are active.
Address these concerns by reading the PK HASH registers on-demand during
the sysfs read. This will result in showing the most current information.
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20240105174253.863388-1-quic_jhugo@quicinc.com
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
When processing a SYSERR, if the device does not respond to the MHI_RESET
from the host, the host will be stuck in a difficult to recover state.
The host will remain in MHI_PM_SYS_ERR_PROCESS and not clean up the host
channels. Clients will not be notified of the SYSERR via the destruction
of their channel devices, which means clients may think that the device is
still up. Subsequent SYSERR events such as a device fatal error will not
be processed as the state machine cannot transition from PROCESS back to
DETECT. The only way to recover from this is to unload the mhi module
(wipe the state machine state) or for the mhi controller to initiate
SHUTDOWN.
This issue was discovered by stress testing soc_reset events on AIC100
via the sysfs node.
soc_reset is processed entirely in hardware. When the register write
hits the endpoint hardware, it causes the soc to reset without firmware
involvement. In stress testing, there is a rare race where soc_reset N
will cause the soc to reset and PBL to signal SYSERR (fatal error). If
soc_reset N+1 is triggered before PBL can process the MHI_RESET from the
host, then the soc will reset again, and re-run PBL from the beginning.
This will cause PBL to lose all state. PBL will be waiting for the host
to respond to the new syserr, but host will be stuck expecting the
previous MHI_RESET to be processed.
Additionally, the AMSS EE firmware (QSM) was hacked to synthetically
reproduce the issue by simulating a FW hang after the QSM issued a
SYSERR. In this case, soc_reset would not recover the device.
For this failure case, to recover the device, we need a state similar to
PROCESS, but can transition to DETECT. There is not a viable existing
state to use. POR has the needed transitions, but assumes the device is
in a good state and could allow the host to attempt to use the device.
Allowing PROCESS to transition to DETECT invites the possibility of
parallel SYSERR processing which could get the host and device out of
sync.
Thus, invent a new state - MHI_PM_SYS_ERR_FAIL
This essentially a holding state. It allows us to clean up the host
elements that are based on the old state of the device (channels), but
does not allow us to directly advance back to an operational state. It
does allow the detection and processing of another SYSERR which may
recover the device, or allows the controller to do a clean shutdown.
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20240112180800.536733-1-quic_jhugo@quicinc.com
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>