When the AP masks are updated via apmask_store() or aqmask_store(),
ap_bus_revise_bindings() is called after ap_attr_mutex has been
released.
This calls __ap_revise_reserved(), which accesses the driver_override
field without holding any lock, racing against a concurrent
driver_override_store() that may free the old string, resulting in a
potential UAF.
Fix this by using the driver-core driver_override infrastructure, which
protects all accesses with an internal spinlock.
Note that unlike most other buses, the AP bus does not check
driver_override in its match() callback; the override is checked in
ap_device_probe() and __ap_revise_reserved() instead.
Also note that we do not enable the driver_override feature of struct
bus_type, as AP - in contrast to most other buses - passes "" to
sysfs_emit() when the driver_override pointer is NULL. Thus, printing
"\n" instead of "(null)\n".
Additionally, AP has a custom counter that is modified in the
corresponding custom driver_override_store().
Fixes: d38a87d7c0 ("s390/ap: Support driver_override for AP queue devices")
Tested-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Link: https://patch.msgid.link/20260324005919.2408620-11-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Pull non-MM updates from Andrew Morton:
- "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
disk space by teaching ocfs2 to reclaim suballocator block group
space (Heming Zhao)
- "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
ARRAY_END() macro and uses it in various places (Alejandro Colomar)
- "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
page size (Pnina Feder)
- "kallsyms: Prevent invalid access when showing module buildid" cleans
up kallsyms code related to module buildid and fixes an invalid
access crash when printing backtraces (Petr Mladek)
- "Address page fault in ima_restore_measurement_list()" fixes a
kexec-related crash that can occur when booting the second-stage
kernel on x86 (Harshit Mogalapalli)
- "kho: ABI headers and Documentation updates" updates the kexec
handover ABI documentation (Mike Rapoport)
- "Align atomic storage" adds the __aligned attribute to atomic_t and
atomic64_t definitions to get natural alignment of both types on
csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)
- "kho: clean up page initialization logic" simplifies the page
initialization logic in kho_restore_page() (Pratyush Yadav)
- "Unload linux/kernel.h" moves several things out of kernel.h and into
more appropriate places (Yury Norov)
- "don't abuse task_struct.group_leader" removes the usage of
->group_leader when it is "obviously unnecessary" (Oleg Nesterov)
- "list private v2 & luo flb" adds some infrastructure improvements to
the live update orchestrator (Pasha Tatashin)
* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
procfs: fix missing RCU protection when reading real_parent in do_task_stat()
watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
kho: fix doc for kho_restore_pages()
tests/liveupdate: add in-kernel liveupdate test
liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
liveupdate: luo_file: Use private list
list: add kunit test for private list primitives
list: add primitives for private list manipulations
delayacct: fix uapi timespec64 definition
panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
netclassid: use thread_group_leader(p) in update_classid_task()
RDMA/umem: don't abuse current->group_leader
drm/pan*: don't abuse current->group_leader
drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
drm/amdgpu: don't abuse current->group_leader
android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
android/binder: don't abuse current->group_leader
kho: skip memoryless NUMA nodes when reserving scratch areas
...
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of
hex.h interfaces to directly #include <linux/hex.h> as part of the process
of putting kernel.h on a diet.
Removing hex.h from kernel.h means that 36K C source files don't have to
pay the price of parsing hex.h for the roughly 120 C source files that
need it.
This change has been build-tested with allmodconfig on most ARCHes. Also,
all users/callers of <linux/hex.h> in the entire source tree have been
updated if needed (if not already #included).
Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Revisit and rework module parameter permissions for AP bus and zcrypt
device drivers.
In general all sysfs permissions for AP bus and zcrypt parameters should be
0444 so that user space tools like lszcrypt can read the current value of
module parameters.
Some exceptions are only for some internal tweak parameters like
ap_msg_pool_min_items and zcrypt_mempool_threshold which should only be
readable by an administrator.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel
message catalog" which never made it upstream.
Remove the macro in order to get rid of a pointless indirection. Replace
all users with the string it defines. In almost all cases this leads to a
simple replacement like this:
- #define KMSG_COMPONENT "appldata"
- #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+ #define pr_fmt(fmt) "appldata: " fmt
Except for some special cases this is just mechanical/scripted work.
Acked-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Introduce a restriction for the driver_override feature versus apmask
and aqmask:
- driver_override is only allowed when the apmask and aqmask values
both are default (=0xffff..ffff).
- apmask and aqmask modifications are only allowed when there is no
driver_override on any AP device active.
So in the end the user is restricted to choose to either use
apmask/apmask to divide the AP devices into host owned and vfio owned
or use the driver_override feature but not mix these two approaches.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The mutex ap_perms_mutex was already used not only for protection
of the struct ap_perms ap_perms variable but also for an consistent
update of the AP bus sysfs attributes apmask and aqmask.
So rename this mutex to ap_attr_mutex which better reflects the
current use. This is also a preparation for an upcoming patch which
will use this mutex to lock updates on a new sysfs attribute.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add a new sysfs attribute driver_override the AP queue's
directory. Writing in a string overrides the default driver
determination and the drivers are matched against this string
instead. This overrules the driver binding determined by the
apmask/aqmask bitmask fields.
According to the common understanding of how the driver_override
behavior shall work, there is no further checking done. Neither about
the string which is given as override driver nor if this device is
currently in use by an mdev device. Another patch may limit this
behavior to refuse a mixed usage of the driver_override and
apmask/aqmask feature.
As there exists some tooling for this kind of driver_override
(see package driverctl) the AP bus behavior for re-binding
should be compatible to this. The steps for a driver_override are:
1) unbind the current driver from the device. For example
echo "17.0005" > /sys/devices/ap/card17/17.0005/driver/unbind
2) set the new driver for this device in the sysfs
driver_override attribute. For example
echo "vfio_ap" > /sys//devices/ap/card17/17.0005/driver_override
3) trigger a bus reprobe of this device. For example
echo "17.0005" > /sys/bus/ap/drivers_probe
With the driverctl package this is more comfortable and
the settings get persisted:
driverctl -b ap set-override 17.0005 vfio_ap
and unset with
driverctl -b ap unset-override 17.0005
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
For the in_use() check of an updated apmask the host's aqmask
was provided to the vfio function. Similar on an update of the
aqmask the host's apmask was provided to the vfio in_use()
function. This led to false results on the check for apmask or
aqmask updates. For example with only one APQN when exactly
this card is tried to be re-assigned back to the host, the
in_use() check did not complain.
The correct behavior is achieved with providing a full mask
for aqmask when an adapter is to be checked and similar a full
mask for aqmask when a domain is to be checked for usage.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
If no AP instructions are available the AP bus module leaks registered
debug feature files. Change function call order to fix this.
Fixes: cccd85bfb7 ("s390/zcrypt: Rework debug feature invocations.")
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The AP bus udev event BINDINGS=complete is sent out when the
first time all devices detected by the AP bus scan have been
bound to device drivers. This is the ideal time to for example
change the AP bus masks apmask and aqmask to re-establish a
persistent change on the decision about which cards/domains
should be available for the host and which should go into the
pool for kvm guests.
However, if exactly this initial udev event is sent out early
in the boot process a udev rule may not have been established
yet and thus this event will never be recognized. To have
some indication about if the AP bus binding complete has
already happened, the internal ap_bindings_complete_count
counter is exposed via sysfs with this patch.
Suggested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Explicitly include <linux/export.h> in files which contain an
EXPORT_SYMBOL().
See commit a934a57a42 ("scripts/misc-check: check missing #include
<linux/export.h> when W=1") for more details.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Introduce a new flag parameter for the both cprb send functions
zcrypt_send_cprb() and zcrypt_send_ep11_cprb(). This new
xflags parameter ("execution flags") shall be used to provide
execution hints and flags for this crypto request.
There are two flags implemented to be used with these functions:
* ZCRYPT_XFLAG_USERSPACE - indicates to the lower layers that
all the ptrs address userspace. So when construction the ap msg
copy_from_user() is to be used. If this flag is NOT set, the ptrs
address kernel memory and thus memcpy() is to be used.
* ZCRYPT_XFLAG_NOMEMALLOC - indicates that this task must not
allocate memory which may be allocated with io operations.
For the AP bus and zcrypt message layer this means:
* The ZCRYPT_XFLAG_USERSPACE is mapped to the already existing
bool variable "userspace" which is propagated to the zcrypt
proto implementations.
* The ZCRYPT_XFLAG_NOMEMALLOC results in setting the AP flag
AP_MSG_FLAG_MEMPOOL when the AP msg buffer is initialized.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-6-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There is a need for a do-not-allocate-memory path through the AP bus
layer. The pkey layer may be triggered via the in-kernel interface
from a protected key crypto algorithm (namely PAES) to convert a
secure key into a protected key. This happens in a workqueue context,
so sleeping is allowed but memory allocations causing IO operations
are not permitted.
To accomplish this, an AP message memory pool with pre-allocated space
is established. When ap_init_apmsg() with use_mempool set to true is
called, instead of kmalloc() the ap message buffer is allocated from
the ap_msg_pool. This pool only holds a limited amount of buffers:
ap_msg_pool_min_items with the item size AP_DEFAULT_MAX_MSG_SIZE and
exactly one of these items (if available) is returned if
ap_init_apmsg() with the use_mempool arg set to true is called. When
this pool is exhausted and use_mempool is set true, ap_init_apmsg()
returns -ENOMEM without any attempt to allocate memory and the caller
has to deal with that.
Default values for this mempool of ap messages is:
* Each buffer is 12KB (that is the default AP bus size
and all the urgent messages should fit into this space).
* Minimum items held in the pool is 8. This value is adjustable
via module parameter ap.msgpool_min_items.
The zcrypt layer may use this flag to indicate to the ap bus that the
processing path for this message should not allocate memory but should
use pre-allocated memory buffer instead. This is to prevent deadlocks
with crypto and io for example with encrypted swap volumes.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-4-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Slight rework on the way how AP message buffers are allocated.
Instead of having multiple places with kmalloc() calls all
the AP message buffers are now allocated and freed on exactly
one place: ap_init_apmsg() allocates the current AP bus max
limit of ap_max_msg_size (defaults to 12KB). The AP message
buffer is then freed in ap_release_apmsg().
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20250424133619.16495-3-freude@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Pull s390 updates from Vasily Gorbik:
- Add sorting of mcount locations at build time
- Rework uaccess functions with C exception handling to shorten inline
assembly size and enable full inlining. This yields near-optimal code
for small constant copies with a ~40kb kernel size increase
- Add support for a configurable STRICT_MM_TYPECHECKS which allows to
generate better code, but also allows to have type checking for debug
builds
- Optimize get_lowcore() for common callers with alternatives that
nearly revert to the pre-relocated lowcore code, while also slightly
reducing syscall entry and exit time
- Convert MACHINE_HAS_* checks for single facility tests into cpu_has_*
style macros that call test_facility(), and for features with
additional conditions, add a new ALT_TYPE_FEATURE alternative to
provide a static branch via alternative patching. Also, move machine
feature detection to the decompressor for early patching and add
debugging functionality to easily show which alternatives are patched
- Add exception table support to early boot / startup code to get rid
of the open coded exception handling
- Use asm_inline for all inline assemblies with EX_TABLE or ALTERNATIVE
to ensure correct inlining and unrolling decisions
- Remove 2k page table leftovers now that s390 has been switched to
always allocate 4k page tables
- Split kfence pool into 4k mappings in arch_kfence_init_pool() and
remove the architecture-specific kfence_split_mapping()
- Use READ_ONCE_NOCHECK() in regs_get_kernel_stack_nth() to silence
spurious KASAN warnings from opportunistic ftrace argument tracing
- Force __atomic_add_const() variants on s390 to always return void,
ensuring compile errors for improper usage
- Remove s390's ioremap_wt() and pgprot_writethrough() due to
mismatched semantics and lack of known users, relying on asm-generic
fallbacks
- Signal eventfd in vfio-ap to notify userspace when the guest AP
configuration changes, including during mdev removal
- Convert mdev_types from an array to a pointer in vfio-ccw and vfio-ap
drivers to avoid fake flex array confusion
- Cleanup trap code
- Remove references to the outdated linux390@de.ibm.com address
- Other various small fixes and improvements all over the code
* tag 's390-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (78 commits)
s390: Use inline qualifier for all EX_TABLE and ALTERNATIVE inline assemblies
s390/kfence: Split kfence pool into 4k mappings in arch_kfence_init_pool()
s390/ptrace: Avoid KASAN false positives in regs_get_kernel_stack_nth()
s390/boot: Ignore vmlinux.map
s390/sysctl: Remove "vm/allocate_pgste" sysctl
s390: Remove 2k vs 4k page table leftovers
s390/tlb: Use mm_has_pgste() instead of mm_alloc_pgste()
s390/lowcore: Use lghi instead llilh to clear register
s390/syscall: Merge __do_syscall() and do_syscall()
s390/spinlock: Implement SPINLOCK_LOCKVAL with inline assembly
s390/smp: Implement raw_smp_processor_id() with inline assembly
s390/current: Implement current with inline assembly
s390/lowcore: Use inline qualifier for get_lowcore() inline assembly
s390: Move s390 sysctls into their own file under arch/s390
s390/syscall: Simplify syscall_get_arguments()
s390/vfio-ap: Notify userspace that guest's AP config changed when mdev removed
s390: Remove ioremap_wt() and pgprot_writethrough()
s390/mm: Add configurable STRICT_MM_TYPECHECKS
s390/mm: Convert pgste_val() into function
s390/mm: Convert pgprot_val() into function
...
Move machine type detection to the decompressor and use static branches
to implement and use machine_is_[lpar|vm|kvm]() instead of a runtime check
via MACHINE_IS_[LPAR|VM|KVM].
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The result of xchg() is not used, and in addition it is used on a one byte
memory area which leads to inefficient code.
Use WRITE_ONCE() instead to achieve the same result with much less
generated code.
Acked-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
A crypto card comes in 3 flavors: accelerator, CCA co-processor or
EP11 co-processor. Within a protected execution environment only the
accelerator and EP11 co-processor is supported. However, it is
possible to set up a KVM guest with a CCA card and run it as a
protected execution guest. There is nothing at the host side which
prevents this. Within such a guest, a CCA card is shown as "illicit"
and you can't do anything with such a crypto card.
Regardless of the unsupported CCA card within a protected execution
guest there are a couple of user space applications which
unconditional try to run crypto requests to the zcrypt device
driver. There was a bug within the AP bus code which allowed such a
request to be forwarded to a CCA card where it is finally
rejected and the driver reacts with -ENODEV but also triggers an AP
bus scan. Together with a retry loop this caused some kind of "hang"
of the KVM guest. On startup it caused timeouts and finally led the
KVM guest startup fail. Fix that by closing the gap and make sure a
CCA card is not usable within a protected execution environment.
Another behavior within an protected execution environment with CCA
cards was that the se_bind and se_associate AP queue sysfs attributes
where shown. The implementation unconditional always added these
attributes. Fix that by checking if the card mode is supported within
a protected execution environment and only if valid, add the attribute
group.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Pull s390 updates from Vasily Gorbik:
- Optimize ftrace and kprobes code patching and avoid stop machine for
kprobes if sequential instruction fetching facility is available
- Add hiperdispatch feature to dynamically adjust CPU capacity in
vertical polarization to improve scheduling efficiency and overall
performance. Also add infrastructure for handling warning track
interrupts (WTI), allowing for graceful CPU preemption
- Rework crypto code pkey module and split it into separate,
independent modules for sysfs, PCKMO, CCA, and EP11, allowing modules
to load only when the relevant hardware is available
- Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
utilizing message-security assist extensions (MSA) 10 and 11. It
introduces new shash implementations for HMAC-SHA224/256/384/512 and
registers the hardware-accelerated AES-XTS cipher as the preferred
option. Also add clear key token support
- Add MSA 10 and 11 processor activity instrumentation counters to perf
and update PAI Extension 1 NNPA counters
- Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
statements
- Add support for SHA3 performance enhancements introduced with MSA 12
- Add support for the query authentication information feature of MSA
13 and introduce the KDSA CPACF instruction. Provide query and query
authentication information in sysfs, enabling tools like cpacfinfo to
present this data in a human-readable form
- Update kernel disassembler instructions
- Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
kpatch compatibility
- Add missing warning handling and relocated lowcore support to the
early program check handler
- Optimize ftrace_return_address() and avoid calling unwinder
- Make modules use kernel ftrace trampolines
- Strip relocs from the final vmlinux ELF file to make it roughly 2
times smaller
- Dump register contents and call trace for early crashes to the
console
- Generate ptdump address marker array dynamically
- Fix rcu_sched stalls that might occur when adding or removing large
amounts of pages at once to or from the CMM balloon
- Fix deadlock caused by recursive lock of the AP bus scan mutex
- Unify sync and async register save areas in entry code
- Cleanup debug prints in crypto code
- Various cleanup and sanitizing patches for the decompressor
- Various small ftrace cleanups
* tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (84 commits)
s390/crypto: Display Query and Query Authentication Information in sysfs
s390/crypto: Add Support for Query Authentication Information
s390/crypto: Rework RRE and RRF CPACF inline functions
s390/crypto: Add KDSA CPACF Instruction
s390/disassembler: Remove duplicate instruction format RSY_RDRU
s390/boot: Move boot_printk() code to own file
s390/boot: Use boot_printk() instead of sclp_early_printk()
s390/boot: Rename decompressor_printk() to boot_printk()
s390/boot: Compile all files with the same march flag
s390: Use MARCH_HAS_*_FEATURES defines
s390: Provide MARCH_HAS_*_FEATURES defines
s390/facility: Disable compile time optimization for decompressor code
s390/boot: Increase minimum architecture to z10
s390/als: Remove obsolete comment
s390/sha3: Fix SHA3 selftests failures
s390/pkey: Add AES xts and HMAC clear key token support
s390/cpacf: Add MSA 10 and 11 new PCKMO functions
s390/mm: Add cond_resched() to cmm_alloc/free_pages()
s390/pai_ext: Update PAI extension 1 counters
s390/pai_crypto: Add support for MSA 10 and 11 pai counters
...
There is a possibility to deadlock with an recursive
lock of the AP bus scan mutex ap_scan_bus_mutex:
... kernel: ============================================
... kernel: WARNING: possible recursive locking detected
... kernel: 5.14.0-496.el9.s390x #3 Not tainted
... kernel: --------------------------------------------
... kernel: kworker/12:1/130 is trying to acquire lock:
... kernel: 0000000358bc1510 (ap_scan_bus_mutex){+.+.}-{3:3}, at: ap_bus_force_rescan+0x92/0x108
... kernel:
but task is already holding lock:
... kernel: 0000000358bc1510 (ap_scan_bus_mutex){+.+.}-{3:3}, at: ap_scan_bus_wq_callback+0x28/0x60
... kernel:
other info that might help us debug this:
... kernel: Possible unsafe locking scenario:
... kernel: CPU0
... kernel: ----
... kernel: lock(ap_scan_bus_mutex);
... kernel: lock(ap_scan_bus_mutex);
... kernel:
*** DEADLOCK ***
Here is how the callstack looks like:
... [<00000003576fe9ce>] process_one_work+0x2a6/0x748
... [<0000000358150c00>] ap_scan_bus_wq_callback+0x40/0x60 <- mutex locked
... [<00000003581506e2>] ap_scan_bus+0x5a/0x3b0
... [<000000035815037c>] ap_scan_adapter+0x5b4/0x8c0
... [<000000035814fa34>] ap_scan_domains+0x2d4/0x668
... [<0000000357d989b4>] device_add+0x4a4/0x6b8
... [<0000000357d9bb54>] bus_probe_device+0xb4/0xc8
... [<0000000357d9daa8>] __device_attach+0x120/0x1b0
... [<0000000357d9a632>] bus_for_each_drv+0x8a/0xd0
... [<0000000357d9d548>] __device_attach_driver+0xc0/0x140
... [<0000000357d9d3d8>] driver_probe_device+0x40/0xf0
... [<0000000357d9cec2>] really_probe+0xd2/0x460
... [<000000035814d7b0>] ap_device_probe+0x150/0x208
... [<000003ff802a5c46>] zcrypt_cex4_queue_probe+0xb6/0x1c0 [zcrypt_cex4]
... [<000003ff7fb2d36e>] zcrypt_queue_register+0xe6/0x1b0 [zcrypt]
... [<000003ff7fb2c8ac>] zcrypt_rng_device_add+0x94/0xd8 [zcrypt]
... [<0000000357d7bc52>] hwrng_register+0x212/0x228
... [<0000000357d7b8c2>] add_early_randomness+0x102/0x110
... [<000003ff7fb29c94>] zcrypt_rng_data_read+0x94/0xb8 [zcrypt]
... [<0000000358150aca>] ap_bus_force_rescan+0x92/0x108
... [<0000000358177572>] mutex_lock_interruptible_nested+0x32/0x40 <- lock again
Note this only happens when the very first random data providing
crypto card appears via hot plug in the system AND is in disabled
state ("deconfig"). Then the initial pull of random data fails and
a re-scan of the AP bus is triggered while already in the middle
of an AP bus scan caused by the appearing new hardware.
The fix is relatively simple once the scenario us understood:
The AP bus force rescan function will immediately return if there
is currently an AP bus scan running with the very same thread id.
Fixes: eacf5b3651 ("s390/ap: introduce mutex to lock the AP bus scan")
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The dynamic debugging provides function names on request. So remove
all explicit function strings.
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
With the rework of the AP bus scan and the introduction of
a bindings complete completion also the timing until the
userspace finally receives a AP bus binding complete uevent
had increased. Unfortunately this event triggers some important
jobs for preparation of KVM guests, for example the modification
of card/queue masks to reassign AP resources to the alternate
AP queue device driver (vfio_ap) which is the precondition
for building mediated devices which may be a precondition for
starting KVM guests using AP resources.
This small fix now triggers the check for binding complete
each time an AP device driver has registered. With this patch
the bindings complete may be posted up to 30s earlier as there
is no need to wait for the next AP bus scan any more.
Fixes: 778412ab91 ("s390/ap: rearm APQNs bindings complete completion")
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Cc: stable@vger.kernel.org
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In the match() callback, the struct device_driver * should not be
changed, so change the function callback to be a const *. This is one
step of many towards making the driver core safe to have struct
device_driver in read-only memory.
Because the match() callback is in all busses, all busses are modified
to handle this properly. This does entail switching some container_of()
calls to container_of_const() to properly handle the constant *.
For some busses, like PCI and USB and HV, the const * is cast away in
the match callback as those busses do want to modify those structures at
this point in time (they have a local lock in the driver structure.)
That will have to be changed in the future if they wish to have their
struct device * in read-only-memory.
Cc: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Alex Elder <elder@kernel.org>
Acked-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/2024070136-wrongdoer-busily-01e8@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With the mentioned commit (see the fixes tag) on every AP bus scan an
uevent "AP bus change bindings complete" is emitted. Furthermore if an AP
device switched from one driver to another, for example by manipulating the
apmask, there was never a "bindings complete" uevent generated.
The "bindings complete" event should be sent once when all AP devices have
been bound to device drivers and again if unbind/bind actions take place
and finally all AP devices are bound again. Therefore implement this.
Fixes: 778412ab91 ("s390/ap: rearm APQNs bindings complete completion")
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
A system crash like this
Failing address: 200000cb7df6f000 TEID: 200000cb7df6f403
Fault in home space mode while using kernel ASCE.
AS:00000002d71bc007 R3:00000003fe5b8007 S:000000011a446000 P:000000015660c13d
Oops: 0038 ilc:3 [#1] PREEMPT SMP
Modules linked in: mlx5_ib ...
CPU: 8 PID: 7556 Comm: bash Not tainted 6.9.0-rc7 #8
Hardware name: IBM 3931 A01 704 (LPAR)
Krnl PSW : 0704e00180000000 0000014b75e7b606 (ap_parse_bitmap_str+0x10e/0x1f8)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000001 ffffffffffffffc0 0000000000000001 00000048f96b75d3
000000cb00000100 ffffffffffffffff ffffffffffffffff 000000cb7df6fce0
000000cb7df6fce0 00000000ffffffff 000000000000002b 00000048ffffffff
000003ff9b2dbc80 200000cb7df6fcd8 0000014bffffffc0 000000cb7df6fbc8
Krnl Code: 0000014b75e7b5fc: a7840047 brc 8,0000014b75e7b68a
0000014b75e7b600: 18b2 lr %r11,%r2
#0000014b75e7b602: a7f4000a brc 15,0000014b75e7b616
>0000014b75e7b606: eb22d00000e6 laog %r2,%r2,0(%r13)
0000014b75e7b60c: a7680001 lhi %r6,1
0000014b75e7b610: 187b lr %r7,%r11
0000014b75e7b612: 84960021 brxh %r9,%r6,0000014b75e7b654
0000014b75e7b616: 18e9 lr %r14,%r9
Call Trace:
[<0000014b75e7b606>] ap_parse_bitmap_str+0x10e/0x1f8
([<0000014b75e7b5dc>] ap_parse_bitmap_str+0xe4/0x1f8)
[<0000014b75e7b758>] apmask_store+0x68/0x140
[<0000014b75679196>] kernfs_fop_write_iter+0x14e/0x1e8
[<0000014b75598524>] vfs_write+0x1b4/0x448
[<0000014b7559894c>] ksys_write+0x74/0x100
[<0000014b7618a440>] __do_syscall+0x268/0x328
[<0000014b761a3558>] system_call+0x70/0x98
INFO: lockdep is turned off.
Last Breaking-Event-Address:
[<0000014b75e7b636>] ap_parse_bitmap_str+0x13e/0x1f8
Kernel panic - not syncing: Fatal exception: panic_on_oops
occured when /sys/bus/ap/a[pq]mask was updated with a relative mask value
(like +0x10-0x12,+60,-90) with one of the numeric values exceeding INT_MAX.
The fix is simple: use unsigned long values for the internal variables. The
correct checks are already in place in the function but a simple int for
the internal variables was used with the possibility to overflow.
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Tested-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The direct dependency of chsc and the AP bus prevents the
modularization of ap bus. Introduce a notifier interface for AP
changes, which decouples the producer of the change events (chsc) from
the consumer (ap_bus).
Remove the ap_cfg_chg() interface and replace it with the notifier
invocation. The ap bus module registers a notification handler, which
triggers the AP bus scan.
Cc: Vineeth Vijayan <vneethv@linux.ibm.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The IRQ handler may rely on the bus or the root device. Register the
adapter IRQ after setting up the bus and the root device to avoid any
race conditions.
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Rework the ap initialization and add missing cleanups to the error path.
Errors during the registration of IRQ handler is now also detected.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Since qci is available on most of the current machines, move away from
the dynamic buffers for qci information and store it instead in a
statically defined buffer.
The new flags member in struct ap_config_info is now used as an
indicator, if qci is available in the system (at least one of these
bits is set).
Suggested-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Rework the invocations around ap_scan_bus():
- Protect ap_scan_bus() with a mutex to make sure only one
scan at a time is running.
- The workqueue invocation which is triggered by either the
module init or via AP bus scan timer expiration uses this
mutex and if there is already a scan running, the work
is simple aborted (as the job is done by another task).
- The ap_bus_force_rescan() which is invoked by higher level
layers mostly on failures which indicate a bus scan may
help is reworked to call ap_scan_bus() direct instead of
enqueuing work into a system workqueue and waiting for that
to finish. Of course the mutex is respected and in case of
another task already running a bus scan the shortcut of
waiting for this scan to finish and reusing the scan result
is taken.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The AP scan bus function now returns true if there have
been any config changes detected. This will become
important in a follow up patch which will exploit this
hint for further actions. This also required to have
the AP scan bus timer callback reworked as the function
signature has changed to bool ap_scan_bus(void).
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This patch tries to clarify the functions and variables
around the AP scan bus job. All these variables and
functions start with ap_scan_bus and are declared in
one place now.
No functional changes in this patch - only renaming and
move of code or declarations.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The APQN bindings complete completion was used to reflect
that 1st the AP bus initial scan is done and 2nd all the
detected APQNs have been bound to a device driver.
This was a single-shot action. However, as the AP bus
supports hot-plug it may be that new APQNs appear reflected
as new AP queue and card devices which need to be bound
to appropriate device drivers. So the condition that
all existing AP queue devices are bound to device drivers
may go away for a certain time.
This patch now checks during AP bus scan for maybe new AP
devices appearing and does a re-init of the internal completion
variable. So the AP bus function ap_wait_apqn_bindings_complete()
now may block on this condition variable even later after
initial scan is through when new APQNs appear which need to
get bound.
This patch also moves the check for binding complete invocation
from the probe function to the end of the AP bus scan function.
This change also covers some weird scenarios where during a
card hotplug the binding of the card device was sufficient for
binding complete but the queue devices where still in the
process of being discovered.
As of now this change has no impact on existing code. The
behavior change in the now later bindings complete should not
impact any code (and has been tested so far). The only
exploiter is the zcrypt function zcrypt_wait_api_operational()
which only initial calls ap_wait_apqn_bindings_complete().
However, this new behavior of the AP bus wait for APQNs bindings
complete function will be used in a later patch exploiting
this for the zcrypt API layer.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The ap_bus is using inline functions of the ultravisor (uv) in-kernel
API. The related header file is implicitly included via several other
headers. Replace this by an explicit include of the ultravisor header
in the ap_bus file.
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This patch replaces all the s390 debug feature calls with
debug level by dynamic debug calls pr_debug. These calls
are much more flexible and each single invocation can get
enabled/disabled at runtime wheres the s390 debug feature
debug calls have only one knob - enable or disable all in
one bunch. The benefit is especially significant with
high frequency called functions like the AP bus scan. In
most debugging scenarios you don't want and need them, but
sometimes it is crucial to know exactly when and how long
the AP bus scan took.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This patch harmonizes the calls and defines around the
s390 debug feature as it is used in the AP bus and
zcrypt device driver code.
More or less cleanup and renaming, no functional changes.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This patch addresses some weird scenarios where an outband
manipulation of the SE bind state of a queue assigned and
maybe in use by an SE guest with AP pass-through support
took place. So for example when the guest has bound and
associated a queue and then this domain has been zeroed on
the service element.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
As of now the AP card struct held only part of the
queue's hwinfo (that is the GR2 register content returned
with an TAPQ invocation). This patch reworks struct ap_card
to hold the whole hwinfo now.
As there is a nice bit field union on top of this
ap_tapq_hwinfo struct, all the ugly bit checkings can
now get replaced by simple evaluations of the required
bit field.
Suggested-by: Ingo Franzki <ifranzki@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
A secure execution (SE, also known as confidential computing)
guest may see asynchronous errors on a crypto firmware queue.
The current implementation to gather information about cards
and queues in ap_queue_info() simple returns if an asynchronous
error is hanging on the firmware queue. If such a situation
happened and it was the only queue visible for a crypto card
within an SE guest, then the card vanished from sysfs as
the AP bus scan function refuses to hold a card without any
type information. As lszcrypt evaluates the sysfs such
a card vanished from the lszcrypt card listing and the
user is baffled and has no way to reset and thus clear the
pending asynchronous error.
This patch improves the named function to also evaluate GR2
of the TAPQ in case of asynchronous error pending. If there
is a not-null value stored in, the info is processed now.
In the end, a queue with pending asynchronous error does not
lead to a vanishing card any more.
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Fix kernel crash in AP bus code caused by very early invocation of the
config change callback function via SCLP.
After a fresh IML of the machine the crypto cards are still offline and
will get switched online only with activation of any LPAR which has the
card in it's configuration. A crypto card coming online is reported
to the LPAR via SCLP and the AP bus offers a callback function to get
this kind of information. However, it may happen that the callback is
invoked before the AP bus init function is complete. As the callback
triggers a synchronous AP bus scan, the scan may already run but some
internal states are not initialized by the AP bus init function resulting
in a crash like this:
[ 11.635859] Unable to handle kernel pointer dereference in virtual kernel address space
[ 11.635861] Failing address: 0000000000000000 TEID: 0000000000000887
[ 11.635862] Fault in home space mode while using kernel ASCE.
[ 11.635864] AS:00000000894c4007 R3:00000001fece8007 S:00000001fece7800 P:000000000000013d
[ 11.635879] Oops: 0004 ilc:1 [#1] SMP
[ 11.635882] Modules linked in:
[ 11.635884] CPU: 5 PID: 42 Comm: kworker/5:0 Not tainted 6.6.0-rc3-00003-g4dbf7cdc6b42 #12
[ 11.635886] Hardware name: IBM 3931 A01 751 (LPAR)
[ 11.635887] Workqueue: events_long ap_scan_bus
[ 11.635891] Krnl PSW : 0704c00180000000 0000000000000000 (0x0)
[ 11.635895] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 11.635897] Krnl GPRS: 0000000001000a00 0000000000000000 0000000000000006 0000000089591940
[ 11.635899] 0000000080000000 0000000000000a00 0000000000000000 0000000000000000
[ 11.635901] 0000000081870c00 0000000089591000 000000008834e4e2 0000000002625a00
[ 11.635903] 0000000081734200 0000038000913c18 000000008834c6d6 0000038000913ac8
[ 11.635906] Krnl Code:>0000000000000000: 0000 illegal
[ 11.635906] 0000000000000002: 0000 illegal
[ 11.635906] 0000000000000004: 0000 illegal
[ 11.635906] 0000000000000006: 0000 illegal
[ 11.635906] 0000000000000008: 0000 illegal
[ 11.635906] 000000000000000a: 0000 illegal
[ 11.635906] 000000000000000c: 0000 illegal
[ 11.635906] 000000000000000e: 0000 illegal
[ 11.635915] Call Trace:
[ 11.635916] [<0000000000000000>] 0x0
[ 11.635918] [<000000008834e4e2>] ap_queue_init_state+0x82/0xb8
[ 11.635921] [<000000008834ba1c>] ap_scan_domains+0x6fc/0x740
[ 11.635923] [<000000008834c092>] ap_scan_adapter+0x632/0x8b0
[ 11.635925] [<000000008834c3e4>] ap_scan_bus+0xd4/0x288
[ 11.635927] [<00000000879a33ba>] process_one_work+0x19a/0x410
[ 11.635930] Discipline DIAG cannot be used without z/VM
[ 11.635930] [<00000000879a3a2c>] worker_thread+0x3fc/0x560
[ 11.635933] [<00000000879aea60>] kthread+0x120/0x128
[ 11.635936] [<000000008792afa4>] __ret_from_fork+0x3c/0x58
[ 11.635938] [<00000000885ebe62>] ret_from_fork+0xa/0x30
[ 11.635942] Last Breaking-Event-Address:
[ 11.635942] [<000000008834c6d4>] ap_wait+0xcc/0x148
This patch improves the ap_bus_force_rescan() function which is
invoked by the config change callback by checking if a first
initial AP bus scan has been done. If not, the force rescan request
is simple ignored. Anyhow it does not make sense to trigger AP bus
re-scans even before the very first bus scan is complete.
Cc: stable@vger.kernel.org
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
On a state toggle from config off to config on and on the
state toggle from checkstop to not checkstop the queue's
internal states was set but the state machine was not
nudged. This did not care as on the first enqueue of a
request the state machine kick ran.
However, within an Secure Execution guest a queue is
only chosen by the scheduler when it has been bound.
But to bind a queue, it needs to run through the initial
states (reset, enable interrupts, ...). So this is like
a chicken-and-egg problem and the result was in fact
that a queue was unusable after a config off/on toggle.
With some slight rework of the handling of these states
now the new function _ap_queue_init_state() is called
which is the core of the ap_queue_init_state() function
but without locking handling. This has the benefit that
it can be called on all the places where a (re-)init
of the AP queue's state machine is needed.
Fixes: 2d72eaf036 ("s390/ap: implement SE AP bind, unbind and associate")
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Secure execution guest environments require an empty pinblob in all
key generation and unwrap requests. Empty pinblobs are only available
in EP11 API ordinal 6 or higher.
Add an empty pinblob to key generation and unwrap requests, if the AP
secure binding facility is available. In all other cases, stay with
the empty pin tag (no pinblob) and the current API ordinals.
The EP11 API ordinal also needs to be considered when the pkey module
tries to figure out the list of eligible cards for key operations
with protected keys in secure execution environment.
These changes are transparent to userspace but required for running
an secure execution guest with handling key generate and key derive
(e.g. secure key to protected key) correct. Especially using EP11
secure keys with the kernel dm-crypt layer requires this patch.
Co-developed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Ingo Franzki <ifranzki@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Remove the legacy device driver code for CEX2 and CEX3 cards.
The last machines which are able to handle CEX2 crypto cards
are z10 EC first available 2008 and z10 BC first available 2009.
The last machines able to handle a CEX3 crypto card are
z196 first available 2010 and z114 first available 2011.
Please note that this does not imply to drop CEX2 and CEX3
support in general. With older kernels on hardware up to the
aforementioned machine models these crypto cards will get
support by IBM.
The removal of the CEX2 and CEX3 device drivers code opens up
some simplifications, for example support for crypto cards
without rng support can be removed also.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>