Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "The biggest changes are MPAM enablement in drivers/resctrl and new PMU
  support under drivers/perf.

  On the core side, FEAT_LSUI lets futex atomic operations with EL0
  permissions, avoiding PAN toggling.

  The rest is mostly TLB invalidation refactoring, further generic entry
  work, sysreg updates and a few fixes.

  Core features:

   - Add support for FEAT_LSUI, allowing futex atomic operations without
     toggling Privileged Access Never (PAN)

   - Further refactor the arm64 exception handling code towards the
     generic entry infrastructure

   - Optimise __READ_ONCE() with CONFIG_LTO=y and allow alias analysis
     through it

  Memory management:

   - Refactor the arm64 TLB invalidation API and implementation for
     better control over barrier placement and level-hinted invalidation

   - Enable batched TLB flushes during memory hot-unplug

   - Fix rodata=full block mapping support for realm guests (when
     BBML2_NOABORT is available)

  Perf and PMU:

   - Add support for a whole bunch of system PMUs featured in NVIDIA's
     Tegra410 SoC (cspmu extensions for the fabric and PCIe, new drivers
     for CPU/C2C memory latency PMUs)

   - Clean up iomem resource handling in the Arm CMN driver

   - Fix signedness handling of AA64DFR0.{PMUVer,PerfMon}

  MPAM (Memory Partitioning And Monitoring):

   - Add architecture context-switch and hiding of the feature from KVM

   - Add interface to allow MPAM to be exposed to user-space using
     resctrl

   - Add errata workaround for some existing platforms

   - Add documentation for using MPAM and what shape of platforms can
     use resctrl

  Miscellaneous:

   - Check DAIF (and PMR, where relevant) at task-switch time

   - Skip TFSR_EL1 checks and barriers in synchronous MTE tag check mode
     (only relevant to asynchronous or asymmetric tag check modes)

   - Remove a duplicate allocation in the kexec code

   - Remove redundant save/restore of SCS SP on entry to/from EL0

   - Generate the KERNEL_HWCAP_ definitions from the arm64 hwcap
     descriptions

   - Add kselftest coverage for cmpbr_sigill()

   - Update sysreg definitions"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (109 commits)
  arm64: rsi: use linear-map alias for realm config buffer
  arm64: Kconfig: fix duplicate word in CMDLINE help text
  arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode
  arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
  arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
  arm64: kexec: Remove duplicate allocation for trans_pgd
  ACPI: AGDI: fix missing newline in error message
  arm64: Check DAIF (and PMR) at task-switch time
  arm64: entry: Use split preemption logic
  arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode()
  arm64: entry: Consistently prefix arm64-specific wrappers
  arm64: entry: Don't preempt with SError or Debug masked
  entry: Split preemption from irqentry_exit_to_kernel_mode()
  entry: Split kernel mode logic from irqentry_{enter,exit}()
  entry: Move irqentry_enter() prototype later
  entry: Remove local_irq_{enable,disable}_exit_to_user()
  ...
This commit is contained in:
Linus Torvalds
2026-04-14 16:48:56 -07:00
82 changed files with 7100 additions and 943 deletions

View File

@@ -24,7 +24,8 @@ Performance monitor support
thunderx2-pmu
alibaba_pmu
dwc_pcie_pmu
nvidia-pmu
nvidia-tegra241-pmu
nvidia-tegra410-pmu
meson-ddr-pmu
cxl
ampere_cspmu

View File

@@ -1,8 +1,8 @@
=========================================================
NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
=========================================================
============================================================
NVIDIA Tegra241 SoC Uncore Performance Monitoring Unit (PMU)
============================================================
The NVIDIA Tegra SoC includes various system PMUs to measure key performance
The NVIDIA Tegra241 SoC includes various system PMUs to measure key performance
metrics like memory bandwidth, latency, and utilization:
* Scalable Coherency Fabric (SCF)

View File

@@ -0,0 +1,522 @@
=====================================================================
NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
=====================================================================
The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
metrics like memory bandwidth, latency, and utilization:
* Unified Coherence Fabric (UCF)
* PCIE
* PCIE-TGT
* CPU Memory (CMEM) Latency
* NVLink-C2C
* NV-CLink
* NV-DLink
PMU Driver
----------
The PMU driver describes the available events and configuration of each PMU in
sysfs. Please see the sections below to get the sysfs path of each PMU. Like
other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
the CPU id used to handle the PMU event. There is also "associated_cpus"
sysfs attribute, which contains a list of CPUs associated with the PMU instance.
UCF PMU
-------
The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
interconnect that supports hardware coherence across multiple coherently caching
agents, including:
* CPU clusters
* GPU
* PCIe Ordering Controller Unit (OCU)
* Other IO-coherent requesters
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
Some of the events available in this PMU can be used to measure bandwidth and
utilization:
* slc_access_rd: count the number of read requests to SLC.
* slc_access_wr: count the number of write requests to SLC.
* slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
* slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
* mem_access_rd: count the number of read requests to local or remote memory.
* mem_access_wr: count the number of write requests to local or remote memory.
* mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
* mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
* cycles: counts the UCF cycles.
The average bandwidth is calculated as::
AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
The average request rate is calculated as::
AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
More details about what other events are available can be found in Tegra410 SoC
technical reference manual.
The events can be filtered based on source or destination. The source filter
indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
remote socket. The destination filter specifies the destination memory type,
e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
local/remote classification of the destination filter is based on the home
socket of the address, not where the data actually resides. The available
filters are described in
/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
The list of UCF PMU event filters:
* Source filter:
* src_loc_cpu: if set, count events from local CPU
* src_loc_noncpu: if set, count events from local non-CPU device
* src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
* Destination filter:
* dst_loc_cmem: if set, count events to local system memory (CMEM) address
* dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
* dst_loc_other: if set, count events to local CXL memory address
* dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
If the source is not specified, the PMU will count events from all sources. If
the destination is not specified, the PMU will count events to all destinations.
Example usage:
* Count event id 0x0 in socket 0 from all sources and to all destinations::
perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
* Count event id 0x0 in socket 0 with source filter = local CPU and destination
filter = local system memory (CMEM)::
perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
destination filter = remote memory::
perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
PCIE PMU
--------
This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
the memory subsystem. It monitors all read/write traffic from the root port(s)
or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per
PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
up to 8 root ports. The traffic from each root port can be filtered using RP or
BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
capture traffic from all RPs. Please see below for more details.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
The events in this PMU can be used to measure bandwidth, utilization, and
latency:
* rd_req: count the number of read requests by PCIE device.
* wr_req: count the number of write requests by PCIE device.
* rd_bytes: count the number of bytes transferred by rd_req.
* wr_bytes: count the number of bytes transferred by wr_req.
* rd_cum_outs: count outstanding rd_req each cycle.
* cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
The average bandwidth is calculated as::
AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
The average request rate is calculated as::
AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
The average latency is calculated as::
FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
The PMU events can be filtered based on the traffic source and destination.
The source filter indicates the PCIE devices that will be monitored. The
destination filter specifies the destination memory type, e.g. local system
memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
classification of the destination filter is based on the home socket of the
address, not where the data actually resides. These filters can be found in
/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
The list of event filters:
* Source filter:
* src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
bitmask represents the RP index in the RC. If the bit is set, all devices under
the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
devices in root port 0 to 3.
* src_bdf: the BDF that will be monitored. This is a 16-bit value that
follows formula: (bus << 8) + (device << 3) + (function). For example, the
value of BDF 27:01.1 is 0x2781.
* src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
"src_bdf" is used to filter the traffic.
Note that Root-Port and BDF filters are mutually exclusive and the PMU in
each RC can only have one BDF filter for the whole counters. If BDF filter
is enabled, the BDF filter value will be applied to all events.
* Destination filter:
* dst_loc_cmem: if set, count events to local system memory (CMEM) address
* dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
* dst_loc_pcie_p2p: if set, count events to local PCIE peer address
* dst_loc_pcie_cxl: if set, count events to local CXL memory address
* dst_rem: if set, count events to remote memory address
If the source filter is not specified, the PMU will count events from all root
ports. If the destination filter is not specified, the PMU will count events
to all destinations.
Example usage:
* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
destinations::
perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
targeting just local CMEM of socket 0::
perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
destinations::
perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
targeting just local CMEM of socket 1::
perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
destinations::
perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
Mapping the RC# to lspci segment number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
contains the following information to map PCIE devices under the RP back to its RC# :
- Bus# (byte 0xc) : bus number as reported by the lspci output
- Segment# (byte 0xd) : segment number as reported by the lspci output
- RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
- RC# (byte 0xf): root complex number associated with the RP
- Socket# (byte 0x10): socket number associated with the RP
Example script for mapping lspci BDF to RC# and socket#::
#!/bin/bash
while read bdf rest; do
dvsec4_reg=$(lspci -vv -s $bdf | awk '
/Designated Vendor-Specific: Vendor=10de ID=0004/ {
match($0, /\[([0-9a-fA-F]+)/, arr);
print "0x" arr[1];
exit
}
')
if [ -n "$dvsec4_reg" ]; then
bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
fi
done < <(lspci -d 10de:)
Example output::
0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
PCIE-TGT PMU
------------
This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
the memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges.
There is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can
have up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU
provides RP filter to count PCIE BAR traffic to each RP and address filter to
count access to PCIE BAR or CXL HDM ranges. The details of the filters are
described in the following sections.
Mapping the RC# to lspci segment number is similar to the PCIE PMU. Please see
:ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
The events and configuration options of this PMU device are available in sysfs,
see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
The events in this PMU can be used to measure bandwidth and utilization:
* rd_req: count the number of read requests to PCIE.
* wr_req: count the number of write requests to PCIE.
* rd_bytes: count the number of bytes transferred by rd_req.
* wr_bytes: count the number of bytes transferred by wr_req.
* cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
The average bandwidth is calculated as::
AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
The average request rate is calculated as::
AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
The PMU events can be filtered based on the destination root port or target
address range. Filtering based on RP is only available for PCIE BAR traffic.
Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be
found in sysfs, see
/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
Destination filter settings:
* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
only available for PCIE BAR traffic.
* dst_addr_base: BAR or CXL HDM filter base address.
* dst_addr_mask: BAR or CXL HDM filter address mask.
* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
to determine if the traffic destination address falls within the filter range::
(txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
If the comparison succeeds, then the event will be counted.
If the destination filter is not specified, the RP filter will be configured by default
to count PCIE BAR traffic to all root ports.
Example usage:
* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
0x10000 to 0x100FF on socket 0's PCIE RC-1::
perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
CPU Memory (CMEM) Latency PMU
-----------------------------
This PMU monitors latency events of memory read requests from the edge of the
Unified Coherence Fabric (UCF) to local CPU DRAM:
* RD_REQ counters: count read requests (32B per request).
* RD_CUM_OUTS counters: accumulated outstanding request counter, which track
how many cycles the read requests are in flight.
* CYCLES counter: counts the number of elapsed cycles.
The average latency is calculated as::
FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
Example usage::
perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
NVLink-C2C PMU
--------------
This PMU monitors latency events of memory read/write requests that pass through
the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
The events and configuration options of this PMU device are available in sysfs,
see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
The list of events:
* IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
* IN_RD_REQ: the number of incoming read requests.
* IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
* IN_WR_REQ: the number of incoming write requests.
* OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
* OUT_RD_REQ: the number of outgoing read requests.
* OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
* OUT_WR_REQ: the number of outgoing write requests.
* CYCLES: NVLink-C2C interface cycle counts.
The incoming events count the reads/writes from remote device to the SoC.
The outgoing events count the reads/writes from the SoC to remote device.
The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
contains the information about the connected device.
When the C2C interface is connected to GPU(s), the user can use the
"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
The PMU will monitor all GPUs by default if not specified.
When connected to another SoC, only the read events are available.
The events can be used to calculate the average latency of the read/write requests::
C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
Example usage:
* Count incoming traffic from all GPUs connected via NVLink-C2C::
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
* Count incoming traffic from GPU 0 connected via NVLink-C2C::
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
* Count incoming traffic from GPU 1 connected via NVLink-C2C::
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
* Count outgoing traffic to all GPUs connected via NVLink-C2C::
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
* Count outgoing traffic to GPU 0 connected via NVLink-C2C::
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
* Count outgoing traffic to GPU 1 connected via NVLink-C2C::
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
NV-CLink PMU
------------
This PMU monitors latency events of memory read requests that pass through
the NV-CLINK interface. Bandwidth events are not available in this PMU.
In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
SoC and this PMU only counts read traffic.
The events and configuration options of this PMU device are available in sysfs,
see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
The list of events:
* IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
* IN_RD_REQ: the number of incoming read requests.
* OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
* OUT_RD_REQ: the number of outgoing read requests.
* CYCLES: NV-CLINK interface cycle counts.
The incoming events count the reads from remote device to the SoC.
The outgoing events count the reads from the SoC to remote device.
The events can be used to calculate the average latency of the read requests::
CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
Example usage:
* Count incoming read traffic from remote SoC connected via NV-CLINK::
perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
* Count outgoing read traffic to remote SoC connected via NV-CLINK::
perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
NV-DLink PMU
------------
This PMU monitors latency events of memory read requests that pass through
the NV-DLINK interface. Bandwidth events are not available in this PMU.
In Tegra410 SoC, this PMU only counts CXL memory read traffic.
The events and configuration options of this PMU device are available in sysfs,
see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
The list of events:
* IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
* IN_RD_REQ: the number of read requests to CXL memory.
* CYCLES: NV-DLINK interface cycle counts.
The events can be used to calculate the average latency of the read requests::
DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
Example usage:
* Count read events to CXL memory::
perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'

View File

@@ -23,6 +23,7 @@ ARM64 Architecture
memory
memory-tagging-extension
mops
mpam
perf
pointer-authentication
ptdump

View File

@@ -0,0 +1,72 @@
.. SPDX-License-Identifier: GPL-2.0
====
MPAM
====
What is MPAM
============
MPAM (Memory Partitioning and Monitoring) is a feature in the CPUs and memory
system components such as the caches or memory controllers that allow memory
traffic to be labelled, partitioned and monitored.
Traffic is labelled by the CPU, based on the control or monitor group the
current task is assigned to using resctrl. Partitioning policy can be set
using the schemata file in resctrl, and monitor values read via resctrl.
See Documentation/filesystems/resctrl.rst for more details.
This allows tasks that share memory system resources, such as caches, to be
isolated from each other according to the partitioning policy (so called noisy
neighbours).
Supported Platforms
===================
Use of this feature requires CPU support, support in the memory system
components, and a description from firmware of where the MPAM device controls
are in the MMIO address space. (e.g. the 'MPAM' ACPI table).
The MMIO device that provides MPAM controls/monitors for a memory system
component is called a memory system component. (MSC).
Because the user interface to MPAM is via resctrl, only MPAM features that are
compatible with resctrl can be exposed to user-space.
MSC are considered as a group based on the topology. MSC that correspond with
the L3 cache are considered together, it is not possible to mix MSC between L2
and L3 to 'cover' a resctrl schema.
The supported features are:
* Cache portion bitmap controls (CPOR) on the L2 or L3 caches. To expose
CPOR at L2 or L3, every CPU must have a corresponding CPU cache at this
level that also supports the feature. Mismatched big/little platforms are
not supported as resctrl's controls would then also depend on task
placement.
* Memory bandwidth maximum controls (MBW_MAX) on or after the L3 cache.
resctrl uses the L3 cache-id to identify where the memory bandwidth
control is applied. For this reason the platform must have an L3 cache
with cache-id's supplied by firmware. (It doesn't need to support MPAM.)
To be exported as the 'MB' schema, the topology of the group of MSC chosen
must match the topology of the L3 cache so that the cache-id's can be
repainted. For example: Platforms with Memory bandwidth maximum controls
on CPU-less NUMA nodes cannot expose the 'MB' schema to resctrl as these
nodes do not have a corresponding L3 cache. If the memory bandwidth
control is on the memory rather than the L3 then there must be a single
global L3 as otherwise it is unknown which L3 the traffic came from. There
must be no caches between the L3 and the memory so that the two ends of
the path have equivalent traffic.
When the MPAM driver finds multiple groups of MSC it can use for the 'MB'
schema, it prefers the group closest to the L3 cache.
* Cache Storage Usage (CSU) counters can expose the 'llc_occupancy' provided
there is at least one CSU monitor on each MSC that makes up the L3 group.
Exposing CSU counters from other caches or devices is not supported.
Reporting Bugs
==============
If you are not seeing the counters or controls you expect please share the
debug messages produced when enabling dynamic debug and booting with:
dyndbg="file mpam_resctrl.c +pl"

View File

@@ -214,6 +214,9 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | SI L1 | #4311569 | ARM64_ERRATUM_4311569 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | CMN-650 | #3642720 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_845719 |
+----------------+-----------------+-----------------+-----------------------------+
| Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_843419 |
@@ -247,6 +250,12 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 MPAM | T241-MPAM-1 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 MPAM | T241-MPAM-4 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 MPAM | T241-MPAM-6 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| Freescale/NXP | LS2080A/LS1043A | A-008585 | FSL_ERRATUM_A008585 |
+----------------+-----------------+-----------------+-----------------------------+

View File

@@ -238,6 +238,13 @@ static inline void kvm_vcpu_pmu_resync_el0(void) {}
static inline bool pmuv3_implemented(int pmuver)
{
/*
* PMUVer follows the standard ID scheme for an unsigned field with the
* exception of 0xF (IMP_DEF) which is treated specially and implies
* FEAT_PMUv3 is not implemented.
*
* See DDI0487L.a D24.1.3.2 for more details.
*/
return !(pmuver == ARMV8_PMU_DFR_VER_IMP_DEF ||
pmuver == ARMV8_PMU_DFR_VER_NI);
}

View File

@@ -61,32 +61,6 @@ config ARM64
select ARCH_HAVE_ELF_PROT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_HAVE_TRACE_MMIO_ACCESS
select ARCH_INLINE_READ_LOCK if !PREEMPTION
select ARCH_INLINE_READ_LOCK_BH if !PREEMPTION
select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPTION
select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPTION
select ARCH_INLINE_READ_UNLOCK if !PREEMPTION
select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPTION
select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPTION
select ARCH_INLINE_READ_UNLOCK_IRQRESTORE if !PREEMPTION
select ARCH_INLINE_WRITE_LOCK if !PREEMPTION
select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPTION
select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPTION
select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPTION
select ARCH_INLINE_WRITE_UNLOCK if !PREEMPTION
select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPTION
select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPTION
select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE if !PREEMPTION
select ARCH_INLINE_SPIN_TRYLOCK if !PREEMPTION
select ARCH_INLINE_SPIN_TRYLOCK_BH if !PREEMPTION
select ARCH_INLINE_SPIN_LOCK if !PREEMPTION
select ARCH_INLINE_SPIN_LOCK_BH if !PREEMPTION
select ARCH_INLINE_SPIN_LOCK_IRQ if !PREEMPTION
select ARCH_INLINE_SPIN_LOCK_IRQSAVE if !PREEMPTION
select ARCH_INLINE_SPIN_UNLOCK if !PREEMPTION
select ARCH_INLINE_SPIN_UNLOCK_BH if !PREEMPTION
select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION
select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
select ARCH_KEEP_MEMBLOCK
select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_USE_CMPXCHG_LOCKREF
@@ -2009,8 +1983,8 @@ config ARM64_TLB_RANGE
config ARM64_MPAM
bool "Enable support for MPAM"
select ARM64_MPAM_DRIVER if EXPERT # does nothing yet
select ACPI_MPAM if ACPI
select ARM64_MPAM_DRIVER
select ARCH_HAS_CPU_RESCTRL
help
Memory System Resource Partitioning and Monitoring (MPAM) is an
optional extension to the Arm architecture that allows each
@@ -2032,6 +2006,8 @@ config ARM64_MPAM
MPAM is exposed to user-space via the resctrl pseudo filesystem.
This option enables the extra context switch code.
endmenu # "ARMv8.4 architectural features"
menu "ARMv8.5 architectural features"
@@ -2208,6 +2184,26 @@ config ARM64_GCS
endmenu # "ARMv9.4 architectural features"
config AS_HAS_LSUI
def_bool $(as-instr,.arch_extension lsui)
help
Supported by LLVM 20+ and binutils 2.45+.
menu "ARMv9.6 architectural features"
config ARM64_LSUI
bool "Support Unprivileged Load Store Instructions (LSUI)"
default y
depends on AS_HAS_LSUI && !CPU_BIG_ENDIAN
help
The Unprivileged Load Store Instructions (LSUI) provides
variants load/store instructions that access user-space memory
from the kernel without clearing PSTATE.PAN bit.
This feature is supported by LLVM 20+ and binutils 2.45+.
endmenu # "ARMv9.6 architectural feature"
config ARM64_SVE
bool "ARM Scalable Vector Extension support"
default y
@@ -2365,7 +2361,7 @@ config CMDLINE
default ""
help
Provide a set of default command-line options at build time by
entering them here. As a minimum, you should specify the the
entering them here. As a minimum, you should specify the
root device (e.g. root=/dev/nfs).
choice

View File

@@ -15,7 +15,7 @@
#ifdef CONFIG_ARM64_SW_TTBR0_PAN
.macro __uaccess_ttbr0_disable, tmp1
mrs \tmp1, ttbr1_el1 // swapper_pg_dir
bic \tmp1, \tmp1, #TTBR_ASID_MASK
bic \tmp1, \tmp1, #TTBRx_EL1_ASID_MASK
sub \tmp1, \tmp1, #RESERVED_SWAPPER_OFFSET // reserved_pg_dir
msr ttbr0_el1, \tmp1 // set reserved TTBR0_EL1
add \tmp1, \tmp1, #RESERVED_SWAPPER_OFFSET

View File

@@ -71,6 +71,8 @@ cpucap_is_possible(const unsigned int cap)
return true;
case ARM64_HAS_PMUV3:
return IS_ENABLED(CONFIG_HW_PERF_EVENTS);
case ARM64_HAS_LSUI:
return IS_ENABLED(CONFIG_ARM64_LSUI);
}
return true;

View File

@@ -513,7 +513,8 @@
check_override id_aa64pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT, .Linit_mpam_\@, .Lskip_mpam_\@, x1, x2
.Linit_mpam_\@:
msr_s SYS_MPAM2_EL2, xzr // use the default partition
mov x0, #MPAM2_EL2_EnMPAMSM_MASK
msr_s SYS_MPAM2_EL2, x0 // use the default partition,
// and disable lower traps
mrs_s x0, SYS_MPAMIDR_EL1
tbz x0, #MPAMIDR_EL1_HAS_HCR_SHIFT, .Lskip_mpam_\@ // skip if no MPAMHCR reg

View File

@@ -9,71 +9,292 @@
#include <linux/uaccess.h>
#include <asm/errno.h>
#include <asm/lsui.h>
#define FUTEX_MAX_LOOPS 128 /* What's the largest number you can think of? */
#define __futex_atomic_op(insn, ret, oldval, uaddr, tmp, oparg) \
do { \
#define LLSC_FUTEX_ATOMIC_OP(op, insn) \
static __always_inline int \
__llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \
{ \
unsigned int loops = FUTEX_MAX_LOOPS; \
int ret, oldval, newval; \
\
uaccess_enable_privileged(); \
asm volatile( \
" prfm pstl1strm, %2\n" \
"1: ldxr %w1, %2\n" \
asm volatile("// __llsc_futex_atomic_" #op "\n" \
" prfm pstl1strm, %[uaddr]\n" \
"1: ldxr %w[oldval], %[uaddr]\n" \
insn "\n" \
"2: stlxr %w0, %w3, %2\n" \
" cbz %w0, 3f\n" \
" sub %w4, %w4, %w0\n" \
" cbnz %w4, 1b\n" \
" mov %w0, %w6\n" \
"2: stlxr %w[ret], %w[newval], %[uaddr]\n" \
" cbz %w[ret], 3f\n" \
" sub %w[loops], %w[loops], %w[ret]\n" \
" cbnz %w[loops], 1b\n" \
" mov %w[ret], %w[err]\n" \
"3:\n" \
" dmb ish\n" \
_ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0) \
_ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0) \
: "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp), \
"+r" (loops) \
: "r" (oparg), "Ir" (-EAGAIN) \
_ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w[ret]) \
_ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w[ret]) \
: [ret] "=&r" (ret), [oldval] "=&r" (oldval), \
[uaddr] "+Q" (*uaddr), [newval] "=&r" (newval), \
[loops] "+r" (loops) \
: [oparg] "r" (oparg), [err] "Ir" (-EAGAIN) \
: "memory"); \
uaccess_disable_privileged(); \
} while (0)
\
if (!ret) \
*oval = oldval; \
\
return ret; \
}
LLSC_FUTEX_ATOMIC_OP(add, "add %w[newval], %w[oldval], %w[oparg]")
LLSC_FUTEX_ATOMIC_OP(or, "orr %w[newval], %w[oldval], %w[oparg]")
LLSC_FUTEX_ATOMIC_OP(and, "and %w[newval], %w[oldval], %w[oparg]")
LLSC_FUTEX_ATOMIC_OP(eor, "eor %w[newval], %w[oldval], %w[oparg]")
LLSC_FUTEX_ATOMIC_OP(set, "mov %w[newval], %w[oparg]")
static __always_inline int
__llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
{
int ret = 0;
unsigned int loops = FUTEX_MAX_LOOPS;
u32 val, tmp;
uaccess_enable_privileged();
asm volatile("//__llsc_futex_cmpxchg\n"
" prfm pstl1strm, %[uaddr]\n"
"1: ldxr %w[curval], %[uaddr]\n"
" eor %w[tmp], %w[curval], %w[oldval]\n"
" cbnz %w[tmp], 4f\n"
"2: stlxr %w[tmp], %w[newval], %[uaddr]\n"
" cbz %w[tmp], 3f\n"
" sub %w[loops], %w[loops], %w[tmp]\n"
" cbnz %w[loops], 1b\n"
" mov %w[ret], %w[err]\n"
"3:\n"
" dmb ish\n"
"4:\n"
_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w[ret])
_ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w[ret])
: [ret] "+r" (ret), [curval] "=&r" (val),
[uaddr] "+Q" (*uaddr), [tmp] "=&r" (tmp),
[loops] "+r" (loops)
: [oldval] "r" (oldval), [newval] "r" (newval),
[err] "Ir" (-EAGAIN)
: "memory");
uaccess_disable_privileged();
if (!ret)
*oval = val;
return ret;
}
#ifdef CONFIG_ARM64_LSUI
/*
* Wrap LSUI instructions with uaccess_ttbr0_enable()/disable(), as
* PAN toggling is not required.
*/
#define LSUI_FUTEX_ATOMIC_OP(op, asm_op) \
static __always_inline int \
__lsui_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \
{ \
int ret = 0; \
int oldval; \
\
uaccess_ttbr0_enable(); \
\
asm volatile("// __lsui_futex_atomic_" #op "\n" \
__LSUI_PREAMBLE \
"1: " #asm_op "al %w[oparg], %w[oldval], %[uaddr]\n" \
"2:\n" \
_ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret]) \
: [ret] "+r" (ret), [uaddr] "+Q" (*uaddr), \
[oldval] "=r" (oldval) \
: [oparg] "r" (oparg) \
: "memory"); \
\
uaccess_ttbr0_disable(); \
\
if (!ret) \
*oval = oldval; \
return ret; \
}
LSUI_FUTEX_ATOMIC_OP(add, ldtadd)
LSUI_FUTEX_ATOMIC_OP(or, ldtset)
LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr)
LSUI_FUTEX_ATOMIC_OP(set, swpt)
static __always_inline int
__lsui_cmpxchg64(u64 __user *uaddr, u64 *oldval, u64 newval)
{
int ret = 0;
uaccess_ttbr0_enable();
asm volatile("// __lsui_cmpxchg64\n"
__LSUI_PREAMBLE
"1: casalt %[oldval], %[newval], %[uaddr]\n"
"2:\n"
_ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret])
: [ret] "+r" (ret), [uaddr] "+Q" (*uaddr),
[oldval] "+r" (*oldval)
: [newval] "r" (newval)
: "memory");
uaccess_ttbr0_disable();
return ret;
}
static __always_inline int
__lsui_cmpxchg32(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
{
u64 __user *uaddr64;
bool futex_pos, other_pos;
u32 other, orig_other;
union {
u32 futex[2];
u64 raw;
} oval64, orig64, nval64;
uaddr64 = (u64 __user *)PTR_ALIGN_DOWN(uaddr, sizeof(u64));
futex_pos = !IS_ALIGNED((unsigned long)uaddr, sizeof(u64));
other_pos = !futex_pos;
oval64.futex[futex_pos] = oldval;
if (get_user(oval64.futex[other_pos], (u32 __user *)uaddr64 + other_pos))
return -EFAULT;
orig64.raw = oval64.raw;
nval64.futex[futex_pos] = newval;
nval64.futex[other_pos] = oval64.futex[other_pos];
if (__lsui_cmpxchg64(uaddr64, &oval64.raw, nval64.raw))
return -EFAULT;
oldval = oval64.futex[futex_pos];
other = oval64.futex[other_pos];
orig_other = orig64.futex[other_pos];
if (other != orig_other)
return -EAGAIN;
*oval = oldval;
return 0;
}
static __always_inline int
__lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
{
/*
* Undo the bitwise negation applied to the oparg passed from
* arch_futex_atomic_op_inuser() with FUTEX_OP_ANDN.
*/
return __lsui_futex_atomic_andnot(~oparg, uaddr, oval);
}
static __always_inline int
__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
{
u32 oldval, newval, val;
int ret, i;
if (get_user(oldval, uaddr))
return -EFAULT;
/*
* there are no ldteor/stteor instructions...
*/
for (i = 0; i < FUTEX_MAX_LOOPS; i++) {
newval = oldval ^ oparg;
ret = __lsui_cmpxchg32(uaddr, oldval, newval, &val);
switch (ret) {
case -EFAULT:
return ret;
case -EAGAIN:
continue;
}
if (val == oldval) {
*oval = val;
return 0;
}
oldval = val;
}
return -EAGAIN;
}
static __always_inline int
__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
{
/*
* Callers of futex_atomic_cmpxchg_inatomic() already retry on
* -EAGAIN, no need for another loop of max retries.
*/
return __lsui_cmpxchg32(uaddr, oldval, newval, oval);
}
#endif /* CONFIG_ARM64_LSUI */
#define FUTEX_ATOMIC_OP(op) \
static __always_inline int \
__futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \
{ \
return __lsui_llsc_body(futex_atomic_##op, oparg, uaddr, oval); \
}
FUTEX_ATOMIC_OP(add)
FUTEX_ATOMIC_OP(or)
FUTEX_ATOMIC_OP(and)
FUTEX_ATOMIC_OP(eor)
FUTEX_ATOMIC_OP(set)
static __always_inline int
__futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
{
return __lsui_llsc_body(futex_cmpxchg, uaddr, oldval, newval, oval);
}
static inline int
arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *_uaddr)
{
int oldval = 0, ret, tmp;
u32 __user *uaddr = __uaccess_mask_ptr(_uaddr);
int ret;
u32 __user *uaddr;
if (!access_ok(_uaddr, sizeof(u32)))
return -EFAULT;
uaddr = __uaccess_mask_ptr(_uaddr);
switch (op) {
case FUTEX_OP_SET:
__futex_atomic_op("mov %w3, %w5",
ret, oldval, uaddr, tmp, oparg);
ret = __futex_atomic_set(oparg, uaddr, oval);
break;
case FUTEX_OP_ADD:
__futex_atomic_op("add %w3, %w1, %w5",
ret, oldval, uaddr, tmp, oparg);
ret = __futex_atomic_add(oparg, uaddr, oval);
break;
case FUTEX_OP_OR:
__futex_atomic_op("orr %w3, %w1, %w5",
ret, oldval, uaddr, tmp, oparg);
ret = __futex_atomic_or(oparg, uaddr, oval);
break;
case FUTEX_OP_ANDN:
__futex_atomic_op("and %w3, %w1, %w5",
ret, oldval, uaddr, tmp, ~oparg);
ret = __futex_atomic_and(~oparg, uaddr, oval);
break;
case FUTEX_OP_XOR:
__futex_atomic_op("eor %w3, %w1, %w5",
ret, oldval, uaddr, tmp, oparg);
ret = __futex_atomic_eor(oparg, uaddr, oval);
break;
default:
ret = -ENOSYS;
}
if (!ret)
*oval = oldval;
return ret;
}
@@ -81,40 +302,14 @@ static inline int
futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *_uaddr,
u32 oldval, u32 newval)
{
int ret = 0;
unsigned int loops = FUTEX_MAX_LOOPS;
u32 val, tmp;
u32 __user *uaddr;
if (!access_ok(_uaddr, sizeof(u32)))
return -EFAULT;
uaddr = __uaccess_mask_ptr(_uaddr);
uaccess_enable_privileged();
asm volatile("// futex_atomic_cmpxchg_inatomic\n"
" prfm pstl1strm, %2\n"
"1: ldxr %w1, %2\n"
" sub %w3, %w1, %w5\n"
" cbnz %w3, 4f\n"
"2: stlxr %w3, %w6, %2\n"
" cbz %w3, 3f\n"
" sub %w4, %w4, %w3\n"
" cbnz %w4, 1b\n"
" mov %w0, %w7\n"
"3:\n"
" dmb ish\n"
"4:\n"
_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
_ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
: "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
: "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
: "memory");
uaccess_disable_privileged();
if (!ret)
*uval = val;
return ret;
return __futex_cmpxchg(uaddr, oldval, newval, uval);
}
#endif /* __ASM_FUTEX_H */

View File

@@ -71,23 +71,23 @@ static inline void __flush_hugetlb_tlb_range(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
unsigned long stride,
bool last_level)
tlbf_t flags)
{
switch (stride) {
#ifndef __PAGETABLE_PMD_FOLDED
case PUD_SIZE:
__flush_tlb_range(vma, start, end, PUD_SIZE, last_level, 1);
__flush_tlb_range(vma, start, end, PUD_SIZE, 1, flags);
break;
#endif
case CONT_PMD_SIZE:
case PMD_SIZE:
__flush_tlb_range(vma, start, end, PMD_SIZE, last_level, 2);
__flush_tlb_range(vma, start, end, PMD_SIZE, 2, flags);
break;
case CONT_PTE_SIZE:
__flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, 3);
__flush_tlb_range(vma, start, end, PAGE_SIZE, 3, flags);
break;
default:
__flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, TLBI_TTL_UNKNOWN);
__flush_tlb_range(vma, start, end, PAGE_SIZE, TLBI_TTL_UNKNOWN, flags);
}
}
@@ -98,7 +98,7 @@ static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
{
unsigned long stride = huge_page_size(hstate_vma(vma));
__flush_hugetlb_tlb_range(vma, start, end, stride, false);
__flush_hugetlb_tlb_range(vma, start, end, stride, TLBF_NONE);
}
#endif /* __ASM_HUGETLB_H */

View File

@@ -60,126 +60,10 @@
* of KERNEL_HWCAP_{feature}.
*/
#define __khwcap_feature(x) const_ilog2(HWCAP_ ## x)
#define KERNEL_HWCAP_FP __khwcap_feature(FP)
#define KERNEL_HWCAP_ASIMD __khwcap_feature(ASIMD)
#define KERNEL_HWCAP_EVTSTRM __khwcap_feature(EVTSTRM)
#define KERNEL_HWCAP_AES __khwcap_feature(AES)
#define KERNEL_HWCAP_PMULL __khwcap_feature(PMULL)
#define KERNEL_HWCAP_SHA1 __khwcap_feature(SHA1)
#define KERNEL_HWCAP_SHA2 __khwcap_feature(SHA2)
#define KERNEL_HWCAP_CRC32 __khwcap_feature(CRC32)
#define KERNEL_HWCAP_ATOMICS __khwcap_feature(ATOMICS)
#define KERNEL_HWCAP_FPHP __khwcap_feature(FPHP)
#define KERNEL_HWCAP_ASIMDHP __khwcap_feature(ASIMDHP)
#define KERNEL_HWCAP_CPUID __khwcap_feature(CPUID)
#define KERNEL_HWCAP_ASIMDRDM __khwcap_feature(ASIMDRDM)
#define KERNEL_HWCAP_JSCVT __khwcap_feature(JSCVT)
#define KERNEL_HWCAP_FCMA __khwcap_feature(FCMA)
#define KERNEL_HWCAP_LRCPC __khwcap_feature(LRCPC)
#define KERNEL_HWCAP_DCPOP __khwcap_feature(DCPOP)
#define KERNEL_HWCAP_SHA3 __khwcap_feature(SHA3)
#define KERNEL_HWCAP_SM3 __khwcap_feature(SM3)
#define KERNEL_HWCAP_SM4 __khwcap_feature(SM4)
#define KERNEL_HWCAP_ASIMDDP __khwcap_feature(ASIMDDP)
#define KERNEL_HWCAP_SHA512 __khwcap_feature(SHA512)
#define KERNEL_HWCAP_SVE __khwcap_feature(SVE)
#define KERNEL_HWCAP_ASIMDFHM __khwcap_feature(ASIMDFHM)
#define KERNEL_HWCAP_DIT __khwcap_feature(DIT)
#define KERNEL_HWCAP_USCAT __khwcap_feature(USCAT)
#define KERNEL_HWCAP_ILRCPC __khwcap_feature(ILRCPC)
#define KERNEL_HWCAP_FLAGM __khwcap_feature(FLAGM)
#define KERNEL_HWCAP_SSBS __khwcap_feature(SSBS)
#define KERNEL_HWCAP_SB __khwcap_feature(SB)
#define KERNEL_HWCAP_PACA __khwcap_feature(PACA)
#define KERNEL_HWCAP_PACG __khwcap_feature(PACG)
#define KERNEL_HWCAP_GCS __khwcap_feature(GCS)
#define KERNEL_HWCAP_CMPBR __khwcap_feature(CMPBR)
#define KERNEL_HWCAP_FPRCVT __khwcap_feature(FPRCVT)
#define KERNEL_HWCAP_F8MM8 __khwcap_feature(F8MM8)
#define KERNEL_HWCAP_F8MM4 __khwcap_feature(F8MM4)
#define KERNEL_HWCAP_SVE_F16MM __khwcap_feature(SVE_F16MM)
#define KERNEL_HWCAP_SVE_ELTPERM __khwcap_feature(SVE_ELTPERM)
#define KERNEL_HWCAP_SVE_AES2 __khwcap_feature(SVE_AES2)
#define KERNEL_HWCAP_SVE_BFSCALE __khwcap_feature(SVE_BFSCALE)
#define KERNEL_HWCAP_SVE2P2 __khwcap_feature(SVE2P2)
#define KERNEL_HWCAP_SME2P2 __khwcap_feature(SME2P2)
#define KERNEL_HWCAP_SME_SBITPERM __khwcap_feature(SME_SBITPERM)
#define KERNEL_HWCAP_SME_AES __khwcap_feature(SME_AES)
#define KERNEL_HWCAP_SME_SFEXPA __khwcap_feature(SME_SFEXPA)
#define KERNEL_HWCAP_SME_STMOP __khwcap_feature(SME_STMOP)
#define KERNEL_HWCAP_SME_SMOP4 __khwcap_feature(SME_SMOP4)
#define __khwcap2_feature(x) (const_ilog2(HWCAP2_ ## x) + 64)
#define KERNEL_HWCAP_DCPODP __khwcap2_feature(DCPODP)
#define KERNEL_HWCAP_SVE2 __khwcap2_feature(SVE2)
#define KERNEL_HWCAP_SVEAES __khwcap2_feature(SVEAES)
#define KERNEL_HWCAP_SVEPMULL __khwcap2_feature(SVEPMULL)
#define KERNEL_HWCAP_SVEBITPERM __khwcap2_feature(SVEBITPERM)
#define KERNEL_HWCAP_SVESHA3 __khwcap2_feature(SVESHA3)
#define KERNEL_HWCAP_SVESM4 __khwcap2_feature(SVESM4)
#define KERNEL_HWCAP_FLAGM2 __khwcap2_feature(FLAGM2)
#define KERNEL_HWCAP_FRINT __khwcap2_feature(FRINT)
#define KERNEL_HWCAP_SVEI8MM __khwcap2_feature(SVEI8MM)
#define KERNEL_HWCAP_SVEF32MM __khwcap2_feature(SVEF32MM)
#define KERNEL_HWCAP_SVEF64MM __khwcap2_feature(SVEF64MM)
#define KERNEL_HWCAP_SVEBF16 __khwcap2_feature(SVEBF16)
#define KERNEL_HWCAP_I8MM __khwcap2_feature(I8MM)
#define KERNEL_HWCAP_BF16 __khwcap2_feature(BF16)
#define KERNEL_HWCAP_DGH __khwcap2_feature(DGH)
#define KERNEL_HWCAP_RNG __khwcap2_feature(RNG)
#define KERNEL_HWCAP_BTI __khwcap2_feature(BTI)
#define KERNEL_HWCAP_MTE __khwcap2_feature(MTE)
#define KERNEL_HWCAP_ECV __khwcap2_feature(ECV)
#define KERNEL_HWCAP_AFP __khwcap2_feature(AFP)
#define KERNEL_HWCAP_RPRES __khwcap2_feature(RPRES)
#define KERNEL_HWCAP_MTE3 __khwcap2_feature(MTE3)
#define KERNEL_HWCAP_SME __khwcap2_feature(SME)
#define KERNEL_HWCAP_SME_I16I64 __khwcap2_feature(SME_I16I64)
#define KERNEL_HWCAP_SME_F64F64 __khwcap2_feature(SME_F64F64)
#define KERNEL_HWCAP_SME_I8I32 __khwcap2_feature(SME_I8I32)
#define KERNEL_HWCAP_SME_F16F32 __khwcap2_feature(SME_F16F32)
#define KERNEL_HWCAP_SME_B16F32 __khwcap2_feature(SME_B16F32)
#define KERNEL_HWCAP_SME_F32F32 __khwcap2_feature(SME_F32F32)
#define KERNEL_HWCAP_SME_FA64 __khwcap2_feature(SME_FA64)
#define KERNEL_HWCAP_WFXT __khwcap2_feature(WFXT)
#define KERNEL_HWCAP_EBF16 __khwcap2_feature(EBF16)
#define KERNEL_HWCAP_SVE_EBF16 __khwcap2_feature(SVE_EBF16)
#define KERNEL_HWCAP_CSSC __khwcap2_feature(CSSC)
#define KERNEL_HWCAP_RPRFM __khwcap2_feature(RPRFM)
#define KERNEL_HWCAP_SVE2P1 __khwcap2_feature(SVE2P1)
#define KERNEL_HWCAP_SME2 __khwcap2_feature(SME2)
#define KERNEL_HWCAP_SME2P1 __khwcap2_feature(SME2P1)
#define KERNEL_HWCAP_SME_I16I32 __khwcap2_feature(SME_I16I32)
#define KERNEL_HWCAP_SME_BI32I32 __khwcap2_feature(SME_BI32I32)
#define KERNEL_HWCAP_SME_B16B16 __khwcap2_feature(SME_B16B16)
#define KERNEL_HWCAP_SME_F16F16 __khwcap2_feature(SME_F16F16)
#define KERNEL_HWCAP_MOPS __khwcap2_feature(MOPS)
#define KERNEL_HWCAP_HBC __khwcap2_feature(HBC)
#define KERNEL_HWCAP_SVE_B16B16 __khwcap2_feature(SVE_B16B16)
#define KERNEL_HWCAP_LRCPC3 __khwcap2_feature(LRCPC3)
#define KERNEL_HWCAP_LSE128 __khwcap2_feature(LSE128)
#define KERNEL_HWCAP_FPMR __khwcap2_feature(FPMR)
#define KERNEL_HWCAP_LUT __khwcap2_feature(LUT)
#define KERNEL_HWCAP_FAMINMAX __khwcap2_feature(FAMINMAX)
#define KERNEL_HWCAP_F8CVT __khwcap2_feature(F8CVT)
#define KERNEL_HWCAP_F8FMA __khwcap2_feature(F8FMA)
#define KERNEL_HWCAP_F8DP4 __khwcap2_feature(F8DP4)
#define KERNEL_HWCAP_F8DP2 __khwcap2_feature(F8DP2)
#define KERNEL_HWCAP_F8E4M3 __khwcap2_feature(F8E4M3)
#define KERNEL_HWCAP_F8E5M2 __khwcap2_feature(F8E5M2)
#define KERNEL_HWCAP_SME_LUTV2 __khwcap2_feature(SME_LUTV2)
#define KERNEL_HWCAP_SME_F8F16 __khwcap2_feature(SME_F8F16)
#define KERNEL_HWCAP_SME_F8F32 __khwcap2_feature(SME_F8F32)
#define KERNEL_HWCAP_SME_SF8FMA __khwcap2_feature(SME_SF8FMA)
#define KERNEL_HWCAP_SME_SF8DP4 __khwcap2_feature(SME_SF8DP4)
#define KERNEL_HWCAP_SME_SF8DP2 __khwcap2_feature(SME_SF8DP2)
#define KERNEL_HWCAP_POE __khwcap2_feature(POE)
#define __khwcap3_feature(x) (const_ilog2(HWCAP3_ ## x) + 128)
#define KERNEL_HWCAP_MTE_FAR __khwcap3_feature(MTE_FAR)
#define KERNEL_HWCAP_MTE_STORE_ONLY __khwcap3_feature(MTE_STORE_ONLY)
#define KERNEL_HWCAP_LSFE __khwcap3_feature(LSFE)
#define KERNEL_HWCAP_LS64 __khwcap3_feature(LS64)
#include "asm/kernel-hwcap.h"
/*
* This yields a mask that user programs can use to figure out what

View File

@@ -0,0 +1,27 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __ASM_LSUI_H
#define __ASM_LSUI_H
#include <linux/compiler_types.h>
#include <linux/stringify.h>
#include <asm/alternative.h>
#include <asm/alternative-macros.h>
#include <asm/cpucaps.h>
#define __LSUI_PREAMBLE ".arch_extension lsui\n"
#ifdef CONFIG_ARM64_LSUI
#define __lsui_llsc_body(op, ...) \
({ \
alternative_has_cap_unlikely(ARM64_HAS_LSUI) ? \
__lsui_##op(__VA_ARGS__) : __llsc_##op(__VA_ARGS__); \
})
#else /* CONFIG_ARM64_LSUI */
#define __lsui_llsc_body(op, ...) __llsc_##op(__VA_ARGS__)
#endif /* CONFIG_ARM64_LSUI */
#endif /* __ASM_LSUI_H */

View File

@@ -10,20 +10,12 @@
#define MMCF_AARCH32 0x1 /* mm context flag for AArch32 executables */
#define USER_ASID_BIT 48
#define USER_ASID_FLAG (UL(1) << USER_ASID_BIT)
#define TTBR_ASID_MASK (UL(0xffff) << 48)
#ifndef __ASSEMBLER__
#include <linux/refcount.h>
#include <asm/cpufeature.h>
enum pgtable_type {
TABLE_PTE,
TABLE_PMD,
TABLE_PUD,
TABLE_P4D,
};
typedef struct {
atomic64_t id;
#ifdef CONFIG_COMPAT
@@ -112,5 +104,7 @@ void kpti_install_ng_mappings(void);
static inline void kpti_install_ng_mappings(void) {}
#endif
extern bool page_alloc_available;
#endif /* !__ASSEMBLER__ */
#endif

View File

@@ -210,7 +210,8 @@ static inline void update_saved_ttbr0(struct task_struct *tsk,
if (mm == &init_mm)
ttbr = phys_to_ttbr(__pa_symbol(reserved_pg_dir));
else
ttbr = phys_to_ttbr(virt_to_phys(mm->pgd)) | ASID(mm) << 48;
ttbr = phys_to_ttbr(virt_to_phys(mm->pgd)) |
FIELD_PREP(TTBRx_EL1_ASID_MASK, ASID(mm));
WRITE_ONCE(task_thread_info(tsk)->ttbr0, ttbr);
}

View File

@@ -0,0 +1,96 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (C) 2025 Arm Ltd. */
#ifndef __ASM__MPAM_H
#define __ASM__MPAM_H
#include <linux/arm_mpam.h>
#include <linux/bitfield.h>
#include <linux/jump_label.h>
#include <linux/percpu.h>
#include <linux/sched.h>
#include <asm/sysreg.h>
DECLARE_STATIC_KEY_FALSE(mpam_enabled);
DECLARE_PER_CPU(u64, arm64_mpam_default);
DECLARE_PER_CPU(u64, arm64_mpam_current);
/*
* The value of the MPAM0_EL1 sysreg when a task is in resctrl's default group.
* This is used by the context switch code to use the resctrl CPU property
* instead. The value is modified when CDP is enabled/disabled by mounting
* the resctrl filesystem.
*/
extern u64 arm64_mpam_global_default;
#ifdef CONFIG_ARM64_MPAM
static inline u64 __mpam_regval(u16 partid_d, u16 partid_i, u8 pmg_d, u8 pmg_i)
{
return FIELD_PREP(MPAM0_EL1_PARTID_D, partid_d) |
FIELD_PREP(MPAM0_EL1_PARTID_I, partid_i) |
FIELD_PREP(MPAM0_EL1_PMG_D, pmg_d) |
FIELD_PREP(MPAM0_EL1_PMG_I, pmg_i);
}
static inline void mpam_set_cpu_defaults(int cpu, u16 partid_d, u16 partid_i,
u8 pmg_d, u8 pmg_i)
{
u64 default_val = __mpam_regval(partid_d, partid_i, pmg_d, pmg_i);
WRITE_ONCE(per_cpu(arm64_mpam_default, cpu), default_val);
}
/*
* The resctrl filesystem writes to the partid/pmg values for threads and CPUs,
* which may race with reads in mpam_thread_switch(). Ensure only one of the old
* or new values are used. Particular care should be taken with the pmg field as
* mpam_thread_switch() may read a partid and pmg that don't match, causing this
* value to be stored with cache allocations, despite being considered 'free' by
* resctrl.
*/
static inline u64 mpam_get_regval(struct task_struct *tsk)
{
return READ_ONCE(task_thread_info(tsk)->mpam_partid_pmg);
}
static inline void mpam_set_task_partid_pmg(struct task_struct *tsk,
u16 partid_d, u16 partid_i,
u8 pmg_d, u8 pmg_i)
{
u64 regval = __mpam_regval(partid_d, partid_i, pmg_d, pmg_i);
WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval);
}
static inline void mpam_thread_switch(struct task_struct *tsk)
{
u64 oldregval;
int cpu = smp_processor_id();
u64 regval = mpam_get_regval(tsk);
if (!static_branch_likely(&mpam_enabled))
return;
if (regval == READ_ONCE(arm64_mpam_global_default))
regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu));
oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu));
if (oldregval == regval)
return;
write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1);
if (system_supports_sme())
write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), SYS_MPAMSM_EL1);
isb();
/* Synchronising the EL0 write is left until the ERET to EL0 */
write_sysreg_s(regval, SYS_MPAM0_EL1);
WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval);
}
#else
static inline void mpam_thread_switch(struct task_struct *tsk) {}
#endif /* CONFIG_ARM64_MPAM */
#endif /* __ASM__MPAM_H */

View File

@@ -252,6 +252,9 @@ static inline void mte_check_tfsr_entry(void)
if (!kasan_hw_tags_enabled())
return;
if (!system_uses_mte_async_or_asymm_mode())
return;
mte_check_tfsr_el1();
}
@@ -260,6 +263,9 @@ static inline void mte_check_tfsr_exit(void)
if (!kasan_hw_tags_enabled())
return;
if (!system_uses_mte_async_or_asymm_mode())
return;
/*
* The asynchronous faults are sync'ed automatically with
* TFSR_EL1 on kernel entry but for exit an explicit dsb()

View File

@@ -223,8 +223,6 @@
*/
#define S1_TABLE_AP (_AT(pmdval_t, 3) << 61)
#define TTBR_CNP_BIT (UL(1) << 0)
/*
* TCR flags.
*/
@@ -287,9 +285,12 @@
#endif
#ifdef CONFIG_ARM64_VA_BITS_52
#define PTRS_PER_PGD_52_VA (UL(1) << (52 - PGDIR_SHIFT))
#define PTRS_PER_PGD_48_VA (UL(1) << (48 - PGDIR_SHIFT))
#define PTRS_PER_PGD_EXTRA (PTRS_PER_PGD_52_VA - PTRS_PER_PGD_48_VA)
/* Must be at least 64-byte aligned to prevent corruption of the TTBR */
#define TTBR1_BADDR_4852_OFFSET (((UL(1) << (52 - PGDIR_SHIFT)) - \
(UL(1) << (48 - PGDIR_SHIFT))) * 8)
#define TTBR1_BADDR_4852_OFFSET (PTRS_PER_PGD_EXTRA << PTDESC_ORDER)
#endif
#endif

View File

@@ -25,6 +25,8 @@
*/
#define PTE_PRESENT_INVALID (PTE_NG) /* only when !PTE_VALID */
#define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */

View File

@@ -89,9 +89,9 @@ static inline void arch_leave_lazy_mmu_mode(void)
/* Set stride and tlb_level in flush_*_tlb_range */
#define flush_pmd_tlb_range(vma, addr, end) \
__flush_tlb_range(vma, addr, end, PMD_SIZE, false, 2)
__flush_tlb_range(vma, addr, end, PMD_SIZE, 2, TLBF_NONE)
#define flush_pud_tlb_range(vma, addr, end) \
__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
__flush_tlb_range(vma, addr, end, PUD_SIZE, 1, TLBF_NONE)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/*
@@ -101,10 +101,11 @@ static inline void arch_leave_lazy_mmu_mode(void)
* entries exist.
*/
#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
local_flush_tlb_page_nonotify(vma, address)
__flush_tlb_page(vma, address, TLBF_NOBROADCAST | TLBF_NONOTIFY)
#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
local_flush_tlb_page_nonotify(vma, address)
#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
__flush_tlb_range(vma, address, address + PMD_SIZE, PMD_SIZE, 2, \
TLBF_NOBROADCAST | TLBF_NONOTIFY | TLBF_NOWALKCACHE)
/*
* ZERO_PAGE is a global shared page that is always zero: used
@@ -322,9 +323,11 @@ static inline pte_t pte_mknoncont(pte_t pte)
return clear_pte_bit(pte, __pgprot(PTE_CONT));
}
static inline pte_t pte_mkvalid(pte_t pte)
static inline pte_t pte_mkvalid_k(pte_t pte)
{
return set_pte_bit(pte, __pgprot(PTE_VALID));
pte = clear_pte_bit(pte, __pgprot(PTE_PRESENT_INVALID));
pte = set_pte_bit(pte, __pgprot(PTE_PRESENT_VALID_KERNEL));
return pte;
}
static inline pte_t pte_mkinvalid(pte_t pte)
@@ -594,6 +597,7 @@ static inline int pmd_protnone(pmd_t pmd)
#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd)))
#define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd)))
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
#define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd))
@@ -635,6 +639,8 @@ static inline pmd_t pmd_mkspecial(pmd_t pmd)
#define pud_young(pud) pte_young(pud_pte(pud))
#define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud)))
#define pud_mkwrite_novma(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud)))
#define pud_mkvalid_k(pud) pte_pud(pte_mkvalid_k(pud_pte(pud)))
#define pud_write(pud) pte_write(pud_pte(pud))
static inline pud_t pud_mkhuge(pud_t pud)
@@ -779,9 +785,13 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
#define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_TABLE)
#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_SECT)
#define pmd_leaf(pmd) (pmd_present(pmd) && !pmd_table(pmd))
#define pmd_leaf pmd_leaf
static inline bool pmd_leaf(pmd_t pmd)
{
return pmd_present(pmd) && !pmd_table(pmd);
}
#define pmd_bad(pmd) (!pmd_table(pmd))
#define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
@@ -799,11 +809,8 @@ static inline int pmd_trans_huge(pmd_t pmd)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
static inline bool pud_sect(pud_t pud) { return false; }
static inline bool pud_table(pud_t pud) { return true; }
#else
#define pud_sect(pud) ((pud_val(pud) & PUD_TYPE_MASK) == \
PUD_TYPE_SECT)
#define pud_table(pud) ((pud_val(pud) & PUD_TYPE_MASK) == \
PUD_TYPE_TABLE)
#endif
@@ -873,7 +880,11 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
PUD_TYPE_TABLE)
#define pud_present(pud) pte_present(pud_pte(pud))
#ifndef __PAGETABLE_PMD_FOLDED
#define pud_leaf(pud) (pud_present(pud) && !pud_table(pud))
#define pud_leaf pud_leaf
static inline bool pud_leaf(pud_t pud)
{
return pud_present(pud) && !pud_table(pud);
}
#else
#define pud_leaf(pud) false
#endif
@@ -1247,9 +1258,18 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
}
extern int __ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty);
extern int __ptep_set_access_flags_anysz(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty,
unsigned long pgsize);
static inline int __ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
{
return __ptep_set_access_flags_anysz(vma, address, ptep, entry, dirty,
PAGE_SIZE);
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
@@ -1257,8 +1277,8 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
pmd_t entry, int dirty)
{
return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
pmd_pte(entry), dirty);
return __ptep_set_access_flags_anysz(vma, address, (pte_t *)pmdp,
pmd_pte(entry), dirty, PMD_SIZE);
}
#endif
@@ -1320,7 +1340,7 @@ static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
* context-switch, which provides a DSB to complete the TLB
* invalidation.
*/
flush_tlb_page_nosync(vma, address);
__flush_tlb_page(vma, address, TLBF_NOSYNC);
}
return young;

View File

@@ -0,0 +1,2 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/arm_mpam.h>

View File

@@ -10,6 +10,11 @@
#ifdef CONFIG_SHADOW_CALL_STACK
scs_sp .req x18
.macro scs_load_current_base
get_current_task scs_sp
ldr scs_sp, [scs_sp, #TSK_TI_SCS_BASE]
.endm
.macro scs_load_current
get_current_task scs_sp
ldr scs_sp, [scs_sp, #TSK_TI_SCS_SP]
@@ -19,6 +24,9 @@
str scs_sp, [\tsk, #TSK_TI_SCS_SP]
.endm
#else
.macro scs_load_current_base
.endm
.macro scs_load_current
.endm

View File

@@ -41,6 +41,9 @@ struct thread_info {
#ifdef CONFIG_SHADOW_CALL_STACK
void *scs_base;
void *scs_sp;
#endif
#ifdef CONFIG_ARM64_MPAM
u64 mpam_partid_pmg;
#endif
u32 cpu;
};

View File

@@ -53,7 +53,7 @@ static inline int tlb_get_level(struct mmu_gather *tlb)
static inline void tlb_flush(struct mmu_gather *tlb)
{
struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
bool last_level = !tlb->freed_tables;
tlbf_t flags = tlb->freed_tables ? TLBF_NONE : TLBF_NOWALKCACHE;
unsigned long stride = tlb_get_unmap_size(tlb);
int tlb_level = tlb_get_level(tlb);
@@ -63,13 +63,13 @@ static inline void tlb_flush(struct mmu_gather *tlb)
* reallocate our ASID without invalidating the entire TLB.
*/
if (tlb->fullmm) {
if (!last_level)
if (tlb->freed_tables)
flush_tlb_mm(tlb->mm);
return;
}
__flush_tlb_range(&vma, tlb->start, tlb->end, stride,
last_level, tlb_level);
tlb_level, flags);
}
static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,

View File

@@ -97,24 +97,69 @@ static inline unsigned long get_trans_granule(void)
#define TLBI_TTL_UNKNOWN INT_MAX
#define __tlbi_level(op, addr, level) do { \
u64 arg = addr; \
\
if (alternative_has_cap_unlikely(ARM64_HAS_ARMv8_4_TTL) && \
level >= 0 && level <= 3) { \
u64 ttl = level & 3; \
ttl |= get_trans_granule() << 2; \
arg &= ~TLBI_TTL_MASK; \
arg |= FIELD_PREP(TLBI_TTL_MASK, ttl); \
} \
\
__tlbi(op, arg); \
} while(0)
typedef void (*tlbi_op)(u64 arg);
#define __tlbi_user_level(op, arg, level) do { \
if (arm64_kernel_unmapped_at_el0()) \
__tlbi_level(op, (arg | USER_ASID_FLAG), level); \
} while (0)
static __always_inline void vae1is(u64 arg)
{
__tlbi(vae1is, arg);
__tlbi_user(vae1is, arg);
}
static __always_inline void vae2is(u64 arg)
{
__tlbi(vae2is, arg);
}
static __always_inline void vale1(u64 arg)
{
__tlbi(vale1, arg);
__tlbi_user(vale1, arg);
}
static __always_inline void vale1is(u64 arg)
{
__tlbi(vale1is, arg);
__tlbi_user(vale1is, arg);
}
static __always_inline void vale2is(u64 arg)
{
__tlbi(vale2is, arg);
}
static __always_inline void vaale1is(u64 arg)
{
__tlbi(vaale1is, arg);
}
static __always_inline void ipas2e1(u64 arg)
{
__tlbi(ipas2e1, arg);
}
static __always_inline void ipas2e1is(u64 arg)
{
__tlbi(ipas2e1is, arg);
}
static __always_inline void __tlbi_level_asid(tlbi_op op, u64 addr, u32 level,
u16 asid)
{
u64 arg = __TLBI_VADDR(addr, asid);
if (alternative_has_cap_unlikely(ARM64_HAS_ARMv8_4_TTL) && level <= 3) {
u64 ttl = level | (get_trans_granule() << 2);
FIELD_MODIFY(TLBI_TTL_MASK, &arg, ttl);
}
op(arg);
}
static inline void __tlbi_level(tlbi_op op, u64 addr, u32 level)
{
__tlbi_level_asid(op, addr, level, 0);
}
/*
* This macro creates a properly formatted VA operand for the TLB RANGE. The
@@ -141,19 +186,6 @@ static inline unsigned long get_trans_granule(void)
#define TLBIR_TTL_MASK GENMASK_ULL(38, 37)
#define TLBIR_BADDR_MASK GENMASK_ULL(36, 0)
#define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
({ \
unsigned long __ta = 0; \
unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
__ta |= FIELD_PREP(TLBIR_BADDR_MASK, baddr); \
__ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl); \
__ta |= FIELD_PREP(TLBIR_NUM_MASK, num); \
__ta |= FIELD_PREP(TLBIR_SCALE_MASK, scale); \
__ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); \
__ta |= FIELD_PREP(TLBIR_ASID_MASK, asid); \
__ta; \
})
/* These macros are used by the TLBI RANGE feature. */
#define __TLBI_RANGE_PAGES(num, scale) \
((unsigned long)((num) + 1) << (5 * (scale) + 1))
@@ -167,11 +199,7 @@ static inline unsigned long get_trans_granule(void)
* range.
*/
#define __TLBI_RANGE_NUM(pages, scale) \
({ \
int __pages = min((pages), \
__TLBI_RANGE_PAGES(31, (scale))); \
(__pages >> (5 * (scale) + 1)) - 1; \
})
(((pages) >> (5 * (scale) + 1)) - 1)
#define __repeat_tlbi_sync(op, arg...) \
do { \
@@ -241,10 +269,7 @@ static inline void __tlbi_sync_s1ish_hyp(void)
* unmapping pages from vmalloc/io space.
*
* flush_tlb_page(vma, addr)
* Invalidate a single user mapping for address 'addr' in the
* address space corresponding to 'vma->mm'. Note that this
* operation only invalidates a single, last-level page-table
* entry and therefore does not affect any walk-caches.
* Equivalent to __flush_tlb_page(..., flags=TLBF_NONE)
*
*
* Next, we have some undocumented invalidation routines that you probably
@@ -258,30 +283,28 @@ static inline void __tlbi_sync_s1ish_hyp(void)
* CPUs, ensuring that any walk-cache entries associated with the
* translation are also invalidated.
*
* __flush_tlb_range(vma, start, end, stride, last_level, tlb_level)
* __flush_tlb_range(vma, start, end, stride, tlb_level, flags)
* Invalidate the virtual-address range '[start, end)' on all
* CPUs for the user address space corresponding to 'vma->mm'.
* The invalidation operations are issued at a granularity
* determined by 'stride' and only affect any walk-cache entries
* if 'last_level' is equal to false. tlb_level is the level at
* determined by 'stride'. tlb_level is the level at
* which the invalidation must take place. If the level is wrong,
* no invalidation may take place. In the case where the level
* cannot be easily determined, the value TLBI_TTL_UNKNOWN will
* perform a non-hinted invalidation.
* perform a non-hinted invalidation. flags may be TLBF_NONE (0) or
* any combination of TLBF_NOWALKCACHE (elide eviction of walk
* cache entries), TLBF_NONOTIFY (don't call mmu notifiers),
* TLBF_NOSYNC (don't issue trailing dsb) and TLBF_NOBROADCAST
* (only perform the invalidation for the local cpu).
*
* local_flush_tlb_page(vma, addr)
* Local variant of flush_tlb_page(). Stale TLB entries may
* remain in remote CPUs.
*
* local_flush_tlb_page_nonotify(vma, addr)
* Same as local_flush_tlb_page() except MMU notifier will not be
* called.
*
* local_flush_tlb_contpte(vma, addr)
* Invalidate the virtual-address range
* '[addr, addr+CONT_PTE_SIZE)' mapped with contpte on local CPU
* for the user address space corresponding to 'vma->mm'. Stale
* TLB entries may remain in remote CPUs.
* __flush_tlb_page(vma, addr, flags)
* Invalidate a single user mapping for address 'addr' in the
* address space corresponding to 'vma->mm'. Note that this
* operation only invalidates a single level 3 page-table entry
* and therefore does not affect any walk-caches. flags may contain
* any combination of TLBF_NONOTIFY (don't call mmu notifiers),
* TLBF_NOSYNC (don't issue trailing dsb) and TLBF_NOBROADCAST
* (only perform the invalidation for the local cpu).
*
* Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
* on top of these routines, since that is our interface to the mmu_gather
@@ -315,59 +338,6 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
}
static inline void __local_flush_tlb_page_nonotify_nosync(struct mm_struct *mm,
unsigned long uaddr)
{
unsigned long addr;
dsb(nshst);
addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1, addr);
__tlbi_user(vale1, addr);
}
static inline void local_flush_tlb_page_nonotify(struct vm_area_struct *vma,
unsigned long uaddr)
{
__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
dsb(nsh);
}
static inline void local_flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr)
{
__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
(uaddr & PAGE_MASK) + PAGE_SIZE);
dsb(nsh);
}
static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
unsigned long uaddr)
{
unsigned long addr;
dsb(ishst);
addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1is, addr);
__tlbi_user(vale1is, addr);
mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
(uaddr & PAGE_MASK) + PAGE_SIZE);
}
static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
unsigned long uaddr)
{
return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
}
static inline void flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr)
{
flush_tlb_page_nosync(vma, uaddr);
__tlbi_sync_s1ish();
}
static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
{
return true;
@@ -397,14 +367,13 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
/*
* __flush_tlb_range_op - Perform TLBI operation upon a range
*
* @op: TLBI instruction that operates on a range (has 'r' prefix)
* @lop: TLBI level operation to perform
* @rop: TLBI range operation to perform
* @start: The start address of the range
* @pages: Range as the number of pages from 'start'
* @stride: Flush granularity
* @asid: The ASID of the task (0 for IPA instructions)
* @tlb_level: Translation Table level hint, if known
* @tlbi_user: If 'true', call an additional __tlbi_user()
* (typically for user ASIDs). 'flase' for IPA instructions
* @level: Translation Table level hint, if known
* @lpa2: If 'true', the lpa2 scheme is used as set out below
*
* When the CPU does not support TLB range operations, flush the TLB
@@ -427,116 +396,181 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
* operations can only span an even number of pages. We save this for last to
* ensure 64KB start alignment is maintained for the LPA2 case.
*/
#define __flush_tlb_range_op(op, start, pages, stride, \
asid, tlb_level, tlbi_user, lpa2) \
do { \
typeof(start) __flush_start = start; \
typeof(pages) __flush_pages = pages; \
int num = 0; \
int scale = 3; \
int shift = lpa2 ? 16 : PAGE_SHIFT; \
unsigned long addr; \
\
while (__flush_pages > 0) { \
if (!system_supports_tlb_range() || \
__flush_pages == 1 || \
(lpa2 && __flush_start != ALIGN(__flush_start, SZ_64K))) { \
addr = __TLBI_VADDR(__flush_start, asid); \
__tlbi_level(op, addr, tlb_level); \
if (tlbi_user) \
__tlbi_user_level(op, addr, tlb_level); \
__flush_start += stride; \
__flush_pages -= stride >> PAGE_SHIFT; \
continue; \
} \
\
num = __TLBI_RANGE_NUM(__flush_pages, scale); \
if (num >= 0) { \
addr = __TLBI_VADDR_RANGE(__flush_start >> shift, asid, \
scale, num, tlb_level); \
__tlbi(r##op, addr); \
if (tlbi_user) \
__tlbi_user(r##op, addr); \
__flush_start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \
__flush_pages -= __TLBI_RANGE_PAGES(num, scale);\
} \
scale--; \
} \
} while (0)
#define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled());
static inline bool __flush_tlb_range_limit_excess(unsigned long start,
unsigned long end, unsigned long pages, unsigned long stride)
static __always_inline void rvae1is(u64 arg)
{
/*
* When the system does not support TLB range based flush
* operation, (MAX_DVM_OPS - 1) pages can be handled. But
* with TLB range based operation, MAX_TLBI_RANGE_PAGES
* pages can be handled.
*/
if ((!system_supports_tlb_range() &&
(end - start) >= (MAX_DVM_OPS * stride)) ||
pages > MAX_TLBI_RANGE_PAGES)
return true;
return false;
__tlbi(rvae1is, arg);
__tlbi_user(rvae1is, arg);
}
static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level,
int tlb_level)
static __always_inline void rvale1(u64 arg)
{
__tlbi(rvale1, arg);
__tlbi_user(rvale1, arg);
}
static __always_inline void rvale1is(u64 arg)
{
__tlbi(rvale1is, arg);
__tlbi_user(rvale1is, arg);
}
static __always_inline void rvaale1is(u64 arg)
{
__tlbi(rvaale1is, arg);
}
static __always_inline void ripas2e1is(u64 arg)
{
__tlbi(ripas2e1is, arg);
}
static __always_inline void __tlbi_range(tlbi_op op, u64 addr,
u16 asid, int scale, int num,
u32 level, bool lpa2)
{
u64 arg = 0;
arg |= FIELD_PREP(TLBIR_BADDR_MASK, addr >> (lpa2 ? 16 : PAGE_SHIFT));
arg |= FIELD_PREP(TLBIR_TTL_MASK, level > 3 ? 0 : level);
arg |= FIELD_PREP(TLBIR_NUM_MASK, num);
arg |= FIELD_PREP(TLBIR_SCALE_MASK, scale);
arg |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule());
arg |= FIELD_PREP(TLBIR_ASID_MASK, asid);
op(arg);
}
static __always_inline void __flush_tlb_range_op(tlbi_op lop, tlbi_op rop,
u64 start, size_t pages,
u64 stride, u16 asid,
u32 level, bool lpa2)
{
u64 addr = start, end = start + pages * PAGE_SIZE;
int scale = 3;
while (addr != end) {
int num;
pages = (end - addr) >> PAGE_SHIFT;
if (!system_supports_tlb_range() || pages == 1)
goto invalidate_one;
if (lpa2 && !IS_ALIGNED(addr, SZ_64K))
goto invalidate_one;
num = __TLBI_RANGE_NUM(pages, scale);
if (num >= 0) {
__tlbi_range(rop, addr, asid, scale, num, level, lpa2);
addr += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT;
}
scale--;
continue;
invalidate_one:
__tlbi_level_asid(lop, addr, level, asid);
addr += stride;
}
}
#define __flush_s1_tlb_range_op(op, start, pages, stride, asid, tlb_level) \
__flush_tlb_range_op(op, r##op, start, pages, stride, asid, tlb_level, lpa2_is_enabled())
#define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
__flush_tlb_range_op(op, r##op, start, pages, stride, 0, tlb_level, kvm_lpa2_is_enabled())
static inline bool __flush_tlb_range_limit_excess(unsigned long pages,
unsigned long stride)
{
/*
* Assume that the worst case number of DVM ops required to flush a
* given range on a system that supports tlb-range is 20 (4 scales, 1
* final page, 15 for alignment on LPA2 systems), which is much smaller
* than MAX_DVM_OPS.
*/
if (system_supports_tlb_range())
return pages > MAX_TLBI_RANGE_PAGES;
return pages >= (MAX_DVM_OPS * stride) >> PAGE_SHIFT;
}
typedef unsigned __bitwise tlbf_t;
/* No special behaviour. */
#define TLBF_NONE ((__force tlbf_t)0)
/* Invalidate tlb entries only, leaving the page table walk cache intact. */
#define TLBF_NOWALKCACHE ((__force tlbf_t)BIT(0))
/* Skip the trailing dsb after issuing tlbi. */
#define TLBF_NOSYNC ((__force tlbf_t)BIT(1))
/* Suppress tlb notifier callbacks for this flush operation. */
#define TLBF_NONOTIFY ((__force tlbf_t)BIT(2))
/* Perform the tlbi locally without broadcasting to other CPUs. */
#define TLBF_NOBROADCAST ((__force tlbf_t)BIT(3))
static __always_inline void __do_flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, int tlb_level,
tlbf_t flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long asid, pages;
start = round_down(start, stride);
end = round_up(end, stride);
pages = (end - start) >> PAGE_SHIFT;
if (__flush_tlb_range_limit_excess(start, end, pages, stride)) {
if (__flush_tlb_range_limit_excess(pages, stride)) {
flush_tlb_mm(mm);
return;
}
dsb(ishst);
if (!(flags & TLBF_NOBROADCAST))
dsb(ishst);
else
dsb(nshst);
asid = ASID(mm);
if (last_level)
__flush_tlb_range_op(vale1is, start, pages, stride, asid,
tlb_level, true, lpa2_is_enabled());
else
__flush_tlb_range_op(vae1is, start, pages, stride, asid,
tlb_level, true, lpa2_is_enabled());
switch (flags & (TLBF_NOWALKCACHE | TLBF_NOBROADCAST)) {
case TLBF_NONE:
__flush_s1_tlb_range_op(vae1is, start, pages, stride,
asid, tlb_level);
break;
case TLBF_NOWALKCACHE:
__flush_s1_tlb_range_op(vale1is, start, pages, stride,
asid, tlb_level);
break;
case TLBF_NOBROADCAST:
/* Combination unused */
BUG();
break;
case TLBF_NOWALKCACHE | TLBF_NOBROADCAST:
__flush_s1_tlb_range_op(vale1, start, pages, stride,
asid, tlb_level);
break;
}
mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
if (!(flags & TLBF_NONOTIFY))
mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
if (!(flags & TLBF_NOSYNC)) {
if (!(flags & TLBF_NOBROADCAST))
__tlbi_sync_s1ish();
else
dsb(nsh);
}
}
static inline void __flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level,
int tlb_level)
unsigned long stride, int tlb_level,
tlbf_t flags)
{
__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
last_level, tlb_level);
__tlbi_sync_s1ish();
}
static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
unsigned long addr)
{
unsigned long asid;
addr = round_down(addr, CONT_PTE_SIZE);
dsb(nshst);
asid = ASID(vma->vm_mm);
__flush_tlb_range_op(vale1, addr, CONT_PTES, PAGE_SIZE, asid,
3, true, lpa2_is_enabled());
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, addr,
addr + CONT_PTE_SIZE);
dsb(nsh);
start = round_down(start, stride);
end = round_up(end, stride);
__do_flush_tlb_range(vma, start, end, stride, tlb_level, flags);
}
static inline void flush_tlb_range(struct vm_area_struct *vma,
@@ -548,7 +582,23 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
* Set the tlb_level to TLBI_TTL_UNKNOWN because we can not get enough
* information here.
*/
__flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN);
__flush_tlb_range(vma, start, end, PAGE_SIZE, TLBI_TTL_UNKNOWN, TLBF_NONE);
}
static inline void __flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr, tlbf_t flags)
{
unsigned long start = round_down(uaddr, PAGE_SIZE);
unsigned long end = start + PAGE_SIZE;
__do_flush_tlb_range(vma, start, end, PAGE_SIZE, 3,
TLBF_NOWALKCACHE | flags);
}
static inline void flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr)
{
__flush_tlb_page(vma, uaddr, TLBF_NONE);
}
static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
@@ -560,14 +610,14 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
end = round_up(end, stride);
pages = (end - start) >> PAGE_SHIFT;
if (__flush_tlb_range_limit_excess(start, end, pages, stride)) {
if (__flush_tlb_range_limit_excess(pages, stride)) {
flush_tlb_all();
return;
}
dsb(ishst);
__flush_tlb_range_op(vaale1is, start, pages, stride, 0,
TLBI_TTL_UNKNOWN, false, lpa2_is_enabled());
__flush_s1_tlb_range_op(vaale1is, start, pages, stride, 0,
TLBI_TTL_UNKNOWN);
__tlbi_sync_s1ish();
isb();
}
@@ -589,7 +639,10 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm, unsigned long start, unsigned long end)
{
__flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
struct vm_area_struct vma = { .vm_mm = mm, .vm_flags = 0 };
__flush_tlb_range(&vma, start, end, PAGE_SIZE, 3,
TLBF_NOWALKCACHE | TLBF_NOSYNC);
}
static inline bool __pte_flags_need_flush(ptdesc_t oldval, ptdesc_t newval)
@@ -618,6 +671,8 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
}
#define huge_pmd_needs_flush huge_pmd_needs_flush
#undef __tlbi_user
#undef __TLBI_VADDR
#endif
#endif

View File

@@ -62,7 +62,7 @@ static inline void __uaccess_ttbr0_disable(void)
local_irq_save(flags);
ttbr = read_sysreg(ttbr1_el1);
ttbr &= ~TTBR_ASID_MASK;
ttbr &= ~TTBRx_EL1_ASID_MASK;
/* reserved_pg_dir placed before swapper_pg_dir */
write_sysreg(ttbr - RESERVED_SWAPPER_OFFSET, ttbr0_el1);
/* Set reserved ASID */
@@ -85,8 +85,8 @@ static inline void __uaccess_ttbr0_enable(void)
/* Restore active ASID */
ttbr1 = read_sysreg(ttbr1_el1);
ttbr1 &= ~TTBR_ASID_MASK; /* safety measure */
ttbr1 |= ttbr0 & TTBR_ASID_MASK;
ttbr1 &= ~TTBRx_EL1_ASID_MASK; /* safety measure */
ttbr1 |= ttbr0 & TTBRx_EL1_ASID_MASK;
write_sysreg(ttbr1, ttbr1_el1);
/* Restore user page table */

View File

@@ -68,6 +68,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_VMCORE_INFO) += vmcore_info.o
obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o
obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o
obj-$(CONFIG_ARM64_MPAM) += mpam.o
obj-$(CONFIG_ARM64_MTE) += mte.o
obj-y += vdso-wrap.o
obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o

View File

@@ -610,6 +610,20 @@ static int __init armv8_deprecated_init(void)
}
#endif
#ifdef CONFIG_SWP_EMULATION
/*
* The purpose of supporting LSUI is to eliminate PAN toggling. CPUs
* that support LSUI are unlikely to support a 32-bit runtime. Rather
* than emulating the SWP instruction using LSUI instructions, simply
* disable SWP emulation.
*/
if (cpus_have_final_cap(ARM64_HAS_LSUI)) {
insn_swp.status = INSN_UNAVAILABLE;
pr_info("swp/swpb instruction emulation is not supported on this system\n");
}
#endif
for (int i = 0; i < ARRAY_SIZE(insn_emulations); i++) {
struct insn_emulation *ie = insn_emulations[i];

View File

@@ -77,6 +77,7 @@
#include <linux/percpu.h>
#include <linux/sched/isolation.h>
#include <asm/arm_pmuv3.h>
#include <asm/cpu.h>
#include <asm/cpufeature.h>
#include <asm/cpu_ops.h>
@@ -86,6 +87,7 @@
#include <asm/kvm_host.h>
#include <asm/mmu.h>
#include <asm/mmu_context.h>
#include <asm/mpam.h>
#include <asm/mte.h>
#include <asm/hypervisor.h>
#include <asm/processor.h>
@@ -281,6 +283,7 @@ static const struct arm64_ftr_bits ftr_id_aa64isar2[] = {
static const struct arm64_ftr_bits ftr_id_aa64isar3[] = {
ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_FPRCVT_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_LSUI_SHIFT, 4, ID_AA64ISAR3_EL1_LSUI_NI),
ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_LSFE_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_FAMINMAX_SHIFT, 4, 0),
ARM64_FTR_END,
@@ -565,7 +568,7 @@ static const struct arm64_ftr_bits ftr_id_aa64dfr0[] = {
* We can instantiate multiple PMU instances with different levels
* of support.
*/
S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_AA64DFR0_EL1_PMUVer_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_AA64DFR0_EL1_PMUVer_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_EXACT, ID_AA64DFR0_EL1_DebugVer_SHIFT, 4, 0x6),
ARM64_FTR_END,
};
@@ -709,7 +712,7 @@ static const struct arm64_ftr_bits ftr_id_pfr2[] = {
static const struct arm64_ftr_bits ftr_id_dfr0[] = {
/* [31:28] TraceFilt */
S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_DFR0_EL1_PerfMon_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_DFR0_EL1_PerfMon_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_MProfDbg_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_MMapTrc_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_CopTrc_SHIFT, 4, 0),
@@ -1927,19 +1930,10 @@ static bool has_pmuv3(const struct arm64_cpu_capabilities *entry, int scope)
u64 dfr0 = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1);
unsigned int pmuver;
/*
* PMUVer follows the standard ID scheme for an unsigned field with the
* exception of 0xF (IMP_DEF) which is treated specially and implies
* FEAT_PMUv3 is not implemented.
*
* See DDI0487L.a D24.1.3.2 for more details.
*/
pmuver = cpuid_feature_extract_unsigned_field(dfr0,
ID_AA64DFR0_EL1_PMUVer_SHIFT);
if (pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF)
return false;
return pmuver >= ID_AA64DFR0_EL1_PMUVer_IMP;
return pmuv3_implemented(pmuver);
}
#endif
@@ -2501,13 +2495,19 @@ test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope)
static void
cpu_enable_mpam(const struct arm64_cpu_capabilities *entry)
{
/*
* Access by the kernel (at EL1) should use the reserved PARTID
* which is configured unrestricted. This avoids priority-inversion
* where latency sensitive tasks have to wait for a task that has
* been throttled to release the lock.
*/
write_sysreg_s(0, SYS_MPAM1_EL1);
int cpu = smp_processor_id();
u64 regval = 0;
if (IS_ENABLED(CONFIG_ARM64_MPAM) && static_branch_likely(&mpam_enabled))
regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu));
write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1);
if (cpus_have_cap(ARM64_SME))
write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), SYS_MPAMSM_EL1);
isb();
/* Synchronising the EL0 write is left until the ERET to EL0 */
write_sysreg_s(regval, SYS_MPAM0_EL1);
}
static bool
@@ -3178,6 +3178,15 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.cpu_enable = cpu_enable_ls64_v,
ARM64_CPUID_FIELDS(ID_AA64ISAR1_EL1, LS64, LS64_V)
},
#ifdef CONFIG_ARM64_LSUI
{
.desc = "Unprivileged Load Store Instructions (LSUI)",
.capability = ARM64_HAS_LSUI,
.type = ARM64_CPUCAP_SYSTEM_FEATURE,
.matches = has_cpuid_feature,
ARM64_CPUID_FIELDS(ID_AA64ISAR3_EL1, LSUI, IMP)
},
#endif
{},
};

View File

@@ -35,11 +35,11 @@
* Before this function is called it is not safe to call regular kernel code,
* instrumentable code, or any code which may trigger an exception.
*/
static noinstr irqentry_state_t enter_from_kernel_mode(struct pt_regs *regs)
static noinstr irqentry_state_t arm64_enter_from_kernel_mode(struct pt_regs *regs)
{
irqentry_state_t state;
state = irqentry_enter(regs);
state = irqentry_enter_from_kernel_mode(regs);
mte_check_tfsr_entry();
mte_disable_tco_entry(current);
@@ -51,11 +51,14 @@ static noinstr irqentry_state_t enter_from_kernel_mode(struct pt_regs *regs)
* After this function returns it is not safe to call regular kernel code,
* instrumentable code, or any code which may trigger an exception.
*/
static void noinstr exit_to_kernel_mode(struct pt_regs *regs,
irqentry_state_t state)
static void noinstr arm64_exit_to_kernel_mode(struct pt_regs *regs,
irqentry_state_t state)
{
local_irq_disable();
irqentry_exit_to_kernel_mode_preempt(regs, state);
local_daif_mask();
mte_check_tfsr_exit();
irqentry_exit(regs, state);
irqentry_exit_to_kernel_mode_after_preempt(regs, state);
}
/*
@@ -298,11 +301,10 @@ static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
unsigned long far = read_sysreg(far_el1);
irqentry_state_t state;
state = enter_from_kernel_mode(regs);
state = arm64_enter_from_kernel_mode(regs);
local_daif_inherit(regs);
do_mem_abort(far, esr, regs);
local_daif_mask();
exit_to_kernel_mode(regs, state);
arm64_exit_to_kernel_mode(regs, state);
}
static void noinstr el1_pc(struct pt_regs *regs, unsigned long esr)
@@ -310,55 +312,50 @@ static void noinstr el1_pc(struct pt_regs *regs, unsigned long esr)
unsigned long far = read_sysreg(far_el1);
irqentry_state_t state;
state = enter_from_kernel_mode(regs);
state = arm64_enter_from_kernel_mode(regs);
local_daif_inherit(regs);
do_sp_pc_abort(far, esr, regs);
local_daif_mask();
exit_to_kernel_mode(regs, state);
arm64_exit_to_kernel_mode(regs, state);
}
static void noinstr el1_undef(struct pt_regs *regs, unsigned long esr)
{
irqentry_state_t state;
state = enter_from_kernel_mode(regs);
state = arm64_enter_from_kernel_mode(regs);
local_daif_inherit(regs);
do_el1_undef(regs, esr);
local_daif_mask();
exit_to_kernel_mode(regs, state);
arm64_exit_to_kernel_mode(regs, state);
}
static void noinstr el1_bti(struct pt_regs *regs, unsigned long esr)
{
irqentry_state_t state;
state = enter_from_kernel_mode(regs);
state = arm64_enter_from_kernel_mode(regs);
local_daif_inherit(regs);
do_el1_bti(regs, esr);
local_daif_mask();
exit_to_kernel_mode(regs, state);
arm64_exit_to_kernel_mode(regs, state);
}
static void noinstr el1_gcs(struct pt_regs *regs, unsigned long esr)
{
irqentry_state_t state;
state = enter_from_kernel_mode(regs);
state = arm64_enter_from_kernel_mode(regs);
local_daif_inherit(regs);
do_el1_gcs(regs, esr);
local_daif_mask();
exit_to_kernel_mode(regs, state);
arm64_exit_to_kernel_mode(regs, state);
}
static void noinstr el1_mops(struct pt_regs *regs, unsigned long esr)
{
irqentry_state_t state;
state = enter_from_kernel_mode(regs);
state = arm64_enter_from_kernel_mode(regs);
local_daif_inherit(regs);
do_el1_mops(regs, esr);
local_daif_mask();
exit_to_kernel_mode(regs, state);
arm64_exit_to_kernel_mode(regs, state);
}
static void noinstr el1_breakpt(struct pt_regs *regs, unsigned long esr)
@@ -420,11 +417,10 @@ static void noinstr el1_fpac(struct pt_regs *regs, unsigned long esr)
{
irqentry_state_t state;
state = enter_from_kernel_mode(regs);
state = arm64_enter_from_kernel_mode(regs);
local_daif_inherit(regs);
do_el1_fpac(regs, esr);
local_daif_mask();
exit_to_kernel_mode(regs, state);
arm64_exit_to_kernel_mode(regs, state);
}
asmlinkage void noinstr el1h_64_sync_handler(struct pt_regs *regs)
@@ -491,13 +487,13 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
{
irqentry_state_t state;
state = enter_from_kernel_mode(regs);
state = arm64_enter_from_kernel_mode(regs);
irq_enter_rcu();
do_interrupt_handler(regs, handler);
irq_exit_rcu();
exit_to_kernel_mode(regs, state);
arm64_exit_to_kernel_mode(regs, state);
}
static void noinstr el1_interrupt(struct pt_regs *regs,
void (*handler)(struct pt_regs *))

View File

@@ -273,7 +273,7 @@ alternative_if ARM64_HAS_ADDRESS_AUTH
alternative_else_nop_endif
1:
scs_load_current
scs_load_current_base
.else
add x21, sp, #PT_REGS_SIZE
get_current_task tsk
@@ -378,8 +378,6 @@ alternative_if ARM64_WORKAROUND_845719
alternative_else_nop_endif
#endif
3:
scs_save tsk
/* Ignore asynchronous tag check faults in the uaccess routines */
ldr x0, [tsk, THREAD_SCTLR_USER]
clear_mte_async_tcf x0
@@ -473,7 +471,7 @@ alternative_else_nop_endif
*/
SYM_CODE_START_LOCAL(__swpan_entry_el1)
mrs x21, ttbr0_el1
tst x21, #TTBR_ASID_MASK // Check for the reserved ASID
tst x21, #TTBRx_EL1_ASID_MASK // Check for the reserved ASID
orr x23, x23, #PSR_PAN_BIT // Set the emulated PAN in the saved SPSR
b.eq 1f // TTBR0 access already disabled
and x23, x23, #~PSR_PAN_BIT // Clear the emulated PAN in the saved SPSR

View File

@@ -129,9 +129,6 @@ int machine_kexec_post_load(struct kimage *kimage)
}
/* Create a copy of the linear map */
trans_pgd = kexec_page_alloc(kimage);
if (!trans_pgd)
return -ENOMEM;
rc = trans_pgd_create_copy(&info, &trans_pgd, PAGE_OFFSET, PAGE_END);
if (rc)
return rc;

62
arch/arm64/kernel/mpam.c Normal file
View File

@@ -0,0 +1,62 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2025 Arm Ltd. */
#include <asm/mpam.h>
#include <linux/arm_mpam.h>
#include <linux/cpu_pm.h>
#include <linux/jump_label.h>
#include <linux/percpu.h>
DEFINE_STATIC_KEY_FALSE(mpam_enabled);
DEFINE_PER_CPU(u64, arm64_mpam_default);
DEFINE_PER_CPU(u64, arm64_mpam_current);
u64 arm64_mpam_global_default;
static int mpam_pm_notifier(struct notifier_block *self,
unsigned long cmd, void *v)
{
u64 regval;
int cpu = smp_processor_id();
switch (cmd) {
case CPU_PM_EXIT:
/*
* Don't use mpam_thread_switch() as the system register
* value has changed under our feet.
*/
regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu));
write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1);
if (system_supports_sme()) {
write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D),
SYS_MPAMSM_EL1);
}
isb();
write_sysreg_s(regval, SYS_MPAM0_EL1);
return NOTIFY_OK;
default:
return NOTIFY_DONE;
}
}
static struct notifier_block mpam_pm_nb = {
.notifier_call = mpam_pm_notifier,
};
static int __init arm64_mpam_register_cpus(void)
{
u64 mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1);
u16 partid_max = FIELD_GET(MPAMIDR_EL1_PARTID_MAX, mpamidr);
u8 pmg_max = FIELD_GET(MPAMIDR_EL1_PMG_MAX, mpamidr);
if (!system_supports_mpam())
return 0;
cpu_pm_register_notifier(&mpam_pm_nb);
return mpam_register_requestor(partid_max, pmg_max);
}
/* Must occur before mpam_msc_driver_init() from subsys_initcall() */
arch_initcall(arm64_mpam_register_cpus)

View File

@@ -291,6 +291,9 @@ void mte_thread_switch(struct task_struct *next)
/* TCO may not have been disabled on exception entry for the current task. */
mte_disable_tco_entry(next);
if (!system_uses_mte_async_or_asymm_mode())
return;
/*
* Check if an async tag exception occurred at EL1.
*
@@ -315,8 +318,8 @@ void mte_cpu_setup(void)
* CnP is not a boot feature so MTE gets enabled before CnP, but let's
* make sure that is the case.
*/
BUG_ON(read_sysreg(ttbr0_el1) & TTBR_CNP_BIT);
BUG_ON(read_sysreg(ttbr1_el1) & TTBR_CNP_BIT);
BUG_ON(read_sysreg(ttbr0_el1) & TTBRx_EL1_CnP);
BUG_ON(read_sysreg(ttbr1_el1) & TTBRx_EL1_CnP);
/* Normal Tagged memory type at the corresponding MAIR index */
sysreg_clear_set(mair_el1,
@@ -350,6 +353,9 @@ void mte_suspend_enter(void)
if (!system_supports_mte())
return;
if (!system_uses_mte_async_or_asymm_mode())
return;
/*
* The barriers are required to guarantee that the indirect writes
* to TFSR_EL1 are synchronized before we report the state.

View File

@@ -51,6 +51,7 @@
#include <asm/fpsimd.h>
#include <asm/gcs.h>
#include <asm/mmu_context.h>
#include <asm/mpam.h>
#include <asm/mte.h>
#include <asm/processor.h>
#include <asm/pointer_auth.h>
@@ -699,6 +700,29 @@ void update_sctlr_el1(u64 sctlr)
isb();
}
static inline void debug_switch_state(void)
{
if (system_uses_irq_prio_masking()) {
unsigned long daif_expected = 0;
unsigned long daif_actual = read_sysreg(daif);
unsigned long pmr_expected = GIC_PRIO_IRQOFF;
unsigned long pmr_actual = read_sysreg_s(SYS_ICC_PMR_EL1);
WARN_ONCE(daif_actual != daif_expected ||
pmr_actual != pmr_expected,
"Unexpected DAIF + PMR: 0x%lx + 0x%lx (expected 0x%lx + 0x%lx)\n",
daif_actual, pmr_actual,
daif_expected, pmr_expected);
} else {
unsigned long daif_expected = DAIF_PROCCTX_NOIRQ;
unsigned long daif_actual = read_sysreg(daif);
WARN_ONCE(daif_actual != daif_expected,
"Unexpected DAIF value: 0x%lx (expected 0x%lx)\n",
daif_actual, daif_expected);
}
}
/*
* Thread switching.
*/
@@ -708,6 +732,8 @@ struct task_struct *__switch_to(struct task_struct *prev,
{
struct task_struct *last;
debug_switch_state();
fpsimd_thread_switch(next);
tls_thread_switch(next);
hw_breakpoint_thread_switch(next);
@@ -738,6 +764,12 @@ struct task_struct *__switch_to(struct task_struct *prev,
if (prev->thread.sctlr_user != next->thread.sctlr_user)
update_sctlr_el1(next->thread.sctlr_user);
/*
* MPAM thread switch happens after the DSB to ensure prev's accesses
* use prev's MPAM settings.
*/
mpam_thread_switch(next);
/* the actual thread switch */
last = cpu_switch_to(prev, next);

View File

@@ -145,7 +145,7 @@ void __init arm64_rsi_init(void)
return;
if (!rsi_version_matches())
return;
if (WARN_ON(rsi_get_realm_config(&config)))
if (WARN_ON(rsi_get_realm_config(lm_alias(&config))))
return;
prot_ns_shared = __phys_to_pte_val(BIT(config.ipa_bits - 1));

View File

@@ -36,7 +36,7 @@ __do_compat_cache_op(unsigned long start, unsigned long end)
* The workaround requires an inner-shareable tlbi.
* We pick the reserved-ASID to minimise the impact.
*/
__tlbi(aside1is, __TLBI_VADDR(0, 0));
__tlbi(aside1is, 0UL);
__tlbi_sync_s1ish();
}

View File

@@ -9,6 +9,7 @@
#include <asm/esr.h>
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>
#include <asm/lsui.h>
static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool s1ptw)
{
@@ -1679,6 +1680,35 @@ int __kvm_find_s1_desc_level(struct kvm_vcpu *vcpu, u64 va, u64 ipa, int *level)
}
}
static int __lsui_swap_desc(u64 __user *ptep, u64 old, u64 new)
{
u64 tmp = old;
int ret = 0;
/*
* Wrap LSUI instructions with uaccess_ttbr0_enable()/disable(),
* as PAN toggling is not required.
*/
uaccess_ttbr0_enable();
asm volatile(__LSUI_PREAMBLE
"1: cast %[old], %[new], %[addr]\n"
"2:\n"
_ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret])
: [old] "+r" (old), [addr] "+Q" (*ptep), [ret] "+r" (ret)
: [new] "r" (new)
: "memory");
uaccess_ttbr0_disable();
if (ret)
return ret;
if (tmp != old)
return -EAGAIN;
return ret;
}
static int __lse_swap_desc(u64 __user *ptep, u64 old, u64 new)
{
u64 tmp = old;
@@ -1754,7 +1784,9 @@ int __kvm_at_swap_desc(struct kvm *kvm, gpa_t ipa, u64 old, u64 new)
return -EPERM;
ptep = (void __user *)hva + offset;
if (cpus_have_final_cap(ARM64_HAS_LSE_ATOMICS))
if (cpus_have_final_cap(ARM64_HAS_LSUI))
r = __lsui_swap_desc(ptep, old, new);
else if (cpus_have_final_cap(ARM64_HAS_LSE_ATOMICS))
r = __lse_swap_desc(ptep, old, new);
else
r = __llsc_swap_desc(ptep, old, new);

View File

@@ -10,6 +10,7 @@
#include <linux/kvm_host.h>
#include <linux/hw_breakpoint.h>
#include <asm/arm_pmuv3.h>
#include <asm/debug-monitors.h>
#include <asm/kvm_asm.h>
#include <asm/kvm_arm.h>
@@ -75,8 +76,10 @@ static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu)
void kvm_init_host_debug_data(void)
{
u64 dfr0 = read_sysreg(id_aa64dfr0_el1);
unsigned int pmuver = cpuid_feature_extract_unsigned_field(dfr0,
ID_AA64DFR0_EL1_PMUVer_SHIFT);
if (cpuid_feature_extract_signed_field(dfr0, ID_AA64DFR0_EL1_PMUVer_SHIFT) > 0)
if (pmuv3_implemented(pmuver))
*host_data_ptr(nr_event_counters) = FIELD_GET(ARMV8_PMU_PMCR_N,
read_sysreg(pmcr_el0));

View File

@@ -267,7 +267,8 @@ static inline void __deactivate_traps_hfgxtr(struct kvm_vcpu *vcpu)
static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu)
{
u64 r = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1;
u64 clr = MPAM2_EL2_EnMPAMSM;
u64 set = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1;
if (!system_supports_mpam())
return;
@@ -277,18 +278,21 @@ static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu)
write_sysreg_s(MPAMHCR_EL2_TRAP_MPAMIDR_EL1, SYS_MPAMHCR_EL2);
} else {
/* From v1.1 TIDR can trap MPAMIDR, set it unconditionally */
r |= MPAM2_EL2_TIDR;
set |= MPAM2_EL2_TIDR;
}
write_sysreg_s(r, SYS_MPAM2_EL2);
sysreg_clear_set_s(SYS_MPAM2_EL2, clr, set);
}
static inline void __deactivate_traps_mpam(void)
{
u64 clr = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1 | MPAM2_EL2_TIDR;
u64 set = MPAM2_EL2_EnMPAMSM;
if (!system_supports_mpam())
return;
write_sysreg_s(0, SYS_MPAM2_EL2);
sysreg_clear_set_s(SYS_MPAM2_EL2, clr, set);
if (system_supports_mpam_hcr())
write_sysreg_s(MPAMHCR_HOST_FLAGS, SYS_MPAMHCR_EL2);

View File

@@ -130,7 +130,7 @@ SYM_CODE_START_LOCAL(___kvm_hyp_init)
ldr x1, [x0, #NVHE_INIT_PGD_PA]
phys_to_ttbr x2, x1
alternative_if ARM64_HAS_CNP
orr x2, x2, #TTBR_CNP_BIT
orr x2, x2, #TTBRx_EL1_CnP
alternative_else_nop_endif
msr ttbr0_el2, x2
@@ -291,7 +291,7 @@ SYM_TYPED_FUNC_START(__pkvm_init_switch_pgd)
/* Install the new pgtables */
phys_to_ttbr x5, x0
alternative_if ARM64_HAS_CNP
orr x5, x5, #TTBR_CNP_BIT
orr x5, x5, #TTBRx_EL1_CnP
alternative_else_nop_endif
msr ttbr0_el2, x5

View File

@@ -270,7 +270,7 @@ static void fixmap_clear_slot(struct hyp_fixmap_slot *slot)
* https://lore.kernel.org/kvm/20221017115209.2099-1-will@kernel.org/T/#mf10dfbaf1eaef9274c581b81c53758918c1d0f03
*/
dsb(ishst);
__tlbi_level(vale2is, __TLBI_VADDR(addr, 0), level);
__tlbi_level(vale2is, addr, level);
__tlbi_sync_s1ish_hyp();
isb();
}

View File

@@ -158,7 +158,6 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
* Instead, we invalidate Stage-2 for this IPA, and the
* whole of Stage-1. Weep...
*/
ipa >>= 12;
__tlbi_level(ipas2e1is, ipa, level);
/*
@@ -188,7 +187,6 @@ void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
* Instead, we invalidate Stage-2 for this IPA, and the
* whole of Stage-1. Weep...
*/
ipa >>= 12;
__tlbi_level(ipas2e1, ipa, level);
/*

View File

@@ -490,14 +490,14 @@ static int hyp_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
kvm_clear_pte(ctx->ptep);
dsb(ishst);
__tlbi_level(vae2is, __TLBI_VADDR(ctx->addr, 0), TLBI_TTL_UNKNOWN);
__tlbi_level(vae2is, ctx->addr, TLBI_TTL_UNKNOWN);
} else {
if (ctx->end - ctx->addr < granule)
return -EINVAL;
kvm_clear_pte(ctx->ptep);
dsb(ishst);
__tlbi_level(vale2is, __TLBI_VADDR(ctx->addr, 0), ctx->level);
__tlbi_level(vale2is, ctx->addr, ctx->level);
*unmapped += granule;
}

View File

@@ -183,6 +183,21 @@ void sysreg_restore_guest_state_vhe(struct kvm_cpu_context *ctxt)
}
NOKPROBE_SYMBOL(sysreg_restore_guest_state_vhe);
/*
* The _EL0 value was written by the host's context switch and belongs to the
* VMM. Copy this into the guest's _EL1 register.
*/
static inline void __mpam_guest_load(void)
{
u64 mask = MPAM0_EL1_PARTID_D | MPAM0_EL1_PARTID_I | MPAM0_EL1_PMG_D | MPAM0_EL1_PMG_I;
if (system_supports_mpam()) {
u64 val = (read_sysreg_s(SYS_MPAM0_EL1) & mask) | MPAM1_EL1_MPAMEN;
write_sysreg_el1(val, SYS_MPAM1);
}
}
/**
* __vcpu_load_switch_sysregs - Load guest system registers to the physical CPU
*
@@ -222,6 +237,7 @@ void __vcpu_load_switch_sysregs(struct kvm_vcpu *vcpu)
*/
__sysreg32_restore_state(vcpu);
__sysreg_restore_user_state(guest_ctxt);
__mpam_guest_load();
if (unlikely(is_hyp_ctxt(vcpu))) {
__sysreg_restore_vel2_state(vcpu);

View File

@@ -104,7 +104,6 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
* Instead, we invalidate Stage-2 for this IPA, and the
* whole of Stage-1. Weep...
*/
ipa >>= 12;
__tlbi_level(ipas2e1is, ipa, level);
/*
@@ -136,7 +135,6 @@ void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
* Instead, we invalidate Stage-2 for this IPA, and the
* whole of Stage-1. Weep...
*/
ipa >>= 12;
__tlbi_level(ipas2e1, ipa, level);
/*

View File

@@ -1805,7 +1805,7 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu,
break;
case SYS_ID_AA64ISAR3_EL1:
val &= ID_AA64ISAR3_EL1_FPRCVT | ID_AA64ISAR3_EL1_LSFE |
ID_AA64ISAR3_EL1_FAMINMAX;
ID_AA64ISAR3_EL1_FAMINMAX | ID_AA64ISAR3_EL1_LSUI;
break;
case SYS_ID_AA64MMFR2_EL1:
val &= ~ID_AA64MMFR2_EL1_CCIDX_MASK;
@@ -3252,6 +3252,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
ID_AA64ISAR2_EL1_GPA3)),
ID_WRITABLE(ID_AA64ISAR3_EL1, (ID_AA64ISAR3_EL1_FPRCVT |
ID_AA64ISAR3_EL1_LSFE |
ID_AA64ISAR3_EL1_LSUI |
ID_AA64ISAR3_EL1_FAMINMAX)),
ID_UNALLOCATED(6,4),
ID_UNALLOCATED(6,5),
@@ -3376,6 +3377,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ SYS_DESC(SYS_MPAM1_EL1), undef_access },
{ SYS_DESC(SYS_MPAM0_EL1), undef_access },
{ SYS_DESC(SYS_MPAMSM_EL1), undef_access },
{ SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 },
{ SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 },

View File

@@ -354,15 +354,15 @@ void cpu_do_switch_mm(phys_addr_t pgd_phys, struct mm_struct *mm)
/* Skip CNP for the reserved ASID */
if (system_supports_cnp() && asid)
ttbr0 |= TTBR_CNP_BIT;
ttbr0 |= TTBRx_EL1_CnP;
/* SW PAN needs a copy of the ASID in TTBR0 for entry */
if (IS_ENABLED(CONFIG_ARM64_SW_TTBR0_PAN))
ttbr0 |= FIELD_PREP(TTBR_ASID_MASK, asid);
ttbr0 |= FIELD_PREP(TTBRx_EL1_ASID_MASK, asid);
/* Set ASID in TTBR1 since TCR.A1 is set */
ttbr1 &= ~TTBR_ASID_MASK;
ttbr1 |= FIELD_PREP(TTBR_ASID_MASK, asid);
ttbr1 &= ~TTBRx_EL1_ASID_MASK;
ttbr1 |= FIELD_PREP(TTBRx_EL1_ASID_MASK, asid);
cpu_set_reserved_ttbr0_nosync();
write_sysreg(ttbr1, ttbr1_el1);

View File

@@ -225,7 +225,8 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
*/
if (!system_supports_bbml2_noabort())
__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, 3,
TLBF_NOWALKCACHE);
__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
}
@@ -551,8 +552,8 @@ int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
* See comment in __ptep_clear_flush_young(); same rationale for
* eliding the trailing DSB applies here.
*/
__flush_tlb_range_nosync(vma->vm_mm, addr, end,
PAGE_SIZE, true, 3);
__flush_tlb_range(vma, addr, end, PAGE_SIZE, 3,
TLBF_NOWALKCACHE | TLBF_NOSYNC);
}
return young;
@@ -685,7 +686,10 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
__ptep_set_access_flags(vma, addr, ptep, entry, 0);
if (dirty)
local_flush_tlb_contpte(vma, start_addr);
__flush_tlb_range(vma, start_addr,
start_addr + CONT_PTE_SIZE,
PAGE_SIZE, 3,
TLBF_NOWALKCACHE | TLBF_NOBROADCAST);
} else {
__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
__ptep_set_access_flags(vma, addr, ptep, entry, dirty);

View File

@@ -204,12 +204,13 @@ static void show_pte(unsigned long addr)
*
* Returns whether or not the PTE actually changed.
*/
int __ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
int __ptep_set_access_flags_anysz(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty, unsigned long pgsize)
{
pteval_t old_pteval, pteval;
pte_t pte = __ptep_get(ptep);
int level;
if (pte_same(pte, entry))
return 0;
@@ -238,8 +239,27 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
* may still cause page faults and be invalidated via
* flush_tlb_fix_spurious_fault().
*/
if (dirty)
local_flush_tlb_page(vma, address);
if (dirty) {
switch (pgsize) {
case PAGE_SIZE:
level = 3;
break;
case PMD_SIZE:
level = 2;
break;
#ifndef __PAGETABLE_PMD_FOLDED
case PUD_SIZE:
level = 1;
break;
#endif
default:
level = TLBI_TTL_UNKNOWN;
WARN_ON(1);
}
__flush_tlb_range(vma, address, address + pgsize, pgsize, level,
TLBF_NOWALKCACHE | TLBF_NOBROADCAST);
}
return 1;
}

View File

@@ -181,7 +181,7 @@ static pte_t get_clear_contig_flush(struct mm_struct *mm,
struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
unsigned long end = addr + (pgsize * ncontig);
__flush_hugetlb_tlb_range(&vma, addr, end, pgsize, true);
__flush_hugetlb_tlb_range(&vma, addr, end, pgsize, TLBF_NOWALKCACHE);
return orig_pte;
}
@@ -209,7 +209,7 @@ static void clear_flush(struct mm_struct *mm,
if (mm == &init_mm)
flush_tlb_kernel_range(saddr, addr);
else
__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, TLBF_NOWALKCACHE);
}
void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
@@ -427,11 +427,11 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
pte_t orig_pte;
VM_WARN_ON(!pte_present(pte));
ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize);
if (!pte_cont(pte))
return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize);
return __ptep_set_access_flags_anysz(vma, addr, ptep, pte,
dirty, pgsize);
if (!__cont_access_flags_changed(ptep, pte, ncontig))
return 0;

View File

@@ -350,7 +350,6 @@ void __init arch_mm_preinit(void)
}
swiotlb_init(swiotlb, flags);
swiotlb_update_mem_attributes();
/*
* Check boundaries twice: Some fundamental inconsistencies can be
@@ -377,6 +376,14 @@ void __init arch_mm_preinit(void)
}
}
bool page_alloc_available __ro_after_init;
void __init mem_init(void)
{
page_alloc_available = true;
swiotlb_update_mem_attributes();
}
void free_initmem(void)
{
void *lm_init_begin = lm_alias(__init_begin);

View File

@@ -112,7 +112,7 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
}
EXPORT_SYMBOL(phys_mem_access_prot);
static phys_addr_t __init early_pgtable_alloc(enum pgtable_type pgtable_type)
static phys_addr_t __init early_pgtable_alloc(enum pgtable_level pgtable_level)
{
phys_addr_t phys;
@@ -197,14 +197,14 @@ static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
unsigned long end, phys_addr_t phys,
pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
phys_addr_t (*pgtable_alloc)(enum pgtable_level),
int flags)
{
unsigned long next;
pmd_t pmd = READ_ONCE(*pmdp);
pte_t *ptep;
BUG_ON(pmd_sect(pmd));
BUG_ON(pmd_leaf(pmd));
if (pmd_none(pmd)) {
pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
phys_addr_t pte_phys;
@@ -212,7 +212,7 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
if (flags & NO_EXEC_MAPPINGS)
pmdval |= PMD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pte_phys = pgtable_alloc(TABLE_PTE);
pte_phys = pgtable_alloc(PGTABLE_LEVEL_PTE);
if (pte_phys == INVALID_PHYS_ADDR)
return -ENOMEM;
ptep = pte_set_fixmap(pte_phys);
@@ -252,7 +252,7 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags)
phys_addr_t (*pgtable_alloc)(enum pgtable_level), int flags)
{
unsigned long next;
@@ -292,7 +292,7 @@ static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
unsigned long end, phys_addr_t phys,
pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
phys_addr_t (*pgtable_alloc)(enum pgtable_level),
int flags)
{
int ret;
@@ -303,7 +303,7 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
/*
* Check for initial section mappings in the pgd/pud.
*/
BUG_ON(pud_sect(pud));
BUG_ON(pud_leaf(pud));
if (pud_none(pud)) {
pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
phys_addr_t pmd_phys;
@@ -311,7 +311,7 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
if (flags & NO_EXEC_MAPPINGS)
pudval |= PUD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pmd_phys = pgtable_alloc(TABLE_PMD);
pmd_phys = pgtable_alloc(PGTABLE_LEVEL_PMD);
if (pmd_phys == INVALID_PHYS_ADDR)
return -ENOMEM;
pmdp = pmd_set_fixmap(pmd_phys);
@@ -349,7 +349,7 @@ out:
static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
phys_addr_t (*pgtable_alloc)(enum pgtable_level),
int flags)
{
int ret = 0;
@@ -364,7 +364,7 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
if (flags & NO_EXEC_MAPPINGS)
p4dval |= P4D_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pud_phys = pgtable_alloc(TABLE_PUD);
pud_phys = pgtable_alloc(PGTABLE_LEVEL_PUD);
if (pud_phys == INVALID_PHYS_ADDR)
return -ENOMEM;
pudp = pud_set_fixmap(pud_phys);
@@ -415,7 +415,7 @@ out:
static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
phys_addr_t (*pgtable_alloc)(enum pgtable_level),
int flags)
{
int ret;
@@ -430,7 +430,7 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
if (flags & NO_EXEC_MAPPINGS)
pgdval |= PGD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
p4d_phys = pgtable_alloc(TABLE_P4D);
p4d_phys = pgtable_alloc(PGTABLE_LEVEL_P4D);
if (p4d_phys == INVALID_PHYS_ADDR)
return -ENOMEM;
p4dp = p4d_set_fixmap(p4d_phys);
@@ -467,7 +467,7 @@ out:
static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
phys_addr_t (*pgtable_alloc)(enum pgtable_level),
int flags)
{
int ret;
@@ -500,7 +500,7 @@ static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
phys_addr_t (*pgtable_alloc)(enum pgtable_level),
int flags)
{
int ret;
@@ -516,7 +516,7 @@ static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
phys_addr_t (*pgtable_alloc)(enum pgtable_level),
int flags)
{
int ret;
@@ -528,7 +528,7 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
}
static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
enum pgtable_type pgtable_type)
enum pgtable_level pgtable_level)
{
/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
@@ -539,40 +539,43 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
pa = page_to_phys(ptdesc_page(ptdesc));
switch (pgtable_type) {
case TABLE_PTE:
switch (pgtable_level) {
case PGTABLE_LEVEL_PTE:
BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
break;
case TABLE_PMD:
case PGTABLE_LEVEL_PMD:
BUG_ON(!pagetable_pmd_ctor(mm, ptdesc));
break;
case TABLE_PUD:
case PGTABLE_LEVEL_PUD:
pagetable_pud_ctor(ptdesc);
break;
case TABLE_P4D:
case PGTABLE_LEVEL_P4D:
pagetable_p4d_ctor(ptdesc);
break;
case PGTABLE_LEVEL_PGD:
VM_WARN_ON(1);
break;
}
return pa;
}
static phys_addr_t
pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
pgd_pgtable_alloc_init_mm_gfp(enum pgtable_level pgtable_level, gfp_t gfp)
{
return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_level);
}
static phys_addr_t __maybe_unused
pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
pgd_pgtable_alloc_init_mm(enum pgtable_level pgtable_level)
{
return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
return pgd_pgtable_alloc_init_mm_gfp(pgtable_level, GFP_PGTABLE_KERNEL);
}
static phys_addr_t
pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type)
pgd_pgtable_alloc_special_mm(enum pgtable_level pgtable_level)
{
return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_type);
return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_level);
}
static void split_contpte(pte_t *ptep)
@@ -593,7 +596,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
pte_t *ptep;
int i;
pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
pte_phys = pgd_pgtable_alloc_init_mm_gfp(PGTABLE_LEVEL_PTE, gfp);
if (pte_phys == INVALID_PHYS_ADDR)
return -ENOMEM;
ptep = (pte_t *)phys_to_virt(pte_phys);
@@ -602,6 +605,8 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
tableprot |= PMD_TABLE_PXN;
prot = __pgprot((pgprot_val(prot) & ~PTE_TYPE_MASK) | PTE_TYPE_PAGE);
if (!pmd_valid(pmd))
prot = pte_pgprot(pte_mkinvalid(pfn_pte(0, prot)));
prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
if (to_cont)
prot = __pgprot(pgprot_val(prot) | PTE_CONT);
@@ -638,7 +643,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
pmd_t *pmdp;
int i;
pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
pmd_phys = pgd_pgtable_alloc_init_mm_gfp(PGTABLE_LEVEL_PMD, gfp);
if (pmd_phys == INVALID_PHYS_ADDR)
return -ENOMEM;
pmdp = (pmd_t *)phys_to_virt(pmd_phys);
@@ -647,6 +652,8 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
tableprot |= PUD_TABLE_PXN;
prot = __pgprot((pgprot_val(prot) & ~PMD_TYPE_MASK) | PMD_TYPE_SECT);
if (!pud_valid(pud))
prot = pmd_pgprot(pmd_mkinvalid(pfn_pmd(0, prot)));
prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
if (to_cont)
prot = __pgprot(pgprot_val(prot) | PTE_CONT);
@@ -768,30 +775,51 @@ static inline bool force_pte_mapping(void)
}
static DEFINE_MUTEX(pgtable_split_lock);
static bool linear_map_requires_bbml2;
int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
{
int ret;
/*
* !BBML2_NOABORT systems should not be trying to change permissions on
* anything that is not pte-mapped in the first place. Just return early
* and let the permission change code raise a warning if not already
* pte-mapped.
*/
if (!system_supports_bbml2_noabort())
return 0;
/*
* If the region is within a pte-mapped area, there is no need to try to
* split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
* change permissions from atomic context so for those cases (which are
* always pte-mapped), we must not go any further because taking the
* mutex below may sleep.
* mutex below may sleep. Do not call force_pte_mapping() here because
* it could return a confusing result if called from a secondary cpu
* prior to finalizing caps. Instead, linear_map_requires_bbml2 gives us
* what we need.
*/
if (force_pte_mapping() || is_kfence_address((void *)start))
if (!linear_map_requires_bbml2 || is_kfence_address((void *)start))
return 0;
if (!system_supports_bbml2_noabort()) {
/*
* !BBML2_NOABORT systems should not be trying to change
* permissions on anything that is not pte-mapped in the first
* place. Just return early and let the permission change code
* raise a warning if not already pte-mapped.
*/
if (system_capabilities_finalized())
return 0;
/*
* Boot-time: split_kernel_leaf_mapping_locked() allocates from
* page allocator. Can't split until it's available.
*/
if (WARN_ON(!page_alloc_available))
return -EBUSY;
/*
* Boot-time: Started secondary cpus but don't know if they
* support BBML2_NOABORT yet. Can't allow splitting in this
* window in case they don't.
*/
if (WARN_ON(num_online_cpus() > 1))
return -EBUSY;
}
/*
* Ensure start and end are at least page-aligned since this is the
* finest granularity we can split to.
@@ -891,8 +919,6 @@ static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp
return ret;
}
static bool linear_map_requires_bbml2 __initdata;
u32 idmap_kpti_bbml2_flag;
static void __init init_idmap_kpti_bbml2_flag(void)
@@ -1226,7 +1252,7 @@ static void __init declare_vma(struct vm_struct *vma,
static phys_addr_t kpti_ng_temp_alloc __initdata;
static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type)
static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_level pgtable_level)
{
kpti_ng_temp_alloc -= PAGE_SIZE;
return kpti_ng_temp_alloc;
@@ -1458,10 +1484,14 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
WARN_ON(!pte_present(pte));
__pte_clear(&init_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
if (free_mapped)
if (free_mapped) {
/* CONT blocks are not supported in the vmemmap */
WARN_ON(pte_cont(pte));
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
free_hotplug_page_range(pte_page(pte),
PAGE_SIZE, altmap);
}
/* unmap_hotplug_range() flushes TLB for !free_mapped */
} while (addr += PAGE_SIZE, addr < end);
}
@@ -1480,17 +1510,16 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
continue;
WARN_ON(!pmd_present(pmd));
if (pmd_sect(pmd)) {
if (pmd_leaf(pmd)) {
pmd_clear(pmdp);
/*
* One TLBI should be sufficient here as the PMD_SIZE
* range is mapped with a single block entry.
*/
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
if (free_mapped)
if (free_mapped) {
/* CONT blocks are not supported in the vmemmap */
WARN_ON(pmd_cont(pmd));
flush_tlb_kernel_range(addr, addr + PMD_SIZE);
free_hotplug_page_range(pmd_page(pmd),
PMD_SIZE, altmap);
}
/* unmap_hotplug_range() flushes TLB for !free_mapped */
continue;
}
WARN_ON(!pmd_table(pmd));
@@ -1513,17 +1542,14 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
continue;
WARN_ON(!pud_present(pud));
if (pud_sect(pud)) {
if (pud_leaf(pud)) {
pud_clear(pudp);
/*
* One TLBI should be sufficient here as the PUD_SIZE
* range is mapped with a single block entry.
*/
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
if (free_mapped)
if (free_mapped) {
flush_tlb_kernel_range(addr, addr + PUD_SIZE);
free_hotplug_page_range(pud_page(pud),
PUD_SIZE, altmap);
}
/* unmap_hotplug_range() flushes TLB for !free_mapped */
continue;
}
WARN_ON(!pud_table(pud));
@@ -1553,6 +1579,7 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
static void unmap_hotplug_range(unsigned long addr, unsigned long end,
bool free_mapped, struct vmem_altmap *altmap)
{
unsigned long start = addr;
unsigned long next;
pgd_t *pgdp, pgd;
@@ -1574,6 +1601,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
WARN_ON(!pgd_present(pgd));
unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap);
} while (addr = next, addr < end);
if (!free_mapped)
flush_tlb_kernel_range(start, end);
}
static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
@@ -1627,7 +1657,7 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
if (pmd_none(pmd))
continue;
WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
WARN_ON(!pmd_present(pmd) || !pmd_table(pmd));
free_empty_pte_table(pmdp, addr, next, floor, ceiling);
} while (addr = next, addr < end);
@@ -1667,7 +1697,7 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
if (pud_none(pud))
continue;
WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
WARN_ON(!pud_present(pud) || !pud_table(pud));
free_empty_pmd_table(pudp, addr, next, floor, ceiling);
} while (addr = next, addr < end);
@@ -1763,7 +1793,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
{
vmemmap_verify((pte_t *)pmdp, node, addr, next);
return pmd_sect(READ_ONCE(*pmdp));
return pmd_leaf(READ_ONCE(*pmdp));
}
int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -1827,7 +1857,7 @@ void p4d_clear_huge(p4d_t *p4dp)
int pud_clear_huge(pud_t *pudp)
{
if (!pud_sect(READ_ONCE(*pudp)))
if (!pud_leaf(READ_ONCE(*pudp)))
return 0;
pud_clear(pudp);
return 1;
@@ -1835,7 +1865,7 @@ int pud_clear_huge(pud_t *pudp)
int pmd_clear_huge(pmd_t *pmdp)
{
if (!pmd_sect(READ_ONCE(*pmdp)))
if (!pmd_leaf(READ_ONCE(*pmdp)))
return 0;
pmd_clear(pmdp);
return 1;
@@ -2010,6 +2040,107 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
}
static bool addr_splits_kernel_leaf(unsigned long addr)
{
pgd_t *pgdp, pgd;
p4d_t *p4dp, p4d;
pud_t *pudp, pud;
pmd_t *pmdp, pmd;
pte_t *ptep, pte;
/*
* If the given address points at a the start address of
* a possible leaf, we certainly won't split. Otherwise,
* check if we would actually split a leaf by traversing
* the page tables further.
*/
if (IS_ALIGNED(addr, PGDIR_SIZE))
return false;
pgdp = pgd_offset_k(addr);
pgd = pgdp_get(pgdp);
if (!pgd_present(pgd))
return false;
if (IS_ALIGNED(addr, P4D_SIZE))
return false;
p4dp = p4d_offset(pgdp, addr);
p4d = p4dp_get(p4dp);
if (!p4d_present(p4d))
return false;
if (IS_ALIGNED(addr, PUD_SIZE))
return false;
pudp = pud_offset(p4dp, addr);
pud = pudp_get(pudp);
if (!pud_present(pud))
return false;
if (pud_leaf(pud))
return true;
if (IS_ALIGNED(addr, CONT_PMD_SIZE))
return false;
pmdp = pmd_offset(pudp, addr);
pmd = pmdp_get(pmdp);
if (!pmd_present(pmd))
return false;
if (pmd_cont(pmd))
return true;
if (IS_ALIGNED(addr, PMD_SIZE))
return false;
if (pmd_leaf(pmd))
return true;
if (IS_ALIGNED(addr, CONT_PTE_SIZE))
return false;
ptep = pte_offset_kernel(pmdp, addr);
pte = __ptep_get(ptep);
if (!pte_present(pte))
return false;
if (pte_cont(pte))
return true;
return !IS_ALIGNED(addr, PAGE_SIZE);
}
static bool can_unmap_without_split(unsigned long pfn, unsigned long nr_pages)
{
unsigned long phys_start, phys_end, start, end;
phys_start = PFN_PHYS(pfn);
phys_end = phys_start + nr_pages * PAGE_SIZE;
/* PFN range's linear map edges are leaf entry aligned */
start = __phys_to_virt(phys_start);
end = __phys_to_virt(phys_end);
if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) {
pr_warn("[%lx %lx] splits a leaf entry in linear map\n",
phys_start, phys_end);
return false;
}
/* PFN range's vmemmap edges are leaf entry aligned */
BUILD_BUG_ON(!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP));
start = (unsigned long)pfn_to_page(pfn);
end = (unsigned long)pfn_to_page(pfn + nr_pages);
if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) {
pr_warn("[%lx %lx] splits a leaf entry in vmemmap\n",
phys_start, phys_end);
return false;
}
return true;
}
/*
* This memory hotplug notifier helps prevent boot memory from being
* inadvertently removed as it blocks pfn range offlining process in
@@ -2018,8 +2149,11 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
* In future if and when boot memory could be removed, this notifier
* should be dropped and free_hotplug_page_range() should handle any
* reserved pages allocated during boot.
*
* This also blocks any memory remove that would have caused a split
* in leaf entry in kernel linear or vmemmap mapping.
*/
static int prevent_bootmem_remove_notifier(struct notifier_block *nb,
static int prevent_memory_remove_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
struct mem_section *ms;
@@ -2065,11 +2199,15 @@ static int prevent_bootmem_remove_notifier(struct notifier_block *nb,
return NOTIFY_DONE;
}
}
if (!can_unmap_without_split(pfn, arg->nr_pages))
return NOTIFY_BAD;
return NOTIFY_OK;
}
static struct notifier_block prevent_bootmem_remove_nb = {
.notifier_call = prevent_bootmem_remove_notifier,
static struct notifier_block prevent_memory_remove_nb = {
.notifier_call = prevent_memory_remove_notifier,
};
/*
@@ -2119,7 +2257,7 @@ static void validate_bootmem_online(void)
}
}
static int __init prevent_bootmem_remove_init(void)
static int __init prevent_memory_remove_init(void)
{
int ret = 0;
@@ -2127,13 +2265,13 @@ static int __init prevent_bootmem_remove_init(void)
return ret;
validate_bootmem_online();
ret = register_memory_notifier(&prevent_bootmem_remove_nb);
ret = register_memory_notifier(&prevent_memory_remove_nb);
if (ret)
pr_err("%s: Notifier registration failed %d\n", __func__, ret);
return ret;
}
early_initcall(prevent_bootmem_remove_init);
early_initcall(prevent_memory_remove_init);
#endif
pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
@@ -2149,7 +2287,7 @@ pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
*/
if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte))
__flush_tlb_range(vma, addr, nr * PAGE_SIZE,
PAGE_SIZE, true, 3);
PAGE_SIZE, 3, TLBF_NOWALKCACHE);
}
return pte;
@@ -2188,7 +2326,7 @@ void __cpu_replace_ttbr1(pgd_t *pgdp, bool cnp)
phys_addr_t ttbr1 = phys_to_ttbr(virt_to_phys(pgdp));
if (cnp)
ttbr1 |= TTBR_CNP_BIT;
ttbr1 |= TTBRx_EL1_CnP;
replace_phys = (void *)__pa_symbol(idmap_cpu_replace_ttbr1);

View File

@@ -25,6 +25,11 @@ static ptdesc_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
{
struct page_change_data *masks = walk->private;
/*
* Some users clear and set bits which alias each other (e.g. PTE_NG and
* PTE_PRESENT_INVALID). It is therefore important that we always clear
* first then set.
*/
val &= ~(pgprot_val(masks->clear_mask));
val |= (pgprot_val(masks->set_mask));
@@ -36,7 +41,7 @@ static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
{
pud_t val = pudp_get(pud);
if (pud_sect(val)) {
if (pud_leaf(val)) {
if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
return -EINVAL;
val = __pud(set_pageattr_masks(pud_val(val), walk));
@@ -52,7 +57,7 @@ static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
{
pmd_t val = pmdp_get(pmd);
if (pmd_sect(val)) {
if (pmd_leaf(val)) {
if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
return -EINVAL;
val = __pmd(set_pageattr_masks(pmd_val(val), walk));
@@ -132,11 +137,12 @@ static int __change_memory_common(unsigned long start, unsigned long size,
ret = update_range_prot(start, size, set_mask, clear_mask);
/*
* If the memory is being made valid without changing any other bits
* then a TLBI isn't required as a non-valid entry cannot be cached in
* the TLB.
* If the memory is being switched from present-invalid to valid without
* changing any other bits then a TLBI isn't required as a non-valid
* entry cannot be cached in the TLB.
*/
if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask))
if (pgprot_val(set_mask) != PTE_PRESENT_VALID_KERNEL ||
pgprot_val(clear_mask) != PTE_PRESENT_INVALID)
flush_tlb_kernel_range(start, start + size);
return ret;
}
@@ -237,18 +243,18 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
{
if (enable)
return __change_memory_common(addr, PAGE_SIZE * numpages,
__pgprot(PTE_VALID),
__pgprot(0));
__pgprot(PTE_PRESENT_VALID_KERNEL),
__pgprot(PTE_PRESENT_INVALID));
else
return __change_memory_common(addr, PAGE_SIZE * numpages,
__pgprot(0),
__pgprot(PTE_VALID));
__pgprot(PTE_PRESENT_INVALID),
__pgprot(PTE_PRESENT_VALID_KERNEL));
}
int set_direct_map_invalid_noflush(struct page *page)
{
pgprot_t clear_mask = __pgprot(PTE_VALID);
pgprot_t set_mask = __pgprot(0);
pgprot_t clear_mask = __pgprot(PTE_PRESENT_VALID_KERNEL);
pgprot_t set_mask = __pgprot(PTE_PRESENT_INVALID);
if (!can_set_direct_map())
return 0;
@@ -259,8 +265,8 @@ int set_direct_map_invalid_noflush(struct page *page)
int set_direct_map_default_noflush(struct page *page)
{
pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
pgprot_t clear_mask = __pgprot(PTE_RDONLY);
pgprot_t set_mask = __pgprot(PTE_PRESENT_VALID_KERNEL | PTE_WRITE);
pgprot_t clear_mask = __pgprot(PTE_PRESENT_INVALID | PTE_RDONLY);
if (!can_set_direct_map())
return 0;
@@ -296,8 +302,8 @@ static int __set_memory_enc_dec(unsigned long addr,
* entries or Synchronous External Aborts caused by RIPAS_EMPTY
*/
ret = __change_memory_common(addr, PAGE_SIZE * numpages,
__pgprot(set_prot),
__pgprot(clear_prot | PTE_VALID));
__pgprot(set_prot | PTE_PRESENT_INVALID),
__pgprot(clear_prot | PTE_PRESENT_VALID_KERNEL));
if (ret)
return ret;
@@ -311,8 +317,8 @@ static int __set_memory_enc_dec(unsigned long addr,
return ret;
return __change_memory_common(addr, PAGE_SIZE * numpages,
__pgprot(PTE_VALID),
__pgprot(0));
__pgprot(PTE_PRESENT_VALID_KERNEL),
__pgprot(PTE_PRESENT_INVALID));
}
static int realm_set_memory_encrypted(unsigned long addr, int numpages)
@@ -404,15 +410,15 @@ bool kernel_page_present(struct page *page)
pud = READ_ONCE(*pudp);
if (pud_none(pud))
return false;
if (pud_sect(pud))
return true;
if (pud_leaf(pud))
return pud_valid(pud);
pmdp = pmd_offset(pudp, addr);
pmd = READ_ONCE(*pmdp);
if (pmd_none(pmd))
return false;
if (pmd_sect(pmd))
return true;
if (pmd_leaf(pmd))
return pmd_valid(pmd);
ptep = pte_offset_kernel(pmdp, addr);
return pte_valid(__ptep_get(ptep));

View File

@@ -31,36 +31,6 @@ static void *trans_alloc(struct trans_pgd_info *info)
return info->trans_alloc_page(info->trans_alloc_arg);
}
static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
{
pte_t pte = __ptep_get(src_ptep);
if (pte_valid(pte)) {
/*
* Resume will overwrite areas that may be marked
* read only (code, rodata). Clear the RDONLY bit from
* the temporary mappings we use during restore.
*/
__set_pte(dst_ptep, pte_mkwrite_novma(pte));
} else if (!pte_none(pte)) {
/*
* debug_pagealloc will removed the PTE_VALID bit if
* the page isn't in use by the resume kernel. It may have
* been in use by the original kernel, in which case we need
* to put it back in our copy to do the restore.
*
* Other cases include kfence / vmalloc / memfd_secret which
* may call `set_direct_map_invalid_noflush()`.
*
* Before marking this entry valid, check the pfn should
* be mapped.
*/
BUG_ON(!pfn_valid(pte_pfn(pte)));
__set_pte(dst_ptep, pte_mkvalid(pte_mkwrite_novma(pte)));
}
}
static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp,
pmd_t *src_pmdp, unsigned long start, unsigned long end)
{
@@ -76,7 +46,11 @@ static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp,
src_ptep = pte_offset_kernel(src_pmdp, start);
do {
_copy_pte(dst_ptep, src_ptep, addr);
pte_t pte = __ptep_get(src_ptep);
if (pte_none(pte))
continue;
__set_pte(dst_ptep, pte_mkvalid_k(pte_mkwrite_novma(pte)));
} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end);
return 0;
@@ -109,8 +83,7 @@ static int copy_pmd(struct trans_pgd_info *info, pud_t *dst_pudp,
if (copy_pte(info, dst_pmdp, src_pmdp, addr, next))
return -ENOMEM;
} else {
set_pmd(dst_pmdp,
__pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY));
set_pmd(dst_pmdp, pmd_mkvalid_k(pmd_mkwrite_novma(pmd)));
}
} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
@@ -145,8 +118,7 @@ static int copy_pud(struct trans_pgd_info *info, p4d_t *dst_p4dp,
if (copy_pmd(info, dst_pudp, src_pudp, addr, next))
return -ENOMEM;
} else {
set_pud(dst_pudp,
__pud(pud_val(pud) & ~PUD_SECT_RDONLY));
set_pud(dst_pudp, pud_mkvalid_k(pud_mkwrite_novma(pud)));
}
} while (dst_pudp++, src_pudp++, addr = next, addr != end);

View File

@@ -3,7 +3,7 @@
gen := arch/$(ARCH)/include/generated
kapi := $(gen)/asm
kapisyshdr-y := cpucap-defs.h sysreg-defs.h
kapisyshdr-y := cpucap-defs.h kernel-hwcap.h sysreg-defs.h
kapi-hdrs-y := $(addprefix $(kapi)/, $(kapisyshdr-y))
@@ -18,11 +18,17 @@ kapi: $(kapi-hdrs-y)
quiet_cmd_gen_cpucaps = GEN $@
cmd_gen_cpucaps = mkdir -p $(dir $@); $(AWK) -f $(real-prereqs) > $@
quiet_cmd_gen_kernel_hwcap = GEN $@
cmd_gen_kernel_hwcap = mkdir -p $(dir $@); /bin/sh -e $(real-prereqs) > $@
quiet_cmd_gen_sysreg = GEN $@
cmd_gen_sysreg = mkdir -p $(dir $@); $(AWK) -f $(real-prereqs) > $@
$(kapi)/cpucap-defs.h: $(src)/gen-cpucaps.awk $(src)/cpucaps FORCE
$(call if_changed,gen_cpucaps)
$(kapi)/kernel-hwcap.h: $(src)/gen-kernel-hwcaps.sh $(srctree)/arch/arm64/include/uapi/asm/hwcap.h FORCE
$(call if_changed,gen_kernel_hwcap)
$(kapi)/sysreg-defs.h: $(src)/gen-sysreg.awk $(src)/sysreg FORCE
$(call if_changed,gen_sysreg)

View File

@@ -48,6 +48,7 @@ HAS_LPA2
HAS_LSE_ATOMICS
HAS_LS64
HAS_LS64_V
HAS_LSUI
HAS_MOPS
HAS_NESTED_VIRT
HAS_BBML2_NOABORT

View File

@@ -0,0 +1,23 @@
#!/bin/sh -e
# SPDX-License-Identifier: GPL-2.0
#
# gen-kernel-hwcap.sh - Generate kernel internal hwcap.h definitions
#
# Copyright 2026 Arm, Ltd.
if [ "$1" = "" ]; then
echo "$0: no filename specified"
exit 1
fi
echo "#ifndef __ASM_KERNEL_HWCAPS_H"
echo "#define __ASM_KERNEL_HWCAPS_H"
echo ""
echo "/* Generated file - do not edit */"
echo ""
grep -E '^#define HWCAP[0-9]*_[A-Z0-9_]+' $1 | \
sed 's/.*HWCAP\([0-9]*\)_\([A-Z0-9_]\+\).*/#define KERNEL_HWCAP_\2\t__khwcap\1_feature(\2)/'
echo ""
echo "#endif /* __ASM_KERNEL_HWCAPS_H */"

View File

@@ -1496,6 +1496,7 @@ UnsignedEnum 27:24 B16B16
0b0000 NI
0b0001 IMP
0b0010 BFSCALE
0b0011 B16MM
EndEnum
UnsignedEnum 23:20 BF16
0b0000 NI
@@ -1522,6 +1523,7 @@ UnsignedEnum 3:0 SVEver
0b0001 SVE2
0b0010 SVE2p1
0b0011 SVE2p2
0b0100 SVE2p3
EndEnum
EndSysreg
@@ -1530,7 +1532,11 @@ UnsignedEnum 63 FA64
0b0 NI
0b1 IMP
EndEnum
Res0 62:61
Res0 62
UnsignedEnum 61 LUT6
0b0 NI
0b1 IMP
EndEnum
UnsignedEnum 60 LUTv2
0b0 NI
0b1 IMP
@@ -1540,6 +1546,7 @@ UnsignedEnum 59:56 SMEver
0b0001 SME2
0b0010 SME2p1
0b0011 SME2p2
0b0100 SME2p3
EndEnum
UnsignedEnum 55:52 I16I64
0b0000 NI
@@ -1654,7 +1661,13 @@ UnsignedEnum 26 F8MM4
0b0 NI
0b1 IMP
EndEnum
Res0 25:2
Res0 25:16
UnsignedEnum 15 F16MM2
0b0 NI
0b1 IMP
EndEnum
Res0 14:8
Raz 7:2
UnsignedEnum 1 F8E4M3
0b0 NI
0b1 IMP
@@ -1835,6 +1848,8 @@ EndEnum
UnsignedEnum 51:48 FHM
0b0000 NI
0b0001 IMP
0b0010 F16F32DOT
0b0011 F16F32MM
EndEnum
UnsignedEnum 47:44 DP
0b0000 NI
@@ -1976,6 +1991,7 @@ EndEnum
UnsignedEnum 59:56 LUT
0b0000 NI
0b0001 IMP
0b0010 LUT6
EndEnum
UnsignedEnum 55:52 CSSC
0b0000 NI
@@ -3655,11 +3671,15 @@ Field 3:0 BS
EndSysreg
Sysreg SMIDR_EL1 3 1 0 0 6
Res0 63:32
Res0 63:60
Field 59:56 NSMC
Field 55:52 HIP
Field 51:32 AFFINITY2
Field 31:24 IMPLEMENTER
Field 23:16 REVISION
Field 15 SMPS
Res0 14:12
Field 14:13 SH
Res0 12
Field 11:0 AFFINITY
EndSysreg
@@ -5172,6 +5192,14 @@ Field 31:16 PARTID_D
Field 15:0 PARTID_I
EndSysreg
Sysreg MPAMSM_EL1 3 0 10 5 3
Res0 63:48
Field 47:40 PMG_D
Res0 39:32
Field 31:16 PARTID_D
Res0 15:0
EndSysreg
Sysreg ISR_EL1 3 0 12 1 0
Res0 63:11
Field 10 IS

View File

@@ -36,7 +36,7 @@ static int agdi_sdei_probe(struct platform_device *pdev,
err = sdei_event_register(adata->sdei_event, agdi_sdei_handler, pdev);
if (err) {
dev_err(&pdev->dev, "Failed to register for SDEI event %d",
dev_err(&pdev->dev, "Failed to register for SDEI event %d\n",
adata->sdei_event);
return err;
}

View File

@@ -311,4 +311,18 @@ config MARVELL_PEM_PMU
Enable support for PCIe Interface performance monitoring
on Marvell platform.
config NVIDIA_TEGRA410_CMEM_LATENCY_PMU
tristate "NVIDIA Tegra410 CPU Memory Latency PMU"
depends on ARM64 && ACPI
help
Enable perf support for CPU memory latency counters monitoring on
NVIDIA Tegra410 SoC.
config NVIDIA_TEGRA410_C2C_PMU
tristate "NVIDIA Tegra410 C2C PMU"
depends on ARM64 && ACPI
help
Enable perf support for counters in NVIDIA C2C interface of NVIDIA
Tegra410 SoC.
endmenu

View File

@@ -35,3 +35,5 @@ obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
obj-$(CONFIG_CXL_PMU) += cxl_pmu.o
obj-$(CONFIG_NVIDIA_TEGRA410_CMEM_LATENCY_PMU) += nvidia_t410_cmem_latency_pmu.o
obj-$(CONFIG_NVIDIA_TEGRA410_C2C_PMU) += nvidia_t410_c2c_pmu.o

View File

@@ -2132,6 +2132,8 @@ static void arm_cmn_init_dtm(struct arm_cmn_dtm *dtm, struct arm_cmn_node *xp, i
static int arm_cmn_init_dtc(struct arm_cmn *cmn, struct arm_cmn_node *dn, int idx)
{
struct arm_cmn_dtc *dtc = cmn->dtc + idx;
const struct resource *cfg;
resource_size_t base, size;
dtc->pmu_base = dn->pmu_base;
dtc->base = dtc->pmu_base - arm_cmn_pmu_offset(cmn, dn);
@@ -2139,6 +2141,13 @@ static int arm_cmn_init_dtc(struct arm_cmn *cmn, struct arm_cmn_node *dn, int id
if (dtc->irq < 0)
return dtc->irq;
cfg = platform_get_resource(to_platform_device(cmn->dev), IORESOURCE_MEM, 0);
base = dtc->base - cmn->base + cfg->start;
size = cmn->part == PART_CMN600 ? SZ_16K : SZ_64K;
if (!devm_request_mem_region(cmn->dev, base, size, dev_name(cmn->dev)))
return dev_err_probe(cmn->dev, -EBUSY,
"Failed to request DTC region 0x%pa\n", &base);
writel_relaxed(CMN_DT_DTC_CTL_DT_EN, dtc->base + CMN_DT_DTC_CTL);
writel_relaxed(CMN_DT_PMCR_PMU_EN | CMN_DT_PMCR_OVFL_INTR_EN, CMN_DT_PMCR(dtc));
writeq_relaxed(0, CMN_DT_PMCCNTR(dtc));
@@ -2525,43 +2534,26 @@ static int arm_cmn_discover(struct arm_cmn *cmn, unsigned int rgn_offset)
return 0;
}
static int arm_cmn600_acpi_probe(struct platform_device *pdev, struct arm_cmn *cmn)
{
struct resource *cfg, *root;
cfg = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!cfg)
return -EINVAL;
root = platform_get_resource(pdev, IORESOURCE_MEM, 1);
if (!root)
return -EINVAL;
if (!resource_contains(cfg, root))
swap(cfg, root);
/*
* Note that devm_ioremap_resource() is dumb and won't let the platform
* device claim cfg when the ACPI companion device has already claimed
* root within it. But since they *are* already both claimed in the
* appropriate name, we don't really need to do it again here anyway.
*/
cmn->base = devm_ioremap(cmn->dev, cfg->start, resource_size(cfg));
if (!cmn->base)
return -ENOMEM;
return root->start - cfg->start;
}
static int arm_cmn600_of_probe(struct device_node *np)
static int arm_cmn_get_root(struct arm_cmn *cmn, const struct resource *cfg)
{
const struct device_node *np = cmn->dev->of_node;
const struct resource *root;
u32 rootnode;
return of_property_read_u32(np, "arm,root-node", &rootnode) ?: rootnode;
if (cmn->part != PART_CMN600)
return 0;
if (np)
return of_property_read_u32(np, "arm,root-node", &rootnode) ?: rootnode;
root = platform_get_resource(to_platform_device(cmn->dev), IORESOURCE_MEM, 1);
return root ? root->start - cfg->start : -EINVAL;
}
static int arm_cmn_probe(struct platform_device *pdev)
{
struct arm_cmn *cmn;
const struct resource *cfg;
const char *name;
static atomic_t id;
int err, rootnode, this_id;
@@ -2575,16 +2567,16 @@ static int arm_cmn_probe(struct platform_device *pdev)
cmn->cpu = cpumask_local_spread(0, dev_to_node(cmn->dev));
platform_set_drvdata(pdev, cmn);
if (cmn->part == PART_CMN600 && has_acpi_companion(cmn->dev)) {
rootnode = arm_cmn600_acpi_probe(pdev, cmn);
} else {
rootnode = 0;
cmn->base = devm_platform_ioremap_resource(pdev, 0);
if (IS_ERR(cmn->base))
return PTR_ERR(cmn->base);
if (cmn->part == PART_CMN600)
rootnode = arm_cmn600_of_probe(pdev->dev.of_node);
}
cfg = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!cfg)
return -EINVAL;
/* Map the whole region now, claim the DTCs once we've found them */
cmn->base = devm_ioremap(cmn->dev, cfg->start, resource_size(cfg));
if (!cmn->base)
return -ENOMEM;
rootnode = arm_cmn_get_root(cmn, cfg);
if (rootnode < 0)
return rootnode;

View File

@@ -16,7 +16,7 @@
* The user should refer to the vendor technical documentation to get details
* about the supported events.
*
* Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
*/
@@ -1134,6 +1134,23 @@ static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
return 0;
}
struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu)
{
char hid[16] = {};
char uid[16] = {};
const struct acpi_apmt_node *apmt_node;
apmt_node = arm_cspmu_apmt_node(cspmu->dev);
if (!apmt_node || apmt_node->type != ACPI_APMT_NODE_TYPE_ACPI)
return NULL;
memcpy(hid, &apmt_node->inst_primary, sizeof(apmt_node->inst_primary));
snprintf(uid, sizeof(uid), "%u", apmt_node->inst_secondary);
return acpi_dev_get_first_match_dev(hid, uid, -1);
}
EXPORT_SYMBOL_GPL(arm_cspmu_acpi_dev_get);
#else
static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
{

View File

@@ -1,13 +1,14 @@
/* SPDX-License-Identifier: GPL-2.0
*
* ARM CoreSight Architecture PMU driver.
* Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
*/
#ifndef __ARM_CSPMU_H__
#define __ARM_CSPMU_H__
#include <linux/acpi.h>
#include <linux/bitfield.h>
#include <linux/cpumask.h>
#include <linux/device.h>
@@ -255,4 +256,18 @@ int arm_cspmu_impl_register(const struct arm_cspmu_impl_match *impl_match);
/* Unregister vendor backend. */
void arm_cspmu_impl_unregister(const struct arm_cspmu_impl_match *impl_match);
#if defined(CONFIG_ACPI) && defined(CONFIG_ARM64)
/**
* Get ACPI device associated with the PMU.
* The caller is responsible for calling acpi_dev_put() on the returned device.
*/
struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu);
#else
static inline struct acpi_device *
arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu)
{
return NULL;
}
#endif
#endif /* __ARM_CSPMU_H__ */

View File

@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
*/
@@ -8,6 +8,7 @@
#include <linux/io.h>
#include <linux/module.h>
#include <linux/property.h>
#include <linux/topology.h>
#include "arm_cspmu.h"
@@ -21,6 +22,44 @@
#define NV_CNVL_PORT_COUNT 4ULL
#define NV_CNVL_FILTER_ID_MASK GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0)
#define NV_UCF_SRC_COUNT 3ULL
#define NV_UCF_DST_COUNT 4ULL
#define NV_UCF_FILTER_ID_MASK GENMASK_ULL(11, 0)
#define NV_UCF_FILTER_SRC GENMASK_ULL(2, 0)
#define NV_UCF_FILTER_DST GENMASK_ULL(11, 8)
#define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST)
#define NV_PCIE_V2_PORT_COUNT 8ULL
#define NV_PCIE_V2_FILTER_ID_MASK GENMASK_ULL(24, 0)
#define NV_PCIE_V2_FILTER_PORT GENMASK_ULL(NV_PCIE_V2_PORT_COUNT - 1, 0)
#define NV_PCIE_V2_FILTER_BDF_VAL GENMASK_ULL(23, NV_PCIE_V2_PORT_COUNT)
#define NV_PCIE_V2_FILTER_BDF_EN BIT(24)
#define NV_PCIE_V2_FILTER_BDF_VAL_EN GENMASK_ULL(24, NV_PCIE_V2_PORT_COUNT)
#define NV_PCIE_V2_FILTER_DEFAULT NV_PCIE_V2_FILTER_PORT
#define NV_PCIE_V2_DST_COUNT 5ULL
#define NV_PCIE_V2_FILTER2_ID_MASK GENMASK_ULL(4, 0)
#define NV_PCIE_V2_FILTER2_DST GENMASK_ULL(NV_PCIE_V2_DST_COUNT - 1, 0)
#define NV_PCIE_V2_FILTER2_DEFAULT NV_PCIE_V2_FILTER2_DST
#define NV_PCIE_TGT_PORT_COUNT 8ULL
#define NV_PCIE_TGT_EV_TYPE_CC 0x4
#define NV_PCIE_TGT_EV_TYPE_COUNT 3ULL
#define NV_PCIE_TGT_EV_TYPE_MASK GENMASK_ULL(NV_PCIE_TGT_EV_TYPE_COUNT - 1, 0)
#define NV_PCIE_TGT_FILTER2_MASK GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT, 0)
#define NV_PCIE_TGT_FILTER2_PORT GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT - 1, 0)
#define NV_PCIE_TGT_FILTER2_ADDR_EN BIT(NV_PCIE_TGT_PORT_COUNT)
#define NV_PCIE_TGT_FILTER2_ADDR GENMASK_ULL(15, NV_PCIE_TGT_PORT_COUNT)
#define NV_PCIE_TGT_FILTER2_DEFAULT NV_PCIE_TGT_FILTER2_PORT
#define NV_PCIE_TGT_ADDR_COUNT 8ULL
#define NV_PCIE_TGT_ADDR_STRIDE 20
#define NV_PCIE_TGT_ADDR_CTRL 0xD38
#define NV_PCIE_TGT_ADDR_BASE_LO 0xD3C
#define NV_PCIE_TGT_ADDR_BASE_HI 0xD40
#define NV_PCIE_TGT_ADDR_MASK_LO 0xD44
#define NV_PCIE_TGT_ADDR_MASK_HI 0xD48
#define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
#define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
@@ -124,6 +163,55 @@ static struct attribute *mcf_pmu_event_attrs[] = {
NULL,
};
static struct attribute *ucf_pmu_event_attrs[] = {
ARM_CSPMU_EVENT_ATTR(bus_cycles, 0x1D),
ARM_CSPMU_EVENT_ATTR(slc_allocate, 0xF0),
ARM_CSPMU_EVENT_ATTR(slc_wb, 0xF3),
ARM_CSPMU_EVENT_ATTR(slc_refill_rd, 0x109),
ARM_CSPMU_EVENT_ATTR(slc_refill_wr, 0x10A),
ARM_CSPMU_EVENT_ATTR(slc_hit_rd, 0x119),
ARM_CSPMU_EVENT_ATTR(slc_access_dataless, 0x183),
ARM_CSPMU_EVENT_ATTR(slc_access_atomic, 0x184),
ARM_CSPMU_EVENT_ATTR(slc_access_rd, 0x111),
ARM_CSPMU_EVENT_ATTR(slc_access_wr, 0x112),
ARM_CSPMU_EVENT_ATTR(slc_bytes_rd, 0x113),
ARM_CSPMU_EVENT_ATTR(slc_bytes_wr, 0x114),
ARM_CSPMU_EVENT_ATTR(mem_access_rd, 0x121),
ARM_CSPMU_EVENT_ATTR(mem_access_wr, 0x122),
ARM_CSPMU_EVENT_ATTR(mem_bytes_rd, 0x123),
ARM_CSPMU_EVENT_ATTR(mem_bytes_wr, 0x124),
ARM_CSPMU_EVENT_ATTR(local_snoop, 0x180),
ARM_CSPMU_EVENT_ATTR(ext_snp_access, 0x181),
ARM_CSPMU_EVENT_ATTR(ext_snp_evict, 0x182),
ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
NULL
};
static struct attribute *pcie_v2_pmu_event_attrs[] = {
ARM_CSPMU_EVENT_ATTR(rd_bytes, 0x0),
ARM_CSPMU_EVENT_ATTR(wr_bytes, 0x1),
ARM_CSPMU_EVENT_ATTR(rd_req, 0x2),
ARM_CSPMU_EVENT_ATTR(wr_req, 0x3),
ARM_CSPMU_EVENT_ATTR(rd_cum_outs, 0x4),
ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
NULL
};
static struct attribute *pcie_tgt_pmu_event_attrs[] = {
ARM_CSPMU_EVENT_ATTR(rd_bytes, 0x0),
ARM_CSPMU_EVENT_ATTR(wr_bytes, 0x1),
ARM_CSPMU_EVENT_ATTR(rd_req, 0x2),
ARM_CSPMU_EVENT_ATTR(wr_req, 0x3),
ARM_CSPMU_EVENT_ATTR(cycles, NV_PCIE_TGT_EV_TYPE_CC),
NULL
};
static struct attribute *generic_pmu_event_attrs[] = {
ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
NULL,
@@ -152,6 +240,40 @@ static struct attribute *cnvlink_pmu_format_attrs[] = {
NULL,
};
static struct attribute *ucf_pmu_format_attrs[] = {
ARM_CSPMU_FORMAT_EVENT_ATTR,
ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"),
ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"),
ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"),
ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"),
ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"),
ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"),
ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"),
NULL
};
static struct attribute *pcie_v2_pmu_format_attrs[] = {
ARM_CSPMU_FORMAT_EVENT_ATTR,
ARM_CSPMU_FORMAT_ATTR(src_rp_mask, "config1:0-7"),
ARM_CSPMU_FORMAT_ATTR(src_bdf, "config1:8-23"),
ARM_CSPMU_FORMAT_ATTR(src_bdf_en, "config1:24"),
ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config2:0"),
ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config2:1"),
ARM_CSPMU_FORMAT_ATTR(dst_loc_pcie_p2p, "config2:2"),
ARM_CSPMU_FORMAT_ATTR(dst_loc_pcie_cxl, "config2:3"),
ARM_CSPMU_FORMAT_ATTR(dst_rem, "config2:4"),
NULL
};
static struct attribute *pcie_tgt_pmu_format_attrs[] = {
ARM_CSPMU_FORMAT_ATTR(event, "config:0-2"),
ARM_CSPMU_FORMAT_ATTR(dst_rp_mask, "config:3-10"),
ARM_CSPMU_FORMAT_ATTR(dst_addr_en, "config:11"),
ARM_CSPMU_FORMAT_ATTR(dst_addr_base, "config1:0-63"),
ARM_CSPMU_FORMAT_ATTR(dst_addr_mask, "config2:0-63"),
NULL
};
static struct attribute *generic_pmu_format_attrs[] = {
ARM_CSPMU_FORMAT_EVENT_ATTR,
ARM_CSPMU_FORMAT_FILTER_ATTR,
@@ -183,6 +305,32 @@ nv_cspmu_get_name(const struct arm_cspmu *cspmu)
return ctx->name;
}
#if defined(CONFIG_ACPI) && defined(CONFIG_ARM64)
static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id)
{
struct fwnode_handle *fwnode;
struct acpi_device *adev;
int ret;
adev = arm_cspmu_acpi_dev_get(cspmu);
if (!adev)
return -ENODEV;
fwnode = acpi_fwnode_handle(adev);
ret = fwnode_property_read_u32(fwnode, "instance_id", id);
if (ret)
dev_err(cspmu->dev, "Failed to get instance ID\n");
acpi_dev_put(adev);
return ret;
}
#else
static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id)
{
return -EINVAL;
}
#endif
static u32 nv_cspmu_event_filter(const struct perf_event *event)
{
const struct nv_cspmu_ctx *ctx =
@@ -228,6 +376,20 @@ static void nv_cspmu_set_ev_filter(struct arm_cspmu *cspmu,
}
}
static void nv_cspmu_reset_ev_filter(struct arm_cspmu *cspmu,
const struct perf_event *event)
{
const struct nv_cspmu_ctx *ctx =
to_nv_cspmu_ctx(to_arm_cspmu(event->pmu));
const u32 offset = 4 * event->hw.idx;
if (ctx->get_filter)
writel(0, cspmu->base0 + PMEVFILTR + offset);
if (ctx->get_filter2)
writel(0, cspmu->base0 + PMEVFILT2R + offset);
}
static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu,
const struct perf_event *event)
{
@@ -236,10 +398,386 @@ static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu,
writel(filter, cspmu->base0 + PMCCFILTR);
}
static u32 ucf_pmu_event_filter(const struct perf_event *event)
{
u32 ret, filter, src, dst;
filter = nv_cspmu_event_filter(event);
/* Monitor all sources if none is selected. */
src = FIELD_GET(NV_UCF_FILTER_SRC, filter);
if (src == 0)
src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0);
/* Monitor all destinations if none is selected. */
dst = FIELD_GET(NV_UCF_FILTER_DST, filter);
if (dst == 0)
dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0);
ret = FIELD_PREP(NV_UCF_FILTER_SRC, src);
ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst);
return ret;
}
static u32 pcie_v2_pmu_bdf_val_en(u32 filter)
{
const u32 bdf_en = FIELD_GET(NV_PCIE_V2_FILTER_BDF_EN, filter);
/* Returns both BDF value and enable bit if BDF filtering is enabled. */
if (bdf_en)
return FIELD_GET(NV_PCIE_V2_FILTER_BDF_VAL_EN, filter);
/* Ignore the BDF value if BDF filter is not enabled. */
return 0;
}
static u32 pcie_v2_pmu_event_filter(const struct perf_event *event)
{
u32 filter, lead_filter, lead_bdf;
struct perf_event *leader;
const struct nv_cspmu_ctx *ctx =
to_nv_cspmu_ctx(to_arm_cspmu(event->pmu));
filter = event->attr.config1 & ctx->filter_mask;
if (filter != 0)
return filter;
leader = event->group_leader;
/* Use leader's filter value if its BDF filtering is enabled. */
if (event != leader) {
lead_filter = pcie_v2_pmu_event_filter(leader);
lead_bdf = pcie_v2_pmu_bdf_val_en(lead_filter);
if (lead_bdf != 0)
return lead_filter;
}
/* Otherwise, return default filter value. */
return ctx->filter_default_val;
}
static int pcie_v2_pmu_validate_event(struct arm_cspmu *cspmu,
struct perf_event *new_ev)
{
/*
* Make sure the events are using same BDF filter since the PCIE-SRC PMU
* only supports one common BDF filter setting for all of the counters.
*/
int idx;
u32 new_filter, new_rp, new_bdf, new_lead_filter, new_lead_bdf;
struct perf_event *new_leader;
if (cspmu->impl.ops.is_cycle_counter_event(new_ev))
return 0;
new_leader = new_ev->group_leader;
new_filter = pcie_v2_pmu_event_filter(new_ev);
new_lead_filter = pcie_v2_pmu_event_filter(new_leader);
new_bdf = pcie_v2_pmu_bdf_val_en(new_filter);
new_lead_bdf = pcie_v2_pmu_bdf_val_en(new_lead_filter);
new_rp = FIELD_GET(NV_PCIE_V2_FILTER_PORT, new_filter);
if (new_rp != 0 && new_bdf != 0) {
dev_err(cspmu->dev,
"RP and BDF filtering are mutually exclusive\n");
return -EINVAL;
}
if (new_bdf != new_lead_bdf) {
dev_err(cspmu->dev,
"sibling and leader BDF value should be equal\n");
return -EINVAL;
}
/* Compare BDF filter on existing events. */
idx = find_first_bit(cspmu->hw_events.used_ctrs,
cspmu->cycle_counter_logical_idx);
if (idx != cspmu->cycle_counter_logical_idx) {
struct perf_event *leader = cspmu->hw_events.events[idx]->group_leader;
const u32 lead_filter = pcie_v2_pmu_event_filter(leader);
const u32 lead_bdf = pcie_v2_pmu_bdf_val_en(lead_filter);
if (new_lead_bdf != lead_bdf) {
dev_err(cspmu->dev, "only one BDF value is supported\n");
return -EINVAL;
}
}
return 0;
}
struct pcie_tgt_addr_filter {
u32 refcount;
u64 base;
u64 mask;
};
struct pcie_tgt_data {
struct pcie_tgt_addr_filter addr_filter[NV_PCIE_TGT_ADDR_COUNT];
void __iomem *addr_filter_reg;
};
#if defined(CONFIG_ACPI) && defined(CONFIG_ARM64)
static int pcie_tgt_init_data(struct arm_cspmu *cspmu)
{
int ret;
struct acpi_device *adev;
struct pcie_tgt_data *data;
struct list_head resource_list;
struct resource_entry *rentry;
struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu);
struct device *dev = cspmu->dev;
data = devm_kzalloc(dev, sizeof(struct pcie_tgt_data), GFP_KERNEL);
if (!data)
return -ENOMEM;
adev = arm_cspmu_acpi_dev_get(cspmu);
if (!adev) {
dev_err(dev, "failed to get associated PCIE-TGT device\n");
return -ENODEV;
}
INIT_LIST_HEAD(&resource_list);
ret = acpi_dev_get_memory_resources(adev, &resource_list);
if (ret < 0) {
dev_err(dev, "failed to get PCIE-TGT device memory resources\n");
acpi_dev_put(adev);
return ret;
}
rentry = list_first_entry_or_null(
&resource_list, struct resource_entry, node);
if (rentry) {
data->addr_filter_reg = devm_ioremap_resource(dev, rentry->res);
ret = 0;
}
if (IS_ERR(data->addr_filter_reg)) {
dev_err(dev, "failed to get address filter resource\n");
ret = PTR_ERR(data->addr_filter_reg);
}
acpi_dev_free_resource_list(&resource_list);
acpi_dev_put(adev);
ctx->data = data;
return ret;
}
#else
static int pcie_tgt_init_data(struct arm_cspmu *cspmu)
{
return -ENODEV;
}
#endif
static struct pcie_tgt_data *pcie_tgt_get_data(struct arm_cspmu *cspmu)
{
struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu);
return ctx->data;
}
/* Find the first available address filter slot. */
static int pcie_tgt_find_addr_idx(struct arm_cspmu *cspmu, u64 base, u64 mask,
bool is_reset)
{
int i;
struct pcie_tgt_data *data = pcie_tgt_get_data(cspmu);
for (i = 0; i < NV_PCIE_TGT_ADDR_COUNT; i++) {
if (!is_reset && data->addr_filter[i].refcount == 0)
return i;
if (data->addr_filter[i].base == base &&
data->addr_filter[i].mask == mask)
return i;
}
return -ENODEV;
}
static u32 pcie_tgt_pmu_event_filter(const struct perf_event *event)
{
u32 filter;
filter = (event->attr.config >> NV_PCIE_TGT_EV_TYPE_COUNT) &
NV_PCIE_TGT_FILTER2_MASK;
return filter;
}
static bool pcie_tgt_pmu_addr_en(const struct perf_event *event)
{
u32 filter = pcie_tgt_pmu_event_filter(event);
return FIELD_GET(NV_PCIE_TGT_FILTER2_ADDR_EN, filter) != 0;
}
static u32 pcie_tgt_pmu_port_filter(const struct perf_event *event)
{
u32 filter = pcie_tgt_pmu_event_filter(event);
return FIELD_GET(NV_PCIE_TGT_FILTER2_PORT, filter);
}
static u64 pcie_tgt_pmu_dst_addr_base(const struct perf_event *event)
{
return event->attr.config1;
}
static u64 pcie_tgt_pmu_dst_addr_mask(const struct perf_event *event)
{
return event->attr.config2;
}
static int pcie_tgt_pmu_validate_event(struct arm_cspmu *cspmu,
struct perf_event *new_ev)
{
u64 base, mask;
int idx;
if (!pcie_tgt_pmu_addr_en(new_ev))
return 0;
/* Make sure there is a slot available for the address filter. */
base = pcie_tgt_pmu_dst_addr_base(new_ev);
mask = pcie_tgt_pmu_dst_addr_mask(new_ev);
idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false);
if (idx < 0)
return -EINVAL;
return 0;
}
static void pcie_tgt_pmu_config_addr_filter(struct arm_cspmu *cspmu,
bool en, u64 base, u64 mask, int idx)
{
struct pcie_tgt_data *data;
struct pcie_tgt_addr_filter *filter;
void __iomem *filter_reg;
data = pcie_tgt_get_data(cspmu);
filter = &data->addr_filter[idx];
filter_reg = data->addr_filter_reg + (idx * NV_PCIE_TGT_ADDR_STRIDE);
if (en) {
filter->refcount++;
if (filter->refcount == 1) {
filter->base = base;
filter->mask = mask;
writel(lower_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_LO);
writel(upper_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_HI);
writel(lower_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_LO);
writel(upper_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_HI);
writel(1, filter_reg + NV_PCIE_TGT_ADDR_CTRL);
}
} else {
filter->refcount--;
if (filter->refcount == 0) {
writel(0, filter_reg + NV_PCIE_TGT_ADDR_CTRL);
writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_LO);
writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_HI);
writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_LO);
writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_HI);
filter->base = 0;
filter->mask = 0;
}
}
}
static void pcie_tgt_pmu_set_ev_filter(struct arm_cspmu *cspmu,
const struct perf_event *event)
{
bool addr_filter_en;
int idx;
u32 filter2_val, filter2_offset, port_filter;
u64 base, mask;
filter2_val = 0;
filter2_offset = PMEVFILT2R + (4 * event->hw.idx);
addr_filter_en = pcie_tgt_pmu_addr_en(event);
if (addr_filter_en) {
base = pcie_tgt_pmu_dst_addr_base(event);
mask = pcie_tgt_pmu_dst_addr_mask(event);
idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false);
if (idx < 0) {
dev_err(cspmu->dev,
"Unable to find a slot for address filtering\n");
writel(0, cspmu->base0 + filter2_offset);
return;
}
/* Configure address range filter registers.*/
pcie_tgt_pmu_config_addr_filter(cspmu, true, base, mask, idx);
/* Config the counter to use the selected address filter slot. */
filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_ADDR, 1U << idx);
}
port_filter = pcie_tgt_pmu_port_filter(event);
/* Monitor all ports if no filter is selected. */
if (!addr_filter_en && port_filter == 0)
port_filter = NV_PCIE_TGT_FILTER2_PORT;
filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_PORT, port_filter);
writel(filter2_val, cspmu->base0 + filter2_offset);
}
static void pcie_tgt_pmu_reset_ev_filter(struct arm_cspmu *cspmu,
const struct perf_event *event)
{
bool addr_filter_en;
u64 base, mask;
int idx;
addr_filter_en = pcie_tgt_pmu_addr_en(event);
if (!addr_filter_en)
return;
base = pcie_tgt_pmu_dst_addr_base(event);
mask = pcie_tgt_pmu_dst_addr_mask(event);
idx = pcie_tgt_find_addr_idx(cspmu, base, mask, true);
if (idx < 0) {
dev_err(cspmu->dev,
"Unable to find the address filter slot to reset\n");
return;
}
pcie_tgt_pmu_config_addr_filter(cspmu, false, base, mask, idx);
}
static u32 pcie_tgt_pmu_event_type(const struct perf_event *event)
{
return event->attr.config & NV_PCIE_TGT_EV_TYPE_MASK;
}
static bool pcie_tgt_pmu_is_cycle_counter_event(const struct perf_event *event)
{
u32 event_type = pcie_tgt_pmu_event_type(event);
return event_type == NV_PCIE_TGT_EV_TYPE_CC;
}
enum nv_cspmu_name_fmt {
NAME_FMT_GENERIC,
NAME_FMT_SOCKET
NAME_FMT_SOCKET,
NAME_FMT_SOCKET_INST,
};
struct nv_cspmu_match {
@@ -342,6 +880,63 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
.init_data = NULL
},
},
{
.prodid = 0x2CF20000,
.prodid_mask = NV_PRODID_MASK,
.name_pattern = "nvidia_ucf_pmu_%u",
.name_fmt = NAME_FMT_SOCKET,
.template_ctx = {
.event_attr = ucf_pmu_event_attrs,
.format_attr = ucf_pmu_format_attrs,
.filter_mask = NV_UCF_FILTER_ID_MASK,
.filter_default_val = NV_UCF_FILTER_DEFAULT,
.filter2_mask = 0x0,
.filter2_default_val = 0x0,
.get_filter = ucf_pmu_event_filter,
},
},
{
.prodid = 0x10301000,
.prodid_mask = NV_PRODID_MASK,
.name_pattern = "nvidia_pcie_pmu_%u_rc_%u",
.name_fmt = NAME_FMT_SOCKET_INST,
.template_ctx = {
.event_attr = pcie_v2_pmu_event_attrs,
.format_attr = pcie_v2_pmu_format_attrs,
.filter_mask = NV_PCIE_V2_FILTER_ID_MASK,
.filter_default_val = NV_PCIE_V2_FILTER_DEFAULT,
.filter2_mask = NV_PCIE_V2_FILTER2_ID_MASK,
.filter2_default_val = NV_PCIE_V2_FILTER2_DEFAULT,
.get_filter = pcie_v2_pmu_event_filter,
.get_filter2 = nv_cspmu_event_filter2,
},
.ops = {
.validate_event = pcie_v2_pmu_validate_event,
.reset_ev_filter = nv_cspmu_reset_ev_filter,
}
},
{
.prodid = 0x10700000,
.prodid_mask = NV_PRODID_MASK,
.name_pattern = "nvidia_pcie_tgt_pmu_%u_rc_%u",
.name_fmt = NAME_FMT_SOCKET_INST,
.template_ctx = {
.event_attr = pcie_tgt_pmu_event_attrs,
.format_attr = pcie_tgt_pmu_format_attrs,
.filter_mask = 0x0,
.filter_default_val = 0x0,
.filter2_mask = NV_PCIE_TGT_FILTER2_MASK,
.filter2_default_val = NV_PCIE_TGT_FILTER2_DEFAULT,
.init_data = pcie_tgt_init_data
},
.ops = {
.is_cycle_counter_event = pcie_tgt_pmu_is_cycle_counter_event,
.event_type = pcie_tgt_pmu_event_type,
.validate_event = pcie_tgt_pmu_validate_event,
.set_ev_filter = pcie_tgt_pmu_set_ev_filter,
.reset_ev_filter = pcie_tgt_pmu_reset_ev_filter,
}
},
{
.prodid = 0,
.prodid_mask = 0,
@@ -365,7 +960,7 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
static char *nv_cspmu_format_name(const struct arm_cspmu *cspmu,
const struct nv_cspmu_match *match)
{
char *name;
char *name = NULL;
struct device *dev = cspmu->dev;
static atomic_t pmu_generic_idx = {0};
@@ -379,13 +974,20 @@ static char *nv_cspmu_format_name(const struct arm_cspmu *cspmu,
socket);
break;
}
case NAME_FMT_SOCKET_INST: {
const int cpu = cpumask_first(&cspmu->associated_cpus);
const int socket = cpu_to_node(cpu);
u32 inst_id;
if (!nv_cspmu_get_inst_id(cspmu, &inst_id))
name = devm_kasprintf(dev, GFP_KERNEL,
match->name_pattern, socket, inst_id);
break;
}
case NAME_FMT_GENERIC:
name = devm_kasprintf(dev, GFP_KERNEL, match->name_pattern,
atomic_fetch_inc(&pmu_generic_idx));
break;
default:
name = NULL;
break;
}
return name;
@@ -426,8 +1028,12 @@ static int nv_cspmu_init_ops(struct arm_cspmu *cspmu)
cspmu->impl.ctx = ctx;
/* NVIDIA specific callbacks. */
SET_OP(validate_event, impl_ops, match, NULL);
SET_OP(event_type, impl_ops, match, NULL);
SET_OP(is_cycle_counter_event, impl_ops, match, NULL);
SET_OP(set_cc_filter, impl_ops, match, nv_cspmu_set_cc_filter);
SET_OP(set_ev_filter, impl_ops, match, nv_cspmu_set_ev_filter);
SET_OP(reset_ev_filter, impl_ops, match, NULL);
SET_OP(get_event_attrs, impl_ops, match, nv_cspmu_get_event_attrs);
SET_OP(get_format_attrs, impl_ops, match, nv_cspmu_get_format_attrs);
SET_OP(get_name, impl_ops, match, nv_cspmu_get_name);

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,736 @@
// SPDX-License-Identifier: GPL-2.0
/*
* NVIDIA Tegra410 CPU Memory (CMEM) Latency PMU driver.
*
* Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*/
#include <linux/acpi.h>
#include <linux/bitops.h>
#include <linux/cpumask.h>
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/module.h>
#include <linux/perf_event.h>
#include <linux/platform_device.h>
#define NUM_INSTANCES 14
/* Register offsets. */
#define CMEM_LAT_CG_CTRL 0x800
#define CMEM_LAT_CTRL 0x808
#define CMEM_LAT_STATUS 0x810
#define CMEM_LAT_CYCLE_CNTR 0x818
#define CMEM_LAT_MC0_REQ_CNTR 0x820
#define CMEM_LAT_MC0_AOR_CNTR 0x830
#define CMEM_LAT_MC1_REQ_CNTR 0x838
#define CMEM_LAT_MC1_AOR_CNTR 0x848
#define CMEM_LAT_MC2_REQ_CNTR 0x850
#define CMEM_LAT_MC2_AOR_CNTR 0x860
/* CMEM_LAT_CTRL values. */
#define CMEM_LAT_CTRL_DISABLE 0x0ULL
#define CMEM_LAT_CTRL_ENABLE 0x1ULL
#define CMEM_LAT_CTRL_CLR 0x2ULL
/* CMEM_LAT_CG_CTRL values. */
#define CMEM_LAT_CG_CTRL_DISABLE 0x0ULL
#define CMEM_LAT_CG_CTRL_ENABLE 0x1ULL
/* CMEM_LAT_STATUS register field. */
#define CMEM_LAT_STATUS_CYCLE_OVF BIT(0)
#define CMEM_LAT_STATUS_MC0_AOR_OVF BIT(1)
#define CMEM_LAT_STATUS_MC0_REQ_OVF BIT(3)
#define CMEM_LAT_STATUS_MC1_AOR_OVF BIT(4)
#define CMEM_LAT_STATUS_MC1_REQ_OVF BIT(6)
#define CMEM_LAT_STATUS_MC2_AOR_OVF BIT(7)
#define CMEM_LAT_STATUS_MC2_REQ_OVF BIT(9)
/* Events. */
#define CMEM_LAT_EVENT_CYCLES 0x0
#define CMEM_LAT_EVENT_REQ 0x1
#define CMEM_LAT_EVENT_AOR 0x2
#define CMEM_LAT_NUM_EVENTS 0x3
#define CMEM_LAT_MASK_EVENT 0x3
#define CMEM_LAT_MAX_ACTIVE_EVENTS 32
#define CMEM_LAT_ACTIVE_CPU_MASK 0x0
#define CMEM_LAT_ASSOCIATED_CPU_MASK 0x1
static unsigned long cmem_lat_pmu_cpuhp_state;
struct cmem_lat_pmu_hw_events {
struct perf_event *events[CMEM_LAT_MAX_ACTIVE_EVENTS];
DECLARE_BITMAP(used_ctrs, CMEM_LAT_MAX_ACTIVE_EVENTS);
};
struct cmem_lat_pmu {
struct pmu pmu;
struct device *dev;
const char *name;
const char *identifier;
void __iomem *base_broadcast;
void __iomem *base[NUM_INSTANCES];
cpumask_t associated_cpus;
cpumask_t active_cpu;
struct hlist_node node;
struct cmem_lat_pmu_hw_events hw_events;
};
#define to_cmem_lat_pmu(p) \
container_of(p, struct cmem_lat_pmu, pmu)
/* Get event type from perf_event. */
static inline u32 get_event_type(struct perf_event *event)
{
return (event->attr.config) & CMEM_LAT_MASK_EVENT;
}
/* PMU operations. */
static int cmem_lat_pmu_get_event_idx(struct cmem_lat_pmu_hw_events *hw_events,
struct perf_event *event)
{
unsigned int idx;
idx = find_first_zero_bit(hw_events->used_ctrs, CMEM_LAT_MAX_ACTIVE_EVENTS);
if (idx >= CMEM_LAT_MAX_ACTIVE_EVENTS)
return -EAGAIN;
set_bit(idx, hw_events->used_ctrs);
return idx;
}
static bool cmem_lat_pmu_validate_event(struct pmu *pmu,
struct cmem_lat_pmu_hw_events *hw_events,
struct perf_event *event)
{
int ret;
if (is_software_event(event))
return true;
/* Reject groups spanning multiple HW PMUs. */
if (event->pmu != pmu)
return false;
ret = cmem_lat_pmu_get_event_idx(hw_events, event);
if (ret < 0)
return false;
return true;
}
/* Make sure the group of events can be scheduled at once on the PMU. */
static bool cmem_lat_pmu_validate_group(struct perf_event *event)
{
struct perf_event *sibling, *leader = event->group_leader;
struct cmem_lat_pmu_hw_events fake_hw_events;
if (event->group_leader == event)
return true;
memset(&fake_hw_events, 0, sizeof(fake_hw_events));
if (!cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events, leader))
return false;
for_each_sibling_event(sibling, leader) {
if (!cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events, sibling))
return false;
}
return cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events, event);
}
static int cmem_lat_pmu_event_init(struct perf_event *event)
{
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
struct hw_perf_event *hwc = &event->hw;
u32 event_type = get_event_type(event);
if (event->attr.type != event->pmu->type ||
event_type >= CMEM_LAT_NUM_EVENTS)
return -ENOENT;
/*
* Sampling, per-process mode, and per-task counters are not supported
* since this PMU is shared across all CPUs.
*/
if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK) {
dev_dbg(cmem_lat_pmu->pmu.dev,
"Can't support sampling and per-process mode\n");
return -EOPNOTSUPP;
}
if (event->cpu < 0) {
dev_dbg(cmem_lat_pmu->pmu.dev, "Can't support per-task counters\n");
return -EINVAL;
}
/*
* Make sure the CPU assignment is on one of the CPUs associated with
* this PMU.
*/
if (!cpumask_test_cpu(event->cpu, &cmem_lat_pmu->associated_cpus)) {
dev_dbg(cmem_lat_pmu->pmu.dev,
"Requested cpu is not associated with the PMU\n");
return -EINVAL;
}
/* Enforce the current active CPU to handle the events in this PMU. */
event->cpu = cpumask_first(&cmem_lat_pmu->active_cpu);
if (event->cpu >= nr_cpu_ids)
return -EINVAL;
if (!cmem_lat_pmu_validate_group(event))
return -EINVAL;
hwc->idx = -1;
hwc->config = event_type;
return 0;
}
static u64 cmem_lat_pmu_read_status(struct cmem_lat_pmu *cmem_lat_pmu,
unsigned int inst)
{
return readq(cmem_lat_pmu->base[inst] + CMEM_LAT_STATUS);
}
static u64 cmem_lat_pmu_read_cycle_counter(struct perf_event *event)
{
const unsigned int instance = 0;
u64 status;
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
struct device *dev = cmem_lat_pmu->dev;
/*
* Use the reading from first instance since all instances are
* identical.
*/
status = cmem_lat_pmu_read_status(cmem_lat_pmu, instance);
if (status & CMEM_LAT_STATUS_CYCLE_OVF)
dev_warn(dev, "Cycle counter overflow\n");
return readq(cmem_lat_pmu->base[instance] + CMEM_LAT_CYCLE_CNTR);
}
static u64 cmem_lat_pmu_read_req_counter(struct perf_event *event)
{
unsigned int i;
u64 status, val = 0;
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
struct device *dev = cmem_lat_pmu->dev;
/* Sum up the counts from all instances. */
for (i = 0; i < NUM_INSTANCES; i++) {
status = cmem_lat_pmu_read_status(cmem_lat_pmu, i);
if (status & CMEM_LAT_STATUS_MC0_REQ_OVF)
dev_warn(dev, "MC0 request counter overflow\n");
if (status & CMEM_LAT_STATUS_MC1_REQ_OVF)
dev_warn(dev, "MC1 request counter overflow\n");
if (status & CMEM_LAT_STATUS_MC2_REQ_OVF)
dev_warn(dev, "MC2 request counter overflow\n");
val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC0_REQ_CNTR);
val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC1_REQ_CNTR);
val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC2_REQ_CNTR);
}
return val;
}
static u64 cmem_lat_pmu_read_aor_counter(struct perf_event *event)
{
unsigned int i;
u64 status, val = 0;
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
struct device *dev = cmem_lat_pmu->dev;
/* Sum up the counts from all instances. */
for (i = 0; i < NUM_INSTANCES; i++) {
status = cmem_lat_pmu_read_status(cmem_lat_pmu, i);
if (status & CMEM_LAT_STATUS_MC0_AOR_OVF)
dev_warn(dev, "MC0 AOR counter overflow\n");
if (status & CMEM_LAT_STATUS_MC1_AOR_OVF)
dev_warn(dev, "MC1 AOR counter overflow\n");
if (status & CMEM_LAT_STATUS_MC2_AOR_OVF)
dev_warn(dev, "MC2 AOR counter overflow\n");
val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC0_AOR_CNTR);
val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC1_AOR_CNTR);
val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC2_AOR_CNTR);
}
return val;
}
static u64 (*read_counter_fn[CMEM_LAT_NUM_EVENTS])(struct perf_event *) = {
[CMEM_LAT_EVENT_CYCLES] = cmem_lat_pmu_read_cycle_counter,
[CMEM_LAT_EVENT_REQ] = cmem_lat_pmu_read_req_counter,
[CMEM_LAT_EVENT_AOR] = cmem_lat_pmu_read_aor_counter,
};
static void cmem_lat_pmu_event_update(struct perf_event *event)
{
u32 event_type;
u64 prev, now;
struct hw_perf_event *hwc = &event->hw;
if (hwc->state & PERF_HES_STOPPED)
return;
event_type = hwc->config;
do {
prev = local64_read(&hwc->prev_count);
now = read_counter_fn[event_type](event);
} while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
local64_add(now - prev, &event->count);
hwc->state |= PERF_HES_UPTODATE;
}
static void cmem_lat_pmu_start(struct perf_event *event, int pmu_flags)
{
event->hw.state = 0;
}
static void cmem_lat_pmu_stop(struct perf_event *event, int pmu_flags)
{
event->hw.state |= PERF_HES_STOPPED;
}
static int cmem_lat_pmu_add(struct perf_event *event, int flags)
{
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
struct cmem_lat_pmu_hw_events *hw_events = &cmem_lat_pmu->hw_events;
struct hw_perf_event *hwc = &event->hw;
int idx;
if (WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(),
&cmem_lat_pmu->associated_cpus)))
return -ENOENT;
idx = cmem_lat_pmu_get_event_idx(hw_events, event);
if (idx < 0)
return idx;
hw_events->events[idx] = event;
hwc->idx = idx;
hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
if (flags & PERF_EF_START)
cmem_lat_pmu_start(event, PERF_EF_RELOAD);
/* Propagate changes to the userspace mapping. */
perf_event_update_userpage(event);
return 0;
}
static void cmem_lat_pmu_del(struct perf_event *event, int flags)
{
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
struct cmem_lat_pmu_hw_events *hw_events = &cmem_lat_pmu->hw_events;
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
cmem_lat_pmu_stop(event, PERF_EF_UPDATE);
hw_events->events[idx] = NULL;
clear_bit(idx, hw_events->used_ctrs);
perf_event_update_userpage(event);
}
static void cmem_lat_pmu_read(struct perf_event *event)
{
cmem_lat_pmu_event_update(event);
}
static inline void cmem_lat_pmu_cg_ctrl(struct cmem_lat_pmu *cmem_lat_pmu,
u64 val)
{
writeq(val, cmem_lat_pmu->base_broadcast + CMEM_LAT_CG_CTRL);
}
static inline void cmem_lat_pmu_ctrl(struct cmem_lat_pmu *cmem_lat_pmu, u64 val)
{
writeq(val, cmem_lat_pmu->base_broadcast + CMEM_LAT_CTRL);
}
static void cmem_lat_pmu_enable(struct pmu *pmu)
{
bool disabled;
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu);
disabled = bitmap_empty(cmem_lat_pmu->hw_events.used_ctrs,
CMEM_LAT_MAX_ACTIVE_EVENTS);
if (disabled)
return;
/* Enable all the counters. */
cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CMEM_LAT_CG_CTRL_ENABLE);
cmem_lat_pmu_ctrl(cmem_lat_pmu, CMEM_LAT_CTRL_ENABLE);
}
static void cmem_lat_pmu_disable(struct pmu *pmu)
{
int idx;
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu);
/* Disable all the counters. */
cmem_lat_pmu_ctrl(cmem_lat_pmu, CMEM_LAT_CTRL_DISABLE);
/*
* The counters will start from 0 again on restart.
* Update the events immediately to avoid losing the counts.
*/
for_each_set_bit(idx, cmem_lat_pmu->hw_events.used_ctrs,
CMEM_LAT_MAX_ACTIVE_EVENTS) {
struct perf_event *event = cmem_lat_pmu->hw_events.events[idx];
if (!event)
continue;
cmem_lat_pmu_event_update(event);
local64_set(&event->hw.prev_count, 0ULL);
}
cmem_lat_pmu_ctrl(cmem_lat_pmu, CMEM_LAT_CTRL_CLR);
cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CMEM_LAT_CG_CTRL_DISABLE);
}
/* PMU identifier attribute. */
static ssize_t cmem_lat_pmu_identifier_show(struct device *dev,
struct device_attribute *attr,
char *page)
{
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(dev_get_drvdata(dev));
return sysfs_emit(page, "%s\n", cmem_lat_pmu->identifier);
}
static struct device_attribute cmem_lat_pmu_identifier_attr =
__ATTR(identifier, 0444, cmem_lat_pmu_identifier_show, NULL);
static struct attribute *cmem_lat_pmu_identifier_attrs[] = {
&cmem_lat_pmu_identifier_attr.attr,
NULL
};
static struct attribute_group cmem_lat_pmu_identifier_attr_group = {
.attrs = cmem_lat_pmu_identifier_attrs,
};
/* Format attributes. */
#define NV_PMU_EXT_ATTR(_name, _func, _config) \
(&((struct dev_ext_attribute[]){ \
{ \
.attr = __ATTR(_name, 0444, _func, NULL), \
.var = (void *)_config \
} \
})[0].attr.attr)
static struct attribute *cmem_lat_pmu_formats[] = {
NV_PMU_EXT_ATTR(event, device_show_string, "config:0-1"),
NULL
};
static const struct attribute_group cmem_lat_pmu_format_group = {
.name = "format",
.attrs = cmem_lat_pmu_formats,
};
/* Event attributes. */
static ssize_t cmem_lat_pmu_sysfs_event_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct perf_pmu_events_attr *pmu_attr;
pmu_attr = container_of(attr, typeof(*pmu_attr), attr);
return sysfs_emit(buf, "event=0x%llx\n", pmu_attr->id);
}
#define NV_PMU_EVENT_ATTR(_name, _config) \
PMU_EVENT_ATTR_ID(_name, cmem_lat_pmu_sysfs_event_show, _config)
static struct attribute *cmem_lat_pmu_events[] = {
NV_PMU_EVENT_ATTR(cycles, CMEM_LAT_EVENT_CYCLES),
NV_PMU_EVENT_ATTR(rd_req, CMEM_LAT_EVENT_REQ),
NV_PMU_EVENT_ATTR(rd_cum_outs, CMEM_LAT_EVENT_AOR),
NULL
};
static const struct attribute_group cmem_lat_pmu_events_group = {
.name = "events",
.attrs = cmem_lat_pmu_events,
};
/* Cpumask attributes. */
static ssize_t cmem_lat_pmu_cpumask_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct pmu *pmu = dev_get_drvdata(dev);
struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu);
struct dev_ext_attribute *eattr =
container_of(attr, struct dev_ext_attribute, attr);
unsigned long mask_id = (unsigned long)eattr->var;
const cpumask_t *cpumask;
switch (mask_id) {
case CMEM_LAT_ACTIVE_CPU_MASK:
cpumask = &cmem_lat_pmu->active_cpu;
break;
case CMEM_LAT_ASSOCIATED_CPU_MASK:
cpumask = &cmem_lat_pmu->associated_cpus;
break;
default:
return 0;
}
return cpumap_print_to_pagebuf(true, buf, cpumask);
}
#define NV_PMU_CPUMASK_ATTR(_name, _config) \
NV_PMU_EXT_ATTR(_name, cmem_lat_pmu_cpumask_show, \
(unsigned long)_config)
static struct attribute *cmem_lat_pmu_cpumask_attrs[] = {
NV_PMU_CPUMASK_ATTR(cpumask, CMEM_LAT_ACTIVE_CPU_MASK),
NV_PMU_CPUMASK_ATTR(associated_cpus, CMEM_LAT_ASSOCIATED_CPU_MASK),
NULL
};
static const struct attribute_group cmem_lat_pmu_cpumask_attr_group = {
.attrs = cmem_lat_pmu_cpumask_attrs,
};
/* Per PMU device attribute groups. */
static const struct attribute_group *cmem_lat_pmu_attr_groups[] = {
&cmem_lat_pmu_identifier_attr_group,
&cmem_lat_pmu_format_group,
&cmem_lat_pmu_events_group,
&cmem_lat_pmu_cpumask_attr_group,
NULL
};
static int cmem_lat_pmu_cpu_online(unsigned int cpu, struct hlist_node *node)
{
struct cmem_lat_pmu *cmem_lat_pmu =
hlist_entry_safe(node, struct cmem_lat_pmu, node);
if (!cpumask_test_cpu(cpu, &cmem_lat_pmu->associated_cpus))
return 0;
/* If the PMU is already managed, there is nothing to do */
if (!cpumask_empty(&cmem_lat_pmu->active_cpu))
return 0;
/* Use this CPU for event counting */
cpumask_set_cpu(cpu, &cmem_lat_pmu->active_cpu);
return 0;
}
static int cmem_lat_pmu_cpu_teardown(unsigned int cpu, struct hlist_node *node)
{
unsigned int dst;
struct cmem_lat_pmu *cmem_lat_pmu =
hlist_entry_safe(node, struct cmem_lat_pmu, node);
/* Nothing to do if this CPU doesn't own the PMU */
if (!cpumask_test_and_clear_cpu(cpu, &cmem_lat_pmu->active_cpu))
return 0;
/* Choose a new CPU to migrate ownership of the PMU to */
dst = cpumask_any_and_but(&cmem_lat_pmu->associated_cpus,
cpu_online_mask, cpu);
if (dst >= nr_cpu_ids)
return 0;
/* Use this CPU for event counting */
perf_pmu_migrate_context(&cmem_lat_pmu->pmu, cpu, dst);
cpumask_set_cpu(dst, &cmem_lat_pmu->active_cpu);
return 0;
}
static int cmem_lat_pmu_get_cpus(struct cmem_lat_pmu *cmem_lat_pmu,
unsigned int socket)
{
int cpu;
for_each_possible_cpu(cpu) {
if (cpu_to_node(cpu) == socket)
cpumask_set_cpu(cpu, &cmem_lat_pmu->associated_cpus);
}
if (cpumask_empty(&cmem_lat_pmu->associated_cpus)) {
dev_dbg(cmem_lat_pmu->dev,
"No cpu associated with PMU socket-%u\n", socket);
return -ENODEV;
}
return 0;
}
static int cmem_lat_pmu_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
struct acpi_device *acpi_dev;
struct cmem_lat_pmu *cmem_lat_pmu;
char *name, *uid_str;
int ret, i;
u32 socket;
acpi_dev = ACPI_COMPANION(dev);
if (!acpi_dev)
return -ENODEV;
uid_str = acpi_device_uid(acpi_dev);
if (!uid_str)
return -ENODEV;
ret = kstrtou32(uid_str, 0, &socket);
if (ret)
return ret;
cmem_lat_pmu = devm_kzalloc(dev, sizeof(*cmem_lat_pmu), GFP_KERNEL);
name = devm_kasprintf(dev, GFP_KERNEL, "nvidia_cmem_latency_pmu_%u", socket);
if (!cmem_lat_pmu || !name)
return -ENOMEM;
cmem_lat_pmu->dev = dev;
cmem_lat_pmu->name = name;
cmem_lat_pmu->identifier = acpi_device_hid(acpi_dev);
platform_set_drvdata(pdev, cmem_lat_pmu);
cmem_lat_pmu->pmu = (struct pmu) {
.parent = &pdev->dev,
.task_ctx_nr = perf_invalid_context,
.pmu_enable = cmem_lat_pmu_enable,
.pmu_disable = cmem_lat_pmu_disable,
.event_init = cmem_lat_pmu_event_init,
.add = cmem_lat_pmu_add,
.del = cmem_lat_pmu_del,
.start = cmem_lat_pmu_start,
.stop = cmem_lat_pmu_stop,
.read = cmem_lat_pmu_read,
.attr_groups = cmem_lat_pmu_attr_groups,
.capabilities = PERF_PMU_CAP_NO_EXCLUDE |
PERF_PMU_CAP_NO_INTERRUPT,
};
/* Map the address of all the instances. */
for (i = 0; i < NUM_INSTANCES; i++) {
cmem_lat_pmu->base[i] = devm_platform_ioremap_resource(pdev, i);
if (IS_ERR(cmem_lat_pmu->base[i])) {
dev_err(dev, "Failed map address for instance %d\n", i);
return PTR_ERR(cmem_lat_pmu->base[i]);
}
}
/* Map broadcast address. */
cmem_lat_pmu->base_broadcast = devm_platform_ioremap_resource(pdev,
NUM_INSTANCES);
if (IS_ERR(cmem_lat_pmu->base_broadcast)) {
dev_err(dev, "Failed map broadcast address\n");
return PTR_ERR(cmem_lat_pmu->base_broadcast);
}
ret = cmem_lat_pmu_get_cpus(cmem_lat_pmu, socket);
if (ret)
return ret;
ret = cpuhp_state_add_instance(cmem_lat_pmu_cpuhp_state,
&cmem_lat_pmu->node);
if (ret) {
dev_err(&pdev->dev, "Error %d registering hotplug\n", ret);
return ret;
}
cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CMEM_LAT_CG_CTRL_ENABLE);
cmem_lat_pmu_ctrl(cmem_lat_pmu, CMEM_LAT_CTRL_CLR);
cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CMEM_LAT_CG_CTRL_DISABLE);
ret = perf_pmu_register(&cmem_lat_pmu->pmu, name, -1);
if (ret) {
dev_err(&pdev->dev, "Failed to register PMU: %d\n", ret);
cpuhp_state_remove_instance(cmem_lat_pmu_cpuhp_state,
&cmem_lat_pmu->node);
return ret;
}
dev_dbg(&pdev->dev, "Registered %s PMU\n", name);
return 0;
}
static void cmem_lat_pmu_device_remove(struct platform_device *pdev)
{
struct cmem_lat_pmu *cmem_lat_pmu = platform_get_drvdata(pdev);
perf_pmu_unregister(&cmem_lat_pmu->pmu);
cpuhp_state_remove_instance(cmem_lat_pmu_cpuhp_state,
&cmem_lat_pmu->node);
}
static const struct acpi_device_id cmem_lat_pmu_acpi_match[] = {
{ "NVDA2021" },
{ }
};
MODULE_DEVICE_TABLE(acpi, cmem_lat_pmu_acpi_match);
static struct platform_driver cmem_lat_pmu_driver = {
.driver = {
.name = "nvidia-t410-cmem-latency-pmu",
.acpi_match_table = ACPI_PTR(cmem_lat_pmu_acpi_match),
.suppress_bind_attrs = true,
},
.probe = cmem_lat_pmu_probe,
.remove = cmem_lat_pmu_device_remove,
};
static int __init cmem_lat_pmu_init(void)
{
int ret;
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
"perf/nvidia/cmem_latency:online",
cmem_lat_pmu_cpu_online,
cmem_lat_pmu_cpu_teardown);
if (ret < 0)
return ret;
cmem_lat_pmu_cpuhp_state = ret;
return platform_driver_register(&cmem_lat_pmu_driver);
}
static void __exit cmem_lat_pmu_exit(void)
{
platform_driver_unregister(&cmem_lat_pmu_driver);
cpuhp_remove_multi_state(cmem_lat_pmu_cpuhp_state);
}
module_init(cmem_lat_pmu_init);
module_exit(cmem_lat_pmu_exit);
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("NVIDIA Tegra410 CPU Memory Latency PMU driver");
MODULE_AUTHOR("Besar Wicaksono <bwicaksono@nvidia.com>");

View File

@@ -1,6 +1,7 @@
menuconfig ARM64_MPAM_DRIVER
bool "MPAM driver"
depends on ARM64 && ARM64_MPAM && EXPERT
depends on ARM64 && ARM64_MPAM
select ACPI_MPAM if ACPI
help
Memory System Resource Partitioning and Monitoring (MPAM) driver for
System IP, e.g. caches and memory controllers.
@@ -22,3 +23,9 @@ config MPAM_KUNIT_TEST
If unsure, say N.
endif
config ARM64_MPAM_RESCTRL_FS
bool
default y if ARM64_MPAM_DRIVER && RESCTRL_FS
select RESCTRL_RMID_DEPENDS_ON_CLOSID
select RESCTRL_ASSIGN_FIXED

View File

@@ -1,4 +1,5 @@
obj-$(CONFIG_ARM64_MPAM_DRIVER) += mpam.o
mpam-y += mpam_devices.o
mpam-$(CONFIG_ARM64_MPAM_RESCTRL_FS) += mpam_resctrl.o
ccflags-$(CONFIG_ARM64_MPAM_DRIVER_DEBUG) += -DDEBUG

View File

@@ -29,7 +29,15 @@
#include "mpam_internal.h"
DEFINE_STATIC_KEY_FALSE(mpam_enabled); /* This moves to arch code */
/* Values for the T241 errata workaround */
#define T241_CHIPS_MAX 4
#define T241_CHIP_NSLICES 12
#define T241_SPARE_REG0_OFF 0x1b0000
#define T241_SPARE_REG1_OFF 0x1c0000
#define T241_CHIP_ID(phys) FIELD_GET(GENMASK_ULL(44, 43), phys)
#define T241_SHADOW_REG_OFF(sidx, pid) (0x360048 + (sidx) * 0x10000 + (pid) * 8)
#define SMCCC_SOC_ID_T241 0x036b0241
static void __iomem *t241_scratch_regs[T241_CHIPS_MAX];
/*
* mpam_list_lock protects the SRCU lists when writing. Once the
@@ -75,6 +83,14 @@ static DECLARE_WORK(mpam_broken_work, &mpam_disable);
/* When mpam is disabled, the printed reason to aid debugging */
static char *mpam_disable_reason;
/*
* Whether resctrl has been setup. Used by cpuhp in preference to
* mpam_is_enabled(). The disable call after an error interrupt makes
* mpam_is_enabled() false before the cpuhp callbacks are made.
* Reads/writes should hold mpam_cpuhp_state_lock, (or be cpuhp callbacks).
*/
static bool mpam_resctrl_enabled;
/*
* An MSC is a physical container for controls and monitors, each identified by
* their RIS index. These share a base-address, interrupts and some MMIO
@@ -624,6 +640,86 @@ static struct mpam_msc_ris *mpam_get_or_create_ris(struct mpam_msc *msc,
return ERR_PTR(-ENOENT);
}
static int mpam_enable_quirk_nvidia_t241_1(struct mpam_msc *msc,
const struct mpam_quirk *quirk)
{
s32 soc_id = arm_smccc_get_soc_id_version();
struct resource *r;
phys_addr_t phys;
/*
* A mapping to a device other than the MSC is needed, check
* SOC_ID is NVIDIA T241 chip (036b:0241)
*/
if (soc_id < 0 || soc_id != SMCCC_SOC_ID_T241)
return -EINVAL;
r = platform_get_resource(msc->pdev, IORESOURCE_MEM, 0);
if (!r)
return -EINVAL;
/* Find the internal registers base addr from the CHIP ID */
msc->t241_id = T241_CHIP_ID(r->start);
phys = FIELD_PREP(GENMASK_ULL(45, 44), msc->t241_id) | 0x19000000ULL;
t241_scratch_regs[msc->t241_id] = ioremap(phys, SZ_8M);
if (WARN_ON_ONCE(!t241_scratch_regs[msc->t241_id]))
return -EINVAL;
pr_info_once("Enabled workaround for NVIDIA T241 erratum T241-MPAM-1\n");
return 0;
}
static const struct mpam_quirk mpam_quirks[] = {
{
/* NVIDIA t241 erratum T241-MPAM-1 */
.init = mpam_enable_quirk_nvidia_t241_1,
.iidr = MPAM_IIDR_NVIDIA_T241,
.iidr_mask = MPAM_IIDR_MATCH_ONE,
.workaround = T241_SCRUB_SHADOW_REGS,
},
{
/* NVIDIA t241 erratum T241-MPAM-4 */
.iidr = MPAM_IIDR_NVIDIA_T241,
.iidr_mask = MPAM_IIDR_MATCH_ONE,
.workaround = T241_FORCE_MBW_MIN_TO_ONE,
},
{
/* NVIDIA t241 erratum T241-MPAM-6 */
.iidr = MPAM_IIDR_NVIDIA_T241,
.iidr_mask = MPAM_IIDR_MATCH_ONE,
.workaround = T241_MBW_COUNTER_SCALE_64,
},
{
/* ARM CMN-650 CSU erratum 3642720 */
.iidr = MPAM_IIDR_ARM_CMN_650,
.iidr_mask = MPAM_IIDR_MATCH_ONE,
.workaround = IGNORE_CSU_NRDY,
},
{ NULL } /* Sentinel */
};
static void mpam_enable_quirks(struct mpam_msc *msc)
{
const struct mpam_quirk *quirk;
for (quirk = &mpam_quirks[0]; quirk->iidr_mask; quirk++) {
int err = 0;
if (quirk->iidr != (msc->iidr & quirk->iidr_mask))
continue;
if (quirk->init)
err = quirk->init(msc, quirk);
if (err)
continue;
mpam_set_quirk(quirk->workaround, msc);
}
}
/*
* IHI009A.a has this nugget: "If a monitor does not support automatic behaviour
* of NRDY, software can use this bit for any purpose" - so hardware might not
@@ -715,6 +811,13 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris)
mpam_set_feature(mpam_feat_mbw_part, props);
props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features);
/*
* The BWA_WD field can represent 0-63, but the control fields it
* describes have a maximum of 16 bits.
*/
props->bwa_wd = min(props->bwa_wd, 16);
if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features))
mpam_set_feature(mpam_feat_mbw_max, props);
@@ -851,8 +954,11 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc)
/* Grab an IDR value to find out how many RIS there are */
mutex_lock(&msc->part_sel_lock);
idr = mpam_msc_read_idr(msc);
msc->iidr = mpam_read_partsel_reg(msc, IIDR);
mutex_unlock(&msc->part_sel_lock);
mpam_enable_quirks(msc);
msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr);
/* Use these values so partid/pmg always starts with a valid value */
@@ -903,6 +1009,7 @@ struct mon_read {
enum mpam_device_features type;
u64 *val;
int err;
bool waited_timeout;
};
static bool mpam_ris_has_mbwu_long_counter(struct mpam_msc_ris *ris)
@@ -1052,7 +1159,7 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val,
}
}
static u64 mpam_msmon_overflow_val(enum mpam_device_features type)
static u64 __mpam_msmon_overflow_val(enum mpam_device_features type)
{
/* TODO: implement scaling counters */
switch (type) {
@@ -1067,6 +1174,18 @@ static u64 mpam_msmon_overflow_val(enum mpam_device_features type)
}
}
static u64 mpam_msmon_overflow_val(enum mpam_device_features type,
struct mpam_msc *msc)
{
u64 overflow_val = __mpam_msmon_overflow_val(type);
if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) &&
type != mpam_feat_msmon_mbwu_63counter)
overflow_val *= 64;
return overflow_val;
}
static void __ris_msmon_read(void *arg)
{
u64 now;
@@ -1137,6 +1256,10 @@ static void __ris_msmon_read(void *arg)
if (mpam_has_feature(mpam_feat_msmon_csu_hw_nrdy, rprops))
nrdy = now & MSMON___NRDY;
now = FIELD_GET(MSMON___VALUE, now);
if (mpam_has_quirk(IGNORE_CSU_NRDY, msc) && m->waited_timeout)
nrdy = false;
break;
case mpam_feat_msmon_mbwu_31counter:
case mpam_feat_msmon_mbwu_44counter:
@@ -1157,13 +1280,17 @@ static void __ris_msmon_read(void *arg)
now = FIELD_GET(MSMON___VALUE, now);
}
if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) &&
m->type != mpam_feat_msmon_mbwu_63counter)
now *= 64;
if (nrdy)
break;
mbwu_state = &ris->mbwu_state[ctx->mon];
if (overflow)
mbwu_state->correction += mpam_msmon_overflow_val(m->type);
mbwu_state->correction += mpam_msmon_overflow_val(m->type, msc);
/*
* Include bandwidth consumed before the last hardware reset and
@@ -1270,6 +1397,7 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx,
.ctx = ctx,
.type = type,
.val = val,
.waited_timeout = true,
};
*val = 0;
@@ -1338,6 +1466,75 @@ static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd)
__mpam_write_reg(msc, reg, bm);
}
static void mpam_apply_t241_erratum(struct mpam_msc_ris *ris, u16 partid)
{
int sidx, i, lcount = 1000;
void __iomem *regs;
u64 val0, val;
regs = t241_scratch_regs[ris->vmsc->msc->t241_id];
for (i = 0; i < lcount; i++) {
/* Read the shadow register at index 0 */
val0 = readq_relaxed(regs + T241_SHADOW_REG_OFF(0, partid));
/* Check if all the shadow registers have the same value */
for (sidx = 1; sidx < T241_CHIP_NSLICES; sidx++) {
val = readq_relaxed(regs +
T241_SHADOW_REG_OFF(sidx, partid));
if (val != val0)
break;
}
if (sidx == T241_CHIP_NSLICES)
break;
}
if (i == lcount)
pr_warn_once("t241: inconsistent values in shadow regs");
/* Write a value zero to spare registers to take effect of MBW conf */
writeq_relaxed(0, regs + T241_SPARE_REG0_OFF);
writeq_relaxed(0, regs + T241_SPARE_REG1_OFF);
}
static void mpam_quirk_post_config_change(struct mpam_msc_ris *ris, u16 partid,
struct mpam_config *cfg)
{
if (mpam_has_quirk(T241_SCRUB_SHADOW_REGS, ris->vmsc->msc))
mpam_apply_t241_erratum(ris, partid);
}
static u16 mpam_wa_t241_force_mbw_min_to_one(struct mpam_props *props)
{
u16 max_hw_value, min_hw_granule, res0_bits;
res0_bits = 16 - props->bwa_wd;
max_hw_value = ((1 << props->bwa_wd) - 1) << res0_bits;
min_hw_granule = ~max_hw_value;
return min_hw_granule + 1;
}
static u16 mpam_wa_t241_calc_min_from_max(struct mpam_props *props,
struct mpam_config *cfg)
{
u16 val = 0;
u16 max;
u16 delta = ((5 * MPAMCFG_MBW_MAX_MAX) / 100) - 1;
if (mpam_has_feature(mpam_feat_mbw_max, cfg)) {
max = cfg->mbw_max;
} else {
/* Resetting. Hence, use the ris specific default. */
max = GENMASK(15, 16 - props->bwa_wd);
}
if (max > delta)
val = max - delta;
return val;
}
/* Called via IPI. Call while holding an SRCU reference */
static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid,
struct mpam_config *cfg)
@@ -1364,36 +1561,41 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid,
__mpam_intpart_sel(ris->ris_idx, partid, msc);
}
if (mpam_has_feature(mpam_feat_cpor_part, rprops) &&
mpam_has_feature(mpam_feat_cpor_part, cfg)) {
if (cfg->reset_cpbm)
mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd);
else
if (mpam_has_feature(mpam_feat_cpor_part, rprops)) {
if (mpam_has_feature(mpam_feat_cpor_part, cfg))
mpam_write_partsel_reg(msc, CPBM, cfg->cpbm);
else
mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd);
}
if (mpam_has_feature(mpam_feat_mbw_part, rprops) &&
mpam_has_feature(mpam_feat_mbw_part, cfg)) {
if (cfg->reset_mbw_pbm)
if (mpam_has_feature(mpam_feat_mbw_part, rprops)) {
if (mpam_has_feature(mpam_feat_mbw_part, cfg))
mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, rprops->mbw_pbm_bits);
else
mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm);
}
if (mpam_has_feature(mpam_feat_mbw_min, rprops) &&
mpam_has_feature(mpam_feat_mbw_min, cfg))
mpam_write_partsel_reg(msc, MBW_MIN, 0);
if (mpam_has_feature(mpam_feat_mbw_min, rprops)) {
u16 val = 0;
if (mpam_has_feature(mpam_feat_mbw_max, rprops) &&
mpam_has_feature(mpam_feat_mbw_max, cfg)) {
if (cfg->reset_mbw_max)
mpam_write_partsel_reg(msc, MBW_MAX, MPAMCFG_MBW_MAX_MAX);
else
mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max);
if (mpam_has_quirk(T241_FORCE_MBW_MIN_TO_ONE, msc)) {
u16 min = mpam_wa_t241_force_mbw_min_to_one(rprops);
val = mpam_wa_t241_calc_min_from_max(rprops, cfg);
val = max(val, min);
}
mpam_write_partsel_reg(msc, MBW_MIN, val);
}
if (mpam_has_feature(mpam_feat_mbw_prop, rprops) &&
mpam_has_feature(mpam_feat_mbw_prop, cfg))
if (mpam_has_feature(mpam_feat_mbw_max, rprops)) {
if (mpam_has_feature(mpam_feat_mbw_max, cfg))
mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max);
else
mpam_write_partsel_reg(msc, MBW_MAX, MPAMCFG_MBW_MAX_MAX);
}
if (mpam_has_feature(mpam_feat_mbw_prop, rprops))
mpam_write_partsel_reg(msc, MBW_PROP, 0);
if (mpam_has_feature(mpam_feat_cmax_cmax, rprops))
@@ -1421,6 +1623,8 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid,
mpam_write_partsel_reg(msc, PRI, pri_val);
}
mpam_quirk_post_config_change(ris, partid, cfg);
mutex_unlock(&msc->part_sel_lock);
}
@@ -1493,16 +1697,6 @@ static int mpam_save_mbwu_state(void *arg)
return 0;
}
static void mpam_init_reset_cfg(struct mpam_config *reset_cfg)
{
*reset_cfg = (struct mpam_config) {
.reset_cpbm = true,
.reset_mbw_pbm = true,
.reset_mbw_max = true,
};
bitmap_fill(reset_cfg->features, MPAM_FEATURE_LAST);
}
/*
* Called via smp_call_on_cpu() to prevent migration, while still being
* pre-emptible. Caller must hold mpam_srcu.
@@ -1510,14 +1704,12 @@ static void mpam_init_reset_cfg(struct mpam_config *reset_cfg)
static int mpam_reset_ris(void *arg)
{
u16 partid, partid_max;
struct mpam_config reset_cfg;
struct mpam_config reset_cfg = {};
struct mpam_msc_ris *ris = arg;
if (ris->in_reset_state)
return 0;
mpam_init_reset_cfg(&reset_cfg);
spin_lock(&partid_max_lock);
partid_max = mpam_partid_max;
spin_unlock(&partid_max_lock);
@@ -1632,6 +1824,9 @@ static int mpam_cpu_online(unsigned int cpu)
mpam_reprogram_msc(msc);
}
if (mpam_resctrl_enabled)
return mpam_resctrl_online_cpu(cpu);
return 0;
}
@@ -1675,6 +1870,9 @@ static int mpam_cpu_offline(unsigned int cpu)
{
struct mpam_msc *msc;
if (mpam_resctrl_enabled)
mpam_resctrl_offline_cpu(cpu);
guard(srcu)(&mpam_srcu);
list_for_each_entry_srcu(msc, &mpam_all_msc, all_msc_list,
srcu_read_lock_held(&mpam_srcu)) {
@@ -1971,6 +2169,7 @@ static bool mpam_has_cmax_wd_feature(struct mpam_props *props)
* resulting safe value must be compatible with both. When merging values in
* the tree, all the aliasing resources must be handled first.
* On mismatch, parent is modified.
* Quirks on an MSC will apply to all MSC in that class.
*/
static void __props_mismatch(struct mpam_props *parent,
struct mpam_props *child, bool alias)
@@ -2090,6 +2289,7 @@ static void __props_mismatch(struct mpam_props *parent,
* nobble the class feature, as we can't configure all the resources.
* e.g. The L3 cache is composed of two resources with 13 and 17 portion
* bitmaps respectively.
* Quirks on an MSC will apply to all MSC in that class.
*/
static void
__class_props_mismatch(struct mpam_class *class, struct mpam_vmsc *vmsc)
@@ -2103,6 +2303,9 @@ __class_props_mismatch(struct mpam_class *class, struct mpam_vmsc *vmsc)
dev_dbg(dev, "Merging features for class:0x%lx &= vmsc:0x%lx\n",
(long)cprops->features, (long)vprops->features);
/* Merge quirks */
class->quirks |= vmsc->msc->quirks;
/* Take the safe value for any common features */
__props_mismatch(cprops, vprops, false);
}
@@ -2167,6 +2370,9 @@ static void mpam_enable_merge_class_features(struct mpam_component *comp)
list_for_each_entry(vmsc, &comp->vmsc, comp_list)
__class_props_mismatch(class, vmsc);
if (mpam_has_quirk(T241_FORCE_MBW_MIN_TO_ONE, class))
mpam_clear_feature(mpam_feat_mbw_min, &class->props);
}
/*
@@ -2520,6 +2726,12 @@ static void mpam_enable_once(void)
mutex_unlock(&mpam_list_lock);
cpus_read_unlock();
if (!err) {
err = mpam_resctrl_setup();
if (err)
pr_err("Failed to initialise resctrl: %d\n", err);
}
if (err) {
mpam_disable_reason = "Failed to enable.";
schedule_work(&mpam_broken_work);
@@ -2527,6 +2739,7 @@ static void mpam_enable_once(void)
}
static_branch_enable(&mpam_enabled);
mpam_resctrl_enabled = true;
mpam_register_cpuhp_callbacks(mpam_cpu_online, mpam_cpu_offline,
"mpam:online");
@@ -2559,7 +2772,7 @@ static void mpam_reset_component_locked(struct mpam_component *comp)
}
}
static void mpam_reset_class_locked(struct mpam_class *class)
void mpam_reset_class_locked(struct mpam_class *class)
{
struct mpam_component *comp;
@@ -2586,24 +2799,39 @@ static void mpam_reset_class(struct mpam_class *class)
void mpam_disable(struct work_struct *ignored)
{
int idx;
bool do_resctrl_exit;
struct mpam_class *class;
struct mpam_msc *msc, *tmp;
if (mpam_is_enabled())
static_branch_disable(&mpam_enabled);
mutex_lock(&mpam_cpuhp_state_lock);
if (mpam_cpuhp_state) {
cpuhp_remove_state(mpam_cpuhp_state);
mpam_cpuhp_state = 0;
}
/*
* Removing the cpuhp state called mpam_cpu_offline() and told resctrl
* all the CPUs are offline.
*/
do_resctrl_exit = mpam_resctrl_enabled;
mpam_resctrl_enabled = false;
mutex_unlock(&mpam_cpuhp_state_lock);
static_branch_disable(&mpam_enabled);
if (do_resctrl_exit)
mpam_resctrl_exit();
mpam_unregister_irqs();
idx = srcu_read_lock(&mpam_srcu);
list_for_each_entry_srcu(class, &mpam_classes, classes_list,
srcu_read_lock_held(&mpam_srcu))
srcu_read_lock_held(&mpam_srcu)) {
mpam_reset_class(class);
if (do_resctrl_exit)
mpam_resctrl_teardown_class(class);
}
srcu_read_unlock(&mpam_srcu, idx);
mutex_lock(&mpam_list_lock);
@@ -2694,6 +2922,7 @@ int mpam_apply_config(struct mpam_component *comp, u16 partid,
srcu_read_lock_held(&mpam_srcu)) {
arg.ris = ris;
mpam_touch_msc(msc, __write_config, &arg);
ris->in_reset_state = false;
}
mutex_unlock(&msc->cfg_lock);
}

View File

@@ -12,22 +12,31 @@
#include <linux/jump_label.h>
#include <linux/llist.h>
#include <linux/mutex.h>
#include <linux/resctrl.h>
#include <linux/spinlock.h>
#include <linux/srcu.h>
#include <linux/types.h>
#include <asm/mpam.h>
#define MPAM_MSC_MAX_NUM_RIS 16
struct platform_device;
DECLARE_STATIC_KEY_FALSE(mpam_enabled);
#ifdef CONFIG_MPAM_KUNIT_TEST
#define PACKED_FOR_KUNIT __packed
#else
#define PACKED_FOR_KUNIT
#endif
/*
* This 'mon' values must not alias an actual monitor, so must be larger than
* U16_MAX, but not be confused with an errno value, so smaller than
* (u32)-SZ_4K.
* USE_PRE_ALLOCATED is used to avoid confusion with an actual monitor.
*/
#define USE_PRE_ALLOCATED (U16_MAX + 1)
static inline bool mpam_is_enabled(void)
{
return static_branch_likely(&mpam_enabled);
@@ -76,6 +85,8 @@ struct mpam_msc {
u8 pmg_max;
unsigned long ris_idxs;
u32 ris_max;
u32 iidr;
u16 quirks;
/*
* error_irq_lock is taken when registering/unregistering the error
@@ -119,6 +130,9 @@ struct mpam_msc {
void __iomem *mapped_hwpage;
size_t mapped_hwpage_sz;
/* Values only used on some platforms for quirks */
u32 t241_id;
struct mpam_garbage garbage;
};
@@ -207,6 +221,42 @@ struct mpam_props {
#define mpam_set_feature(_feat, x) __set_bit(_feat, (x)->features)
#define mpam_clear_feature(_feat, x) __clear_bit(_feat, (x)->features)
/* Workaround bits for msc->quirks */
enum mpam_device_quirks {
T241_SCRUB_SHADOW_REGS,
T241_FORCE_MBW_MIN_TO_ONE,
T241_MBW_COUNTER_SCALE_64,
IGNORE_CSU_NRDY,
MPAM_QUIRK_LAST
};
#define mpam_has_quirk(_quirk, x) ((1 << (_quirk) & (x)->quirks))
#define mpam_set_quirk(_quirk, x) ((x)->quirks |= (1 << (_quirk)))
struct mpam_quirk {
int (*init)(struct mpam_msc *msc, const struct mpam_quirk *quirk);
u32 iidr;
u32 iidr_mask;
enum mpam_device_quirks workaround;
};
#define MPAM_IIDR_MATCH_ONE (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0xfff) | \
FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0xf) | \
FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0xf) | \
FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0xfff))
#define MPAM_IIDR_NVIDIA_T241 (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0x241) | \
FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0) | \
FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0) | \
FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0x36b))
#define MPAM_IIDR_ARM_CMN_650 (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0) | \
FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0) | \
FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0) | \
FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0x43b))
/* The values for MSMON_CFG_MBWU_FLT.RWBW */
enum mon_filter_options {
COUNT_BOTH = 0,
@@ -215,7 +265,11 @@ enum mon_filter_options {
};
struct mon_cfg {
u16 mon;
/*
* mon must be large enough to hold out of range values like
* USE_PRE_ALLOCATED
*/
u32 mon;
u8 pmg;
bool match_pmg;
bool csu_exclude_clean;
@@ -246,6 +300,7 @@ struct mpam_class {
struct mpam_props props;
u32 nrdy_usec;
u16 quirks;
u8 level;
enum mpam_class_types type;
@@ -266,10 +321,6 @@ struct mpam_config {
u32 mbw_pbm;
u16 mbw_max;
bool reset_cpbm;
bool reset_mbw_pbm;
bool reset_mbw_max;
struct mpam_garbage garbage;
};
@@ -337,6 +388,32 @@ struct mpam_msc_ris {
struct mpam_garbage garbage;
};
struct mpam_resctrl_dom {
struct mpam_component *ctrl_comp;
/*
* There is no single mon_comp because different events may be backed
* by different class/components. mon_comp is indexed by the event
* number.
*/
struct mpam_component *mon_comp[QOS_NUM_EVENTS];
struct rdt_ctrl_domain resctrl_ctrl_dom;
struct rdt_l3_mon_domain resctrl_mon_dom;
};
struct mpam_resctrl_res {
struct mpam_class *class;
struct rdt_resource resctrl_res;
bool cdp_enabled;
};
struct mpam_resctrl_mon {
struct mpam_class *class;
/* per-class data that resctrl needs will live here */
};
static inline int mpam_alloc_csu_mon(struct mpam_class *class)
{
struct mpam_props *cprops = &class->props;
@@ -381,6 +458,9 @@ extern u8 mpam_pmg_max;
void mpam_enable(struct work_struct *work);
void mpam_disable(struct work_struct *work);
/* Reset all the RIS in a class under cpus_read_lock() */
void mpam_reset_class_locked(struct mpam_class *class);
int mpam_apply_config(struct mpam_component *comp, u16 partid,
struct mpam_config *cfg);
@@ -391,6 +471,20 @@ void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx);
int mpam_get_cpumask_from_cache_id(unsigned long cache_id, u32 cache_level,
cpumask_t *affinity);
#ifdef CONFIG_RESCTRL_FS
int mpam_resctrl_setup(void);
void mpam_resctrl_exit(void);
int mpam_resctrl_online_cpu(unsigned int cpu);
void mpam_resctrl_offline_cpu(unsigned int cpu);
void mpam_resctrl_teardown_class(struct mpam_class *class);
#else
static inline int mpam_resctrl_setup(void) { return 0; }
static inline void mpam_resctrl_exit(void) { }
static inline int mpam_resctrl_online_cpu(unsigned int cpu) { return 0; }
static inline void mpam_resctrl_offline_cpu(unsigned int cpu) { }
static inline void mpam_resctrl_teardown_class(struct mpam_class *class) { }
#endif /* CONFIG_RESCTRL_FS */
/*
* MPAM MSCs have the following register layout. See:
* Arm Memory System Resource Partitioning and Monitoring (MPAM) System

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,315 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (C) 2025 Arm Ltd.
/* This file is intended to be included into mpam_resctrl.c */
#include <kunit/test.h>
#include <linux/array_size.h>
#include <linux/bits.h>
#include <linux/math.h>
#include <linux/sprintf.h>
struct percent_value_case {
u8 pc;
u8 width;
u16 value;
};
/*
* Mysterious inscriptions taken from the union of ARM DDI 0598D.b,
* "Arm Architecture Reference Manual Supplement - Memory System
* Resource Partitioning and Monitoring (MPAM), for A-profile
* architecture", Section 9.8, "About the fixed-point fractional
* format" (exact percentage entries only) and ARM IHI0099B.a
* "MPAM system component specification", Section 9.3,
* "The fixed-point fractional format":
*/
static const struct percent_value_case percent_value_cases[] = {
/* Architectural cases: */
{ 1, 8, 1 }, { 1, 12, 0x27 }, { 1, 16, 0x28e },
{ 25, 8, 0x3f }, { 25, 12, 0x3ff }, { 25, 16, 0x3fff },
{ 33, 8, 0x53 }, { 33, 12, 0x546 }, { 33, 16, 0x5479 },
{ 35, 8, 0x58 }, { 35, 12, 0x598 }, { 35, 16, 0x5998 },
{ 45, 8, 0x72 }, { 45, 12, 0x732 }, { 45, 16, 0x7332 },
{ 50, 8, 0x7f }, { 50, 12, 0x7ff }, { 50, 16, 0x7fff },
{ 52, 8, 0x84 }, { 52, 12, 0x850 }, { 52, 16, 0x851d },
{ 55, 8, 0x8b }, { 55, 12, 0x8cb }, { 55, 16, 0x8ccb },
{ 58, 8, 0x93 }, { 58, 12, 0x946 }, { 58, 16, 0x9479 },
{ 75, 8, 0xbf }, { 75, 12, 0xbff }, { 75, 16, 0xbfff },
{ 80, 8, 0xcb }, { 80, 12, 0xccb }, { 80, 16, 0xcccb },
{ 88, 8, 0xe0 }, { 88, 12, 0xe13 }, { 88, 16, 0xe146 },
{ 95, 8, 0xf2 }, { 95, 12, 0xf32 }, { 95, 16, 0xf332 },
{ 100, 8, 0xff }, { 100, 12, 0xfff }, { 100, 16, 0xffff },
};
static void test_percent_value_desc(const struct percent_value_case *param,
char *desc)
{
snprintf(desc, KUNIT_PARAM_DESC_SIZE,
"pc=%d, width=%d, value=0x%.*x\n",
param->pc, param->width,
DIV_ROUND_UP(param->width, 4), param->value);
}
KUNIT_ARRAY_PARAM(test_percent_value, percent_value_cases,
test_percent_value_desc);
struct percent_value_test_info {
u32 pc; /* result of value-to-percent conversion */
u32 value; /* result of percent-to-value conversion */
u32 max_value; /* maximum raw value allowed by test params */
unsigned int shift; /* promotes raw testcase value to 16 bits */
};
/*
* Convert a reference percentage to a fixed-point MAX value and
* vice-versa, based on param (not test->param_value!)
*/
static void __prepare_percent_value_test(struct kunit *test,
struct percent_value_test_info *res,
const struct percent_value_case *param)
{
struct mpam_props fake_props = { };
/* Reject bogus test parameters that would break the tests: */
KUNIT_ASSERT_GE(test, param->width, 1);
KUNIT_ASSERT_LE(test, param->width, 16);
KUNIT_ASSERT_LT(test, param->value, 1 << param->width);
mpam_set_feature(mpam_feat_mbw_max, &fake_props);
fake_props.bwa_wd = param->width;
res->shift = 16 - param->width;
res->max_value = GENMASK_U32(param->width - 1, 0);
res->value = percent_to_mbw_max(param->pc, &fake_props);
res->pc = mbw_max_to_percent(param->value << res->shift, &fake_props);
}
static void test_get_mba_granularity(struct kunit *test)
{
int ret;
struct mpam_props fake_props = { };
/* Use MBW_MAX */
mpam_set_feature(mpam_feat_mbw_max, &fake_props);
fake_props.bwa_wd = 0;
KUNIT_EXPECT_FALSE(test, mba_class_use_mbw_max(&fake_props));
fake_props.bwa_wd = 1;
KUNIT_EXPECT_TRUE(test, mba_class_use_mbw_max(&fake_props));
/* Architectural maximum: */
fake_props.bwa_wd = 16;
KUNIT_EXPECT_TRUE(test, mba_class_use_mbw_max(&fake_props));
/* No usable control... */
fake_props.bwa_wd = 0;
ret = get_mba_granularity(&fake_props);
KUNIT_EXPECT_EQ(test, ret, 0);
fake_props.bwa_wd = 1;
ret = get_mba_granularity(&fake_props);
KUNIT_EXPECT_EQ(test, ret, 50); /* DIV_ROUND_UP(100, 1 << 1)% = 50% */
fake_props.bwa_wd = 2;
ret = get_mba_granularity(&fake_props);
KUNIT_EXPECT_EQ(test, ret, 25); /* DIV_ROUND_UP(100, 1 << 2)% = 25% */
fake_props.bwa_wd = 3;
ret = get_mba_granularity(&fake_props);
KUNIT_EXPECT_EQ(test, ret, 13); /* DIV_ROUND_UP(100, 1 << 3)% = 13% */
fake_props.bwa_wd = 6;
ret = get_mba_granularity(&fake_props);
KUNIT_EXPECT_EQ(test, ret, 2); /* DIV_ROUND_UP(100, 1 << 6)% = 2% */
fake_props.bwa_wd = 7;
ret = get_mba_granularity(&fake_props);
KUNIT_EXPECT_EQ(test, ret, 1); /* DIV_ROUND_UP(100, 1 << 7)% = 1% */
/* Granularity saturates at 1% */
fake_props.bwa_wd = 16; /* architectural maximum */
ret = get_mba_granularity(&fake_props);
KUNIT_EXPECT_EQ(test, ret, 1); /* DIV_ROUND_UP(100, 1 << 16)% = 1% */
}
static void test_mbw_max_to_percent(struct kunit *test)
{
const struct percent_value_case *param = test->param_value;
struct percent_value_test_info res;
/*
* Since the reference values in percent_value_cases[] all
* correspond to exact percentages, round-to-nearest will
* always give the exact percentage back when the MPAM max
* value has precision of 0.5% or finer. (Always true for the
* reference data, since they all specify 8 bits or more of
* precision.
*
* So, keep it simple and demand an exact match:
*/
__prepare_percent_value_test(test, &res, param);
KUNIT_EXPECT_EQ(test, res.pc, param->pc);
}
static void test_percent_to_mbw_max(struct kunit *test)
{
const struct percent_value_case *param = test->param_value;
struct percent_value_test_info res;
__prepare_percent_value_test(test, &res, param);
KUNIT_EXPECT_GE(test, res.value, param->value << res.shift);
KUNIT_EXPECT_LE(test, res.value, (param->value + 1) << res.shift);
KUNIT_EXPECT_LE(test, res.value, res.max_value << res.shift);
/* No flexibility allowed for 0% and 100%! */
if (param->pc == 0)
KUNIT_EXPECT_EQ(test, res.value, 0);
if (param->pc == 100)
KUNIT_EXPECT_EQ(test, res.value, res.max_value << res.shift);
}
static const void *test_all_bwa_wd_gen_params(struct kunit *test, const void *prev,
char *desc)
{
uintptr_t param = (uintptr_t)prev;
if (param > 15)
return NULL;
param++;
snprintf(desc, KUNIT_PARAM_DESC_SIZE, "wd=%u\n", (unsigned int)param);
return (void *)param;
}
static unsigned int test_get_bwa_wd(struct kunit *test)
{
uintptr_t param = (uintptr_t)test->param_value;
KUNIT_ASSERT_GE(test, param, 1);
KUNIT_ASSERT_LE(test, param, 16);
return param;
}
static void test_mbw_max_to_percent_limits(struct kunit *test)
{
struct mpam_props fake_props = {0};
u32 max_value;
mpam_set_feature(mpam_feat_mbw_max, &fake_props);
fake_props.bwa_wd = test_get_bwa_wd(test);
max_value = GENMASK(15, 16 - fake_props.bwa_wd);
KUNIT_EXPECT_EQ(test, mbw_max_to_percent(max_value, &fake_props),
MAX_MBA_BW);
KUNIT_EXPECT_EQ(test, mbw_max_to_percent(0, &fake_props),
get_mba_min(&fake_props));
/*
* Rounding policy dependent 0% sanity-check:
* With round-to-nearest, the minimum mbw_max value really
* should map to 0% if there are at least 200 steps.
* (100 steps may be enough for some other rounding policies.)
*/
if (fake_props.bwa_wd >= 8)
KUNIT_EXPECT_EQ(test, mbw_max_to_percent(0, &fake_props), 0);
if (fake_props.bwa_wd < 8 &&
mbw_max_to_percent(0, &fake_props) == 0)
kunit_warn(test, "wd=%d: Testsuite/driver Rounding policy mismatch?",
fake_props.bwa_wd);
}
/*
* Check that converting a percentage to mbw_max and back again (or, as
* appropriate, vice-versa) always restores the original value:
*/
static void test_percent_max_roundtrip_stability(struct kunit *test)
{
struct mpam_props fake_props = {0};
unsigned int shift;
u32 pc, max, pc2, max2;
mpam_set_feature(mpam_feat_mbw_max, &fake_props);
fake_props.bwa_wd = test_get_bwa_wd(test);
shift = 16 - fake_props.bwa_wd;
/*
* Converting a valid value from the coarser scale to the finer
* scale and back again must yield the original value:
*/
if (fake_props.bwa_wd >= 7) {
/* More than 100 steps: only test exact pc values: */
for (pc = get_mba_min(&fake_props); pc <= MAX_MBA_BW; pc++) {
max = percent_to_mbw_max(pc, &fake_props);
pc2 = mbw_max_to_percent(max, &fake_props);
KUNIT_EXPECT_EQ(test, pc2, pc);
}
} else {
/* Fewer than 100 steps: only test exact mbw_max values: */
for (max = 0; max < 1 << 16; max += 1 << shift) {
pc = mbw_max_to_percent(max, &fake_props);
max2 = percent_to_mbw_max(pc, &fake_props);
KUNIT_EXPECT_EQ(test, max2, max);
}
}
}
static void test_percent_to_max_rounding(struct kunit *test)
{
const struct percent_value_case *param = test->param_value;
unsigned int num_rounded_up = 0, total = 0;
struct percent_value_test_info res;
for (param = percent_value_cases, total = 0;
param < &percent_value_cases[ARRAY_SIZE(percent_value_cases)];
param++, total++) {
__prepare_percent_value_test(test, &res, param);
if (res.value > param->value << res.shift)
num_rounded_up++;
}
/*
* The MPAM driver applies a round-to-nearest policy, whereas a
* round-down policy seems to have been applied in the
* reference table from which the test vectors were selected.
*
* For a large and well-distributed suite of test vectors,
* about half should be rounded up and half down compared with
* the reference table. The actual test vectors are few in
* number and probably not very well distributed however, so
* tolerate a round-up rate of between 1/4 and 3/4 before
* crying foul:
*/
kunit_info(test, "Round-up rate: %u%% (%u/%u)\n",
DIV_ROUND_CLOSEST(num_rounded_up * 100, total),
num_rounded_up, total);
KUNIT_EXPECT_GE(test, 4 * num_rounded_up, 1 * total);
KUNIT_EXPECT_LE(test, 4 * num_rounded_up, 3 * total);
}
static struct kunit_case mpam_resctrl_test_cases[] = {
KUNIT_CASE(test_get_mba_granularity),
KUNIT_CASE_PARAM(test_mbw_max_to_percent, test_percent_value_gen_params),
KUNIT_CASE_PARAM(test_percent_to_mbw_max, test_percent_value_gen_params),
KUNIT_CASE_PARAM(test_mbw_max_to_percent_limits, test_all_bwa_wd_gen_params),
KUNIT_CASE(test_percent_to_max_rounding),
KUNIT_CASE_PARAM(test_percent_max_roundtrip_stability,
test_all_bwa_wd_gen_params),
{}
};
static struct kunit_suite mpam_resctrl_test_suite = {
.name = "mpam_resctrl_test_suite",
.test_cases = mpam_resctrl_test_cases,
};
kunit_test_suites(&mpam_resctrl_test_suite);

View File

@@ -5,6 +5,7 @@
#define __LINUX_ARM_MPAM_H
#include <linux/acpi.h>
#include <linux/resctrl_types.h>
#include <linux/types.h>
struct mpam_msc;
@@ -49,6 +50,37 @@ static inline int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx,
}
#endif
bool resctrl_arch_alloc_capable(void);
bool resctrl_arch_mon_capable(void);
void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid);
void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid);
void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid);
void resctrl_arch_sched_in(struct task_struct *tsk);
bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid);
bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid);
u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid);
void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid);
u32 resctrl_arch_system_num_rmid_idx(void);
struct rdt_resource;
void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, enum resctrl_event_id evtid);
void resctrl_arch_mon_ctx_free(struct rdt_resource *r, enum resctrl_event_id evtid, void *ctx);
/*
* The CPU configuration for MPAM is cheap to write, and is only written if it
* has changed. No need for fine grained enables.
*/
static inline void resctrl_arch_enable_mon(void) { }
static inline void resctrl_arch_disable_mon(void) { }
static inline void resctrl_arch_enable_alloc(void) { }
static inline void resctrl_arch_disable_alloc(void) { }
static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
{
return val;
}
/**
* mpam_register_requestor() - Register a requestor with the MPAM driver
* @partid_max: The maximum PARTID value the requestor can generate.

View File

@@ -324,7 +324,7 @@ static __always_inline void syscall_exit_to_user_mode(struct pt_regs *regs)
{
instrumentation_begin();
syscall_exit_to_user_mode_work(regs);
local_irq_disable_exit_to_user();
local_irq_disable();
syscall_exit_to_user_mode_prepare(regs);
instrumentation_end();
exit_to_user_mode();

View File

@@ -109,37 +109,6 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
instrumentation_end();
}
/**
* local_irq_enable_exit_to_user - Exit to user variant of local_irq_enable()
* @ti_work: Cached TIF flags gathered with interrupts disabled
*
* Defaults to local_irq_enable(). Can be supplied by architecture specific
* code.
*/
static inline void local_irq_enable_exit_to_user(unsigned long ti_work);
#ifndef local_irq_enable_exit_to_user
static __always_inline void local_irq_enable_exit_to_user(unsigned long ti_work)
{
local_irq_enable();
}
#endif
/**
* local_irq_disable_exit_to_user - Exit to user variant of local_irq_disable()
*
* Defaults to local_irq_disable(). Can be supplied by architecture specific
* code.
*/
static inline void local_irq_disable_exit_to_user(void);
#ifndef local_irq_disable_exit_to_user
static __always_inline void local_irq_disable_exit_to_user(void)
{
local_irq_disable();
}
#endif
/**
* arch_exit_to_user_mode_work - Architecture specific TIF work for exit
* to user mode.
@@ -348,6 +317,8 @@ static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
*/
static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
{
lockdep_assert_irqs_disabled();
instrumentation_begin();
irqentry_exit_to_user_mode_prepare(regs);
instrumentation_end();
@@ -378,6 +349,207 @@ typedef struct irqentry_state {
} irqentry_state_t;
#endif
/**
* irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
*
* Conditional reschedule with additional sanity checks.
*/
void raw_irqentry_exit_cond_resched(void);
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
#define irqentry_exit_cond_resched_dynamic_disabled NULL
DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
#define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)()
#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
void dynamic_irqentry_exit_cond_resched(void);
#define irqentry_exit_cond_resched() dynamic_irqentry_exit_cond_resched()
#endif
#else /* CONFIG_PREEMPT_DYNAMIC */
#define irqentry_exit_cond_resched() raw_irqentry_exit_cond_resched()
#endif /* CONFIG_PREEMPT_DYNAMIC */
/**
* irqentry_enter_from_kernel_mode - Establish state before invoking the irq handler
* @regs: Pointer to currents pt_regs
*
* Invoked from architecture specific entry code with interrupts disabled.
* Can only be called when the interrupt entry came from kernel mode. The
* calling code must be non-instrumentable. When the function returns all
* state is correct and the subsequent functions can be instrumented.
*
* The function establishes state (lockdep, RCU (context tracking), tracing) and
* is provided for architectures which require a strict split between entry from
* kernel and user mode and therefore cannot use irqentry_enter() which handles
* both entry modes.
*
* Returns: An opaque object that must be passed to irqentry_exit_to_kernel_mode().
*/
static __always_inline irqentry_state_t irqentry_enter_from_kernel_mode(struct pt_regs *regs)
{
irqentry_state_t ret = {
.exit_rcu = false,
};
/*
* If this entry hit the idle task invoke ct_irq_enter() whether
* RCU is watching or not.
*
* Interrupts can nest when the first interrupt invokes softirq
* processing on return which enables interrupts.
*
* Scheduler ticks in the idle task can mark quiescent state and
* terminate a grace period, if and only if the timer interrupt is
* not nested into another interrupt.
*
* Checking for rcu_is_watching() here would prevent the nesting
* interrupt to invoke ct_irq_enter(). If that nested interrupt is
* the tick then rcu_flavor_sched_clock_irq() would wrongfully
* assume that it is the first interrupt and eventually claim
* quiescent state and end grace periods prematurely.
*
* Unconditionally invoke ct_irq_enter() so RCU state stays
* consistent.
*
* TINY_RCU does not support EQS, so let the compiler eliminate
* this part when enabled.
*/
if (!IS_ENABLED(CONFIG_TINY_RCU) &&
(is_idle_task(current) || arch_in_rcu_eqs())) {
/*
* If RCU is not watching then the same careful
* sequence vs. lockdep and tracing is required
* as in irqentry_enter_from_user_mode().
*/
lockdep_hardirqs_off(CALLER_ADDR0);
ct_irq_enter();
instrumentation_begin();
kmsan_unpoison_entry_regs(regs);
trace_hardirqs_off_finish();
instrumentation_end();
ret.exit_rcu = true;
return ret;
}
/*
* If RCU is watching then RCU only wants to check whether it needs
* to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
* already contains a warning when RCU is not watching, so no point
* in having another one here.
*/
lockdep_hardirqs_off(CALLER_ADDR0);
instrumentation_begin();
kmsan_unpoison_entry_regs(regs);
rcu_irq_enter_check_tick();
trace_hardirqs_off_finish();
instrumentation_end();
return ret;
}
/**
* irqentry_exit_to_kernel_mode_preempt - Run preempt checks on return to kernel mode
* @regs: Pointer to current's pt_regs
* @state: Return value from matching call to irqentry_enter_from_kernel_mode()
*
* This is to be invoked before irqentry_exit_to_kernel_mode_after_preempt() to
* allow kernel preemption on return from interrupt.
*
* Must be invoked with interrupts disabled and CPU state which allows kernel
* preemption.
*
* After returning from this function, the caller can modify CPU state before
* invoking irqentry_exit_to_kernel_mode_after_preempt(), which is required to
* re-establish the tracing, lockdep and RCU state for returning to the
* interrupted context.
*/
static inline void irqentry_exit_to_kernel_mode_preempt(struct pt_regs *regs,
irqentry_state_t state)
{
if (regs_irqs_disabled(regs) || state.exit_rcu)
return;
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();
hrtimer_rearm_deferred();
}
/**
* irqentry_exit_to_kernel_mode_after_preempt - Establish trace, lockdep and RCU state
* @regs: Pointer to current's pt_regs
* @state: Return value from matching call to irqentry_enter_from_kernel_mode()
*
* This is to be invoked after irqentry_exit_to_kernel_mode_preempt() and before
* actually returning to the interrupted context.
*
* There are no requirements for the CPU state other than being able to complete
* the tracing, lockdep and RCU state transitions. After this function returns
* the caller must return directly to the interrupted context.
*/
static __always_inline void
irqentry_exit_to_kernel_mode_after_preempt(struct pt_regs *regs, irqentry_state_t state)
{
if (!regs_irqs_disabled(regs)) {
/*
* If RCU was not watching on entry this needs to be done
* carefully and needs the same ordering of lockdep/tracing
* and RCU as the return to user mode path.
*/
if (state.exit_rcu) {
instrumentation_begin();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare();
instrumentation_end();
ct_irq_exit();
lockdep_hardirqs_on(CALLER_ADDR0);
return;
}
instrumentation_begin();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();
} else {
/*
* IRQ flags state is correct already. Just tell RCU if it
* was not watching on entry.
*/
if (state.exit_rcu)
ct_irq_exit();
}
}
/**
* irqentry_exit_to_kernel_mode - Run preempt checks and establish state after
* invoking the interrupt handler
* @regs: Pointer to current's pt_regs
* @state: Return value from matching call to irqentry_enter_from_kernel_mode()
*
* This is the counterpart of irqentry_enter_from_kernel_mode() and combines
* the calls to irqentry_exit_to_kernel_mode_preempt() and
* irqentry_exit_to_kernel_mode_after_preempt().
*
* The requirement for the CPU state is that it can schedule. After the function
* returns the tracing, lockdep and RCU state transitions are completed and the
* caller must return directly to the interrupted context.
*/
static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs,
irqentry_state_t state)
{
lockdep_assert_irqs_disabled();
instrumentation_begin();
irqentry_exit_to_kernel_mode_preempt(regs, state);
instrumentation_end();
irqentry_exit_to_kernel_mode_after_preempt(regs, state);
}
/**
* irqentry_enter - Handle state tracking on ordinary interrupt entries
* @regs: Pointer to pt_regs of interrupted context
@@ -407,32 +579,10 @@ typedef struct irqentry_state {
* establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
* would not be possible.
*
* Returns: An opaque object that must be passed to idtentry_exit()
* Returns: An opaque object that must be passed to irqentry_exit()
*/
irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
/**
* irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
*
* Conditional reschedule with additional sanity checks.
*/
void raw_irqentry_exit_cond_resched(void);
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
#define irqentry_exit_cond_resched_dynamic_disabled NULL
DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
#define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)()
#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
void dynamic_irqentry_exit_cond_resched(void);
#define irqentry_exit_cond_resched() dynamic_irqentry_exit_cond_resched()
#endif
#else /* CONFIG_PREEMPT_DYNAMIC */
#define irqentry_exit_cond_resched() raw_irqentry_exit_cond_resched()
#endif /* CONFIG_PREEMPT_DYNAMIC */
/**
* irqentry_exit - Handle return from exception that used irqentry_enter()
* @regs: Pointer to pt_regs (exception entry regs)

View File

@@ -47,7 +47,7 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
*/
while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
local_irq_enable_exit_to_user(ti_work);
local_irq_enable();
if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
if (!rseq_grant_slice_extension(ti_work, TIF_SLICE_EXT_DENY))
@@ -74,7 +74,7 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
* might have changed while interrupts and preemption was
* enabled above.
*/
local_irq_disable_exit_to_user();
local_irq_disable();
/* Check if any of the above work has queued a deferred wakeup */
tick_nohz_user_enter_prepare();
@@ -105,70 +105,16 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
.exit_rcu = false,
};
if (user_mode(regs)) {
irqentry_state_t ret = {
.exit_rcu = false,
};
irqentry_enter_from_user_mode(regs);
return ret;
}
/*
* If this entry hit the idle task invoke ct_irq_enter() whether
* RCU is watching or not.
*
* Interrupts can nest when the first interrupt invokes softirq
* processing on return which enables interrupts.
*
* Scheduler ticks in the idle task can mark quiescent state and
* terminate a grace period, if and only if the timer interrupt is
* not nested into another interrupt.
*
* Checking for rcu_is_watching() here would prevent the nesting
* interrupt to invoke ct_irq_enter(). If that nested interrupt is
* the tick then rcu_flavor_sched_clock_irq() would wrongfully
* assume that it is the first interrupt and eventually claim
* quiescent state and end grace periods prematurely.
*
* Unconditionally invoke ct_irq_enter() so RCU state stays
* consistent.
*
* TINY_RCU does not support EQS, so let the compiler eliminate
* this part when enabled.
*/
if (!IS_ENABLED(CONFIG_TINY_RCU) &&
(is_idle_task(current) || arch_in_rcu_eqs())) {
/*
* If RCU is not watching then the same careful
* sequence vs. lockdep and tracing is required
* as in irqentry_enter_from_user_mode().
*/
lockdep_hardirqs_off(CALLER_ADDR0);
ct_irq_enter();
instrumentation_begin();
kmsan_unpoison_entry_regs(regs);
trace_hardirqs_off_finish();
instrumentation_end();
ret.exit_rcu = true;
return ret;
}
/*
* If RCU is watching then RCU only wants to check whether it needs
* to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
* already contains a warning when RCU is not watching, so no point
* in having another one here.
*/
lockdep_hardirqs_off(CALLER_ADDR0);
instrumentation_begin();
kmsan_unpoison_entry_regs(regs);
rcu_irq_enter_check_tick();
trace_hardirqs_off_finish();
instrumentation_end();
return ret;
return irqentry_enter_from_kernel_mode(regs);
}
/**
@@ -212,45 +158,10 @@ void dynamic_irqentry_exit_cond_resched(void)
noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
lockdep_assert_irqs_disabled();
/* Check whether this returns to user mode */
if (user_mode(regs)) {
if (user_mode(regs))
irqentry_exit_to_user_mode(regs);
} else if (!regs_irqs_disabled(regs)) {
/*
* If RCU was not watching on entry this needs to be done
* carefully and needs the same ordering of lockdep/tracing
* and RCU as the return to user mode path.
*/
if (state.exit_rcu) {
instrumentation_begin();
hrtimer_rearm_deferred();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare();
instrumentation_end();
ct_irq_exit();
lockdep_hardirqs_on(CALLER_ADDR0);
return;
}
instrumentation_begin();
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();
hrtimer_rearm_deferred();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();
} else {
/*
* IRQ flags state is correct already. Just tell RCU if it
* was not watching on entry.
*/
if (state.exit_rcu)
ct_irq_exit();
}
else
irqentry_exit_to_kernel_mode(regs, state);
}
irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)

View File

@@ -56,7 +56,8 @@ static void atomics_sigill(void)
static void cmpbr_sigill(void)
{
/* Not implemented, too complicated and unreliable anyway */
asm volatile(".inst 0x74C00040\n" /* CBEQ w0, w0, +8 */
"udf #0" : : : "cc"); /* UDF #0 */
}
static void crc32_sigill(void)

View File

@@ -124,6 +124,7 @@ static const struct reg_ftr_bits ftr_id_aa64isar2_el1[] = {
static const struct reg_ftr_bits ftr_id_aa64isar3_el1[] = {
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, FPRCVT, 0),
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, LSUI, 0),
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, LSFE, 0),
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, FAMINMAX, 0),
REG_FTR_END,