mirror of
https://github.com/torvalds/linux.git
synced 2026-04-18 06:44:00 -04:00
Adds NVIDIA C2C PMU support in Tegra410 SOC. This PMU is used to measure memory latency between the SOC and device memory, e.g GPU Memory (GMEM), CXL Memory, or memory on remote Tegra410 SOC. Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org>
523 lines
22 KiB
ReStructuredText
523 lines
22 KiB
ReStructuredText
=====================================================================
|
|
NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
|
|
=====================================================================
|
|
|
|
The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
|
|
metrics like memory bandwidth, latency, and utilization:
|
|
|
|
* Unified Coherence Fabric (UCF)
|
|
* PCIE
|
|
* PCIE-TGT
|
|
* CPU Memory (CMEM) Latency
|
|
* NVLink-C2C
|
|
* NV-CLink
|
|
* NV-DLink
|
|
|
|
PMU Driver
|
|
----------
|
|
|
|
The PMU driver describes the available events and configuration of each PMU in
|
|
sysfs. Please see the sections below to get the sysfs path of each PMU. Like
|
|
other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
|
|
the CPU id used to handle the PMU event. There is also "associated_cpus"
|
|
sysfs attribute, which contains a list of CPUs associated with the PMU instance.
|
|
|
|
UCF PMU
|
|
-------
|
|
|
|
The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
|
|
distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
|
|
interconnect that supports hardware coherence across multiple coherently caching
|
|
agents, including:
|
|
|
|
* CPU clusters
|
|
* GPU
|
|
* PCIe Ordering Controller Unit (OCU)
|
|
* Other IO-coherent requesters
|
|
|
|
The events and configuration options of this PMU device are described in sysfs,
|
|
see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
|
|
|
|
Some of the events available in this PMU can be used to measure bandwidth and
|
|
utilization:
|
|
|
|
* slc_access_rd: count the number of read requests to SLC.
|
|
* slc_access_wr: count the number of write requests to SLC.
|
|
* slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
|
|
* slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
|
|
* mem_access_rd: count the number of read requests to local or remote memory.
|
|
* mem_access_wr: count the number of write requests to local or remote memory.
|
|
* mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
|
|
* mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
|
|
* cycles: counts the UCF cycles.
|
|
|
|
The average bandwidth is calculated as::
|
|
|
|
AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
|
|
AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
|
|
AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
|
|
AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
|
|
|
|
The average request rate is calculated as::
|
|
|
|
AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
|
|
AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
|
|
AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
|
|
AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
|
|
|
|
More details about what other events are available can be found in Tegra410 SoC
|
|
technical reference manual.
|
|
|
|
The events can be filtered based on source or destination. The source filter
|
|
indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
|
|
remote socket. The destination filter specifies the destination memory type,
|
|
e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
|
|
local/remote classification of the destination filter is based on the home
|
|
socket of the address, not where the data actually resides. The available
|
|
filters are described in
|
|
/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
|
|
|
|
The list of UCF PMU event filters:
|
|
|
|
* Source filter:
|
|
|
|
* src_loc_cpu: if set, count events from local CPU
|
|
* src_loc_noncpu: if set, count events from local non-CPU device
|
|
* src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
|
|
|
|
* Destination filter:
|
|
|
|
* dst_loc_cmem: if set, count events to local system memory (CMEM) address
|
|
* dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
|
|
* dst_loc_other: if set, count events to local CXL memory address
|
|
* dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
|
|
|
|
If the source is not specified, the PMU will count events from all sources. If
|
|
the destination is not specified, the PMU will count events to all destinations.
|
|
|
|
Example usage:
|
|
|
|
* Count event id 0x0 in socket 0 from all sources and to all destinations::
|
|
|
|
perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
|
|
|
|
* Count event id 0x0 in socket 0 with source filter = local CPU and destination
|
|
filter = local system memory (CMEM)::
|
|
|
|
perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
|
|
|
|
* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
|
|
destination filter = remote memory::
|
|
|
|
perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
|
|
|
|
PCIE PMU
|
|
--------
|
|
|
|
This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
|
|
the memory subsystem. It monitors all read/write traffic from the root port(s)
|
|
or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per
|
|
PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
|
|
up to 8 root ports. The traffic from each root port can be filtered using RP or
|
|
BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
|
|
capture traffic from all RPs. Please see below for more details.
|
|
|
|
The events and configuration options of this PMU device are described in sysfs,
|
|
see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
|
|
|
|
The events in this PMU can be used to measure bandwidth, utilization, and
|
|
latency:
|
|
|
|
* rd_req: count the number of read requests by PCIE device.
|
|
* wr_req: count the number of write requests by PCIE device.
|
|
* rd_bytes: count the number of bytes transferred by rd_req.
|
|
* wr_bytes: count the number of bytes transferred by wr_req.
|
|
* rd_cum_outs: count outstanding rd_req each cycle.
|
|
* cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
|
|
|
|
The average bandwidth is calculated as::
|
|
|
|
AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
|
|
AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
|
|
|
|
The average request rate is calculated as::
|
|
|
|
AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
|
|
AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
|
|
|
|
|
|
The average latency is calculated as::
|
|
|
|
FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
|
|
AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
|
|
AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
|
|
|
|
The PMU events can be filtered based on the traffic source and destination.
|
|
The source filter indicates the PCIE devices that will be monitored. The
|
|
destination filter specifies the destination memory type, e.g. local system
|
|
memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
|
|
classification of the destination filter is based on the home socket of the
|
|
address, not where the data actually resides. These filters can be found in
|
|
/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
|
|
|
|
The list of event filters:
|
|
|
|
* Source filter:
|
|
|
|
* src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
|
|
bitmask represents the RP index in the RC. If the bit is set, all devices under
|
|
the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
|
|
devices in root port 0 to 3.
|
|
* src_bdf: the BDF that will be monitored. This is a 16-bit value that
|
|
follows formula: (bus << 8) + (device << 3) + (function). For example, the
|
|
value of BDF 27:01.1 is 0x2781.
|
|
* src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
|
|
"src_bdf" is used to filter the traffic.
|
|
|
|
Note that Root-Port and BDF filters are mutually exclusive and the PMU in
|
|
each RC can only have one BDF filter for the whole counters. If BDF filter
|
|
is enabled, the BDF filter value will be applied to all events.
|
|
|
|
* Destination filter:
|
|
|
|
* dst_loc_cmem: if set, count events to local system memory (CMEM) address
|
|
* dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
|
|
* dst_loc_pcie_p2p: if set, count events to local PCIE peer address
|
|
* dst_loc_pcie_cxl: if set, count events to local CXL memory address
|
|
* dst_rem: if set, count events to remote memory address
|
|
|
|
If the source filter is not specified, the PMU will count events from all root
|
|
ports. If the destination filter is not specified, the PMU will count events
|
|
to all destinations.
|
|
|
|
Example usage:
|
|
|
|
* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
|
|
destinations::
|
|
|
|
perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
|
|
|
|
* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
|
|
targeting just local CMEM of socket 0::
|
|
|
|
perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
|
|
|
|
* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
|
|
destinations::
|
|
|
|
perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
|
|
|
|
* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
|
|
targeting just local CMEM of socket 1::
|
|
|
|
perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
|
|
|
|
* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
|
|
destinations::
|
|
|
|
perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
|
|
|
|
.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
|
|
|
|
Mapping the RC# to lspci segment number
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
|
|
Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
|
|
for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
|
|
contains the following information to map PCIE devices under the RP back to its RC# :
|
|
|
|
- Bus# (byte 0xc) : bus number as reported by the lspci output
|
|
- Segment# (byte 0xd) : segment number as reported by the lspci output
|
|
- RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
|
|
- RC# (byte 0xf): root complex number associated with the RP
|
|
- Socket# (byte 0x10): socket number associated with the RP
|
|
|
|
Example script for mapping lspci BDF to RC# and socket#::
|
|
|
|
#!/bin/bash
|
|
while read bdf rest; do
|
|
dvsec4_reg=$(lspci -vv -s $bdf | awk '
|
|
/Designated Vendor-Specific: Vendor=10de ID=0004/ {
|
|
match($0, /\[([0-9a-fA-F]+)/, arr);
|
|
print "0x" arr[1];
|
|
exit
|
|
}
|
|
')
|
|
if [ -n "$dvsec4_reg" ]; then
|
|
bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
|
|
segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
|
|
rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
|
|
rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
|
|
socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
|
|
echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
|
|
fi
|
|
done < <(lspci -d 10de:)
|
|
|
|
Example output::
|
|
|
|
0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
|
|
0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
|
|
0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
|
|
0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
|
|
0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
|
|
0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
|
|
0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
|
|
0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
|
|
0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
|
|
0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
|
|
0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
|
|
0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
|
|
000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
|
|
000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
|
|
000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
|
|
000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
|
|
000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
|
|
000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
|
|
000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
|
|
000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
|
|
000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
|
|
|
|
PCIE-TGT PMU
|
|
------------
|
|
|
|
This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
|
|
the memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges.
|
|
There is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can
|
|
have up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU
|
|
provides RP filter to count PCIE BAR traffic to each RP and address filter to
|
|
count access to PCIE BAR or CXL HDM ranges. The details of the filters are
|
|
described in the following sections.
|
|
|
|
Mapping the RC# to lspci segment number is similar to the PCIE PMU. Please see
|
|
:ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
|
|
|
|
The events and configuration options of this PMU device are available in sysfs,
|
|
see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
|
|
|
|
The events in this PMU can be used to measure bandwidth and utilization:
|
|
|
|
* rd_req: count the number of read requests to PCIE.
|
|
* wr_req: count the number of write requests to PCIE.
|
|
* rd_bytes: count the number of bytes transferred by rd_req.
|
|
* wr_bytes: count the number of bytes transferred by wr_req.
|
|
* cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
|
|
|
|
The average bandwidth is calculated as::
|
|
|
|
AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
|
|
AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
|
|
|
|
The average request rate is calculated as::
|
|
|
|
AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
|
|
AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
|
|
|
|
The PMU events can be filtered based on the destination root port or target
|
|
address range. Filtering based on RP is only available for PCIE BAR traffic.
|
|
Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be
|
|
found in sysfs, see
|
|
/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
|
|
|
|
Destination filter settings:
|
|
|
|
* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
|
|
corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
|
|
only available for PCIE BAR traffic.
|
|
* dst_addr_base: BAR or CXL HDM filter base address.
|
|
* dst_addr_mask: BAR or CXL HDM filter address mask.
|
|
* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
|
|
address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
|
|
the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
|
|
to determine if the traffic destination address falls within the filter range::
|
|
|
|
(txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
|
|
|
|
If the comparison succeeds, then the event will be counted.
|
|
|
|
If the destination filter is not specified, the RP filter will be configured by default
|
|
to count PCIE BAR traffic to all root ports.
|
|
|
|
Example usage:
|
|
|
|
* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
|
|
|
|
perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
|
|
|
|
* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
|
|
0x10000 to 0x100FF on socket 0's PCIE RC-1::
|
|
|
|
perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
|
|
|
|
CPU Memory (CMEM) Latency PMU
|
|
-----------------------------
|
|
|
|
This PMU monitors latency events of memory read requests from the edge of the
|
|
Unified Coherence Fabric (UCF) to local CPU DRAM:
|
|
|
|
* RD_REQ counters: count read requests (32B per request).
|
|
* RD_CUM_OUTS counters: accumulated outstanding request counter, which track
|
|
how many cycles the read requests are in flight.
|
|
* CYCLES counter: counts the number of elapsed cycles.
|
|
|
|
The average latency is calculated as::
|
|
|
|
FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
|
|
AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
|
|
AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
|
|
|
|
The events and configuration options of this PMU device are described in sysfs,
|
|
see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
|
|
|
|
Example usage::
|
|
|
|
perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
|
|
|
|
NVLink-C2C PMU
|
|
--------------
|
|
|
|
This PMU monitors latency events of memory read/write requests that pass through
|
|
the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
|
|
in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
|
|
|
|
The events and configuration options of this PMU device are available in sysfs,
|
|
see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
|
|
|
|
The list of events:
|
|
|
|
* IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
|
|
* IN_RD_REQ: the number of incoming read requests.
|
|
* IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
|
|
* IN_WR_REQ: the number of incoming write requests.
|
|
* OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
|
|
* OUT_RD_REQ: the number of outgoing read requests.
|
|
* OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
|
|
* OUT_WR_REQ: the number of outgoing write requests.
|
|
* CYCLES: NVLink-C2C interface cycle counts.
|
|
|
|
The incoming events count the reads/writes from remote device to the SoC.
|
|
The outgoing events count the reads/writes from the SoC to remote device.
|
|
|
|
The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
|
|
contains the information about the connected device.
|
|
|
|
When the C2C interface is connected to GPU(s), the user can use the
|
|
"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
|
|
index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
|
|
The PMU will monitor all GPUs by default if not specified.
|
|
|
|
When connected to another SoC, only the read events are available.
|
|
|
|
The events can be used to calculate the average latency of the read/write requests::
|
|
|
|
C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
|
|
|
|
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
|
|
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
|
|
|
|
IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
|
|
IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
|
|
|
|
OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
|
|
OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
|
|
|
|
OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
|
|
OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
|
|
|
|
Example usage:
|
|
|
|
* Count incoming traffic from all GPUs connected via NVLink-C2C::
|
|
|
|
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
|
|
|
|
* Count incoming traffic from GPU 0 connected via NVLink-C2C::
|
|
|
|
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
|
|
|
|
* Count incoming traffic from GPU 1 connected via NVLink-C2C::
|
|
|
|
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
|
|
|
|
* Count outgoing traffic to all GPUs connected via NVLink-C2C::
|
|
|
|
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
|
|
|
|
* Count outgoing traffic to GPU 0 connected via NVLink-C2C::
|
|
|
|
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
|
|
|
|
* Count outgoing traffic to GPU 1 connected via NVLink-C2C::
|
|
|
|
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
|
|
|
|
NV-CLink PMU
|
|
------------
|
|
|
|
This PMU monitors latency events of memory read requests that pass through
|
|
the NV-CLINK interface. Bandwidth events are not available in this PMU.
|
|
In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
|
|
SoC and this PMU only counts read traffic.
|
|
|
|
The events and configuration options of this PMU device are available in sysfs,
|
|
see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
|
|
|
|
The list of events:
|
|
|
|
* IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
|
|
* IN_RD_REQ: the number of incoming read requests.
|
|
* OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
|
|
* OUT_RD_REQ: the number of outgoing read requests.
|
|
* CYCLES: NV-CLINK interface cycle counts.
|
|
|
|
The incoming events count the reads from remote device to the SoC.
|
|
The outgoing events count the reads from the SoC to remote device.
|
|
|
|
The events can be used to calculate the average latency of the read requests::
|
|
|
|
CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
|
|
|
|
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
|
|
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
|
|
|
|
OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
|
|
OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
|
|
|
|
Example usage:
|
|
|
|
* Count incoming read traffic from remote SoC connected via NV-CLINK::
|
|
|
|
perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
|
|
|
|
* Count outgoing read traffic to remote SoC connected via NV-CLINK::
|
|
|
|
perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
|
|
|
|
NV-DLink PMU
|
|
------------
|
|
|
|
This PMU monitors latency events of memory read requests that pass through
|
|
the NV-DLINK interface. Bandwidth events are not available in this PMU.
|
|
In Tegra410 SoC, this PMU only counts CXL memory read traffic.
|
|
|
|
The events and configuration options of this PMU device are available in sysfs,
|
|
see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
|
|
|
|
The list of events:
|
|
|
|
* IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
|
|
* IN_RD_REQ: the number of read requests to CXL memory.
|
|
* CYCLES: NV-DLINK interface cycle counts.
|
|
|
|
The events can be used to calculate the average latency of the read requests::
|
|
|
|
DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
|
|
|
|
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
|
|
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
|
|
|
|
Example usage:
|
|
|
|
* Count read events to CXL memory::
|
|
|
|
perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'
|