1. What’s LPprof ?
Lpprof combines Linux-tools perf record and perf stat commands on parallel processes. It analyzes perf record samples and perf stat results to provide an HPC oriented performance summary.
2. Obtaining, Building and Installating LPprof
LPprof’s code is available at GitHub https://github.com/edf-hpc/lpprof. You can clone it using git:
$ git clone https://github.com/edf-hpc/lpprof.git
Debian packaging files are included, so if you are using a Debian or a Debian derivative system you only need to build a Debian package and install it. Before building the package, you need to install the following packages:
$ apt-get install asciidoctor debhelper dh-python python3-all python3-setuptools pandoc dpkg-dev
If you are not using Debian, Lpprof provides a setuptools script for its installation. Just run:
$ python setup.py install
3. Profiling with LPprof
3.1. Commands
Monoprocessus profiling:
$ lpprof --frequency=<sampling_frequency(Hz)> <exe>
Profile a process given its PID (or a list of processes by giving a list of their pids):
$ lpprof --frequency=<sampling_frequency(Hz)> --pids <pid1,pid2,...> <exe>
If pids are split on different hosts it is possible to precise hostnames and ranks for each pid :
$ lpprof --frequency=<sampling_frequency(Hz)> --pids <rank1:hostname1:pid1,rank2,hostname2,pid2,...> <exe>
Profile a program executed with a parallel launcher (ex: srun):
$ lpprof --launcher=<launcher> --frequency=<sampling_frequency(Hz)> <exe>
3.2. Results
LPprof makes a perf_<date> directory that contains the following files at the end of the run :
perf_data.<rank> (perf record output) perf_stat.<rank> (perf stat output) LPprof_perf_report (lpprof performance report)
4. Lpprof performance report
Lpprof performance report provides the following metrics. For each metric minimum, maximum and average values across ranks are given with associated rank numbers.
hwc counter metrics:
Hardware counters like cycles or cpu-clock and metrics derived from hardware counters like instructions per cycle or cycles spent due to TLBmiss which is computed as the following ratio :
(ITLB_MISSES.WALK_DURATION + DTLB_MISSES.WALK_DURATION)*100 / cycles
Theses counters are available on modern Intel CPU at least from Sandy-Bridge to Kaby-Lake architectures.
vectorization metrics:
SSE, AVX and AVX2 proportion are computed from assembly instructions samples. Only double precision operations are taken into account.
asm metrics:
Most frequently used assembly instructions (top 95%). Assembly instructions that are reported correspond to instruction pointer addresses found in perf samples.
sym metrics:
Most frequently used symbols (top 95%) found in samples collected with perf record.
5. LPprof Slurm Spank Plugin
The Slurm Spank plugin:
srun --lpprof_f=<frequency>
It is possible to profile only certain ranks by giving a list of ranks to profile:
srun --lpprof_f=<frequency> --lpprof_r=<ranks>
Example for a profiling at 99Hz of ranks 0,7,8,9:
srun --lpprof_f=99 --lpprof_r=0,7-9
When lpprof is used throught the spank plugin the ouput directory is named perf_<slurm_job_id>.