1. What’s LPprof ?

Lpprof combines Linux-tools perf record and perf stat commands on parallel processes. It analyzes perf record samples and perf stat results to provide an HPC oriented performance summary.

2. Obtaining, Building and Installating LPprof

LPprof’s code is available at GitHub https://github.com/edf-hpc/lpprof. You can clone it using git:

$ git clone https://github.com/edf-hpc/lpprof.git

Debian packaging files are included, so if you are using a Debian or a Debian derivative system you only need to build a Debian package and install it. Before building the package, you need to install the following packages:

$ apt-get install asciidoctor debhelper dh-python python3-all python3-setuptools pandoc dpkg-dev

If you are not using Debian, Lpprof provides a setuptools script for its installation. Just run:

$ python setup.py install

3. Profiling with LPprof

3.1. Commands

Monoprocessus profiling:

$ lpprof --frequency=<sampling_frequency(Hz)> <exe>

Profile a process given its PID (or a list of processes by giving a list of their pids):

$ lpprof --frequency=<sampling_frequency(Hz)> --pids <pid1,pid2,...> <exe>

If pids are split on different hosts it is possible to precise hostnames and ranks for each pid :

$ lpprof --frequency=<sampling_frequency(Hz)> --pids <rank1:hostname1:pid1,rank2,hostname2,pid2,...> <exe>

Profile a program executed with a parallel launcher (ex: srun):

$ lpprof --launcher=<launcher> --frequency=<sampling_frequency(Hz)> <exe>

3.2. Results

LPprof makes a perf_<date> directory that contains the following files at the end of the run :

perf_data.<rank> (perf record output)
perf_stat.<rank> (perf stat output)
LPprof_perf_report (lpprof performance report)

4. Lpprof performance report

Lpprof performance report provides the following metrics. For each metric minimum, maximum and average values across ranks are given with associated rank numbers.

hwc counter metrics:

Hardware counters like cycles or cpu-clock and metrics derived from hardware counters like instructions per cycle or cycles spent due to TLBmiss which is computed as the following ratio :

(ITLB_MISSES.WALK_DURATION + DTLB_MISSES.WALK_DURATION)*100 / cycles

Theses counters are available on modern Intel CPU at least from Sandy-Bridge to Kaby-Lake architectures.

vectorization metrics:

SSE, AVX and AVX2 proportion are computed from assembly instructions samples. Only double precision operations are taken into account.

asm metrics:

Most frequently used assembly instructions (top 95%). Assembly instructions that are reported correspond to instruction pointer addresses found in perf samples.

sym metrics:

Most frequently used symbols (top 95%) found in samples collected with perf record.

5. LPprof Slurm Spank Plugin

The Slurm Spank plugin:

srun --lpprof_f=<frequency>

It is possible to profile only certain ranks by giving a list of ranks to profile:

srun --lpprof_f=<frequency> --lpprof_r=<ranks>

Example for a profiling at 99Hz of ranks 0,7,8,9:

srun --lpprof_f=99 --lpprof_r=0,7-9

When lpprof is used throught the spank plugin the ouput directory is named perf_<slurm_job_id>.