Deeper look at CPU utilization : The power of PMU events

Suppose we have a CPU bound application/query/program. How to know what my CPU is really doing ? What’s my CPU bottleneck ? How much my CPU are stalled ? For what resource ? How to characterizes my Workloads ?

Answering this question can helps direct performance tuning !

Let’s take a sample program to analyze :

TEST env : ORACLE 12.2.0.1/OEL 7.0 / kernel-3.10 /Processor i5-6500 /2*DDR3-1600 (4GB*2)

This CPU intensive PL/SQL program does nothing useful it just read a fully buffered table of about 1400MB in a loop .

Capture 07

Capture du 2017-10-25 162904

To identify HOTSPOT functions we can use Brendan Gregg CPU Flame Graphs :

perf record -a -g — sleep 10
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-flame.svg

Capture 08

This allow for quick identification of hot code-paths.

But what that functions was doing when they where on-CPU ? running or stalled ?

Extract from Brendan Gregg “CPU Utilization is Wrong

Capture 01

“Stalled means the processor was not making forward progress with instructions, and usually happens because it is waiting on memory I/O. The ratio I drew above (between busy and stalled) is what I typically see in production. Chances are, you’re mostly stalled, but don’t know it.”

As stated by Brendan Gregg a good starting point for identifying what the CPU is really doing is IPC (Instruction per cycle)

“If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.

If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.”

Let’s check our IPC using perf

Capture du 2017-10-25 162830

I’m using SKYLAKE micro-architecture which can retire up to 4 instruction per cycle.

Following the previous assumption i m likely instruction bound.

Using Brendan Gregg CPI Flame Graph  (CPI invert of IPC) we can get a clear visualization of where stall cycles are :

perf record -a -g -e resource_stalls.any -e cpu-cycles — sleep 10
perf script | ./stackcollapse-perf.pl –event-filter=cpu-cycles > out_cycle.perf-folded
perf script | ./stackcollapse-perf.pl –event-filter=resource_stalls.any > out_stall.perf-folded
./difffolded.pl -n out_stall.perf-folded out_cycle.perf-folded | ./flamegraph.pl -title “CPI Flame Graph: blue=stalls, red=instructions” –width=900 > cpi.svg

Capture 10

The color now shows what that function was doing when it was on-CPU: running or stalled (highest CPI blue (slowest instructions), and lowest CPI red (fastest instructions) ). This graph is closer to “Cycles Per Resource Stall”  than “Cycles Per Instruction” as we are using the event “RESOURCE_STALLS.ANY” and not “INST_RETIRED.ANY_P” . Checking Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B 

Capture 09

There are many reason that may lead to low IPC and stall/inefficient  execution. PMU performance events (hundreds of core/offcore/uncore events) can be used to drill down further. However, it’s non-obvious to determine which events are useful in detecting the true bottleneck.

What we need is a method and that’s it TMAM !

Top-down Micro-architecture Analysis Method : (Systematically Find True Bottleneck with Less Guess Work)

TMAM let’s us Identifies the true bottlenecks in  a simple, structured hierarchical process.It’s Simplified hierarchy avoids the u-arch high-learning curve.

More info (Must read to get a better understanding)  :

At the top-level, TMAM classifies pipe line-slots (A pipeline slot represents the hardware resources needed to process one uOp) into four main categories:

  • Front End Bound : The front end is delivering < 4 uops per cycle while the back end of the pipeline is ready to accept uops
  • Bad Speculation : Tracks uops that never retire or allocation slots wasted due to recovery from branch miss-prediction or clears
  • Retiring : Successfully delivered uops who eventually do retire
  • Back End Bound: No uops are delivered due to lack of required resources at the back end of the pipeline

The drill down is recursively performed until a tree-leaf is reached.

Capture 04

This method is adopted by multiple in-production tools including VTune and PMU-TOOLS. SO let’s try them :

Let’s give VTune a try :

Capture du 2017-10-25 153010

VTune correctly indicated that we have a IPC rate of 1.8 (1/CPI) and that we are mostly Back-End Bound.We can also get detailed breakdown of the execution cycles !

Grouping by function :

Capture du 2017-10-25 152421

Expand the Back-End Bound section to get more detail !

Capture du 2017-10-25 152552

And there is a lot more !

Give PMU-TOOLS ./topev.py a try

Displaying only the first level :

Capture du 2017-10-25 160341

So our PL/SQL program seem to be Back-end Bound and specifically  memory bound (DRAM BOUND) !

As we all know the  common bottleneck is moving from disks to the memory subsystem.

Memory is the new disk

As we are looking at a memory bound program (DRAM BOUND)  let’s analyze it’s memory bandwidth consumption.

Capture 11

Our processor indicate a pick bandwidth of 34GB/s. However, we have only two DDR3-1600  so it give us a theoretical pick bandwidth of about 25GB/s.

Capture du 2017-10-25 164306

The  peak memory bandwidth calculated by vTune is 21GB.

Capture du 2017-10-25 152852

When running our program our memory bandwidth consumption reached 7GB/s (calculated using offcore PMU events).

Running the program in two different session will consume more than half of the memory bandwidth available.(12GB/s)

Capture du 2017-10-25 153937

Basically, an application that has saturated the available memory bandwidth won’t scale effectively with more cores (sharing the same memory resources) .For a more concrete example take a look at the great investigation done BY Luca Canali Performance Analysis of a CPU-Intensive Workload in Apache Spark.

Ouf ! That was a long blog post 😀 I hope that now you see how PMU events are crucial for analyzing modern system bottleneck !

That’s it 😀

3 thoughts on “Deeper look at CPU utilization : The power of PMU events

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s