Suppose we have a CPU bound application/query/program. How to know what my CPU is really doing ? What’s my CPU bottleneck ? How much my CPU are stalled ? For what resource ? How to characterizes my Workloads ?
Answering this question can helps direct performance tuning !
Let’s take a sample program to analyze :
TEST env : ORACLE 126.96.36.199/OEL 7.0 / kernel-3.10 /Processor i5-6500 /2*DDR3-1600 (4GB*2)
This CPU intensive PL/SQL program does nothing useful it just read a fully buffered table of about 1400MB in a loop .
To identify HOTSPOT functions we can use Brendan Gregg CPU Flame Graphs :
perf record -a -g — sleep 10
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-flame.svg
This allow for quick identification of hot code-paths.
But what that functions was doing when they where on-CPU ? running or stalled ?
Extract from Brendan Gregg “CPU Utilization is Wrong”
“Stalled means the processor was not making forward progress with instructions, and usually happens because it is waiting on memory I/O. The ratio I drew above (between busy and stalled) is what I typically see in production. Chances are, you’re mostly stalled, but don’t know it.”
As stated by Brendan Gregg a good starting point for identifying what the CPU is really doing is IPC (Instruction per cycle)
“If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.
If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.”
Let’s check our IPC using perf
I’m using SKYLAKE micro-architecture which can retire up to 4 instruction per cycle.
Following the previous assumption i m likely instruction bound.
Using Brendan Gregg CPI Flame Graph (CPI invert of IPC) we can get a clear visualization of where stall cycles are :
perf record -a -g -e resource_stalls.any -e cpu-cycles — sleep 10
perf script | ./stackcollapse-perf.pl –event-filter=cpu-cycles > out_cycle.perf-folded
perf script | ./stackcollapse-perf.pl –event-filter=resource_stalls.any > out_stall.perf-folded
./difffolded.pl -n out_stall.perf-folded out_cycle.perf-folded | ./flamegraph.pl -title “CPI Flame Graph: blue=stalls, red=instructions” –width=900 > cpi.svg
The color now shows what that function was doing when it was on-CPU: running or stalled (highest CPI blue (slowest instructions), and lowest CPI red (fastest instructions) ). This graph is closer to “Cycles Per Resource Stall” than “Cycles Per Instruction” as we are using the event “RESOURCE_STALLS.ANY” and not “INST_RETIRED.ANY_P” . Checking Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B
There are many reason that may lead to low IPC and stall/inefficient execution. PMU performance events (hundreds of core/offcore/uncore events) can be used to drill down further. However, it’s non-obvious to determine which events are useful in detecting the true bottleneck.
What we need is a method and that’s it TMAM !
Top-down Micro-architecture Analysis Method : (Systematically Find True Bottleneck with Less Guess Work)
TMAM let’s us Identifies the true bottlenecks in a simple, structured hierarchical process.It’s Simplified hierarchy avoids the u-arch high-learning curve.
More info (Must read to get a better understanding) :
- Tuning Applications Using a Top-down Microarchitecture Analysis Method
At the top-level, TMAM classifies pipe line-slots (A pipeline slot represents the hardware resources needed to process one uOp) into four main categories:
- Front End Bound : The front end is delivering < 4 uops per cycle while the back end of the pipeline is ready to accept uops
- Bad Speculation : Tracks uops that never retire or allocation slots wasted due to recovery from branch miss-prediction or clears
- Retiring : Successfully delivered uops who eventually do retire
- Back End Bound: No uops are delivered due to lack of required resources at the back end of the pipeline
The drill down is recursively performed until a tree-leaf is reached.
This method is adopted by multiple in-production tools including VTune and PMU-TOOLS. SO let’s try them :
Let’s give VTune a try :
VTune correctly indicated that we have a IPC rate of 1.8 (1/CPI) and that we are mostly Back-End Bound.We can also get detailed breakdown of the execution cycles !
Grouping by function :
Expand the Back-End Bound section to get more detail !
And there is a lot more !
Give PMU-TOOLS ./topev.py a try
Displaying only the first level :
So our PL/SQL program seem to be Back-end Bound and specifically memory bound (DRAM BOUND) !
As we all know the common bottleneck is moving from disks to the memory subsystem.
Memory is the new disk
As we are looking at a memory bound program (DRAM BOUND) let’s analyze it’s memory bandwidth consumption.
Our processor indicate a pick bandwidth of 34GB/s. However, we have only two DDR3-1600 so it give us a theoretical pick bandwidth of about 25GB/s.
The peak memory bandwidth calculated by vTune is 21GB.
When running our program our memory bandwidth consumption reached 7GB/s (calculated using offcore PMU events).
Running the program in two different session will consume more than half of the memory bandwidth available.(12GB/s)
Basically, an application that has saturated the available memory bandwidth won’t scale effectively with more cores (sharing the same memory resources) .For a more concrete example take a look at the great investigation done BY Luca Canali Performance Analysis of a CPU-Intensive Workload in Apache Spark.
Ouf ! That was a long blog post 😀 I hope that now you see how PMU events are crucial for analyzing modern system bottleneck !
That’s it 😀