Troubleshooting Latch Contention using sytemtap

The purpose of this blog post is to show how we can troubleshoot contention on  a specific latch using a systemtap script. This post is highly inspired by the “latchprof” script developed by Tanel Poder and his systematic approach for latch contention troubleshooting (For more info please check latch-contention-troubleshooting .)

This is what we are going to achieve :

Tested in : oracle 11.2.0.4/OEL6/UEK4

stap -v monitor_latch.stp  “latch_address” “latch#” “refresh_time”

capture-01

This script show a breakdown of latch holder by pid/session id/sql_hash for “cache buffers chains” latch with address “0x000000009F69FF60”

UPDATE 27/04/2017 : The latch address specified is valid only in the context of the target instance (Same shared memory mapped). I modified the script so that the memory watch point fire only when the address content is modified by a process belonging to the target instance.As we are dealing with virtual address space this memory address can be used in other process for other purpose. Example here a hardware breakpoint set on some address who fired here two times for different program

Capture

Part 1 : Monitoring latch acquisition /release

To monitor the latch activity i used a hardware breakpoint that will fire whenever the latch address is modified.The number of hardware breakpoint that we can use is limited as it make use of dedicated registers( usually limited to 4 on x86 for more info ) .So we can not monitor many latch address using hardware breakpoint (I limited my self to one).

But how to know if the latch is acquired or released at every modification ?

Whenever the latch is acquired or released it will modify the first word pointed out by the latch address  as stated by Andrey Nikolaev to reflect the PID of the holding process or the number of process holder depending on the latch type/acquisition mode.Also  as demonstrated on my previous post the number of gets will be incremented at release time.

Assuming that the latch address is modified only when the latch is acquired or released we can state that if :

  • The address is modified by  a process X and nb of gets does not change => Latch acquired
  • The address is modified by  a process X and nb of gets does change=> Latch released

We can access the latch “gets” value at a specific offset from the latch address.This offset has different value for shared and exclusive latch.

Exclusive latch memory layout :

oradebug peek 200222A0 24
[200222A0, 200222B8) = 00000016 00000001 000001D0 00000007
pidˆ               gets        latch#          level#

Shared latch memory layout :
oradebug peek 0x6000AEA8 24
[6000AEA8, 6000AEC0) = 00000002 00000000 00000001 00000007
ˆNproc       ˆX flag                gets    latch#

Reference : Andrey Nikolaev

I used the latch# to determine the offset of the gets ,that’s why it’s passed as a parameter to the script.

NOTE 1: This program will monitor only latches that are acquired in willing to wait mode as the number of gets will not increment in the case of immediate gets mode.

NOTE 2: This program is far from being perfect and is just an example of the usage of hardware breakpoint for monitoring.It had to be enhanced to work perfectly in a multi process environment.

Part 2 : Getting session addr/SID/sql hash

To get the session address i used the technique described on my previous post.So i extracted it from the global symbol “ksupga_with some offset and then used x$kqfco and x$kqfta to extract the offset of the other fields (SQL_HASH/SID)

DOWNLOAD : monitor_latch.stp

That’s it 😀

2 thoughts on “Troubleshooting Latch Contention using sytemtap

  1. Hi Hatem,
    Getting error while trying to run the script for one of the Latch.
    Can you please help!1 🙂

    [root@dixitlab NEW]# stap -v monitor_latch.stp 0000000060055800 551 3
    parse error: number invalid or out of range
    saw: number ‘0000000060055800’ at monitor_latch.stp:42:90
    source: printf (“——————–Monitoring latch with address 0x%x———————-\n”, … 0000000060055800 … $1)

    ^

    parse error: number invalid or out of range
    saw: number ‘0000000060055800’ at monitor_latch.stp:48:19
    source: probe kernel.data( … 0000000060055800 … $1).length(8).write
    ^

    2 parse errors.
    Pass 1: parsed user script and 112 library script(s) using 206596virt/34396res/3372shr/31412data kb, in 230usr/70sys/1665real ms.
    Pass 1: parse failed. [man error::pass1]
    [root@dixitlab NEW]#

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s