Improved start-up times for ORACLE database in UEK4 (Concurrent Huge Page faults)

As you may know Unbreakable Enterprise Kernel Release 4 is finally here with many enhancement.So  If you plan to upgrade your kernel to UEK4  and you are using Huge page for your database (And you probably should) you “may” see a great improvements in start-up times. So what have changed ?

From ORACLE Unbreakable Enterprise Kernel Release 4 Release Notes  Documentation “Improve page-fault scalability in hugetlb by handing concurrent page faults. Previously, the kernel could only handle a single hugetlb page fault at a time. Typically, the startup time for a 10-gigabyte Oracle database, which generates approximately 5000 page table faults, decreases to 25.7 seconds from 37.5 seconds. Larger workloads should experience even greater improvements in start-up times. ”

Let’s take a closer look ! The function responsible for handling huge page fault is “hugetlb_fault” which is located in file “mm/hugetlb.c”. Here is the interesting part from UEKR4.1.12 :


  /*
         * Serialize hugepage allocation and instantiation, so that we don't
         * get spurious allocation failures if two CPUs race to instantiate
         * the same page in the page cache.
         */
        hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, address);
        mutex_lock(&hugetlb_fault_mutex_table[hash]);

        entry = huge_ptep_get(ptep);
        if (huge_pte_none(entry)) {
                ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
                goto out_mutex;
        }

The function is using a table of mutexes, allowing a better chance of
parallelization, where each hugepage is individually serialized and that’s a great news.For more info check here.

Let’s now check code for kernel version UEKR3.8.13 :

/*
         * Serialize hugepage allocation and instantiation, so that we don't
         * get spurious allocation failures if two CPUs race to instantiate
         * the same page in the page cache.
         */
        hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
        mutex_lock(&htlb_fault_mutex_table[hash]);

        entry = huge_ptep_get(ptep);
        if (huge_pte_none(entry)) {
                ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
                goto out_mutex;
        }

It seem that this version is also using the same mechanism for serializing access.What have changed in UEKR4 is this :

“hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table). fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages. Expose interfaces so that fallocate operations
can be synchronized with page faults. Minor name changes to
be more consistent with other global hugetlb symbols.” for more info check here ref

For previous version of the kernel this is how serialization was handled (ex: UEKR2.6.39)

/*
         * Serialize hugepage allocation and instantiation, so that we don't
         * get spurious allocation failures if two CPUs race to instantiate
         * the same page in the page cache.
         */
        mutex_lock(&hugetlb_instantiation_mutex);
        entry = huge_ptep_get(ptep);
        if (huge_pte_none(entry)) {
                ret = hugetlb_no_page(mm, vma, address, ptep, flags);
                goto out_mutex;
        }

Like previously stated the kernel in this version could only handle a single hugetlb page fault at a time.Also “hugetlb_instantiation_mutex” mutex can become quite contended during the startup of large databases which make use of huge pages.

NOTE : So this feature is not really new as it was already implemented in UEKR3.
TIME TO TEST  :

  • ORACLE 12.1.0.2.6
  • HugePages_Total:    4001
  • Hugepagesize:       2048 kB
  • *.pga_aggregate_target=400M
  • *.sga_target=8000M
  • *.pre_page_sga=TRUE (default since 12C with enhanced behavior for more info check PRE_PAGE_SGA Behaviour Change in Oracle Release 12c (Doc ID 1987975.1) )

Startup UEKR2 :

startup uek2

Startup UEKR3 :

Capture uek3

Startup UEKR4 :

startup uek4

Like expected there is no big difference between startup-time for UEKR3 and UEKR4 and a huge one for UEKR2.

To get a closer look we can use systemtap !

Before using systemtap use “rpmbuild” to generate kernel-debuginfo package from the kernel source rpm :

Capture 0

Capture -1

SYSTEMTAP script for tracing the number of concurrent huge page fault and other stats every second.

#! /usr/bin/env stap

global max_conc_page_fault = 0
global conc_page_fault = 0
global nb_page_fault = 0
global in_huge_page_fault
global mutex_wait_time
global page_fault_time
global page_fault_begin
global mutex_begin
global in_serialized_sec = 0
global out_serialized_sec = 0


probe kernel.function("hugetlb_fault") {
  in_huge_page_fault[tid()] = 1
  page_fault_begin[tid()] = gettimeofday_ns()
  nb_page_fault = nb_page_fault + 1
}

probe kernel.function("hugetlb_fault").return {
  delete in_huge_page_fault[tid()]
  delta =  gettimeofday_ns() -  page_fault_begin[tid()]
  page_fault_time <<<  delta
  delete page_fault_begin[tid()]
  out_serialized_sec = out_serialized_sec + 1
  conc_page_fault = conc_page_fault - 1
}


probe kernel.function("mutex_lock").call {
if (in_huge_page_fault[tid()] == 1) {
    mutex_begin[tid()] = gettimeofday_ns()
}
}

probe kernel.function("mutex_lock").return {
if (in_huge_page_fault[tid()] == 1) {
  delta =  gettimeofday_ns() -  mutex_begin[tid()]
  mutex_wait_time <<<  delta
  delete mutex_begin[tid()]
}
}



#In Serialized code section
probe kernel.statement("hugetlb_fault@mm/hugetlb.c+43") {
 in_serialized_sec = in_serialized_sec + 1
 conc_page_fault = conc_page_fault + 1
 if (conc_page_fault > max_conc_page_fault ) {
   max_conc_page_fault = conc_page_fault
 }
}


probe timer.ms(1000) {
 printf("Huge page fault : Current concurrent %d /  Max concurrent : %d / Total : %d / Mutex wait : %d / Elapsed time: %d / In serialized : %d / Out serialized : %d  \n",conc_page_fault,max_conc_page_faul
t,nb_page_fault,@sum(mutex_wait_time),@sum(page_fault_time),in_serialized_sec,out_serialized_sec)
}

The probe used for tracking concurrent process executing serialized section of code is “kernel.statement(“hugetlb_fault@mm/hugetlb.c+43″)” where 43 is the line number following “mutex_lock” function in “hugetlb_fault”.
This script display :

  • Current concurrent : Number of concurrent process executing the serialized part of “hugetlb_fault” (Part between “mutex_lock” and “mutex_unlock”)
  • Max concurrent :Max Number of concurrent process executing the serialized part of “hugetlb_fault”
  • Mutex wait : Elapsed time waiting for mutex acquisition in context of huge page fault.
  • Elapsed time : Elapsed time handling huge page fault
  • In serialized : Number of process entering the serialized section
  • Out serialized : Number of process exiting the serialized section

TEST CASE :UEK4

Launch the systemtap script and test the startup-time :

systemtap  uek4

Interpreting the output :

  • 1646087865ns was spent handling huge page fault.
  • 43070655ns was spent waiting for mutex acquisition
  • Max concurrent huge page fault is 3

 

Due to some problem i encountered i haven’t tested the systemtap script in  UEK2.(May be later but if you test it you must see that Max concurrent is never greater than 1 and a much higher Mutex wait time)

That’s it 😀

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s