Manual balancing of SGA components across numa nodes [NUMA API:move_pages()]

In my previous blog post i showed how we can display memory components  (Buffer cache,Shared pool,Large Pool,etc) distribution across the different NUMA nodes using the NUMA API. But what to do if we want to have more control ? Can we for example isolate a specific SGA components in a specific set of nodes ?

Suppose for example that you are using the IN-MEMORY column store and only a few user are relying heavily on it.Would it be useful to collocate them on a specific set of nodes to improve memory access latency.For sure it depend  ! But we can do it ! Using the NUMA API and specifically the function “move_pages” we can distribute the memory pages  across NUMA nodes as we want !

Automatic NUMA Balancing which is enabled by default on UEK R4  rely on a similar mechanism for moving the memory pages closer to where the task is executing.(For more info check this) but it does not support for now the migration of  Huge Pages (hugetlbfs)

[root@svltest ~]# sysctl -a | grep numa_balancing
kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256

This is what we are going to achieve in this blog post :

Capture 0

Capture 20

TEST SERVER :

  • OEL 6 UEK R4
  • ORACLE 12.1.0.2.6
  • HugePages_Total:    4001
  • Hugepagesize:       2048 kB
  • *.inmemory_size=300M
  • *.pga_aggregate_target=400M
  • *.sga_target=8000M

Step 1 : Reserve page on the target nodes

Before moving any page we must prepare the target nodes for receiving them.

As we see there is not enough free huge page on node 2 (HugePages_Free)

Capture 10

Let’s reserve some :

echo 1046 > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

Capture 11

Step 2 : Moving the IN-MEMORY data part to node 2

The IN-MEMORY pool storing the actual column formatted data is located in a separate shared memory segment which is evenly distributed across NUMA nodes :

Capture 12

Let’s write a c program to move it to NODE 2.This an extension to the c program written in my previous post.

#include <errno.h>
#include <stdio.h>
#include <numaif.h>
#include <sys/shm.h>


int main(int argc, char **argv )
{
if(numa_available() < 0){
printf("System does not support NUMA API!\n");
}

void *addr;
int n = numa_max_node();
//printf("There are %d nodes on your system\n", n + 1);

int nb_page=atoi(argv[2]);
int shmid=atoi(argv[1]);
addr =shmat(shmid,(void *) 0x00000061000000 ,SHM_RDONLY);

if(addr == (void *)-1) {
perror("shmop: shmat failed");
}

//SGA BASE address
void * mem =  (void *) 0x00000061000000;

int numa_node = -1;
void * pa;


//align to page boundary
unsigned long a;
a  = (unsigned long) mem;
a  = a - (a % ((unsigned long) 2097152)); //Huge page size 2M
pa = (void *) a;

int ret = 0;
int i ;

int status[1];
int nodes[1];
int ret_code;
status[0]=-1;
nodes[0]=2;


for (i=0;i < nb_page;i = i + 1) {
status[0]=-1;

ret = get_mempolicy(&numa_node, NULL, 0, pa  , MPOL_F_NODE | MPOL_F_ADDR);
if(ret  ==   0)
{
printf("Memory at %p is in node %d\n",(void *) pa ,numa_node);
ret_code=move_pages(0 /*self memory */, 1, &pa,nodes, status, MPOL_MF_MOVE_ALL);
if (ret_code == 0 ) {
printf("Move memory at %p to node %d\n",pa, status[0]);
} else {
printf("Error moving memory at %p.Error : %d\n",pa,ret_code);
}
}else {
printf("Error %d %p \n", errno,(void *) pa);
break;
}

a = a + 2097152;
pa = (void *) a;

}

return 0;
}

Execute the program :

Capture 30

As you see page at address “0x61000000” was moved from node 0 to node 2.

As i have not reserved enough huge page on the target node we will receive an error when trying to move some pages.

Capture 31

So take care to reserve the right amount of huge page ! (In our case we should have reserved 90 pages).

That’s it 😀 Have fun playing with NUMA !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s