Reverse engineering : What we need to know as a DBA ?

“Reverse engineering, also called back engineering, is the processes of extracting knowledge or design information from anything man-made and re-producing it or re-producing anything based on the extracted information” Ref wikipedia

There are many purpose for reverse engineering such discovering software bug,security auditing, removal of copy protection,improving documentation shortcomings,learning purposes,etc. But why is this important to us ?

There are many situation where a basic background knowledge in reverse engineering may reveal to be very useful.The utilization of  Dynamic tracing tools for performance troubleshooting and analyzing application internal, which  is becoming more  widespread these days, is one of them. In fact, many applications like oracle database are delivered without source code or any debugging information which will make identification of function meaning ,arguments value,type and number,return value and other things much more difficult. This later information are needed  in case we want to extend the oracle instrumentation or understand the application behavior using a systemtap scripts for example .

In this post i will try to cover some basic things that may reveal important in many situation :

I will be using a simple C program as example

Test system : Linux X86_64 OEL6

ELF standard (Executable and Linking Format) :

A standard file format must be used when generating an executable/library such that it will be recognized and managed by the operating system(how to load the file in memory,where to look for the code/data,which shared library need to be loaded,etc). ELF was chosen as the standard binary file format for Unix and Unix-like systems on x86 .

There are three main types of object files :

  • Relocatable file : Holds code and data suitable for linking with other object files to create an executable or a shared object file.
  • Executable file  :Holds a program suitable for execution.
  • Shared object file :  Holds code and data suitable for linking  it two context
  1. The link editor may process it with other relocatable and shared shared object file to create another object file.
  2. The dynamic linker combines it with an executable file and other shared objects to create a process image.

ELF format specifies two views of an ELF file :

  • Sections contain important data for linking and relocation (.text – executables code/.data – initialized variables/.bss – non-initialized variables/etc)
  • Segments contain information that is necessary for runtime execution.During loading into the memory, sections are combined into segments so that one or more sections map to a segment in the executable.

capture-02

Time for a little example :

Creating a shared library using position independent code as it’s required on 64bit architecture.

lib_add.c


int add_value(int a,int b ,int c,int d,int e,int f,int g)
{
int x=9;
return a+b+c+d+e+f+g+x;
}

gcc -g -fPIC -shared -o lib_add.so lib_add.c

main_add.c

#include <stdio.h>

int add_value(int a,int b ,int c,int d,int e,int f,int g);
int global_var=8;
static int add_value2(int a,int b)
{
return a+b;
}

int main()
{
printf ("%d\n", add_value(1,2,3,4,5,6,7));
printf ("%d\n", add_value2(1,2));
printf ("%d\n", global_var);
return 0;
};

gcc -g -o main_add  main_add.c -L. -l_add

Let’s take a look a the ELF header using the readelf utility :

capture-07

There is a lot of information that we can extract from the above  ELF header.The main_add executable  file is composed of 37 sections (Number of section headers) and 8 segments (Number of program headers) .

Using readelf tools let’s list the different sections available and how they will be mapped to segments when loaded in memory.

Section Header Table of main_add :

capture-03

A detailed explanation of every section can be found in the ELF specification doc.I will only focus on some section later.

Program Header Table of main_add :

capture-04

We can clearly see the section to segment mapping (Segment has many types LOAD/GNU_STACK/etc for more info check the elf reference) .The segment number two of type LOAD that will contain executable code (.text) is flaged R E for read/execute . The interpreter that will be used for this binary is also listed in segment of type INTERP and it’s the dynamic loader (ld-linux).

We can check the real layout of process segment using pmap.

using gdb to run the program and break on the function add_value to stop execution and then check the layout of process segment using pmap :


gdb main_add

break add_value

r

capture-06

The segment number two which contain the .text,.rodata,.plt , etc sections is marked read /execute and loaded at address 0x400000 which correspond to VirtAddr.(the virtual addresses in the program headers might not represent the actual virtual addresses of the program’s memory image). Segment number three which contain sections responsible for dynamic relocation ,initialized and no initialized data and other stuffs is loaded at address 0x600000 (aligned to page boundary ) and marked read/write.

Time to get more deeper :

The .text section :

Contain the program instruction code.We can use objdump to confirm that :

objdump -d -j .text main_add

capture-08

Our function “add_value2” and “main” program are effectively there.Have you spotted how the function “add_value” which is defined in the shared library “lib_add” is called :

400648:       e8 c3 fe ff ff          callq  400510 <add_value@plt>

This is how dynamic relocation behave when using position independent code. Basically ,when the function is called  (lazy binding ) it will  not call the external function directly but will use a PLT stub.  GOT and PLT (Procedure Linkage Table it’s and array of stubs) sections are the keys for dynamic linking.

The first time the function is called this is what will happen (Resolve the external function address and place it in GOT entry before calling it) :

capture-09

After the first call ( PLT is no longer used.We will jump directly to the function as we already know it’s address in GOT) :

capture-10

For detailed information on how dynamic relocation work i invite you to read the following articles :

But how the disassembler have figure it out that address “0x400604” correspond to the function “add_value2”  ? objdump does address-to-symbol transformation for us using the symbol tables.

SYMBOL TABLES :

When writing a program we are basically giving names (symbol) to address and area of memory  containing data and instruction by the bias of functions and variables.From The ELF specification : Symbol table holds information needed to locate and relocate a program’s symbolic definitions and references.Basically,Symbol tables  are used by symbolic debugger and dynamic tracing tools to translate a known hex address into a human readable format and by linker for symbol resolution.

There is two distinct symbol tables :

.symtab

Full symbol table which contain local and global symbol.

.dynsym

Is a subset of .symtab. Contain the minimal set of symbols required for dynamic linking such as information about imported and exported dynamic link functions.

For more info on symbol tables and why there is two of them please take look at :

Let’s use readelf tool to display this different sections :

capture-11

capture-12

Like expected the function “add_value2” which is a local function is defined only on .symtab symbol table whereas the global function “add_value” is defined in the two sections.Also you can check that  the function “add_value2” is defined at address “0x400604” in section 13 “.text” and that correspond exactly to what objdump disassembler have done.

Symbols that are not needed for relocation processing like local functions can be stripped from the program using the strip utility.

capture-13

The section “.symtab” which contain the full symbol table was eliminated.This will reduces the file size but the drawback of this is that it will be not be possible for symbolic debugger and dynamic tracer to transform local symbol address to human  readable format.For example if displaying a callstack trace of a program the call to the function “add_value2” will be displayed as hex value.

Program compiled with debuginfo and contain the full symbol table.

capture-14

Program compiled without debuginfo nor contain the full symbol table.(fully stripped)

capture-15

It’s clear that the call stack trace is not as useful as before. (i will give more detailed view on what’s a call stack later).

But what do we mean by “compiled with debuginfo” ?

When compiling with debug info extra information in DWARF format are generated which are kept in special sections of the ELF file (we can also put a program’s debugging information in a file separate from the executable itself info). The debug info contain information such that detail about function argument and return value,variable type and size,corresponding line number,etc.

We can use readelf to display the different sections holding the debug information :

capture-16

Here is simple example to emphasis on the usefulness of the debug info sections :

Program compiled with debuginfo :

capture-17

Program compiled without debuginfo :

capture-18

Clearly the debugger is still able to resolve symbol using the symbol tables.But there is many information missing like function arguments value and source code line number.

I hope that for now you see the importance of having the full symbol table in place (.symtab) and the debuginfo of the program at hand.Without this information the usage of debugger and dynamic tracing tool  for troubleshooting performance problem and for the purpose of getting a deeper understanding of application internal will become more difficult.

Time to look at the oracle executable :

Using the file and readelf utility we can reveal many useful information :

  • Dynamically linked
  • ELF executable
  • Not stripped : So the full symbol table is there ! good !

capture-19

Using readelf to display the symbol tables :

capture-22

We can list the shared library dependencies also using ldd utility :

capture-20

Does it contain debug information ?

capture-21

No sign of debuginfo in the ELF file, also ORACLE does not ship any separate debuginfo file with the executable.So it appear that it will be difficult to access function arguments ,local variables ,return value and many other stuffs. And then what ? Are we stuck ?

In the OS/ABI filed in the ELF header we can read UNIX – system V.

But what is an ABI ?

“The System V Application Binary Interface is a set of specifications that detail calling conventions, object file formats, executable file formats, dynamic linking semantics, and much more for systems that complies with the X/Open Common Application Environment Specification and the System V Interface Definition. It is today the standard ABI used by the major Unix operating systems such as Linux, the BSD systems, and many others.” ref

So the ABI define many standard as the calling conventions which detail how function argument are passed (pushed on the stack or placed in registers) ,where the return value of a function is placed and many other things (For more detail please take look at the ABI specification).

So after all, it seem possible to get many useful information although we have no debuginfo in hand.But, before going any further let’s see what is a stack ?

What is a Stack ?

Basically a stack is an area of memory allocated to a process that stores information about the active subroutines.It’s composed of multiple stack frames , as a new frame is allocated for each active function(not yet returned). Stack frames may contain information such as :

  • Arguments of the called function
  • Local variables
  • Return address off the caller
  • Etc

The most important purpose of a stack is to keep track of the point to which each active function should return control when it finishes executing.

Here is an example off a call stack layout from wikipedia :

call_stack_layout

Special register are used along with the stack :

  • RSP : Stack pointer  : Points to the top of the stack
  • RBP : Frame pointer : Provide a stable reference from which functions arguments and local variables can be referenced.(The Frame pointer is optional and can be omitted using “-fomit-frame-pointer” is this case local variable and arguments will be accessed relatively to the stack pointer)

From the ABI Spec :

“The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available.”

This is only a very brief overview of the stack for more detailed info :

 

Time to look at the disassembled code (Of our test program of course 😀 ):

Using GDB let’s disas our main function !

capture-23

The first two instructions are related to function prologue which is responsible for the preparation of stack and registers for use within the function (wiki)

The value of Frame pointer is pushed into the stack after that it will take the value of the stack pointer : A new stack frame is created on top of the old stack frame

capture-24-1

As of the ABI specification function arguments can be passed using both of the stack and special registers (check the ABI spec for more info on which register can be used ). In this particular case,after making some space into the stack using the instruction “sub $0x10;%rsp” ,the 7th argument is pushed into the stack then the 6 other arguments are copied into specific registers. After that the add_value function is called  (which will push the return address into the stack) and as indicated in the ABI the return value is saved in the EAX register which is the copied into the EDX register.

capture-24-2

 

Let’s now disas the “add_value” function :

We can see the reappearance of the function prologue and also how the function arguments are pushed from the registers into the stack ,and that is in the purpose of  freeing  the registers for future use (In case of another function call). Then the local variable “x” which take the value 9 is pushed into the stack using the instruction “movl   $0x9,-0x4(%rbp)”. The remaining instruction are related to the process of additioning the different variables which are saved into the EAX register and to the function epilogue.

Function epilogue reverses the actions of the function prologue.It consist , int this case of the two functions “leaveq” and “retq”.

capture-25

With this knowledge we can now place a probe point on the function “add_value” and display it’s arguments (both in the stack and registers) , return value and much more.Here is a simple GDB script :


break add_value
command
printf "Args : a=%d b=%d c=%d d=%d e=%d f=%d g=%d \n" ,$edi,$esi,$edx,$ecx,$r8d,$r9d , *(int *)($rbp+0x10)
c
end

capture-28

There is much more to see ! but i think i will stop here as this post begin to be really long !

If you want more ! here is a free and awesome book  by Dennis Yurichev !

That’s it 😀 and stay tuned more to come !

 

4 thoughts on “Reverse engineering : What we need to know as a DBA ?

  1. Mahmoud Hatem once again great article about internals OS/Oracle , I realize why you did not dig deeper in oracle binary , but I will try to use your lesson/article to do it. Once again thanks.

    Adam

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s