Oracle 20C DbNest : Linux namespaces/seccomp/Capabilites/cgroups

One of the new feature in oracle 20C (which is still in preview version) is dbnest. So what is dbnest :

“DbNest provides hierarchical, isolated run-time environments at the CDB and PDB level.

These run-time environments provide file system isolation, process ID number space isolation, and secure computing for PDBs and CDBs. To protect the multitenant environment from security breaches, dbNest uses the latest Linux resource isolation, namespace, and control group features.”

So dbnest add further protection to the databases (PDB/CDB) in consolidated environment by isolating every PDB in it’s our Container/Nest. In fact it’s powered by same fundamental technologies used by containers as we know them today .

So before giving it a try, let’s first take a quick overview at the base technologies it’s using.

Linux namespaces :

Namespaces are a fundamental aspect of containers on Linux. So what is it :

“A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers.”

In total There are 8 namespaces :

  • Cgroup : Isolates Cgroup root directory
  • IPC : Isolates System V IPC,POSIX message queues
  • Network: Isolates Network devices,stacks, ports,routing tables, etc.
  • Mount : Isolates Mount points
  • PID : Isolates Process IDs
  • Time : Isolates Boot and monotonic clocks
  • User : Isolates security-related identifiers and attributes (user IDs and group IDs/root directory/keys/capabilites)
  • UTS : Isolates Hostname and NISdomain name

A Linux system starts out with a single namespace of different type, used by all processes.We can use the command lsns to list the currently accessible namespaces in our system.

We can also use it to display information about a specific namespace by specifying it’s inode number.

The ps command also alow us to display information about the pid namespace the process belong to. Example if we want to display all process belonging to a specific namespace we cloud do the following :

What about having some fun ? In the following i will be using the unshare command to create new namespaces and the nsenter commad to join and execute a command in an existing namespace. I will focus only on namespace used by dbnest that is the PID,Mount and user namespaces.How i know that :

It’s indicated in dbnest program ! (But more on that in my next blog post)

Mount namespace :

The mount namespace was the first of his kind, you can think about it as an evolution of chroot.

“Mount namespaces provide isolation of the list of mount points seen
by the processes in each namespace instance. Thus, the processes in
each of the mount namespace instances will see distinct single-
directory hierarchies.”

Here is the mount points in my server :

I create a new mount namespace and unmount some volumes.

Nothing has changed in my intial namespace.

That’s how container isolate mount points 🙂

PID namespaces :

PID namespaces isolate the process ID number space, meaning that processes in different PID namespaces can have the same PID.

a process can see (e.g., send signals with kill(2), set nice values with setpriority(2), etc.) only processes contained in its own PID namespace and in descendants of that namespace.”

Here a great article on namespace that i liked very much :Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces

In the kernel source we can clearly see that a process can have different PID in different namespaces :

Ok let’s give it a try :

I have a pid of 1 and i cannot see the other process in my parent container, how cool is that 🙂

Executing a command in the new namespace and checking it PID :

Checking it’s PID in the parent namespace :

We clearly see how the process have different PID in it’s parent namespace !

User namespace :

User namespaces isolate security-related identifiers and attributes,in particular, user IDs and group IDs (see credentials(7)), the root directory, keys (see keyrings(7)), and capabilities (see capabilities(7)). A process’s user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.”

That’s very interesting, so we can pretend to be god (root) in our namespace with full capabilities (more on them in the next section) on resource we own but outside we are nobody (with very limited capabilites).

Let’s give it a try :

It’s the only namespace that don’t require special privilege (CAP_SYS_ADMIN) for creation , so i’am gonna create one with the oracle user (I will use getcaps to display the effective capability) :

So even if it seem to us that we are root still when accessing resource not owned by us in the parent namespace regular privilege check will kick in. Still i have now enough capability to create a new sub namespace, let’s create an UTS one and change the hostname :

That’s great 🙂 Lets go to the next one !

Linux capabilities :

“For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process’s credentials (usually: effective UID, effective GID, and supplementary group list).

Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities,which can be independently enabled and disabled. Capabilities are a per-thread attribute.”

So in simple word capabilites provide fine-grained control over superuser permissions. Before having them we where used to setting the setuid on different program if we need special premision. Take for example the ping executable which used to have the setuid attribute set :

The setuid has ben replaced by a more granular privilege improving security !

To list the current effective capability in our session for the current process :

We have seen earlier using the user namespaces how capabilities allow us to create semi-privileged environment.

For more information on capabilities here is a great post .

Ok let’s go to the next one :

Linux seccomp :

We can think of Linux seccomp as a syscall filter , it’s used extensively in container environment to reduce the kernel attack surface by reducing the exposed number of syscalls. For example when using docker-formatted container the default profile allow only 44 syscall from more that 300 available syscalls. We can see that it’s also used by dbnest :

In this docker runtime a custom secomp profile is used.

We also can apply the seccomp policy to a systemd service directly using “SystemCallFilter” .

To check if seccomp is enabled for one of the process we can use :

grep -i seccomp /proc/*/status

Ok let’s move to the final one !

Linux cgroups :

“The control groups, abbreviated as cgroups in this guide, are a Linux kernel feature that allows you to allocate resources — such as CPU time, system memory, network bandwidth, or combinations of these resources — among hierarchically ordered groups of processes running on a system. By using cgroups, system administrators gain fine-grained control over allocating, prioritizing, denying, managing, and monitoring system resources. Hardware resources can be smartly divided up among applications and users, increasing overall efficiency.”

To manage cgroup we can use one of the following :

  • Using directives in systemd unit files (Recommended way starting from RedHat 7)
  • Using  libcgroup package (Deprecated method starting from RedHat 7)
  • Manipulating cgroup filesystem directly

There are two version of Cgroups V1 and V2 where the primarily difference reside in the fact that  “CGroups v1 has cgroups associated with controllers whereas CGroups v2 has controllers associated with cgroups”.

But let’s not get into too much detail and let’s give it a try :

I will use the stress tool to impose load on the CPU.

Ok we are at 100% of cpu usage. Let’s now use systemd-run command to creates a Transient cgroup that creates and start a Transient service. I used the CPUQuota attribute and set it to 25% to specify how much CPU time the unit gets at maximum, relative to the total CPU time available on one CPU.

That’s great, the stress command don’t exceed 25% !

We can use the command systemd-cgls to display cgroup hierachy and the systemd-cgtop to monitor them :

And here is the our dbnest there ! But more on it in the next blog post 🙂

Using the command “cgsnapshot -s” here is the cgroup configuration for our transient service :

That’s it i hope that this blog post helped in demystifying the concept of container a little bit and stay tuned for the upcoming blog post on dbnest 🙂

3 thoughts on “Oracle 20C DbNest : Linux namespaces/seccomp/Capabilites/cgroups

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s