ImperialViolet

SELinux from the inside out (14 Jul 2009)

There are some great sources of information for users and sysadmins about SELinux [1] [2] but your author has always preferred to understand a system from the bottom-up and, in this regard, found the information somewhat lacking. This document is a guide to the internals of SELinux by starting at the kernel source and working outwards.

We'll be drawing on three different sources in order to write this document.

Access vectors

SELinux is fundamentally about answering questions of the form “May x do y to z?” and enforcing the result. Although the nature of the subject and object can be complex, they all boil down to security identifiers (SIDs), which are unsigned 32-bit integers.

The action boils down to a class and a permission. Each class can have up to 32 permissions (because they are stored as a bitmask in a 32-bit int). Examples of classes are FILE, TCP_SOCKET and X_EVENT. For the FILE class, some examples of permissions are READ, WRITE, LOCK etc.

At the time of writing there are 73 different classes (selinux/libselinux/include/selinux/flask.h) and 1025 different permissions (.../av_permissions.h).

The security policy of a system can be thought of as a table, with subjects running down the left edge, objects across the top and, in each cell, the set of actions which that subject can perform on that object.

This is reflected in the first part of the SELinux code that we'll look at : the access vector cache (security/selinux/avc.c). The AVC is a hash map from (subject, object, class) to the bitset of permissions allowed:

struct avc_entry {
        u32                     ssid;    // subject SID
        u32                     tsid;    // object SID
        u16                     tclass;  // class
        struct av_decision      avd;     // contains the set of permissions for that class
};

The AVC is queried when the kernel needs to make security decisions. SELinux hooks into the kernel using the LSM hooks and is called whenever the kernel is about to perform an action which needs a security check. Consider the getpgid system call to get the current process group ID. When SELinux is built into a kernel, this ends up calling the following hook function (security/selinux/hooks.c):

static int selinux_task_getpgid(struct task_struct *p)
{
        return current_has_perm(p, PROCESS__GETPGID);
}

static int current_has_perm(const struct task_struct *tsk,
                            u32 perms)
{
        u32 sid, tsid;

        sid = current_sid();
        tsid = task_sid(tsk);
        return avc_has_perm(sid, tsid, SECCLASS_PROCESS, perms, NULL);
}

Referring back to the table concept: in order to check if a process with SID x may call getpgid we find x across and x down and check that SECCLASS_PROCESS:PROCESS__GETPID is in the set of allowed actions.

So now we have to discover what the AVC is actually caching, and where these SIDs are coming from. We'll tackle the latter question first.

SIDs and Security Contexts

SIDs turn out to be much like interned symbols in some languages. Rather than keeping track of complex objects and spending time comparing them during lookups, they are reduced to an identifier via a table. SIDs are the interned identifiers of security contexts. The sidtab maps from one to the other (security/selinux/ss/sidtab.h):

struct sidtab {
        struct sidtab_node **htable;
        unsigned int nel;       /* number of elements */
        unsigned int next_sid;  /* next SID to allocate */
        unsigned char shutdown;
        spinlock_t lock;
};

struct sidtab_node {
        u32 sid;                       /* security identifier */
        struct context context;        /* security context structure */
        struct sidtab_node *next;
};

The SID table is optimised for mapping from SIDs to security contexts. Mapping the other way involves walking the whole hash table.

The structure for the security context is probably familiar to you if you have worked with SELinux before (security/selinux/ss/context.h):

struct context {
        u32 user;
        u32 role;
        u32 type;
        u32 len;        /* length of string in bytes */
        struct mls_range range;
        char *str;        /* string representation if context cannot be mapped. */
};

If you have an SELinux enabled system, you can look at your current security context with id -Z. Running that will produce something like unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023. This string splits into four parts:

  1. The SELinux “user”: unconfined_u
  2. The role: unconfined_r
  3. The type: unconfined_t (we'll mostly be concentrating on types)
  4. The multi-level-security (MLS) sensitivity and compartments: s0-s0:c0.c1023

(You might notice that the parts are broken up with colons, but that the MLS part can contain colons too! Obviously, this is the only part that can contain colons to avoid ambiguity.)

When the system's security policy is compiled, these names are mapped to IDs. It's these IDs which end up in the kernel's context structure. Also notice that, by convention, types end in _t, roles with _r and users with _u. Don't confuse UNIX users with SELinux users; they are separate namespaces. For a sense of scale, on a Fedora 11 box, the default policy includes 8 users, 11 roles and 2727 types.

The Security Server

We now address the question of what it is that the access vector cache is actually caching. When a question is asked of the AVC to which it doesn't have an answer, it falls back on the security server. The security server is responsible for interpreting the policy from userspace. The code lives in context_struct_compute_av (in security/selinux/ss/services.c). We'll walk through its logic (and we'll expand on each of these points below):

  1. The subject and object's type are used to index type_attr_map, which results in a set of types for each of them.
  2. We consider the Cartesian product of the two sets and build up a 32-bit allowed bit-vector based on the union of the permissions in the access vector table for each (subject, object) pair.
  3. For each pair in the product, we also include the union of permissions from a second access vector table: the conditional access vector table.
  4. The target type is used to index an array and from that we get a linked list of “constraints”. Each constraint contains byte code for a stack based virtual machine and can limit the granted permissions.
  5. If the resulting set of permissions includes role transition, then we walk a linked list of allowed role transitions. If the transition isn't whitelisted, those permissions are removed.
  6. If either the subject or object's type is ‘bounded’, then we recurse and check the permissions of the bounded types. We verify that the resulting permissions are a subset of the permissions enjoyed by types that they are bounded by. This should be statically enforced by the tool with produced the policy so, if we find a violation, it's logged and the resulting permissions are clipped.

Now, dealing with each of those steps in more detail:

Type attributes

Type attributes are discussed in the Configuring the SELinux Policy report. They are used for grouping types together: by including a type attribute on a new type, the new type inherits all the permissions granted to the type attribute. As can be seen from the description above, type attributes are implemented as types themselves.

These attributes could have been statically expanded by the tool which generated the policy file. Expanding at generation time is a time/space tradeoff and the SELinux developers opted for the smaller policy file.

It's also worth noting that type_attr_map isn't expanded recursively: one can only have one level of type attributes.

Type attributes conventionally end in _type (as opposed to types, which end in _t). In the Fedora 11 policy, here are the top five type attributes:

Name of type attribute Number of types with that attribute
file_type 1406
non_security_file_type 1401
exec_type 484
entry_type 478
domain 442

The graph of types and type attributes is, as expected, bipartite.

The conditional access vector table

The conditional access vector table contains permissions just like the regular access vector table except that each, optionally, has an extra flag: AV_ENABLED (security/selinux/avtab.h). This flag can be enabled and disabled at run time by changing the value of ‘booleans’. These booleans are quite well covered by the higher-level documentation for the policy language (here and here).

The set of booleans can be found in /selinux/booleans (if you are running SELinux). They can be read without special authority although you should be aware of a bug: trying to read more than a page from one of those files results in -EINVAL and recent coreutils binaries (like cat) use a buffer size of 32K. Instead you can use dd, or just run the friendly tool: semanage boolean -l.

The AV_ENABLED flag is updated when a boolean is changed. The conditional access vector table is populated by a list of cond_node structures (security/selinux/conditional.h). These contain a bytecode for a limited, stack based machine and and two lists of access vectors which should be enabled or disabled in the case that the machine returns true or false.

The stack machine can read any of the configured booleans and combine them with standard boolean algebra, returning a single bit result.

Constraints

One of the parts of the SELinux policy language is the ability to define constraints. Constraints are defined using the neverallow command. Constraints are used to prevent people from writing bad policy, or in the case of MLS, to enforce rules governing information flow. http://danwalsh.livejournal.com/12333.html

As you can see if you read the above linked blog post, constraints are statically enforced by the policy tools where possible and also checked by the kernel. Constraints are evaluated by running a stack-machine bytecode. (This is a different machine than that which is used for the conditional access vector table.) Based on the kernel code for the stack-machine, we can write a simple disassembler and see what constraints are enforced in the kernel.

In the Fedora 11 policy, 32 classes have constraints applied to them. Let's have a look at some of them. Here's the first one:

constraint for class 'process' permissions:800000:
  DYNTRANSITION
subject.user == object.user?
subject.role == object.role?
and

Roughly translated, this means “Whenever operating on an object of class process, the DYNTRANSITION permission is forbidden unless the user and role of the subject and object match”. A DYNTRANSITION (dynamic transition) is when a process switches security contexts without execing a binary. Think of it like a setuid call for security contexts (we'll cover how to perform this later).

Here's another constraint, a longer one this time:

constraint for class 'file' permissions:188:
  CREATE
  RELABELFROM
  RELABELTO
subject.user == object.user?
[bootloader_t, devicekit_power_t, logrotate_t, ldconfig_t, unconfined_cronjob_t, unconfined_sendmail_t, setfiles_mac_t,
initrc_t, sysadm_t, ada_t, fsadm_t, kudzu_t, lvm_t, mdadm_t, mono_t, rpm_t, wine_t, xdm_t, unconfined_mount_t,
oddjob_mkhomedir_t, saslauthd_t, krb5kdc_t, newrole_t, prelink_t, anaconda_t, local_login_t, rpm_script_t,
sysadm_passwd_t, system_cronjob_t, tmpreaper_t, samba_unconfined_net_t, unconfined_notrans_t, unconfined_execmem_t,
devicekit_disk_t, firstboot_t, samba_unconfined_script_t, unconfined_java_t, unconfined_mono_t,
httpd_unconfined_script_t, groupadd_t, depmod_t, insmod_t, kernel_t, kpropd_t, livecd_t, oddjob_t, passwd_t, apmd_t,
chfn_t, clvmd_t, crond_t, ftpd_t, inetd_t, init_t, rshd_t, sshd_t, staff_t, udev_t, virtd_t, xend_t, devicekit_t,
remote_login_t, inetd_child_t, qemu_unconfined_t, restorecond_t, setfiles_t, unconfined_t, kadmind_t,
ricci_modcluster_t, rlogind_t, sulogin_t, yppasswdd_t, telnetd_t, useradd_t, xserver_t] contains subject.type?
or

This means that when you create a file or change its security context, either the SELinux user of the file has to match your current SELinux user, or you have to be one of a list of privileged types.

One last example foreshadows several large subjects: user-land object managers and multi-level security. For now I'll leave it undiscussed to wet your appetite.

constraint for class 'db_database' permissions:7de:
  DROP
  GETATTR
  SETATTR
  RELABELFROM
  ACCESS
  INSTALL_MODULE
  LOAD_MODULE
  GET_PARAM
  SET_PARAM
object.sensitivity[high] dominates type?

Roles and users

In step 5, above, we mention ‘role transitions’, so we should probably discuss SELinux users and roles. Keep in mind that SELinux users are separate from normal UNIX users.

Each type inhabits some set of roles and each role inhabits some set of SELinux users. UNIX users are mapped to SELinux users at login time (run `semanage login -l`) and so each user has some set of roles that they may operate under. Like the standard custom of administrating a system by logging in as a normal user and using sudo only for the tasks which need root privilege, roles are designed for the same purpose. Although a given physical user may need to perform administrative tasks, they probably don't want to have that power all the time. If they did, then there would be a confused deputy problem when they perform what should be an unprivileged task which does far more than intended because they performed it with excess authority.

Here's the graph of users and roles in the Fedora 11 targeted policy:

An SELinux user can move between roles with the newrole command, if such a role transition is permitted. Here's the graph of permitted role transitions in the Fedora policy:

With the targeted policy at least, roles and users play a relatively small part in SELinux and we won't cover them again.

Bounded types

A type in SELinux may be “bounded” to another type. This means that the bounded type's permissions are a strict subset of the parent and here we find the beginnings of a type hierarchy. The code for enforcing this originally existed only in the user-space tools which build the policy, but recently it was directly integrated into the kernel.

In the future, this will make it possible for a lesser privileged process to safely carve out subsets of policy underneath the administratively-defined policy. At the time of writing, this functionality has yet to be integrated in any shipping distribution.

(Thanks to Stephen Smalley for clearing up this section.)

The SELinux filesystem

The kernel mostly communicates with userspace via filesystems. There's both the SELinux filesystem (usually mounted at /selinux) and the standard proc filesystem. Here we'll run down some of the various SELinux specific entries in each.

But first, a quick note. Several of the entries are described as ‘transaction’ files. This means that you must open them, perform a single write and then a single read to get the result. You must use the same file descriptor for both (so, no echo, cat pairs in shell scripts).

/selinux/enforcing

A boolean file which specifies if the system is in ‘enforcing’ mode. If so, SELinux permissions checks are enforced. Otherwise, they only cause audit messages.

(Read: unprivileged. Write: requires root, SECURITY:SETENFORCE and that the kernel be built with CONFIG_SECURITY_SELINUX_DEVELOP.)

/selinux/disable

A write only, boolean file which causes SELinux to be disabled. The LSM looks are reset, the SELinux filesystem is unregistered etc. SELinux can only be disabled once and probably doesn't leave your kernel in the best of states.

(Read: unsupported. Write: requires root, and that the kernel be built with CONFIG_SECURITY_SELINUX_DISABLE.)

/selinux/policyvers

A read only file which contains the version of the current policy. The version of a policy is contained in the binary policy file and the kernel contains logic to deal with older policy versions, should the version number in the file suggest that it's needed.

(Read: unprivileged. Write: unsupported.)

/selinux/load

A write only file which is used to load policies into the kernel. Loading a new policy triggers a global AVC invalidation.

(Read: unsupported. Write: requires root and SECURITY:LOAD_POLICY.)

/selinux/context

A transaction file. One writes a security context string and then reads the resulting, canonicalised context. The context is canonicalised by running it via the sidtab.

(Read/Write: unprivileged.)

/selinux/checkreqprot

A boolean file which determines which permissions are checked for mmap and mprotect calls. In certain cases the kernel can actually grant a process more access than it requests with these calls. (For example, if a shared library is marked as needing an executable stack, then the kernel may add the PROT_EXEC permission if the process didn't request it.)

If the value of this boolean is one, then SELinux checks the permissions requested by the process. If 0, it checks the permissions which the process will actually receive.

(Read: unprivileged. Write: requires root, and SECURITY:SETCHECKREQPROT.)

/selinux/access

A transaction file which allows a user-space process to query the access vector table. This is the basis of user-space object managers.

The write phase consists of a string of the following form: ${subject security context (string)} ${object security context (string)} ${class (uint16_t, base 10)} ${requested permissions (uint32_t bitmap, base 16)}.

The read phase results in a string with this format: ${allowed permissions (uint32_t bitmap, base 16)} 0xffffffff ${audit allow (uint32_t bitmap, base 16)} ${audit deny (uint32_t bitmap, base 16)} ${sequence number (uint32_t, base 10)} ${flags (uint32_t, base 16)}.

This call will be covered in greater detail in the User-space Object Managers section, below.

(Read/Write: SECURITY:COMPUTE_AV.)

Attribute files

SELinux is also responsible for a number of attribute files in /proc. The attribute system is actually a generic LSM hook, although the names of the nodes are current hardcoded into the code for the proc filesystem.

/proc/pid/attr/current

Contains the current security context for the process. Writing to this performs a dynamic transition to the new context. In order to do this:

  • The current security context must have PROCESS:DYNTRANSITION to the new context.
  • The process must be single threaded or the transition must be to a context bounded by the current context.
  • If the process is being traced, the tracer must have permissions to trace the new context.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETCURRENT)

/proc/pid/attr/exec

Sets the security context for child processes. The permissions checking is done at exec time rather than when writing this file.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETEXEC)

/proc/pid/attr/fscreate

Sets the security context for files created by the current process. The permissions checking is done at creat/open time rather than when writing this file.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETFSCREATE)

/proc/pid/attr/keycreate

Sets the security context for keys created by the current process. Keys support in the kernel is documented in Documentation/keys.txt. The permissions checking is done at creation time rather than when writing this file.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETKEYCREATE)

/proc/pid/attr/sockcreate

Sets the security context for sockets created by the current process. The permissions checking is done at creation time rather than when writing this file.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETSOCKCREATE)

User-space object managers

Although the kernel is a large source of authority for many process, it's certainly not the only one these days. An increasing amount of ambient authority is being granted via new services like DBus and then there's always the venerable old X server which, by default, allows clients to screenshot other windows, grab the keyboard input etc.

The most common example of a user-space process attempting to enforce a security policy is probably a SQL servers. PostgreSQL and MySQL both have a login system and an internal user namespace, permissions database etc. This leads to administrators having to learn a whole separate security system, use password authentication over local sockets, include passwords embedded in CGI scripts etc.

User-space object managers are designed to solve this issue by allowing a single policy to express the allowed actions for objects which are managed outside the kernel. The NSA has published a number of papers about securing these types of systems: X: [1] [2], DBus: [1]. See also the SE-PostgreSQL project for details on PostgreSQL and Apache.

In order to implement such a design a user-space process needs to be able to label its objects, query the global policy and determine the security context of requests from clients. The libselinux library contains the canonical functions for doing all these things, but this document is about what lies under the hood, so we'll be doing it raw here.

The task of labeling objects is quite specific to each different object manager and this problem is discussed in the above referenced papers. Labels need to be stored (probably persistently) and administrators need some way to query and manipulate them. For example, in the X server, objects are often labeled with a type which derives from its name ("XInput" → "input_ext_t")

When it comes to querying the policy database, a process could either open the policy file from disk (which we'll cover later) or it could query the kernel. Querying the kernel solves a number of issues around locating the policy and invalidating caches when it gets reloaded, so that's the path which the SELinux folks have taken. See the section on /selinux/access for the interface for doing this.

In order to authenticate requests from clients, SELinux allows a process to get the security context of the other end of a local socket. There are efforts underway to extend this beyond the scope of a single computer, but I'm going to omit the details for brevity here.

I'll start with a code example of this authentication first:

  const int afd = socket(PF_UNIX, SOCK_STREAM, 0);
  assert(afd >= 0);

  struct sockaddr_un sun;
  memset(&sun, 0, sizeof(sun));
  sun.sun_family = AF_UNIX;
  strcpy(sun.sun_path, "test-socket");
  assert(bind(afd, (struct sockaddr*) &sun, sizeof(sun)) == 0);
  assert(listen(afd, 1) == 0);
  const int fd = accept(afd, NULL, NULL);
  assert(fd >= 0);

  char buf[256];
  socklen_t bufsize = sizeof(buf);
  assert(getsockopt(fd, SOL_SOCKET, SO_PEERSEC, buf, &bufsize) == 0);
  printf("%s\n", buf);

This code snippet will print the security context of any process which connects to it. Running it without any special configuration on a Fedora 11 system (targeted policy) will result in a context of unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023. Don't try running it on a socket pair however, you end up with system_u:object_r:unlabeled_t:s0.

If you already have code which is using SCM_CREDENTIALS to authenticate peers, you can use getpidcon to get a security context from a PID. Under the hood this just reads /proc/pid/attr/context.

Now that we can label requests, the next part of the puzzle is getting access decisions from the kernel. As hinted at above, the /selinux/access file allows this. See above for the details of the transaction format. As an example, we'll see if the action PROCESS:GETATTR, with a subject and object of unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023, is permitted.

  → unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 2 00010000
  ← f77fffff ffffffff 0 fffafb7f 11

This is telling us that it is permitted (the only bits missing are for EXECHEAP and DYNTRANSITION). It also tells us the permissions which should be logged on allow and deny and the sequence number of the policy state in the kernel. Note that, above, we documented an additional flags field, however it's missing in this example. That's another good reason to use libselinux! The flags field was only recently added and isn't in the kernel which I'm using for these examples.

At this time, the astute reader will be worried about the performance impact of getting this information from the kernel in such a manner. The solution is to use the same access vector cache code that the kernel uses, in user-space to cache the answers from the kernel. This is another benefit which libselinux brings.

However, every cache brings with it problems of consistency and this is no different. All user-space object managers need to know when the administrator updates the system security policy so that they can flush their AVCs. This notification is achieved via a netlink socket, as demonstrated by the following snippet:

  const int fd = socket(PF_NETLINK, SOCK_RAW, NETLINK_SELINUX);
  assert(fd >= 0);

  struct sockaddr_nl addr;
  int len = sizeof(addr);
  memset(&addr, 0, len);
  addr.nl_family = AF_NETLINK;
  addr.nl_groups = SELNL_GRP_AVC;
  assert(bind(fd, (struct sockaddr*) &ddr, len) == 0);

  struct sockaddr_nl nladdr;
  socklen_t nladdrlen;
  char buf[1024];
  struct nlmsghdr *nlh = (struct nlmsghdr *)buf;

  for (;;) {
    nladdrlen = sizeof(nladdr);
    const ssize_t r = recvfrom(fd, buf, sizeof(buf), 0,
                               (struct sockaddr*) &nladdr, &nladdrlen);
    assert(r >= 0);
    assert(nladdrlen == sizeof(nladdr));
    assert(nladdr.nl_pid == 0);
    assert((nlh->nlmsg_flags & MSG_TRUNC) == 0);
    assert(nlh->nlmsg_len <= r);

    if (nlh->nlmsg_type == SELNL_MSG_SETENFORCE) {
      struct selnl_msg_setenforce *msg = NLMSG_DATA(nlh);
      printf("enforcing %s\n", msg->val ? "on" : "off");
    } else if (nlh->nlmsg_type == SELNL_MSG_POLICYLOAD) {
      struct selnl_msg_policyload *msg = NLMSG_DATA(nlh);
      printf("policy loaded, seqno:%d\n", msg->seqno);
    }
  }

If you toggle the enforcing mode off and on, or reload the system policy (with `semodule -R`), a message is delivered to via the netlink socket. A user-space object manager can then flush its AVC etc.

With all the above, hopefully it's now clear how user-space object managers work. If you wish to write your own, remember to read the libselinux man pages first.

Reading binary policy files

The system security policy is written in a text-based language which has been well documented elsewhere. These text files are compiled and checked by user-space tools and converted into a binary blob that can be loaded into the kernel. The binary blob is also saved on disk and can be a useful source for information.

The SELinux user-space tools contain libsepol which is very useful for parsing these files. Here's a snippet of example code which returns the number of users, roles and types defined in a policy file:

#include <sepol/policydb.h>
#include <sepol/policydb/policydb.h>

int main(int argc, char **argv) {
  FILE* file = fopen(argv[1], "r");
  sepol_policy_file_t* input;
  sepol_policy_file_create(&input);
  sepol_policy_file_set_fp(input, file);

  sepol_policydb_t* policy;
  sepol_policydb_create(&policy);
  sepol_policydb_read(policy, input);

  printf("users:%d roles:%d types:%d\n",
         policy->p.p_users.nprim
         policy->p.p_roles.nprim
         policy->p.p_types.nprim);

  return 0;
};

By looking in the sepol/policydb/policydb.h header, you can probably find whatever you are looking for. Pay special heed to the comments about indexing however. Users, roles and types are indexed from 1 in some places and from 0 in others.

With a little C code, much of the useful information can be extracted from the policy files. The numbers and graphs above were generated this way, with a little help from a few Python scripts.

Conclusion

Hopefully we've covered some useful areas of SELinux that some people were unfamiliar with before, or at least shown the inner workings of something which you already knew about.

If you want information about the practical aspects of administering a system with SELinux, you should start with the Fedora documentation on the subject. After reading this document I hope that some of it is clearer now.