Adam Langley (firstname.lastname@example.org)
This is LSMSB, a sandboxing scheme for Linux based on the ideas of the OS X sandbox (which, in turn, was inspired by TrustedBSD and FreeBSD).
Imagine that you're working on a university computer and you get a binary which promises to do some fiendishly complex calculation, reading from a file ./input and writing to a file ./output. It also talks to a specific server to access a pre-computed lookup table. You want to run it, but you don't want to have to trust that it won't do anything malicious (save giving the wrong answer).
You current options are very limited. Without root access you cannot setup a chroot jail, as troublesome as that is. If the system has SELinux or AppArmor installed, the tools are there but those are MAC systems and only root can define their policies.
Your best bet, currently, is either to use ptrace or to run a whole virtual machine. The former is slow and difficult to get right in the face of threads and the latter is a sledgehammer when we just want to crack a nut.
To address these concerns we need a sandboxing system which is:
We present a sandboxing scheme using the LSM hooks in the Linux kernel. At the moment this scheme is a prototype only and this code is based off of 2.6.30-rcx.
1 Kernel code
LSMSB uses the Linux Security Modules hooks to intercept security decisions in the kernel. The policies are implemented by tables of rules which are uploaded from user-space. Each process has a stack of zero or more sandboxes. Each sandbox may define a rule table for a given operation. An action must be accepted by all the sandboxes in the stack for it to be permitted. Child processes inherit the sandbox stack of their parents and are free to push extra sandboxes onto their stack. By construction, this can only reduce their authority.
2 Rule tables
An LSMSB filter evaluates a table of rules to decide if a given action is permitted or not. The table of rules is specific to a given filter as the filter defines the context in which the table is evaluated. The context determines which arguments are provided to the table, their order, syntax and semantics. It also defines the meaning of the return code although, by convention, zero always means reject and non-zero means accept.
The rule tables are very limited. Conditionals do exist, but a rule can only jump forward, thus there are no loops. This also means that the rule tables are not Turing complete and are guaranteed to terminate.
Each rule in the table is a single 32-bit unsigned integer. Because constant values often don't fit in 32-bits, we need another way to deal with them. Thus constants are kept in a side array, linked with each filter. The rules in the table can reference them by their index.
For working storage, the rules in the table have an array of 16 registers which can either store a 32-bit unsigned integer or a byte-string (which is a normal string, but may contain NUL characters). If a table is exceedingly complex, it may need more than 16-registers of storage to hold temporary values. For these situations, we also allow the table to specify the number of spill slots that it needs. Spill slots act just like registers except that they have to be read and written explicitly.
From this, the structure for a filter is obvious:
3 Filter values
The contents of registers, spill slots and constants are all ‘values’. These values are either a 32-bit unsigned int or a bytestring and represented by the following structure.
For integer values, value is the integer and data is set to NULL. For bytestrings, data is a pointer to the bytes (and thus not NULL) and value is the number of bytes.
Each rule in the table consists of at least an operation, which is encoded in the top 8-bits. Given the operation, there are often other arguments (register numbers etc) encoded in the remaining 24-bits, the format of which is specific to each operation.
The operations are numbered from zero by the following enum:
|MOV||Move||reg1 = reg2|
|LDI||Load immediate||reg1 = immediate value|
|LDC||Load constant||reg1 = constant value|
|RET||Return value in register||terminates the filter|
|JMP||Jump||Skips the next n rules|
|SPILL||Spill register||spill1 = reg1|
|UNSPILL||Unspill register||reg1 = spill1|
|JC||Jump conditionally||If reg1 > 0 then skip the next n rules|
|EQ||Equal?||reg1 = reg2 == reg3|
|GT||Greater than?||reg1 = reg2 > reg3|
|LT||Less than?||reg1 = reg2 < reg3|
|GTE||Greater than or equal?||reg1 = reg2 >= reg3|
|LTE||Less than or equal?||reg1 = reg2 <= reg3|
|AND||Bitwise conjunction||reg1 = reg2 & reg3|
|OR||Bitwise disjunction||reg1 = reg2 | reg3|
|XOR||Bitwise exclusive-or||reg1 = reg2 ^ reg3|
|ISPREFIXOF||Bytestring prefix||reg1 = reg2 ⊆ reg3|
5 Available filter types
The type of a filter signifies the operation which it intends to filter, as well as the context that it runs in. (See above). Here are the currently defined filters.
|DENTRY_OPEN||File open||Filename(bs) and mode(i)|
The rule tables are simple enough that they can be validated such that we can know they they can never cause a run-time error. This is helpful as it means that we can pay the verification cost once (when loading the table) and omit run-time checks when evaluating.
7 Predecessor tables
For each rule in the table there is a set of rules which can be the immediate predecessor of that rule. In the simple case, the rule preceding will fall though and be the single predecessor. In more complex cases, a rule might be the target of several jumps and a fall though.
The predecessor table conceptually has n rows and x columns, where n is the number of rules in the rule table and x is a constant value. If we didn't have jumps each rule could only have a single predecessor, x could be one, and we would have a simple array. However, in the presence of jumps x may need to be greater than one.
In the pessimal case, every rule in the table except the last is a jump to the very last one. Then x would need to be n - 1, even though most of the slots in the table would be unused. Therefore we take a hybrid approach. We define x to be a small number and, if a row overflows, we mark it with a magic value. Most of the time this will work fine but, in exceptional cases, we have to deal with the overflows.
8 Appending to a predecessor table row
Appending to a row simply involves scanning the entries in the row until a free slot is found. If we don't find a free slot we mark the first entry in the row with the magic overflow tag.
9 Building the predecessor table
We build the predecessor table by first clearing it then, for each rule, we find its one or two successor instructions and mark it as a predecessor for those instructions.
10 Clearing the predecessor table
Since we chose our magic unused value to be 0xffff, we can clear the table using memset.
11 Type unification
Type unification is the process of finding, for each rule, the set of possible types for each register and spill slot. Since we only have two types (integers and bytestrings), we can encode this using only a couple of bits.
If we then find a rule which is going to operate on a register which could possibly hold the wrong type, we reject the rule table.
12 Type vectors
A type vector is an array of 2-bit values, one for each register and spill slot. Each 2-bit value is either:
|Undefined:||at the start of evaluation, every spill slot and register which isn't defined by the context has an undefined type.|
|Integer:||an integer type.|
|Bytestring:||a byte-string type.|
|Conflicting:||the type differs depending on the control flow.|
13 Unifying type vectors
Given two type vectors (say, the type vectors from the two predecessors of a rule), we unify them by considering each element of each vector, pairwise. Then, for each pair, if they match then we output that value, otherwise, we record the value as conflicting.
To implement this quickly, we try to operate on many elements concurrently. Most of the time a rule table will not use spill slots so the type vector will be 2 × 16 = 32 bits long and we can do them all in one go.
To understand the code below, consider a single pair of 2-bit inputs. If we exclusive-or them, we'll end up with a true bit in the result iff the inputs differed. If we could map 01, 10, and 11 to 11 and then OR with the original input, that would map equal inputs to themselves and differing inputs to LSMSB_TYPE_CONFLICTING. (Remember that LSMSB_TYPE_CONFLICTING has a value of 11 in binary).
In order to transform the exclusive-or output we define left to be the left bit and right to be the right bit. Then we OR together left, right, left >> 1 and right << 1. This results in the desired mapping.
We work with 32-bit quantities where possible and fall back to 8-bit values otherwise. Although the type vector might not be a multiple of 4 values long, we still operate on the undefined bits at the end because it's harmless and faster.
14 Updating the type vectors for an operation
Each operation may require certain types as inputs and may mutate the type of outputs. We model this in a function which operates on type vectors.
At any point, if an operation would encounter a type error, we return 0.
15 Type vector arrays
A type vector array is a simply an array of type vectors, one for each operation in the rule table. We can build a type vector array by iterating over the operations, unifying the type vectors for all predecessors and then updating the type vector for that operation.
By the end we will have either found a type error or proved that evaluating the rule table will not encounter any run-time errors.
16 Unifying normal rows
With a row where the predecessor table didn't overflow, we can easily find the predecessor operations. For the first one, we need to copy its type vector. For all subsequent predecessors, we unify the current type vector with their type vector.
Here we also deal with the case of an operation with no predecessor. This can only occur if no control flow path leads to it. In this case, we reject the rule table. We assume that the tools used to generate the rule tables are sufficiently smart to remove dead code and we don't want to waste kernel memory keeping it around.
17 Unifying overflow rows
When the predecessor table is marked as overflowed, we have to find all the predecessors ourselves. We can do this by checking all previous operations for jumps to the current operation and checking the previous instruction for fall though. (Recall that we only have forward jumps.)
18 Testing for predecessor rules
19 Type vector utility functions
20 Bringing typechecking together
Pulling together the above code is straight-forward. We allocate memory for the predecessor table and type vector array and generate them both. If we manage to generate them both and the rule table checks out, then we're done.
21 Evaluating filters
Since typechecking has eliminated most run-time errors from the filter, the evaluation of filters can happen without many of those checks.
The evaluation is straight-forward: a register machine is simulated with with the register values in an array on the stack. Each instruction is dispatched using a switch. There are several pieces of low-hanging fruit that could make this code faster but, for now, we choose the simplest code that works.
22 Utility functions
We used a few utility functions in this code which we'll now flesh out.
23 Filter contexts
The context of a filter (the semantics and types of the registers on entry) are specified implicitly in the code for running each different type of filter. In order to perform typechecking we need to duplicate that information , or at least the types, here.
24 Getting an initial type vector for a filter
Once we have filter_contexts, we can define a function to build an initial type vector for a filter given the type string.
25 Installing sandboxes.
A sandbox is installed from userspace by writing a compact representation of a set of filters to the kernel. Currently, the userspace process does this by writing to /proc/self/sandbox, although that could change in future versions.
26 External structures
Several structures are used in the data which is provided to the kernel. These structures thus become part of the kernel ABI. They are named with a _wire suffix to mark them as such.
27 Installing the sandbox
The filters are prefixed by a uint32_t which contains the number of filters in this sandbox. As with most kernel APIs, the values are in native-endian.
Then we read each filter in turn and insert it into the correct place in the sandbox. Once we have a complete sandbox, we can push it onto the stack of active sandboxes for the current process.
28 Pushing a new sandbox
The sandboxes are conceptually in a stack. The top of the stack is pointed by by the struct task_struct of a given process. Thus, pushing a new sandbox on the top of the stack involves an RCU update of the current struct task_struct.
29 Limiting the number of sandboxes
In order to stop userspace processes from consuming kernel memory unboundedly, we limit the number of sandboxes which can be active for any given process.
30 Installing a filter
Installing a filter involves copying the filter header from userspace, followed by the operation stream and any constants. At this point a number of limits are imposed on the sizes of the various structures for sanity's sake and also to avoid having to worry about integer overflows in other parts of the code.
At this point we also learn the filter_code, which identifies the hook that this this filter processes. Once we know the filter code, we can typecheck the filter.
31 Installing a constant
Each constant is described by a struct constant_wire and, optionally, followed by its data if it's a bytestring.
32 Interfacing with LSM
LSM modules provide a structure of function pointers. Each function pointer corresponds to an LSM hook. For each hook that we wish to implement we need to provide a function which converts the hook's arguments to the form which the corresponding filter expects and runs the filter (if any) for each sandbox stacked on the current process.
33 Handling sandbox lifetimes
We only have to provide a couple of functions to handle sandbox lifetimes. One to destroy sandboxes and another to duplicate them. (The sandboxes are reference countered, so ‘duplicating’ is very cheap.)
34 cred structures
A struct cred contains the authority of a process; it's various UIDs and GIDs and an opaque pointer to the LSM data for a process. In our case, that points to the sandbox that is at the top of the stack for the process.
We need a couple of functions to deal with them. The first comes into play when duplicating a cred structure for a new process. The new process initially gets the same sandbox stack as the parent so we just copy the pointer and increment the reference count.
The second deals with freeing a cred structure. When freeing a sandbox we have to keep in mind that the parent pointer is reference counted. Thus, when we delete a sandbox that might cause it's parent to be deleted and so on.
35 The dentry_open hook
We now come the list of hooks. These functions are LSM hook functions which convert their arguments into a context for a filter and evaluate the stack of sandboxes.
dentry_open is called when a process opens a file. This function is currently incomplete as it doesn't deal with files which cannot be named in the current context (for example, the file is outside the current root for the process). Nor does it currently deal differences between the view of the filesystem that was active when the sandbox was installed vs the current view of the filesystem.
37 Initialising the module
LSM modules, despite the name, can no longer actually be loadable modules. They are initialised during the boot sequence by the security code so that they can label kernel objects early on.
The security code only allows a single LSM module to register. Either this is the first module that tries to register, or the module named on the command line.
40 The assembler
Obviously humans are going to need some assistance when building these filter structures to load into the kernel. Preferably, a high level language would allow programmers to precisely specify their desired level of access. For now at least, we only provide a low level, assembly like language for this purpose.
The following code defines a sandbox with a single filter for dentry-open. If you recall, the context of dentry-open specifies that the mode argument to open is provided in register one. The following code tests the least significant bit of this argument (which requests write access) and fails if it's set.
41 A more complex example
Here we have a more complex example which involves bytestrings and constants. As you can see, we define a bytestring constant named etc-prefix which we can use as an argument to ldc.
Once loaded, we test the bytestring in register zero (which, according to the context for dentry-open, is the full path to the file to be opened) and test that /etc/ is a prefix of the path.
44 The filter structureWhile parsing filters, we build up a structure called a Filter (we've switched to C++ for this code).
45 The constant structure
Constants in this language have a name and a type. We implement this as a base class which holds the name with subclasses for each of the types.
The Write member serialises the constant to standard out in the external format which the kernel expects.
The parser is written using Ragel which generates code for a state-machine parser from a description. Readers are directed to the Ragel documentation to fully understand the following.
The parser is non-recursive and uses a single int value for its current state. Several magic variables are used in the snippets of code embedded in the parser:
|start||A pointer, into the input, to the start of the current word/string etc.|
|fpc||A pointer, into the input, to the byte which has just been parsed.|
|current_filter||A pointer to a Filter structure.|
|line_no||The current line number.|
|op||The current operation (a uint32_t).|
The first chunk of the parser defines an action (next_line) which increments the current line counter, a parser (ws) to skip whitespace and another action (start) to set the global start variable.
47 Parsing a filter
48 Starting to parse a new filter
In the Ragel chunk above, a token is parsed for the filter name and, at the end of the token, the filter_new action is called. This sets the global current_filter variable to a new filter and looks up the filter name in the list of supported filters
When a filter is complete, it's pushed onto a list of filters.
49 Parsing constants
50 Helper functions
51 Parsing the spill slots declaration
52 Parsing instructions
After the constants and spill-slots declarations, each line is an 'instruction'. I put the word in quotes because a jump target is included as an instruction according to the parser, but it doesn't actually emit an operation in the filter. Since all our jumps are forwards, jump resolution is simple and we do everything in a single pass.
53 Parsing simple instructions
First we'll consider how to parse the easy instructions which don't reference any other structures.
54 Helper actions
We build up the current operation (in the global op) as we parse, so we start with a bunch of helper actions for setting various parts of op.
55 Helper functions
Those helper actions called several helper functions which we'll expand on now:
56 Pushing new operations
Every time we complete an operation, we need to add it to the current filter:
57 Parsing the simple operations
58 Parsing ldc
The ldc instruction is a little more complicated since we have to translate the constant name, given in the instruction, into an index into the constant table.
We simply use the order which the constants were defined as the index and walk the vector till we find the correct index.
59 Parsing jumps
In the code which the kernel sees, all the jumps are expressed as an unsigned number of operations to skip. However, people don't like writing code like that, counting lines and all, so we let them specify jump targets as strings and define them later.
So, uniquely for jump instructions, we don't actually know what the exact operation is when we finish parsing the instruction: the offset is still unknown. Thus we keep a map of named jump targets to a vector of offsets into the operation list for jumps to that label.
When we parse a jump, we add an entry to the map for the current instruction and when we parse a jump target, we lookup in the map and write the correct offset for each instruction which jumps there.
60 Parsing a whole file
A file, as a whole, is just a list of filters:
61 The main function
62 Opening and reading the input
63 Parsing variables
A detailed above, during parsing a number of globals are used. These happen to not actually be globals in the C sense. Since the parsing code is expanded inline into this function, the 'globals' are actually local variables to this function.
64 Typechecking the filters
Once we have parsed the filters we typecheck them since the kernel will reject them anyway if they fail to typecheck. We simply call the Typecheck member on each filter which we'll expand upon below.
65 Typechecking a filter
We wouldn't want the typechecking to diverge between the kernel and userspace, so we use the exact same code here as we do in the kernel.
We do so by filling out an lsmsb_filter structure and an array of constants.
66 Serialising the filters
Once the filters have been typechecked, we write them out to stdout in the format which the kernel expects to parse them in (see the kernel code for installing a sandbox). This code follows the pattern of typechecking: all the actual work is in an a Filter member function which we'll expand on below.
67 Serialising a filter
We serialise a filter by filling out the external structures which the kernel expects and writing to stdout.
68 The writea utility function
This is a very thin wrapper around write which handles short writes.
69 Compatibility with kernel code
Several of the functions which we are using in this code were written for the kernel code and, as such, use kernel specific functions like kmalloc. In order to have them run in a userspace context we provide small shims for them.
71 Using a sandbox
Once a sandbox has been built, using it is very simple. One needs only to write the sandbox to /proc/self/sandbox. We provide a very simple binary which installs a given sandbox and runs a shell within it.
Note that, because sandboxes are composable, this can be done multiple times.