ImperialViolet

Sandboxing on Linux (27 Nov 2008)

This blog post has been brought about because of the issues of sandboxing Chromium on Linux (no, it's not ready and wont be for months).

Chromium uses a multiprocess model where each tab (roughly) is a separate process which performs IPCs to a UI process. This means that we can do parallel rendering and withstand crashes in the renderer. It also means that we should be able to sandbox the renderers.

Since the renderer are parsing HTML, CSS, Javascript, running plugins etc, sandboxing them would be very desirable. There's a lot of scope for buffer overflows and other issues in a code base that large and a good sandbox would dramatically reduce the scope of any exploits.

Traditional sandboxes: chroot, resource limits

People have been using chroot jails for many years. A chroot call changes the root of the filesystem for the current process. Once that has happened the process cannot interact with any of the filesystem outside the jail. As long as the process cannot gain root access, it's a good security measure.

Resource limits prevent denial of service attacks by, say, trying to use up all the memory on the system. See the getrlimit manpage for details.

These two mechanisms are supported by most UNIX like systems. However, there are some limitations:

Network access, for one, is not mediated by the filesystem on these platforms, so a compromised process could spew spam or launch attacks on an internal network. Also, the chroot call requires root access. Traditionally this has been done with a small SUID helper binary, but then root access is needed to install etc.

ptrace jails

The ptrace call is used by the strace utility which shows a trace of all the system calls that a child makes. It can also be used to mediate those system calls.

It works like this: the untrusted child is traced by a trusted parent and the kernel arranges that all system calls that the child makes cause a SIGTRAP, stopping the child. The parent can then read the registers and memory of the child and decide if the system call is allowed, permitting it or simulating an error if not.

The first issue is that some system calls take pointers to userspace memory which needs to be validated. Take open, which passes a pointer to the filename to be opened. If the parent wishes to validate the filename it has to read the child's memory and check that it's within limits. That's perfectly doable with ptrace.

The issue comes when there are multiple threads in the untrusted address space. In between the parent validating the filename and the kernel reading it, another thread can change its contents. In the case of open that means that the validator in the parent see one (safe) filename but the kernel actually acts on another. Because of this, either multithreaded children need to be prohibited, or the validator must forbid all system calls which take a pointer to a buffer which needs to be validated.

When calls like open have been prohibited, there's another trick which can be used to securely replace it:

UNIX domain sockets are able to transmit file descriptors between processes. Not just the integer value, but a reference to the actual descriptor (which will almost certainly have a different integer value in the other process). For details see the unix and cmsg manpages.

With this ability an untrusted child can securely open a file by making a request, over a UNIX domain socket to a trusted broker. The broker can validate the filename requested in safety: because it's in another address space the filename is safe between validation and use by the kernel. The broker can then return the file descriptor over the socket to the untrusted child.

The major problem with ptrace jails is that they have a high cost at every system call. On my 2.33GHz Core2 a simple getpid call takes 128ns. When a process is ptraced, that rises to 13,800ns (a factor of 100x slower). Additionally, Chromium on Linux is a 32-bit process because of our JIT, so getting the current time is a system call too.

Seccomp

Seccomp has a rather messy past (see the linked Wikipedia page for details). It's a Linux specific mode which a process can request whereby only read, write, exit and sigreturn system calls are allowed. Making any system call not on the permitted list results in immediate termination of the process.

This is a very tight jail, designed for pure computation and is perfect for that. It's enabled by default in kernel builds (although some distributions disable it I believe). It used to be enabled via a file in /proc but, in order to save space, it's now a prctl.

This issue is that the jail is too tight. It's great that read and write calls are enabled without overhead because that's much of what one of our rendering processes will use, but many other system calls would be nice (brk and mmap for memory allocation, gettimeofday etc). We would have to use the broker model for all of them.

For some calls the broker model has to be updated. Allocating memory to an address space isn't something which can be performed outside that address space so, in this case, the broker for these calls has to be in the same address space. This means that there's an untrusted thread running under seccomp and a trusted thread, not running seccomped, in the same process. The untrusted thread can request more memory by making an request over a pipe to the trusted thread. The trusted thread can then perform the allocation in the same address space.

This presents some issues when writing the trusted code. Because untrusted code has access to the memory the only thing the trusted thread can trust are its registers. That means no stack nor heap usage. Basically the trusted code has to be written in assembly and has to be pretty simple. That's not a huge problem for us however.

But we will be making lots of these other system calls, not just the memory allocation ones, but time calls, poll etc. All have to use a broker model.

To recap, a basic system call (getpid) on my 2.33GHz Core2 takes about 128ns. Performing the same operation over a pipe to another thread takes 7,775ns and to another process takes 8,423ns, roughly a factor of 60x slower.

Again, this is a very painful slowdown given the volume of such calls that we expect to make.

SELinux

Fedora, rightfully, makes a lot of noise about the fact that they have SELinux. It's a huge beast and Fedora's work has mostly been a process of taming the complexity and dealing with the fact that very little is written with SELinux in mind.

I don't have Fedora installed anywhere, but this may be a very nice solution to our issues. However, I suspect that root access will be required, again, to configure it. I speak mostly from a position of ignorance here, however. I should install Fedora at some point and have a play.

The Other Man's Grass

Recent releases of OSX have a system call actually called sandbox_init. It's a little half-baked at the moment, but shows great promise.

It's a feature from TrustedBSD and, in the limit, allows for a Scheme like language to give a detailed specification of the shape of the sandbox which is compiled to bytecode and loaded into the kernel. You can see some examples of the profile language in the slides for this USENIX talk. But, for the moment, I believe that just a few preset profiles are provided (see the manpage).

Rolling one's own

SELinux is implemented atop of LSM which is a general framework for hooking security decisions in the Linux kernel. It's conceivable that one could write a sandboxing module using these hooks.

It would require root access to install, but then so do many of the other solutions. It would probably play badly with other LSM users too, but Fedora is the only major distribution to be using them as far as I know. However, it would also be a large distraction.

Summary of data
PlatformSimple system call... via a broker thread... via a broker process... when ptraced
32-bit 136.9ns 8161.4ns 8327.3ns 14087.0ns
64-bit 128.7ns 7775.0ns 8423.3ns 13779.9ns