I wrote an article for LWN about Chromium's seccomp sandbox. They decided that it wasn't in the right style for LWN, and they rewrote it to fit. Their version has just become available for free. I'm including my version below:
The Chromium seccomp sandbox
The Chromium sandbox is an important part of keeping users safe. The web is a very complicated place these days and the code to parse and interpret it is large and on the front-line of security. We try to make sure that this code is free of security bugs, but history suggests that we can't be perfect. So, we plan for the case where someone has an exploit against our rendering code and run it in its own process with limited authority. It's the sandbox's job to limit that authority as much as possible.
Chromium renderers need very little authority. They need access to fontconfig to find fonts on the system and to open those font files. However, these can be handled as IPC requests to the browser process. They do not need access to the X server (which is why we don't have GTK widgets on web pages), nor should they be able to access DBus, which is increasingly powerful these days.
Drawing is handled using SysV shared memory (so that we can share memory directly with X). Everything else is either serialised over a socketpair or passed using a file descriptor to a tmpfs file. This means that we can deny filesystem access completely. The renderer requires no network access: the network stack is entirely within the browser process.
Traditional sandboxing schemes on Linux involve switching UIDs and using chroot. We'll be using some of those techniques too. But this text is about the most experimental part of our sandbox: the seccomp layer which my colleague Markus Gutschke has been writing.
The kernel provides a little known feature where by any process can enter ‘seccomp mode’. Once enabled it cannot be disabled. Any process running in seccomp mode can only make four system calls: read, write, sigreturn and exit. Attempting any other system call will result in the immediate termination of the process.
This is quite desirable for preventing attacks. It removes network access, which is traditionally difficult to limit otherwise (although CLONE_NEWNET is might help here). It also limits access to new, possibly dangerous, system calls that we don't otherwise need like tee and vmsplice. Also, because read and write proceed at full speed, if we limit our use of other system calls, we can hope to have a minimal performance overhead.
But we do need to support some other system calls. Allocating memory is certainly very useful. The traditional way to support this would be to RPC to a trusted helper process which could validate and perform the needed actions. However, a different process cannot allocate memory on our behalf. In order to affect the address space of the sandboxed code, the trusted code would have to be inside the process!
So that's what we do: each untrusted thread has a trusted helper thread running in the same process. This certainly presents a fairly hostile environment for the trusted code to run in. For one, it can only trust its CPU registers - all memory must be assumed to be hostile. Since C code will spill to the stack when needed and may pass arguments on the stack, all the code for the trusted thread has to carefully written in assembly.
The trusted thread can receive requests to make system calls from the untrusted thread over a socket pair, validate the system call number and perform them on its behalf. We can stop the untrusted thread from breaking out by only using CPU registers and by refusing to let the untrusted code manipulate the VM in unsafe ways with mmap, mprotect etc.
That could work, if only the untrusted code would make RPCs rather than system calls. Our renderer code is very large however. We couldn't patch every call site and, even if we could, our upstream libraries don't want those patches. Alternatively, we could try and intercept at dynamic linking time, assuming that all the system calls are via glibc. Even if that were true, glibc's functions make system calls directly, so we would have to patch at the level of functions like printf rather than write.
This would seem to be a very tough problem, but keep in mind that if we miss a call site, it's not a security issue: the kernel will kill us. It's just a crash bug. So we could use a theoretically incorrect solution so long as it actually worked in practice. And this is what we do:
At startup we haven't processed any untrusted input, so we assume that the program is uncompromised. Now we can disassemble our own memory, find sites where we make system calls and patch them. Correctly parsing x86 machine code is very tough. Native Client uses a customised compiler which only generates a subset of x86 in order to do it. But we don't need a perfect disassembler so long as it works in practice for the code that we have. It turns out that a simple disassembler does the job perfectly well with only a very few corner cases.
Now that we have patched all the call sites to call our RPC wrapper, instead of the kernel, we are almost done. We have only to consider system calls which pass arguments in memory. Because the untrusted code can modify any memory that the trusted code can, the trusted code couldn't validate calls like open. It could verify the filename being requested but the untrusted code could change the filename before the kernel copied the string from user-space.
For these cases, we also have a single trusted process. This trusted process shares a couple of pages of memory with each of the trusted threads. When the trusted thread is asked to make a system call which it cannot safely validate, it forwards the call to the trusted process. Since the trusted process has a different address space, it can safely validate the arguments without interference. It then copies the validated arguments into the shared memory pages. These memory pages are writable by the trusted process, but read-only in the sandboxed process. Thus the untrusted code cannot modify them and the trusted code can safely make the system call using the validated, read-only arguments.
We also use this trick for system calls like mmap which don't take arguments in memory, but are complicated to verify. Recall that the trusted thread has to be hand written in assembly so we try to minimise the amount of this code where possible.
Once we have this scheme in place we can intercept, examine and deny any system calls. We start off denying everything and then, slowly, add system calls that we need. For each system call we need to consider the security implications it might have. Calls like getpid are easy, but what damage could one do with mmap/munmap? Well, the untrusted code could replace the code which the trusted threads are running for one! So, when a call might be dangerous we allow only a minimal, and carefully examimed, subset of flags which match the uses that we actually have in our code.
We'll be layering this sandbox with some more traditional UNIX sandboxing techniques in the final design. However, you can get a preview of the code in it's incomplete state already at its Google Code homepage.
There's still much work to be done. A given renderer could load a web page with an iframe to any domain. Those iframes are handled in the same renderer, thus a compromised renderer can ask the browser for any of the user's cookies. Microsoft research developed Gazelle, which has much stricter controls on a renderer, at the expense of web-compatibility. We know that users wont accept browsers that don't work with their favourite websites, but we are also very jealous of Gazelle's security properties so hopefully we can improve Chromium along those lines in the future.
Another weak spot are installed plugins. Plugin support on Linux is very new but on Windows, at least, we don't sandbox plugins. They don't expect to be sandboxed and we hurt web-compatibility (and break their auto-updating) if we limit them. That means that plugins are a vector for more serious attacks against web browsers. As ever, keep up to date with the latest security patches!