BoringSSL (17 Oct 2015)
We recently switched Google's two billion line repository over to BoringSSL, our fork of OpenSSL. This means that BoringSSL is now powering Chromium (on nearly all platforms), Android M and Google's production services. For the first time, the majority of Google's products are sharing a single TLS stack and making changes no longer involves several days of work juggling patch files across multiple repositories.
This is a big positive for Google and I'm going to document some of the changes that we've made in BoringSSL in this post. I am not saying that people should be ditching OpenSSL and switching to BoringSSL. For Linux distributions that doesn't even make sense because we've removed too much for many applications to run unaltered and, without linker trickery, it's not possible to have both OpenSSL and BoringSSL in the same process because their symbols will collide. Even if you're in the position of shipping your own TLS stack with your code, you should still heed the warnings in the README well.
OpenSSL have considerably improved their processes since last April, which is great and important because huge swathes of the Internet will continue to depend on it. BoringSSL started before those changes but, even taking them into consideration, I'm still happy with my decision to fork. (But note that Google employs OpenSSL team members Emilia Käsper, Bodo Möller and Ben Laurie and contributes monetarily via the Core Infrastructure Initiative, so we haven't dropped our support of OpenSSL as a project.)
With that in mind, I'm going to mention some of the cleanups that we've done in BoringSSL from the lowest level, upwards. While most people should continue to use OpenSSL, there are lots of developers outside of Google who work on Chromium and Android and thus this document shouldn't be internal to Google. This post may seem critical of OpenSSL, but remember that many of these changes are possible because we only have to worry about Google's needs—we have an order of magnitude fewer platforms and configurations to support than OpenSSL and we don't keep any ABI compatibility. We also have the superpower of being able to change, where needed, the code that calls BoringSSL, so you can't really compare the two.
The “we”, above, is primarily myself and my colleagues David Benjamin and Matt Braithwaite. But BoringSSL is open source and Brian Smith has clocked up 55 patches and we've also had contributions from Opera and CloudFlare. (Brian's number would be higher if I had had more time to review his pending changes in the past couple of weeks).
Generally when people say “forking” they mean that they took a copy of the code and started landing patches independently of the original source. That's not what we did with BoringSSL. Rather than start with a copy, I started with an empty directory and went through OpenSSL function-by-function, reformatting, cleaning up (sometimes discarding) and documenting each one. So BoringSSL headers and sources look like this rather than this. The comments in BoringSSL headers can be extracted by a tool to produce documentation of a sort. (Although it could do with a make-over.)
(Clang's formatting tool and its Vim integration are very helpful! It's been the biggest improvement in my code-editing experience in many years.)
For much of the code, lengths were converted from ints to size_ts and functions that returned one, zero or minus one were converted to just returning one or zero. (Not handling a minus one return value is an easy and dangerous mistake.)
I didn't always get everything right: sometimes I discarded a function that we later found we actually needed or I changed something that, on balance, wasn't worth the changes required in other code. Where possible, code that we've needed to bring back has gone into a separate section called “decrepit” which isn't built in Chromium or Android.
But large amounts of OpenSSL could simply be discarded given our more limited scope. All the following were simply never copied into the main BoringSSL: Blowfish, Camllia, CMS, compression, the ENGINE code, IDEA, JPAKE, Kerberos, MD2, MDC2, OCSP, PKCS#7, RC5, RIPE-MD, SEED, SRP, timestamping and Whirlpool. The OpenSSL that we started from has about 468,000 lines of code but, today, even with the things that we've added (including tests) BoringSSL is just 200,000. Even projects that were using OpenSSL's OPENSSL_NO_x defines to exclude functionality at compile time have seen binaries sizes drop by 300KB when switching to BoringSSL.
Some important bits of OpenSSL are too large to bite off all at once, however. The SSL, ASN.1 and X.509 code were “forked” in the traditional sense: they were copied with minimal changes and improved incrementally. (Or, in the case of ASN.1 and X.509, left alone until they could be replaced completely.)
OpenSSL has a confusing number of initialisation functions. Code that uses OpenSSL generally takes a shotgun approach to calling some subset of OpenSSL_add_all_algorithms, SSL_library_init, ERR_load_crypto_strings and the deprecated SSLeay aliases of the same. BoringSSL doesn't need any of them; everything works immediately and the errors don't print out funny just because you forgot to load the error strings. If, like Chromium, you care about avoiding static initialisation (because every disk seek to load pages of code delays displaying the window at startup) then you can build with BORINGSSL_NO_STATIC_INITIALIZER and initialise the library when you need with CRYPTO_library_init. But the vast majority of code just wants to avoid having to think about it. In the future, we would like to move to an automatic lazy-init which would solve even Chromium's needs.
OpenSSL and BoringSSL are often built into shared libraries, but OpenSSL doesn't have any visibility annotations. By default symbols are not hidden and ELF requires that any non-hidden symbols can be interposed. So if you look at libcrypto.so in a Linux distribution you'll see lots of internal functions polluting the dynamic symbol table and calls to those functions from within the library have to indirect via the PLT. BoringSSL builds with hidden visibility by default so calls to internal functions are direct and only functions marked OPENSSL_EXPORT are included in the dynamic symbol table.
Multi-threaded code is common these days but OpenSSL requires that you install callbacks to lock and unlock a conceptual array of locks. This trips up people who now take thread-safety to be a given, and can also mean that contention profiling shows a large, opaque amount of contention in the locking callback with no hint as to the real source. BoringSSL has a native concept of locks so is thread-safe by default. It also has “once” objects, atomic reference counting and thread-local storage, which eliminates much of the need for locking in the first place.
OpenSSL has a fairly unique method of handling errors: it pushes errors onto a per-thread queue as the stack unwinds. This means that OpenSSL errors can generally give you something like a stack trace that you might expect from gdb or a Python exception, which is definitely helpful in some cases. For contrast, NSS (Mozilla's crypto library) uses a more traditional, errno-like system of error codes. Debugging an NSS error involves looking up the numeric error code and then grepping the source code to find all the places where that error code can be set and figuring out which triggered this time.
However, this single error-code system is better for programmatic use. Code that tries to do something with OpenSSL errors (other than dumping them for human debugging) tends to look only at the first (i.e. deepest) error on the queue and tries to match on the reason or even function code. Thus changing the name of even internal functions could break calling code because these names were implicitly exported by the error system. Adding errors could also break code because now a different error could be first in the queue. Lastly, forgetting to clear the error queue after a failed function is very easy to do and thus endemic.
So BoringSSL no longer saves functions in the error queue: they all appear as OPENSSL_internal, which saved about 15KB of binary size alone. As a bonus, we no longer need to run a script every time we add a new function. The file name and line number is still saved but, thankfully, I've never seen code try to match line numbers from the error queue. Trying to match on reason codes is still problematic, but we're living with it for now. We also have no good answer for forgetting to clear the error queue. It's possible that we'll change things in the future to automatically clear the error queue when calling most functions as, now that we're using thread-local storage, that'll no longer cause servers to burst into a flaming ball of lock contention. But we've not done that yet.
Parsing and serialisation
OpenSSL's parsing and serialisation involves a lot of incrementing pointers with single-letter names. BoringSSL drags this firmly into the 1990's with functions that automatically check bounds for parsing and functions that automatically resize buffers for serialisation. This code also handles parsing and serialising ASN.1 in an imperative fashion and we're slowly switching over to these functions because the OpenSSL ASN.1 code is just too complicated for us.
But I should note that OpenSSL's master branch now uses some similar parsing functions for parsing TLS structures at least. I've no idea whether that was inspired by BoringSSL, but it's great to see.
Random number generation
Random number generation in OpenSSL suffers because entropy used to be really difficult. There were entropy files on disk that applications would read and write, timestamps and PIDs would be mixed into entropy pools and applications would try other tricks to gather entropy and mix it into the pool. That has all made OpenSSL complicated.
BoringSSL just uses urandom—it's the right answer. (Although we'll probably do it via getrandom rather than /dev/urandom in the future.) There are no return values that you can forget to check: if anything goes wrong, it crashes the address space.
For the vast majority of code, that's all that you need to know, although there are some concessions to performance in the details:
TLS servers that are pushing lots of AES-CBC need the RNG to be really fast because each record needs a random IV. Because of this, if BoringSSL detects that the machine supports Intel's RDRAND instruction, it'll read a seed from urandom, expand it with ChaCha20 and XOR entropy from RDRAND. The seed is thread-local and refreshed every 1024 calls or 1MB output, whichever happens first.
Handing people a block cipher and hash function and expecting them to figure out the rest does not work. Authenticated Encryption is much closer to being reasonable and BoringSSL promotes it where possible. One very pleasing BoringSSL tale is that I handed that header file to a non-crypto developer and they produced secure code, first time. That would not have happened had I pointed them at EVP_CIPHER.
There is more to be done here as I've talked about before: we need nonce-misuse-resistant primitives and solutions for large files but what we have now is a significant improvement and the foundations for that future work are now in place.
As I mentioned, the SSL/TLS code wasn't reworked function-by-function like most of BoringSSL. It was copied whole and incrementally improved, predominantly by David Benjamin. I'm really happy with what he's managed to do with it.
At the small scale, most of the parsing and serialisation is now using the safe functions that I covered above. (Changes to convert most of the remaining pointer-juggling code are in my review queue.) TLS extensions are now a bit saner and no longer handled with huge switch statements. Support for SSLv2, DSS, SRP and Kerberos has all been dropped. The header file actually has comments.
Some important, small scale cleanups are less obvious. The large number of “functions” that were actually macros around ctrl functions (that bypassed the type system) are now real functions. In order to get TLS 1.0–1.2 you no longer use the ridiculously named SSLv23_method and then disable SSLv2 and SSLv3 by setting options on the SSL_CTX, rather you use TLS_method and control the versions by setting a minimum and maximum version.
There is lots more that I could mention like that.
At the larger scale, the buffer handling code has been substantially improved and the TLS code now does symmetric crypto using the AEAD interface, which cleanly partitions concerns that previously leaked all over the SSL code. We've also rewritten the version negotiation code so it no longer preprocesses the ClientHello and fiddles with method tables to use the correct version. This avoids some duplicated code and session resumption bugs and OpenSSL has since done a similar rewrite for 1.1.0. To solve a particular problem for Chrome, we've added some support for asynchronous private key operations so that slow smartcards don't block the network thread. Much of the DTLS logic has also been rewritten or pruned.
Perhaps most importantly, the state machine is much reduced. Renegotiation has been dropped except for the case of a TLS client handling renegotiation from a server while the application data flow has stopped, and even that is disabled by default. The DTLS code (a source of many bugs) is much saner in light of this.
OpenSSL has always had decent test coverage of lower-level parts like hash functions and ciphers, but testing of the more complex SSL/TLS code has been lacking. Testing that code is harder because you need to be able to produce sufficiently correct handshakes to get close to its edge cases, but you don't want to litter your real code with dozens of options for producing incorrect outputs in order to hit them. In BoringSSL, we've solved this by using a copy of Go's TLS stack for testing and we've littered it with such options. Our tests also stress asynchronous resume points across a range of handshakes. We wrote partial DTLS support in Go to test DTLS-only edge cases like reassembly, replay and retransmission. Along the way, we even discovered one of OpenSSL's old bug workarounds didn't work, allowing both projects to shed some code.
In C, any malloc call may fail. OpenSSL attempts to handle this, but such code is error-prone and rarely tested. It's best to use a malloc which crashes on failure, but for the benefit of consumers who can't, we have a "malloc test" mode. This runs all tests repeatedly, causing each successive allocation to fail, looking for crashes.
We now have 1,139 TLS tests which gives us 70% coverage of the TLS code—still better than any other TLS library that we've used.
Now that we've done the task of aligning Google around BoringSSL, we'll hopefully be able to turn a little bit more attention to some feature work. Support for the IETF-approved ChaCha20-Poly1305 is coming soon. (Brian Smith has a change waiting for me.) Curve25519 and Ed25519 support are likely too. Next year, we will probably start on TLS 1.3 support.
But more cleanups are probably more important. The big one is the elimination of the ASN.1 and X.509 code in many cases. If you recall, we imported that code whole without cleanups and it hasn't been touched since. We've been incrementally replacing uses of the ASN.1 code with the new CBS and CBB functions but X.509 remains as a substantial user. We're not going to be able to drop that code completely because too much expects the X.509 functions to be available for reading and writing certificates, but we can make it so that the rest of the code doesn't depend on it. Then we can put it in a separate library and drop in a new certificate verification library that some of my Chromium colleagues are writing. Most users of BoringSSL will then, transparently, end up using the new library.
In the SSL code, the SSL object itself is a mess. We need to partition state that's really needed for the whole connection from state that can be thrown away after the handshake from state that can be optionally discarded after the handshake. That will save memory in servers as well as improving the clarity of the code. Since we don't have ABI compatibility, we can also reorder the structs to pack them better.
Lastly, we need to make fuzzing part of our process. Michał Zalewski's AFL has substantially improved the state of fuzzing but, whether we're using AFL or LibFuzzer, it's still a one-off for us. It should be much more like our CI builders. So should running clang-analyzer.
(David Benjamin contributed to this post.)