Cryptographic Agility (16 May 2016)
I wasn't quite sure how technical to pitch this talk so it's relatively introductory; regular readers probably know all this.
This isn't a transcript of what I said, but I try to hit the main points in my notes.)
Firstly I'd like to separate extensibility from agility. A protocol is extensible if you can add features to it without having to update every implementation at the same time—which is generally impossible. Cryptographic agility depends on having extensibility, at least if you ever want to use something that wasn't designed into a protocol at the beginning.
Protocols should be extensible: the world keeps changing and no design is going to be perfect for all time. But extensibility is much harder in practice than it sounds.
I happen to be particularly familiar with TLS and TLS has two, major extensibility mechanisms. The first is a simple version number. Here's how the specification says that it should work:
Client: I support up to version 1.2.
Server: (calculates the minimum of the version that the client supports and the maximum version that the server supports) Ok, let's use version 1.1.
This is commendably simple: it's not possible to express a range of versions and certainly not a discontinuous range. This is about as simple as an extensibility mechanism could be, and yet lots of implementations get it wrong. It's a common mistake for implementations to return an error when the client offers a version number that they don't understand.
This, of course, means that deploying new versions doesn't work. But it's insidious because the server will work fine until someone tries to deploy a new version. We thought that we had flexibility in the protocol but it turned out that bugs in code had rusted it in place.
At this point it's worth recalling the Law of the Internet: blame attaches to the last thing that changed. If Chrome updates and then something stops working then Chrome gets the blame. It doesn't matter that the server couldn't correctly calculate the minimum of two numbers. No normal person understands or cares about that.
What's to be done about this? Well, we work around issues if they're big and suck up the breakage if they're small. It's taken about 15 years to get to the point where web browsers don't have to work around broken version negotiation in TLS and that's mostly because we only have three active versions of TLS. When we try to add a fourth (TLS 1.3) in the next year, we'll have to add back the workaround, no doubt. In summary, this extensibility mechanism hasn't worked well because it's rarely used and that lets bugs thrive.
TLS has a second, major extension mechanism which is a series of (key, value) pairs where servers should ignore unknown keys. This has worked a little better because, while there are only three or four versions in play, with many years between versions, there are 25 to 30 extensions defined. It's not perfect: bugs in implementations have led them to be dependent on the order of extensions and somebody at least managed to write a server that breaks if the last value is empty.
Sometimes more extensibility points have been added inside of extensions in the expectation that it'll save adding another, top-level extension in the future. This has generally been a mistake: these extension points have added complexity for little reason and, when we try to use them, we often find that bugs have rusted them solid anyway. They've just been a waste.
There's a lesson in all this: have one joint and keep it well oiled.
Protocol designers underestimate how badly people will implement their designs. Writing down how you think it should work and hoping that it'll work, doesn't work. TLS's protocol negotiation is trivial and the specification is clear, yet it still didn't work in practice because it's difficult to oil.
Rather one needs to minimise complexity, concentrate all extensibility in a single place and actively defend it. An active defense can take many forms: fuzzing the extensibility system in test suites and compliance testing is good. You might want to define and implement dummy extensions once a year or such, and retire old ones on a similar schedule. When extensions contain lists of values, define a range of values that clients insert at random. In short, be creative otherwise you'll find that bug rust will quickly settle in.
Cryptographic agility is a huge cost. Implementing and supporting multiple algorithms means more code. More code begets more bugs. More things in general means less academic focus on any one thing, and less testing and code-review per thing. Any increase in the number of options also means more combinations and a higher chance for a bad interaction to arise.
Let's just consider symmetric ciphers for a moment. Because everyone wants them to be as fast as possible, BoringSSL currently contains 27 thousand lines of Perl scripts (taken from OpenSSL, who wrote them all) that generate assembly code just in order to implement AES-GCM. That's a tremendous amount of work and a tremendous scope for bugs.
Focusing again on TLS: over the years, 25 different ciphers and modes have been specified for use in TLS. Thankfully, of those, only nine are actively used. But that doesn't mean that the zombies of the others might not still be lurking around, ready to cause problems.
Where did this mess of diversity come from?
1. Old age / we had no idea what we were doing in the 1990's:
A lot of mistakes were made in the 1990's—we really didn't know what we were doing. Phil Rogaway did, but sadly not enough people listened to him; probably because they were busy fighting the US Government which was trying to ban the whole field of study at the time. Unfortunately that coincided with the early inflation period of the internet and a lot of those mistakes were embedded pretty deeply. We're still living with them today.
2. National pride cipher suites
The next cause of excess agility are the national pride cipher suites. Many countries consider cryptography to be an area of national interest but then mistakenly believe that means that they have to invent their own standards and primitives. South Korea and Japan were especially forthright about this and so managed to get these ciphers assigned code points in TLS but Russia and China and, to some extent, many other countries do the same thing.
Although they receive limited analysis compared to something like AES, they're generally not bad, per se, but they bring nothing new to the table: they add nothing but costs, and the costs are significant. Cryptographic diversity for the point of national pride should be strenuously resisted for that reason. Other countries may complain that the US got their standards widely used but the US got to specify a lot about the internet by being the first mover. (And AES is from Belgium anyway.) However, it is the case that I'm not aware of any of these national standards being used to promote something that's actually a deliberate backdoor; which is, of course, not true of the US.
3. Reasonable cases for diversity:
- Embedded systems want to minimise circuit size: AES_128_CCM and AES_256_CCM.
- We want something faster for when we don't have AES hardware: CHACHA20_POLY1305.
- US Government standard, got hardware support from Intel: AES_128_GCM and AES_256_GCM.
Now we come to the ones that are reasonable to use and the reasons for diversity there. It's all about performance optimisation for different environments really: tiny devices want CCM because it only needs an AES-encrypt circuit. Devices without hardware support for AES-GCM want to use ChaCha20-Poly1305 because it's much more efficient in software. Everything else wants to use AES-GCM.
Agility has allowed us to introduce the ciphers in the final set and that's really important. But it's equally important to kill off the old stuff, and that's very hard. Nearly all the incentives are aligned against it. Recall the Law of the Internet (mentioned above); users hate stuff breaking and always blame you. Even djb will take to Twitter when one drops DSA support.
We have a long conveyor belt of primitives, we put new ones at the front and, every so often, we turn the crank and something drops off the end. In addition to all the obvious problems with killing off old stuff, that also means that there's a lot of inadvisable options that will generally function at any given time and this is leading to new products launching with no idea that they're sitting towards the end of this conveyor belt. These products expect a lifetime of some number of years and are unaware that we hope to discontinue something that they're using much sooner than that. It's no longer the case that we can assume that waiting a year will result in a reduction of the amount of use that a deprecated primitive gets because of new launches.
Google tries to address this where it can by requiring support for the newest options in our certification process for devices that interact with our services. But only a tiny subset of the things that interact with Google go through any of our certifications.
Things are even harder in non-interactive cases. TLS at least gets to negotiate between the client and server but algorithms in S/MIME messages and certificate signatures don't allow that. (One can think of ways to help change that, but the current reality is that they're not negotiated.) That's why dropping SHA-1 support in certificates has been a such a gruesome fight and why PKCS#8 messages still require us to support 40-bit RC2.
So what's the lesson here? I'd say that you need extensibility but, when it comes to cryptographic agility, have one option. Maybe two. Fight to keep it that small.
It's worth highlighting that, for the purposes of time, I've simplified things dramatically. I've considered only symmetric ciphers and modes above but, even within TLS, there's a whole separate conveyor belt for asymmetric algorithms. And I've not mentioned the oncoming storm of quantum computers. Quantum computers are going to be hilarious and I hope to be retired before they get big enough to cause problems!