If you do not know what ACVP is then you should read no further. If you think it might be useful then what you're actually looking for is Wycheproof; ACVP is only for those who have no choice.

If you're still reading and you're vaguely aware that your previous CAVP infrastructure isn't applicable any longer, and that you'll need to deal with ACVP next time, then you might be interested in BoringSSL's ACVP infrastructure. We have a number of different FIPS modules to test and wanted something generic rather than repeating the bespoke-per-module way that we handled CAVP. We also need to test not just BoringCrypto (a software module) but also embedded devices.

The result, acvptool, lives within the BoringSSL repo and can translate ACVP JSON into a series of reasonably simple IPC calls that a “module wrapper” speaks over stdin/stdout. BoringSSL's module wrapper is the reference implementation, but there's also a tiny one for testing that could easily be repurposed to forward over a serial link, etc, for embedded devices.

It's reasonably likely that you'll find some case that's not handled, but the code is just Go throwing around JSON so you should be able to extend it without too much bother. But, for the cases that are already handled, the weird undocumented quirks that'll otherwise consume hours of your life are taken care of.

Letter to 20 years ago

I noticed that I have not posted anything here in 2020! There's a bunch of reasons for this: the work I'm doing at the moment does not lend itself so well to blog posts, and life intervenes, leaving less time for personal projects. But in order to head off the risk that I'll post nothing at all this year I pulled something from one of my notebooks. 2020 is a round number so I decided to do some reflection and this was a letter that I imagined writing to myself 20 years ago. It is very much a letter to me! The topics are going to be quite specific and if you weren't paying attention to the computing industry in the year 2000 I’m not sure how much of it will make sense. But these are the points that I think me of 20 years ago would have wondered about.

You must be thinking that computers will be crazy fast by now. Yes…ish. It's complicated, and that's going to be a theme here. You've been hearing from Intel that the NetBurst chips will hit 10GHz in a few years, so with another few doublings what will have by now? 50GHz? Actually common values are around 3.5GHz. Some reach 5GHz, but only in bursts. Intel never hit 10GHz and nor did anybody else. It’s better than it sounds: instructions per clock are up a lot, so each cycle is worth more. (Although maybe we'll get to some of the issues that caused!) More importantly, all systems are multiprocessor now. It's physically a single chip, but inside is often 8- to 32-way SMT. Yep, that's cool. And yep, it only helps for certain sorts of workloads. Multithreaded programming is not going away.

Memory? 10s of gigabytes is common. Hard drives? It's nearly all flash now. You can still buy hard drives and they’re huge and cheap, but the speed of flash is pretty sweet. Computers really are quite substantially faster — don't be too put off by the clock speeds.

Your day-to-day life is a bunch of xterms and a web browser. Nothing's changed; you are dramatically underestimating the importance of path dependence. Terminals are still emulating a fancy VT-100 and sometimes they get messed up and need a reset. No fixes there. It's still bash or zsh; nearly unchanged from your time. The kernel has been fixed a little: you can now get a handle to a process, so no more PID races. You can open a file relative to a directory descriptor and you can create an unlinked file in a directory and link it later. Yes it's good that these things are possible now, but it is not a fundamental change and it took a long time. Actually you know what? Windows grew a much smarter shell, leapfrogging Linux in several respects. They had hardly moved forward since DOS so it was easier there, perversely.

So innovation must have happened higher up where there was new ground and little legacy, right? What about the semantic web? How did that turn out? Not well. We don't have lots of data in machine-readable formats and fancy GUIs so that anyone can create automation. Information is somewhere between impossible and a huge pain to access. You’ve read The Dilbert Future by now and its ‘confusopoly’ concept is much closer to the mark. The Semantic Web stuff failed so badly that nobody even tries any longer. (I'm afraid Scott Adams won’t seem so wholesome in the future either.) The closest you'll get is that your web browser can fill out your name, address, and credit card details. And it has to work really hard to do that because there’s almost no assistance from web pages. Go find Weaving the Web and throw it away.

Something more positive: bandwidth! You are using a dial-up that tops out at 5 KB/s and charges by the minute. You use a local proxy that keeps a copy of everything so that viewed pages are available offline and it lets you mark missing pages for batch fetching to reduce the cost. This problem is now solved. You can assume that any house in a city can get an always-on, several 10s of Mb/s connection. It's not as cheap as it could be but it's a standard household expense now. (Note: past me doesn't live in the US! —agl.) Also, everyone carries an impossibly fancy PDA that has that level of connection wirelessly and everywhere. I don't need to equivocate here, connectivity is solved in the sorts of places you're likely to live. But … there's a second edge to that sword. This connectivity can be a bit … much? There are some advantages to having the internet be stationary, metered, and behind 30 seconds of banshee wailing and static. Imagine your whole social life getting run through IRC, and that you're always connected. It’s tough to explain but there's a problem. But these PDAs? They have GPS and maps. Nobody gets lost anymore. Nobody carries paper street maps in their car. Connectivity can be pretty sweet.

This next bit is going to upset you a little: the whole Palladium / trusted boot stuff never took off on the desktop, but these PDAs are pretty locked down. One type of them is completely locked down and you can’t run non-approved software. The other will make you jump through hoops and, even then, you can't access the data of other programs. On the latter sort you can install a completely custom OS most of the time, but there's attestation and some things won't cooperate. This is still playing out and people are fighting over the details (because of money, of course). It remains a concern, but you underestimate the benefits of this sort of system. Your idea that people should own their own computers because they’re critical tools isn't wrong, but it is elitist. For the vast majority of people, their desktops degrade into a fragile truce with a whole ecosystem of malware and near-malware. Maybe it's their “fault” for having installed it, but these PDAs are so popular, in part, because they're hard to screw up. Bad stuff does get through the approval process, but it cannot mess things up to the wipe-and-reinstall level that desktops reach. The jury is still out about whether we will regret this, but you're wrong about the viability of giving people Windows XP and getting a good result.

Back on a positive note: the music industry switched to a $15 a month stream-whatever-you-want model and it works fine. You were completely right about this. Music still exists and it still pays a few at the top large sums and the rest very little. The music industry itself didn't sort this out though, other companies did it for them. What you're missing is that you’re not taking things far enough: companies also did this for TV and (many) movies. There are still rips of this stuff on BitTorrent, but it's not a live issue because people pay the subscription for the ease, by and large. In fact, access to scientific papers is a hotter issue now!

Basically, rates of change are really uneven.

Real-world measurements of structured-lattices and supersingular isogenies in TLS

This is the third in a series of posts about running experiments on post-quantum confidentiality in TLS. The first detailed experiments that measured the estimated network overhead of three families of post-quantum key exchanges. The second detailed the choices behind a specific structured-lattice scheme. This one gives details of a full, end-to-end measurement of that scheme and a supersingular isogeny scheme, SIKE/p434. This was done in collaboration with Cloudflare, who integrated Microsoft's SIKE code into BoringSSL for the tests, and ran the server-side of the experiment.


Google Chrome installs, on Dev and Canary channels, and on all platforms except iOS, were randomly assigned to one of three groups: control (30%), CECPQ2 (30%), or CECPQ2b (30%). (A random ten percent of installs did not take part in the experiment so the numbers only add up to 90.) CECPQ2 is the hybrid X25519+structured-lattice scheme previously described. CECPQ2b is the name that we gave to the combination of X25519 and the SIKE/p434 scheme.

Because optimised assembly implementations are labour-intensive to write, they were only available/written for AArch64 and x86-64. Because SIKE is computationally expensive, it wasn’t feasible to enable it without an assembly implementation, thus only AArch64 and x86-64 clients were included in the experiment and ARMv7 and x86 clients did not contribute to the results even if they were assigned to one of the experiment groups.

Cloudflare servers were updated to include support for both CECPQ2 and CECPQ2b, and to support an empty TLS extension that indicated that they were part of the experiment. Depending on the experiment group, Chrome would either offer CECPQ2, CECPQ2b, or just non-post-quantum options, in its TLS 1.3 handshake, along with the signaling extension to indicate which clients were part of the control group. Measurements were taken of how long TLS handshakes took to complete using Chrome’s metrics system. Chrome knew which servers were part of the experiment because they echoed the signaling extension, thus all three groups were measuring handshake duration against the same set of servers.

After this phase of the trial was complete, client-side measurements were disabled and Chrome Canary was switched to a mode where it randomly picked one of CECPQ2, CECPQ2b, or neither to offer. This enabled some additional, server-side measurements to ensure that nothing unexpected was occuring.

(Cloudflare has a significantly more detailed write up of this experiment.)


We’re aware of a couple of biases and these need to be kept in mind when looking at the results. Firstly, since ARMv7 and x86 platforms were excluded, the population was significantly biased towards more powerful CPUs. This will make supersingular isogenies look better. Also, we’ve seen from past experiments that Canary and Dev Chrome users tend to have worse networks than the Chrome user population as a whole, and this too will tend to advantage supersingular isogenies since they require less network traffic.


Here are histograms of client-side results, first from Windows (representing desktops/laptops) and from Android (representing mobile devices):

1 10 100 1000 10000 TLS handshake time (ms) TLS handshake latency (Windows) Control Control CECPQ2 CECPQ2 CECPQ2b CECPQ2b 1 10 100 1000 10000 TLS handshake time (ms) TLS handshake latency (Android) Control Control CECPQ2 CECPQ2 CECPQ2b CECPQ2b

From the histograms we can see that the CECPQ2b (SIKE) group shifts visibly to the right (i.e. slower) in both cases. (On Android, a similar but smaller shift is seen for CECPQ2.) Despite the advantages of removing the slower clients and experimenting with worse-than-usual networks, the computational demands of SIKE out-weigh the reduced network traffic. Only for the slowest 5% of connections are the smaller messages of SIKE a net advantage.

Cloudflare have a much more detailed analysis of the server-side results, which are very similar.


While there may be cases where the smaller messages of SIKE are a decisive advantage, that doesn’t appear to be the case for TLS, where the computational advantages of structured lattices make them a more attractive choice for post-quantum confidentiality.

Username (and password) free login with security keys

Most readers of this blog will be familiar with the traditional security key user experience: you register a token with a site then, when logging in, you enter a username and password as normal but are also required to press a security key in order for it to sign a challenge from the website. This is an effective defense against phishing, phone number takeover, etc. But modern security keys are capable of serving the roles of username and password too, so the user experience can just involve clicking a login button, pressing the security key, and (perhaps) entering a locally-validated PIN if the security key doesn't do biometrics. This is possible with the recently released Chromium 76 and also with Edge or Firefox on current versions of Windows.

On the plus side, this one-button flow frees users from having to remember and type their username and password for a given site. It also avoids sites having to receive and validate a password, potentially avoiding both having a password database (which, even with aggressively slow hashes, will leak many users' passwords if disclosed), and removing any possibility of accidentally logging the plaintext values (which both Google and Facebook have done recently). On the negative side, users will need a modern security key (or Windows Hello-enabled computer) and may still need to enter a PIN.

Which security keys count as “modern”? For most people it'll mean a series-5 black Yubikey or else a blue Yubikey that has a faint “2” printed on the upper side. Of course, there are other manufacturers who make security keys and, if it advertises “CTAP2” support, there's a good chance that it'll work too. But those Yubikeys certainly do.

In practical terms, web sites exercise this capability via WebAuthn, the same API that handles the traditional security key flow. (I'm not going to go into much detail about how to use WebAuthn. Readers wanting more introductory information can see what I've written previously or else see one of the several tutorials that come up in a Google search.)

When registering a security key for a username-free login, the important differences are that you need to make requireResidentKey true, set userVerification to required, and set a meaningful user ID.

In WebAuthn terms, a “resident” credential is one that can be discovered without knowing its ID. Generally, most security keys operate statelessly, i.e. the credential ID is an encrypted private seed, and the security key doesn't store any per-credential information itself. Thus the credential ID is required for the security key to function so the server sends a list of them to the browser during login, implying that the server already knows which user is logging in. Resident keys, on the other hand, require some state to be kept by the security key because they can be used without presenting their ID first. (Note that, while resident keys require some state to be kept, security keys are free to keep state for non-resident keys too: resident vs non-resident is all about whether the credential ID is needed.)

User verification is about whether the security key is providing one or two authentication factors. With the traditional experience, the security key is something you have and the password is something you know. In order to get rid of the password, the security key now needs to provide two factors all by itself. It's still something you have so the second security key factor becomes a PIN (something you know) or a biometric (something you are). That begs the question: what's the difference between a PIN and a password? On the surface: nothing. A security key PIN is an arbitrary string, not limited to numbers. (I think it was probably considered too embarrassing to call it a password since FIDO's slogan is “solving the world's password problem”.) So you should think of it as a password, but it is a password with some deeper advantages: firstly, it doesn't get sent to web sites, so they can't leak it and people can safely use a single password everywhere. Secondly, brute-force resistance is enforced by the hardware of the security key, which will only allow eight attempts before locking and requiring a reset. Still, it'll be nice when biometrics are common in security keys.

A user ID is an opaque identifier that should not be personally identifying. Most systems will have some database primary key that identifies a user and, if using that as a WebAuthn user ID, ensure that you encrypt it first with a key that is only used for this purpose. That way it doesn't matter if those primary keys surface elsewhere too. Security keys will only store a single credential for a given pair of domain name and user ID. So, if you register a second credential with the same user ID on the same domain, it'll overwrite the first.

The fact that you can register more than one credential for a given domain means that it's important to set the metadata correctly when creating a resident credential. This isn't unique to resident keys, but it's much more important in this context. The user name and displayName will be shown by the browser during login when there's more than one credential for a domain. Also the relying party name and displayName will be shown in interfaces for managing the contents of a security key.

When logging in, WebAuthn works as normal except you leave the list of credential IDs empty and set userVerification to required. That triggers the resident-credential flow and the resulting credential will include the user ID, with which you look up the user and their set of registered public keys, and then validate the public key and other parameters.

Microsoft have a good test site (enter any username) where you can experiment with crafting different WebAuthn requests.

Exposed credentials

In order to support the above, security keys obviously have a command that says “what credentials do you have for domain x?”. But what level of authentication is needed to run that command is a little complex. While it doesn't matter for the web, one might want to use security keys to act as, for example, door access badges; especially over NFC. In that case one probably doesn't want to bother with a PIN etc. Thus the pertinent resident credentials would have to be discoverable and exercisable given only physical presence. But in a web context, perhaps you don't want your security key to indicate that it has a credential stored for (or to anyone who can plug it in. Current security keys, however, will disclose whether they have a resident credential for a given domain, and the user ID and public key for that credential, to anyone with physical access. (Which is one reason why user IDs should not be identifying.) Future security keys will have a concept of a per-credential protection level which will prevent them from being disclosed without user verification (i.e. PIN or biometrics), or without knowing their random credential ID. While Chromium will configure credential protection automatically if supported, other browsers may not. Thus it doesn't hurt to set credProtect: 2 in the extensions dictionary during registration.

Zero-knowledge attestation

U2F/FIDO tokens (a.k.a. “Security Keys”) are a solid contender for doing something about the effectiveness of phishing and so I believe they're pretty important. I've written a fairly lengthy introduction to them previously and, as mentioned there, one concerning aspect of their design is that they permit attestation: when registering a key it's possible for a site to learn a cryptographically authenticated make, model, and batch. As a browser vendor who has dealt with User-Agent sniffing, and as a large-site operator, who has dealt with certificate pervasiveness issues, that's quite concerning for public sites.

It's already the case that one significant financial site has enforced a single-vendor policy using attestation (i.e. you can only register a token made by that vendor). That does not feel very congruent with the web, where any implementation that follows the standards is supposed to be a first-class citizen. (Sure, we may have undermined that with staggering levels of complexity, but that doesn't discredit the worth of the goal itself.)

Even in cases where a site's intended policy is more reasonable (say, they want to permit all tokens with some baseline competence), there are strong grounds for suspecting that things won't turn out well. Firstly, the policies of any two sites may not completely align, leading to a crappy user experience where a user needs multiple tokens to cover all the sites that they use, and also has to remember which works where. Secondly, sites have historically not been so hot about staying up-to-date. New token vendors may find themselves excluded from the market because it's not feasible to get every site to update their attestation whitelists. That feels similar to past issues with User-Agent headers but the solution there was to spoof other browsers. Since attestation involves a cryptographic signature, that answer doesn't work here.

So the strong recommendation for public sites is not to request attestation and not to worry about it. The user, after all, has control of the browser once logged in, so it's not terribly clear what threats it would address.

However, if we assume that certain classes of sites probably are going to use attestation, then users have a collective interest in those sites enforcing the same, transparent standard, and in them keeping their attestation metadata current. But without any impetus towards those ends, that's not going to happen. Which begs the question: can browsers do something about that?

Ultimately, in such a world, sites only operate on a single bit of information about any registration: was this public-key generated in a certified device or not? The FIDO Alliance wants to run the certification process, so then the problem reduces down to providing that bit to the site. Maybe they would simply trust the browser to send it: the browser could keep a current copy of the attestation metadata and tell the site whether the device is certified or not. I don't present that as a straw-man: if the site's aim is just to ensure that the vast majority of users aren't using some backdoored token that came out of a box of breakfast cereal then it might work, and it's certainly simple for the site.

But that would be a short blog post, and I suspect that trusting the browser probably wouldn't fly in some cases.

So what we're looking for is something like a group signature scheme, but we can't change existing tokens. So we need to retrospectively impose a group signature on top of signers that are using vanilla P-256 ECDSA.

Zero-knowledge proofs

It is a surprising but true result in cryptography that it's possible to create a convincing proof of any statement in NP that discloses nothing except the truth of the statement. As an example of such a statement, we might consider “I know a valid signature of message x from one of the public keys in this set”. That's a pretty dense couple of sentences but rather than write an introduction to zero-knowledge proofs here, I'm going to refer you to Matthew Green's posts[1][2]. He does a better job than I would.

I obviously didn't pick that example at random. If there was a well-known set of acceptable public keys (say, as approved by the FIDO Alliance) then a browser could produce a zero-knowledge proof that it knew a valid attestation signature from one of those keys, without disclosing anything else, notably without disclosing which public key was used. That could serve as an “attestation valid” bit, as hypothesised above, that doesn't require trusting the browser.

As a concrete instantiation of zero-knowledge proofs for this task, I'll be using Bulletproofs [BBBPWM17]. (See for a good collection of many different ZK systems. Also, dalek-cryptography have excellent notes on Bulletproofs; Cathie Yun and Henry de Valence from that group were kind enough to help me with a question about Bulletproofs too.)

The computational model for Bulletproofs is an arithmetic circuit: an acyclic graph where public and secret inputs enter and each node either adds or multiplies all its inputs. Augmenting that are linear constraints on the nodes of the circuit. In the tool that I wrote for generating these circuits, this is represented as a series of equations where the only operations are multiplication, addition, and subtraction. Here are some primitives that hopefully convince you that non-trivial functions can be built from this:


Using single bit values in an arithmetic circuit certainly works, but it's inefficient. Getting past single-bit values, the arithmetic circuits in Bulletproofs don't work in ℤ (i.e. arbitrary-length integers), rather they work over a finite field. Bulletproofs are built on top of an elliptic curve and the finite field of the arithmetic circuit is the scalar field of that curve.

When dealing with elliptic curves (as used in cryptography) there are two finite fields in play: the x and y coordinates of the points on the curve are in the coordinate field of the curve. Multiples of the base point (B) then generate a prime number (n) of points in the group before cycling back to the base point. So xB + yB = (x + y mod n)B — i.e. you can reduce the multiple mod n before multiplying because it'll give the same result. Since n is prime, reduction mod n gives a field, the scalar field.

(I'm omitting powers of primes, cofactors, and some other complications in the above, but it'll serve.)

So Bulletproofs work in the scalar field of whatever elliptic curve they're implemented with, but we want to build P-256 ECDSA verification inside of a Bulletproof, and that involves lots of operations in P-256's coordinate field. So, ideally, the Bulletproofs need to work on a curve whose scalar field is equal to P-256's coordinate field. Usually when generating a curve, one picks the coordinate field to be computationally convenient, iterates other parameters until the curve meets standard security properties, and the scalar field is whatever it ends up as. However, after some quality time with “Constructing elliptic curves of prime order” (Broker & Stevenhagen) and Sage, we find that y² = x³ - 3x + B over GF(PP) where:

… gives a curve with the correct number of points, and which seems plausibly secure based on the SafeCurves criteria. (A more exhaustive check would be needed before using it for real, but it'll do for a holiday exploration.) Given its relationship to P-256, I called it “PP-256” in the code.

ECDSA verification

Reviewing the ECDSA verification algorithm, the public keys and message hash are obviously public inputs. The r and s values that make up the signature cannot be both be public because then the verifier could just try each public key and find which one generated the signature. However, one of r and s can be public. From the generation algorithm, r is the x-coordinate of a random point and s is blinded by the inverse of the nonce. So on their own, neither r nor s disclose any information and so can just be given to the verifier—moving work outside of the expensive zero-knowledge proof. (I'm not worrying about tokens trying to use a covert channel here but, if you do worry about that, see True2F.)

If we disclose s to the verifier directly then what's left inside the zero-knowledge proof is 1) selecting the public key; 2) checking that the secret r is in range; 3) u₂ = r/s mod n; 4) scalar-multiplication of the public key by u₂; 5) adding in the (now) public multiple of the base point; and 6) showing that the x-coordinate of resulting point equals the original r, mod n.

The public-key is a 4-tooth comb, which is a precomputed form that speeds up scalar multiplications. It consists of 30 values. The main measure that we want to minimise in the arithmetic circuit is the number of multiplications where both inputs are secret. When selecting from t possible public keys the prover supplies a secret t-bit vector where only one of the bits is set. The proof shows that each value is, indeed, either zero or one using IsBit (from above, at a cost of one multiply per bit), and that exactly one bit is set by requiring that the sum of the values equals one. Each of the 30t public-key values is multiplied by one of the bits and summed to select exactly one key.

Rather than checking that the secret r is within [0, n-1], which would cost 512 multiplies, we just check that it's not equal to zero mod n. That's the important condition here since an out of range r is otherwise just an encoding error. Showing that a number is not zero mod n just involves showing that it's not equal to zero or n, as 2n is outside of the arithmetic circuit field. Proving a ≠ b is easy: the prover just provides an inverse for a - b (since zero doesn't have an inverse) and the proof shows that (a - b) × (a - b)⁻¹ = 1.

Calculating r/s mod n is the most complex part of the whole proof! Since the arithmetic circuit is working mod P-256's p, working mod n (which is the order of P-256—slightly less than p) is awkward. The prover gives bit-wise breakdown of r; the proof does the multiplication as three words of 86, 86, and 84 bits; the prover supplies the values for the carry-chain (since bit-shifts aren't a native operation in the arithmetic circuit); the prover then gives the result in the form a×n + b, where b is a 256-bit number; and the proof does another multiplication and carry-chain to check that the results are equal. All for a total cost of 2152 multiplication nodes!

After that, the elliptic curve operation itself is pretty easy. Using the formulae from “Complete addition formulas for prime order elliptic curves” (Renes, Costello, and Batina) it takes 5365 multiplication nodes to do a 4-tooth comb scalar-mult with a secret scalar and a secret point. Then a final 17 multiplication nodes add in the public base-point multiple, supply the inverse to convert to affine form, and check that the resulting x-coordinate matches the original r value. The circuit does not reduce the x-coordinate mod n in order to save work: for P-256, that means that around one in 2¹²⁸ signatures may be incorrectly rejected, but that's below the noise floor of arithmetic errors in CPUs. Perhaps if this were to be used in the real world, that would be worth doing correctly, but I go back to work tomorrow so I'm out of time.

In total, the full circuit contains 7534 multiplication nodes, 2154 secret inputs, and 17 236 constraints.

(Pratyush Mishra points out that affine formulae would be more efficient than projective since inversion is cheap in this model. Oops!)


My tool for generating the matrices that Bulletproofs operate on outputs 136KB of LZMA-compressed data for the circuit described above. In some contexts, that amount of binary size would be problematic, but it's not infeasible. There is also quite a lot of redundancy: the data includes instructions for propagating the secret inputs through the arithmetic circuit, but it also includes matrices from which that information could be derived.

The implementation is based on BoringSSL's generic-curve code. It doesn't even use Shamir's trick for multi-scalar multiplication of curve points, it doesn't use Montgomery form in a lot of places, and it doesn't use any of the optimisations described in the Bulletproofs paper. In short, the following timings are extremely pessimistic and should not be taken as any evidence about the efficiency of Bulletproofs. But, on a 4GHz Skylake, proving takes 18 seconds and verification takes 13 seconds. That's not really practical, but there is a lot of room for optimisation and for multiple cores to be used concurrently.

The proof is 70 450 bytes, dominated by the 2154 secret-input commitments. That's not very large by the standards of today's web pages. (And Dan Boneh points out that I should have used a vector commitment to the secret inputs, which would shrink the proof down to just a few hundred bytes.)

Intermediates and FIDO2

One important limitation of the above is that it only handles one level of signatures. U2F allows an intermediate certificate to be provided so that only less-frequently-updated roots need to be known a priori. With support for only a single level of signatures, manufacturers would have to publish their intermediates too. (But we already require that for the WebPKI.)

Another issue is that it doesn't work with the updated FIDO2 standard. While only a tiny fraction of Security Keys are FIDO2-based so far, that's likely to increase. With FIDO2, the model of the device is also included in the signed message, so the zero-knowledge proof would also have to show that a SHA-256 preimage has a certain structure. While Bulletproofs are quite efficient for implementing elliptic curves, a binary-based algorithm like SHA-256 is quite expensive: the Bulletproofs paper notes a SHA-256 circuit using 25 400 multiplications. There may be a good solution in combining different zero-knowledge systems based on “Efficient Zero-Knowledge Proof of Algebraic and Non-Algebraic Statements with Applications to Privacy Preserving Credentials” (Chase, Ganesh, Mohassel), but that'll have to be future work.

Happy new year.


CECPQ1 was the experiment in post-quantum confidentiality that my colleague, Matt Braithwaite, and I ran in 2016. It's about time for CECPQ2.

I've previously written about the experiments in Chrome which lead to the conclusion that structured lattices were likely the best area in which to look for a new key-exchange mechanism at the current time. Thanks to the NIST process we now have a great many candidates to choose from in that space. While this is obviously welcome, it also presents a problem: the fitness space of structured lattices looks quite flat so there's no obviously correct choice. Would you like keys to be products (RLWE) or quotients (NTRU; much slower key-gen, but subsequent operations are faster; older, more studied)? Do you want the ring to be NTT-friendly (fast multiplication, but more structure), or to have just a power-of-two modulus (easy reduction), or to have as little structure as possible? What noise profile and failure probability? Smart people can reasonably end up with different preferences.

This begs the question of why do CECPQ2 now at all? In some number of years NIST will eventually whittle down the field and write standards. Adrian Stanger of the NSA said at CRYPTO this year that the NSA is looking to publish post-quantum standards around 2024, based on NIST's output. (And even said that they would be pure-PQ algorithms, not combined with an elliptic-curve operation as a safeguard.) So if we wait five years things are likely to be a lot more settled.

Firstly, you might not be happy with the idea of waiting five years if you believe Michele Mosca's estimate of a one sixth chance of a large quantum computer in ten years. More practically, as we sail past the two year mark of trying to deploy TLS 1.3, another concern is that if we don't exercise this ability now we might find it extremely difficult to deploy any eventual design.

TLS 1.3 should have been straightforward to deploy because the TLS specs make accommodations for future changes. However, in practice, we had to run a series of large-scale experiments to measure what patterns of bytes would actually weave through all the bugs in the TLS ecosystem. TLS 1.3 now has several oddities in the wire-format that exist purely to confuse various network intermediaries into working. Even after that, we're still dealing with issues. Gallingly, because we delayed our server deployment in order to ease the client deployment, we're now having to work around bugs in TLS 1.3 client implementations that wouldn't have been able to get established had we quickly and fully enabled it.

The internet is very large and it's not getting any easier to steer. So it seems dangerous to assume that we can wait for a post-quantum standard and then deploy it. Any CECPQ2 is probably, in the long-term, destined to be replaced. But by starting the deployment now it can hopefully make that replacement viable by exercising things like larger TLS messages. Also, some practical experience might yield valuable lessons when it comes to choosing a standard. If the IETF had published the TLS 1.3 RFC before feedback from deployment, it would have been a divisive mess.

CECPQ2 Details

At the time of CECPQ1, the idea of running both a post-quantum and elliptic-curve primitive concurrently (to ensure that even if the post-quantum part was useless, at least the security wasn't worse than before) wasn't universally embraced. But we thought it important enough to highlight the idea in the name: combined elliptic-curve and post-quantum. It's a much more widely accepted idea now and still makes sense, and the best choice of elliptic-curve primitive hasn't changed, so CECPQ2 is still a combination with X25519.

As for the post-quantum part, it's based on the HRSS scheme of Hülsing, Rijneveld, Schanck, and Schwabe. This is an instantiation of NTRU, the patent for which has expired since we did CECPQ1. (The list of people who've had a hand in NTRU is quite long. See the Wikipedia page for a start.)

Last year Saito, Xagawa, and Yamakawa (SXY) published a derivative of HRSS with a tight, QROM-proof of CCA2 security from an assumption close to CPA security. It requires changes (that slow HRSS down a little), but HRSS+SXY is currently the basis of CECPQ2. Since HRSS+SXY no longer requires an XOF, SHAKE-128 has been replaced with SHA-256.

Having said that it's hard to choose in the structured lattice space, obviously HRSS is a choice and there were motivations behind it:

  1. CCA2-security is worthwhile, even though TLS can do without. CCA2 security roughly means that a private-key can be used multiple times. The step down is CPA security, where a private-key is only safe to use once. NewHope, used in CECPQ1, was only CPA secure and that worked for TLS since its confidentiality keys are ephemeral. But CPA vs CCA security is a subtle and dangerous distinction, and if we're going to invest in a post-quantum primitive, better it not be fragile.
  2. Avoiding decryption failures is attractive. Not because we're worried about unit tests failing (hardware errors set a noise floor for that anyway), but because the analysis of failure probabilities is complex. In the time since we picked HRSS, a paper has appeared chipping away at these failures. Eliminating them simplifies things.
  3. Schemes with a quotient-style key (like HRSS) will probably have faster encap/decap operations at the cost of much slower key-generation. Since there will be many uses outside TLS where keys can be reused, this is interesting as long as the key-generation speed is still reasonable for TLS.
  4. NTRU has a long history. In a space without a clear winner, that's a small positive.

CECPQ2 will be moving slowly: It depends on TLS 1.3 and, as mentioned, 1.3 is taking a while. The larger messages may take some time to deploy if we hit middlebox- or server-compatibility issues. Also the messages are currently too large to include in QUIC. But working though these problems now is a lot of the reason for doing CECPQ2—to ensure that post-quantum TLS remains feasible.

Lastly, I want to highlight that this only addresses confidentiality, not authenticity. Confidentiality is more pressing since it can be broken retrospectively, but it's also much easier to deal with in TLS because it's negotiated independently for every connection. Post-quantum authenticity will be entangled with the certificate and CA ecosystem and thus will be significantly more work.

Post-quantum confidentiality for TLS

In 2016, my colleague, Matt Braithwaite, ran an experiment in Google Chrome which integrated a post-quantum key-agreement primitive (NewHope) with a standard, elliptic-curve one (X25519). Since that time, the submissions for the 1st round of NIST’s post-quantum process have arrived. We thus wanted to consider which of the submissions, representing the new state of the art, would be most suitable for future work on post-quantum confidentiality in TLS.

A major change since the 2016 experiment is the transition from TLS 1.2 to TLS 1.3 (a nearly-final version of which is now enabled by default in Chrome). This impacts where in the TLS handshake the larger post-quantum messages appear:

In TLS 1.2, the client offers possible cipher suites in its initial flow, the server selects one and sends a public-key in its reply, then the client completes the key-agreement in its second flow. With TLS 1.3, the client offers several possible public-keys in its initial flow and the server completes one of them in its reply. Thus, in a TLS 1.3 world, the larger post-quantum keys will be sent to every TLS server, whether or not they’ll use it.

(There is a mechanism in TLS 1.3 for a client to advertise support for a key-agreement but not provide a public-key for it. In this case, if the server wishes to use that key-agreement, it replies with a special message to indicate that the client should try again and include a public-key because it’ll be accepted. However, this obviously adds an extra round-trip and we don’t want to penalise more secure options—at least not if we want them adopted. Therefore I’m assuming here that any post-quantum key agreement will be optimistically included in the initial message. For a diagram of how this might work with a post-quantum KEM, see figure one of the Kyber paper. Also I'm using the term “public-key” here in keeping with the KEM constructions of the post-quantum candidates. It's not quite the same as a Diffie-Hellman value, but it's close enough in this context.)

In order to evaluate the likely latency impact of a post-quantum key-exchange in TLS, Chrome was augmented with the ability to include a dummy, arbitrarily-sized extension in the TLS ClientHello. To be clear: this was not an implementation of any post-quantum primitive, it was just a fixed number of bytes of random noise that could be sized to simulate the bandwidth impact of different options. It was included for all versions of TLS because this experiment straddled the enabling of TLS 1.3.

Post quantum families

Based on submissions to NIST’s process, we grouped likely candidates into three “families” of algorithms, based on size: supersingular isogenies (SI), structured lattices (SL), and unstructured lattices (UL). Not all submissions fit into these families, of course: in some cases the public-key + ciphertext size was too large to be viable in the context of TLS, and others were based on problems that we were too unfamiliar with to be able to evaluate.

As with our 2016 experiment, we expect any post-quantum algorithm to be coupled with a traditional, elliptic-curve primitive so that, if the post-quantum component is later found to be flawed, confidentiality is still protected as well as it would otherwise have been.

This led to the following, rough sizes for evaluation:

  1. Supersingular isogenies (SI): 400 bytes
  2. Structured lattices (SL): 1 100 bytes
  3. Unstructured lattices (UL): 10 000 bytes

Incompatibilities with unstructured lattices

Before the experiment began, in order to establish some confidence that TLS connections wouldn’t break, the top 2 500 sites from the Alexa list were probed to see whether large handshakes caused problems. Unfortunately, the 10 000-byte extension caused 21 of these sites to fail, including common sites such as,, and

Oddly enough, reducing the size of the extension to 9 999 bytes reduced that number to eight, but was still included. These remaining eight sites seem to reject ClientHello messages larger than 3 970 bytes.

This indicated that widespread testing of a 10 000-byte extension was not viable. Thus the unstructured lattice configuration was replaced with a 3 300-byte “unstructured lattice stand-in” (ULS). This size was chosen to be about as large as we could reasonably test (although four sites still broke.

Phase One

In February 2018, a fraction of Chrome installs (Canary and Dev channels only) were configured to enable the dummy extension in one of four sizes, with the configuration randomly selected, per install, from:

  1. Control group: no extension sent
  2. Supersingular isogenies (SI): 400 bytes
  3. Structured lattices (SL): 1 100 bytes
  4. Unstructured lattice standing (ULS): 3 300 bytes

We measured TLS handshake latency at the 50th and 95th percentile, split out by mobile / non-mobile and by whether the handshake was a full or resumption handshake.

Given that we measured a stand-in for the UL case, we needed to extrapolate from that to an estimate for the UL latency. We modeled a linear cost of each additional byte in the handshake, 66 bytes of overhead per packet, and an MTU around 1 500 bytes. The number of packets for the UL case is the same for MTU values around 1 500, but the number of packets for the SL case is less certain: a session ticket could push the ClientHello message into two packets there. We chose to model the SL case as a single packet for the purposes of estimating the UL cost.

We also ignored the possibility that the client’s initial congestion window could impact the UL case. If that were to happen, a UL client would have to wait an additional round-trip, thus our UL estimates might be optimistic.

Despite the simplicity of the model, the control, SI, and SL points for the various configurations are reasonably co-linear under it and we draw the least-squares line out to 10 000 bytes to estimate the UL latency.

Configuration Additional latency over control group
SI SL UL (estimated)
Desktop, Full, Median 4.0% 6.4% 71.2%
Desktop, Full, 95% 4.7% 9.6% 117.0%
Desktop, Resume, Median 4.3% 12.5% 118.6%
Desktop, Resume, 95% 5.2% 17.7% 205.1%
Mobile, Full, Median -0.2% 3.4% 34.3%
Mobile, Full, 95% 0.5% 7.2% 110.7%
Mobile, Resume, Median 0.6% 7.2% 66.7%
Mobile, Resume, 95% 4.2% 12.5% 149.5%

(The fact that one of the SI values came out implausibly negative should give some feeling for the accuracy of the results—i.e. the first decimal place is probably just noise, even at the median.)

As can be seen, the estimated latency overhead of unstructured lattices is notably large in TLS, even without including the effects of the initial congestion window. For this reason, we feel that UL primitives are probably not preferable for real-time TLS connections.

It is also important to note that, in a real deployment, the server would also have to return a message to the client. Therefore the transmission overhead, when connecting to a compatible server, would be doubled. This phase of the experiment experiment did not take that into account at all. Which leads us to…

Phase Two

In the second phase of the experiment we wished to measure the effect of actually negotiating a post-quantum primitive, i.e. when both client and server send post-quantum messages. We again added random noise of various sizes to simulate the bandwidth costs of this. Cloudflare and Google servers were augmented to echo an equal amount of random noise back to clients that included it.

In this phase, latencies were only measured when the server echoed the random noise back so that the set of measured servers could be equal between control and experiment groups. This required that the control group send one byte of noise (rather than nothing).

Since unstructured lattices were not our focus after the phase one results, there were only three groups in phase two: one byte (control), 400 bytes (SI), and 1100 bytes (SL).

Unlike phase one, we did not break the measurements out into resumption / full handshakes this time. We did that in phase one simply because that’s what Chrome measures by default. However, it begs the question of what fraction of handshakes resume, since that value is needed to correctly weigh the two numbers. Thus, in phase two, we simply measured the overall latency, with resumption (or not) implicitly included.

Here are the latency results, this time in milliseconds in order to reason about CPU costs below:

Configuration Additional latency over control group (ms)
Mobile, Median 3.5 9.6
Mobile, 95% 18.4 159.0
Desktop, Median 2.6 5.5
Desktop, 95% 19.2 136.9

So far we’ve only talked about families of algorithms but, to analyse these results, we have to be more specific. That’s easy in the case supersingular isogenies (SI) because there is only one SI-based submission to NIST: SIKE.

For SIKEp503, the paper claims an encapsulation speed on a 3.4GHz Skylake of 16 619 kilocycles, or 4.9ms of work on a server. They also report on an assembly implementation for Aarch64, which makes a good example client. There the key generation + decapsulation time is 86 078 kilocycles, or 43ms on their 2GHz Cortex-A79.

On the other hand, structured lattices (SL) are much faster. There are several candidates here but most have comparable speed, and it’s under a 10th of a millisecond on modern Intel chips.

Therefore, if computation were included, the SI numbers above would have 48ms added to them. That happens to makes SL clearly faster at the median, although it does not close the gap at the 95% level. (The SI key-generation could be amortised across connections, at the cost of code complexity and maybe some forward security. Amortising key-generation across 16 connections results in 28ms of computation per connection, which doesn’t affect the argument.)

We had previously deployed CECPQ1 (X25519 + NewHope) in TLS, which should be comparable to SL measurements, and gotten quite different results: we found a 1ms increase at the median and only 20ms at 95%. (And CECPQ1 took roughly 1 900 bytes.)

On a purely technical level, CECPQ1 was based on TLS 1.2 and therefore increased the sizes of the server’s first flow and the client’s second flow. It’s possible that bytes in the client’s second flow are less expensive, but much more significantly, TLS 1.2 does not do a fresh key-agreement for a resumption handshake. Therefore all the successful resumptions had no overhead with CECPQ1, but with TLS 1.3, they would. Going back to historical Chrome statistics from the time of the CECPQ1 experiment, we estimate that about 50% of connections would have been resumptions, which closes the gap.

Additionally, there may be a significant difference in Chrome user populations between the experiments. CECPQ1 went to Chrome Stable (non-mobile only) while the post-quantum measurements here have been done on Dev and Canary channels. It is the case that, for all non-mobile TLS connections (completely unrelated to any post-quantum experiments), the population of Dev and Canary channel users see higher TLS latencies and the effect is larger than the SI or SL effect measured above. Thus the user population for these experiments might have had poorer internet connections than the population for CECPQ1, exaggerating the costs of extra handshake bytes.

Apart from CPU time, a second major consideration is that we are not comparing like with like. Our 400-byte figure for SI maps to a NIST “category one” strength, while 1 100 bytes for SL is roughly in line with “category three”. A comparable, “category one” SL primitive comes in at more like 800 bytes, while a “category three” SI is much slower.

(Although there is a recent paper suggesting that SIKEp503 might be more like category three after all. So maybe this point is moot, but such gyrations highlight that SI is a relatively new field of study.)

Lastly, and much more qualitatively, we have had many years of dealing with subtle bugs in elliptic-curve implementations. The desire for performance pushes for carefully-optimised implementations, but missed carries, or other range violations, are typically impossible for random tests to find. After many years, we are at the point where we are starting to use formally-synthesised code for elliptic-curve field operations but SI would take us deeper into the same territory, likely ensuring that the pattern of ECC bugs continues into the future. (As Hamburg says in his NIST submission, “some cryptographers were probably hoping never to see a carry chain again”.)

On the other hand, SL designs are much more like symmetric primitives, using small fields and bitwise operations. While it’s not impossible to botch, say, AES, my experience is that the defect rates in implementations of things like AES and SHA-256 is dramatically lower, and this is strong argument for SL over SI. Combining post-quantum primitives with traditional ECC is a good defense against bugs, but we should still be reticent to adopt potentially fragile constructs.

(Not all SL designs have this benefit, mind you. Hamburg, quoted above, uses ℤ/(23120 - 21560 - 1) in ThreeBears and lattice schemes for other primitives may need operations like discrete Gaussian sampling. But it holds for a number of SL-based KEMs.)

Thus the overall conclusion of these experiments is that post-quantum confidentiality in TLS should probably be based on structured lattices although there is still an open question around QUIC, for which exceeding a single packet is problematic and thus the larger size of SL is more significant.

Security Keys


Predictions of, and calls for, the end of passwords have been ringing through the press for many years now. The first instance of this that Google can find is from Bill Gates in 2004, although I suspect it wasn’t the first.

None the less, the experience of most people is that passwords remain a central, albeit frustrating, feature of their online lives.

Security Keys are another attempt address this problem—initially in the form of a second authentication factor but, in the future, potentially as a complete replacement. Security Keys have gotten more traction than many other attempts to solve this problem and this post exists to explain and, to some extent, advocate for them to a technical audience.

Very briefly, Security Keys are separate pieces of hardware capable of generating public/private key pairs and signing with them. By being separate, they can hopefully better protect those keys than a general purpose computer can, and they can be moved between devices to serve as a means of safely authorising multiple devices. Most current examples attach via USB, but NFC and Bluetooth devices also exist.

Contrasts with existing solutions

Security Keys are not the first attempt at solving the problems of passwords, but they do have different properties than many of the other solutions.

One common form of second factor authentication is TOTP/HOTP. This often takes the form of an app on the user’s phone (e.g. Google Authenticator) which produces codes that change every minute or so. It can also take the form of a token with an LCD display to show such codes (e.g. RSA SecurID).

These codes largely solve the problem of password reuse between sites as different sites have different seed values. Thus stealing the password (and/or seed) database from one site no longer compromises accounts at other sites, as is the case with passwords.

However, these codes are still phishable: the user may be tricked into entering their password and code on a fake site, which can promptly forward them to the real site and impersonate the user. The codes may also be socially engineered as they can be read over the phone etc by a confused user.

Another common form of second factor authentication are SMS-delivered codes. These share all the flaws of HOTP/TOTP and add concerns around the social engineering of phone companies to redirect messages and, in extreme cases, manipulation of the SS7 network.

Lastly, many security guides advocate for the use of password managers. These, if used correctly, can also solve the password reuse problem and significantly help with phishing, since passwords will not be auto-filled on the wrong site. Thus this is sound advice and, uniquely amongst the solutions discussed here, can be deployed unilaterally by users.

Password managers, however, do not conveniently solve the problem of authenticating new devices, and their automated assistance is generally limited to a web context. They also change a password authentication from an (admittedly weak) verification of the user, to a verification of the device; an effect which has provoked hostility and counter-measures from relying parties.

In light of this, Security Keys should be seen as a way to improve upon, and exceed the abilities of, password managers in a context where the relying party is cooperating and willing to make changes. Security Keys are unphishable to a greater extent than password managers because credentials are bound to a given site, plus it’s infeasible to socially engineer someone to read a binary signature over the phone. Also, like TOTP/HOTP, they use fresh credentials for each site so that no reuse is possible. Unlike password managers, they can work outside of a web context and they can serve to authenticate new devices.

They aren’t magic, however. The unphishability of Security Keys depends on the extent to which the user may be mislead into compromising other aspects of their device. If the user can be tricked into installing malware, it could access the Security Key and request login signatures for arbitrary sites. Also, malware may compromise the user’s login session in a browser after successfully authenticating with a Security Key. Still, that’s a heck of a lot better than the common case of people using the same password across dozens of sites.

All the different terms

There is a lot of terminology specific to this topic. The first of which I’ve already used above: “relying parties”. This term refers to any entity trying to authenticate a user. When logging into a website, for example, the website is the relying party.

The FIDO Alliance is a group of major relying parties, secure token manufacturers, and others which defines many of the standards around Security Keys. The term that FIDO uses for Security Keys is “Universal 2nd factor” (U2F) so you’ll often see “U2F security key” used—it’s talking about the same thing. The terms “authenticator” and “token” are also often used interchangeably to refer to these devices.

At the time of writing, all Security Keys are based version one of FIDO’s “Client To Authenticator Protocol” (CTAP1). This protocol is split between documentation of the core protocol and separate documents that describe how the core protocol is transported over USB, NFC, and Bluetooth.

FIDO also defines a U2F Javascript API for websites to be able to interact with and use Security Keys. However, no browser ever implemented that API prior to a forthcoming (at the time of writing) version of Firefox.

But sites have been able to use Security Keys with Google Chrome for some years because Chrome ships with a hidden, internal extension through which the U2F API can be implemented with a Javascript polyfill, which Google also provides. (Extensions for Firefox were also available prior to native support in that browser.)

Thus all sites which supported Security Keys prior to 2018 used some polyfill in combination with Chrome’s internal extension, or one of the Firefox extensions, to do so.

The FIDO Javascript API is not the future, however. Instead, the W3C is defining an official Web Authentication standard for Security Keys, which is commonly called by its short name “webauthn”. This standard is significantly more capable (and significantly more complex) than the U2F API but, by the end of 2018, it is likely that all of Edge, Chrome, and Firefox will support it by default.

The webauthn standard has been designed to work with existing (CTAP1-based) devices, but FIDO is working on an updated standard for tokens, CTAP2, which will allow them to take advantage of the new capabilities in webauthn. (The standards were co-developed so it’s equally reasonable to see it from the other direction and say that webauthn allows browsers to take advantage of the new capabilities in CTAP2.)

There are no CTAP2 devices on the market yet but their major distinguishing feature will be that they can be used as a 1st (and only) factor. I.e. they have enough internal storage that they can contain a username and so both provide an identity and authenticate it. This text will mostly skim over CTAP2 since the devices are not yet available. But developers should keep it in mind when dealing with webauthn as it explains many, otherwise superfluous, features in that standard.

CTAP1 Basics

Since all current Security Keys use CTAP1, and webauthn is backwards compatible with it, understanding CTAP1 is pretty helpful for understanding the space in general. Here I’ll include some Python snippets for communicating with USB CTAP1 devices to make things concrete, although I’ll skip over everything that deals with framing.

CTAP1 defines two operations: creating a new key, and signing with an existing key. I’ll focus on them in turn.

Creating a new key

This operation is called “registration“ in CTAP1 terminology and it takes two, 32-byte arguments: a “challenge” and an “application parameter”. From the point of view of the token these arguments are opaque byte strings, but they’re intended to be hashes and the hash function has to be SHA-256 if you want to interoperate.

The challenge argument, when used with a web browser, ends up being the hash of a JSON-encoded structure that includes a random nonce from the relying party as well as other information. This nonce is intended to prove freshness: if it was signed by the newly generated key then the relying party could know that the key really was fresh and that this was the only time it had been registered. Unfortunately, CTAP1 doesn’t include any self-signature (and CTAP2 devices probably won’t either). Instead the situation is a lot more complex, which we’ll get to.

The application parameter identifies a relying party. In U2F it’s a hash of the origin (e.g. SHA-256(“”)) while in webauthn it’s a hash of the domain (e.g. SHA-256(“”)). As we’ll see, the signing operation also takes an application parameter and the token checks that it’s the same value that was given when the key was created. A phishing site will operate on a look-alike domain, but when the browser hashes that domain, the result will be different. Thus the application parameter sent to the token will be different and the token will refuse to allow the key to be used. Thus keys are bound to specific origins (or, with webauthn, domains) and cannot be used outside of that context.

Here’s some sample code that’ll shine some light on other aspects of the protocol, including the outputs from key creation:

(The full source is available if you want to play with it, although it’ll only work on Linux.)

The first thing to note is that the operation runs in a loop. CTAP1 devices require a “user presence” test before performing operations. In practice this means that they’ll have a button or capacitive sensor that you have to press. While the button/sensor isn’t triggered, operations return a specific error code and the host is expected to retry the operation after a brief delay until it succeeds.

A user-presence requirement ensures that operations cannot happen without a human being physically present. This both stops silent authentication (which could be used to track people) and it stops malware from silently proxying requests to a connected token. (Although it doesn’t stop malware from exploiting a touch that the user believes is authorising a legitimate action.)

Once the operation is successful, the response can be parsed. In the spirit of short example code everywhere, errors aren’t checked, so don’t use this code for real.

Key generation, of course, produces a public key. For CTAP1 tokens, that key will always be an uncompressed, X9.62-encoded, ECDSA P-256 public key. That encoding happens to always be 65 bytes long.

After that comes the key handle. This is an opaque value that the token uses to identify this key, and this evinces another important facet of CTAP1: the tokens are nearly stateless in practice.

In theory, the key handle could be a small identifier for a key which is stored on the device. In practice, however, the key handle is always an encrypted version of the private key itself (or a generating seed). This allows storage of the keys to be offloaded to the relying party, which is important for keeping token costs down.

This also means that CTAP1 tokens cannot be used without the user entering a username. The relying party has to maintain a database of key handles, and that database has to be indexed by something. This is changing with CTAP2, but I’ll not mention that further until CTAP2 tokens are being commercially produced.

Lastly, there’s the attestation certificate (just one; no chain) and a signature. As I mentioned above, the signature is sadly not from the newly created key, but from a separate attestation private key contained in the token. This’ll be covered in a later section

Signing with a key

Once a key has been created, we can ask the token to sign with it.

Again, a “challenge” and “application parameter” have to be given, and they have the same meaning and format as when creating a key. The application parameter has to be identical to the value presented when the key was created otherwise the token will return an error.

The code looks very similar:

The same pattern for waiting for a button press is used: the token is polled and returns an error until the button/sensor is pressed.

Three values are returned: a flag confirming that the button was pressed, a signature counter, and the signature itself—which signs over the challenge, application parameter, flags, and counter.

In the web context, the challenge parameter will be the hash of a JSON structure again, and that includes a nonce from the relying party. Therefore, the relying party can be convinced that the signature has been freshly generated.

The signature counter is a strictly monotonic counter and the intent is that a relying party can record the values and so notice if a private key has been duplicated, as the strictly-monotonic property will eventually be violated if multiple, independent copies of the key are used.

There are numerous problems with this, however. Firstly, recall that CTAP1 tokens have very little state in order to keep costs down. Because of that, all tokens that I’m aware of have a single, global counter shared by all keys created by the device. (The only exception I’ve seen is the Bluink key because it keeps state on a phone.) This means that the value and growth rate of the counter is a trackable signal that’s transmitted to all sites that the token is used to login with. For example, the token that I’m using right now has a counter of 431 and I probably use it far more often than most because I’m doing things like writing example Python code to trigger signature generation. I’m probably pretty identifiable because of that.

A signature counter is also not a very effective defense, especially if it’s per-token, not per-key. Security Keys are generally used to bless long-term logins, and an attacker is likely to be able to login once with a cloned key. In fact, the login that violates the monotonicity of the counter will probably be a legitimate login so relying parties that strictly enforce the requirement are likely to lock the real user out after a compromise.

Since the counter is almost universally per-token, that means that it’ll commonly jump several values between logins to the same site because the token will have been used to login elsewhere in-between. That makes the counter less effective at detecting cloning. If login sessions are long-lived, then the attacker may only need to sign once and quite possibly never be detected. If an attacker is able to observe the evolution of the counter, say by having the user login to an attacker-controlled site periodically, they can avoid ever triggering a counter regression on a victim site.

Finally, signature counters move the threat model from one that deals with phishing and password reuse, to one where attackers are capable of extracting key material from hardware tokens. That’s quite a change and the signature counter is not optional in the protocol.


When creating a key we observed that the token returned a certificate, and a signature over the newly created key by that certificate. Here’s the certificate from the token that I happen to be using as I write this:

        Version: 3 (0x2)
        Serial Number: 95815033 (0x5b60579)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = Yubico U2F Root CA Serial 457200631
            Not Before: Aug  1 00:00:00 2014 GMT
            Not After : Sep  4 00:00:00 2050 GMT
        Subject: CN = Yubico U2F EE Serial 95815033
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
    Signature Algorithm: sha256WithRSAEncryption

Clearly, it identifies the manufacturer and the unknown extension in there identifies the exact model of the key.

The certificate, despite the 32-bit serial number, doesn’t uniquely identify the device. The FIDO rules say that at least 100 000 other devices should share the same certificate. (Some devices mistakenly shipped with uniquely identifying attestation certifications. These will be recognised and suppressed in Chrome 67.) This device happens to be a special, GitHub-branded one. I don’t know whether this attestation certificate is specific to that run of devices, but the same model of device purchased on Amazon around the same time (but unbranded) had a different certificate serial number, so maybe.

But the practical upshot is that a relying party can determine, with some confidence, that a newly created key is stored in a Yubico 4th-gen U2F device by checking the attestation certificate and signature. That fact should cause people to pause; it has weighty ramifications.

Traditionally, anyone who implemented the various specifications that make up a web browser or server has been an equal participant in the web. There’s never been any permission needed to create a new browser or server and, although the specifications on the client side are now so complex that implementing them is a major effort, it’s possible to leverage existing efforts and use WebKit, Firefox, or Chromium as a base.

(The DRM in Encrypted Media Extensions might be an exception, and there was an appropriately large fight over that. I’m not making any judgment about the result of that here though.)

But with attestation, it’s quite possible for a site to require that you have particular hardware if you want to use webauthn. The concerns about that vary depending on the context: for an internal, corporate site to demand employees use particular hardware seems unobjectionable; a large public site raises many more concerns. There appear to be two, main ways in which this could develop in an unhealthy direction:

Firstly, as we’ve experienced with User-Agent headers, sites are not always as responsive as we would like. User-Agent headers have layers upon layers of browsers spoofing other browsers—the history of browser development is written in there—all because of this. Spoofing was necessary because sites would implement workarounds or degraded experiences for some browsers but fail to update things as the browser landscape changed. In order to be viable, each new browser entrant had to spoof the identity of the currently dominant one.

But attestation is not spoofable. Therefore, if sites launch webauthn support and accept attestations from the current set of token vendors, future vendors may be locked out of the market: Their devices won’t work because their attestations aren’t trusted, and they won’t be able to get sites to update because they won’t have enough market presence to matter.

If things get really bad, we may see a market develop where attestation certificates are traded between companies to overcome this. The costs of that are ultimately borne by increased token prices, which users have to pay.

The second concern is that, if different sites adopt different policies, users can have no confidence that a given token will work on a given site. They may be forced to have multiple tokens to span the set of sites that they use, and to remember in each case which token goes with which site. This would substantially degrade the user experience.

FIDO does not dismiss these worries and their answer, for the moment, is the metadata service (MDS). Essentially this is a unified root store that all sites checking attestation are supposed to use and update from. That would solve the problem of site stagnation by creating a single place for new vendors to submit their roots which would, in theory, then auto-update everywhere else. It would also help with the problem of divergent policies because it includes a FIDO-specified security-level for each type of token, which would at least reduce the variety of possible policies and make them more understandable—if used.

The challenge at the moment is that major vendors are not even included in the MDS, and using the MDS properly is harder for sites than throwing together a quick hack that’ll be “good enough for now”.

Thus my advice is for sites to ignore attestation if you’re serving the public. As we’ll see when we cover the webauthn API, attestation information is not even provided by default. Sites newly supporting webauthn are probably just using passwords, or maybe SMS OTP, and thus Security Keys offer a clear improvement. Worrying about whether the Security Key is a physical piece of hardware, and what certifications it has, is a distraction.


Now that we understand the underlying protocol, here’s how to actually use it. As mentioned above, there’s an older, “U2F” API but that’ll disappear in time and so it’s not covered here. Rather, the future is the W3C Web Authentication API, which builds on the Credential Management API.

Right now, if you want to experiment, you can use Firefox Nightly, Chrome Canary, or Edge 14291+. In Firefox Nightly, things should be enabled by default. For Chrome Canary, run with –enable-features=WebAuthentication –enable-experimental-web-platform-features. For Edge, I believe that you have to enable it in about:flags, but I’m just going off documentation.

Testing for support

Before trying to use webauthn, you should use feature detection to ensure that it’s supported by a given browser:

Creating a new key

The following are partial snippets of Javascript for creating a key. In-between the snippets is discussion so you would need to concatenate all the snippets in this section to get a complete example.

The name field is required, but currently discarded. In the future it’s intended that CTAP2 tokens will be able to store this and so it could be used in an account chooser interface.

The id field is optional and it sets the relying party ID (RP ID) of the credential. This is the domain name associated with the key and it defaults to the domain that’s creating the key. It can only be overridden to set it within the eTLD+1 of the current domain—like a cookie.

All these fields are required. However, like the previous chunk, they’re currently discarded and intended for forthcoming CTAP2 tokens.

The id identifies an account. A given CTAP2 token should not store two keys for the same RP ID and user ID. However, it’s possible that CTAP2 keys will allow blind enumeration of user IDs given physical possession of the token. (We have to see how that part of the CTAP2 spec is implemented in practice.) Therefore, you don’t want to store a username in the id field. Instead you could do something like HMAC-SHA256(key = server-side-secret, input = username)[:16]. (Although you’ll need a reverse index in the future if you want to use CTAP2 tokens in 1st-factor mode.)

The name and displayName are again strings intended for a future account chooser UI that doesn’t currently exist.

The challenge field for a new key is a little complex. The webauthn spec is very clear that this must be a strong, server-generated nonce and, for the assertion request which we’ll get to next, that’s correct. Also, if you’re checking attestation then this challenge is your only assurance of freshness, so you’ll want it in that case too.

However, as I mentioned above, it’s far from clear how well attestation will work outside of a controlled environment and I recommend that you ignore it in the general case. Given that, the utility of the challenge when creating a key is questionable. Generated U2F keys don’t sign over it and, while CTAP2 keys have the option of covering it with a self-signature, it remains to be seen whether any will actually do that.

One advantage that it does bring is that it stops CSRF attacks from registering a new key. But, if you don’t have a solid, general solution to CSRF already then you have far larger security issues.

Thus, since it’s easy and you’ll need it for assertions anyway, I still recommend that you generate a 16- or 32-byte nonce on the server as the spec suggests. I just thought that you should be aware that the benefit is a little fuzzier than you might imagine.

This enumerates the types of public keys that the server can process. The type field is always public-key and the alg comes from the IANA COSE list. The value -7 means ECDSA with P-256, which is effectively mandatory since it’s what all U2F tokens implement. At the moment, that’s the only value that makes sense although there might be some TPM-based implementations in the future that use RSA.

The timeout specifies how many milliseconds to wait for the user to select a token before returning an error.

The excludeCredentials is something that can be ignored while you get something working, but which you’ll have to circle back and read the spec on before deploying anything real. It allows you to exclude tokens that the user has already created a key on when adding new keys.

The promise will, if everything goes well, resolve to a PublicKeyCredential, the response member of which is an AuthenticatorAttestationResponse. I’m not going to pointlessly rewrite all the server-side processing steps from the spec here but I will note a couple of things:

Firstly, if you don’t know what token-binding is, that’s fine: ignore it. Same goes for extensions unless you previously supported the U2F API, in which case see the section below. I also suggest that you ignore the parts dealing with attestation (see above), which eliminates 95% of the complexity. Lastly, while I recommend following the remaining steps, see the bit above about the challenge parameter and don’t erroneously believe that checking, say, the origin member of the JSON is more meaningful than it is.

I’ll also mention something about all the different formats in play here: In order to process webauthn, you’ll end up dealing with JSON, ASN.1, bespoke binary formats, and CBOR. You may not have even heard of the last of those but CBOR is yet another serialisation format, joining Protocol Buffers, Msgpack, Thrift, Cap’n Proto, JSON, ASN.1, Avro, BSON, …. Whatever you might think of the IETF defining another format rather than using any of the numerous existing ones, you will have to deal with it to support webauthn, and you’ll have to deal with COSE, the CBOR translation of JOSE/JWT. You don’t have to support tags or indefinite lengths in CBOR though because the CTAP2 canonicalisation format forbids them.

The problem is that there’s going to be a lot of people having to implement converters from COSE key format to something that their crypto library accepts. Maybe I should start a GitHub repo of sample code for that in various libraries. But, short of that, here’s a quick reference to help you navigate all the different formats in play:


Used in: the public key in the attested credential data when creating a key. You have to parse this because you need the public key in order to check assertions.
Format: a CBOR map, defined here. However, you’ll struggle to figure out, concretely, what the keys in that map are from the RFC so, for reference, if you’re decoding a P-256 key you should expect these entries in the map: 1 (key type) = 2 (elliptic curve, x&y), 3 (algorithm) = -7 (ECDSA with SHA-256), -1 (curve) = 1 (P-256), and then x and y coordinates are 32-byte values with keys -2 and -3, respectively.
How to recognise: the first nibble is 0xa, for a small CBOR map.

X9.62 key

Used in: lots of things, this is the standard format for EC public keys. Contained within the SPKI format, which is used by Web Crypto and X.509.
Format: a type byte (0x04 for standard, uncompressed keys), followed by x and y values.
How to recognise: for P-256, it’s 65 bytes long and start with 0x04. (There’s also a compressed version of X9.62 but it’s exceedingly rare.)


Used in: X.509 and Web Crypto.
Format: an ASN.1 structure called SubjectPublicKeyInfo: an algorithm identifier followed by a lump of bytes in an algorithm-specific format, which is X9.62 for elliptic-curve keys.
How to recognise: starts with 0x30.

ASN.1 signatures

Used in: nearly everything that deals with ECDSA signatures.
Format: an ASN.1 SEQUENCE containing two INTEGER values for ECDSA’s r and s values.
How to recognise: starts with 0x30 and you’re expecting a signature, not a key.

“Raw” signatures

Used in: Web Crypto.
Format: a pair of 32-byte values (for P-256).
How to recognise: 64 bytes long.

Getting assertions

Once you have one (or more) keys registered for a user you can challenge them for a signature. Typically this is done at login time although CTAP2 envisions user verifying tokens that take a fingerprint or PIN number, so signatures from those devices could replace reauthentication requests. (I.e. those times when a site asks you to re-enter your password before showing particularly sensitive information.)

When getting an assertion, the challenge (i.e. nonce) really must be securely generated at the server—there’s none of the equivocation as with generation.

The credentialID values are those extracted from the attested credential data from the registration.

Again, there’s no need for me to reiterate the processing steps already enumerated in the spec except to repeat that you can ignore token-binding if you don’t know what it is, and to reference the comments above about signature counters. If you implement the signature counter check, you should think through how the user will experience various scenarios and ensure that you’re monitoring metrics for the number of counter failures observed. (I don’t think we know, in practice, how often counter writes will be lost, or counters will be corrupted.)

Supporting keys created with the U2F API

This is just for the handful of sites that supported Security Keys using the older, U2F API. If you have keys that were registered with that API then they aren’t immediately going to work with webauthn because the AppID from U2F is a URL, but the relying party ID in webauthn is just a domain. Thus the same origin is considered to be a different relying party when using U2F and webauthn and the keys won’t be accepted. (Just the same as if you tried to use a credential ID from a different site.)

However, there is a solution to this: in the publicKey argument of the get call add an extensions field containing a dict with a key appid whose value is the AppID of your U2F keys. This can be done uniformly for all keys if you wish since both the relying party ID and AppID will be tried when this is asserted. With this in place, keys registered with U2F should work with webauthn.

There is no way to create a U2F key with webauthn however. So if you rollout webauthn support, have users create keys with webauthn, and have to roll it back for some reason, those new keys will not work with the U2F API. So complete the transition to webauthn of your login process first, then transition registration.

Concluding remarks

It’s exciting to see webauthn support coming to most browsers this year. I get to use Security Keys with about the same number of sites as I use SMS OTP with, and I use Security Keys anywhere I can. While I, like everyone technical, assumes that I’m less likely to get phished than most, I’m still human. So I hope that wide-spread support for webauthn encourages more sites to support Security Keys.

Challenges remain, though. Probably the most obvious is that NFC is the ideal interface for mobile devices, but it doesn’t work with iOS. Bluetooth works for both Android and iOS, but requires a battery and isn’t as frictionless.

Security keys will probably remain the domain of the more security conscious in the short-term since, with CTAP1, they can only be an additional authentication step. But I hope to see CTAP2 tokens generally available this year with fingerprint readers or PIN support. It might be that, in a few years time, a significant number of people have a passwordless experience with at least one site that they use regularly. That’ll be exciting.

TLS 1.3 and Proxies

I'll generally ignore the internet froth in a given week as much as possible, but when Her Majesty's Government starts repeating misunderstandings about TLS 1.3 it is necessary to write something, if only to have a pointer ready for when people start citing it as evidence.

The first misunderstanding in the piece is the claim that it's possible for man-in-the-middle proxies to selectively proxy TLS 1.2 connections, but not TLS 1.3 connections because the latter encrypts certificates.

The TLS 1.2 proxy behaviour that's presumed here is the following: the proxy forwards the client's ClientHello message to the server and inspects the resulting ServerHello and certificates. Based on the name in the certificate, the proxy may “drop out” of the connection (i.e. allow the client and server to communicate directly) or may choose to interpose itself, answering the client with an alternative ServerHello and the server with an alternative ClientKeyExchange, negotiating different encrypted sessions with each and forwarding so that it can see the plaintext of the connection. In order to satisfy the client in this case the client must trust the proxy, but that's taken care of in the enterprise setting by installing a root CA on the client. (Or, in Syria, by hoping that users click through the the certificate error.)

While there do exist products that attempt to do this, they break repeatedly because it's a fundamentally flawed design: by forwarding the ClientHello to the server, the proxy has committed to supporting every feature that the client advertises because, if the server selects a given feature, it's too late for the proxy to change its mind. Therefore, with every new cipher suite, new curve, and new extension introduced, a proxy that does this finds that it cannot understand the connection that it's trying to interpose.

One option that some proxies take is to try and heuristically detect when it can't support a connection and fail open. However, if you believe that your proxy is a strong defense against something then failing open is a bit of problem.

Thus another avenue that some proxies have tried is to use the same heuristics to detect unsupported connections, discard the incomplete, outgoing connection, and start another by sending a ClientHello that only includes features that the proxy supports. That's unfortunate for the server because it doubles its handshaking cost, but gives the proxy a usable connection.

However, both those tricks only slow down the rate at which customers lurch from outage to outage. The heuristics are necessarily imprecise because TLS extensions can change anything about a connection after the ClientHello and some additions to TLS have memorably broken them, leading to confused proxies cutting enterprises off from the internet.

So the idea that selective proxying based on the server certificate ever functioned is false. A proxy can, with all versions of TLS, examine a ClientHello and decide to proxy the connection or not but, if it does so, it must craft a fresh ClientHello to send to the server containing only features that it supports. Making assumptions about any TLS message after a ClientHello that you didn't craft is invalid. Since, in practice, this has not been as obvious as the designers of TLS had imagined, the 1.3 draft has a section laying out these requirements.

Sadly, it's precisely this sort of proxy misbehaviour that has delayed TLS 1.3 for over a year while my colleagues (David Benjamin and Steven Valdez) repeatedly deployed experiments and measured success rates of different serialisations. In the end we found that making TLS 1.3 look like a TLS 1.2 resumption solved a huge number of problems, suggesting that many proxies blindly pass through such connections. (Which should, again, make one wonder about what security properties they're providing.)

But, given all that, you might ponder why we bothered encrypting certificates? Partly it's one component of an effort to make browsing more private but, more concretely, it's because anything not encrypted suffers these problems. TLS 1.3 was difficult to deploy because TLS's handshake is, perforce, exposed to the network. The idea that we should make TLS a little more efficient by compressing certificates has been bouncing around for many years. But it's only with TLS 1.3 that we might make it happen because everyone expected to hit another swamp of proxy issues if we tried it without encrypting certificates first.

It's also worth examining the assumption behind waiting for the server certificate before making an interception decision: that the client might be malicious and attempt to fool the proxy but (to quote the article) the certificate is “tightly bound to the server we’re actually interacting with”. The problem here is that a certificate for any given site, and a valid signature over a ServerKeyExchange from that certificate, is easily available: just connect to the server and it'll send it to you. Therefore if you're worried about malware, how is it that the malware C&C server won't just reply with a certificate for a reputable site? The malware client, after all, can be crafted to compensate for any such trickery. Unless the proxy is interposing and performing the cryptographic checks, then the server certificate isn't tightly bound to anything at all and the whole reason for the design seems flawed.

On that subject, I'll briefly mention the fact that HTTPS proxies aren't always so great at performing cryptographic checks. (We recently notified a major proxy vendor that their product didn't appear to validate certificates at all. We were informed that they can validate certificates, it's just disabled by default. It's unclear what fraction of their customers are aware of that.)

Onto the second claim of the article: that TLS 1.3 is incompatible with PCI-DSS (credit card standards) and HIPAA (US healthcare regulation). No reasoning is given for the claim, so let's take a look:

Many PCI-DSS compliant systems use TLS 1.2, primarily stemming from requirement 4.1: “use strong cryptography and security protocols to safeguard sensitive cardholder data during transmission over open, public networks, including a) only trusted keys and certificates are accepted, b) the protocol in use only supports secure versions or configurations, and c) the encryption strength is appropriate for the encryption methodology in use”.

As you can see, the PCI-DSS requirements are general enough to adapt to new versions of TLS and, if TLS 1.2 is sufficient, then TLS 1.3 is better. (Even those misunderstanding aspects of TLS 1.3 are saying it's stronger than 1.2.)

HIPAA is likewise, requiring that one must “implement technical security measures to guard against unauthorized access to electronic protected health information that is being transmitted over an electronic communications network”.

TLS 1.3 is enabled in Chrome 65, which is rolling out now. It is a major improvement in TLS and lets us eliminate session-ticket encryption keys as a mass-decryption threat, which both PCI-DSS- and HIPAA-compliance experts should take great interest in. It does not require special measures by proxies—they need only implement TLS 1.2 correctly.

Testing Security Keys

Last time I reviewed various security keys at a fairly superficial level: basic function, physical characteristics etc. This post considers lower-level behaviour.

Security Keys implement the FIDO U2F spec, which borrows a lot from ISO 7816-4. Each possible transport (i.e. USB, NFC, or Bluetooth) has its own spec for how to encapsulate the U2F messages over that transport (e.g. here's the USB one). FIDO is working on much more complex (and more capable) second versions of these specs, but currently all security keys implement the basic ones.

In essence, the U2F spec only contains three functions: Register, Authenticate, and Check. Register creates a new key-pair. Authenticate signs with an existing key-pair, after the user confirms physical presence, and Check confirms whether or not a key-pair is known to a security key.

In more detail, Register takes a 32-byte challenge and a 32-byte appID. These are intended to be SHA-256 hashes, but are opaque and can be anything. The challenge acts as a nonce, while the appID is bound to the resulting key and represents the context of the key. For web browsers, the appID is the hash of a URL in the origin of the login page.

Register returns a P-256 public key, an opaque key handle, the security key's batch certificate, and a signature, by the public key in the certificate, of the challenge, appID, key handle, and public key. Since the security keys are small devices with limited storage, it's universally the case in the ones that I've looked at that the key handle is actually an encrypted private key, i.e. the token offloads storage of private keys. However, in theory, the key handle could just be an integer that indexes storage within the token.

Authenticate takes a challenge, an appID, and a key handle, verifies that the appID matches the value given to Register, and returns a signature, from the public key associated with that key handle, over the challenge and appID.

Check takes a key handle and an appID and returns a positive result if the key handle came from this security key and the appID matches.

Given that, there are a number of properties that should hold. Some of the most critical:

But there are a number of other things that would be nice to test:

So given all those desirable properties, how do various security keys manage?


Easy one first: I can find no flaws in Yubico's U2F Security Key.

VASCO SecureClick

I've acquired one of these since the round-up of security keys that I did last time so I'll give a full introduction here. (See also Brad's review.)

This is a Bluetooth Low-Energy (BLE) token, which means that it works with both Android and iOS devices. For non-mobile devices, it includes a USB-A BLE dongle. The SecureClick uses a Chrome extension for configuring and pairing the dongle, which works across platforms. The dongle appears as a normal USB device until it sees a BLE signal from the token, at which point it “disconnects” and reconnects as a different device for actually doing the U2F operation. Once an operation that requires user-presence (i.e. a Register or Authenticate) has completed, the token powers down and the dongle disconnects and reconnects as the original USB device again.

If you're using Linux and you configure udev to grant access to the vendor ID & product ID of the token as it appears normally, nothing will work because the vendor ID and product ID are different when it's active. The Chrome extension will get very confused about this.

However, once I'd figured that out, everything else worked well. The problem, as is inherent with BLE devices, is that the token needs a battery that will run out eventually. (It takes a CR2012 and can be replaced.) VASCO claims that it can be used 10 times a day for two years, which seems plausible. I did run the battery out during testing, but testing involves a lot of operations. Like the Yubico, I did not find any problems with this token.

I did have it working with iOS, but it didn't work when I tried to check the battery level just now, and I'm not sure what changed. (Perhaps iOS 11?)

Feitian ePass

ASN.1 DER is designed to be a “distinguished” encoding, i.e. there should be a unique serialisation for a given value and all other representations are invalid. As such, numbers are supposed to be encoded minimally, with no leading zeros (unless necessary to make a number positive). Feitian doesn't get that right with this security key: numbers that start with 9 leading zero bits have an invalid zero byte at the beginning. Presumably, numbers starting with 17 zero bits have two invalid zero bytes at the beginning and so on, but I wasn't able to press the button enough times to get such an example. Thus something like one in 256 signatures produced by this security key are invalid.

Also, the final eight bytes of the key handle seem to be superfluous: you can change them to whatever value you like and the security key doesn't care. That is not immediately a problem, but it does beg the question: if they're not being used, what are they?

Lastly, the padding data in USB packets isn't zeroed. However, it's obviously just the previous contents of the transmit buffer, so there's nothing sensitive getting leaked.


With this device, I can't test things like key handle mutability and whether the appID is being checked because of some odd behaviour. The response to the first Check is invalid, according to the spec: it returns status 0x9000 (“NO_ERROR”), when it should be 0x6985 or 0x6a80. After that, it starts rejecting all key handles (even valid ones) with 0x6a80 until it's unplugged and reinserted.

This device has the same non-minimal signature encoding issue as the Feitian ePass. Also, if you click too fast, this security key gets upset and rejects a few requests with status 0x6ffe.

USB padding bytes aren't zeroed, but appear to be parts of the request message and thus nothing interesting.

U2F Zero

A 1KiB ping message crashes this device (i.e. it stops responding to USB messages and needs to be unplugged and reinserted). Testing a corrupted key handle also crashes it and thus I wasn't able to run many tests.


The Key-ID (and HyperFIDO devices, which have the same firmware, I think) have the same non-minimal encoding issue as the Feitian ePass, but also have a second ASN.1 flaw. In ASN.1 DER, if the most-significant bit of a number is set, that number is negative. If it's not supposed to be negative, then a zero pad byte is needed. I think what happened here is that, when testing the most-significant bit, the security key checks whether the first byte is > 0x80, but it should be checking whether it's >= 0x80. The upshot is the sometimes it produces signatures that contain negative numbers and are thus invalid.

USB padding bytes aren't zeroed, and include data that was not part of the request or response. It's unlikely to be material, but it does beg the question of where it comes from.

The wrapped keys also have some unfortunate properties. Firstly, bytes 16 through 31 are a function of the device and the appID, thus a given site can passively identify the same token when used by different accounts. Bytes 48 through 79 are unauthenticated and, when altered, everything still works except the signatures are wrong. That suggests that these bytes are the encrypted private key (or the encrypted seed to generate it). It's not obvious that there's any vulnerability from being able to tweak the private key like this, but all bytes of the key handle should be authenticated as a matter of best practice. Lastly, bytes 32 through 47 can't be arbitrarily manipulated, but can be substituted with the same bytes from a different key handle, which causes the signatures to be incorrect. I don't know what's going on there.

Overall, the key handle structure is sufficiently far from the obvious construction to cause worry, but not an obvious vulnerability.

Security Keys

Security Keys are (generally) USB-connected hardware fobs that are capable of key generation and oracle signing. Websites can “enroll” a security key by asking it to generate a public key bound to an “appId” (which is limited by the browser based on the site's origin). Later, when a user wants to log in, the website can send a challenge to the security key, which signs it to prove possession of the corresponding private key. By having a physical button, which must be pressed to enroll or sign, operations can't happen without user involvement. By having the security keys encrypt state and hand it to the website to store, they can be stateless(*) and robust.

(* well, they can almost be stateless, but there's a signature counter in the spec. Hopefully it'll go away in a future revision for that and other reasons.)

The point is that security keys are unphishable: a phisher can only get a signature for their appId which, because it's based on the origin, has to be invalid for the real site. Indeed, a user cannot be socially engineered into compromising themselves with a security key, short of them physically giving it to the attacker. This is a step up from app- or SMS-based two-factor authentication, which only solves password reuse. (And SMS has other issues.)

The W3C standard for security keys is still a work in progress, but sites can use them via the FIDO API today. In Chrome you can load an implementation of that API which forwards requests to an internal extension that handles the USB communication. If you do that, then there's a Firefox extension that implements the same API by running a local binary to handle it. (Although the Firefox extension appears to stop working with Firefox 57, based on reports.)

Google, GitHub, Facebook and Dropbox (and others) all support security keys this way. If you administer a G Suite domain, you can require security keys for your users. (“G Suite” is the new name for Gmail etc on a custom domain.)

But, to get all this, you need an actual security key, and probably two of them if you want a backup. (And a backup is a good idea, especially if you plan on dropping your phone number for account recovery.) So I did a search on Amazon for “U2F security key” and bought everything on the first page of results that was under $20 and available to ship now.

(Update: Brad Hill setup a page on GitHub that includes these reviews and his own.)

Yubico Security Key

Brand: Yubico, Firmware: Yubico, Chip: NXP, Price: $17.99, Connection: USB-A

Yubico is the leader in this space and their devices are the most common. They have a number of more expensive and more capable devices that some people might be familiar with, but this one only does U2F. The sensor is a capacitive so a light touch is sufficient to trigger it. You'll have no problems with this key, but it is the most expensive of the under $20 set.

Thetis U2F Security Key

Brand: Thetis, Firmware: Excelsecu, Chip: ?, Price: $13.95, Connection: USB-A

This security key is fashioned more like a USB thumb drive. The plastic inner part rotates within the outer metal shell and so the USB connector can be protected by it. The button is in the axis and is clicky, rather than capacitive, but doesn't require too much force to press. If you'll be throwing your security key in bags and worry about damaging them then perhaps this one will work well for you.

A minor nit is that the attestation certificate is signed with SHA-1. That doesn't really matter, but it suggests that the firmware writers aren't paying as much attention as one would hope. (I.e. it's a brown M&M.)

Feitian ePass

Brand: Feitian, Firmware: Feitian, Chip: NXP, Price: $16.99, Connection: USB-A, NFC

This one is very much like the Yubico, just a little fatter around the middle. Otherwise, it's also a sealed plastic body and capacitive touch sensor. The differences are a dollar and NFC support—which should let it work with Android. However, I haven't tested this feature.

I don't know what the opposite of a brown M&M is, but this security key is the only one here that has its metadata correctly registered with the FIDO Metadata Service.

U2F Zero

Brand: U2F Zero, Firmware: Conor Patrick, Chip: Atmel, Price: $8.99, Connection: USB-A

I did bend the rules a little to include this one: it wasn't immediately available when I did the main order from Amazon. But it's the only token on Amazon that has open source firmware (and hardware designs), and that was worth waiting for. It's also the cheapest of all the options here.

Sadly, I have to report that I can't quite recommend it because, in my laptop (a Chromebook Pixel), it's not thick enough to sit in the USB port correctly: Since it only has the “tongue” of a USB connector, it can move around in the port a fair bit. That's true of the other tokens too, but with the U2F Zero, unless I hold it just right, it fails to make proper contact. Since operating it requires pressing the button, it's almost unusable in my laptop.

However, it's fine with a couple of USB hubs that I have and in my desktop computer, so it might be fine for you. Depends how much you value the coolness factor of it being open-source.

KEY-ID FIDO U2F Security Key

Brand: KEY-ID, Firmware: Feitian(?), Chip: ?, Price: $12.00, Connection: USB-A

I photographed this one while plugged in in order to show the most obvious issue with this device: everyone will know when you're using it! Whenever it's plugged in, the green LED on the end is lit up and, although the saturation in the photo exaggerates the situation a little, it really is too bright. When it's waiting for a touch, it starts flashing too.

In addition, whenever I remove this from my desktop computer, the computer reboots. That suggests an electrical issue with the device itself—it's probably shorting something that shouldn't be shorted, like the USB power pin to ground, for example.

While this device is branded “KEY-ID”, I believe that the firmware is done by Feitian. There are similarities in certificate that match the Feitian device and, if you look up the FIDO certification, you find that Feitian registered a device called “KEY-ID FIDO® U2F Security Key”. Possibly Feitian decided against putting their brand on this.

(Update: Brad Hill, at least, reports no problems with these when using a MacBook.)

HyperFIDO Mini

Brand: HyperFIDO, Firmware: Feitian(?), Chip: ?, Price: $13.75, Connection: USB-A

By observation, this is physically identical to the KEY-ID device, save for the colour. It has the same green LED too (see above).

However, it manages to be worse. The KEY-ID device is highlighted in Amazon as a “new 2017 model”, and maybe this an example of the older model. Not only does it cause my computer to reliably reboot when removed (I suffered to bring you this review, dear reader), it also causes all devices on a USB hub to stop working when plugged in. When plugged into my laptop it does work—as long as you hold it up in the USB socket. The only saving grace is that, when you aren't pressing it upwards, at least the green LED doesn't light up.

HyperFIDO U2F Security Key

Brand: HyperFIDO, Firmware: Feitian(?), Chip: ?, Price: $9.98, Connection: USB-A

This HyperFIDO device is plastic so avoids the electrical issues of the KEY-ID and HyperFIDO Mini, above. It also avoids having an LED that can blind small children.

However, at least on the one that I received, the plastic USB part is only just small enough to fit into a USB socket. It takes a fair bit of force to insert and remove it. Also the end cap looks like it should be symmetrical and so able to go on either way around, but it doesn't quite work when upside down.

Once inserted, pressing the button doesn't take too much force, but it's enough to make the device bend worryingly in the socket. It doesn't actually appear to be a problem, but it adds a touch of anxiety to each use. Overall, it's cheap and you'll know it.

Those are the devices that matched my initial criteria. But, sometimes, $20 isn't going to be enough I'm afraid. These are some other security keys that I've ended up with:

Yubikey 4C

Brand: Yubico, Firmware: Yubico, Chip: NXP?, Price: $50 (direct from Yubico), Connection: USB-C

If you have a laptop that only has USB-C ports then a USB-A device is useless to you. Currently your only option is the Yubikey 4C at $50 a piece. This works well enough: the “button” is capacitive and triggers when you touch either of the contacts on the sides. The visual indicator is an LED that shines through the plastic at the very end.

Note that, as a full Yubikey, it can do more than just being a security key. Yubico have a site for that.

(Update: several people have mentioned in comments that this device wasn't very robust for them and had physical problems after some months. I can't confirm that, but there's always the option of a USB-A to USB-C adaptor. Just be careful to get one that'll work—the adaptor that I have doesn't hold “tongue-only” USB devices well enough.)

Many people lacking USB-A ports will have a Touch Bar, which includes a fingerprint sensor and secure element. One might spy an alternative (and cheaper solution) there. GitHub have published SoftU2F which does some of that but, from what I can tell, doesn't actually store keys in the secure element yet. However, in time, there might be a good answer for this.

(Update: the behaviour of SoftU2F might be changing.)

Yubikey Nano

Brand: Yubico, Firmware: Yubico, Chip: NXP?, Price: $50 (direct from Yubico), Connection: USB-A

Another $50 security key from Yubico, but I've included it because it's my preferred form-factor: this key is designed to sit semi-permanently inside the USB-A port. The edge is a capacitive touch sensor so you can trigger it by running your finger along it.

It does mean that you give up a USB port, but it also means that you've never rummaging around to find it.

(Note: newer Nanos look slightly different. See Yubico's page for a photo of the current design.)

Maybe Skip SHA-3

In 2005 and 2006, a series of significant results were published against SHA-1 [1][2][3]. These repeated break-throughs caused something of a crisis of faith as cryptographers questioned whether we knew how to build hash functions at all. After all, many hash functions from the 1990's had not aged well [1][2].

In the wake of this, NIST announced (PDF) a competition to develop SHA-3 in order to hedge the risk of SHA-2 falling. In 2012, Keccak (pronounced “ket-chak”, I believe) won (PDF) and became SHA-3. But the competition itself proved that we do know how to build hash functions: the series of results in 2005 didn't extend to SHA-2 and the SHA-3 process produced a number of hash functions, all of which are secure as far as we can tell. Thus, by the time it existed, it was no longer clear that SHA-3 was needed. Yet there is a natural tendency to assume that SHA-3 must be better than SHA-2 because the number is bigger.

As I've mentioned before, diversity of cryptographic primitives is expensive. It contributes to the exponential number of combinations that need to be tested and hardened; it draws on limited developer resources as multiple platforms typically need separate, optimised code; and it contributes to code-size, which is a worry again in the mobile age. SHA-3 is also slow, and is even slower than SHA-2 which is already a comparative laggard amongst crypto primitives.

SHA-3 did introduce something useful: extendable output functions (XOFs), in the form of the SHAKE algorithms. In an XOF, input is hashed and then an (effectively) unlimited amount of output can be produced from it. It's convenient, although the same effect can be produced for a limited amount of output using HKDF, or by hashing to a key and running ChaCha20 or AES-CTR.

Thus I believe that SHA-3 should probably not be used. It offers no compelling advantage over SHA-2 and brings many costs. The only argument that I can credit is that it's nice to have a backup hash function, but both SHA-256 and SHA-512 are commonly supported and have different cores. So we already have two secure hash functions deployed and I don't think we need another.

BLAKE2 is another new, secure hash function, but it at least offers much improved speed over SHA-2. Speed is important. Not only does it mean less CPU time spent on cryptography, it means that cryptography can be economically deployed in places where it couldn't be before. BLAKE2, however, has too many versions: eight at the current count (BLAKE2(X)?[sb](p)?). In response to complaints about speed, the Keccak team now have KangarooTwelve and MarsupilamiFourteen, which have a vector-based design for better performance. (Although a vector-based design can also be used to speed up SHA-2.)

So there are some interesting prospects for a future, faster replacement for SHA-2. But SHA-3 itself isn't one of them.

Update: two points came up in discussion about this. Firstly, what about length-extension? SHA-2 has the property that simply hashing a secret with some data is not a secure MAC construction, that's why we have HMAC. SHA-3 does not have this problem.

That is an advantage of SHA-3 because it means that people who don't know they need to use HMAC (with SHA-2) won't be caught out by it. Hopefully, in time, we end up with a hash function that has that property. But SHA-512/256, BLAKE2, K12, M14 and all the other SHA-3 candidates do have this property. In fact, it's implausible that any future hash function wouldn't.

Overall, I don't feel that solving length-extension is a sufficiently pressing concern that we should all invest in SHA-3 now, rather than a hash function that hopefully comes with more advantages. If it is a major concern for you now, try SHA-512/256—a member of the SHA-2 family.

The second point was that SHA-3 is just the first step towards a permutation-based future: SHA-3 has an elegant foundation that is suitable for implementing the full range of symmetric algorithms. In the future, a single optimised permutation function could be the basis of hashes, MACs, and AEADs, thus saving code size / die area and complexity. (E.g. STROBE.)

But skipping SHA-3 doesn't preclude any of that. SHA-3 is the hash standard itself, and even the Keccak team appear to be pushing K12 rather than SHA-3 now. It seems unlikely that a full set of primitives built around the Keccak permutation would choose to use the SHA-3 parameters at this point.

Indeed, SHA-3 adoption might inhibit that ecosystem by pushing it towards those bad parameters. (There was a thing about NIST tweaking the parameters at the end of the process if you want some background.)

One might argue that SHA-3 should be supported because you believe that it'll result in hardware implementations of the permutation and you hope that they'll be flexible enough to support what you really want to do with it. I'm not sure that would be the best approach even if your goal was to move to a permutation-based world. Instead I would nail down the whole family of primitives as you would like to see it and try to push small chips, where area is a major concern, to adopt it. Even then, the hash function in the family probably wouldn't be exactly SHA-3, but more like K12.


AEADs combine encryption and authentication in a way that provides the properties that people generally expect when they “encrypt” something. This is great because, historically, handing people a block cipher and a hash function has resulted in a lot of bad and broken constructions. Standardising AEADs avoids this.

Common AEADs have a sharp edge though: you must never encrypt two different messages with the same key and nonce. Doing so generally violates the confidentiality of the two messages and might break much more.

There are some situations where obtaining a unique nonce is easy, for example in transport security where a simple counter can be used. (Due to poor design in TLS 1.2, some TLS implementations still managed to duplicate nonces. The answer there is to do what ChaCha20-Poly1305 does in TLS 1.2, and what everything does in TLS 1.3: make the nonce implicit. That means that any mistakes result in your code not interoperating, which should be noticed.)

But there are situations where ensuring nonce uniqueness is not trivial, generally where multiple machines are independently encrypting with the same key. A system for distributing nonces is complex and hard to have confidence in. Generating nonces randomly and depending on statistical uniqueness is reasonable, but only if the space of nonces is large enough. XSalsa20-Poly1305 (paper, code) has a 192-bit nonce and good performance across a range of CPUs. In many situations, using that with random nonces is a sound choice.

However, lots of chips now have hardware support for the AES-GCM AEAD, meaning that its performance and power use is hard to beat. Also, a 192-bit nonce results in a ~40% increase in overhead vs the 96-bit nonce of AES-GCM, which is undesirable in some contexts. (Overhead consists of the nonce and tag, thus the increase from 96- to 192-bit nonces is not 100%.)

But using random nonces in a 96-bit space isn't that comfortable. NIST recommends a limit of 232 messages when using random nonces with AES-GCM which, while quite large, is often not large enough not to have to worry about. Because of this, Shay Gueron, Yehuda Lindell and I have been working on AES-GCM-SIV (paper, spec): AES-GCM with some forgiveness. It uses the same primitives as AES-GCM, and thus enjoys the same hardware support, but it doesn't fail catastrophically if you repeat a nonce. Thus you can use random, 96-bit nonces with a far larger number of messages, or withstand a glitch in your nonce distribution scheme. (For precise numbers, see section 5.3 of the paper.)

There is a performance cost: AES-GCM-SIV encryption runs at about 70% the speed of AES-GCM, although decryption runs at the same speed. (Measured using current BoringSSL on an Intel Skylake chip with 8KiB messages.) But, in any situation where you don't have a watertight argument for nonce uniqueness, that might be pretty cheap compared to the alternative.

For example, both TLS and QUIC need to encrypt small messages at the server that can be decrypted by other servers. For TLS, these messages are the session tickets and, in QUIC, they are the source-address tokens. This is an example of a situation where many servers are independently encrypting with the same key and so Google's QUIC implementation is in the process of switching to using AES-GCM-SIV for source-address tokens. (Our TLS implementation may switch too in the future, although that will be more difficult because of existing APIs.)

AEADs that can withstand nonce duplication are called “nonce-misuse resistant” and that name appears to have caused some people to believe that they are infinitely resistant. I.e. that an unlimited number of messages can be encrypted with a fixed nonce with no loss in security. That is not the case, and the term wasn't defined that way originally by Rogaway and Shrimpton (nor does their SIV mode have that property). So it's important to emphasise that AES-GCM-SIV (and nonce-misuse resistant modes in general) are not a magic invulnerability shield. Figure four and section five of the the paper give precise bounds but, if in doubt, consider AES-GCM-SIV to be a safety net for accidental nonce duplication and otherwise treat it like a traditional AEAD.

AES-GCM-SIV is supported in BoringSSL now and, while one may not want to use the whole of BoringSSL, the core assembly is ISC licensed. Also, Shay has published reference, assembly and intrinsics versions for those who think AES-GCM-SIV might be useful to them.

CFI directives in assembly files

(This post uses x86-64 for illustration throughout. The fundamentals are similar for other platforms but will need some translation that I don't cover here.)

Despite compilers getting better over time, it's still the case that hand-written assembly can be worthwhile for certain hot-spots. Sometimes there are special CPU instructions for the thing that you're trying to do, sometimes you need detailed control of the resulting code and, to some extent, it remains possible for some people to out-optimise a compiler.

But hand-written assembly doesn't automatically get some of the things that the compiler generates for normal code, such as debugging information. Perhaps your assembly code never crashes (although any function that takes a pointer can suffer from bugs in other code) but you probably still care about accurate profiling information. In order for debuggers to walk up the stack in a core file, or for profilers to correctly account for CPU time, they need be able to unwind call frames.

Unwinding used to be easy as every function would have a standard prologue:

push rbp
mov rbp, rsp

This would make the stack look like this (remember that stacks grow downwards in memory):

Caller's stackRSP value before CALLRSP at function entryCaller's RBPCallee's local variablesSaved RIP(pushed by CALL)RBP always points here

So, upon entry to a function, the CALL instruction that jumped to the function in question will have pushed the previous program counter (from the RIP register) onto the stack. Then the function prologue saves the current value of RBP on the stack and copies the current value of the stack pointer into RBP. From this point until the function is complete, RBP won't be touched.

This makes stack unwinding easy because RBP always points to the call frame for the current function. That gets you the saved address of the parent call and the saved value of its RBP and so on.

The problems with this scheme are that a) the function prologue can be excessive for small functions and b) we would like to be able to use RBP as a general purpose register to avoid spills. Which is why the GCC documentation says that “-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging”. This means that you can't depend on being able to unwind stacks like this. A process can be comprised of various shared libraries, any of which might be compiled with optimisations.

To be able to unwind the stack without depending on this convention, additional debugging tables are needed. The compiler will generate these automatically (when asked) for code that it generates, but it's something that we need to worry about when writing assembly functions ourselves if we want profilers and debuggers to work.

The reference for the assembly directives that we'll need is here, but they are very lightly documented. You can understand more by reading the DWARF spec, which documents the data that is being generated. Specifically see sections 6.4 and D.6. But I'll try to tie the two together in this post.

The tables that we need the assembler to emit for us are called Call Frame Information (CFI). (Not to be confused with Control Flow Integrity, which is very different.) Based on that name, all the assembler directives begin with .cfi_.

Next we need to define the Canonical Frame Address (CFA). This is the value of the stack pointer just before the CALL instruction in the parent function. In the diagram above, it's the value indicated by “RSP value before CALL”. Our first task will be to define data that allows the CFA to be calculated for any given instruction.

The CFI tables allow the CFA to be expressed as a register value plus an offset. For example, immediately upon function entry the CFA is RSP + 8. (The eight byte offset is because the CALL instruction will have pushed the previous RIP on the stack.)

As the function executes, however, the expression will probably change. If nothing else, after pushing a value onto the stack we would need to increase the offset.

So one design for the CFI table would be to store a (register, offset) pair for every instruction. Conceptually that's what we do but, to save space, only changes from instruction to instruction are stored.

It's time for an example, so here's a trivial assembly function that includes CFI directives and a running commentary.

  .globl  square
  .type   square,@function
  .hidden square

This is a standard preamble for a function that's unrelated to CFI. Your assembly code should already be full of this.


Our first CFI directive. This is needed at the start of every annotated function. It causes a new CFI table for this function to be initialised.

  .cfi_def_cfa rsp, 8

This is defining the CFA expression as a register plus offset. One of the things that you'll see compilers do is express the registers as numbers rather than names. But, at least with GAS, you can write names. (I've included a table of DWARF register numbers and names below in case you need it.)

Getting back to the directive, this is just specifying what I discussed above: on entry to a function, the CFA is at RSP + 8.

  push    rbp
  .cfi_def_cfa rsp, 16

After pushing something to the stack, the value of RSP will have changed so we need to update the CFA expression. It's now RSP + 16, to account for the eight bytes we pushed.

  mov     rbp, rsp
  .cfi_def_cfa rbp, 16

This function happens to have a standard prologue, so we'll save the frame pointer in RBP, following the old convention. Thus, for the rest of the function we can define the CFA as RBP + 16 and manipulate the stack without having to worry about it again.

  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8

We're getting ready to return from this function and, after restoring RBP from the stack, the old CFA expression is invalid because the value of RBP has changed. So we define it as RSP + 8 again.


At the end of the function we need to trigger the CFI table to be emitted. (It's an error if a CFI table is left open at the end of the file.)

The CFI tables for an object file can be dumped with objdump -W and, if you do that for the example above, you'll see two tables: something called a CIE and something called an FDE.

The CIE (Common Information Entry) table contains information common to all functions and it's worth taking a look at it:

  Version:         1
  Augmentation:    "zR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     1b

  DW_CFA_def_cfa: r7 (rsp) ofs 8
  DW_CFA_offset: r16 (rip) at cfa-8

You can ignore everything until the DW_CFA_… lines at the end. They define CFI directives that are common to all functions (that reference this CIE). The first is saying that the CFA is at RSP + 8, which is what we had already defined at function entry. This means that you don't need a CFI directive at the beginning of the function. Basically RSP + 8 is already the default.

The second directive is something that we'll get to when we discuss saving registers.

If we look at the FDE (Frame Description Entry) for the example function that we defined, we see that it reflects the CFI directives from the assembly:

… FDE cie=…
  DW_CFA_advance_loc: 1 to 0000000000000001
  DW_CFA_def_cfa: r7 (rsp) ofs 16
  DW_CFA_advance_loc: 3 to 0000000000000004
  DW_CFA_def_cfa: r6 (rbp) ofs 16
  DW_CFA_advance_loc: 11 to 000000000000000f
  DW_CFA_def_cfa: r7 (rsp) ofs 8

The FDE describes the range of instructions that it's valid for and is a series of operations to either update the CFA expression, or to skip over the next n bytes of instructions. Fairly obvious.

Optimisations for CFA directives

There are some shortcuts when writing CFA directives:

Firstly, you can update just the offset, or just the register, with cfi_def_cfa_offset and cfi_def_cfa_register respectively. This not only saves typing in the source file, it saves bytes in the table too.

Secondly, you can update the offset with a relative value using cfi_adjust_cfa_offset. This is useful when pushing lots of values to the stack as the offset will increase by eight each time.

Here's the example from above, but using these directives and omitting the first directive that we don't need because of the CIE:

  .globl  square
  .type   square,@function
  .hidden square
  push    rbp
  .cfi_adjust_cfa_offset 8
  mov     rbp, rsp
  .cfi_def_cfa_register rbp
  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8

Saving registers

Consider a profiler that is unwinding the stack after a profiling signal. It calculates the CFA of the active function and, from that, finds the parent function. Now it needs to calculate the parent function's CFA and, from the CFI tables, discovers that it's related to RBX. Since RBX is a callee-saved register, that's reasonable, but the active function might have stomped RBX. So, in order for the unwinding to proceed it needs a way to find where the active function saved the old value of RBX. So there are more CFI directives that let you document where registers have been saved.

Registers can either be saved at an offset from the CFA (i.e. on the stack), or in another register. Most of the time they'll be saved on the stack though because, if you had a caller-saved register to spare, you would be using it first.

To indicate that a register is saved on the stack, use cfi_offset. In the same example as above (see the stack diagram at the top) the caller's RBP is saved at CFA - 16 bytes. So, with saved registers annotated too, it would start like this:

  push    rbp
  .cfi_adjust_cfa_offset 8
  .cfi_offset rbp, -16

If you need to save a register in another register for some reason, see the documentation for cfi_register.

If you get all of that correct then your debugger should be able to unwind crashes correctly, and your profiler should be able to avoid recording lots of detached functions. However, I'm afraid that I don't know of a better way to test this than to zero RBP, add a crash in the assembly code, and check whether GBD can go up correctly.

(None of this works for Windows. But Per Vognsen, via Twitter, notes that there are similar directives in MASM.)

CFI expressions

New in version three of the DWARF standard are CFI Expressions. These define a stack machine for calculating the CFA value and can be useful when your stack frame is non-standard (which is fairly common in assembly code). However, there's no assembler support for them that I've been able to find, so one has to use cfi_escape and provide the raw DWARF data in a .s file. As an example, see this kernel patch.

Since there's no assembler support, you'll need to read section 2.5 of the standard, then search for DW_CFA_def_cfa_expression and, perhaps, search for cfi_directive in OpenSSL's perlasm script for x86-64 and the places in OpenSSL where that is used. Good luck.

(I suggest testing by adding some instructions that write to NULL in the assembly code and checking that gdb can correctly step up the stack and that info reg shows the correct values for callee-saved registers in the parent frame.)

CFI register numbers

In case you need to use or read the raw register numbers, here they are for a few architectures:

(may be EBP on MacOS) (may be ESP on MacOS)
Register numberx86-64x86ARM

(x86 values taken from page 25 of this doc. x86-64 values from page 57 of this doc. ARM taken from page 7 of this doc.)

RISC-V assembly

RISC-V is a new, open instruction set. Fabrice Bellard wrote a Javascript emulator for it that boots Linux here (more info). I happen to have just gotten a physical chip that implements it too (one of these) and what's cool is that you can get the source code to the chip on GitHub.

The full, user-level instruction set is documented but there's a lot of information in there. I wanted a brief summary of things that I could keep in mind when reading disassembly. This blog post is just a dump of my notes; it's probably not very useful for most people! It also leaves out a lot of things that come up less frequently. There's also an unofficial reference card which is good, but still aimed a little low for me.

RISC-V is little-endian and comes in 32 and 64 bit flavours. In keeping with the RISC-V documents, the flavour (either 32 or 64) is called XLEN below in the few places where it matters.

For both, int is 32 bits. Pointers and long are the native register size. Signed values are always sign extended in a larger register and unsigned 8- and 16-bit values are zero extended. But unsigned 32-bit values are sign-extended. Everything has natural alignment in structs and the like, but alignment isn't required by the hardware.

The stack grows downwards and is 16-byte aligned upon entry to a function.

Instructions are all 32 bits and 32-bit aligned. There is a compressed-instruction extension that adds 16-bit instructions and changes the required alignment of all instructions to just 16 bits. Unlike Thumb on ARM, RISC-V compressed instructions are just a short-hand for some 32-bit instruction and 16- and 32-bit instructions can be mixed freely.

There are 31 general purpose registers called x1–x31. Register x0 drops all writes and always reads as zero. Each register is given an alias too, which describes their conventional use:

RegisterAliasDescriptionSaved by
x1raReturn addressCaller
x2spStack pointerCallee
x3gpGlobal pointer
x4tpThread pointer
x8s0/fpSaved register / frame pointerCallee
x9s1Saved registerCallee
x10–17a1–a7Arguments and return valuesCaller
x18–27s2–11Saved registersCallee

There are two types of immediates, which will be written as ‘I’ and ‘J’. Type ‘I’ intermediates are 12-bit, signed values and type ‘J’ are 20-bit, signed values. They are always sign-extended to the width of a register before use.


There's no carry flag, instead one has to use a comparison instruction on the result in order to get the carry in a register.

A U suffix indicates an unsigned operation. Note that the multiplication and division instructions are part of the ‘M’ extension, but I expect most chips to have them.

ADD dest,src1,src2dest = src1 + src2
SUB dest,src1,src2dest = src1 - src2
ADDI dest,src1,Idest = src1 + I
MUL dest,src1,src2dest = src1 × src2
MULH[|U|SH] dest,src1,src2dest = (src1 × src2) >> XLENThis returns the high word of the multiplication result for signed×signed, unsigned×unsigned or signed×unsigned, respectively
DIV[U] dest,src1,src2dest = src1/src2There's no trap on divide-by-zero, rather special values are returned. (See documentation.)
REM[U] dest,src1,src2dest = src1%src2


These should all be obvious:

AND dest,src1,src2
OR dest,src1,src2
XOR dest,src1,src2
ANDI dest,src1,I
ORI dest,src1,I
XORI dest,src1,I


Instructions here are Shift [Left|Right] [Logical|Arithmetic].

Note that only the minimal number of bits are read from the shift count. So shifting by the width of the register doesn't zero it, it does nothing.

SLL dest,src1,src2dest = src1 << src2%XLEN
SRL dest,src1,src2dest = src1 >> src2%XLEN
SRA dest,src1,src2dest = src1 >> src2%XLEN
SLLI dest,src1,0‥31/63dest = src1 << I
SRLI dest,src1,0‥31/63dest = src1 >> I
SRAI dest,src1,0‥31/63dest = src1 >> I


These instructions set the destination register to one or zero depending on whether the relation is true or not.

SLT dest,src1,src2dest = src1<src2(signed compare)
SLTU dest,src1,src2dest = src1<src2(unsigned compare)
SLTI dest,src1,Idest = src1<I(signed compare)
SLTIU dest,src1,Idest = src1<I(unsigned compare, but remember that the immediate is sign-extended first)

Control flow

The JAL[R] instructions write the address of the next instruction to the destination register for the subsequent return. This can be x0 if you don't care.

Many instructions here take an immediate that is doubled before use. That's because no instruction (even with compressed instructions) can ever start on an odd address. However, when you write the value in assembly you write the actual value; the assembler will halve it when encoding.

JAL dest,Jdest = pc+2/4; pc+=2*J
JALR dest,src,Idest = pc+2/4; pc=src1+(I&~1)Note that the least-significant bit of the address is ignored
BEQ src1,src2,Iif src1==src2, pc+=2*I
BNE src1,src2,Iif src1!=src2, pc+=2*I
BLT src1,src2,Iif src1<src2, pc+=2*Isigned compare
BLTU src1,src2,Iif src1<src2, pc+=2*Iunsigned compare
BGE src1,src2,Iif src1>=src2, pc+=2*Isigned compare
BGEU src1,src2,Iif src1>=src2, pc+=2*Iunsigned compare


The suffix on these instructions denotes the size of the value read or written: ‘D’ = double-word (64 bits), ‘W’ = word (32 bits), ‘H’ = half-word (16 bits), ´B’ = byte. Reading and writing 64-bit values only works on 64-bit systems, and LWU only exists on 64-bit systems because it's the same as LW otherwise.

Alignment is not required, but might be faster. Also note that there's no consistent direction of data flow in the textual form of the assembly: the register always comes first.

L[D|W|H|B] dest,I(src)dest = *(src + I)result is sign extended
L[D|W|H|B]U dest,I(src)dest = *(src + I)result is zero extended
S[D|W|H|B] src1,I(src2)*(src2 + I) = src1


LUI dest,Jdest = J<<12“Load Upper Immediate”. The result is sign extended on 64-bit.
AUIPC dest,Jdest = pc + J<<12“Add Upper Immediate to PC”. The result of the shift is sign-extended on 64-bit before the addition.

W instructions

These instructions only exist on 64-bit and are copies of the same instruction without the ‘W’ suffix, except that they operate only on the lower 32 bits of the registers and, when writing the result, sign-extend to 64 bits. They are ADDW, SUBW, ADDIW, DIV[U]W, REM[U]W, S[L|R]LW, SRAW, S[L|R]LIW and SRAIW


For clarity, there are also a number of pseudo-instructions defined. These are just syntactic sugar for one of the primitive instructions, above. Here's a handful of the more useful ones:

nopaddi x0,x0,0
mv dest,srcaddi dest,src,0
not dest,srcxor dest,src,-1
seqz dest,srcsltiu dest,src,1
snez dest,srcsltu dest,x0,src
j Jjal x0,J
retjalr x0,x1,0
call offsetauipc x6, offset >> 12; jalr x1,x6,offset & 0xfff
li dest,value(possibly several instructions to load arbitrary value)
la dest,symbolauipc dest, symbol >> 12; addi dest,dest,offset & 0xfff
l[d|w|h|b] dest,symbolauipc dest, symbol >> 12; lx dest,offset & 0xfff(dest)
s[d|w|h|b] src,symbolauipc dest, symbol >> 12; sx dest,offset & 0xfff(dest)

Note that the instructions that expand to two instructions where the first is AUIPC aren't quite as simple as they appear. Since the 12-bit immediate in the second instruction is sign extended, it could end up negative. If that happens, the J immediate for AUIPC needs to be one greater and then you can reach the value by subtracting from there.

CECPQ1 results

In July my colleague, Matt Braithwaite, announced that Chrome and Google would be experimenting with a post-quantum key-agreement primitive in TLS. One should read the original announcement for details, but we had two goals for this experiment:

Firstly we wanted to direct cryptoanalytic attention at the family of Ring Learning-with-Errors (RLWE) problems. The algorithm that we used, NewHope, is part of this family and appeared to be the most promising when we selected one at the end of 2015.

It's very difficult to know whether we had any impact here, but it's good to see the recent publication and withdrawal of a paper describing a quantum attack on a fundamental lattice problem. Although the algorithm contained an error, it still shows that capable people are testing these new foundations.

Our second goal was to measure the feasibility of deploying post-quantum key-agreement in TLS by combining NewHope with an existing key-agreement (X25519). We called the combination CECPQ1.

TLS key agreements have never been so large and we expected a latency impact from the extra network traffic. Also, any incompatibilities with middleboxes can take years to sort out, so it's best to discover them as early as possible.

Here the results are more concrete: we did not find any unexpected impediment to deploying something like NewHope. There were no reported problems caused by enabling it.

Although the median connection latency only increased by a millisecond, the latency for the slowest 5% increased by 20ms and, for the slowest 1%, by 150ms. Since NewHope is computationally inexpensive, we're assuming that this is caused entirely by the increased message sizes. Since connection latencies compound on the web (because subresource discovery is delayed), the data requirement of NewHope is moderately expensive for people on slower connections.

None the less, if the need arose, it would be practical to quickly deploy NewHope in TLS 1.2. (TLS 1.3 makes things a little more complex and we did not test with CECPQ1 with it.)

At this point the experiment is concluded. We do not want to promote CECPQ1 as a de-facto standard and so a future Chrome update will disable CECPQ1 support. It's likely that TLS will want a post-quantum key-agreement in the future but a more multilateral approach is preferable for something intended to be more than an experiment.


Security protocols often assume an accurate, local clock (e.g. TLS, Kerberos, DNSSEC and more). It's a widely accepted assumption when designing protocols but, for a lot of people, it just isn't true. We find good evidence that at least 25% of all certificate errors in Chrome are due to a bad local clock.

Even when the local clock is being synchronised, it's very likely to be using unauthenticated NTP. So if your threat model includes man-in-the-middle attackers then you still can't trust the local clock.

There have been efforts to augment NTP with authentication, but they still assume a world where each client trusts one or more time servers absolutely. In order to explore a solution that allows time servers to be effectively audited by clients, myself and my colleague Matt Braithwaite (with assistance and advice from Ben Laurie and Michael Shields) have developed Roughtime.

Very briefly: using some tricks we believe that it's viable to deploy servers that sign a client-chosen nonce and timestamp on demand. Once you have several of these servers, clients can generate their nonces by hashing replies from other servers with some entropy. That proves that a nonce was created after the reply was received. Clients maintain a chain of nonces and replies and, if a server misbehaves, can use replies from several other servers to prove and report it.

Currently there's only one Roughtime service running, so the idea of spreading trust around is inchoate. But we would like to gauge whether others are interested in this idea, specifically whether there are any organisations who would be seriously interested in deploying something like this in their clients. (Because I assume that, if you have clients, then you'll also be interested in running a server.)

There's a much longer introduction and many more details on the Roughtime site and we've also setup a mailing list.

memcpy (and friends) with NULL pointers

The C standard (ISO/IEC 9899:2011) has a sane-seeming definition of memcpy (section

The memcpy function copies n characters from the object pointed to by s2 into the object pointed to by s1.

Apart from a prohibition on passing overlapping objects, I think every C programmer understands that.

However, the standard also says (section 7.1.4):

If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after promotion) not expected by a function with variable number of arguments, the behavior is undefined.

(Emphasis is mine.)

I'm sure that 7.1.4 seemed quite reasonable in isolation, but how does it interact with the case where memcpy is called with a zero length? If you read then you might well think that, since the function copies zero bytes, it's valid to pass NULL as either of the pointer arguments. I claim that the vast majority of C programmers would agree with that, but 7.24.1(2) clarifies that 7.1.4 really does apply:

Where an argument declared as size_t n specifies the length of the array for a function, n can have the value zero […] pointer arguments on such a call shall still have valid values, as described in 7.1.4.

(Nobody would actually write memcpy(NULL, NULL, 0), of course, because it (at best) does nothing. But such a call can easily arise at run-time when an empty object is handled by a more general function.)

Some compilers will use this corner of the standard to assume that pointers passed to memcpy are non-NULL, irrespective of the length argument. GCC has built this in, while Clang can get it from the fact that glibc annotates memcpy with nonnull specifications.

Consider the following function:

#include <stdint.h>
#include <string.h>

int f(uint8_t *dest, uint8_t *src, size_t len) {
  memcpy(dest, src, len);
  return dest == NULL;

Here's the output of building that with GCC 6.1.1 with -O2:

0000000000000000 <f>:
   0:	48 83 ec 08          	sub    rsp,0x8
   4:	e8 00 00 00 00       	call   9 <f+0x9>  # memcpy
   9:	31 c0                	xor    eax,eax
   b:	48 83 c4 08          	add    rsp,0x8
   f:	c3                   	ret

From that we can see that rax (which holds the return value of a function in the amd64 ABI) is unconditionally set to zero, i.e. the compiler has assumed that dest == NULL is false because it has been passed to memcpy. The compiler's reasoning goes like this: 7.1.4 says that passing a NULL pointer to a standard library function is undefined behaviour, therefore if dest was NULL any behaviour is reasonable. So the code can be optimised with the assumption that it's non-NULL, as that's the only case with defined behaviour.

(You can also play with this snippet in Matt Godbolt's excellent tool.)

Opinions on this vary from “the C standard defines the language thus that optimisation is fine by definition” to “that's crazy: there's a huge amount of code out there that probably assumes the obvious behaviour of memcpy”. Personally, I find myself further towards the latter position than the former.

Also, it's not just memcpy: the same optimisations are annotated in glibc for (at least) memccpy, memset, memcmp, memchr, memrchr, memmem, mempcpy, bcopy and bcmp. Section 7.1.4 can be applied to any standard library function.


To try and figure out the impact that this optimisation is having I built a number of open-source programs with GCC 6.1.1, with -fno-builtin (to disable GCC's built-in versions of these functions) and with glibc's string.h including, or not, the nonnull annotations. For example, the snippet of code above produces this diff when tested this way:

0000000000000000 <f>:
-   0:	48 83 ec 08          	sub    rsp,0x8
+   0:	53                   	push   rbx
+   1:	48 89 fb             	mov    rbx,rdi
    4:	e8 00 00 00 00       	call   9 <f+0x9>
    9:	31 c0                	xor    eax,eax
-   b:	48 83 c4 08          	add    rsp,0x8
-   f:	c3                   	ret    
+   b:	48 85 db             	test   rbx,rbx
+   e:	0f 94 c0             	sete   al
+  11:	5b                   	pop    rbx
+  12:	c3                   	ret    

The added code tests dest to set the return value, as intended.

The first program I tested was BIND 9.9.5 because of this advisory that says: “GCC now includes (by default) an optimization which is intended to eliminate unnecessary null pointer comparisons in compiled code. Unfortunately this optimization removes checks which are necessary in BIND and the demonstrated effect is to cause unpredictable assertion failures during execution of named, resulting in termination of the server process”. Although version 9.9.5 should be affected according to the advisory, I found no differences in the compiled output based on nonnull annotations in string.h. Perhaps it's because I'm using a different GCC, perhaps I just got something wrong in my testing, or perhaps these checks were eliminated for different reasons. (For example, a local root exploit in the kernel was enabled by a dereference-based removal of a NULL check.)

Next up, I tried something that I'm more involved with: BoringSSL. Here there are two changes: a reordering of two conditions in OPENSSL_realloc_clean (which has no semantic effect) and extensive changes in BN_mod_exp_mont. I'm sure I would be able to do a better analysis if I were more experienced with disassembling large programs, but I'm just using objdump and diff. Still, I believe that all the changes are the result of a single NULL check being removed and then the resulting offset shift of all the following code. That counts as an optimisation, but it's statically clear that the pointer cannot be NULL even without any assumptions about string.h functions so I struggle to give much credit.

Since BoringSSL showed some changes, I tried OpenSSL 1.0.2h. This also shows the same large changes around BN_mod_exp_mont. There's also a large change in dsa_builtin_paramgen2 (a function that we don't have in BoringSSL) but that appears to be another insignificant NULL-check removed and a consequent change of all the later offsets. Lastly, there are a handful of no-op changes: like swapping the arguments to cmp before jne.

Next I tried openssh-7.2p2, which shows no changes. I wondered whether someone had already done this analysis and corrected any problems in OpenSSH so tried a much older version too: 5.4p1. That does show a small, but non-trivial, change in ssh_rsa_verify. After a bit of thought, I believe that GCC has managed to eliminate a test for a non-NULL pointer at the end of openssh_RSA_verify. Just like the BoringSSL case, it's already possible to deduce that the pointer must be non-NULL without any section 7.1.4 assumptions.


It's clear that one has to write C code that's resilient to the compiler assuming that any pointers passed to standard library functions are non-NULL. There have been too many releases of glibc and GCC with this in to safely assume that it'll ever go away.

However, the benefits of this (i.e. the optimisations that the compiler can perform because of it) are nearly zero. Large programs can be built where it has no effect. When there are changes they are either cases that the compiler should have been able to figure out anyway, or else noise changes with no effect.

As for the costs: there have been several cases where removing NULL checks has resulted in a security vulnerability, although I can't find any cases of this precise corner of the C standard causing it. It also adds a very subtle, exceptional case to several very common functions, burdening programmers. But it thankfully rarely seems to make a difference in real-life code, so hopefully there's not a large pool of bugs in legacy code that have been created by this change.

Still, given the huge amount of legacy C code that exists, this optimisation seems unwise to me. Although I've little hope of it happening, I'd suggest that GCC and glibc remove these assumptions and that the next revision of the C standard change 7.24.1(2) to clarify that when a length is zero, pointers can be NULL.

If anyone wishes to check my results here, I've put the scripts that I used on GitHub. I'm afraid that it takes a bit of manual setup and, given variation in GCC versions across systems, some differences are to be expected but the results should be reproducible.

Cryptographic Agility

(These are notes that I wrote up from a talk that I gave at the National Academies Forum on Cyber Resilience. You can tell that it was in Washington, DC because of the “cyber”.

I wasn't quite sure how technical to pitch this talk so it's relatively introductory; regular readers probably know all this.

This isn't a transcript of what I said, but I try to hit the main points in my notes.)

Firstly I'd like to separate extensibility from agility. A protocol is extensible if you can add features to it without having to update every implementation at the same time—which is generally impossible. Cryptographic agility depends on having extensibility, at least if you ever want to use something that wasn't designed into a protocol at the beginning.

Protocols should be extensible: the world keeps changing and no design is going to be perfect for all time. But extensibility is much harder in practice than it sounds.

I happen to be particularly familiar with TLS and TLS has two, major extensibility mechanisms. The first is a simple version number. Here's how the specification says that it should work:

Client: I support up to version 1.2.

Server: (calculates the minimum of the version that the client supports and the maximum version that the server supports) Ok, let's use version 1.1.

This is commendably simple: it's not possible to express a range of versions and certainly not a discontinuous range. This is about as simple as an extensibility mechanism could be, and yet lots of implementations get it wrong. It's a common mistake for implementations to return an error when the client offers a version number that they don't understand.

This, of course, means that deploying new versions doesn't work. But it's insidious because the server will work fine until someone tries to deploy a new version. We thought that we had flexibility in the protocol but it turned out that bugs in code had rusted it in place.

At this point it's worth recalling the Law of the Internet: blame attaches to the last thing that changed. If Chrome updates and then something stops working then Chrome gets the blame. It doesn't matter that the server couldn't correctly calculate the minimum of two numbers. No normal person understands or cares about that.

What's to be done about this? Well, we work around issues if they're big and suck up the breakage if they're small. It's taken about 15 years to get to the point where web browsers don't have to work around broken version negotiation in TLS and that's mostly because we only have three active versions of TLS. When we try to add a fourth (TLS 1.3) in the next year, we'll have to add back the workaround, no doubt. In summary, this extensibility mechanism hasn't worked well because it's rarely used and that lets bugs thrive.

TLS has a second, major extension mechanism which is a series of (key, value) pairs where servers should ignore unknown keys. This has worked a little better because, while there are only three or four versions in play, with many years between versions, there are 25 to 30 extensions defined. It's not perfect: bugs in implementations have led them to be dependent on the order of extensions and somebody at least managed to write a server that breaks if the last value is empty.

Sometimes more extensibility points have been added inside of extensions in the expectation that it'll save adding another, top-level extension in the future. This has generally been a mistake: these extension points have added complexity for little reason and, when we try to use them, we often find that bugs have rusted them solid anyway. They've just been a waste.

There's a lesson in all this: have one joint and keep it well oiled.

Protocol designers underestimate how badly people will implement their designs. Writing down how you think it should work and hoping that it'll work, doesn't work. TLS's protocol negotiation is trivial and the specification is clear, yet it still didn't work in practice because it's difficult to oil.

Rather one needs to minimise complexity, concentrate all extensibility in a single place and actively defend it. An active defense can take many forms: fuzzing the extensibility system in test suites and compliance testing is good. You might want to define and implement dummy extensions once a year or such, and retire old ones on a similar schedule. When extensions contain lists of values, define a range of values that clients insert at random. In short, be creative otherwise you'll find that bug rust will quickly settle in.

Agility itself

Cryptographic agility is a huge cost. Implementing and supporting multiple algorithms means more code. More code begets more bugs. More things in general means less academic focus on any one thing, and less testing and code-review per thing. Any increase in the number of options also means more combinations and a higher chance for a bad interaction to arise.

Let's just consider symmetric ciphers for a moment. Because everyone wants them to be as fast as possible, BoringSSL currently contains 27 thousand lines of Perl scripts (taken from OpenSSL, who wrote them all) that generate assembly code just in order to implement AES-GCM. That's a tremendous amount of work and a tremendous scope for bugs.

Focusing again on TLS: over the years, 25 different ciphers and modes have been specified for use in TLS. Thankfully, of those, only nine are actively used. But that doesn't mean that the zombies of the others might not still be lurking around, ready to cause problems.

Where did this mess of diversity come from?

1. Old age / we had no idea what we were doing in the 1990's:

RC2_CBC_40 RC4_128 RC4_40

A lot of mistakes were made in the 1990's—we really didn't know what we were doing. Phil Rogaway did, but sadly not enough people listened to him; probably because they were busy fighting the US Government which was trying to ban the whole field of study at the time. Unfortunately that coincided with the early inflation period of the internet and a lot of those mistakes were embedded pretty deeply. We're still living with them today.

2. National pride cipher suites


The next cause of excess agility are the national pride cipher suites. Many countries consider cryptography to be an area of national interest but then mistakenly believe that means that they have to invent their own standards and primitives. South Korea and Japan were especially forthright about this and so managed to get these ciphers assigned code points in TLS but Russia and China and, to some extent, many other countries do the same thing.

Although they receive limited analysis compared to something like AES, they're generally not bad, per se, but they bring nothing new to the table: they add nothing but costs, and the costs are significant. Cryptographic diversity for the point of national pride should be strenuously resisted for that reason. Other countries may complain that the US got their standards widely used but the US got to specify a lot about the internet by being the first mover. (And AES is from Belgium anyway.) However, it is the case that I'm not aware of any of these national standards being used to promote something that's actually a deliberate backdoor; which is, of course, not true of the US.

3. Reasonable cases for diversity:

  • Embedded systems want to minimise circuit size: AES_128_CCM and AES_256_CCM.
  • We want something faster for when we don't have AES hardware: CHACHA20_POLY1305.
  • US Government standard, got hardware support from Intel: AES_128_GCM and AES_256_GCM.

Now we come to the ones that are reasonable to use and the reasons for diversity there. It's all about performance optimisation for different environments really: tiny devices want CCM because it only needs an AES-encrypt circuit. Devices without hardware support for AES-GCM want to use ChaCha20-Poly1305 because it's much more efficient in software. Everything else wants to use AES-GCM.

Agility has allowed us to introduce the ciphers in the final set and that's really important. But it's equally important to kill off the old stuff, and that's very hard. Nearly all the incentives are aligned against it. Recall the Law of the Internet (mentioned above); users hate stuff breaking and always blame you. Even djb will take to Twitter when one drops DSA support.

We have a long conveyor belt of primitives, we put new ones at the front and, every so often, we turn the crank and something drops off the end. In addition to all the obvious problems with killing off old stuff, that also means that there's a lot of inadvisable options that will generally function at any given time and this is leading to new products launching with no idea that they're sitting towards the end of this conveyor belt. These products expect a lifetime of some number of years and are unaware that we hope to discontinue something that they're using much sooner than that. It's no longer the case that we can assume that waiting a year will result in a reduction of the amount of use that a deprecated primitive gets because of new launches.

Google tries to address this where it can by requiring support for the newest options in our certification process for devices that interact with our services. But only a tiny subset of the things that interact with Google go through any of our certifications.

Things are even harder in non-interactive cases. TLS at least gets to negotiate between the client and server but algorithms in S/MIME messages and certificate signatures don't allow that. (One can think of ways to help change that, but the current reality is that they're not negotiated.) That's why dropping SHA-1 support in certificates has been a such a gruesome fight and why PKCS#8 messages still require us to support 40-bit RC2.

So what's the lesson here? I'd say that you need extensibility but, when it comes to cryptographic agility, have one option. Maybe two. Fight to keep it that small.

It's worth highlighting that, for the purposes of time, I've simplified things dramatically. I've considered only symmetric ciphers and modes above but, even within TLS, there's a whole separate conveyor belt for asymmetric algorithms. And I've not mentioned the oncoming storm of quantum computers. Quantum computers are going to be hilarious and I hope to be retired before they get big enough to cause problems!

Post-quantum key agreement

If large quantum computers can be built (and not of the D-Wave variety) then they'll make quite a mess of public-key cryptography. RSA and systems based on discrete logarithms (i.e. finite-field and elliptic-curve Diffie-Hellman) are all broken. I've written about hash-based signatures (which resist quantum attacks) before and the recent PQCRYPTO recommendations suggest those for signatures and McEliece for public-key encryption. Those are both sound, conservative recommendations but both have some size problems: McEliece public keys can be on the order of a megabyte of data and conservative hash-based signatures are about 40KB.

In some situations that's not a problem, and things like firmware signatures, where the key is embedded in hard-to-change silicon, should consider using hash-based signatures today. But those costs motivate the search for post-quantum schemes that are closer to the small, fast primitives that we have today.

One candidate is called Ring Learning-With-Errors (RLWE) and I'll try to give a taste of how it works in this post. This is strongly based on the A New Hope paper by Alkim, Ducas, Pöppelmann & Schwabe, and, in turn, on papers by Bos, Costello, Naehrig & Stebila and Peikert.

Firstly, the basic stats (because I always hate having to dig around for these values in papers). I've included the numbers for a current, elliptic-curve based Diffie-Hellman scheme (X25519) for comparision:

A New HopeX25519
Alice's transmission size1,824 bytesa32 bytes
Alice's computation129,638 cyclesb331,568 cyclesc
Bob's transmission size1,824 bytes32 bytes
Bob's computation126,236 cycles331,568 cycles

(a This is using a more compact scheme than in the paper. b These are Haswell cycle counts from the paper. c These are values from the SUPERCOP benchmark on titan0.)

Something to keep in mind when looking at the numbers above: sending 1,824 bytes on a 10MBit link takes 5.1 million cycles, assuming that your CPU is running at 3.5GHz.

RLWE key agreement

Our fundamental setting is ℤ12289[X]/(X1024+1). That's the set of polynomial equations where the largest power of x is 1023 and the coefficients are values between zero and 12288 (inclusive). For example, 66 + 4532x + 10000x2 + … + 872x1023.

Addition and multiplication can be defined for polynomials in this set. Addition is done by matching up powers of x and adding the corresponding coefficients. If the result is out of range then it's reduced modulo 12289.

Multiplication is high-school polynomial multiplication where the polynomial on the right is multiplied by every term of the polynomial on the left and the result is the sum of those terms. Coefficients are reduced modulo 12289 to keep them in range, but it's likely that the resulting powers of x will be too large—multiplying two x1023 terms gives a result in terms of x2046.

Polynomials with too large a degree can be reduced modulo x1024+1 until they're in range again. So, if we ended up with a term of a×x2046 then we could subtract a×x1022(x1024+1) to eliminate it. By repeated application of that trick, all the terms with powers of x greater than 1023 can be eliminated and then the result is back in the set.

Now that we can add and multiply within this set of polynomials we need to define a noise polynomial: this is simply a polynomial where each coefficient is sampled from a random distribution. In this case, the distribution will be a centered binomial distribution that ranges from -12 to 12. The probability density looks like this:

image/svg+xml Gnuplot Gnuplot Produced by GNUPLOT 5.0 patchlevel 1 0 -15 -10 -5 0 5 10 15

An important feature of noise polynomials is that the magnitude of each coefficient is small. That will be critical later on.

A random polynomial is one where each coefficient is sampled from a uniform distribution over the full range of zero to 12288.

To build a Diffie-Hellman protocol from this, Alice generates a random polynomial, a, and two noise polynomials, s and e. Alice calculates b = as+e and sends a and b to Bob. Bob generates his own s′ and e′, uses Alice's a to calculate u = as′+e′, and sends u back to Alice. Now Alice can calculate us = (as′+e′)s = as′s+e′s and Bob can calculate bs′ = (as+e)s′ = ass′+es′. But the keen-eyed will notice that those are different values!

The added noise is necessary for security but it means that the two sides to this protocol calculate different values. But, while the values are different, they're very similar because the magnitude of the noise polynomials is small. So a reconciliation mechanism is needed to find a shared value given two, similar polynomials.


So far I've been following the A New Hope paper and it does include a reconciliation mechanism. But, to be honest, I'm not sure that I understand it, so I'm going to be describing a mechanism by Peikert here:

The reconciliation will treat each coefficient in the similar polynomials separately and will extract a single, shared bit from each. Since we're dealing with polynomials that have 1024 terms, we'll get a 1024-bit shared secret in total but I'm just going to discuss the process of processing a single coefficient to get a single bit.

Consider the coefficient space as a circle: zero and 12289 are the same value modulo 12289 and we put that at the top of the circle. At the bottom of the circle will be 12289/2 = 6144 (rounding down). We know that, for each coefficient, Alice and Bob will have similar values—meaning that the values will be close by on the circle.

image/svg+xml 0 6144

One option is to split the circle into left and right halves and say that if the point is in the left half then it's a zero, otherwise it's a one.

image/svg+xml 0 6144 0 1

But while that will work most of the time, there's obviously a problem when the points are near the top or the bottom. In these cases, a small difference in location can result in a point being in a different half, and thus Alice and Bob will get a different result.

In this case we want to split the circle into top (zero) and bottom (one), so that both points are clearly in the bottom half.

image/svg+xml 0 6144 0 1

But that just moves the problem around to the left and right edges of the circle. So how about we vary the basis in which we measure the points depending where they are? If the point is near the bottom or the top we'll use the top–bottom (blue) basis and, if not, we'll use the left–right (red) basis.

image/svg+xml 0 6144

But there's still a problem! Consider the two points in the diagram just above. One party will think that it's in a red area, measure it left–right and conclude that the shared bit is a zero. The other will think it's in a blue area, measure it top–bottom and conclude that the shared bit is a one.

There isn't a solution to this in which the parties operate independently so, instead, one of the parties chooses the basis in which each coefficient will be measured. This information is a single bit of information (i.e. red or blue) that we have Bob send to Alice along with his value u. With this reconciliation information in hand, both parties can measure their points in the same, optimal basis and calculate the same, shared value.


There's lots in the paper that I've skipped over here. Most importantly (for performance), a variant of the Fourier transform can be used to convert the polynomials into the frequency domain where multiplication is much faster. Some of the values transmitted can be transmitted in the frequency domain too to save conversions overall. Also, random polynomials can be sampled directly in the frequency domain.

The parameters here have also been carefully selected so that the reduction stage of multiplication happens magically, just by point-wise multiplication of a some constants before and after the transform.

The a value that Alice generates could be a global constant, but in order to save worrying about how it was generated, and to eliminate the possibility of all-for-the-price-of-one attacks (like LogJam), it's generated fresh for each instance. Rather than transmit it in full, Alice need only send a seed value for it to Bob.

Contrasts with Diffie-Hellman

The most important change from Diffie-Hellman is that this scheme requires that all values be ephemeral. Both QUIC and TLS 1.3 assume that a Diffie-Hellman public-value can be reused but, in this scheme, that breaks the security of the system. More traditional uses of Diffie-Hellman, e.g. TLS 1.2 and SSH, are fine though.

Another important change is that this scheme takes a full round-trip for Alice. With Diffie-Hellman, both parties can transmit a message at time zero and as soon as the other party has received the message, they can calculate the shared key and start transmitting. But even if the a value is a global constant in this scheme, the reconciliation process means that Bob can't send a message until he's received Alice's message, and Alice can't calculate the shared key until she has received Bob's message. So Alice has a wait a full round-trip.

Often that limitation isn't important because other parts of the protocol already require the round-trip (for example, in TLS). But for some uses it's a critical difference.

Also, since this protocol involves random noise it has a failure probability: it's possible that the reconciliation mechanism produces different answers for each side. Thankfully this probability can be made negligible (i.e. less than one in 264).

I should also note that the RLWE problem is hypothesised to be resistant to quantum attack, but we don't know that. We also don't know that it's resistant to attacks by classical computers! It's possible that someone will develop a classical algorithm tomorrow that breaks the scheme above. Thus it should be used concurrently with a standard Diffie-Hellman (e.g. X25519) and the outputs of each should concatenated as the input keying material for a KDF.

Juniper: recording some Twitter conversations

Update: Ralf wrote up some notes from his work. These now include an update themselves with information from Willem Pinckaers that suggests that the presumed Dual-EC output is exposed to the world in Juniper devices.

On Thursday, Juniper announced that some of their products were affected by “unauthorized code in ScreenOS that could allow a knowledgeable attacker to gain administrative access to NetScreen® devices and to decrypt VPN connections”. That sounds like an attacker managed to subvert Juniper's source code repository and insert a backdoor. Of course, any glimpses that we get of these sorts of attacks are fascinating.

Juniper followed up with a slightly more detailed post that noted that there were two backdoors: one via SSH and one that “may allow a knowledgeable attacker who can monitor VPN traffic to decrypt that traffic”. Either of these would be very interesting to a nation-state attacker but that latter—passive decryption of VPN connections—is really in their neighborhood.

So, of course, smarter people than I quickly took to Twitter to pull apart the differences in the fixed firmware versions. Since Twitter conversations are terrible to try and pick apart after the fact, I'm writing down the gist of things here. But I'm just the scribe in this case; other people did the work.

One of the first things that people focused on was a difference to a large, hex value that was visible by just diffing the strings of the two firmwares. That change is interesting not just because it's a large, opaque hex string in a binary, but because of the hex strings that immediately precede it. Specially they were:

  • FFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF: this is the prime order of the underlying field of P-256, a standard elliptic curve.
  • FFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFC: P-256 is typically written in short-Weierstrass form: y2=x3+ax+b. This is then the a value for P-256.
  • 5AC635D8AA3A93E7B3EBBD55769886BC651D06B0CC53B0F63BCE3C3E27D2604B: This is the b value for the P-256 equation.
  • 6B17D1F2E12C4247F8BCE6E563A440F277037D812DEB33A0F4A13945D898C296: This is the x coordinate for the standard generator of P-256—the starting point for operations on the curve.
  • FFFFFFFF00000000FFFFFFFFFFFFFFFFBCE6FAADA7179E84F3B9CAC2FC632551: This is the number of points on P-256.

So all the values just before the changed one are constants for P-256, suggesting that the changed value is cryptographic too. The obvious, missing value would be the y coordinate for the standard generator. One possibility was that the attack put in the wrong y value. This could put the generator on the wrong curve, say a weaker curve that shares most of the same parameters as P-256 but with a different value for b. But the curve that would have resulted, while weaker, wasn't real-time-passive-decryption weak. Also the replacement value in the fixed version wasn't the standard y value either.

Ralf-Philipp Weinmann was looking at the code itself and found:

That means that the changed value is an x coordinate and that the code was calculating the y value from it given the curve equation. Thus it would only need the x values and the points would always be on the correct curve. So perhaps it's a public key for something?

Changing a public key could easily be a big backdoor, but recall that the result here is somehow passive decryption of VPN traffic. It's unclear how changing a public key could result in passive decryption.

Oh dear. To explain: “EC PRNG” suggests that the value might be a constant in an elliptic-curve based pseudo-random number generator. That could certainly explain how passive decryption of VPN traffic was possible because it brings up memories of Dual-EC. Dual-EC was an NSA effort to introduce a backdoored pseudo-random number generator (PRNG) that, given knowledge of a secret key, allowed an attacker to observe output from the RNG and then predict its future output. If an attacker can predict the output of the PRNG then they can know the keys that one or both sides of a VPN connection will choose and decrypt it. (For more details, see the research paper.)

Indeed, it quickly came to light that Juniper have a page where they say that the VPN devices in question here “do utilize Dual_EC_DRBG, but do not use the pre-defined points cited by NIST”. In short, they used a backdoored RNG but changed the locks. Then this attack might be explained by saying that someone broke in and changed the locks again.

We're not sure that's actually what happened, but it seems like a reasonable hypothesis at this point. If it's correct, this is fairly bananas. Dual-EC is not a reasonable RNG: it's massively larger, slower and more complex than standard RNGs. It's output isn't even very uniform. Huge compromises were made in its design in order to meet its primary objective: to be a NOBUS, passive backdoor. (“NOBUS” is an intelligence community term for “nobody but us”, i.e. other parties shouldn't be able to use the backdoor.) Why would it be used in ScreenOS in the first place?

Again, assuming this hypothesis is correct then, if it wasn't the NSA who did this, we have a case where a US government backdoor effort (Dual-EC) laid the groundwork for someone else to attack US interests. Certainly this attack would be a lot easier given the presence of a backdoor-friendly RNG already in place. And I've not even discussed the SSH backdoor which, as Wired notes, could have been the work of a different group entirely. That backdoor certainly isn't NOBUS—Fox-IT claim to have found the backdoor password in six hours.


We recently switched Google's two billion line repository over to BoringSSL, our fork of OpenSSL. This means that BoringSSL is now powering Chromium (on nearly all platforms), Android M and Google's production services. For the first time, the majority of Google's products are sharing a single TLS stack and making changes no longer involves several days of work juggling patch files across multiple repositories.

This is a big positive for Google and I'm going to document some of the changes that we've made in BoringSSL in this post. I am not saying that people should be ditching OpenSSL and switching to BoringSSL. For Linux distributions that doesn't even make sense because we've removed too much for many applications to run unaltered and, without linker trickery, it's not possible to have both OpenSSL and BoringSSL in the same process because their symbols will collide. Even if you're in the position of shipping your own TLS stack with your code, you should still heed the warnings in the README well.

OpenSSL have considerably improved their processes since last April, which is great and important because huge swathes of the Internet will continue to depend on it. BoringSSL started before those changes but, even taking them into consideration, I'm still happy with my decision to fork. (But note that Google employs OpenSSL team members Emilia Käsper, Bodo Möller and Ben Laurie and contributes monetarily via the Core Infrastructure Initiative, so we haven't dropped our support of OpenSSL as a project.)

With that in mind, I'm going to mention some of the cleanups that we've done in BoringSSL from the lowest level, upwards. While most people should continue to use OpenSSL, there are lots of developers outside of Google who work on Chromium and Android and thus this document shouldn't be internal to Google. This post may seem critical of OpenSSL, but remember that many of these changes are possible because we only have to worry about Google's needs—we have an order of magnitude fewer platforms and configurations to support than OpenSSL and we don't keep any ABI compatibility. We also have the superpower of being able to change, where needed, the code that calls BoringSSL, so you can't really compare the two.

The “we”, above, is primarily myself and my colleagues David Benjamin and Matt Braithwaite. But BoringSSL is open source and Brian Smith has clocked up 55 patches and we've also had contributions from Opera and CloudFlare. (Brian's number would be higher if I had had more time to review his pending changes in the past couple of weeks).


Generally when people say “forking” they mean that they took a copy of the code and started landing patches independently of the original source. That's not what we did with BoringSSL. Rather than start with a copy, I started with an empty directory and went through OpenSSL function-by-function, reformatting, cleaning up (sometimes discarding) and documenting each one. So BoringSSL headers and sources look like this rather than this. The comments in BoringSSL headers can be extracted by a tool to produce documentation of a sort. (Although it could do with a make-over.)

(Clang's formatting tool and its Vim integration are very helpful! It's been the biggest improvement in my code-editing experience in many years.)

For much of the code, lengths were converted from ints to size_ts and functions that returned one, zero or minus one were converted to just returning one or zero. (Not handling a minus one return value is an easy and dangerous mistake.)

I didn't always get everything right: sometimes I discarded a function that we later found we actually needed or I changed something that, on balance, wasn't worth the changes required in other code. Where possible, code that we've needed to bring back has gone into a separate section called “decrepit” which isn't built in Chromium or Android.

But large amounts of OpenSSL could simply be discarded given our more limited scope. All the following were simply never copied into the main BoringSSL: Blowfish, Camllia, CMS, compression, the ENGINE code, IDEA, JPAKE, Kerberos, MD2, MDC2, OCSP, PKCS#7, RC5, RIPE-MD, SEED, SRP, timestamping and Whirlpool. The OpenSSL that we started from has about 468,000 lines of code but, today, even with the things that we've added (including tests) BoringSSL is just 200,000. Even projects that were using OpenSSL's OPENSSL_NO_x defines to exclude functionality at compile time have seen binaries sizes drop by 300KB when switching to BoringSSL.

Some important bits of OpenSSL are too large to bite off all at once, however. The SSL, ASN.1 and X.509 code were “forked” in the traditional sense: they were copied with minimal changes and improved incrementally. (Or, in the case of ASN.1 and X.509, left alone until they could be replaced completely.)

The lowest-levels

OpenSSL has a confusing number of initialisation functions. Code that uses OpenSSL generally takes a shotgun approach to calling some subset of OpenSSL_­add_­all_­algorithms, SSL_­library_­init, ERR_­load_­crypto_­strings and the deprecated SSLeay aliases of the same. BoringSSL doesn't need any of them; everything works immediately and the errors don't print out funny just because you forgot to load the error strings. If, like Chromium, you care about avoiding static initialisation (because every disk seek to load pages of code delays displaying the window at startup) then you can build with BORINGSSL_­NO_­STATIC_­INITIALIZER and initialise the library when you need with CRYPTO_­library_­init. But the vast majority of code just wants to avoid having to think about it. In the future, we would like to move to an automatic lazy-init which would solve even Chromium's needs.

OpenSSL and BoringSSL are often built into shared libraries, but OpenSSL doesn't have any visibility annotations. By default symbols are not hidden and ELF requires that any non-hidden symbols can be interposed. So if you look at in a Linux distribution you'll see lots of internal functions polluting the dynamic symbol table and calls to those functions from within the library have to indirect via the PLT. BoringSSL builds with hidden visibility by default so calls to internal functions are direct and only functions marked OPENSSL_­EXPORT are included in the dynamic symbol table.

Multi-threaded code is common these days but OpenSSL requires that you install callbacks to lock and unlock a conceptual array of locks. This trips up people who now take thread-safety to be a given, and can also mean that contention profiling shows a large, opaque amount of contention in the locking callback with no hint as to the real source. BoringSSL has a native concept of locks so is thread-safe by default. It also has “once” objects, atomic reference counting and thread-local storage, which eliminates much of the need for locking in the first place.


OpenSSL has a fairly unique method of handling errors: it pushes errors onto a per-thread queue as the stack unwinds. This means that OpenSSL errors can generally give you something like a stack trace that you might expect from gdb or a Python exception, which is definitely helpful in some cases. For contrast, NSS (Mozilla's crypto library) uses a more traditional, errno-like system of error codes. Debugging an NSS error involves looking up the numeric error code and then grepping the source code to find all the places where that error code can be set and figuring out which triggered this time.

However, this single error-code system is better for programmatic use. Code that tries to do something with OpenSSL errors (other than dumping them for human debugging) tends to look only at the first (i.e. deepest) error on the queue and tries to match on the reason or even function code. Thus changing the name of even internal functions could break calling code because these names were implicitly exported by the error system. Adding errors could also break code because now a different error could be first in the queue. Lastly, forgetting to clear the error queue after a failed function is very easy to do and thus endemic.

So BoringSSL no longer saves functions in the error queue: they all appear as OPENSSL_­internal, which saved about 15KB of binary size alone. As a bonus, we no longer need to run a script every time we add a new function. The file name and line number is still saved but, thankfully, I've never seen code try to match line numbers from the error queue. Trying to match on reason codes is still problematic, but we're living with it for now. We also have no good answer for forgetting to clear the error queue. It's possible that we'll change things in the future to automatically clear the error queue when calling most functions as, now that we're using thread-local storage, that'll no longer cause servers to burst into a flaming ball of lock contention. But we've not done that yet.

Parsing and serialisation

OpenSSL's parsing and serialisation involves a lot of incrementing pointers with single-letter names. BoringSSL drags this firmly into the 1990's with functions that automatically check bounds for parsing and functions that automatically resize buffers for serialisation. This code also handles parsing and serialising ASN.1 in an imperative fashion and we're slowly switching over to these functions because the OpenSSL ASN.1 code is just too complicated for us.

But I should note that OpenSSL's master branch now uses some similar parsing functions for parsing TLS structures at least. I've no idea whether that was inspired by BoringSSL, but it's great to see.

Random number generation

Random number generation in OpenSSL suffers because entropy used to be really difficult. There were entropy files on disk that applications would read and write, timestamps and PIDs would be mixed into entropy pools and applications would try other tricks to gather entropy and mix it into the pool. That has all made OpenSSL complicated.

BoringSSL just uses urandom—it's the right answer. (Although we'll probably do it via getrandom rather than /dev/urandom in the future.) There are no return values that you can forget to check: if anything goes wrong, it crashes the address space.

For the vast majority of code, that's all that you need to know, although there are some concessions to performance in the details:

TLS servers that are pushing lots of AES-CBC need the RNG to be really fast because each record needs a random IV. Because of this, if BoringSSL detects that the machine supports Intel's RDRAND instruction, it'll read a seed from urandom, expand it with ChaCha20 and XOR entropy from RDRAND. The seed is thread-local and refreshed every 1024 calls or 1MB output, whichever happens first.

Authenticated Encryption

Handing people a block cipher and hash function and expecting them to figure out the rest does not work. Authenticated Encryption is much closer to being reasonable and BoringSSL promotes it where possible. One very pleasing BoringSSL tale is that I handed that header file to a non-crypto developer and they produced secure code, first time. That would not have happened had I pointed them at EVP_CIPHER.

There is more to be done here as I've talked about before: we need nonce-misuse-resistant primitives and solutions for large files but what we have now is a significant improvement and the foundations for that future work are now in place.


As I mentioned, the SSL/TLS code wasn't reworked function-by-function like most of BoringSSL. It was copied whole and incrementally improved, predominantly by David Benjamin. I'm really happy with what he's managed to do with it.

At the small scale, most of the parsing and serialisation is now using the safe functions that I covered above. (Changes to convert most of the remaining pointer-juggling code are in my review queue.) TLS extensions are now a bit saner and no longer handled with huge switch statements. Support for SSLv2, DSS, SRP and Kerberos has all been dropped. The header file actually has comments.

Some important, small scale cleanups are less obvious. The large number of “functions” that were actually macros around ctrl functions (that bypassed the type system) are now real functions. In order to get TLS 1.0–1.2 you no longer use the ridiculously named SSLv23_method and then disable SSLv2 and SSLv3 by setting options on the SSL_CTX, rather you use TLS_method and control the versions by setting a minimum and maximum version.

There is lots more that I could mention like that.

At the larger scale, the buffer handling code has been substantially improved and the TLS code now does symmetric crypto using the AEAD interface, which cleanly partitions concerns that previously leaked all over the SSL code. We've also rewritten the version negotiation code so it no longer preprocesses the ClientHello and fiddles with method tables to use the correct version. This avoids some duplicated code and session resumption bugs and OpenSSL has since done a similar rewrite for 1.1.0. To solve a particular problem for Chrome, we've added some support for asynchronous private key operations so that slow smartcards don't block the network thread. Much of the DTLS logic has also been rewritten or pruned.

Perhaps most importantly, the state machine is much reduced. Renegotiation has been dropped except for the case of a TLS client handling renegotiation from a server while the application data flow has stopped, and even that is disabled by default. The DTLS code (a source of many bugs) is much saner in light of this.


OpenSSL has always had decent test coverage of lower-level parts like hash functions and ciphers, but testing of the more complex SSL/TLS code has been lacking. Testing that code is harder because you need to be able to produce sufficiently correct handshakes to get close to its edge cases, but you don't want to litter your real code with dozens of options for producing incorrect outputs in order to hit them. In BoringSSL, we've solved this by using a copy of Go's TLS stack for testing and we've littered it with such options. Our tests also stress asynchronous resume points across a range of handshakes. We wrote partial DTLS support in Go to test DTLS-only edge cases like reassembly, replay and retransmission. Along the way, we even discovered one of OpenSSL's old bug workarounds didn't work, allowing both projects to shed some code.

In C, any malloc call may fail. OpenSSL attempts to handle this, but such code is error-prone and rarely tested. It's best to use a malloc which crashes on failure, but for the benefit of consumers who can't, we have a "malloc test" mode. This runs all tests repeatedly, causing each successive allocation to fail, looking for crashes.

We now have 1,139 TLS tests which gives us 70% coverage of the TLS code—still better than any other TLS library that we've used.

The future

Now that we've done the task of aligning Google around BoringSSL, we'll hopefully be able to turn a little bit more attention to some feature work. Support for the IETF-approved ChaCha20-Poly1305 is coming soon. (Brian Smith has a change waiting for me.) Curve25519 and Ed25519 support are likely too. Next year, we will probably start on TLS 1.3 support.

But more cleanups are probably more important. The big one is the elimination of the ASN.1 and X.509 code in many cases. If you recall, we imported that code whole without cleanups and it hasn't been touched since. We've been incrementally replacing uses of the ASN.1 code with the new CBS and CBB functions but X.509 remains as a substantial user. We're not going to be able to drop that code completely because too much expects the X.509 functions to be available for reading and writing certificates, but we can make it so that the rest of the code doesn't depend on it. Then we can put it in a separate library and drop in a new certificate verification library that some of my Chromium colleagues are writing. Most users of BoringSSL will then, transparently, end up using the new library.

In the SSL code, the SSL object itself is a mess. We need to partition state that's really needed for the whole connection from state that can be thrown away after the handshake from state that can be optionally discarded after the handshake. That will save memory in servers as well as improving the clarity of the code. Since we don't have ABI compatibility, we can also reorder the structs to pack them better.

Lastly, we need to make fuzzing part of our process. Michał Zalewski's AFL has substantially improved the state of fuzzing but, whether we're using AFL or LibFuzzer, it's still a one-off for us. It should be much more like our CI builders. So should running clang-analyzer.

(David Benjamin contributed to this post.)

The ICANN Public Comments on WHOIS Privacy

ICANN is currently considering a proposal that “domains used for online financial transactions for commercial purpose should be ineligible for [WHOIS] privacy and proxy registrations” [PDF]. Given the vagueness around what would count as “commercial purpose” (tip jars? advertising? promoting your Kickstarter?) and concerns that some commercial sites are for small, home-run businesses, quite a lot of people are grumpy about this.

ICANN has a public comment period on this document until July 7th and what's interesting is that the comments (those that were emailed at least) are all in a mailing list archive. When you submit to the comment address ( you receive a confirmation email with a link that needs to be followed and quite a clear statement that the comments are public, so I think that this is deliberate.

I was curious what the comments box on this sort of topic is full of so did a quick analysis. The comment period doesn't close until July 7th so obviously I'm missing a couple of days worth of responses, but it was a two month comment period so that doesn't seem too bad.

When I checked there were 11,017 messages and 9,648 (87.6%) of them were strongly based on the Respect Our Privacy form letter. Several hundred additional messages included wording from it so I think that campaign resulted in about 90% of messages. (And it's worth noting the the primary flow on that site is to call ICANN—of course, I've no data on the volume of phone calls that resulted.)

Another campaign site, Save Domain Privacy, has a petition and quite a few messages included its wording.

I classified all such messages as “against” and wrote a quick program to manually review the remaining messages that weren't trivially classifiable by string matching against those template messages.

  • Nine messages were so odd or confused that it's unclear what the writer believed.
  • Three messages were asking questions and not expressing an opinion.
  • Two messages were sufficiently equivocal that they didn't express a clear feeling in either direction.
  • One message was commenting on a different section of the document.
  • One message suggested that WHOIS privacy be available to all, but that it should have a significant (monetary) cost in order to discourage its use.

Many more messages were against and not based on either of the two template letters. That leaves 13 messages that expressed support for the proposal. (That's 0.12% for those who are counting, although I very much question how much meaning that number has.):

  • Three messages suggested that private WHOIS registration was contrary to the openness of the Internet.
  • Three messages believed that shutting down sites that infringe copyright, or sell counterfeit trademarked goods, was a compelling reason.
  • Two writers believed that it was compelling, in general, to have contact details for websites. One of who claimed to be a security researcher and wanted CERTs to have access to full WHOIS details.
  • Two messages suggested that it would hinder “cyber-bullies” and “pædophiles”, one of which described how hard it was to have a stalker's site shut down.
  • One author believed that being able to contact site owners in the event of a domain-name dispute was a compelling reason.
  • One message suggested that WHOIS privacy should be removed for all .com sites, but no others.
  • One commenter opined that the Internet is inherently hostile to privacy and thus those who want privacy should not register domains.

The comment period opened on May 5th, but between then and June 22nd there were only seven messages. However, the week of the 22nd brought 10,015 messages. The the week of the 29th brought 995 more. So I think it's clear that, without significant outside promotion of these topics, almost nobody would have noticed this proposal.

AEADs: getting better at symmetric cryptography

I gave a talk a couple of weeks ago at the Yahoo Unconference. The conference was at the end of a particually hard week for a bunch of reasons and I fear that the talk wasn't that great. (Afterwards I got home about 3pm and pretty much slept until the following morning.) This post is a, hopefully clearer, articulation of its contents.

I've been primarily working on getting Google products switched over to BoringSSL for a little over a year now. (Chromium is done on many platforms and AOSP switched recently.) This is neccessary work, but it doesn't exactly lend itself to talk material. So the talk was titled “Lessons Learnt from Questions”—the idea being that smart developers at Google often ask lots of cryptography questions, and from those questions one can tell what is unclear in existing documentation. Points that are not obvious to smart, non-experts are thus good topics to talk about because it's likely that lots of people are missing them.

I was aiming for an hour but, due to a misunderstanding in the weeks prior, I thought that I had to cut it down to 20 minutes, so I ditched all but one question:

“How do I encrypt lots of records given a per-user key?” —Anonymous developer

This question is about applying symmetric cryptography, which is the originally motivation of all of cryptography. You would have thought that we would have figured it out by now, but we totally haven't.

In the 1990's it was common for cryptographers to produce block ciphers and hash functions, and then developers would be tasked with composing them. In the talk I stepped through the design of, and attacks on, the SSLv3/TLS CBC construction as a demonstration that plausible-sounding design features can lead to disaster, but I'm going to omit that here because it worked better live. (Phillip Rogaway tried to warn about these issues when SSL, IPSec, etc were being developed, but sadly the right people didn't listen to him.)

So, for quite a long time now, cryptographers have understood that block ciphers and hash functions are too low-level an abstraction and that the constructions themselves need to be studied and standardised. The result is called authenticated encryption (AE), and let's ponder what it might look like.

Obviously an AE function must take some plaintext and, if it's symmetric encryption, it must take a key. We also assume that its going to take care of both confidentiality and authenticity for us, so not only will an attacker not be able to tell the plaintext from the ciphertext but, if the ciphertext is altered in any way, the decryption will fail cleanly. So let's experiment with a hypothetical SSH-like protocol that's trying to protect a series of key presses:

AE(key, plaintext) → ciphertext

AE(key, ‘h’) → α
AE(key, ‘e’) → β
AE(key, ‘l’) → γ
AE(key, ‘l’) → γ
AE(key, ‘o’) → δ

Oh no! The two l's in “hello” turned into the same ciphertext, so an attacker can see patterns in the input reflected in the ciphertext. That's pretty terrible; clearly we need our AE function to map the same plaintext to different ciphertexts. There's two ways that can happen: either the function is non-deterministic (i.e. it reads entropy internally as a hidden input), or we pass in some varying argument. Explicit is better than implicit so we choose the latter and add an argument called the nonce:

AE(key, plaintext, nonce) → ciphertext

AE(key, ‘h’, n0) → (α, n0)
AE(key, ‘e’, n1) → (β, n1)
AE(key, ‘l’, n2) → (γ, n2)
AE(key, ‘l’, n3) → (ɛ, n3)
AE(key, ‘o’, n4) → (δ, n4)

We assume that n0…4 are all distinct and thus the two l's now get different ciphertexts and patterns in the plaintext have been eliminated. Note that (for now) we need to include the nonce value with the ciphertext because it'll be needed when decrypting.

(As an aside: if you've done anything with cryptography before you've probably come across the acronym “IV”, for initialisation vector. Although not completely standard, I differentiate an IV and an nonce thus: an IV needs to be unpredictable while an nonce needs only to be distinct. The values 0, 1, 2, … are distinct, but they aren't unpredictable. For something like CBC mode, which needs an IV, using a counter would be unacceptable.)

We solved one problem but there's another thing that an evil attacker might do: reorder messages:

AE(key, ‘h’, n0) → (α, n0)(γ, n2)
AE(key, ‘e’, n1) → (β, n1)(δ, n4)
AE(key, ‘l’, n2) → (γ, n2)…untrusted network…(α, n0)
AE(key, ‘l’, n3) → (ɛ, n3)(ɛ, n3)
AE(key, ‘o’, n4) → (δ, n4)(β, n1)

All the ciphertexts on the right are perfectly valid and coupled with a valid nonce, but they end up decrypting to “lohle”—hardly what was intended.

There are two solutions to this. The first is to use an implicit nonce: don't transmit the nonce, just use a counter for each direction. Now, if a message is out of order, the receiver will attempt to decrypt it with the wrong nonce and it'll fail to decrypt. That works perfectly well for transport protocols like SSH and TLS because it's easy to keep a counter in those situations, but sometimes the problem isn't quite so synchronous. For these cases, one could include a sequence number or other context in the plaintext and that works fine, but it's a waste of space if the receiver already knew the information and just wanted to confirm it.

This motivates a common extension to authenticated encryption called authenticated encryption with associated data (AEAD). The associated data is another argument, of arbitrary length, which must be equal at the encryption and decryption ends:

AEAD(key, plaintext, nonce, ad) → ciphertext

AEAD(key, ‘h’, n0, 0) → (α, n0)
AEAD(key, ‘e’, n1, 1) → (β, n1)
AEAD(key, ‘l’, n2, 2) → (γ, n2)
AEAD(key, ‘l’, n3, 3) → (ɛ, n3)
AEAD(key, ‘o’, n4, 4) → (δ, n4)

In this example the associated data happens to be a counter, but it could be anything. The associated data isn't included in the ciphertext, but it must be identical when decrypting otherwise the decryption will fail. In other protocols it could equally well be some context denoting, say, the first or last record. This is the second solution to the reordering problem.

The associated data might seem quite a lot like an nonce. They both need to be presented at encryption and decryption time and they must both match for decryption to succeed, so why do we bother having both? Mostly because the associated data is free-form: you can have as much or as little of it as you like and you can repeat it etc. The requirements of an nonce are much stricter, in fact if you remember only one thing from this please remember this, which I'm calling The Law of AEADs:

Thou shall never reuse the same (key, nonce) pair, for all time. (With high probability.)

So, if you generate a random key and use it to encrypt a single message, it's ok to set the nonce to zero. If you generate a random key and encrypt a series of messages you must ensure that the nonce never repeats. A counter is one way to do this, but if you need to store that counter on disk then stop: the chances of you screwing up and reusing an nonce value are way too high in designs like that.

It would be nice if reusing an nonce just meant that the same plaintext would result in the same ciphertext. That's the least bad thing that an AEAD could do in that situation. However the reality is significantly worse: common AEADs tend to lose confidentiality of messages with a repeated nonce and authenticity tends to collaspe completely for all messages. (I.e. it's very bad.) We like these common AEADs because they're fast, but you must have a solid story about nonce uniqueness. AEADs like AES-GCM and ChaCha20-Poly1305 fail in this fashion.

So what should you do when using a counter isn't trivially safe? One option is to generate the nonce at random and consider the probability of duplicates. AES-GCM takes a 96-bit nonce and NIST says that you can only encrypt 232 messages under a single key if using random nonces. This is because if you throw 232 balls at 296 buckets then you have roughly a 2-33 chance of getting two in the same bucket and NIST drew the line there. That probability might seem either high or low to you. It's pretty small in absolute terms and, unlike a work factor, an attacker can't spend resources against it, but it's a very long way from the safety margins that we usually use in cryptography. So there are also functions like crypto_­secretbox_­xsalsa20­poly1305 (from NaCl) that have a 192-bit nonce. The probabilities with random nonces are much more comforting at that size.

Another approach would be to use an AEAD that doesn't fail quite so catastrophically when an nonce is repeated. This is called an nonce-misuse resistant AEAD and we're hitting the boundary of developed practice now. The CAESAR competition has several nonce-misuse resistant entries in it, although it's not scheduled to conclude until 2018. Closer to established primitives, Gueron and Lindell propose an AES-GCM-SIV mode with a strong foundation (assuming that you trust AES-GCM) and good performance (assuming that you have good AES-GCM performance).

AEADs with large plaintexts

If you look at AEAD APIs you'll generally notice that they take the entire plaintext or ciphertext at once. In other words, they aren't “streaming” APIs. This is not a mistake, rather it's the streaming APIs that are generally a mistake.

I've complained about this in the past, so I'll be brief here. In short, old standards (e.g. PGP) will encrypt plaintexts of any length and then put an authenticator at the end. The likely outcome of such a design is that some implementations will stream out unauthenticated plaintext and only notice any problem when they get to the end of the ciphertext and try to check the authenticator. But by that time the damage has been done—it doesn't take much searching to find people suggesting piping the output of gpg to tar or even a shell.

So, if streaming is required, large plaintexts should be chunked and each chunk should easily fit into a “one-shot” API like those linked to above. That prevents unauthenticated plaintext from being processed. However, there is no standard for this; it's another case where we've hit the borders of existing practice. Implementations need to construct the nonces and associated data so that the first chunk is known to be the first, so that each chunk is in order and so that truncation is always detectable.

Streaming decryption always runs the risk of an attacker truncating the ciphertext however. Designs must always be able to detect truncation, but it's very easy to imagine that the rest of a system doesn't handle it well. (For example, see the Cookie Cutter attack against HTTPS.)

If streaming isn't essential then an all-or-nothing transform would be ideal. But, again, there's no substantial standards or practice around this. Hopefully in ten years or so there will be clear answers for this and for large-plaintext constructions with AEADs (AERO is a start). But, for now, even symmetric encryption isn't a “solved problem”.

Why not DANE in browsers

Thomas Ptacek laid out a number of arguments against DNSSEC recently (and in a follow up). We don't fully agree on everything, but it did prompt me to write why, even if you assume DNSSEC, DANE (the standard for speaking about the intersection of TLS and DNSSEC) is not a foregone conclusion in web browsers.

There are two ways that you might wish to use DANE in a web browser: either to block a certificate that would normally be considered valid, or to bless a certificate that would normally be rejected. The first, obviously, requires that DANE information always be obtained—if a lookup failure was ignored, a network attacker with a bad certificate would just simulate a lookup failure. But requiring that browsers always obtain DANE information (or a proof of absence) is nearly implausible:

Some years ago now, Chrome did an experiment where we would lookup a TXT record that we knew existed when we knew the Internet connection was working. At the time, some 4–5% of users couldn't lookup that record; we assume because the network wasn't transparent to non-standard DNS resource types. DANE records are going to be even more non-standard, are going to be larger and browsers would have to fetch lots of them because they'll need the DNSKEY/RRSIG chain up from the root. Even if DNSSEC record lookup worked flawlessly for everyone, we probably still wouldn't implement this aspect of DANE because each extra lookup is more latency and another chance for packet loss to cause an expensive timeout and retransmit.

Instead, for this we have HPKP, which is a memory-based pinning solution using HTTP headers. We also have pre-loaded pinning in Chrome for larger or more obvious targets. Thomas Ptacek seems bullish on pinning but I'm much more lukewarm. HPKP is quite complex and no doubt someone will write a “supercookies” story about it at some point, as they have for HSTS. Additionally, pinning is quite dangerous. Clients deciding that “pinning is good” have caused headaches at Google. It's also worth noting that CryptoCat has committed pinning-suicide in Chrome at at the moment due to their CA having switched intermediates between renewals. They're waiting for the release of Chrome 41 to recover.

But what about the other side of DANE: blessing certificates that would otherwise be considered untrusted? In this case, DNSSEC can be seen as something like another CA. The same problems with looking up DNSSEC records apply, but are much less painful when you only need to depend on the lookup for sites that are using DANE certificates. Indeed, Chrome even supported something very like DANE for a while. In that case the DNSSEC records were contained in the certificate to avoid the latency and complexity of looking them up in the client. (DNSSEC records contains signatures so need not be transported over the DNS protocol.)

But support for that was removed because it was a bunch of parsing code outside of the sandbox, wasn't really being used and it conflicted with two long-term plans for the health of the HTTPS ecosystem: eliminating 1024-bit RSA and Certificate Transparency. The conflicts are probably the most important reasons for not wanting to renew the experiment.

The effort to remove 1024-bit RSA from HTTPS has been going for years and is, perhaps, nearing completion now. (That noise that you can hear is my colleague, Ryan Sleevi, crying softly.). There are still some 1024-bit root certificates, but they are nearly gone from the Mozilla set. The amount of work involved is an order of magnitude greater than you expect because of the interactions of different X.509 validation libraries, intermediate chains and varying root stores on different platforms and versions.

DNSSEC, however, is littered with 1024-bit RSA. You literally can't avoid it because the root zone transits through a 1024-bit key. DNSSEC has them because of (I think) concerns about the size of responses and they are usually rotated every two or three months. The RFC suggests that 1024-bit RSA is good for “most zones” until 2022. Dan Bernstein's paper on Batch NFS deals well with the question of whether that's wise.

Next, Certificate Transparency is our effort to add strong, technical audits to the CA system by creating a trustworthy log of all valid certificates. CT logs only accept certificates from CA as an anti-spam measure but people can create DANE certificates for domains at will. This is far from an insurmountable problem, but it is a problem that would need to be solved and the CT team already have their hands full with the staged rollout of CT in Chrome.

The 1024-bit RSA problem isn't insurmountable either (although it's baked in much deeper), so it's possible that browsers might accept a DNSSEC signature chain in a certificate in the future, but it's a long way out.

The POODLE bites again

October's POODLE attack affected CBC-mode cipher suites in SSLv3 due to SSLv3's under-specification of the contents of the CBC padding bytes. Since SSLv3 didn't say what the padding bytes should be, implementations couldn't check them and that opened SSLv3 up to an oracle attack.

We're done pretty well at killing off SSLv3 in response to that. Chrome 39 (released Nov 18th) removed fallback to SSLv3 and Chrome 40 is scheduled to remove SSLv3 completely. Firefox 34 (released Dec 1st) has already removed SSLv3 support.

We're removing SSLv3 in favour of TLS because TLS fully specifies the contents of the padding bytes and thus stops the attack. However, TLS's padding is a subset of SSLv3's padding so, technically, you could use an SSLv3 decoding function with TLS and it would still work fine. It wouldn't check the padding bytes but that wouldn't cause any problems in normal operation. However, if an SSLv3 decoding function was used with TLS, then the POODLE attack would work, even against TLS connections.

This was noted by, at least, Brian Smith on the TLS list ([1][2]) and I was sufficiently cynical to assume that there were probably more instances of this than the old versions of NSS that Brian cited, and so wrote a scanner for the issue.

Unfortunately, I found a number of major sites that had this problem. At least one of whom I had good enough contacts at to quickly find that they used an F5 device to terminate connections. I contacted F5 on October 21st and they started working on a fix. Yngve Pettersen also independently found this issue and contacted me about it around this time.

F5 reported that some of the affected sites weren't customers of theirs, which meant that there was (at least) a second vendor with the same issue. After more digging, I found that some A10 devices also have this problem. I emailed a number of contacts at A10 on October 30th but sadly didn't get a reply from any of them. It wasn't until November 13th that I found the right person at A10 to deal with this.

F5 and A10 have posted patches for their products (F5's are here and A10's are here and they have an advisory here). I'm not completely sure that I've found every affected vendor but, now that this issue is public, any other affected products should quickly come to light. (Citrix devices have an odd behaviour in this area in that they'll accept padding bytes that are all zeros, but not random padding. That's unexpected but I can't make an attack out of it.)

(Update: since posting this, it appears that products from Fortinet, Cisco, IBM (WebSphere, Domino, Tivoli) and Juniper may also be affected.)

Ivan Ristić has added a test for this issue to his excellent scanner at SSLLabs. Affected sites will have their grade set to F and will report “This server is vulnerable to the POODLE attack against TLS servers”.

This seems like a good moment to reiterate that everything less than TLS 1.2 with an AEAD cipher suite is cryptographically broken. An IETF draft to prohibit RC4 is in Last Call at the moment but it would be wrong to believe that RC4 is uniquely bad. While RC4 is fundamentally broken and no implementation can save it, attacks against MtE-CBC ciphers have repeatedly been shown to be far more practical. Thankfully, TLS 1.2 support is about to hit 50% at the time of writing.

POODLE attacks on SSLv3

My colleague, Bodo Möller, in collaboration with Thai Duong and Krzysztof Kotowicz (also Googlers), just posted details about a padding oracle attack against CBC-mode ciphers in SSLv3. This attack, called POODLE, is similar to the BEAST attack and also allows a network attacker to extract the plaintext of targeted parts of an SSL connection, usually cookie data. Unlike the BEAST attack, it doesn't require such extensive control of the format of the plaintext and thus is more practical.

Fundamentally, the design flaw in SSL/TLS that allows this is the same as with Lucky13 and Vaudenay's two attacks: SSL got encryption and authentication the wrong way around – it authenticates before encrypting.

Consider the following plaintext HTTP request, which I've broken into 8-byte blocks (as in 3DES), but the same idea works for 16-byte blocks (as in AES) just as well:


The last block contains seven bytes of padding (represented as •) and the final byte is the length of the padding. (And I've used a fictional, 8-byte MAC, but that doesn't matter.) Before transmission, those blocks would be encrypted with 3DES or AES in CBC mode to provide confidentiality.

Now consider how CBC decryption works at the receiver, thanks to this public domain diagram from Wikipedia:

image/svg+xml block cipherdecryption Key Plaintext Ciphertext Initialization Vector (IV) block cipherdecryption Key Plaintext Ciphertext block cipherdecryption Key Plaintext Ciphertext

An attacker can't see the plaintext contents like we can in the diagram, above. They only see the CBC-encrypted ciphertext blocks. But what happens if the attacker duplicates the block containing the cookie data and overwrites the last block with it? When the receiver decrypts the last block it XORs in the contents of the previous ciphertext (which the attacker knows) and checks the authenticity of the data. Critically, since SSLv3 doesn't specify the contents of the padding (•) bytes, the receiver cannot check them. Thus the record will be accepted if, and only if, the last byte ends up as a seven.

An attacker can run Javascript in any origin in a browser and cause the browser to make requests (with cookies) to any other origin. If the attacker does this block duplication trick they have a 1-in-256 chance that the receiver won't reject the record and close the connection. If the receiver accepts the record then the attacker knows that the decryption of the cookie block that they duplicated, XORed with the ciphertext of the previous block, equals seven. Thus they've found the last byte of the cookie using (on average) 256 requests.

Now the attacker can increase the length of the requested URL and decrease the length of something after the cookies and make the request look like this:


Note that the Cookie data has been shifted so that the second to last byte of the data is now at the end of the block. So, with another 256 requests the attacker can expect to have decrypted that byte and so on.

Thus, with an average of 256×n requests and a little control of the layout of those requests, an attacker can decrypt n bytes of plaintext from SSLv3. The critical part of this attack is that SSLv3 doesn't specify the contents of padding bytes (the •s). TLS does and so this attack doesn't work because the attacker only has a 2-64 or 2-128 chance of a duplicated block being a valid padding block.

This should be an academic curiosity because SSLv3 was deprecated very nearly 15 years ago. However, the Internet is vast and full of bugs. The vastness means that a non-trivial number of SSLv3 servers still exist and workarounds for the bugs mean that an attacker can convince a browser to use SSLv3 even when both the browser and server support a more recent version. Thus, this attack is widely applicable.

SSL/TLS has a perfectly good version negotiation mechanism that should prevent a browser and server that support a modern TLS version from using anything less. However, because some servers are buggy and don't implement version negotiation correctly, browsers break this mechanism by retrying connections with lesser SSL/TLS versions when TLS handshaking fails. These days we're more aware of the fact that fallback behaviour like this is a landmine for the future (as demonstrated today) but this TLS fallback behaviour was enshrined long ago and we're stuck with it. It means that, by injecting some trivial errors on the network, an attacker can cause a browser to speak SSLv3 to any server and then run the above attack.

What's to be done?

It's no revelation that this fallback behaviour is bad news. In fact, Bodo and I have a draft out for a mechanism to add a second, less bug-rusted mechanism to prevent it called TLS_FALLBACK_SCSV. Chrome and Google have implemented it since February this year and so connections from Chrome to Google are already protected. We are urging server operators and other browsers to implement it too. It doesn't just protect against this specific attack, it solves the fallback problem in general. For example, it stops attackers from downgrading TLS 1.2 to 1.1 and 1.0 and thus removing modern, AEAD ciphers from a connection. (Remember, everything less than TLS 1.2 with an AEAD mode is cryptographically broken.) There should soon be an updated OpenSSL version that supports it.

Even with TLS_FALLBACK_SCSV, there will be a long tail of servers that don't update. Because of that, I've just landed a patch on Chrome trunk that disables fallback to SSLv3 for all servers. This change will break things and so we don't feel that we can jump it straight to Chrome's stable channel. But we do hope to get it there within weeks and so buggy servers that currently function only because of SSLv3 fallback will need to be updated.

Chrome users that just want to get rid of SSLv3 can use the command line flag --ssl-version-min=tls1 to do so. (We used to have an entry in the preferences for that but people thought that “SSL 3.0” was a higher version than “TLS 1.0” and would mistakenly disable the latter.)

In Firefox you can go into about:config and set security.tls.version.min to 1. I expect that other browser vendors will publish similar instructions over the coming days.

As a server operator, it is possible to stop this attack by disabling SSLv3, or by disabling CBC-mode ciphers in SSLv3. However, the compatibility impact of this is unclear. Certainly, disabling SSLv3 completely is likely to break IE6. Some sites will be happy doing that, some will not.

A little further down the line, perhaps in about three months, we hope to disable SSLv3 completely. The changes that I've just landed in Chrome only disable fallback to SSLv3 – a server that correctly negotiates SSLv3 can still use it. Disabling SSLv3 completely will break even more than just disabling the fallback but SSLv3 is now completely broken with CBC-mode ciphers and the only other option is RC4, which is hardly that attractive. Any servers depending on SSLv3 are thus on notice that they need to address that now.

We hardened SSLv3 and TLS 1.0 against the BEAST attack with 1/n-1 record splitting and, based on an idea by Håvard Molland, it is possible to do something called anti-POODLE record splitting this time. I'll omit the details, but one can ensure that the last block in a CBC record contains at least fixed six bytes using only one split for AES and two for 3DES. With this, CBC is probably less bad than RC4. However, while anti-POODLE record splitting should be easy to deploy because it's valid according to the spec, so was 1/n-1 and deploying that was very painful. Thus there's a high risk that this would also cause compatibility problems. Therefore I'm not proceeding with anti-POODLE record splitting and concentrating on removing SSLv3 fallback instead. (There's also the possibility that an attacker could run the attack on the server to client direction. If we assume that both the client and the server get patched then we might as well assume that they are patched for TLS_FALLBACK_SCSV, which makes record splitting moot.)

PKCS#1 signature validation

On Wednesday, Chrome and Mozilla did coordinated updates to fix an RSA signature verification bug in NSS — the crypto library that handles SSL in Firefox and (currently) Chrome on most platforms. The updates should be well spread now and the bug has been detailed on Reddit, so I think it's safe to talk about.

(Hilariously, on the same day, bash turned out to have a little security issue and so hardly anyone noticed the NSS one!)

The NSS bug is another variant of Bleichenbacher's 2006 attack on RSA signature validation. To recap: an RSA signature is roughly the cube root of the message hash modulo the RSA modulus. Verification involves cubing the signature and checking that the result is the hash that you expected. Cubing modulo an RSA modulus is easy but finding a cube root is infeasible.

There's a little more to it because one needs to eliminate some properties of the RSA operation by formatting the hash that you're going to sign — called padding.

The standard for RSA signing and encryption is PKCS#1 version 1.5. The PKCS series of standards are amazing. Although I've no doubt that the people writing them were doing their best, it was a long time ago and mistakes were made. In a modern light, they are all completely terrible. If you wanted something that was plausible enough to be widely implemented but complex enough to ensure that cryptography would forever be hamstrung by implementation bugs, you would be hard pressed to do better. If you can find some implementers, just say "PKCS#11!" or "PKCS#12!" as punchlines. You'll get a good laugh, although possibly also a few panic attacks.

(PKCS#1 version 2 is much better, but only because they took it from Bellare and Rogaway and slapped the PKCS#1 title on it.)

PKCS#1 version 1.5 wanted to include an identifier for the hash function that's being used, inside the signature. This is a fine idea, but they did it by encoding the algorithm and hash value with ASN.1. This caused many implementations to include the complexity of an ASN.1 parser inside signature validation and that let the bugs in.

Bleichenbacher's original attack was based on the observation that the ASN.1 parsers, in many cases, didn't reject extra trailing data. This is reasonable behaviour for a generic ASN.1 parser, but a disaster in signature verification. Because the parser could ignore so much of the signature, Bleichenbacher could arrange for a perfect cube to have a suitable prefix and thus calculate the cube root over the integers — ignoring the RSA modulus completely!

That was fixed, but there was another common bug: the ASN.1 structure used to identify the hash was a standard structure called AlgorithmIdentifier, which includes an optional parameter. Implementations were ignoring the parameter and that also introduced a vulnerability: arbitrary bytes could be inserted as a parameter and that was sufficient to forge signatures. (See section five of this paper for details.)

This week's vulnerability was similar. Antoine Delignat-Lavaud (who, by the way, has been doing stellar work along with the Prosecco team at INRIA: see Triple Handshake, Cookie Cutter, SPDY and virtual host attacks and miTLS) noticed that the check on the contents of the parameters in NSS wasn't very strict — it was only checking that the length was at most two bytes. This is because, due to complexity, there wasn't universal agreement on what the the parameter should be. The ASN.1 for the parameter is an ANY type (with the concrete type depending on a preceding object id) but also optional. So, when not needed, should it be an ASN.1 NULL (which is two bytes long), or should it be omitted completely? The answer is the former, but it was underspecified for a long time.

Once Antoine had put a spotlight on the code, Brian Smith noticed something worse: an integer overflow in the ASN.1 parser. ASN.1 (in DER form at least, because, of course, ASN.1 has multiple defined encodings) has variable-length lengths, and NSS was using a generic ASN.1 parser which didn't check for overflow. So you could specify that a length was arbitrarily long and the parser would do something similar to:

unsigned length = 0;
for (i = 0; i < length_of_length; i++) {
  length <<= 8;
  length |= length[i];

Thus, as long as the last 4 or 8 bytes (depending on whether the system was 32 or 64 bit) encoded the correct length, the bytes before that would be ignored. That allows arbitrary bytes in the signature again and the attack from section 5 of the previous paper can be still be used to make forged signatures.

(Intel's ATR group also reported the same issue to Mozilla a little bit later. Bugs often seem to work like scientific discovery.)

The moral of the story

The moral here is that an ASN.1 parser should never have been put somewhere so sensitive; parsing is dangerous. It was a mistake to have defined PKCS#1 that way, but it can be ameliorated: rather than parsing, generate the expected signature contents and compare it against the plaintext. Go has always done this. I believe that SSH does this. Bouncy Castle reportedly does this. BoringSSL does this because I changed it from OpenSSL (prior to learning about the NSS issue — it's just a good idea). NSS does it now, although they're still worried about the omitted parameter vs NULL parameter confusion so they serialise two ASN.1 outputs and compare against both.

I generated a number of tests of different possible points of flexibility in signature parsing so that other libraries can be tested. For example, you can test OpenSSL (which still uses an ASN.1 parser) against them like this:

for cert in $(ls *.crt | grep -v root2048.crt); do
  openssl verify -CAfile root2048.crt $cert 2>> /dev/null | grep -q "$cert: OK"
  if [ $? -eq 0 ] ; then
    echo "$cert accepted"

The file control.crt should always be accepted. The file missingnull.crt will be accepted if you allow the parameter to be omitted. (I recommend against that. BoringSSL doesn't allow it and we'll see whether we can make that stick.) Nothing else should be accepted.

Sadly, OpenSSL 1.0.1 also accepts highvaltag.crt (ordinary ASN.1 tags in high-value form) and indeflen.crt (BER indefinite lengths rather than DER). The OpenSSL development branch also accepts bernull.crt (superfluous zero bytes in a length). To be clear, there is no known vulnerability here, but it's unwelcome flexibility in something that should be rigid. (I let OpenSSL team know on Wednesday and that they're welcome to use the code from BoringSSL.)

Remember, we use PKCS in browsers because we have to deal with the world as we find it, not as we want it. If you are not so encumbered, consider something simpler.

A couple more formal systems

After my last post on formal analysis of C code, people pointed out several other options to me and I wanted to do a follow up because the conclusion is a little more positive now.

Firstly, I noted last time that Z3 was the best, Why3-compatible SMT solver for my problem but that it came with a non-commercial license. Craig Stuntz pointed out that Microsoft sell a commercial license for $10K. Sure, that's quite a lot for a hobbyist project, but it's easy for a corporation and a lot easier than having to negotiate specially.

The first additional thing that I've looked into is AutoCorres. I mentioned last time that SeL4's refinement proof to C code involved a C parser in Isabelle that output structures in the Simpl framework and that Simpl was pretty intimidating.

AutoCorres is a tool for verifiably simplifing the Simpl structures and thus making proofs easier. Let's repeat the example from last time and show how AutoCorres helps. Here's the test C function:

int add(int a, int b) {
  return a+b;

And here's the raw result from the C parser, exactly as I showed last time:

test_global_addresses.add_body ≡
  Guard SignedArithmetic ⦃-2147483648 ≤ sint ´a + sint ´b ∧ sint ´a + sint ´b ≤ 2147483647⦄
   (creturn global_exn_var_'_update ret__int_'_update (λs. a_' s + b_' s));;
  Guard DontReach {} SKIP

And here's what AutoCorres makes of it:

add' ?a ?b ≡
  DO oguard (λs. INT_MIN ≤ ?a + ?b);
     oguard (λs. ?a + ?b ≤ INT_MAX);
     oreturn (?a + ?b)

That's very impressive! That's using a Maybe monad (using the Haskell name) in Isabelle to handle the possible overflow. Obviously that's a lot easier to deal with than the raw Simpl. Here's the field addition function that I was using for the other examples in the previous post:

void add(int *out, const int *x, const int *y) {
  unsigned int i;
  unsigned int size = 10;
  for (i = 0; i < size; i++) {
    out[i] = x[i] + y[i];

And here it is after AutoCorres:

add' ?out ?x ?y ≡
do i ← whileLoop (λi s. i < 0xA)
          (λi. do guard (λs. is_valid_w32 s (ptr_coerce (?out +⇩p uint i)));
                  guard (λs. is_valid_w32 s (ptr_coerce (?x +⇩p uint i)));
                  guard (λs. is_valid_w32 s (ptr_coerce (?y +⇩p uint i)));
                  guard (λs. INT_MIN ≤ sint (ucast s[ptr_coerce (?x +⇩p uint i)]) + sint (ucast s[ptr_coerce (?y +⇩p uint i)]));
                  guard (λs. sint (ucast s[ptr_coerce (?x +⇩p uint i)]) + sint (ucast s[ptr_coerce (?y +⇩p uint i)]) ≤ INT_MAX);
                  modify (λs. s[ptr_coerce (?out +⇩p uint i) := s[ptr_coerce (?x +⇩p uint i)] + s[ptr_coerce (?y +⇩p uint i)]]);
                  return (i + 1)

While the AutoCorres output looks very clean, there isn't any documentation at the moment which means that it's a little hard to actually use. (There are a couple of papers on the home page that are well worth reading - but I didn't find anything like a user guide.) There are some examples from which it's possible to make some guesses, but I wasn't able to get very far in a proof. Also, since AutoCorres builds on Simpl and C parser, it has the same limitations - most notably that the address of local variables can't be taken.

None the less, AutoCorres holds a lot of promise.

Next, is VST from Andrew Appel and others (yes, that Appel). VST is a framework for proving the behaviour of C code with Coq by using one of the intermediate languages from CompCert. Coq (which I'm pronouncing as “coke”) is a proof system like Isabelle and CompCert is a formally verified compiler. That means that it's possible to have a chain of formal verification from C code all the way to the semantics of the machine language! Additionally, the C parser isn't correctness critical because VST proves properties using the same parse tree as the verified compiler. That's an extremely strong verification standard.

CompCert is a mixture of GPL and “non-commercial” licensed code. It appears that everything needed for VST is licensed as GPL or another free license. So, one can also use VST to verify C code and then use a different compiler to build it if the non-commercial restriction is a problem. That moves the CompCert C parser and the other compiler into the trusted set, but so it goes.

The VST logic also appears to be very capable. It doesn't have a problem with pointers to local variables and should even have the capability to support reasoning about concurrent programs. The restrictions that it does have are local changes: only one memory load per line (at the moment) or side effect per statement and void functions need an explicit return at the end. (Unfortunately, one doesn't get an error from violating these requirements - rather the proof cannot be completed.)

VST thankfully, has brief documentation online which is expanded upon in the form of a book ($15 as a DRMed download). The book contains a very abstract articulation of the underlying logic which, to be honest, defeated me initially. I'd recommend skipping chapters 2 through 21 on a first reading and coming back to them later. They are useful, but they're also a lot clearer after having played with the software for a while.

Here's the same example function, translated into VST style. I switched back from having a special output array because that was something that I changed to make the SPARK verification easier but it was never part of the original code and VST doesn't have problem with overlapping the input and output. I've also switched from a for loop to a while loop because the documentation and examples for the latter are better. Otherwise, the changes are just to limit each statement to a single memory access and to include the final return.

void add(int *a, int *b) {
	int t1, t2;
	int i = 0;
	while (i < 10) {
		t1 = a[i];
		t2 = b[i];
		a[i] = t1 + t2;

We can use clightgen from CompCert to parse that into the intermediate language that CompCert and VST use. (A word to the wise - if you want to install VST, use exactly Coq 8.4p2, not the latest version which is 8.4p4.) Then we can use Coq to prove its behaviour.

Compared to Isabelle, Coq's proofs aren't as clear as the Isar style proofs in Isabelle (although the terseness is more useful when proofs are this complex) and neither Proof General nor CoqIDE are as nice as Isabelle's jedit mode. But it's not otherwise dramatically different.

Here's a specification of add in VST/Coq's language:

Definition bound_int (v : val) (b : Z) :=
  match v with
  | Vint i => -b < (Int.signed i) < b
  | _ => False

Definition partial_add (i: Z) (l: Z -> val) (r: Z -> val) (j: Z) :=
  if zlt j i then Val.add (l j) (r j) else (l j).

Definition add_spec :=
  DECLARE _add
  WITH a0 : val, b0 : val, sh : share, orig_a : Z -> val, orig_b : Z -> val
  PRE [_a OF (tptr tint), _b OF (tptr tint)]
    PROP (writable_share sh;
          forall i, 0 <= i < 10 -> is_int (orig_a i);
          forall i, 0 <= i < 10 -> is_int (orig_b i);
          forall i, 0 <= i < 10 -> bound_int (orig_a i) 134217728;
          forall i, 0 <= i < 10 -> bound_int (orig_b i) 134217728)
    LOCAL (`(eq a0) (eval_id _a);
           `(eq b0) (eval_id _b);
           `isptr (eval_id _a);
           `isptr (eval_id _b))
    SEP (`(array_at tint sh orig_a 0 10 a0);
         `(array_at tint sh orig_b 0 10 b0))
  POST [ tvoid ]
    PROP ()
    LOCAL ()
    SEP (`(array_at tint sh (partial_add 10 orig_a orig_b) 0 10 a0);
         `(array_at tint sh orig_b 0 10 b0)).

The partial_add function implements a map that reflects the state of the a array at step i of the loop. I've written it like that so that it can also be used in the loop invariant.

And here's the proof: It is quite long, but at least half of that is because I've not used Coq before and I don't know what I'm doing. I wouldn't call it readable however.

Definition inv orig_a a0 orig_b b0 sh :=
  EX i:Z,
    (PROP (0 <= i < 11;
           isptr a0;
           isptr b0)
     LOCAL (`(eq a0) (eval_id _a);
            `(eq b0) (eval_id _b);
            `(eq (Vint (Int.repr i))) (eval_id _i))
   SEP (`(array_at tint sh (partial_add i orig_a orig_b) 0 10 a0);
        `(array_at tint sh orig_b 0 10 b0))).

Lemma mod_nop : forall (i : Z) (m : Z), 0 <= i < m -> m > 0 -> i mod m = i.
rewrite Zdiv.Zmod_eq.
apply Zdiv_small.
exact H.
rewrite H1.
exact H0.

Lemma body_sumarray: semax_body Vprog Gprog f_add add_spec.
  (inv orig_a a0 orig_b b0 sh)
     PROP ()
     LOCAL ()
     SEP (`(array_at tint sh (partial_add 10 orig_a orig_b) 0 10 a0);
          `(array_at tint sh orig_b 0 10 b0))).
apply exp_right with 0.
rewrite H7.
unfold partial_add.
assert(is_int (orig_a i)).
assert(is_int (orig_b i)).
exact H7.
apply prop_right.
assert(Vint (Int.add _id1 _id) = partial_add (i+1) orig_a orig_b (Int.signed (Int.repr i))).
Focus 2.
symmetry in H9.
apply H9.
unfold partial_add.
assert(Int.signed (Int.repr i) = i).
rewrite Int.signed_repr.
unfold Int.max_signed, Int.min_signed.
rewrite H9.
unfold Val.add.
rewrite Int.signed_repr_eq in H7.
rewrite mod_nop in H7.
if_tac in H7.
unfold partial_add in H7.
if_tac in H7.
rewrite Int.signed_repr_eq in H6.
rewrite mod_nop in H6.
if_tac in H6.
unfold Int.add, Int.unsigned, Int.intval, Int.repr.
assert(bound_int (orig_a i) 134217728).
unfold bound_int in H12.
symmetry in H7, H6.
rewrite H7.
rewrite H6.
assert(Int.modulus = 4294967296).
rewrite H13.
assert(Int.modulus = 4294967296).
assert(i < Int.half_modulus).
assert(Int.half_modulus = 2147483648).
rewrite H12.
rewrite H12 in H11.
assert(Int.modulus = 4294967296).
assert(Int.modulus = 4294967296).
rewrite H11.
unfold inv.
apply exp_right with (Zsucc i).
apply derives_refl'.
apply equal_f.
apply array_at_ext.
unfold upd.
rewrite H10.
unfold Z.succ.
unfold partial_add.

That's certainly quite a chunk! Practically you have to step through the proof in Proof General and see the state evolve to understand anything. Additionally, when in the proof, it would be very useful if subgoals has some sort of description like “show the loop invariant plus not the loop condition results in the loop post-condition” - it's very easy to get lost in the proof. But I do feel that, while it would be a lot of work, I could make progress with VST, while I don't (at the moment) feel with other systems. Prof. Appel recently did a formal verification of the SHA256 from OpenSSL.

However, there is no community around VST that I can find - no mailing list, wiki, etc. The Subversion repo is only accessible via HTTP - you can't clone it from what I can tell. I think I found a (completeness) bug in VST, but the only option would be to try emailing Prof. Appel. (I didn't; I'm probably wrong.)

Still, AutoCorres and especially VST leave me more optimistic than I was at the end of the last post!


I didn't have time to look at everything that deserved attention. Cryptol is one. Cryptol is a tool written in Haskell designed for the specification and implementation of cryptographic code and can call out to SMT solvers to show certain properties. From its internal language it can export implementations to (at least) C.

Next, Frama-C, which I mentioned last time in relation to the Jessie plugin, has other plugins. One that's useful is the value analysis, which can be properly used to eliminate NULL derefs, array overruns etc. In fact, Polar SSL has been fully validated in certain configurations using that. That's certainly very valuable, and you can do it without that multi-page Coq proof! Although such analysis wouldn't have caught the carry bug in Ed25519 nor the reduction bug in Donna, not all code is going to be suitable for formal analysis.

Microsoft also have lots of things. There's Dafny, VCC (Verifier for Concurrent C), Boogie, Havoc, Z3 and others! I didn't look at them because I don't have a Windows box to hand this week (although I see that some are exposed via a web interface too). I was thinking that maybe I'd have time when a Windows box was to hand but, realistically, probably not. So I've updated this post to mention them here. If you are Windows based, you will probably do well to pay attention to them first.

A shallow survey of formal methods for C code

Two interesting things in formally verified software happened recently. The big one was the release of SeL4 - a formally verified L4 microkernel. The second was much smaller, but closer to my usual scope: a paper which showed the correctness of sections of a couple of the assembly implementations of Curve25519.

Overflow and carry bugs are subtle and, although with 32-bit words you might hope to be able to do enough random testing to eliminate them, that's much less plausible with 64-bit implementations. The TweetNaCl paper mentions a bug that lived in one of the assembly implementations of ed25519 for years and I've sinned too:

When I wrote curve25519-donna (back when it was the only 64-bit implementation of Curve25519), I first wrote a 32-bit version which mirrored the implementation of the athlon assembly version. This was just to provide a reference for the 64-bit code, but it was included in the repository for education. It was never really supposed to be used, and wasn't constant-time, but I was aware that it was being used by some groups.

Many years later, someone noticed that it was missing a reduction in the final contraction. Values between 2255-19 and 2255 would be output when they should have been reduced. This turned out to be harmless (as best I can tell), but embarrassing none the less.

And Curve25519 is a very nice curve to implement! Take a look at the overflow considerations for this version of P-256. I hope that I got everything right there, but it's very complex. More formal verification of the software would be welcome!

SeL4 has a multi-stage verification using refinement. Refinement is the process of showing that two pieces of code behave identically where one version is more abstract. SeL4's proof refines from an abstract specification to a Haskell implementation to the C implementation. It also has a SAT-solver based proof that the ARMv6 assembly matches the C implementation.

The refinement proofs are done in Isabelle/HOL which, along with Coq, are the major proof systems. (There are several other systems including HOL Light, HOL4 and Mizar.) Proof systems assist in the construction of automated proofs using simple rules of inference and axioms. Although not used for software verification, Metamath is a good introduction to the idea. It clearly explains its axioms (which Isabelle and Coq are very bad at) and gives an idea for the scale of formal proofs with an example of 2+2 = 4.

The best introduction to Isabelle that I know of is the book Concrete Semantics, although I do wish that it would start with Isar style proofs much sooner. If you're doing work in Isabelle I think you need page 57 printed and stuck to the wall and if you're going through that book and exercises I'd recommend peeking at the Isar chapter sooner.

But Isabelle's traditional workflow is to write in its functional language and export to OCaml or Haskell. That's no help if we're seeking to prove things about C code.

Isabelle's theorems are objects in its underlying ML language and so you can write a program in ML that parses C and use the result in Isabelle. That's what SeL4 does. The underlying framework for expressing imperative languages in Isabelle is Schirmer's Simpl and this imposes some odd limitations on the C that can be processed. For one, all the local variables with the same name in a given .c file have to have the same type because the local state of all functions in a .c file is represented in a single struct. For the same reason, it's not possible to take the address of a local variable.

But we can try parsing a very simple C file with SeL4's parser:

int add(int a, int b) {
  return a+b;

That fits SeL4's C subset (called StrictC - see l4v/tools/c-parser/doc in the SeL4 source) and results in this Simpl construct (you'll need to read the Simpl paper to really understand it):

test_global_addresses.add_body ≡
  Guard SignedArithmetic ⦃-2147483648 ≤ sint ´a + sint ´b ∧ sint ´a + sint ´b ≤ 2147483647⦄
   (creturn global_exn_var_'_update ret__int_'_update (λs. a_' s + b_' s));;
  Guard DontReach {} SKIP

There's clearly an overflow check in there, which is good, but to call the process of starting to prove things about it intimidating to the beginner would be an understatement. That and the oddities of the C subset motivated me to look further.

Dependent types are closely related to this problem so I checked Wikipedia for the dependently typed languages which support imperative programming and that are still active. That's just ATS and F*, according to that table. ATS doesn't deal with overflows and, while F*/F7 is worthy of a look because of miTLS, it's a CLR language and so not applicable here.

Next up, based on searches, is SPARK. SPARK is a subset of Ada designed for verification. There's a commercial company behind it, but versions are published under the GPL. It even comes with an IDE.

SPARK uses an SMT solver to prove correctness, which is very different from a proof in Isabelle. While an Isabelle proof is, by construction, just an application of axioms and rules of inference, an SMT solver is essentially a magic box into which you feed a proposition and out of which (maybe) comes a true/false answer. Trusting in an SMT solver is much, much stronger than not doing so, but it is a step down from a formal proof system. SPARK uses a version of Why3 to abstract over SMT solvers and we'll discuss Why3 more later.

Anyway, here's the same, trivial function in Ada:

function Add(X, Y : in Integer_32) return Integer_32 is
   return X + Y;
end Add;

If we ask SPARK to process that then it generates verification conditions (VCs): one for each possible thing that could go wrong - overflow, array indexing etc. For that code it throws an error saying that X + Y might overflow, which is promising! In order to eliminate that, one needs to prove the verification condition that says that the addition is safe by adding preconditions to the function (which then become VCs at each callsite):

function Add(X, Y : in Integer_32) return Integer_32 with
  Pre => X < 2**30 and X > -2*30 and Y < 2**30 and Y > -2**30;

With that precondition, the function is accepted. I should note that this is SPARK 2014; there are many older versions (SPARK has been going for a long time) and they kept preconditions and the like in comments. Lots of pages about SPARK still talk about the older versions and major projects in SPARK (like Ironsides - a DNS server) still use the older versions.

With that initial success, let's try something a tiny bit more involved. Curve25519 deals with 255-bit integers but CPUs don't. So, in the same way that a 64-int integer could be represented with a pair of 32-bit integers, 255-bit integers are represented in Donna with ten, 32-bit integers. This representation is a little odd because the values aren't spaced at 32-bit multiples, but rather at alternating 26 and 25 bit positions. That also means that the representation is non-unique (setting bit 26 of the first word is equal to setting bit zero of the second) and that's done for performance. See this for more details, but it's not important right now.

Adding these 255-bit integers is done simply by adding the underlying words without carry, with suitable bounds on the inputs so that overflow doesn't happen:

type Ints is array (Integer range 0 .. 9) of Integer_32;

function Add (X, Y : in Ints) return Ints with
   Pre => (for all i in Ints'Range => X(i) < 2**30 and X(i) > -2**30 and Y(i) < 2**30 and Y(i) > -2**30),
   Post => (for all i in Ints'Range => Add'Result(i) = X(i) + Y(i));

function Add (X, Y : in Ints) return Ints is
   Sum : Ints;
   for i in Ints'Range loop
      Sum(i) := X(i) + Y(i);
   end loop;

   return Sum;
end Add;

That mostly works, but SPARK can't prove that the result is X(i) + Y(i) for all i despite it being exactly what the function says. One really needs to spoon feed the prover: in this case with a loop invariant in the for loop:

for i in Ints'Range loop
   pragma Loop_Invariant (for all j in Ints'First .. i-1 => Sum(j) = X(j) + Y(j));
   Sum(i) := X(i) + Y(i);
end loop;

Sadly, that doesn't work either, despite being correct, because SPARK doesn't seem to like i-1 when i might be zero. So, trying again:

for i in Ints'Range loop
   Sum(i) := X(i) + Y(i);
   pragma Loop_Invariant (for all j in Ints'First .. i => Sum(j) = X(j) + Y(j));
end loop;

That one works, but now SPARK isn't sure whether uninitialised elements of Sum are being referenced. The property of being initialised isn't part of SPARK's logic and seems that the only way to proceed here is to disable that warning!

But, there's some promise! We would now like to show a higher level property of Add: that the 255-bit integer that the returned array represents is the sum of the integers that the input arrays represent. This means that the logic needs to deal with arbitrary integers and that we need to write a function in the logic that turns an array into an abstract integer.

Enabling arbitrary integers in the logic is possible (although good luck finding how to do it in the documentation: the answer is to put a file called gnat.adc in the gnatprove subdirectory of the project containing pragma Overflow_Mode (General => Strict, Assertions => Eliminated);). However, I failed at writing functions in the logic without dropping down into code and triggering overflow checks. It appears that ghost functions should be able to do this but, after getting the answer to a documentation bug on StackOverflow, actually trying to use any logic functions, even the identity function, caused the proof not to terminate. SPARK claims to be able to use Isabelle for proofs but I couldn't get that to work at all: strace says that the .thy file isn't ever written.

Stuck, I moved on from SPARK and tried Frama-C and its Jessie plugin. Even if SPARK was working for me, using C (even a subset) has advantages: it's convenient to use in other projects, there exist verified compilers for it and there's an extension to CompCert that allows for static verification of constant-time behaviour (although I've lost the paper!)

So here's the same function in Frama-C:

/*@ requires \valid_range(out, 0, 9);
  @ requires \valid_range(x, 0, 9);
  @ requires \valid_range(y, 0, 9);
  @ requires \forall integer i; 0 <= i < 10 ==>
      (x[i] > -1000 && x[i] < 1000);
  @ requires \forall integer i; 0 <= i < 10 ==>
      (y[i] > -1000 && y[i] < 1000);
  @ ensures \forall integer i; 0 <= i < 10 ==> (out[i] == x[i] + y[i]);
void add(int *out, const int *x, const int *y) {
  /*@ loop invariant i >= 0 && i <= 10 &&
        (\forall integer j; 0 <= j < i ==> out[j] == x[j] + y[j]);
    @ loop variant 10-i;
  for (int i = 0; i < 10; i++) {
    out[i] = x[i] + y[i];

Note that this time we not only need a loop invariant to show the postcondition, but we also need a loop variant to show that the for loop terminates: you really need to spoon feed the verification! But Frama-C/Jessie has no problem with functions in the logic:

/*@ logic integer felemvalue (int *x) =
      x[0] + x[1] * (1 << 26) + x[2] * (1 << 51) + x[3] * (1 << 77) +
      x[4] * (1 << 102) + x[5] * 340282366920938463463374607431768211456 +
      x[6] * 11417981541647679048466287755595961091061972992 +
      x[7] * 766247770432944429179173513575154591809369561091801088 +
      x[8] * 25711008708143844408671393477458601640355247900524685364822016 +
      x[9] * 1725436586697640946858688965569256363112777243042596638790631055949824; */

That function maps from an array to the abstract integer that it represents. Note that the terms switched from using shifts to represent the constants to literal numbers. That's because the proof process suddenly became non-terminating (in a reasonable time) once the values reached about 104-bits, but the literals still worked.

With that logic function in hand, the we can easily prove the higher-level concept that we wanted as a post condition of the add function:

@ ensures felemvalue(out) == felemvalue(x) + felemvalue(y);

Flushed with success, I moved onto the next most basic function: multiplication of 255-bit integers. The code that Donna uses is the textbook, polynomial multiplication code. Here's how the function starts:

/* Multiply two numbers: output = in2 * in
 * output must be distinct to both inputs. The inputs are reduced coefficient
 * form, the output is not.
 * output[x] <= 14 * the largest product of the input limbs. */
static void fproduct(limb *output, const limb *in2, const limb *in) {
  output[0] =       ((limb) ((s32) in2[0])) * ((s32) in[0]);
  output[1] =       ((limb) ((s32) in2[0])) * ((s32) in[1]) +
                    ((limb) ((s32) in2[1])) * ((s32) in[0]);
  output[2] =  2 *  ((limb) ((s32) in2[1])) * ((s32) in[1]) +
                    ((limb) ((s32) in2[0])) * ((s32) in[2]) +
                    ((limb) ((s32) in2[2])) * ((s32) in[0]);

Each of the lines is casting a 64-bit number down to 32 bits and then doing a 32×32⇒64 multiplication. (The casts down to 32 bits are just to tell the compiler not to waste time on a 64×64⇒64 multiplication.) In Frama-C/Jessie we need to show that the casts are safe, multiplications don't overflow and nor do the sums and multiplications by two etc. This should be easy. Here are the preconditions that set the bounds on the input.

/*@ requires \valid_range(output, 0, 18);
  @ requires \valid_range(in, 0, 9);
  @ requires \valid_range(in2, 0, 9);
  @ requires \forall integer i; 0 <= i < 10 ==> -134217728 < in[i] < 134217728;
  @ requires \forall integer i; 0 <= i < 10 ==> -134217728 < in2[i] < 134217728;

However, Why3, which is the prover backend that Jessie (and SPARK) uses, quickly gets bogged down. In an attempt to help it out, I moved the casts to the beginning of the function and put the results in arrays so that the code was less complex.

As you can see in the Jessie documentation, the Why3 UI allows one to do transforms on the verification conditions to try and make life easier for the prover. Quickly, splitting a requirement becomes necessary. But this throws away a lot of information. In this case, each of the 32×32⇒64 multiplications is assumed to fit only within its bounds and so the intermediate bounds need to be reestablished every time. This means that the equations have to be split like this:

limb t, t2;
t = ((limb) in2_32[0]) * in_32[1];
//@ assert -18014398509481984 < t < 18014398509481984;
t2 = ((limb) in2_32[1]) * in_32[0];
//@ assert -18014398509481984 < t2 < 18014398509481984;
output[1] = t + t2;

That's only a minor problem really, but the further one goes into the function, the harder the prover needs to work for some reason. It's unclear why because every multiplication has the same bounds - they are all instances of the same proof. But it's very clear that the later multiplications are causing more work.

Why3 is an abstraction layer for SMT solvers and a large number are supported (see “External provers” on its homepage for a list). I tried quite a few and, for this task, Z3 is clearly the best with CVC4 coming in second. However, Z3 has a non-commercial license which is very worrying - does that mean that a verified version of Curve25519 that used Z3 has to also have a non-commercial license? So I stuck to using CVC4.

However, about a quarter of the way into the function, both CVC4 and Z3 are stuck. Despite the bounds being a trivial problem, and just instances of the same problem that they can solve at the beginning of the function, somehow either Why3 is doing something bad or the SMT solvers are failing to discard irrelevant facts. I left it running overnight and Z3 solved one more instance after six hours but achieved nothing after 12 hours on the next one. Splitting and inlining the problem further in the Why3 GUI didn't help either.

Like SPARK, Jessie can use Isabelle for proofs too (and for the same reason: they are both using Why3, which supports Isabelle as a backend). It even worked this time, once I added Real to the Isabelle imports. However, in the same way that the C parser in Isabelle was an ML function that created Isabelle objects internally, the Why3 connection to Isabelle is a function (why3_open) that parses an XML file. This means that the proof has large numbers of generated variable names (o33, o34, etc) and you have no idea which intermediate values they are. Additionally, the bounds problem is something that Isabelle could have solved automatically, but the assumptions that you can use are similarly autonamed as things like local.H165. In short, the Isabelle integration appears to be unworkable.

Perhaps I could split up each statement in the original code into a separate function and then write the statement again in the logic in order in order to show the higher level properties, but at some point you have to accept that you're not going to be able to dig this hole with a plastic spoon.


The conclusion is a bit disappointing really: Curve25519 has no side effects and performs no allocation, it's a pure function that should be highly amenable to verification and yet I've been unable to find anything that can get even 20 lines into it. Some of this might be my own stupidity, but I put a fair amount of work into trying to find something that worked.

There seems to be a lot of promise in the area and some pieces work well (SMT solvers are often quite impressive, the Frama-C framework appears to be solid, Isabelle is quite pleasant) but nothing I found worked well together, at least for verifying C code. That makes efforts like SeL4 and Ironsides even more impressive. However, if you're happy to work at a higher level I'm guessing that verifying functional programs is a lot easier going.

HSTS for new TLDs

Whatever you might think of them, the new TLDs are rapidly arriving. New TLDs approved so far this month include alsace, sarl, iinet, poker, gifts, restaurant, fashion, tui and allfinanz. The full list for last month is over twice as long.

That means that there's lots of people currently trying to figure out how to differentiate themselves from other TLDs. Here's an idea: why not ask me to set HSTS for the entire TLD? That way, every single site runs over HTTPS, always. It strikes me that could be useful if you're trying to build trust with users unfamiliar with the zoo of new domains.

(I can't speak for Firefox and Safari but I think it's safe to assume that Firefox would be on board with this. It's still unclear whether IE's support for HSTS will include preloading.)

I'm guessing that, with such a large number of new TLDs, I should be able to reach at least some operators of them via this blog post.

Encrypting streams

When sending data over the network, chunking is pretty much a given. TLS has a maximum record size of 16KB and this fits neatly with authenticated encryption APIs which all operate on an entire message at once.

But file encryption frequently gets this wrong. Take OpenPGP: it bulk encrypts the data and sticks a single MAC on the end. Ideally everyone is decrypting to a temporary file and waiting until the decryption and verification is complete before touching the plaintext, but it takes a few seconds of searching to find people suggesting commands like this:

gpg -d your_archive.tgz.gpg | tar xz

With that construction, tar receives unauthenticated input and will happily extract it to the filesystem. An attacker doesn't (we assume) know the secret key, but they can guess at the structure of the plaintext and flip bits in the ciphertext. Bit flips in the ciphertext will produce a corresponding bit flip in the plaintext, followed by randomising the next block. I bet some smart attacker can do useful things with that ability. Sure the gpg command will exit with an error code, but do you think that the shell script writer carefully handled that case and undid the changes to the filesystem?

The flaw here isn't in CFB mode's malleability, but in OpenPGP forcing the use of unauthenticated plaintext in practical situations. (Indeed, if you are ever thinking about the malleability of ciphertext, you have probably already lost.)

I will even claim that the existance of an API that can operate in a streaming fashion over large records (i.e. will encrypt and defer the authenticator and will decrypt and return unauthenticated plaintext) is a mistake. Not only is it too easy to misunderstand and misuse (like the gpg example above) but, even if correctly buffered in a particular implementation, the existance of large records may force other implementations to do dangerous things because of a lack of buffer space.

If large messages are chunked at 16KB then the overhead of sixteen bytes of authenticator for every chunk is only 0.1%. Additionally, you can safely stream the decryption (as long as you can cope with truncation of the plaintext).

Although safer in general, when chunking one has to worry that an attacker hasn't reordered chunks, hasn't dropped chunks from the start and hasn't dropped chunks from the end. But sadly there's not a standard construction for taking an AEAD and making a scheme suitable for encrypting large files (AERO might be close, but it's not quite what I have in mind). Ideally such a scheme would take an AEAD and produce something very like an AEAD in that it takes a key, nonce and additional data, but can safely work in a streaming fashion. I don't think it need be very complex: take 64 bits of the nonce from the underlying AEAD as the chunk number, always start with chunk number zero and feed the additional data into chunk zero with a zero byte prefix. Prefix each chunk ciphertext with a 16 bit length and set the MSB to indicate the last chunk and authenticate that indication by setting the additional data to a single, 1 byte. The major worry might be that for many underlying AEADs, taking 64 bits of the nonce for the chunk counter leaves one with very little (or none!) left.

That requires more thought before using it for real but, if you are ever building encryption-at-rest, please don't mess it up like we did 20 years ago. (Although, even with that better design, piping the output into tar may still be unwise because an attacker can truncate the stream at a chunk boundary: perhaps eliminating important files in the process.)

Update: On Twitter, zooko points to Tahoe-LAFS as an example of getting it right. Additionally, taking the MAC of the current state of a digest operation and continuing the operation has been proposed for sponge functions (like SHA-3) under the name MAC-and-continue. The exact provenance of this isn't clear, but I think it might have been from the Keccak team in this paper. Although MAC-and-continue doesn't allow random access, which might be important for some situations.


Earlier this year, before Apple had too many goto fails and GnuTLS had too few, before everyone learnt that TLS heart-beat messages were a thing and that some bugs are really old, I started a tidy up of the OpenSSL code that we use at Google.

We have used a number of patches on top of OpenSSL for many years. Some of them have been accepted into the main OpenSSL repository, but many of them don’t mesh with OpenSSL’s guarantee of API and ABI stability and many of them are a little too experimental.

But as Android, Chrome and other products have started to need some subset of these patches, things have grown very complex. The effort involved in keeping all these patches (and there are more than 70 at the moment) straight across multiple code bases is getting to be too much.

So we’re switching models to one where we import changes from OpenSSL rather than rebasing on top of them. The result of that will start to appear in the Chromium repository soon and, over time, we hope to use it in Android and internally too.

There are no guarantees of API or ABI stability with this code: we are not aiming to replace OpenSSL as an open-source project. We will still be sending them bug fixes when we find them and we will be importing changes from upstream. Also, we will still be funding the Core Infrastructure Initiative and the OpenBSD Foundation.

But we’ll also be more able to import changes from LibreSSL and they are welcome to take changes from us. We have already relicensed some of our prior contributions to OpenSSL under an ISC license at their request and completely new code that we write will also be so licensed.

(Note: the name is aspirational and not yet a promise.)

Early ChangeCipherSpec Attack

OpenSSL 1.0.1h (and others) were released today with a scary looking security advisiory and that's always an event worth looking into. (Hopefully people are practiced at updating OpenSSL now!)

Update: the original reporter has a blog post up. Also, I won't, personally, be answering questions about specific Google services. (I cut this blog post together from notes that I'm writing for internal groups to evaluate and this is still very fresh.)

Update: my initial thoughts from looking at the diff still seem to be holding up. Someone is welcome to write a more detailed analysis than below. HP/ZDI have a write up of one of the DTLS issues.

There are some critical bug fixes to DTLS (TLS over datagram transports, i.e. UDP), but most people will be more concerned about the MITM attack against TLS (CVE-2014-0224).

The code changes are around the rejection of ChangeCipherSpec messages, which are messages sent during the TLS handshake that mark the change from unencrypted to encrypted traffic. These messages aren't part of the handshake protocol itself and aren't linked into the handshake state machine in OpenSSL. Rather there's a check in the code that they are only received when a new cipher is ready to be used. However, that check (for s->s3->tmp.new_cipher in s3_pkt.c) seems reasonable, but new_cipher is actually set as soon as the cipher for the connection has been decided (i.e. once the ServerHello message has been sent/received), not when the cipher is actually ready! It looks like this is the problem that's getting fixed in this release.

Here's the code in question that handles a ChangeCipherSpec message:

int ssl3_do_change_cipher_spec(SSL *s)
	int i;
	const char *sender;
	int slen;

	if (s->state & SSL_ST_ACCEPT)

	if (s->s3->tmp.key_block == NULL)1
		if (s->session == NULL)
			/* might happen if dtls1_read_bytes() calls this */
			return (0);

		if (!s->method->ssl3_enc->setup_key_block(s)) return(0); 2

	if (!s->method->ssl3_enc->change_cipher_state(s,i))

	/* we have to record the message digest at
	 * this point so we can get it before we read
	 * the finished message */
	if (s->state & SSL_ST_CONNECT)

	i = s->method->ssl3_enc->final_finish_mac(s,
		sender,slen,s->s3->tmp.peer_finish_md); 3
	if (i == 0)
		return 0;
	s->s3->tmp.peer_finish_md_len = i;


If a ChangeCipherSpec message is injected into the connection after the ServerHello, but before the master secret has been generated, then ssl3_do_change_cipher_spec will generate the keys (2) and the expected Finished hash (3) for the handshake with an empty master secret. This means that both are based only on public information. Additionally, the keys will be latched because of the check at (1) - further ChangeCipherSpec messages will regenerate the expected Finished hash, but not the keys.

The oldest source code on the OpenSSL site is for 0.9.1c (Dec 28, 1998) and the affected code appears almost unchanged. So it looks like this bug has existed for 15+ years.

The implications of this are pretty complex.

For a client there's an additional check in the code that requires that a CCS message appear before the Finished and after the master secret has been generated. An attacker can still inject an early CCS too and the keys will be calculated with an empty master secret. Those keys will be latched - another CCS won't cause them to be recalculated. However, when sending the second CCS that the client code requires, the Finished hash is recalculated with the correct master secret. This means that the attacker can't fabricate an acceptable Finished hash. This stops the obvious, generic impersonation attack against the client.

For a server, there's no such check and it appears to be possible to send an early CCS message and then fabricate the Finished hash because it's based on an empty master secret. However, that doesn't obviously gain an attacker anything. It would be interesting if a connection with a client certificate could be hijacked, but there is a check in ssl3_get_cert_verify that a CCS hasn't been already processed so that isn't possible.

Someone may be able to do something more creative with this bug; I'm not ruling out that there might be other implications.

Things change with an OpenSSL 1.0.1 server however. In 1.0.1, a patch of mine was included that moves the point where Finished values are calculated. The server will now use the correct Finished hash even if it erroneously processed a CCS, but this interacts badly with this bug. The server code (unlike the client) won't accept two CCS messages in a handshake. So, if an attacker injects an early CCS at the server to fixate the bad keys, then it's not possible for them to send a second in order to get it to calculate the correct Finished hash. But with 1.0.1, the server will use the correct Finished hash and will reply with the correct hash to the client. This explains the 1.0.1 and 1.0.2 mention in the advisory. With any OpenSSL client talking to an OpenSSL 1.0.1 server, an attacker can inject CCS messages to fixate the bad keys at both ends but the Finished hashes will still line up. So it's possible for the attacker to decrypt and/or hijack the connection completely.

The good news is that these attacks need man-in-the-middle position against the victim and that non-OpenSSL clients (IE, Firefox, Chrome on Desktop and iOS, Safari etc) aren't affected. None the less, all OpenSSL users should be updating.

Update: I hacked up the Go TLS library to perform the broken handshake in both 0.9.8/1.0.0 and 1.0.1 modes. There's a patch for crypto/tls and a tool that will attempt to do those broken handshakes and tell you if either worked. For example (at the time of writing):

$ ./earlyccs_check
Handshake failed with error: remote error: unexpected message
Looks ok.
$ ./earlyccs_check
Server is affected (1.0.1).
$ ./earlyccs_check
Server is affected (0.9.8 or 1.0.0).

Matching primitive strengths

It's a common, and traditional, habit to match the strengths of cryptographic primitives. For example, one might design at the 128-bit security level and pick AES-128, SHA-256 and P-256. Or if someone wants “better” security, one might target a 192-bit security level with AES-192, SHA-384 and P-384. NIST does this sort of thing, for example in 800-57/3, section 5.2.

The logic underlying this is that an attacker will attack at a system's weakest point, so anything that is stronger than the weakest part is a waste. But, while matching up the security level might have made sense in the past, I think it warrants reconsideration now.

Consider the energy needed to do any operation 2128 times: Landauer tells us that erasing a bit takes a minimum amount of energy any actual crypto function will involve erasing many bits. But, for the sake of argument, let's say that we're doing a single bit XOR. If we take the minimum energy at room temperature (to save energy on refrigeration) we get 2.85×10-21 × 2128 = 0.97×1018J. That's about “yearly electricity consumption of South Korea as of 2009” according to Wikipedia. If we consider that actually doing anything useful in each of those 2128 steps is going to take orders of magnitude more bit operations, and that we can't build a computer anywhere near the theoretic limit, we can easily reach, say, 5.5×1024J, which is “total energy from the Sun that strikes the face of the Earth each year”.

(The original version of this post discussed incrementing a 128-bit counter and said that it took two erasures per increment. Samuel Neves on Twitter pointed out that incrementing the counter doesn't involve destroying the information of the old value and several people pointed out that one can hit all the bit-patterns with a Gray code and thus only need to flip a single bit each time. So I changed the above to use a one-bit XOR as the basic argument still holds.)

So any primitive at or above the 128-bit security level is equally matched today, because they are all effectively infinitely strong.

One way to take this is to say that anything above 128 bits is a waste of time. Another is to say that one might want to use primitives above that level, but based on the hope that discontiguous advances leave it standing but not the lesser primitives. The latter involves a lot more guesswork than the old practice of making the security levels match up. Additionally, I don't think most people would think that an analytic result is equally likely for all classes of primitives, so one probably doesn't want to spend equal amounts of them all.

Consider the difference between P-256 and P-384 (or curve25519 and Curve41417 or Hamburg's Goldilocks-448). If they are all infinitely strong today then the reason to want to use a larger curve is because you expect some breakthrough, but not too much of a breakthrough. Discrete log in an elliptic-curve group is a square-root work function today. For those larger curves to make sense you have to worry about a breakthrough that can do it in cube-root time, but not really forth-root and certainly not fifth-root. How much would you pay to insure yourself against that, precise possibility? And how does that compare against the risk of a similar breakthrough in AES?

It is also worth considering whether the security-level is defined in a way matches your needs. AES-128 is generally agreed to have a 128-bit security level, but if an attacker is happy breaking any one of 2n keys then they can realise a significant speedup. An “unbalanced” choice of cipher key size might be reasonable if that attack is a concern.

So maybe a 256-bit cipher (ciphers do pretty well now, just worry about Grover's algorithm), SHA-384 (SHA-256 should be fine, but hash functions have, historically, had a bad time) and 128512(? see below)-bit McEliece is actually “balanced” these days.

(Thanks to Mike Hamburg and Trevor Perrin, with whom I was discussing this after Mike presented his Goldilocks curve. Thanks to JP Aumasson on Twitter and pbsd on HN for suggesting multi-target attacks as a motivation for large cipher key sizes. Also to pbsd for pointing out a quantum attack against McEliece that I wasn't aware of.)

SHA-256 certificates are coming

It's a neat result in cryptography that you can build a secure hash function given a secure signature scheme, and you can build a secure signature scheme given a secure hash function. However, far from the theory, in the real world, lots of signatures today depend on SHA-1, which is looking increasingly less like a secure hash function.

There are lots of properties by which one can evaluate a hash function, but the most important are preimage resistance (if I give you a hash, can you find an input with that hash value?), second preimage resistance (if I give you a message, can you find another that hashes to the same value?) and collision resistance (can you find two messages with the same hash?). Of those, the third appears to be much harder to meet than the first two, based on historical results against hash functions.

Back when certificates were signed with MD5, a chosen-prefix collision attack (i.e. given two messages, can you append different data to each so that the results have the same hash?) against MD5 was used at least twice to break the security of the PKI. First the MD5 Collisions Inc demo against RapidSSL and then the Flame malware.

Today, SHA-1 is at the point where a collision attack is estimated to take 261 work and a chosen-prefix collision to take 277. Both are below the design strength of 280 and even that design strength isn't great by today's standards.

We hope that we have a second line of defense for SHA-1: after the MD5 Collisions Inc demo, CAs were required to use random serial numbers. A chosen-prefix attack requires being able to predict the certificate contents that will be signed and the random serial number should thwart that. With random serials we should be resting on a stronger hash function property called target-collision resistance. (Although I'm not aware of any proofs that random serials actually put us on TCR.)

Still, it would be better not to depend on those band-aids and to use a hash function with a better design strength while we do it. So certificates are starting to switch to using SHA-256. A large part of that shift came from Microsoft forbidding certificates using SHA-1 starting in 2016.

For most people, this will have no effect. Twitter ended up with a SHA-256 certificate after they replaced their old one because of the OpenSSL heartbeat bug. So, if you can still load Twitter's site, then you're fine.

But there are a lot of people using Windows XP prior to Service Pack 3, and they will have problems. We've already seen lots of user reports of issues with Twitter (and other sites) from these users. Wherever possible, installing SP3 is the answer. (Or, better yet, updating from Windows XP.)

There are also likely to be problems with embedded clients, old phones etc. Some of these may not come to light for a while.

We've not yet decided what Google's timeline for switching is, but we will be switching prior to 2016. If you've involved with a something embedded that speaks to Google over HTTPS (and there's a lot of it), now is the time to test, which is using a SHA-256 certificate. I'll be using other channels to contact the groups that we know about but, on the web, we don't always know what's talking to us.

Revocation still doesn't work

I was hoping to be done with revocation for a bit, but sadly not.

GRC have published a breathless piece attacking a straw man argument: “Google tells us ... that Chrome's unique CRLSet solution provides all the protection we need”. I call it a straw man because I need only quote my own posts that GRC have read to show it.

The original CRLSets announcement contained two points. Firstly that online revocation checking doesn't work and that we were switching it off in Chrome in most cases. Secondly, that we would be using CRLSets to avoid having to do full binary updates in order to respond to emergency incidents.

In the last two paragraphs I mentioned something else. Quoting myself:

“Since we're pushing a list of revoked certificates anyway, we would like to invite CAs to contribute their revoked certificates (CRLs) to the list. We have to be mindful of size, but the vast majority of revocations happen for purely administrative reasons and can be excluded. So, if we can get the details of the more important revocations, we can improve user security. Our criteria for including revocations are:

  1. The CRL must be crawlable: we must be able to fetch it over HTTP and robots.txt must not exclude GoogleBot.
  2. The CRL must be valid by RFC 5280 and none of the serial numbers may be negative.
  3. CRLs that cover EV certificates are taken in preference, while still considering point (4).
  4. CRLs that include revocation reasons can be filtered to take less space and are preferred.”

In short, since we were pushing revocations anyway, maybe we could get some extra benefit from it. It would be better than nothing, which is what browsers otherwise have with soft-fail revocation checking.

I mentioned it again, last week (emphasis added):

“We compile daily lists of some high-value revocations and use Chrome's auto-update mechanism to push them to Chrome installations. It's called the CRLSet and it's not complete, nor big enough to cope with large numbers of revocations, but it allows us to react quickly to situations like Diginotar and ANSSI. It's certainly not perfect, but it's more than many other browsers do.”

“The original hope with CRLSets was that we could get revocations categorised into important and administrative and push only the important ones. (Administrative revocations occur when a certificate is changed with no reason to suspect that any key compromise occurred.) Sadly, that mostly hasn't happened.”

And yet, GRC managed to write pages (including cartoons!) exposing the fact that it doesn't cover many revocations and attacking Chrome for it.

They also claim that soft-fail revocation checking is effective:

The claim is that a user will download and cache a CRL while not being attacked and then be protected from a local attacker using a certificate that was included on that CRL. (I'm paraphrasing; you can search for “A typical Internet user” in their article for the original.)

There are two protocols used in online revocation checking: OCSP and CRL. The problem is that OCSP only covers a single certificate and OCSP is used in preference because it's so much smaller and thus removes the need to download CRLs. So the user isn't expected to download and cache the CRL anyway. So that doesn't work.

However, it's clear that someone is downloading CRLs because Cloudflare are spending half a million dollars a month to serve CRLs. Possibly it's non-browser clients but the bulk is probably CAPI (the library that IE, Chrome and other software on Windows typically uses - although not Firefox). A very obscure fact about CAPI is that it will download a CRL when it has 50 or more OCSP responses cached for a given CA certificate. But CAs know this and split their CRLs so that they don't get hit with huge bandwidth bills. But a split CRL renders the claimed protection from caching ineffective, because the old certificate for a given site is probably in a different part.

So I think the claim is that doing blocking OCSP lookups is a good idea because, if you use CAPI on Windows, then you might cache 50 OCSP responses for a given CA certificate. Then you'll download and cache a CRL for a while and then, depending on whether the CA splits their CRLs, you might have some revocations cached for a site that you visit.

It seems that argument is actually for something like CRLSets (download and cache revocations) over online checking, it's just a Rube Goldberg machine to, perhaps, implement it!

So, once again, revocation doesn't work. It doesn't work in other browsers and CRLSets aren't going to cover much in Chrome, as I've always said. GRC's conclusions follow from those errors and end up predictably far from the mark.

In order to make this post a little less depressing, let's consider whether its reasonable to aim to cover everything with CRLSets. (Which I mentioned before as well, but I'll omit the quote.) GRC quote numbers from Netcraft claiming 2.85 million revocations, although some are from certificate authorities not trusted by mainstream web browsers. I spent a few minutes harvesting CRLs from the CT log. This only includes certificates that are trusted by reasonable number of people and I only waited for half the log to download. From that half I threw in the CRLs included by CRLSets and got 2356 URLs after discarding LDAP ones.

I tried downloading them and got 2164 files. I parsed them and eliminated duplicates and my quick-and-dirty parser skipped quite a few. None the less, this very casual search found 1,062 issuers and 4.2 million revocations. If they were all dropped into the CRLSet, it would take 39MB.

So that's a ballpark figure, but we need to design for something that will last a few years. I didn't find figures on the growth of HTTPS sites, but Netcraft say that the number of web sites overall is growing at 37% a year. It seems reasonable that the number of HTTPS sites, thus certificates, thus revocations will double a couple of times in the next five years.

There's also the fact that if we subsidise revocation with the bandwidth and computers of users, then demand for revocation will likely increase. Symantec, I believe, have experimented with certificates that are issued for the maximum time (five years at the moment) but sold in single year increments. When it comes to the end of the year, they'll remind you to renew. If you do, you don't need to do anything, otherwise the certificate gets revoked. This seems like a good deal for the CA so, if we make revocation really effective, I'd expect to see a lot more of it. Perhaps factor in another doubling for that.

So we would need to scale to ~35M revocations in five years, which is ~300MB of raw data. That would need to be pushed to all users and delta updated. (I don't have numbers for the daily delta size.)

That would obviously take a lot of bandwidth and a lot of resources on the client. We would really need to aim at mobile as well, because that is likely to be the dominant HTTPS client in a few years, making storage and bandwidth use even more expensive.

Some probabilistic data structure might help. I've written about that before. Then you have to worry about false positives and the number of reads needed per validation. Phones might use Flash for storage, but it's remarkably slow.

Also, would you rather spend those resources on revocation, or on distributing the Safe Browsing set to mobile devices? Or would you spend the memory on improving ASLR on mobile devices? Security is big picture full of trade-offs. It's not clear the revocation warrants all that spending.

So, if you believe that downloading and caching revocation is the way forward, I think those are the parameters that you have to deal with.

No, don't enable revocation checking

Revocation checking is in the news again because of a large number of revocations resulting from precautionary rotations for servers affected by the OpenSSL heartbeat bug. However, revocation checking is a complex topic and there's a fair amount of misinformation around. In short, it doesn't work and you are no more secure by switching it on. But let's quickly catch up on the background:

Certificates bind a public key and an identity (commonly a DNS name) together. Because of the way the incentives work out, they are typically issued for a period of several years. But events occur and sometimes the binding between public key and name that the certificate asserts becomes invalid during that time. In the recent cases, people who ran a server that was affected by the heartbeat bug are worried that their private key might have been obtained by someone else and so they want to invalidate the old binding, and bind to a new public key. However, the old certificates are still valid and so someone who had obtained that private key could still use them.

Revocation is the process of invalidating a certificate before its expiry date. All certificates include a statement that essentially says “please phone the following number to check that this certificate has not been revoked”. The term online revocation checking refers to the process of making that phone call. It's not actually a phone call, of course, rather browsers and other clients can use a protocol called OCSP to check the status of a certificate. OCSP supplies a signed statement that says that the certificate is still valid (or not) and, critically, the OCSP statement itself is valid for a much shorter period of time, typically a few days.

The critical question is what to do in the event that you can't get an answer about a certificate's revocation status. If you reject certificates when you can't get an answer, that's called hard-fail. If you accept certificates when you can't get an answer that's called soft-fail.

Everyone does soft-fail for a number of reasons on top of the general principle that single points of failure should be avoided. Firstly, the Internet is a noisy place and sometimes you can't get through to OCSP servers for some reason. If you fail in those cases then the level of random errors increases. Also, captive portals (e.g. hotel WiFi networks where you have to “login” before you can use the Internet) frequently use HTTPS (and thus require certificates) but don't allow you to access OCSP servers. Lastly, if everyone did hard-fail then taking down an OCSP service would be sufficient to take down lots of Internet sites. That would mean that DDoS attackers would turn their attention to them, greatly increasing the costs of running them and it's unclear whether the CAs (who pay those costs) could afford it. (And the disruption is likely to be significant while they figure that out.)

So soft-fail is the only viable answer but it has a problem: it's completely useless. But that's not immediately obvious so we have to consider a few cases:

If you're worried about an attacker using a revoked certificate then the attacker first must be able to intercept your traffic to the site in question. (If they can't even intercept the traffic then you didn't need any authentication to protect it from them in the first place.) Most of the time, such an attacker is near you. For example, they might be running a fake WiFi access point, or maybe they're at an ISP. In these cases the important fact is that the attacker can intercept all your traffic, including OCSP traffic. Thus they can block OCSP lookups and soft-fail behaviour means that a revoked certificate will be accepted.

The next class of attacker might be a state-level attacker. For example, Syria trying to intercept Facebook connections. These attackers might not be physically close, but they can still intercept all your traffic because they control the cables going into and out of a country. Thus, the same reasoning applies.

We're left with cases where the attacker can only intercept traffic between a user and a website, but not between the user and the OCSP service. The attacker could be close to the website's servers and thus able to intercept all traffic to that site, but not anything else. More likely, the attacker might be able to perform a DNS hijacking: they persuade a DNS registrar to change the mapping between a domain ( and its IP address(es) and thus direct the site's traffic to themselves. In these cases, soft-fail still doesn't work, although the reasoning is more complex:

Firstly, the attacker can use OCSP stapling to include the OCSP response with the revoked certificate. Because OCSP responses are generally valid for some number of days, they can store one from before the certificate was revoked and use it for as long as it's valid for. DNS hijackings are generally noticed and corrected faster than the OCSP response will expire. (On top of that, you need to worry about browser cache poisoning, but I'm not going to get into that.)

Secondly, and more fundamentally, when issuing certificates a CA validates ownership of a domain by sending an email, or looking for a specially formed page on the site. If the attacker is controlling the site, they can get new certificates issued. The original owners might revoke the certificates that they know about, but it doesn't matter because the attacker is using different ones. The true owners could try contacting CAs, convincing them that they are the true owners and get other certificates revoked, but if the attacker still has control of the site, they can hop from CA to CA getting certificates. (And they will have the full OCSP validity period to use after each revocation.) That circus could go on for weeks and weeks.

That's why I claim that online revocation checking is useless - because it doesn't stop attacks. Turning it on does nothing but slow things down. You can tell when something is security theater because you need some absurdly specific situation in order for it to be useful.

So, for a couple of years now, Chrome hasn't done these useless checks by default in most cases. Rather, we have tried a different mechanism. We compile daily lists of some high-value revocations and use Chrome's auto-update mechanism to push them to Chrome installations. It's called the CRLSet and it's not complete, nor big enough to cope with large numbers of revocations, but it allows us to react quickly to situations like Diginotar and ANSSI. It's certainly not perfect, but it's more than many other browsers do.

A powerful attacker may be able to block a user from receiving CRLSet updates if they can intercept all of that user's traffic for long periods of time. But that's a pretty fundamental limit; we can only respond to any Chrome issue, including security bugs, by pushing updates.

The original hope with CRLSets was that we could get revocations categorised into important and administrative and push only the important ones. (Administrative revocations occur when a certificate is changed with no reason to suspect that any key compromise occurred.) Sadly, that mostly hasn't happened. Perhaps we need to consider a system that can handle much larger numbers of revocations, but the data in that case is likely to be two orders of magnitude larger and it's very unlikely that the current CRLSet design is still optimal when the goal moves that far. It's also a lot of data for every user to be downloading and perhaps efforts would be better spent elsewhere. It's still the case that an attacker that can intercept traffic can easily perform an SSL Stripping attack on most sites; they hardly need to fight revoked certificates.

In order to end on a positive note, I'll mention a case where online revocation checking does work, and another, possible way to solve the revocation problem for browsers.

The arguments above started with the point that an attacker using a revoked certificate first needs to be able to intercept traffic between the victim and the site. That's true for browsers, but it's not true for code-signing. In the case where you're checking the signature on a program or document that could be distributed via, say, email, then soft-fail is still valuable. That's because it increases the costs on the attacker substantially: they need to go from being able to email you to needing to be able to intercept OCSP checks. In these cases, online revocation checking makes sense.

If we want a scalable solution to the revocation problem then it's probably going to come in the form of short-lived certificates or something like OCSP Must Staple. Recall that the original problem stems from the fact that certificates are valid for years. If they were only valid for days then revocation would take care of itself. (This is the approach taken by DNSSEC.) For complex reasons, it might be easier to deploy that by having certificates that are valid for years, but include a marker in them that indicates that an OCSP response must be provided along with the certificate. The OCSP response is then only valid for a few days and the effect is the same (although less efficient).

TLS Triple Handshakes

Today, the TLS WG mailing list meeting received a note about the work of Karthikeyan Bhargavan, Antoine Delignat-Lavaud, Cedric Fournet, Alfredo Pironti and Pierre-Yves Strub on triple handshake attacks against TLS. This is more complex than just a duplicated goto and I'm not going to try and reproduce their explanation here. Instead, I'll link to their site again, which also includes a copy of their paper.

In short, the TLS handshake hashes in too little information, and always has. Because of that it's possible to synchronise the state of two TLS sessions in a way that breaks assumptions made in the rest of the protocol.

I'd like to thank the researchers for doing a very good job of disclosing this. The note today even included a draft for fixing the TLS key derivation to include all the needed information to stop this attack, and it'll be presented at the WG meeting tomorrow.

In the mean time, people shouldn't panic. The impact of this attack is limited to sites that use TLS client-certificate authentication with renegotiation, and protocols that depend on channel binding. The vast majority of users have never used client certificates.

The client-certificate issues can be fixed with a unilateral, client change to be stricter about verifying certificates during a renegotiation, as suggested by the authors. I've included an image, below, that is loaded over an HTTPS connection that renegotiates with unrelated certificates before returning the image data. Hopefully the image below is broken. If not, then it likely will be soon because of a browser update. (I took the server down.)

Protocols that depend on channel binding (including ChannelID) need other changes. Ideally, the proposed update to the master-secret computation will be finalised and implemented. (For ChannelID, we have already updated the protocol to include a change similar to the proposed draft.)

It's likely that there are still concrete problems to be found because of the channel-binding break. Hopefully with today's greater publicity people can start to find them.

TLS Symmetric Crypto

At this time last year, the TLS world was mostly running on RC4-SHA and AES-CBC. The Lucky 13 attack against CBC in TLS had just been published and I had spent most of January writing patches for OpenSSL and NSS to implement constant-time CBC decoding. The RC4 biases paper is still a couple of week away, but it's already clear that both these major TLS cipher suite families are finished and need replacing. (The question of which is worse is complicated.)

Here's Chrome's view of TLS by number of TLS connections from around this time (very minor cipher suite families were omitted):

Cipher suite familyPercentage of TLS connections made by Chrome

A whole bunch of people, myself included, have been working to address that since.

AES-GCM was already standardised, implemented in OpenSSL and has hardware support in Intel chips, so Wan-Teh Chang and I started to implement it in NSS, although first we had to implement TLS 1.2, on which it depends. That went out in Chrome 30 and 31. Brian Smith (and probably others!) at Mozilla shipped that code in Firefox also, with Firefox 27.

AES-GCM is great if you have hardware support for it: Haswell chips can do it in just about 1 cycle/byte. However, it's very much a hardware orientated algorithm and it's very difficult to implement securely in software with reasonable speed. Also, since TLS has all the infrastructure for cipher suite negotiation already, it's nice to have a backup in the wings should it be needed.

So we implemented ChaCha20-Poly1305 for TLS in NSS and OpenSSL. (Thanks to Elie Bursztein for doing a first pass in OpenSSL.) This is an AEAD made up of primitives from djb and we're using implementations from djb, Andrew M, Ted Krovetz and Peter Schwabe. Although it can't beat AES-GCM when AES-GCM has dedicated hardware, I suspect there will be lots of chips for a long time that don't have such hardware. Certainly most mobile phones at the moment don't (although AArch64 is coming).

Here's an example of the difference in speeds:

ChipAES-128-GCM speedChaCha20-Poly1305 speed
OMAP 446024.1 MB/s75.3 MB/s
Snapdragon S4 Pro41.5 MB/s130.9 MB/s
Sandy Bridge Xeon (AESNI)900 MB/s500 MB/s

(The AES-128-GCM implementation is from OpenSSL 1.0.1f. Note, ChaCha20 is a 256-bit cipher and AES-128 obviously isn't.)

There's also an annoying niggle with AES-GCM in TLS because the spec says that records have an eight byte, explicit nonce. Being an AEAD, the nonce is required to be unique for a given key. Since an eight-byte value is too small to pick at random with a sufficiently low collision probability, the only safe implementation is a counter. But TLS already has a suitable, implicit record sequence counter that has always been used to number records. So the explicit nonce is at best a waste of eight bytes per record, and possibly dangerous should anyone attempt to use random values with it. Thankfully, all the major implementations use a counter and I did a scan of the Alexa, top 200K sites to check that none are using random values - and none are. (Thanks to Dr Henson for pointing out that OpenSSL is using a counter, but with a random starting value.)

For ChaCha20-Poly1305 we can save 8 bytes of overhead per record by using the implicit sequence number for the nonce.

Cipher suite selection

Given the relative merits of AES-GCM and ChaCha20-Poly1305, we wish to use the former when hardware support is available and the latter otherwise. But TLS clients typically advertise a fixed preference list of cipher suites and TLS servers either respect it, or override the client's preferences with their own, fixed list. (E.g. Apache's SSLHonorCipherOrder directive.)

So, first, the client needs to alter its cipher suite preference list depending on whether it has AES-GCM hardware support - which Chrome now does. Second, servers that enforce their own preferences (which most large sites do) need a new concept of an equal-preference group: a set of cipher suites in the server's preference order which are all “equally good”. When choosing a cipher suite using the server preferences, the server finds its most preferable cipher suite that the client also supports and, if that is in an equal preference group, picks whichever member of the group is the client's most preferable. For example, Google servers have a cipher suite preference that includes AES-GCM and ChaCha20-Poly1305 cipher suites in an equal preference group at the top of the preference list. So if the client supports any cipher suite in that group, then the server will pick whichever was most preferable for the client.

The end result is that Chrome talking to Google uses AES-GCM if there's hardware support at the client and ChaCha20-Poly1305 otherwise.

After a year of work, let's see what Chrome's view of TLS is now:

Cipher suite familyPercentage of TLS connections made by Chrome
AES128-GCM & ChaCha20-Poly130539.9%

What remains is standardisation work and getting the code out to the world for others. The ChaCha20-Poly1305 cipher suites are unlikely to survive the IETF without changes so we'll need to respin them at some point. As for getting the code out, I hope to have something to say about that in the coming months. Before then I don't recommend that anyone try to repurpose code from the Chromium repos - it is maintained with only Chromium in mind.


Now that TLS's symmetric crypto is working better there's another, long-standing problem that needs to be addressed: TLS fallbacks. The new, AEAD cipher suites only work with TLS 1.2, which shouldn't be a problem because TLS has a secure negotiation mechanism. Sadly, browsers have a long history of doing TLS fallbacks: reconnecting with a lesser TLS version when a handshake fails. This is because there are lots of buggy HTTPS servers on the Internet that don't implement version negotiation correctly and fallback allows them to continue to work. (It also promotes buggy implementations by making them appear to function; but fallback was common before Chrome even existed so we didn't really have a choice.)

Chrome now has a four stage fallback: TLS 1.2, 1.1, 1.0 then SSLv3. The problem is that fallback can be triggered by an attacker injecting TCP packets and it reduces the security level of the connection. In the TLS 1.2 to 1.1 fallback, AES-GCM and ChaCha20-Poly1305 are lost. In the TLS 1.0 to SSLv3 fallback, all elliptic curve support, and thus ECDHE, disappears. If we're going to depend on these new cipher suites then we cannot allow attackers to do this, but we also cannot break all the buggy servers.

So, with Chrome 33, we implemented a fallback SCSV. This involves adding a pseudo-cipher suite in the client's handshake that indicates when the client is trying a fallback connection. If an updated server sees this pseudo-cipher suite, and the connection is not using the highest TLS version supported by the server, then it replies with an error. This is essentially duplicating the version negotiation in the cipher suite list, which isn't pretty, but the reality is that the cipher suite list still works but the primary version negotiation is rusted over with bugs.

With this in place (and also implemented on Google servers) we can actually depend on having TLS 1.2, at least between Chrome and Google (for now).


Few rollouts are trouble free and Chrome 33 certainly wasn't, although most of the problems were because of a completely different change.

Firstly, ChaCha20-Poly1305 is running at reduced speed on ARM because early-revision, S3 phones have a problem that causes the NEON, Poly1305 code to sometimes calculate the wrong result. (It's not clear if it's the MSM8960 chip or the power supply in the phone.) Of course, it works well enough to run the unit tests successfully because otherwise I wouldn't have needed two solid days to pin it down. Future releases will need to run a self-test to gather data about which revisions have the problem and then we can selectively disable the fast-path code on those devices. Until then, it's disabled everywhere. (Even so, ChaCha20-Poly1305 is still much faster than AES-GCM on ARM.)

But it was the F5 workaround that caused most of the headaches. Older (but common) firmware revisions of F5 devices hang the connection when the ClientHello is longer than 255 bytes. This appeared to be a major issue because it meant that we couldn't add things to the ClientHello without breaking lots of big sites that use F5 devices. After a plea and a discussion on the TLS list, an engineer from F5 explained that it was just handshake messages between 256 and 511 bytes in length that caused the issue. That suggested an obvious solution: pad the handshake message so that it doesn't fall into the troublesome size range.

Chrome 33 padded all ClientHellos to 512 bytes, which broke at least two “anti-virus” products that perform local MITM of TLS connections on Windows and at least one “filtering” device that doesn't do MITM, but killed the TCP connections anyway. All made the assumption that the ClientHello will never be longer than some tiny amount. In the firewall case, the fallback SCSV stopped a fallback to SSLv3 that would otherwise have hidden the problem by removing the padding.

Thankfully, two of the three vendors have acted quickly to address the issue. The last, Kaspersky, has known about these sorts of problems for 18 months now (since Chrome tried to deploy TLS 1.1 and hit a similar issue) but, thankfully, doesn't enable MITM mode by default.

Overall, it looks like the changes are viable. Hopefully the path will now be easier for any other clients that wish to do the same.

Apple's SSL/TLS bug

Yesterday, Apple pushed a rather spooky security update for iOS that suggested that something was horribly wrong with SSL/TLS in iOS but gave no details. Since the answer is at the top of the Hacker News thread, I guess the cat's out of the bag already and we're into the misinformation-quashing stage now.

So here's the Apple bug:

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams,
                                 uint8_t *signature, UInt16 signatureLen)
	OSStatus        err;

	if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
		goto fail;
	if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
		goto fail;
		goto fail;
	if ((err =, &hashOut)) != 0)
		goto fail;

	return err;

(Quoted from Apple's published source code.)

Note the two goto fail lines in a row. The first one is correctly bound to the if statement but the second, despite the indentation, isn't conditional at all. The code will always jump to the end from that second goto, err will contain a successful value because the SHA1 update operation was successful and so the signature verification will never fail.

This signature verification is checking the signature in a ServerKeyExchange message. This is used in DHE and ECDHE ciphersuites to communicate the ephemeral key for the connection. The server is saying “here's the ephemeral key and here's a signature, from my certificate, so you know that it's from me”. Now, if the link between the ephemeral key and the certificate chain is broken, then everything falls apart. It's possible to send a correct certificate chain to the client, but sign the handshake with the wrong private key, or not sign it at all! There's no proof that the server possesses the private key matching the public key in its certificate.

Since this is in SecureTransport, it affects iOS from some point prior to 7.0.6 (I confirmed on 7.0.4) and also OS X prior to 10.9.2 (confirmed on 10.9.1). It affects anything that uses SecureTransport, which is most software on those platforms although not Chrome and Firefox, which both use NSS for SSL/TLS. However, that doesn't mean very much if, say, the software update systems on your machine might be using SecureTransport.

I coded up a very quick test site at Note the port number (which is the CVE number), the normal site is running on port 443 and that is expected to work. On port 1266 the server is sending the same certificates but signing with a completely different key. If you can load an HTTPS site on port 1266 then you have this bug.

Because the certificate chain is correct and it's the link from the handshake to that chain which is broken, I don't believe any sort of certificate pinning would have stopped this. Also, this doesn't only affect sites using DHE or ECDHE ciphersuites - the attacker gets to choose the ciphersuite in this case and will choose the one that works for them.

Also, this doesn't affect TLS 1.2 because there's a different function for verifying the different ServerKeyExchange message in TLS 1.2. But, again, the attacker can choose any version that the client will accept. But if the client only enables TLS 1.2 then it appears that would workaround this issue. Likewise, if the client only enabled the plain, RSA ciphersuites then there's no ServerKeyExchange and that should also work around this issue. (Of the two, the former workaround is much more preferable.)

Based on my test site, both iOS 7.0.6 and OS X 10.9.2 fix the issue. (Update: it looks like the bug was introduced in 10.9 for OS X but existed in at least some versions of iOS 6. iOS 6.1.6 was released yesterday to fix it.)

This sort of subtle bug deep in the code is a nightmare. I believe that it's just a mistake and I feel very bad for whoever might have slipped in an editor and created it.

Here's a stripped down that code with the same issue:

extern int f();

int g() {
	int ret = 1;

	goto out;
	ret = f();

	return ret;

If I compile with -Wall (enable all warnings), neither GCC 4.8.2 or Clang 3.3 from Xcode make a peep about the dead code. That's surprising to me. A better warning could have stopped this but perhaps the false positive rate is too high over real codebases? (Thanks to Peter Nelson for pointing out the Clang does have -Wunreachable-code to warn about this, but it's not in -Wall.)

Maybe the coding style contributed to this by allowing ifs without braces, but one can have incorrect indentation with braces too, so that doesn't seem terribly convincing to me.

A test case could have caught this, but it's difficult because it's so deep into the handshake. One needs to write a completely separate TLS stack, with lots of options for sending invalid handshakes. In Chromium we have a patched version of TLSLite to do this sort of thing but I cannot recall that we have a test case for exactly this. (Sounds like I know what my Monday morning involves if not.)

Code review can be effective against these sorts of bug. Not just auditing, but review of each change as it goes in. I've no idea what the code review culture is like at Apple but I strongly believe that my colleagues, Wan-Teh or Ryan Sleevi, would have caught it had I slipped up like this. Although not everyone can be blessed with folks like them.

Lastly, there was a lot of discussion yesterday that Apple missed checking the hostname in the certificate. It's true that curl on the OS X command line oddly accepts HTTPS connections to IP addresses when the IP address isn't in the certificate, but I can't find that there's anything more than that and Safari doesn't have that problem.

Implementing Elligator for Curve25519

There are some situations where you would like to encode a point on an elliptic curve in such a way that it appears to be a uniform, random string. This comes up in some password authenticated key exchange systems, where one doesn't want to leak any bits of the password. It also comes up in protocols that are trying to be completely uniform so as to make life hard for network censors.

For the purposes of this post, I'm only going to be considering Curve25519. Recall that Curve25519 is the set of points (u,v) where v2 = u3 + 486662x2 + u and u and v are taken from GF(q=2255-19). Curve25519 works with a Montgomery ladder construction and thus only exchanges v coordinates. So, when sending a v coordinate there are two ways that an attacker can distinguish it from a random string:

Firstly, since the field is only 255 bits, the 256th bit is always zero. Thus if an attacker sees a series of 32-byte strings where the top bit of the last byte is always zero, then they can be confident that they are not random strings. This is easy to fix however, just XOR in a random bit and mask it out before processing.

Secondly, the attacker can assume that a 32-byte string is a v coordinate and check whether v3 + 486662x2 + v is a square. This will always be true if the strings are v coordinates, by the curve equation, but will only be true 50% of the time otherwise. This problem is a lot harder to fix.

Thankfully; djb, Tanja Lange, Mike Hamburg and Anna Krasnova have published Elligator [paper]. In that paper they present two maps from elliptic curve points to uniform, random strings and you should, at least, read the introduction to the paper, which contains much more detail on the use cases and previous work.

Elligator 2 is suitable for Curve25519 and there are some hints about an implementation in section 5.5. However, this blog post contains my own notes about implementing each direction of the map with only a couple of exponentiations. You may need to read the paper first before the following makes much sense.

I'll start with the inverse map, because that's what's used during key-generation. Throughout I'll write (x, y) for affine coordinates on the twisted, Edwards isomorphism of Curve25519, as defined for Ed25519; (u,v) for affine coordinates on Curve25519 itself; and X, Y, Z for extended coordinates on the twisted, Edwards curve.

The inverse map, ψ-1, takes a point on the curve with the following limitations and produces a uniform, representative value:

  1. u ≠ -A. (The value A is a parameter of Curve25519 and has the value 486662.)
  2. -2u(u + A) is a square

Since we're generating the points randomly, I'm going to ignore the first condition because it happens far less frequently than malfunctions in the CPU instructions that I might use to detect it.

The second condition excludes about half the points on the curve: these points don't have any representation as a uniform string. When generating a key we need to detect and restart if we happen upon one of those points. (When working in the other direction, ψ(r) = ψ(-r), which explains why half the image is missing.)

If the point is in the image of the map then the uniform string representation, r, is defined by:

(If you can't see the equation, you need a browser that can handle inline SVG.)

So our key generation function is going to take a private key and either produce nothing (if the resulting point isn't in the image of the map), or produce a public key and the uniform representation of it. We're working with Curve25519, so the public key, if any, will exactly match the result of the scalar_base_mult operation on Curve25519.

Since I happened to have the ref10 code from Ed25519 close to hand, we'll be calculating the scalar multiplication on the twisted, Edwards isomorphism of Curve25519 that Ed25519 uses. I've discussed that before but, in short, it allows for the use of precomputed tables and so is faster than scalar_base_mult with the Curve25519 code.

The result of the scalar multiplication of the base point with the private key is a point on the twisted, Edwards curve in extended coordinates: (X : Y : Z), where x = X/Z and y = Y/Z and where (x, y) are the affine coordinates on the Edwards curve. In order to convert a point on the Edwards curve to a point on Curve25519, we evaluate:

This is looking quite expensive: first we have to convert from extended coordinates on the Edwards curve to affine, then to affine on Curve25519. In finite fields, inversions (i.e. division) are very expensive because they are done by raising the divisor to the power of q-2 (where q=2255-19). Compared to such large exponentiations, multiplications and additions are almost free.

But, we can merge all those inversions and do the whole conversion with just one. First, substitute the extended coordinate conversion: x = X/Z, y = Y/Z.


By calculating just 1/(X(Y-Z)), we can get (u,v). The reciprocal can be used directly to calculate v and then multiplied by X to calculate u.

What remains is to check whether -2u(u + A) is a square and then to calculate the correct square root. Happily, we can calculate all that (including both square roots because we want to be constant-time), with a single exponentiation.

Square roots are defined in the standard way for finite fields where q ≅ 5 mod 8:

Let's consider the square roots first. Both have a common, constant term of (-1/2) which we can ignore for now. With that gone, both are fractions that we need to raise to (q+3)/8. Section 5 of the Ed25519 paper contains a trick for merging the inversion with the exponentiation:

But let's see what else we can do with the c value.

So c7 gives us the variable part of the other square root without another large exponentiation! It only remains to multiply in the constant term of (-1/2)(q+3)/8 to each and then, for each, test whether the square root of -1 needs to be multiplied. For the first square root, for example, we should have that r2 = -u/(2(u+A)) thus 2r2(u+A)+u should be zero. If not, then we need to multiply r by sqrt(-1).

Lastly, we need to check that the point is in the image of the map at all! (An implementation may want to do this sooner so that failures can be rejected faster.) For this we need to check that -2u(u + A) is a square. We use the fact that an element of the field is a non-zero square iff raising it to (q-1)/2 results in one. Thus we need to calculate (-2)(q-1)/2(u(u+A))(q-1)/2. The first part is just a constant, but the latter would be an expensive, large exponentiation, if not for c again:

With that result we can pick the correct square root (and in constant time, of course!)

The map in the forward direction

In the forward direction, the map takes a uniform, representative value and produces a public key, which is the x coordinate of a Curve25519 point in this context: ψ(r) = u. (Remember that the affine coordinates on Curve25519 are written (u,v) here.) This is much simpler than the reverse direction, although, as best I can tell, equally expensive. Given a value, r, one can calculate the value u with:

This requires one inversion to calculate d and another, large exponentiation to calculate ε. The resulting u value will match the output of the key generation stage which, in turn, matches the value that running scalar_base_mult from the Curve25519 code would have produced for the same private key.

Note that the Elligator software page says that GMP can compute the Legendre symbol (the ε calculation) in only 9000 cycles, so I'm clearly missing some implementation trick somewhere. Additionally, the Elligator paper raises the possibility of implementing the inversions with a blinded, Euclidean algorithm. That would certainly be faster but I'm not comfortable that I know how to do that safely, so I'm sticking with exponentiation here.


I've a constant-time, pure Go implementation, based on the field operations from the pure-C, SUPERCOP, Ed25519 code. The field operations have been ported pretty mechanically from the C code, so it's hardly pretty Go code, but it seems to work.

On my, 3.1GHz, Ivy Bridge system (with TurboBoost enabled) the key generation takes 155µs and the forward map operation takes 28µs. (But remember that half of the key generations will fail, so ~double that time.)

I have not carefully reviewed the code, so beware. There are also some non-obvious concerns when using Elligator: for example, if you are hiding a uniform representative in an nonce field then the attacker might assume that it's a field element, negate it and let the connection continue. In a normal protocol, altering the nonce usually leads to a handshake failure but negating an Elligator representative is a no-op. If the connection continues then the attacker can be pretty sure that the value wasn't really random. So care is required.

So if you think that Elligator might be right for you, consult your cryptographer.

Forward security for journalists

A number of journalists have been writing stories about forward security recently. Some of them ended up talking to me and I completely failed to come up with a good metaphor that didn't cause more problems than it solved. Eventually I wrote the following as an introduction to forward security for journalists. It's deeply simplified but I think it captures the broad strokes.

It remains unclear whether anyone actually came away with a better understanding after reading it, however.

A great many metaphors have been mangled, and analogies tortured, trying to explain what forward security is. But, fundamentally, it doesn't involve anything more complex than high-school math and anybody should be able to understand it. So this document attempts to describe what forward security is.

In the 1970s, cryptography was purely the concern of the military. The NSA existed (unofficially) and believed that it was the sole, legitimate center of cryptographic knowledge in the US. In this environment, Whitfield Diffie set out from MIT across the country to try and gather up what scraps of cryptography has escaped into the public realm. If we omit huge amounts of detail (see Steven Levy's book, “Crypto” for the details) we can say that he ended up at Stanford with a revolutionary idea in his head: public key cryptography.

Up to that point, cryptography had involved secret keys that both the sender and receiver were in possession of. Diffie wondered about a system where one could publish part of a key (the public part) and keep another part of the key private. However, he didn't have any concrete implementation of this.

At Stanford he found Martin Hellman and, together they published New Directions in Cryptography (1976), which contained the Diffie-Hellman key-agreement algorithm. This is possibly the most significant work in cryptographic history and is still a core algorithm today. (Later we learnt that a researcher at GCHQ had discovered the idea a few years earlier. But GCHQ didn't do anything with it and never published it.)

We'll need modular arithmetic, which is like working with times. What's 8pm plus five hours? One o'clock. Because 8+5=13 and 13 divided by 12 has a remainder of one. Likewise, 8pm plus 17 hours is still one o'clock, because adding multiples of 12 doesn't make any difference to a clock. We'll be using the same idea, but with numbers (called the modulus) other than 12.

To explain Diffie-Hellman we'll work modulo 11. So 5+10=4 now, because 15 divided by 11 has a remainder of 4.

Next, exponentiation: 23 means 2, multiplied by itself three times. So 23=2×2×2=8. Now we have everything that we need to implement Diffie-Hellman.

In classic, cryptographic tradition we'll imagine two parties and we'll call them Alice and Bob. Alice generates a random number between 1 and 9 and calls it a. This is her private key; let's say that it's five. Alice calculates her public key as A=2a=25=32=10 mod 11.

Bob does the same but his private key is called b and his public key is B. Let's say that his private key is six which means that B=2b=26=64=9 mod 11.

Alice and Bob can freely publish A and B because, given them, no other party can work backwards and calculate their private keys. Well, in this case they can because we're using a toy example with a tiny modulus. A real modulus has over 500 digits. Then it's not possible with any current computer.

Once Alice has Bob's public key, or vice-versa, she can perform key-agreement. She combines her private key with Bob's public key like this: Ba=95=59049=1 mod 11. Bob can combine his private key with Alice's public key in the same way: Ab=106=1000000=1 mod 11. Both Alice and Bob arrived at the same, shared value (one) which nobody else can calculate. And without exchanging any secret information!

This was a huge breakthrough but it still left something to be desired: signatures. It seemed that it would be very useful for Alice to be able to calculate some function of a message that could be broadcast and would serve as a signature. So it had to be something that only Alice (or, rather, the entity holding the private part of Alice's public key) could have generated. Diffie-Hellman doesn't provide this.

In 1977, three researchers at MIT, Rivest, Shamir and Adleman, produced such a scheme. Now it's called by their initials: RSA. Again, the basics are high-school math.

We'll be working with modular arithmetic again but this time it'll be modulo the product of two primes. Like the example above, the numbers involved are usually huge, but we're going to use tiny numbers for the purposes of the examples.

So our two primes will be p=11 and q=17. This will serve as Alice's RSA private key. In order to calculate an RSA public key, one calculates n=11×17=187.

The core of RSA is that, when working modulo 187, it's very easy to calculate the cube of a number. But, unless you know that 187=11×17, it's very difficult to calculate the cube root of a number. Let's say that our message is 15. The cube is easy: 153=3375=9 mod 187. But to calculate the cube root you need to calculate the modular inverse of 3 and (p-1)(q-1)=10×16=160. We're getting a little past high-school math here so trust me that it's 107. By calculating 9107 mod 187 we find that we get our original number again: 15.

Think about what this means. If Bob knows that Alice's public key is 187, he can cube a message and send it to Alice and only Alice can calculate the cube root and recover the message. Also, if Alice starts with a message, she can calculate the cube root of it and publish that. Everyone else can be convinced that only Alice could have calculated the cube root of the message and so that functions as a digital signature of the message!

Now we have everything that we need to understand forward security so hang on while we wrap it all up.

RSA and Diffie-Hellman are both too slow to use for large amounts of data. Bulk data transfer uses secret-key cryptography: the old style cryptography where both sides need to know the secret key in order to process the ciphertext. But, in order to get the secret key to the other party, RSA or Diffie-Hellman is used.

Traditional HTTPS has the browser use RSA to transmit a secret key to the server. The browser picks a random, secret key, cubes it modulo the server's RSA modulus and sends it to the server. Only the server knows the two numbers that, multiplied together, equal its modulus so only the server can calculate the cube root and get the secret key for the connection.

However, if anyone were to record the HTTPS connection and later learn those two numbers (called factors) then they can retrospectively decrypt the recorded connections. They can also watch new connections to the server and decrypt those because they can calculate cube roots just as well as the real server can.

With forward security, the method of establishing the secret key for the connection is changed. When a client connects, the server generates a brand new Diffie-Hellman private key just for that connection. (It's called the ephemeral private key.) Then the server calculates its Diffie-Hellman public key (by 2a mod m, remember?) signs the public key with its RSA key (by calculating the cube root of it, modulo its RSA modulus) and sends all that to the browser. The browser verifies the signature (by cubing it and checking that the result equals the public key) and then generates its own Diffie-Hellman private and public keys. Now, the client has the server's Diffie-Hellman public key, so it can combine it with its Diffie-Hellman private key and calculated the shared-secret. It sends its Diffie-Hellman public key to the server and the server ends up with the same, shared secret. That secret is used as the secret key for the connection.

The advantage here is that, should the server's, long-term, RSA private key be compromised in the future it isn't as bad. The confidentiality of the connections is protected by the Diffie-Hellman secrets which were generated fresh for the connection and destroyed immediately afterwards. Also, the attacker, even with the RSA private key cannot passively decrypt connections in real-time. The attacker has to impersonate the server, which is much more work.

That's why forward-secrecy is often written as DHE: it stands for Diffie-Hellman ephemeral. There's also ECDHE (Elliptic Curve, Diffie-Hellman ephemeral) which is the same idea in a more efficient mathematical structure.


It seems that secure messaging projects are all the rage right now. Heck, Matthew Green is writing about it in the New Yorker this week!

One of my side-projects for a year or so has been about secure messaging. Of course, that means that I was working on it before it was cool (insert hipster emoticon here, :{D ) … but I'm slow because my side projects get very little time. (When I started, the subheading for the project was “How to better organise a discreet relationship with the Director of the CIA”, because that was the topical reference at the time. Oh, how things change.)

It's still not really ready, but the one year mark seemed like a good deadline to give myself to mention it in public. You should pay heed to the warning on the project page however. I did contact one, well known, security auditing company to see whether they would take a couple thousand dollars to spend a day on it, but they said that don't really do small projects and that I'd need to be talking 10-20 times that for a couple of weeks at least. So don't depend on it yet and don't be surprised if I've just broken it somehow by replacing the ratchet code. (Although, thanks, Trevor, for the design.)

But I fear that it's not going to make Matthew Green happy! It's certainly not a drop-in replacement for email. I spend my days working on small changes with a large multiplier, in my free time I choose to shift the balance completely the other way.

  • Ephemeral messages: although no software can prevent a recipient from making a copy of a message, it's the social norm and technical default in Pond that messages are erased after a week.
  • No spam: only approved contacts can send you messages. In order for the server to enforce that without revealing the identity of every sender, pairing-based, group signatures are used. (The flip side is that there are no public addresses, so you can't message someone that you don't already know.)
  • Forward security: the exchange of messages drives a Diffie-Hellman ratchet preventing a point-in-time compromise from giving the attacker the ability to decrypt past messages (modulo the week-long retention).
  • Erasure storage: since B-tree filesystems like btrfs and log-structured SSDs mean that one can have very little confidence in being able to securely erase anything, Pond tries to use the TPM NVRAM to securely delete old state. (Note: not implemented for OS X yet.)
  • One-to-one only: No messages to multiple contacts (yet). Haven't implemented it.

The source code is in Go, but there are binaries for several Linux flavors and OS X. (Although it looks terrible on OS X and one user reports that the program can only be closed by killing it from the command line: I've not reproduced that yet.) In addition to the default server, I learned recently that the Wau Holland Foundation also runs a Pond server.

For more details, see the user guide, threat model etc which are linked from the project page.

If you get it running, but don't know anyone else, you can use:

gpg --recv-key C92172384F387DBAED4D420165EB9636F02C5704

to get my public key and email me a shared-secret. (Although I only check that email during the evenings.)

Please update F5/BIG-IP firmware

Update (Dec 7th): an F5 engineer commented on the TLS WG mailing list describing the internals of the hang bug. Based on this, we realised that it would be possible to work around the issue by padding ClientHello messages of a certain size. We now have a design for doing this, it's implemented in Chrome and it appears to be working! So, while you should always keep software up to date, it appears that the Internet can dodge this bug!

If you use F5/BIG-IP devices to terminate SSL connections, please update the firmware on the things! We're trying to run an Internet here and old versions of these devices are a real problem for deploying new TLS features. You need to be running at least version 10.2.4 (as far as I know), but running the latest version is generally good advice.

If you just try to connect to these sites with a recent version of OpenSSL, you should find that the connection hangs - which is terrible. We can detect a server that returns an error, but hanging the connection isn't something we can generally work around.

$ openssl version
OpenSSL 1.0.1e 11 Feb 2013
$ openssl s_client -connect
$ openssl s_client -connect -tls1
New, TLSv1/SSLv3, Cipher is AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE

I did have a long list of major sites that were affected by this issue here. I've removed it because some of them had updated and, because of the update at the top, it's no longer a crippling problem.

ChaCha20 and Poly1305 for TLS

Today, TLS connections predominantly use one of two families of cipher suites: RC4 based or AES-CBC based. However, in recent years both of these families of cipher suites have suffered major problems. TLS's CBC construction was the subject of the BEAST attack (fixed with 1/n-1 record splitting or TLS 1.1) and Lucky13 (fixed with complex implementation tricks). RC4 was found to have key-stream biases that cannot effectively be fixed.

Although AES-CBC is currently believed to be secure when correctly implemented, a correct implementation is so complex that there remains strong motivation to replace it.

Clearly we need something better. An obvious alternative is AES-GCM (AES-GCM is AES in counter mode with a polynomial authenticator over GF(2128)), which is already specified for TLS and has some implementations. Support for it is in the latest versions of OpenSSL and it has been the top preference cipher of Google servers for some time now. Chrome support should be coming in Chrome 31, which is expected in November. (Although we're still fighting to get TLS 1.2 support deployed in Chrome 30 due to buggy servers.)

AES-GCM isn't perfect, however. Firstly, implementing AES and GHASH (the authenticator part of GCM) in software in a way which is fast, secure and has good key agility is very difficult. Both primitives are suited to hardware implementations and good software implementations are worthy of conference papers. The fact that a naive implementation (which is also what's recommended in the standard for GHASH!) leaks timing information is a problem.

AES-GCM also isn't very quick on lower-powered devices such as phones, and phones are now a very important class of device. A standard phone (which is always defined by whatever I happen to have in my pocket; a Galaxy Nexus at the moment) can do AES-128-GCM at only 25MB/s and AES-256-GCM at 20MB/s (both measured with an 8KB block size).

Lastly, if we left things as they are, AES-GCM would be the only good cipher suite in TLS. While there are specifications for AES-CCM and for fixing the AES-CBC construction, they are all AES based and, in the past, having some diversity in cipher suites has proven useful. So we would be looking for an alternative even if AES-GCM were perfect.

In light of this, Google servers and Chrome will soon be supporting cipher suites based around ChaCha20 and Poly1305. These are primitives developed by Dan Bernstein and are fast, secure, have high quality, public domain implementations, are naturally constant time and have nearly perfect key agility.

On the same phone as the AES-GCM speeds were measured, the ChaCha20+Poly1305 cipher suite runs at 92MB/s (which should be compared against the AES-256-GCM speed as ChaCha20 is a 256-bit cipher).

In addition to support in Chrome and on Google's servers, myself and my colleague, Elie Bursztein, are working on patches for NSS and OpenSSL to support this cipher suite. (And I should thank Dan Bernstein, Andrew M, Ted Krovetz and Peter Schwabe for their excellent, public domain implementations of these algorithms. Also Ben Laurie and Wan-Teh Chang for code reviews, suggestions etc.)

But while AES-GCM's hardware orientation is troublesome for software implementations, it's obviously good news for hardware implementations and some systems do have hardware AES-GCM support. Most notably, Intel chips have had such support (which they call AES-NI) since Westmere. Where such support exists, it would be a shame not to use it because it's constant time and very fast (see slide 17). So, once ChaCha20+Poly1305 is running, I hope to have clients change their cipher suite preferences depending on the hardware that they're running on, so that, in cases where both client and server support AES-GCM in hardware, it'll be used.

To wrap all this up, we need to solve a long standing, browser TLS problem: in order to deal with buggy HTTPS servers on the Internet (of which there are many, sadly), browsers will retry failed HTTPS connections with lower TLS version numbers in order to try and find a version that doesn't trigger the problem. As a last attempt, they'll try an SSLv3 connection with no extensions.

Several useful features get jettisoned when this occurs but the important one for security, up until now, has been that elliptic curve support is disabled in SSLv3. For servers that support ECDHE but not DHE that means that a network attacker can trigger version downgrades and remove forward security from a connection. Now that AES-GCM and ChaCha20+Poly1305 are important we have to worry about them too as these cipher suites are only defined for TLS 1.2.

Something needs to be done to fix this so, with Chrome 31, Chrome will no longer downgrade to SSLv3 for Google servers. In this experiment, Google servers are being used as an example of non-buggy servers. The experiment hopes to show that networks are sufficiently transparent that they'll let at least TLS 1.0 through. We know from Chrome's statistics that connections to Google servers do end up getting downgraded to SSLv3 sometimes, but all sorts of random network events can trigger a downgrade. The fear is that there's some common network element that blocks TLS connections by deep-packet inspection, which we'll measure by breaking them and seeing how many bug reports we get.

If that works then, in Chrome 32, no fallbacks will be permitted for Google servers at all. This stage of the experiment tests that the network is transparent to TLS 1.2 by, again, breaking anything that isn't and seeing if it causes bug reports.

If both those experiments work then, great! We can define a way for servers to securely indicate that they don't need version fallback. It would be nice to have used the renegotiation extension for this, but I think that there are likely already too many broken servers that support that, so another SCSV is probably needed.

If we get all of the above working then we're not in too bad a state with respect to TLS cipher suites. At least, once most of the world has upgraded in any case.

Playing with the Certificate Transparency pilot log

I've written about Certificate Transparency several times in the past, but now there's actually something to play with! (And, to be clear, this is entirely thanks to several of my colleagues, not me!) So I though it would be neat to step through some simple requests of the pilot log.

I'm going to be using Go for this, because I like it. But it's substantially just JSON and HTTP and one could do it in any language.

In order to query a log you need to know its URL prefix and public key. Here's a structure to represent that.

// Log represents a public log.
type Log struct {
	Root string
	Key *ecdsa.PublicKey

For the pilot log, the URL prefix is and here's the public key:

-----END PUBLIC KEY-----

Since it's a little obscure, here's the code to parse a public key like that. (This code, and much of the code below, is missing error checking because I'm just playing around.)

	block, _ := pem.Decode([]byte(pilotKeyPEM))
	key, _ := x509.ParsePKIXPublicKey(block.Bytes)
	pilotKey = key.(*ecdsa.PublicKey)
	pilotLog = &Log{Root: "", Key: pilotKey}

The log is a tree, so the first thing that we want to do with a log is to get its head. This is detailed in section 4.3 of the RFC: we make an HTTP GET to get-sth and it returns a JSON blob.

Any HTTP client can do that, so give it a go with curl on the command line:

$ curl 2>>/dev/null | less

You should get a JSON blob that, with a bit of formatting, looks like this:

	"tree_size": 1979426,
	"timestamp": 1368891548960,
	"sha256_root_hash": "8UkrV2kjoLcZ5fP0xxVtpsSsWAnvcV8aPv39vh96J2o=",
	"tree_head_signature": "BAMARjBEAiAc95/ONhz2vQsULrISlLumvpo..."
In Go, we'll parse that into a structure like this:

// Head contains a signed tree head.
type Head struct {
	Size uint64 `json:"tree_size"`
	Time time.Time `json:"-"`
	Hash []byte `json:"sha256_root_hash"`
	Signature []byte `json:"tree_head_signature"`
	Timestamp uint64 `json:"timestamp"`

And here's some code to make the HTTP request, check the signature and return such a structure:

func (log *Log) GetHead() (*Head, error) {
	// See
	resp, err := http.Get(log.Root + "/ct/v1/get-sth")
	if err != nil {
		return nil, err

	defer resp.Body.Close()
	if resp.StatusCode != 200 {
		return nil, errors.New("ct: error from server")
	if resp.ContentLength == 0 {
		return nil, errors.New("ct: body unexpectedly missing")
	if resp.ContentLength > 1<<16 {
		return nil, errors.New("ct: body too large")
	data, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		return nil, err
	head := new(Head)
	if err := json.Unmarshal(data, &head); err != nil {
		return nil, err

	head.Time = time.Unix(int64(head.Timestamp/1000), int64(head.Timestamp%1000))

	// See
	if len(head.Signature) < 4 {
		return nil, errors.New("ct: signature truncated")
	if head.Signature[0] != hashSHA256 {
		return nil, errors.New("ct: unknown hash function")
	if head.Signature[1] != sigECDSA {
		return nil, errors.New("ct: unknown signature algorithm")

	signatureBytes := head.Signature[4:]
	var sig struct {
		R, S *big.Int

	if signatureBytes, err = asn1.Unmarshal(signatureBytes, &sig); err != nil {
		return nil, errors.New("ct: failed to parse signature: " + err.Error())
	if len(signatureBytes) > 0 {
		return nil, errors.New("ct: trailing garbage after signature")
	// See
	signed := make([]byte, 2 + 8 + 8 + 32)
	x := signed
	x[0] = logVersion
	x[1] = treeHash
	x = x[2:]
	binary.BigEndian.PutUint64(x, head.Timestamp)
	x = x[8:]
	binary.BigEndian.PutUint64(x, head.Size)
	x = x[8:]
	copy(x, head.Hash)

	h := sha256.New()
	digest := h.Sum(nil)

	if !ecdsa.Verify(log.Key, digest, sig.R, sig.S) {
		return nil, errors.New("ct: signature verification failed")

	return head, nil

If one runs this code against the current pilot, we can get the same information as we got with curl: 1979426 certificates at Sat May 18 11:39:08 EDT 2013, although now the date will be newer and there will be more entries.

As you can see, the log has been seeded with some certificates gathered from a scan of the public Internet. Since this data is probably more recent than the EFF's Observatory data, it might be of interest and anyone can download it using the documented interface. Again, we can try a dummy request in curl:

$ curl '' 2>>/dev/null | less

You should see a result that starts with {"entries":[{"leaf_input":"AAAAAAE9p. You can request log entries in batches and even download everything if you wish. My toy code simply writes a single large file containing the certificates, gzipped and concatenated with a length prefix.

My toy code is go gettable from In the tools directory is ct-sync, which will incrementally download the pilot log and ct-map, which demonstrates how to look for interesting things by extracting the certificates from a local mirror file. Both will use multiple cores if possible so be sure to set GOMAXPROCS. (This is very helpful when calculating the hash of the tree since the code doesn't do that incrementally.)

The project's official code is at See the README and the main site.

And since we have the tools to look for interesting things, I spent a couple of minutes doing so:

I wrote a quick script to find ‘TURKTRUST’ certificates - leaf certificates that are CA certificates. The log only contains certificates that are valid by a fairly promiscuous root set in order to prevent spam, so we don't have to worry about self-signed certificates giving us false positives. As ever the Korean Government CA is good to highlight malpractice with 40 certificates that mistakenly have the CA bit set (and are currently valid). This issue was reported by a colleague of mine last year. Thankfully, the issuing certificate has a path length constraint that should prevent this issue from being exploited (as long as your X.509 validation checks these sort of details). The root CA is also only present in the Microsoft root set as far as I know.

A different Korean root CA (KISA) can't encode ASN.1 correctly and is issuing certificates with negative serial numbers because ASN.1 INTEGERs are negative if the most significant bit is one.

Another CA is issuing certificates with 8-byte IP addresses (I've let them know).

The ct-map.go in the repository is setup to find publicly valid certificates with names that end in .corp (which will soon not get an HTTPS indication in Chrome.)

Anyway, at the moment the log is just pre-seeded with some public certificates. The real value of CT comes when we have strong assurances that all valid certificates are in the log. That will take some time, but we're working towards it.

Hash based signatures

It's likely that Kevin Poulsen will still be alive in twenty years time, and that his encrypted message to Edward Snowden will still be saved somewhere. In fact, there are several places where people post RSA-encrypted messages and seem to assume that they're secure. Even assuming that no huge advances in factoring occur (and Joux et al have just killed discrete logs in binary fields, so it's not impossible), large, quantum computers will be able to break them.

It might be the case that building practical, quantum computers doesn't turn out to be possible (D-Wave machines don't count here), but there are people trying hard to do it. From “Deep State” (a rather fragmented book, but interesting in parts):

There is a quantum computing arms race of sorts under way. China, Israel and Russia have advanced quantum computing programs with direct aim of gaining geopolitical advantage, as does DARPA itself.

There are classified and semi-classified DARPA and Army/Air Force/Navy Research Lab programs for the potential use of quantum computers for a select set of defense technologies, including: the ability to design a perfect sensing laser for drones or satellites; the ability to design radar that can defeat counter-stealth techniques; the ability to design coatings for aircraft that truly are stealth, owning to the exploitation of quantum fluid dynamics.

Now, government agencies don't care about Wired, but they do care about those things. So it's a possibility that, in a few decades, an academic lab will be in possession of a large quantum computer and might want to pursue some light-hearted historical investigations. (Just in case, I've included the Wired message and key below in case either is deleted ☺)

Despite a common misunderstanding, quantum computers don't break all public key cryptography, and quantum cryptography isn't the answer. (Unless you have lots of money and a dedicated fiber line, and possibly not even then.) Instead post-quantum cryptography is the answer, and there's even a conference on it.

So, should we be using post-quantum algorithms now? The reason that we don't currently use them is because they generally take one or two orders of magnitude more time, space or both. For high-volume work like TLS, that puts them outside the bounds of practicality. But for an application like PGP, where the volume of messages is very low, why not? Size might be an issue but, if it takes 1 second to process a message vs 10 milliseconds, it doesn't really make a difference. You might as well dial the security up to 11.

There are a couple of things that one needs for a system like PGP: encryption and signatures. One option for building post-quantum, public-key signatures is hash-based signatures. These are actually really old! They were described by Lamport in 1979, only a couple of years after RSA. In fact, as Rompel showed, a secure signature scheme exists if and only if a secure hash-based signature scheme exists. Individual hash functions may be broken, but if hash-based signatures are broken in general then there can be no secure signatures at all!

The most primitive hash-based signature can be used only once, and only to sign a single-bit message: Alice picks two random values, x and y and publishes her public key: H(x) and H(y). In order to sign a message, Alice publishes one of the two, original values depending on the value of the bit.

Clearly this only works for a single bit message. It's also a one-time signature because once one of the original values has been published, it's insecure for the same public key to be used again.

The first extension to this scheme is to run many instances in parallel in order to sign larger messages. Fix the hash function, H, as SHA-256. Now Alice picks 512 random values, x0..255 and y0..255 and publishes the hash of each of them as her public key. Now she can sign an arbitrarily long message, m, by calculating H(m) and then publishing either xi or yi depending on the bits of H(m).

It's still one-time though. As soon as Alice signs a different message with the same public key then, for some values of i, both xi and yi have been published. This lets an attacker forge signatures for messages that Alice hasn't signed.

In order to allow signing multiple messages there's one easy solution: just have multiple one-time signatures! Let's say that we want a thousand. The issue is that the public and private keys of our one-time scheme were 512×32 bytes = 16KB and, if we want a thousand of them, then the public and private keys end up being 16MB, which is quite a lot. We can solve the private key size by generating a stream of values from a cipher with just a 32-byte seed key, but the large public key remains.

The solution to this is to use a Merkle hash tree, introduced (unsurprisingly) by Merkle, in 1979.

The the diagram, above, four leaf hashes are included in the hash tree: A, B, C and D. The hash tree is a binary tree where each non-leaf node contains the hash of its children. If Bob believes that H(H(AB)H(CD)) is Alice's public key, then anyone can convince him that A is part of Alice's public key by giving him A, B and H(CD). Bob calculates H(AB) from A and B, and then H(H(AB)H(CD)) from H(AB) and H(CD). If the values match, and if the hash function isn't broken, then A must be part of Alice's public key.

Now, if A, B, C and D were one-time signature public keys then Alice's public key has been reduced to a single hash. Each signature now needs to include the one-time signature value, the rest of the public key, and also a proof that the public key is in Alice's public key. In more detail, the signature contains either xi or yi depending on the bits of the hash of the message to be signed and also H(xi) or H(yi) for the bits that weren't used. With that information, Bob can reconstruct the one-time signature public key. Then the signature contains the path up the tree - i.e. the values B, and H(CD) in our example, which convince Bob that the one-time signature public key is part of Alice's overall public key.

So now the public and private keys have been reduced to just 32 bytes (at least if you're using a 256-bit hash). We can make the tree arbitrarily large, thus supporting arbitrarily many signatures. The signatures, however, are 16KB plus change, which is pretty chunky. So the next trick will try to address this, and it's thanks to Winternitz.

The Winternitz trick was described in Merkle's original, 1979 work and it involves iterating hashes. We started off by considering the one-time signature of a single bit and solved it by publishing H(x) and H(y) and then revealing one of the preimages. What if we generated a single random value, z, and published H(H(z))? We would reveal H(z) if the bit was one, or z if the bit was zero. But an attacker could take our signature of zero (z) and calculate H(z), thus creating a signature of one.

So we need a checksum to make sure that, for any other message, the hash function needs to be broken. So we would need two of these structures and then we would sign 01 or 10. That way, one of the hash chains needs to be broken in order to calculate the signature of a different message. The second bit is the Winternitz checksum.

As it stands, Winternitz's trick hasn't achieved anything - we're still publishing two hash values and revealing two secrets. But when we start signing larger messages it becomes valuable.

Let's say that the Winternitz hash chains were 16 values long. That means that one of them can cover four bits of the hash that we're signing. If we were signing a 256-bit value, then we only need 64 Winternitz chains and three for the checksum. Also, the Winternitz scheme makes it easy for the verifier to calculate the original public key. The signature size has gone from 16KB to 2,144 bytes. Not bad!

But! There's still a critical flaw in this! Recall that using a one-time signature twice completely breaks the system. Because of that, the signature scheme is “stateful”. This means that, when signing, the signer absolutely must record that a one-time key has been used so that they never use it again. If the private key was copied onto another computer and used there, then the whole system is broken.

That limitation might be ok in some situations, and actually means that one can build forward-secure signature schemes: schemes where signatures prior to a key compromise can still be trusted. Perhaps for a CA where the key is in an HSM that might be useful. However, for most environments it's a huge foot-cannon.

We could pick one-time public keys to use based on the hash of the message, or at random in the hope of never colliding but, in order for such schemes to be safe the tree of one-time public keys needs to be really big. Big as in, more than 2128 entries. If we use the trick of generating the private keys from the output of a cipher then, as soon as we fix the cipher private key, we have determined the whole tree. However, although we can compute any part of it, we can't compute all of it. At least not without an infinite amount of time (or close enough to infinite as makes no difference).

So we have to defer calculating parts of the tree until we actually need them in order to generate a signature. In order to defer bits of the tree, we have to split the tree up into a forest of smaller trees. The top-level tree is calculated as before and the Merkle hash of it is the public key. But rather than sign a message with the one-time leaves of that tree, we sign the root hash of a subtree. Since everything is deterministic, the subtree at any position is always the same so we're always signing the same message - which is safe.

Thus, if we wished to have a conceptual hash tree with 2160 leaves, we could split it up, say, every 16 bits. The tip of the tree would be a hash tree with 216 leaves. Those leaves would sign the root of a hash tree of another 216 leaves and so one. In order to make 2160, we would need 10 layers of such trees.

I can't have an image with all 216 leaves in each tree, so 22 will have to do. When generating the key, only the top (blue) tree needs to be calculated. When generating each signature, all the subtrees down the path to the final one-time key need to be calculated. So the signature consists of a one-time signature, from the 3rd leaf of the blue tree, of the root of the red tree (as well as Merkle path to the blue root), plus a one-time signature, from the 1st leaf of the red tree, of the root of the green tree (as well as a Merkle path to the red root), etc.

Obviously these signatures are going to be rather larger, due to all the intermediate signatures, and will take rather longer to calculate, due to calculating all the subtrees on the fly. Below is a little Haskell program that tries to estimate signature size vs signature time. It may well have some bugs because I haven't actually coded up a working example to ensure that it does actually work.

log2 = logBase 2 :: Double -> Double

-- We assume SHA-256 for the sake of this example.
hashBits = 256
hashBytes = hashBits/8
-- This is the Winternitz parameter.
w = 64
wBits = log2 w
-- These are parameters for the Winternitz signature.
l1 = fromIntegral $ ceiling (hashBits/wBits)
l2 = 1 + (fromIntegral $ floor $ log2(l1*(w-1))/wBits)
-- wWords is the number of hash chains that we'll have.
wWords = l1 + l2

-- An SSE2, SHA-256 implementation can hash at 5 cycles/byte
hashCyclesPerByte = 5
-- We assume that we can generate keystream bytes at 2 cycles/byte.
secretCyclesPerByte = 2

otsSignatureBytes = wWords * hashBytes
-- In order to generate a single OTS public key we need to calculate wWords
-- hash chains to a depth of w hashes.
otsHashOps = wWords * w
otsHashBytes = hashBytes * otsHashOps
otsHashCycles = hashCyclesPerByte * otsHashBytes
-- In order to generate the private key, we need to generate this many bytes of
-- key stream.
otsPrivateBytes = wWords * hashBytes
otsPublicBytes = otsPrivateBytes
otsPrivateCycles = secretCyclesPerByte * otsPrivateBytes

-- This is the height of a tree.
treeHeight = 8
-- To calculate the public keys for a whole tree we have to generate 2^treeHeight public keys.
treePublicKeyGenCycles = (otsPrivateCycles + otsHashCycles) * 2**treeHeight
-- To calculate the hash of the root of the tree, we first need to hash all the
-- public keys.
treePublicKeyHashCycles = otsPublicBytes * hashCyclesPerByte * 2**treeHeight
-- Then we need to hash all the hashes
treeMerkleCycles = 2**(treeHeight-1) * hashBytes * hashCyclesPerByte
treeSignatureCycles = treePublicKeyGenCycles + treePublicKeyHashCycles + treeMerkleCycles
-- As part of the signature we need to publish the OTS signature and a path up
-- the Merkle tree. The signature is the same size as a public key because the
-- Winternitz hash chain are just slightly shorter.
treeSignatureBytes = otsPublicBytes + treeHeight*hashBytes

logKeys = 160
forestHeight = fromIntegral $ ceiling (logKeys/treeHeight)
forestSignatureBytes = forestHeight * treeSignatureBytes
forestSignatureCycles = forestHeight * treeSignatureCycles
forestSignatureTime = forestSignatureCycles / 2.9e9

The Haskell code assumes that one can calculate SHA-256 at 5 cycles/byte, which is possible for concurrent hashing as is needed in this case. There are several parameters that one can play with: the Winternitz value, the size of the trees (which is assumed to be uniform here) and the size of the forest. I've taken the size of the forest to be 2160, although that only allows 216 signatures before the probability of a collision reaches 2-128. However, the overall security of the scheme is somewhat less than 128 bits because there are multiple hashes to attack concurrently.

With the parameters that I picked above, the estimated, lower-bound time for signing is 0.8 seconds (on a 2.9GHz machine) and the signatures are 34KB. The algorithm is embarrassingly parallel, so multi-core machines can cut the signing time down a lot, but the size is more difficult to improve. Changing the parameters to achieve a 20KB signature results in a signing time of nearly 30 seconds! (And it's a lower-bound of the signing time - in the real world there will be additional overhead.)

So, should everyone be using hash-based, PGP signatures now? Well, probably not, at least not for a subkey: PGP clearsigned messages would look silly with 34KB of base64-encoded noise at the bottom. Perhaps for a long-term signing key if you really want. But, over the long-term, it's generally confidentiality (i.e. encryption) that you need to maintain and I haven't talked about post-quantum encryption at all in this post. (Maybe in the future.)

But for problems like software updates, a reasonable case can be made. The size of the signature is generally trivial compared to the size of the message and software really does last for decades now. Hash-based signatures allow you to forget about any cryptographic breakthroughs to the greatest degree possible.

There's lots more to hash-based signatures than I've covered here:

  • GMSS treats the forest construction as an optimisation problem and solves for several cases.
  • W-OTS+ builds on several, previous papers and changes the hash property needed from collision resistance to second preimage resistance, which is huge for classical machines! It means that the size of the hash can be halved. However, quantum computers can do a pre-image in 2n/2 time, the same amount of time that classical computers can do a collision, so it doesn't help you there. (Can quantum computers do a collision in 2n/3 time? djb says don't worry.)
  • If you don't mind keeping state, then XMSS is probably the most recent work exploring that problem that I'm aware of. (Although one might want to combine it with W-OTS+.)
  • Tahoe-LAFS has some notes on hash-based signatures, notably some from Daira Hopwood.

And those messages for anyone in possession of a large, quantum computer:

Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools -




How to botch TLS forward secrecy

There's been some discussion about forward secret TLS recently, spurred, I think, by this piece from Declan McCullagh on CNET.

In case there are people out there wondering about turning it on, I thought it would be a good time to discuss some of the ways to mess it up!

In the case of multiplicative Diffie-Hellman (i.e. DHE), servers are free to choose their own, arbitrary DH groups. Historically some servers have had groups as small as 256-bits! Opera used to reconnect without DHE cipher suites in this case and Chrome now simply fails the connection with ERR_SSL_WEAK_SERVER_EPHEMERAL_DH_KEY. However, it's still the case that some servers use 512-bit DH groups, meaning that the connection can be broken open with relatively little effort.

So the first way to mess up forward secrecy is to use under sized DH groups. Ideally the DH group would match or exceed the RSA key size but 1024-bit DHE is arguably better than straight 2048-bit RSA so you can get away with that if you want to.

If you're using ECDHE then you don't need to worry about it being too small because the smallest EC group that clients support (P-256) is far, far stronger than 2048-bit RSA anyway.

So the next way to mess up forward secrecy is to get compromised via session resumption. TLS offers two session resumption mechanisms: session IDs (where the server and client each store their own secret state) and session tickets (where the client stores the server's state, encrypted by the server). If an attacker can obtain the session resumption information for a connection then they can decrypt the connection. (This needn't be completely true, but it is for TLS because of the way that TLS is designed.)

In the case of session IDs, the session information will either be kept in memory or kept on disk, depending on how the server is configured. (See the SSLSessionCache directive in Apache.) The forward secrecy of a connection is bounded by how long the session information is stored at the server. A small, in-memory cache likely has a high turnover rate, but a disk-cache could retain that information for a long time. Ideally one would have a medium sized, in-memory cache that turned over once a day or so.

In the case of session tickets, the server's encrypted state is transmitted to the client in the clear, so the server's session ticket key is what's protecting the connection. If you're running a single server then the session ticket key is probably generated randomly, at startup and kept in memory. So the forward secrecy of the connections is actually limited by the security of the server's memory until the server is restarted. It's possible for servers to run for many months without a restart so connections could be a lot less forward secure than you might think.

If you run several servers then they all need to share the same session ticket key otherwise they can't decrypt each other's session tickets. In Apache you would configure that with the SSLSessionTicketKeyFile directive. In this case your forward secrecy is limited by the security of that file on disk. Since your SSL private key is probably kept on the same disk, enabling forward secure cipher suites probably hasn't actually achieved anything other than changing the file that the attacker needs to steal!

So how do you run forward secrecy with several servers and support session tickets? You need to generate session ticket keys randomly, distribute them to the servers without ever touching persistent storage and rotate them frequently. However, I'm not aware of any open source servers that support anything like that.

Sudden Death Entropy Failures

During the time that the RSA patent was in force, DSA was the signature algorithm of choice for any software that didn't want to deal with patent licenses. (Which is why lots of old PGP keys are still DSA.) It has slowly disappeared since the patent expired and it appears that 4096-bit RSA is now the algorithm of choice if you're on the run from the NSA [1]. (And if you're a journalist trying to get a reply: keyid BDA0DF3C.)

But DSA can also be used with elliptic curves in the form of ECDSA and, in that form, it's likely that we'll see it return in the future, at least to some extent. SSH and GPG both support ECDSA now and CAs are starting to offer ECDSA certificates for HTTPS.

Unfortunately, DSA has an important weakness that RSA doesn't: an entropy failure leaks your private key. If you used a machine affected by the Debian entropy bug then, in that time, messages that you encrypted with RSA can be broken. But if you signed anything with a DSA key, then your private key is compromised.

The randomness in DSA is absolutely critical. Given enough signatures, leaking just a handful bits per signature is sufficient to break it. In the limit, you can make the make the mistake that Sony did and not understand that the random number needs to be generated for each message and, seemingly, just pick one and code it in. (See XKCD.)

But it doesn't need to be this way! All that is required of the nonce is that it be unique for each distinct message and secret, and we can achieve that by hashing the message and private key together. That's what Ed25519 does and I've added the option to OpenSSL to do the same. Unlike RSA and Ed25519 signatures, DSA signatures are probabilistic - signing the same message twice with the same key will result in different signatures. Since someone may be depending on that, OpenSSL also hashes in some randomness to maintain that feature.

(p.s. if you're an actual cryptographer, please take a look at the core of the code and let me know if I'm an idiot.)

Since the DSA specification says that the nonce must be randomly generated, and compliance with the spec is important for many users, this isn't enabled by default. For ECDSA keys, one needs to call EC_KEY_set_nonce_from_hash, for example. But hopefully we can measurably improve things with little effort by doing this.


Here's an example of breaking DSA with a known nonce: essentially the Sony PlayStation hack. I've used the variable names from the Wikipedia page for those that want to follow along, although I'm not using a subgroup to keep things simple.

(The example is done with Sage.)

Let's pick some (public) DSA parameters:

p = 2903
F = GF(p)
g = F(2)
n = g.multiplicative_order()

Next, generate a key pair. x is the private key and y is the public key:

x = int(F.random_element()) % n
y = g^x
(x, y)
(1282, 966)

Generate our nonce, which will be fixed for two different messages in this example:

k = int(F.random_element()) % n
kInv = inverse_mod(k, n)

The first message that we sign will be 42 and the pair (r, s) is the signature. (Normally the message would be a hash of a much larger message.) The signature is very simple: r is gk and s is (m + xr)/k:

m = 42
r = int(g^k) % n
s = ((m + x*r) * kInv) % n
(r, s)
(1401, 1168)

As an aside, we can verify the signature that we just made:

w = inverse_mod(s, n)
u1 = m*w
u2 = r*w
v = g^u1*y^u2
v == r

Next, we also sign the message 24 (which I'll call mm) and we gather the two signatures. We have s and ss, both with the same r. Since s = (m + xr)/k, we can subtract the two to get (m + xr - mm - xr)/k, which is just (m - mm)/k. Thus k = (m - mm)/(s - ss). Given k, we can rearrange the original equation for s: s = (m + xr)/k ⇒ sk = m + xr ⇒ sk - m = xr ⇒ (sk - m)/r = x. And thus we have the private key:

(r, s) = (1401, 1168)
(r, ss) = (1401, 1212)
kk = ((42 - 24)*inverse_mod(s-ss,n)) % n
xx = ((s*kk - 42) * inverse_mod(r, n))%n
print xx == x

Faster curve25519 with precomputation

(Update: based on questions, I've clearly failed to point out what this is, or rather what it isn't. Fixed base multiplication only applies when performing many operations on the same public key, i.e. key generation or perhaps some other cases. If you're working with random public keys then precomputaion cannot help I'm afraid.)

Diffie-Hellman is one of the most important public-key cryptographic primitives, even if it doesn't have the name recognition of, say, RSA. When anything is providing forward secrecy then it's very likely using Diffie-Hellman. In fact, the reason that web servers do RSA operations at all is mostly a legacy hangover. If we could design things afresh then there would be a lot of more Diffie-Hellman and everything would be a lot faster!

Of the concrete implementations of Diffie-Hellman, curve25519 is the fastest, common one. There are some faster primitives in eBACS, but the ones that are significantly faster are also significantly weaker. My machine (E5-2690, with Hyperthreading and TurboBoost disabled) can do a curve25519, arbitrary base point, scalar multiplication (a.k.a. Diffie-Hellman key agreement) in just 67.5µs.

But nobody will complain if it gets faster, right?

The core of Diffie-Hellman is calculating xy in some group. The simplest way to calculate xy is the classic square-and-multiply algorithm:

(Note: I'll be using multiplicative notation throughout because I think it's more familiar to most people, although additive is more common for the specific groups that we're dealing with here.)

Start with the number 1, which is x0 (because anything to the power of zero is 1). If we multiply by x then we get x1. So, by multiplying by x we've added one to the exponent. If we square our new number then we get x2 - squaring doubles the exponent.

Now consider the exponent as a binary number: x1 is x1b and squaring it gets us x10b. Squaring is a left shift of the exponent! Now, given any y in binary form, we can calculate xy with a series of additions by 1 and left shifts - i.e. multiplies by x and squarings.

If we have y = 1011b then we start with x0 = 1, as always, and read y left-to-right: the first bit is a one so we multiply by x to get x1b. Left shift: x10b. We already have the 0 in there so left shift again: x100b. Multiply by x: x101b. Left shift: x1010b. Multiply by x: x1011b. Done.

With square-and-multiply, the bits of the desired exponent are clocked in, bit by bit, as the squarings do the left shifting.

This is essentially what curve25519 does. It has lots of tricks to make it fast but, if you're calculating lots of powers of the same x then there are a number of tricks to make things faster. I'm only going to cover a couple of basic ones here.

Firstly, one can precalculate x1, x2, x3, … x15. Now, by multiplying by one of those values we can add any value from 0 to 15 to the exponent in a single multiplication. We're essentially clocking in four bits at a time. After one multiplication we have to do four squarings to left shift 4 bits to make space for the next 4 bits to be clocked in. This reduces the number of multiplications by a factor of 4.

Next, imagine that our exponents were always 8-bits long. We could precalculate x00000000b, x00000001b, x00010000b and x00010001b. By multiplying by one of those values we can clock in bits in two positions at once. Previously we always clocked in bits at the right hand side of the exponent, but those precomputed powers mean that we can clock in bits at the forth bit too. Thus we deal with the exponent as two, 4-bit values and the squarings are shared between them. We only have to square and multiply four times to cover the whole 8-bit value.

In cryptography, the powers are usually larger than 8-bits, of course, but you can clock in at 4 or 8 positions if you wish. It's a classic time/memory trade-off because more positions mean exponentially more precomputed values.

(There's more and better tricks than the two that I've just outlined, some are detailed on this Wikipedia page.)

Precomputation in curve25519

The reason that such precomputation hasn't been applied to curve25519 before is (I think), because of a detail that I glossed over before: curve25519 is actually a Montgomery curve that uses differential addition. Rather than being able to multiply arbitrary values, a and b, you instead need to also know the value of a/b first. That throws a spanner in the works of the typical optimisations.

If you review elliptic curves, you'll recall that operations on the elliptic curve group consist of operations on the underlying field. In curve25519's case it's the field GF(2255-19), which is where it gets its name from. We can roughly measure the amount of time that something will take by the number of field operations required. (Note to practitioners: I'm taking a field squaring as 0.8 multiplications and one multiplication is an ‘operation’.)

A curve25519 exponent is 255 bits long. Each squaring takes 3.6 operations and each multiplication by x takes 4.6 operations. 255*(3.6+4.6) is 2091 operations. We also need a final conversion (a field inversion) that costs 215.2 operations for a total of 2306.2. That's the number to beat.

In order to work in a group which has nice, arbitrary multiplication operations we map curve25519 to an isomorphic, Twisted Edwards curve. In fact, exactly the same curve as used by Ed25519, which is a fast signature primitive. This also means that we get to use Ed25519's highly optimised code. In order to be able to use as much code as possible, we use exactly the same precomputation scheme as described in section 4 of the paper.

This scheme is based on 256-bit exponents, which it splits into 64, 4-bit chunks. But it uses lots of precomputed values! For every other 4-bit chunk, it precomputes all 16 possible values. So the first subtable contains the 16 values x0000b, x0001b, x0010b, …, x1111b. The next subtable contains the 16 values for the next but one 4-bit chunk. So that's x0000 0000 0000b, x0001 0000 0000b, x0010 0000 0000b, …, x1111 0000 0000b. It has 32 of those subtables!

It picks one value from each subtable and, with 32 multiplications, takes care of half the bits of the exponent. Then it squares four times to left-shift those values into place and does another 32 multiplications from the same subtables to take care of the other half of the bits. That's a total of 64 multiplications and 4 squarings.

(It actually uses a trick which means that it only needs to store 8 of the 16 values in each subtable, but that's outside the scope of this post.)

We're in a different group now and so the costs are different: multiplications cost 6 operations and squarings cost 7. (The field is the same so an ‘operation’ takes the same time.) 64*6 + 4*7 = 412, so that's looking good. We need another field inversion to finish up, same as curve25519, for a total cost of 627.2. Those rough costs suggest that the Twisted Edwards precomputation should be 3.6x faster, which is promising!

(Updated: I originally suggested that two field inversions were required because I'm an idiot. Thanks to @CodesInChaos for asking why.)

The actual timings

Based on the amd64-64-24k implementation of Ed25519 in SUPERCOP, it's fairly easy to implement. On the same machine as the curve25519 speed was measured, above, the precomputed curve25519 runs in 21.9µs, a 3.1x speedup with 24KB of tables. So it's not as fast as the rough calculations suggested, but that's because we're now processing 24KB of memory. That cache pressure isn't free of course, although benchmarks make it look mostly free because they run in a tight loop and so the tables will always be in L1 cache. Still, it means that CPU can sustain 350,000 public key operations per second.

(Code still too messy to release. I hope to tidy it up next week.)


Since its inception, SPDY has depended on a TLS extension called NPN. NPN allows a TLS connection to negotiate which application-level protocol will be running across it.

NPN allows SPDY to be enabled efficiently. If we had run SPDY on a different port, then we would have had to be constantly creating probing connections to see whether a site supported SPDY as well as HTTPS. Even if we knew that a site supported SPDY, network devices between any given client and that site might block connections to the different TCP port. If we had tried an HTTP Upgrade header, that would have slowed everything down and caused compatibility issues with servers and proxies that didn't process the header correctly.

NPN also allows us to update SPDY without spending round trips on a version negotiation. Overall, NPN has worked very well for SPDY.

NPN attempted to be a little bit future proof by sending the selected application protocol name under encryption, so that network devices couldn't discriminate. The benefit was somewhat limited because the server's list of supported protocols was still sent in the clear but we believe that anything that can be encrypted, should be encrypted.

There is an alternative to NPN: ALPN is essentially the same design except that the negotiation is done in the clear (like other TLS extensions).

Last Friday, at IETF 86 in Orlando, the TLS working group considered both designs and came to a rough consensus on ALPN. ALPN is currently on track to be published as an RFC at some point and we will be switching SPDY over to it and deprecating NPN.

Once IANA has assigned a TLS extension number for ALPN, Google servers will start supporting both NPN and ALPN, with a preference for ALPN. Chrome and, I expect, other browsers will start sending both NPN and ALPN extensions. During this time, SPDY servers will be able to switch from NPN to ALPN without dropping SPDY support for current clients.

At some point after the end of 2014, I plan on removing NPN support from Chrome and Google servers. Any old servers and clients will continue to function just fine: they'll just use HTTPS.

Lucky Thirteen attack on TLS CBC

In an upcoming paper (made public this morning), Nadhem AlFardan and Kenny Paterson describe another method of performing Vaudenay's attack on CBC as used in TLS. Firstly I'd like to thank the researchers for notifying the various vendors ahead of time so that patches could be prepared: the disclosure process has gone very smoothly in this case. I couldn't have asked for anything more - they did everything right.

Vaudenay's attack requires an attacker to be able to detect when a CBC padding check has succeeded, even if the authentication check then fails. Authentication should always be applied after encryption to avoid this, but TLS famously did it the wrong way round with CBC mode.

Knowing whether a padding check succeeded reveals information about the decrypted plaintext. By tweaking the ciphertext over many trials, it's possible to progressively decrypt an unknown ciphertext. For details, see their paper, which does a better job of explaining it than I would.

Vaudenay first used the fact that these two situations (padding check failure and authenticator failure) resulted in different TLS alert values, although that didn't result in a practical attack because TLS errors are encrypted. Once that was corrected (by specifying that the same alert value should be sent in each case) the next year's paper used a timing-side channel: if the authentication check wasn't performed when the padding check failed then, by timing the server's response, an attacker could tell whether the server aborted processing early.

To try and remove that side-channel, TLS stacks perform the authentication check whether or not the padding check succeeds. However, here's what I commented in the Go TLS code:

Note that we still have a timing side-channel in the MAC check, below. An attacker can align the record so that a correct padding will cause one less hash block to be calculated. Then they can iteratively decrypt a record by breaking each byte. However, our behavior matches OpenSSL, so we leak only as much as they do.

That pretty much sums up the new attack: the side-channel defenses that were hoped to be sufficient were found not to be (again). So the answer, this time I believe, is to make the processing rigorously constant-time. (Details below.)

As a practical matter, since a padding or authenticator check failure is fatal to a TLS connection, performing this attack requires a client to send the same plaintext secret on thousands of different connections to the same server. This isn't a trivial obstacle but it's possible to meet this requirement for cookies with a browser and bit of Javascript injected into any origin in the same session.

For DTLS the attack is much easier because a rejected record doesn't cause the connection to be destroyed and the same authors developed a method for amplifing timing attacks against DTLS in a