ImperialViolet

Adam Langley's Weblog

HSTS for new TLDs (06 Jul 2014)

Whatever you might think of them, the new TLDs are rapidly arriving. New TLDs approved so far this month include alsace, sarl, iinet, poker, gifts, restaurant, fashion, tui and allfinanz. The full list for last month is over twice as long.

That means that there's lots of people currently trying to figure out how to differentiate themselves from other TLDs. Here's an idea: why not ask me to set HSTS for the entire TLD? That way, every single site runs over HTTPS, always. It strikes me that could be useful if you're trying to build trust with users unfamiliar with the zoo of new domains.

(I can't speak for Firefox and Safari but I think it's safe to assume that Firefox would be on board with this. It's still unclear whether IE's support for HSTS will include preloading.)

I'm guessing that, with such a large number of new TLDs, I should be able to reach at least some operators of them via this blog post.

Encrypting streams (27 Jun 2014)

When sending data over the network, chunking is pretty much a given. TLS has a maximum record size of 16KB and this fits neatly with authenticated encryption APIs which all operate on an entire message at once.

But file encryption frequently gets this wrong. Take OpenPGP: it bulk encrypts the data and sticks a single MAC on the end. Ideally everyone is decrypting to a temporary file and waiting until the decryption and verification is complete before touching the plaintext, but it takes a few seconds of searching to find people suggesting commands like this:

gpg -d your_archive.tgz.gpg | tar xz

With that construction, tar receives unauthenticated input and will happily extract it to the filesystem. An attacker doesn't (we assume) know the secret key, but they can guess at the structure of the plaintext and flip bits in the ciphertext. Bit flips in the ciphertext will produce a corresponding bit flip in the plaintext, followed by randomising the next block. I bet some smart attacker can do useful things with that ability. Sure the gpg command will exit with an error code, but do you think that the shell script writer carefully handled that case and undid the changes to the filesystem?

The flaw here isn't in CFB mode's malleability, but in OpenPGP forcing the use of unauthenticated plaintext in practical situations. (Indeed, if you are ever thinking about the malleability of ciphertext, you have probably already lost.)

I will even claim that the existance of an API that can operate in a streaming fashion over large records (i.e. will encrypt and defer the authenticator and will decrypt and return unauthenticated plaintext) is a mistake. Not only is it too easy to misunderstand and misuse (like the gpg example above) but, even if correctly buffered in a particular implementation, the existance of large records may force other implementations to do dangerous things because of a lack of buffer space.

If large messages are chunked at 16KB then the overhead of sixteen bytes of authenticator for every chunk is only 0.1%. Additionally, you can safely stream the decryption (as long as you can cope with truncation of the plaintext).

Although safer in general, when chunking one has to worry that an attacker hasn't reordered chunks, hasn't dropped chunks from the start and hasn't dropped chunks from the end. But sadly there's not a standard construction for taking an AEAD and making a scheme suitable for encrypting large files (AERO might be close, but it's not quite what I have in mind). Ideally such a scheme would take an AEAD and produce something very like an AEAD in that it takes a key, nonce and additional data, but can safely work in a streaming fashion. I don't think it need be very complex: take 64 bits of the nonce from the underlying AEAD as the chunk number, always start with chunk number zero and feed the additional data into chunk zero with a zero byte prefix. Prefix each chunk ciphertext with a 16 bit length and set the MSB to indicate the last chunk and authenticate that indication by setting the additional data to a single, 1 byte. The major worry might be that for many underlying AEADs, taking 64 bits of the nonce for the chunk counter leaves one with very little (or none!) left.

That requires more thought before using it for real but, if you are ever building encryption-at-rest, please don't mess it up like we did 20 years ago. (Although, even with that better design, piping the output into tar may still be unwise because an attacker can truncate the stream at a chunk boundary: perhaps eliminating important files in the process.)

Update: On Twitter, zooko points to Tahoe-LAFS as an example of getting it right. Additionally, taking the MAC of the current state of a digest operation and continuing the operation has been proposed for sponge functions (like SHA-3) under the name MAC-and-continue. The exact provenance of this isn't clear, but I think it might have been from the Keccak team in this paper. Although MAC-and-continue doesn't allow random access, which might be important for some situations.

BoringSSL (20 Jun 2014)

Earlier this year, before Apple had too many goto fails and GnuTLS had too few, before everyone learnt that TLS heart-beat messages were a thing and that some bugs are really old, I started a tidy up of the OpenSSL code that we use at Google.

We have used a number of patches on top of OpenSSL for many years. Some of them have been accepted into the main OpenSSL repository, but many of them don’t mesh with OpenSSL’s guarantee of API and ABI stability and many of them are a little too experimental.

But as Android, Chrome and other products have started to need some subset of these patches, things have grown very complex. The effort involved in keeping all these patches (and there are more than 70 at the moment) straight across multiple code bases is getting to be too much.

So we’re switching models to one where we import changes from OpenSSL rather than rebasing on top of them. The result of that will start to appear in the Chromium repository soon and, over time, we hope to use it in Android and internally too.

There are no guarantees of API or ABI stability with this code: we are not aiming to replace OpenSSL as an open-source project. We will still be sending them bug fixes when we find them and we will be importing changes from upstream. Also, we will still be funding the Core Infrastructure Initiative and the OpenBSD Foundation.

But we’ll also be more able to import changes from LibreSSL and they are welcome to take changes from us. We have already relicensed some of our prior contributions to OpenSSL under an ISC license at their request and completely new code that we write will also be so licensed.

(Note: the name is aspirational and not yet a promise.)

Early ChangeCipherSpec Attack (05 Jun 2014)

OpenSSL 1.0.1h (and others) were released today with a scary looking security advisiory and that's always an event worth looking into. (Hopefully people are practiced at updating OpenSSL now!)

Update: the original reporter has a blog post up. Also, I won't, personally, be answering questions about specific Google services. (I cut this blog post together from notes that I'm writing for internal groups to evaluate and this is still very fresh.)

Update: my initial thoughts from looking at the diff still seem to be holding up. Someone is welcome to write a more detailed analysis than below. HP/ZDI have a write up of one of the DTLS issues.

There are some critical bug fixes to DTLS (TLS over datagram transports, i.e. UDP), but most people will be more concerned about the MITM attack against TLS (CVE-2014-0224).

The code changes are around the rejection of ChangeCipherSpec messages, which are messages sent during the TLS handshake that mark the change from unencrypted to encrypted traffic. These messages aren't part of the handshake protocol itself and aren't linked into the handshake state machine in OpenSSL. Rather there's a check in the code that they are only received when a new cipher is ready to be used. However, that check (for s->s3->tmp.new_cipher in s3_pkt.c) seems reasonable, but new_cipher is actually set as soon as the cipher for the connection has been decided (i.e. once the ServerHello message has been sent/received), not when the cipher is actually ready! It looks like this is the problem that's getting fixed in this release.

Here's the code in question that handles a ChangeCipherSpec message:

int ssl3_do_change_cipher_spec(SSL *s)
	{
	int i;
	const char *sender;
	int slen;

	if (s->state & SSL_ST_ACCEPT)
		i=SSL3_CHANGE_CIPHER_SERVER_READ;
	else
		i=SSL3_CHANGE_CIPHER_CLIENT_READ;

	if (s->s3->tmp.key_block == NULL)1
		{
		if (s->session == NULL)
			{
			/* might happen if dtls1_read_bytes() calls this */
			SSLerr(SSL_F_SSL3_DO_CHANGE_CIPHER_SPEC,SSL_R_CCS_RECEIVED_EARLY);
			return (0);
			}

		s->session->cipher=s->s3->tmp.new_cipher;
		if (!s->method->ssl3_enc->setup_key_block(s)) return(0); 2
		}

	if (!s->method->ssl3_enc->change_cipher_state(s,i))
		return(0);

	/* we have to record the message digest at
	 * this point so we can get it before we read
	 * the finished message */
	if (s->state & SSL_ST_CONNECT)
		{
		sender=s->method->ssl3_enc->server_finished_label;
		slen=s->method->ssl3_enc->server_finished_label_len;
		}
	else
		{
		sender=s->method->ssl3_enc->client_finished_label;
		slen=s->method->ssl3_enc->client_finished_label_len;
		}

	i = s->method->ssl3_enc->final_finish_mac(s,
		sender,slen,s->s3->tmp.peer_finish_md); 3
	if (i == 0)
		{
		SSLerr(SSL_F_SSL3_DO_CHANGE_CIPHER_SPEC, ERR_R_INTERNAL_ERROR);
		return 0;
		}
	s->s3->tmp.peer_finish_md_len = i;

	return(1);
	}

If a ChangeCipherSpec message is injected into the connection after the ServerHello, but before the master secret has been generated, then ssl3_do_change_cipher_spec will generate the keys (2) and the expected Finished hash (3) for the handshake with an empty master secret. This means that both are based only on public information. Additionally, the keys will be latched because of the check at (1) - further ChangeCipherSpec messages will regenerate the expected Finished hash, but not the keys.

The oldest source code on the OpenSSL site is for 0.9.1c (Dec 28, 1998) and the affected code appears almost unchanged. So it looks like this bug has existed for 15+ years.

The implications of this are pretty complex.

For a client there's an additional check in the code that requires that a CCS message appear before the Finished and after the master secret has been generated. An attacker can still inject an early CCS too and the keys will be calculated with an empty master secret. Those keys will be latched - another CCS won't cause them to be recalculated. However, when sending the second CCS that the client code requires, the Finished hash is recalculated with the correct master secret. This means that the attacker can't fabricate an acceptable Finished hash. This stops the obvious, generic impersonation attack against the client.

For a server, there's no such check and it appears to be possible to send an early CCS message and then fabricate the Finished hash because it's based on an empty master secret. However, that doesn't obviously gain an attacker anything. It would be interesting if a connection with a client certificate could be hijacked, but there is a check in ssl3_get_cert_verify that a CCS hasn't been already processed so that isn't possible.

Someone may be able to do something more creative with this bug; I'm not ruling out that there might be other implications.

Things change with an OpenSSL 1.0.1 server however. In 1.0.1, a patch of mine was included that moves the point where Finished values are calculated. The server will now use the correct Finished hash even if it erroneously processed a CCS, but this interacts badly with this bug. The server code (unlike the client) won't accept two CCS messages in a handshake. So, if an attacker injects an early CCS at the server to fixate the bad keys, then it's not possible for them to send a second in order to get it to calculate the correct Finished hash. But with 1.0.1, the server will use the correct Finished hash and will reply with the correct hash to the client. This explains the 1.0.1 and 1.0.2 mention in the advisory. With any OpenSSL client talking to an OpenSSL 1.0.1 server, an attacker can inject CCS messages to fixate the bad keys at both ends but the Finished hashes will still line up. So it's possible for the attacker to decrypt and/or hijack the connection completely.

The good news is that these attacks need man-in-the-middle position against the victim and that non-OpenSSL clients (IE, Firefox, Chrome on Desktop and iOS, Safari etc) aren't affected. None the less, all OpenSSL users should be updating.

Update: I hacked up the Go TLS library to perform the broken handshake in both 0.9.8/1.0.0 and 1.0.1 modes. There's a patch for crypto/tls and a tool that will attempt to do those broken handshakes and tell you if either worked. For example (at the time of writing):

$ ./earlyccs_check google.com
Handshake failed with error: remote error: unexpected message
Looks ok.
$ ./earlyccs_check github.com
Server is affected (1.0.1).
$ ./earlyccs_check amazon.com
Server is affected (0.9.8 or 1.0.0).

Matching primitive strengths (25 May 2014)

It's a common, and traditional, habit to match the strengths of cryptographic primitives. For example, one might design at the 128-bit security level and pick AES-128, SHA-256 and P-256. Or if someone wants “better” security, one might target a 192-bit security level with AES-192, SHA-384 and P-384. NIST does this sort of thing, for example in 800-57/3, section 5.2.

The logic underlying this is that an attacker will attack at a system's weakest point, so anything that is stronger than the weakest part is a waste. But, while matching up the security level might have made sense in the past, I think it warrants reconsideration now.

Consider the energy needed to do any operation 2128 times: Landauer tells us that erasing a bit takes a minimum amount of energy any actual crypto function will involve erasing many bits. But, for the sake of argument, let's say that we're doing a single bit XOR. If we take the minimum energy at room temperature (to save energy on refrigeration) we get 2.85×10-21 × 2128 = 0.97×1018J. That's about “yearly electricity consumption of South Korea as of 2009” according to Wikipedia. If we consider that actually doing anything useful in each of those 2128 steps is going to take orders of magnitude more bit operations, and that we can't build a computer anywhere near the theoretic limit, we can easily reach, say, 5.5×1024J, which is “total energy from the Sun that strikes the face of the Earth each year”.

(The original version of this post discussed incrementing a 128-bit counter and said that it took two erasures per increment. Samuel Neves on Twitter pointed out that incrementing the counter doesn't involve destroying the information of the old value and several people pointed out that one can hit all the bit-patterns with a Gray code and thus only need to flip a single bit each time. So I changed the above to use a one-bit XOR as the basic argument still holds.)

So any primitive at or above the 128-bit security level is equally matched today, because they are all effectively infinitely strong.

One way to take this is to say that anything above 128 bits is a waste of time. Another is to say that one might want to use primitives above that level, but based on the hope that discontiguous advances leave it standing but not the lesser primitives. The latter involves a lot more guesswork than the old practice of making the security levels match up. Additionally, I don't think most people would think that an analytic result is equally likely for all classes of primitives, so one probably doesn't want to spend equal amounts of them all.

Consider the difference between P-256 and P-384 (or curve25519 and Curve41417 or Hamburg's Goldilocks-448). If they are all infinitely strong today then the reason to want to use a larger curve is because you expect some breakthrough, but not too much of a breakthrough. Discrete log in an elliptic-curve group is a square-root work function today. For those larger curves to make sense you have to worry about a breakthrough that can do it in cube-root time, but not really forth-root and certainly not fifth-root. How much would you pay to insure yourself against that, precise possibility? And how does that compare against the risk of a similar breakthrough in AES?

It is also worth considering whether the security-level is defined in a way matches your needs. AES-128 is generally agreed to have a 128-bit security level, but if an attacker is happy breaking any one of 2n keys then they can realise a significant speedup. An “unbalanced” choice of cipher key size might be reasonable if that attack is a concern.

So maybe a 256-bit cipher (ciphers do pretty well now, just worry about Grover's algorithm), SHA-384 (SHA-256 should be fine, but hash functions have, historically, had a bad time) and 128512(? see below)-bit McEliece is actually “balanced” these days.

(Thanks to Mike Hamburg and Trevor Perrin, with whom I was discussing this after Mike presented his Goldilocks curve. Thanks to JP Aumasson on Twitter and pbsd on HN for suggesting multi-target attacks as a motivation for large cipher key sizes. Also to pbsd for pointing out a quantum attack against McEliece that I wasn't aware of.)

SHA-256 certificates are coming (14 May 2014)

It's a neat result in cryptography that you can build a secure hash function given a secure signature scheme, and you can build a secure signature scheme given a secure hash function. However, far from the theory, in the real world, lots of signatures today depend on SHA-1, which is looking increasingly less like a secure hash function.

There are lots of properties by which one can evaluate a hash function, but the most important are preimage resistance (if I give you a hash, can you find an input with that hash value?), second preimage resistance (if I give you a message, can you find another that hashes to the same value?) and collision resistance (can you find two messages with the same hash?). Of those, the third appears to be much harder to meet than the first two, based on historical results against hash functions.

Back when certificates were signed with MD5, a chosen-prefix collision attack (i.e. given two messages, can you append different data to each so that the results have the same hash?) against MD5 was used at least twice to break the security of the PKI. First the MD5 Collisions Inc demo against RapidSSL and then the Flame malware.

Today, SHA-1 is at the point where a collision attack is estimated to take 261 work and a chosen-prefix collision to take 277. Both are below the design strength of 280 and even that design strength isn't great by today's standards.

We hope that we have a second line of defense for SHA-1: after the MD5 Collisions Inc demo, CAs were required to use random serial numbers. A chosen-prefix attack requires being able to predict the certificate contents that will be signed and the random serial number should thwart that. With random serials we should be resting on a stronger hash function property called target-collision resistance. (Although I'm not aware of any proofs that random serials actually put us on TCR.)

Still, it would be better not to depend on those band-aids and to use a hash function with a better design strength while we do it. So certificates are starting to switch to using SHA-256. A large part of that shift came from Microsoft forbidding certificates using SHA-1 starting in 2016.

For most people, this will have no effect. Twitter ended up with a SHA-256 certificate after they replaced their old one because of the OpenSSL heartbeat bug. So, if you can still load Twitter's site, then you're fine.

But there are a lot of people using Windows XP prior to Service Pack 3, and they will have problems. We've already seen lots of user reports of issues with Twitter (and other sites) from these users. Wherever possible, installing SP3 is the answer. (Or, better yet, updating from Windows XP.)

There are also likely to be problems with embedded clients, old phones etc. Some of these may not come to light for a while.

We've not yet decided what Google's timeline for switching is, but we will be switching prior to 2016. If you've involved with a something embedded that speaks to Google over HTTPS (and there's a lot of it), now is the time to test https://cert-test.sandbox.google.com, which is using a SHA-256 certificate. I'll be using other channels to contact the groups that we know about but, on the web, we don't always know what's talking to us.

Revocation still doesn't work (29 Apr 2014)

I was hoping to be done with revocation for a bit, but sadly not.

GRC have published a breathless piece attacking a straw man argument: “Google tells us ... that Chrome's unique CRLSet solution provides all the protection we need”. I call it a straw man because I need only quote my own posts that GRC have read to show it.

The original CRLSets announcement contained two points. Firstly that online revocation checking doesn't work and that we were switching it off in Chrome in most cases. Secondly, that we would be using CRLSets to avoid having to do full binary updates in order to respond to emergency incidents.

In the last two paragraphs I mentioned something else. Quoting myself:

“Since we're pushing a list of revoked certificates anyway, we would like to invite CAs to contribute their revoked certificates (CRLs) to the list. We have to be mindful of size, but the vast majority of revocations happen for purely administrative reasons and can be excluded. So, if we can get the details of the more important revocations, we can improve user security. Our criteria for including revocations are:

  1. The CRL must be crawlable: we must be able to fetch it over HTTP and robots.txt must not exclude GoogleBot.
  2. The CRL must be valid by RFC 5280 and none of the serial numbers may be negative.
  3. CRLs that cover EV certificates are taken in preference, while still considering point (4).
  4. CRLs that include revocation reasons can be filtered to take less space and are preferred.”

In short, since we were pushing revocations anyway, maybe we could get some extra benefit from it. It would be better than nothing, which is what browsers otherwise have with soft-fail revocation checking.

I mentioned it again, last week (emphasis added):

“We compile daily lists of some high-value revocations and use Chrome's auto-update mechanism to push them to Chrome installations. It's called the CRLSet and it's not complete, nor big enough to cope with large numbers of revocations, but it allows us to react quickly to situations like Diginotar and ANSSI. It's certainly not perfect, but it's more than many other browsers do.”

“The original hope with CRLSets was that we could get revocations categorised into important and administrative and push only the important ones. (Administrative revocations occur when a certificate is changed with no reason to suspect that any key compromise occurred.) Sadly, that mostly hasn't happened.”

And yet, GRC managed to write pages (including cartoons!) exposing the fact that it doesn't cover many revocations and attacking Chrome for it.

They also claim that soft-fail revocation checking is effective:

The claim is that a user will download and cache a CRL while not being attacked and then be protected from a local attacker using a certificate that was included on that CRL. (I'm paraphrasing; you can search for “A typical Internet user” in their article for the original.)

There are two protocols used in online revocation checking: OCSP and CRL. The problem is that OCSP only covers a single certificate and OCSP is used in preference because it's so much smaller and thus removes the need to download CRLs. So the user isn't expected to download and cache the CRL anyway. So that doesn't work.

However, it's clear that someone is downloading CRLs because Cloudflare are spending half a million dollars a month to serve CRLs. Possibly it's non-browser clients but the bulk is probably CAPI (the library that IE, Chrome and other software on Windows typically uses - although not Firefox). A very obscure fact about CAPI is that it will download a CRL when it has 50 or more OCSP responses cached for a given CA certificate. But CAs know this and split their CRLs so that they don't get hit with huge bandwidth bills. But a split CRL renders the claimed protection from caching ineffective, because the old certificate for a given site is probably in a different part.

So I think the claim is that doing blocking OCSP lookups is a good idea because, if you use CAPI on Windows, then you might cache 50 OCSP responses for a given CA certificate. Then you'll download and cache a CRL for a while and then, depending on whether the CA splits their CRLs, you might have some revocations cached for a site that you visit.

It seems that argument is actually for something like CRLSets (download and cache revocations) over online checking, it's just a Rube Goldberg machine to, perhaps, implement it!

So, once again, revocation doesn't work. It doesn't work in other browsers and CRLSets aren't going to cover much in Chrome, as I've always said. GRC's conclusions follow from those errors and end up predictably far from the mark.

In order to make this post a little less depressing, let's consider whether its reasonable to aim to cover everything with CRLSets. (Which I mentioned before as well, but I'll omit the quote.) GRC quote numbers from Netcraft claiming 2.85 million revocations, although some are from certificate authorities not trusted by mainstream web browsers. I spent a few minutes harvesting CRLs from the CT log. This only includes certificates that are trusted by reasonable number of people and I only waited for half the log to download. From that half I threw in the CRLs included by CRLSets and got 2356 URLs after discarding LDAP ones.

I tried downloading them and got 2164 files. I parsed them and eliminated duplicates and my quick-and-dirty parser skipped quite a few. None the less, this very casual search found 1,062 issuers and 4.2 million revocations. If they were all dropped into the CRLSet, it would take 39MB.

So that's a ballpark figure, but we need to design for something that will last a few years. I didn't find figures on the growth of HTTPS sites, but Netcraft say that the number of web sites overall is growing at 37% a year. It seems reasonable that the number of HTTPS sites, thus certificates, thus revocations will double a couple of times in the next five years.

There's also the fact that if we subsidise revocation with the bandwidth and computers of users, then demand for revocation will likely increase. Symantec, I believe, have experimented with certificates that are issued for the maximum time (five years at the moment) but sold in single year increments. When it comes to the end of the year, they'll remind you to renew. If you do, you don't need to do anything, otherwise the certificate gets revoked. This seems like a good deal for the CA so, if we make revocation really effective, I'd expect to see a lot more of it. Perhaps factor in another doubling for that.

So we would need to scale to ~35M revocations in five years, which is ~300MB of raw data. That would need to be pushed to all users and delta updated. (I don't have numbers for the daily delta size.)

That would obviously take a lot of bandwidth and a lot of resources on the client. We would really need to aim at mobile as well, because that is likely to be the dominant HTTPS client in a few years, making storage and bandwidth use even more expensive.

Some probabilistic data structure might help. I've written about that before. Then you have to worry about false positives and the number of reads needed per validation. Phones might use Flash for storage, but it's remarkably slow.

Also, would you rather spend those resources on revocation, or on distributing the Safe Browsing set to mobile devices? Or would you spend the memory on improving ASLR on mobile devices? Security is big picture full of trade-offs. It's not clear the revocation warrants all that spending.

So, if you believe that downloading and caching revocation is the way forward, I think those are the parameters that you have to deal with.

No, don't enable revocation checking (19 Apr 2014)

Revocation checking is in the news again because of a large number of revocations resulting from precautionary rotations for servers affected by the OpenSSL heartbeat bug. However, revocation checking is a complex topic and there's a fair amount of misinformation around. In short, it doesn't work and you are no more secure by switching it on. But let's quickly catch up on the background:

Certificates bind a public key and an identity (commonly a DNS name) together. Because of the way the incentives work out, they are typically issued for a period of several years. But events occur and sometimes the binding between public key and name that the certificate asserts becomes invalid during that time. In the recent cases, people who ran a server that was affected by the heartbeat bug are worried that their private key might have been obtained by someone else and so they want to invalidate the old binding, and bind to a new public key. However, the old certificates are still valid and so someone who had obtained that private key could still use them.

Revocation is the process of invalidating a certificate before its expiry date. All certificates include a statement that essentially says “please phone the following number to check that this certificate has not been revoked”. The term online revocation checking refers to the process of making that phone call. It's not actually a phone call, of course, rather browsers and other clients can use a protocol called OCSP to check the status of a certificate. OCSP supplies a signed statement that says that the certificate is still valid (or not) and, critically, the OCSP statement itself is valid for a much shorter period of time, typically a few days.

The critical question is what to do in the event that you can't get an answer about a certificate's revocation status. If you reject certificates when you can't get an answer, that's called hard-fail. If you accept certificates when you can't get an answer that's called soft-fail.

Everyone does soft-fail for a number of reasons on top of the general principle that single points of failure should be avoided. Firstly, the Internet is a noisy place and sometimes you can't get through to OCSP servers for some reason. If you fail in those cases then the level of random errors increases. Also, captive portals (e.g. hotel WiFi networks where you have to “login” before you can use the Internet) frequently use HTTPS (and thus require certificates) but don't allow you to access OCSP servers. Lastly, if everyone did hard-fail then taking down an OCSP service would be sufficient to take down lots of Internet sites. That would mean that DDoS attackers would turn their attention to them, greatly increasing the costs of running them and it's unclear whether the CAs (who pay those costs) could afford it. (And the disruption is likely to be significant while they figure that out.)

So soft-fail is the only viable answer but it has a problem: it's completely useless. But that's not immediately obvious so we have to consider a few cases:

If you're worried about an attacker using a revoked certificate then the attacker first must be able to intercept your traffic to the site in question. (If they can't even intercept the traffic then you didn't need any authentication to protect it from them in the first place.) Most of the time, such an attacker is near you. For example, they might be running a fake WiFi access point, or maybe they're at an ISP. In these cases the important fact is that the attacker can intercept all your traffic, including OCSP traffic. Thus they can block OCSP lookups and soft-fail behaviour means that a revoked certificate will be accepted.

The next class of attacker might be a state-level attacker. For example, Syria trying to intercept Facebook connections. These attackers might not be physically close, but they can still intercept all your traffic because they control the cables going into and out of a country. Thus, the same reasoning applies.

We're left with cases where the attacker can only intercept traffic between a user and a website, but not between the user and the OCSP service. The attacker could be close to the website's servers and thus able to intercept all traffic to that site, but not anything else. More likely, the attacker might be able to perform a DNS hijacking: they persuade a DNS registrar to change the mapping between a domain (example.com) and its IP address(es) and thus direct the site's traffic to themselves. In these cases, soft-fail still doesn't work, although the reasoning is more complex:

Firstly, the attacker can use OCSP stapling to include the OCSP response with the revoked certificate. Because OCSP responses are generally valid for some number of days, they can store one from before the certificate was revoked and use it for as long as it's valid for. DNS hijackings are generally noticed and corrected faster than the OCSP response will expire. (On top of that, you need to worry about browser cache poisoning, but I'm not going to get into that.)

Secondly, and more fundamentally, when issuing certificates a CA validates ownership of a domain by sending an email, or looking for a specially formed page on the site. If the attacker is controlling the site, they can get new certificates issued. The original owners might revoke the certificates that they know about, but it doesn't matter because the attacker is using different ones. The true owners could try contacting CAs, convincing them that they are the true owners and get other certificates revoked, but if the attacker still has control of the site, they can hop from CA to CA getting certificates. (And they will have the full OCSP validity period to use after each revocation.) That circus could go on for weeks and weeks.

That's why I claim that online revocation checking is useless - because it doesn't stop attacks. Turning it on does nothing but slow things down. You can tell when something is security theater because you need some absurdly specific situation in order for it to be useful.

So, for a couple of years now, Chrome hasn't done these useless checks by default in most cases. Rather, we have tried a different mechanism. We compile daily lists of some high-value revocations and use Chrome's auto-update mechanism to push them to Chrome installations. It's called the CRLSet and it's not complete, nor big enough to cope with large numbers of revocations, but it allows us to react quickly to situations like Diginotar and ANSSI. It's certainly not perfect, but it's more than many other browsers do.

A powerful attacker may be able to block a user from receiving CRLSet updates if they can intercept all of that user's traffic for long periods of time. But that's a pretty fundamental limit; we can only respond to any Chrome issue, including security bugs, by pushing updates.

The original hope with CRLSets was that we could get revocations categorised into important and administrative and push only the important ones. (Administrative revocations occur when a certificate is changed with no reason to suspect that any key compromise occurred.) Sadly, that mostly hasn't happened. Perhaps we need to consider a system that can handle much larger numbers of revocations, but the data in that case is likely to be two orders of magnitude larger and it's very unlikely that the current CRLSet design is still optimal when the goal moves that far. It's also a lot of data for every user to be downloading and perhaps efforts would be better spent elsewhere. It's still the case that an attacker that can intercept traffic can easily perform an SSL Stripping attack on most sites; they hardly need to fight revoked certificates.

In order to end on a positive note, I'll mention a case where online revocation checking does work, and another, possible way to solve the revocation problem for browsers.

The arguments above started with the point that an attacker using a revoked certificate first needs to be able to intercept traffic between the victim and the site. That's true for browsers, but it's not true for code-signing. In the case where you're checking the signature on a program or document that could be distributed via, say, email, then soft-fail is still valuable. That's because it increases the costs on the attacker substantially: they need to go from being able to email you to needing to be able to intercept OCSP checks. In these cases, online revocation checking makes sense.

If we want a scalable solution to the revocation problem then it's probably going to come in the form of short-lived certificates or something like OCSP Must Staple. Recall that the original problem stems from the fact that certificates are valid for years. If they were only valid for days then revocation would take care of itself. (This is the approach taken by DNSSEC.) For complex reasons, it might be easier to deploy that by having certificates that are valid for years, but include a marker in them that indicates that an OCSP response must be provided along with the certificate. The OCSP response is then only valid for a few days and the effect is the same (although less efficient).

TLS Triple Handshakes (03 Mar 2014)

Today, the TLS WG mailing list meeting received a note about the work of Karthikeyan Bhargavan, Antoine Delignat-Lavaud, Cedric Fournet, Alfredo Pironti and Pierre-Yves Strub on triple handshake attacks against TLS. This is more complex than just a duplicated goto and I'm not going to try and reproduce their explanation here. Instead, I'll link to their site again, which also includes a copy of their paper.

In short, the TLS handshake hashes in too little information, and always has. Because of that it's possible to synchronise the state of two TLS sessions in a way that breaks assumptions made in the rest of the protocol.

I'd like to thank the researchers for doing a very good job of disclosing this. The note today even included a draft for fixing the TLS key derivation to include all the needed information to stop this attack, and it'll be presented at the WG meeting tomorrow.

In the mean time, people shouldn't panic. The impact of this attack is limited to sites that use TLS client-certificate authentication with renegotiation, and protocols that depend on channel binding. The vast majority of users have never used client certificates.

The client-certificate issues can be fixed with a unilateral, client change to be stricter about verifying certificates during a renegotiation, as suggested by the authors. I've included an image, below, that is loaded over an HTTPS connection that renegotiates with unrelated certificates before returning the image data. Hopefully the image below is broken. If not, then it likely will be soon because of a browser update. (Or, if you're reading long after this was posted, I may have taken the test server down!)

Protocols that depend on channel binding (including ChannelID) need other changes. Ideally, the proposed update to the master-secret computation will be finalised and implemented. (For ChannelID, we have already updated the protocol to include a change similar to the proposed draft.)

It's likely that there are still concrete problems to be found because of the channel-binding break. Hopefully with today's greater publicity people can start to find them.

TLS Symmetric Crypto (27 Feb 2014)

At this time last year, the TLS world was mostly running on RC4-SHA and AES-CBC. The Lucky 13 attack against CBC in TLS had just been published and I had spent most of January writing patches for OpenSSL and NSS to implement constant-time CBC decoding. The RC4 biases paper is still a couple of week away, but it's already clear that both these major TLS cipher suite families are finished and need replacing. (The question of which is worse is complicated.)

Here's Chrome's view of TLS by number of TLS connections from around this time (very minor cipher suite families were omitted):

Cipher suite familyPercentage of TLS connections made by Chrome
RC4-MD52.8%
RC4-SHA148.9%
AES128-SHA11.2%
AES256-SHA146.3%

A whole bunch of people, myself included, have been working to address that since.

AES-GCM was already standardised, implemented in OpenSSL and has hardware support in Intel chips, so Wan-Teh Chang and I started to implement it in NSS, although first we had to implement TLS 1.2, on which it depends. That went out in Chrome 30 and 31. Brian Smith (and probably others!) at Mozilla shipped that code in Firefox also, with Firefox 27.

AES-GCM is great if you have hardware support for it: Haswell chips can do it in just about 1 cycle/byte. However, it's very much a hardware orientated algorithm and it's very difficult to implement securely in software with reasonable speed. Also, since TLS has all the infrastructure for cipher suite negotiation already, it's nice to have a backup in the wings should it be needed.

So we implemented ChaCha20-Poly1305 for TLS in NSS and OpenSSL. (Thanks to Elie Bursztein for doing a first pass in OpenSSL.) This is an AEAD made up of primitives from djb and we're using implementations from djb, Andrew M, Ted Krovetz and Peter Schwabe. Although it can't beat AES-GCM when AES-GCM has dedicated hardware, I suspect there will be lots of chips for a long time that don't have such hardware. Certainly most mobile phones at the moment don't (although AArch64 is coming).

Here's an example of the difference in speeds:

ChipAES-128-GCM speedChaCha20-Poly1305 speed
OMAP 446024.1 MB/s75.3 MB/s
Snapdragon S4 Pro41.5 MB/s130.9 MB/s
Sandy Bridge Xeon (AESNI)900 MB/s500 MB/s

(The AES-128-GCM implementation is from OpenSSL 1.0.1f. Note, ChaCha20 is a 256-bit cipher and AES-128 obviously isn't.)

There's also an annoying niggle with AES-GCM in TLS because the spec says that records have an eight byte, explicit nonce. Being an AEAD, the nonce is required to be unique for a given key. Since an eight-byte value is too small to pick at random with a sufficiently low collision probability, the only safe implementation is a counter. But TLS already has a suitable, implicit record sequence counter that has always been used to number records. So the explicit nonce is at best a waste of eight bytes per record, and possibly dangerous should anyone attempt to use random values with it. Thankfully, all the major implementations use a counter and I did a scan of the Alexa, top 200K sites to check that none are using random values - and none are. (Thanks to Dr Henson for pointing out that OpenSSL is using a counter, but with a random starting value.)

For ChaCha20-Poly1305 we can save 8 bytes of overhead per record by using the implicit sequence number for the nonce.

Cipher suite selection

Given the relative merits of AES-GCM and ChaCha20-Poly1305, we wish to use the former when hardware support is available and the latter otherwise. But TLS clients typically advertise a fixed preference list of cipher suites and TLS servers either respect it, or override the client's preferences with their own, fixed list. (E.g. Apache's SSLHonorCipherOrder directive.)

So, first, the client needs to alter its cipher suite preference list depending on whether it has AES-GCM hardware support - which Chrome now does. Second, servers that enforce their own preferences (which most large sites do) need a new concept of an equal-preference group: a set of cipher suites in the server's preference order which are all “equally good”. When choosing a cipher suite using the server preferences, the server finds its most preferable cipher suite that the client also supports and, if that is in an equal preference group, picks whichever member of the group is the client's most preferable. For example, Google servers have a cipher suite preference that includes AES-GCM and ChaCha20-Poly1305 cipher suites in an equal preference group at the top of the preference list. So if the client supports any cipher suite in that group, then the server will pick whichever was most preferable for the client.

The end result is that Chrome talking to Google uses AES-GCM if there's hardware support at the client and ChaCha20-Poly1305 otherwise.

After a year of work, let's see what Chrome's view of TLS is now:

Cipher suite familyPercentage of TLS connections made by Chrome
RC4-MD53.1%
RC4-SHA114.1%
AES128-SHA12.6%
AES256-SHA140.2%
AES128-GCM & ChaCha20-Poly130539.9%

What remains is standardisation work and getting the code out to the world for others. The ChaCha20-Poly1305 cipher suites are unlikely to survive the IETF without changes so we'll need to respin them at some point. As for getting the code out, I hope to have something to say about that in the coming months. Before then I don't recommend that anyone try to repurpose code from the Chromium repos - it is maintained with only Chromium in mind.

Fallbacks

Now that TLS's symmetric crypto is working better there's another, long-standing problem that needs to be addressed: TLS fallbacks. The new, AEAD cipher suites only work with TLS 1.2, which shouldn't be a problem because TLS has a secure negotiation mechanism. Sadly, browsers have a long history of doing TLS fallbacks: reconnecting with a lesser TLS version when a handshake fails. This is because there are lots of buggy HTTPS servers on the Internet that don't implement version negotiation correctly and fallback allows them to continue to work. (It also promotes buggy implementations by making them appear to function; but fallback was common before Chrome even existed so we didn't really have a choice.)

Chrome now has a four stage fallback: TLS 1.2, 1.1, 1.0 then SSLv3. The problem is that fallback can be triggered by an attacker injecting TCP packets and it reduces the security level of the connection. In the TLS 1.2 to 1.1 fallback, AES-GCM and ChaCha20-Poly1305 are lost. In the TLS 1.0 to SSLv3 fallback, all elliptic curve support, and thus ECDHE, disappears. If we're going to depend on these new cipher suites then we cannot allow attackers to do this, but we also cannot break all the buggy servers.

So, with Chrome 33, we implemented a fallback SCSV. This involves adding a pseudo-cipher suite in the client's handshake that indicates when the client is trying a fallback connection. If an updated server sees this pseudo-cipher suite, and the connection is not using the highest TLS version supported by the server, then it replies with an error. This is essentially duplicating the version negotiation in the cipher suite list, which isn't pretty, but the reality is that the cipher suite list still works but the primary version negotiation is rusted over with bugs.

With this in place (and also implemented on Google servers) we can actually depend on having TLS 1.2, at least between Chrome and Google (for now).

Bugs

Few rollouts are trouble free and Chrome 33 certainly wasn't, although most of the problems were because of a completely different change.

Firstly, ChaCha20-Poly1305 is running at reduced speed on ARM because early-revision, S3 phones have a problem that causes the NEON, Poly1305 code to sometimes calculate the wrong result. (It's not clear if it's the MSM8960 chip or the power supply in the phone.) Of course, it works well enough to run the unit tests successfully because otherwise I wouldn't have needed two solid days to pin it down. Future releases will need to run a self-test to gather data about which revisions have the problem and then we can selectively disable the fast-path code on those devices. Until then, it's disabled everywhere. (Even so, ChaCha20-Poly1305 is still much faster than AES-GCM on ARM.)

But it was the F5 workaround that caused most of the headaches. Older (but common) firmware revisions of F5 devices hang the connection when the ClientHello is longer than 255 bytes. This appeared to be a major issue because it meant that we couldn't add things to the ClientHello without breaking lots of big sites that use F5 devices. After a plea and a discussion on the TLS list, an engineer from F5 explained that it was just handshake messages between 256 and 511 bytes in length that caused the issue. That suggested an obvious solution: pad the handshake message so that it doesn't fall into the troublesome size range.

Chrome 33 padded all ClientHellos to 512 bytes, which broke at least two “anti-virus” products that perform local MITM of TLS connections on Windows and at least one “filtering” device that doesn't do MITM, but killed the TCP connections anyway. All made the assumption that the ClientHello will never be longer than some tiny amount. In the firewall case, the fallback SCSV stopped a fallback to SSLv3 that would otherwise have hidden the problem by removing the padding.

Thankfully, two of the three vendors have acted quickly to address the issue. The last, Kaspersky, has known about these sorts of problems for 18 months now (since Chrome tried to deploy TLS 1.1 and hit a similar issue) but, thankfully, doesn't enable MITM mode by default.

Overall, it looks like the changes are viable. Hopefully the path will now be easier for any other clients that wish to do the same.


There's an index of all posts and one, long page with them all too. Email: agl AT imperialviolet DOT org.