Faster curve25519 with precomputation (10 May 2013)
(Update: based on questions, I've clearly failed to point out what this is, or rather what it isn't. Fixed base multiplication only applies when performing many operations on the same public key, i.e. key generation or perhaps some other cases. If you're working with random public keys then precomputaion cannot help I'm afraid.)
Diffie-Hellman is one of the most important public-key cryptographic primitives, even if it doesn't have the name recognition of, say, RSA. When anything is providing forward secrecy then it's very likely using Diffie-Hellman. In fact, the reason that web servers do RSA operations at all is mostly a legacy hangover. If we could design things afresh then there would be a lot of more Diffie-Hellman and everything would be a lot faster!
Of the concrete implementations of Diffie-Hellman, curve25519 is the fastest, common one. There are some faster primitives in eBACS, but the ones that are significantly faster are also significantly weaker. My machine (E5-2690, with Hyperthreading and TurboBoost disabled) can do a curve25519, arbitrary base point, scalar multiplication (a.k.a. Diffie-Hellman key agreement) in just 67.5µs.
But nobody will complain if it gets faster, right?
The core of Diffie-Hellman is calculating xy in some group. The simplest way to calculate xy is the classic square-and-multiply algorithm:
(Note: I'll be using multiplicative notation throughout because I think it's more familiar to most people, although additive is more common for the specific groups that we're dealing with here.)
Start with the number 1, which is x0 (because anything to the power of zero is 1). If we multiply by x then we get x1. So, by multiplying by x we've added one to the exponent. If we square our new number then we get x2 - squaring doubles the exponent.
Now consider the exponent as a binary number: x1 is x1b and squaring it gets us x10b. Squaring is a left shift of the exponent! Now, given any y in binary form, we can calculate xy with a series of additions by 1 and left shifts - i.e. multiplies by x and squarings.
If we have y = 1011b then we start with x0 = 1, as always, and read y left-to-right: the first bit is a one so we multiply by x to get x1b. Left shift: x10b. We already have the 0 in there so left shift again: x100b. Multiply by x: x101b. Left shift: x1010b. Multiply by x: x1011b. Done.
With square-and-multiply, the bits of the desired exponent are clocked in, bit by bit, as the squarings do the left shifting.
This is essentially what curve25519 does. It has lots of tricks to make it fast but, if you're calculating lots of powers of the same x then there are a number of tricks to make things faster. I'm only going to cover a couple of basic ones here.
Firstly, one can precalculate x1, x2, x3, … x15. Now, by multiplying by one of those values we can add any value from 0 to 15 to the exponent in a single multiplication. We're essentially clocking in four bits at a time. After one multiplication we have to do four squarings to left shift 4 bits to make space for the next 4 bits to be clocked in. This reduces the number of multiplications by a factor of 4.
Next, imagine that our exponents were always 8-bits long. We could precalculate x00000000b, x00000001b, x00010000b and x00010001b. By multiplying by one of those values we can clock in bits in two positions at once. Previously we always clocked in bits at the right hand side of the exponent, but those precomputed powers mean that we can clock in bits at the forth bit too. Thus we deal with the exponent as two, 4-bit values and the squarings are shared between them. We only have to square and multiply four times to cover the whole 8-bit value.
In cryptography, the powers are usually larger than 8-bits, of course, but you can clock in at 4 or 8 positions if you wish. It's a classic time/memory trade-off because more positions mean exponentially more precomputed values.
(There's more and better tricks than the two that I've just outlined, some are detailed on this Wikipedia page.)
Precomputation in curve25519
The reason that such precomputation hasn't been applied to curve25519 before is (I think), because of a detail that I glossed over before: curve25519 is actually a Montgomery curve that uses differential addition. Rather than being able to multiply arbitrary values, a and b, you instead need to also know the value of a/b first. That throws a spanner in the works of the typical optimisations.
If you review elliptic curves, you'll recall that operations on the elliptic curve group consist of operations on the underlying field. In curve25519's case it's the field GF(2255-19), which is where it gets its name from. We can roughly measure the amount of time that something will take by the number of field operations required. (Note to practitioners: I'm taking a field squaring as 0.8 multiplications and one multiplication is an ‘operation’.)
A curve25519 exponent is 255 bits long. Each squaring takes 3.6 operations and each multiplication by x takes 4.6 operations. 255*(3.6+4.6) is 2091 operations. We also need a final conversion (a field inversion) that costs 215.2 operations for a total of 2306.2. That's the number to beat.
In order to work in a group which has nice, arbitrary multiplication operations we map curve25519 to an isomorphic, Twisted Edwards curve. In fact, exactly the same curve as used by Ed25519, which is a fast signature primitive. This also means that we get to use Ed25519's highly optimised code. In order to be able to use as much code as possible, we use exactly the same precomputation scheme as described in section 4 of the paper.
This scheme is based on 256-bit exponents, which it splits into 64, 4-bit chunks. But it uses lots of precomputed values! For every other 4-bit chunk, it precomputes all 16 possible values. So the first subtable contains the 16 values x0000b, x0001b, x0010b, …, x1111b. The next subtable contains the 16 values for the next but one 4-bit chunk. So that's x0000 0000 0000b, x0001 0000 0000b, x0010 0000 0000b, …, x1111 0000 0000b. It has 32 of those subtables!
It picks one value from each subtable and, with 32 multiplications, takes care of half the bits of the exponent. Then it squares four times to left-shift those values into place and does another 32 multiplications from the same subtables to take care of the other half of the bits. That's a total of 64 multiplications and 4 squarings.
(It actually uses a trick which means that it only needs to store 8 of the 16 values in each subtable, but that's outside the scope of this post.)
We're in a different group now and so the costs are different: multiplications cost 6 operations and squarings cost 7. (The field is the same so an ‘operation’ takes the same time.) 64*6 + 4*7 = 412, so that's looking good. We need another field inversion to finish up, same as curve25519, for a total cost of 627.2. Those rough costs suggest that the Twisted Edwards precomputation should be 3.6x faster, which is promising!
(Updated: I originally suggested that two field inversions were required because I'm an idiot. Thanks to @CodesInChaos for asking why.)
The actual timings
Based on the amd64-64-24k implementation of Ed25519 in SUPERCOP, it's fairly easy to implement. On the same machine as the curve25519 speed was measured, above, the precomputed curve25519 runs in 21.9µs, a 3.1x speedup with 24KB of tables. So it's not as fast as the rough calculations suggested, but that's because we're now processing 24KB of memory. That cache pressure isn't free of course, although benchmarks make it look mostly free because they run in a tight loop and so the tables will always be in L1 cache. Still, it means that CPU can sustain 350,000 public key operations per second.
(Code still too messy to release. I hope to tidy it up next week.)
NPN and ALPN (20 Mar 2013)
Since its inception, SPDY has depended on a TLS extension called NPN. NPN allows a TLS connection to negotiate which application-level protocol will be running across it.
NPN allows SPDY to be enabled efficiently. If we had run SPDY on a different port, then we would have had to be constantly creating probing connections to see whether a site supported SPDY as well as HTTPS. Even if we knew that a site supported SPDY, network devices between any given client and that site might block connections to the different TCP port. If we had tried an HTTP Upgrade header, that would have slowed everything down and caused compatibility issues with servers and proxies that didn't process the header correctly.
NPN also allows us to update SPDY without spending round trips on a version negotiation. Overall, NPN has worked very well for SPDY.
NPN attempted to be a little bit future proof by sending the selected application protocol name under encryption, so that network devices couldn't discriminate. The benefit was somewhat limited because the server's list of supported protocols was still sent in the clear but we believe that anything that can be encrypted, should be encrypted.
There is an alternative to NPN: ALPN is essentially the same design except that the negotiation is done in the clear (like other TLS extensions).
Last Friday, at IETF 86 in Orlando, the TLS working group considered both designs and came to a rough consensus on ALPN. ALPN is currently on track to be published as an RFC at some point and we will be switching SPDY over to it and deprecating NPN.
Once IANA has assigned a TLS extension number for ALPN, Google servers will start supporting both NPN and ALPN, with a preference for ALPN. Chrome and, I expect, other browsers will start sending both NPN and ALPN extensions. During this time, SPDY servers will be able to switch from NPN to ALPN without dropping SPDY support for current clients.
At some point after the end of 2014, I plan on removing NPN support from Chrome and Google servers. Any old servers and clients will continue to function just fine: they'll just use HTTPS.
Lucky Thirteen attack on TLS CBC (04 Feb 2013)
In an upcoming paper (made public this morning), Nadhem AlFardan and Kenny Paterson describe another method of performing Vaudenay's attack on CBC as used in TLS. Firstly I'd like to thank the researchers for notifying the various vendors ahead of time so that patches could be prepared: the disclosure process has gone very smoothly in this case. I couldn't have asked for anything more - they did everything right.
Vaudenay's attack requires an attacker to be able to detect when a CBC padding check has succeeded, even if the authentication check then fails. Authentication should always be applied after encryption to avoid this, but TLS famously did it the wrong way round with CBC mode.
Knowing whether a padding check succeeded reveals information about the decrypted plaintext. By tweaking the ciphertext over many trials, it's possible to progressively decrypt an unknown ciphertext. For details, see their paper, which does a better job of explaining it than I would.
Vaudenay first used the fact that these two situations (padding check failure and authenticator failure) resulted in different TLS alert values, although that didn't result in a practical attack because TLS errors are encrypted. Once that was corrected (by specifying that the same alert value should be sent in each case) the next year's paper used a timing-side channel: if the authentication check wasn't performed when the padding check failed then, by timing the server's response, an attacker could tell whether the server aborted processing early.
To try and remove that side-channel, TLS stacks perform the authentication check whether or not the padding check succeeds. However, here's what I commented in the Go TLS code:
Note that we still have a timing side-channel in the MAC check, below. An attacker can align the record so that a correct padding will cause one less hash block to be calculated. Then they can iteratively decrypt a record by breaking each byte. However, our behavior matches OpenSSL, so we leak only as much as they do.
That pretty much sums up the new attack: the side-channel defenses that were hoped to be sufficient were found not to be (again). So the answer, this time I believe, is to make the processing rigorously constant-time. (Details below.)
As a practical matter, since a padding or authenticator check failure is fatal to a TLS connection, performing this attack requires a client to send the same plaintext secret on thousands of different connections to the same server. This isn't a trivial obstacle but it's possible to meet this requirement for cookies with a browser and bit of Javascript injected into any origin in the same session.
For DTLS the attack is much easier because a rejected record doesn't cause the connection to be destroyed and the same authors developed a method for amplifing timing attacks against DTLS in a previous paper
Unfortunately, unlike BEAST, this isn't something that we can unilaterally fix on the client side (except by disabling CBC mode, which might have significant compatibility issues.) On the server side, making RC4 the most preferable cipher solves the problem although I do recommend that server admins apply patches as they become available even so. Having RC4 as an option has been a useful fallback for CBC problems but, if patches aren't applied, then RC4 becomes a monoculture in the future.
Implementing constant time CBC decoding
Fixing this problem in a backwards compatible manner is, sadly, really rather complex. What need to ensure that all CBC records are authenticated in constant time. Initially we define constant-time to mean that, for all values of the secret input, the trace of instructions executed and memory locations accessed is identical (on an in-order machine). Here the secret input is the decrypted CBC record before the padding and MAC have been checked.
Meeting this definition is tough and there's a great temptation to remove some of the more glaring timing leaks and hope that the remainder is small enough not to be practical. But the DTLS trick mentioned above can be used to amplify any timing side-channels. Additionally, situations change and implementations that are currently only exposed over a network may, in the future, be exposed to attackers running on the same machine. At that point, cache-line probing and other powerful timing probes can be brought to bear against any side-channels that remain.
For the purposes of this post, we consider the decrypted, CBC record to contain the following parts:
- n bytes of plaintext. (Which may be compressed but, within our scope, it's plaintext.)
- mac_size bytes of MAC. (Up to 48 bytes in TLS with SHA-384. We conservatively assume that the minimum number of MAC bytes is zero since this is clearly a valid lower-bound and accounts for truncated MAC extensions.)
- padding_length bytes of padding.
- One byte of padding length. (Some consider this to be part of the padding itself, but I'll try to be consistent that this length byte is separate.)
With this secret data in hand, we need to perform three steps in constant time:
- Verify that the padding bytes are correct: the padding cannot be longer than the record and, with SSLv3, must be minimal. In TLS, the padding bytes must all have the same value as the length byte.
- Calculate the MAC of the header and plaintext (where the header is what I'm calling the additional data that is also covered by the MAC: the sequence number etc).
- Extract the MAC from the record (with constant memory accesses!) and compare it against the calculated MAC.
Utilities
We will make extensive use of bitwise operations and mask variables: variables where the bits are either all one or all zero. In order to create mask values it's very useful to have a method of replicating the most-significant-bit to all the bits in a value. We'll assume that an arithmetic right shift will shift in the MSB and perform this operation for us:
#define DUPLICATE_MSB_TO_ALL(x) ( (unsigned)( (int)(x) >> (sizeof(int)*8-1) ) ) #define DUPLICATE_MSB_TO_ALL_8(x) ( (uint8_t)( (int8_t)(x) >> 7) )
However, note that the C standard does not guarantee this behaviour, although all CPUs that I've encountered do so. If you're worried then you can implement this as a series of logical shifts and ORs.
We'll define some more utility functions as we need them.
SSLv3 padding
SSLv3 padding checks are reasonably simple: we need only to test that the padding isn't longer than the record and that it's less than a whole cipher block. We assume that the record is at least a byte long. (Note that the length of the whole record is public information and so we can fast reject invalid records if we're using that length alone.) We can use the fact that, if a and b are bounded such that the MSB cannot be set, then the MSB of a - b is one iff b>a.
padding_length = data[length-1]; unsigned t = (length-1)-padding_length; unsigned good = DUPLICATE_MSB_TO_ALL(~t); t = block_size - (padding_length+1); good &= DUPLICATE_MSB_TO_ALL(~t);
The resulting value of good is a mask value which is all ones iff the padding is valid.
TLS padding
Padding in TLS differs in two respects: the value of the padding bytes is defined and must be checked, and the padding is not required to be minimal. Therefore we must always make memory accesses as if the padding was the maximum length, with an exception only in the case that the record is shorter than the maximum. (Again, the total record length is public information so we don't leak anything by using it.)
Here we use a mask variable (mask), which is all ones for those bytes which should be part of the padding, to discard the result of checking the other bytes. However, our memory access pattern is the same for all padding lengths.
unsigned padding_length = data[length-1];
unsigned good = (length-1)-padding_length;
good = DUPLICATE_MSB_TO_ALL(~good);
unsigned to_check = 255; /* maximum amount of padding. */
if (to_check > length-1) {
to_check = length-1;
}
for (unsigned i = 0; i < to_check; i++) {
unsigned t = padding_length - i;
uint8_t mask = DUPLICATE_MSB_TO_ALL(~t);
uint8_t b = data[length-1-i];
good &= ~(mask&(padding_length ^ b));
}
If any of the padding bytes had an incorrect value, or the padding length was invalid itself, one of more of the bottom eight bits of good will be zero. Now we can map any value of good with such a zero bit to 0, and leave the case of all ones the same:
good &= good >> 4; good &= good >> 2; good &= good >> 1; good <<= sizeof(good)*8-1; good = DUPLICATE_MSB_TO_ALL(good);
In a very similar way, good can be updated to check that there are enough bytes for a MAC once the padding has been removed and then be used as a mask to subtract the given amount of padding from the length of the record. Otherwise, in subsequent sections, the code can end up under-running the record's buffer.
Calculating the MAC
All the hash functions that we're concerned with are Merkle–Damgård hashes and nearly all used in TLS have a 64-byte block size (MD5, SHA1, SHA-256). Some review of this family of hash functions is called for:
All these hash functions have an initial state, a transform function that takes exactly 64 bytes of data and updates that state, and a final_raw function that marshals the internal state as a series of bytes. In order to hash a message, it must be a multiple of 64 bytes long and that's achieved by padding. (This is a completely different padding than discussed above!)
In order to pad a message for hashing, a 0x80 byte is appended, followed by the minimum number of zero bytes needed to make the length of the message congruent to 56 mod 64. Then the length of the message (not including the padding) is written as a 64-bit, big-endian number to round out the 64 bytes. If the length of the message is 0 or ≥ 55 mod 64 then the padding will cause an extra block to be processed.
In TLS, the number of hash blocks needed to compute a message can vary by up to six, depending on the length of the padding and MAC. Thus we can process all but the last six blocks normally because the (secret) padding length cannot affect them. However, the last six blocks need to be handled carefully.
The high-level idea is that we generate the contents of each of the final hash blocks in constant time and hash each of them. For each block we serialize the hash and copy it with a mask so that only the correct hash value is copied out, but the amount of computation is constant.
SHA-384 is the odd one out: it has a 128 byte block size and the serialised length is twice as long, although it is otherwise similar. The example below assumes a 64 byte block size for clarity.
We calculate two indexes: index_a and index_b. index_a is the hash block where the application data ends and the 0x80 byte is put. index_b is the final hash block where the length bytes are put. They may be the same, or index_b may be one greater. We also calculate c: the offset of the 0x80 byte in the index_a block.
The arithmetic takes some thought. One wrinkle so far ignored is that additional data is hashed in prior to the application data. TLS uses HMAC, which includes a block containing the masked key and then there are 13 bytes of sequence number, record type etc. SSLv3 has more than a block's worth of padding data, MAC key, sequence etc. These are collectively called the header here. The variable k in the following tracks the starting position of the data for the constant time blocks and includes this header.
for (i = num_starting_blocks; i <= num_starting_blocks+varience_blocks; i++) {
uint8_t block[64];
uint8_t is_block_a = constant_time_eq(i, index_a);
uint8_t is_block_b = constant_time_eq(i, index_b);
for (j = 0; j < 64; j++) {
uint8_t b = 0;
if (k < header_length) {
b = header[k];
} else if (k < data_plus_mac_plus_padding_size + header_length) {
b = data[k-header_length];
}
k++;
uint8_t is_past_c = is_block_a & constant_time_ge(j, c);
uint8_t is_past_cp1 = is_block_a & constant_time_ge(j, c+1);
/* If this is the block containing the end of the
* application data, and we are at, or past, the offset
* for the 0x80 value, then overwrite b with 0x80. */
b = (b&~is_past_c) | (0x80&is_past_c);
/* If this the the block containing the end of the
* application data and we're past the 0x80 value then
* just write zero. */
b = b&~is_past_cp1;
/* If this is index_b (the final block), but not
* index_a (the end of the data), then the 64-bit
* length didn't fit into index_a and we're having to
* add an extra block of zeros. */
b &= ~is_block_b | is_block_a;
/* The final eight bytes of one of the blocks contains the length. */
if (j >= 56) {
/* If this is index_b, write a length byte. */
b = (b&~is_block_b) | (is_block_b&length_bytes[j-56]);
}
block[j] = b;
}
md_transform(md_state, block);
md_final_raw(md_state, block);
/* If this is index_b, copy the hash value to |mac_out|. */
for (j = 0; j < md_size; j++) {
mac_out[j] |= block[j]&is_block_b;
}
}
Finally, the hash value needs to be used to finish up either the HMAC or SSLv3 computations. Since this involves processing an amount of data that depends only on the MAC length, and the MAC length is public, this can be done in the standard manner.
In the case of SSLv3, since the padding has to be minimal, only the last couple of hash blocks can vary and so need to be processed in constant time. Another optimisation is to reduce the number of hash blocks processed in constant time if the (public) record length is small enough that some cannot exist for any padding value.
The above code is a problem to implement because it uses a different hash API than the one usually provided. It needs a final_raw function that doesn't append the hash padding, rather than the common final function, which does. Because of this implementers may be tempted to bodge this part of the code and perhaps the resulting timing side-channel will be too small to exploit, even with more accurate local attacks. Or perhaps it'll be good for another paper!
Extracting the MAC from the record.
We can't just copy the MAC from the record because its position depends on the (secret) amount of padding. The obvious, constant-time method of copying out the MAC is rather slow so my thanks to Emilia Kasper and Bodo Möller for suggesting better ways of doing it.
We can read every location where the MAC might be found and copy to a MAC-sized buffer. However, the MAC may be byte-wise rotated by this copy so we need to do another mac_size2 operations to rotate it in constant-time (since the amount of rotation is also secret):
unsigned mac_end = length;
unsigned mac_start = mac_end - md_size;
/* scan_start contains the number of bytes that we can ignore because the
* MAC's position can only vary by 255 bytes. */
unsigned scan_start = 0;
if (public_length_inc_padding > md_size + 255 + 1) {
scan_start = public_length_inc_padding - (md_size + 255 + 1);
}
unsigned char rotate_offset = (mac_start - scan_start) % md_size;
memset(rotated_mac, 0, sizeof(rotated_mac));
for (unsigned i = scan_start; i < public_length_inc_padding; i += md_size) {
for (unsigned j = 0; j < md_size; j++) {
unsigned char mac_started = constant_time_ge(i + j, mac_start);
unsigned char mac_ended = constant_time_ge(i + j, mac_end);
unsigned char b = 0;
if (i + j < public_length_inc_padding) {
b = data[i + j];
}
rotated_mac[j] |= b & mac_started & ~mac_ended;
}
}
/* Now rotate the MAC */
memset(out, 0, md_size);
for (unsigned i = 0; i < md_size; i++) {
unsigned char offset = (md_size - rotate_offset + i) % md_size;
for (j = 0; j < md_size; j++) {
out[j] |= rotated_mac[i] & constant_time_eq_8(j, offset);
}
}
Finally, the MAC should be compared in constant time by ORing a value with the XOR of each byte in place from the calculated and given MACs. If the result is zero, then the record is valid. In order to avoid leaking timing information, the mask value from the padding check at the very beginning should be carried throughout the process and ANDed with the final value from the MAC compare.
Limitations of our model
We started out by defining constant-time with a specific model. But this model fails in a couple of ways:
Firstly, extracting the MAC as detailed above is rather expensive. On some chips all memory accesses within a cache-line are constant time and so we can relax our model a little and use a line-aligned buffer to perform the rotation in when we detect that we're building on such a platform.
Secondly, when I sent an early version of the OpenSSL patch to the paper's authors for them to test, they reported that they found a timing difference! I instrumented the OpenSSL code with the CPU cycle counter and measured a median time of 18020 cycles to reject a record with a large amount of padding but only 18004 cycles with small padding.
This difference may be small, but it was stable. To cut a long story short, it came from this line:
unsigned char rotate_offset = (mac_start - scan_start) % md_size;
When the padding was larger, the MAC started earlier in the record (because the total size was the same). So the argument to the DIV instruction was smaller and DIV, on Intel, takes a variable amount of time depending on its arguments!
The solution was to add a large multiple of md_size to make the DIV always take the full amount of time (and to do it in such a way that the compiler, hopefully, won't be able to optimise it away). Certainly 20 CPU cycles is probably too small to exploit, but since we've already “solved” this problem twice, I'd rather not have to worry in the future.
Real World Crypto 2013 (13 Jan 2013)
(These are my notes for a talk that I gave last week at Real World Crypto. The premise of the conference is that it brings together theoretical cryptographers and practitioners. So this talk is aimed at theoretical cryptographers but it's fairly simple as I don't have anything complex worth saying to real cryptographers! Slides for other talks are linked from the program page and Rogaway's are relevant to this talk.
Note that this isn't a transcript: I actually say more words than this, but it contains the main points.)
Hello all.
For those who don't know me, I'm Adam Langley. I work at Google, mostly on our serving side HTTPS infrastructure these days. But I also still do some work on Chrome's SSL stack from time to time.
When I was asked to come up with a title of a talk I picked “Things that bit us, things we fixed and things that are waiting in the grass” with reference to HTTPS. Partly because that's what I know about but also because HTTPS is the only place where most people knowingly interact with public crypto, so it's one of the few examples of real world crypto at really large, really messy scales.
I could also have titled this talk “Know your enemy” because your real enemy is not trying to achieve an advantage with greater than negligible probability. As I hope to convince you, your enemy is me. I am the idiot who will mess up the implementation of your lovely cryptosystem. Your 128-bit security level is worthless in the face of my stupidity and so I'm here to seek your help in making my life easier and, therefore, everything more secure.
But before we get into anything crypto related, I'd like to set the scene. By and large, transport security on the Internet is doing OK because few people bother to attack it. If you want to steal banking credentials, you can get malware toolkits that will work on a large enough fraction of machines to be useful - e.g. using the Java exploit made public this week. If you want to steal passwords, you can SQL inject the site and reverse the hashes (if the site even bothered to hash), or just use the Ruby on Rails exploit made public this week. Given the level of password reuse, it doesn't even have to be the same site! Economically, attacking the transport doesn't make sense for many attackers and so they don't do it.
If you do want to attack the transport, by far the best methods are SSL stripping, mixed scripting and insecure cookie vulnerabilities.
SSL stripping means that, when the user types in example.com, since the default scheme is unencrypted, the browser makes an insecure request. The attacker can simply answer that request and proxy the entire site while removing any attempts to upgrade to HTTPS. In the majority of cases, the user will never notice.
Mixed scripting results from pages that source Javascript, CSS or plugins from an HTTP URL. The request is made insecurely but the response is trusted with the same authority as the HTTPS origin that requested them. On sites that serve both HTTP and HTTPS, mixed scripting is endemic.
Lastly, cookies are secrets that the client sends to authenticate requests. If, when creating a cookie, the server doesn't set the secure flag, the client will also send the same cookie over unencrypted connections. Since forgetting to set that flag doesn't break anything in normal operation, it happens fairly often.
Oh, and all that is assuming that any transport security is used at all. HTTP is still much more common than HTTPS.
I've breezed over those issues, but the important point is that none of them involve crypto, they're all systems issues. On the whole, the crypto is doing great in comparison to everything else!
I'll cycle back to those issues towards the end but I wanted to give some perspective as I'll be talking about the crypto a lot, since this is Real World Crypto, but crypto is only a small part of web security.
When writing this talk I sat down and made a list of the cryptographic issues that have bitten HTTPS over the past few years and tried to group in order to make points that I think would be useful for the research community. I didn't end up using them all, so this is hardly a exhaustive list and I used some things twice, but here they are:
Go beyond a paper
- CRIME (compression leaks.)
- BEAST (CBC vs adaptive, chosen plaintext.)
The first group of issues I called ‘go beyond a paper’:
The CRIME attack resulted from the attacker having partial, chosen plaintext abilities on a channel which was performing compression. This applied to both TLS, which supports compression, and SPDY, an HTTP replacement that runs over TLS that applied compression internally. Most, major HTTPS sites don't support TLS compression so SPDY was the bigger hole, but SPDY is a relatively new protocol and the idea that compression can leak information isn't. I can easily find it going back ten years (“Compression and Information Leakage of Plaintext”, Kelsey, 2002) and here's what I said on the SPDY mailing list before the CRIME work:
With a good model of zlib, I think you could extract a ~40 byte cookie with ~13K requests. That's a practical attack and would make a great paper if someone has the time.
Of course, I never had the time but we did start working on a better compression scheme to solve the issue.
But SPDY was done by smart networking folks. Since the compression leak issues weren't known widely enough at the time, they picked gzip as a seemingly reasonable compression algorithm. It wasn't clearly stupid to compose TLS and gzip and yet it blew up when they did so. Had the Kelsey paper instead been a splashy, public demonstration, as CRIME was, then it's possible that the idea would have sunk into the collective memory to the point where simply using gzip wouldn't have seemed so reasonable. As it is, we're now using a horrible gzip hack to segment cookies and other sensitive headers from the attacker controlled values, although the replacement compression is mostly done, pending SPDY 4.
So the first idea in this section is that it's OK to make a splash in order to get some new idea into the minds of the developer community in general.
Somewhat similarly, the origins of the BEAST attack dates back to at least 2002 in the context of SSH (Möller, 2002).
In this case, Rizzo and Duong contacted the major SSL stacks prior to their work going public with a clear case that it was worth looking into. This, at least from my point of view, is very welcome! Please do this if you can. You can contact me if nothing else and I can rope in all the other usual suspects. Please set a clear go-public date and allow us to share the facts of the matter with other major vendors.
In the case of BEAST, this produced a rare example of major browsers breaking things together for the common good. The breakage wasn't trivial: we took out Disneyland Tokyo's online ticking system amongst many others, but we got it deployed.
So the second point of this group is that you should consider treating it like a security disclosure if it's warranted. We don't bite!
The world often sucks
- Hash collisions in MD5
- BEAST
- Downgrade attacks
The second group I put together under the title ‘the world often sucks’.
I'm afraid that sometimes it takes a while to make changes even if we are given a very clear demonstration of the issue. After a very clear demo of MD5 collisions causing vulnerabilities in certificate issuance in 2008 it still took years to remove support for it. Other remedial actions were taken: public CAs were required to use random serial numbers and stopped using MD5 for new certificates. But it wasn't until early 2012 that Chrome removed support for MD5 in signatures. Sadly, many MITM proxies still used MD5 and, despite giving them lots of notice, they didn't do anything and we broke quite a few people with that change.
Of course, later in the year it was found that Flame broke Microsoft's code signing with an MD5 collision. Maybe that'll be enough to convince people to move.
But the point of this section is to encourage people to think about workarounds because the ‘right fix’ often isn't feasible. For hash collisions we have randomised serial numbers and although we might like everyone to be using SHA-256 or SHA-3, realistically those randomised serial numbers are going to be buttressing SHA-1 for quite some time to come.
Marc Stevens talked earlier about how to detect SHA-1 collisions given only one of the colliding messages. That's great! That's absolutely something that we can put into certificate validation now and hopefully prevent problems in the future.
In relation to BEAST: I mentioned that the core weakness had been known since 2002 and, because of that, there was even a workaround in OpenSSL for it: empty fragments. By sending an empty CBC fragment before each real record, the MAC would effectively randomise the IV. However, empty fragments caused compatibility issues because some SSL stacks returned a zero length when encountering them and higher layers of the code took this to be EOF. Because of that, the workaround was never widely enabled.
In the course of discussing BEAST another workaround was proposed: 1/n-1 record splitting. Rather than putting an empty fragment before each record, include a single byte of the plaintext. The protection isn't quite as good, but it solves the EOF problem. Some servers still broke because they assumed that the complete HTTP request would come in a single read, but the lower rate of problems probably made BEAST mitigation viable.
Lastly, there's the first of our snakes in the grass (problems that are still pending to bite us): SSLv3 downgrade. Since there exist so many broken servers and network middleboxes on the Internet that can't handle TLS version negotiation, in the event of a TLS handshake failure browsers will fall back to SSLv3. However, SSLv3 doesn't support all of the features that TLS does. Most significantly from my point of view, it doesn't support ECDHE, which Google servers use. So a network attacker can trigger a fallback and downgrade a capable client to a non-forward secure ciphersuite.
This is obviously not good. The correct fix is to remove the fallback to SSLv3 of course, but that's sadly not viable right now. Instead, as a workaround, Yngve (formerly of Opera) suggested using the TLS renegotiation extension as a signal that a server is reasonably recent and therefore we shouldn't have performed the fallback.
Numbers from Chrome indicate that a high fraction of the servers that we perform fallback for are renego patched, suggesting that's a bad signal and we should instead create a different one. Although maybe the number of fallbacks is dominated by transient network problems and that's skewing the data. Eric Rescorla has suggested replicating the TLS version negotiation using ciphersuite values. It's something that we will hopefully address in one way or another in 2013.
So that's the point of this group: please consider workarounds because we can't always manage to deploy the ‘right’ fix.
Side-channels are a big deal.
- RSA PKCS#1 v1.5 padding oracles ("Million message attack")
- CBC padding oracles (Vaudenay's attack)
- Timing attacks against RSA CRT.
- Side-channel attacks against AES
- Side-channel attacks against group operations
This big group is about the fact that side channels are a big deal. Writing constant-time code is a very odd corner of programming and not something that can be easily tested. I have a valgrind hack called ctgrind that allows for some automated testing, but it certainly has its limitations.
But what does constant-time even mean? There's the obvious definition: the CPU runs for the same amount of time, independent of any secret inputs. But CPUs are preemptively multitasked and have frequency scaling and thermal limiting these days so that's not a very useful definition in practice. A more workable definition is that if the code were to be run on an abstract Von Neumann machine, then the trace of memory fetches is identical for all secret inputs. That means that the trace of instructions fetched and data accesses is constant for all secret inputs. That takes care of all the obvious problems.
In practice, it can be useful to relax the definition a little and require only that the set of cache lines accessed for data is constant, rather than the exact addresses. In practice CPUs often fetch whole cache lines at a time and using that fact can lead to speedups at the cost of having to know the cache line length of the CPU.
This model assumes that the individual instructions themselves are constant time. As a research topic it would be interesting to know how variable time CPU instructions affect this. For example, from the Intel optimisation manual:
The latency and throughput of IDIV in Enhanced Intel Core micro-architecture varies with operand sizes and with the number of significant digits of the quotient of the division.
Is this a problem, cryptographically? I don't know. I think multiplies on ARM are variable time too. They are on PowerPC, but that's a much more obscure platform.
As a researcher, what's there to keep in mind with respect to constant-time implementations? Firstly, know that Moore's law will never help you. CPUs may get faster, but the amount of data that we need to process is increasing just as fast, if not faster. So you can't assume that the slower, constant time code will become viable in time - you have to be constant time from the start.
Even having a non-constant time implementation is a danger. There are usually two sides to a protocol and I may not control the software on the other side. If I specify AES in a protocol then I have to consider that it may well be a non-constant time implementation. I just made up the term ‘implementation ecosystem’ for this. AES is very difficult to implement securely in software: good implementations are still still topics for research papers. So the implementation ecosystem for AES is terrible! There are lots and lots of side-channel vulnerable implementations out there because no normal person, given the AES spec, will produce a secure implementation.
If we're aiming for a 128-bit security level then that possibility is a much larger problem than many other, more traditional crypto concerns.
So, for new primitives, you may want to produce solid implementations for different platforms to seed the implementation ecosystem. Not just reference implementations, but ones that are good enough that they dominate the set of implementations. For example, if I specify curve25519 in a protocol, I can be pretty sure that everyone is going to be using djb's reference code. That's a major advantage.
You should consider what an implementation is going to look like when designing. Of course, building a solid implementation will make sure that you end up considering this, so that's another good reason. There are certain patterns that are inherently dangerous. Square-and-multiply loops for example. You should recognise that and make sure that even the description of the algorithm includes counter measures. Binary fields are another which are very likely to result in non-constant time code.
Lastly, please don't assume that CPU changes are going to solve your constant-time or performance problems. Intel have added specific instructions for AES and binary fields in their latest chips and, while that does have some benefit, they will be a small fraction of all chips for a very long time. The chance of both sides of a connection having these chips is even smaller.
Cryptographic Room 101.
Room 101 is a fairly long running British TV show where people nominate things they dislike to put into Room 101, which banishes them from the world. Based on the above I've a number of things that I'd to put into Room 101, from least controversial to most. These are more specific points than the general ones that I just made and I'd like to see if anyone in the room disagrees with me!
1. MAC then Encrypt.
I hope that everyone agrees on this. Of course, it's pretty ubiquitous in the world today. Just because I want to banish something doesn't mean that I'm not going to spend years dealing with it in the future!
2. CBC mode.
With all the problems of chosen plaintexts, padding oracles etc I think it's time to get rid of CBC mode forever. Although it's possible to implement it securely, it's been done wrong for so long and is so easy to mess up, I think it's best to get rid of it.
3. ‘Sudden death’ entropy failure: plain DSA.
DSA (and ECDSA) has a very unfortunate property that an entropy failure leaks the private key. Even a slight bias over many signatures can be exploited. This is ridiculous. As Nadia and Debian have demonstrated, entropy failures do happen and they are inherently very bad. But that's not a reason to amplify them! By hashing in the private key and message, this problem can be avoided. So please consider what happens when your nonces are actually ntwices.
4. AES-GCM.
I've saved the really controversial one for last!
AES-GCM so easily leads to timing side-channels that I'd like to put it into Room 101. It took a decade of research to produce solid, high-speed, constant time AES implementations and they are very complex. In that decade, many, many non-constant time AES implementations have found their way into everything, poisoning the ecosystem when it comes to using AES.
I haven't seen any research on extracting the key from GHASH, but I've certainly seen vulnerable implementations and there's every reason to believe that it's possible. Most GHASH implementations look like AES implementations from 10 years ago. I'm aware of one, reasonable, constant-time AES-GCM implementation (Käsper and Schwabe, CHES 2009), but it runs at 22 cycles/byte on a Core2.
If you have a recent Intel chip, and software that implements AES-GCM using the specific instructions provided, then it's great. But most chips don't have that and I think it would have been much more preferable to pick an AEAD that's easy to securely implement everywhere and then speed it up on certain chips.
But it's still much better than AES-CBC I guess!
(At this point I cycled back to talking about the larger transport security issues and how browsers are addressing them with HSTS, mixed script blocking etc. But I've written about that before.)
Certificate Transparency (06 Nov 2012)
These are my notes for a talk that I gave today at IETF 85. For more details, see the draft.Certificates are public statements that everyone trusts, but they aren't public record. Other critical details about companies are generally public record, their address, directors etc, but not their public key. Why not? Work like the EFF Observatory has already means that CA customer lists, which they might consider confidential, are public.
If certificates were public record then you would be able to see what CAs are asserting about you, and hopefully correct any mistakes. I would very much like to know the set of valid certificates within google.com at any given time. I know about the real ones, of course, but I don’t know that there aren’t others. At the moment, there's no accountability for CAs unless you get really, publicly caught screwing up.
However, there is a class of certificates that we don't want published: internal names. Some companies get certificates from a real CA for their internal networks and those names should not be published. So we'll keep that in mind as we work out a design.
How might we achieve this? Well, we could have CAs publish the certificates that they issue. That's a good start; and actually useful to some degree, but you'll probably admit that this is weaker than we might like, so how do we make it stronger?
So that's our design goal, now let's talk about the constraints:
Firstly, we aren't going to update every HTTPS server on the planet. Any design must be incrementally deployable and, initially, we are talking about HTTPS. Hopefully any design would be generally applicable, but HTTPS is what we have in mind right now.
Secondly, no blocking lookups. Making calls to a third-party during certificate validation is very convenient, but it's just too costly. It creates latency when it's working and a disaster when it's not. Too many networks filter too much for it to be dependable and we've never managed to make it work for OCSP, so there's little chance that it'll work for this.
So whatever information we need to get to the client has to be in the handshake, and anything that client needs to report back has to be asynchronous.
We had our straw-man proposal where we just ask CAs to publish their certificate stream, but we want to do better because that only works when the CA is simply mistaken. So the problem that we need to address is how do clients know that a certificate they receive has been published at all?
Well, we can have an independent certificate log sign a statement of publication and, based on our constraints, we have to put that signature somewhere in the TLS handshake. We call the statements of publication Signed Certificate Timestamps and, since we don't want to delay certificate issuance, they’re not actually statements of publications, rather they are a promise that the certificate will be published soon.
Now we've just moved the problem from "what if the CA doesn't publish it?" to "what if the log promises to publish something, but doesn't?". We can require SCTs from a quorum of independent logs, and we'll do that, but running an append only log is a job that can be verified by clients.
In order to make a log verifiable, we put the log entries in a Merkle tree, with certificates at the leaves. Each log is an ever growing Merkle tree of certificates and the log periodically signs the root of the tree and publishes it, along with all the entries. But what does `periodically' mean? All logs must publish certificates that they have received within a time called the Maximum Merge Delay. That merge delay is the amount of time that a certificate can be used without being published.
We detect violations of the MMD by having clients verify the log's behaviour. Clients can request from the logs (or a mirror), a path from any certificate to the root of a published Merkle tree. In order to preserve client privacy this request may be made via DNS, if possible, or by requesting paths for a range of certificates rather than a specific one.
Once a client has a signed root, it needs to gossip about it. Clients implicitly trust their software provider so one common answer may be to request a signed statement that the log root has been observed from an auditor service run by them. But these auditors can be run by anyone, and clients can configure their auditor however they like.
Since these client checks are asynchronous, they can be blocked by an attacker. Clients will have to be tolerant to that to some extent because many networks are indistinguishable to an attack. However, if after some amount of time, the client has a certificate that it hasn't been able to check, or a log root that it hasn't verified with its auditor, it should try to publish it in various, unspecified ways. In the extreme, one can imagine asking the user to email a file to an email address.
So to `break' this system, by which I mean to get a certificate that will be silently accepted by clients without the world knowing about it, one needs to compromise a CA, a quorum of logs, and then either partition the client forever, or also compromise the client's auditor.
In addition to auditors, there’s another class of log observers: monitors. Monitors watch the logs for interesting events. We envision at least one, obvious, type of monitor service: an `alerts’ system where anyone can sign-up to receive emails when a certificate is issued in a certain domain.
So now some practical points:
First, how do we get the SCTs (the receipts from the logs) into the TLS handshake without altering every HTTPS server? Well, the one thing that every HTTPS server can do is serve a certificate chain. So we tried a couple of tricks:
One, we tried adding a superfluous certificate to the end of the chain. Most clients will ignore it, but a few smaller ones, and old versions of Android, don't and IIS doesn't like configuring such a chain. So we aren't pushing ahead with that idea.
We also tried putting SCTs into the unsigned portion of the leaf certificate and that works fairly well except for breaking Java. Nonetheless, we may continue to support that as an option.
And options are the way we're going to go with this problem: there doesn't seem to be a good single answer.
So another option is that CAs can do the work for you and put the SCTs into the signed part of the certificate, in an X.509 extension. Of course, the SCT contains a hash of the certificate, and a hash cannot include itself, so CAs issue a `pre-cert' from a special intermediate with a magic EKU that makes it invalid for normal use. The pre-cert can be submitted to the logs to get SCTs for embedding in the real certificate.
Another way in which CAs can do it for you is to put the SCTs in an OCSP response and then one uses OCSP stapling on the server.
Finally, for servers that can be updated, SCTs can be included in an Authorisation Data TLS-extension, and support for that is working its way through OpenSSL and Apache.
We're experimenting with CAs at the moment to find what works and we may whittle down this list in time. But for now we're promiscuous about solutions to this problem.
The logs also have to worry about spam, so they only accept certificate chains that end at a known root. There's a very low bar for the logs to accept a root because it doesn't grant any authority, it's just an anti-spam measure, but we do have to make sure that the logs can’t be spammed to death.
Going back to one of our requirements at the beginning: we don't want to log internal names in certificates. One solution to this is to allow intermediates with name constraints to be logged and then any certificates from there don't need to be. So an intermediate constrained to issue within example.com can be logged and then we don't need to log any leaf certificates within example.com: they can't be used against anyone else and, by using an intermediate, its security is up to you.
However, that's a little complex so the initial solution will probably be that we let companies configure the clients to say "don't require logging within these domains". At least for Chrome, people report great success with our Enterprise Policy configuration and that's an easy route for such config.
Lastly, the very tricky question: deployment.
Initially we can deploy in the same manner as pinning and HSTS: require it only for certain, opt-in domains (or possibly for opt-in CAs). But the goal is very much to get it everywhere. That will take years, but we do plan to do it. We need to make the error messages in the browser informative for server admins as well as users because this will be something new for them, although hopefully their CA will just take care of it. We can also enforce it for certificates issued past a certain date, rather than having a flag day. That way the certificate change will be the triggering factor, which is something under the control of the site. Lastly, we'll probably have to do the work to patch software like EJBCA.
We do not yet know exactly who will be running the logs, nor how many we want. But Google expects to run one and that project is staffed. We’re also working with Digicert and Comodo who are experimenting with CT integration and are considering running logs.
Since we can verify the operation of logs, they don't have the same trusted status as CAs. However, the set of logs has to be globally agreed so having to revoke a log would be a pain, so we do want logs that are operationally competent.
There is no doubt that this will be a difficult and lengthy deployment process. But it doesn't seem completely hopeless.
NIST may not have you in mind (21 Oct 2012)
A couple of weeks back, NIST announced that Keccak would be SHA-3. Keccak has somewhat disappointing software performance but is a gift to hardware implementations.
This is rather a theme for NIST selected algorithms. AES includes a number of operations on a binary field and CPUs generally don't implement binary field operations because they aren't really useful for anything else. So many software implementations of AES include four, 1024-byte tables of results in that field in order to achieve reasonable performance. However, these table lookups cause a timing side-channel which is fairly effective even over a network. On the same machine, where the CPU cache state can be probed more directly, it's very effective.
In response to this, some AES implementations made their tables smaller to reduce the leak. That's obviously an improvement, but it came at the cost of slowing AES down and AES was never very quick to start off with. And while a smaller leak is better, AES is supposed to be operating at (at least) an 128-bit security level!
Next, the mode recommended for AES by NIST is GCM: it's AES in counter mode coupled with a polynomial authenticator. Cryptographically this is perfectly sensible but the polynomial authenticator uses binary fields again. For a hardware implementation, this is another gift but software implementations need, yet again, tables in memory to implement it at reasonable speed.
Tables not only cause problems with timing side-channels, they also cause slowdowns that don't appear in micro-benchmarks. Tables are great when the benchmark is running with a single key and they are contained in L1 cache. But when running in a larger system, the cache pressure causes slowdowns and the real-world speed of an such an algorithm can be dramatically less than the benchmarks suggest.
But there's no reason why a polynomial authenticator needs to use binary fields. Why not use a field that CPUs actually implement? If you do so you get a very fast authenticator with dramatically improved key agility.
So why does NIST continually pick algorithms that aren't suited to software implementations? One of my colleagues proffers (somewhat jokingly) a conspiracy theory but I suspect that the answer is much more mundane. At USENIX Security this year, Dickie George gave an excellent keynote involving some NSA history and described how they would spend a decade or more designing a cryptographic system from the chip upwards. Everything was custom built, carefully designed and it was a completely closed system. It's certainly nice to be able to do that!
If you're designing such a system, of course you would be orientated towards hardware. Every implementation of your system would be a custom-designed chip; there's no benefit to making something easy to implement on commodity systems. And, in that world, AES, GCM, and SHA3 make a lot of sense. As the GCM paper says, “binary Galois field multiplication is especially suitable for hardware implementations”. I'm assuming that there's a fair amount of shared worldview between NIST and the NSA, but that doesn't seem unreasonable.
(In a NIST paper about the selection of AES they state: “table lookup: not vulnerable to timing attacks.” Oops. If you're building your own hardware, maybe.)
I'm not suggesting that the NSA is actually using AES, GCM etc in their own designs. But a generation or more of cryptographers from that world are probably comfortable with features that were chosen with hardware in mind. Those features, like binary fields, are what they will naturally gravitate towards.
People who aren't in the situation of designing custom chips, but rather have to design protocols for implementation on a menagerie of commodity chips, might want to consider that NIST doesn't appear to have them in mind. Although hardware implementations of AES have existed for a while, unless they are CPU instructions (like AES-NI), then the system-call overhead can cause the acceleration to be slower than the plain CPU for common, 1400-byte, chunk sizes. Even when you have a CPU instruction to implement AES and GCM, it's unlikely that all the chips that your code is going to run on will be so endowed.
Hardware implementations of these, poorly-suited primitives also take chip area and development time away from other tasks. Why not choose primitives that use CPU abilities that lots of code benefits from? Then optimisations of those CPU instructions lifts all boats.
So if you want to pick algorithms designed for CPUs, try Salsa20 rather than AES and Poly1305 rather than GCM. But we don't have a great answer for a modern hash function I'm afraid, and that's a little disappointing. SHA-2 is fine for now, although Zooko has some thoughts on the matter.
(I should note that, thanks to some great work by djb, Peter Schwabe, Emilia Käsper and others, we do have bitsliced, constant-time implementations of AES now that run reasonably fast (for AES). But it's only effective for counter mode, and I believe that it needs long streams of data to achieve its headline speed. (A commenter pointed out that it can be made to work with only 8 blocks. Later, another pointed out that that I'm doubly wrong! We can do it for single blocks in reasonable time.) That's great, but it took 10 years of optimisation work and implies that AES can only be effectively implemented by some of the best practical cryptographers in the world.)
DANE stapled certificates (20 Oct 2012)
We've supported DNSSEC stapled certificates in Chrome for a while now and since the DANE RFC has been published, it's about time that I updated that code to support DANE, rather than the hacked up CAA record that it currently uses.
I've written the DANE code, but it also seemed like an opportune time to reevaluate the idea. The promise of DNSSEC, as Dan Kaminsky put it, it that it can reduce the number of meetings required to get something done.
You have a whole bunch of internal hosts that some company that you're partnering with needs to interface with securely? Well, perhaps you used an internal CA, so will they install that root on their systems? (Meeting.) They will if it's name constrained, can we get it reissued with name constraints? (Meeting.) Ok, now can IT get that root pushed out everywhere? (Meeting.) You see how this is going.
But every bit of code needs to justify its complexity and, for small cases, StartSSL will give you free certificates. Perhaps just Chrome doing it isn't very interesting, and everyone hacks up custom solutions anyway.
So, if it's useful to you, you should let me know (agl at chromium dot org).
CRIME (21 Sep 2012)
Last year I happened to worry on the SPDY mailing list about whether sensitive information could be obtained via SPDY's use of zlib for compressing headers. Sadly, I never got the time to follow up and find out whether it was a viable attack. Thankfully there exist security researchers who, independently, wondered the same thing and did the work for me! Today Duong and Rizzo presented that work at ekoparty 2012.
They were also kind enough to let Firefox and ourselves know ahead of time so that we could develop and push security fixes before the public presentation. In order to explain what we did, let's start by looking at how SPDY compressed headers:
(This is inline SVG, if you can't see it, check here.)
That's a pretty busy diagram! But I don't think it's too bad with a bit of explanation:
zlib uses a language with basically two statements: “output these literal bytes” and “go back x bytes and duplicate y bytes from there”. In the diagram, red text was included literally and black text came from duplicating previous text.
The duplicated text is underlined. A dark blue underline means that the original text is in the diagram and there will be a gray line pointing to where it came from. (You can hover the mouse over one of those lines to make it darker.)
A light blue underline means that the original text came from a pre-shared dictionary of strings. SPDY defines some common text, for zlib to be able to refer to, that contains strings that we expect to find in the headers. This is most useful at the beginning of compression when there wouldn't otherwise be any text to refer back to.
The problem that CRIME highlights is that sensitive cookie data and an attacker controlled path is compressed together in the same context. Cookie data makes up most of the red, uncompressed bytes in the diagram. If the path contains some cookie data, then the compressed headers will be shorter because zlib will be able to refer back to the path, rather than have to output all the literal bytes of the cookie. If you arrange things so that you can probe the contents of the cookie incrementally, then (assuming that the cookie is base64), you can extract the cookie byte-by-byte by inducing the browser to make requests.


For details of how to get zlib to reveal that information in practice, I'll just refer you to Duong and Rizzo's CRIME presentation. It's good work.
In order to carry out this attack, the attacker needs to be able to observe your network traffic and to be able to cause many arbitrary requests to be sent. An active network attacker can do both by injecting Javascript into any HTTP page load that you make in the same session.
When we learned of this work, we were already in the process of designing the compression for SPDY/4, which avoids this problem. But we still needed to do something about SPDY/2 and SPDY/3 which are currently deployed. To that end, Chrome 21 and Firefox 15 have switched off SPDY header compression because that's a minimal change that easily backports.
Chrome has also switched off TLS compression, through which a very similar attack can be mounted.
But we like SPDY header compression because it saves significant amounts of data on the wire! Since SPDY/4 isn't ready to go yet we have a more complex solution for Chrome 22/23 that compresses data separately while still being backwards compatible.
Most importantly cookie data will only ever be duplicated exactly, and in its entirety, against other cookie data. Each cookie will also be placed in its own Huffman group (Huffman coding is a zlib detail that I skipped over in the explanation above). Finally, in case other headers contain sensitive data (i.e. when set by an XMLHttpRequest), non-standard headers will be compressed in their own Huffman group without any back references.
That's only a brief overview of the rules. The code to follow them and continue to produce a valid zlib stream wasn't one of the cleaner patches ever landed in Chrome and I'll be happy to revert it when SPDY/4 is ready. But it's effective at getting much of the benefit of compression back.
To the right are a couple of images of the same sort of diagram as above, but zoomed out. At this level of zoom, all you can really see are the blocks of red (literal) and blue (duplicated) bytes. The diagram on the right has the new rules enabled and, as you can see, there is certainly more red in there. However that's mostly the result of limited window size. In order to save on server memory, Chrome only uses 2048-byte compression windows and, under the new rules, a previous cookie value has to fit completely within the window in order to be matched. So things are a little less efficient until SPDY/4, although we might choose to trade a little more memory to make up for that.
SSL interstitial bypass rates (20 Jul 2012)
In yesterday's post I threw in the following that I've been asked to clarify:
“we know that those bypass buttons are clicked 60% of the time by Chrome users”
Chrome collects anonymous statistics from users who opt in to such collection (and thank you to those who do!). One of those statistics covers how frequently people bypass SSL interstitials. (As always, the Chrome privacy policy has the details.)
We define bypassing the interstitial as clicking the ‘Proceed anyway’ button and not bypassing as either closing the tab, navigating elsewhere, or clicking the ‘Back’ button.
I picked five days at random over the past six weeks and averaged the percentages of the time that users bypassed rather than not. That came to 61.6%.
There may be some biases here: we may have a biased population because we only include users who have opted in to statistics collection. We are also counting all interstitals: there may be a small number of users who bypass a lot of SSL errors. But that's the data that we have.
Living with HTTPS (19 Jul 2012)
(These are my notes from the first half of my talk at HOPE9 last weekend. I write notes like these not as a script, but so that I have at least some words ready in my head when I'm speaking. They are more conversational and less organised than a usual blog post, so please forgive me the rough edges.)
HTTPS tends to cause people to give talks mocking certificate security and the ecosystem around it. Perhaps that's well deserved, but that's not what this talk is about. If you want to have fun at the expense of CAs, dig up one of Moxie's talks. This talk deals with the fact that your HTTPS site, and the sites that you use, probably don't even reach the level where you get to start worrying about certificates.
I'm a transport security person so the model for this talk is that we have two computers talking over a malicious network. We assume that the computers themselves are honest and uncompromised. That might be a stretch in these malware-ridden times, but that's the area of host security and I'm not talking about that today. The network can drop, alter or fabricate packets at will. As a lemma, we also assume that the network can cause the browser to load any URL it wishes. The network can do this by inserting HTML into any HTTP request and we assume that every user makes some unencrypted requests while browsing.
Stripping
If the average user typed mail.google.com into a browser and saw the following, what fraction of them do you think would login, none the wiser?
Can you even see what's terribly wrong here?
The problem is that the page isn't served over HTTPS. It should have been, but when a user types a hostname into a browser, the default scheme is HTTP. The server may attempt to redirect users to HTTPS, but that redirect is insecure: a MITM attacker can rewrite it and keep the user on HTTP, spoofing the real site the whole time. The attacker can now intercept all the traffic to this perfectly well configured and secure website.
This is called SSL stripping and it's terribly simple and devastatingly effective. We probably don't see it very often because it's not something that corporate proxies need to do, so it's not in off-the-shelf devices. But that respite is unlikely to last very long and maybe it's already over: how would we even know if it was being used?
In order to stop SSL stripping, we need to make HTTPS the only protocol. We can't do that for the whole Internet, but we can do it site-by-site with HTTP Strict Transport Security (HSTS).
HSTS tells browsers to always make requests over HTTPS to HSTS sites. Sites become HSTS either by being built into the browser, or by advertising a header:
Strict-Transport-Security: max-age=8640000; includeSubDomains
The header is in force for the given number of seconds and may also apply to all subdomains. The header must be received over a clean HTTPS connection.
Once the browser knows that a site is HTTPS only, the user typing mail.google.com is safe: the initial request uses HTTPS and there's no hole for an attacker to exploit.
(mail.google.com and a number of other sites are already built into Chrome as HSTS sites so it's not actually possible to access accounts.google.com over HTTP with Chrome - I had to doctor that image! If you want to be included in Chrome's built-in HSTS list, email me.)
HSTS can also protect you, the webmaster, from making silly mistakes. Let's assume that you've told your mother that she should always type https:// before going to her banking site or maybe you setup a bookmark for her. That's honestly more than we can, or should, expect of our users. But let's say that our supererogatory user enters https://www.citibank.com in order to securely connect to her bank. What happens? Well, https://www.citibank.com redirects her to http://www.citibank.com. They've downgraded the user! From there, the HTTP site should redirect back to HTTPS, but the damage has been done. An attacker can get in through the hole.
I'm honestly not picking on Citibank here. They were simply the second site that I tried and I was some surprised that the first site didn't have the problem. It's a very easy mistake to make, and everything just works! It's a completely silent disaster! But HSTS would have either prevented it, or would have failed closed.
HSTS also does something else. It turns this:
Into this:
The “bypass this certificate error” button has gone. That button is a UI disaster. Asking regular people to evaluate the validity of X.509 certificates is insane. It's a security cop-out that we're saddled with, and which is causing real damage.
We've seen widespread MITM attacks using invalid certificates in Syria and, in recent weeks, Jordan. These attacks are invalid! This is squarely within our threat model for transport security and it shouldn't have been a risk for anybody. But we know that those bypass buttons are clicked 60% of the time by Chrome users. People are certainly habituated to clicking them and I bet that a large number of people were victims of attacks that we should have been able to prevent.
If you take only one thing away from this talk, HSTS should be it.
Mixed scripting
One we're sorted HSTS we have another problem. These snippets of HTML are gaping wide holes in your security:
<script src="http://... <link href="http://... <embed src="http://...
It's called mixed scripting and it happens when a secure site loads critical sub-resources over HTTP. It's a subset of mixed content: mixed content covers loading any sub-resource insecurely. Mixed content is bad, but when the resource is Javascript, CSS, or a plugin we give is another name to make it clear that its a lot more damaging.
When you load sub-resources over HTTP, an attacker can replace them with content of their choosing. The attacker also gets to choose any page on your HTTPS site with the problem. That includes pages that you don't expect to be served over HTTPS, but happen to be mapped. If you have this problem anywhere, on any HTTPS page, the attacker wins.
With complex sites, it's very difficult to ensure that this doesn't happen. One good way to limit it is to only serve over HTTPS so that there aren't any pages that you expect to serve over HTTP. Also, HSTS might also save you if you're loading mixed script from the same domain.
Another mitigation is to use scheme-relative URLs everywhere possible. These URLs look like //example.com/bar.js and are valid in all browsers. They inherit the scheme of the parent page, which will be HTTP or HTTPS as needed. (Although it does mean that if you load the page from disk then things will break. The scheme will be file:// in that case.)
Fundamentally this is such an easy mistake to make, and such a problem, that the only long term solution is for browsers to stop loading insecure Javascript, CSS and plugins for HTTPS sites. To their credit, IE9 does this and did it before Chrome. But I'm glad to say that Chrome has caught up and mixed scripting is now blocked by default, although with a user-override:
Yes, there's another of those damm bypass buttons. But we swear that it's just a transitional strategy and it already stops silent exploitation in a hidden iframe.
Cookies
HTTP and HTTPS cookie jars are the same. No really: cookies aren't scoped to a protocol! That means that if you set a Cookie on https://example.com and then make a request to http://example.com, the cookies will be sent in the clear! In order to prevent this, you should be setting secure on your HTTPS cookies. Sadly this is a very easy thing to miss because everything will still work if you omit it, but without the secure tag, attackers can easily steal your cookies.
It's worth noting that HSTS can protect you from this: by preventing the HTTP request from being sent, the cookies can't be leaked that way, but you'll need to include all subdomains in the HSTS coverage.
There's a second corollary to this: attackers can set your HTTPS cookies to. By causing an request to be sent to http://example.com, they can spoof a reply with cookies that then override any existing cookies. In this fashion, an attacker can log a user in as themselves during their interaction with an HTTPS site. Then, say, emails that they send will be saved in the attacker's out-box.
There's no very good protection against this except HSTS again. By preventing any HTTP requests you can stop the attacker from spoofing a reply to set the cookies. Against, HSTS needs to cover all subdomains in order to be effective against this.
Get yourself checked out
You should go to https://www.ssllabs.com and run their scan against your site. It's very good, but ignore it if it complains about the BEAST attack. It'll do so if you make a CBC ciphersuite your top preference. Browsers have worked around BEAST and non-browser clients are very unlikely to provide the attacker enough access to be able to pull it off. You have a limited amount of resources to address HTTPS issues and I don't think BEAST should make the list.
Get a real certificate
You should get a real certificate. You probably already have one but, if you don't, then you're just training more people to ignore certificate errors and you can't have HSTS without a real certificate. StartSSL give them away for free. Get one.
If you've reached this far and have done all of the above, congratulations: you're in the top 0.1% of HTTPS sites. If you're getting bored, this is a reasonable point to stop reading: everything else is just bonus points from now on.
Forward secrecy
You should consider forward secrecy. Forward secrecy means that the keys for a connection aren't stored on disk. You might have limited the amount of information that you log in order to protect the privacy of your users, but if you don't have forward secrecy then your private key is capable of decrypting all past connections. Someone else might be doing the logging for you.
In order to enable forward secrecy you need to have DHE or ECDHE ciphersuites as your top preference. DHE ciphersuites are somewhat expensive if you're terminating lots of SSL connections and you should be aware that your server will probably only allow 1024-bit DHE. I think 1024-bit DHE-RSA is preferable to 2048-bit RSA, but opinions vary. If you're using ECDHE, use P-256.
You also need to be aware of Session Tickets in order to implement forward secrecy correctly. There are two ways to resume a TLS connection: either the server chooses a random number and both sides store the session information, of the server can encrypt the session information with a secret, local key and send that to the client. The former is called Session IDs and the latter is called Session Tickets.
But Session Tickets are transmitted over the wire and so the server's Session Ticket encryption key is capable of decrypting past connections. Most servers will generate a random Session Ticket key at startup unless otherwise configured, but you should check.
I'm not going to take the time to detail how to configure this here. There are lots of webservers and it would take a while. This is more of a pointer so that you can go away and research it if you wish.
(The rest of the talk touched on OpenSSL speed, public key pinning, TLS 1.1 and 1.2 and a few other matters, but I did those bits mostly on the fly and don't have notes for them.)