Sudden Death Entropy Failures

During the time that the RSA patent was in force, DSA was the signature algorithm of choice for any software that didn't want to deal with patent licenses. (Which is why lots of old PGP keys are still DSA.) It has slowly disappeared since the patent expired and it appears that 4096-bit RSA is now the algorithm of choice if you're on the run from the NSA [1]. (And if you're a journalist trying to get a reply: keyid BDA0DF3C.)

But DSA can also be used with elliptic curves in the form of ECDSA and, in that form, it's likely that we'll see it return in the future, at least to some extent. SSH and GPG both support ECDSA now and CAs are starting to offer ECDSA certificates for HTTPS.

Unfortunately, DSA has an important weakness that RSA doesn't: an entropy failure leaks your private key. If you used a machine affected by the Debian entropy bug then, in that time, messages that you encrypted with RSA can be broken. But if you signed anything with a DSA key, then your private key is compromised.

The randomness in DSA is absolutely critical. Given enough signatures, leaking just a handful bits per signature is sufficient to break it. In the limit, you can make the make the mistake that Sony did and not understand that the random number needs to be generated for each message and, seemingly, just pick one and code it in. (See XKCD.)

But it doesn't need to be this way! All that is required of the nonce is that it be unique for each distinct message and secret, and we can achieve that by hashing the message and private key together. That's what Ed25519 does and I've added the option to OpenSSL to do the same. Unlike RSA and Ed25519 signatures, DSA signatures are probabilistic - signing the same message twice with the same key will result in different signatures. Since someone may be depending on that, OpenSSL also hashes in some randomness to maintain that feature.

(p.s. if you're an actual cryptographer, please take a look at the core of the code and let me know if I'm an idiot.)

Since the DSA specification says that the nonce must be randomly generated, and compliance with the spec is important for many users, this isn't enabled by default. For ECDSA keys, one needs to call EC_KEY_set_nonce_from_hash, for example. But hopefully we can measurably improve things with little effort by doing this.

Appendix

Here's an example of breaking DSA with a known nonce: essentially the Sony PlayStation hack. I've used the variable names from the Wikipedia page for those that want to follow along, although I'm not using a subgroup to keep things simple.

Let's pick some (public) DSA parameters:

p = 2903
F = GF(p)
g = F(2)
n = g.multiplicative_order()

Next, generate a key pair. x is the private key and y is the public key:

x = int(F.random_element()) % n
y = g^x
(x, y)
(1282, 966)

Generate our nonce, which will be fixed for two different messages in this example:

k = int(F.random_element()) % n
kInv = inverse_mod(k, n)

The first message that we sign will be 42 and the pair (r, s) is the signature. (Normally the message would be a hash of a much larger message.) The signature is very simple: r is gk and s is (m + xr)/k:

m = 42
r = int(g^k) % n
s = ((m + x*r) * kInv) % n
(r, s)
(1401, 1168)

As an aside, we can verify the signature that we just made:

w = inverse_mod(s, n)
u1 = m*w
u2 = r*w
v = g^u1*y^u2
v == r
True

Next, we also sign the message 24 (which I'll call mm) and we gather the two signatures. We have s and ss, both with the same r. Since s = (m + xr)/k, we can subtract the two to get (m + xr - mm - xr)/k, which is just (m - mm)/k. Thus k = (m - mm)/(s - ss). Given k, we can rearrange the original equation for s: s = (m + xr)/k ⇒ sk = m + xr ⇒ sk - m = xr ⇒ (sk - m)/r = x. And thus we have the private key:

(r, s) = (1401, 1168)
(r, ss) = (1401, 1212)
kk = ((42 - 24)*inverse_mod(s-ss,n)) % n
xx = ((s*kk - 42) * inverse_mod(r, n))%n
print xx == x
xx
True
1282

Faster curve25519 with precomputation

(Update: based on questions, I've clearly failed to point out what this is, or rather what it isn't. Fixed base multiplication only applies when performing many operations on the same public key, i.e. key generation or perhaps some other cases. If you're working with random public keys then precomputaion cannot help I'm afraid.)

Diffie-Hellman is one of the most important public-key cryptographic primitives, even if it doesn't have the name recognition of, say, RSA. When anything is providing forward secrecy then it's very likely using Diffie-Hellman. In fact, the reason that web servers do RSA operations at all is mostly a legacy hangover. If we could design things afresh then there would be a lot of more Diffie-Hellman and everything would be a lot faster!

Of the concrete implementations of Diffie-Hellman, curve25519 is the fastest, common one. There are some faster primitives in eBACS, but the ones that are significantly faster are also significantly weaker. My machine (E5-2690, with Hyperthreading and TurboBoost disabled) can do a curve25519, arbitrary base point, scalar multiplication (a.k.a. Diffie-Hellman key agreement) in just 67.5µs.

But nobody will complain if it gets faster, right?

The core of Diffie-Hellman is calculating xy in some group. The simplest way to calculate xy is the classic square-and-multiply algorithm:

(Note: I'll be using multiplicative notation throughout because I think it's more familiar to most people, although additive is more common for the specific groups that we're dealing with here.)

Start with the number 1, which is x0 (because anything to the power of zero is 1). If we multiply by x then we get x1. So, by multiplying by x we've added one to the exponent. If we square our new number then we get x2 - squaring doubles the exponent.

Now consider the exponent as a binary number: x1 is x1b and squaring it gets us x10b. Squaring is a left shift of the exponent! Now, given any y in binary form, we can calculate xy with a series of additions by 1 and left shifts - i.e. multiplies by x and squarings.

If we have y = 1011b then we start with x0 = 1, as always, and read y left-to-right: the first bit is a one so we multiply by x to get x1b. Left shift: x10b. We already have the 0 in there so left shift again: x100b. Multiply by x: x101b. Left shift: x1010b. Multiply by x: x1011b. Done.

With square-and-multiply, the bits of the desired exponent are clocked in, bit by bit, as the squarings do the left shifting.

This is essentially what curve25519 does. It has lots of tricks to make it fast but, if you're calculating lots of powers of the same x then there are a number of tricks to make things faster. I'm only going to cover a couple of basic ones here.

Firstly, one can precalculate x1, x2, x3, … x15. Now, by multiplying by one of those values we can add any value from 0 to 15 to the exponent in a single multiplication. We're essentially clocking in four bits at a time. After one multiplication we have to do four squarings to left shift 4 bits to make space for the next 4 bits to be clocked in. This reduces the number of multiplications by a factor of 4.

Next, imagine that our exponents were always 8-bits long. We could precalculate x00000000b, x00000001b, x00010000b and x00010001b. By multiplying by one of those values we can clock in bits in two positions at once. Previously we always clocked in bits at the right hand side of the exponent, but those precomputed powers mean that we can clock in bits at the forth bit too. Thus we deal with the exponent as two, 4-bit values and the squarings are shared between them. We only have to square and multiply four times to cover the whole 8-bit value.

In cryptography, the powers are usually larger than 8-bits, of course, but you can clock in at 4 or 8 positions if you wish. It's a classic time/memory trade-off because more positions mean exponentially more precomputed values.

(There's more and better tricks than the two that I've just outlined, some are detailed on this Wikipedia page.)

Precomputation in curve25519

The reason that such precomputation hasn't been applied to curve25519 before is (I think), because of a detail that I glossed over before: curve25519 is actually a Montgomery curve that uses differential addition. Rather than being able to multiply arbitrary values, a and b, you instead need to also know the value of a/b first. That throws a spanner in the works of the typical optimisations.

If you review elliptic curves, you'll recall that operations on the elliptic curve group consist of operations on the underlying field. In curve25519's case it's the field GF(2255-19), which is where it gets its name from. We can roughly measure the amount of time that something will take by the number of field operations required. (Note to practitioners: I'm taking a field squaring as 0.8 multiplications and one multiplication is an ‘operation’.)

A curve25519 exponent is 255 bits long. Each squaring takes 3.6 operations and each multiplication by x takes 4.6 operations. 255*(3.6+4.6) is 2091 operations. We also need a final conversion (a field inversion) that costs 215.2 operations for a total of 2306.2. That's the number to beat.

In order to work in a group which has nice, arbitrary multiplication operations we map curve25519 to an isomorphic, Twisted Edwards curve. In fact, exactly the same curve as used by Ed25519, which is a fast signature primitive. This also means that we get to use Ed25519's highly optimised code. In order to be able to use as much code as possible, we use exactly the same precomputation scheme as described in section 4 of the paper.

This scheme is based on 256-bit exponents, which it splits into 64, 4-bit chunks. But it uses lots of precomputed values! For every other 4-bit chunk, it precomputes all 16 possible values. So the first subtable contains the 16 values x0000b, x0001b, x0010b, …, x1111b. The next subtable contains the 16 values for the next but one 4-bit chunk. So that's x0000 0000 0000b, x0001 0000 0000b, x0010 0000 0000b, …, x1111 0000 0000b. It has 32 of those subtables!

It picks one value from each subtable and, with 32 multiplications, takes care of half the bits of the exponent. Then it squares four times to left-shift those values into place and does another 32 multiplications from the same subtables to take care of the other half of the bits. That's a total of 64 multiplications and 4 squarings.

(It actually uses a trick which means that it only needs to store 8 of the 16 values in each subtable, but that's outside the scope of this post.)

We're in a different group now and so the costs are different: multiplications cost 6 operations and squarings cost 7. (The field is the same so an ‘operation’ takes the same time.) 64*6 + 4*7 = 412, so that's looking good. We need another field inversion to finish up, same as curve25519, for a total cost of 627.2. Those rough costs suggest that the Twisted Edwards precomputation should be 3.6x faster, which is promising!

(Updated: I originally suggested that two field inversions were required because I'm an idiot. Thanks to @CodesInChaos for asking why.)

The actual timings

Based on the amd64-64-24k implementation of Ed25519 in SUPERCOP, it's fairly easy to implement. On the same machine as the curve25519 speed was measured, above, the precomputed curve25519 runs in 21.9µs, a 3.1x speedup with 24KB of tables. So it's not as fast as the rough calculations suggested, but that's because we're now processing 24KB of memory. That cache pressure isn't free of course, although benchmarks make it look mostly free because they run in a tight loop and so the tables will always be in L1 cache. Still, it means that CPU can sustain 350,000 public key operations per second.

(Code still too messy to release. I hope to tidy it up next week.)

NPN and ALPN

Since its inception, SPDY has depended on a TLS extension called NPN. NPN allows a TLS connection to negotiate which application-level protocol will be running across it.

NPN allows SPDY to be enabled efficiently. If we had run SPDY on a different port, then we would have had to be constantly creating probing connections to see whether a site supported SPDY as well as HTTPS. Even if we knew that a site supported SPDY, network devices between any given client and that site might block connections to the different TCP port. If we had tried an HTTP Upgrade header, that would have slowed everything down and caused compatibility issues with servers and proxies that didn't process the header correctly.

NPN also allows us to update SPDY without spending round trips on a version negotiation. Overall, NPN has worked very well for SPDY.

NPN attempted to be a little bit future proof by sending the selected application protocol name under encryption, so that network devices couldn't discriminate. The benefit was somewhat limited because the server's list of supported protocols was still sent in the clear but we believe that anything that can be encrypted, should be encrypted.

There is an alternative to NPN: ALPN is essentially the same design except that the negotiation is done in the clear (like other TLS extensions).

Last Friday, at IETF 86 in Orlando, the TLS working group considered both designs and came to a rough consensus on ALPN. ALPN is currently on track to be published as an RFC at some point and we will be switching SPDY over to it and deprecating NPN.

Once IANA has assigned a TLS extension number for ALPN, Google servers will start supporting both NPN and ALPN, with a preference for ALPN. Chrome and, I expect, other browsers will start sending both NPN and ALPN extensions. During this time, SPDY servers will be able to switch from NPN to ALPN without dropping SPDY support for current clients.

At some point after the end of 2014, I plan on removing NPN support from Chrome and Google servers. Any old servers and clients will continue to function just fine: they'll just use HTTPS.

Lucky Thirteen attack on TLS CBC

In an upcoming paper (made public this morning), Nadhem AlFardan and Kenny Paterson describe another method of performing Vaudenay's attack on CBC as used in TLS. Firstly I'd like to thank the researchers for notifying the various vendors ahead of time so that patches could be prepared: the disclosure process has gone very smoothly in this case. I couldn't have asked for anything more - they did everything right.

Vaudenay's attack requires an attacker to be able to detect when a CBC padding check has succeeded, even if the authentication check then fails. Authentication should always be applied after encryption to avoid this, but TLS famously did it the wrong way round with CBC mode.

Knowing whether a padding check succeeded reveals information about the decrypted plaintext. By tweaking the ciphertext over many trials, it's possible to progressively decrypt an unknown ciphertext. For details, see their paper, which does a better job of explaining it than I would.

Vaudenay first used the fact that these two situations (padding check failure and authenticator failure) resulted in different TLS alert values, although that didn't result in a practical attack because TLS errors are encrypted. Once that was corrected (by specifying that the same alert value should be sent in each case) the next year's paper used a timing-side channel: if the authentication check wasn't performed when the padding check failed then, by timing the server's response, an attacker could tell whether the server aborted processing early.

To try and remove that side-channel, TLS stacks perform the authentication check whether or not the padding check succeeds. However, here's what I commented in the Go TLS code:

Note that we still have a timing side-channel in the MAC check, below. An attacker can align the record so that a correct padding will cause one less hash block to be calculated. Then they can iteratively decrypt a record by breaking each byte. However, our behavior matches OpenSSL, so we leak only as much as they do.

That pretty much sums up the new attack: the side-channel defenses that were hoped to be sufficient were found not to be (again). So the answer, this time I believe, is to make the processing rigorously constant-time. (Details below.)

As a practical matter, since a padding or authenticator check failure is fatal to a TLS connection, performing this attack requires a client to send the same plaintext secret on thousands of different connections to the same server. This isn't a trivial obstacle but it's possible to meet this requirement for cookies with a browser and bit of Javascript injected into any origin in the same session.

For DTLS the attack is much easier because a rejected record doesn't cause the connection to be destroyed and the same authors developed a method for amplifing timing attacks against DTLS in a previous paper

Unfortunately, unlike BEAST, this isn't something that we can unilaterally fix on the client side (except by disabling CBC mode, which might have significant compatibility issues.) On the server side, making RC4 the most preferable cipher solves the problem although I do recommend that server admins apply patches as they become available even so. Having RC4 as an option has been a useful fallback for CBC problems but, if patches aren't applied, then RC4 becomes a monoculture in the future.

Implementing constant time CBC decoding

Fixing this problem in a backwards compatible manner is, sadly, really rather complex. What need to ensure that all CBC records are authenticated in constant time. Initially we define constant-time to mean that, for all values of the secret input, the trace of instructions executed and memory locations accessed is identical (on an in-order machine). Here the secret input is the decrypted CBC record before the padding and MAC have been checked.

Meeting this definition is tough and there's a great temptation to remove some of the more glaring timing leaks and hope that the remainder is small enough not to be practical. But the DTLS trick mentioned above can be used to amplify any timing side-channels. Additionally, situations change and implementations that are currently only exposed over a network may, in the future, be exposed to attackers running on the same machine. At that point, cache-line probing and other powerful timing probes can be brought to bear against any side-channels that remain.

For the purposes of this post, we consider the decrypted, CBC record to contain the following parts:

  1. n bytes of plaintext. (Which may be compressed but, within our scope, it's plaintext.)
  2. mac_size bytes of MAC. (Up to 48 bytes in TLS with SHA-384. We conservatively assume that the minimum number of MAC bytes is zero since this is clearly a valid lower-bound and accounts for truncated MAC extensions.)
  3. padding_length bytes of padding.
  4. One byte of padding length. (Some consider this to be part of the padding itself, but I'll try to be consistent that this length byte is separate.)

With this secret data in hand, we need to perform three steps in constant time:

  1. Verify that the padding bytes are correct: the padding cannot be longer than the record and, with SSLv3, must be minimal. In TLS, the padding bytes must all have the same value as the length byte.
  2. Calculate the MAC of the header and plaintext (where the header is what I'm calling the additional data that is also covered by the MAC: the sequence number etc).
  3. Extract the MAC from the record (with constant memory accesses!) and compare it against the calculated MAC.
Utilities

We will make extensive use of bitwise operations and mask variables: variables where the bits are either all one or all zero. In order to create mask values it's very useful to have a method of replicating the most-significant-bit to all the bits in a value. We'll assume that an arithmetic right shift will shift in the MSB and perform this operation for us:

#define DUPLICATE_MSB_TO_ALL(x) ( (unsigned)( (int)(x) >> (sizeof(int)*8-1) ) )
#define DUPLICATE_MSB_TO_ALL_8(x) ( (uint8_t)( (int8_t)(x) >> 7) )

However, note that the C standard does not guarantee this behaviour, although all CPUs that I've encountered do so. If you're worried then you can implement this as a series of logical shifts and ORs.

We'll define some more utility functions as we need them.

SSLv3 padding

SSLv3 padding checks are reasonably simple: we need only to test that the padding isn't longer than the record and that it's less than a whole cipher block. We assume that the record is at least a byte long. (Note that the length of the whole record is public information and so we can fast reject invalid records if we're using that length alone.) We can use the fact that, if a and b are bounded such that the MSB cannot be set, then the MSB of a - b is one iff b>a.

padding_length = data[length-1];
unsigned t = (length-1)-padding_length;
unsigned good = DUPLICATE_MSB_TO_ALL(~t);
t = block_size - (padding_length+1);
good &= DUPLICATE_MSB_TO_ALL(~t);

The resulting value of good is a mask value which is all ones iff the padding is valid.

TLS padding

Padding in TLS differs in two respects: the value of the padding bytes is defined and must be checked, and the padding is not required to be minimal. Therefore we must always make memory accesses as if the padding was the maximum length, with an exception only in the case that the record is shorter than the maximum. (Again, the total record length is public information so we don't leak anything by using it.)

Here we use a mask variable (mask), which is all ones for those bytes which should be part of the padding, to discard the result of checking the other bytes. However, our memory access pattern is the same for all padding lengths.

unsigned padding_length = data[length-1];
unsigned good = (length-1)-padding_length;
good = DUPLICATE_MSB_TO_ALL(~good);
unsigned to_check = 255; /* maximum amount of padding. */
if (to_check > length-1) {
        to_check = length-1;
}

for (unsigned i = 0; i < to_check; i++) {
        unsigned t = padding_length - i;
        uint8_t mask = DUPLICATE_MSB_TO_ALL(~t);
        uint8_t b = data[length-1-i];
        good &= ~(mask&(padding_length ^ b));
}

If any of the padding bytes had an incorrect value, or the padding length was invalid itself, one of more of the bottom eight bits of good will be zero. Now we can map any value of good with such a zero bit to 0, and leave the case of all ones the same:

good &= good >> 4;
good &= good >> 2;
good &= good >> 1;
good <<= sizeof(good)*8-1;
good = DUPLICATE_MSB_TO_ALL(good);

In a very similar way, good can be updated to check that there are enough bytes for a MAC once the padding has been removed and then be used as a mask to subtract the given amount of padding from the length of the record. Otherwise, in subsequent sections, the code can end up under-running the record's buffer.

Calculating the MAC

All the hash functions that we're concerned with are Merkle–Damgård hashes and nearly all used in TLS have a 64-byte block size (MD5, SHA1, SHA-256). Some review of this family of hash functions is called for:

All these hash functions have an initial state, a transform function that takes exactly 64 bytes of data and updates that state, and a final_raw function that marshals the internal state as a series of bytes. In order to hash a message, it must be a multiple of 64 bytes long and that's achieved by padding. (This is a completely different padding than discussed above!)

In order to pad a message for hashing, a 0x80 byte is appended, followed by the minimum number of zero bytes needed to make the length of the message congruent to 56 mod 64. Then the length of the message (not including the padding) is written as a 64-bit, big-endian number to round out the 64 bytes. If the length of the message is 0 or ≥ 55 mod 64 then the padding will cause an extra block to be processed.

In TLS, the number of hash blocks needed to compute a message can vary by up to six, depending on the length of the padding and MAC. Thus we can process all but the last six blocks normally because the (secret) padding length cannot affect them. However, the last six blocks need to be handled carefully.

The high-level idea is that we generate the contents of each of the final hash blocks in constant time and hash each of them. For each block we serialize the hash and copy it with a mask so that only the correct hash value is copied out, but the amount of computation is constant.

SHA-384 is the odd one out: it has a 128 byte block size and the serialised length is twice as long, although it is otherwise similar. The example below assumes a 64 byte block size for clarity.

image/svg+xml 0x80 0x00 ... application data 0x00 ... length bytes index_a index_b c

We calculate two indexes: index_a and index_b. index_a is the hash block where the application data ends and the 0x80 byte is put. index_b is the final hash block where the length bytes are put. They may be the same, or index_b may be one greater. We also calculate c: the offset of the 0x80 byte in the index_a block.

The arithmetic takes some thought. One wrinkle so far ignored is that additional data is hashed in prior to the application data. TLS uses HMAC, which includes a block containing the masked key and then there are 13 bytes of sequence number, record type etc. SSLv3 has more than a block's worth of padding data, MAC key, sequence etc. These are collectively called the header here. The variable k in the following tracks the starting position of the data for the constant time blocks and includes this header.

for (i = num_starting_blocks; i <= num_starting_blocks+varience_blocks; i++) {
        uint8_t block[64];
        uint8_t is_block_a = constant_time_eq(i, index_a);
        uint8_t is_block_b = constant_time_eq(i, index_b);
        for (j = 0; j < 64; j++) {
                uint8_t b = 0;
                if (k < header_length) {
                        b = header[k];
                } else if (k < data_plus_mac_plus_padding_size + header_length) {
                        b = data[k-header_length];
                }
                k++;

                uint8_t is_past_c = is_block_a & constant_time_ge(j, c);
                uint8_t is_past_cp1 = is_block_a & constant_time_ge(j, c+1);
                /* If this is the block containing the end of the
                 * application data, and we are at, or past, the offset
                 * for the 0x80 value, then overwrite b with 0x80. */
                b = (b&~is_past_c) | (0x80&is_past_c);
                /* If this the the block containing the end of the
                 * application data and we're past the 0x80 value then
                 * just write zero. */
                b = b&~is_past_cp1;
                /* If this is index_b (the final block), but not
                 * index_a (the end of the data), then the 64-bit
                 * length didn't fit into index_a and we're having to
                 * add an extra block of zeros. */
                b &= ~is_block_b | is_block_a;

                /* The final eight bytes of one of the blocks contains the length. */
                if (j >= 56) {
                        /* If this is index_b, write a length byte. */
                        b = (b&~is_block_b) | (is_block_b&length_bytes[j-56]);
                }
                block[j] = b;
        }

        md_transform(md_state, block);
        md_final_raw(md_state, block);
        /* If this is index_b, copy the hash value to |mac_out|. */
        for (j = 0; j < md_size; j++) {
                mac_out[j] |= block[j]&is_block_b;
        }
}

Finally, the hash value needs to be used to finish up either the HMAC or SSLv3 computations. Since this involves processing an amount of data that depends only on the MAC length, and the MAC length is public, this can be done in the standard manner.

In the case of SSLv3, since the padding has to be minimal, only the last couple of hash blocks can vary and so need to be processed in constant time. Another optimisation is to reduce the number of hash blocks processed in constant time if the (public) record length is small enough that some cannot exist for any padding value.

The above code is a problem to implement because it uses a different hash API than the one usually provided. It needs a final_raw function that doesn't append the hash padding, rather than the common final function, which does. Because of this implementers may be tempted to bodge this part of the code and perhaps the resulting timing side-channel will be too small to exploit, even with more accurate local attacks. Or perhaps it'll be good for another paper!

Extracting the MAC from the record.

We can't just copy the MAC from the record because its position depends on the (secret) amount of padding. The obvious, constant-time method of copying out the MAC is rather slow so my thanks to Emilia Kasper and Bodo Möller for suggesting better ways of doing it.

We can read every location where the MAC might be found and copy to a MAC-sized buffer. However, the MAC may be byte-wise rotated by this copy so we need to do another mac_size2 operations to rotate it in constant-time (since the amount of rotation is also secret):

unsigned mac_end = length;
unsigned mac_start = mac_end - md_size;
/* scan_start contains the number of bytes that we can ignore because the
 * MAC's position can only vary by 255 bytes. */
unsigned scan_start = 0;
if (public_length_inc_padding > md_size + 255 + 1) {
        scan_start = public_length_inc_padding - (md_size + 255 + 1);
}
unsigned char rotate_offset = (mac_start - scan_start) % md_size;

memset(rotated_mac, 0, sizeof(rotated_mac));
for (unsigned i = scan_start; i < public_length_inc_padding; i += md_size) {
        for (unsigned j = 0; j < md_size; j++) {
                unsigned char mac_started = constant_time_ge(i + j, mac_start);
                unsigned char mac_ended = constant_time_ge(i + j, mac_end);
                unsigned char b = 0;
                if (i + j < public_length_inc_padding) {
                        b = data[i + j];
                }
                rotated_mac[j] |= b & mac_started & ~mac_ended;
        }
}

/* Now rotate the MAC */
memset(out, 0, md_size);
for (unsigned i = 0; i < md_size; i++) {
        unsigned char offset = (md_size - rotate_offset + i) % md_size;
        for (j = 0; j < md_size; j++) {
                out[j] |= rotated_mac[i] & constant_time_eq_8(j, offset);
        }
}

Finally, the MAC should be compared in constant time by ORing a value with the XOR of each byte in place from the calculated and given MACs. If the result is zero, then the record is valid. In order to avoid leaking timing information, the mask value from the padding check at the very beginning should be carried throughout the process and ANDed with the final value from the MAC compare.

Limitations of our model

We started out by defining constant-time with a specific model. But this model fails in a couple of ways:

Firstly, extracting the MAC as detailed above is rather expensive. On some chips all memory accesses within a cache-line are constant time and so we can relax our model a little and use a line-aligned buffer to perform the rotation in when we detect that we're building on such a platform.

Secondly, when I sent an early version of the OpenSSL patch to the paper's authors for them to test, they reported that they found a timing difference! I instrumented the OpenSSL code with the CPU cycle counter and measured a median time of 18020 cycles to reject a record with a large amount of padding but only 18004 cycles with small padding.

This difference may be small, but it was stable. To cut a long story short, it came from this line:

unsigned char rotate_offset = (mac_start - scan_start) % md_size;

When the padding was larger, the MAC started earlier in the record (because the total size was the same). So the argument to the DIV instruction was smaller and DIV, on Intel, takes a variable amount of time depending on its arguments!

The solution was to add a large multiple of md_size to make the DIV always take the full amount of time (and to do it in such a way that the compiler, hopefully, won't be able to optimise it away). Certainly 20 CPU cycles is probably too small to exploit, but since we've already “solved” this problem twice, I'd rather not have to worry in the future.

Real World Crypto 2013

(These are my notes for a talk that I gave last week at Real World Crypto. The premise of the conference is that it brings together theoretical cryptographers and practitioners. So this talk is aimed at theoretical cryptographers but it's fairly simple as I don't have anything complex worth saying to real cryptographers! Slides for other talks are linked from the program page and Rogaway's are relevant to this talk.

Note that this isn't a transcript: I actually say more words than this, but it contains the main points.)

Hello all.

For those who don't know me, I'm Adam Langley. I work at Google, mostly on our serving side HTTPS infrastructure these days. But I also still do some work on Chrome's SSL stack from time to time.

When I was asked to come up with a title of a talk I picked “Things that bit us, things we fixed and things that are waiting in the grass” with reference to HTTPS. Partly because that's what I know about but also because HTTPS is the only place where most people knowingly interact with public crypto, so it's one of the few examples of real world crypto at really large, really messy scales.

I could also have titled this talk “Know your enemy” because your real enemy is not trying to achieve an advantage with greater than negligible probability. As I hope to convince you, your enemy is me. I am the idiot who will mess up the implementation of your lovely cryptosystem. Your 128-bit security level is worthless in the face of my stupidity and so I'm here to seek your help in making my life easier and, therefore, everything more secure.

But before we get into anything crypto related, I'd like to set the scene. By and large, transport security on the Internet is doing OK because few people bother to attack it. If you want to steal banking credentials, you can get malware toolkits that will work on a large enough fraction of machines to be useful - e.g. using the Java exploit made public this week. If you want to steal passwords, you can SQL inject the site and reverse the hashes (if the site even bothered to hash), or just use the Ruby on Rails exploit made public this week. Given the level of password reuse, it doesn't even have to be the same site! Economically, attacking the transport doesn't make sense for many attackers and so they don't do it.

If you do want to attack the transport, by far the best methods are SSL stripping, mixed scripting and insecure cookie vulnerabilities.

SSL stripping means that, when the user types in example.com, since the default scheme is unencrypted, the browser makes an insecure request. The attacker can simply answer that request and proxy the entire site while removing any attempts to upgrade to HTTPS. In the majority of cases, the user will never notice.

Mixed scripting results from pages that source Javascript, CSS or plugins from an HTTP URL. The request is made insecurely but the response is trusted with the same authority as the HTTPS origin that requested them. On sites that serve both HTTP and HTTPS, mixed scripting is endemic.

Lastly, cookies are secrets that the client sends to authenticate requests. If, when creating a cookie, the server doesn't set the secure flag, the client will also send the same cookie over unencrypted connections. Since forgetting to set that flag doesn't break anything in normal operation, it happens fairly often.

Oh, and all that is assuming that any transport security is used at all. HTTP is still much more common than HTTPS.

I've breezed over those issues, but the important point is that none of them involve crypto, they're all systems issues. On the whole, the crypto is doing great in comparison to everything else!

I'll cycle back to those issues towards the end but I wanted to give some perspective as I'll be talking about the crypto a lot, since this is Real World Crypto, but crypto is only a small part of web security.

When writing this talk I sat down and made a list of the cryptographic issues that have bitten HTTPS over the past few years and tried to group in order to make points that I think would be useful for the research community. I didn't end up using them all, so this is hardly a exhaustive list and I used some things twice, but here they are:

Go beyond a paper

  1. CRIME (compression leaks.)
  2. BEAST (CBC vs adaptive, chosen plaintext.)

The first group of issues I called ‘go beyond a paper’:

The CRIME attack resulted from the attacker having partial, chosen plaintext abilities on a channel which was performing compression. This applied to both TLS, which supports compression, and SPDY, an HTTP replacement that runs over TLS that applied compression internally. Most, major HTTPS sites don't support TLS compression so SPDY was the bigger hole, but SPDY is a relatively new protocol and the idea that compression can leak information isn't. I can easily find it going back ten years (“Compression and Information Leakage of Plaintext”, Kelsey, 2002) and here's what I said on the SPDY mailing list before the CRIME work:

With a good model of zlib, I think you could extract a ~40 byte cookie with ~13K requests. That's a practical attack and would make a great paper if someone has the time.

Of course, I never had the time but we did start working on a better compression scheme to solve the issue.

But SPDY was done by smart networking folks. Since the compression leak issues weren't known widely enough at the time, they picked gzip as a seemingly reasonable compression algorithm. It wasn't clearly stupid to compose TLS and gzip and yet it blew up when they did so. Had the Kelsey paper instead been a splashy, public demonstration, as CRIME was, then it's possible that the idea would have sunk into the collective memory to the point where simply using gzip wouldn't have seemed so reasonable. As it is, we're now using a horrible gzip hack to segment cookies and other sensitive headers from the attacker controlled values, although the replacement compression is mostly done, pending SPDY 4.

So the first idea in this section is that it's OK to make a splash in order to get some new idea into the minds of the developer community in general.

Somewhat similarly, the origins of the BEAST attack dates back to at least 2002 in the context of SSH (Möller, 2002).

In this case, Rizzo and Duong contacted the major SSL stacks prior to their work going public with a clear case that it was worth looking into. This, at least from my point of view, is very welcome! Please do this if you can. You can contact me if nothing else and I can rope in all the other usual suspects. Please set a clear go-public date and allow us to share the facts of the matter with other major vendors.

In the case of BEAST, this produced a rare example of major browsers breaking things together for the common good. The breakage wasn't trivial: we took out Disneyland Tokyo's online ticking system amongst many others, but we got it deployed.

So the second point of this group is that you should consider treating it like a security disclosure if it's warranted. We don't bite!

The world often sucks

  1. Hash collisions in MD5
  2. BEAST
  3. Downgrade attacks

The second group I put together under the title ‘the world often sucks’.

I'm afraid that sometimes it takes a while to make changes even if we are given a very clear demonstration of the issue. After a very clear demo of MD5 collisions causing vulnerabilities in certificate issuance in 2008 it still took years to remove support for it. Other remedial actions were taken: public CAs were required to use random serial numbers and stopped using MD5 for new certificates. But it wasn't until early 2012 that Chrome removed support for MD5 in signatures. Sadly, many MITM proxies still used MD5 and, despite giving them lots of notice, they didn't do anything and we broke quite a few people with that change.

Of course, later in the year it was found that Flame broke Microsoft's code signing with an MD5 collision. Maybe that'll be enough to convince people to move.

But the point of this section is to encourage people to think about workarounds because the ‘right fix’ often isn't feasible. For hash collisions we have randomised serial numbers and although we might like everyone to be using SHA-256 or SHA-3, realistically those randomised serial numbers are going to be buttressing SHA-1 for quite some time to come.

Marc Stevens talked earlier about how to detect SHA-1 collisions given only one of the colliding messages. That's great! That's absolutely something that we can put into certificate validation now and hopefully prevent problems in the future.

In relation to BEAST: I mentioned that the core weakness had been known since 2002 and, because of that, there was even a workaround in OpenSSL for it: empty fragments. By sending an empty CBC fragment before each real record, the MAC would effectively randomise the IV. However, empty fragments caused compatibility issues because some SSL stacks returned a zero length when encountering them and higher layers of the code took this to be EOF. Because of that, the workaround was never widely enabled.

In the course of discussing BEAST another workaround was proposed: 1/n-1 record splitting. Rather than putting an empty fragment before each record, include a single byte of the plaintext. The protection isn't quite as good, but it solves the EOF problem. Some servers still broke because they assumed that the complete HTTP request would come in a single read, but the lower rate of problems probably made BEAST mitigation viable.

Lastly, there's the first of our snakes in the grass (problems that are still pending to bite us): SSLv3 downgrade. Since there exist so many broken servers and network middleboxes on the Internet that can't handle TLS version negotiation, in the event of a TLS handshake failure browsers will fall back to SSLv3. However, SSLv3 doesn't support all of the features that TLS does. Most significantly from my point of view, it doesn't support ECDHE, which Google servers use. So a network attacker can trigger a fallback and downgrade a capable client to a non-forward secure ciphersuite.

This is obviously not good. The correct fix is to remove the fallback to SSLv3 of course, but that's sadly not viable right now. Instead, as a workaround, Yngve (formerly of Opera) suggested using the TLS renegotiation extension as a signal that a server is reasonably recent and therefore we shouldn't have performed the fallback.

Numbers from Chrome indicate that a high fraction of the servers that we perform fallback for are renego patched, suggesting that's a bad signal and we should instead create a different one. Although maybe the number of fallbacks is dominated by transient network problems and that's skewing the data. Eric Rescorla has suggested replicating the TLS version negotiation using ciphersuite values. It's something that we will hopefully address in one way or another in 2013.

So that's the point of this group: please consider workarounds because we can't always manage to deploy the ‘right’ fix.

Side-channels are a big deal.

  1. RSA PKCS#1 v1.5 padding oracles ("Million message attack")
  2. CBC padding oracles (Vaudenay's attack)
  3. Timing attacks against RSA CRT.
  4. Side-channel attacks against AES
  5. Side-channel attacks against group operations

This big group is about the fact that side channels are a big deal. Writing constant-time code is a very odd corner of programming and not something that can be easily tested. I have a valgrind hack called ctgrind that allows for some automated testing, but it certainly has its limitations.

But what does constant-time even mean? There's the obvious definition: the CPU runs for the same amount of time, independent of any secret inputs. But CPUs are preemptively multitasked and have frequency scaling and thermal limiting these days so that's not a very useful definition in practice. A more workable definition is that if the code were to be run on an abstract Von Neumann machine, then the trace of memory fetches is identical for all secret inputs. That means that the trace of instructions fetched and data accesses is constant for all secret inputs. That takes care of all the obvious problems.

In practice, it can be useful to relax the definition a little and require only that the set of cache lines accessed for data is constant, rather than the exact addresses. In practice CPUs often fetch whole cache lines at a time and using that fact can lead to speedups at the cost of having to know the cache line length of the CPU.

This model assumes that the individual instructions themselves are constant time. As a research topic it would be interesting to know how variable time CPU instructions affect this. For example, from the Intel optimisation manual:

The latency and throughput of IDIV in Enhanced Intel Core micro-architecture varies with operand sizes and with the number of significant digits of the quotient of the division.

Is this a problem, cryptographically? I don't know. I think multiplies on ARM are variable time too. They are on PowerPC, but that's a much more obscure platform.

As a researcher, what's there to keep in mind with respect to constant-time implementations? Firstly, know that Moore's law will never help you. CPUs may get faster, but the amount of data that we need to process is increasing just as fast, if not faster. So you can't assume that the slower, constant time code will become viable in time - you have to be constant time from the start.

Even having a non-constant time implementation is a danger. There are usually two sides to a protocol and I may not control the software on the other side. If I specify AES in a protocol then I have to consider that it may well be a non-constant time implementation. I just made up the term ‘implementation ecosystem’ for this. AES is very difficult to implement securely in software: good implementations are still still topics for research papers. So the implementation ecosystem for AES is terrible! There are lots and lots of side-channel vulnerable implementations out there because no normal person, given the AES spec, will produce a secure implementation.

If we're aiming for a 128-bit security level then that possibility is a much larger problem than many other, more traditional crypto concerns.

So, for new primitives, you may want to produce solid implementations for different platforms to seed the implementation ecosystem. Not just reference implementations, but ones that are good enough that they dominate the set of implementations. For example, if I specify curve25519 in a protocol, I can be pretty sure that everyone is going to be using djb's reference code. That's a major advantage.

You should consider what an implementation is going to look like when designing. Of course, building a solid implementation will make sure that you end up considering this, so that's another good reason. There are certain patterns that are inherently dangerous. Square-and-multiply loops for example. You should recognise that and make sure that even the description of the algorithm includes counter measures. Binary fields are another which are very likely to result in non-constant time code.

Lastly, please don't assume that CPU changes are going to solve your constant-time or performance problems. Intel have added specific instructions for AES and binary fields in their latest chips and, while that does have some benefit, they will be a small fraction of all chips for a very long time. The chance of both sides of a connection having these chips is even smaller.

Cryptographic Room 101.

Room 101 is a fairly long running British TV show where people nominate things they dislike to put into Room 101, which banishes them from the world. Based on the above I've a number of things that I'd to put into Room 101, from least controversial to most. These are more specific points than the general ones that I just made and I'd like to see if anyone in the room disagrees with me!

1. MAC then Encrypt.

I hope that everyone agrees on this. Of course, it's pretty ubiquitous in the world today. Just because I want to banish something doesn't mean that I'm not going to spend years dealing with it in the future!

2. CBC mode.

With all the problems of chosen plaintexts, padding oracles etc I think it's time to get rid of CBC mode forever. Although it's possible to implement it securely, it's been done wrong for so long and is so easy to mess up, I think it's best to get rid of it.

3. ‘Sudden death’ entropy failure: plain DSA.

DSA (and ECDSA) has a very unfortunate property that an entropy failure leaks the private key. Even a slight bias over many signatures can be exploited. This is ridiculous. As Nadia and Debian have demonstrated, entropy failures do happen and they are inherently very bad. But that's not a reason to amplify them! By hashing in the private key and message, this problem can be avoided. So please consider what happens when your nonces are actually ntwices.

4. AES-GCM.

I've saved the really controversial one for last!

AES-GCM so easily leads to timing side-channels that I'd like to put it into Room 101. It took a decade of research to produce solid, high-speed, constant time AES implementations and they are very complex. In that decade, many, many non-constant time AES implementations have found their way into everything, poisoning the ecosystem when it comes to using AES.

I haven't seen any research on extracting the key from GHASH, but I've certainly seen vulnerable implementations and there's every reason to believe that it's possible. Most GHASH implementations look like AES implementations from 10 years ago. I'm aware of one, reasonable, constant-time AES-GCM implementation (Käsper and Schwabe, CHES 2009), but it runs at 22 cycles/byte on a Core2.

If you have a recent Intel chip, and software that implements AES-GCM using the specific instructions provided, then it's great. But most chips don't have that and I think it would have been much more preferable to pick an AEAD that's easy to securely implement everywhere and then speed it up on certain chips.

But it's still much better than AES-CBC I guess!

(At this point I cycled back to talking about the larger transport security issues and how browsers are addressing them with HSTS, mixed script blocking etc. But I've written about that before.)

Certificate Transparency

These are my notes for a talk that I gave today at IETF 85. For more details, see the draft.

Certificates are public statements that everyone trusts, but they aren't public record. Other critical details about companies are generally public record, their address, directors etc, but not their public key. Why not? Work like the EFF Observatory has already means that CA customer lists, which they might consider confidential, are public.

If certificates were public record then you would be able to see what CAs are asserting about you, and hopefully correct any mistakes. I would very much like to know the set of valid certificates within google.com at any given time. I know about the real ones, of course, but I don’t know that there aren’t others. At the moment, there's no accountability for CAs unless you get really, publicly caught screwing up.

However, there is a class of certificates that we don't want published: internal names. Some companies get certificates from a real CA for their internal networks and those names should not be published. So we'll keep that in mind as we work out a design.

How might we achieve this? Well, we could have CAs publish the certificates that they issue. That's a good start; and actually useful to some degree, but you'll probably admit that this is weaker than we might like, so how do we make it stronger?

So that's our design goal, now let's talk about the constraints:

Firstly, we aren't going to update every HTTPS server on the planet. Any design must be incrementally deployable and, initially, we are talking about HTTPS. Hopefully any design would be generally applicable, but HTTPS is what we have in mind right now.

Secondly, no blocking lookups. Making calls to a third-party during certificate validation is very convenient, but it's just too costly. It creates latency when it's working and a disaster when it's not. Too many networks filter too much for it to be dependable and we've never managed to make it work for OCSP, so there's little chance that it'll work for this.

So whatever information we need to get to the client has to be in the handshake, and anything that client needs to report back has to be asynchronous.

We had our straw-man proposal where we just ask CAs to publish their certificate stream, but we want to do better because that only works when the CA is simply mistaken. So the problem that we need to address is how do clients know that a certificate they receive has been published at all?

Well, we can have an independent certificate log sign a statement of publication and, based on our constraints, we have to put that signature somewhere in the TLS handshake. We call the statements of publication Signed Certificate Timestamps and, since we don't want to delay certificate issuance, they’re not actually statements of publications, rather they are a promise that the certificate will be published soon.

Now we've just moved the problem from "what if the CA doesn't publish it?" to "what if the log promises to publish something, but doesn't?". We can require SCTs from a quorum of independent logs, and we'll do that, but running an append only log is a job that can be verified by clients.

In order to make a log verifiable, we put the log entries in a Merkle tree, with certificates at the leaves. Each log is an ever growing Merkle tree of certificates and the log periodically signs the root of the tree and publishes it, along with all the entries. But what does `periodically' mean? All logs must publish certificates that they have received within a time called the Maximum Merge Delay. That merge delay is the amount of time that a certificate can be used without being published.

We detect violations of the MMD by having clients verify the log's behaviour. Clients can request from the logs (or a mirror), a path from any certificate to the root of a published Merkle tree. In order to preserve client privacy this request may be made via DNS, if possible, or by requesting paths for a range of certificates rather than a specific one.

Once a client has a signed root, it needs to gossip about it. Clients implicitly trust their software provider so one common answer may be to request a signed statement that the log root has been observed from an auditor service run by them. But these auditors can be run by anyone, and clients can configure their auditor however they like.

Since these client checks are asynchronous, they can be blocked by an attacker. Clients will have to be tolerant to that to some extent because many networks are indistinguishable to an attack. However, if after some amount of time, the client has a certificate that it hasn't been able to check, or a log root that it hasn't verified with its auditor, it should try to publish it in various, unspecified ways. In the extreme, one can imagine asking the user to email a file to an email address.

So to `break' this system, by which I mean to get a certificate that will be silently accepted by clients without the world knowing about it, one needs to compromise a CA, a quorum of logs, and then either partition the client forever, or also compromise the client's auditor.

In addition to auditors, there’s another class of log observers: monitors. Monitors watch the logs for interesting events. We envision at least one, obvious, type of monitor service: an `alerts’ system where anyone can sign-up to receive emails when a certificate is issued in a certain domain.

So now some practical points:

First, how do we get the SCTs (the receipts from the logs) into the TLS handshake without altering every HTTPS server? Well, the one thing that every HTTPS server can do is serve a certificate chain. So we tried a couple of tricks:

One, we tried adding a superfluous certificate to the end of the chain. Most clients will ignore it, but a few smaller ones, and old versions of Android, don't and IIS doesn't like configuring such a chain. So we aren't pushing ahead with that idea.

We also tried putting SCTs into the unsigned portion of the leaf certificate and that works fairly well except for breaking Java. Nonetheless, we may continue to support that as an option.

And options are the way we're going to go with this problem: there doesn't seem to be a good single answer.

So another option is that CAs can do the work for you and put the SCTs into the signed part of the certificate, in an X.509 extension. Of course, the SCT contains a hash of the certificate, and a hash cannot include itself, so CAs issue a `pre-cert' from a special intermediate with a magic EKU that makes it invalid for normal use. The pre-cert can be submitted to the logs to get SCTs for embedding in the real certificate.

Another way in which CAs can do it for you is to put the SCTs in an OCSP response and then one uses OCSP stapling on the server.

Finally, for servers that can be updated, SCTs can be included in an Authorisation Data TLS-extension, and support for that is working its way through OpenSSL and Apache.

We're experimenting with CAs at the moment to find what works and we may whittle down this list in time. But for now we're promiscuous about solutions to this problem.

The logs also have to worry about spam, so they only accept certificate chains that end at a known root. There's a very low bar for the logs to accept a root because it doesn't grant any authority, it's just an anti-spam measure, but we do have to make sure that the logs can’t be spammed to death.

Going back to one of our requirements at the beginning: we don't want to log internal names in certificates. One solution to this is to allow intermediates with name constraints to be logged and then any certificates from there don't need to be. So an intermediate constrained to issue within example.com can be logged and then we don't need to log any leaf certificates within example.com: they can't be used against anyone else and, by using an intermediate, its security is up to you.

However, that's a little complex so the initial solution will probably be that we let companies configure the clients to say "don't require logging within these domains". At least for Chrome, people report great success with our Enterprise Policy configuration and that's an easy route for such config.

Lastly, the very tricky question: deployment.

Initially we can deploy in the same manner as pinning and HSTS: require it only for certain, opt-in domains (or possibly for opt-in CAs). But the goal is very much to get it everywhere. That will take years, but we do plan to do it. We need to make the error messages in the browser informative for server admins as well as users because this will be something new for them, although hopefully their CA will just take care of it. We can also enforce it for certificates issued past a certain date, rather than having a flag day. That way the certificate change will be the triggering factor, which is something under the control of the site. Lastly, we'll probably have to do the work to patch software like EJBCA.

We do not yet know exactly who will be running the logs, nor how many we want. But Google expects to run one and that project is staffed. We’re also working with Digicert and Comodo who are experimenting with CT integration and are considering running logs.

Since we can verify the operation of logs, they don't have the same trusted status as CAs. However, the set of logs has to be globally agreed so having to revoke a log would be a pain, so we do want logs that are operationally competent.

There is no doubt that this will be a difficult and lengthy deployment process. But it doesn't seem completely hopeless.

NIST may not have you in mind

A couple of weeks back, NIST announced that Keccak would be SHA-3. Keccak has somewhat disappointing software performance but is a gift to hardware implementations.

This is rather a theme for NIST selected algorithms. AES includes a number of operations on a binary field and CPUs generally don't implement binary field operations because they aren't really useful for anything else. So many software implementations of AES include four, 1024-byte tables of results in that field in order to achieve reasonable performance. However, these table lookups cause a timing side-channel which is fairly effective even over a network. On the same machine, where the CPU cache state can be probed more directly, it's very effective.

In response to this, some AES implementations made their tables smaller to reduce the leak. That's obviously an improvement, but it came at the cost of slowing AES down and AES was never very quick to start off with. And while a smaller leak is better, AES is supposed to be operating at (at least) an 128-bit security level!

Next, the mode recommended for AES by NIST is GCM: it's AES in counter mode coupled with a polynomial authenticator. Cryptographically this is perfectly sensible but the polynomial authenticator uses binary fields again. For a hardware implementation, this is another gift but software implementations need, yet again, tables in memory to implement it at reasonable speed.

Tables not only cause problems with timing side-channels, they also cause slowdowns that don't appear in micro-benchmarks. Tables are great when the benchmark is running with a single key and they are contained in L1 cache. But when running in a larger system, the cache pressure causes slowdowns and the real-world speed of an such an algorithm can be dramatically less than the benchmarks suggest.

But there's no reason why a polynomial authenticator needs to use binary fields. Why not use a field that CPUs actually implement? If you do so you get a very fast authenticator with dramatically improved key agility.

So why does NIST continually pick algorithms that aren't suited to software implementations? One of my colleagues proffers (somewhat jokingly) a conspiracy theory but I suspect that the answer is much more mundane. At USENIX Security this year, Dickie George gave an excellent keynote involving some NSA history and described how they would spend a decade or more designing a cryptographic system from the chip upwards. Everything was custom built, carefully designed and it was a completely closed system. It's certainly nice to be able to do that!

If you're designing such a system, of course you would be orientated towards hardware. Every implementation of your system would be a custom-designed chip; there's no benefit to making something easy to implement on commodity systems. And, in that world, AES, GCM, and SHA3 make a lot of sense. As the GCM paper says, “binary Galois field multiplication is especially suitable for hardware implementations”. I'm assuming that there's a fair amount of shared worldview between NIST and the NSA, but that doesn't seem unreasonable.

(In a NIST paper about the selection of AES they state: “table lookup: not vulnerable to timing attacks.” Oops. If you're building your own hardware, maybe.)

I'm not suggesting that the NSA is actually using AES, GCM etc in their own designs. But a generation or more of cryptographers from that world are probably comfortable with features that were chosen with hardware in mind. Those features, like binary fields, are what they will naturally gravitate towards.

People who aren't in the situation of designing custom chips, but rather have to design protocols for implementation on a menagerie of commodity chips, might want to consider that NIST doesn't appear to have them in mind. Although hardware implementations of AES have existed for a while, unless they are CPU instructions (like AES-NI), then the system-call overhead can cause the acceleration to be slower than the plain CPU for common, 1400-byte, chunk sizes. Even when you have a CPU instruction to implement AES and GCM, it's unlikely that all the chips that your code is going to run on will be so endowed.

Hardware implementations of these, poorly-suited primitives also take chip area and development time away from other tasks. Why not choose primitives that use CPU abilities that lots of code benefits from? Then optimisations of those CPU instructions lifts all boats.

So if you want to pick algorithms designed for CPUs, try Salsa20 rather than AES and Poly1305 rather than GCM. But we don't have a great answer for a modern hash function I'm afraid, and that's a little disappointing. SHA-2 is fine for now, although Zooko has some thoughts on the matter.

(I should note that, thanks to some great work by djb, Peter Schwabe, Emilia Käsper and others, we do have bitsliced, constant-time implementations of AES now that run reasonably fast (for AES). But it's only effective for counter mode, and I believe that it needs long streams of data to achieve its headline speed. (A commenter pointed out that it can be made to work with only 8 blocks. Later, another pointed out that that I'm doubly wrong! We can do it for single blocks in reasonable time.) That's great, but it took 10 years of optimisation work and implies that AES can only be effectively implemented by some of the best practical cryptographers in the world.)

DANE stapled certificates

We've supported DNSSEC stapled certificates in Chrome for a while now and since the DANE RFC has been published, it's about time that I updated that code to support DANE, rather than the hacked up CAA record that it currently uses.

I've written the DANE code, but it also seemed like an opportune time to reevaluate the idea. The promise of DNSSEC, as Dan Kaminsky put it, it that it can reduce the number of meetings required to get something done.

You have a whole bunch of internal hosts that some company that you're partnering with needs to interface with securely? Well, perhaps you used an internal CA, so will they install that root on their systems? (Meeting.) They will if it's name constrained, can we get it reissued with name constraints? (Meeting.) Ok, now can IT get that root pushed out everywhere? (Meeting.) You see how this is going.

But every bit of code needs to justify its complexity and, for small cases, StartSSL will give you free certificates. Perhaps just Chrome doing it isn't very interesting, and everyone hacks up custom solutions anyway.

So, if it's useful to you, you should let me know (agl at chromium dot org).

CRIME

Last year I happened to worry on the SPDY mailing list about whether sensitive information could be obtained via SPDY's use of zlib for compressing headers. Sadly, I never got the time to follow up and find out whether it was a viable attack. Thankfully there exist security researchers who, independently, wondered the same thing and did the work for me! Today Duong and Rizzo presented that work at ekoparty 2012.

They were also kind enough to let Firefox and ourselves know ahead of time so that we could develop and push security fixes before the public presentation. In order to explain what we did, let's start by looking at how SPDY compressed headers:

(This is inline SVG, if you can't see it, check here.)

? ? ? ? ? ? ? ? : h o s t ? ? ? ? w w w . g o o g l e . c o m ? ? ? ? : m e t h o d ? ? ? ? G E T ? ? ? ? : p a t h ? ? ? ? / ? ? ? ? : s c h e m e ? ? ? ? h t t p s ? ? ? ? : v e r s i o n ? ? ? ? H T T P / 1 . 1 ? ? ? ? a c c e p t ? ? ? ? t e x t / h t m l , a p p l i c a t i o n / x h t m l + x m l , a p p l i c a t i o n / x m l ; q = 0 . 9 , * / * ; q = 0 . 8 ? ? ? ? a c c e p t - c h a r s e t ? ? ? ? I S O - 8 8 5 9 - 1 , u t f - 8 ; q = 0 . 7 , * ; q = 0 . 3 ? ? ? ? a c c e p t - e n c o d i n g ? ? ? ? g z i p , d e f l a t e , s d c h ? ? ? ? a c c e p t - l a n g u a g e ? ? ? ? e n - U S , e n ; q = 0 . 8 ? ? ? ? c o o k i e ? ? ? ? P R E F = I D = 7 0 c b b f e 7 e e 8 8 a 5 e 2 : F F = 0 : T M = 1 3 4 7 4 0 0 4 7 9 : L M = 1 7 8 7 4 1 4 3 9 5 : S = f J q m r k s _ 8 G h 3 i 1 K D ; N I D = 6 3 = k 9 f K d q _ 2 g u F 0 O 1 F 5 g V 6 K 3 C I t F b x d d 2 f D W L g x T H f a Q 3 5 P q 4 S D d d d A i H F 9 G G 9 6 2 3 A J v - W U b U p A h 8 _ 0 Y Z T 6 B H Q N 4 f A B 2 j O 3 _ Y 5 q H 7 w x e d A N v I n J 5 R r h l s i i h M Y q e - s u 1 U 1 O ? ? ? ? u s e r - a g e n t ? ? ? g M o z i l l a / 5 . 0 ( X 1 1 ; L i n u x x 8 6 _ 6 4 ) A p p l e W e b K i t / 5 3 7 . 1 0 ( K H T M L , l i k e G e c k o ) C h r o m e / 2 3 . 0 . 1 2 6 4 . 0 S a f a r i / 5 3 7 . 1 0 ? ? ? ? x - c h r o m e - v a r i a t i o n s ? ? ? 0 C M + 1 y Q E I l b b J A Q i d t s k B C K S 2 y Q E I p 7 b J A Q i 9 t s k B C L u D y g E = ? ? ? ? ? ? ? ? : h o s t ? ? ? ? w w w . g o o g l e . c o m ? ? ? ? : m e t h o d ? ? ? ? G E T ? ? ? ? : p a t h ? ? ? ? / c s i ? v = 3 ? s = w e b h p ? a c t i o n = ? s r t = 2 4 6 ? p = s ? n p n = 1 ? e = 1 7 2 5 9 , 2 3 6 2 8 , 2 3 6 7 0 , 3 2 6 9 0 , 3 5 7 0 4 , 3 7 1 0 2 , 3 8 0 3 4 , 3 8 4 4 9 , 3 8 4 6 6 , 3 9 1 5 4 , 3 9 3 3 2 , 3 9 5 2 3 , 3 9 9 7 8 , 4 0 1 9 5 , 4 0 3 3 3 , 3 3 0 0 0 4 7 , 3 3 0 0 1 1 7 , 3 3 0 0 1 2 5 , 3 3 0 0 1 3 2 , 3 3 0 0 1 3 5 , 3 3 0 0 1 5 7 , 3 3 1 0 0 1 1 , 4 0 0 0 1 1 6 , 4 0 0 0 2 6 0 , 4 0 0 0 2 6 7 , 4 0 0 0 2 7 8 , 4 0 0 0 3 0 8 , 4 0 0 0 3 5 2 , 4 0 0 0 3 5 4 , 4 0 0 0 4 7 2 , 4 0 0 0 4 7 6 , 4 0 0 0 5 1 6 , 4 0 0 0 5 1 9 , 4 0 0 0 5 5 3 , 4 0 0 0 5 9 3 , 4 0 0 0 6 0 5 , 4 0 0 0 6 1 6 , 4 0 0 0 7 6 2 , 4 0 0 0 8 2 5 , 4 0 0 0 8 3 7 , 4 0 0 0 8 4 1 , 4 0 0 0 8 4 9 ? e i = Y r N P U N P t H 8 T C g A f Y 7 I C A C g ? i m c = 2 ? i m n = 2 ? i m p = 2 ? r t =

That's a pretty busy diagram! But I don't think it's too bad with a bit of explanation:

zlib uses a language with basically two statements: “output these literal bytes” and “go back x bytes and duplicate y bytes from there”. In the diagram, red text was included literally and black text came from duplicating previous text.

The duplicated text is underlined. A dark blue underline means that the original text is in the diagram and there will be a gray line pointing to where it came from. (You can hover the mouse over one of those lines to make it darker.)

A light blue underline means that the original text came from a pre-shared dictionary of strings. SPDY defines some common text, for zlib to be able to refer to, that contains strings that we expect to find in the headers. This is most useful at the beginning of compression when there wouldn't otherwise be any text to refer back to.

The problem that CRIME highlights is that sensitive cookie data and an attacker controlled path is compressed together in the same context. Cookie data makes up most of the red, uncompressed bytes in the diagram. If the path contains some cookie data, then the compressed headers will be shorter because zlib will be able to refer back to the path, rather than have to output all the literal bytes of the cookie. If you arrange things so that you can probe the contents of the cookie incrementally, then (assuming that the cookie is base64), you can extract the cookie byte-by-byte by inducing the browser to make requests.

For details of how to get zlib to reveal that information in practice, I'll just refer you to Duong and Rizzo's CRIME presentation. It's good work.

In order to carry out this attack, the attacker needs to be able to observe your network traffic and to be able to cause many arbitrary requests to be sent. An active network attacker can do both by injecting Javascript into any HTTP page load that you make in the same session.

When we learned of this work, we were already in the process of designing the compression for SPDY/4, which avoids this problem. But we still needed to do something about SPDY/2 and SPDY/3 which are currently deployed. To that end, Chrome 21 and Firefox 15 have switched off SPDY header compression because that's a minimal change that easily backports.

Chrome has also switched off TLS compression, through which a very similar attack can be mounted.

But we like SPDY header compression because it saves significant amounts of data on the wire! Since SPDY/4 isn't ready to go yet we have a more complex solution for Chrome 22/23 that compresses data separately while still being backwards compatible.

Most importantly cookie data will only ever be duplicated exactly, and in its entirety, against other cookie data. Each cookie will also be placed in its own Huffman group (Huffman coding is a zlib detail that I skipped over in the explanation above). Finally, in case other headers contain sensitive data (i.e. when set by an XMLHttpRequest), non-standard headers will be compressed in their own Huffman group without any back references.

That's only a brief overview of the rules. The code to follow them and continue to produce a valid zlib stream wasn't one of the cleaner patches ever landed in Chrome and I'll be happy to revert it when SPDY/4 is ready. But it's effective at getting much of the benefit of compression back.

To the right are a couple of images of the same sort of diagram as above, but zoomed out. At this level of zoom, all you can really see are the blocks of red (literal) and blue (duplicated) bytes. The diagram on the right has the new rules enabled and, as you can see, there is certainly more red in there. However that's mostly the result of limited window size. In order to save on server memory, Chrome only uses 2048-byte compression windows and, under the new rules, a previous cookie value has to fit completely within the window in order to be matched. So things are a little less efficient until SPDY/4, although we might choose to trade a little more memory to make up for that.

SSL interstitial bypass rates

In yesterday's post I threw in the following that I've been asked to clarify:

“we know that those bypass buttons are clicked 60% of the time by Chrome users”

Chrome collects anonymous statistics from users who opt in to such collection (and thank you to those who do!). One of those statistics covers how frequently people bypass SSL interstitials. (As always, the Chrome privacy policy has the details.)

We define bypassing the interstitial as clicking the ‘Proceed anyway’ button and not bypassing as either closing the tab, navigating elsewhere, or clicking the ‘Back’ button.

I picked five days at random over the past six weeks and averaged the percentages of the time that users bypassed rather than not. That came to 61.6%.

There may be some biases here: we may have a biased population because we only include users who have opted in to statistics collection. We are also counting all interstitals: there may be a small number of users who bypass a lot of SSL errors. But that's the data that we have.

Living with HTTPS

(These are my notes from the first half of my talk at HOPE9 last weekend. I write notes like these not as a script, but so that I have at least some words ready in my head when I'm speaking. They are more conversational and less organised than a usual blog post, so please forgive me the rough edges.)

HTTPS tends to cause people to give talks mocking certificate security and the ecosystem around it. Perhaps that's well deserved, but that's not what this talk is about. If you want to have fun at the expense of CAs, dig up one of Moxie's talks. This talk deals with the fact that your HTTPS site, and the sites that you use, probably don't even reach the level where you get to start worrying about certificates.

I'm a transport security person so the model for this talk is that we have two computers talking over a malicious network. We assume that the computers themselves are honest and uncompromised. That might be a stretch in these malware-ridden times, but that's the area of host security and I'm not talking about that today. The network can drop, alter or fabricate packets at will. As a lemma, we also assume that the network can cause the browser to load any URL it wishes. The network can do this by inserting HTML into any HTTP request and we assume that every user makes some unencrypted requests while browsing.

Stripping

If the average user typed mail.google.com into a browser and saw the following, what fraction of them do you think would login, none the wiser?

Can you even see what's terribly wrong here?

The problem is that the page isn't served over HTTPS. It should have been, but when a user types a hostname into a browser, the default scheme is HTTP. The server may attempt to redirect users to HTTPS, but that redirect is insecure: a MITM attacker can rewrite it and keep the user on HTTP, spoofing the real site the whole time. The attacker can now intercept all the traffic to this perfectly well configured and secure website.

This is called SSL stripping and it's terribly simple and devastatingly effective. We probably don't see it very often because it's not something that corporate proxies need to do, so it's not in off-the-shelf devices. But that respite is unlikely to last very long and maybe it's already over: how would we even know if it was being used?

In order to stop SSL stripping, we need to make HTTPS the only protocol. We can't do that for the whole Internet, but we can do it site-by-site with HTTP Strict Transport Security (HSTS).

HSTS tells browsers to always make requests over HTTPS to HSTS sites. Sites become HSTS either by being built into the browser, or by advertising a header:

Strict-Transport-Security: max-age=8640000; includeSubDomains

The header is in force for the given number of seconds and may also apply to all subdomains. The header must be received over a clean HTTPS connection.

Once the browser knows that a site is HTTPS only, the user typing mail.google.com is safe: the initial request uses HTTPS and there's no hole for an attacker to exploit.

(mail.google.com and a number of other sites are already built into Chrome as HSTS sites so it's not actually possible to access accounts.google.com over HTTP with Chrome - I had to doctor that image! If you want to be included in Chrome's built-in HSTS list, email me.)

HSTS can also protect you, the webmaster, from making silly mistakes. Let's assume that you've told your mother that she should always type https:// before going to her banking site or maybe you setup a bookmark for her. That's honestly more than we can, or should, expect of our users. But let's say that our supererogatory user enters https://www.citibank.com in order to securely connect to her bank. What happens? Well, https://www.citibank.com redirects her to http://www.citibank.com. They've downgraded the user! From there, the HTTP site should redirect back to HTTPS, but the damage has been done. An attacker can get in through the hole.

I'm honestly not picking on Citibank here. They were simply the second site that I tried and I was some surprised that the first site didn't have the problem. It's a very easy mistake to make, and everything just works! It's a completely silent disaster! But HSTS would have either prevented it, or would have failed closed.

HSTS also does something else. It turns this:

Into this:

The “bypass this certificate error” button has gone. That button is a UI disaster. Asking regular people to evaluate the validity of X.509 certificates is insane. It's a security cop-out that we're saddled with, and which is causing real damage.

We've seen widespread MITM attacks using invalid certificates in Syria and, in recent weeks, Jordan. These attacks are invalid! This is squarely within our threat model for transport security and it shouldn't have been a risk for anybody. But we know that those bypass buttons are clicked 60% of the time by Chrome users. People are certainly habituated to clicking them and I bet that a large number of people were victims of attacks that we should have been able to prevent.

If you take only one thing away from this talk, HSTS should be it.

Mixed scripting

One we're sorted HSTS we have another problem. These snippets of HTML are gaping wide holes in your security:

<script src="http://...

<link href="http://...

<embed src="http://...

It's called mixed scripting and it happens when a secure site loads critical sub-resources over HTTP. It's a subset of mixed content: mixed content covers loading any sub-resource insecurely. Mixed content is bad, but when the resource is Javascript, CSS, or a plugin we give is another name to make it clear that its a lot more damaging.

When you load sub-resources over HTTP, an attacker can replace them with content of their choosing. The attacker also gets to choose any page on your HTTPS site with the problem. That includes pages that you don't expect to be served over HTTPS, but happen to be mapped. If you have this problem anywhere, on any HTTPS page, the attacker wins.

With complex sites, it's very difficult to ensure that this doesn't happen. One good way to limit it is to only serve over HTTPS so that there aren't any pages that you expect to serve over HTTP. Also, HSTS might also save you if you're loading mixed script from the same domain.

Another mitigation is to use scheme-relative URLs everywhere possible. These URLs look like //example.com/bar.js and are valid in all browsers. They inherit the scheme of the parent page, which will be HTTP or HTTPS as needed. (Although it does mean that if you load the page from disk then things will break. The scheme will be file:// in that case.)

Fundamentally this is such an easy mistake to make, and such a problem, that the only long term solution is for browsers to stop loading insecure Javascript, CSS and plugins for HTTPS sites. To their credit, IE9 does this and did it before Chrome. But I'm glad to say that Chrome has caught up and mixed scripting is now blocked by default, although with a user-override:

Yes, there's another of those damm bypass buttons. But we swear that it's just a transitional strategy and it already stops silent exploitation in a hidden iframe.

Cookies

HTTP and HTTPS cookie jars are the same. No really: cookies aren't scoped to a protocol! That means that if you set a Cookie on https://example.com and then make a request to http://example.com, the cookies will be sent in the clear! In order to prevent this, you should be setting secure on your HTTPS cookies. Sadly this is a very easy thing to miss because everything will still work if you omit it, but without the secure tag, attackers can easily steal your cookies.

It's worth noting that HSTS can protect you from this: by preventing the HTTP request from being sent, the cookies can't be leaked that way, but you'll need to include all subdomains in the HSTS coverage.

There's a second corollary to this: attackers can set your HTTPS cookies to. By causing an request to be sent to http://example.com, they can spoof a reply with cookies that then override any existing cookies. In this fashion, an attacker can log a user in as themselves during their interaction with an HTTPS site. Then, say, emails that they send will be saved in the attacker's out-box.

There's no very good protection against this except HSTS again. By preventing any HTTP requests you can stop the attacker from spoofing a reply to set the cookies. Against, HSTS needs to cover all subdomains in order to be effective against this.

Get yourself checked out

You should go to https://www.ssllabs.com and run their scan against your site. It's very good, but ignore it if it complains about the BEAST attack. It'll do so if you make a CBC ciphersuite your top preference. Browsers have worked around BEAST and non-browser clients are very unlikely to provide the attacker enough access to be able to pull it off. You have a limited amount of resources to address HTTPS issues and I don't think BEAST should make the list.

Get a real certificate

You should get a real certificate. You probably already have one but, if you don't, then you're just training more people to ignore certificate errors and you can't have HSTS without a real certificate. StartSSL give them away for free. Get one.

If you've reached this far and have done all of the above, congratulations: you're in the top 0.1% of HTTPS sites. If you're getting bored, this is a reasonable point to stop reading: everything else is just bonus points from now on.

Forward secrecy

You should consider forward secrecy. Forward secrecy means that the keys for a connection aren't stored on disk. You might have limited the amount of information that you log in order to protect the privacy of your users, but if you don't have forward secrecy then your private key is capable of decrypting all past connections. Someone else might be doing the logging for you.

In order to enable forward secrecy you need to have DHE or ECDHE ciphersuites as your top preference. DHE ciphersuites are somewhat expensive if you're terminating lots of SSL connections and you should be aware that your server will probably only allow 1024-bit DHE. I think 1024-bit DHE-RSA is preferable to 2048-bit RSA, but opinions vary. If you're using ECDHE, use P-256.

You also need to be aware of Session Tickets in order to implement forward secrecy correctly. There are two ways to resume a TLS connection: either the server chooses a random number and both sides store the session information, of the server can encrypt the session information with a secret, local key and send that to the client. The former is called Session IDs and the latter is called Session Tickets.

But Session Tickets are transmitted over the wire and so the server's Session Ticket encryption key is capable of decrypting past connections. Most servers will generate a random Session Ticket key at startup unless otherwise configured, but you should check.

I'm not going to take the time to detail how to configure this here. There are lots of webservers and it would take a while. This is more of a pointer so that you can go away and research it if you wish.

(The rest of the talk touched on OpenSSL speed, public key pinning, TLS 1.1 and 1.2 and a few other matters, but I did those bits mostly on the fly and don't have notes for them.)

Decrypting SSL packet dumps

We all love transport security but it can get in the way of a good tcpdump. Unencrypted protocols like HTTP, DNS etc can be picked apart for debugging but anything running over SSL can be impenetrable. Of course, that's an advantage too: the end-to-end principle is dead for any common, unencrypted protocol. But we want to have our cake and eat it.

Wireshark (a common tool for dissecting packet dumps) has long had the ability to decrypt some SSL connections given the private key of the server, but the private key isn't always something that you can get hold of, or want to spread around. MITM proxies (like Fiddler) can sit in the middle of a connection and produce plaintext, but they also alter the connection: SPDY, client-certificates etc won't work through them (at least not without special support).

So here's another option: if you get a dev channel release of Chrome and a trunk build of Wireshark you can run Chrome with the environment variable SSLKEYLOGFILE set to, say, /home/foo/keylog. Then, in Wireshark's preferences for SSL, you can tell it about that key log file. As Chrome makes SSL connections, it'll dump an identifier and the connection key to that file and Wireshark can read those and decrypt SSL connections.

The format of the key log file is described here. There's an older format just for RSA ciphersuites that I added when Wireshark decrypted purely based on RSA pre-master secrets. However, that doesn't work with ECDHE ciphersuites (amongst others) so the newer format can be used to decrypt any connection. (You need the trunk build of Wireshark to support the newer format.) Chrome currently writes records of both formats.

This can also be coupled with spdy-shark to dissect SPDY connections.

Since key log support is part of NSS, support will hopefully end up in Firefox in the future.

New TLS versions

TLS is the protocol behind most secure connections on the Internet and most TLS is TLS 1.0, despite that fact that the RFC for 1.0 was published in January 1999, over 13 years ago.

Since then there have a two newer versions of TLS: 1.1 (2006) and 1.2 (2008). TLS 1.1 added an explicit IV for CBC mode ciphers as a response to CBC weaknesses that eventually turned into the BEAST attack. TLS 1.2 changes the previous MD5/SHA1 combination hash to use SHA256 and introduces AEAD ciphers like AES-GCM.

However, neither of these versions saw any significant adoption for a long time because TLS's extension mechanism allowed 1.0 to adapt to new needs.

But things are starting to change:

In the long run, getting to 1.2 is worthwhile. The MD5/SHA1 hash combination used previous versions was hoped to be more secure than either hash function alone, but [1] suggests that it's probably only as secure as SHA1. Also, the GCM cipher modes allow AES to be used without the problems (and space overhead) of CBC mode. GCM is hardware accelerated in recent Intel and AMD chips along with AES itself.

But there are always realities to contend with I'm afraid:

Firstly, there's the usual problem of buggy servers. TLS has a version negotiation mechanism, but some servers will fail if a client indicates that it supports the newer TLS versions. (Last year, Yngve Pettersen suggested that 2% of HTTPS servers failed if the client indicated TLS 1.1 and 3% for TLS 1.2.)

Because of this Chrome implements a fallback from TLS 1.1 to TLS 1.0 if the server sends a TLS error. (And we have a fallback from TLS 1.0 to SSL 3.0 if we get another TLS error on the second try.) This, sadly, means that supporting TLS 1.1 cannot bring any security benefits because an attacker can cause us to fallback. Thankfully, the major security benefit of TLS 1.1, the explicit CBC IVs, was retrofitted to previous versions in the form of 1/n-1 record splitting after the BEAST demonstration.

Since these fallbacks can be a security concern (especially the fallback to SSLv3, which eliminates ECDHE forward secrecy) I fear that it's necessary to add a second, redundant version negotiation mechanism to the protocol. It's an idea which has been floated before and I raised it again recently.

But buggy servers are something that we've known about for many years. Deploying new TLS versions has introduced a new problem: buggy networks.

Appallingly it appears that there are several types of network device that manage to break when confronted with new TLS versions. There are some that break any attempt to offer TLS 1.1 or 1.2, and some that break any connections that negotiate these versions. These failures, so far, manifest in the form of TCP resets, which isn't currently a trigger for Chrome to fallback. Although we may be forced to add it.

Chrome dev or iOS users suffering from the first type of device see all of their HTTPS connections fail. Users suffering the second type only see failures when connecting to sites that support TLS 1.1 or 1.2. (Which includes Google.). iOS leaves it up to the application to implement fallback if they wish and adding TLS 1.2 support to Google's servers has caused some problems because of these bad networks.

We're working to track down the vendors with issues at the moment and to make sure that updates are available, and that they inform their customers of this. I'm very interested in any cases where Chrome 21 suddenly caused all or some HTTPS connections to fail with ERR_CONNECTION_RESET. If you hit this, please let me know (agl at chromium dot org).

([1] Antoine Joux, Multicollisions in Iterated Hash Functions: Application to Cascaded Constructions, CRYPTO (Matthew K. Franklin, ed.), Lecture Notes in Computer Science, vol. 3152, Springer, 2004, pp. 306–316.)

False Start's Failure

Eighteen months ago(ish), Chrome started using False Start. False Start reduces the average time for an SSL handshake by 30%.

Since the biggest problem with transport security is that most sites don't use it, anything that reduces the latency impact of HTTPS is important. Making things faster doesn't just make them faster, it also makes them cheaper and more prevalent. When HTTPS is faster, it'll be used in more places than it would otherwise be.

But, sadly, False Start will be disabled, except for sites doing NPN, in Chrome 20. NPN is a TLS extension that we use to negotiate SPDY, although you don't have to use it to negotiate SPDY, you can advertise http/1.1 if you wish.

False Start was known to cause problems with a very small number of servers and the initial announcement outlined the uncommon scheme that we used to deploy it: we scanned the public Internet and built up a list of problematic sites. That list was built into Chrome and we didn't use False Start for connections to those sites. Over time the list was randomly eroded away and I'd try to address any issues that came up. (Preemptively so in the case of large sites.)

It did work to some extent. Many sites that had problems were fixed and it's a deployment scheme that is worth considering in the future. But it didn't ultimately work well enough for False Start.

Initially we believed that False Start issues were deterministic so long as the TLS Finished and application data records were sent in the same TCP packet. We changed Chrome to do this in the hopes of making False Start issues deterministic. However, we later discovered some HTTPS servers that were still non-deterministically False Start intolerant. I hypothesise that the servers run two threads per connection: one for reading and one for writing. Although the TCP packet was received atomically, thread scheduling could mean that the read thread may or may not be scheduled before the write thread had updated the connection state in response to the Finished.

This non-determinism made False Start intolerance difficult to diagnose and reduced our confidence in the blacklist.

The `servers' with problems were nearly always SSL terminators. These hardware devices terminate SSL connections and proxy unencrypted data to backend HTTP servers. I believe that False Start intolerance is very simple to fix in the code and one vendor suggested that was the case. None the less, of the vendors who did issue an update, most failed to communicate that fact to their customers. (A pattern that has repeated with the BEAST fix.)

One, fairly major, SSL terminator vendor refused to update to fix their False Start intolerance despite problems that their customers were having. I don't believe that this was done in bad faith, but rather a case of something much more mundane along the lines of “the SSL guy left and nobody touches that code any more”. However, it did mean that there was no good answer for their customers who were experiencing problems.

Lastly, it was becoming increasingly clear that we had a bigger problem internationally. Foreign admins have problems finding information on the subject (which is mostly in English) and foreign users have problems reporting bugs because we can't read them. We do have excellent agents in countries who liaise locally but it was still a big issue, and we don't cover every country with them. I also suspect that the distribution of problematic SSL terminators is substantially larger in some countries and that the experience with the US and Europe caused us to underestimate the problem.

In aggregate this lead us to decide that False Start was causing more problems than it was worth. We will now limit it to sites that support the NPN extension. This unfortunately means that it'll be an arcane, unused optimisation for the most part: at least until SPDY takes over the world.

Very large RSA public exponents

After yesterday's post that advocated using RSA public exponents of 3 or 216+1 in DNSSEC for performance, Dan Kaminsky asked me whether there was a potential DoS vector by using really big public exponents.

Recall that the RSA signature verification core is me mod n. By making e and n larger, we can make the operation slower. But there are limits, at least in OpenSSL:

/* for large moduli, enforce exponent limit */
if (BN_num_bits(rsa->n) > OPENSSL_RSA_SMALL_MODULUS_BITS) {
        if (BN_num_bits(rsa->e) > OPENSSL_RSA_MAX_PUBEXP_BITS) {
                RSAerr(RSA_F_RSA_EAY_PUBLIC_ENCRYPT, RSA_R_BAD_E_VALUE);
                return -1;
        }
}

So, if n is large, we enforce a limit on e. The values of the #defines are such that for n>3072 bits, e must be less than or equal to 64 bits. So the slowest operations happen with an n and e of 3072 bits. (The fact that e<n is enforced earlier in the code.)

So I setup *.bige.imperialviolet.org. This is a perfectly valid and well signed zone which happens to use a 3072-bit key with a 3072-bit public exponent. (I could probably have slowed things down more by picking a public exponent with lots of 1s in its binary representation, but it's just a random number in this case.) One can resolve records 1.bige.imperialviolet.org, 2.bige.imperialviolet.org, … and the server doesn't have to sign anything because it's a wildcard: a single signature covers all of the names. However, the resolver validates the signature every time.

On my 2.66GHz, Core 2 laptop, 15 requests per second causes unbound to take 95% of a core. A couple hundred queries per second would probably put most DNSSEC resolvers in serious trouble.

So I'd recommend limiting the public exponent size in DNSSEC to 216+1, except that people are already mistakenly using 232+1, so I guess that needs to be the limit. The DNSSEC spec limits the modulus size to 4096-bits, and 4096-bit signatures are about 13 times slower to verify than the typical 1024-bit signatures used in DNSSEC. But that's a lot less of a DoS vector than bige.imperialviolet.org, which is 2230 times slower than normal.

RSA public exponent size

RSA public operations are much faster than private operations. Thirty-two times faster for a 2048-bit key on my machine. But the two operations are basically the same: take the message, raise it to the power of the public or private exponent and reduce modulo the key's modulus.

What makes the public operations so much faster is that the public exponent is typically tiny, while the private exponent should be the same size as the modulus. One can actually use a public exponent of three in RSA, and clearly cubing a number should be faster than raising it to the power of a 2048-bit number.

In 2006, Daniel Bleichenbacher (a fellow Googler) gave a talk at the CRYPTO 2006 rump session where he outlined a bug in several, very common RSA signature verification implementations at the time. The bug wasn't nearly as bad as it could have been because it only affected public keys that used a public exponent of three and most keys used a larger exponent: 216+1. But the fact that a slightly larger public exponent saved these buggy verifiers cemented 216+1 as sensible default value for the public exponent.

But there's absolutely no reason to think that a public exponent larger than 216+1 does any good. I even checked that with Bleichenbacher before writing this. Three should be fine, 216+1 saved some buggy software a couple of times and any larger is probably a mistake.

Because of that, when writing the Go RSA package, I didn't bother supporting public exponents larger than 231-1. But then several people reported that they couldn't verify some DNSSEC signatures. Sure enough, the DNSKEY records for .cz and .us are using a public exponent of 232+1.

DNSSEC is absolutely the last place to use large, public exponents. Busy DNS resolvers have to resolve tens or hundreds of thousands of records a second; fast RSA signature verification is big deal in DNSSEC. So I measured the speed impact of various public exponent sizes with OpenSSL (1.0.1-beta3):

Public exponentVerification time (µs)
310.6
216+123.9
232+142.7
2127-1160.7

So a public exponent of 232+1 makes signature verification over four times slower than an exponent of three. As DNSSEC grows, and DNSSEC resolvers too, that extra CPU time is going to be a big deal.

It looks like the zones using a value of 232+1 are just passing the -e flag to BIND's dnssec-keygen utility. There's some suggestion that -e used to select 216+1 but silently changed semantics in some release.

So today's lesson is don't pass -e to dnssec-keygen! The default of dnssec-keygen is 216+1 and that's certainly safe. The .com zone uses a value of three and I think that's probably the best choice given the CPU cost and the fact that the original Bleichenbacher bug has been long since fixed.

Forward secrecy for IE and Safari too

When we announced forward secrecy for Google HTTPS we noted that “we hope to support IE in the future”. It wasn't that we didn't want to support IE, but IE doesn't implement the combination of ECDHE with RC4. IE (or rather, SChannel) does implement ECDHE with AES on Vista and Windows 7, but I only wanted to make one change at a time.

With the release of MS12-006 a month ago to address the BEAST weakness in TLS 1.0's CBC design, we've now made ECDHE-RSA-AES128-SHA our second preference cipher. That means that Chrome and Firefox will still use ECDHE-RSA-RC4-SHA, but IE on Vista and Windows 7 will get ECDHE now. This change also means that we support ECDHE with Safari (at least on Lion, where I tried it.)

RSA 2012

Just a brief note that I'll be at RSA 2012 in San Francisco at the end of the month and will be speaking with several others about certificate revocation. (tech-106, Tuesday, 1pm.).

If you want to chat and don't manage to grab me then, drop me an email. (agl at imperialviolet.org if you didn't already know.)

Revocation checking and Chrome's CRL

When a browser connects to an HTTPS site it receives signed certificates which allow it to verify that it's really connecting to the domain that it should be connecting to. In those certificates are pointers to services, run by the Certificate Authorities (CAs) that issued the certificate, that allow the browser to get up-to-date information.

All the major desktop browsers will contact those services to inquire whether the certificate has been revoked. There are two protocols/formats involved: OCSP and CRL, although the differences aren't relevant here. I mention them only so that readers can recognise the terms in other discussions.

The problem with these checks, that we call online revocation checks, is that the browser can't be sure that it can reach the CA's servers. There are lots of cases where it's not possible: captive portals are one. A captive portal frequently requires you to sign in on an HTTPS site, but blocks traffic to all other sites, including the CA's OCSP servers.

If browsers were to insist on talking to the CA before accepting a certificate, all these cases would stop working. There's also the concern that the CA may experience downtime and it's bad engineering practice to build in single points of failure.

Therefore online revocation checks which result in a network error are effectively ignored (this is called “soft-fail”). I've previously documented the resulting behaviour of several browsers.

But an attacker who can intercept HTTPS connections can also make online revocation checks appear to fail and so bypass the revocation checks! In cases where the attacker can only intercept a subset of a victim's traffic (i.e. the SSL traffic but not the revocation checks), the attacker is likely to be a backbone provider capable of DNS or BGP poisoning to block the revocation checks too.

If the attacker is close to the server then online revocation checks can be effective, but an attacker close to the server can get certificates issued from many CAs and deploy different certificates as needed. In short, even revocation checks don't stop this from being a real mess.

So soft-fail revocation checks are like a seat-belt that snaps when you crash. Even though it works 99% of the time, it's worthless because it only works when you don't need it.

While the benefits of online revocation checking are hard to find, the costs are clear: online revocation checks are slow and compromise privacy. The median time for a successful OCSP check is ~300ms and the mean is nearly a second. This delays page loading and discourages sites from using HTTPS. They are also a privacy concern because the CA learns the IP address of users and which sites they're visiting.

On this basis, we're currently planning on disabling online revocation checks in a future version of Chrome. (There is a class of higher-security certificate, called an EV certificate, where we haven't made a decision about what to do yet.)

Pushing a revocation list

Our current method of revoking certificates in response to major incidents is to push a software update. Microsoft, Opera and Firefox also push software updates for serious incidents rather than rely on online revocation checks. But our software updates require that users restart their browser before they take effect, so we would like a lighter weight method of revoking certificates.

So Chrome will start to reuse its existing update mechanism to maintain a list of revoked certificates, as first proposed to the CA/Browser Forum by Chris Bailey and Kirk Hall of AffirmTrust last April. This list can take effect without having to restart the browser.

An attacker can still block updates, but they have to be able to maintain the block constantly, from the time of revocation, to prevent the update. This is much harder than blocking an online revocation check, where the attacker only has to block the checks during the attack.

Since we're pushing a list of revoked certificates anyway, we would like to invite CAs to contribute their revoked certificates (CRLs) to the list. We have to be mindful of size, but the vast majority of revocations happen for purely administrative reasons and can be excluded. So, if we can get the details of the more important revocations, we can improve user security. Our criteria for including revocations are:

  1. The CRL must be crawlable: we must be able to fetch it over HTTP and robots.txt must not exclude GoogleBot.
  2. The CRL must be valid by RFC 5280 and none of the serial numbers may be negative.
  3. CRLs that cover EV certificates are taken in preference, while still considering point (4).
  4. CRLs that include revocation reasons can be filtered to take less space and are preferred.

For the curious, there is a tool for fetching and parsing Chrome's list of revoked certificates at https://github.com/agl/crlset-tools.

Extracting Mozilla's Root Certificates

When people need a list of root certificates, they often turn to Mozilla's. However, Mozilla doesn't produce a nice list of PEM encoded certificates. Rather, they keep them in a form which is convenient for NSS to build from: https://mxr.mozilla.org/mozilla/source/security/nss/lib/ckfw/builtins/certdata.txt?raw=1.

Several people have written quick scripts to try and convert this into PEM format, but they often miss something critical: some certificates are explicitly distrusted. These include the DigiNotar certificates and the misissued COMODO certificates. If you don't parse the trust records from the NSS data file, then you end up trusting these too! There's at least one, major example of this that I know of.

(Even with a correct root file, unless you do hard fail revocation checking you're still vulnerable to the misissued COMODO certificates.)

So, at the prodding of Denton Gentry, I've open-sourced a tool for converting NSS's file to PEM format: extract-nss-root-certs. (At the time of writing it requires a 6g built from the weekly or current tree (hg -r weekly), not the release tree. A few of the APIs have changed since the last Go release was done. This will be resolved when Go 1.0 is released.)

BEAST followup

(See the original post for background.)

Everyone seems to have settled on 1/n-1 record splitting as a workaround for the BEAST attack in TLS 1.0 and SSLv3. Briefly: 1/n-1 record splitting breaks CBC encrypted records in two: the first with only a single byte of application data and the second with the rest. This effectively randomises the IV and stops the attack.

The workaround which OpenSSL tried many years ago, and which hit significant issues, was 0/n record splitting. It's the same thing, but with the first record being empty. The problem with it was that many stacks processed the empty record and returned a 0-byte read, which higher levels took to mean EOF.

1/n-1 record splitting doesn't hit that problem, but it turns out that there's a fair amount of code out there that assumes that the entire HTTP request comes in a single read. The single byte record breaks that.

We first implemented 1/n-1 record splitting in Chrome 15 but backed off after only a couple of days because logging into several large sites broke. But that did motivate the sites to fix things so that we could switch it on in Chrome 16 and it stuck that time.

Opera also implemented it around this time, but I think Chrome took the brunt of the bug reports and it's time consuming dealing with them. Myself and a colleague have been emailing and phoning a lot of sites and vendors while dealing with upset users and site admins. Chrome certainly paid a price for moving before Firefox and IE but then we're nice like that.

Thankfully, this week, Microsoft released a security update which implements 1/n-1 record splitting in SChannel and switches it on in IE. (Although it defaults to off for other users of SChannel, unlike NSS.) Now the sites which broke with Chrome 16 are also broken in a patched IE and that takes some pressure off us. In a few weeks, Firefox 10 should be released and then we'll be about as good as we're going to get.

After taking the brunt with Chrome 16, there is one case that I'm not going to fight: Plesk can't handle POST payloads that don't come in a single read. Chrome (currently) sends POSTs as two writes: one for the HTTP headers and a second for the POST body. That means that each write is split into two records and Plesk breaks because of the second split. IE and Firefox send the headers and body in a single write, so there's only a single split in the HTTP headers, which Plesk handles.

Chrome will start merging small POST bodies into the headers with Chrome 17 (hopefully) and this will fix Plesk. Also, merging as Firefox and IE do saves an extra packet so it's worthwhile on its own. Once again, anything that's mostly true soon becomes an unwritten rule on the Internet.

It's worth contrasting the BEAST response to the renegotiation attack. The BEAST workaround caused a number of problems, but it worked fine for the vast majority of sites. The renegotiation fix requires that very nearly every HTTPS site on the Internet be updated and then that browsers refuse to talk to unpatched servers.

I'd bet that we'll not manage to get enough patched servers for any browser to require it this side of 2020. Unpatched servers can still disable renegotiation to protect themselves, but it's still not hard to find major sites that allow insecure renegotiation (www.chase.com was literally the second site that I tried).

OTR in Go

“Off the record” is, unfortunately, an overloaded term. To many it's feature in gTalk and AOL IM which indicates that the conversation isn't logged. However, to crypto folks it's a protocol for secure chat.

(In fact, resoloving the ambiguity is on the EFF's wish list.)

Pidgin has been my chat client of choice for some time because it's pretty fully featured and supports OTR via a plugin. However, I just don't trust it from a security point of view. The latest incident was only a couple of weeks ago: CVE-2011-3919.

So, I implemented otr in Go, as well as an XMPP library and client. It's an absolutely minimal client (except for OTR support) and implements only what I absolutely need in a client.

But it does mean that the whole stack, including the TLS library, is implemented in a memory safe language. (On the other hand, pretty much everything in that stack, from the modexp function to the terminal handling code was written by me and has never really been audited. I'm a decent programmer but I'm sure there are some howlers of security issues in there somewhere.)

Certificate Transparency

(I don't have comments on this blog, but you can comment on my Google+ post.)

Ben Laurie and I have been working on a longer term plan for improving the foundations of the certificate infrastructure on which most Internet transport security is based on these days. Although Chrome has public key pinning for some domains, which limits the set of permitted certificates, we don't see public key pinning as a long term solution (and nor was it ever designed to be).

For the 10 second summary of the plan, I'll quote Ben: “certificates are registered in a public audit log. Servers present proofs that their certificate is registered, along with the certificate itself. Clients check these proofs and domain owners monitor the logs.”. I would add only that anyone can check the logs: the certificate infrastructure would be fully transparent.

We now have an outline of the basic idea and will be continuing to flesh it out in the coming months, hopefully in conjunction with other browser vendors.

But I thought that, at the outset, it would be helpful to describe some of the limitations to the design space, as I see them:

No side-channels

As I've previously described, side-channels occur when a browser needs to contact a server other than the immediate destination in order to verify a certificate. Revocation checking with OCSP is an example of a side-channel used today.

But in order to be effective, side-channels invariably need to block the certificate verification until they complete, and that's a big problem. The Internet isn't fully connected. Captive portals, proxies and firewalls all mean that the only thing you can really depend on talking to is the immediate destination server. Because of this, browsers have never been able to make OCSP lookups blocking, and therefore OCSP is basically useless for security.

And that's not to mention the privacy, performance and functionality issues that arise from needing side-channels. (What happens when the side-channel server goes down?)

So our design requires that the servers send us the information that we require. We can use side-channels to check up on the logs, but it's an asynchronous lookup.

It's not opt-in, it's all-in

SSL Stripping is a very easy and very effective attack. HSTS prevents it and is as easy to deploy as anything can be for HTTPS sites. But, despite all this, and despite a significant amount of evangelism, take up has been very limited, even by sites which are HTTPS only and the subject of attacks.

While HSTS really has to be opt-in, a solution to the certificate problem doesn't. Although our scheme is incrementally deployable, the eventual aim is that it's required for everybody. Thankfully, since certificates have to be renewed there's an obvious means to incrementally deploy it: require it for certificates issued after a certain date. Although an eventual hard requirement is still needed, it's a lot less of a problem.

It's easy on the server operator

Since the aim is to make it a requirement for all servers, we've sacrificed a lot in order to make it very easy on the server operator. For most server operators, their CA will probably take care of fetching the audit proofs, meaning there's no additional work at all.

Some initial designs included short-lived log signatures, which would have solved the revocation problem. (Revocation would simply be a matter of instructing the logs to stop signing a given certificate.) However, this would have required server operators to update their audit proofs on a near-daily basis. After discussions it became clear that such a requirement wouldn't be tenable for many and so we reluctantly dropped it.

We are also sacrificing decentralisation to make things easy on the server. As I've previously argued, decentralisation isn't all it's cracked up to be in most cases because 99.99% of people will never change any default settings, so we haven't given up much. Our design does imply a central set of trusted logs which is universally agreed. This saves the server from possibly having to fetch additional audit proofs at runtime, something which requires server code changes and possible network changes.

There are more valid certificates than the one that's currently serving

Cheap virtual hosting and EC2 have made multi-homed services common. Even small to medium scale sites have multiple servers these days. So any scheme that asserts that the currently serving certificate is the only valid certificate will run into problems when certificates change. Unless all the servers are updated at exactly the same time, then users will see errors during the switch. These schemes also make testing a certificate with a small number of users impossible.

In the end, the only real authority on whether a certificate is valid is the site itself. So we don't rely on external observations to decide on whether a certificate is valid, instead to seek to make the set of valid certificates for a site public knowledge (which it currently isn't), so that the site can determine whether it's correct.

It's not easy to do

We believe that this design will have a significant, positive impact on an important part of Internet security and that it's deployable. We also believe that any design that shares those two properties ends up looking a lot like it. (It's no coincidence that we share several ideas with the EFF's Sovereign Keys.)

None the less, deployment won't be easy and, hopefully, we won't be pushing it alone.

Forward secrecy for Google HTTPS

As announced on the Google Security Blog, Google HTTPS sites now support forward secrecy. What this means in practice is two things:

Firstly, the preferred cipher suite for most Google HTTPS servers is ECDHE-RSA-RC4-SHA. If you have a client that supports it, you'll be using that ciphersuite. Chrome and Firefox, at least, support it.

Previously we were using RSA-RC4-SHA, which means that the client (i.e. browser) picks a random key for the session, encrypts it with the server's public key and sends it to the server. Since only the holder of the private key can decrypt the session key, the connection is secure.

However, if an attacker obtains the private key in the future then they can decrypt recorded traffic. The encrypted session key can be decrypted just as well in ten years time as it can be decrypted today and, in ten years time, the attacker has much more computing power to break the server's public key. If an attacker obtains the private key, they can decrypt everything encrypted to it, which could be many months of traffic.

ECDHE-RSA-RC4-SHA means elliptic curve, ephemeral Diffie-Hellman, signed by an RSA key. You can see a previous post about elliptic curves for an introduction, but the use of elliptic curves is an implementation detail.

Ephemeral Diffie-Hellman means that the server generates a new Diffie-Hellman public key for each session and signs the public key. The client also generates a public key and, thanks to the magic of Diffie-Hellman they both generate a mutual key that no eavesdropper can know.

The important part here is that there's a different public key for each session. If the attacker breaks a single public key then they can decrypt only a single session. Also, the elliptic curve that we're using (P-256) is estimated to be as strong as a 3248-bit RSA key (by ECRYPT II), so it's unlikely that the attacker will ever be able to break a single instance of it without a large, quantum computer.

While working on this, Bodo Möller, Emilia Kasper and I wrote fast, constant-time implementations of P-224, P-256 and P-521 for OpenSSL. This work has been open-sourced and submitted upstream to OpenSSL. We also fixed several bugs in OpenSSL's ECDHE handling during deployment and those bug fixes are in OpenSSL 1.0.0e.

Session Tickets

The second part of forward secrecy is dealing with TLS session tickets.

Session tickets allow a client to resume a previous session without requiring that the server maintain any state. When a new session is established the server encrypts the state of the session and sends it back to the client, in a session ticket. Later, the client can echo that encrypted session ticket back to the server to resume the session.

Since the session ticket contains the state of the session, and thus keys that can decrypt the session, it too must be protected by ephemeral keys. But, in order for session resumption to be effective, the keys protecting the session ticket have to be kept around for a certain amount of time: the idea of session resumption is that you can resume the session in the future, and you can't do that if the server can't decrypt the ticket!

So the ephemeral, session ticket keys have to be distributed to all the frontend machines, without being written to any kind of persistent storage, and frequently rotated.

Result

We believe that forward secrecy provides a fairly significant benefit to our users and we've contributed our work back to OpenSSL in the hope that others will make use of it.

Classifying solutions to the certificate problem

This is something that I wrote up internally, but which Chris Palmer suggested would be useful to post publicly for reference. It presents some, somewhat artificial, axis on which I believe that all proposed solutions to addressing weaknesses in the current certificate ecosystem fall. By dividing the problem into those decisions, I hope to make the better solutions clearer.

Axis 1: Private vs public signing

At the moment we have private signing. A CA can sign a certificate and tell nobody about it and the certificate is still valid. This is the reason that we have to go crawl the Internet in order to figure out how many intermediate CA certs there are.

In public signing, certificates aren't valid unless they're public.

There are degrees of how public schemes are. Convergence is a public scheme: certificates have to be visible to the notaries, but it's a lot less public than a scheme where all certificates have to be published. And published where? Highly public schemes imply some centralisation.

Private schemes don't protect us from CAs acting in bad faith or CAs which have been compromised. Public schemes help a lot in these cases, increasingly so the more public they are. Although a certificate might be valid for a short time, evidence of misbehavior is recorded. Public schemes also allow each domain to monitor all the certificates for their domain. Fundamentally, the only entity that can answer the question of whether a certificate is legitimate is the subject of the certificate itself. (See this tale of a Facebook certificate.)

Private schemes have the advantage of protecting the details of internal networks (i.e. not leaking the name www.corp.example.com).

Axis 2: Side channels or not

Revocation checking which calls back to the CA is a side channel. Convergence notaries are too. OCSP stapling is not, because the server provides the client with everything that it needs (assuming that OCSP stapling worked).

Side channels which need to be a hard fail are a problem: it's why revocation checking doesn't actually work. Private networks, hotel networks and server downtime are the issues. Side-channels are also a privacy and performance concern. But they're easier on the server operator and so more likely to be deployed.

In the middle are soft-fail side-channels. These offer limited protection because the connection proceeds even when the side-channel fails. They often queue the certificate up for later reporting.

Axis 3: Clocks or not

OCSP with nonces doesn't need clock sync. OCSP with time stamps does.

Keeping clocks in sync is a problem in itself, but it allows for short lived statements to be cached. It can also be a useful security advantage: Nonces require that a signing key be online because it has to sign a constant stream of requests. With clocks, a single response can be served up repeatedly and the key kept largely off-line. That moves the online key problem to the clock servers, but that's a smaller problem. A compromised clock server key can be handled by querying several concurrently and picking the largest cluster of values.

Examples

OCSP today is {private,side-channel,clock}. The 'let's fix OCSP' solutions are typically {private,side-channel,no-clock} (with an option for no-side-channel). Convergence is {mostly-public,side-channel,no-clock}.

I think the answer lies with {public,no-side-channel,clock}, but it's a trek to get there. Maybe something for a future post.

False Start: Brocade broken again

I wrote previously that Brocade had released a firmware update for their ServerIron SSL terminators which fixed an incompatibility with False Start. However, it now appears that they didn't fix the underlying issue, rather they special cased Chrome's current behaviour.

Sadly, due to Chrome implementing a workaround for the BEAST attack, their special casing doesn't work anymore and breaks Chrome 15.

So, if you run a Brocade ServerIron you should contact me ASAP. I'll be adding sites running this hardware to the False Start blacklist for Chrome 15 to allow Brocade to release another firmware update.

Chrome and the BEAST

Thai Duong and Juliano Rizzo today demoed an attack against TLS 1.0's use of cipher block chaining (CBC) in a browser environment. The authors contacted browser vendors several months ago about this and so, in order not to preempt their demo, I haven't discussed any details until now.

Contrary to several press reports, Duong and Rizzo have not found, nor do they claim, any new flaws in TLS. They have shown a concrete proof of concept for a flaw in CBC that, sadly, has a long history. Early reports of the problem date back nearly ten years ago and Bard published two papers detailing the problem.

The problem has been fixed in TLS 1.1 and a workaround for SSL 3.0 and TLS 1.0 is known, so why is this still an issue?

The workaround (prepending empty application data records) is perfectly valid according to the protocol but several buggy implementations of SSL/TLS misbehaved when it was enabled. After Duong and Rizzo notified us, we put the same workaround back into Chrome to see if the state of the Internet had improved in the years since the last attempt. Sadly it was still infeasible to implement this workaround and the change had to be reverted.

Use of TLS 1.1 would also have solved the issue but, despite it being published in 2006, common SSL/TLS libraries still don't implement it. But even if there was widespread deployment of TLS 1.1, it wouldn't have helped avoid problems like this. Due to a different, common bug in SSL 3.0 implementations (nearly 12 years after SSL 3.0 was obsoleted by TLS 1.0), browsers still have to perform SSL 3.0 downgrades to support buggy servers. So even with a TLS 1.1 capable browser and server, an attacker can trigger a downgrade to SSL 3.0 and bypass the protections of TLS 1.1.

Finally, the CBC attacks were believed to be largely theoretical but, as Duong and Rizzo have pointed out today, that's no longer the case.

Initially the authors identified HTML5 WebSockets as a viable method of exploiting the CBC weakness but, due to unrelated reasons, the WebSockets protocol was already in the process of changing in such a way that stopped it. The new WebSockets protocol shipped with Chrome 14 but several plugins have been found to offer features that might allow the attack to be performed.

Duong and Rizzo confirmed that the Java plugin can be used, but Chrome already blocks the execution of Java by default. Other plugins, if installed, can be disabled on the about:plugins page if the user wishes.

The attack is still a difficult one; the attacker has to have high-bandwidth MITM access to the victim. This is typically achieved by being on the same wireless network as the victim. None the less, it's a much less serious issue than a problem which can be exploited by having the victim merely visit a webpage. (Incidentally, we pushed out a fix to all Chrome users for such a Flash bug only a few days ago.)

Also, an attacker with MITM abilities doesn't need to implement this complex attack. SSL stripping and mixed-scripting issues are much easier to exploit and will take continued, sustained effort to address. Duong and Rizzo have highlighted this fact by choosing to attack one of the few HSTS sites.

Thanks to an idea suggested by Xuelei Fan, we have another workaround for the problem which will, hopefully, cause fewer incompatibility problems. This workaround is currently being tested on the Chrome dev and beta channels but we haven't pushed it on the stable channel yet. Since we don't really know if the fix will cause problems it's not something that we want to drop on people without testing. The benefit of a fix to Chrome's TLS stack is also limited as Chrome already uses the newer WebSockets protocol and it doesn't fix problems in plugins.

If it turns out that we've misjudged something we can quickly react, thanks to Chrome's auto-update mechanism.

It's also worth noting that Google's servers aren't vulnerable to this problem. In part due to CBC's history, Google servers have long preferred RC4, a cipher that doesn't involve CBC mode. Mention has also been made of Chrome's False Start feature but, since we don't believe that there are any vectors using Chrome's stack, that's immaterial.

DNSSEC Certificates now in Chrome Stable

A few months back I described DNSSEC authenticated HTTPS in Chrome. This allows sites to use DNSSEC, rather than traditional, certificates and is aimed at sites which currently use no HTTPS, or self-signed certificates. Since Chrome 14 is now stable, all Chrome users now have this experimental feature.

(Also, the serialisation format has been documented.)

Why not Convergence?

In light of recent events, I've had several requests to implement Convergence in Chrome. For those who don't know and, frankly, for anyone interested in SSL, I highly recommend watching Moxie's talk on the subject from this year's Black Hat. You can also check out the project website.

Moxie, having actually thought about the issue and coded something up, has already done a thousand times more to address the problem than almost anyone else. But I don't think that Convergence is something we would add in Chrome:

Although the idea of trust agility is great, 99.99% of Chrome users would never change the default settings. (The percentage is not an exaggeration.) Indeed, I don't believe that an option for setting custom notaries would even meet the standards for inclusion in the preferences UI.

Given that essentially the whole population of Chrome users would use the default notary settings, those notaries will get a large amount of traffic. Also, we have a very strong interest for the notaries to function, otherwise Chrome stops working. Combined, that means that Google would end up running the notaries. So the design boils down to Chrome phoning home for certificate validation. That has both unacceptable privacy implications and very high uptime requirements on the notary service.

It also doesn't address the two problems that Moxie highlights: internal servers and captive portals. It's not clear how either would work in this design, at least without giving up on security and asking the user. (These two problems, captive portals esp, are the bane of many an idea in this area.)

None of the above argues against allowing Convergence as an extension for those who wish to run it. We don't currently have an extension API for controlling certificate decisions and I'm not inherently opposed to one. It would be additional complexity and something that we would have to support in the future, so it's not without costs, but mostly it's not there because nobody has written it and I'm afraid that I don't have any plans to do so.

False Start: past time to fix your servers

A year ago I wrote about False Start in Chrome. In short: Chrome cuts the TLS handshake short to save a round trip in a full handshake. Since then we've posted results that show a 30% drop in SSL handshake latency.

When we enabled False Start by default in Chrome, we also included a list of the very small number of incompatible sites. This list was built into the browser in order to avoid breaking these sites. (See the original post for the reasoning.)

For some time I've been randomly eliminating chunks of that list. Mostly it's been the case that sites have already upgraded. I don't think that they did so specifically with False Start in mind, but that it was just a regular maintainance.

But it's now time that all sites are updated because the list is fading away fast:

DNSSEC authenticated HTTPS in Chrome

DNSSEC validation of HTTPS sites has been hanging around in Chrome for nearly a year now. But it's now enabled by default in the current canary and dev channels of Chrome and is on schedule to go stable with Chrome 14. If you're running a canary or dev channel (and you need today's dev channel release: 14.0.794.0) then you can go to https://dnssec.imperialviolet.org and see a DNSSEC signed site in action.

DNSSEC stapled certificates (and the reason that I use that phrase will become clear in a minute) are aimed at sites that currently have, or would use, self-signed certificates and, possibly, larger organisations that are Chrome based and want certificates for internal sites without having to bother with installing a custom root CA on all the client devices. Suggesting that this heralds the end of the CA system would be utterly inaccurate. Given the deployed base of software, all non-trival sites will continue to use CA signed certificates for decades, at least. DNSSEC signing is just a gateway drug to better transport security.

I'm also going to see how it goes for a while. The most likely outcome is that nobody uses them and I pull the code out in another year's time. There are also some risks:

In theory, it should be hard to get a CA certificate for, say, paypa1.com (note the digit 1 in there). The CA should detect a phishing attempt and deny the certificate. With DNSSEC, there's no such filtering and a phisher could get a green padlock on any site that they can get a DNS name for. It's not clear whether this filtering is actually enforced by all public CAs, nor that phishing is more effective over HTTPS. Additionally, Chrome has anti-phishing warnings built in which is a better layer at which to handle this problem.

In the end, if you want a certificate that identifies a legal entity, rather than a domain name, then you want EV certificates. None the less, if DNSSEC stapled certificates end up being predominantly used for abuse then I'll probably kill them.

It's also possible that lots of sites will use these certificates, triggering warnings in other browsers thus further desensitising those users to such warnings. But that's one of those “good to have” problems and, in this very unlikely situation, there will have to be a push to add wider support.

For site operators

For site operators it's not just a question of dropping in a DNS record and forgetting about it (although nearly). The DNSSEC chain has to be embedded in the certificate and the certificate has to be refreshed, probably in a nightly cron job.

This is why I used the phrase “DNSSEC stapled certificate” above. These certificates are self-signed X.509 certificates with a DNSSEC chain embedded in an extension. The alternative would be to have the client perform a DNSSEC lookup itself. In the long term this may be the answer but, for now, client platforms don't support DNSSEC resolution and I'm in no rush to drop a DNSSEC resolving library into Chrome. We also have data that suggests that ~1% of Chrome (Linux) users can't resolve arbitary DNS record types (because of ‘hotel networks’, firewalls etc). Finally, having the client do a lookup is slower. Since DNSSEC data is signed it doesn't matter how you get it so having the server do the work and cache the result is simplier, faster and dependable.

The DNSSEC stapled data is embedded in an X.509 certificate (as opposed to extending TLS or using a different certificate format) because every HTTPS server will take an X.509 certificate as an opaque blob and it Just Works. All the other possiblilies introduce significant barriers to adoption.

You should also keep in mind that the format of the DNS record may change in the future. I've no current plans to change it but these things are still working themselves out.

Setting it up

First you need a zone that's DNSSEC signed. Setting that up is a little beyond the scope of this blog post but I use BIND with auto-signing as a server and DynDNS as a registra.

Second, you need to clone git://github.com/agl/dnssec-tls-tools.git.

$ openssl genrsa 1024 > privkey.pem
$ openssl rsa -pubout -in privkey.pem > pubkey.pem
$ python ./gencaa.py pubkey.pem
EXAMPLE.COM. 60 IN TYPE257 \# 70 020461757468303e3039060a2b06010401d6790203010…

Take the DNS record that's printed out by gencaa.py, change the domain name (and possibly the TTL) and add it to your zone. If you need to do anything else to make the record accessable to the world, do that now. You should be able to see the record with dig -t type257 example.com.

Now we get the DNSSEC chain and create a certificate with it embedded:

$ python ./chain.py example.com chain
$ gcc -o gencert gencert.c -Wall -lcrypto
$ ./gencert privkey.pem chain > cert.pem

You can check the certificate with openssl x509 -text < cert.pem | less.

CNAMES should work but I haven't implemented DNS wildcards. Also, that code was written to run easily (hence C and Python) rather than being nice code that's going to handle lots of edge cases well. chain.py is just shelling out to dig :)

You now have a key pair (privkey.pem and cert.pem) that can be used by any HTTPS server in the usual way. Keep in mind that DNSSEC signatures expire, often on the order of a week. So you'll need to setup a cron job to download the chain, rebuild the certificate and HUP the HTTPS server.

Goopenpgp

--- title: "OpenPGP support in Go" layout: post ---

Go is currently my favourite programming language for recreational programming. Mentioning Go appears to send some people into a rage judging by the sort of comments that arise on Hacker News, proggit and so on but, thankfully, this blog doesn't have comments.

Over the past few months I've been (very) slowly building OpenPGP support into Go's crypto library. (You can get my key, DNSSEC signed, via PKA: dig +dnssec -t txt agl._pka.imperialviolet.org). I don't get to use PGP often enough but, even when I have (I used to run an email verifying auto-signer), I've resorted to shelling out to gpg. I understand that GPG's support for acting as a library has improved considerably since then, but half the aim of this is that it's recreational programming.

So, here's a complete Go program using the OpenPGP library: (Note, if you wish to compile these programs you'll need a tip-of-tree Go)

package main

import (
  "crypto/openpgp"
  "crypto/openpgp/armor"
  "fmt"
  "os"
)

func main() {
  w, _ := armor.Encode(os.Stdout, "PGP MESSAGE", nil)
  plaintext, _ := openpgp.SymmetricallyEncrypt(w, []byte("golang"), nil)
  fmt.Fprintf(plaintext, "Hello from golang.\n")
  plaintext.Close()
  w.Close()
  fmt.Print("\n")
}

It'll output a symmetrically encrypted message, like this:

-----BEGIN PGP MESSAGE-----

wx4EBwMCkdZyiLAEewZgIuDNFqo0FBYNp4ZGpaiaAjHS4AHkLcerCkW9sCqLdBQc
GH6HUOEwZeCr4ILhUBHgnOID5zAh4PDkdVSUDxFj0KITDfgDXMptL+Ai4aTL4Mzg
4uCX5C7gKEnS8gsxdwC67zQzUMbiSS92duG39AA=
=NuOw
-----END PGP MESSAGE-----

You can feed that into gpg -c and decrypt with the passphrase golang if you like. Since all the APIs are stream based you can process arbitrary length messages without having to fit them into memory.

We can also do public key stuff:

package main

import (
  "crypto/openpgp"
  "crypto/openpgp/armor"
  "fmt"
  "os"
)

func getKeyByEmail(keyring openpgp.EntityList, email string) *openpgp.Entity {
  for _, entity := range keyring {
    for _, ident := range entity.Identities {
      if ident.UserId.Email == email {
        return entity
      }
    }
  }

  return nil
}

func main() {
  pubringFile, _ := os.Open("pubring.gpg")
  pubring, _ := openpgp.ReadKeyRing(pubringFile)
  privringFile, _ := os.Open("secring.gpg")
  privring, _ := openpgp.ReadKeyRing(privringFile)

  myPrivateKey := getKeyByEmail(privring, "me@mydomain.com")
  theirPublicKey := getKeyByEmail(pubring, "bob@example.com")

  w, _ := armor.Encode(os.Stdout, "PGP MESSAGE", nil)
  plaintext, _ := openpgp.Encrypt(w, []*openpgp.Entity{theirPublicKey}, myPrivateKey, nil)
  fmt.Fprintf(plaintext, "Hello from golang.\n")
  plaintext.Close()
  w.Close()
  fmt.Printf("\n")
}

And, of course, it can also decrypt those messages, check signatures etc. The missing bits are ElGamal encryption, support for clearsigning and big bits of functionality around web-of-trust, manipulating keys etc. None the less, Camlistore is already using it and it appears to be working for them.

Public key pinning

Starting with Chrome 13, we'll have HTTPS pins for most Google properties. This means that certificate chains for, say, https://www.google.com, must include a whitelisted public key. It's a fatal error otherwise. Credit goes to my colleague, Chris Evans, for much of this.

The whitelisted public keys for Google currently include Verisign, Google Internet Authority, Equifax and GeoTrust. Thus Chrome will not accept certificates for Google properties from other CAs. As ever, you can look at the source if you wish (search that file for "Verisign"). DigiCert also issues some of our certificates, but we aren't yet enforcing pinning for those domains.

This works with HSTS preloading of Google properties to ensure that, when a user types gmail.com into the address bar, they only get the real Gmail, no matter how hostile the network.

What about MITM proxies, Fiddler etc?

There are a number of cases where HTTPS connections are intercepted by using local, ephemeral certificates. These certificates are signed by a root certificate that has to be manually installed on the client. Corporate MITM proxies may do this, several anti-virus/parental control products do this and debugging tools like Fiddler can also do this. Since we cannot break in these situations, user installed root CAs are given the authority to override pins. We don't believe that there will be any incompatibility issues.

Why public key hashes, not certificate hashes?

In general, hashing certificates is the obvious solution, but the wrong one. The problem is that CA certificates are often reissued: there are multiple certificates with the same public key, subject name etc but different extensions or expiry dates. Browsers build certificates chains from a pool of certificates, bottom up, and an alternative version of a certificate might be substituted for the one that you expect.

For example, StartSSL has two root certificates: one signed with SHA1 and the other with SHA256. If you wished to pin to StartSSL as your CA, which certificate hash would you use? You would have to use both, but how would you know about the other root if I hadn't just told you?

Conversely, public key hashes must be correct:

Browsers assume that the leaf certificate is fixed: it's always the starting point of the chain. The leaf certificate contains a signature which must be a valid signature, from its parent, for that certificate. That implies that the public key of the parent is fixed by the leaf certificate. So, inductively, the chain of public keys is fixed, modulo truncation.

The only sharp edge is that you mustn't pin to a cross-certifying root. For example, GoDaddy's root is signed by Valicert so that older clients, which don't recognise GoDaddy as a root, still trust those certificates. However, you wouldn't want to pin to Valicert because newer clients will stop their chain at GoDaddy.

Also, we're hashing the SubjectPublicKeyInfo not the public key bit string. The SPKI includes the type of the public key and some parameters along with the public key itself. This is important because just hashing the public key leaves one open to misinterpretation attacks. Consider a Diffie-Hellman public key: if one only hashes the public key, not the full SPKI, then an attacker can use the same public key but make the client interpret it in a different group. Likewise one could force an RSA key to be interpreted as a DSA key etc.

(It's possible that a certificate could be reissued with a different SPKI for the same public key. That would be a second sharp edge but, to my knowledge, that has never happened. SPKIs for any public key type should have a distinguished encoding.)

Can I get this for my site?

If you run a large, high security site and want Chrome to include pins, let me know. For everyone else we hope to expose pinning via HSTS, although the details have yet to be worked out. You can experiment with pinning via the HSTS debug UI (chrome://net-internals/#hsts) if you have a new enough version of Chrome.

Smaller than Bloom filters

I've been looking at what would be needed in order to have a global view of CRLs in browsers. At the moment revocation has three problems: privacy (by asking the CA about the status of a certificate you're revealing to them what you're looking at), performance (OCSP checks take hundreds of milliseconds which adds up to thousands of milliseconds when subdomains are in play) and functionality (it doesn't work).

Having a global view of (nearly) all CRLs would mostly solve these issues. We could quickly determine that a certificate is good without leaking information and perform a hard fail revocation check if we believe that a certificate has been revoked. So that begs the question of how to get a global view of CRLs to the client. A Bloom filter would be the obvious choice since we can't accept false negatives (and a Bloom doesn't generate them) but we can cope with false positives (which it does produce).

But we really want to keep this structure small. Every extra KB has to be transferred to hundreds of millions of clients (including mobile devices) and takes up memory on each of those devices. We also have another trick up to play: we don't care much about query time. For a normal Bloom filter, query time is very important but here, if it took several milliseconds, that would be ok.

Given those limitations we can look at data structures other than a traditional Bloom filter and which are smaller and slower.

An optimum Bloom filter takes up n·log(e)ln(1/p) bits (where p is the false positive probability and n is the number of inserted elements). You can't always hit that because you can't have a fraction of a hash function but the theoretical lower bound is just n·log(1/p) bits, a factor if log(e)≅1.44 smaller. Other structures get closer to this bound.

First, there are compressed bloom filters[PDF]. The idea here is that one builds a larger Bloom filter with fewer hash functions and compresses it. In the paper they discuss compressing for transport, but I'm also thinking about it being compressed in memory. An interesting result from the paper is that the optimum number of hash functions for a normal Bloom filter is actually the worst case for a compressed filter.

The optimum number of hash functions for a compressed Bloom is one. Normally that would make the Bloom too large to deal with uncompressed, but if you're not actually going to store the uncompressed version, that's less of an issue.

A Bloom filter with a single hash function requires 1/p bits per element. From that, it's obvious that the density of the filter is p. So a perfect entropy coder (and a range coder is basically perfect) should be able to compress that to just n·1/p·H(p) bits, where H is the binary entropy. Now we can consider how close a compressed Bloom gets to the lower bound that we considered before

A compressed bloom takes 1/p·H(p) per element and the lower bound is ln(1/p). Here's a Wolfram Alpha plot of the expansion factor. The expansion factor goes from about 0.1 to 0.2 over a range of reasonable values of p. For p = 1/1024, it's 0.144. Remember that the expansion factor of a normal Bloom filter is 0.44, so that's quite a bit better.

However, the uncompressed size, even if you're just streaming it, is a problem. For 400K revoked certificates with a 1 in 1024 false positive rate, you're looking at 51MB of uncompressed data. In order to process that you would have to call into the entropy coder 410 million times.

There's an obvious solution to this: split the Bloom into chunks and hash the input in order to find the correct chunk. That works fine, but there's also another answer which may be cleaner: Golomb Compressed Sets. (The idea was first mentioned in this paper, I believe.)

A GCS mirrors the structure of the compressed Bloom filter: we're hashing the elements into a space of size n/p. But, while a compressed Bloom filter treats this as a bitmap, a GCS treats it as a list of values. Since the values are the result of hashing, we can assume that they are uniformly distributed, sort them and build a list of differences. The differences will be geometrically distributed with a parameter of p. Golomb coding is the optimal encoding for geometrically distributed values: you divide by 1/p, encode that in unary then encode the remainder in binary.

A GCS is going to be slightly larger than a compressed bloom filter because the Golomb coding uses a whole number of bits for each value. I don't have an equation for the overhead, but I found it to be between 0.1 and 0.2 bits per set element, which is pretty minimal. It's also an advantage because it means that the user can index the GCS to any degree. With a chunked, compressed Bloom filter the generator has to decide on the chunking and also deal with the inefficiencies resulting from imperfect balance.

But what about that lower bound, can anything hit it? Yes, matrix filters (paper). The idea here is that you hash elements into n values of a finite field. Then, from n such elements you have an n by n matrix where the rows are independent with very high probability. You then perform Gaussian elimination to solve M·b = 1 (where M is the matrix, b is an n column vector and 1 is an n column vector of ones).

After this, b is your matrix filter. Given b, one can hash an element to generate a vector, multiply by b and get the original value back. If the equation was in M then the result has to be 1. Otherwise, the result is a random element from the field. If you use GF(210) as the field, then the false positive probability is 1 in 1024.

As an example, consider the case of a two element set. You would hash these elements into equations that define lines in the plane and the matrix filter is the intersection of these lines. Querying the filter means hashing the element to form a line and seeing if it intersects that point.

Matrix filters can also be used to associate values with the elements. In this case, for an element in the set you'll always get the correct value and for anything else you'll get a random value. (So you don't get to know if the element was in the set in this case.)

Matrix filters hit the lower bound, but have some issues. First, in order to query the filter you have to read and process the whole filter, so you will have to chunk it. Secondly, if your false positive rate is not of the form 1/2n, then the field arithmetic is slower. But I'll deal with the most important problem, from my point of view, next.

Back to the motivation for a moment: we're looking to sync a global view of CRLs to browsers. The initial download is one thing, but we also have to push updates frequently (probably daily). That means that delta encoding is terribly important.

Delta compressing a compressed Bloom filter or a GCS is pretty obvious with a range coder in hand. Additions and deletions from the set are just additions and deletions of bits or values, respectively. But changes to a matrix filter mean rewriting everything. Consider the example of lines in a plane: changing one of the lines moves the intersection point in both dimensions.

So, in order to add elements you would either have to retransmit the chunks that they fall into or send an additional chunk (that would have to be checked every time). Deletions can't be realised without retransmitting a whole chunk so, over time, the useful fraction of elements in a chunk would fall. Since a GCS is only about 15% bigger, it wouldn't take long for the matrix filter to be worse off.

So, although matrix filters are beautiful, I think a GCS works best. I suspect that there's a deep reason why a structure which hits the lower size bound cannot be delta compressed, but I'm too dumb to figure it out if there is.

In the process of playing with this, I implemented everything to some degree. I'll throw the code on github at some point.

Multi-prime RSA trade offs

(This is a technical post: casual readers should skip it.)

I couldn't find data on the speed/security trade offs of multi-prime RSA anywhere, so I gathered the data myself and hopefully this blog post can be found by people in the future.

I used the equations from Lenstra (Unbelievable security: Matching AES security using public key systems, in ASIACRYTPT 2001) and scaled them so that two prime, 1024 bit RSA is 80 bits. The speed numbers are from OpenSSL for a single core of a Xeon E5520 at 2.3GHz.

I'm assuming a 2048-bit modulus. The blue, horizontal lines are the estimated difficulty of the NFS for 2048 and 1024-bit composities. The blue curve is the estimated difficulty of factoring a multi-prime composite using ECM. Really you don't want to choose so many primes at this drops below the NFS level.

image/svg+xml Produced by GNUPLOT 4.2 patchlevel 6 0 500 1000 1500 2000 2 3 4 5 6 7 8 0 50 100 150 ops/s security (bits) Num primes Security Speed

A couple of useful observations from this data: For 2048-bits, 3 primes is the most you want (this is confirmed by table 1 of this paper). Also, if your CA is requiring a 2048-bit modulus, you can use seven primes to get basically the same speed and security as a 1024-bit key.

This is the same data, in table form:

1024-bits (80-bit NFS security)

nops/sECM security
21807104
3268086
4430575

2048-bits (107-bit NFS security)

nops/sECM security
2279148
3524121
4872106
5104795
6126387
7157981
8199777

4096-bits (144-bit NFS security)

nops/sECM security
238213
377173
4135150
5193135
6254123
7288114
8409107

Revocation doesn't work

When an HTTPS certificate is issued, it's typically valid for a year or two. But what if something bad happens? What if the site loses control of its key?

In that case you would really need a way to invalidate a certificate before it expires and that's what revocation does. Certificates contain instructions for how to find out whether they are revoked and clients should check this before accepting a certificate.

There are basically two methods for checking if a certificate is revoked: certificate revocation lists (CRLs) and OCSP. CRLs are long lists of serial numbers that have been revoked while OCSP only deals with a single certificate. But the details are unimportant, they are both methods of getting signed and timestamped statements about the status of a certificate.

But both methods rely on the CA being available to answer CRL or OCSP queries. If a CA went down then it could take out huge sections of the web. Because of this, clients (and I'm thinking mainly of browsers) have historically been forgiving of an unavailable CA.

But an event this week gave me cause to wonder how well revocation actually works. So I wrote the the world's dumbest HTTP proxy. It's just a convenient way to intercept network traffic from a browser. HTTPS requests involve the CONNECT method, which is implemented. All other requests (including all revocation checks) simply return a 500 error. This isn't even as advanced as Moxie's trick of returning 3.

To be clear, the proxy is just for testing. An actual attack would intercept TCP connections. The results:

Firstly, IE 8 on Windows 7:

No indication of a problem at all even though the revocation checks returned 500s. It's even EV.

(Aside: I used Wireshark to confirm that revocation checks weren't bypassing the proxy. It's also the case that SChannel can cache revocation information. I don't know how to clear this cache, so I simply used a site that I hadn't visited before. Also confirming that a cache wasn't in effect is the fact that Chrome uses SChannel to verify certificates and Chrome knew that there was an issue..)

Firefox 3.6 on Windows 7, no indication:

Update: George Macon pointed out that Firefox removes the EV indication from this site. I'm not a regular Firefox user so I'm afraid that I didn't notice:

Chrome 12 on Windows 7:

There's something at least! We can click on it...

So Chrome has an indication but it's the same indication as mixed-content and I suspect that rather a lot of people will ignore it. There is one saving grace, Chrome implements HSTS and for HSTS sites, revocation failure is fatal:

Update: I changed this behaviour. See http://www.ietf.org/mail-archive/web/websec/current/msg00296.html

On other platforms, the story is much the same. Safari doesn't indicate that anything is wrong, nor does Firefox 4 on OS X, nor Firefox 3.6 on Linux. Chrome on Mac is the same as Windows, but on Linux doesn't indicate and doesn't protect HSTS sites. (Chrome Linux's behaviour is due to a limitation in NSS as I understand it.).

So, what's the effect? Well it depends where an attacker is. If the attacker is spoofing a site and is situated close to the site then they can attack all users, because they can get all the traffic destined for the site. However, such an attacker probably can't intercept traffic to the CA's servers, so revocation will actually work because the users will receive a firm ‘revoked’ message from the CA. If the attacker is close to the user, then they can only attack a smaller number of users, but they can intercept traffic to the CA and thus defeat revocation. Attacks in Tunisia and only open WiFi networks are the sort of attacks which can defeat revocation.

So should browsers be stricter about revocation checking? Maybe, but it does mean that a CA outage would disable large parts of the web. Imagine if Verisign corrupted their revocation database and were down for six hours while they rebuilt it. An global outage of large parts of the HTTPS web like that would seriously damage the image of web security to the point where sites would think twice about using HTTPS at all.

(You can configure Firefox to be strict about checking if you wish: security.OCSP.require in about:config.)

A much better solution would be for certificates to only be valid for a few days and to forget about revocation altogether. This doesn't mean that the private key needs to change every few days, just the certificate. And the certificate is public data, so servers could just download their refreshed certificate over HTTP periodically and automatically (like OCSP stapling). Clients wouldn't have to perform revocation checks (which are very complex and slow), CAs wouldn't have to pay for massive, DDoS proof serving capacity and revocation would actually work. If the CA went down for six hours, nobody cares. Only if the CA is down for days is there a problem. If you want to “revoke” a certificate, just stop renewing it.

HSTS UI in Chrome

HSTS is designed to address the fact that HTTP is the default protocol on the web. You might run a secure site that redirects everyone to HTTPS, but your users will just type in foo.com and that initial HTTP request is vulnerable to manipulation and SSL stripping attacks.

HSTS allows sites to advertise that they are HTTPS only. See the Chromium site page for more details.

However, it's difficult to test HSTS. Chrome's HSTS database stores only the hashes of sites (which may not have been the right choice, but I don't see sufficient motivation to change it at the moment) so it's hard to edit by hand. There are tools like craSH's Chrome-STS to help edit the database, but it's too hard for the average developer.

So I've added a debugging UI to Chrome to query, add and remove entries from the HSTS database. You have to type about:net-internals into the address bar and click the HSTS tab. You also need a version of Chrome after r75282. Today that means a trunk build but it should be in the dev channel release next week. Until then, here's a screenshot:

So, if you want to see if your site will break before deploying HSTS, you can add an entry locally for it and find out. Modifications persist on disk and have a expiry of 1000 days. You can also use it as a way to elect to always access sites via HTTPS (i.e. mail.google.com).

Origin poisoning

Mixed-scripting is when an HTTPS page loads resources like Javascript, CSS or plugins over HTTP. It's a subset of `mixed-content' but it's the really dangerous subset: by controlling those types of resources a network attacker can take control of the whole page rather than just, say, a single image.

Chrome, unlike other browsers, tracks mixed-scripting for origins rather than for pages. This means that, if page x triggers mixed-scripting and you follow a link to another page, y, in the same origin, we'll decorate y with mixed-scripting warnings even if it's clean when considered in isolation. This is because Javascript can control other pages in the same origin, see this paper.

Chrome's detection actually errs a little on the side of too many warnings. Strictly speaking, if there's only a single page open in the origin and you navigate it, then there's nowhere for evil Javascript to hide and so the next page can be clean. Chrome, rather, considers an origin to be poisoned once it has triggered mixed-scripting and that taint is only lifted once that renderer process has shutdown.

(Although there's also an argument to be made that Chrome's behavior is fine. Consider this case: you log into a web email client and the front page has mixed scripting. Then you click a link to compose an email. The compose page might be free of mixed-scripting and all the evil Javascript might have been flushed. But the mixed-scripting could have switched your cookies and logged you in as the attacker's account. Now, when you send the email, it'll appear in the attacker's `Sent Mail'.)

But that's all background. The problem is that developers get confused about these mixed-scripting warnings and invariably blame Chrome. If you have a mixed-scripting issue occur on a page which redirects, or in an HTTPS iframe on an HTTP site, then the Javascript console doesn't help you. If can be very difficult to figure out what the problem is.

So I've landed a patch in Chrome which logs mixed-scripting to the debug log no matter where it occurs. Just get a build after r74119 (see about:version) and follow the instructions to enable debugging logging.

Still not computationally expensive

F5 have put up a blog post explaining why my previous statements about SSL's computational costs are a myth. I'm glad that the idea is getting out there!

Keep in mind, however, that F5 want to sell you SSL hardware and that the blog post is a marketing piece. (I've never used F5 products but, from the client side, I've never found any problems with them either.)

Some assertions in the text are simple and wrong, so I'll just point these out:

Certificate lengths

It's true that NIST have recommended that 1024-bit certificates should be phased out by 2013. CA certificates should be 2048-bit now and we need to work to remove the 1024-bit ones.

But even in 2013 it's still going to take tens of millions of dollars of computer hardware a year to factor a 1024-bit RSA key. If you're the sort of organisation which is considering deploying HTTPS do you think that your attackers are going to do that, or are they going to bribe someone to steal the private key from the servers?

Likewise, SSL hardware will probably make your key harder to steal via a software exploit, but keys can be revoked the problem dealt with. There are some organisations where hardware protected keys make sense, but for 99% of sites, key material isn't what's important. It would be far more damaging for customer data to be dumped on the web and SSL hardware doesn't save you there.

As we go towards 2013, CAs will try to issue fewer 1024-bit certificates and 2048-bit certificates are 5x slower. But it only takes one CA to start issuing ECDSA certificates along with a 2048-bit RSA one and the problem is much less vexing. Browsers largely support it and you can serve the ECDSA certificate to the ones which do and the RSA to the rest. P256-ECDSA is about as fast as 1024-bit RSA. The only problem is that people don't know to demand it from their CAs so the CAs don't do it yet.

Ciphers

RC4 has good key agility, it's a stream cipher (which saves bytes on the wire), it's quick and very well analysed. It's not the strongest cipher in the world, but it's significantly stronger than some other parts of the SSL ecosystem. With overwhelming probability, you are not the sort of organisation that needs to worry about attackers brute-forcing your cipher.

All together now...

SSL is just not that computationally expensive any more. Here are the real costs of HTTPS deployment these days:

The F5 article does mention the first of these, but SSL hardware doesn't help with either of them.

All sites should deploy HTTPS because attacks like Firesheep are too easy to do. Even sites where you don't login should deploy HTTPS (imagine the effect of spoofing news websites at a major financial conference to headline “Market crashes”). You should use HSTS to stop sslstrip. But you are probably not the sort of organisation which needs to worry about multi-million dollar attacks aimed at factoring your key.

Unfortunate current practices for HTTP over TLS

(For amusement value.)

Network Working Group                                         A. Langley
Internet-Draft                                                Google Inc
Expires: July 5, 2011                                           Jan 2011


            Unfortunate current practices for HTTP over TLS
                      draft-agl-tls-oppractices-00

Abstract

   This document describes some of the unfortunate current practices
   which are needed in order to transport HTTP over TLS on the public
   Internet.

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on July 5, 2011.

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Handshake message fragmentation  . . . . . . . . . . . . . . .  4
   3.  Protocol Fallback  . . . . . . . . . . . . . . . . . . . . . .  5
   4.  More implementation mistakes . . . . . . . . . . . . . . . . .  6
   5.  Certificate Chains . . . . . . . . . . . . . . . . . . . . . .  7
   6.  Insufficient Security  . . . . . . . . . . . . . . . . . . . .  8
   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .  9
   8.  Normative References . . . . . . . . . . . . . . . . . . . . . 10
   Appendix A.  Changes . . . . . . . . . . . . . . . . . . . . . . . 11
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 12


1.  Introduction

   HTTP [RFC2616] is one of the most common application level protocols
   transported over TLS [RFC5246].  (This combination is commonly known
   as HTTPS based on the URL scheme used to indicate it.)  HTTPS clients
   have to function with a huge range of TLS implementations, some of
   higher quality than others.  This text aims to document some of the
   behaviours of existing HTTPS clients that are designed to ensure
   maximum interoperability.

   This text should not be taken as a recommendation that future HTTPS
   clients adopt these behaviours.  The security implications of each
   need to be carefully considered by each implementation.  However,
   these behaviours are common and the authors consider it better to
   document the state of practice than to simply wish it were otherwise.


2.  Handshake message fragmentation

   Many servers will fail to process a handshake message that spans more
   than one record.  These servers will close the connection when they
   encounter such a handshake message.  HTTPS clients will commonly
   ensure against that by either packing all handshake messages in a
   flow into a single record, or by creating a single record for each
   handshake message.


3.  Protocol Fallback

   Despite it being nearly twelve years since the publication of TLS 1.0
   [RFC2246], around 3% of HTTPS servers will reject a valid TLS
   "ClientHello".  These rejections can take the form of immediately
   closing the connection or a fatal alert.  Intolerance to the
   following has been observed:

      Advertising version TLS 1.0.

      Advertising a TLS version greater than TLS 1.0 (around 2% for 1.1
      or 1.2, around 3% for greater than 1.2).

      Advertising a version greater than 0x03ff (around 65% of servers)

      The presence of any extensions (around 7% of servers)

      The presence of specific extensions ("server_name" and
      "status_request" intolerance has been observed, although in very
      low numbers).

      The presence of any advertised compression algorithms

   Next, some servers will misbehave after processing the "ClientHello"
   message.  Negotiating the use of "DEFLATE" compression can result in
   fatal "bad_record_mac", "decompression_failure" or
   "decryption_failed" alerts.  Notably, OpenSSL prior to version 0.9.8c
   will intermittently fail to process compressed finished messages due
   to a work around of a previous padding bug.

   Lastly, some servers will negotiate the use of SSLv3 but select a
   TLS-only cipher suite.

   In all these cases, HTTPS clients will often enter a fallback mode.
   The connection is retried using only SSLv3 and without advertising
   any compression algorithms.  (This is obviously an easy downgrade
   attack.)  Also, the fallback can be triggered by transient network
   problems, which often manifest as an abruptly closed connection.
   Since SSLv3 does not provide any means of Server Name Indication
   [RFC3546], the fallback connection can use the wrong certificate
   chain, resulting in a very surprising certificate error.


4.  More implementation mistakes

   Non-fatal errors in version negotiation also occur.  Some 0.2% of
   servers use the version from the record header.  Around 0.6% of
   servers require that the version in the "ClientHello" and record
   header match in order to respect the version in the "ClientHello".  A
   very low number of servers echo whatever version the client
   advertises.

   In the event that the client supports a higher protocol version than
   the server, about 0.4% of servers require that the RSA
   "ClientKeyExchange" message include the server's protocol version.

   Some 30% of servers don't check the version in an RSA
   "ClientKeyExchange" at all.


5.  Certificate Chains

   Certificate chains presented by servers will commonly be missing
   intermediate certificates, have certificates in the wrong order and
   will include unrelated, superfluous certificates.  Servers have been
   observed presenting more than ten certificates in what we assume is a
   drive-by shooting approach to including the correct intermediate
   certificate.

   In order to validate chains which are missing certificates, some
   HTTPS clients will collect intermediate certificates from other
   servers.  Clients will commonly put all the presented certificates
   into a set and try to validate a chain assuming only that the first
   certificate is the leaf.


6.  Insufficient Security

   Some 65% of servers support SSLv2 (beyond just supporting the
   handshake in order to upgrade to SSLv3 or TLS).  HTTPS clients will
   typically not support SSLv2, nor send SSLv2 handshakes by default.
   Of those servers, 80% support the export ciphersuites.  (Although
   about 3% of those servers negotiate weak ciphersuites only to show a
   warning.)

   Some servers will choose very small multiplicative group sizes for
   their ephemeral Diffie-Hellman exchange (for example, 256-bits).
   Some HTTPS clients will reject all multiplicative group sizes smaller
   than 512-bits while others will retry after demoting DHE ciphersuites
   in their "ClientHello".


7.  Acknowledgements

   Yngve Pettersen made significant contributions and many of the
   numbers in this document come from his scanning work.  Other numbers
   were taken from Ivan Ristic's SSL Survey.

   Thanks to Wan Teh Chang for reviewing early drafts.


8.  Normative References

   [RFC2246]  Dierks, T. and C. Allen, "The TLS Protocol Version 1.0",
              RFC 2246, January 1999.

   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

   [RFC3546]  Blake-Wilson, S., Nystrom, M., Hopwood, D., Mikkelsen, J.,
              and T. Wright, "Transport Layer Security (TLS)
              Extensions", RFC 3546, June 2003.

   [RFC5246]  Dierks, T. and E. Rescorla, "The Transport Layer Security
              (TLS) Protocol Version 1.2", RFC 5246, August 2008.

OpenSSH 5.7 with ECC support

OpenSSH 5.7 has been released and its major notable new feature is support for ECDH key agreement and ECDSA host and user keys.

What a difference a prime makes

In a previous post I covered some of the issues with implementing ECC operations. I also mentioned the problems with the number of terms in the P256 prime.

Having spent a fair amount of time optimising OpenSSL's elliptic curve code, here are the current speeds:

image/svg+xml Produced by GNUPLOT 4.2 patchlevel 6 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 80 100 120 140 160 180 200 220 240 260 1024-bit DH 2048-bit DH P224 curve25519 P256 P521 Security level (bits) Operations / core second

(The graph is inline SVG. Can't see it? Get a better browser.)

P224 and P256 are not, fundamentally, very different. However, P256 has a nasty prime formation, as I explained previously, which kills the speed. Sadly, if you want to support the existing fleet of browsers, P256 is your fastest option.

(The multiplicative DH is measured using OpenSSL's code. curve25519 is my own code, based on djb's. P224 is Emilia Kasper's, inspired by curve25519. P521 is my own code based on Emilia's. P256 is my own code, also based on Emilia's, with a very different form from in an attempt to get it running faster.)

(The measurements are taken on a single core of my 2.3GHz Xeon E5520. They are a scalar multiplication of the base point followed by a scalar multiplication of the result. They don't include point validation. The OpenSSL code involves additional overhead from the BIGNUM conversions, however I consider that fair game because you have to do them if you want to use the code.)

Elliptic curves and their implementation (pointer)

If you read this via Google Reader (or, possibly, other feed readers) then the last post won't have appeared, possibly because of the inline SVG (getting cut by the bleeding edge again).

So, click to read Elliptic curves and their implementation.

Elliptic curves and their implementation

So what's an elliptic curve? Well, for starters, it's not an ellipse. An elliptic curve is a set of points on a plane which satisfy an equation of the form y2 = x3 + ax + b.

As an example, here's the elliptic curve y2 = x3 - 3x + 3:

image/svg+xml 1 2 3 -1 -2 -3 1 2 3 -1 -2 -3

(Thanks to Wolfram Alpha for plotting the curve. Can't see the diagram? Get a better browser).

At this point I'm going to point out that I'm omitting details in order to keep things simple and I'm going to keep doing it without mentioning it again. For example, that elliptic curve equation is based on a transformation which is only valid if the underlying field characteristic is not 2 nor 3. If you want to full details then you should read up. Carrying on …

So the elliptic curve is the set of points which satisfy an equation like that. For the curve above you can see that (1, 1) and (1, -1) are both on the curve and a quick mental calculation should confirm that they fit into the equation.

But the points on an elliptic curve aren't just a set of points, with a small hack they have a structure to them which is sufficient to form a group. (A group is a mathematical term, not just a noun that I pulled from the air.)

Being a group requires four things: that you can add two elements to get a third element which is also in the group. That this is true for all elements: (a + b) + c = a + (b + c). That there's a zero element so that a + 0 = a. Finally that, for every element, there's a negative of that element (written -a), so that a + -a = 0.

And, when you define addition correctly, the set of points on an elliptic curve has that structure. Also, the addition rule for elliptic curves has a very nice graphical definition, so I can give you a couple more diagrams.

The addition rule says that, to add a and b, you draw a line from a to b, intersect it with the curve and reflect in the x-axis:

image/svg+xml a b a+b 1 2 3 -1 -2 -3 1 2 3 -1 -2 -3

There's a slight wrinkle in this: what if a and b are the same point? A line needs two, distinct points to be defined.

So there's a special rule for doubling a point: you take the tangent to the curve at that point, intersect it with the curve and then reflect in the x-axis:

image/svg+xml a 2a 1 2 3 -1 -2 -3 1 2 3 -1 -2 -3

So we defined addition, but the zero element is a bit of a hack. First, we define the negative of an element to be its reflection in the x-axis. We know that adding an element and its negative results in the zero element so think about what adding an element and its reflection means. We draw a line from the element to its reflection, which means that the line is vertical. Now we have to find the third point where that line intersects the curve. But, from looking at the y2 term of the equation, it's clear that the curve only has two solutions for any given x value. So the line doesn't intersect the curve!

And this is where the zero element comes from. It's called the point at infinity and it's a magic element which is infinitely far away. This means that the vertical line ends up hitting it. It also means that it's directly above every point so, when you add the zero element to a point, you get its reflection which you then reflect back to the original point.

And, with that brief wart, we have a group! But what's the point? Well, given addition, you can define multiplication (it's just repeated addition). So, given a curve and a point on that curve (called the base point and written G), you can calculate some multiple of that point. But, given the resulting point, you can't feasibly work backwards to find out what the multiple was. It's a one-way function.

To see how this might be useful for cryptography our old friends, Alice and Bob, enter the scene. They are standing in a room with lots of other people and would like a private conservation, but there's not enough room. Thankfully, Alice and Bob both know a standard, public elliptic curve and base point. Each think up a random number and multiply the base point by that number.

We'll call Alice's random number a and Bob's, b. Each tell the other (and, therefore, everyone) aG and bG: the multiples of the base point. So Alice knows a and bG and multiplies the latter by the former to get abG. Bob knows b and aG and multiplies the latter by the former to get baG. abG = baG so both now have a shared, private key. The other people in the room know only aG and bG and they can't divide to get either a nor b, so they can't calculate the secret.

(That's an elliptic curve Diffie-Hellman key agreement.)

Implementing elliptic curve operations in software.

The diagrams for addition and doubling are pretty, but aren't code. Thankfully they translate pretty easily into equations which can be found at the top of the EFD page for Short Weierstrass curves (which is the specific subset of elliptic curve that we're dealing with).

Those equations are defined in terms of affine coordinates (i.e. (x,y) pairs) but serious implementations will invariably use a transformation. Transformations are applied at the start of a long calculation so that the steps of the calculation are easier. The transformation can be reversed at the end to get affine coordinates out.

A common example of a transformation is the use of polar coordinates. When working with points on the plane, working with polar coordinates can make some calculations vastly easier. Elliptic curve transformations are like that, although with different transformations.

The transformations that you'll commonly see are Jacobian coordinates and XZ (aka Montgomery's trick), although the latter sort of isn't invertible (more on that later).

The underlying field

When we defined the elliptic curve, above, we talked about a set of points but we never defined a point. Well, a point on the plane can be represented as an (x,y) pair, but what are x and y? If this were a maths class they would be elements of ℜ, the set of real numbers. But real numbers on computers are either very slow or approximate. The rules of the group quickly breakdown if you are only approximate.

So, in cryptographic applications, we use an alternative, finite field. (Again, field is a mathematical term, like group.) The finite fields in this case are the numbers modulo a prime. (Although all finite fields of the same size are isomorphic so it doesn't really matter.)

At this point, we're going to speed up. I'm not going to explain finite fields, I'm going to jump into the tricks of implementation. If you made it this far, well done!

It's important to keep in mind that we have two different structures in play here. We have the elliptic curve group, where we are adding and doubling points on the curve, and we have the underlying field. In order to do operations on the group, we have to do a number of operations on elements of the underlying field. For example, here's the formula for group addition under a Jacobian transform. It's given a cost in terms of the underlying field: 11 multiplications plus five squarings.

Our job is to do these group and field operations quickly and in constant time. The latter is important! Look at the power of the timing attacks against AES and RSA!

Limb schedules

If the order of the field is between 2n-1 and 2n then we call it an n-bit group. (The order can't be 2n for n > 0 because the order must be prime and 2n is even.)

For cryptographically useful field sizes, elements aren't going to fit in a single register. Therefore we have to represent field elements with a number of limbs (which we'll call a, b, c, ...). If we were working in a field of size 2127-1 then we might represent field elements with two limbs by 264×a + b. Think of it as a two digit number in base 264.

Since we're using a pair of 64-bit limbs and we're probably putting them in 64-bit registers this is a uniform, saturated (because the registers are full) limb schedule. As a is multiplied by 264 we say that it's “at 264”.

However, we're going to hit some issues with this scheme. Think about squaring these field elements (multiplication is the same, but I would need more variables). Squaring means calculating a2 + 2ab + b2. That results in limbs at 20 (b2), 264 (the middle term) and 2128 (a2).

But multiplying the 64-bit limbs gives a 128-bit result (x86-64 chips can do that, although you need some tricks to pull it off from C code). When it comes to the middle term, you can't multiply the 128-bit by 2 without overflowing. You would have to process the result into a form which you could manipulate. Also, if you look at the algorithms which we're implementing, there are small scalar factors and subtractions, both of which are much easier to do with some headroom.

So we could consider an unsaturated schedule. For example, for a 224-bit field, four 56-bit limbs works well. It's still uniform, but the results of multiplication are only 112-bit, so you can add several of those numbers together in a 128-bit limb without overflowing.

There are also non-uniform schedules. For a 255-bit field, a schedule of 25, 26, 25, 26, ... for ten limbs works well. It's best to keep the schedule as uniform as possible however, otherwise, when multiplying, you end up with values at odd positions and might not have enough headroom to shift them in-place to match up with a limb position.

Prime structures

When implementing curves you typically don't get to choose the prime. There are a few standard primes defined by NIST (in FIPS 186-3) and less common ones (like curve25519).

The structure of the prime matters because the most complex operation will be the reduction. When we did a squaring in the example above, we ended up with a term at 128 bits. We have to eliminate that otherwise, as we keep multiplying results, the results will just get bigger and bigger.

If you think of reducing 2127 modulo 2127-1, I hope it's clear that the result is one. If you had a limb at 127 bits, then you could just take that value, add it in at 0 bits and drop the limb at 127 bits. A limb at 127 bits `reflects' off the term at 20 and ends getting added in at 0 bits.

Since we had a limb at 128 bits, its reflection off that term ends up at 1 bit. So we would have to left shift the limb by one before adding it in at 0. (And hope that your limb schedule gave you some headroom to do the left shift!). That's pretty much an ideal case and 2127-1 is called a Mersenne prime. Sadly, there aren't any between 2127-1 and 2521-1, so you'll probably get something harder.

Take NIST's P256 curve. Here the prime is 2256 - 2224 + 2192 + 296 - 1. It was designed so that, with 32-bit limbs, all the reflections end up on 32-bit boundaries. However, we have 64-bit systems these days and we really don't want to waste those big 64-bit full multipliers. However, the prime is nasty for other limb structures. That term at 224 bits is so high up that reflections off it usually end up above 256 and get reflected again. That's terrible if your limbs don't line up with the reflections. For example, if you try 52-bit, unsaturated limbs with that prime, then the reflections for the limb at 416 bits are at 224, 192, 160, 128, 96, 64, 32, and 0 bits! With all that shifting and adding (and more because you don't have enough headroom for all those shifts), you end up with a slow reduction.

I have typically ended up experimenting with different limb schedules and seeing which is the fastest. You can get a rough feeling for how well a given schedule will do but, with CPUs being so complex these days, it's never terribly clear which will be the fastest without trying them.

Inversion

The usual way to do inversion in a finite field is with Euclid's algorithm. However, this isn't constant time. Thankfully, we can use Fermat's little theorem (xp ≅ x (mod p)) and notice that it easily follows that xp-2 ≅ x-1 (mod p). So, just raise the element to power of p - 2. It's a long operation, but you don't do it very often.

(Credit to djb for that trick.)

Subtraction

Although you can use signed limbs, I consider the complexity to be too much to bother with. But, without signed limbs how do you do subtraction?

Well, everything is modulo p, so we can always add 0 mod p. If you have unsaturated limbs then the top bits of the limb overlap with the bottom bits of the next limb. So, you can add a large number to the first limb and subtract a small number from the next one. But what stops the next limb from underflowing? Well, you want to add the same large number there and subtract from the next etc.

This continues until you reach the top limb. Here, you can calculate the excess, reduce modulo p and correct the lower limbs to counteract it. In the end, you end up with a few, static values which you can quickly add to a field element without changing the element value, but so that you can now prove that the subtraction doesn't underflow any limbs.

(Credit to Emilia Kasper for that one.)

Remarks

I could write a lot more, but it's getting late. The reason for this write up is that I'm currently working on a fast P256 implementation for OpenSSL. It won't be as fast as I would like due to the issues with the P256 prime, given above. But, I hope to be done by the end of the week and hope to have some exciting news about it in the new year.

Google Charts API over HTTPS

I'm happy to announce that the Google Charts API is now available over HTTPS at https://chart.googleapis.com. For example:

The documentation should be updated in due course.

Random number generators in Psi experiments

(Note: I'm not making any comment on the research in question, merely making a technical remark.)

In Feeling the Future, the author, D. Bem, has an interesting problem (page 11 of the linked preprint). If an experiment shows that people can predict the output of a RNG, what does that tell us?

Well, if the RNG was a PRNG, seeded at the beginning of the experiment, then the subject could either be showing precognition (i.e. they can sense the result from the future) or clairvoyance (i.e. they can sense the state of the PRNG and thus know the next result). If you clock a true, hardware RNG before the trial then the same possibilities arise.

If you clock a true RNG after the trial however, then the possibilities are precognition and psycokinesis (i.e. the subject altered the state of the RNG, causing the result).

My observation was that you could XOR a PRNG and a RNG. (The latter being clocked after the trial.) In this case, prediction implies, at least, either both clairvoyance (for the PRNG) and psychokinesis (for the RNG), or precognition. That appears to be a useful trick.

However, this rather relies on the result that the sum of two RNGs is at least as random as the best of them. However, if we have a result which suggests precognition, then that doesn't hold! An evil RNG could, when summed with a true RNG, predict its future output and cancel it out, resulting in a less random result!

Changing HTTPS

At Google we're pretty fanatical about speed. We know that when things run faster people aren't just happier to use them, but that it fundamentally changes the way that people use them. That's why we strive to make Chrome the fastest browser possible and include experiments like SPDY.

As part of this, we've been looking at SSL/TLS, the protocol which secures HTTPS. We love SSL/TLS because of the privacy and security that it gives to our users and, like everything, we want to make it faster.

One of the simplest changes that we're experimenting with is called False Start. It can shave a round trip from the setup time of many HTTPS connections. A round trip varies depending on how far away the webserver is from the client. Crossing the continental US and back takes about 70ms. Going from the west coast to Europe and back takes about 150ms.

That might not seem like very much. But these costs are multiplied when loading a complex site. If it takes an extra 100ms to start fetching the contents of the page, then that's 100ms until the browser discovers resources on other websites that will be needed to render its contents. If loading those resources reveals more dependents then they are delayed by three round trips.

And this change disproportionately benefits smaller websites (who aren't multihomed around the world) and mobile users or areas with poorer Internet connectivity (who have longer round trip times).

Most attractively, this change can be made unilaterally. Browsers can implement False Start and benefit without having to update webservers.

However, we are aware that this change will cause issues with about 0.05% of websites on the Internet. There are a number of possible responses to this:

The most common would be to admit defeat. Rightly or wrongly, users assign blame to the last thing to change. Thus, no matter how grievous or damaging their behaviour, anything that worked at some point in the past is regarded as blameless for any future problem. As the Internet becomes larger and more diverse, more and more devices are attached that improperly implement Internet protocols. This means that any practice that isn't common is likely to be non-functional for a significant number of users.

In light of this, many efforts which try to make lower level changes fail. Others move on top of common protocols in the hope that bugs in the network don't reach that high.

Chrome still carries an idealism that means that we're going to try to make low level changes and try to make them work.

The second common response to these problems is to work around them. This means detecting when something has gone wrong and falling back to the old behaviour. In the specific case of False Start, detecting the failure is problematic. However, even if it were not, then we still wouldn't want to work around the issue because we think that it's damaging to the Internet in its own way.

Fallbacks allow problematic sites to continue to function and stems the flow of angry users. That alone makes it very attractive to developers. However, it also removes any motivation for the sites to change. In most cases, it means that sites will never even know about the issue. Fallbacks do nothing to move the world towards a position where the change in question functions correctly.

Additionally, fallbacks add more unwritten rules and complexity. As an example, take the change from SSLv3 to TLS 1.0. This was a change to the same SSL/TLS protocol that False Start changes and it was finialised as a standard nearly 12 years ago. TLS 1.0 was designed so that browsers could talk to older SSLv3 websites without issues. The only slight problem was that some webservers needed to be updated to ignore some extra data that TLS 1.0 clients would send. A very minor change.

In order to make the transition smoother, browsers added a fallback from TLS to SSLv3. The Internet in 1999 was a much smaller and more flexible place. It was assumed that the problematic webservers could be fixed in a few years and the fallback could be removed.

Twelve years later, the fallback is in robust health and still adding complexity. A security update to TLS earlier this year was made much more complex by the need to account for SSLv3 fallback. The operators of the problematic webservers are largely unaware of the problems that they are causing and have no incentive to change in any case. Meanwhile, the cost of SSLv3 fallback continues to accumulate.

Because of these problems, we're going to try a new approach with False Start: blacklists.

Chrome (on the dev channel at first) will contain a list of the 0.05% of problematic sites and we won't use False Start with them. We're generating this list by probing a large list of websites although, of course, we'll miss some which we'll have to address from bug reports.

Blacklisting gives us two advantages. Firstly, it limits the accumulation of new problematic websites. Sites which have never worked are a very different case from sites which used to work.

Secondly, we can contact the problematic sites in question. We already have a good idea of where the problem lies with many of them and we're in contact with the stakeholders to plan a way forward.

Blacklists require effort to maintain and we'll have to be responsive to make it work. But, with our near-weekly dev channel and even more frequently updated Canary channel, we think that we can do it. In the end, success will be measured by whether we manage to make the Internet a safe place to implement False Start and by how much we manage to reduce the blacklist over time.

DNSSEC and TLS

Ever since the DNS root was signed on July 15th, quite a few people have been wondering about the interaction of TLS and DNSSEC. On the one hand, trust in the CA system is lukewarm but, on the other, the deployment issues with getting DNSSEC to the client seem immense.

Those who saw Dan Kaminsky's talk at BlackHat, which included a patched version of Chromium performing DNSSEC validation, have probably already guessed that Dan, myself and others have been working on this problem. In this blog post I'm going to try to explain the design space as I see it.

Objectives

In the long term, we want a stronger foundation of trust for the Internet. This means both pairing back the power of the 1500 root certificates and making TLS easier to deploy and thus more commonly used.

So one of the goals is to serve sites which currently don't have a CA signed certificate. We can do that by allowing them to publish fingerprints in DNS and having browsers accept those (DNSSEC secured) fingerprints.

There might also be speed advantages to be had by avoiding OCSP and CRL lookups.

We also want to serve those sites which currently do have CA signed certificates. For them we can provide exclusion of other certificates issued either mistakenly or maliciously by CAs.

Yet it's important to note that this isn't a plan for the elimination of CAs. CAs are still a good way to link DNS names to legal entities. For that we need EV certificates and CAs to issue them.

The Design Space

Firstly we have the issues which are, in many respects, the least important. How should we encode the data?

What type of record? The two major candidates are a TXT record and a new type of CERT record. It's important that we figure it out as DNS queries can only ask for one type of response (ignoring ANY, which we don't want) and each query is another chance for packet loss and a painful timeout.

TXT records have the advantage that the crippled web interfaces, by which many people manage their DNS, often support them. They are also already common (try looking up TXT records for google.com, amazon.com etc). They are human readable. They are easily extendable given a sensible format.

A new CERT type is the ‘cleaner’ solution. Stuffing things in TXT records is clearly a hack and working around crappy web interfaces feels like being a good man standing still.

Where to put the record? If one were to use a TXT record then the temptation would be to put it on a prefix label like _tls.www.example.com. However, if www.example.com is a CNAME then we won't get a helpful answer: probably just an NXDOMAIN and we'll have to hunt around: taking many round trips and a painful latency hit. Because of this, I like putting the record on the target domain name.

Next we'll consider some of the deployment options because they inform some of the points to follow.

What about clients without DNSSEC resolution ability? This is a big question. We don't care if your ISP's resolver is going to set the AD (Authenticated Data) bit in DNS replies: we don't trust it. We need either a local recursive resolver or we need the full DNSSEC chain so that we can check the signatures ourselves.

(It's worth pointing out that, although we can't trust any fingerprint information without DNSSSEC, we can provide exclusion without DNSSEC. It's a bit of a weak threat model: the attacker would have to control some of your network but not your DNS resolutions: maybe you have the records cached with a long TTL.)

One option is to put an aggressive DNSSEC resolver in the client and set DO (DNSSEC OK) and CD (Checking Disabled) bits in the requests to get the signatures themselves.

What about clients which can't even do DNS resolution correctly? (Also known as the Hotel Network Problem.) For them we could tunnel DNS over HTTP. In fact, if you encode the request in a GET request you can set the HTTP caching headers so that HTTP caches are DNS caches too (Dan's trick).

If we're talking over HTTP, why not get the server to give us the whole chain? In that case, what about an EDNS0 option to ask for the full chain? Both are possibilities.

How about putting the DNSSEC chain in other protocols? What about embedding the chain in an X.509 certificate? Chain embedding can solve the needs of sites which want to use self-signed certificates.

Now we can start to get to some of the more meaty issues:

Fingerprint in record? If you only have a single certificate then you want to put the fingerprint in a record on the domain name. That way the client can start the lookup at the same time as for the A record. But if you have many fingerprints, that becomes troublesome and you want to lookup a domain named by the fingerprint. That's slower because you can only start that lookup once you have the certificate from the server. The answer looks to be that one has to handle both types of lookup.

What to hash? If we are going to embed a DNSSEC chain in a certificate, then the fingerprint can't cover the whole certificate (because then the hash would have to cover itself). So that suggests hashing only the public key. However, if we do that then the attacker can change other parts of the certificate at will.

Here we have a bit of an impedance mismatch with the X.509 world. Existing APIs are usually of the form “please validate this certificate”. Once validated, everything in the certificate is trusted because CAs are the Voice of God. However, in a DNSSEC world, authority is scoped.

If we hash only the public key then an attacker could include things in an X.509 certificate for example.com which would appear to have the authority of example.com to an unsuspecting application. If we hash the whole certificate then example.com could put things in its certificate which might assert things which example.com shouldn't be allowed to. Applications which work on the Voice of God model could be misled.

It's tricky. At the moment we are supporting both models.

Should we include a flag to perform CA validation in addition? For performance reasons, people might want to use just DNSSEC because one can avoid OCSP and CRL lookups that way. For EV certs, we certainly want to perform CA validation in addition, but what about DV certs? Some people might like the idea of enforcing the expiry and OCSP checks.

TLS extensions? Embedding a DNSSEC chain in an X.509 certificate means that you need to regenerate the certificate once a week or so (because DNSSEC signatures expire). Also, CAs aren't going to sign certificates with a random blob of binary data that they don't understand. Because of this, it's been suggested that we could carry the DNSSEC chain in a TLS extension. However, chain embedding is for those servers which don't have a CA issued certificate. Clients aren't going to be able to require DNSSEC for many years, if ever. And if we're going to accept CA certificates without DNSSEC, then an embedded chain can't provide exclusion.

The Code

The code is minimal so far. Chrome trunk already includes support for validating DNSSEC chains embedded inside an X.509 cert, although it requires a command line flag to enable. I also have code to generate the chains. However, the format of the chain is going to change so I'm not making that public yet.

Hopefully there will be something more substantial in a couple of months.

SSL Survey

Ivan Ristić has put together a fantastic survey of the state of SSL/TLS on the web. Some highlights include:

Clipping in Chromium

I'm a guest poster on the Chromium Notes blog today, talking about clipping in Skia and Chromium.

Overclocking SSL

(This is a write up of the talk that I gave at Velocity 2010 last Thursday. This is a joint work of myself, Nagendra Modadugu and Wan-Teh Chang.)

The ‘S’ in HTTPS stands for ‘secure’ and the security is provided by SSL/TLS. SSL/TLS is a standard network protocol which is implemented in every browser and web server to provide confidentiality and integrity for HTTPS traffic.

If there's one point that we want to communicate to the world, it's that SSL/TLS is not computationally expensive any more. Ten years ago it might have been true, but it's just not the case any more. You too can afford to enable HTTPS for your users.

In January this year (2010), Gmail switched to using HTTPS for everything by default. Previously it had been introduced as an option, but now all of our users use HTTPS to secure their email between their browsers and Google, all the time. In order to do this we had to deploy no additional machines and no special hardware. On our production frontend machines, SSL/TLS accounts for less than 1% of the CPU load, less than 10KB of memory per connection and less than 2% of network overhead. Many people believe that SSL takes a lot of CPU time and we hope the above numbers (public for the first time) will help to dispel that.

If you stop reading now you only need to remember one thing: SSL/TLS is not computationally expensive any more.

The first part of this text contains hints for SSL/TLS performance and then the second half deals with things that Google is doing to address the latency that SSL/TLS adds.

Basic configuration

Modern hardware can perform 1500 handshakes/second/core. That's assuming that the handshakes involve a 1024-bit RSA private operation (make sure to use 64-bit software). If your site needs high security then you might want larger public key sizes or ephemeral Diffie-Hellman, but then you're not the HTTP-only site that this presentation is aimed at. But pick your certificate size carefully.

It's also important to pick your preferred ciphersuites. Most large sites (Google included) will try to pick RC4 because it's very fast and, as a stream cipher, doesn't require padding. Recent Intel chips (Westmere) contain AES instructions which can make AES the better choice, but remember that there's no point using AES-256 with a 1024-bit public key. Also keep in mind that ephemeral Diffie-Hellman (EDH or DHE) ciphersuites will handshake at about half the speed of pure RSA ciphersuites. However, with a pure RSA ciphersuite, an attacker can record traffic, crack (or steal) your private key at will and decrypt the traffic retrospectively, so consider your needs.

OpenSSL tends to allocate about 50KB of memory for each connection. We have patched OpenSSL to reduce this to about 5KB and the Tor project have independently written a similar patch that is now upstream. (Check for SSL_MODE_RELEASE_BUFFERS in OpenSSL 1.0.0a.). Keeping memory usage down is vitally important when dealing with many connections.

Resumption

There are two types of SSL/TLS handshake: a full handshake and an abbreviated handshake. The full handshake takes two round trips (in addition to the round trip from the TCP handshake):

The abbreviated handshake occurs when the connection can resume a previous session. This can only occur if the client has the previous session information cached. Since the session information contains key material, it's never cached on disk so the attempted client resume rate, seen by Google, is only 50%. Older clients also require that the server cache the session information. Since these old clients haven't gone away yet, it's vitally important to setup a shared session cache if you have multiple frontend machines. The server-side miss rate at Google is less than 10%.

An abbreviated handshake saves the server performing an RSA operation, but those are cheap anyway. More importantly, it saves a round-trip time:

Addressing round trips is a major focus of our SSL/TLS work at Google (see below).

Certificates

We've already mentioned that you probably don't want to use 4096-bit certificates without a very good reason, but there are other certificate issues which can cause a major slowdown.

Firstly, most certificates from CAs require an intermediate certificate to be presented by the server. Rather than have the root certificate sign the end certificates directly, the root signs the intermediate and the intermediate signs the end certificate. Sometimes there are several intermediate certificates.

If you forget to include an intermediate certificate then things will probably still work. The end certificate will contain the URL of the intermediate certificate and, if the intermediate certificate is missing, the browser will fetch it. This is obviously very slow (a DNS lookup, TCP connection and HTTP request blocking the handshake to your site). Unfortunately there's a constant pressure on browsers to work around issues and, because of this, many sites which are missing certificates will never notice because the site still functions. So make sure to include all your certificates (in the correct order)!

There's a second certificate issue that can increase latency: the certificate chain can be too large. A TCP connection will only send so much data before waiting to hear from the other side. It slowly ramps this amount up over time, but a new TCP connection will only send (typically) three packets. This is called the initial congestion window (initcwnd). If your certificates are large enough, they can overflow the initcwnd and cause an additional round trip as the server waits for the client to ACK the packets:

For example, www.bankofamerica.com sends four certificates: 1624 bytes, 1488 bytes, 1226 bytes and 576 bytes. That will overflow the initcwnd, but it's not clear what they could do about that if their CA really requires that many intermediate certificates. On the other hand, edgecastcdn.net has a single certificate that's 4093 bytes long, containing 107 hostnames!

Packets and records

SSL/TLS packages the bytes of the application protocol (HTTP in our case) into records. Each record has a signature and a header. Records are packed into packets and each packet has headers. The overhead of a record is typically 25 to 40 bytes (based on common ciphersuites) and the overhead of a packet is around 52 bytes. So it's vitally important not to send lots of small packets with small records in them.

I don't want to be seen to be picking on Bank Of America, it's honestly just the first site that I tried, but looking at their packets in Wireshark, we see many small records, often sent in a single packet. A quick sample of the record sizes: 638 bytes, 1363, 15628, 69, 182, 34, 18, … This is often caused because OpenSSL will build a record from each call to SSL_write and the kernel, with Nagle disabled, will send out packets to minimise latency.

This can be fixed with a couple of tricks: buffer in front of OpenSSL and don't make SSL_write calls with small amounts of data if you have more coming. Also, if code limitations mean that you are building small records in some cases, then use TCP_CORK to pack several of them into a packet.

But don't make the records too large either! See the 15KB record that https://www.bankofamerica.com sent? None of that data can be parsed by the browser until the whole record has been received. As the congestion window opens up, those large records tend to span several windows and so there's an extra round trip of delay before the browser gets any of that data. Since the browser is pre-parsing the HTML for subresources, it'll delay discovery and cause more knock-on effects.

So how large should records be? There's always going to be some uncertainty in that number because the size of the TCP header depends on the OS and the number of SACK blocks that need to be sent. In the ideal case, each packet is full and contains exactly one record. Start with a value of 1415 bytes for a non-padded ciphersuite (like RC4), or 1403 bytes for an AES based ciphersuite and look at the packets which result from that.

OCSP and CRLs

OCSP and CRLs are both methods of dealing with certificate revocation: what to do when you lose control of your private key. The certificates themselves contain the details of how to check if they have been revoked.

OCSP is a protocol for asking the issuing authority “What's the status of this certificate?” and a CRL is a list of certificates which have been revoked by the issuing authority. Both are fetched over HTTP and a certificate can specify an OCSP URL, a CRL URL, or both. But certificate authorities will typically use at least OCSP.

Firefox 2 and IE on Windows XP won't block an SSL/TLS handshake for either revocation method. IE on Vista will block for OCSP, as will Firefox 3. Since there can be several OCSP requests resulting from a handshake (one for each certificate), and because OCSP responders can be slow, this can result in hundreds of milliseconds of additional latency for the first connection. I don't have really good data yet, but hopefully soon.

The answer to this is OCSP stapling: the SSL/TLS server includes the OCSP response in the handshake. OCSP responses are public and typically last for a week, so the server can do the work of fetching them and reuse the response for many connections. The latest alpha of Apache supports this (httpd 2.3.6-alpha). Google is currently rolling out support.

However, OCSP stapling has several issues. Firstly, the protocol only allows the server to staple one response into the handshake: so if you have more than one certificate in the chain the client will probably end up doing an OCSP check anyway. Secondly, an OCSP response is about 1K of data. Remember the issue with overflowing the initcwnd with large certificates? Well the OCSP response is included in the same part of the handshake, so it puts even more pressure on certificate sizes.

Google's SSL/TLS work

Google is working on a number of fronts here. I'll deal with them in order of deployment and complexity.

Firstly we have False Start. All you really need to know is that it reduces the number of round trips for a full handshake from two to one. Thus, there's no longer any latency advantage to resuming. It's a client-side change which should be live in Chrome 8 and is already in some builds of Android Froyo.

Secondly, Chrome will soon have OCSP preloading. This involves predicting the certificates that a server will use based on past experience and starting the OCSP lookups concurrently with the DNS resolution and TCP connection.

Slightly more complex, but deployed, is Next Protocol Negotiation. This pushes negotiation of the application level protocol down into the TLS handshake. It's how we trigger the use of SPDY. It's live on the Google frontend servers and in Chrome (6).

Lastly, and most complex, is Snap Start. This reduces the round trip times to zero for both types of handshakes. It's a client and server side change and it assumes that the client has the server's certificate cached and has up-to-date OCSP information. However, since the certificate and OCSP responses are public information, we can cache them on disk for long periods of time.

Conclusion

I hope that was helpful. We want to make the web faster and more secure and this sort of communication helps keep the world abreast of what we're doing.

Also, don't forget that we recently deployed encrypted web search on https://encrypted.google.com. Switch your search engine!

(updated 26th Oct 2010: mentioned OpenSSL 1.0.0a now that it's released, updated the status of OCSP stapling and False Start and added a mention of OCSP preloading.)

curve25519 in iOS4

Someone pointed out to me that I'm in the credits for iOS4 (the operating system formerly known as iPhone OS) for curve25519-donna. Unfortunately, I have no idea what they're using it for! I would be fascinated to know.

Update: (2012-06-26) The use is described in this document from Apple.

TLS latency improvements

We've published several drafts recently concerning reducing the number of round trips in SSL/TLS. Firstly, False Start is a client-side only change which reduces the number of round trips for a full handshake to one. It's in Android Froyo and Chrome (Linux since 5.x, Mac and Windows since 6.x). Secondly, Snap Start is a client and server change which reduces the overhead to zero. The code for it is still preliminary, but it might make it into Chrome 6.x. Snap Start requires that the client cache some information about the server but, unlike resume information, that data can be cached on disk.

This is all part of an ongoing effort to make the web faster.

Full handshakeResume handshake
(Round trip times)
Standard TLS21
False Start11
Snap Start00

Speaking at Velocity

O'Reilly Velocity Conference 2010

I'll be presenting various SSL/TLS work at the O'Reilly Velocity Conference in June. Specifically, 2:30pm on Thursday the 24th of June.

You can also see the outline of the talk here, although wtc, ngm and I have yet to nail down the precise contents.

WOFF support in Chromium

Last week I added WOFF support to Chromium. It's currently on trunk and hopefully will be included in a dev channel release by the end of this week.

WOFF is a repackaging of the old sfnt/TrueType/OpenType format. Conceptually it's just a gzipped TTF file with optional sections for a chunk of XML and a chunk of 'private data'. Since TrueType is a table based format already, those two optional sections could have been included as tables. And since we have gzip as a transport encoding in all reasonable HTTP servers, that's a little pointless too.

So, technically, there's no need to WOFF to exist at all. It's all politics. The type foundries didn't want a format which people could use on their computers. Of course, it's trivial to convert them (Chromium does it internally), but if it gets the foundries and Microsoft on board, then it's worthwhile.

But I do worry that WOFF will actually make web sites look worse. All the samples and examples that I've seen (even the nice ones) are terribly hinted. I assume that this is because the people doing them are using Macs, which traditionally don't hint anyway. But I prefer strong hinting and, in this mode, the free fonts look blotchy. They just haven't had the time and effort put in that my system fonts (often msttcorefonts) have had.

So I can always tell when a site is using @font-face, because it looks bad. Which is rather a shame given the intention.

Checking that functions are constant time with Valgrind

Information leaks via timing side channels can be deadly. You can steal RSA keys from other processes on the same host, extract the kernel's dm_crypt keys and steal AES keys over the network.

In order for a function to be constant time, the branches taken and memory addresses accessed must be independent of any secret inputs. (That's assuming that the fundamental processor instructions are constant time, but that's true for all sane CPUs.)

However, it's tough to write constant time functions. You can see the source to a few simple ones that I wrote Go. Sometimes the design of the function is fundamentally flawed (like AES), but even when the function can be efficiently implemented in a constant time fashion, it's easy to slip up. Careful code review is currently the best practice but it's error prone as the amount of code increases and fragile in the face of change.

A type system could probably help here, but that's not the path I'm taking today. Since cryptographic functions result in abnormally straight line code, it's common for a typical input to exercise every instruction. So a tool like Valgrind could check all the branches and memory accesses to make sure that they haven't been tainted with secret data.

This would mean keeping track of every bit in memory to know if it's secret or not, likewise for all the CPU registers. Preferably at the bit level. The tool would also have to know that adding secret and non-secret data results in secret data etc. That suggests that it would be quite a complex tool.

But memcheck already does this! In order to keep track of uninitialised data it shadows every bit of memory and will warn you if you use uninitialised data in a branch or to index a memory access. So if we could tell memcheck to treat our secret data as uninitialised, everything should just work.

I've a Valgrind patch does just that (against SVN r11097). It will intercept any calls to ct_poison and ct_unpoison in libctgrind.so*.

Let's look at a bad way to test a 128-bit MAC against a calculated value:

char check16_bad(unsigned char *a, unsigned char *b) {
  unsigned i;
  for (i = 0; i < 16; i++) {
    if (a[i] != b[i])
      return 0;
  }

  return 1;
}
And now, testing it with memcheck/ctgrind:
int
main() {
  unsigned char a[16], b[16];

  memset(a, 42, sizeof(a));
  memset(b, 42, sizeof(b));

  ct_poison(a, sizeof(a));

  printf("check16_bad\n");
  check16_bad(a, b);

  return 0;
}
$ /home/agl/devel/valgrind/vg-in-place ./a.out
...
check16_bad
==30993== Conditional jump or move depends on uninitialised value(s)
==30993==    at 0x40067F: check16_bad (test.c:11)
==30993==    by 0x40075F: main (test.c:44)

It seems to work! There are a few other tests in test.c which show the correct way to implement that function and well as a demo to show that secret dependent memory accesses are also caught by this tool.

We can test a few other things too. It confirms that donna-c64 is constant time, which is nice. I also tested BN_mod_exp_mont_consttime from OpenSSL since that's a large function which calls functions from several other files.

It turns out not to be constant time! There's a secret dependent memory access:

==31076== Use of uninitialised value of size 8
==31076==    at 0x402210: MOD_EXP_CTIME_COPY_FROM_PREBUF (bn_exp.c:554)
==31076==    by 0x40277B: BN_mod_exp_mont_consttime (bn_exp.c:703)
==31076==    by 0x4011FF: main (ossl.c:24)

From inspection of the code, this appears to be a true positive. To be fair, the comment above the function suggests that it's rather misnamed:

/* This variant of BN_mod_exp_mont() uses fixed windows and the special
 * precomputation memory layout to limit data-dependency to a minimum
 * to protect secret exponents

It only claims to "limit" the side-channel leaks. I don't know how serious this is, but then you never do till someone gets a paper published by stealing all your secrets.

Time to update OpenSSL 0.9.8m

If you're using OpenSSL 0.9.8m as either a client or server (but esp as a server) then it's time to update to 0.9.8n.

OpenSSL Security Advisory [24 March 2010]

"Record of death" vulnerability in OpenSSL 0.9.8f through 0.9.8m
================================================================

In TLS connections, certain incorrectly formatted records can cause an OpenSSL
client or server to crash due to a read attempt at NULL.

Affected versions depend on the C compiler used with OpenSSL:

- If 'short' is a 16-bit integer, this issue applies only to OpenSSL 0.9.8m.
- Otherwise, this issue applies to OpenSSL 0.9.8f through 0.9.8m.

Users of OpenSSL should update to the OpenSSL 0.9.8n release, which contains a
patch to correct this issue.  If upgrading is not immediately possible, the
source code patch provided in this advisory should be applied.

Bodo Moeller and Adam Langley (Google) have identified the vulnerability
and prepared the fix.

Macs everywhere

The Setup is a neat site. They find (somewhat) famous techie people and interview them about what hardware and software they use. I was browsing through it because I recognised some of the names, and because it's always neat to find out about tools that you don't use.

But it struck me how many people were using OS X as their primary, day to day, operating system. So I went through every one of them and added up the numbers (except Why the Lucky Stiff, because they put an underscore at the start of the domain name; stopping it from resolving).

Windows: 3.5, Linux: 3, Mac: 29.5

These folks aren't all hardcore coders to be sure, but one of the Linux users is RMS and I'm not sure he counts! It would be like asking Steve Jobs what he uses.

But gosh, that's a total OS X domination.

Strict Transport Security

Chrome 4 went stable yesterday. One of the many new things in this release is the addition of Strict Transport Security. STS allows a site to request that it always be contacted over HTTPS. So far, only Chrome supports it. However, the popular NoScript Firefox extension also supports it and hopefully support will appear in Firefox proper at some point.

The issue that STS addresses is that users tend to type http:// at best, and omit the scheme entirely most of the time. In the latter case, browsers will insert http:// for them.

However, HTTP is insecure. An attacker can grab that connection, manipulate it and only the most eagle eyed users might notice that it redirected to https://www.bank0famerica.com or some such. From then on, the user is under the control of the attacker, who can intercept passwords etc at will.

An STS enabled server can include the following header in an HTTPS reply:

Strict-Transport-Security: max-age=16070400; includeSubDomains

When the browser sees this, it will remember, for the given number of seconds, that the current domain should only be contacted over HTTPS. In the future, if the user types http:// or omits the scheme, HTTPS is the default. In fact, all requests for URLs in the current domain will be redirected to HTTPS. (So you have to make sure that you can serve them all!).

For more details, see the specification.

There is still a window where a user who has a fresh install, or who wipes out their local state, is vulnerable. Because of that, we'll be starting a "Preloaded STS" list. These domains will be configured for STS out of the box. In the beginning, this will be hardcoded into the binary. As it (hopefully) grows, it can change into a list this is shared across browsers, like the safe-browsing database is today.

If you own a site that you would like to see included in the preloaded STS list, contact me at .

Setting up Apache with OCSP stapling

OCSP is the Online Certificate Status Protocol. It's a way for TLS clients to check if a certificate is expired. A certificate with OCSP enabled includes a URL to which a client can send a POST request and receive a signed statement that a given certificate is still valid.

This adds quite a bit of latency to the TLS connection setup as the client has to perform a DNS lookup on the OCSP server hostname, create an HTTP connection and perform the request-response transaction.

OCSP stapling allows the TLS server to include a recent OCSP response in the TLS handshake so that the client doesn't have to perform its own check. This also reduces load on the OCSP server.

Apache recently got support for OCSP stapling and this post details how to set it up.

1. Prerequisites

Apache support got added in this revision. At the time of writing, no release of Apache includes this so we get it from SVN below.

OpenSSL support was added in 0.9.8h. The version in Ubuntu Karmic is not recent enough, so I pulled the packages from lucid for this:

cd /tmp
wget 'http://mirrors.kernel.org/ubuntu/pool/main/o/openssl/openssl_0.9.8k-7ubuntu3_amd64.deb'
wget 'http://mirrors.kernel.org/ubuntu/pool/main/o/openssl/libssl-dev_0.9.8k-7ubuntu3_amd64.deb'
wget 'http://mirrors.kernel.org/ubuntu/pool/main/o/openssl/libssl0.9.8_0.9.8k-7ubuntu3_amd64.deb'

sudo dpkg -i openssl_0.9.8k-7ubuntu3_amd64.deb libssl0.9.8_0.9.8k-7ubuntu3_amd64.deb  libssl-dev_0.9.8k-7ubuntu3_amd64.deb

2. Building Apache

As noted, we need the SVN version of Apache at the time of writing. I'll be using some paths in the following that you should change (like /home/agl/local/ocsp):

svn checkout http://svn.apache.org/repos/asf/httpd/httpd/trunk httpd
cd httpd
svn co http://svn.apache.org/repos/asf/apr/apr/trunk srclib/apr
./buildconf
cd srclib/apr
./configure --prefix=/home/agl/local/ocsp

At this point, I had to patch APR in order to get it to build. I suspect this build break will be fixed in short order but, for the sake of completeness, here's the patch that I applied:

--- poll/unix/pollset.c (revision 892677)
+++ poll/unix/pollset.c (working copy)
@@ -129,6 +129,8 @@
 
 static apr_status_t close_wakeup_pipe(apr_pollset_t *pollset)
 {
+    apr_status_t rv0, rv1;
+
     /* Close both sides of the wakeup pipe */
     if (pollset->wakeup_pipe[0]) {
     rv0 = apr_file_close(pollset->wakeup_pipe[0]);

Now we can build and install Apache itself. Since we are giving a prefix option, this doesn't conflict with any system installs.

cd ../..
./configure --prefix=/home/agl/local/ocsp --with-apr=/home/agl/local/ocsp --enable-ssl --enable-socache-dbm

Again, for the sake of completeness, I'll mention that Apache SVN had a bug at the time of writing that will stop OCSP stapling from working:

--- modules/ssl/ssl_util_stapling.c     (revision 892677)
+++ modules/ssl/ssl_util_stapling.c     (working copy)
@@ -414,6 +414,10 @@
        goto done;
    }

+    if (uri.port == 0) {
+        uri.port = APR_URI_HTTP_DEFAULT_PORT;
+    }
+
    *prsp = modssl_dispatch_ocsp_request(&uri, mctx->stapling_responder_timeout,
                                         req, conn, vpool);

Then build and install it...

make -j4
make install

3. Generating certs

For this example, I'll be generating a CA cert, a server cert and an OCSP responder cert for that CA. In the real world you'll probably be getting the certs from a true CA, so you can skip this step.

cd /home/agl/local/ocsp
mkdir certs && cd certs
wget 'https://fedorahosted.org/pkinit-nss/browser/doc/openssl/make-certs.sh?format=txt'
mv make-certs.sh\?format=txt make-certs.sh
/bin/bash ./make-certs.sh europa.sfo.corp.google.com test@example.com all ocsp:http://europa.sfo.corp.google.com/
cat ocsp.crt ocsp.key > ocsp.pem

Now I'm going to add the CA that was just generated to the CA set. Firstly, Chromium uses an NSS database in your home directory:

certutil -d sql:/home/agl/.pki/nssdb -A -n testCA -i ~/local/ocsp/certs/ca.crt -t Cu,,

OpenSSL uses a file of PEM certs:

cd /etc/ssl
cp cert.pem cert.pem.orig
rm cert.pem
cat cert.pem.orig /home/agl/local/ocsp/certs/ca.crt > cert.pem

4. Running the responder

In the real world, the OCSP responder is run by the CA that you got your certificate from. But here I'm going to be running my own since I generated a new CA in section 3.

cd ~/local/ocsp
touch index.txt
sudo openssl ocsp -index index.txt -port 80 -rsigner certs/ca.pem -CA certs/ca.pem

5. Configuring Apache

I won't cover the basics of configuring Apache here. There are plenty of documents on the web about that. I'll just note that I have Apache only listening on port 443 since my OCSP responder is running on 80.

The config that you'll need is roughly this:

SSLStaplingCache dbm:/tmp/staples
SSLCACertificateFile "/etc/ssl/cert.pem"
SSLUseStapling on

(You probably want to choose a better location for the cache.)

Apache will parse its server certificate on startup and extract the OCSP responder URL. It needs to find the CA certificate in order to validate OCSP responces and that's why the SSLCACertificateFile directive is there (and why we added the CA to that file in section 3).

After restarting Apache, look in the error.log. What you don't want to see is the following:

[Sun Dec 20 17:24:28 2009] [error] ssl_stapling_init_cert: Can't retrieve issuer certificate!
[Sun Dec 20 17:24:28 2009] [error] Unable to configure server certificate for stapling

That means that Apache couldn't find the CA cert.

There are other directives, but they are currently undocumented. Your best bet is to look at the original Apache bug and the commit itself.

Chrome Linux Beta

Life goals: get a comic published. Check.

Digital Economy Bill

Cory Doctorow seems to have crafted the lexicon of the opposition to the Digital Economy Bill with his phrase ‘Pirate Finder General’ [1]. In his follow up he claims that this bill will introduce three-strikes, ISP spying and powers for Peter Mandelson to rewrite copyright law at will.

I spent a fun-filled Sunday afternoon reading the bill to see how bad it really is so that you don't have to (although you still should). You're welcome.

Firstly, the changes to copyright law are only a small part of the bill. Other parts of the bill cover: Channel 4 and Channel 3 licensing, removing the requirements for Teletext, digital switch over, radio licenses and classification of video games. I won't be talking about those in this post.

The bill requires that ISPs pass on infringement notices to subscribers, either via email or via the postal system. Copyright owners can also request infringement lists. This allows them to see that their notices A, B, C all went to the same subscriber. They can then take this information to court and proceed with the usual actions. (124A and 124B.)

The bill defers much of the policy in this area to a code that is to be written by OFCOM with the approval of the Secretary of State. This code includes the number of infringement notices that a single subscriber needs in order to be included in a report, the size of fines to be imposed on ISPs for failing to follow the code and the appeals process. The Secretary of State sets the compensation paid from copyright holders to ISPs and from everyone to OFCOM (124L).

What isn't in the bill is any talk of disconnection, ISP spying or three strikes. However, 124H gives the Secretary of State the power to require any “technical obligation”.

(3) A "technical measure" is a measure that (a) limits the speed or other capacity of the service provided to a subscriber; (b) prevents a subscriber from using the service to gain access to particular material, or limits such use; (c) suspends the service provided to a subscriber; or (d) limits the service provided to a subscriber in another way.

A brief interlude about statutory instruments is needed here. SIs are published by the government and cannot be amended by Parliament. There are two types. The first takes effect automatically and Parliament has a short time (usually 40 days depending on holidays etc) to annul it. The second requires positive action by both Houses before it takes effect.

According to Wikipedia, the last time that an SI was annulled was in 2000, and 1979 before that. The last time that an SI wasn't approved was 40 years ago.

The powers to impose technical measures are of the annul variety: they take effect automatically after 40 days. They don't appear to include requiring ISPs to spy.

After that power, 302A gives the Secretary of State the power to change the Copyright Act via a positive action SI “for the purpose of preventing or reducing the infringement of copyright by means of the internet”.

Next up, 124N gives the Secretary of State the power to take over any domain name registrar. By my reading, that is no exaggeration.

The Secretary can take action if they believe that the actions of a registrar affect “(a) the reputation or availability of electronic communications networks or electronic communications services [...] (b) the interests of consumers or members of the public [...].”. The action can either be to appoint a “manager”, who has total power, or to ask a court to change the constitution of the body and enjoin them from changing it back.

There's no requirement that this registrar be based in the UK, or even to allocate in the uk ccTLD.

Understandably, they are quite upset about this.

I'd also like to quickly note that this bill contains provisions for the licensing of orphan works without the copyright holder being involved also for libraries to have the rights to lend out e-books. (Although without giving any powers to do anything about technical limitations on e-books that might prevent it.)

Lastly, and I might be misunderstanding something here, on page 52 the Secretary of State seems to get powers to amend or annul anything relating to this act, in any bill in this session of Parliament, or before, by using a positive action SI.

Hopefully the above provides pointers for people who want to understand and read the bill. Now, my (informed) opining:

This is an abomination. It's an attempt to subvert Parliament by giving the government the power to write copyright law at will. The provisions are extraordinary. The sanctions against domain name registrars are staggering. I don't know of any other case when the government can sequestrate a private entity at will. Given the international nature of the domain name system, this should cause international concern.

I expect that this power is largely intended to be a very large stick with which to force the removal of xyzsucks.com style names that embarrass business or government. No registrar will dare cross their interests if they have this power.

If you vote in the UK, goto TheyWorkForYou, lookup your MP, write a letter (on paper). Sign the petition. Support ORG.

Recent changes to SSL/TLS on the web

Most of the movement around TLS (aka SSL) currently involves people dealing with the renegotiation issues, but I'm going to sound a happier note today. TLS isn't static; things are changing for the better:

Strict transport security

My colleagues, Dr Barth and Collin Jackson proposed ForceHTTPS some time ago. This has picked up Jeff Hodges, from PayPal, and morphed into Strict Transport Security. Dr Barth and I have implemented this in Chromium and Firefox supports it with the NoScript extension.

In short, you can add a header to your HTTPS replies like: Strict-Transport-Security: max-age=86400 and the browser will remember, for the next 86400 seconds (1 day), that the origin host should only be contacted over HTTPS. It also forbids mixed content.

(Update: Dr Barth points out that the limits on mixed content have been removed as the standard has advanced!)

Chrome dev channel releases already support this and it'll be in Chrome 4.0. The hosts are stored in a JSON file in the profile directory:

{
   "+7cOz6FDyMiPEjNtc0haTPwdZPbvbPFP2NyZIA82GTM=": {
      "expiry": 1258514505.715938,
      "include_subdomains": false
   }
}

If you try to navigate to an http:// URL when that host has STS enabled, the browser will internally rewrite it to https://. Suitable sites (banks etc) should start using this as soon as possible.

Compression

Well, this certainly isn't new! OpenSSL has supported deflate compression on TLS connections for ages, but NSS (the SSL/TLS library used in all Mozilla based products for one) hasn't. This means that Firefox never supported compression, nor Thunderbird (and it's a fairly big deal for IMAP connections).

However, Wan Teh Chang and I have added deflate support to NSS and it'll be in next release. Thanks to Nelson Bolyard for the code review.

Cut through

Here's a diagram of a TLS connection from the RFC:

Client                                               Server

      ClientHello                  -------->
                                                      ServerHello
                                                     Certificate*
                                               ServerKeyExchange*
                                              CertificateRequest*
                                   <--------      ServerHelloDone
      Certificate*
      ClientKeyExchange
      CertificateVerify*
      [ChangeCipherSpec]
      Finished                     -------->
                                               [ChangeCipherSpec]
                                   <--------             Finished
      Application Data             <------->     Application Data

This means that an HTTPS connection adds an extra two round trips on top of HTTP.

Nagendra Modadugu and myself (independently) came up with a “cut through” mode for TLS handshakes. Rather than wait for the server's Finished message, the client can send application data after only one round trip. This means than an attacker can perform a downgrade attack on the cipher and force the client to transmit with a weaker cipher than it might have normally used. However, an attacker cannot get the key so, as long as all the supported ciphers are strong enough, it all works out.

This cuts a round-trip time from a normal HTTPS handshake and should be appearing in Chromium and Android soon.

(Nelson Bolyard tells me that this isn't a novel idea, although it doesn't seem to have had much traction up til now.)

Next protocol negotiation

TLS over port 443 is the only clean channel that many hosts have these days. However, this means that the TCP destination port number can no longer be used to select an application level protocol since it's fixed by firewalls, proxies etc.

The specific use case for this would be SDPY, a new transport layer for HTTP. We want to know, before we send the first request, if the server supports SDPY.

draft-agl-tls-nextprotoneg describes an extension to let you do that. It's being tested in Chromium at the moment (although not yet in the public tree).

Go launch

I'm delighted to be a minor part of the Go launch today:

Go is an experimental language from Google that I've been coding in for the past month or so. It sits in a similar niche to Java: performant but garbage collected. However, it's vastly more enjoyable to code in than Java!

Thanks to a suite of compilers, it compiles to machine code very quickly. There's also a frontend to GCC in the works. It's runtime and type safe, concurrent, has a novel (for me, at least) take on object orientation and provides runtime reflections on types.

Personally, I think it gets a place in my list of favoured tools, which currently contains C, C++, Python and Haskell.

The TLS flaw that wasn't

There were many articles yesterday suggesting that a major new flaw in TLS (aka SSL) had been found ([1][2][3]). The last of those is a post by Ben Laurie, an expert in these matters, with a suitably hyperbolic title: “Another Protocol Bites The Dust”

Here's the issue: there's an extremely uncommon configuration of web servers where they're setup to require client side certificates for some URLs and not others. If a user has an HTTPS connection open that didn't handshake with a client side certificate and they try to access such a URL, the webserver will perform another handshake on the same connection. As soon as that handshake completes with the correct certificate, they'll run the request that was received from before the connection was fully authenticated.

It's a bug in the web server. There was a misunderstanding between what the folks writing the webserver thought that TLS was providing and what it actually provides. One might also argue that it's a short coming in the HTTP protocol (there's no way for a server to ask a client to redo a request). One might also argue that TLS should provide the properties that the web servers expected.

But it's not a flaw in TLS. The TLS security properties are exactly what was intended.

Now, it appears that the fix will be to TLS. That's fine, but the place that gets ‘fixed’ isn't always the place that made the mistake.

I don't understand why knowledgeable folks like EKR and Laurie are so eager to attribute this problem to TLS.

Anti aliased clipping, a tale of woe

People have been complaining that rounded rectangles in Chrome aren't anti-aliased. If you're a web developer, it seems that this is a Big Deal.

The issue is that almost anything can have rounded corners in WebKit. There's not a drawRoundedRectangle function, instead, clipping paths are created and then normal drawing proceeds. On Safari (which is also WebKit, but sitting on top of the CoreGraphics library), clipping to a path is anti-aliased and everything looks pretty. However, Chrome's graphics library, Skia, doesn't do anti-aliased clipping for a good reason.

Consider the figure below:

At the top left is an anti-aliased clipping region. The darker the pixel, the more is covered by the path. If we were to fill the region with green, we would get the image at the bottom left. When drawing, we consider how much of the clipping region covers each pixel and convert that to an alpha value. For a pixel which was half covered by the clipping region we would calculate 50% × background_color + 50% × green.

However, consider what happens when we first fill with red (top right) and then with green (bottom right). We would expect that the result would be the same as filling with green - the second fill should cover the first. But for pixels which are fractionally covered by clipping region, this isn't the case.

The first fill, with red, works correctly as detailed above. But when we come to do the second fill, the background_color isn't the original background color, but the slightly red color resulting from the first fill. Both CoreGraphics and Firefox's <canvas> have this bug.

It might seem trivial, but if you end up covering anti-aliased clipping regions multiple times you end up with unsightly borders around the clip paths. This is why Skia only supports 1-bit clip paths.

The correct way to do anti-aliased clipping is to draw to a layer, on top of the original bitmap, and, when the clipping path is popped from the clip stack, erase outside of the path (anti-aliased) and composite the result onto the underlying bitmap.

This works just fine, the problem is that <canvas> users don't always pop the clip stack. They expect to be able to set a clipping path, draw and have it appear without managing a stack. We could collapse the clipping stack for them when we paint to the screen, but then we need to restore it afterwards, which would require major surgery to Skia.

The second problem with anti-aliasing, even when done correctly, is that it makes it impossible to put polygons next to each other. Try this demo in Firefox and note the hairlines caused by anti-aliasing.

I think what Chrome will end up doing is to anti-alias the clipping paths (correctly) for everything except <canvas>. This isn't a great solution, but it's better than what we have now.

Chromium's seccomp Sandbox

I wrote an article for LWN about Chromium's seccomp sandbox. They decided that it wasn't in the right style for LWN, and they rewrote it to fit. Their version has just become available for free. I'm including my version below:

The Chromium seccomp sandbox

As part of the process of porting Chromium to Linux, we had to decide how to implement Chromium's sandbox on Linux.

The Chromium sandbox is an important part of keeping users safe. The web is a very complicated place these days and the code to parse and interpret it is large and on the front-line of security. We try to make sure that this code is free of security bugs, but history suggests that we can't be perfect. So, we plan for the case where someone has an exploit against our rendering code and run it in its own process with limited authority. It's the sandbox's job to limit that authority as much as possible.

Chromium renderers need very little authority. They need access to fontconfig to find fonts on the system and to open those font files. However, these can be handled as IPC requests to the browser process. They do not need access to the X server (which is why we don't have GTK widgets on web pages), nor should they be able to access DBus, which is increasingly powerful these days.

Drawing is handled using SysV shared memory (so that we can share memory directly with X). Everything else is either serialised over a socketpair or passed using a file descriptor to a tmpfs file. This means that we can deny filesystem access completely. The renderer requires no network access: the network stack is entirely within the browser process.

Traditional sandboxing schemes on Linux involve switching UIDs and using chroot. We'll be using some of those techniques too. But this text is about the most experimental part of our sandbox: the seccomp layer which my colleague Markus Gutschke has been writing.

The kernel provides a little known feature where by any process can enter ‘seccomp mode’. Once enabled it cannot be disabled. Any process running in seccomp mode can only make four system calls: read, write, sigreturn and exit. Attempting any other system call will result in the immediate termination of the process.

This is quite desirable for preventing attacks. It removes network access, which is traditionally difficult to limit otherwise (although CLONE_NEWNET is might help here). It also limits access to new, possibly dangerous, system calls that we don't otherwise need like tee and vmsplice. Also, because read and write proceed at full speed, if we limit our use of other system calls, we can hope to have a minimal performance overhead.

But we do need to support some other system calls. Allocating memory is certainly very useful. The traditional way to support this would be to RPC to a trusted helper process which could validate and perform the needed actions. However, a different process cannot allocate memory on our behalf. In order to affect the address space of the sandboxed code, the trusted code would have to be inside the process!

So that's what we do: each untrusted thread has a trusted helper thread running in the same process. This certainly presents a fairly hostile environment for the trusted code to run in. For one, it can only trust its CPU registers - all memory must be assumed to be hostile. Since C code will spill to the stack when needed and may pass arguments on the stack, all the code for the trusted thread has to carefully written in assembly.

The trusted thread can receive requests to make system calls from the untrusted thread over a socket pair, validate the system call number and perform them on its behalf. We can stop the untrusted thread from breaking out by only using CPU registers and by refusing to let the untrusted code manipulate the VM in unsafe ways with mmap, mprotect etc.

That could work, if only the untrusted code would make RPCs rather than system calls. Our renderer code is very large however. We couldn't patch every call site and, even if we could, our upstream libraries don't want those patches. Alternatively, we could try and intercept at dynamic linking time, assuming that all the system calls are via glibc. Even if that were true, glibc's functions make system calls directly, so we would have to patch at the level of functions like printf rather than write.

This would seem to be a very tough problem, but keep in mind that if we miss a call site, it's not a security issue: the kernel will kill us. It's just a crash bug. So we could use a theoretically incorrect solution so long as it actually worked in practice. And this is what we do:

At startup we haven't processed any untrusted input, so we assume that the program is uncompromised. Now we can disassemble our own memory, find sites where we make system calls and patch them. Correctly parsing x86 machine code is very tough. Native Client uses a customised compiler which only generates a subset of x86 in order to do it. But we don't need a perfect disassembler so long as it works in practice for the code that we have. It turns out that a simple disassembler does the job perfectly well with only a very few corner cases.

Now that we have patched all the call sites to call our RPC wrapper, instead of the kernel, we are almost done. We have only to consider system calls which pass arguments in memory. Because the untrusted code can modify any memory that the trusted code can, the trusted code couldn't validate calls like open. It could verify the filename being requested but the untrusted code could change the filename before the kernel copied the string from user-space.

For these cases, we also have a single trusted process. This trusted process shares a couple of pages of memory with each of the trusted threads. When the trusted thread is asked to make a system call which it cannot safely validate, it forwards the call to the trusted process. Since the trusted process has a different address space, it can safely validate the arguments without interference. It then copies the validated arguments into the shared memory pages. These memory pages are writable by the trusted process, but read-only in the sandboxed process. Thus the untrusted code cannot modify them and the trusted code can safely make the system call using the validated, read-only arguments.

We also use this trick for system calls like mmap which don't take arguments in memory, but are complicated to verify. Recall that the trusted thread has to be hand written in assembly so we try to minimise the amount of this code where possible.

Once we have this scheme in place we can intercept, examine and deny any system calls. We start off denying everything and then, slowly, add system calls that we need. For each system call we need to consider the security implications it might have. Calls like getpid are easy, but what damage could one do with mmap/munmap? Well, the untrusted code could replace the code which the trusted threads are running for one! So, when a call might be dangerous we allow only a minimal, and carefully examimed, subset of flags which match the uses that we actually have in our code.

We'll be layering this sandbox with some more traditional UNIX sandboxing techniques in the final design. However, you can get a preview of the code in it's incomplete state already at its Google Code homepage.

There's still much work to be done. A given renderer could load a web page with an iframe to any domain. Those iframes are handled in the same renderer, thus a compromised renderer can ask the browser for any of the user's cookies. Microsoft research developed Gazelle, which has much stricter controls on a renderer, at the expense of web-compatibility. We know that users wont accept browsers that don't work with their favourite websites, but we are also very jealous of Gazelle's security properties so hopefully we can improve Chromium along those lines in the future.

Another weak spot are installed plugins. Plugin support on Linux is very new but on Windows, at least, we don't sandbox plugins. They don't expect to be sandboxed and we hurt web-compatibility (and break their auto-updating) if we limit them. That means that plugins are a vector for more serious attacks against web browsers. As ever, keep up to date with the latest security patches!

DNSCurve Internet Draft

Matthew has posted a Internet draft for DNSCurve. DNSCurve is a way of securing DNS which isn't DNSSEC. See Dan's talk from a couple of weeks ago about why DNSCurve is the better answer.

DEFCON 17

I'll be going to DEFCON this year. Ping me if you'll be around.

SELinux from the inside out

There are some great sources of information for users and sysadmins about SELinux [1] [2] but your author has always preferred to understand a system from the bottom-up and, in this regard, found the information somewhat lacking. This document is a guide to the internals of SELinux by starting at the kernel source and working outwards.

We'll be drawing on three different sources in order to write this document.

Access vectors

SELinux is fundamentally about answering questions of the form “May x do y to z?” and enforcing the result. Although the nature of the subject and object can be complex, they all boil down to security identifiers (SIDs), which are unsigned 32-bit integers.

The action boils down to a class and a permission. Each class can have up to 32 permissions (because they are stored as a bitmask in a 32-bit int). Examples of classes are FILE, TCP_SOCKET and X_EVENT. For the FILE class, some examples of permissions are READ, WRITE, LOCK etc.

At the time of writing there are 73 different classes (selinux/libselinux/include/selinux/flask.h) and 1025 different permissions (.../av_permissions.h).

The security policy of a system can be thought of as a table, with subjects running down the left edge, objects across the top and, in each cell, the set of actions which that subject can perform on that object.

This is reflected in the first part of the SELinux code that we'll look at : the access vector cache (security/selinux/avc.c). The AVC is a hash map from (subject, object, class) to the bitset of permissions allowed:

struct avc_entry {
        u32                     ssid;    // subject SID
        u32                     tsid;    // object SID
        u16                     tclass;  // class
        struct av_decision      avd;     // contains the set of permissions for that class
};

The AVC is queried when the kernel needs to make security decisions. SELinux hooks into the kernel using the LSM hooks and is called whenever the kernel is about to perform an action which needs a security check. Consider the getpgid system call to get the current process group ID. When SELinux is built into a kernel, this ends up calling the following hook function (security/selinux/hooks.c):

static int selinux_task_getpgid(struct task_struct *p)
{
        return current_has_perm(p, PROCESS__GETPGID);
}

static int current_has_perm(const struct task_struct *tsk,
                            u32 perms)
{
        u32 sid, tsid;

        sid = current_sid();
        tsid = task_sid(tsk);
        return avc_has_perm(sid, tsid, SECCLASS_PROCESS, perms, NULL);
}

Referring back to the table concept: in order to check if a process with SID x may call getpgid we find x across and x down and check that SECCLASS_PROCESS:PROCESS__GETPID is in the set of allowed actions.

So now we have to discover what the AVC is actually caching, and where these SIDs are coming from. We'll tackle the latter question first.

SIDs and Security Contexts

SIDs turn out to be much like interned symbols in some languages. Rather than keeping track of complex objects and spending time comparing them during lookups, they are reduced to an identifier via a table. SIDs are the interned identifiers of security contexts. The sidtab maps from one to the other (security/selinux/ss/sidtab.h):

struct sidtab {
        struct sidtab_node **htable;
        unsigned int nel;       /* number of elements */
        unsigned int next_sid;  /* next SID to allocate */
        unsigned char shutdown;
        spinlock_t lock;
};

struct sidtab_node {
        u32 sid;                       /* security identifier */
        struct context context;        /* security context structure */
        struct sidtab_node *next;
};

The SID table is optimised for mapping from SIDs to security contexts. Mapping the other way involves walking the whole hash table.

The structure for the security context is probably familiar to you if you have worked with SELinux before (security/selinux/ss/context.h):

struct context {
        u32 user;
        u32 role;
        u32 type;
        u32 len;        /* length of string in bytes */
        struct mls_range range;
        char *str;        /* string representation if context cannot be mapped. */
};

If you have an SELinux enabled system, you can look at your current security context with id -Z. Running that will produce something like unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023. This string splits into four parts:

  1. The SELinux “user”: unconfined_u
  2. The role: unconfined_r
  3. The type: unconfined_t (we'll mostly be concentrating on types)
  4. The multi-level-security (MLS) sensitivity and compartments: s0-s0:c0.c1023

(You might notice that the parts are broken up with colons, but that the MLS part can contain colons too! Obviously, this is the only part that can contain colons to avoid ambiguity.)

When the system's security policy is compiled, these names are mapped to IDs. It's these IDs which end up in the kernel's context structure. Also notice that, by convention, types end in _t, roles with _r and users with _u. Don't confuse UNIX users with SELinux users; they are separate namespaces. For a sense of scale, on a Fedora 11 box, the default policy includes 8 users, 11 roles and 2727 types.

The Security Server

We now address the question of what it is that the access vector cache is actually caching. When a question is asked of the AVC to which it doesn't have an answer, it falls back on the security server. The security server is responsible for interpreting the policy from userspace. The code lives in context_struct_compute_av (in security/selinux/ss/services.c). We'll walk through its logic (and we'll expand on each of these points below):

  1. The subject and object's type are used to index type_attr_map, which results in a set of types for each of them.
  2. We consider the Cartesian product of the two sets and build up a 32-bit allowed bit-vector based on the union of the permissions in the access vector table for each (subject, object) pair.
  3. For each pair in the product, we also include the union of permissions from a second access vector table: the conditional access vector table.
  4. The target type is used to index an array and from that we get a linked list of “constraints”. Each constraint contains byte code for a stack based virtual machine and can limit the granted permissions.
  5. If the resulting set of permissions includes role transition, then we walk a linked list of allowed role transitions. If the transition isn't whitelisted, those permissions are removed.
  6. If either the subject or object's type is ‘bounded’, then we recurse and check the permissions of the bounded types. We verify that the resulting permissions are a subset of the permissions enjoyed by types that they are bounded by. This should be statically enforced by the tool with produced the policy so, if we find a violation, it's logged and the resulting permissions are clipped.

Now, dealing with each of those steps in more detail:

Type attributes

Type attributes are discussed in the Configuring the SELinux Policy report. They are used for grouping types together: by including a type attribute on a new type, the new type inherits all the permissions granted to the type attribute. As can be seen from the description above, type attributes are implemented as types themselves.

These attributes could have been statically expanded by the tool which generated the policy file. Expanding at generation time is a time/space tradeoff and the SELinux developers opted for the smaller policy file.

It's also worth noting that type_attr_map isn't expanded recursively: one can only have one level of type attributes.

Type attributes conventionally end in _type (as opposed to types, which end in _t). In the Fedora 11 policy, here are the top five type attributes:

Name of type attribute Number of types with that attribute
file_type 1406
non_security_file_type 1401
exec_type 484
entry_type 478
domain 442

The graph of types and type attributes is, as expected, bipartite.

The conditional access vector table

The conditional access vector table contains permissions just like the regular access vector table except that each, optionally, has an extra flag: AV_ENABLED (security/selinux/avtab.h). This flag can be enabled and disabled at run time by changing the value of ‘booleans’. These booleans are quite well covered by the higher-level documentation for the policy language (here and here).

The set of booleans can be found in /selinux/booleans (if you are running SELinux). They can be read without special authority although you should be aware of a bug: trying to read more than a page from one of those files results in -EINVAL and recent coreutils binaries (like cat) use a buffer size of 32K. Instead you can use dd, or just run the friendly tool: semanage boolean -l.

The AV_ENABLED flag is updated when a boolean is changed. The conditional access vector table is populated by a list of cond_node structures (security/selinux/conditional.h). These contain a bytecode for a limited, stack based machine and and two lists of access vectors which should be enabled or disabled in the case that the machine returns true or false.

The stack machine can read any of the configured booleans and combine them with standard boolean algebra, returning a single bit result.

Constraints

One of the parts of the SELinux policy language is the ability to define constraints. Constraints are defined using the neverallow command. Constraints are used to prevent people from writing bad policy, or in the case of MLS, to enforce rules governing information flow. http://danwalsh.livejournal.com/12333.html

As you can see if you read the above linked blog post, constraints are statically enforced by the policy tools where possible and also checked by the kernel. Constraints are evaluated by running a stack-machine bytecode. (This is a different machine than that which is used for the conditional access vector table.) Based on the kernel code for the stack-machine, we can write a simple disassembler and see what constraints are enforced in the kernel.

In the Fedora 11 policy, 32 classes have constraints applied to them. Let's have a look at some of them. Here's the first one:

constraint for class 'process' permissions:800000:
  DYNTRANSITION
subject.user == object.user?
subject.role == object.role?
and

Roughly translated, this means “Whenever operating on an object of class process, the DYNTRANSITION permission is forbidden unless the user and role of the subject and object match”. A DYNTRANSITION (dynamic transition) is when a process switches security contexts without execing a binary. Think of it like a setuid call for security contexts (we'll cover how to perform this later).

Here's another constraint, a longer one this time:

constraint for class 'file' permissions:188:
  CREATE
  RELABELFROM
  RELABELTO
subject.user == object.user?
[bootloader_t, devicekit_power_t, logrotate_t, ldconfig_t, unconfined_cronjob_t, unconfined_sendmail_t, setfiles_mac_t,
initrc_t, sysadm_t, ada_t, fsadm_t, kudzu_t, lvm_t, mdadm_t, mono_t, rpm_t, wine_t, xdm_t, unconfined_mount_t,
oddjob_mkhomedir_t, saslauthd_t, krb5kdc_t, newrole_t, prelink_t, anaconda_t, local_login_t, rpm_script_t,
sysadm_passwd_t, system_cronjob_t, tmpreaper_t, samba_unconfined_net_t, unconfined_notrans_t, unconfined_execmem_t,
devicekit_disk_t, firstboot_t, samba_unconfined_script_t, unconfined_java_t, unconfined_mono_t,
httpd_unconfined_script_t, groupadd_t, depmod_t, insmod_t, kernel_t, kpropd_t, livecd_t, oddjob_t, passwd_t, apmd_t,
chfn_t, clvmd_t, crond_t, ftpd_t, inetd_t, init_t, rshd_t, sshd_t, staff_t, udev_t, virtd_t, xend_t, devicekit_t,
remote_login_t, inetd_child_t, qemu_unconfined_t, restorecond_t, setfiles_t, unconfined_t, kadmind_t,
ricci_modcluster_t, rlogind_t, sulogin_t, yppasswdd_t, telnetd_t, useradd_t, xserver_t] contains subject.type?
or

This means that when you create a file or change its security context, either the SELinux user of the file has to match your current SELinux user, or you have to be one of a list of privileged types.

One last example foreshadows several large subjects: user-land object managers and multi-level security. For now I'll leave it undiscussed to wet your appetite.

constraint for class 'db_database' permissions:7de:
  DROP
  GETATTR
  SETATTR
  RELABELFROM
  ACCESS
  INSTALL_MODULE
  LOAD_MODULE
  GET_PARAM
  SET_PARAM
object.sensitivity[high] dominates type?

Roles and users

In step 5, above, we mention ‘role transitions’, so we should probably discuss SELinux users and roles. Keep in mind that SELinux users are separate from normal UNIX users.

Each type inhabits some set of roles and each role inhabits some set of SELinux users. UNIX users are mapped to SELinux users at login time (run `semanage login -l`) and so each user has some set of roles that they may operate under. Like the standard custom of administrating a system by logging in as a normal user and using sudo only for the tasks which need root privilege, roles are designed for the same purpose. Although a given physical user may need to perform administrative tasks, they probably don't want to have that power all the time. If they did, then there would be a confused deputy problem when they perform what should be an unprivileged task which does far more than intended because they performed it with excess authority.

Here's the graph of users and roles in the Fedora 11 targeted policy:

An SELinux user can move between roles with the newrole command, if such a role transition is permitted. Here's the graph of permitted role transitions in the Fedora policy:

With the targeted policy at least, roles and users play a relatively small part in SELinux and we won't cover them again.

Bounded types

A type in SELinux may be “bounded” to another type. This means that the bounded type's permissions are a strict subset of the parent and here we find the beginnings of a type hierarchy. The code for enforcing this originally existed only in the user-space tools which build the policy, but recently it was directly integrated into the kernel.

In the future, this will make it possible for a lesser privileged process to safely carve out subsets of policy underneath the administratively-defined policy. At the time of writing, this functionality has yet to be integrated in any shipping distribution.

(Thanks to Stephen Smalley for clearing up this section.)

The SELinux filesystem

The kernel mostly communicates with userspace via filesystems. There's both the SELinux filesystem (usually mounted at /selinux) and the standard proc filesystem. Here we'll run down some of the various SELinux specific entries in each.

But first, a quick note. Several of the entries are described as ‘transaction’ files. This means that you must open them, perform a single write and then a single read to get the result. You must use the same file descriptor for both (so, no echo, cat pairs in shell scripts).

/selinux/enforcing

A boolean file which specifies if the system is in ‘enforcing’ mode. If so, SELinux permissions checks are enforced. Otherwise, they only cause audit messages.

(Read: unprivileged. Write: requires root, SECURITY:SETENFORCE and that the kernel be built with CONFIG_SECURITY_SELINUX_DEVELOP.)

/selinux/disable

A write only, boolean file which causes SELinux to be disabled. The LSM looks are reset, the SELinux filesystem is unregistered etc. SELinux can only be disabled once and probably doesn't leave your kernel in the best of states.

(Read: unsupported. Write: requires root, and that the kernel be built with CONFIG_SECURITY_SELINUX_DISABLE.)

/selinux/policyvers

A read only file which contains the version of the current policy. The version of a policy is contained in the binary policy file and the kernel contains logic to deal with older policy versions, should the version number in the file suggest that it's needed.

(Read: unprivileged. Write: unsupported.)

/selinux/load

A write only file which is used to load policies into the kernel. Loading a new policy triggers a global AVC invalidation.

(Read: unsupported. Write: requires root and SECURITY:LOAD_POLICY.)

/selinux/context

A transaction file. One writes a security context string and then reads the resulting, canonicalised context. The context is canonicalised by running it via the sidtab.

(Read/Write: unprivileged.)

/selinux/checkreqprot

A boolean file which determines which permissions are checked for mmap and mprotect calls. In certain cases the kernel can actually grant a process more access than it requests with these calls. (For example, if a shared library is marked as needing an executable stack, then the kernel may add the PROT_EXEC permission if the process didn't request it.)

If the value of this boolean is one, then SELinux checks the permissions requested by the process. If 0, it checks the permissions which the process will actually receive.

(Read: unprivileged. Write: requires root, and SECURITY:SETCHECKREQPROT.)

/selinux/access

A transaction file which allows a user-space process to query the access vector table. This is the basis of user-space object managers.

The write phase consists of a string of the following form: ${subject security context (string)} ${object security context (string)} ${class (uint16_t, base 10)} ${requested permissions (uint32_t bitmap, base 16)}.

The read phase results in a string with this format: ${allowed permissions (uint32_t bitmap, base 16)} 0xffffffff ${audit allow (uint32_t bitmap, base 16)} ${audit deny (uint32_t bitmap, base 16)} ${sequence number (uint32_t, base 10)} ${flags (uint32_t, base 16)}.

This call will be covered in greater detail in the User-space Object Managers section, below.

(Read/Write: SECURITY:COMPUTE_AV.)

Attribute files

SELinux is also responsible for a number of attribute files in /proc. The attribute system is actually a generic LSM hook, although the names of the nodes are current hardcoded into the code for the proc filesystem.

/proc/pid/attr/current

Contains the current security context for the process. Writing to this performs a dynamic transition to the new context. In order to do this:

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETCURRENT)

/proc/pid/attr/exec

Sets the security context for child processes. The permissions checking is done at exec time rather than when writing this file.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETEXEC)

/proc/pid/attr/fscreate

Sets the security context for files created by the current process. The permissions checking is done at creat/open time rather than when writing this file.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETFSCREATE)

/proc/pid/attr/keycreate

Sets the security context for keys created by the current process. Keys support in the kernel is documented in Documentation/keys.txt. The permissions checking is done at creation time rather than when writing this file.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETKEYCREATE)

/proc/pid/attr/sockcreate

Sets the security context for sockets created by the current process. The permissions checking is done at creation time rather than when writing this file.

(Read: PROCESS:GETATTR. Write: only allowed for the current process and requires PROCESS:SETSOCKCREATE)

User-space object managers

Although the kernel is a large source of authority for many process, it's certainly not the only one these days. An increasing amount of ambient authority is being granted via new services like DBus and then there's always the venerable old X server which, by default, allows clients to screenshot other windows, grab the keyboard input etc.

The most common example of a user-space process attempting to enforce a security policy is probably a SQL servers. PostgreSQL and MySQL both have a login system and an internal user namespace, permissions database etc. This leads to administrators having to learn a whole separate security system, use password authentication over local sockets, include passwords embedded in CGI scripts etc.

User-space object managers are designed to solve this issue by allowing a single policy to express the allowed actions for objects which are managed outside the kernel. The NSA has published a number of papers about securing these types of systems: X: [1] [2], DBus: [1]. See also the SE-PostgreSQL project for details on PostgreSQL and Apache.

In order to implement such a design a user-space process needs to be able to label its objects, query the global policy and determine the security context of requests from clients. The libselinux library contains the canonical functions for doing all these things, but this document is about what lies under the hood, so we'll be doing it raw here.

The task of labeling objects is quite specific to each different object manager and this problem is discussed in the above referenced papers. Labels need to be stored (probably persistently) and administrators need some way to query and manipulate them. For example, in the X server, objects are often labeled with a type which derives from its name ("XInput" → "input_ext_t")

When it comes to querying the policy database, a process could either open the policy file from disk (which we'll cover later) or it could query the kernel. Querying the kernel solves a number of issues around locating the policy and invalidating caches when it gets reloaded, so that's the path which the SELinux folks have taken. See the section on /selinux/access for the interface for doing this.

In order to authenticate requests from clients, SELinux allows a process to get the security context of the other end of a local socket. There are efforts underway to extend this beyond the scope of a single computer, but I'm going to omit the details for brevity here.

I'll start with a code example of this authentication first:

  const int afd = socket(PF_UNIX, SOCK_STREAM, 0);
  assert(afd >= 0);

  struct sockaddr_un sun;
  memset(&sun, 0, sizeof(sun));
  sun.sun_family = AF_UNIX;
  strcpy(sun.sun_path, "test-socket");
  assert(bind(afd, (struct sockaddr*) &sun, sizeof(sun)) == 0);
  assert(listen(afd, 1) == 0);
  const int fd = accept(afd, NULL, NULL);
  assert(fd >= 0);

  char buf[256];
  socklen_t bufsize = sizeof(buf);
  assert(getsockopt(fd, SOL_SOCKET, SO_PEERSEC, buf, &bufsize) == 0);
  printf("%s\n", buf);

This code snippet will print the security context of any process which connects to it. Running it without any special configuration on a Fedora 11 system (targeted policy) will result in a context of unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023. Don't try running it on a socket pair however, you end up with system_u:object_r:unlabeled_t:s0.

If you already have code which is using SCM_CREDENTIALS to authenticate peers, you can use getpidcon to get a security context from a PID. Under the hood this just reads /proc/pid/attr/context.

Now that we can label requests, the next part of the puzzle is getting access decisions from the kernel. As hinted at above, the /selinux/access file allows this. See above for the details of the transaction format. As an example, we'll see if the action PROCESS:GETATTR, with a subject and object of unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023, is permitted.

  → unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 2 00010000
  ← f77fffff ffffffff 0 fffafb7f 11

This is telling us that it is permitted (the only bits missing are for EXECHEAP and DYNTRANSITION). It also tells us the permissions which should be logged on allow and deny and the sequence number of the policy state in the kernel. Note that, above, we documented an additional flags field, however it's missing in this example. That's another good reason to use libselinux! The flags field was only recently added and isn't in the kernel which I'm using for these examples.

At this time, the astute reader will be worried about the performance impact of getting this information from the kernel in such a manner. The solution is to use the same access vector cache code that the kernel uses, in user-space to cache the answers from the kernel. This is another benefit which libselinux brings.

However, every cache brings with it problems of consistency and this is no different. All user-space object managers need to know when the administrator updates the system security policy so that they can flush their AVCs. This notification is achieved via a netlink socket, as demonstrated by the following snippet:

  const int fd = socket(PF_NETLINK, SOCK_RAW, NETLINK_SELINUX);
  assert(fd >= 0);

  struct sockaddr_nl addr;
  int len = sizeof(addr);
  memset(&addr, 0, len);
  addr.nl_family = AF_NETLINK;
  addr.nl_groups = SELNL_GRP_AVC;
  assert(bind(fd, (struct sockaddr*) &ddr, len) == 0);

  struct sockaddr_nl nladdr;
  socklen_t nladdrlen;
  char buf[1024];
  struct nlmsghdr *nlh = (struct nlmsghdr *)buf;

  for (;;) {
    nladdrlen = sizeof(nladdr);
    const ssize_t r = recvfrom(fd, buf, sizeof(buf), 0,
                               (struct sockaddr*) &nladdr, &nladdrlen);
    assert(r >= 0);
    assert(nladdrlen == sizeof(nladdr));
    assert(nladdr.nl_pid == 0);
    assert((nlh->nlmsg_flags & MSG_TRUNC) == 0);
    assert(nlh->nlmsg_len <= r);

    if (nlh->nlmsg_type == SELNL_MSG_SETENFORCE) {
      struct selnl_msg_setenforce *msg = NLMSG_DATA(nlh);
      printf("enforcing %s\n", msg->val ? "on" : "off");
    } else if (nlh->nlmsg_type == SELNL_MSG_POLICYLOAD) {
      struct selnl_msg_policyload *msg = NLMSG_DATA(nlh);
      printf("policy loaded, seqno:%d\n", msg->seqno);
    }
  }

If you toggle the enforcing mode off and on, or reload the system policy (with `semodule -R`), a message is delivered to via the netlink socket. A user-space object manager can then flush its AVC etc.

With all the above, hopefully it's now clear how user-space object managers work. If you wish to write your own, remember to read the libselinux man pages first.

Reading binary policy files

The system security policy is written in a text-based language which has been well documented elsewhere. These text files are compiled and checked by user-space tools and converted into a binary blob that can be loaded into the kernel. The binary blob is also saved on disk and can be a useful source for information.

The SELinux user-space tools contain libsepol which is very useful for parsing these files. Here's a snippet of example code which returns the number of users, roles and types defined in a policy file:

#include <sepol/policydb.h>
#include <sepol/policydb/policydb.h>

int main(int argc, char **argv) {
  FILE* file = fopen(argv[1], "r");
  sepol_policy_file_t* input;
  sepol_policy_file_create(&input);
  sepol_policy_file_set_fp(input, file);

  sepol_policydb_t* policy;
  sepol_policydb_create(&policy);
  sepol_policydb_read(policy, input);

  printf("users:%d roles:%d types:%d\n",
         policy->p.p_users.nprim
         policy->p.p_roles.nprim
         policy->p.p_types.nprim);

  return 0;
};

By looking in the sepol/policydb/policydb.h header, you can probably find whatever you are looking for. Pay special heed to the comments about indexing however. Users, roles and types are indexed from 1 in some places and from 0 in others.

With a little C code, much of the useful information can be extracted from the policy files. The numbers and graphs above were generated this way, with a little help from a few Python scripts.

Conclusion

Hopefully we've covered some useful areas of SELinux that some people were unfamiliar with before, or at least shown the inner workings of something which you already knew about.

If you want information about the practical aspects of administering a system with SELinux, you should start with the Fedora documentation on the subject. After reading this document I hope that some of it is clearer now.

General homomorphic encryption

If you've heard of Hal Finney, the following quote should be enough to get you to read: his explanation of the recent homomorphic encryption paper:

This is IMO one of the most remarkable crypto papers ever. Not only does it solve one of the oldest open problems in cryptography, the construction of a fully homomorphic encryption system, it does so by means of a self-embedding technique reminiscent of Godel's theorem.

Linux sandboxing with LSMSB

Chrome Linux got a dev channel release and I'm very happy with it. It's now my primary browser.

However, one of the big selling points for Chrome on Windows is that the renderers (which deal with decoding the HTML, CSS, image files etc) are sandboxed. We've had exploitable issues in the renderers which which have been stopped by the sandbox. It's a Good Thing.

However, we don't have a sandbox on Linux! The Mac team have been talking about how nice their sandbox is (and I expect we'll get some official documentation about it after WWDC this week). We have to hack around with SUID binaries, chrooting, seccomp and one-size-fits all SELinux solutions.

(I don't wish to discount the good work that the SELinux folks have done: we'll probably use something like that sandbox on Fedora, but Chromium was very carefully written to be sandboxed and we should aim higher.)

So, as part of the exploration of what we could do with sandboxing on Linux, longer term, I have a prototype implementation of LSMSB. It's another literate program of mine, so you can usefully read the source too. The README is included below:

This is LSMSB, a sandboxing scheme for Linux based on the ideas of the OS X
sandbox (which, in turn, was inspired by TrustedBSD and FreeBSD).

Imagine that you're working on a university computer and you get a binary which
promises to do some fiendishly complex calculation, reading from a file ./input
and writing to a file ./output. It also talks to a specific server to access a
pre-computed lookup table. You want to run it, but you don't want to have to
trust that it won't do anything malicious (save giving the wrong answer).

This code is incomplete, but currently you can take a sandbox specification
like this:

filter dentry-open {
  constants {
    var etc-prefix bytestring = "/etc/";
  }

  ldc r2,etc-prefix;
  isprefixof r2,r2,r0;
  jc r2,#fail;
  ldi r0,1;
  ret r0;
#fail:
  ldi r0,0;
  ret r0;
}

... and use it to remove access to /etc.

*** This code functions, but is incomplete ***

It's written in a literate programming style, but the derived sources are
included so that you don't have to bother with that in order to build. You'll
need a recent (> 2.6.30-rc1) kernel in order to apply the included patch. Once
you've applied the patch, drop lsmsb.c into security/lsmsb and rebuild.

You can assemble a sandbox file with:
  ./lsmsb-as sandbox-input.sb > sandbox
And then run a shell in the sandbox with:
  ./lsmsb-install sandbox

To read the code, see http://www.imperialviolet.org/binary/lsmsb.html

Chrome for Linux

Myself and the rest of the Chrome Linux team have been working hard over the past few months to get Chrome ported to Linux. It's certainly very rough still, but it runs and the first development release just got released.

I'm very happy with this and we should be pushing new releases frequently from now on. If you're in the San Francisco office tomorrow, feel free to pop by the office as I'll be bringing in champagne.

Just to be clear, here are some of the things which don't work:

W2SP and Seccomp

I gave a talk today at W2SP about opportunistic encryption. You would have to ask someone in the audience how it went to get a real answer, but I feel it went OK.

The talk was based on a paper that I wrote for the conference.

Also, LWN covered some recent work that I've been doing at Google with Linux sandboxing.

Moved to GitHub

I've finally manged to move IV off heeps, a server which it's been ticking along on for the last half decade.

In the process, I've moved to GitHub using their Pages system. We'll see how well it works out!

In the process I've cleaned out a lot of stuff and probably broken lots of links. I trust that the search engines will figure it all out soon enough.

I'll be at CodeCon this y...

I'll be at CodeCon this year.

Thanks to Alexander Sotir...

Thanks to Alexander Sotirov to pushing me to check that the carry chains in donna-c64 were sufficient. I don't know if I realised something when I wrote it which I'm currently missing, or if I just screwed up, but I now believe that they're wrong.

I wrote this Haskell code to check it:

This Haskell code has been written to experiment with the carry chains in
curve25519-donna-c64. It's a literate Haskell program, one can load it into
GHCI and play along.

> module Main where
>
> import Data.Bits (shiftR, (.&.))

There are two constants that we'll need.

Our five limbs are, nominally, 51 bits wide, so this is the maximum value of
their initial values.

> twoFiftyOneMinusOne = (2 ^ 51- 1

2^128 - 1 is the limit of the range of our temporary variables. If we exceed
this at any point, our calculations will be incorrect.

> two128MinusOne = (2 ^ 128- 1

Now we define a type which mimics our 128-bit unsigned type in C. It's a
disjuction of an Integer and the distinguished value 'Overflow'. 'Overflow' is
contagious: if we try to perform any operations where one or both of the
operands is 'Overflow', then the result is also 'Overflow'.

> data U128 = U128 Integer
>           | Overflow
>           deriving (Show, Eq)

We make U128 an instance of Num so that we can perform arithmetic with it.

> instance Num U128 where
>   (U128 a) + (U128 b) = mayOverflow (a + b)
>   _ + _ = Overflow
>   (U128 a) * (U128 b) = mayOverflow (a * b)
>   _ * _ = Overflow
>   (U128 a) - (U128 b) = mayOverflow (a - b)
>   _ - _ = Overflow
>   negate _ = Overflow
>   abs a@(U128 _) = a
>   abs _ = Overflow
>   signum (U128 _) = 1
>   signum _ = 0
>   fromInteger = mayOverflow

> instance Ord U128 where
>   compare (U128 a) (U128 b) = compare a b
>   compare _ _ = EQ

This function lifts an Integer to a U128. If the value is out of range, the
result is 'Overflow'

> mayOverflow :: Integer -> U128
> mayOverflow x
>   | x > two128MinusOne = Overflow
>   | x < 0 = Overflow
>   | otherwise = U128 x

Our field elements consist of five limbs. In the C code, these limbs are
actually uint64_t's, but we keep them as U128's here. We will convince ourselves
that we don't hit any 64-bit overflows later.

> data FieldElement = FieldElement { m0 :: U128, m1 :: U128, m2 :: U128,
>                                    m3 :: U128, m4 :: U128 }
>                                  deriving (Show, Eq)

Now, two helper functions:

This function takes only the bottom 51-bits of a value

> clamp :: U128 -> U128
> clamp (U128 a) = U128 $ a .&. 0x7ffffffffffff
> clamp _ = Overflow

This function drop the bottom 51-bits of a value

> topBits :: U128 -> U128
> topBits (U128 a) = U128 $ a `shiftR` 51
> topBits _ = Overflow

This function simulates the 'fsquare' function in donna-c64, including its carry
chain. If the carry chain is sufficient, then iterating this function for any
valid initial value should never overflow.

> square :: FieldElement -> FieldElement
> square e = result where
>   t0 = m0 e * m0 e
>   t1 = m0 e * m1 e +
>        m1 e * m0 e
>   t2 = m0 e * m2 e +
>        m2 e * m0 e +
>        m1 e * m1 e
>   t3 = m0 e * m3 e +
>        m3 e * m0 e +
>        m1 e * m2 e +
>        m2 e * m1 e
>   t4 = m0 e * m4 e +
>        m4 e * m0 e +
>        m3 e * m1 e +
>        m1 e * m3 e +
>        m2 e * m2 e
>   t5 = m4 e * m1 e +
>        m1 e * m4 e +
>        m2 e * m3 e +
>        m3 e * m2 e
>   t6 = m4 e * m2 e +
>        m2 e * m4 e +
>        m3 e * m3 e
>   t7 = m3 e * m4 e +
>        m4 e * m3 e
>   t8 = m4 e * m4 e
>
>   t0' = t0 + t5 * 19
>   t1' = t1 + t6 * 19
>   t2' = t2 + t7 * 19
>   t3' = t3 + t8 * 19
>
>   t1'' = t1' + topBits t0'
>   t2'' = t2' + topBits t1''
>   t3'' = t3' + topBits t2''
>   t4' = t4 + topBits t3''
>   t0'' = t0' + 19 * topBits t4'
>   t1''' = clamp t1'' + topBits t0''

At this point, we implement two carry chains. If 'currentChain' is true, then we
implement the carry chain as currently written in donna-c64. Otherwise, we
perform an extra step and carry t1 into t2.

>   result = if currentChain
>               then FieldElement (clamp t0'') t1''' (clamp t2'') (clamp t3'')
>                                 (clamp t4')
>               else FieldElement (clamp t0'') (clamp t1''') t2''' (clamp t3'')
>                                 (clamp t4') where
>                    t2''' = clamp t2'' + topBits t1'''

This is the maximum initial element: an element where all limbs are 2^51 - 1.
Inspection of the 'fexpand' function should be sufficient to convince oneself of
this.

> maxInitialElement :: FieldElement
> maxInitialElement = FieldElement twoFiftyOneMinusOne twoFiftyOneMinusOne
>                                  twoFiftyOneMinusOne twoFiftyOneMinusOne
>                                  twoFiftyOneMinusOne

This function takes two field elements and returns the worst case result: one
where the maximum of each limb is chosen.

> elementWiseMax :: FieldElement -> FieldElement -> FieldElement
> elementWiseMax x y = FieldElement (f m0) (f m1) (f m2) (f m3) (f m4) where
>   f :: (FieldElement -> U128) -> U128
>   f accessor = max (accessor x) (accessor y)

We now define a series of values generated by squaring the previous element and
setting any limb that is less than the maximum to the maximum value.

> maxSeries = iterate (elementWiseMax maxInitialElement . square)
>                     maxInitialElement

This value controls which carry chain is used in 'square', the current one or
the one with the extra carry

> currentChain = True

By running this, we can see that the current carry chain is insufficient for
this simulation:

ghci> maxSeries !! 4
FieldElement {m0 = Overflow, m1 = Overflow, m2 = Overflow, m3 = Overflow,
              m4 = Overflow}

The series overflows after only four iterations. However, if we use the
alternative carry chain, the series is stable far beyound the requirements of
the Montgomery ladder used in donna-c64:

ghci> maxSeries !! 100000
FieldElement {m0 = U128 2251799813685247, m1 = U128 2251799813685247,
              m2 = U128 2251799813685247, m3 = U128 2251799813685247,
              m4 = U128 2251799813685247}

Additionally, these values are small enough not to overflow the 64-limb limbs.

When I wrote curve25519-d...

When I wrote curve25519-donna I implemented many of the critical functions in x86-64 assembly. It was a lot of code, even using the C preprocessor! This got a good 20% boost in speed. This was clearly very important because it made donna-x86-64 faster than djb's version .

However, djb just pointed out that the 64-bit C implementation of donna was now as fast as my hand coded version. Turns out that GCC 4.3 greatly improved the quality of the code generation for this sort of code and now equals my hand crafted efforts! Well done to the GCC team because the C code is vastly smaller and easier to understand. Thus, the x86-64 of donna has been removed from the repo.

Packet sizes in DNSSEC

Even when the DNS root hasn't started signing records, one can still use trust-anchors to employ DNSSEC for those TLDs which support it. Follow the links from Ben Laurie's latest blog post on the matter.

The .se ccTLD is one of those TLDs which support DNSSEC. You can test it with: dig +dnssec -t any se @a.ns.se. You'll see lots of NSEC, RRSIG and DNSKEY records. (DNSSEC is very complicated.)

However, the size of that reply is 3974 bytes long! All that from a request packet of 31 bytes. That's a very easy to use 100x DoS amplication. Of course, if you use mirror amplication like that, you cannot forge the source addresses of the flooding packets, making the flood easier to filter. However, DNSSEC may well bring DoS floods into the reach of many more attackers.

When Layers of Abstractio...

When Layers of Abstraction Don't Get Along: The Difficulty of Fixing Cache Side-Channel Vulnerabilities.

Why networked software should expire.

If your business is still writing letters in WordStar and printing them out on an original Apple Laserwriter, good for you. If you're writing your own LISP in vi on a PDP 10, best of luck. However, if you're still using IE6, that's just rude.

Networked software (by which I just mean programs that talk to other programs) on public networks have a different cost model than the first two examples, but our mental models haven't caught up with that fact yet. We're stuck with the idea that what software you run is your own business and that's preventing needed changes. Here's one example:

ECN (Explicit Congestion Notification) is a modification to TCP and IP which allows routers to indicate congestion by altering packets as they pass though. Early routers dropped packets only when their buffers overflowed and this was taken as an indication of congestion. It was soon noticed that a more probabilistic method of indicating congestion performed better. So routers starting using RED (random early drop) where, approximately, if a buffer is 50% full, a packet has a 50% chance of getting dropped. This gives an indication of congestion sooner and prevented cases where TCP timeouts for many different hosts would start to synchronise and resonate.

To indication congestion, RED drops a packet that has already traversed part of the network; throwing away information. So ECN was developed to indicate congestion without dropping the packets. Network simulations and small scale testing showed a small, but significant benefit from it.

But when ECN was enabled for vger.kernel.org, the mailing list server which handles Linux kernel mailing lists, many people suddenly noticed that their mails weren't getting though. It turned out that many buggy routers and firewalls simply dropped all packets which were using ECN. This was clearly against the specifications and, in terms of code, an easy fix.

ECN wasn't enabled by default in order to give time for the routers to get fixed. In a year or so, it was hoped, it could start to be used and the Internet could start to benefit.

That was over eight years ago now. ECN is still not enabled by default in any major OS. The latest numbers I've seen (which I collected) suggest that 0.5% of destinations will still stop working if you enable ECN and the number of hosts supporting ECN appears to be dropping.

The world has payed a price for not having ECN for the past eight years. Not a lot, but downloads have been a little slower and maybe more infrastructure has been build than was really needed. But who actually paid? Every user of the Internet did, a little bit. But that cost was imposed by router manufactures who didn't test their products and network operators who didn't install updates. Those people saved money by doing less and everyone else paid the price.

These problems are multiplying with the increasing amount of network middleware (routers, firewalls etc) getting deployed; often in homes and owned by people who don't know of care about them.

Recently, Linux 2.6.27 was released and broke Internet access for, probably, thousands of people. Ubuntu Intrepid released with it and had to disable TCP timestamps as a work around while the issue was fixed.

But the issue wasn't a bug in 2.6.27. It was a bug in many home routers (WiFi access points and the like) which was triggered by a perfectly innocent change in Linux that caused the order of TCP options to change. (I felt specifically aggrieved about this because I made that change.) The order was soon changed back and everything started working again.

But, for the future, this now means that the order cannot change. It's not written down anywhere, it's a rule written in bugs. This imposes costs on anyone who might write new TCP stacks in the future, by requiring increased testing and reduced sales as some customers find that it wont work with their routers. These are all costs created by router manufactures and paid by others.

Economists calls these sorts of costs externalities and they are seen as a failure which needs to be addressed. Often, in other areas, they are addressed by regulation or privitisation. Neither of those options appeal in this case.

An uncontroversial suggestion that I'm going to make is that we require better test suites. As a router manufacturer, testing involves checking that your equipment works with a couple of flavors of Windows and, if we're lucky, Linux and some BSDs too. This is much too small of a testing surface. There needs to be an open source test suite designed to test every corner of the RFCs. The NFS connectathons are similar in spirit and probably saved millions of man-hours of debugging over their lifetimes. Likewise, the ACID tests for web browsers focused attention on areas where they were poorly implementing the standards.

And, although my two examples above are both IP/TCP related, I don't want to suggest that the problem stops there. Every common RFC should have such a test suite. HTTP may be a simple protocol but I'll bet that most implementations can't cope with continued header lines. It's those corners which a test suite should address.

Testing should help, but I don't think that it'll be enough. Problems will slip through. Testing against specifications will also never catch problems with the specification itself.

DNS requests can carry multiple questions. There's a big counter in the packet to say how many questions you are asking. However, the reply format can only hold one response code. Thus, I don't know of any DNS server which handles multiple questions (most consider the request to be invalid).

The ability to ask multiple questions would be very helpful. Just look at the number of places which suggest that you turn off IPv6 to make your networking faster. That's because software will otherwise ask a single IPv6 question of DNS, wait for the reply and then ask the IPv4 question. This delay, caused by not being able to request both results in a single request, is causing people to report slowdowns and disable IPv6.

We need to fix DNS, but we never can because one cannot afford break the world. We can't even start a backwards compatible transition because of broken implementations.

That's why networked software should have an expiry date. After the expiry date, the code should make it very clear that it's time to upgrade. For a router, print a big banner when an administrator connects. Flash all the error lights. For software, pop up a dialog every time you start. For home routers, beep and flash a big indicator.

We don't need everyone to update and, as manufacturers fold, maybe there won't be any firmware updates or software upgrades. Almost certainly the device shouldn't stop working. But we need to make more of an effort to recognise that large populations of old code hold everyone else back.

If we can know that nearly all the old code is going to be gone by some date, maybe we can make progress.

(Thanks to Evan for first putting this idea in my mind.)

rwb0fuz1024 included in eBATS

rwb0fuz1024 (pronounced 'robo-fuzz') has been included in the eBATS benchmarking suite. Not all of the test systems have been run with it yet, but here's one which has. It's the fastest verification by fa r .

(Full results here)

Sandboxing on Linux

This blog post has been brought about because of the issues of sandboxing Chromium on Linux (no, it's not ready and wont be for months).

Chromium uses a multiprocess model where each tab (roughly) is a separate process which performs IPCs to a UI process. This means that we can do parallel rendering and withstand crashes in the renderer. It also means that we should be able to sandbox the renderers.

Since the renderer are parsing HTML, CSS, Javascript, running plugins etc, sandboxing them would be very desirable. There's a lot of scope for buffer overflows and other issues in a code base that large and a good sandbox would dramatically reduce the scope of any exploits.

Traditional sandboxes: chroot, resource limits

People have been using chroot jails for many years. A chroot call changes the root of the filesystem for the current process. Once that has happened the process cannot interact with any of the filesystem outside the jail. As long as the process cannot gain root access, it's a good security measure.

Resource limits prevent denial of service attacks by, say, trying to use up all the memory on the system. See the getrlimit manpage for details.

These two mechanisms are supported by most UNIX like systems. However, there are some limitations:

Network access, for one, is not mediated by the filesystem on these platforms, so a compromised process could spew spam or launch attacks on an internal network. Also, the chroot call requires root access. Traditionally this has been done with a small SUID helper binary, but then root access is needed to install etc.

ptrace jails

The ptrace call is used by the strace utility which shows a trace of all the system calls that a child makes. It can also be used to mediate those system calls.

It works like this: the untrusted child is traced by a trusted parent and the kernel arranges that all system calls that the child makes cause a SIGTRAP, stopping the child. The parent can then read the registers and memory of the child and decide if the system call is allowed, permitting it or simulating an error if not.

The first issue is that some system calls take pointers to userspace memory which needs to be validated. Take open, which passes a pointer to the filename to be opened. If the parent wishes to validate the filename it has to read the child's memory and check that it's within limits. That's perfectly doable with ptrace.

The issue comes when there are multiple threads in the untrusted address space. In between the parent validating the filename and the kernel reading it, another thread can change its contents. In the case of open that means that the validator in the parent see one (safe) filename but the kernel actually acts on another. Because of this, either multithreaded children need to be prohibited, or the validator must forbid all system calls which take a pointer to a buffer which needs to be validated.

When calls like open have been prohibited, there's another trick which can be used to securely replace it:

UNIX domain sockets are able to transmit file descriptors between processes. Not just the integer value, but a reference to the actual descriptor (which will almost certainly have a different integer value in the other process). For details see the unix and cmsg manpages.

With this ability an untrusted child can securely open a file by making a request, over a UNIX domain socket to a trusted broker. The broker can validate the filename requested in safety: because it's in another address space the filename is safe between validation and use by the kernel. The broker can then return the file descriptor over the socket to the untrusted child.

The major problem with ptrace jails is that they have a high cost at every system call. On my 2.33GHz Core2 a simple getpid call takes 128ns. When a process is ptraced, that rises to 13,800ns (a factor of 100x slower). Additionally, Chromium on Linux is a 32-bit process because of our JIT, so getting the current time is a system call too.

Seccomp

Seccomp has a rather messy past (see the linked Wikipedia page for details). It's a Linux specific mode which a process can request whereby only read, write, exit and sigreturn system calls are allowed. Making any system call not on the permitted list results in immediate termination of the process.

This is a very tight jail, designed for pure computation and is perfect for that. It's enabled by default in kernel builds (although some distributions disable it I believe). It used to be enabled via a file in /proc but, in order to save space, it's now a prctl.

This issue is that the jail is too tight. It's great that read and write calls are enabled without overhead because that's much of what one of our rendering processes will use, but many other system calls would be nice (brk and mmap for memory allocation, gettimeofday etc). We would have to use the broker model for all of them.

For some calls the broker model has to be updated. Allocating memory to an address space isn't something which can be performed outside that address space so, in this case, the broker for these calls has to be in the same address space. This means that there's an untrusted thread running under seccomp and a trusted thread, not running seccomped, in the same process. The untrusted thread can request more memory by making an request over a pipe to the trusted thread. The trusted thread can then perform the allocation in the same address space.

This presents some issues when writing the trusted code. Because untrusted code has access to the memory the only thing the trusted thread can trust are its registers. That means no stack nor heap usage. Basically the trusted code has to be written in assembly and has to be pretty simple. That's not a huge problem for us however.

But we will be making lots of these other system calls, not just the memory allocation ones, but time calls, poll etc. All have to use a broker model.

To recap, a basic system call (getpid) on my 2.33GHz Core2 takes about 128ns. Performing the same operation over a pipe to another thread takes 7,775ns and to another process takes 8,423ns, roughly a factor of 60x slower.

Again, this is a very painful slowdown given the volume of such calls that we expect to make.

SELinux

Fedora, rightfully, makes a lot of noise about the fact that they have SELinux. It's a huge beast and Fedora's work has mostly been a process of taming the complexity and dealing with the fact that very little is written with SELinux in mind.

I don't have Fedora installed anywhere, but this may be a very nice solution to our issues. However, I suspect that root access will be required, again, to configure it. I speak mostly from a position of ignorance here, however. I should install Fedora at some point and have a play.

The Other Man's Grass

Recent releases of OSX have a system call actually called sandbox_init. It's a little half-baked at the moment, but shows great promise.

It's a feature from TrustedBSD and, in the limit, allows for a Scheme like language to give a detailed specification of the shape of the sandbox which is compiled to bytecode and loaded into the kernel. You can see some examples of the profile language in the slides for this USENIX talk. But, for the moment, I believe that just a few preset profiles are provided (see the manpage).

Rolling one's own

SELinux is implemented atop of LSM which is a general framework for hooking security decisions in the Linux kernel. It's conceivable that one could write a sandboxing module using these hooks.

It would require root access to install, but then so do many of the other solutions. It would probably play badly with other LSM users too, but Fedora is the only major distribution to be using them as far as I know. However, it would also be a large distraction.

Summary of data
PlatformSimple system call... via a broker thread... via a broker process... when ptraced
32-bit 136.9ns 8161.4ns 8327.3ns 14087.0ns
64-bit 128.7ns 7775.0ns 8423.3ns 13779.9ns

Obfuscated TCP

It's now in its 3rd iteration, Obfuscated TCP now has an updated site, mostly working code etc. I need people to go to the site, look at the docs, watch the video, build the code, try stuff out etc. Tell me what works and what doesn't. Email address is at the top of the page. Thanks to all who do, and remember that you don't just have to email if you have problems, positive reports are good too!

Google datacenters

There are different levels of secrets at Google. Almost everything unreleased is “confidential” - which means that we don't talk about it to the outside world. Then there is the “top secret” stuff - stuff that you don't even talk about to other Googlers.

Now, top secret stuff is rare because it's a little poisonous. An environment where lots of things are secret between coworkers isn't a pleasant one. How we cool our data centers was one of those items and I was sworn to secrecy when I was lucky enough to be given a guided tour of our Oregon operations.

But, for whatever reasons, this information is now public! Seriously, this is some of the coolest (no pun intended) stuff that Google does: go read about evaporative cooling.

A Rabin-Williams signature scheme: rwb0fuz1024

I wrote a Rabin-Williams signature scheme [source]:

Crit-bit trees

I wrote up djb's implementation of crit-bit trees for strings here [pdf]. Crit-bit trees have several nice properties:

Several groups of Linux k...

Several groups of Linux kernel papers have been published recently. Here's my pick of them:

First we have the Proceedings of the 2008 Linux Symposium (these are in some order of order, favourite first):

Next there's the ACM SIGOPS Operating Systems Review. These papers are about much more experimental developments in the kernel and are thus more fun, even if they are less likely to see the light of day:

I've just releasedtwo new...

I've just released two new curve25519 implementations: one in C and one in x86-64 assembly. The latter is 10% faster than djb's implementation.

curve25519 is an elliptic curve, developed by Dan Bernstein, for fast Diffie-Hellman key agreement. DJB's original implementation was written in a language of his own devising called qhasm. The original qhasm source isn't available, only the x86 32-bit assembly output.

Since many x86 systems are now 64-bit, and portability is important, this project provides alternative implementations for other platforms.

Implementation Platform Author 32-bit speed 64-bit speed
curve25519 x86 32-bit djb 265µs N/A
curve25519-donna-x86-64 x86 64-bit agl N/A 240µs
curve25591-donna Portable C agl 2179µs 628µs

(All tests run on a 2.33GHz Intel Core2)

Google has, at last, open...

Google has, at last, open sourced Protocol buffers. My, very minor contribution to this is that I wrote the basis for the encoding documentation.

Protocol buffers pretty much hit the sweet spot of complexity and capability. (See XML and ASN.1 for examples of attempts which missed.) I have the beginnings of a protocol buffer compiler for Haskell that I wrote for internal apps. Now that the C/Java/Python versions are out, I should probably clean that up and put it on Hackage. But every coder should consider protocol buffers for their serialisation needs from now on.

The Black Swan

Firstly, if you're wondering what happened to all the ObsTCP stuff, it didn't disappear, it just moved to a different blog. Things are still moving as fast as I can push them.

(ISBN: 1400063515)

This book has some good, if unoriginal, points about the stupidity of much of the modeling done in today's world, esp the world of finance. Sadly, these are hidden in many pages of self-centered rambling and discourse on adventitious topics. If you're thinking of buying this book, get The (Mis)behaviour of Markets by Mandelbrot instead; you'll thank me.

I've added a bunch of Obs...

I've added a bunch of Obsfucated TCP stuff to the obstcp project page code.google.com include kernel patches, userland tools, specs and friendly introductions.

Also, I posted it to Reddit. If it doesn't get downvoted into /dev/null in 60 seconds, the comments will probably end up there.

OpenID - not actually spawn of Satan

A blog post aggregating complaints about OpenID has been popping up in different places this morning. If you've read it, you might want a little perspective. I'm not going to deal with each point in turn because there's so many, mostly repeating each other.

Phishing

At login time, the site that you're logging into can end up redirecting you to your OpenID provider. Your provider then tells you to go to their site and enter your login information, then click a button to try again. They don't provide a "link" to their site and they don't ask for your password.

Some early providers might not have followed these basic steps, but all the reasonable ones do.

Yes, it's still possible for users to be confused but, by habit they'll be used to doing to right thing.

XSS and CSRF

XSS problems on the providers site are a big deal. This criticism is reasonable.

CSRF may be a bigger deal because you are more likely to be 'logged in' to the target. However, most users already keep persistent cookies to save logging into these sites. The additional attack surface here is dubious; CSRF issues are a problem with or without OpenID.

DNS poisoning

If your OpenID starts with https://, you should be protected from DNS poisoning attacks and the like by the usual TLS PKI. This isn't perfect, but it's pretty good.

However, the OpenID spec says that plain domain names are normalised by prepending http://. This is a technical problem with the spec and should be fixed. Until then, this is a reasonable criticism but not a fundamental issue.

Privacy

The OpenID provider has a lot of information about your activities. This is little different than, say, your email account and many people are happy with Gmail. Likewise, password recovery on most of the sites which could use OpenID is based on email access, so most people already have a single password that suffices for entry to many sites.

If you don't like the idea of Gmail you can run your own email server. Likewise, you can run your own OpenID provider.

Using the same OpenID on many sites does allow them to link your activities. So does giving these sites your email address for password recovery. So does using the same IP (although to a lesser extent).

Some providers will let you have many OpenIDs linked to the same account for this reason. Joe user probably won't use that feature and probably gives the same email address to all those sites already and so looses nothing.

Trust problems

OpenID is not a trust system. Trust systems may be built on top of identity systems. Likewise, apples are not oranges and complaints about their lack of tangyness are moot.

Usability / Adoption

Somewhat valid points here. It's a big job to get widespread adoption and, at the moment, it's a pretty small crowd that uses OpenID. However, OpenID doesn't need a flag day; it can have incremental deployment.

Availability

Valid points. If your provider goes down you're going to have a bad day.

Conclusion

I don't believe that OpenID should be used to login to your bank account. However, for the myriad of sites that I login to (Google Reader, reddit, ...) it would be nice to just be able to type my OpenID in. It's decently suited to that because I'm fed up with all these accounts.

I'm now running a Ubuntu ...

I'm now running a Ubuntu based laptop with a somewhat functions Obsfucated TCP patch in its kernel. (If you have a Neo like view of the Internets you'll be able to see it by the funny options in the SYN packets.)

Hopefully soon I'll be able to post a first draft patch for other people to try. In the mean time, I wrote the start of the mounds of documentation I suspect it'll need: a very non-technical introduction.

I've updated the patches ...

I've updated the patches linked to in the last post with today's work. Both sides now end up with the same shared key (and not just because they got the same private key from lack of entropy like before). That took some fun tracking down of bugs.

Also, packets are now HMAC-MD5'ed with the shared key, and invalid packets are dropped. That also took far longer than expected. I ended up using the MD5 implementation from the CIFS filesystem because the kernel's crypto library is just plain terrible. It's also totally undocumented but, from what I can see, you can't lookup an algorithm without taking a semaphore, and that requires that you be able to sleep. I almost think I must be missing something because that's dumber than the bastard offspring of Randy Hickey and Jade Goodie.

But there we go. Encryption (with Salsa20) to come next Wednesday.

First Obsfucated TCP patches

After a day of kernel hacking, I have a few patches which, together, make a start towards implementing ObsTCP.

At the moment, it will advertise ObsTCP on all connections and, if you have two kernels which support it, you'll get a shared key setup. At the moment, the private key is generated at boot time and since the host doesn't have any entropy then, it's always the same. So I'll have to do something special there. Also, I've a problem where the ACK with the connecting host's public key can get lost. Since ACKs aren't ACKed, this can be a real pain. I think I need to include it in every transmitted packet until (yet another) option signifies that it's been received.

After the last post expla...

After the last post explained why small curves aren't good enough for obsfucated TCP, I decided that, since I'm going to have to do some damage to the TCP header to get a bigger public key in there anyway, I might as well go the whole way and use curve25519, by djb. Now, djb has forgotton more about elliptic curves than I'll ever know and I feel much happier using a curve that's been designed by him. As you can probably guess from the name, it's a curve over 2255-19 - a prime. So the public keys are 32 bytes long.

In order to get that much public key material into a TCP header, here's my proposed hack: Jumbo TCP options.

djb's sample implementation of curve25519 is written in a special assembly language called qhasm. Sadly, it's so alpha that he's not actually released it. So the sample implementation is for ia32 only, uses the floating point registers and has 5100 lines of uncommented assembly. It is, however, freaking quick.

However, since I have kernel-space in mind for this I've written a C implementation. It's about 1/3 the speed (and I've not really tried to optimise it yet), doesn't use any floating point (since kernel-space doesn't have easy access to the fp registers in Linux) and fuzz testing seems to indicate that it's correct. (At least, it's giving the same answers as djb's code.)

Next step: hacking up the kernel. (And I thought the elliptic curve maths was hard enough.)

Elliptic curves don't work either

(For context, see my previous post on OTCP)

In any Diffie-Hellman exchange based on elliptic curves, we have Q=aP where P and Q are points on an elliptic curve. The operation of multiplying a point and a scalar is well defined, but unimportant here. The problem facing the attacker is, given Q and P, find a. If they can do that, we're sunk.

If you could find a pair of numbers such that: cP + dQ = eP + fQ then you're done because: (c-e)P = (f-d)Q = (f-d)aP, then a = (c-e)/(f-d) mod n, where n is the size of the field underlying the curve.

Finding such a point by picking random examples is never going to work because of the storage requirements. However, if you define a step function which takes a pair (c, d) and produces a new pair (c', d') you have defined a cycle through the search space. (It must be a cycle because the search space is finite. At some point you must hit a previous state and loop forever.) Now you can use Floyd's cycle finding algorithm to find a collision with constant space. This is an √n algorithm for breaking this problem and is well known as Pollard's rho method.

Now, if you have many of these problems you get a big speed up by using some storage. Assume that you do the legwork to solve an instance of the problem and that you record some fraction of the points that you evaluated. (How you choose the points isn't important so long as it's a function of the point; say pick all points where the first m bits are zero.)

Now, future attempts to break the problem can collide with one of the previous points. If you find cP + dQ = eP + fR (note that P is a constant of the elliptic curve system) and also that R = bP (because we solved this instance previously) then cP + dQ = cP + adP = (e+fb)P and so (c-(e+fb)) / d = a (and we know all the values on the left-hand side).

Now, 2112 (14 bytes) is about as big an elliptic curve point as we can fit in a TCP header. The maximum options payload is 40 bytes, of which 20 are already taken up in modern TCP stacks. We need 2 bytes of fluff per option and, unless we want this to be the last TCP header ever, we need to leave at least 4 bytes. That's where the 14 byte limit comes from.

We give the attacker 250 bytes of space. I believe that each point will take 3*14 bytes of space for the (c,d,Y) triple, where Y = cP+dQ. Thus they can store 244 distinguished points. Thus one in 256-44=12 points are distinguished. Additionally, generating those 244 points isn't that hard, computationally. This suggests that an attacker can find a collision in only 212 iterations., or about 213 field multiplications.

So, again, a reasonable attacker can break our crypto in real time.

This scheme becomes much harder to sell if we have to do evil things to the TCP header in order to make it work.

If you've been wondering ...

If you've been wondering what I'm up to at work, we now have a public blog for the RechargeIt project.

How sad: from reading the...

How sad: from reading the sleepcat documentation on network partitions, it's clear that BDB uses a broken replication system (i.e. not Paxos). That's a shame because I was hoping to use it.

Yahoo now has OpenID for ...

Yahoo now has OpenID for all its accounts, which is great. Wonderful in fact. OpenID is a good thing for many authentication needs on the Internet and will make the world a better place.

However,...