ImperialViolet

False Start's Failure (11 Apr 2012)

Eighteen months ago(ish), Chrome started using False Start. False Start reduces the average time for an SSL handshake by 30%.

Since the biggest problem with transport security is that most sites don't use it, anything that reduces the latency impact of HTTPS is important. Making things faster doesn't just make them faster, it also makes them cheaper and more prevalent. When HTTPS is faster, it'll be used in more places than it would otherwise be.

But, sadly, False Start will be disabled, except for sites doing NPN, in Chrome 20. NPN is a TLS extension that we use to negotiate SPDY, although you don't have to use it to negotiate SPDY, you can advertise http/1.1 if you wish.

False Start was known to cause problems with a very small number of servers and the initial announcement outlined the uncommon scheme that we used to deploy it: we scanned the public Internet and built up a list of problematic sites. That list was built into Chrome and we didn't use False Start for connections to those sites. Over time the list was randomly eroded away and I'd try to address any issues that came up. (Preemptively so in the case of large sites.)

It did work to some extent. Many sites that had problems were fixed and it's a deployment scheme that is worth considering in the future. But it didn't ultimately work well enough for False Start.

Initially we believed that False Start issues were deterministic so long as the TLS Finished and application data records were sent in the same TCP packet. We changed Chrome to do this in the hopes of making False Start issues deterministic. However, we later discovered some HTTPS servers that were still non-deterministically False Start intolerant. I hypothesise that the servers run two threads per connection: one for reading and one for writing. Although the TCP packet was received atomically, thread scheduling could mean that the read thread may or may not be scheduled before the write thread had updated the connection state in response to the Finished.

This non-determinism made False Start intolerance difficult to diagnose and reduced our confidence in the blacklist.

The `servers' with problems were nearly always SSL terminators. These hardware devices terminate SSL connections and proxy unencrypted data to backend HTTP servers. I believe that False Start intolerance is very simple to fix in the code and one vendor suggested that was the case. None the less, of the vendors who did issue an update, most failed to communicate that fact to their customers. (A pattern that has repeated with the BEAST fix.)

One, fairly major, SSL terminator vendor refused to update to fix their False Start intolerance despite problems that their customers were having. I don't believe that this was done in bad faith, but rather a case of something much more mundane along the lines of “the SSL guy left and nobody touches that code any more”. However, it did mean that there was no good answer for their customers who were experiencing problems.

Lastly, it was becoming increasingly clear that we had a bigger problem internationally. Foreign admins have problems finding information on the subject (which is mostly in English) and foreign users have problems reporting bugs because we can't read them. We do have excellent agents in countries who liaise locally but it was still a big issue, and we don't cover every country with them. I also suspect that the distribution of problematic SSL terminators is substantially larger in some countries and that the experience with the US and Europe caused us to underestimate the problem.

In aggregate this lead us to decide that False Start was causing more problems than it was worth. We will now limit it to sites that support the NPN extension. This unfortunately means that it'll be an arcane, unused optimisation for the most part: at least until SPDY takes over the world.