CFI directives in assembly files

(This post uses x86-64 for illustration throughout. The fundamentals are similar for other platforms but will need some translation that I don't cover here.)

Despite compilers getting better over time, it's still the case that hand-written assembly can be worthwhile for certain hot-spots. Sometimes there are special CPU instructions for the thing that you're trying to do, sometimes you need detailed control of the resulting code and, to some extent, it remains possible for some people to out-optimise a compiler.

But hand-written assembly doesn't automatically get some of the things that the compiler generates for normal code, such as debugging information. Perhaps your assembly code never crashes (although any function that takes a pointer can suffer from bugs in other code) but you probably still care about accurate profiling information. In order for debuggers to walk up the stack in a core file, or for profilers to correctly account for CPU time, they need be able to unwind call frames.

Unwinding used to be easy as every function would have a standard prologue:

push rbp
mov rbp, rsp

This would make the stack look like this (remember that stacks grow downwards in memory):

Caller's stackRSP value before CALLRSP at function entryCaller's RBPCallee's local variablesSaved RIP(pushed by CALL)RBP always points here

So, upon entry to a function, the CALL instruction that jumped to the function in question will have pushed the previous program counter (from the RIP register) onto the stack. Then the function prologue saves the current value of RBP on the stack and copies the current value of the stack pointer into RBP. From this point until the function is complete, RBP won't be touched.

This makes stack unwinding easy because RBP always points to the call frame for the current function. That gets you the saved address of the parent call and the saved value of its RBP and so on.

The problems with this scheme are that a) the function prologue can be excessive for small functions and b) we would like to be able to use RBP as a general purpose register to avoid spills. Which is why the GCC documentation says that “-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging”. This means that you can't depend on being able to unwind stacks like this. A process can be comprised of various shared libraries, any of which might be compiled with optimisations.

To be able to unwind the stack without depending on this convention, additional debugging tables are needed. The compiler will generate these automatically (when asked) for code that it generates, but it's something that we need to worry about when writing assembly functions ourselves if we want profilers and debuggers to work.

The reference for the assembly directives that we'll need is here, but they are very lightly documented. You can understand more by reading the DWARF spec, which documents the data that is being generated. Specifically see sections 6.4 and D.6. But I'll try to tie the two together in this post.

The tables that we need the assembler to emit for us are called Call Frame Information (CFI). (Not to be confused with Control Flow Integrity, which is very different.) Based on that name, all the assembler directives begin with .cfi_.

Next we need to define the Canonical Frame Address (CFA). This is the value of the stack pointer just before the CALL instruction in the parent function. In the diagram above, it's the value indicated by “RSP value before CALL”. Our first task will be to define data that allows the CFA to be calculated for any given instruction.

The CFI tables allow the CFA to be expressed as a register value plus an offset. For example, immediately upon function entry the CFA is RSP + 8. (The eight byte offset is because the CALL instruction will have pushed the previous RIP on the stack.)

As the function executes, however, the expression will probably change. If nothing else, after pushing a value onto the stack we would need to increase the offset.

So one design for the CFI table would be to store a (register, offset) pair for every instruction. Conceptually that's what we do but, to save space, only changes from instruction to instruction are stored.

It's time for an example, so here's a trivial assembly function that includes CFI directives and a running commentary.

  .globl  square
  .type   square,@function
  .hidden square

This is a standard preamble for a function that's unrelated to CFI. Your assembly code should already be full of this.


Our first CFI directive. This is needed at the start of every annotated function. It causes a new CFI table for this function to be initialised.

  .cfi_def_cfa rsp, 8

This is defining the CFA expression as a register plus offset. One of the things that you'll see compilers do is express the registers as numbers rather than names. But, at least with GAS, you can write names. (I've included a table of DWARF register numbers and names below in case you need it.)

Getting back to the directive, this is just specifying what I discussed above: on entry to a function, the CFA is at RSP + 8.

  push    rbp
  .cfi_def_cfa rsp, 16

After pushing something to the stack, the value of RSP will have changed so we need to update the CFA expression. It's now RSP + 16, to account for the eight bytes we pushed.

  mov     rbp, rsp
  .cfi_def_cfa rbp, 16

This function happens to have a standard prologue, so we'll save the frame pointer in RBP, following the old convention. Thus, for the rest of the function we can define the CFA as RBP + 16 and manipulate the stack without having to worry about it again.

  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8

We're getting ready to return from this function and, after restoring RBP from the stack, the old CFA expression is invalid because the value of RBP has changed. So we define it as RSP + 8 again.


At the end of the function we need to trigger the CFI table to be emitted. (It's an error if a CFI table is left open at the end of the file.)

The CFI tables for an object file can be dumped with objdump -W and, if you do that for the example above, you'll see two tables: something called a CIE and something called an FDE.

The CIE (Common Information Entry) table contains information common to all functions and it's worth taking a look at it:

  Version:         1
  Augmentation:    "zR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     1b

  DW_CFA_def_cfa: r7 (rsp) ofs 8
  DW_CFA_offset: r16 (rip) at cfa-8

You can ignore everything until the DW_CFA_… lines at the end. They define CFI directives that are common to all functions (that reference this CIE). The first is saying that the CFA is at RSP + 8, which is what we had already defined at function entry. This means that you don't need a CFI directive at the beginning of the function. Basically RSP + 8 is already the default.

The second directive is something that we'll get to when we discuss saving registers.

If we look at the FDE (Frame Description Entry) for the example function that we defined, we see that it reflects the CFI directives from the assembly:

… FDE cie=…
  DW_CFA_advance_loc: 1 to 0000000000000001
  DW_CFA_def_cfa: r7 (rsp) ofs 16
  DW_CFA_advance_loc: 3 to 0000000000000004
  DW_CFA_def_cfa: r6 (rbp) ofs 16
  DW_CFA_advance_loc: 11 to 000000000000000f
  DW_CFA_def_cfa: r7 (rsp) ofs 8

The FDE describes the range of instructions that it's valid for and is a series of operations to either update the CFA expression, or to skip over the next n bytes of instructions. Fairly obvious.

Optimisations for CFA directives

There are some shortcuts when writing CFA directives:

Firstly, you can update just the offset, or just the register, with cfi_def_cfa_offset and cfi_def_cfa_register respectively. This not only saves typing in the source file, it saves bytes in the table too.

Secondly, you can update the offset with a relative value using cfi_adjust_cfa_offset. This is useful when pushing lots of values to the stack as the offset will increase by eight each time.

Here's the example from above, but using these directives and omitting the first directive that we don't need because of the CIE:

  .globl  square
  .type   square,@function
  .hidden square
  push    rbp
  .cfi_adjust_cfa_offset 8
  mov     rbp, rsp
  .cfi_def_cfa_register rbp
  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8

Saving registers

Consider a profiler that is unwinding the stack after a profiling signal. It calculates the CFA of the active function and, from that, finds the parent function. Now it needs to calculate the parent function's CFA and, from the CFI tables, discovers that it's related to RBX. Since RBX is a callee-saved register, that's reasonable, but the active function might have stomped RBX. So, in order for the unwinding to proceed it needs a way to find where the active function saved the old value of RBX. So there are more CFI directives that let you document where registers have been saved.

Registers can either be saved at an offset from the CFA (i.e. on the stack), or in another register. Most of the time they'll be saved on the stack though because, if you had a caller-saved register to spare, you would be using it first.

To indicate that a register is saved on the stack, use cfi_offset. In the same example as above (see the stack diagram at the top) the caller's RBP is saved at CFA - 16 bytes. So, with saved registers annotated too, it would start like this:

  push    rbp
  .cfi_adjust_cfa_offset 8
  .cfi_offset rbp, -16

If you need to save a register in another register for some reason, see the documentation for cfi_register.

If you get all of that correct then your debugger should be able to unwind crashes correctly, and your profiler should be able to avoid recording lots of detached functions. However, I'm afraid that I don't know of a better way to test this than to zero RBP, add a crash in the assembly code, and check whether GBD can go up correctly.

(None of this works for Windows. But Per Vognsen, via Twitter, notes that there are similar directives in MASM.)

CFI register numbers

In case you need to use or read the raw register numbers, here they are for a few architectures:

(may be EBP on MacOS) (may be ESP on MacOS)
Register numberx86-64x86ARM

(x86 values taken from page 25 of this doc. x86-64 values from page 57 of this doc. ARM taken from page 7 of this doc.)

RISC-V assembly

RISC-V is a new, open instruction set. Fabrice Bellard wrote a Javascript emulator for it that boots Linux here (more info). I happen to have just gotten a physical chip that implements it too (one of these) and what's cool is that you can get the source code to the chip on GitHub.

The full, user-level instruction set is documented but there's a lot of information in there. I wanted a brief summary of things that I could keep in mind when reading disassembly. This blog post is just a dump of my notes; it's probably not very useful for most people! It also leaves out a lot of things that come up less frequently. There's also an unofficial reference card which is good, but still aimed a little low for me.

RISC-V is little-endian and comes in 32 and 64 bit flavours. In keeping with the RISC-V documents, the flavour (either 32 or 64) is called XLEN below in the few places where it matters.

For both, int is 32 bits. Pointers and long are the native register size. Signed values are always sign extended in a larger register and unsigned 8- and 16-bit values are zero extended. But unsigned 32-bit values are sign-extended. Everything has natural alignment in structs and the like, but alignment isn't required by the hardware.

The stack grows downwards and is 16-byte aligned upon entry to a function.

Instructions are all 32 bits and 32-bit aligned. There is a compressed-instruction extension that adds 16-bit instructions and changes the required alignment of all instructions to just 16 bits. Unlike Thumb on ARM, RISC-V compressed instructions are just a short-hand for some 32-bit instruction and 16- and 32-bit instructions can be mixed freely.

There are 31 general purpose registers called x1–x31. Register x0 drops all writes and always reads as zero. Each register is given an alias too, which describes their conventional use:

RegisterAliasDescriptionSaved by
x1raReturn addressCaller
x2spStack pointerCallee
x3gpGlobal pointer
x4tpThread pointer
x8s0/fpSaved register / frame pointerCallee
x9s1Saved registerCallee
x10–17a1–a7Arguments and return valuesCaller
x18–27s2–11Saved registersCallee

There are two types of immediates, which will be written as ‘I’ and ‘J’. Type ‘I’ intermediates are 12-bit, signed values and type ‘J’ are 20-bit, signed values. They are always sign-extended to the width of a register before use.


There's no carry flag, instead one has to use a comparison instruction on the result in order to get the carry in a register.

A U suffix indicates an unsigned operation. Note that the multiplication and division instructions are part of the ‘M’ extension, but I expect most chips to have them.

ADD dest,src1,src2dest = src1 + src2
SUB dest,src1,src2dest = src1 - src2
ADDI dest,src1,Idest = src1 + I
MUL dest,src1,src2dest = src1 × src2
MULH[|U|SH] dest,src1,src2dest = (src1 × src2) >> XLENThis returns the high word of the multiplication result for signed×signed, unsigned×unsigned or signed×unsigned, respectively
DIV[U] dest,src1,src2dest = src1/src2There's no trap on divide-by-zero, rather special values are returned. (See documentation.)
REM[U] dest,src1,src2dest = src1%src2


These should all be obvious:

AND dest,src1,src2
OR dest,src1,src2
XOR dest,src1,src2
ANDI dest,src1,I
ORI dest,src1,I
XORI dest,src1,I


Instructions here are Shift [Left|Right] [Logical|Arithmetic].

Note that only the minimal number of bits are read from the shift count. So shifting by the width of the register doesn't zero it, it does nothing.

SLL dest,src1,src2dest = src1 << src2%XLEN
SRL dest,src1,src2dest = src1 >> src2%XLEN
SRA dest,src1,src2dest = src1 >> src2%XLEN
SLLI dest,src1,0‥31/63dest = src1 << I
SRLI dest,src1,0‥31/63dest = src1 >> I
SRAI dest,src1,0‥31/63dest = src1 >> I


These instructions set the destination register to one or zero depending on whether the relation is true or not.

SLT dest,src1,src2dest = src1<src2(signed compare)
SLTU dest,src1,src2dest = src1<src2(unsigned compare)
SLTI dest,src1,Idest = src1<I(signed compare)
SLTIU dest,src1,Idest = src1<I(unsigned compare, but remember that the immediate is sign-extended first)

Control flow

The JAL[R] instructions write the address of the next instruction to the destination register for the subsequent return. This can be x0 if you don't care.

Many instructions here take an immediate that is doubled before use. That's because no instruction (even with compressed instructions) can ever start on an odd address. However, when you write the value in assembly you write the actual value; the assembler will halve it when encoding.

JAL dest,Jdest = pc+2/4; pc+=2*J
JALR dest,src,Idest = pc+2/4; pc=src1+(I&~1)Note that the least-significant bit of the address is ignored
BEQ src1,src2,Iif src1==src2, pc+=2*I
BNE src1,src2,Iif src1!=src2, pc+=2*I
BLT src1,src2,Iif src1<src2, pc+=2*Isigned compare
BLTU src1,src2,Iif src1<src2, pc+=2*Iunsigned compare
BGE src1,src2,Iif src1>=src2, pc+=2*Isigned compare
BGEU src1,src2,Iif src1>=src2, pc+=2*Iunsigned compare


The suffix on these instructions denotes the size of the value read or written: ‘D’ = double-word (64 bits), ‘W’ = word (32 bits), ‘H’ = half-word (16 bits), ´B’ = byte. Reading and writing 64-bit values only works on 64-bit systems, and LWU only exists on 64-bit systems because it's the same as LW otherwise.

Alignment is not required, but might be faster. Also note that there's no consistent direction of data flow in the textual form of the assembly: the register always comes first.

L[D|W|H|B] dest,I(src)dest = *(src + I)result is sign extended
L[D|W|H|B]U dest,I(src)dest = *(src + I)result is zero extended
S[D|W|H|B] src1,I(src2)*(src2 + I) = src1


LUI dest,Jdest = J<<12“Load Upper Immediate”. The result is sign extended on 64-bit.
AUIPC dest,Jdest = pc + J<<12“Add Upper Immediate to PC”. The result of the shift is sign-extended on 64-bit before the addition.

W instructions

These instructions only exist on 64-bit and are copies of the same instruction without the ‘W’ suffix, except that they operate only on the lower 32 bits of the registers and, when writing the result, sign-extend to 64 bits. They are ADDW, SUBW, ADDIW, DIV[U]W, REM[U]W, S[L|R]LW, SRAW, S[L|R]LIW and SRAIW


For clarity, there are also a number of pseudo-instructions defined. These are just syntactic sugar for one of the primitive instructions, above. Here's a handful of the more useful ones:

nopaddi x0,x0,0
mv dest,srcaddi dest,src,0
not dest,srcxor dest,src,-1
seqz dest,srcsltiu dest,src,1
snez dest,srcsltu dest,x0,src
j Jjal x0,J
retjalr x0,x1,0
call offsetauipc x6, offset >> 12; jalr x1,x6,offset & 0xfff
li dest,value(possibly several instructions to load arbitrary value)
la dest,symbolauipc dest, symbol >> 12; addi dest,dest,offset & 0xfff
l[d|w|h|b] dest,symbolauipc dest, symbol >> 12; lx dest,offset & 0xfff(dest)
s[d|w|h|b] src,symbolauipc dest, symbol >> 12; sx dest,offset & 0xfff(dest)

Note that the instructions that expand to two instructions where the first is AUIPC aren't quite as simple as they appear. Since the 12-bit immediate in the second instruction is sign extended, it could end up negative. If that happens, the J immediate for AUIPC needs to be one greater and then you can reach the value by subtracting from there.

CECPQ1 results

In July my colleague, Matt Braithwaite, announced that Chrome and Google would be experimenting with a post-quantum key-agreement primitive in TLS. One should read the original announcement for details, but we had two goals for this experiment:

Firstly we wanted to direct cryptoanalytic attention at the family of Ring Learning-with-Errors (RLWE) problems. The algorithm that we used, NewHope, is part of this family and appeared to be the most promising when we selected one at the end of 2015.

It's very difficult to know whether we had any impact here, but it's good to see the recent publication and withdrawal of a paper describing a quantum attack on a fundamental lattice problem. Although the algorithm contained an error, it still shows that capable people are testing these new foundations.

Our second goal was to measure the feasibility of deploying post-quantum key-agreement in TLS by combining NewHope with an existing key-agreement (X25519). We called the combination CECPQ1.

TLS key agreements have never been so large and we expected a latency impact from the extra network traffic. Also, any incompatibilities with middleboxes can take years to sort out, so it's best to discover them as early as possible.

Here the results are more concrete: we did not find any unexpected impediment to deploying something like NewHope. There were no reported problems caused by enabling it.

Although the median connection latency only increased by a millisecond, the latency for the slowest 5% increased by 20ms and, for the slowest 1%, by 150ms. Since NewHope is computationally inexpensive, we're assuming that this is caused entirely by the increased message sizes. Since connection latencies compound on the web (because subresource discovery is delayed), the data requirement of NewHope is moderately expensive for people on slower connections.

None the less, if the need arose, it would be practical to quickly deploy NewHope in TLS 1.2. (TLS 1.3 makes things a little more complex and we did not test with CECPQ1 with it.)

At this point the experiment is concluded. We do not want to promote CECPQ1 as a de-facto standard and so a future Chrome update will disable CECPQ1 support. It's likely that TLS will want a post-quantum key-agreement in the future but a more multilateral approach is preferable for something intended to be more than an experiment.


Security protocols often assume an accurate, local clock (e.g. TLS, Kerberos, DNSSEC and more). It's a widely accepted assumption when designing protocols but, for a lot of people, it just isn't true. We find good evidence that at least 25% of all certificate errors in Chrome are due to a bad local clock.

Even when the local clock is being synchronised, it's very likely to be using unauthenticated NTP. So if your threat model includes man-in-the-middle attackers then you still can't trust the local clock.

There have been efforts to augment NTP with authentication, but they still assume a world where each client trusts one or more time servers absolutely. In order to explore a solution that allows time servers to be effectively audited by clients, myself and my colleague Matt Braithwaite (with assistance and advice from Ben Laurie and Michael Shields) have developed Roughtime.

Very briefly: using some tricks we believe that it's viable to deploy servers that sign a client-chosen nonce and timestamp on demand. Once you have several of these servers, clients can generate their nonces by hashing replies from other servers with some entropy. That proves that a nonce was created after the reply was received. Clients maintain a chain of nonces and replies and, if a server misbehaves, can use replies from several other servers to prove and report it.

Currently there's only one Roughtime service running, so the idea of spreading trust around is inchoate. But we would like to gauge whether others are interested in this idea, specifically whether there are any organisations who would be seriously interested in deploying something like this in their clients. (Because I assume that, if you have clients, then you'll also be interested in running a server.)

There's a much longer introduction and many more details on the Roughtime site and we've also setup a mailing list.

memcpy (and friends) with NULL pointers

The C standard (ISO/IEC 9899:2011) has a sane-seeming definition of memcpy (section

The memcpy function copies n characters from the object pointed to by s2 into the object pointed to by s1.

Apart from a prohibition on passing overlapping objects, I think every C programmer understands that.

However, the standard also says (section 7.1.4):

If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after promotion) not expected by a function with variable number of arguments, the behavior is undefined.

(Emphasis is mine.)

I'm sure that 7.1.4 seemed quite reasonable in isolation, but how does it interact with the case where memcpy is called with a zero length? If you read then you might well think that, since the function copies zero bytes, it's valid to pass NULL as either of the pointer arguments. I claim that the vast majority of C programmers would agree with that, but 7.24.1(2) clarifies that 7.1.4 really does apply:

Where an argument declared as size_t n specifies the length of the array for a function, n can have the value zero […] pointer arguments on such a call shall still have valid values, as described in 7.1.4.

(Nobody would actually write memcpy(NULL, NULL, 0), of course, because it (at best) does nothing. But such a call can easily arise at run-time when an empty object is handled by a more general function.)

Some compilers will use this corner of the standard to assume that pointers passed to memcpy are non-NULL, irrespective of the length argument. GCC has built this in, while Clang can get it from the fact that glibc annotates memcpy with nonnull specifications.

Consider the following function:

#include <stdint.h>
#include <string.h>

int f(uint8_t *dest, uint8_t *src, size_t len) {
  memcpy(dest, src, len);
  return dest == NULL;

Here's the output of building that with GCC 6.1.1 with -O2:

0000000000000000 <f>:
   0:	48 83 ec 08          	sub    rsp,0x8
   4:	e8 00 00 00 00       	call   9 <f+0x9>  # memcpy
   9:	31 c0                	xor    eax,eax
   b:	48 83 c4 08          	add    rsp,0x8
   f:	c3                   	ret

From that we can see that rax (which holds the return value of a function in the amd64 ABI) is unconditionally set to zero, i.e. the compiler has assumed that dest == NULL is false because it has been passed to memcpy. The compiler's reasoning goes like this: 7.1.4 says that passing a NULL pointer to a standard library function is undefined behaviour, therefore if dest was NULL any behaviour is reasonable. So the code can be optimised with the assumption that it's non-NULL, as that's the only case with defined behaviour.

(You can also play with this snippet in Matt Godbolt's excellent tool.)

Opinions on this vary from “the C standard defines the language thus that optimisation is fine by definition” to “that's crazy: there's a huge amount of code out there that probably assumes the obvious behaviour of memcpy”. Personally, I find myself further towards the latter position than the former.

Also, it's not just memcpy: the same optimisations are annotated in glibc for (at least) memccpy, memset, memcmp, memchr, memrchr, memmem, mempcpy, bcopy and bcmp. Section 7.1.4 can be applied to any standard library function.


To try and figure out the impact that this optimisation is having I built a number of open-source programs with GCC 6.1.1, with -fno-builtin (to disable GCC's built-in versions of these functions) and with glibc's string.h including, or not, the nonnull annotations. For example, the snippet of code above produces this diff when tested this way:

0000000000000000 <f>:
-   0:	48 83 ec 08          	sub    rsp,0x8
+   0:	53                   	push   rbx
+   1:	48 89 fb             	mov    rbx,rdi
    4:	e8 00 00 00 00       	call   9 <f+0x9>
    9:	31 c0                	xor    eax,eax
-   b:	48 83 c4 08          	add    rsp,0x8
-   f:	c3                   	ret    
+   b:	48 85 db             	test   rbx,rbx
+   e:	0f 94 c0             	sete   al
+  11:	5b                   	pop    rbx
+  12:	c3                   	ret    

The added code tests dest to set the return value, as intended.

The first program I tested was BIND 9.9.5 because of this advisory that says: “GCC now includes (by default) an optimization which is intended to eliminate unnecessary null pointer comparisons in compiled code. Unfortunately this optimization removes checks which are necessary in BIND and the demonstrated effect is to cause unpredictable assertion failures during execution of named, resulting in termination of the server process”. Although version 9.9.5 should be affected according to the advisory, I found no differences in the compiled output based on nonnull annotations in string.h. Perhaps it's because I'm using a different GCC, perhaps I just got something wrong in my testing, or perhaps these checks were eliminated for different reasons. (For example, a local root exploit in the kernel was enabled by a dereference-based removal of a NULL check.)

Next up, I tried something that I'm more involved with: BoringSSL. Here there are two changes: a reordering of two conditions in OPENSSL_realloc_clean (which has no semantic effect) and extensive changes in BN_mod_exp_mont. I'm sure I would be able to do a better analysis if I were more experienced with disassembling large programs, but I'm just using objdump and diff. Still, I believe that all the changes are the result of a single NULL check being removed and then the resulting offset shift of all the following code. That counts as an optimisation, but it's statically clear that the pointer cannot be NULL even without any assumptions about string.h functions so I struggle to give much credit.

Since BoringSSL showed some changes, I tried OpenSSL 1.0.2h. This also shows the same large changes around BN_mod_exp_mont. There's also a large change in dsa_builtin_paramgen2 (a function that we don't have in BoringSSL) but that appears to be another insignificant NULL-check removed and a consequent change of all the later offsets. Lastly, there are a handful of no-op changes: like swapping the arguments to cmp before jne.

Next I tried openssh-7.2p2, which shows no changes. I wondered whether someone had already done this analysis and corrected any problems in OpenSSH so tried a much older version too: 5.4p1. That does show a small, but non-trivial, change in ssh_rsa_verify. After a bit of thought, I believe that GCC has managed to eliminate a test for a non-NULL pointer at the end of openssh_RSA_verify. Just like the BoringSSL case, it's already possible to deduce that the pointer must be non-NULL without any section 7.1.4 assumptions.


It's clear that one has to write C code that's resilient to the compiler assuming that any pointers passed to standard library functions are non-NULL. There have been too many releases of glibc and GCC with this in to safely assume that it'll ever go away.

However, the benefits of this (i.e. the optimisations that the compiler can perform because of it) are nearly zero. Large programs can be built where it has no effect. When there are changes they are either cases that the compiler should have been able to figure out anyway, or else noise changes with no effect.

As for the costs: there have been several cases where removing NULL checks has resulted in a security vulnerability, although I can't find any cases of this precise corner of the C standard causing it. It also adds a very subtle, exceptional case to several very common functions, burdening programmers. But it thankfully rarely seems to make a difference in real-life code, so hopefully there's not a large pool of bugs in legacy code that have been created by this change.

Still, given the huge amount of legacy C code that exists, this optimisation seems unwise to me. Although I've little hope of it happening, I'd suggest that GCC and glibc remove these assumptions and that the next revision of the C standard change 7.24.1(2) to clarify that when a length is zero, pointers can be NULL.

If anyone wishes to check my results here, I've put the scripts that I used on GitHub. I'm afraid that it takes a bit of manual setup and, given variation in GCC versions across systems, some differences are to be expected but the results should be reproducible.

Cryptographic Agility

(These are notes that I wrote up from a talk that I gave at the National Academies Forum on Cyber Resilience. You can tell that it was in Washington, DC because of the “cyber”.

I wasn't quite sure how technical to pitch this talk so it's relatively introductory; regular readers probably know all this.

This isn't a transcript of what I said, but I try to hit the main points in my notes.)

Firstly I'd like to separate extensibility from agility. A protocol is extensible if you can add features to it without having to update every implementation at the same time—which is generally impossible. Cryptographic agility depends on having extensibility, at least if you ever want to use something that wasn't designed into a protocol at the beginning.

Protocols should be extensible: the world keeps changing and no design is going to be perfect for all time. But extensibility is much harder in practice than it sounds.

I happen to be particularly familiar with TLS and TLS has two, major extensibility mechanisms. The first is a simple version number. Here's how the specification says that it should work:

Client: I support up to version 1.2.

Server: (calculates the minimum of the version that the client supports and the maximum version that the server supports) Ok, let's use version 1.1.

This is commendably simple: it's not possible to express a range of versions and certainly not a discontinuous range. This is about as simple as an extensibility mechanism could be, and yet lots of implementations get it wrong. It's a common mistake for implementations to return an error when the client offers a version number that they don't understand.

This, of course, means that deploying new versions doesn't work. But it's insidious because the server will work fine until someone tries to deploy a new version. We thought that we had flexibility in the protocol but it turned out that bugs in code had rusted it in place.

At this point it's worth recalling the Law of the Internet: blame attaches to the last thing that changed. If Chrome updates and then something stops working then Chrome gets the blame. It doesn't matter that the server couldn't correctly calculate the minimum of two numbers. No normal person understands or cares about that.

What's to be done about this? Well, we work around issues if they're big and suck up the breakage if they're small. It's taken about 15 years to get to the point where web browsers don't have to work around broken version negotiation in TLS and that's mostly because we only have three active versions of TLS. When we try to add a fourth (TLS 1.3) in the next year, we'll have to add back the workaround, no doubt. In summary, this extensibility mechanism hasn't worked well because it's rarely used and that lets bugs thrive.

TLS has a second, major extension mechanism which is a series of (key, value) pairs where servers should ignore unknown keys. This has worked a little better because, while there are only three or four versions in play, with many years between versions, there are 25 to 30 extensions defined. It's not perfect: bugs in implementations have led them to be dependent on the order of extensions and somebody at least managed to write a server that breaks if the last value is empty.

Sometimes more extensibility points have been added inside of extensions in the expectation that it'll save adding another, top-level extension in the future. This has generally been a mistake: these extension points have added complexity for little reason and, when we try to use them, we often find that bugs have rusted them solid anyway. They've just been a waste.

There's a lesson in all this: have one joint and keep it well oiled.

Protocol designers underestimate how badly people will implement their designs. Writing down how you think it should work and hoping that it'll work, doesn't work. TLS's protocol negotiation is trivial and the specification is clear, yet it still didn't work in practice because it's difficult to oil.

Rather one needs to minimise complexity, concentrate all extensibility in a single place and actively defend it. An active defense can take many forms: fuzzing the extensibility system in test suites and compliance testing is good. You might want to define and implement dummy extensions once a year or such, and retire old ones on a similar schedule. When extensions contain lists of values, define a range of values that clients insert at random. In short, be creative otherwise you'll find that bug rust will quickly settle in.

Agility itself

Cryptographic agility is a huge cost. Implementing and supporting multiple algorithms means more code. More code begets more bugs. More things in general means less academic focus on any one thing, and less testing and code-review per thing. Any increase in the number of options also means more combinations and a higher chance for a bad interaction to arise.

Let's just consider symmetric ciphers for a moment. Because everyone wants them to be as fast as possible, BoringSSL currently contains 27 thousand lines of Perl scripts (taken from OpenSSL, who wrote them all) that generate assembly code just in order to implement AES-GCM. That's a tremendous amount of work and a tremendous scope for bugs.

Focusing again on TLS: over the years, 25 different ciphers and modes have been specified for use in TLS. Thankfully, of those, only nine are actively used. But that doesn't mean that the zombies of the others might not still be lurking around, ready to cause problems.

Where did this mess of diversity come from?

1. Old age / we had no idea what we were doing in the 1990's:

RC2_CBC_40 RC4_128 RC4_40

A lot of mistakes were made in the 1990's—we really didn't know what we were doing. Phil Rogaway did, but sadly not enough people listened to him; probably because they were busy fighting the US Government which was trying to ban the whole field of study at the time. Unfortunately that coincided with the early inflation period of the internet and a lot of those mistakes were embedded pretty deeply. We're still living with them today.

2. National pride cipher suites


The next cause of excess agility are the national pride cipher suites. Many countries consider cryptography to be an area of national interest but then mistakenly believe that means that they have to invent their own standards and primitives. South Korea and Japan were especially forthright about this and so managed to get these ciphers assigned code points in TLS but Russia and China and, to some extent, many other countries do the same thing.

Although they receive limited analysis compared to something like AES, they're generally not bad, per se, but they bring nothing new to the table: they add nothing but costs, and the costs are significant. Cryptographic diversity for the point of national pride should be strenuously resisted for that reason. Other countries may complain that the US got their standards widely used but the US got to specify a lot about the internet by being the first mover. (And AES is from Belgium anyway.) However, it is the case that I'm not aware of any of these national standards being used to promote something that's actually a deliberate backdoor; which is, of course, not true of the US.

3. Reasonable cases for diversity:

  • Embedded systems want to minimise circuit size: AES_128_CCM and AES_256_CCM.
  • We want something faster for when we don't have AES hardware: CHACHA20_POLY1305.
  • US Government standard, got hardware support from Intel: AES_128_GCM and AES_256_GCM.

Now we come to the ones that are reasonable to use and the reasons for diversity there. It's all about performance optimisation for different environments really: tiny devices want CCM because it only needs an AES-encrypt circuit. Devices without hardware support for AES-GCM want to use ChaCha20-Poly1305 because it's much more efficient in software. Everything else wants to use AES-GCM.

Agility has allowed us to introduce the ciphers in the final set and that's really important. But it's equally important to kill off the old stuff, and that's very hard. Nearly all the incentives are aligned against it. Recall the Law of the Internet (mentioned above); users hate stuff breaking and always blame you. Even djb will take to Twitter when one drops DSA support.

We have a long conveyor belt of primitives, we put new ones at the front and, every so often, we turn the crank and something drops off the end. In addition to all the obvious problems with killing off old stuff, that also means that there's a lot of inadvisable options that will generally function at any given time and this is leading to new products launching with no idea that they're sitting towards the end of this conveyor belt. These products expect a lifetime of some number of years and are unaware that we hope to discontinue something that they're using much sooner than that. It's no longer the case that we can assume that waiting a year will result in a reduction of the amount of use that a deprecated primitive gets because of new launches.

Google tries to address this where it can by requiring support for the newest options in our certification process for devices that interact with our services. But only a tiny subset of the things that interact with Google go through any of our certifications.

Things are even harder in non-interactive cases. TLS at least gets to negotiate between the client and server but algorithms in S/MIME messages and certificate signatures don't allow that. (One can think of ways to help change that, but the current reality is that they're not negotiated.) That's why dropping SHA-1 support in certificates has been a such a gruesome fight and why PKCS#8 messages still require us to support 40-bit RC2.

So what's the lesson here? I'd say that you need extensibility but, when it comes to cryptographic agility, have one option. Maybe two. Fight to keep it that small.

It's worth highlighting that, for the purposes of time, I've simplified things dramatically. I've considered only symmetric ciphers and modes above but, even within TLS, there's a whole separate conveyor belt for asymmetric algorithms. And I've not mentioned the oncoming storm of quantum computers. Quantum computers are going to be hilarious and I hope to be retired before they get big enough to cause problems!

Post-quantum key agreement

If large quantum computers can be built (and not of the D-Wave variety) then they'll make quite a mess of public-key cryptography. RSA and systems based on discrete logarithms (i.e. finite-field and elliptic-curve Diffie-Hellman) are all broken. I've written about hash-based signatures (which resist quantum attacks) before and the recent PQCRYPTO recommendations suggest those for signatures and McEliece for public-key encryption. Those are both sound, conservative recommendations but both have some size problems: McEliece public keys can be on the order of a megabyte of data and conservative hash-based signatures are about 40KB.

In some situations that's not a problem, and things like firmware signatures, where the key is embedded in hard-to-change silicon, should consider using hash-based signatures today. But those costs motivate the search for post-quantum schemes that are closer to the small, fast primitives that we have today.

One candidate is called Ring Learning-With-Errors (RLWE) and I'll try to give a taste of how it works in this post. This is strongly based on the A New Hope paper by Alkim, Ducas, Pöppelmann & Schwabe, and, in turn, on papers by Bos, Costello, Naehrig & Stebila and Peikert.

Firstly, the basic stats (because I always hate having to dig around for these values in papers). I've included the numbers for a current, elliptic-curve based Diffie-Hellman scheme (X25519) for comparision:

A New HopeX25519
Alice's transmission size1,824 bytesa32 bytes
Alice's computation129,638 cyclesb331,568 cyclesc
Bob's transmission size1,824 bytes32 bytes
Bob's computation126,236 cycles331,568 cycles

(a This is using a more compact scheme than in the paper. b These are Haswell cycle counts from the paper. c These are values from the SUPERCOP benchmark on titan0.)

Something to keep in mind when looking at the numbers above: sending 1,824 bytes on a 10MBit link takes 5.1 million cycles, assuming that your CPU is running at 3.5GHz.

RLWE key agreement

Our fundamental setting is ℤ12289[X]/(X1024+1). That's the set of polynomial equations where the largest power of x is 1023 and the coefficients are values between zero and 12288 (inclusive). For example, 66 + 4532x + 10000x2 + … + 872x1023.

Addition and multiplication can be defined for polynomials in this set. Addition is done by matching up powers of x and adding the corresponding coefficients. If the result is out of range then it's reduced modulo 12289.

Multiplication is high-school polynomial multiplication where the polynomial on the right is multiplied by every term of the polynomial on the left and the result is the sum of those terms. Coefficients are reduced modulo 12289 to keep them in range, but it's likely that the resulting powers of x will be too large—multiplying two x1023 terms gives a result in terms of x2046.

Polynomials with too large a degree can be reduced modulo x1024+1 until they're in range again. So, if we ended up with a term of a×x2046 then we could subtract a×x1022(x1024+1) to eliminate it. By repeated application of that trick, all the terms with powers of x greater than 1023 can be eliminated and then the result is back in the set.

Now that we can add and multiply within this set of polynomials we need to define a noise polynomial: this is simply a polynomial where each coefficient is sampled from a random distribution. In this case, the distribution will be a centered binomial distribution that ranges from -12 to 12. The probability density looks like this:

image/svg+xml Gnuplot Gnuplot Produced by GNUPLOT 5.0 patchlevel 1 0 -15 -10 -5 0 5 10 15

An important feature of noise polynomials is that the magnitude of each coefficient is small. That will be critical later on.

A random polynomial is one where each coefficient is sampled from a uniform distribution over the full range of zero to 12288.

To build a Diffie-Hellman protocol from this, Alice generates a random polynomial, a, and two noise polynomials, s and e. Alice calculates b = as+e and sends a and b to Bob. Bob generates his own s′ and e′, uses Alice's a to calculate u = as′+e′, and sends u back to Alice. Now Alice can calculate us = (as′+e′)s = as′s+e′s and Bob can calculate bs′ = (as+e)s′ = ass′+es′. But the keen-eyed will notice that those are different values!

The added noise is necessary for security but it means that the two sides to this protocol calculate different values. But, while the values are different, they're very similar because the magnitude of the noise polynomials is small. So a reconciliation mechanism is needed to find a shared value given two, similar polynomials.


So far I've been following the A New Hope paper and it does include a reconciliation mechanism. But, to be honest, I'm not sure that I understand it, so I'm going to be describing a mechanism by Peikert here:

The reconciliation will treat each coefficient in the similar polynomials separately and will extract a single, shared bit from each. Since we're dealing with polynomials that have 1024 terms, we'll get a 1024-bit shared secret in total but I'm just going to discuss the process of processing a single coefficient to get a single bit.

Consider the coefficient space as a circle: zero and 12289 are the same value modulo 12289 and we put that at the top of the circle. At the bottom of the circle will be 12289/2 = 6144 (rounding down). We know that, for each coefficient, Alice and Bob will have similar values—meaning that the values will be close by on the circle.

image/svg+xml 0 6144

One option is to split the circle into left and right halves and say that if the point is in the left half then it's a zero, otherwise it's a one.

image/svg+xml 0 6144 0 1

But while that will work most of the time, there's obviously a problem when the points are near the top or the bottom. In these cases, a small difference in location can result in a point being in a different half, and thus Alice and Bob will get a different result.

In this case we want to split the circle into top (zero) and bottom (one), so that both points are clearly in the bottom half.

image/svg+xml 0 6144 0 1

But that just moves the problem around to the left and right edges of the circle. So how about we vary the basis in which we measure the points depending where they are? If the point is near the bottom or the top we'll use the top–bottom (blue) basis and, if not, we'll use the left–right (red) basis.

image/svg+xml 0 6144

But there's still a problem! Consider the two points in the diagram just above. One party will think that it's in a red area, measure it left–right and conclude that the shared bit is a zero. The other will think it's in a blue area, measure it top–bottom and conclude that the shared bit is a one.

There isn't a solution to this in which the parties operate independently so, instead, one of the parties chooses the basis in which each coefficient will be measured. This information is a single bit of information (i.e. red or blue) that we have Bob send to Alice along with his value u. With this reconciliation information in hand, both parties can measure their points in the same, optimal basis and calculate the same, shared value.


There's lots in the paper that I've skipped over here. Most importantly (for performance), a variant of the Fourier transform can be used to convert the polynomials into the frequency domain where multiplication is much faster. Some of the values transmitted can be transmitted in the frequency domain too to save conversions overall. Also, random polynomials can be sampled directly in the frequency domain.

The parameters here have also been carefully selected so that the reduction stage of multiplication happens magically, just by point-wise multiplication of a some constants before and after the transform.

The a value that Alice generates could be a global constant, but in order to save worrying about how it was generated, and to eliminate the possibility of all-for-the-price-of-one attacks (like LogJam), it's generated fresh for each instance. Rather than transmit it in full, Alice need only send a seed value for it to Bob.

Contrasts with Diffie-Hellman

The most important change from Diffie-Hellman is that this scheme requires that all values be ephemeral. Both QUIC and TLS 1.3 assume that a Diffie-Hellman public-value can be reused but, in this scheme, that breaks the security of the system. More traditional uses of Diffie-Hellman, e.g. TLS 1.2 and SSH, are fine though.

Another important change is that this scheme takes a full round-trip for Alice. With Diffie-Hellman, both parties can transmit a message at time zero and as soon as the other party has received the message, they can calculate the shared key and start transmitting. But even if the a value is a global constant in this scheme, the reconciliation process means that Bob can't send a message until he's received Alice's message, and Alice can't calculate the shared key until she has received Bob's message. So Alice has a wait a full round-trip.

Often that limitation isn't important because other parts of the protocol already require the round-trip (for example, in TLS). But for some uses it's a critical difference.

Also, since this protocol involves random noise it has a failure probability: it's possible that the reconciliation mechanism produces different answers for each side. Thankfully this probability can be made negligible (i.e. less than one in 264).

I should also note that the RLWE problem is hypothesised to be resistant to quantum attack, but we don't know that. We also don't know that it's resistant to attacks by classical computers! It's possible that someone will develop a classical algorithm tomorrow that breaks the scheme above. Thus it should be used concurrently with a standard Diffie-Hellman (e.g. X25519) and the outputs of each should concatenated as the input keying material for a KDF.

Juniper: recording some Twitter conversations

Update: Ralf wrote up some notes from his work. These now include an update themselves with information from Willem Pinckaers that suggests that the presumed Dual-EC output is exposed to the world in Juniper devices.

On Thursday, Juniper announced that some of their products were affected by “unauthorized code in ScreenOS that could allow a knowledgeable attacker to gain administrative access to NetScreen® devices and to decrypt VPN connections”. That sounds like an attacker managed to subvert Juniper's source code repository and insert a backdoor. Of course, any glimpses that we get of these sorts of attacks are fascinating.

Juniper followed up with a slightly more detailed post that noted that there were two backdoors: one via SSH and one that “may allow a knowledgeable attacker who can monitor VPN traffic to decrypt that traffic”. Either of these would be very interesting to a nation-state attacker but that latter—passive decryption of VPN connections—is really in their neighborhood.

So, of course, smarter people than I quickly took to Twitter to pull apart the differences in the fixed firmware versions. Since Twitter conversations are terrible to try and pick apart after the fact, I'm writing down the gist of things here. But I'm just the scribe in this case; other people did the work.

One of the first things that people focused on was a difference to a large, hex value that was visible by just diffing the strings of the two firmwares. That change is interesting not just because it's a large, opaque hex string in a binary, but because of the hex strings that immediately precede it. Specially they were:

  • FFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF: this is the prime order of the underlying field of P-256, a standard elliptic curve.
  • FFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFC: P-256 is typically written in short-Weierstrass form: y2=x3+ax+b. This is then the a value for P-256.
  • 5AC635D8AA3A93E7B3EBBD55769886BC651D06B0CC53B0F63BCE3C3E27D2604B: This is the b value for the P-256 equation.
  • 6B17D1F2E12C4247F8BCE6E563A440F277037D812DEB33A0F4A13945D898C296: This is the x coordinate for the standard generator of P-256—the starting point for operations on the curve.
  • FFFFFFFF00000000FFFFFFFFFFFFFFFFBCE6FAADA7179E84F3B9CAC2FC632551: This is the number of points on P-256.

So all the values just before the changed one are constants for P-256, suggesting that the changed value is cryptographic too. The obvious, missing value would be the y coordinate for the standard generator. One possibility was that the attack put in the wrong y value. This could put the generator on the wrong curve, say a weaker curve that shares most of the same parameters as P-256 but with a different value for b. But the curve that would have resulted, while weaker, wasn't real-time-passive-decryption weak. Also the replacement value in the fixed version wasn't the standard y value either.

Ralf-Philipp Weinmann was looking at the code itself and found:

That means that the changed value is an x coordinate and that the code was calculating the y value from it given the curve equation. Thus it would only need the x values and the points would always be on the correct curve. So perhaps it's a public key for something?

Changing a public key could easily be a big backdoor, but recall that the result here is somehow passive decryption of VPN traffic. It's unclear how changing a public key could result in passive decryption.

Oh dear. To explain: “EC PRNG” suggests that the value might be a constant in an elliptic-curve based pseudo-random number generator. That could certainly explain how passive decryption of VPN traffic was possible because it brings up memories of Dual-EC. Dual-EC was an NSA effort to introduce a backdoored pseudo-random number generator (PRNG) that, given knowledge of a secret key, allowed an attacker to observe output from the RNG and then predict its future output. If an attacker can predict the output of the PRNG then they can know the keys that one or both sides of a VPN connection will choose and decrypt it. (For more details, see the research paper.)

Indeed, it quickly came to light that Juniper have a page where they say that the VPN devices in question here “do utilize Dual_EC_DRBG, but do not use the pre-defined points cited by NIST”. In short, they used a backdoored RNG but changed the locks. Then this attack might be explained by saying that someone broke in and changed the locks again.

We're not sure that's actually what happened, but it seems like a reasonable hypothesis at this point. If it's correct, this is fairly bananas. Dual-EC is not a reasonable RNG: it's massively larger, slower and more complex than standard RNGs. It's output isn't even very uniform. Huge compromises were made in its design in order to meet its primary objective: to be a NOBUS, passive backdoor. (“NOBUS” is an intelligence community term for “nobody but us”, i.e. other parties shouldn't be able to use the backdoor.) Why would it be used in ScreenOS in the first place?

Again, assuming this hypothesis is correct then, if it wasn't the NSA who did this, we have a case where a US government backdoor effort (Dual-EC) laid the groundwork for someone else to attack US interests. Certainly this attack would be a lot easier given the presence of a backdoor-friendly RNG already in place. And I've not even discussed the SSH backdoor which, as Wired notes, could have been the work of a different group entirely. That backdoor certainly isn't NOBUS—Fox-IT claim to have found the backdoor password in six hours.


We recently switched Google's two billion line repository over to BoringSSL, our fork of OpenSSL. This means that BoringSSL is now powering Chromium (on nearly all platforms), Android M and Google's production services. For the first time, the majority of Google's products are sharing a single TLS stack and making changes no longer involves several days of work juggling patch files across multiple repositories.

This is a big positive for Google and I'm going to document some of the changes that we've made in BoringSSL in this post. I am not saying that people should be ditching OpenSSL and switching to BoringSSL. For Linux distributions that doesn't even make sense because we've removed too much for many applications to run unaltered and, without linker trickery, it's not possible to have both OpenSSL and BoringSSL in the same process because their symbols will collide. Even if you're in the position of shipping your own TLS stack with your code, you should still heed the warnings in the README well.

OpenSSL have considerably improved their processes since last April, which is great and important because huge swathes of the Internet will continue to depend on it. BoringSSL started before those changes but, even taking them into consideration, I'm still happy with my decision to fork. (But note that Google employs OpenSSL team members Emilia Käsper, Bodo Möller and Ben Laurie and contributes monetarily via the Core Infrastructure Initiative, so we haven't dropped our support of OpenSSL as a project.)

With that in mind, I'm going to mention some of the cleanups that we've done in BoringSSL from the lowest level, upwards. While most people should continue to use OpenSSL, there are lots of developers outside of Google who work on Chromium and Android and thus this document shouldn't be internal to Google. This post may seem critical of OpenSSL, but remember that many of these changes are possible because we only have to worry about Google's needs—we have an order of magnitude fewer platforms and configurations to support than OpenSSL and we don't keep any ABI compatibility. We also have the superpower of being able to change, where needed, the code that calls BoringSSL, so you can't really compare the two.

The “we”, above, is primarily myself and my colleagues David Benjamin and Matt Braithwaite. But BoringSSL is open source and Brian Smith has clocked up 55 patches and we've also had contributions from Opera and CloudFlare. (Brian's number would be higher if I had had more time to review his pending changes in the past couple of weeks).


Generally when people say “forking” they mean that they took a copy of the code and started landing patches independently of the original source. That's not what we did with BoringSSL. Rather than start with a copy, I started with an empty directory and went through OpenSSL function-by-function, reformatting, cleaning up (sometimes discarding) and documenting each one. So BoringSSL headers and sources look like this rather than this. The comments in BoringSSL headers can be extracted by a tool to produce documentation of a sort. (Although it could do with a make-over.)

(Clang's formatting tool and its Vim integration are very helpful! It's been the biggest improvement in my code-editing experience in many years.)

For much of the code, lengths were converted from ints to size_ts and functions that returned one, zero or minus one were converted to just returning one or zero. (Not handling a minus one return value is an easy and dangerous mistake.)

I didn't always get everything right: sometimes I discarded a function that we later found we actually needed or I changed something that, on balance, wasn't worth the changes required in other code. Where possible, code that we've needed to bring back has gone into a separate section called “decrepit” which isn't built in Chromium or Android.

But large amounts of OpenSSL could simply be discarded given our more limited scope. All the following were simply never copied into the main BoringSSL: Blowfish, Camllia, CMS, compression, the ENGINE code, IDEA, JPAKE, Kerberos, MD2, MDC2, OCSP, PKCS#7, RC5, RIPE-MD, SEED, SRP, timestamping and Whirlpool. The OpenSSL that we started from has about 468,000 lines of code but, today, even with the things that we've added (including tests) BoringSSL is just 200,000. Even projects that were using OpenSSL's OPENSSL_NO_x defines to exclude functionality at compile time have seen binaries sizes drop by 300KB when switching to BoringSSL.

Some important bits of OpenSSL are too large to bite off all at once, however. The SSL, ASN.1 and X.509 code were “forked” in the traditional sense: they were copied with minimal changes and improved incrementally. (Or, in the case of ASN.1 and X.509, left alone until they could be replaced completely.)

The lowest-levels

OpenSSL has a confusing number of initialisation functions. Code that uses OpenSSL generally takes a shotgun approach to calling some subset of OpenSSL_­add_­all_­algorithms, SSL_­library_­init, ERR_­load_­crypto_­strings and the deprecated SSLeay aliases of the same. BoringSSL doesn't need any of them; everything works immediately and the errors don't print out funny just because you forgot to load the error strings. If, like Chromium, you care about avoiding static initialisation (because every disk seek to load pages of code delays displaying the window at startup) then you can build with BORINGSSL_­NO_­STATIC_­INITIALIZER and initialise the library when you need with CRYPTO_­library_­init. But the vast majority of code just wants to avoid having to think about it. In the future, we would like to move to an automatic lazy-init which would solve even Chromium's needs.

OpenSSL and BoringSSL are often built into shared libraries, but OpenSSL doesn't have any visibility annotations. By default symbols are not hidden and ELF requires that any non-hidden symbols can be interposed. So if you look at in a Linux distribution you'll see lots of internal functions polluting the dynamic symbol table and calls to those functions from within the library have to indirect via the PLT. BoringSSL builds with hidden visibility by default so calls to internal functions are direct and only functions marked OPENSSL_­EXPORT are included in the dynamic symbol table.

Multi-threaded code is common these days but OpenSSL requires that you install callbacks to lock and unlock a conceptual array of locks. This trips up people who now take thread-safety to be a given, and can also mean that contention profiling shows a large, opaque amount of contention in the locking callback with no hint as to the real source. BoringSSL has a native concept of locks so is thread-safe by default. It also has “once” objects, atomic reference counting and thread-local storage, which eliminates much of the need for locking in the first place.


OpenSSL has a fairly unique method of handling errors: it pushes errors onto a per-thread queue as the stack unwinds. This means that OpenSSL errors can generally give you something like a stack trace that you might expect from gdb or a Python exception, which is definitely helpful in some cases. For contrast, NSS (Mozilla's crypto library) uses a more traditional, errno-like system of error codes. Debugging an NSS error involves looking up the numeric error code and then grepping the source code to find all the places where that error code can be set and figuring out which triggered this time.

However, this single error-code system is better for programmatic use. Code that tries to do something with OpenSSL errors (other than dumping them for human debugging) tends to look only at the first (i.e. deepest) error on the queue and tries to match on the reason or even function code. Thus changing the name of even internal functions could break calling code because these names were implicitly exported by the error system. Adding errors could also break code because now a different error could be first in the queue. Lastly, forgetting to clear the error queue after a failed function is very easy to do and thus endemic.

So BoringSSL no longer saves functions in the error queue: they all appear as OPENSSL_­internal, which saved about 15KB of binary size alone. As a bonus, we no longer need to run a script every time we add a new function. The file name and line number is still saved but, thankfully, I've never seen code try to match line numbers from the error queue. Trying to match on reason codes is still problematic, but we're living with it for now. We also have no good answer for forgetting to clear the error queue. It's possible that we'll change things in the future to automatically clear the error queue when calling most functions as, now that we're using thread-local storage, that'll no longer cause servers to burst into a flaming ball of lock contention. But we've not done that yet.

Parsing and serialisation

OpenSSL's parsing and serialisation involves a lot of incrementing pointers with single-letter names. BoringSSL drags this firmly into the 1990's with functions that automatically check bounds for parsing and functions that automatically resize buffers for serialisation. This code also handles parsing and serialising ASN.1 in an imperative fashion and we're slowly switching over to these functions because the OpenSSL ASN.1 code is just too complicated for us.

But I should note that OpenSSL's master branch now uses some similar parsing functions for parsing TLS structures at least. I've no idea whether that was inspired by BoringSSL, but it's great to see.

Random number generation

Random number generation in OpenSSL suffers because entropy used to be really difficult. There were entropy files on disk that applications would read and write, timestamps and PIDs would be mixed into entropy pools and applications would try other tricks to gather entropy and mix it into the pool. That has all made OpenSSL complicated.

BoringSSL just uses urandom—it's the right answer. (Although we'll probably do it via getrandom rather than /dev/urandom in the future.) There are no return values that you can forget to check: if anything goes wrong, it crashes the address space.

For the vast majority of code, that's all that you need to know, although there are some concessions to performance in the details:

TLS servers that are pushing lots of AES-CBC need the RNG to be really fast because each record needs a random IV. Because of this, if BoringSSL detects that the machine supports Intel's RDRAND instruction, it'll read a seed from urandom, expand it with ChaCha20 and XOR entropy from RDRAND. The seed is thread-local and refreshed every 1024 calls or 1MB output, whichever happens first.

Authenticated Encryption

Handing people a block cipher and hash function and expecting them to figure out the rest does not work. Authenticated Encryption is much closer to being reasonable and BoringSSL promotes it where possible. One very pleasing BoringSSL tale is that I handed that header file to a non-crypto developer and they produced secure code, first time. That would not have happened had I pointed them at EVP_CIPHER.

There is more to be done here as I've talked about before: we need nonce-misuse-resistant primitives and solutions for large files but what we have now is a significant improvement and the foundations for that future work are now in place.


As I mentioned, the SSL/TLS code wasn't reworked function-by-function like most of BoringSSL. It was copied whole and incrementally improved, predominantly by David Benjamin. I'm really happy with what he's managed to do with it.

At the small scale, most of the parsing and serialisation is now using the safe functions that I covered above. (Changes to convert most of the remaining pointer-juggling code are in my review queue.) TLS extensions are now a bit saner and no longer handled with huge switch statements. Support for SSLv2, DSS, SRP and Kerberos has all been dropped. The header file actually has comments.

Some important, small scale cleanups are less obvious. The large number of “functions” that were actually macros around ctrl functions (that bypassed the type system) are now real functions. In order to get TLS 1.0–1.2 you no longer use the ridiculously named SSLv23_method and then disable SSLv2 and SSLv3 by setting options on the SSL_CTX, rather you use TLS_method and control the versions by setting a minimum and maximum version.

There is lots more that I could mention like that.

At the larger scale, the buffer handling code has been substantially improved and the TLS code now does symmetric crypto using the AEAD interface, which cleanly partitions concerns that previously leaked all over the SSL code. We've also rewritten the version negotiation code so it no longer preprocesses the ClientHello and fiddles with method tables to use the correct version. This avoids some duplicated code and session resumption bugs and OpenSSL has since done a similar rewrite for 1.1.0. To solve a particular problem for Chrome, we've added some support for asynchronous private key operations so that slow smartcards don't block the network thread. Much of the DTLS logic has also been rewritten or pruned.

Perhaps most importantly, the state machine is much reduced. Renegotiation has been dropped except for the case of a TLS client handling renegotiation from a server while the application data flow has stopped, and even that is disabled by default. The DTLS code (a source of many bugs) is much saner in light of this.


OpenSSL has always had decent test coverage of lower-level parts like hash functions and ciphers, but testing of the more complex SSL/TLS code has been lacking. Testing that code is harder because you need to be able to produce sufficiently correct handshakes to get close to its edge cases, but you don't want to litter your real code with dozens of options for producing incorrect outputs in order to hit them. In BoringSSL, we've solved this by using a copy of Go's TLS stack for testing and we've littered it with such options. Our tests also stress asynchronous resume points across a range of handshakes. We wrote partial DTLS support in Go to test DTLS-only edge cases like reassembly, replay and retransmission. Along the way, we even discovered one of OpenSSL's old bug workarounds didn't work, allowing both projects to shed some code.

In C, any malloc call may fail. OpenSSL attempts to handle this, but such code is error-prone and rarely tested. It's best to use a malloc which crashes on failure, but for the benefit of consumers who can't, we have a "malloc test" mode. This runs all tests repeatedly, causing each successive allocation to fail, looking for crashes.

We now have 1,139 TLS tests which gives us 70% coverage of the TLS code—still better than any other TLS library that we've used.

The future

Now that we've done the task of aligning Google around BoringSSL, we'll hopefully be able to turn a little bit more attention to some feature work. Support for the IETF-approved ChaCha20-Poly1305 is coming soon. (Brian Smith has a change waiting for me.) Curve25519 and Ed25519 support are likely too. Next year, we will probably start on TLS 1.3 support.

But more cleanups are probably more important. The big one is the elimination of the ASN.1 and X.509 code in many cases. If you recall, we imported that code whole without cleanups and it hasn't been touched since. We've been incrementally replacing uses of the ASN.1 code with the new CBS and CBB functions but X.509 remains as a substantial user. We're not going to be able to drop that code completely because too much expects the X.509 functions to be available for reading and writing certificates, but we can make it so that the rest of the code doesn't depend on it. Then we can put it in a separate library and drop in a new certificate verification library that some of my Chromium colleagues are writing. Most users of BoringSSL will then, transparently, end up using the new library.

In the SSL code, the SSL object itself is a mess. We need to partition state that's really needed for the whole connection from state that can be thrown away after the handshake from state that can be optionally discarded after the handshake. That will save memory in servers as well as improving the clarity of the code. Since we don't have ABI compatibility, we can also reorder the structs to pack them better.

Lastly, we need to make fuzzing part of our process. Michał Zalewski's AFL has substantially improved the state of fuzzing but, whether we're using AFL or LibFuzzer, it's still a one-off for us. It should be much more like our CI builders. So should running clang-analyzer.

(David Benjamin contributed to this post.)

The ICANN Public Comments on WHOIS Privacy

ICANN is currently considering a proposal that “domains used for online financial transactions for commercial purpose should be ineligible for [WHOIS] privacy and proxy registrations” [PDF]. Given the vagueness around what would count as “commercial purpose” (tip jars? advertising? promoting your Kickstarter?) and concerns that some commercial sites are for small, home-run businesses, quite a lot of people are grumpy about this.

ICANN has a public comment period on this document until July 7th and what's interesting is that the comments (those that were emailed at least) are all in a mailing list archive. When you submit to the comment address ( you receive a confirmation email with a link that needs to be followed and quite a clear statement that the comments are public, so I think that this is deliberate.

I was curious what the comments box on this sort of topic is full of so did a quick analysis. The comment period doesn't close until July 7th so obviously I'm missing a couple of days worth of responses, but it was a two month comment period so that doesn't seem too bad.

When I checked there were 11,017 messages and 9,648 (87.6%) of them were strongly based on the Respect Our Privacy form letter. Several hundred additional messages included wording from it so I think that campaign resulted in about 90% of messages. (And it's worth noting the the primary flow on that site is to call ICANN—of course, I've no data on the volume of phone calls that resulted.)

Another campaign site, Save Domain Privacy, has a petition and quite a few messages included its wording.

I classified all such messages as “against” and wrote a quick program to manually review the remaining messages that weren't trivially classifiable by string matching against those template messages.

  • Nine messages were so odd or confused that it's unclear what the writer believed.
  • Three messages were asking questions and not expressing an opinion.
  • Two messages were sufficiently equivocal that they didn't express a clear feeling in either direction.
  • One message was commenting on a different section of the document.
  • One message suggested that WHOIS privacy be available to all, but that it should have a significant (monetary) cost in order to discourage its use.

Many more messages were against and not based on either of the two template letters. That leaves 13 messages that expressed support for the proposal. (That's 0.12% for those who are counting, although I very much question how much meaning that number has.):

  • Three messages suggested that private WHOIS registration was contrary to the openness of the Internet.
  • Three messages believed that shutting down sites that infringe copyright, or sell counterfeit trademarked goods, was a compelling reason.
  • Two writers believed that it was compelling, in general, to have contact details for websites. One of who claimed to be a security researcher and wanted CERTs to have access to full WHOIS details.
  • Two messages suggested that it would hinder “cyber-bullies” and “pædophiles”, one of which described how hard it was to have a stalker's site shut down.
  • One author believed that being able to contact site owners in the event of a domain-name dispute was a compelling reason.
  • One message suggested that WHOIS privacy should be removed for all .com sites, but no others.
  • One commenter opined that the Internet is inherently hostile to privacy and thus those who want privacy should not register domains.

The comment period opened on May 5th, but between then and June 22nd there were only seven messages. However, the week of the 22nd brought 10,015 messages. The the week of the 29th brought 995 more. So I think it's clear that, without significant outside promotion of these topics, almost nobody would have noticed this proposal.

AEADs: getting better at symmetric cryptography

I gave a talk a couple of weeks ago at the Yahoo Unconference. The conference was at the end of a particually hard week for a bunch of reasons and I fear that the talk wasn't that great. (Afterwards I got home about 3pm and pretty much slept until the following morning.) This post is a, hopefully clearer, articulation of its contents.

I've been primarily working on getting Google products switched over to BoringSSL for a little over a year now. (Chromium is done on many platforms and AOSP switched recently.) This is neccessary work, but it doesn't exactly lend itself to talk material. So the talk was titled “Lessons Learnt from Questions”—the idea being that smart developers at Google often ask lots of cryptography questions, and from those questions one can tell what is unclear in existing documentation. Points that are not obvious to smart, non-experts are thus good topics to talk about because it's likely that lots of people are missing them.

I was aiming for an hour but, due to a misunderstanding in the weeks prior, I thought that I had to cut it down to 20 minutes, so I ditched all but one question:

“How do I encrypt lots of records given a per-user key?” —Anonymous developer

This question is about applying symmetric cryptography, which is the originally motivation of all of cryptography. You would have thought that we would have figured it out by now, but we totally haven't.

In the 1990's it was common for cryptographers to produce block ciphers and hash functions, and then developers would be tasked with composing them. In the talk I stepped through the design of, and attacks on, the SSLv3/TLS CBC construction as a demonstration that plausible-sounding design features can lead to disaster, but I'm going to omit that here because it worked better live. (Phillip Rogaway tried to warn about these issues when SSL, IPSec, etc were being developed, but sadly the right people didn't listen to him.)

So, for quite a long time now, cryptographers have understood that block ciphers and hash functions are too low-level an abstraction and that the constructions themselves need to be studied and standardised. The result is called authenticated encryption (AE), and let's ponder what it might look like.

Obviously an AE function must take some plaintext and, if it's symmetric encryption, it must take a key. We also assume that its going to take care of both confidentiality and authenticity for us, so not only will an attacker not be able to tell the plaintext from the ciphertext but, if the ciphertext is altered in any way, the decryption will fail cleanly. So let's experiment with a hypothetical SSH-like protocol that's trying to protect a series of key presses:

AE(key, plaintext) → ciphertext

AE(key, ‘h’) → α
AE(key, ‘e’) → β
AE(key, ‘l’) → γ
AE(key, ‘l’) → γ
AE(key, ‘o’) → δ

Oh no! The two l's in “hello” turned into the same ciphertext, so an attacker can see patterns in the input reflected in the ciphertext. That's pretty terrible; clearly we need our AE function to map the same plaintext to different ciphertexts. There's two ways that can happen: either the function is non-deterministic (i.e. it reads entropy internally as a hidden input), or we pass in some varying argument. Explicit is better than implicit so we choose the latter and add an argument called the nonce:

AE(key, plaintext, nonce) → ciphertext

AE(key, ‘h’, n0) → (α, n0)
AE(key, ‘e’, n1) → (β, n1)
AE(key, ‘l’, n2) → (γ, n2)
AE(key, ‘l’, n3) → (ɛ, n3)
AE(key, ‘o’, n4) → (δ, n4)

We assume that n0…4 are all distinct and thus the two l's now get different ciphertexts and patterns in the plaintext have been eliminated. Note that (for now) we need to include the nonce value with the ciphertext because it'll be needed when decrypting.

(As an aside: if you've done anything with cryptography before you've probably come across the acronym “IV”, for initialisation vector. Although not completely standard, I differentiate an IV and an nonce thus: an IV needs to be unpredictable while an nonce needs only to be distinct. The values 0, 1, 2, … are distinct, but they aren't unpredictable. For something like CBC mode, which needs an IV, using a counter would be unacceptable.)

We solved one problem but there's another thing that an evil attacker might do: reorder messages:

AE(key, ‘h’, n0) → (α, n0)(γ, n2)
AE(key, ‘e’, n1) → (β, n1)(δ, n4)
AE(key, ‘l’, n2) → (γ, n2)…untrusted network…(α, n0)
AE(key, ‘l’, n3) → (ɛ, n3)(ɛ, n3)
AE(key, ‘o’, n4) → (δ, n4)(β, n1)

All the ciphertexts on the right are perfectly valid and coupled with a valid nonce, but they end up decrypting to “lohle”—hardly what was intended.

There are two solutions to this. The first is to use an implicit nonce: don't transmit the nonce, just use a counter for each direction. Now, if a message is out of order, the receiver will attempt to decrypt it with the wrong nonce and it'll fail to decrypt. That works perfectly well for transport protocols like SSH and TLS because it's easy to keep a counter in those situations, but sometimes the problem isn't quite so synchronous. For these cases, one could include a sequence number or other context in the plaintext and that works fine, but it's a waste of space if the receiver already knew the information and just wanted to confirm it.

This motivates a common extension to authenticated encryption called authenticated encryption with associated data (AEAD). The associated data is another argument, of arbitrary length, which must be equal at the encryption and decryption ends:

AEAD(key, plaintext, nonce, ad) → ciphertext

AEAD(key, ‘h’, n0, 0) → (α, n0)
AEAD(key, ‘e’, n1, 1) → (β, n1)
AEAD(key, ‘l’, n2, 2) → (γ, n2)
AEAD(key, ‘l’, n3, 3) → (ɛ, n3)
AEAD(key, ‘o’, n4, 4) → (δ, n4)

In this example the associated data happens to be a counter, but it could be anything. The associated data isn't included in the ciphertext, but it must be identical when decrypting otherwise the decryption will fail. In other protocols it could equally well be some context denoting, say, the first or last record. This is the second solution to the reordering problem.

The associated data might seem quite a lot like an nonce. They both need to be presented at encryption and decryption time and they must both match for decryption to succeed, so why do we bother having both? Mostly because the associated data is free-form: you can have as much or as little of it as you like and you can repeat it etc. The requirements of an nonce are much stricter, in fact if you remember only one thing from this please remember this, which I'm calling The Law of AEADs:

Thou shall never reuse the same (key, nonce) pair, for all time. (With high probability.)

So, if you generate a random key and use it to encrypt a single message, it's ok to set the nonce to zero. If you generate a random key and encrypt a series of messages you must ensure that the nonce never repeats. A counter is one way to do this, but if you need to store that counter on disk then stop: the chances of you screwing up and reusing an nonce value are way too high in designs like that.

It would be nice if reusing an nonce just meant that the same plaintext would result in the same ciphertext. That's the least bad thing that an AEAD could do in that situation. However the reality is significantly worse: common AEADs tend to lose confidentiality of messages with a repeated nonce and authenticity tends to collaspe completely for all messages. (I.e. it's very bad.) We like these common AEADs because they're fast, but you must have a solid story about nonce uniqueness. AEADs like AES-GCM and ChaCha20-Poly1305 fail in this fashion.

So what should you do when using a counter isn't trivially safe? One option is to generate the nonce at random and consider the probability of duplicates. AES-GCM takes a 96-bit nonce and NIST says that you can only encrypt 232 messages under a single key if using random nonces. This is because if you throw 232 balls at 296 buckets then you have roughly a 2-33 chance of getting two in the same bucket and NIST drew the line there. That probability might seem either high or low to you. It's pretty small in absolute terms and, unlike a work factor, an attacker can't spend resources against it, but it's a very long way from the safety margins that we usually use in cryptography. So there are also functions like crypto_­secretbox_­xsalsa20­poly1305 (from NaCl) that have a 192-bit nonce. The probabilities with random nonces are much more comforting at that size.

Another approach would be to use an AEAD that doesn't fail quite so catastrophically when an nonce is repeated. This is called an nonce-misuse resistant AEAD and we're hitting the boundary of developed practice now. The CAESAR competition has several nonce-misuse resistant entries in it, although it's not scheduled to conclude until 2018. Closer to established primitives, Gueron and Lindell propose an AES-GCM-SIV mode with a strong foundation (assuming that you trust AES-GCM) and good performance (assuming that you have good AES-GCM performance).

AEADs with large plaintexts

If you look at AEAD APIs you'll generally notice that they take the entire plaintext or ciphertext at once. In other words, they aren't “streaming” APIs. This is not a mistake, rather it's the streaming APIs that are generally a mistake.

I've complained about this in the past, so I'll be brief here. In short, old standards (e.g. PGP) will encrypt plaintexts of any length and then put an authenticator at the end. The likely outcome of such a design is that some implementations will stream out unauthenticated plaintext and only notice any problem when they get to the end of the ciphertext and try to check the authenticator. But by that time the damage has been done—it doesn't take much searching to find people suggesting piping the output of gpg to tar or even a shell.

So, if streaming is required, large plaintexts should be chunked and each chunk should easily fit into a “one-shot” API like those linked to above. That prevents unauthenticated plaintext from being processed. However, there is no standard for this; it's another case where we've hit the borders of existing practice. Implementations need to construct the nonces and associated data so that the first chunk is known to be the first, so that each chunk is in order and so that truncation is always detectable.

Streaming decryption always runs the risk of an attacker truncating the ciphertext however. Designs must always be able to detect truncation, but it's very easy to imagine that the rest of a system doesn't handle it well. (For example, see the Cookie Cutter attack against HTTPS.)

If streaming isn't essential then an all-or-nothing transform would be ideal. But, again, there's no substantial standards or practice around this. Hopefully in ten years or so there will be clear answers for this and for large-plaintext constructions with AEADs (AERO is a start). But, for now, even symmetric encryption isn't a “solved problem”.

Why not DANE in browsers

Thomas Ptacek laid out a number of arguments against DNSSEC recently (and in a follow up). We don't fully agree on everything, but it did prompt me to write why, even if you assume DNSSEC, DANE (the standard for speaking about the intersection of TLS and DNSSEC) is not a foregone conclusion in web browsers.

There are two ways that you might wish to use DANE in a web browser: either to block a certificate that would normally be considered valid, or to bless a certificate that would normally be rejected. The first, obviously, requires that DANE information always be obtained—if a lookup failure was ignored, a network attacker with a bad certificate would just simulate a lookup failure. But requiring that browsers always obtain DANE information (or a proof of absence) is nearly implausible:

Some years ago now, Chrome did an experiment where we would lookup a TXT record that we knew existed when we knew the Internet connection was working. At the time, some 4–5% of users couldn't lookup that record; we assume because the network wasn't transparent to non-standard DNS resource types. DANE records are going to be even more non-standard, are going to be larger and browsers would have to fetch lots of them because they'll need the DNSKEY/RRSIG chain up from the root. Even if DNSSEC record lookup worked flawlessly for everyone, we probably still wouldn't implement this aspect of DANE because each extra lookup is more latency and another chance for packet loss to cause an expensive timeout and retransmit.

Instead, for this we have HPKP, which is a memory-based pinning solution using HTTP headers. We also have pre-loaded pinning in Chrome for larger or more obvious targets. Thomas Ptacek seems bullish on pinning but I'm much more lukewarm. HPKP is quite complex and no doubt someone will write a “supercookies” story about it at some point, as they have for HSTS. Additionally, pinning is quite dangerous. Clients deciding that “pinning is good” have caused headaches at Google. It's also worth noting that CryptoCat has committed pinning-suicide in Chrome at at the moment due to their CA having switched intermediates between renewals. They're waiting for the release of Chrome 41 to recover.

But what about the other side of DANE: blessing certificates that would otherwise be considered untrusted? In this case, DNSSEC can be seen as something like another CA. The same problems with looking up DNSSEC records apply, but are much less painful when you only need to depend on the lookup for sites that are using DANE certificates. Indeed, Chrome even supported something very like DANE for a while. In that case the DNSSEC records were contained in the certificate to avoid the latency and complexity of looking them up in the client. (DNSSEC records contains signatures so need not be transported over the DNS protocol.)

But support for that was removed because it was a bunch of parsing code outside of the sandbox, wasn't really being used and it conflicted with two long-term plans for the health of the HTTPS ecosystem: eliminating 1024-bit RSA and Certificate Transparency. The conflicts are probably the most important reasons for not wanting to renew the experiment.

The effort to remove 1024-bit RSA from HTTPS has been going for years and is, perhaps, nearing completion now. (That noise that you can hear is my colleague, Ryan Sleevi, crying softly.). There are still some 1024-bit root certificates, but they are nearly gone from the Mozilla set. The amount of work involved is an order of magnitude greater than you expect because of the interactions of different X.509 validation libraries, intermediate chains and varying root stores on different platforms and versions.

DNSSEC, however, is littered with 1024-bit RSA. You literally can't avoid it because the root zone transits through a 1024-bit key. DNSSEC has them because of (I think) concerns about the size of responses and they are usually rotated every two or three months. The RFC suggests that 1024-bit RSA is good for “most zones” until 2022. Dan Bernstein's paper on Batch NFS deals well with the question of whether that's wise.

Next, Certificate Transparency is our effort to add strong, technical audits to the CA system by creating a trustworthy log of all valid certificates. CT logs only accept certificates from CA as an anti-spam measure but people can create DANE certificates for domains at will. This is far from an insurmountable problem, but it is a problem that would need to be solved and the CT team already have their hands full with the staged rollout of CT in Chrome.

The 1024-bit RSA problem isn't insurmountable either (although it's baked in much deeper), so it's possible that browsers might accept a DNSSEC signature chain in a certificate in the future, but it's a long way out.

The POODLE bites again

October's POODLE attack affected CBC-mode cipher suites in SSLv3 due to SSLv3's under-specification of the contents of the CBC padding bytes. Since SSLv3 didn't say what the padding bytes should be, implementations couldn't check them and that opened SSLv3 up to an oracle attack.

We're done pretty well at killing off SSLv3 in response to that. Chrome 39 (released Nov 18th) removed fallback to SSLv3 and Chrome 40 is scheduled to remove SSLv3 completely. Firefox 34 (released Dec 1st) has already removed SSLv3 support.

We're removing SSLv3 in favour of TLS because TLS fully specifies the contents of the padding bytes and thus stops the attack. However, TLS's padding is a subset of SSLv3's padding so, technically, you could use an SSLv3 decoding function with TLS and it would still work fine. It wouldn't check the padding bytes but that wouldn't cause any problems in normal operation. However, if an SSLv3 decoding function was used with TLS, then the POODLE attack would work, even against TLS connections.

This was noted by, at least, Brian Smith on the TLS list ([1][2]) and I was sufficiently cynical to assume that there were probably more instances of this than the old versions of NSS that Brian cited, and so wrote a scanner for the issue.

Unfortunately, I found a number of major sites that had this problem. At least one of whom I had good enough contacts at to quickly find that they used an F5 device to terminate connections. I contacted F5 on October 21st and they started working on a fix. Yngve Pettersen also independently found this issue and contacted me about it around this time.

F5 reported that some of the affected sites weren't customers of theirs, which meant that there was (at least) a second vendor with the same issue. After more digging, I found that some A10 devices also have this problem. I emailed a number of contacts at A10 on October 30th but sadly didn't get a reply from any of them. It wasn't until November 13th that I found the right person at A10 to deal with this.

F5 and A10 have posted patches for their products (F5's are here and A10's are here and they have an advisory here). I'm not completely sure that I've found every affected vendor but, now that this issue is public, any other affected products should quickly come to light. (Citrix devices have an odd behaviour in this area in that they'll accept padding bytes that are all zeros, but not random padding. That's unexpected but I can't make an attack out of it.)

(Update: since posting this, it appears that products from Fortinet, Cisco, IBM (WebSphere, Domino, Tivoli) and Juniper may also be affected.)

Ivan Ristić has added a test for this issue to his excellent scanner at SSLLabs. Affected sites will have their grade set to F and will report “This server is vulnerable to the POODLE attack against TLS servers”.

This seems like a good moment to reiterate that everything less than TLS 1.2 with an AEAD cipher suite is cryptographically broken. An IETF draft to prohibit RC4 is in Last Call at the moment but it would be wrong to believe that RC4 is uniquely bad. While RC4 is fundamentally broken and no implementation can save it, attacks against MtE-CBC ciphers have repeatedly been shown to be far more practical. Thankfully, TLS 1.2 support is about to hit 50% at the time of writing.

POODLE attacks on SSLv3

My colleague, Bodo Möller, in collaboration with Thai Duong and Krzysztof Kotowicz (also Googlers), just posted details about a padding oracle attack against CBC-mode ciphers in SSLv3. This attack, called POODLE, is similar to the BEAST attack and also allows a network attacker to extract the plaintext of targeted parts of an SSL connection, usually cookie data. Unlike the BEAST attack, it doesn't require such extensive control of the format of the plaintext and thus is more practical.

Fundamentally, the design flaw in SSL/TLS that allows this is the same as with Lucky13 and Vaudenay's two attacks: SSL got encryption and authentication the wrong way around – it authenticates before encrypting.

Consider the following plaintext HTTP request, which I've broken into 8-byte blocks (as in 3DES), but the same idea works for 16-byte blocks (as in AES) just as well:


The last block contains seven bytes of padding (represented as •) and the final byte is the length of the padding. (And I've used a fictional, 8-byte MAC, but that doesn't matter.) Before transmission, those blocks would be encrypted with 3DES or AES in CBC mode to provide confidentiality.

Now consider how CBC decryption works at the receiver, thanks to this public domain diagram from Wikipedia:

image/svg+xml block cipherdecryption Key Plaintext Ciphertext Initialization Vector (IV) block cipherdecryption Key Plaintext Ciphertext block cipherdecryption Key Plaintext Ciphertext

An attacker can't see the plaintext contents like we can in the diagram, above. They only see the CBC-encrypted ciphertext blocks. But what happens if the attacker duplicates the block containing the cookie data and overwrites the last block with it? When the receiver decrypts the last block it XORs in the contents of the previous ciphertext (which the attacker knows) and checks the authenticity of the data. Critically, since SSLv3 doesn't specify the contents of the padding (•) bytes, the receiver cannot check them. Thus the record will be accepted if, and only if, the last byte ends up as a seven.

An attacker can run Javascript in any origin in a browser and cause the browser to make requests (with cookies) to any other origin. If the attacker does this block duplication trick they have a 1-in-256 chance that the receiver won't reject the record and close the connection. If the receiver accepts the record then the attacker knows that the decryption of the cookie block that they duplicated, XORed with the ciphertext of the previous block, equals seven. Thus they've found the last byte of the cookie using (on average) 256 requests.

Now the attacker can increase the length of the requested URL and decrease the length of something after the cookies and make the request look like this:


Note that the Cookie data has been shifted so that the second to last byte of the data is now at the end of the block. So, with another 256 requests the attacker can expect to have decrypted that byte and so on.

Thus, with an average of 256×n requests and a little control of the layout of those requests, an attacker can decrypt n bytes of plaintext from SSLv3. The critical part of this attack is that SSLv3 doesn't specify the contents of padding bytes (the •s). TLS does and so this attack doesn't work because the attacker only has a 2-64 or 2-128 chance of a duplicated block being a valid padding block.

This should be an academic curiosity because SSLv3 was deprecated very nearly 15 years ago. However, the Internet is vast and full of bugs. The vastness means that a non-trivial number of SSLv3 servers still exist and workarounds for the bugs mean that an attacker can convince a browser to use SSLv3 even when both the browser and server support a more recent version. Thus, this attack is widely applicable.

SSL/TLS has a perfectly good version negotiation mechanism that should prevent a browser and server that support a modern TLS version from using anything less. However, because some servers are buggy and don't implement version negotiation correctly, browsers break this mechanism by retrying connections with lesser SSL/TLS versions when TLS handshaking fails. These days we're more aware of the fact that fallback behaviour like this is a landmine for the future (as demonstrated today) but this TLS fallback behaviour was enshrined long ago and we're stuck with it. It means that, by injecting some trivial errors on the network, an attacker can cause a browser to speak SSLv3 to any server and then run the above attack.

What's to be done?

It's no revelation that this fallback behaviour is bad news. In fact, Bodo and I have a draft out for a mechanism to add a second, less bug-rusted mechanism to prevent it called TLS_FALLBACK_SCSV. Chrome and Google have implemented it since February this year and so connections from Chrome to Google are already protected. We are urging server operators and other browsers to implement it too. It doesn't just protect against this specific attack, it solves the fallback problem in general. For example, it stops attackers from downgrading TLS 1.2 to 1.1 and 1.0 and thus removing modern, AEAD ciphers from a connection. (Remember, everything less than TLS 1.2 with an AEAD mode is cryptographically broken.) There should soon be an updated OpenSSL version that supports it.

Even with TLS_FALLBACK_SCSV, there will be a long tail of servers that don't update. Because of that, I've just landed a patch on Chrome trunk that disables fallback to SSLv3 for all servers. This change will break things and so we don't feel that we can jump it straight to Chrome's stable channel. But we do hope to get it there within weeks and so buggy servers that currently function only because of SSLv3 fallback will need to be updated.

Chrome users that just want to get rid of SSLv3 can use the command line flag --ssl-version-min=tls1 to do so. (We used to have an entry in the preferences for that but people thought that “SSL 3.0” was a higher version than “TLS 1.0” and would mistakenly disable the latter.)

In Firefox you can go into about:config and set security.tls.version.min to 1. I expect that other browser vendors will publish similar instructions over the coming days.

As a server operator, it is possible to stop this attack by disabling SSLv3, or by disabling CBC-mode ciphers in SSLv3. However, the compatibility impact of this is unclear. Certainly, disabling SSLv3 completely is likely to break IE6. Some sites will be happy doing that, some will not.

A little further down the line, perhaps in about three months, we hope to disable SSLv3 completely. The changes that I've just landed in Chrome only disable fallback to SSLv3 – a server that correctly negotiates SSLv3 can still use it. Disabling SSLv3 completely will break even more than just disabling the fallback but SSLv3 is now completely broken with CBC-mode ciphers and the only other option is RC4, which is hardly that attractive. Any servers depending on SSLv3 are thus on notice that they need to address that now.

We hardened SSLv3 and TLS 1.0 against the BEAST attack with 1/n-1 record splitting and, based on an idea by Håvard Molland, it is possible to do something called anti-POODLE record splitting this time. I'll omit the details, but one can ensure that the last block in a CBC record contains at least fixed six bytes using only one split for AES and two for 3DES. With this, CBC is probably less bad than RC4. However, while anti-POODLE record splitting should be easy to deploy because it's valid according to the spec, so was 1/n-1 and deploying that was very painful. Thus there's a high risk that this would also cause compatibility problems. Therefore I'm not proceeding with anti-POODLE record splitting and concentrating on removing SSLv3 fallback instead. (There's also the possibility that an attacker could run the attack on the server to client direction. If we assume that both the client and the server get patched then we might as well assume that they are patched for TLS_FALLBACK_SCSV, which makes record splitting moot.)

PKCS#1 signature validation

On Wednesday, Chrome and Mozilla did coordinated updates to fix an RSA signature verification bug in NSS — the crypto library that handles SSL in Firefox and (currently) Chrome on most platforms. The updates should be well spread now and the bug has been detailed on Reddit, so I think it's safe to talk about.

(Hilariously, on the same day, bash turned out to have a little security issue and so hardly anyone noticed the NSS one!)

The NSS bug is another variant of Bleichenbacher's 2006 attack on RSA signature validation. To recap: an RSA signature is roughly the cube root of the message hash modulo the RSA modulus. Verification involves cubing the signature and checking that the result is the hash that you expected. Cubing modulo an RSA modulus is easy but finding a cube root is infeasible.

There's a little more to it because one needs to eliminate some properties of the RSA operation by formatting the hash that you're going to sign — called padding.

The standard for RSA signing and encryption is PKCS#1 version 1.5. The PKCS series of standards are amazing. Although I've no doubt that the people writing them were doing their best, it was a long time ago and mistakes were made. In a modern light, they are all completely terrible. If you wanted something that was plausible enough to be widely implemented but complex enough to ensure that cryptography would forever be hamstrung by implementation bugs, you would be hard pressed to do better. If you can find some implementers, just say "PKCS#11!" or "PKCS#12!" as punchlines. You'll get a good laugh, although possibly also a few panic attacks.

(PKCS#1 version 2 is much better, but only because they took it from Bellare and Rogaway and slapped the PKCS#1 title on it.)

PKCS#1 version 1.5 wanted to include an identifier for the hash function that's being used, inside the signature. This is a fine idea, but they did it by encoding the algorithm and hash value with ASN.1. This caused many implementations to include the complexity of an ASN.1 parser inside signature validation and that let the bugs in.

Bleichenbacher's original attack was based on the observation that the ASN.1 parsers, in many cases, didn't reject extra trailing data. This is reasonable behaviour for a generic ASN.1 parser, but a disaster in signature verification. Because the parser could ignore so much of the signature, Bleichenbacher could arrange for a perfect cube to have a suitable prefix and thus calculate the cube root over the integers — ignoring the RSA modulus completely!

That was fixed, but there was another common bug: the ASN.1 structure used to identify the hash was a standard structure called AlgorithmIdentifier, which includes an optional parameter. Implementations were ignoring the parameter and that also introduced a vulnerability: arbitrary bytes could be inserted as a parameter and that was sufficient to forge signatures. (See section five of this paper for details.)

This week's vulnerability was similar. Antoine Delignat-Lavaud (who, by the way, has been doing stellar work along with the Prosecco team at INRIA: see Triple Handshake, Cookie Cutter, SPDY and virtual host attacks and miTLS) noticed that the check on the contents of the parameters in NSS wasn't very strict — it was only checking that the length was at most two bytes. This is because, due to complexity, there wasn't universal agreement on what the the parameter should be. The ASN.1 for the parameter is an ANY type (with the concrete type depending on a preceding object id) but also optional. So, when not needed, should it be an ASN.1 NULL (which is two bytes long), or should it be omitted completely? The answer is the former, but it was underspecified for a long time.

Once Antoine had put a spotlight on the code, Brian Smith noticed something worse: an integer overflow in the ASN.1 parser. ASN.1 (in DER form at least, because, of course, ASN.1 has multiple defined encodings) has variable-length lengths, and NSS was using a generic ASN.1 parser which didn't check for overflow. So you could specify that a length was arbitrarily long and the parser would do something similar to:

unsigned length = 0;
for (i = 0; i < length_of_length; i++) {
  length <<= 8;
  length |= length[i];

Thus, as long as the last 4 or 8 bytes (depending on whether the system was 32 or 64 bit) encoded the correct length, the bytes before that would be ignored. That allows arbitrary bytes in the signature again and the attack from section 5 of the previous paper can be still be used to make forged signatures.

(Intel's ATR group also reported the same issue to Mozilla a little bit later. Bugs often seem to work like scientific discovery.)

The moral of the story

The moral here is that an ASN.1 parser should never have been put somewhere so sensitive; parsing is dangerous. It was a mistake to have defined PKCS#1 that way, but it can be ameliorated: rather than parsing, generate the expected signature contents and compare it against the plaintext. Go has always done this. I believe that SSH does this. Bouncy Castle reportedly does this. BoringSSL does this because I changed it from OpenSSL (prior to learning about the NSS issue — it's just a good idea). NSS does it now, although they're still worried about the omitted parameter vs NULL parameter confusion so they serialise two ASN.1 outputs and compare against both.

I generated a number of tests of different possible points of flexibility in signature parsing so that other libraries can be tested. For example, you can test OpenSSL (which still uses an ASN.1 parser) against them like this:

for cert in $(ls *.crt | grep -v root2048.crt); do
  openssl verify -CAfile root2048.crt $cert 2>> /dev/null | grep -q "$cert: OK"
  if [ $? -eq 0 ] ; then
    echo "$cert accepted"

The file control.crt should always be accepted. The file missingnull.crt will be accepted if you allow the parameter to be omitted. (I recommend against that. BoringSSL doesn't allow it and we'll see whether we can make that stick.) Nothing else should be accepted.

Sadly, OpenSSL 1.0.1 also accepts highvaltag.crt (ordinary ASN.1 tags in high-value form) and indeflen.crt (BER indefinite lengths rather than DER). The OpenSSL development branch also accepts bernull.crt (superfluous zero bytes in a length). To be clear, there is no known vulnerability here, but it's unwelcome flexibility in something that should be rigid. (I let OpenSSL team know on Wednesday and that they're welcome to use the code from BoringSSL.)

Remember, we use PKCS in browsers because we have to deal with the world as we find it, not as we want it. If you are not so encumbered, consider something simpler.

A couple more formal systems

After my last post on formal analysis of C code, people pointed out several other options to me and I wanted to do a follow up because the conclusion is a little more positive now.

Firstly, I noted last time that Z3 was the best, Why3-compatible SMT solver for my problem but that it came with a non-commercial license. Craig Stuntz pointed out that Microsoft sell a commercial license for $10K. Sure, that's quite a lot for a hobbyist project, but it's easy for a corporation and a lot easier than having to negotiate specially.

The first additional thing that I've looked into is AutoCorres. I mentioned last time that SeL4's refinement proof to C code involved a C parser in Isabelle that output structures in the Simpl framework and that Simpl was pretty intimidating.

AutoCorres is a tool for verifiably simplifing the Simpl structures and thus making proofs easier. Let's repeat the example from last time and show how AutoCorres helps. Here's the test C function:

int add(int a, int b) {
  return a+b;

And here's the raw result from the C parser, exactly as I showed last time:

test_global_addresses.add_body ≡
  Guard SignedArithmetic ⦃-2147483648 ≤ sint ´a + sint ´b ∧ sint ´a + sint ´b ≤ 2147483647⦄
   (creturn global_exn_var_'_update ret__int_'_update (λs. a_' s + b_' s));;
  Guard DontReach {} SKIP

And here's what AutoCorres makes of it:

add' ?a ?b ≡
  DO oguard (λs. INT_MIN ≤ ?a + ?b);
     oguard (λs. ?a + ?b ≤ INT_MAX);
     oreturn (?a + ?b)

That's very impressive! That's using a Maybe monad (using the Haskell name) in Isabelle to handle the possible overflow. Obviously that's a lot easier to deal with than the raw Simpl. Here's the field addition function that I was using for the other examples in the previous post:

void add(int *out, const int *x, const int *y) {
  unsigned int i;
  unsigned int size = 10;
  for (i = 0; i < size; i++) {
    out[i] = x[i] + y[i];

And here it is after AutoCorres:

add' ?out ?x ?y ≡
do i ← whileLoop (λi s. i < 0xA)
          (λi. do guard (λs. is_valid_w32 s (ptr_coerce (?out +⇩p uint i)));
                  guard (λs. is_valid_w32 s (ptr_coerce (?x +⇩p uint i)));
                  guard (λs. is_valid_w32 s (ptr_coerce (?y +⇩p uint i)));
                  guard (λs. INT_MIN ≤ sint (ucast s[ptr_coerce (?x +⇩p uint i)]) + sint (ucast s[ptr_coerce (?y +⇩p uint i)]));
                  guard (λs. sint (ucast s[ptr_coerce (?x +⇩p uint i)]) + sint (ucast s[ptr_coerce (?y +⇩p uint i)]) ≤ INT_MAX);
                  modify (λs. s[ptr_coerce (?out +⇩p uint i) := s[ptr_coerce (?x +⇩p uint i)] + s[ptr_coerce (?y +⇩p uint i)]]);
                  return (i + 1)

While the AutoCorres output looks very clean, there isn't any documentation at the moment which means that it's a little hard to actually use. (There are a couple of papers on the home page that are well worth reading - but I didn't find anything like a user guide.) There are some examples from which it's possible to make some guesses, but I wasn't able to get very far in a proof. Also, since AutoCorres builds on Simpl and C parser, it has the same limitations - most notably that the address of local variables can't be taken.

None the less, AutoCorres holds a lot of promise.

Next, is VST from Andrew Appel and others (yes, that Appel). VST is a framework for proving the behaviour of C code with Coq by using one of the intermediate languages from CompCert. Coq (which I'm pronouncing as “coke”) is a proof system like Isabelle and CompCert is a formally verified compiler. That means that it's possible to have a chain of formal verification from C code all the way to the semantics of the machine language! Additionally, the C parser isn't correctness critical because VST proves properties using the same parse tree as the verified compiler. That's an extremely strong verification standard.

CompCert is a mixture of GPL and “non-commercial” licensed code. It appears that everything needed for VST is licensed as GPL or another free license. So, one can also use VST to verify C code and then use a different compiler to build it if the non-commercial restriction is a problem. That moves the CompCert C parser and the other compiler into the trusted set, but so it goes.

The VST logic also appears to be very capable. It doesn't have a problem with pointers to local variables and should even have the capability to support reasoning about concurrent programs. The restrictions that it does have are local changes: only one memory load per line (at the moment) or side effect per statement and void functions need an explicit return at the end. (Unfortunately, one doesn't get an error from violating these requirements - rather the proof cannot be completed.)

VST thankfully, has brief documentation online which is expanded upon in the form of a book ($15 as a DRMed download). The book contains a very abstract articulation of the underlying logic which, to be honest, defeated me initially. I'd recommend skipping chapters 2 through 21 on a first reading and coming back to them later. They are useful, but they're also a lot clearer after having played with the software for a while.

Here's the same example function, translated into VST style. I switched back from having a special output array because that was something that I changed to make the SPARK verification easier but it was never part of the original code and VST doesn't have problem with overlapping the input and output. I've also switched from a for loop to a while loop because the documentation and examples for the latter are better. Otherwise, the changes are just to limit each statement to a single memory access and to include the final return.

void add(int *a, int *b) {
	int t1, t2;
	int i = 0;
	while (i < 10) {
		t1 = a[i];
		t2 = b[i];
		a[i] = t1 + t2;

We can use clightgen from CompCert to parse that into the intermediate language that CompCert and VST use. (A word to the wise - if you want to install VST, use exactly Coq 8.4p2, not the latest version which is 8.4p4.) Then we can use Coq to prove its behaviour.

Compared to Isabelle, Coq's proofs aren't as clear as the Isar style proofs in Isabelle (although the terseness is more useful when proofs are this complex) and neither Proof General nor CoqIDE are as nice as Isabelle's jedit mode. But it's not otherwise dramatically different.

Here's a specification of add in VST/Coq's language:

Definition bound_int (v : val) (b : Z) :=
  match v with
  | Vint i => -b < (Int.signed i) < b
  | _ => False

Definition partial_add (i: Z) (l: Z -> val) (r: Z -> val) (j: Z) :=
  if zlt j i then Val.add (l j) (r j) else (l j).

Definition add_spec :=
  DECLARE _add
  WITH a0 : val, b0 : val, sh : share, orig_a : Z -> val, orig_b : Z -> val
  PRE [_a OF (tptr tint), _b OF (tptr tint)]
    PROP (writable_share sh;
          forall i, 0 <= i < 10 -> is_int (orig_a i);
          forall i, 0 <= i < 10 -> is_int (orig_b i);
          forall i, 0 <= i < 10 -> bound_int (orig_a i) 134217728;
          forall i, 0 <= i < 10 -> bound_int (orig_b i) 134217728)
    LOCAL (`(eq a0) (eval_id _a);
           `(eq b0) (eval_id _b);
           `isptr (eval_id _a);
           `isptr (eval_id _b))
    SEP (`(array_at tint sh orig_a 0 10 a0);
         `(array_at tint sh orig_b 0 10 b0))
  POST [ tvoid ]
    PROP ()
    LOCAL ()
    SEP (`(array_at tint sh (partial_add 10 orig_a orig_b) 0 10 a0);
         `(array_at tint sh orig_b 0 10 b0)).

The partial_add function implements a map that reflects the state of the a array at step i of the loop. I've written it like that so that it can also be used in the loop invariant.

And here's the proof: It is quite long, but at least half of that is because I've not used Coq before and I don't know what I'm doing. I wouldn't call it readable however.

Definition inv orig_a a0 orig_b b0 sh :=
  EX i:Z,
    (PROP (0 <= i < 11;
           isptr a0;
           isptr b0)
     LOCAL (`(eq a0) (eval_id _a);
            `(eq b0) (eval_id _b);
            `(eq (Vint (Int.repr i))) (eval_id _i))
   SEP (`(array_at tint sh (partial_add i orig_a orig_b) 0 10 a0);
        `(array_at tint sh orig_b 0 10 b0))).

Lemma mod_nop : forall (i : Z) (m : Z), 0 <= i < m -> m > 0 -> i mod m = i.
rewrite Zdiv.Zmod_eq.
apply Zdiv_small.
exact H.
rewrite H1.
exact H0.

Lemma body_sumarray: semax_body Vprog Gprog f_add add_spec.
  (inv orig_a a0 orig_b b0 sh)
     PROP ()
     LOCAL ()
     SEP (`(array_at tint sh (partial_add 10 orig_a orig_b) 0 10 a0);
          `(array_at tint sh orig_b 0 10 b0))).
apply exp_right with 0.
rewrite H7.
unfold partial_add.
assert(is_int (orig_a i)).
assert(is_int (orig_b i)).
exact H7.
apply prop_right.
assert(Vint (Int.add _id1 _id) = partial_add (i+1) orig_a orig_b (Int.signed (Int.repr i))).
Focus 2.
symmetry in H9.
apply H9.
unfold partial_add.
assert(Int.signed (Int.repr i) = i).
rewrite Int.signed_repr.
unfold Int.max_signed, Int.min_signed.
rewrite H9.
unfold Val.add.
rewrite Int.signed_repr_eq in H7.
rewrite mod_nop in H7.
if_tac in H7.
unfold partial_add in H7.
if_tac in H7.
rewrite Int.signed_repr_eq in H6.
rewrite mod_nop in H6.
if_tac in H6.
unfold Int.add, Int.unsigned, Int.intval, Int.repr.
assert(bound_int (orig_a i) 134217728).
unfold bound_int in H12.
symmetry in H7, H6.
rewrite H7.
rewrite H6.
assert(Int.modulus = 4294967296).
rewrite H13.
assert(Int.modulus = 4294967296).
assert(i < Int.half_modulus).
assert(Int.half_modulus = 2147483648).
rewrite H12.
rewrite H12 in H11.
assert(Int.modulus = 4294967296).
assert(Int.modulus = 4294967296).
rewrite H11.
unfold inv.
apply exp_right with (Zsucc i).
apply derives_refl'.
apply equal_f.
apply array_at_ext.
unfold upd.
rewrite H10.
unfold Z.succ.
unfold partial_add.

That's certainly quite a chunk! Practically you have to step through the proof in Proof General and see the state evolve to understand anything. Additionally, when in the proof, it would be very useful if subgoals has some sort of description like “show the loop invariant plus not the loop condition results in the loop post-condition” - it's very easy to get lost in the proof. But I do feel that, while it would be a lot of work, I could make progress with VST, while I don't (at the moment) feel with other systems. Prof. Appel recently did a formal verification of the SHA256 from OpenSSL.

However, there is no community around VST that I can find - no mailing list, wiki, etc. The Subversion repo is only accessible via HTTP - you can't clone it from what I can tell. I think I found a (completeness) bug in VST, but the only option would be to try emailing Prof. Appel. (I didn't; I'm probably wrong.)

Still, AutoCorres and especially VST leave me more optimistic than I was at the end of the last post!


I didn't have time to look at everything that deserved attention. Cryptol is one. Cryptol is a tool written in Haskell designed for the specification and implementation of cryptographic code and can call out to SMT solvers to show certain properties. From its internal language it can export implementations to (at least) C.

Next, Frama-C, which I mentioned last time in relation to the Jessie plugin, has other plugins. One that's useful is the value analysis, which can be properly used to eliminate NULL derefs, array overruns etc. In fact, Polar SSL has been fully validated in certain configurations using that. That's certainly very valuable, and you can do it without that multi-page Coq proof! Although such analysis wouldn't have caught the carry bug in Ed25519 nor the reduction bug in Donna, not all code is going to be suitable for formal analysis.

Microsoft also have lots of things. There's Dafny, VCC (Verifier for Concurrent C), Boogie, Havoc, Z3 and others! I didn't look at them because I don't have a Windows box to hand this week (although I see that some are exposed via a web interface too). I was thinking that maybe I'd have time when a Windows box was to hand but, realistically, probably not. So I've updated this post to mention them here. If you are Windows based, you will probably do well to pay attention to them first.

A shallow survey of formal methods for C code

Two interesting things in formally verified software happened recently. The big one was the release of SeL4 - a formally verified L4 microkernel. The second was much smaller, but closer to my usual scope: a paper which showed the correctness of sections of a couple of the assembly implementations of Curve25519.

Overflow and carry bugs are subtle and, although with 32-bit words you might hope to be able to do enough random testing to eliminate them, that's much less plausible with 64-bit implementations. The TweetNaCl paper mentions a bug that lived in one of the assembly implementations of ed25519 for years and I've sinned too:

When I wrote curve25519-donna (back when it was the only 64-bit implementation of Curve25519), I first wrote a 32-bit version which mirrored the implementation of the athlon assembly version. This was just to provide a reference for the 64-bit code, but it was included in the repository for education. It was never really supposed to be used, and wasn't constant-time, but I was aware that it was being used by some groups.

Many years later, someone noticed that it was missing a reduction in the final contraction. Values between 2255-19 and 2255 would be output when they should have been reduced. This turned out to be harmless (as best I can tell), but embarrassing none the less.

And Curve25519 is a very nice curve to implement! Take a look at the overflow considerations for this version of P-256. I hope that I got everything right there, but it's very complex. More formal verification of the software would be welcome!

SeL4 has a multi-stage verification using refinement. Refinement is the process of showing that two pieces of code behave identically where one version is more abstract. SeL4's proof refines from an abstract specification to a Haskell implementation to the C implementation. It also has a SAT-solver based proof that the ARMv6 assembly matches the C implementation.

The refinement proofs are done in Isabelle/HOL which, along with Coq, are the major proof systems. (There are several other systems including HOL Light, HOL4 and Mizar.) Proof systems assist in the construction of automated proofs using simple rules of inference and axioms. Although not used for software verification, Metamath is a good introduction to the idea. It clearly explains its axioms (which Isabelle and Coq are very bad at) and gives an idea for the scale of formal proofs with an example of 2+2 = 4.

The best introduction to Isabelle that I know of is the book Concrete Semantics, although I do wish that it would start with Isar style proofs much sooner. If you're doing work in Isabelle I think you need page 57 printed and stuck to the wall and if you're going through that book and exercises I'd recommend peeking at the Isar chapter sooner.

But Isabelle's traditional workflow is to write in its functional language and export to OCaml or Haskell. That's no help if we're seeking to prove things about C code.

Isabelle's theorems are objects in its underlying ML language and so you can write a program in ML that parses C and use the result in Isabelle. That's what SeL4 does. The underlying framework for expressing imperative languages in Isabelle is Schirmer's Simpl and this imposes some odd limitations on the C that can be processed. For one, all the local variables with the same name in a given .c file have to have the same type because the local state of all functions in a .c file is represented in a single struct. For the same reason, it's not possible to take the address of a local variable.

But we can try parsing a very simple C file with SeL4's parser:

int add(int a, int b) {
  return a+b;

That fits SeL4's C subset (called StrictC - see l4v/tools/c-parser/doc in the SeL4 source) and results in this Simpl construct (you'll need to read the Simpl paper to really understand it):

test_global_addresses.add_body ≡
  Guard SignedArithmetic ⦃-2147483648 ≤ sint ´a + sint ´b ∧ sint ´a + sint ´b ≤ 2147483647⦄
   (creturn global_exn_var_'_update ret__int_'_update (λs. a_' s + b_' s));;
  Guard DontReach {} SKIP

There's clearly an overflow check in there, which is good, but to call the process of starting to prove things about it intimidating to the beginner would be an understatement. That and the oddities of the C subset motivated me to look further.

Dependent types are closely related to this problem so I checked Wikipedia for the dependently typed languages which support imperative programming and that are still active. That's just ATS and F*, according to that table. ATS doesn't deal with overflows and, while F*/F7 is worthy of a look because of miTLS, it's a CLR language and so not applicable here.

Next up, based on searches, is SPARK. SPARK is a subset of Ada designed for verification. There's a commercial company behind it, but versions are published under the GPL. It even comes with an IDE.

SPARK uses an SMT solver to prove correctness, which is very different from a proof in Isabelle. While an Isabelle proof is, by construction, just an application of axioms and rules of inference, an SMT solver is essentially a magic box into which you feed a proposition and out of which (maybe) comes a true/false answer. Trusting in an SMT solver is much, much stronger than not doing so, but it is a step down from a formal proof system. SPARK uses a version of Why3 to abstract over SMT solvers and we'll discuss Why3 more later.

Anyway, here's the same, trivial function in Ada:

function Add(X, Y : in Integer_32) return Integer_32 is
   return X + Y;
end Add;

If we ask SPARK to process that then it generates verification conditions (VCs): one for each possible thing that could go wrong - overflow, array indexing etc. For that code it throws an error saying that X + Y might overflow, which is promising! In order to eliminate that, one needs to prove the verification condition that says that the addition is safe by adding preconditions to the function (which then become VCs at each callsite):

function Add(X, Y : in Integer_32) return Integer_32 with
  Pre => X < 2**30 and X > -2*30 and Y < 2**30 and Y > -2**30;

With that precondition, the function is accepted. I should note that this is SPARK 2014; there are many older versions (SPARK has been going for a long time) and they kept preconditions and the like in comments. Lots of pages about SPARK still talk about the older versions and major projects in SPARK (like Ironsides - a DNS server) still use the older versions.

With that initial success, let's try something a tiny bit more involved. Curve25519 deals with 255-bit integers but CPUs don't. So, in the same way that a 64-int integer could be represented with a pair of 32-bit integers, 255-bit integers are represented in Donna with ten, 32-bit integers. This representation is a little odd because the values aren't spaced at 32-bit multiples, but rather at alternating 26 and 25 bit positions. That also means that the representation is non-unique (setting bit 26 of the first word is equal to setting bit zero of the second) and that's done for performance. See this for more details, but it's not important right now.

Adding these 255-bit integers is done simply by adding the underlying words without carry, with suitable bounds on the inputs so that overflow doesn't happen:

type Ints is array (Integer range 0 .. 9) of Integer_32;

function Add (X, Y : in Ints) return Ints with
   Pre => (for all i in Ints'Range => X(i) < 2**30 and X(i) > -2**30 and Y(i) < 2**30 and Y(i) > -2**30),
   Post => (for all i in Ints'Range => Add'Result(i) = X(i) + Y(i));

function Add (X, Y : in Ints) return Ints is
   Sum : Ints;
   for i in Ints'Range loop
      Sum(i) := X(i) + Y(i);
   end loop;

   return Sum;
end Add;

That mostly works, but SPARK can't prove that the result is X(i) + Y(i) for all i despite it being exactly what the function says. One really needs to spoon feed the prover: in this case with a loop invariant in the for loop:

for i in Ints'Range loop
   pragma Loop_Invariant (for all j in Ints'First .. i-1 => Sum(j) = X(j) + Y(j));
   Sum(i) := X(i) + Y(i);
end loop;

Sadly, that doesn't work either, despite being correct, because SPARK doesn't seem to like i-1 when i might be zero. So, trying again:

for i in Ints'Range loop
   Sum(i) := X(i) + Y(i);
   pragma Loop_Invariant (for all j in Ints'First .. i => Sum(j) = X(j) + Y(j));
end loop;

That one works, but now SPARK isn't sure whether uninitialised elements of Sum are being referenced. The property of being initialised isn't part of SPARK's logic and seems that the only way to proceed here is to disable that warning!

But, there's some promise! We would now like to show a higher level property of Add: that the 255-bit integer that the returned array represents is the sum of the integers that the input arrays represent. This means that the logic needs to deal with arbitrary integers and that we need to write a function in the logic that turns an array into an abstract integer.

Enabling arbitrary integers in the logic is possible (although good luck finding how to do it in the documentation: the answer is to put a file called gnat.adc in the gnatprove subdirectory of the project containing pragma Overflow_Mode (General => Strict, Assertions => Eliminated);). However, I failed at writing functions in the logic without dropping down into code and triggering overflow checks. It appears that ghost functions should be able to do this but, after getting the answer to a documentation bug on StackOverflow, actually trying to use any logic functions, even the identity function, caused the proof not to terminate. SPARK claims to be able to use Isabelle for proofs but I couldn't get that to work at all: strace says that the .thy file isn't ever written.

Stuck, I moved on from SPARK and tried Frama-C and its Jessie plugin. Even if SPARK was working for me, using C (even a subset) has advantages: it's convenient to use in other projects, there exist verified compilers for it and there's an extension to CompCert that allows for static verification of constant-time behaviour (although I've lost the paper!)

So here's the same function in Frama-C:

/*@ requires \valid_range(out, 0, 9);
  @ requires \valid_range(x, 0, 9);
  @ requires \valid_range(y, 0, 9);
  @ requires \forall integer i; 0 <= i < 10 ==>
      (x[i] > -1000 && x[i] < 1000);
  @ requires \forall integer i; 0 <= i < 10 ==>
      (y[i] > -1000 && y[i] < 1000);
  @ ensures \forall integer i; 0 <= i < 10 ==> (out[i] == x[i] + y[i]);
void add(int *out, const int *x, const int *y) {
  /*@ loop invariant i >= 0 && i <= 10 &&
        (\forall integer j; 0 <= j < i ==> out[j] == x[j] + y[j]);
    @ loop variant 10-i;
  for (int i = 0; i < 10; i++) {
    out[i] = x[i] + y[i];

Note that this time we not only need a loop invariant to show the postcondition, but we also need a loop variant to show that the for loop terminates: you really need to spoon feed the verification! But Frama-C/Jessie has no problem with functions in the logic:

/*@ logic integer felemvalue (int *x) =
      x[0] + x[1] * (1 << 26) + x[2] * (1 << 51) + x[3] * (1 << 77) +
      x[4] * (1 << 102) + x[5] * 340282366920938463463374607431768211456 +
      x[6] * 11417981541647679048466287755595961091061972992 +
      x[7] * 766247770432944429179173513575154591809369561091801088 +
      x[8] * 25711008708143844408671393477458601640355247900524685364822016 +
      x[9] * 1725436586697640946858688965569256363112777243042596638790631055949824; */

That function maps from an array to the abstract integer that it represents. Note that the terms switched from using shifts to represent the constants to literal numbers. That's because the proof process suddenly became non-terminating (in a reasonable time) once the values reached about 104-bits, but the literals still worked.

With that logic function in hand, the we can easily prove the higher-level concept that we wanted as a post condition of the add function:

@ ensures felemvalue(out) == felemvalue(x) + felemvalue(y);

Flushed with success, I moved onto the next most basic function: multiplication of 255-bit integers. The code that Donna uses is the textbook, polynomial multiplication code. Here's how the function starts:

/* Multiply two numbers: output = in2 * in
 * output must be distinct to both inputs. The inputs are reduced coefficient
 * form, the output is not.
 * output[x] <= 14 * the largest product of the input limbs. */
static void fproduct(limb *output, const limb *in2, const limb *in) {
  output[0] =       ((limb) ((s32) in2[0])) * ((s32) in[0]);
  output[1] =       ((limb) ((s32) in2[0])) * ((s32) in[1]) +
                    ((limb) ((s32) in2[1])) * ((s32) in[0]);
  output[2] =  2 *  ((limb) ((s32) in2[1])) * ((s32) in[1]) +
                    ((limb) ((s32) in2[0])) * ((s32) in[2]) +
                    ((limb) ((s32) in2[2])) * ((s32) in[0]);

Each of the lines is casting a 64-bit number down to 32 bits and then doing a 32×32⇒64 multiplication. (The casts down to 32 bits are just to tell the compiler not to waste time on a 64×64⇒64 multiplication.) In Frama-C/Jessie we need to show that the casts are safe, multiplications don't overflow and nor do the sums and multiplications by two etc. This should be easy. Here are the preconditions that set the bounds on the input.

/*@ requires \valid_range(output, 0, 18);
  @ requires \valid_range(in, 0, 9);
  @ requires \valid_range(in2, 0, 9);
  @ requires \forall integer i; 0 <= i < 10 ==> -134217728 < in[i] < 134217728;
  @ requires \forall integer i; 0 <= i < 10 ==> -134217728 < in2[i] < 134217728;

However, Why3, which is the prover backend that Jessie (and SPARK) uses, quickly gets bogged down. In an attempt to help it out, I moved the casts to the beginning of the function and put the results in arrays so that the code was less complex.

As you can see in the Jessie documentation, the Why3 UI allows one to do transforms on the verification conditions to try and make life easier for the prover. Quickly, splitting a requirement becomes necessary. But this throws away a lot of information. In this case, each of the 32×32⇒64 multiplications is assumed to fit only within its bounds and so the intermediate bounds need to be reestablished every time. This means that the equations have to be split like this:

limb t, t2;
t = ((limb) in2_32[0]) * in_32[1];
//@ assert -18014398509481984 < t < 18014398509481984;
t2 = ((limb) in2_32[1]) * in_32[0];
//@ assert -18014398509481984 < t2 < 18014398509481984;
output[1] = t + t2;

That's only a minor problem really, but the further one goes into the function, the harder the prover needs to work for some reason. It's unclear why because every multiplication has the same bounds - they are all instances of the same proof. But it's very clear that the later multiplications are causing more work.

Why3 is an abstraction layer for SMT solvers and a large number are supported (see “External provers” on its homepage for a list). I tried quite a few and, for this task, Z3 is clearly the best with CVC4 coming in second. However, Z3 has a non-commercial license which is very worrying - does that mean that a verified version of Curve25519 that used Z3 has to also have a non-commercial license? So I stuck to using CVC4.

However, about a quarter of the way into the function, both CVC4 and Z3 are stuck. Despite the bounds being a trivial problem, and just instances of the same problem that they can solve at the beginning of the function, somehow either Why3 is doing something bad or the SMT solvers are failing to discard irrelevant facts. I left it running overnight and Z3 solved one more instance after six hours but achieved nothing after 12 hours on the next one. Splitting and inlining the problem further in the Why3 GUI didn't help either.

Like SPARK, Jessie can use Isabelle for proofs too (and for the same reason: they are both using Why3, which supports Isabelle as a backend). It even worked this time, once I added Real to the Isabelle imports. However, in the same way that the C parser in Isabelle was an ML function that created Isabelle objects internally, the Why3 connection to Isabelle is a function (why3_open) that parses an XML file. This means that the proof has large numbers of generated variable names (o33, o34, etc) and you have no idea which intermediate values they are. Additionally, the bounds problem is something that Isabelle could have solved automatically, but the assumptions that you can use are similarly autonamed as things like local.H165. In short, the Isabelle integration appears to be unworkable.

Perhaps I could split up each statement in the original code into a separate function and then write the statement again in the logic in order in order to show the higher level properties, but at some point you have to accept that you're not going to be able to dig this hole with a plastic spoon.


The conclusion is a bit disappointing really: Curve25519 has no side effects and performs no allocation, it's a pure function that should be highly amenable to verification and yet I've been unable to find anything that can get even 20 lines into it. Some of this might be my own stupidity, but I put a fair amount of work into trying to find something that worked.

There seems to be a lot of promise in the area and some pieces work well (SMT solvers are often quite impressive, the Frama-C framework appears to be solid, Isabelle is quite pleasant) but nothing I found worked well together, at least for verifying C code. That makes efforts like SeL4 and Ironsides even more impressive. However, if you're happy to work at a higher level I'm guessing that verifying functional programs is a lot easier going.

HSTS for new TLDs

Whatever you might think of them, the new TLDs are rapidly arriving. New TLDs approved so far this month include alsace, sarl, iinet, poker, gifts, restaurant, fashion, tui and allfinanz. The full list for last month is over twice as long.

That means that there's lots of people currently trying to figure out how to differentiate themselves from other TLDs. Here's an idea: why not ask me to set HSTS for the entire TLD? That way, every single site runs over HTTPS, always. It strikes me that could be useful if you're trying to build trust with users unfamiliar with the zoo of new domains.

(I can't speak for Firefox and Safari but I think it's safe to assume that Firefox would be on board with this. It's still unclear whether IE's support for HSTS will include preloading.)

I'm guessing that, with such a large number of new TLDs, I should be able to reach at least some operators of them via this blog post.

Encrypting streams

When sending data over the network, chunking is pretty much a given. TLS has a maximum record size of 16KB and this fits neatly with authenticated encryption APIs which all operate on an entire message at once.

But file encryption frequently gets this wrong. Take OpenPGP: it bulk encrypts the data and sticks a single MAC on the end. Ideally everyone is decrypting to a temporary file and waiting until the decryption and verification is complete before touching the plaintext, but it takes a few seconds of searching to find people suggesting commands like this:

gpg -d your_archive.tgz.gpg | tar xz

With that construction, tar receives unauthenticated input and will happily extract it to the filesystem. An attacker doesn't (we assume) know the secret key, but they can guess at the structure of the plaintext and flip bits in the ciphertext. Bit flips in the ciphertext will produce a corresponding bit flip in the plaintext, followed by randomising the next block. I bet some smart attacker can do useful things with that ability. Sure the gpg command will exit with an error code, but do you think that the shell script writer carefully handled that case and undid the changes to the filesystem?

The flaw here isn't in CFB mode's malleability, but in OpenPGP forcing the use of unauthenticated plaintext in practical situations. (Indeed, if you are ever thinking about the malleability of ciphertext, you have probably already lost.)

I will even claim that the existance of an API that can operate in a streaming fashion over large records (i.e. will encrypt and defer the authenticator and will decrypt and return unauthenticated plaintext) is a mistake. Not only is it too easy to misunderstand and misuse (like the gpg example above) but, even if correctly buffered in a particular implementation, the existance of large records may force other implementations to do dangerous things because of a lack of buffer space.

If large messages are chunked at 16KB then the overhead of sixteen bytes of authenticator for every chunk is only 0.1%. Additionally, you can safely stream the decryption (as long as you can cope with truncation of the plaintext).

Although safer in general, when chunking one has to worry that an attacker hasn't reordered chunks, hasn't dropped chunks from the start and hasn't dropped chunks from the end. But sadly there's not a standard construction for taking an AEAD and making a scheme suitable for encrypting large files (AERO might be close, but it's not quite what I have in mind). Ideally such a scheme would take an AEAD and produce something very like an AEAD in that it takes a key, nonce and additional data, but can safely work in a streaming fashion. I don't think it need be very complex: take 64 bits of the nonce from the underlying AEAD as the chunk number, always start with chunk number zero and feed the additional data into chunk zero with a zero byte prefix. Prefix each chunk ciphertext with a 16 bit length and set the MSB to indicate the last chunk and authenticate that indication by setting the additional data to a single, 1 byte. The major worry might be that for many underlying AEADs, taking 64 bits of the nonce for the chunk counter leaves one with very little (or none!) left.

That requires more thought before using it for real but, if you are ever building encryption-at-rest, please don't mess it up like we did 20 years ago. (Although, even with that better design, piping the output into tar may still be unwise because an attacker can truncate the stream at a chunk boundary: perhaps eliminating important files in the process.)

Update: On Twitter, zooko points to Tahoe-LAFS as an example of getting it right. Additionally, taking the MAC of the current state of a digest operation and continuing the operation has been proposed for sponge functions (like SHA-3) under the name MAC-and-continue. The exact provenance of this isn't clear, but I think it might have been from the Keccak team in this paper. Although MAC-and-continue doesn't allow random access, which might be important for some situations.


Earlier this year, before Apple had too many goto fails and GnuTLS had too few, before everyone learnt that TLS heart-beat messages were a thing and that some bugs are really old, I started a tidy up of the OpenSSL code that we use at Google.

We have used a number of patches on top of OpenSSL for many years. Some of them have been accepted into the main OpenSSL repository, but many of them don’t mesh with OpenSSL’s guarantee of API and ABI stability and many of them are a little too experimental.

But as Android, Chrome and other products have started to need some subset of these patches, things have grown very complex. The effort involved in keeping all these patches (and there are more than 70 at the moment) straight across multiple code bases is getting to be too much.

So we’re switching models to one where we import changes from OpenSSL rather than rebasing on top of them. The result of that will start to appear in the Chromium repository soon and, over time, we hope to use it in Android and internally too.

There are no guarantees of API or ABI stability with this code: we are not aiming to replace OpenSSL as an open-source project. We will still be sending them bug fixes when we find them and we will be importing changes from upstream. Also, we will still be funding the Core Infrastructure Initiative and the OpenBSD Foundation.

But we’ll also be more able to import changes from LibreSSL and they are welcome to take changes from us. We have already relicensed some of our prior contributions to OpenSSL under an ISC license at their request and completely new code that we write will also be so licensed.

(Note: the name is aspirational and not yet a promise.)

Early ChangeCipherSpec Attack

OpenSSL 1.0.1h (and others) were released today with a scary looking security advisiory and that's always an event worth looking into. (Hopefully people are practiced at updating OpenSSL now!)

Update: the original reporter has a blog post up. Also, I won't, personally, be answering questions about specific Google services. (I cut this blog post together from notes that I'm writing for internal groups to evaluate and this is still very fresh.)

Update: my initial thoughts from looking at the diff still seem to be holding up. Someone is welcome to write a more detailed analysis than below. HP/ZDI have a write up of one of the DTLS issues.

There are some critical bug fixes to DTLS (TLS over datagram transports, i.e. UDP), but most people will be more concerned about the MITM attack against TLS (CVE-2014-0224).

The code changes are around the rejection of ChangeCipherSpec messages, which are messages sent during the TLS handshake that mark the change from unencrypted to encrypted traffic. These messages aren't part of the handshake protocol itself and aren't linked into the handshake state machine in OpenSSL. Rather there's a check in the code that they are only received when a new cipher is ready to be used. However, that check (for s->s3->tmp.new_cipher in s3_pkt.c) seems reasonable, but new_cipher is actually set as soon as the cipher for the connection has been decided (i.e. once the ServerHello message has been sent/received), not when the cipher is actually ready! It looks like this is the problem that's getting fixed in this release.

Here's the code in question that handles a ChangeCipherSpec message:

int ssl3_do_change_cipher_spec(SSL *s)
	int i;
	const char *sender;
	int slen;

	if (s->state & SSL_ST_ACCEPT)

	if (s->s3->tmp.key_block == NULL)1
		if (s->session == NULL)
			/* might happen if dtls1_read_bytes() calls this */
			return (0);

		if (!s->method->ssl3_enc->setup_key_block(s)) return(0); 2

	if (!s->method->ssl3_enc->change_cipher_state(s,i))

	/* we have to record the message digest at
	 * this point so we can get it before we read
	 * the finished message */
	if (s->state & SSL_ST_CONNECT)

	i = s->method->ssl3_enc->final_finish_mac(s,
		sender,slen,s->s3->tmp.peer_finish_md); 3
	if (i == 0)
		return 0;
	s->s3->tmp.peer_finish_md_len = i;


If a ChangeCipherSpec message is injected into the connection after the ServerHello, but before the master secret has been generated, then ssl3_do_change_cipher_spec will generate the keys (2) and the expected Finished hash (3) for the handshake with an empty master secret. This means that both are based only on public information. Additionally, the keys will be latched because of the check at (1) - further ChangeCipherSpec messages will regenerate the expected Finished hash, but not the keys.

The oldest source code on the OpenSSL site is for 0.9.1c (Dec 28, 1998) and the affected code appears almost unchanged. So it looks like this bug has existed for 15+ years.

The implications of this are pretty complex.

For a client there's an additional check in the code that requires that a CCS message appear before the Finished and after the master secret has been generated. An attacker can still inject an early CCS too and the keys will be calculated with an empty master secret. Those keys will be latched - another CCS won't cause them to be recalculated. However, when sending the second CCS that the client code requires, the Finished hash is recalculated with the correct master secret. This means that the attacker can't fabricate an acceptable Finished hash. This stops the obvious, generic impersonation attack against the client.

For a server, there's no such check and it appears to be possible to send an early CCS message and then fabricate the Finished hash because it's based on an empty master secret. However, that doesn't obviously gain an attacker anything. It would be interesting if a connection with a client certificate could be hijacked, but there is a check in ssl3_get_cert_verify that a CCS hasn't been already processed so that isn't possible.

Someone may be able to do something more creative with this bug; I'm not ruling out that there might be other implications.

Things change with an OpenSSL 1.0.1 server however. In 1.0.1, a patch of mine was included that moves the point where Finished values are calculated. The server will now use the correct Finished hash even if it erroneously processed a CCS, but this interacts badly with this bug. The server code (unlike the client) won't accept two CCS messages in a handshake. So, if an attacker injects an early CCS at the server to fixate the bad keys, then it's not possible for them to send a second in order to get it to calculate the correct Finished hash. But with 1.0.1, the server will use the correct Finished hash and will reply with the correct hash to the client. This explains the 1.0.1 and 1.0.2 mention in the advisory. With any OpenSSL client talking to an OpenSSL 1.0.1 server, an attacker can inject CCS messages to fixate the bad keys at both ends but the Finished hashes will still line up. So it's possible for the attacker to decrypt and/or hijack the connection completely.

The good news is that these attacks need man-in-the-middle position against the victim and that non-OpenSSL clients (IE, Firefox, Chrome on Desktop and iOS, Safari etc) aren't affected. None the less, all OpenSSL users should be updating.

Update: I hacked up the Go TLS library to perform the broken handshake in both 0.9.8/1.0.0 and 1.0.1 modes. There's a patch for crypto/tls and a tool that will attempt to do those broken handshakes and tell you if either worked. For example (at the time of writing):

$ ./earlyccs_check
Handshake failed with error: remote error: unexpected message
Looks ok.
$ ./earlyccs_check
Server is affected (1.0.1).
$ ./earlyccs_check
Server is affected (0.9.8 or 1.0.0).

Matching primitive strengths

It's a common, and traditional, habit to match the strengths of cryptographic primitives. For example, one might design at the 128-bit security level and pick AES-128, SHA-256 and P-256. Or if someone wants “better” security, one might target a 192-bit security level with AES-192, SHA-384 and P-384. NIST does this sort of thing, for example in 800-57/3, section 5.2.

The logic underlying this is that an attacker will attack at a system's weakest point, so anything that is stronger than the weakest part is a waste. But, while matching up the security level might have made sense in the past, I think it warrants reconsideration now.

Consider the energy needed to do any operation 2128 times: Landauer tells us that erasing a bit takes a minimum amount of energy any actual crypto function will involve erasing many bits. But, for the sake of argument, let's say that we're doing a single bit XOR. If we take the minimum energy at room temperature (to save energy on refrigeration) we get 2.85×10-21 × 2128 = 0.97×1018J. That's about “yearly electricity consumption of South Korea as of 2009” according to Wikipedia. If we consider that actually doing anything useful in each of those 2128 steps is going to take orders of magnitude more bit operations, and that we can't build a computer anywhere near the theoretic limit, we can easily reach, say, 5.5×1024J, which is “total energy from the Sun that strikes the face of the Earth each year”.

(The original version of this post discussed incrementing a 128-bit counter and said that it took two erasures per increment. Samuel Neves on Twitter pointed out that incrementing the counter doesn't involve destroying the information of the old value and several people pointed out that one can hit all the bit-patterns with a Gray code and thus only need to flip a single bit each time. So I changed the above to use a one-bit XOR as the basic argument still holds.)

So any primitive at or above the 128-bit security level is equally matched today, because they are all effectively infinitely strong.

One way to take this is to say that anything above 128 bits is a waste of time. Another is to say that one might want to use primitives above that level, but based on the hope that discontiguous advances leave it standing but not the lesser primitives. The latter involves a lot more guesswork than the old practice of making the security levels match up. Additionally, I don't think most people would think that an analytic result is equally likely for all classes of primitives, so one probably doesn't want to spend equal amounts of them all.

Consider the difference between P-256 and P-384 (or curve25519 and Curve41417 or Hamburg's Goldilocks-448). If they are all infinitely strong today then the reason to want to use a larger curve is because you expect some breakthrough, but not too much of a breakthrough. Discrete log in an elliptic-curve group is a square-root work function today. For those larger curves to make sense you have to worry about a breakthrough that can do it in cube-root time, but not really forth-root and certainly not fifth-root. How much would you pay to insure yourself against that, precise possibility? And how does that compare against the risk of a similar breakthrough in AES?

It is also worth considering whether the security-level is defined in a way matches your needs. AES-128 is generally agreed to have a 128-bit security level, but if an attacker is happy breaking any one of 2n keys then they can realise a significant speedup. An “unbalanced” choice of cipher key size might be reasonable if that attack is a concern.

So maybe a 256-bit cipher (ciphers do pretty well now, just worry about Grover's algorithm), SHA-384 (SHA-256 should be fine, but hash functions have, historically, had a bad time) and 128512(? see below)-bit McEliece is actually “balanced” these days.

(Thanks to Mike Hamburg and Trevor Perrin, with whom I was discussing this after Mike presented his Goldilocks curve. Thanks to JP Aumasson on Twitter and pbsd on HN for suggesting multi-target attacks as a motivation for large cipher key sizes. Also to pbsd for pointing out a quantum attack against McEliece that I wasn't aware of.)

SHA-256 certificates are coming

It's a neat result in cryptography that you can build a secure hash function given a secure signature scheme, and you can build a secure signature scheme given a secure hash function. However, far from the theory, in the real world, lots of signatures today depend on SHA-1, which is looking increasingly less like a secure hash function.

There are lots of properties by which one can evaluate a hash function, but the most important are preimage resistance (if I give you a hash, can you find an input with that hash value?), second preimage resistance (if I give you a message, can you find another that hashes to the same value?) and collision resistance (can you find two messages with the same hash?). Of those, the third appears to be much harder to meet than the first two, based on historical results against hash functions.

Back when certificates were signed with MD5, a chosen-prefix collision attack (i.e. given two messages, can you append different data to each so that the results have the same hash?) against MD5 was used at least twice to break the security of the PKI. First the MD5 Collisions Inc demo against RapidSSL and then the Flame malware.

Today, SHA-1 is at the point where a collision attack is estimated to take 261 work and a chosen-prefix collision to take 277. Both are below the design strength of 280 and even that design strength isn't great by today's standards.

We hope that we have a second line of defense for SHA-1: after the MD5 Collisions Inc demo, CAs were required to use random serial numbers. A chosen-prefix attack requires being able to predict the certificate contents that will be signed and the random serial number should thwart that. With random serials we should be resting on a stronger hash function property called target-collision resistance. (Although I'm not aware of any proofs that random serials actually put us on TCR.)

Still, it would be better not to depend on those band-aids and to use a hash function with a better design strength while we do it. So certificates are starting to switch to using SHA-256. A large part of that shift came from Microsoft forbidding certificates using SHA-1 starting in 2016.

For most people, this will have no effect. Twitter ended up with a SHA-256 certificate after they replaced their old one because of the OpenSSL heartbeat bug. So, if you can still load Twitter's site, then you're fine.

But there are a lot of people using Windows XP prior to Service Pack 3, and they will have problems. We've already seen lots of user reports of issues with Twitter (and other sites) from these users. Wherever possible, installing SP3 is the answer. (Or, better yet, updating from Windows XP.)

There are also likely to be problems with embedded clients, old phones etc. Some of these may not come to light for a while.

We've not yet decided what Google's timeline for switching is, but we will be switching prior to 2016. If you've involved with a something embedded that speaks to Google over HTTPS (and there's a lot of it), now is the time to test, which is using a SHA-256 certificate. I'll be using other channels to contact the groups that we know about but, on the web, we don't always know what's talking to us.

Revocation still doesn't work

I was hoping to be done with revocation for a bit, but sadly not.

GRC have published a breathless piece attacking a straw man argument: “Google tells us ... that Chrome's unique CRLSet solution provides all the protection we need”. I call it a straw man because I need only quote my own posts that GRC have read to show it.

The original CRLSets announcement contained two points. Firstly that online revocation checking doesn't work and that we were switching it off in Chrome in most cases. Secondly, that we would be using CRLSets to avoid having to do full binary updates in order to respond to emergency incidents.

In the last two paragraphs I mentioned something else. Quoting myself:

“Since we're pushing a list of revoked certificates anyway, we would like to invite CAs to contribute their revoked certificates (CRLs) to the list. We have to be mindful of size, but the vast majority of revocations happen for purely administrative reasons and can be excluded. So, if we can get the details of the more important revocations, we can improve user security. Our criteria for including revocations are:

  1. The CRL must be crawlable: we must be able to fetch it over HTTP and robots.txt must not exclude GoogleBot.
  2. The CRL must be valid by RFC 5280 and none of the serial numbers may be negative.
  3. CRLs that cover EV certificates are taken in preference, while still considering point (4).
  4. CRLs that include revocation reasons can be filtered to take less space and are preferred.”

In short, since we were pushing revocations anyway, maybe we could get some extra benefit from it. It would be better than nothing, which is what browsers otherwise have with soft-fail revocation checking.

I mentioned it again, last week (emphasis added):

“We compile daily lists of some high-value revocations and use Chrome's auto-update mechanism to push them to Chrome installations. It's called the CRLSet and it's not complete, nor big enough to cope with large numbers of revocations, but it allows us to react quickly to situations like Diginotar and ANSSI. It's certainly not perfect, but it's more than many other browsers do.”

“The original hope with CRLSets was that we could get revocations categorised into important and administrative and push only the important ones. (Administrative revocations occur when a certificate is changed with no reason to suspect that any key compromise occurred.) Sadly, that mostly hasn't happened.”

And yet, GRC managed to write pages (including cartoons!) exposing the fact that it doesn't cover many revocations and attacking Chrome for it.

They also claim that soft-fail revocation checking is effective:

The claim is that a user will download and cache a CRL while not being attacked and then be protected from a local attacker using a certificate that was included on that CRL. (I'm paraphrasing; you can search for “A typical Internet user” in their article for the original.)

There are two protocols used in online revocation checking: OCSP and CRL. The problem is that OCSP only covers a single certificate and OCSP is used in preference because it's so much smaller and thus removes the need to download CRLs. So the user isn't expected to download and cache the CRL anyway. So that doesn't work.

However, it's clear that someone is downloading CRLs because Cloudflare are spending half a million dollars a month to serve CRLs. Possibly it's non-browser clients but the bulk is probably CAPI (the library that IE, Chrome and other software on Windows typically uses - although not Firefox). A very obscure fact about CAPI is that it will download a CRL when it has 50 or more OCSP responses cached for a given CA certificate. But CAs know this and split their CRLs so that they don't get hit with huge bandwidth bills. But a split CRL renders the claimed protection from caching ineffective, because the old certificate for a given site is probably in a different part.

So I think the claim is that doing blocking OCSP lookups is a good idea because, if you use CAPI on Windows, then you might cache 50 OCSP responses for a given CA certificate. Then you'll download and cache a CRL for a while and then, depending on whether the CA splits their CRLs, you might have some revocations cached for a site that you visit.

It seems that argument is actually for something like CRLSets (download and cache revocations) over online checking, it's just a Rube Goldberg machine to, perhaps, implement it!

So, once again, revocation doesn't work. It doesn't work in other browsers and CRLSets aren't going to cover much in Chrome, as I've always said. GRC's conclusions follow from those errors and end up predictably far from the mark.

In order to make this post a little less depressing, let's consider whether its reasonable to aim to cover everything with CRLSets. (Which I mentioned before as well, but I'll omit the quote.) GRC quote numbers from Netcraft claiming 2.85 million revocations, although some are from certificate authorities not trusted by mainstream web browsers. I spent a few minutes harvesting CRLs from the CT log. This only includes certificates that are trusted by reasonable number of people and I only waited for half the log to download. From that half I threw in the CRLs included by CRLSets and got 2356 URLs after discarding LDAP ones.

I tried downloading them and got 2164 files. I parsed them and eliminated duplicates and my quick-and-dirty parser skipped quite a few. None the less, this very casual search found 1,062 issuers and 4.2 million revocations. If they were all dropped into the CRLSet, it would take 39MB.

So that's a ballpark figure, but we need to design for something that will last a few years. I didn't find figures on the growth of HTTPS sites, but Netcraft say that the number of web sites overall is growing at 37% a year. It seems reasonable that the number of HTTPS sites, thus certificates, thus revocations will double a couple of times in the next five years.

There's also the fact that if we subsidise revocation with the bandwidth and computers of users, then demand for revocation will likely increase. Symantec, I believe, have experimented with certificates that are issued for the maximum time (five years at the moment) but sold in single year increments. When it comes to the end of the year, they'll remind you to renew. If you do, you don't need to do anything, otherwise the certificate gets revoked. This seems like a good deal for the CA so, if we make revocation really effective, I'd expect to see a lot more of it. Perhaps factor in another doubling for that.

So we would need to scale to ~35M revocations in five years, which is ~300MB of raw data. That would need to be pushed to all users and delta updated. (I don't have numbers for the daily delta size.)

That would obviously take a lot of bandwidth and a lot of resources on the client. We would really need to aim at mobile as well, because that is likely to be the dominant HTTPS client in a few years, making storage and bandwidth use even more expensive.

Some probabilistic data structure might help. I've written about that before. Then you have to worry about false positives and the number of reads needed per validation. Phones might use Flash for storage, but it's remarkably slow.

Also, would you rather spend those resources on revocation, or on distributing the Safe Browsing set to mobile devices? Or would you spend the memory on improving ASLR on mobile devices? Security is big picture full of trade-offs. It's not clear the revocation warrants all that spending.

So, if you believe that downloading and caching revocation is the way forward, I think those are the parameters that you have to deal with.

No, don't enable revocation checking

Revocation checking is in the news again because of a large number of revocations resulting from precautionary rotations for servers affected by the OpenSSL heartbeat bug. However, revocation checking is a complex topic and there's a fair amount of misinformation around. In short, it doesn't work and you are no more secure by switching it on. But let's quickly catch up on the background:

Certificates bind a public key and an identity (commonly a DNS name) together. Because of the way the incentives work out, they are typically issued for a period of several years. But events occur and sometimes the binding between public key and name that the certificate asserts becomes invalid during that time. In the recent cases, people who ran a server that was affected by the heartbeat bug are worried that their private key might have been obtained by someone else and so they want to invalidate the old binding, and bind to a new public key. However, the old certificates are still valid and so someone who had obtained that private key could still use them.

Revocation is the process of invalidating a certificate before its expiry date. All certificates include a statement that essentially says “please phone the following number to check that this certificate has not been revoked”. The term online revocation checking refers to the process of making that phone call. It's not actually a phone call, of course, rather browsers and other clients can use a protocol called OCSP to check the status of a certificate. OCSP supplies a signed statement that says that the certificate is still valid (or not) and, critically, the OCSP statement itself is valid for a much shorter period of time, typically a few days.

The critical question is what to do in the event that you can't get an answer about a certificate's revocation status. If you reject certificates when you can't get an answer, that's called hard-fail. If you accept certificates when you can't get an answer that's called soft-fail.

Everyone does soft-fail for a number of reasons on top of the general principle that single points of failure should be avoided. Firstly, the Internet is a noisy place and sometimes you can't get through to OCSP servers for some reason. If you fail in those cases then the level of random errors increases. Also, captive portals (e.g. hotel WiFi networks where you have to “login” before you can use the Internet) frequently use HTTPS (and thus require certificates) but don't allow you to access OCSP servers. Lastly, if everyone did hard-fail then taking down an OCSP service would be sufficient to take down lots of Internet sites. That would mean that DDoS attackers would turn their attention to them, greatly increasing the costs of running them and it's unclear whether the CAs (who pay those costs) could afford it. (And the disruption is likely to be significant while they figure that out.)

So soft-fail is the only viable answer but it has a problem: it's completely useless. But that's not immediately obvious so we have to consider a few cases:

If you're worried about an attacker using a revoked certificate then the attacker first must be able to intercept your traffic to the site in question. (If they can't even intercept the traffic then you didn't need any authentication to protect it from them in the first place.) Most of the time, such an attacker is near you. For example, they might be running a fake WiFi access point, or maybe they're at an ISP. In these cases the important fact is that the attacker can intercept all your traffic, including OCSP traffic. Thus they can block OCSP lookups and soft-fail behaviour means that a revoked certificate will be accepted.

The next class of attacker might be a state-level attacker. For example, Syria trying to intercept Facebook connections. These attackers might not be physically close, but they can still intercept all your traffic because they control the cables going into and out of a country. Thus, the same reasoning applies.

We're left with cases where the attacker can only intercept traffic between a user and a website, but not between the user and the OCSP service. The attacker could be close to the website's servers and thus able to intercept all traffic to that site, but not anything else. More likely, the attacker might be able to perform a DNS hijacking: they persuade a DNS registrar to change the mapping between a domain ( and its IP address(es) and thus direct the site's traffic to themselves. In these cases, soft-fail still doesn't work, although the reasoning is more complex:

Firstly, the attacker can use OCSP stapling to include the OCSP response with the revoked certificate. Because OCSP responses are generally valid for some number of days, they can store one from before the certificate was revoked and use it for as long as it's valid for. DNS hijackings are generally noticed and corrected faster than the OCSP response will expire. (On top of that, you need to worry about browser cache poisoning, but I'm not going to get into that.)

Secondly, and more fundamentally, when issuing certificates a CA validates ownership of a domain by sending an email, or looking for a specially formed page on the site. If the attacker is controlling the site, they can get new certificates issued. The original owners might revoke the certificates that they know about, but it doesn't matter because the attacker is using different ones. The true owners could try contacting CAs, convincing them that they are the true owners and get other certificates revoked, but if the attacker still has control of the site, they can hop from CA to CA getting certificates. (And they will have the full OCSP validity period to use after each revocation.) That circus could go on for weeks and weeks.

That's why I claim that online revocation checking is useless - because it doesn't stop attacks. Turning it on does nothing but slow things down. You can tell when something is security theater because you need some absurdly specific situation in order for it to be useful.

So, for a couple of years now, Chrome hasn't done these useless checks by default in most cases. Rather, we have tried a different mechanism. We compile daily lists of some high-value revocations and use Chrome's auto-update mechanism to push them to Chrome installations. It's called the CRLSet and it's not complete, nor big enough to cope with large numbers of revocations, but it allows us to react quickly to situations like Diginotar and ANSSI. It's certainly not perfect, but it's more than many other browsers do.

A powerful attacker may be able to block a user from receiving CRLSet updates if they can intercept all of that user's traffic for long periods of time. But that's a pretty fundamental limit; we can only respond to any Chrome issue, including security bugs, by pushing updates.

The original hope with CRLSets was that we could get revocations categorised into important and administrative and push only the important ones. (Administrative revocations occur when a certificate is changed with no reason to suspect that any key compromise occurred.) Sadly, that mostly hasn't happened. Perhaps we need to consider a system that can handle much larger numbers of revocations, but the data in that case is likely to be two orders of magnitude larger and it's very unlikely that the current CRLSet design is still optimal when the goal moves that far. It's also a lot of data for every user to be downloading and perhaps efforts would be better spent elsewhere. It's still the case that an attacker that can intercept traffic can easily perform an SSL Stripping attack on most sites; they hardly need to fight revoked certificates.

In order to end on a positive note, I'll mention a case where online revocation checking does work, and another, possible way to solve the revocation problem for browsers.

The arguments above started with the point that an attacker using a revoked certificate first needs to be able to intercept traffic between the victim and the site. That's true for browsers, but it's not true for code-signing. In the case where you're checking the signature on a program or document that could be distributed via, say, email, then soft-fail is still valuable. That's because it increases the costs on the attacker substantially: they need to go from being able to email you to needing to be able to intercept OCSP checks. In these cases, online revocation checking makes sense.

If we want a scalable solution to the revocation problem then it's probably going to come in the form of short-lived certificates or something like OCSP Must Staple. Recall that the original problem stems from the fact that certificates are valid for years. If they were only valid for days then revocation would take care of itself. (This is the approach taken by DNSSEC.) For complex reasons, it might be easier to deploy that by having certificates that are valid for years, but include a marker in them that indicates that an OCSP response must be provided along with the certificate. The OCSP response is then only valid for a few days and the effect is the same (although less efficient).

TLS Triple Handshakes

Today, the TLS WG mailing list meeting received a note about the work of Karthikeyan Bhargavan, Antoine Delignat-Lavaud, Cedric Fournet, Alfredo Pironti and Pierre-Yves Strub on triple handshake attacks against TLS. This is more complex than just a duplicated goto and I'm not going to try and reproduce their explanation here. Instead, I'll link to their site again, which also includes a copy of their paper.

In short, the TLS handshake hashes in too little information, and always has. Because of that it's possible to synchronise the state of two TLS sessions in a way that breaks assumptions made in the rest of the protocol.

I'd like to thank the researchers for doing a very good job of disclosing this. The note today even included a draft for fixing the TLS key derivation to include all the needed information to stop this attack, and it'll be presented at the WG meeting tomorrow.

In the mean time, people shouldn't panic. The impact of this attack is limited to sites that use TLS client-certificate authentication with renegotiation, and protocols that depend on channel binding. The vast majority of users have never used client certificates.

The client-certificate issues can be fixed with a unilateral, client change to be stricter about verifying certificates during a renegotiation, as suggested by the authors. I've included an image, below, that is loaded over an HTTPS connection that renegotiates with unrelated certificates before returning the image data. Hopefully the image below is broken. If not, then it likely will be soon because of a browser update. (I took the server down.)

Protocols that depend on channel binding (including ChannelID) need other changes. Ideally, the proposed update to the master-secret computation will be finalised and implemented. (For ChannelID, we have already updated the protocol to include a change similar to the proposed draft.)

It's likely that there are still concrete problems to be found because of the channel-binding break. Hopefully with today's greater publicity people can start to find them.

TLS Symmetric Crypto

At this time last year, the TLS world was mostly running on RC4-SHA and AES-CBC. The Lucky 13 attack against CBC in TLS had just been published and I had spent most of January writing patches for OpenSSL and NSS to implement constant-time CBC decoding. The RC4 biases paper is still a couple of week away, but it's already clear that both these major TLS cipher suite families are finished and need replacing. (The question of which is worse is complicated.)

Here's Chrome's view of TLS by number of TLS connections from around this time (very minor cipher suite families were omitted):

Cipher suite familyPercentage of TLS connections made by Chrome

A whole bunch of people, myself included, have been working to address that since.

AES-GCM was already standardised, implemented in OpenSSL and has hardware support in Intel chips, so Wan-Teh Chang and I started to implement it in NSS, although first we had to implement TLS 1.2, on which it depends. That went out in Chrome 30 and 31. Brian Smith (and probably others!) at Mozilla shipped that code in Firefox also, with Firefox 27.

AES-GCM is great if you have hardware support for it: Haswell chips can do it in just about 1 cycle/byte. However, it's very much a hardware orientated algorithm and it's very difficult to implement securely in software with reasonable speed. Also, since TLS has all the infrastructure for cipher suite negotiation already, it's nice to have a backup in the wings should it be needed.

So we implemented ChaCha20-Poly1305 for TLS in NSS and OpenSSL. (Thanks to Elie Bursztein for doing a first pass in OpenSSL.) This is an AEAD made up of primitives from djb and we're using implementations from djb, Andrew M, Ted Krovetz and Peter Schwabe. Although it can't beat AES-GCM when AES-GCM has dedicated hardware, I suspect there will be lots of chips for a long time that don't have such hardware. Certainly most mobile phones at the moment don't (although AArch64 is coming).

Here's an example of the difference in speeds:

ChipAES-128-GCM speedChaCha20-Poly1305 speed
OMAP 446024.1 MB/s75.3 MB/s
Snapdragon S4 Pro41.5 MB/s130.9 MB/s
Sandy Bridge Xeon (AESNI)900 MB/s500 MB/s

(The AES-128-GCM implementation is from OpenSSL 1.0.1f. Note, ChaCha20 is a 256-bit cipher and AES-128 obviously isn't.)

There's also an annoying niggle with AES-GCM in TLS because the spec says that records have an eight byte, explicit nonce. Being an AEAD, the nonce is required to be unique for a given key. Since an eight-byte value is too small to pick at random with a sufficiently low collision probability, the only safe implementation is a counter. But TLS already has a suitable, implicit record sequence counter that has always been used to number records. So the explicit nonce is at best a waste of eight bytes per record, and possibly dangerous should anyone attempt to use random values with it. Thankfully, all the major implementations use a counter and I did a scan of the Alexa, top 200K sites to check that none are using random values - and none are. (Thanks to Dr Henson for pointing out that OpenSSL is using a counter, but with a random starting value.)

For ChaCha20-Poly1305 we can save 8 bytes of overhead per record by using the implicit sequence number for the nonce.

Cipher suite selection

Given the relative merits of AES-GCM and ChaCha20-Poly1305, we wish to use the former when hardware support is available and the latter otherwise. But TLS clients typically advertise a fixed preference list of cipher suites and TLS servers either respect it, or override the client's preferences with their own, fixed list. (E.g. Apache's SSLHonorCipherOrder directive.)

So, first, the client needs to alter its cipher suite preference list depending on whether it has AES-GCM hardware support - which Chrome now does. Second, servers that enforce their own preferences (which most large sites do) need a new concept of an equal-preference group: a set of cipher suites in the server's preference order which are all “equally good”. When choosing a cipher suite using the server preferences, the server finds its most preferable cipher suite that the client also supports and, if that is in an equal preference group, picks whichever member of the group is the client's most preferable. For example, Google servers have a cipher suite preference that includes AES-GCM and ChaCha20-Poly1305 cipher suites in an equal preference group at the top of the preference list. So if the client supports any cipher suite in that group, then the server will pick whichever was most preferable for the client.

The end result is that Chrome talking to Google uses AES-GCM if there's hardware support at the client and ChaCha20-Poly1305 otherwise.

After a year of work, let's see what Chrome's view of TLS is now:

Cipher suite familyPercentage of TLS connections made by Chrome
AES128-GCM & ChaCha20-Poly130539.9%

What remains is standardisation work and getting the code out to the world for others. The ChaCha20-Poly1305 cipher suites are unlikely to survive the IETF without changes so we'll need to respin them at some point. As for getting the code out, I hope to have something to say about that in the coming months. Before then I don't recommend that anyone try to repurpose code from the Chromium repos - it is maintained with only Chromium in mind.


Now that TLS's symmetric crypto is working better there's another, long-standing problem that needs to be addressed: TLS fallbacks. The new, AEAD cipher suites only work with TLS 1.2, which shouldn't be a problem because TLS has a secure negotiation mechanism. Sadly, browsers have a long history of doing TLS fallbacks: reconnecting with a lesser TLS version when a handshake fails. This is because there are lots of buggy HTTPS servers on the Internet that don't implement version negotiation correctly and fallback allows them to continue to work. (It also promotes buggy implementations by making them appear to function; but fallback was common before Chrome even existed so we didn't really have a choice.)

Chrome now has a four stage fallback: TLS 1.2, 1.1, 1.0 then SSLv3. The problem is that fallback can be triggered by an attacker injecting TCP packets and it reduces the security level of the connection. In the TLS 1.2 to 1.1 fallback, AES-GCM and ChaCha20-Poly1305 are lost. In the TLS 1.0 to SSLv3 fallback, all elliptic curve support, and thus ECDHE, disappears. If we're going to depend on these new cipher suites then we cannot allow attackers to do this, but we also cannot break all the buggy servers.

So, with Chrome 33, we implemented a fallback SCSV. This involves adding a pseudo-cipher suite in the client's handshake that indicates when the client is trying a fallback connection. If an updated server sees this pseudo-cipher suite, and the connection is not using the highest TLS version supported by the server, then it replies with an error. This is essentially duplicating the version negotiation in the cipher suite list, which isn't pretty, but the reality is that the cipher suite list still works but the primary version negotiation is rusted over with bugs.

With this in place (and also implemented on Google servers) we can actually depend on having TLS 1.2, at least between Chrome and Google (for now).


Few rollouts are trouble free and Chrome 33 certainly wasn't, although most of the problems were because of a completely different change.

Firstly, ChaCha20-Poly1305 is running at reduced speed on ARM because early-revision, S3 phones have a problem that causes the NEON, Poly1305 code to sometimes calculate the wrong result. (It's not clear if it's the MSM8960 chip or the power supply in the phone.) Of course, it works well enough to run the unit tests successfully because otherwise I wouldn't have needed two solid days to pin it down. Future releases will need to run a self-test to gather data about which revisions have the problem and then we can selectively disable the fast-path code on those devices. Until then, it's disabled everywhere. (Even so, ChaCha20-Poly1305 is still much faster than AES-GCM on ARM.)

But it was the F5 workaround that caused most of the headaches. Older (but common) firmware revisions of F5 devices hang the connection when the ClientHello is longer than 255 bytes. This appeared to be a major issue because it meant that we couldn't add things to the ClientHello without breaking lots of big sites that use F5 devices. After a plea and a discussion on the TLS list, an engineer from F5 explained that it was just handshake messages between 256 and 511 bytes in length that caused the issue. That suggested an obvious solution: pad the handshake message so that it doesn't fall into the troublesome size range.

Chrome 33 padded all ClientHellos to 512 bytes, which broke at least two “anti-virus” products that perform local MITM of TLS connections on Windows and at least one “filtering” device that doesn't do MITM, but killed the TCP connections anyway. All made the assumption that the ClientHello will never be longer than some tiny amount. In the firewall case, the fallback SCSV stopped a fallback to SSLv3 that would otherwise have hidden the problem by removing the padding.

Thankfully, two of the three vendors have acted quickly to address the issue. The last, Kaspersky, has known about these sorts of problems for 18 months now (since Chrome tried to deploy TLS 1.1 and hit a similar issue) but, thankfully, doesn't enable MITM mode by default.

Overall, it looks like the changes are viable. Hopefully the path will now be easier for any other clients that wish to do the same.

Apple's SSL/TLS bug

Yesterday, Apple pushed a rather spooky security update for iOS that suggested that something was horribly wrong with SSL/TLS in iOS but gave no details. Since the answer is at the top of the Hacker News thread, I guess the cat's out of the bag already and we're into the misinformation-quashing stage now.

So here's the Apple bug:

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams,
                                 uint8_t *signature, UInt16 signatureLen)
	OSStatus        err;

	if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
		goto fail;
	if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
		goto fail;
		goto fail;
	if ((err =, &hashOut)) != 0)
		goto fail;

	return err;

(Quoted from Apple's published source code.)

Note the two goto fail lines in a row. The first one is correctly bound to the if statement but the second, despite the indentation, isn't conditional at all. The code will always jump to the end from that second goto, err will contain a successful value because the SHA1 update operation was successful and so the signature verification will never fail.

This signature verification is checking the signature in a ServerKeyExchange message. This is used in DHE and ECDHE ciphersuites to communicate the ephemeral key for the connection. The server is saying “here's the ephemeral key and here's a signature, from my certificate, so you know that it's from me”. Now, if the link between the ephemeral key and the certificate chain is broken, then everything falls apart. It's possible to send a correct certificate chain to the client, but sign the handshake with the wrong private key, or not sign it at all! There's no proof that the server possesses the private key matching the public key in its certificate.

Since this is in SecureTransport, it affects iOS from some point prior to 7.0.6 (I confirmed on 7.0.4) and also OS X prior to 10.9.2 (confirmed on 10.9.1). It affects anything that uses SecureTransport, which is most software on those platforms although not Chrome and Firefox, which both use NSS for SSL/TLS. However, that doesn't mean very much if, say, the software update systems on your machine might be using SecureTransport.

I coded up a very quick test site at Note the port number (which is the CVE number), the normal site is running on port 443 and that is expected to work. On port 1266 the server is sending the same certificates but signing with a completely different key. If you can load an HTTPS site on port 1266 then you have this bug.

Because the certificate chain is correct and it's the link from the handshake to that chain which is broken, I don't believe any sort of certificate pinning would have stopped this. Also, this doesn't only affect sites using DHE or ECDHE ciphersuites - the attacker gets to choose the ciphersuite in this case and will choose the one that works for them.

Also, this doesn't affect TLS 1.2 because there's a different function for verifying the different ServerKeyExchange message in TLS 1.2. But, again, the attacker can choose any version that the client will accept. But if the client only enables TLS 1.2 then it appears that would workaround this issue. Likewise, if the client only enabled the plain, RSA ciphersuites then there's no ServerKeyExchange and that should also work around this issue. (Of the two, the former workaround is much more preferable.)

Based on my test site, both iOS 7.0.6 and OS X 10.9.2 fix the issue. (Update: it looks like the bug was introduced in 10.9 for OS X but existed in at least some versions of iOS 6. iOS 6.1.6 was released yesterday to fix it.)

This sort of subtle bug deep in the code is a nightmare. I believe that it's just a mistake and I feel very bad for whoever might have slipped in an editor and created it.

Here's a stripped down that code with the same issue:

extern int f();

int g() {
	int ret = 1;

	goto out;
	ret = f();

	return ret;

If I compile with -Wall (enable all warnings), neither GCC 4.8.2 or Clang 3.3 from Xcode make a peep about the dead code. That's surprising to me. A better warning could have stopped this but perhaps the false positive rate is too high over real codebases? (Thanks to Peter Nelson for pointing out the Clang does have -Wunreachable-code to warn about this, but it's not in -Wall.)

Maybe the coding style contributed to this by allowing ifs without braces, but one can have incorrect indentation with braces too, so that doesn't seem terribly convincing to me.

A test case could have caught this, but it's difficult because it's so deep into the handshake. One needs to write a completely separate TLS stack, with lots of options for sending invalid handshakes. In Chromium we have a patched version of TLSLite to do this sort of thing but I cannot recall that we have a test case for exactly this. (Sounds like I know what my Monday morning involves if not.)

Code review can be effective against these sorts of bug. Not just auditing, but review of each change as it goes in. I've no idea what the code review culture is like at Apple but I strongly believe that my colleagues, Wan-Teh or Ryan Sleevi, would have caught it had I slipped up like this. Although not everyone can be blessed with folks like them.

Lastly, there was a lot of discussion yesterday that Apple missed checking the hostname in the certificate. It's true that curl on the OS X command line oddly accepts HTTPS connections to IP addresses when the IP address isn't in the certificate, but I can't find that there's anything more than that and Safari doesn't have that problem.

Implementing Elligator for Curve25519

There are some situations where you would like to encode a point on an elliptic curve in such a way that it appears to be a uniform, random string. This comes up in some password authenticated key exchange systems, where one doesn't want to leak any bits of the password. It also comes up in protocols that are trying to be completely uniform so as to make life hard for network censors.

For the purposes of this post, I'm only going to be considering Curve25519. Recall that Curve25519 is the set of points (u,v) where v2 = u3 + 486662x2 + u and u and v are taken from GF(q=2255-19). Curve25519 works with a Montgomery ladder construction and thus only exchanges v coordinates. So, when sending a v coordinate there are two ways that an attacker can distinguish it from a random string:

Firstly, since the field is only 255 bits, the 256th bit is always zero. Thus if an attacker sees a series of 32-byte strings where the top bit of the last byte is always zero, then they can be confident that they are not random strings. This is easy to fix however, just XOR in a random bit and mask it out before processing.

Secondly, the attacker can assume that a 32-byte string is a v coordinate and check whether v3 + 486662x2 + v is a square. This will always be true if the strings are v coordinates, by the curve equation, but will only be true 50% of the time otherwise. This problem is a lot harder to fix.

Thankfully; djb, Tanja Lange, Mike Hamburg and Anna Krasnova have published Elligator [paper]. In that paper they present two maps from elliptic curve points to uniform, random strings and you should, at least, read the introduction to the paper, which contains much more detail on the use cases and previous work.

Elligator 2 is suitable for Curve25519 and there are some hints about an implementation in section 5.5. However, this blog post contains my own notes about implementing each direction of the map with only a couple of exponentiations. You may need to read the paper first before the following makes much sense.

I'll start with the inverse map, because that's what's used during key-generation. Throughout I'll write (x, y) for affine coordinates on the twisted, Edwards isomorphism of Curve25519, as defined for Ed25519; (u,v) for affine coordinates on Curve25519 itself; and X, Y, Z for extended coordinates on the twisted, Edwards curve.

The inverse map, ψ-1, takes a point on the curve with the following limitations and produces a uniform, representative value:

  1. u ≠ -A. (The value A is a parameter of Curve25519 and has the value 486662.)
  2. -2u(u + A) is a square

Since we're generating the points randomly, I'm going to ignore the first condition because it happens far less frequently than malfunctions in the CPU instructions that I might use to detect it.

The second condition excludes about half the points on the curve: these points don't have any representation as a uniform string. When generating a key we need to detect and restart if we happen upon one of those points. (When working in the other direction, ψ(r) = ψ(-r), which explains why half the image is missing.)

If the point is in the image of the map then the uniform string representation, r, is defined by:

(If you can't see the equation, you need a browser that can handle inline SVG.)

So our key generation function is going to take a private key and either produce nothing (if the resulting point isn't in the image of the map), or produce a public key and the uniform representation of it. We're working with Curve25519, so the public key, if any, will exactly match the result of the scalar_base_mult operation on Curve25519.

Since I happened to have the ref10 code from Ed25519 close to hand, we'll be calculating the scalar multiplication on the twisted, Edwards isomorphism of Curve25519 that Ed25519 uses. I've discussed that before but, in short, it allows for the use of precomputed tables and so is faster than scalar_base_mult with the Curve25519 code.

The result of the scalar multiplication of the base point with the private key is a point on the twisted, Edwards curve in extended coordinates: (X : Y : Z), where x = X/Z and y = Y/Z and where (x, y) are the affine coordinates on the Edwards curve. In order to convert a point on the Edwards curve to a point on Curve25519, we evaluate:

This is looking quite expensive: first we have to convert from extended coordinates on the Edwards curve to affine, then to affine on Curve25519. In finite fields, inversions (i.e. division) are very expensive because they are done by raising the divisor to the power of q-2 (where q=2255-19). Compared to such large exponentiations, multiplications and additions are almost free.

But, we can merge all those inversions and do the whole conversion with just one. First, substitute the extended coordinate conversion: x = X/Z, y = Y/Z.


By calculating just 1/(X(Y-Z)), we can get (u,v). The reciprocal can be used directly to calculate v and then multiplied by X to calculate u.

What remains is to check whether -2u(u + A) is a square and then to calculate the correct square root. Happily, we can calculate all that (including both square roots because we want to be constant-time), with a single exponentiation.

Square roots are defined in the standard way for finite fields where q ≅ 5 mod 8:

Let's consider the square roots first. Both have a common, constant term of (-1/2) which we can ignore for now. With that gone, both are fractions that we need to raise to (q+3)/8. Section 5 of the Ed25519 paper contains a trick for merging the inversion with the exponentiation:

But let's see what else we can do with the c value.

So c7 gives us the variable part of the other square root without another large exponentiation! It only remains to multiply in the constant term of (-1/2)(q+3)/8 to each and then, for each, test whether the square root of -1 needs to be multiplied. For the first square root, for example, we should have that r2 = -u/(2(u+A)) thus 2r2(u+A)+u should be zero. If not, then we need to multiply r by sqrt(-1).

Lastly, we need to check that the point is in the image of the map at all! (An implementation may want to do this sooner so that failures can be rejected faster.) For this we need to check that -2u(u + A) is a square. We use the fact that an element of the field is a non-zero square iff raising it to (q-1)/2 results in one. Thus we need to calculate (-2)(q-1)/2(u(u+A))(q-1)/2. The first part is just a constant, but the latter would be an expensive, large exponentiation, if not for c again:

With that result we can pick the correct square root (and in constant time, of course!)

The map in the forward direction

In the forward direction, the map takes a uniform, representative value and produces a public key, which is the x coordinate of a Curve25519 point in this context: ψ(r) = u. (Remember that the affine coordinates on Curve25519 are written (u,v) here.) This is much simpler than the reverse direction, although, as best I can tell, equally expensive. Given a value, r, one can calculate the value u with:

This requires one inversion to calculate d and another, large exponentiation to calculate ε. The resulting u value will match the output of the key generation stage which, in turn, matches the value that running scalar_base_mult from the Curve25519 code would have produced for the same private key.

Note that the Elligator software page says that GMP can compute the Legendre symbol (the ε calculation) in only 9000 cycles, so I'm clearly missing some implementation trick somewhere. Additionally, the Elligator paper raises the possibility of implementing the inversions with a blinded, Euclidean algorithm. That would certainly be faster but I'm not comfortable that I know how to do that safely, so I'm sticking with exponentiation here.


I've a constant-time, pure Go implementation, based on the field operations from the pure-C, SUPERCOP, Ed25519 code. The field operations have been ported pretty mechanically from the C code, so it's hardly pretty Go code, but it seems to work.

On my, 3.1GHz, Ivy Bridge system (with TurboBoost enabled) the key generation takes 155µs and the forward map operation takes 28µs. (But remember that half of the key generations will fail, so ~double that time.)

I have not carefully reviewed the code, so beware. There are also some non-obvious concerns when using Elligator: for example, if you are hiding a uniform representative in an nonce field then the attacker might assume that it's a field element, negate it and let the connection continue. In a normal protocol, altering the nonce usually leads to a handshake failure but negating an Elligator representative is a no-op. If the connection continues then the attacker can be pretty sure that the value wasn't really random. So care is required.

So if you think that Elligator might be right for you, consult your cryptographer.

Forward security for journalists

A number of journalists have been writing stories about forward security recently. Some of them ended up talking to me and I completely failed to come up with a good metaphor that didn't cause more problems than it solved. Eventually I wrote the following as an introduction to forward security for journalists. It's deeply simplified but I think it captures the broad strokes.

It remains unclear whether anyone actually came away with a better understanding after reading it, however.

A great many metaphors have been mangled, and analogies tortured, trying to explain what forward security is. But, fundamentally, it doesn't involve anything more complex than high-school math and anybody should be able to understand it. So this document attempts to describe what forward security is.

In the 1970s, cryptography was purely the concern of the military. The NSA existed (unofficially) and believed that it was the sole, legitimate center of cryptographic knowledge in the US. In this environment, Whitfield Diffie set out from MIT across the country to try and gather up what scraps of cryptography has escaped into the public realm. If we omit huge amounts of detail (see Steven Levy's book, “Crypto” for the details) we can say that he ended up at Stanford with a revolutionary idea in his head: public key cryptography.

Up to that point, cryptography had involved secret keys that both the sender and receiver were in possession of. Diffie wondered about a system where one could publish part of a key (the public part) and keep another part of the key private. However, he didn't have any concrete implementation of this.

At Stanford he found Martin Hellman and, together they published New Directions in Cryptography (1976), which contained the Diffie-Hellman key-agreement algorithm. This is possibly the most significant work in cryptographic history and is still a core algorithm today. (Later we learnt that a researcher at GCHQ had discovered the idea a few years earlier. But GCHQ didn't do anything with it and never published it.)

We'll need modular arithmetic, which is like working with times. What's 8pm plus five hours? One o'clock. Because 8+5=13 and 13 divided by 12 has a remainder of one. Likewise, 8pm plus 17 hours is still one o'clock, because adding multiples of 12 doesn't make any difference to a clock. We'll be using the same idea, but with numbers (called the modulus) other than 12.

To explain Diffie-Hellman we'll work modulo 11. So 5+10=4 now, because 15 divided by 11 has a remainder of 4.

Next, exponentiation: 23 means 2, multiplied by itself three times. So 23=2×2×2=8. Now we have everything that we need to implement Diffie-Hellman.

In classic, cryptographic tradition we'll imagine two parties and we'll call them Alice and Bob. Alice generates a random number between 1 and 9 and calls it a. This is her private key; let's say that it's five. Alice calculates her public key as A=2a=25=32=10 mod 11.

Bob does the same but his private key is called b and his public key is B. Let's say that his private key is six which means that B=2b=26=64=9 mod 11.

Alice and Bob can freely publish A and B because, given them, no other party can work backwards and calculate their private keys. Well, in this case they can because we're using a toy example with a tiny modulus. A real modulus has over 500 digits. Then it's not possible with any current computer.

Once Alice has Bob's public key, or vice-versa, she can perform key-agreement. She combines her private key with Bob's public key like this: Ba=95=59049=1 mod 11. Bob can combine his private key with Alice's public key in the same way: Ab=106=1000000=1 mod 11. Both Alice and Bob arrived at the same, shared value (one) which nobody else can calculate. And without exchanging any secret information!

This was a huge breakthrough but it still left something to be desired: signatures. It seemed that it would be very useful for Alice to be able to calculate some function of a message that could be broadcast and would serve as a signature. So it had to be something that only Alice (or, rather, the entity holding the private part of Alice's public key) could have generated. Diffie-Hellman doesn't provide this.

In 1977, three researchers at MIT, Rivest, Shamir and Adleman, produced such a scheme. Now it's called by their initials: RSA. Again, the basics are high-school math.

We'll be working with modular arithmetic again but this time it'll be modulo the product of two primes. Like the example above, the numbers involved are usually huge, but we're going to use tiny numbers for the purposes of the examples.

So our two primes will be p=11 and q=17. This will serve as Alice's RSA private key. In order to calculate an RSA public key, one calculates n=11×17=187.

The core of RSA is that, when working modulo 187, it's very easy to calculate the cube of a number. But, unless you know that 187=11×17, it's very difficult to calculate the cube root of a number. Let's say that our message is 15. The cube is easy: 153=3375=9 mod 187. But to calculate the cube root you need to calculate the modular inverse of 3 and (p-1)(q-1)=10×16=160. We're getting a little past high-school math here so trust me that it's 107. By calculating 9107 mod 187 we find that we get our original number again: 15.

Think about what this means. If Bob knows that Alice's public key is 187, he can cube a message and send it to Alice and only Alice can calculate the cube root and recover the message. Also, if Alice starts with a message, she can calculate the cube root of it and publish that. Everyone else can be convinced that only Alice could have calculated the cube root of the message and so that functions as a digital signature of the message!

Now we have everything that we need to understand forward security so hang on while we wrap it all up.

RSA and Diffie-Hellman are both too slow to use for large amounts of data. Bulk data transfer uses secret-key cryptography: the old style cryptography where both sides need to know the secret key in order to process the ciphertext. But, in order to get the secret key to the other party, RSA or Diffie-Hellman is used.

Traditional HTTPS has the browser use RSA to transmit a secret key to the server. The browser picks a random, secret key, cubes it modulo the server's RSA modulus and sends it to the server. Only the server knows the two numbers that, multiplied together, equal its modulus so only the server can calculate the cube root and get the secret key for the connection.

However, if anyone were to record the HTTPS connection and later learn those two numbers (called factors) then they can retrospectively decrypt the recorded connections. They can also watch new connections to the server and decrypt those because they can calculate cube roots just as well as the real server can.

With forward security, the method of establishing the secret key for the connection is changed. When a client connects, the server generates a brand new Diffie-Hellman private key just for that connection. (It's called the ephemeral private key.) Then the server calculates its Diffie-Hellman public key (by 2a mod m, remember?) signs the public key with its RSA key (by calculating the cube root of it, modulo its RSA modulus) and sends all that to the browser. The browser verifies the signature (by cubing it and checking that the result equals the public key) and then generates its own Diffie-Hellman private and public keys. Now, the client has the server's Diffie-Hellman public key, so it can combine it with its Diffie-Hellman private key and calculated the shared-secret. It sends its Diffie-Hellman public key to the server and the server ends up with the same, shared secret. That secret is used as the secret key for the connection.

The advantage here is that, should the server's, long-term, RSA private key be compromised in the future it isn't as bad. The confidentiality of the connections is protected by the Diffie-Hellman secrets which were generated fresh for the connection and destroyed immediately afterwards. Also, the attacker, even with the RSA private key cannot passively decrypt connections in real-time. The attacker has to impersonate the server, which is much more work.

That's why forward-secrecy is often written as DHE: it stands for Diffie-Hellman ephemeral. There's also ECDHE (Elliptic Curve, Diffie-Hellman ephemeral) which is the same idea in a more efficient mathematical structure.


It seems that secure messaging projects are all the rage right now. Heck, Matthew Green is writing about it in the New Yorker this week!

One of my side-projects for a year or so has been about secure messaging. Of course, that means that I was working on it before it was cool (insert hipster emoticon here, :{D ) … but I'm slow because my side projects get very little time. (When I started, the subheading for the project was “How to better organise a discreet relationship with the Director of the CIA”, because that was the topical reference at the time. Oh, how things change.)

It's still not really ready, but the one year mark seemed like a good deadline to give myself to mention it in public. You should pay heed to the warning on the project page however. I did contact one, well known, security auditing company to see whether they would take a couple thousand dollars to spend a day on it, but they said that don't really do small projects and that I'd need to be talking 10-20 times that for a couple of weeks at least. So don't depend on it yet and don't be surprised if I've just broken it somehow by replacing the ratchet code. (Although, thanks, Trevor, for the design.)

But I fear that it's not going to make Matthew Green happy! It's certainly not a drop-in replacement for email. I spend my days working on small changes with a large multiplier, in my free time I choose to shift the balance completely the other way.

  • Ephemeral messages: although no software can prevent a recipient from making a copy of a message, it's the social norm and technical default in Pond that messages are erased after a week.
  • No spam: only approved contacts can send you messages. In order for the server to enforce that without revealing the identity of every sender, pairing-based, group signatures are used. (The flip side is that there are no public addresses, so you can't message someone that you don't already know.)
  • Forward security: the exchange of messages drives a Diffie-Hellman ratchet preventing a point-in-time compromise from giving the attacker the ability to decrypt past messages (modulo the week-long retention).
  • Erasure storage: since B-tree filesystems like btrfs and log-structured SSDs mean that one can have very little confidence in being able to securely erase anything, Pond tries to use the TPM NVRAM to securely delete old state. (Note: not implemented for OS X yet.)
  • One-to-one only: No messages to multiple contacts (yet). Haven't implemented it.

The source code is in Go, but there are binaries for several Linux flavors and OS X. (Although it looks terrible on OS X and one user reports that the program can only be closed by killing it from the command line: I've not reproduced that yet.) In addition to the default server, I learned recently that the Wau Holland Foundation also runs a Pond server.

For more details, see the user guide, threat model etc which are linked from the project page.

If you get it running, but don't know anyone else, you can use:

gpg --recv-key C92172384F387DBAED4D420165EB9636F02C5704

to get my public key and email me a shared-secret. (Although I only check that email during the evenings.)

Please update F5/BIG-IP firmware

Update (Dec 7th): an F5 engineer commented on the TLS WG mailing list describing the internals of the hang bug. Based on this, we realised that it would be possible to work around the issue by padding ClientHello messages of a certain size. We now have a design for doing this, it's implemented in Chrome and it appears to be working! So, while you should always keep software up to date, it appears that the Internet can dodge this bug!

If you use F5/BIG-IP devices to terminate SSL connections, please update the firmware on the things! We're trying to run an Internet here and old versions of these devices are a real problem for deploying new TLS features. You need to be running at least version 10.2.4 (as far as I know), but running the latest version is generally good advice.

If you just try to connect to these sites with a recent version of OpenSSL, you should find that the connection hangs - which is terrible. We can detect a server that returns an error, but hanging the connection isn't something we can generally work around.

$ openssl version
OpenSSL 1.0.1e 11 Feb 2013
$ openssl s_client -connect
$ openssl s_client -connect -tls1
New, TLSv1/SSLv3, Cipher is AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE

I did have a long list of major sites that were affected by this issue here. I've removed it because some of them had updated and, because of the update at the top, it's no longer a crippling problem.

ChaCha20 and Poly1305 for TLS

Today, TLS connections predominantly use one of two families of cipher suites: RC4 based or AES-CBC based. However, in recent years both of these families of cipher suites have suffered major problems. TLS's CBC construction was the subject of the BEAST attack (fixed with 1/n-1 record splitting or TLS 1.1) and Lucky13 (fixed with complex implementation tricks). RC4 was found to have key-stream biases that cannot effectively be fixed.

Although AES-CBC is currently believed to be secure when correctly implemented, a correct implementation is so complex that there remains strong motivation to replace it.

Clearly we need something better. An obvious alternative is AES-GCM (AES-GCM is AES in counter mode with a polynomial authenticator over GF(2128)), which is already specified for TLS and has some implementations. Support for it is in the latest versions of OpenSSL and it has been the top preference cipher of Google servers for some time now. Chrome support should be coming in Chrome 31, which is expected in November. (Although we're still fighting to get TLS 1.2 support deployed in Chrome 30 due to buggy servers.)

AES-GCM isn't perfect, however. Firstly, implementing AES and GHASH (the authenticator part of GCM) in software in a way which is fast, secure and has good key agility is very difficult. Both primitives are suited to hardware implementations and good software implementations are worthy of conference papers. The fact that a naive implementation (which is also what's recommended in the standard for GHASH!) leaks timing information is a problem.

AES-GCM also isn't very quick on lower-powered devices such as phones, and phones are now a very important class of device. A standard phone (which is always defined by whatever I happen to have in my pocket; a Galaxy Nexus at the moment) can do AES-128-GCM at only 25MB/s and AES-256-GCM at 20MB/s (both measured with an 8KB block size).

Lastly, if we left things as they are, AES-GCM would be the only good cipher suite in TLS. While there are specifications for AES-CCM and for fixing the AES-CBC construction, they are all AES based and, in the past, having some diversity in cipher suites has proven useful. So we would be looking for an alternative even if AES-GCM were perfect.

In light of this, Google servers and Chrome will soon be supporting cipher suites based around ChaCha20 and Poly1305. These are primitives developed by Dan Bernstein and are fast, secure, have high quality, public domain implementations, are naturally constant time and have nearly perfect key agility.

On the same phone as the AES-GCM speeds were measured, the ChaCha20+Poly1305 cipher suite runs at 92MB/s (which should be compared against the AES-256-GCM speed as ChaCha20 is a 256-bit cipher).

In addition to support in Chrome and on Google's servers, myself and my colleague, Elie Bursztein, are working on patches for NSS and OpenSSL to support this cipher suite. (And I should thank Dan Bernstein, Andrew M, Ted Krovetz and Peter Schwabe for their excellent, public domain implementations of these algorithms. Also Ben Laurie and Wan-Teh Chang for code reviews, suggestions etc.)

But while AES-GCM's hardware orientation is troublesome for software implementations, it's obviously good news for hardware implementations and some systems do have hardware AES-GCM support. Most notably, Intel chips have had such support (which they call AES-NI) since Westmere. Where such support exists, it would be a shame not to use it because it's constant time and very fast (see slide 17). So, once ChaCha20+Poly1305 is running, I hope to have clients change their cipher suite preferences depending on the hardware that they're running on, so that, in cases where both client and server support AES-GCM in hardware, it'll be used.

To wrap all this up, we need to solve a long standing, browser TLS problem: in order to deal with buggy HTTPS servers on the Internet (of which there are many, sadly), browsers will retry failed HTTPS connections with lower TLS version numbers in order to try and find a version that doesn't trigger the problem. As a last attempt, they'll try an SSLv3 connection with no extensions.

Several useful features get jettisoned when this occurs but the important one for security, up until now, has been that elliptic curve support is disabled in SSLv3. For servers that support ECDHE but not DHE that means that a network attacker can trigger version downgrades and remove forward security from a connection. Now that AES-GCM and ChaCha20+Poly1305 are important we have to worry about them too as these cipher suites are only defined for TLS 1.2.

Something needs to be done to fix this so, with Chrome 31, Chrome will no longer downgrade to SSLv3 for Google servers. In this experiment, Google servers are being used as an example of non-buggy servers. The experiment hopes to show that networks are sufficiently transparent that they'll let at least TLS 1.0 through. We know from Chrome's statistics that connections to Google servers do end up getting downgraded to SSLv3 sometimes, but all sorts of random network events can trigger a downgrade. The fear is that there's some common network element that blocks TLS connections by deep-packet inspection, which we'll measure by breaking them and seeing how many bug reports we get.

If that works then, in Chrome 32, no fallbacks will be permitted for Google servers at all. This stage of the experiment tests that the network is transparent to TLS 1.2 by, again, breaking anything that isn't and seeing if it causes bug reports.

If both those experiments work then, great! We can define a way for servers to securely indicate that they don't need version fallback. It would be nice to have used the renegotiation extension for this, but I think that there are likely already too many broken servers that support that, so another SCSV is probably needed.

If we get all of the above working then we're not in too bad a state with respect to TLS cipher suites. At least, once most of the world has upgraded in any case.

Playing with the Certificate Transparency pilot log

I've written about Certificate Transparency several times in the past, but now there's actually something to play with! (And, to be clear, this is entirely thanks to several of my colleagues, not me!) So I though it would be neat to step through some simple requests of the pilot log.

I'm going to be using Go for this, because I like it. But it's substantially just JSON and HTTP and one could do it in any language.

In order to query a log you need to know its URL prefix and public key. Here's a structure to represent that.

// Log represents a public log.
type Log struct {
	Root string
	Key *ecdsa.PublicKey

For the pilot log, the URL prefix is and here's the public key:

-----END PUBLIC KEY-----

Since it's a little obscure, here's the code to parse a public key like that. (This code, and much of the code below, is missing error checking because I'm just playing around.)

	block, _ := pem.Decode([]byte(pilotKeyPEM))
	key, _ := x509.ParsePKIXPublicKey(block.Bytes)
	pilotKey = key.(*ecdsa.PublicKey)
	pilotLog = &Log{Root: "", Key: pilotKey}

The log is a tree, so the first thing that we want to do with a log is to get its head. This is detailed in section 4.3 of the RFC: we make an HTTP GET to get-sth and it returns a JSON blob.

Any HTTP client can do that, so give it a go with curl on the command line:

$ curl 2>>/dev/null | less

You should get a JSON blob that, with a bit of formatting, looks like this:

	"tree_size": 1979426,
	"timestamp": 1368891548960,
	"sha256_root_hash": "8UkrV2kjoLcZ5fP0xxVtpsSsWAnvcV8aPv39vh96J2o=",
	"tree_head_signature": "BAMARjBEAiAc95/ONhz2vQsULrISlLumvpo..."
In Go, we'll parse that into a structure like this:

// Head contains a signed tree head.
type Head struct {
	Size uint64 `json:"tree_size"`
	Time time.Time `json:"-"`
	Hash []byte `json:"sha256_root_hash"`
	Signature []byte `json:"tree_head_signature"`
	Timestamp uint64 `json:"timestamp"`

And here's some code to make the HTTP request, check the signature and return such a structure:

func (log *Log) GetHead() (*Head, error) {
	// See
	resp, err := http.Get(log.Root + "/ct/v1/get-sth")
	if err != nil {
		return nil, err

	defer resp.Body.Close()
	if resp.StatusCode != 200 {
		return nil, errors.New("ct: error from server")
	if resp.ContentLength == 0 {
		return nil, errors.New("ct: body unexpectedly missing")
	if resp.ContentLength > 1<<16 {
		return nil, errors.New("ct: body too large")
	data, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		return nil, err
	head := new(Head)
	if err := json.Unmarshal(data, &head); err != nil {
		return nil, err

	head.Time = time.Unix(int64(head.Timestamp/1000), int64(head.Timestamp%1000))

	// See
	if len(head.Signature) < 4 {
		return nil, errors.New("ct: signature truncated")
	if head.Signature[0] != hashSHA256 {
		return nil, errors.New("ct: unknown hash function")
	if head.Signature[1] != sigECDSA {
		return nil, errors.New("ct: unknown signature algorithm")

	signatureBytes := head.Signature[4:]
	var sig struct {
		R, S *big.Int

	if signatureBytes, err = asn1.Unmarshal(signatureBytes, &sig); err != nil {
		return nil, errors.New("ct: failed to parse signature: " + err.Error())
	if len(signatureBytes) > 0 {
		return nil, errors.New("ct: trailing garbage after signature")
	// See
	signed := make([]byte, 2 + 8 + 8 + 32)
	x := signed
	x[0] = logVersion
	x[1] = treeHash
	x = x[2:]
	binary.BigEndian.PutUint64(x, head.Timestamp)
	x = x[8:]
	binary.BigEndian.PutUint64(x, head.Size)
	x = x[8:]
	copy(x, head.Hash)

	h := sha256.New()
	digest := h.Sum(nil)

	if !ecdsa.Verify(log.Key, digest, sig.R, sig.S) {
		return nil, errors.New("ct: signature verification failed")

	return head, nil

If one runs this code against the current pilot, we can get the same information as we got with curl: 1979426 certificates at Sat May 18 11:39:08 EDT 2013, although now the date will be newer and there will be more entries.

As you can see, the log has been seeded with some certificates gathered from a scan of the public Internet. Since this data is probably more recent than the EFF's Observatory data, it might be of interest and anyone can download it using the documented interface. Again, we can try a dummy request in curl:

$ curl '' 2>>/dev/null | less

You should see a result that starts with {"entries":[{"leaf_input":"AAAAAAE9p. You can request log entries in batches and even download everything if you wish. My toy code simply writes a single large file containing the certificates, gzipped and concatenated with a length prefix.

My toy code is go gettable from In the tools directory is ct-sync, which will incrementally download the pilot log and ct-map, which demonstrates how to look for interesting things by extracting the certificates from a local mirror file. Both will use multiple cores if possible so be sure to set GOMAXPROCS. (This is very helpful when calculating the hash of the tree since the code doesn't do that incrementally.)

The project's official code is at See the README and the main site.

And since we have the tools to look for interesting things, I spent a couple of minutes doing so:

I wrote a quick script to find ‘TURKTRUST’ certificates - leaf certificates that are CA certificates. The log only contains certificates that are valid by a fairly promiscuous root set in order to prevent spam, so we don't have to worry about self-signed certificates giving us false positives. As ever the Korean Government CA is good to highlight malpractice with 40 certificates that mistakenly have the CA bit set (and are currently valid). This issue was reported by a colleague of mine last year. Thankfully, the issuing certificate has a path length constraint that should prevent this issue from being exploited (as long as your X.509 validation checks these sort of details). The root CA is also only present in the Microsoft root set as far as I know.

A different Korean root CA (KISA) can't encode ASN.1 correctly and is issuing certificates with negative serial numbers because ASN.1 INTEGERs are negative if the most significant bit is one.

Another CA is issuing certificates with 8-byte IP addresses (I've let them know).

The ct-map.go in the repository is setup to find publicly valid certificates with names that end in .corp (which will soon not get an HTTPS indication in Chrome.)

Anyway, at the moment the log is just pre-seeded with some public certificates. The real value of CT comes when we have strong assurances that all valid certificates are in the log. That will take some time, but we're working towards it.

Hash based signatures

It's likely that Kevin Poulsen will still be alive in twenty years time, and that his encrypted message to Edward Snowden will still be saved somewhere. In fact, there are several places where people post RSA-encrypted messages and seem to assume that they're secure. Even assuming that no huge advances in factoring occur (and Joux et al have just killed discrete logs in binary fields, so it's not impossible), large, quantum computers will be able to break them.

It might be the case that building practical, quantum computers doesn't turn out to be possible (D-Wave machines don't count here), but there are people trying hard to do it. From “Deep State” (a rather fragmented book, but interesting in parts):

There is a quantum computing arms race of sorts under way. China, Israel and Russia have advanced quantum computing programs with direct aim of gaining geopolitical advantage, as does DARPA itself.

There are classified and semi-classified DARPA and Army/Air Force/Navy Research Lab programs for the potential use of quantum computers for a select set of defense technologies, including: the ability to design a perfect sensing laser for drones or satellites; the ability to design radar that can defeat counter-stealth techniques; the ability to design coatings for aircraft that truly are stealth, owning to the exploitation of quantum fluid dynamics.

Now, government agencies don't care about Wired, but they do care about those things. So it's a possibility that, in a few decades, an academic lab will be in possession of a large quantum computer and might want to pursue some light-hearted historical investigations. (Just in case, I've included the Wired message and key below in case either is deleted ☺)

Despite a common misunderstanding, quantum computers don't break all public key cryptography, and quantum cryptography isn't the answer. (Unless you have lots of money and a dedicated fiber line, and possibly not even then.) Instead post-quantum cryptography is the answer, and there's even a conference on it.

So, should we be using post-quantum algorithms now? The reason that we don't currently use them is because they generally take one or two orders of magnitude more time, space or both. For high-volume work like TLS, that puts them outside the bounds of practicality. But for an application like PGP, where the volume of messages is very low, why not? Size might be an issue but, if it takes 1 second to process a message vs 10 milliseconds, it doesn't really make a difference. You might as well dial the security up to 11.

There are a couple of things that one needs for a system like PGP: encryption and signatures. One option for building post-quantum, public-key signatures is hash-based signatures. These are actually really old! They were described by Lamport in 1979, only a couple of years after RSA. In fact, as Rompel showed, a secure signature scheme exists if and only if a secure hash-based signature scheme exists. Individual hash functions may be broken, but if hash-based signatures are broken in general then there can be no secure signatures at all!

The most primitive hash-based signature can be used only once, and only to sign a single-bit message: Alice picks two random values, x and y and publishes her public key: H(x) and H(y). In order to sign a message, Alice publishes one of the two, original values depending on the value of the bit.

Clearly this only works for a single bit message. It's also a one-time signature because once one of the original values has been published, it's insecure for the same public key to be used again.

The first extension to this scheme is to run many instances in parallel in order to sign larger messages. Fix the hash function, H, as SHA-256. Now Alice picks 512 random values, x0..255 and y0..255 and publishes the hash of each of them as her public key. Now she can sign an arbitrarily long message, m, by calculating H(m) and then publishing either xi or yi depending on the bits of H(m).

It's still one-time though. As soon as Alice signs a different message with the same public key then, for some values of i, both xi and yi have been published. This lets an attacker forge signatures for messages that Alice hasn't signed.

In order to allow signing multiple messages there's one easy solution: just have multiple one-time signatures! Let's say that we want a thousand. The issue is that the public and private keys of our one-time scheme were 512×32 bytes = 16KB and, if we want a thousand of them, then the public and private keys end up being 16MB, which is quite a lot. We can solve the private key size by generating a stream of values from a cipher with just a 32-byte seed key, but the large public key remains.

The solution to this is to use a Merkle hash tree, introduced (unsurprisingly) by Merkle, in 1979.

The the diagram, above, four leaf hashes are included in the hash tree: A, B, C and D. The hash tree is a binary tree where each non-leaf node contains the hash of its children. If Bob believes that H(H(AB)H(CD)) is Alice's public key, then anyone can convince him that A is part of Alice's public key by giving him A, B and H(CD). Bob calculates H(AB) from A and B, and then H(H(AB)H(CD)) from H(AB) and H(CD). If the values match, and if the hash function isn't broken, then A must be part of Alice's public key.

Now, if A, B, C and D were one-time signature public keys then Alice's public key has been reduced to a single hash. Each signature now needs to include the one-time signature value, the rest of the public key, and also a proof that the public key is in Alice's public key. In more detail, the signature contains either xi or yi depending on the bits of the hash of the message to be signed and also H(xi) or H(yi) for the bits that weren't used. With that information, Bob can reconstruct the one-time signature public key. Then the signature contains the path up the tree - i.e. the values B, and H(CD) in our example, which convince Bob that the one-time signature public key is part of Alice's overall public key.

So now the public and private keys have been reduced to just 32 bytes (at least if you're using a 256-bit hash). We can make the tree arbitrarily large, thus supporting arbitrarily many signatures. The signatures, however, are 16KB plus change, which is pretty chunky. So the next trick will try to address this, and it's thanks to Winternitz.

The Winternitz trick was described in Merkle's original, 1979 work and it involves iterating hashes. We started off by considering the one-time signature of a single bit and solved it by publishing H(x) and H(y) and then revealing one of the preimages. What if we generated a single random value, z, and published H(H(z))? We would reveal H(z) if the bit was one, or z if the bit was zero. But an attacker could take our signature of zero (z) and calculate H(z), thus creating a signature of one.

So we need a checksum to make sure that, for any other message, the hash function needs to be broken. So we would need two of these structures and then we would sign 01 or 10. That way, one of the hash chains needs to be broken in order to calculate the signature of a different message. The second bit is the Winternitz checksum.

As it stands, Winternitz's trick hasn't achieved anything - we're still publishing two hash values and revealing two secrets. But when we start signing larger messages it becomes valuable.

Let's say that the Winternitz hash chains were 16 values long. That means that one of them can cover four bits of the hash that we're signing. If we were signing a 256-bit value, then we only need 64 Winternitz chains and three for the checksum. Also, the Winternitz scheme makes it easy for the verifier to calculate the original public key. The signature size has gone from 16KB to 2,144 bytes. Not bad!

But! There's still a critical flaw in this! Recall that using a one-time signature twice completely breaks the system. Because of that, the signature scheme is “stateful”. This means that, when signing, the signer absolutely must record that a one-time key has been used so that they never use it again. If the private key was copied onto another computer and used there, then the whole system is broken.

That limitation might be ok in some situations, and actually means that one can build forward-secure signature schemes: schemes where signatures prior to a key compromise can still be trusted. Perhaps for a CA where the key is in an HSM that might be useful. However, for most environments it's a huge foot-cannon.

We could pick one-time public keys to use based on the hash of the message, or at random in the hope of never colliding but, in order for such schemes to be safe the tree of one-time public keys needs to be really big. Big as in, more than 2128 entries. If we use the trick of generating the private keys from the output of a cipher then, as soon as we fix the cipher private key, we have determined the whole tree. However, although we can compute any part of it, we can't compute all of it. At least not without an infinite amount of time (or close enough to infinite as makes no difference).

So we have to defer calculating parts of the tree until we actually need them in order to generate a signature. In order to defer bits of the tree, we have to split the tree up into a forest of smaller trees. The top-level tree is calculated as before and the Merkle hash of it is the public key. But rather than sign a message with the one-time leaves of that tree, we sign the root hash of a subtree. Since everything is deterministic, the subtree at any position is always the same so we're always signing the same message - which is safe.

Thus, if we wished to have a conceptual hash tree with 2160 leaves, we could split it up, say, every 16 bits. The tip of the tree would be a hash tree with 216 leaves. Those leaves would sign the root of a hash tree of another 216 leaves and so one. In order to make 2160, we would need 10 layers of such trees.

I can't have an image with all 216 leaves in each tree, so 22 will have to do. When generating the key, only the top (blue) tree needs to be calculated. When generating each signature, all the subtrees down the path to the final one-time key need to be calculated. So the signature consists of a one-time signature, from the 3rd leaf of the blue tree, of the root of the red tree (as well as Merkle path to the blue root), plus a one-time signature, from the 1st leaf of the red tree, of the root of the green tree (as well as a Merkle path to the red root), etc.

Obviously these signatures are going to be rather larger, due to all the intermediate signatures, and will take rather longer to calculate, due to calculating all the subtrees on the fly. Below is a little Haskell program that tries to estimate signature size vs signature time. It may well have some bugs because I haven't actually coded up a working example to ensure that it does actually work.

log2 = logBase 2 :: Double -> Double

-- We assume SHA-256 for the sake of this example.
hashBits = 256
hashBytes = hashBits/8
-- This is the Winternitz parameter.
w = 64
wBits = log2 w
-- These are parameters for the Winternitz signature.
l1 = fromIntegral $ ceiling (hashBits/wBits)
l2 = 1 + (fromIntegral $ floor $ log2(l1*(w-1))/wBits)
-- wWords is the number of hash chains that we'll have.
wWords = l1 + l2

-- An SSE2, SHA-256 implementation can hash at 5 cycles/byte
hashCyclesPerByte = 5
-- We assume that we can generate keystream bytes at 2 cycles/byte.
secretCyclesPerByte = 2

otsSignatureBytes = wWords * hashBytes
-- In order to generate a single OTS public key we need to calculate wWords
-- hash chains to a depth of w hashes.
otsHashOps = wWords * w
otsHashBytes = hashBytes * otsHashOps
otsHashCycles = hashCyclesPerByte * otsHashBytes
-- In order to generate the private key, we need to generate this many bytes of
-- key stream.
otsPrivateBytes = wWords * hashBytes
otsPublicBytes = otsPrivateBytes
otsPrivateCycles = secretCyclesPerByte * otsPrivateBytes

-- This is the height of a tree.
treeHeight = 8
-- To calculate the public keys for a whole tree we have to generate 2^treeHeight public keys.
treePublicKeyGenCycles = (otsPrivateCycles + otsHashCycles) * 2**treeHeight
-- To calculate the hash of the root of the tree, we first need to hash all the
-- public keys.
treePublicKeyHashCycles = otsPublicBytes * hashCyclesPerByte * 2**treeHeight
-- Then we need to hash all the hashes
treeMerkleCycles = 2**(treeHeight-1) * hashBytes * hashCyclesPerByte
treeSignatureCycles = treePublicKeyGenCycles + treePublicKeyHashCycles + treeMerkleCycles
-- As part of the signature we need to publish the OTS signature and a path up
-- the Merkle tree. The signature is the same size as a public key because the
-- Winternitz hash chain are just slightly shorter.
treeSignatureBytes = otsPublicBytes + treeHeight*hashBytes

logKeys = 160
forestHeight = fromIntegral $ ceiling (logKeys/treeHeight)
forestSignatureBytes = forestHeight * treeSignatureBytes
forestSignatureCycles = forestHeight * treeSignatureCycles
forestSignatureTime = forestSignatureCycles / 2.9e9

The Haskell code assumes that one can calculate SHA-256 at 5 cycles/byte, which is possible for concurrent hashing as is needed in this case. There are several parameters that one can play with: the Winternitz value, the size of the trees (which is assumed to be uniform here) and the size of the forest. I've taken the size of the forest to be 2160, although that only allows 216 signatures before the probability of a collision reaches 2-128. However, the overall security of the scheme is somewhat less than 128 bits because there are multiple hashes to attack concurrently.

With the parameters that I picked above, the estimated, lower-bound time for signing is 0.8 seconds (on a 2.9GHz machine) and the signatures are 34KB. The algorithm is embarrassingly parallel, so multi-core machines can cut the signing time down a lot, but the size is more difficult to improve. Changing the parameters to achieve a 20KB signature results in a signing time of nearly 30 seconds! (And it's a lower-bound of the signing time - in the real world there will be additional overhead.)

So, should everyone be using hash-based, PGP signatures now? Well, probably not, at least not for a subkey: PGP clearsigned messages would look silly with 34KB of base64-encoded noise at the bottom. Perhaps for a long-term signing key if you really want. But, over the long-term, it's generally confidentiality (i.e. encryption) that you need to maintain and I haven't talked about post-quantum encryption at all in this post. (Maybe in the future.)

But for problems like software updates, a reasonable case can be made. The size of the signature is generally trivial compared to the size of the message and software really does last for decades now. Hash-based signatures allow you to forget about any cryptographic breakthroughs to the greatest degree possible.

There's lots more to hash-based signatures than I've covered here:

  • GMSS treats the forest construction as an optimisation problem and solves for several cases.
  • W-OTS+ builds on several, previous papers and changes the hash property needed from collision resistance to second preimage resistance, which is huge for classical machines! It means that the size of the hash can be halved. However, quantum computers can do a pre-image in 2n/2 time, the same amount of time that classical computers can do a collision, so it doesn't help you there. (Can quantum computers do a collision in 2n/3 time? djb says don't worry.)
  • If you don't mind keeping state, then XMSS is probably the most recent work exploring that problem that I'm aware of. (Although one might want to combine it with W-OTS+.)
  • Tahoe-LAFS has some notes on hash-based signatures, notably some from Daira Hopwood.

And those messages for anyone in possession of a large, quantum computer:

Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools -




How to botch TLS forward secrecy

There's been some discussion about forward secret TLS recently, spurred, I think, by this piece from Declan McCullagh on CNET.

In case there are people out there wondering about turning it on, I thought it would be a good time to discuss some of the ways to mess it up!

In the case of multiplicative Diffie-Hellman (i.e. DHE), servers are free to choose their own, arbitrary DH groups. Historically some servers have had groups as small as 256-bits! Opera used to reconnect without DHE cipher suites in this case and Chrome now simply fails the connection with ERR_SSL_WEAK_SERVER_EPHEMERAL_DH_KEY. However, it's still the case that some servers use 512-bit DH groups, meaning that the connection can be broken open with relatively little effort.

So the first way to mess up forward secrecy is to use under sized DH groups. Ideally the DH group would match or exceed the RSA key size but 1024-bit DHE is arguably better than straight 2048-bit RSA so you can get away with that if you want to.

If you're using ECDHE then you don't need to worry about it being too small because the smallest EC group that clients support (P-256) is far, far stronger than 2048-bit RSA anyway.

So the next way to mess up forward secrecy is to get compromised via session resumption. TLS offers two session resumption mechanisms: session IDs (where the server and client each store their own secret state) and session tickets (where the client stores the server's state, encrypted by the server). If an attacker can obtain the session resumption information for a connection then they can decrypt the connection. (This needn't be completely true, but it is for TLS because of the way that TLS is designed.)

In the case of session IDs, the session information will either be kept in memory or kept on disk, depending on how the server is configured. (See the SSLSessionCache directive in Apache.) The forward secrecy of a connection is bounded by how long the session information is stored at the server. A small, in-memory cache likely has a high turnover rate, but a disk-cache could retain that information for a long time. Ideally one would have a medium sized, in-memory cache that turned over once a day or so.

In the case of session tickets, the server's encrypted state is transmitted to the client in the clear, so the server's session ticket key is what's protecting the connection. If you're running a single server then the session ticket key is probably generated randomly, at startup and kept in memory. So the forward secrecy of the connections is actually limited by the security of the server's memory until the server is restarted. It's possible for servers to run for many months without a restart so connections could be a lot less forward secure than you might think.

If you run several servers then they all need to share the same session ticket key otherwise they can't decrypt each other's session tickets. In Apache you would configure that with the SSLSessionTicketKeyFile directive. In this case your forward secrecy is limited by the security of that file on disk. Since your SSL private key is probably kept on the same disk, enabling forward secure cipher suites probably hasn't actually achieved anything other than changing the file that the attacker needs to steal!

So how do you run forward secrecy with several servers and support session tickets? You need to generate session ticket keys randomly, distribute them to the servers without ever touching persistent storage and rotate them frequently. However, I'm not aware of any open source servers that support anything like that.

Sudden Death Entropy Failures

During the time that the RSA patent was in force, DSA was the signature algorithm of choice for any software that didn't want to deal with patent licenses. (Which is why lots of old PGP keys are still DSA.) It has slowly disappeared since the patent expired and it appears that 4096-bit RSA is now the algorithm of choice if you're on the run from the NSA [1]. (And if you're a journalist trying to get a reply: keyid BDA0DF3C.)

But DSA can also be used with elliptic curves in the form of ECDSA and, in that form, it's likely that we'll see it return in the future, at least to some extent. SSH and GPG both support ECDSA now and CAs are starting to offer ECDSA certificates for HTTPS.

Unfortunately, DSA has an important weakness that RSA doesn't: an entropy failure leaks your private key. If you used a machine affected by the Debian entropy bug then, in that time, messages that you encrypted with RSA can be broken. But if you signed anything with a DSA key, then your private key is compromised.

The randomness in DSA is absolutely critical. Given enough signatures, leaking just a handful bits per signature is sufficient to break it. In the limit, you can make the make the mistake that Sony did and not understand that the random number needs to be generated for each message and, seemingly, just pick one and code it in. (See XKCD.)

But it doesn't need to be this way! All that is required of the nonce is that it be unique for each distinct message and secret, and we can achieve that by hashing the message and private key together. That's what Ed25519 does and I've added the option to OpenSSL to do the same. Unlike RSA and Ed25519 signatures, DSA signatures are probabilistic - signing the same message twice with the same key will result in different signatures. Since someone may be depending on that, OpenSSL also hashes in some randomness to maintain that feature.

(p.s. if you're an actual cryptographer, please take a look at the core of the code and let me know if I'm an idiot.)

Since the DSA specification says that the nonce must be randomly generated, and compliance with the spec is important for many users, this isn't enabled by default. For ECDSA keys, one needs to call EC_KEY_set_nonce_from_hash, for example. But hopefully we can measurably improve things with little effort by doing this.


Here's an example of breaking DSA with a known nonce: essentially the Sony PlayStation hack. I've used the variable names from the Wikipedia page for those that want to follow along, although I'm not using a subgroup to keep things simple.

(The example is done with Sage.)

Let's pick some (public) DSA parameters:

p = 2903
F = GF(p)
g = F(2)
n = g.multiplicative_order()

Next, generate a key pair. x is the private key and y is the public key:

x = int(F.random_element()) % n
y = g^x
(x, y)
(1282, 966)

Generate our nonce, which will be fixed for two different messages in this example:

k = int(F.random_element()) % n
kInv = inverse_mod(k, n)

The first message that we sign will be 42 and the pair (r, s) is the signature. (Normally the message would be a hash of a much larger message.) The signature is very simple: r is gk and s is (m + xr)/k:

m = 42
r = int(g^k) % n
s = ((m + x*r) * kInv) % n
(r, s)
(1401, 1168)

As an aside, we can verify the signature that we just made:

w = inverse_mod(s, n)
u1 = m*w
u2 = r*w
v = g^u1*y^u2
v == r

Next, we also sign the message 24 (which I'll call mm) and we gather the two signatures. We have s and ss, both with the same r. Since s = (m + xr)/k, we can subtract the two to get (m + xr - mm - xr)/k, which is just (m - mm)/k. Thus k = (m - mm)/(s - ss). Given k, we can rearrange the original equation for s: s = (m + xr)/k ⇒ sk = m + xr ⇒ sk - m = xr ⇒ (sk - m)/r = x. And thus we have the private key:

(r, s) = (1401, 1168)
(r, ss) = (1401, 1212)
kk = ((42 - 24)*inverse_mod(s-ss,n)) % n
xx = ((s*kk - 42) * inverse_mod(r, n))%n
print xx == x

Faster curve25519 with precomputation

(Update: based on questions, I've clearly failed to point out what this is, or rather what it isn't. Fixed base multiplication only applies when performing many operations on the same public key, i.e. key generation or perhaps some other cases. If you're working with random public keys then precomputaion cannot help I'm afraid.)

Diffie-Hellman is one of the most important public-key cryptographic primitives, even if it doesn't have the name recognition of, say, RSA. When anything is providing forward secrecy then it's very likely using Diffie-Hellman. In fact, the reason that web servers do RSA operations at all is mostly a legacy hangover. If we could design things afresh then there would be a lot of more Diffie-Hellman and everything would be a lot faster!

Of the concrete implementations of Diffie-Hellman, curve25519 is the fastest, common one. There are some faster primitives in eBACS, but the ones that are significantly faster are also significantly weaker. My machine (E5-2690, with Hyperthreading and TurboBoost disabled) can do a curve25519, arbitrary base point, scalar multiplication (a.k.a. Diffie-Hellman key agreement) in just 67.5µs.

But nobody will complain if it gets faster, right?

The core of Diffie-Hellman is calculating xy in some group. The simplest way to calculate xy is the classic square-and-multiply algorithm:

(Note: I'll be using multiplicative notation throughout because I think it's more familiar to most people, although additive is more common for the specific groups that we're dealing with here.)

Start with the number 1, which is x0 (because anything to the power of zero is 1). If we multiply by x then we get x1. So, by multiplying by x we've added one to the exponent. If we square our new number then we get x2 - squaring doubles the exponent.

Now consider the exponent as a binary number: x1 is x1b and squaring it gets us x10b. Squaring is a left shift of the exponent! Now, given any y in binary form, we can calculate xy with a series of additions by 1 and left shifts - i.e. multiplies by x and squarings.

If we have y = 1011b then we start with x0 = 1, as always, and read y left-to-right: the first bit is a one so we multiply by x to get x1b. Left shift: x10b. We already have the 0 in there so left shift again: x100b. Multiply by x: x101b. Left shift: x1010b. Multiply by x: x1011b. Done.

With square-and-multiply, the bits of the desired exponent are clocked in, bit by bit, as the squarings do the left shifting.

This is essentially what curve25519 does. It has lots of tricks to make it fast but, if you're calculating lots of powers of the same x then there are a number of tricks to make things faster. I'm only going to cover a couple of basic ones here.

Firstly, one can precalculate x1, x2, x3, … x15. Now, by multiplying by one of those values we can add any value from 0 to 15 to the exponent in a single multiplication. We're essentially clocking in four bits at a time. After one multiplication we have to do four squarings to left shift 4 bits to make space for the next 4 bits to be clocked in. This reduces the number of multiplications by a factor of 4.

Next, imagine that our exponents were always 8-bits long. We could precalculate x00000000b, x00000001b, x00010000b and x00010001b. By multiplying by one of those values we can clock in bits in two positions at once. Previously we always clocked in bits at the right hand side of the exponent, but those precomputed powers mean that we can clock in bits at the forth bit too. Thus we deal with the exponent as two, 4-bit values and the squarings are shared between them. We only have to square and multiply four times to cover the whole 8-bit value.

In cryptography, the powers are usually larger than 8-bits, of course, but you can clock in at 4 or 8 positions if you wish. It's a classic time/memory trade-off because more positions mean exponentially more precomputed values.

(There's more and better tricks than the two that I've just outlined, some are detailed on this Wikipedia page.)

Precomputation in curve25519

The reason that such precomputation hasn't been applied to curve25519 before is (I think), because of a detail that I glossed over before: curve25519 is actually a Montgomery curve that uses differential addition. Rather than being able to multiply arbitrary values, a and b, you instead need to also know the value of a/b first. That throws a spanner in the works of the typical optimisations.

If you review elliptic curves, you'll recall that operations on the elliptic curve group consist of operations on the underlying field. In curve25519's case it's the field GF(2255-19), which is where it gets its name from. We can roughly measure the amount of time that something will take by the number of field operations required. (Note to practitioners: I'm taking a field squaring as 0.8 multiplications and one multiplication is an ‘operation’.)

A curve25519 exponent is 255 bits long. Each squaring takes 3.6 operations and each multiplication by x takes 4.6 operations. 255*(3.6+4.6) is 2091 operations. We also need a final conversion (a field inversion) that costs 215.2 operations for a total of 2306.2. That's the number to beat.

In order to work in a group which has nice, arbitrary multiplication operations we map curve25519 to an isomorphic, Twisted Edwards curve. In fact, exactly the same curve as used by Ed25519, which is a fast signature primitive. This also means that we get to use Ed25519's highly optimised code. In order to be able to use as much code as possible, we use exactly the same precomputation scheme as described in section 4 of the paper.

This scheme is based on 256-bit exponents, which it splits into 64, 4-bit chunks. But it uses lots of precomputed values! For every other 4-bit chunk, it precomputes all 16 possible values. So the first subtable contains the 16 values x0000b, x0001b, x0010b, …, x1111b. The next subtable contains the 16 values for the next but one 4-bit chunk. So that's x0000 0000 0000b, x0001 0000 0000b, x0010 0000 0000b, …, x1111 0000 0000b. It has 32 of those subtables!

It picks one value from each subtable and, with 32 multiplications, takes care of half the bits of the exponent. Then it squares four times to left-shift those values into place and does another 32 multiplications from the same subtables to take care of the other half of the bits. That's a total of 64 multiplications and 4 squarings.

(It actually uses a trick which means that it only needs to store 8 of the 16 values in each subtable, but that's outside the scope of this post.)

We're in a different group now and so the costs are different: multiplications cost 6 operations and squarings cost 7. (The field is the same so an ‘operation’ takes the same time.) 64*6 + 4*7 = 412, so that's looking good. We need another field inversion to finish up, same as curve25519, for a total cost of 627.2. Those rough costs suggest that the Twisted Edwards precomputation should be 3.6x faster, which is promising!

(Updated: I originally suggested that two field inversions were required because I'm an idiot. Thanks to @CodesInChaos for asking why.)

The actual timings

Based on the amd64-64-24k implementation of Ed25519 in SUPERCOP, it's fairly easy to implement. On the same machine as the curve25519 speed was measured, above, the precomputed curve25519 runs in 21.9µs, a 3.1x speedup with 24KB of tables. So it's not as fast as the rough calculations suggested, but that's because we're now processing 24KB of memory. That cache pressure isn't free of course, although benchmarks make it look mostly free because they run in a tight loop and so the tables will always be in L1 cache. Still, it means that CPU can sustain 350,000 public key operations per second.

(Code still too messy to release. I hope to tidy it up next week.)


Since its inception, SPDY has depended on a TLS extension called NPN. NPN allows a TLS connection to negotiate which application-level protocol will be running across it.

NPN allows SPDY to be enabled efficiently. If we had run SPDY on a different port, then we would have had to be constantly creating probing connections to see whether a site supported SPDY as well as HTTPS. Even if we knew that a site supported SPDY, network devices between any given client and that site might block connections to the different TCP port. If we had tried an HTTP Upgrade header, that would have slowed everything down and caused compatibility issues with servers and proxies that didn't process the header correctly.

NPN also allows us to update SPDY without spending round trips on a version negotiation. Overall, NPN has worked very well for SPDY.

NPN attempted to be a little bit future proof by sending the selected application protocol name under encryption, so that network devices couldn't discriminate. The benefit was somewhat limited because the server's list of supported protocols was still sent in the clear but we believe that anything that can be encrypted, should be encrypted.

There is an alternative to NPN: ALPN is essentially the same design except that the negotiation is done in the clear (like other TLS extensions).

Last Friday, at IETF 86 in Orlando, the TLS working group considered both designs and came to a rough consensus on ALPN. ALPN is currently on track to be published as an RFC at some point and we will be switching SPDY over to it and deprecating NPN.

Once IANA has assigned a TLS extension number for ALPN, Google servers will start supporting both NPN and ALPN, with a preference for ALPN. Chrome and, I expect, other browsers will start sending both NPN and ALPN extensions. During this time, SPDY servers will be able to switch from NPN to ALPN without dropping SPDY support for current clients.

At some point after the end of 2014, I plan on removing NPN support from Chrome and Google servers. Any old servers and clients will continue to function just fine: they'll just use HTTPS.

Lucky Thirteen attack on TLS CBC

In an upcoming paper (made public this morning), Nadhem AlFardan and Kenny Paterson describe another method of performing Vaudenay's attack on CBC as used in TLS. Firstly I'd like to thank the researchers for notifying the various vendors ahead of time so that patches could be prepared: the disclosure process has gone very smoothly in this case. I couldn't have asked for anything more - they did everything right.

Vaudenay's attack requires an attacker to be able to detect when a CBC padding check has succeeded, even if the authentication check then fails. Authentication should always be applied after encryption to avoid this, but TLS famously did it the wrong way round with CBC mode.

Knowing whether a padding check succeeded reveals information about the decrypted plaintext. By tweaking the ciphertext over many trials, it's possible to progressively decrypt an unknown ciphertext. For details, see their paper, which does a better job of explaining it than I would.

Vaudenay first used the fact that these two situations (padding check failure and authenticator failure) resulted in different TLS alert values, although that didn't result in a practical attack because TLS errors are encrypted. Once that was corrected (by specifying that the same alert value should be sent in each case) the next year's paper used a timing-side channel: if the authentication check wasn't performed when the padding check failed then, by timing the server's response, an attacker could tell whether the server aborted processing early.

To try and remove that side-channel, TLS stacks perform the authentication check whether or not the padding check succeeds. However, here's what I commented in the Go TLS code:

Note that we still have a timing side-channel in the MAC check, below. An attacker can align the record so that a correct padding will cause one less hash block to be calculated. Then they can iteratively decrypt a record by breaking each byte. However, our behavior matches OpenSSL, so we leak only as much as they do.

That pretty much sums up the new attack: the side-channel defenses that were hoped to be sufficient were found not to be (again). So the answer, this time I believe, is to make the processing rigorously constant-time. (Details below.)

As a practical matter, since a padding or authenticator check failure is fatal to a TLS connection, performing this attack requires a client to send the same plaintext secret on thousands of different connections to the same server. This isn't a trivial obstacle but it's possible to meet this requirement for cookies with a browser and bit of Javascript injected into any origin in the same session.

For DTLS the attack is much easier because a rejected record doesn't cause the connection to be destroyed and the same authors developed a method for amplifing timing attacks against DTLS in a previous paper

Unfortunately, unlike BEAST, this isn't something that we can unilaterally fix on the client side (except by disabling CBC mode, which might have significant compatibility issues.) On the server side, making RC4 the most preferable cipher solves the problem although I do recommend that server admins apply patches as they become available even so. Having RC4 as an option has been a useful fallback for CBC problems but, if patches aren't applied, then RC4 becomes a monoculture in the future.

Implementing constant time CBC decoding

Fixing this problem in a backwards compatible manner is, sadly, really rather complex. What need to ensure that all CBC records are authenticated in constant time. Initially we define constant-time to mean that, for all values of the secret input, the trace of instructions executed and memory locations accessed is identical (on an in-order machine). Here the secret input is the decrypted CBC record before the padding and MAC have been checked.

Meeting this definition is tough and there's a great temptation to remove some of the more glaring timing leaks and hope that the remainder is small enough not to be practical. But the DTLS trick mentioned above can be used to amplify any timing side-channels. Additionally, situations change and implementations that are currently only exposed over a network may, in the future, be exposed to attackers running on the same machine. At that point, cache-line probing and other powerful timing probes can be brought to bear against any side-channels that remain.

For the purposes of this post, we consider the decrypted, CBC record to contain the following parts:

  1. n bytes of plaintext. (Which may be compressed but, within our scope, it's plaintext.)
  2. mac_size bytes of MAC. (Up to 48 bytes in TLS with SHA-384. We conservatively assume that the minimum number of MAC bytes is zero since this is clearly a valid lower-bound and accounts for truncated MAC extensions.)
  3. padding_length bytes of padding.
  4. One byte of padding length. (Some consider this to be part of the padding itself, but I'll try to be consistent that this length byte is separate.)

With this secret data in hand, we need to perform three steps in constant time:

  1. Verify that the padding bytes are correct: the padding cannot be longer than the record and, with SSLv3, must be minimal. In TLS, the padding bytes must all have the same value as the length byte.
  2. Calculate the MAC of the header and plaintext (where the header is what I'm calling the additional data that is also covered by the MAC: the sequence number etc).
  3. Extract the MAC from the record (with constant memory accesses!) and compare it against the calculated MAC.

We will make extensive use of bitwise operations and mask variables: variables where the bits are either all one or all zero. In order to create mask values it's very useful to have a method of replicating the most-significant-bit to all the bits in a value. We'll assume that an arithmetic right shift will shift in the MSB and perform this operation for us:

#define DUPLICATE_MSB_TO_ALL(x) ( (unsigned)( (int)(x) >> (sizeof(int)*8-1) ) )
#define DUPLICATE_MSB_TO_ALL_8(x) ( (uint8_t)( (int8_t)(x) >> 7) )

However, note that the C standard does not guarantee this behaviour, although all CPUs that I've encountered do so. If you're worried then you can implement this as a series of logical shifts and ORs.

We'll define some more utility functions as we need them.

SSLv3 padding

SSLv3 padding checks are reasonably simple: we need only to test that the padding isn't longer than the record and that it's less than a whole cipher block. We assume that the record is at least a byte long. (Note that the length of the whole record is public information and so we can fast reject invalid records if we're using that length alone.) We can use the fact that, if a and b are bounded such that the MSB cannot be set, then the MSB of a - b is one iff b>a.

padding_length = data[length-1];
unsigned t = (length-1)-padding_length;
unsigned good = DUPLICATE_MSB_TO_ALL(~t);
t = block_size - (padding_length+1);

The resulting value of good is a mask value which is all ones iff the padding is valid.

TLS padding

Padding in TLS differs in two respects: the value of the padding bytes is defined and must be checked, and the padding is not required to be minimal. Therefore we must always make memory accesses as if the padding was the maximum length, with an exception only in the case that the record is shorter than the maximum. (Again, the total record length is public information so we don't leak anything by using it.)

Here we use a mask variable (mask), which is all ones for those bytes which should be part of the padding, to discard the result of checking the other bytes. However, our memory access pattern is the same for all padding lengths.

unsigned padding_length = data[length-1];
unsigned good = (length-1)-padding_length;
good = DUPLICATE_MSB_TO_ALL(~good);
unsigned to_check = 255; /* maximum amount of padding. */
if (to_check > length-1) {
        to_check = length-1;

for (unsigned i = 0; i < to_check; i++) {
        unsigned t = padding_length - i;
        uint8_t mask = DUPLICATE_MSB_TO_ALL(~t);
        uint8_t b = data[length-1-i];
        good &= ~(mask&(padding_length ^ b));

If any of the padding bytes had an incorrect value, or the padding length was invalid itself, one of more of the bottom eight bits of good will be zero. Now we can map any value of good with such a zero bit to 0, and leave the case of all ones the same:

good &= good >> 4;
good &= good >> 2;
good &= good >> 1;
good <<= sizeof(good)*8-1;
good = DUPLICATE_MSB_TO_ALL(good);

In a very similar way, good can be updated to check that there are enough bytes for a MAC once the padding has been removed and then be used as a mask to subtract the given amount of padding from the length of the record. Otherwise, in subsequent sections, the code can end up under-running the record's buffer.

Calculating the MAC

All the hash functions that we're concerned with are Merkle–Damgård hashes and nearly all used in TLS have a 64-byte block size (MD5, SHA1, SHA-256). Some review of this family of hash functions is called for:

All these hash functions have an initial state, a transform function that takes exactly 64 bytes of data and updates that state, and a final_raw function that marshals the internal state as a series of bytes. In order to hash a message, it must be a multiple of 64 bytes long and that's achieved by padding. (This is a completely different padding than discussed above!)

In order to pad a message for hashing, a 0x80 byte is appended, followed by the minimum number of zero bytes needed to make the length of the message congruent to 56 mod 64. Then the length of the message (not including the padding) is written as a 64-bit, big-endian number to round out the 64 bytes. If the length of the message is 0 or ≥ 55 mod 64 then the padding will cause an extra block to be processed.

In TLS, the number of hash blocks needed to compute a message can vary by up to six, depending on the length of the padding and MAC. Thus we can process all but the last six blocks normally because the (secret) padding length cannot affect them. However, the last six blocks need to be handled carefully.

The high-level idea is that we generate the contents of each of the final hash blocks in constant time and hash each of them. For each block we serialize the hash and copy it with a mask so that only the correct hash value is copied out, but the amount of computation is constant.

SHA-384 is the odd one out: it has a 128 byte block size and the serialised length is twice as long, although it is otherwise similar. The example below assumes a 64 byte block size for clarity.

image/svg+xml 0x80 0x00 ... application data 0x00 ... length bytes index_a index_b c

We calculate two indexes: index_a and index_b. index_a is the hash block where the application data ends and the 0x80 byte is put. index_b is the final hash block where the length bytes are put. They may be the same, or index_b may be one greater. We also calculate c: the offset of the 0x80 byte in the index_a block.

The arithmetic takes some thought. One wrinkle so far ignored is that additional data is hashed in prior to the application data. TLS uses HMAC, which includes a block containing the masked key and then there are 13 bytes of sequence number, record type etc. SSLv3 has more than a block's worth of padding data, MAC key, sequence etc. These are collectively called the header here. The variable k in the following tracks the starting position of the data for the constant time blocks and includes this header.

for (i = num_starting_blocks; i <= num_starting_blocks+varience_blocks; i++) {
        uint8_t block[64];
        uint8_t is_block_a = constant_time_eq(i, index_a);
        uint8_t is_block_b = constant_time_eq(i, index_b);
        for (j = 0; j < 64; j++) {
                uint8_t b = 0;
                if (k < header_length) {
                        b = header[k];
                } else if (k < data_plus_mac_plus_padding_size + header_length) {
                        b = data[k-header_length];

                uint8_t is_past_c = is_block_a & constant_time_ge(j, c);
                uint8_t is_past_cp1 = is_block_a & constant_time_ge(j, c+1);
                /* If this is the block containing the end of the
                 * application data, and we are at, or past, the offset
                 * for the 0x80 value, then overwrite b with 0x80. */
                b = (b&~is_past_c) | (0x80&is_past_c);
                /* If this the the block containing the end of the
                 * application data and we're past the 0x80 value then
                 * just write zero. */
                b = b&~is_past_cp1;
                /* If this is index_b (the final block), but not
                 * index_a (the end of the data), then the 64-bit
                 * length didn't fit into index_a and we're having to
                 * add an extra block of zeros. */
                b &= ~is_block_b | is_block_a;

                /* The final eight bytes of one of the blocks contains the length. */
                if (j >= 56) {
                        /* If this is index_b, write a length byte. */
                        b = (b&~is_block_b) | (is_block_b&length_bytes[j-56]);
                block[j] = b;

        md_transform(md_state, block);
        md_final_raw(md_state, block);
        /* If this is index_b, copy the hash value to |mac_out|. */
        for (j = 0; j < md_size; j++) {
                mac_out[j] |= block[j]&is_block_b;

Finally, the hash value needs to be used to finish up either the HMAC or SSLv3 computations. Since this involves processing an amount of data that depends only on the MAC length, and the MAC length is public, this can be done in the standard manner.

In the case of SSLv3, since the padding has to be minimal, only the last couple of hash blocks can vary and so need to be processed in constant time. Another optimisation is to reduce the number of hash blocks processed in constant time if the (public) record length is small enough that some cannot exist for any padding value.

The above code is a problem to implement because it uses a different hash API than the one usually provided. It needs a final_raw function that doesn't append the hash padding, rather than the common final function, which does. Because of this implementers may be tempted to bodge this part of the code and perhaps the resulting timing side-channel will be too small to exploit, even with more accurate local attacks. Or perhaps it'll be good for another paper!

Extracting the MAC from the record.

We can't just copy the MAC from the record because its position depends on the (secret) amount of padding. The obvious, constant-time method of copying out the MAC is rather slow so my thanks to Emilia Kasper and Bodo Möller for suggesting better ways of doing it.

We can read every location where the MAC might be found and copy to a MAC-sized buffer. However, the MAC may be byte-wise rotated by this copy so we need to do another mac_size2 operations to rotate it in constant-time (since the amount of rotation is also secret):

unsigned mac_end = length;
unsigned mac_start = mac_end - md_size;
/* scan_start contains the number of bytes that we can ignore because the
 * MAC's position can only vary by 255 bytes. */
unsigned scan_start = 0;
if (public_length_inc_padding > md_size + 255 + 1) {
        scan_start = public_length_inc_padding - (md_size + 255 + 1);
unsigned char rotate_offset = (mac_start - scan_start) % md_size;

memset(rotated_mac, 0, sizeof(rotated_mac));
for (unsigned i = scan_start; i < public_length_inc_padding; i += md_size) {
        for (unsigned j = 0; j < md_size; j++) {
                unsigned char mac_started = constant_time_ge(i + j, mac_start);
                unsigned char mac_ended = constant_time_ge(i + j, mac_end);
                unsigned char b = 0;
                if (i + j < public_length_inc_padding) {
                        b = data[i + j];
                rotated_mac[j] |= b & mac_started & ~mac_ended;

/* Now rotate the MAC */
memset(out, 0, md_size);
for (unsigned i = 0; i < md_size; i++) {
        unsigned char offset = (md_size - rotate_offset + i) % md_size;
        for (j = 0; j < md_size; j++) {
                out[j] |= rotated_mac[i] & constant_time_eq_8(j, offset);

Finally, the MAC should be compared in constant time by ORing a value with the XOR of each byte in place from the calculated and given MACs. If the result is zero, then the record is valid. In order to avoid leaking timing information, the mask value from the padding check at the very beginning should be carried throughout the process and ANDed with the final value from the MAC compare.

Limitations of our model

We started out by defining constant-time with a specific model. But this model fails in a couple of ways:

Firstly, extracting the MAC as detailed above is rather expensive. On some chips all memory accesses within a cache-line are constant time and so we can relax our model a little and use a line-aligned buffer to perform the rotation in when we detect that we're building on such a platform.

Secondly, when I sent an early version of the OpenSSL patch to the paper's authors for them to test, they reported that they found a timing difference! I instrumented the OpenSSL code with the CPU cycle counter and measured a median time of 18020 cycles to reject a record with a large amount of padding but only 18004 cycles with small padding.

This difference may be small, but it was stable. To cut a long story short, it came from this line:

unsigned char rotate_offset = (mac_start - scan_start) % md_size;

When the padding was larger, the MAC started earlier in the record (because the total size was the same). So the argument to the DIV instruction was smaller and DIV, on Intel, takes a variable amount of time depending on its arguments!

The solution was to add a large multiple of md_size to make the DIV always take the full amount of time (and to do it in such a way that the compiler, hopefully, won't be able to optimise it away). Certainly 20 CPU cycles is probably too small to exploit, but since we've already “solved” this problem twice, I'd rather not have to worry in the future.

Real World Crypto 2013

(These are my notes for a talk that I gave last week at Real World Crypto. The premise of the conference is that it brings together theoretical cryptographers and practitioners. So this talk is aimed at theoretical cryptographers but it's fairly simple as I don't have anything complex worth saying to real cryptographers! Slides for other talks are linked from the program page and Rogaway's are relevant to this talk.

Note that this isn't a transcript: I actually say more words than this, but it contains the main points.)

Hello all.

For those who don't know me, I'm Adam Langley. I work at Google, mostly on our serving side HTTPS infrastructure these days. But I also still do some work on Chrome's SSL stack from time to time.

When I was asked to come up with a title of a talk I picked “Things that bit us, things we fixed and things that are waiting in the grass” with reference to HTTPS. Partly because that's what I know about but also because HTTPS is the only place where most people knowingly interact with public crypto, so it's one of the few examples of real world crypto at really large, really messy scales.

I could also have titled this talk “Know your enemy” because your real enemy is not trying to achieve an advantage with greater than negligible probability. As I hope to convince you, your enemy is me. I am the idiot who will mess up the implementation of your lovely cryptosystem. Your 128-bit security level is worthless in the face of my stupidity and so I'm here to seek your help in making my life easier and, therefore, everything more secure.

But before we get into anything crypto related, I'd like to set the scene. By and large, transport security on the Internet is doing OK because few people bother to attack it. If you want to steal banking credentials, you can get malware toolkits that will work on a large enough fraction of machines to be useful - e.g. using the Java exploit made public this week. If you want to steal passwords, you can SQL inject the site and reverse the hashes (if the site even bothered to hash), or just use the Ruby on Rails exploit made public this week. Given the level of password reuse, it doesn't even have to be the same site! Economically, attacking the transport doesn't make sense for many attackers and so they don't do it.

If you do want to attack the transport, by far the best methods are SSL stripping, mixed scripting and insecure cookie vulnerabilities.

SSL stripping means that, when the user types in, since the default scheme is unencrypted, the browser makes an insecure request. The attacker can simply answer that request and proxy the entire site while removing any attempts to upgrade to HTTPS. In the majority of cases, the user will never notice.

Mixed scripting results from pages that source Javascript, CSS or plugins from an HTTP URL. The request is made insecurely but the response is trusted with the same authority as the HTTPS origin that requested them. On sites that serve both HTTP and HTTPS, mixed scripting is endemic.

Lastly, cookies are secrets that the client sends to authenticate requests. If, when creating a cookie, the server doesn't set the secure flag, the client will also send the same cookie over unencrypted connections. Since forgetting to set that flag doesn't break anything in normal operation, it happens fairly often.

Oh, and all that is assuming that any transport security is used at all. HTTP is still much more common than HTTPS.

I've breezed over those issues, but the important point is that none of them involve crypto, they're all systems issues. On the whole, the crypto is doing great in comparison to everything else!

I'll cycle back to those issues towards the end but I wanted to give some perspective as I'll be talking about the crypto a lot, since this is Real World Crypto, but crypto is only a small part of web security.

When writing this talk I sat down and made a list of the cryptographic issues that have bitten HTTPS over the past few years and tried to group in order to make points that I think would be useful for the research community. I didn't end up using them all, so this is hardly a exhaustive list and I used some things twice, but here they are:

Go beyond a paper

  1. CRIME (compression leaks.)
  2. BEAST (CBC vs adaptive, chosen plaintext.)

The first group of issues I called ‘go beyond a paper’:

The CRIME attack resulted from the attacker having partial, chosen plaintext abilities on a channel which was performing compression. This applied to both TLS, which supports compression, and SPDY, an HTTP replacement that runs over TLS that applied compression internally. Most, major HTTPS sites don't support TLS compression so SPDY was the bigger hole, but SPDY is a relatively new protocol and the idea that compression can leak information isn't. I can easily find it going back ten years (“Compression and Information Leakage of Plaintext”, Kelsey, 2002) and here's what I said on the SPDY mailing list before the CRIME work:

With a good model of zlib, I think you could extract a ~40 byte cookie with ~13K requests. That's a practical attack and would make a great paper if someone has the time.

Of course, I never had the time but we did start working on a better compression scheme to solve the issue.

But SPDY was done by smart networking folks. Since the compression leak issues weren't known widely enough at the time, they picked gzip as a seemingly reasonable compression algorithm. It wasn't clearly stupid to compose TLS and gzip and yet it blew up when they did so. Had the Kelsey paper instead been a splashy, public demonstration, as CRIME was, then it's possible that the idea would have sunk into the collective memory to the point where simply using gzip wouldn't have seemed so reasonable. As it is, we're now using a horrible gzip hack to segment cookies and other sensitive headers from the attacker controlled values, although the replacement compression is mostly done, pending SPDY 4.

So the first idea in this section is that it's OK to make a splash in order to get some new idea into the minds of the developer community in general.

Somewhat similarly, the origins of the BEAST attack dates back to at least 2002 in the context of SSH (Möller, 2002).

In this case, Rizzo and Duong contacted the major SSL stacks prior to their work going public with a clear case that it was worth looking into. This, at least from my point of view, is very welcome! Please do this if you can. You can contact me if nothing else and I can rope in all the other usual suspects. Please set a clear go-public date and allow us to share the facts of the matter with other major vendors.

In the case of BEAST, this produced a rare example of major browsers breaking things together for the common good. The breakage wasn't trivial: we took out Disneyland Tokyo's online ticking system amongst many others, but we got it deployed.

So the second point of this group is that you should consider treating it like a security disclosure if it's warranted. We don't bite!

The world often sucks

  1. Hash collisions in MD5
  2. BEAST
  3. Downgrade attacks

The second group I put together under the title ‘the world often sucks’.

I'm afraid that sometimes it takes a while to make changes even if we are given a very clear demonstration of the issue. After a very clear demo of MD5 collisions causing vulnerabilities in certificate issuance in 2008 it still took years to remove support for it. Other remedial actions were taken: public CAs were required to use random serial numbers and stopped using MD5 for new certificates. But it wasn't until early 2012 that Chrome removed support for MD5 in signatures. Sadly, many MITM proxies still used MD5 and, despite giving them lots of notice, they didn't do anything and we broke quite a few people with that change.

Of course, later in the year it was found that Flame broke Microsoft's code signing with an MD5 collision. Maybe that'll be enough to convince people to move.

But the point of this section is to encourage people to think about workarounds because the ‘right fix’ often isn't feasible. For hash collisions we have randomised serial numbers and although we might like everyone to be using SHA-256 or SHA-3, realistically those randomised serial numbers are going to be buttressing SHA-1 for quite some time to come.

Marc Stevens talked earlier about how to detect SHA-1 collisions given only one of the colliding messages. That's great! That's absolutely something that we can put into certificate validation now and hopefully prevent problems in the future.

In relation to BEAST: I mentioned that the core weakness had been known since 2002 and, because of that, there was even a workaround in OpenSSL for it: empty fragments. By sending an empty CBC fragment before each real record, the MAC would effectively randomise the IV. However, empty fragments caused compatibility issues because some SSL stacks returned a zero length when encountering them and higher layers of the code took this to be EOF. Because of that, the workaround was never widely enabled.

In the course of discussing BEAST another workaround was proposed: 1/n-1 record splitting. Rather than putting an empty fragment before each record, include a single byte of the plaintext. The protection isn't quite as good, but it solves the EOF problem. Some servers still broke because they assumed that the complete HTTP request would come in a single read, but the lower rate of problems probably made BEAST mitigation viable.

Lastly, there's the first of our snakes in the grass (problems that are still pending to bite us): SSLv3 downgrade. Since there exist so many broken servers and network middleboxes on the Internet that can't handle TLS version negotiation, in the event of a TLS handshake failure browsers will fall back to SSLv3. However, SSLv3 doesn't support all of the features that TLS does. Most significantly from my point of view, it doesn't support ECDHE, which Google servers use. So a network attacker can trigger a fallback and downgrade a capable client to a non-forward secure ciphersuite.

This is obviously not good. The correct fix is to remove the fallback to SSLv3 of course, but that's sadly not viable right now. Instead, as a workaround, Yngve (formerly of Opera) suggested using the TLS renegotiation extension as a signal that a server is reasonably recent and therefore we shouldn't have performed the fallback.

Numbers from Chrome indicate that a high fraction of the servers that we perform fallback for are renego patched, suggesting that's a bad signal and we should instead create a different one. Although maybe the number of fallbacks is dominated by transient network problems and that's skewing the data. Eric Rescorla has suggested replicating the TLS version negotiation using ciphersuite values. It's something that we will hopefully address in one way or another in 2013.

So that's the point of this group: please consider workarounds because we can't always manage to deploy the ‘right’ fix.

Side-channels are a big deal.

  1. RSA PKCS#1 v1.5 padding oracles ("Million message attack")
  2. CBC padding oracles (Vaudenay's attack)
  3. Timing attacks against RSA CRT.
  4. Side-channel attacks against AES
  5. Side-channel attacks against group operations

This big group is about the fact that side channels are a big deal. Writing constant-time code is a very odd corner of programming and not something that can be easily tested. I have a valgrind hack called ctgrind that allows for some automated testing, but it certainly has its limitations.

But what does constant-time even mean? There's the obvious definition: the CPU runs for the same amount of time, independent of any secret inputs. But CPUs are preemptively multitasked and have frequency scaling and thermal limiting these days so that's not a very useful definition in practice. A more workable definition is that if the code were to be run on an abstract Von Neumann machine, then the trace of memory fetches is identical for all secret inputs. That means that the trace of instructions fetched and data accesses is constant for all secret inputs. That takes care of all the obvious problems.

In practice, it can be useful to relax the definition a little and require only that the set of cache lines accessed for data is constant, rather than the exact addresses. In practice CPUs often fetch whole cache lines at a time and using that fact can lead to speedups at the cost of having to know the cache line length of the CPU.

This model assumes that the individual instructions themselves are constant time. As a research topic it would be interesting to know how variable time CPU instructions affect this. For example, from the Intel optimisation manual:

The latency and throughput of IDIV in Enhanced Intel Core micro-architecture varies with operand sizes and with the number of significant digits of the quotient of the division.

Is this a problem, cryptographically? I don't know. I think multiplies on ARM are variable time too. They are on PowerPC, but that's a much more obscure platform.

As a researcher, what's there to keep in mind with respect to constant-time implementations? Firstly, know that Moore's law will never help you. CPUs may get faster, but the amount of data that we need to process is increasing just as fast, if not faster. So you can't assume that the slower, constant time code will become viable in time - you have to be constant time from the start.

Even having a non-constant time implementation is a danger. There are usually two sides to a protocol and I may not control the software on the other side. If I specify AES in a protocol then I have to consider that it may well be a non-constant time implementation. I just made up the term ‘implementation ecosystem’ for this. AES is very difficult to implement securely in software: good implementations are still still topics for research papers. So the implementation ecosystem for AES is terrible! There are lots and lots of side-channel vulnerable implementations out there because no normal person, given the AES spec, will produce a secure implementation.

If we're aiming for a 128-bit security level then that possibility is a much larger problem than many other, more traditional crypto concerns.

So, for new primitives, you may want to produce solid implementations for different platforms to seed the implementation ecosystem. Not just reference implementations, but ones that are good enough that they dominate the set of implementations. For example, if I specify curve25519 in a protocol, I can be pretty sure that everyone is going to be using djb's reference code. That's a major advantage.

You should consider what an implementation is going to look like when designing. Of course, building a solid implementation will make sure that you end up considering this, so that's another good reason. There are certain patterns that are inherently dangerous. Square-and-multiply loops for example. You should recognise that and make sure that even the description of the algorithm includes counter measures. Binary fields are another which are very likely to result in non-constant time code.

Lastly, please don't assume that CPU changes are going to solve your constant-time or performance problems. Intel have added specific instructions for AES and binary fields in their latest chips and, while that does have some benefit, they will be a small fraction of all chips for a very long time. The chance of both sides of a connection having these chips is even smaller.

Cryptographic Room 101.

Room 101 is a fairly long running British TV show where people nominate things they dislike to put into Room 101, which banishes them from the world. Based on the above I've a number of things that I'd to put into Room 101, from least controversial to most. These are more specific points than the general ones that I just made and I'd like to see if anyone in the room disagrees with me!

1. MAC then Encrypt.

I hope that everyone agrees on this. Of course, it's pretty ubiquitous in the world today. Just because I want to banish something doesn't mean that I'm not going to spend years dealing with it in the future!

2. CBC mode.

With all the problems of chosen plaintexts, padding oracles etc I think it's time to get rid of CBC mode forever. Although it's possible to implement it securely, it's been done wrong for so long and is so easy to mess up, I think it's best to get rid of it.

3. ‘Sudden death’ entropy failure: plain DSA.

DSA (and ECDSA) has a very unfortunate property that an entropy failure leaks the private key. Even a slight bias over many signatures can be exploited. This is ridiculous. As Nadia and Debian have demonstrated, entropy failures do happen and they are inherently very bad. But that's not a reason to amplify them! By hashing in the private key and message, this problem can be avoided. So please consider what happens when your nonces are actually ntwices.


I've saved the really controversial one for last!

AES-GCM so easily leads to timing side-channels that I'd like to put it into Room 101. It took a decade of research to produce solid, high-speed, constant time AES implementations and they are very complex. In that decade, many, many non-constant time AES implementations have found their way into everything, poisoning the ecosystem when it comes to using AES.

I haven't seen any research on extracting the key from GHASH, but I've certainly seen vulnerable implementations and there's every reason to believe that it's possible. Most GHASH implementations look like AES implementations from 10 years ago. I'm aware of one, reasonable, constant-time AES-GCM implementation (Käsper and Schwabe, CHES 2009), but it runs at 22 cycles/byte on a Core2.

If you have a recent Intel chip, and software that implements AES-GCM using the specific instructions provided, then it's great. But most chips don't have that and I think it would have been much more preferable to pick an AEAD that's easy to securely implement everywhere and then speed it up on certain chips.

But it's still much better than AES-CBC I guess!

(At this point I cycled back to talking about the larger transport security issues and how browsers are addressing them with HSTS, mixed script blocking etc. But I've written about that before.)

Certificate Transparency

These are my notes for a talk that I gave today at IETF 85. For more details, see the draft.

Certificates are public statements that everyone trusts, but they aren't public record. Other critical details about companies are generally public record, their address, directors etc, but not their public key. Why not? Work like the EFF Observatory has already means that CA customer lists, which they might consider confidential, are public.

If certificates were public record then you would be able to see what CAs are asserting about you, and hopefully correct any mistakes. I would very much like to know the set of valid certificates within at any given time. I know about the real ones, of course, but I don’t know that there aren’t others. At the moment, there's no accountability for CAs unless you get really, publicly caught screwing up.

However, there is a class of certificates that we don't want published: internal names. Some companies get certificates from a real CA for their internal networks and those names should not be published. So we'll keep that in mind as we work out a design.

How might we achieve this? Well, we could have CAs publish the certificates that they issue. That's a good start; and actually useful to some degree, but you'll probably admit that this is weaker than we might like, so how do we make it stronger?

So that's our design goal, now let's talk about the constraints:

Firstly, we aren't going to update every HTTPS server on the planet. Any design must be incrementally deployable and, initially, we are talking about HTTPS. Hopefully any design would be generally applicable, but HTTPS is what we have in mind right now.

Secondly, no blocking lookups. Making calls to a third-party during certificate validation is very convenient, but it's just too costly. It creates latency when it's working and a disaster when it's not. Too many networks filter too much for it to be dependable and we've never managed to make it work for OCSP, so there's little chance that it'll work for this.

So whatever information we need to get to the client has to be in the handshake, and anything that client needs to report back has to be asynchronous.

We had our straw-man proposal where we just ask CAs to publish their certificate stream, but we want to do better because that only works when the CA is simply mistaken. So the problem that we need to address is how do clients know that a certificate they receive has been published at all?

Well, we can have an independent certificate log sign a statement of publication and, based on our constraints, we have to put that signature somewhere in the TLS handshake. We call the statements of publication Signed Certificate Timestamps and, since we don't want to delay certificate issuance, they’re not actually statements of publications, rather they are a promise that the certificate will be published soon.

Now we've just moved the problem from "what if the CA doesn't publish it?" to "what if the log promises to publish something, but doesn't?". We can require SCTs from a quorum of independent logs, and we'll do that, but running an append only log is a job that can be verified by clients.

In order to make a log verifiable, we put the log entries in a Merkle tree, with certificates at the leaves. Each log is an ever growing Merkle tree of certificates and the log periodically signs the root of the tree and publishes it, along with all the entries. But what does `periodically' mean? All logs must publish certificates that they have received within a time called the Maximum Merge Delay. That merge delay is the amount of time that a certificate can be used without being published.

We detect violations of the MMD by having clients verify the log's behaviour. Clients can request from the logs (or a mirror), a path from any certificate to the root of a published Merkle tree. In order to preserve client privacy this request may be made via DNS, if possible, or by requesting paths for a range of certificates rather than a specific one.

Once a client has a signed root, it needs to gossip about it. Clients implicitly trust their software provider so one common answer may be to request a signed statement that the log root has been observed from an auditor service run by them. But these auditors can be run by anyone, and clients can configure their auditor however they like.

Since these client checks are asynchronous, they can be blocked by an attacker. Clients will have to be tolerant to that to some extent because many networks are indistinguishable to an attack. However, if after some amount of time, the client has a certificate that it hasn't been able to check, or a log root that it hasn't verified with its auditor, it should try to publish it in various, unspecified ways. In the extreme, one can imagine asking the user to email a file to an email address.

So to `break' this system, by which I mean to get a certificate that will be silently accepted by clients without the world knowing about it, one needs to compromise a CA, a quorum of logs, and then either partition the client forever, or also compromise the client's auditor.

In addition to auditors, there’s another class of log observers: monitors. Monitors watch the logs for interesting events. We envision at least one, obvious, type of monitor service: an `alerts’ system where anyone can sign-up to receive emails when a certificate is issued in a certain domain.

So now some practical points:

First, how do we get the SCTs (the receipts from the logs) into the TLS handshake without altering every HTTPS server? Well, the one thing that every HTTPS server can do is serve a certificate chain. So we tried a couple of tricks:

One, we tried adding a superfluous certificate to the end of the chain. Most clients will ignore it, but a few smaller ones, and old versions of Android, don't and IIS doesn't like configuring such a chain. So we aren't pushing ahead with that idea.

We also tried putting SCTs into the unsigned portion of the leaf certificate and that works fairly well except for breaking Java. Nonetheless, we may continue to support that as an option.

And options are the way we're going to go with this problem: there doesn't seem to be a good single answer.

So another option is that CAs can do the work for you and put the SCTs into the signed part of the certificate, in an X.509 extension. Of course, the SCT contains a hash of the certificate, and a hash cannot include itself, so CAs issue a `pre-cert' from a special intermediate with a magic EKU that makes it invalid for normal use. The pre-cert can be submitted to the logs to get SCTs for embedding in the real certificate.

Another way in which CAs can do it for you is to put the SCTs in an OCSP response and then one uses OCSP stapling on the server.

Finally, for servers that can be updated, SCTs can be included in an Authorisation Data TLS-extension, and support for that is working its way through OpenSSL and Apache.

We're experimenting with CAs at the moment to find what works and we may whittle down this list in time. But for now we're promiscuous about solutions to this problem.

The logs also have to worry about spam, so they only accept certificate chains that end at a known root. There's a very low bar for the logs to accept a root because it doesn't grant any authority, it's just an anti-spam measure, but we do have to make sure that the logs can’t be spammed to death.

Going back to one of our requirements at the beginning: we don't want to log internal names in certificates. One solution to this is to allow intermediates with name constraints to be logged and then any certificates from there don't need to be. So an intermediate constrained to issue within can be logged and then we don't need to log any leaf certificates within they can't be used against anyone else and, by using an intermediate, its security is up to you.

However, that's a little complex so the initial solution will probably be that we let companies configure the clients to say "don't require logging within these domains". At least for Chrome, people report great success with our Enterprise Policy configuration and that's an easy route for such config.

Lastly, the very tricky question: deployment.

Initially we can deploy in the same manner as pinning and HSTS: require it only for certain, opt-in domains (or possibly for opt-in CAs). But the goal is very much to get it everywhere. That will take years, but we do plan to do it. We need to make the error messages in the browser informative for server admins as well as users because this will be something new for them, although hopefully their CA will just take care of it. We can also enforce it for certificates issued past a certain date, rather than having a flag day. That way the certificate change will be the triggering factor, which is something under the control of the site. Lastly, we'll probably have to do the work to patch software like EJBCA.

We do not yet know exactly who will be running the logs, nor how many we want. But Google expects to run one and that project is staffed. We’re also working with Digicert and Comodo who are experimenting with CT integration and are considering running logs.

Since we can verify the operation of logs, they don't have the same trusted status as CAs. However, the set of logs has to be globally agreed so having to revoke a log would be a pain, so we do want logs that are operationally competent.

There is no doubt that this will be a difficult and lengthy deployment process. But it doesn't seem completely hopeless.

NIST may not have you in mind

A couple of weeks back, NIST announced that Keccak would be SHA-3. Keccak has somewhat disappointing software performance but is a gift to hardware implementations.

This is rather a theme for NIST selected algorithms. AES includes a number of operations on a binary field and CPUs generally don't implement binary field operations because they aren't really useful for anything else. So many software implementations of AES include four, 1024-byte tables of results in that field in order to achieve reasonable performance. However, these table lookups cause a timing side-channel which is fairly effective even over a network. On the same machine, where the CPU cache state can be probed more directly, it's very effective.

In response to this, some AES implementations made their tables smaller to reduce the leak. That's obviously an improvement, but it came at the cost of slowing AES down and AES was never very quick to start off with. And while a smaller leak is better, AES is supposed to be operating at (at least) an 128-bit security level!

Next, the mode recommended for AES by NIST is GCM: it's AES in counter mode coupled with a polynomial authenticator. Cryptographically this is perfectly sensible but the polynomial authenticator uses binary fields again. For a hardware implementation, this is another gift but software implementations need, yet again, tables in memory to implement it at reasonable speed.

Tables not only cause problems with timing side-channels, they also cause slowdowns that don't appear in micro-benchmarks. Tables are great when the benchmark is running with a single key and they are contained in L1 cache. But when running in a larger system, the cache pressure causes slowdowns and the real-world speed of an such an algorithm can be dramatically less than the benchmarks suggest.

But there's no reason why a polynomial authenticator needs to use binary fields. Why not use a field that CPUs actually implement? If you do so you get a very fast authenticator with dramatically improved key agility.

So why does NIST continually pick algorithms that aren't suited to software implementations? One of my colleagues proffers (somewhat jokingly) a conspiracy theory but I suspect that the answer is much more mundane. At USENIX Security this year, Dickie George gave an excellent keynote involving some NSA history and described how they would spend a decade or more designing a cryptographic system from the chip upwards. Everything was custom built, carefully designed and it was a completely closed system. It's certainly nice to be able to do that!

If you're designing such a system, of course you would be orientated towards hardware. Every implementation of your system would be a custom-designed chip; there's no benefit to making something easy to implement on commodity systems. And, in that world, AES, GCM, and SHA3 make a lot of sense. As the GCM paper says, “binary Galois field multiplication is especially suitable for hardware implementations”. I'm assuming that there's a fair amount of shared worldview between NIST and the NSA, but that doesn't seem unreasonable.

(In a NIST paper about the selection of AES they state: “table lookup: not vulnerable to timing attacks.” Oops. If you're building your own hardware, maybe.)

I'm not suggesting that the NSA is actually using AES, GCM etc in their own designs. But a generation or more of cryptographers from that world are probably comfortable with features that were chosen with hardware in mind. Those features, like binary fields, are what they will naturally gravitate towards.

People who aren't in the situation of designing custom chips, but rather have to design protocols for implementation on a menagerie of commodity chips, might want to consider that NIST doesn't appear to have them in mind. Although hardware implementations of AES have existed for a while, unless they are CPU instructions (like AES-NI), then the system-call overhead can cause the acceleration to be slower than the plain CPU for common, 1400-byte, chunk sizes. Even when you have a CPU instruction to implement AES and GCM, it's unlikely that all the chips that your code is going to run on will be so endowed.

Hardware implementations of these, poorly-suited primitives also take chip area and development time away from other tasks. Why not choose primitives that use CPU abilities that lots of code benefits from? Then optimisations of those CPU instructions lifts all boats.

So if you want to pick algorithms designed for CPUs, try Salsa20 rather than AES and Poly1305 rather than GCM. But we don't have a great answer for a modern hash function I'm afraid, and that's a little disappointing. SHA-2 is fine for now, although Zooko has some thoughts on the matter.

(I should note that, thanks to some great work by djb, Peter Schwabe, Emilia Käsper and others, we do have bitsliced, constant-time implementations of AES now that run reasonably fast (for AES). But it's only effective for counter mode, and I believe that it needs long streams of data to achieve its headline speed. (A commenter pointed out that it can be made to work with only 8 blocks. Later, another pointed out that that I'm doubly wrong! We can do it for single blocks in reasonable time.) That's great, but it took 10 years of optimisation work and implies that AES can only be effectively implemented by some of the best practical cryptographers in the world.)

DANE stapled certificates

We've supported DNSSEC stapled certificates in Chrome for a while now and since the DANE RFC has been published, it's about time that I updated that code to support DANE, rather than the hacked up CAA record that it currently uses.

I've written the DANE code, but it also seemed like an opportune time to reevaluate the idea. The promise of DNSSEC, as Dan Kaminsky put it, it that it can reduce the number of meetings required to get something done.

You have a whole bunch of internal hosts that some company that you're partnering with needs to interface with securely? Well, perhaps you used an internal CA, so will they install that root on their systems? (Meeting.) They will if it's name constrained, can we get it reissued with name constraints? (Meeting.) Ok, now can IT get that root pushed out everywhere? (Meeting.) You see how this is going.

But every bit of code needs to justify its complexity and, for small cases, StartSSL will give you free certificates. Perhaps just Chrome doing it isn't very interesting, and everyone hacks up custom solutions anyway.

So, if it's useful to you, you should let me know (agl at chromium dot org).


Last year I happened to worry on the SPDY mailing list about whether sensitive information could be obtained via SPDY's use of zlib for compressing headers. Sadly, I never got the time to follow up and find out whether it was a viable attack. Thankfully there exist security researchers who, independently, wondered the same thing and did the work for me! Today Duong and Rizzo presented that work at ekoparty 2012.

They were also kind enough to let Firefox and ourselves know ahead of time so that we could develop and push security fixes before the public presentation. In order to explain what we did, let's start by looking at how SPDY compressed headers:

(This is inline SVG, if you can't see it, check here.)

? ? ? ? ? ? ? ? : h o s t ? ? ? ? w w w . g o o g l e . c o m ? ? ? ? : m e t h o d ? ? ? ? G E T ? ? ? ? : p a t h ? ? ? ? / ? ? ? ? : s c h e m e ? ? ? ? h t t p s ? ? ? ? : v e r s i o n ? ? ? ? H T T P / 1 . 1 ? ? ? ? a c c e p t ? ? ? ? t e x t / h t m l , a p p l i c a t i o n / x h t m l + x m l , a p p l i c a t i o n / x m l ; q = 0 . 9 , * / * ; q = 0 . 8 ? ? ? ? a c c e p t - c h a r s e t ? ? ? ? I S O - 8 8 5 9 - 1 , u t f - 8 ; q = 0 . 7 , * ; q = 0 . 3 ? ? ? ? a c c e p t - e n c o d i n g ? ? ? ? g z i p , d e f l a t e , s d c h ? ? ? ? a c c e p t - l a n g u a g e ? ? ? ? e n - U S , e n ; q = 0 . 8 ? ? ? ? c o o k i e ? ? ? ? P R E F = I D = 7 0 c b b f e 7 e e 8 8 a 5 e 2 : F F = 0 : T M = 1 3 4 7 4 0 0 4 7 9 : L M = 1 7 8 7 4 1 4 3 9 5 : S = f J q m r k s _ 8 G h 3 i 1 K D ; N I D = 6 3 = k 9 f K d q _ 2 g u F 0 O 1 F 5 g V 6 K 3 C I t F b x d d 2 f D W L g x T H f a Q 3 5 P q 4 S D d d d A i H F 9 G G 9 6 2 3 A J v - W U b U p A h 8 _ 0 Y Z T 6 B H Q N 4 f A B 2 j O 3 _ Y 5 q H 7 w x e d A N v I n J 5 R r h l s i i h M Y q e - s u 1 U 1 O ? ? ? ? u s e r - a g e n t ? ? ? g M o z i l l a / 5 . 0 ( X 1 1 ; L i n u x x 8 6 _ 6 4 ) A p p l e W e b K i t / 5 3 7 . 1 0 ( K H T M L , l i k e G e c k o ) C h r o m e / 2 3 . 0 . 1 2 6 4 . 0 S a f a r i / 5 3 7 . 1 0 ? ? ? ? x - c h r o m e - v a r i a t i o n s ? ? ? 0 C M + 1 y Q E I l b b J A Q i d t s k B C K S 2 y Q E I p 7 b J A Q i 9 t s k B C L u D y g E = ? ? ? ? ? ? ? ? : h o s t ? ? ? ? w w w . g o o g l e . c o m ? ? ? ? : m e t h o d ? ? ? ? G E T ? ? ? ? : p a t h ? ? ? ? / c s i ? v = 3 ? s = w e b h p ? a c t i o n = ? s r t = 2 4 6 ? p = s ? n p n = 1 ? e = 1 7 2 5 9 , 2 3 6 2 8 , 2 3 6 7 0 , 3 2 6 9 0 , 3 5 7 0 4 , 3 7 1 0 2 , 3 8 0 3 4 , 3 8 4 4 9 , 3 8 4 6 6 , 3 9 1 5 4 , 3 9 3 3 2 , 3 9 5 2 3 , 3 9 9 7 8 , 4 0 1 9 5 , 4 0 3 3 3 , 3 3 0 0 0 4 7 , 3 3 0 0 1 1 7 , 3 3 0 0 1 2 5 , 3 3 0 0 1 3 2 , 3 3 0 0 1 3 5 , 3 3 0 0 1 5 7 , 3 3 1 0 0 1 1 , 4 0 0 0 1 1 6 , 4 0 0 0 2 6 0 , 4 0 0 0 2 6 7 , 4 0 0 0 2 7 8 , 4 0 0 0 3 0 8 , 4 0 0 0 3 5 2 , 4 0 0 0 3 5 4 , 4 0 0 0 4 7 2 , 4 0 0 0 4 7 6 , 4 0 0 0 5 1 6 , 4 0 0 0 5 1 9 , 4 0 0 0 5 5 3 , 4 0 0 0 5 9 3 , 4 0 0 0 6 0 5 , 4 0 0 0 6 1 6 , 4 0 0 0 7 6 2 , 4 0 0 0 8 2 5 , 4 0 0 0 8 3 7 , 4 0 0 0 8 4 1 , 4 0 0 0 8 4 9 ? e i = Y r N P U N P t H 8 T C g A f Y 7 I C A C g ? i m c = 2 ? i m n = 2 ? i m p = 2 ? r t =

That's a pretty busy diagram! But I don't think it's too bad with a bit of explanation:

zlib uses a language with basically two statements: “output these literal bytes” and “go back x bytes and duplicate y bytes from there”. In the diagram, red text was included literally and black text came from duplicating previous text.

The duplicated text is underlined. A dark blue underline means that the original text is in the diagram and there will be a gray line pointing to where it came from. (You can hover the mouse over one of those lines to make it darker.)

A light blue underline means that the original text came from a pre-shared dictionary of strings. SPDY defines some common text, for zlib to be able to refer to, that contains strings that we expect to find in the headers. This is most useful at the beginning of compression when there wouldn't otherwise be any text to refer back to.

The problem that CRIME highlights is that sensitive cookie data and an attacker controlled path is compressed together in the same context. Cookie data makes up most of the red, uncompressed bytes in the diagram. If the path contains some cookie data, then the compressed headers will be shorter because zlib will be able to refer back to the path, rather than have to output all the literal bytes of the cookie. If you arrange things so that you can probe the contents of the cookie incrementally, then (assuming that the cookie is base64), you can extract the cookie byte-by-byte by inducing the browser to make requests.

For details of how to get zlib to reveal that information in practice, I'll just refer you to Duong and Rizzo's CRIME presentation. It's good work.

In order to carry out this attack, the attacker needs to be able to observe your network traffic and to be able to cause many arbitrary requests to be sent. An active network attacker can do both by injecting Javascript into any HTTP page load that you make in the same session.

When we learned of this work, we were already in the process of designing the compression for SPDY/4, which avoids this problem. But we still needed to do something about SPDY/2 and SPDY/3 which are currently deployed. To that end, Chrome 21 and Firefox 15 have switched off SPDY header compression because that's a minimal change that easily backports.

Chrome has also switched off TLS compression, through which a very similar attack can be mounted.

But we like SPDY header compression because it saves significant amounts of data on the wire! Since SPDY/4 isn't ready to go yet we have a more complex solution for Chrome 22/23 that compresses data separately while still being backwards compatible.

Most importantly cookie data will only ever be duplicated exactly, and in its entirety, against other cookie data. Each cookie will also be placed in its own Huffman group (Huffman coding is a zlib detail that I skipped over in the explanation above). Finally, in case other headers contain sensitive data (i.e. when set by an XMLHttpRequest), non-standard headers will be compressed in their own Huffman group without any back references.

That's only a brief overview of the rules. The code to follow them and continue to produce a valid zlib stream wasn't one of the cleaner patches ever landed in Chrome and I'll be happy to revert it when SPDY/4 is ready. But it's effective at getting much of the benefit of compression back.

To the right are a couple of images of the same sort of diagram as above, but zoomed out. At this level of zoom, all you can really see are the blocks of red (literal) and blue (duplicated) bytes. The diagram on the right has the new rules enabled and, as you can see, there is certainly more red in there. However that's mostly the result of limited window size. In order to save on server memory, Chrome only uses 2048-byte compression windows and, under the new rules, a previous cookie value has to fit completely within the window in order to be matched. So things are a little less efficient until SPDY/4, although we might choose to trade a little more memory to make up for that.

SSL interstitial bypass rates

In yesterday's post I threw in the following that I've been asked to clarify:

“we know that those bypass buttons are clicked 60% of the time by Chrome users”

Chrome collects anonymous statistics from users who opt in to such collection (and thank you to those who do!). One of those statistics covers how frequently people bypass SSL interstitials. (As always, the Chrome privacy policy has the details.)

We define bypassing the interstitial as clicking the ‘Proceed anyway’ button and not bypassing as either closing the tab, navigating elsewhere, or clicking the ‘Back’ button.

I picked five days at random over the past six weeks and averaged the percentages of the time that users bypassed rather than not. That came to 61.6%.

There may be some biases here: we may have a biased population because we only include users who have opted in to statistics collection. We are also counting all interstitals: there may be a small number of users who bypass a lot of SSL errors. But that's the data that we have.

Living with HTTPS

(These are my notes from the first half of my talk at HOPE9 last weekend. I write notes like these not as a script, but so that I have at least some words ready in my head when I'm speaking. They are more conversational and less organised than a usual blog post, so please forgive me the rough edges.)

HTTPS tends to cause people to give talks mocking certificate security and the ecosystem around it. Perhaps that's well deserved, but that's not what this talk is about. If you want to have fun at the expense of CAs, dig up one of Moxie's talks. This talk deals with the fact that your HTTPS site, and the sites that you use, probably don't even reach the level where you get to start worrying about certificates.

I'm a transport security person so the model for this talk is that we have two computers talking over a malicious network. We assume that the computers themselves are honest and uncompromised. That might be a stretch in these malware-ridden times, but that's the area of host security and I'm not talking about that today. The network can drop, alter or fabricate packets at will. As a lemma, we also assume that the network can cause the browser to load any URL it wishes. The network can do this by inserting HTML into any HTTP request and we assume that every user makes some unencrypted requests while browsing.


If the average user typed into a browser and saw the following, what fraction of them do you think would login, none the wiser?

Can you even see wh