ImperialViolet

Adam Langley's Weblog

Security Keys (13 Aug 2017)

Security Keys are (generally) USB-connected hardware fobs that are capable of key generation and oracle signing. Websites can “enroll” a security key by asking it to generate a public key bound to an “appId” (which is limited by the browser based on the site's origin). Later, when a user wants to log in, the website can send a challenge to the security key, which signs it to prove possession of the corresponding private key. By having a physical button, which must be pressed to enroll or sign, operations can't happen without user involvement. By having the security keys encrypt state and hand it to the website to store, they can be stateless(*) and robust.

(* well, they can almost be stateless, but there's a signature counter in the spec. Hopefully it'll go away in a future revision for that and other reasons.)

The point is that security keys are unphishable: a phisher can only get a signature for their appId which, because it's based on the origin, has to be invalid for the real site. Indeed, a user cannot be socially engineered into compromising themselves with a security key, short of them physically giving it to the attacker. This is a step up from app- or SMS-based two-factor authentication, which only solves password reuse. (And SMS has other issues.)

The W3C standard for security keys is still a work in progress, but sites can use them via the FIDO API today. In Chrome you can load an implementation of that API which forwards requests to an internal extension that handles the USB communication. If you do that, then there's a Firefox extension that implements the same API by running a local binary to handle it. (Although the Firefox extension appears to stop working with Firefox 57, based on reports.)

Google, GitHub, Facebook and Dropbox (and others) all support security keys this way. If you administer a G Suite domain, you can require security keys for your users. (“G Suite” is the new name for Gmail etc on a custom domain.)

But, to get all this, you need an actual security key, and probably two of them if you want a backup. (And a backup is a good idea, especially if you plan on dropping your phone number for account recovery.) So I did a search on Amazon for “U2F security key” and bought everything on the first page of results that was under $20 and available to ship now.

Yubico Security Key

Brand: Yubico, Firmware: Yubico, Chip: NXP, Price: $17.99, Connection: USB-A

Yubico is the leader in this space and their devices are the most common. They have a number of more expensive and more capable devices that some people might be familiar with, but this one only does U2F. The sensor is a capacitive so a light touch is sufficient to trigger it. You'll have no problems with this key, but it is the most expensive of the under $20 set.

Thetis U2F Security Key

Brand: Thetis, Firmware: Excelsecu, Chip: ?, Price: $13.95, Connection: USB-A

This security key is fashioned more like a USB thumb drive. The plastic inner part rotates within the outer metal shell and so the USB connector can be protected by it. The button is in the axis and is clicky, rather than capacitive, but doesn't require too much force to press. If you'll be throwing your security key in bags and worry about damaging them then perhaps this one will work well for you.

A minor nit is that the attestation certificate is signed with SHA-1. That doesn't really matter, but it suggests that the firmware writers aren't paying as much attention as one would hope. (I.e. it's a brown M&M.)

Feitian ePass

Brand: Feitian, Firmware: Feitian, Chip: NXP, Price: $16.99, Connection: USB-A, NFC

This one is very much like the Yubico, just a little fatter around the middle. Otherwise, it's also a sealed plastic body and capacitive touch sensor. The differences are a dollar and NFC support—which should let it work with Android. However, I haven't tested this feature.

I don't know what the opposite of a brown M&M is, but this security key is the only one here that has its metadata correctly registered with the FIDO Metadata Service.

U2F Zero

Brand: U2F Zero, Firmware: Conor Patrick, Chip: Atmel, Price: $8.99, Connection: USB-A

I did bend the rules a little to include this one: it wasn't immediately available when I did the main order from Amazon. But it's the only token on Amazon that has open source firmware (and hardware designs), and that was worth waiting for. It's also the cheapest of all the options here.

Sadly, I have to report that I can't quite recommend it because, in my laptop (a Chromebook Pixel), it's not thick enough to sit in the USB port correctly: Since it only has the “tongue” of a USB connector, it can move around in the port a fair bit. That's true of the other tokens too, but with the U2F Zero, unless I hold it just right, it fails to make proper contact. Since operating it requires pressing the button, it's almost unusable in my laptop.

However, it's fine with a couple of USB hubs that I have and in my desktop computer, so it might be fine for you. Depends how much you value the coolness factor of it being open-source.

KEY-ID FIDO U2F Security Key

Brand: KEY-ID, Firmware: Feitian(?), Chip: ?, Price: $12.00, Connection: USB-A

I photographed this one while plugged in in order to show the most obvious issue with this device: everyone will know when you're using it! Whenever it's plugged in, the green LED on the end is lit up and, although the saturation in the photo exaggerates the situation a little, it really is too bright. When it's waiting for a touch, it starts flashing too.

In addition, whenever I remove this from my desktop computer, the computer reboots. That suggests an electrical issue with the device itself—it's probably shorting something that shouldn't be shorted, like the USB power pin to ground, for example.

While this device is branded “KEY-ID”, I believe that the firmware is done by Feitian. There are similarities in certificate that match the Feitian device and, if you look up the FIDO certification, you find that Feitian registered a device called “KEY-ID FIDO® U2F Security Key”. Possibly Feitian decided against putting their brand on this.

(Update: Brad Hill, at least, reports no problems with these when using a MacBook.)

HyperFIDO Mini

Brand: HyperFIDO, Firmware: Feitian(?), Chip: ?, Price: $13.75, Connection: USB-A

By observation, this is physically identical to the KEY-ID device, save for the colour. It has the same green LED too (see above).

However, it manages to be worse. The KEY-ID device is highlighted in Amazon as a “new 2017 model”, and maybe this an example of the older model. Not only does it cause my computer to reliably reboot when removed (I suffered to bring you this review, dear reader), it also causes all devices on a USB hub to stop working when plugged in. When plugged into my laptop it does work—as long as you hold it up in the USB socket. The only saving grace is that, when you aren't pressing it upwards, at least the green LED doesn't light up.

HyperFIDO U2F Security Key

Brand: HyperFIDO, Firmware: Feitian(?), Chip: ?, Price: $9.98, Connection: USB-A

This HyperFIDO device is plastic so avoids the electrical issues of the KEY-ID and HyperFIDO Mini, above. It also avoids having an LED that can blind small children.

However, at least on the one that I received, the plastic USB part is only just small enough to fit into a USB socket. It takes a fair bit of force to insert and remove it. Also the end cap looks like it should be symmetrical and so able to go on either way around, but it doesn't quite work when upside down.

Once inserted, pressing the button doesn't take too much force, but it's enough to make the device bend worryingly in the socket. It doesn't actually appear to be a problem, but it adds a touch of anxiety to each use. Overall, it's cheap and you'll know it.

Those are the devices that matched my initial criteria. But, sometimes, $20 isn't going to be enough I'm afraid. These are some other security keys that I've ended up with:

Yubikey 4C

Brand: Yubico, Firmware: Yubico, Chip: NXP?, Price: $50 (direct from Yubico), Connection: USB-C

If you have a laptop that only has USB-C ports then a USB-A device is useless to you. Currently your only option is the Yubikey 4C at $50 a piece. This works well enough: the “button” is capacitive and triggers when you touch either of the contacts on the sides. The visual indicator is an LED that shines through the plastic at the very end.

Note that, as a full Yubikey, it can do more than just being a security key. Yubico have a site for that.

Many people lacking USB-A ports will have a Touch Bar, which includes a fingerprint sensor and secure element. One might spy an alternative (and cheaper solution) there. GitHub have published SoftU2F which does some of that but, from what I can tell, doesn't actually store keys in the secure element yet. However, in time, there might be a good answer for this.

Yubikey Nano

Brand: Yubico, Firmware: Yubico, Chip: NXP?, Price: $50 (direct from Yubico), Connection: USB-A

Another $50 security key from Yubico, but I've included it because it's my preferred form-factor: this key is designed to sit semi-permanently inside the USB-A port. The edge is a capacitive touch sensor so you can trigger it by running your finger along it.

It does mean that you give up a USB port, but it also means that you've never rummaging around to find it.

(Note: newer Nanos look slightly different. See Yubico's page for a photo of the current design.)

Maybe Skip SHA-3 (31 May 2017)

In 2005 and 2006, a series of significant results were published against SHA-1 [1][2][3]. These repeated break-throughs caused something of a crisis of faith as cryptographers questioned whether we knew how to build hash functions at all. After all, many hash functions from the 1990's had not aged well [1][2].

In the wake of this, NIST announced (PDF) a competition to develop SHA-3 in order to hedge the risk of SHA-2 falling. In 2012, Keccak (pronounced “ket-chak”, I believe) won (PDF) and became SHA-3. But the competition itself proved that we do know how to build hash functions: the series of results in 2005 didn't extend to SHA-2 and the SHA-3 process produced a number of hash functions, all of which are secure as far as we can tell. Thus, by the time it existed, it was no longer clear that SHA-3 was needed. Yet there is a natural tendency to assume that SHA-3 must be better than SHA-2 because the number is bigger.

As I've mentioned before, diversity of cryptographic primitives is expensive. It contributes to the exponential number of combinations that need to be tested and hardened; it draws on limited developer resources as multiple platforms typically need separate, optimised code; and it contributes to code-size, which is a worry again in the mobile age. SHA-3 is also slow, and is even slower than SHA-2 which is already a comparative laggard amongst crypto primitives.

SHA-3 did introduce something useful: extendable output functions (XOFs), in the form of the SHAKE algorithms. In an XOF, input is hashed and then an (effectively) unlimited amount of output can be produced from it. It's convenient, although the same effect can be produced for a limited amount of output using HKDF, or by hashing to a key and running ChaCha20 or AES-CTR.

Thus I believe that SHA-3 should probably not be used. It offers no compelling advantage over SHA-2 and brings many costs. The only argument that I can credit is that it's nice to have a backup hash function, but both SHA-256 and SHA-512 are commonly supported and have different cores. So we already have two secure hash functions deployed and I don't think we need another.

BLAKE2 is another new, secure hash function, but it at least offers much improved speed over SHA-2. Speed is important. Not only does it mean less CPU time spent on cryptography, it means that cryptography can be economically deployed in places where it couldn't be before. BLAKE2, however, has too many versions: eight at the current count (BLAKE2(X)?[sb](p)?). In response to complaints about speed, the Keccak team now have KangarooTwelve and MarsupilamiFourteen, which have a vector-based design for better performance. (Although a vector-based design can also be used to speed up SHA-2.)

So there are some interesting prospects for a future, faster replacement for SHA-2. But SHA-3 itself isn't one of them.

Update: two points came up in discussion about this. Firstly, what about length-extension? SHA-2 has the property that simply hashing a secret with some data is not a secure MAC construction, that's why we have HMAC. SHA-3 does not have this problem.

That is an advantage of SHA-3 because it means that people who don't know they need to use HMAC (with SHA-2) won't be caught out by it. Hopefully, in time, we end up with a hash function that has that property. But SHA-512/256, BLAKE2, K12, M14 and all the other SHA-3 candidates do have this property. In fact, it's implausible that any future hash function wouldn't.

Overall, I don't feel that solving length-extension is a sufficiently pressing concern that we should all invest in SHA-3 now, rather than a hash function that hopefully comes with more advantages. If it is a major concern for you now, try SHA-512/256—a member of the SHA-2 family.

The second point was that SHA-3 is just the first step towards a permutation-based future: SHA-3 has an elegant foundation that is suitable for implementing the full range of symmetric algorithms. In the future, a single optimised permutation function could be the basis of hashes, MACs, and AEADs, thus saving code size / die area and complexity. (E.g. STROBE.)

But skipping SHA-3 doesn't preclude any of that. SHA-3 is the hash standard itself, and even the Keccak team appear to be pushing K12 rather than SHA-3 now. It seems unlikely that a full set of primitives built around the Keccak permutation would choose to use the SHA-3 parameters at this point.

Indeed, SHA-3 adoption might inhibit that ecosystem by pushing it towards those bad parameters. (There was a thing about NIST tweaking the parameters at the end of the process if you want some background.)

One might argue that SHA-3 should be supported because you believe that it'll result in hardware implementations of the permutation and you hope that they'll be flexible enough to support what you really want to do with it. I'm not sure that would be the best approach even if your goal was to move to a permutation-based world. Instead I would nail down the whole family of primitives as you would like to see it and try to push small chips, where area is a major concern, to adopt it. Even then, the hash function in the family probably wouldn't be exactly SHA-3, but more like K12.

AES-GCM-SIV (14 May 2017)

AEADs combine encryption and authentication in a way that provides the properties that people generally expect when they “encrypt” something. This is great because, historically, handing people a block cipher and a hash function has resulted in a lot of bad and broken constructions. Standardising AEADs avoids this.

Common AEADs have a sharp edge though: you must never encrypt two different messages with the same key and nonce. Doing so generally violates the confidentiality of the two messages and might break much more.

There are some situations where obtaining a unique nonce is easy, for example in transport security where a simple counter can be used. (Due to poor design in TLS 1.2, some TLS implementations still managed to duplicate nonces. The answer there is to do what ChaCha20-Poly1305 does in TLS 1.2, and what everything does in TLS 1.3: make the nonce implicit. That means that any mistakes result in your code not interoperating, which should be noticed.)

But there are situations where ensuring nonce uniqueness is not trivial, generally where multiple machines are independently encrypting with the same key. A system for distributing nonces is complex and hard to have confidence in. Generating nonces randomly and depending on statistical uniqueness is reasonable, but only if the space of nonces is large enough. XSalsa20-Poly1305 (paper, code) has a 192-bit nonce and good performance across a range of CPUs. In many situations, using that with random nonces is a sound choice.

However, lots of chips now have hardware support for the AES-GCM AEAD, meaning that its performance and power use is hard to beat. Also, a 192-bit nonce results in a ~40% increase in overhead vs the 96-bit nonce of AES-GCM, which is undesirable in some contexts. (Overhead consists of the nonce and tag, thus the increase from 96- to 192-bit nonces is not 100%.)

But using random nonces in a 96-bit space isn't that comfortable. NIST recommends a limit of 232 messages when using random nonces with AES-GCM which, while quite large, is often not large enough not to have to worry about. Because of this, Shay Gueron, Yehuda Lindell and I have been working on AES-GCM-SIV (paper, spec): AES-GCM with some forgiveness. It uses the same primitives as AES-GCM, and thus enjoys the same hardware support, but it doesn't fail catastrophically if you repeat a nonce. Thus you can use random, 96-bit nonces with a far larger number of messages, or withstand a glitch in your nonce distribution scheme. (For precise numbers, see section 5.3 of the paper.)

There is a performance cost: AES-GCM-SIV encryption runs at about 70% the speed of AES-GCM, although decryption runs at the same speed. (Measured using current BoringSSL on an Intel Skylake chip with 8KiB messages.) But, in any situation where you don't have a watertight argument for nonce uniqueness, that might be pretty cheap compared to the alternative.

For example, both TLS and QUIC need to encrypt small messages at the server that can be decrypted by other servers. For TLS, these messages are the session tickets and, in QUIC, they are the source-address tokens. This is an example of a situation where many servers are independently encrypting with the same key and so Google's QUIC implementation is in the process of switching to using AES-GCM-SIV for source-address tokens. (Our TLS implementation may switch too in the future, although that will be more difficult because of existing APIs.)

AEADs that can withstand nonce duplication are called “nonce-misuse resistant” and that name appears to have caused some people to believe that they are infinitely resistant. I.e. that an unlimited number of messages can be encrypted with a fixed nonce with no loss in security. That is not the case, and the term wasn't defined that way originally by Rogaway and Shrimpton (nor does their SIV mode have that property). So it's important to emphasise that AES-GCM-SIV (and nonce-misuse resistant modes in general) are not a magic invulnerability shield. Figure four and section five of the the paper give precise bounds but, if in doubt, consider AES-GCM-SIV to be a safety net for accidental nonce duplication and otherwise treat it like a traditional AEAD.

AES-GCM-SIV is supported in BoringSSL now and, while one may not want to use the whole of BoringSSL, the core assembly is ISC licensed. Also, Shay has published reference, assembly and intrinsics versions for those who think AES-GCM-SIV might be useful to them.

CFI directives in assembly files (18 Jan 2017)

(This post uses x86-64 for illustration throughout. The fundamentals are similar for other platforms but will need some translation that I don't cover here.)

Despite compilers getting better over time, it's still the case that hand-written assembly can be worthwhile for certain hot-spots. Sometimes there are special CPU instructions for the thing that you're trying to do, sometimes you need detailed control of the resulting code and, to some extent, it remains possible for some people to out-optimise a compiler.

But hand-written assembly doesn't automatically get some of the things that the compiler generates for normal code, such as debugging information. Perhaps your assembly code never crashes (although any function that takes a pointer can suffer from bugs in other code) but you probably still care about accurate profiling information. In order for debuggers to walk up the stack in a core file, or for profilers to correctly account for CPU time, they need be able to unwind call frames.

Unwinding used to be easy as every function would have a standard prologue:

push rbp
mov rbp, rsp

This would make the stack look like this (remember that stacks grow downwards in memory):

Caller's stackRSP value before CALLRSP at function entryCaller's RBPCallee's local variablesSaved RIP(pushed by CALL)RBP always points here

So, upon entry to a function, the CALL instruction that jumped to the function in question will have pushed the previous program counter (from the RIP register) onto the stack. Then the function prologue saves the current value of RBP on the stack and copies the current value of the stack pointer into RBP. From this point until the function is complete, RBP won't be touched.

This makes stack unwinding easy because RBP always points to the call frame for the current function. That gets you the saved address of the parent call and the saved value of its RBP and so on.

The problems with this scheme are that a) the function prologue can be excessive for small functions and b) we would like to be able to use RBP as a general purpose register to avoid spills. Which is why the GCC documentation says that “-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging”. This means that you can't depend on being able to unwind stacks like this. A process can be comprised of various shared libraries, any of which might be compiled with optimisations.

To be able to unwind the stack without depending on this convention, additional debugging tables are needed. The compiler will generate these automatically (when asked) for code that it generates, but it's something that we need to worry about when writing assembly functions ourselves if we want profilers and debuggers to work.

The reference for the assembly directives that we'll need is here, but they are very lightly documented. You can understand more by reading the DWARF spec, which documents the data that is being generated. Specifically see sections 6.4 and D.6. But I'll try to tie the two together in this post.

The tables that we need the assembler to emit for us are called Call Frame Information (CFI). (Not to be confused with Control Flow Integrity, which is very different.) Based on that name, all the assembler directives begin with .cfi_.

Next we need to define the Canonical Frame Address (CFA). This is the value of the stack pointer just before the CALL instruction in the parent function. In the diagram above, it's the value indicated by “RSP value before CALL”. Our first task will be to define data that allows the CFA to be calculated for any given instruction.

The CFI tables allow the CFA to be expressed as a register value plus an offset. For example, immediately upon function entry the CFA is RSP + 8. (The eight byte offset is because the CALL instruction will have pushed the previous RIP on the stack.)

As the function executes, however, the expression will probably change. If nothing else, after pushing a value onto the stack we would need to increase the offset.

So one design for the CFI table would be to store a (register, offset) pair for every instruction. Conceptually that's what we do but, to save space, only changes from instruction to instruction are stored.

It's time for an example, so here's a trivial assembly function that includes CFI directives and a running commentary.

  .globl  square
  .type   square,@function
  .hidden square
square:

This is a standard preamble for a function that's unrelated to CFI. Your assembly code should already be full of this.

  .cfi_startproc

Our first CFI directive. This is needed at the start of every annotated function. It causes a new CFI table for this function to be initialised.

  .cfi_def_cfa rsp, 8

This is defining the CFA expression as a register plus offset. One of the things that you'll see compilers do is express the registers as numbers rather than names. But, at least with GAS, you can write names. (I've included a table of DWARF register numbers and names below in case you need it.)

Getting back to the directive, this is just specifying what I discussed above: on entry to a function, the CFA is at RSP + 8.

  push    rbp
  .cfi_def_cfa rsp, 16

After pushing something to the stack, the value of RSP will have changed so we need to update the CFA expression. It's now RSP + 16, to account for the eight bytes we pushed.

  mov     rbp, rsp
  .cfi_def_cfa rbp, 16

This function happens to have a standard prologue, so we'll save the frame pointer in RBP, following the old convention. Thus, for the rest of the function we can define the CFA as RBP + 16 and manipulate the stack without having to worry about it again.

  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8

We're getting ready to return from this function and, after restoring RBP from the stack, the old CFA expression is invalid because the value of RBP has changed. So we define it as RSP + 8 again.

  ret
  .cfi_endproc

At the end of the function we need to trigger the CFI table to be emitted. (It's an error if a CFI table is left open at the end of the file.)

The CFI tables for an object file can be dumped with objdump -W and, if you do that for the example above, you'll see two tables: something called a CIE and something called an FDE.

The CIE (Common Information Entry) table contains information common to all functions and it's worth taking a look at it:

… CIE
  Version:         1
  Augmentation:    "zR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     1b

  DW_CFA_def_cfa: r7 (rsp) ofs 8
  DW_CFA_offset: r16 (rip) at cfa-8

You can ignore everything until the DW_CFA_… lines at the end. They define CFI directives that are common to all functions (that reference this CIE). The first is saying that the CFA is at RSP + 8, which is what we had already defined at function entry. This means that you don't need a CFI directive at the beginning of the function. Basically RSP + 8 is already the default.

The second directive is something that we'll get to when we discuss saving registers.

If we look at the FDE (Frame Description Entry) for the example function that we defined, we see that it reflects the CFI directives from the assembly:

… FDE cie=…
  DW_CFA_advance_loc: 1 to 0000000000000001
  DW_CFA_def_cfa: r7 (rsp) ofs 16
  DW_CFA_advance_loc: 3 to 0000000000000004
  DW_CFA_def_cfa: r6 (rbp) ofs 16
  DW_CFA_advance_loc: 11 to 000000000000000f
  DW_CFA_def_cfa: r7 (rsp) ofs 8

The FDE describes the range of instructions that it's valid for and is a series of operations to either update the CFA expression, or to skip over the next n bytes of instructions. Fairly obvious.

Optimisations for CFA directives

There are some shortcuts when writing CFA directives:

Firstly, you can update just the offset, or just the register, with cfi_def_cfa_offset and cfi_def_cfa_register respectively. This not only saves typing in the source file, it saves bytes in the table too.

Secondly, you can update the offset with a relative value using cfi_adjust_cfa_offset. This is useful when pushing lots of values to the stack as the offset will increase by eight each time.

Here's the example from above, but using these directives and omitting the first directive that we don't need because of the CIE:

  .globl  square
  .type   square,@function
  .hidden square
square:
  .cfi_startproc
  push    rbp
  .cfi_adjust_cfa_offset 8
  mov     rbp, rsp
  .cfi_def_cfa_register rbp
  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8
  ret
  .cfi_endproc

Saving registers

Consider a profiler that is unwinding the stack after a profiling signal. It calculates the CFA of the active function and, from that, finds the parent function. Now it needs to calculate the parent function's CFA and, from the CFI tables, discovers that it's related to RBX. Since RBX is a callee-saved register, that's reasonable, but the active function might have stomped RBX. So, in order for the unwinding to proceed it needs a way to find where the active function saved the old value of RBX. So there are more CFI directives that let you document where registers have been saved.

Registers can either be saved at an offset from the CFA (i.e. on the stack), or in another register. Most of the time they'll be saved on the stack though because, if you had a caller-saved register to spare, you would be using it first.

To indicate that a register is saved on the stack, use cfi_offset. In the same example as above (see the stack diagram at the top) the caller's RBP is saved at CFA - 16 bytes. So, with saved registers annotated too, it would start like this:

square:
  .cfi_startproc
  push    rbp
  .cfi_adjust_cfa_offset 8
  .cfi_offset rbp, -16

If you need to save a register in another register for some reason, see the documentation for cfi_register.

If you get all of that correct then your debugger should be able to unwind crashes correctly, and your profiler should be able to avoid recording lots of detached functions. However, I'm afraid that I don't know of a better way to test this than to zero RBP, add a crash in the assembly code, and check whether GBD can go up correctly.

(None of this works for Windows. But Per Vognsen, via Twitter, notes that there are similar directives in MASM.)

CFI expressions

New in version three of the DWARF standard are CFI Expressions. These define a stack machine for calculating the CFA value and can be useful when your stack frame is non-standard (which is fairly common in assembly code). However, there's no assembler support for them that I've been able to find, so one has to use cfi_escape and provide the raw DWARF data in a .s file. As an example, see this kernel patch.

Since there's no assembler support, you'll need to read section 2.5 of the standard, then search for DW_CFA_def_cfa_expression and, perhaps, search for cfi_directive in OpenSSL's perlasm script for x86-64 and the places in OpenSSL where that is used. Good luck.

(I suggest testing by adding some instructions that write to NULL in the assembly code and checking that gdb can correctly step up the stack and that info reg shows the correct values for callee-saved registers in the parent frame.)

CFI register numbers

In case you need to use or read the raw register numbers, here they are for a few architectures:

(may be EBP on MacOS) (may be ESP on MacOS)
Register numberx86-64x86ARM
0RAXEAXr0
1RDXECXr1
2RCXEDXr2
3RBXEBXr3
4RSIESPr4
5RDIEBPr5
6RBPESIr6
7RSPEDIr7
8R8r8
9R9r9
10R10r10
11R11r11
12R12r12
13R13r13
14R14r14
15R15r15
16RIP

(x86 values taken from page 25 of this doc. x86-64 values from page 57 of this doc. ARM taken from page 7 of this doc.)

RISC-V assembly (31 Dec 2016)

RISC-V is a new, open instruction set. Fabrice Bellard wrote a Javascript emulator for it that boots Linux here (more info). I happen to have just gotten a physical chip that implements it too (one of these) and what's cool is that you can get the source code to the chip on GitHub.

The full, user-level instruction set is documented but there's a lot of information in there. I wanted a brief summary of things that I could keep in mind when reading disassembly. This blog post is just a dump of my notes; it's probably not very useful for most people! It also leaves out a lot of things that come up less frequently. There's also an unofficial reference card which is good, but still aimed a little low for me.

RISC-V is little-endian and comes in 32 and 64 bit flavours. In keeping with the RISC-V documents, the flavour (either 32 or 64) is called XLEN below in the few places where it matters.

For both, int is 32 bits. Pointers and long are the native register size. Signed values are always sign extended in a larger register and unsigned 8- and 16-bit values are zero extended. But unsigned 32-bit values are sign-extended. Everything has natural alignment in structs and the like, but alignment isn't required by the hardware.

The stack grows downwards and is 16-byte aligned upon entry to a function.

Instructions are all 32 bits and 32-bit aligned. There is a compressed-instruction extension that adds 16-bit instructions and changes the required alignment of all instructions to just 16 bits. Unlike Thumb on ARM, RISC-V compressed instructions are just a short-hand for some 32-bit instruction and 16- and 32-bit instructions can be mixed freely.

There are 31 general purpose registers called x1–x31. Register x0 drops all writes and always reads as zero. Each register is given an alias too, which describes their conventional use:

RegisterAliasDescriptionSaved by
x0zeroZero
x1raReturn addressCaller
x2spStack pointerCallee
x3gpGlobal pointer
x4tpThread pointer
x5–7t0–2TemporaryCaller
x8s0/fpSaved register / frame pointerCallee
x9s1Saved registerCallee
x10–17a1–a7Arguments and return valuesCaller
x18–27s2–11Saved registersCallee
x28–31t3–6TemporaryCaller

There are two types of immediates, which will be written as ‘I’ and ‘J’. Type ‘I’ intermediates are 12-bit, signed values and type ‘J’ are 20-bit, signed values. They are always sign-extended to the width of a register before use.

Arithmetic

There's no carry flag, instead one has to use a comparison instruction on the result in order to get the carry in a register.

A U suffix indicates an unsigned operation. Note that the multiplication and division instructions are part of the ‘M’ extension, but I expect most chips to have them.

InstructionEffectNotes
ADD dest,src1,src2dest = src1 + src2
SUB dest,src1,src2dest = src1 - src2
ADDI dest,src1,Idest = src1 + I
MUL dest,src1,src2dest = src1 × src2
MULH[|U|SH] dest,src1,src2dest = (src1 × src2) >> XLENThis returns the high word of the multiplication result for signed×signed, unsigned×unsigned or signed×unsigned, respectively
DIV[U] dest,src1,src2dest = src1/src2There's no trap on divide-by-zero, rather special values are returned. (See documentation.)
REM[U] dest,src1,src2dest = src1%src2

Bitwise

These should all be obvious:

Instruction
AND dest,src1,src2
OR dest,src1,src2
XOR dest,src1,src2
ANDI dest,src1,I
ORI dest,src1,I
XORI dest,src1,I

Shifts

Instructions here are Shift [Left|Right] [Logical|Arithmetic].

Note that only the minimal number of bits are read from the shift count. So shifting by the width of the register doesn't zero it, it does nothing.

InstructionEffectNotes
SLL dest,src1,src2dest = src1 << src2%XLEN
SRL dest,src1,src2dest = src1 >> src2%XLEN
SRA dest,src1,src2dest = src1 >> src2%XLEN
SLLI dest,src1,0‥31/63dest = src1 << I
SRLI dest,src1,0‥31/63dest = src1 >> I
SRAI dest,src1,0‥31/63dest = src1 >> I

Comparisons

These instructions set the destination register to one or zero depending on whether the relation is true or not.

InstructionEffectNotes
SLT dest,src1,src2dest = src1<src2(signed compare)
SLTU dest,src1,src2dest = src1<src2(unsigned compare)
SLTI dest,src1,Idest = src1<I(signed compare)
SLTIU dest,src1,Idest = src1<I(unsigned compare, but remember that the immediate is sign-extended first)

Control flow

The JAL[R] instructions write the address of the next instruction to the destination register for the subsequent return. This can be x0 if you don't care.

Many instructions here take an immediate that is doubled before use. That's because no instruction (even with compressed instructions) can ever start on an odd address. However, when you write the value in assembly you write the actual value; the assembler will halve it when encoding.

InstructionEffectNotes
JAL dest,Jdest = pc+2/4; pc+=2*J
JALR dest,src,Idest = pc+2/4; pc=src1+(I&~1)Note that the least-significant bit of the address is ignored
BEQ src1,src2,Iif src1==src2, pc+=2*I
BNE src1,src2,Iif src1!=src2, pc+=2*I
BLT src1,src2,Iif src1<src2, pc+=2*Isigned compare
BLTU src1,src2,Iif src1<src2, pc+=2*Iunsigned compare
BGE src1,src2,Iif src1>=src2, pc+=2*Isigned compare
BGEU src1,src2,Iif src1>=src2, pc+=2*Iunsigned compare

Memory

The suffix on these instructions denotes the size of the value read or written: ‘D’ = double-word (64 bits), ‘W’ = word (32 bits), ‘H’ = half-word (16 bits), ´B’ = byte. Reading and writing 64-bit values only works on 64-bit systems, and LWU only exists on 64-bit systems because it's the same as LW otherwise.

Alignment is not required, but might be faster. Also note that there's no consistent direction of data flow in the textual form of the assembly: the register always comes first.

InstructionEffectNotes
L[D|W|H|B] dest,I(src)dest = *(src + I)result is sign extended
L[D|W|H|B]U dest,I(src)dest = *(src + I)result is zero extended
S[D|W|H|B] src1,I(src2)*(src2 + I) = src1

Other

InstructionEffectNotes
LUI dest,Jdest = J<<12“Load Upper Immediate”. The result is sign extended on 64-bit.
AUIPC dest,Jdest = pc + J<<12“Add Upper Immediate to PC”. The result of the shift is sign-extended on 64-bit before the addition.

W instructions

These instructions only exist on 64-bit and are copies of the same instruction without the ‘W’ suffix, except that they operate only on the lower 32 bits of the registers and, when writing the result, sign-extend to 64 bits. They are ADDW, SUBW, ADDIW, DIV[U]W, REM[U]W, S[L|R]LW, SRAW, S[L|R]LIW and SRAIW

Pseudo-instructions

For clarity, there are also a number of pseudo-instructions defined. These are just syntactic sugar for one of the primitive instructions, above. Here's a handful of the more useful ones:

Pseudo-instructionTranslation
nopaddi x0,x0,0
mv dest,srcaddi dest,src,0
not dest,srcxor dest,src,-1
seqz dest,srcsltiu dest,src,1
snez dest,srcsltu dest,x0,src
j Jjal x0,J
retjalr x0,x1,0
call offsetauipc x6, offset >> 12; jalr x1,x6,offset & 0xfff
li dest,value(possibly several instructions to load arbitrary value)
la dest,symbolauipc dest, symbol >> 12; addi dest,dest,offset & 0xfff
l[d|w|h|b] dest,symbolauipc dest, symbol >> 12; lx dest,offset & 0xfff(dest)
s[d|w|h|b] src,symbolauipc dest, symbol >> 12; sx dest,offset & 0xfff(dest)

Note that the instructions that expand to two instructions where the first is AUIPC aren't quite as simple as they appear. Since the 12-bit immediate in the second instruction is sign extended, it could end up negative. If that happens, the J immediate for AUIPC needs to be one greater and then you can reach the value by subtracting from there.

CECPQ1 results (28 Nov 2016)

In July my colleague, Matt Braithwaite, announced that Chrome and Google would be experimenting with a post-quantum key-agreement primitive in TLS. One should read the original announcement for details, but we had two goals for this experiment:

Firstly we wanted to direct cryptoanalytic attention at the family of Ring Learning-with-Errors (RLWE) problems. The algorithm that we used, NewHope, is part of this family and appeared to be the most promising when we selected one at the end of 2015.

It's very difficult to know whether we had any impact here, but it's good to see the recent publication and withdrawal of a paper describing a quantum attack on a fundamental lattice problem. Although the algorithm contained an error, it still shows that capable people are testing these new foundations.

Our second goal was to measure the feasibility of deploying post-quantum key-agreement in TLS by combining NewHope with an existing key-agreement (X25519). We called the combination CECPQ1.

TLS key agreements have never been so large and we expected a latency impact from the extra network traffic. Also, any incompatibilities with middleboxes can take years to sort out, so it's best to discover them as early as possible.

Here the results are more concrete: we did not find any unexpected impediment to deploying something like NewHope. There were no reported problems caused by enabling it.

Although the median connection latency only increased by a millisecond, the latency for the slowest 5% increased by 20ms and, for the slowest 1%, by 150ms. Since NewHope is computationally inexpensive, we're assuming that this is caused entirely by the increased message sizes. Since connection latencies compound on the web (because subresource discovery is delayed), the data requirement of NewHope is moderately expensive for people on slower connections.

None the less, if the need arose, it would be practical to quickly deploy NewHope in TLS 1.2. (TLS 1.3 makes things a little more complex and we did not test with CECPQ1 with it.)

At this point the experiment is concluded. We do not want to promote CECPQ1 as a de-facto standard and so a future Chrome update will disable CECPQ1 support. It's likely that TLS will want a post-quantum key-agreement in the future but a more multilateral approach is preferable for something intended to be more than an experiment.

Roughtime (19 Sep 2016)

Security protocols often assume an accurate, local clock (e.g. TLS, Kerberos, DNSSEC and more). It's a widely accepted assumption when designing protocols but, for a lot of people, it just isn't true. We find good evidence that at least 25% of all certificate errors in Chrome are due to a bad local clock.

Even when the local clock is being synchronised, it's very likely to be using unauthenticated NTP. So if your threat model includes man-in-the-middle attackers then you still can't trust the local clock.

There have been efforts to augment NTP with authentication, but they still assume a world where each client trusts one or more time servers absolutely. In order to explore a solution that allows time servers to be effectively audited by clients, myself and my colleague Matt Braithwaite (with assistance and advice from Ben Laurie and Michael Shields) have developed Roughtime.

Very briefly: using some tricks we believe that it's viable to deploy servers that sign a client-chosen nonce and timestamp on demand. Once you have several of these servers, clients can generate their nonces by hashing replies from other servers with some entropy. That proves that a nonce was created after the reply was received. Clients maintain a chain of nonces and replies and, if a server misbehaves, can use replies from several other servers to prove and report it.

Currently there's only one Roughtime service running, so the idea of spreading trust around is inchoate. But we would like to gauge whether others are interested in this idea, specifically whether there are any organisations who would be seriously interested in deploying something like this in their clients. (Because I assume that, if you have clients, then you'll also be interested in running a server.)

There's a much longer introduction and many more details on the Roughtime site and we've also setup a mailing list.

memcpy (and friends) with NULL pointers (26 Jun 2016)

The C standard (ISO/IEC 9899:2011) has a sane-seeming definition of memcpy (section 7.24.2.1):

The memcpy function copies n characters from the object pointed to by s2 into the object pointed to by s1.

Apart from a prohibition on passing overlapping objects, I think every C programmer understands that.

However, the standard also says (section 7.1.4):

If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after promotion) not expected by a function with variable number of arguments, the behavior is undefined.

(Emphasis is mine.)

I'm sure that 7.1.4 seemed quite reasonable in isolation, but how does it interact with the case where memcpy is called with a zero length? If you read 7.24.2.1 then you might well think that, since the function copies zero bytes, it's valid to pass NULL as either of the pointer arguments. I claim that the vast majority of C programmers would agree with that, but 7.24.1(2) clarifies that 7.1.4 really does apply:

Where an argument declared as size_t n specifies the length of the array for a function, n can have the value zero […] pointer arguments on such a call shall still have valid values, as described in 7.1.4.

(Nobody would actually write memcpy(NULL, NULL, 0), of course, because it (at best) does nothing. But such a call can easily arise at run-time when an empty object is handled by a more general function.)

Some compilers will use this corner of the standard to assume that pointers passed to memcpy are non-NULL, irrespective of the length argument. GCC has built this in, while Clang can get it from the fact that glibc annotates memcpy with nonnull specifications.

Consider the following function:

#include <stdint.h>
#include <string.h>

int f(uint8_t *dest, uint8_t *src, size_t len) {
  memcpy(dest, src, len);
  return dest == NULL;
}

Here's the output of building that with GCC 6.1.1 with -O2:

0000000000000000 <f>:
   0:	48 83 ec 08          	sub    rsp,0x8
   4:	e8 00 00 00 00       	call   9 <f+0x9>  # memcpy
   9:	31 c0                	xor    eax,eax
   b:	48 83 c4 08          	add    rsp,0x8
   f:	c3                   	ret

From that we can see that rax (which holds the return value of a function in the amd64 ABI) is unconditionally set to zero, i.e. the compiler has assumed that dest == NULL is false because it has been passed to memcpy. The compiler's reasoning goes like this: 7.1.4 says that passing a NULL pointer to a standard library function is undefined behaviour, therefore if dest was NULL any behaviour is reasonable. So the code can be optimised with the assumption that it's non-NULL, as that's the only case with defined behaviour.

(You can also play with this snippet in Matt Godbolt's excellent tool.)

Opinions on this vary from “the C standard defines the language thus that optimisation is fine by definition” to “that's crazy: there's a huge amount of code out there that probably assumes the obvious behaviour of memcpy”. Personally, I find myself further towards the latter position than the former.

Also, it's not just memcpy: the same optimisations are annotated in glibc for (at least) memccpy, memset, memcmp, memchr, memrchr, memmem, mempcpy, bcopy and bcmp. Section 7.1.4 can be applied to any standard library function.

Measurement

To try and figure out the impact that this optimisation is having I built a number of open-source programs with GCC 6.1.1, with -fno-builtin (to disable GCC's built-in versions of these functions) and with glibc's string.h including, or not, the nonnull annotations. For example, the snippet of code above produces this diff when tested this way:

0000000000000000 <f>:
-   0:	48 83 ec 08          	sub    rsp,0x8
+   0:	53                   	push   rbx
+   1:	48 89 fb             	mov    rbx,rdi
    4:	e8 00 00 00 00       	call   9 <f+0x9>
    9:	31 c0                	xor    eax,eax
-   b:	48 83 c4 08          	add    rsp,0x8
-   f:	c3                   	ret    
+   b:	48 85 db             	test   rbx,rbx
+   e:	0f 94 c0             	sete   al
+  11:	5b                   	pop    rbx
+  12:	c3                   	ret    

The added code tests dest to set the return value, as intended.

The first program I tested was BIND 9.9.5 because of this advisory that says: “GCC now includes (by default) an optimization which is intended to eliminate unnecessary null pointer comparisons in compiled code. Unfortunately this optimization removes checks which are necessary in BIND and the demonstrated effect is to cause unpredictable assertion failures during execution of named, resulting in termination of the server process”. Although version 9.9.5 should be affected according to the advisory, I found no differences in the compiled output based on nonnull annotations in string.h. Perhaps it's because I'm using a different GCC, perhaps I just got something wrong in my testing, or perhaps these checks were eliminated for different reasons. (For example, a local root exploit in the kernel was enabled by a dereference-based removal of a NULL check.)

Next up, I tried something that I'm more involved with: BoringSSL. Here there are two changes: a reordering of two conditions in OPENSSL_realloc_clean (which has no semantic effect) and extensive changes in BN_mod_exp_mont. I'm sure I would be able to do a better analysis if I were more experienced with disassembling large programs, but I'm just using objdump and diff. Still, I believe that all the changes are the result of a single NULL check being removed and then the resulting offset shift of all the following code. That counts as an optimisation, but it's statically clear that the pointer cannot be NULL even without any assumptions about string.h functions so I struggle to give much credit.

Since BoringSSL showed some changes, I tried OpenSSL 1.0.2h. This also shows the same large changes around BN_mod_exp_mont. There's also a large change in dsa_builtin_paramgen2 (a function that we don't have in BoringSSL) but that appears to be another insignificant NULL-check removed and a consequent change of all the later offsets. Lastly, there are a handful of no-op changes: like swapping the arguments to cmp before jne.

Next I tried openssh-7.2p2, which shows no changes. I wondered whether someone had already done this analysis and corrected any problems in OpenSSH so tried a much older version too: 5.4p1. That does show a small, but non-trivial, change in ssh_rsa_verify. After a bit of thought, I believe that GCC has managed to eliminate a test for a non-NULL pointer at the end of openssh_RSA_verify. Just like the BoringSSL case, it's already possible to deduce that the pointer must be non-NULL without any section 7.1.4 assumptions.

Conclusions

It's clear that one has to write C code that's resilient to the compiler assuming that any pointers passed to standard library functions are non-NULL. There have been too many releases of glibc and GCC with this in to safely assume that it'll ever go away.

However, the benefits of this (i.e. the optimisations that the compiler can perform because of it) are nearly zero. Large programs can be built where it has no effect. When there are changes they are either cases that the compiler should have been able to figure out anyway, or else noise changes with no effect.

As for the costs: there have been several cases where removing NULL checks has resulted in a security vulnerability, although I can't find any cases of this precise corner of the C standard causing it. It also adds a very subtle, exceptional case to several very common functions, burdening programmers. But it thankfully rarely seems to make a difference in real-life code, so hopefully there's not a large pool of bugs in legacy code that have been created by this change.

Still, given the huge amount of legacy C code that exists, this optimisation seems unwise to me. Although I've little hope of it happening, I'd suggest that GCC and glibc remove these assumptions and that the next revision of the C standard change 7.24.1(2) to clarify that when a length is zero, pointers can be NULL.

If anyone wishes to check my results here, I've put the scripts that I used on GitHub. I'm afraid that it takes a bit of manual setup and, given variation in GCC versions across systems, some differences are to be expected but the results should be reproducible.

Cryptographic Agility (16 May 2016)

(These are notes that I wrote up from a talk that I gave at the National Academies Forum on Cyber Resilience. You can tell that it was in Washington, DC because of the “cyber”.

I wasn't quite sure how technical to pitch this talk so it's relatively introductory; regular readers probably know all this.

This isn't a transcript of what I said, but I try to hit the main points in my notes.)

Firstly I'd like to separate extensibility from agility. A protocol is extensible if you can add features to it without having to update every implementation at the same time—which is generally impossible. Cryptographic agility depends on having extensibility, at least if you ever want to use something that wasn't designed into a protocol at the beginning.

Protocols should be extensible: the world keeps changing and no design is going to be perfect for all time. But extensibility is much harder in practice than it sounds.

I happen to be particularly familiar with TLS and TLS has two, major extensibility mechanisms. The first is a simple version number. Here's how the specification says that it should work:

Client: I support up to version 1.2.

Server: (calculates the minimum of the version that the client supports and the maximum version that the server supports) Ok, let's use version 1.1.

This is commendably simple: it's not possible to express a range of versions and certainly not a discontinuous range. This is about as simple as an extensibility mechanism could be, and yet lots of implementations get it wrong. It's a common mistake for implementations to return an error when the client offers a version number that they don't understand.

This, of course, means that deploying new versions doesn't work. But it's insidious because the server will work fine until someone tries to deploy a new version. We thought that we had flexibility in the protocol but it turned out that bugs in code had rusted it in place.

At this point it's worth recalling the Law of the Internet: blame attaches to the last thing that changed. If Chrome updates and then something stops working then Chrome gets the blame. It doesn't matter that the server couldn't correctly calculate the minimum of two numbers. No normal person understands or cares about that.

What's to be done about this? Well, we work around issues if they're big and suck up the breakage if they're small. It's taken about 15 years to get to the point where web browsers don't have to work around broken version negotiation in TLS and that's mostly because we only have three active versions of TLS. When we try to add a fourth (TLS 1.3) in the next year, we'll have to add back the workaround, no doubt. In summary, this extensibility mechanism hasn't worked well because it's rarely used and that lets bugs thrive.

TLS has a second, major extension mechanism which is a series of (key, value) pairs where servers should ignore unknown keys. This has worked a little better because, while there are only three or four versions in play, with many years between versions, there are 25 to 30 extensions defined. It's not perfect: bugs in implementations have led them to be dependent on the order of extensions and somebody at least managed to write a server that breaks if the last value is empty.

Sometimes more extensibility points have been added inside of extensions in the expectation that it'll save adding another, top-level extension in the future. This has generally been a mistake: these extension points have added complexity for little reason and, when we try to use them, we often find that bugs have rusted them solid anyway. They've just been a waste.

There's a lesson in all this: have one joint and keep it well oiled.

Protocol designers underestimate how badly people will implement their designs. Writing down how you think it should work and hoping that it'll work, doesn't work. TLS's protocol negotiation is trivial and the specification is clear, yet it still didn't work in practice because it's difficult to oil.

Rather one needs to minimise complexity, concentrate all extensibility in a single place and actively defend it. An active defense can take many forms: fuzzing the extensibility system in test suites and compliance testing is good. You might want to define and implement dummy extensions once a year or such, and retire old ones on a similar schedule. When extensions contain lists of values, define a range of values that clients insert at random. In short, be creative otherwise you'll find that bug rust will quickly settle in.

Agility itself

Cryptographic agility is a huge cost. Implementing and supporting multiple algorithms means more code. More code begets more bugs. More things in general means less academic focus on any one thing, and less testing and code-review per thing. Any increase in the number of options also means more combinations and a higher chance for a bad interaction to arise.

Let's just consider symmetric ciphers for a moment. Because everyone wants them to be as fast as possible, BoringSSL currently contains 27 thousand lines of Perl scripts (taken from OpenSSL, who wrote them all) that generate assembly code just in order to implement AES-GCM. That's a tremendous amount of work and a tremendous scope for bugs.

Focusing again on TLS: over the years, 25 different ciphers and modes have been specified for use in TLS. Thankfully, of those, only nine are actively used. But that doesn't mean that the zombies of the others might not still be lurking around, ready to cause problems.

Where did this mess of diversity come from?

1. Old age / we had no idea what we were doing in the 1990's:

3DES_EDE_CBC AES_128_CBC AES_256_CBC DES40_CBC
DES_CBC DES_CBC_40 IDEA_CBC NULL
RC2_CBC_40 RC4_128 RC4_40

A lot of mistakes were made in the 1990's—we really didn't know what we were doing. Phil Rogaway did, but sadly not enough people listened to him; probably because they were busy fighting the US Government which was trying to ban the whole field of study at the time. Unfortunately that coincided with the early inflation period of the internet and a lot of those mistakes were embedded pretty deeply. We're still living with them today.

2. National pride cipher suites

ARIA_128_CBC ARIA_128_GCM ARIA_256_CBC ARIA_256_GCM
CAMELLIA_128_CBC CAMELLIA_128_GCM CAMELLIA_256_CBC CAMELLIA_256_GCM
SEED_CBC

The next cause of excess agility are the national pride cipher suites. Many countries consider cryptography to be an area of national interest but then mistakenly believe that means that they have to invent their own standards and primitives. South Korea and Japan were especially forthright about this and so managed to get these ciphers assigned code points in TLS but Russia and China and, to some extent, many other countries do the same thing.

Although they receive limited analysis compared to something like AES, they're generally not bad, per se, but they bring nothing new to the table: they add nothing but costs, and the costs are significant. Cryptographic diversity for the point of national pride should be strenuously resisted for that reason. Other countries may complain that the US got their standards widely used but the US got to specify a lot about the internet by being the first mover. (And AES is from Belgium anyway.) However, it is the case that I'm not aware of any of these national standards being used to promote something that's actually a deliberate backdoor; which is, of course, not true of the US.

3. Reasonable cases for diversity:

  • Embedded systems want to minimise circuit size: AES_128_CCM and AES_256_CCM.
  • We want something faster for when we don't have AES hardware: CHACHA20_POLY1305.
  • US Government standard, got hardware support from Intel: AES_128_GCM and AES_256_GCM.

Now we come to the ones that are reasonable to use and the reasons for diversity there. It's all about performance optimisation for different environments really: tiny devices want CCM because it only needs an AES-encrypt circuit. Devices without hardware support for AES-GCM want to use ChaCha20-Poly1305 because it's much more efficient in software. Everything else wants to use AES-GCM.

Agility has allowed us to introduce the ciphers in the final set and that's really important. But it's equally important to kill off the old stuff, and that's very hard. Nearly all the incentives are aligned against it. Recall the Law of the Internet (mentioned above); users hate stuff breaking and always blame you. Even djb will take to Twitter when one drops DSA support.

We have a long conveyor belt of primitives, we put new ones at the front and, every so often, we turn the crank and something drops off the end. In addition to all the obvious problems with killing off old stuff, that also means that there's a lot of inadvisable options that will generally function at any given time and this is leading to new products launching with no idea that they're sitting towards the end of this conveyor belt. These products expect a lifetime of some number of years and are unaware that we hope to discontinue something that they're using much sooner than that. It's no longer the case that we can assume that waiting a year will result in a reduction of the amount of use that a deprecated primitive gets because of new launches.

Google tries to address this where it can by requiring support for the newest options in our certification process for devices that interact with our services. But only a tiny subset of the things that interact with Google go through any of our certifications.

Things are even harder in non-interactive cases. TLS at least gets to negotiate between the client and server but algorithms in S/MIME messages and certificate signatures don't allow that. (One can think of ways to help change that, but the current reality is that they're not negotiated.) That's why dropping SHA-1 support in certificates has been a such a gruesome fight and why PKCS#8 messages still require us to support 40-bit RC2.

So what's the lesson here? I'd say that you need extensibility but, when it comes to cryptographic agility, have one option. Maybe two. Fight to keep it that small.

It's worth highlighting that, for the purposes of time, I've simplified things dramatically. I've considered only symmetric ciphers and modes above but, even within TLS, there's a whole separate conveyor belt for asymmetric algorithms. And I've not mentioned the oncoming storm of quantum computers. Quantum computers are going to be hilarious and I hope to be retired before they get big enough to cause problems!

Post-quantum key agreement (24 Dec 2015)

If large quantum computers can be built (and not of the D-Wave variety) then they'll make quite a mess of public-key cryptography. RSA and systems based on discrete logarithms (i.e. finite-field and elliptic-curve Diffie-Hellman) are all broken. I've written about hash-based signatures (which resist quantum attacks) before and the recent PQCRYPTO recommendations suggest those for signatures and McEliece for public-key encryption. Those are both sound, conservative recommendations but both have some size problems: McEliece public keys can be on the order of a megabyte of data and conservative hash-based signatures are about 40KB.

In some situations that's not a problem, and things like firmware signatures, where the key is embedded in hard-to-change silicon, should consider using hash-based signatures today. But those costs motivate the search for post-quantum schemes that are closer to the small, fast primitives that we have today.

One candidate is called Ring Learning-With-Errors (RLWE) and I'll try to give a taste of how it works in this post. This is strongly based on the A New Hope paper by Alkim, Ducas, Pöppelmann & Schwabe, and, in turn, on papers by Bos, Costello, Naehrig & Stebila and Peikert.

Firstly, the basic stats (because I always hate having to dig around for these values in papers). I've included the numbers for a current, elliptic-curve based Diffie-Hellman scheme (X25519) for comparision:

A New HopeX25519
Alice's transmission size1,824 bytesa32 bytes
Alice's computation129,638 cyclesb331,568 cyclesc
Bob's transmission size1,824 bytes32 bytes
Bob's computation126,236 cycles331,568 cycles

(a This is using a more compact scheme than in the paper. b These are Haswell cycle counts from the paper. c These are values from the SUPERCOP benchmark on titan0.)

Something to keep in mind when looking at the numbers above: sending 1,824 bytes on a 10MBit link takes 5.1 million cycles, assuming that your CPU is running at 3.5GHz.

RLWE key agreement

Our fundamental setting is ℤ12289[X]/(X1024+1). That's the set of polynomial equations where the largest power of x is 1023 and the coefficients are values between zero and 12288 (inclusive). For example, 66 + 4532x + 10000x2 + … + 872x1023.

Addition and multiplication can be defined for polynomials in this set. Addition is done by matching up powers of x and adding the corresponding coefficients. If the result is out of range then it's reduced modulo 12289.

Multiplication is high-school polynomial multiplication where the polynomial on the right is multiplied by every term of the polynomial on the left and the result is the sum of those terms. Coefficients are reduced modulo 12289 to keep them in range, but it's likely that the resulting powers of x will be too large—multiplying two x1023 terms gives a result in terms of x2046.

Polynomials with too large a degree can be reduced modulo x1024+1 until they're in range again. So, if we ended up with a term of a×x2046 then we could subtract a×x1022(x1024+1) to eliminate it. By repeated application of that trick, all the terms with powers of x greater than 1023 can be eliminated and then the result is back in the set.

Now that we can add and multiply within this set of polynomials we need to define a noise polynomial: this is simply a polynomial where each coefficient is sampled from a random distribution. In this case, the distribution will be a centered binomial distribution that ranges from -12 to 12. The probability density looks like this:

image/svg+xml Gnuplot Gnuplot Produced by GNUPLOT 5.0 patchlevel 1 0 -15 -10 -5 0 5 10 15

An important feature of noise polynomials is that the magnitude of each coefficient is small. That will be critical later on.

A random polynomial is one where each coefficient is sampled from a uniform distribution over the full range of zero to 12288.

To build a Diffie-Hellman protocol from this, Alice generates a random polynomial, a, and two noise polynomials, s and e. Alice calculates b = as+e and sends a and b to Bob. Bob generates his own s′ and e′, uses Alice's a to calculate u = as′+e′, and sends u back to Alice. Now Alice can calculate us = (as′+e′)s = as′s+e′s and Bob can calculate bs′ = (as+e)s′ = ass′+es′. But the keen-eyed will notice that those are different values!

The added noise is necessary for security but it means that the two sides to this protocol calculate different values. But, while the values are different, they're very similar because the magnitude of the noise polynomials is small. So a reconciliation mechanism is needed to find a shared value given two, similar polynomials.

Reconciliation

So far I've been following the A New Hope paper and it does include a reconciliation mechanism. But, to be honest, I'm not sure that I understand it, so I'm going to be describing a mechanism by Peikert here:

The reconciliation will treat each coefficient in the similar polynomials separately and will extract a single, shared bit from each. Since we're dealing with polynomials that have 1024 terms, we'll get a 1024-bit shared secret in total but I'm just going to discuss the process of processing a single coefficient to get a single bit.

Consider the coefficient space as a circle: zero and 12289 are the same value modulo 12289 and we put that at the top of the circle. At the bottom of the circle will be 12289/2 = 6144 (rounding down). We know that, for each coefficient, Alice and Bob will have similar values—meaning that the values will be close by on the circle.

image/svg+xml 0 6144

One option is to split the circle into left and right halves and say that if the point is in the left half then it's a zero, otherwise it's a one.

image/svg+xml 0 6144 0 1

But while that will work most of the time, there's obviously a problem when the points are near the top or the bottom. In these cases, a small difference in location can result in a point being in a different half, and thus Alice and Bob will get a different result.

In this case we want to split the circle into top (zero) and bottom (one), so that both points are clearly in the bottom half.

image/svg+xml 0 6144 0 1

But that just moves the problem around to the left and right edges of the circle. So how about we vary the basis in which we measure the points depending where they are? If the point is near the bottom or the top we'll use the top–bottom (blue) basis and, if not, we'll use the left–right (red) basis.

image/svg+xml 0 6144

But there's still a problem! Consider the two points in the diagram just above. One party will think that it's in a red area, measure it left–right and conclude that the shared bit is a zero. The other will think it's in a blue area, measure it top–bottom and conclude that the shared bit is a one.

There isn't a solution to this in which the parties operate independently so, instead, one of the parties chooses the basis in which each coefficient will be measured. This information is a single bit of information (i.e. red or blue) that we have Bob send to Alice along with his value u. With this reconciliation information in hand, both parties can measure their points in the same, optimal basis and calculate the same, shared value.

Optimisations

There's lots in the paper that I've skipped over here. Most importantly (for performance), a variant of the Fourier transform can be used to convert the polynomials into the frequency domain where multiplication is much faster. Some of the values transmitted can be transmitted in the frequency domain too to save conversions overall. Also, random polynomials can be sampled directly in the frequency domain.

The parameters here have also been carefully selected so that the reduction stage of multiplication happens magically, just by point-wise multiplication of a some constants before and after the transform.

The a value that Alice generates could be a global constant, but in order to save worrying about how it was generated, and to eliminate the possibility of all-for-the-price-of-one attacks (like LogJam), it's generated fresh for each instance. Rather than transmit it in full, Alice need only send a seed value for it to Bob.

Contrasts with Diffie-Hellman

The most important change from Diffie-Hellman is that this scheme requires that all values be ephemeral. Both QUIC and TLS 1.3 assume that a Diffie-Hellman public-value can be reused but, in this scheme, that breaks the security of the system. More traditional uses of Diffie-Hellman, e.g. TLS 1.2 and SSH, are fine though.

Another important change is that this scheme takes a full round-trip for Alice. With Diffie-Hellman, both parties can transmit a message at time zero and as soon as the other party has received the message, they can calculate the shared key and start transmitting. But even if the a value is a global constant in this scheme, the reconciliation process means that Bob can't send a message until he's received Alice's message, and Alice can't calculate the shared key until she has received Bob's message. So Alice has a wait a full round-trip.

Often that limitation isn't important because other parts of the protocol already require the round-trip (for example, in TLS). But for some uses it's a critical difference.

Also, since this protocol involves random noise it has a failure probability: it's possible that the reconciliation mechanism produces different answers for each side. Thankfully this probability can be made negligible (i.e. less than one in 264).

I should also note that the RLWE problem is hypothesised to be resistant to quantum attack, but we don't know that. We also don't know that it's resistant to attacks by classical computers! It's possible that someone will develop a classical algorithm tomorrow that breaks the scheme above. Thus it should be used concurrently with a standard Diffie-Hellman (e.g. X25519) and the outputs of each should concatenated as the input keying material for a KDF.


There's an index of all posts and one, long page with them all too. Email: agl AT imperialviolet DOT org.