Passkeys

This is an opinionated, “quick-start” guide to using passkeys as a web developer. It’s hopefully broadly applicable, but one size will never fit all authentication needs and this guide ignores everything that’s optional. So take it as a worked example, but not as gospel.

It doesn't use any WebAuthn libraries, it just assumes that you have access to functions for verifying signatures. That mightn't be optimal—maybe finding a good library is better idea—but passkeys aren't so complex that it's unreasonable for people to know what's going on.

This is probably a post that'll need updating over time, making it a bad fit for a blog, so maybe I'll move it in the future. But it's here for now.

Platforms for developing with passkeys include:

Database changes

Each user will need a passkey user ID. The user ID identifies an account, but should not contain any personally identifiable information (PII). You probably already have a user ID in your system, but you should make one specifically for passkeys to more easily keep it PII-free. Create a new column in your users table and populate it with large random values for this purpose. (The following is in SQLite syntax so you’ll need to adjust for other databases.)

/* SQLite can't set a non-constant DEFAULT when altering a table, only
 * when creating it, but this is what we would like to write. */
ALTER TABLE users ADD COLUMN passkey_id blob DEFAULT(randomblob(16));

/* The CASE expression causes the function to be non-constant. */
UPDATE USERS SET passkey_id=hex(randomblob(CASE rowid WHEN 0
                                                      THEN 16
                                                      ELSE 16 END));

A user can only have a single password but can have multiple passkeys. So create a table for them:

CREATE TABLE passkeys (
  id BLOB PRIMARY KEY,
  username STRING NOT NULL,
  public_key_spki BLOB,
  backed_up BOOLEAN,
  FOREIGN KEY(username) REFERENCES users(username));

Secure contexts

Nothing in WebAuthn works outside of a secure context, so if you’re not using HTTPS, go fix that first.

Enrolling existing users

When a user signs in with a password, you might want to prompt them to create a passkey on the local device for easier sign-in next time. First, check to see if their device has a local authenticator and that the browser is going to support passkeys:

if (!window.PublicKeyCredential ||
    !(PublicKeyCredential as any).isConditionalMediationAvailable) {
  return;
}

Promise.all([
    (PublicKeyCredential as any).isConditionalMediationAvailable(),
    PublicKeyCredential
      .isUserVerifyingPlatformAuthenticatorAvailable()])
  .then((values) => {
    if (values.every(x => x === true)) {
      promptUserToCreatePlatformCredential();
    }
  })

(The snippets here are in TypeScript. It should be easy to convert them to plain Javascript if that’s what you need. You might notice several places where TypeScript’s DOM types are getting overridden because lib.dom.d.ts hasn’t caught up. I hope these cases will disappear in time.)

If the user accepts, ask the browser to create a local credential:

var createOptions : CredentialCreationOptions = {
  publicKey: {
    rp: {
      // The RP ID. This needs some thought. See comments below.
      id: SEE_BELOW,
      // This field is required to be set to something, but you can
      // ignore it.
      name: "",
    },

    user: {
      // `userIdBase64` is the user's passkey ID, from the database,
      // base64-encoded.
      id: Uint8Array.from(atob(userIdBase64), c => c.charCodeAt(0)),
      // `username` is the user's username. Whatever they would type
      // when signing in with a password.
      name: username,
      // `displayName` can be a more human name for the user, or
      // just leave it blank.
      displayName: "",
    },

    // This lists the ids of the user's existing credentials. I.e.
    //   SELECT id FROM passkeys WHERE username = ?
    // and supply the resulting list of values, base64-encoded, as
    // existingCredentialIdsBase64 here.
    excludeCredentials: existingCredentialIdsBase64.map(id => {
      return {
        type: "public-key",
        id: Uint8Array.from(atob(id), c => c.charCodeAt(0)),
      };
    }),

    // Boilerplate that advertises support for P-256 ECDSA and RSA
    // PKCS#1v1.5. Supporting these key types results in universal
    // coverage so far.
    pubKeyCredParams: [{
      type: "public-key",
      alg: -7
    }, {
      type: "public-key",
      alg: -257
    }],

    // Unused during registrations, except in some enterprise
    // deployments. But don't do this during sign-in!
    challenge: new Uint8Array([0]),

    authenticatorSelection: {
      authenticatorAttachment: "platform",
      requireResidentKey: true,
    },

    // Three minutes.
    timeout: 180000,
  }
};

navigator.credentials.create(createOptions).then(
  handleCreation, handleCreationError);

RP IDs

There are two levels of controls that prevent passkeys from being used on the wrong website. You need to know about this upfront to prevent getting stuck later.

“RP” stands for “relying party”. You (the website) are a “relying party” in authentication-speak. An RP ID is a domain name and every passkey has one that’s fixed at creation time. Every passkey operation asserts an RP ID and, if a passkey’s RP ID doesn’t match, then it doesn’t exist for that operation.

This prevents one site from using another’s passkeys. A passkey with an RP ID of foo.com can’t be used on bar.com because bar.com can’t assert an RP ID of foo.com. A site may use any RP ID formed by discarding zero or more labels from the left of its domain name until it hits an eTLD. So say that you’re https://www.foo.co.uk: you can assert www.foo.co.uk (discarding zero labels), foo.co.uk (discarding one label), but not co.uk because that hits an eTLD. If you don’t set an RP ID in a request then the default is the site’s full domain.

Our www.foo.co.uk example might happily be creating passkeys with the default RP ID but later decide that it wants to move all sign-in activity to an isolated origin, https://accounts.foo.co.uk. But none of the passkeys could be used from that origin! If would have needed to create them with an RP ID of foo.co.uk in the first place to allow that.

But you might want to be careful about always setting the most general RP ID because then usercontent.foo.co.uk could access and overwrite them too. That brings us to the second control mechanism. As you’ll see later, when a passkey is used to sign in, the browser includes the origin that made the request in the signed data. So accounts.foo.co.uk would be able to see that a request was triggered by usercontent.foo.co.uk and reject it, even if the passkey’s RP ID allowed usercontent.foo.co.uk to use it. But that mechanism can’t do anything about usercontent.foo.co.uk being able to overwrite them.

So either pick an RP ID and put it in the “SEE BELOW” placeholder, above. Or else don’t include the rp.id field at all and use the default.

Recording a passkey

When the promise from navigator.credentials.create resolves successfully, you have a newly created passkey! Now you have to ensure that it gets recorded by the server.

The promise will result in a PublicKeyCredential object, the response field of which is an AuthenticatorAttestationResponse. First, sanity check some data from the browser. Since this data isn’t signed over in the configuration that we’re using, it’s fine to do this check client-side.

const cdj = JSON.parse(
    new TextDecoder().decode(cred.response.clientDataJSON));
if (cdj.type != 'webauthn.create' ||
    (('crossOrigin' in cdj) && cdj.crossOrigin) ||
    cdj.origin != 'https://YOURSITEHERE') {
  // handle error
}

Call getAuthenticatorData() and getPublicKey() on response and send those ArrayBuffers to the server.

At the server, we want to insert a row into the passkeys table for this user. The authenticator data is a fairly simple, binary format. Offset 32 contains the flags byte. Sanity check that bit 7 is set and then extract:

  1. Bit 4 as the value of backed_up. (I.e. (authData[32] >> 4) & 1.)
  2. The big-endian, uint16 at offset 53 as the length of the credential ID.
  3. That many bytes from offset 55 as the value of id.

The ArrayBuffer that came from getPublicKey() is the value for public_key_spki. That should be all the values needed to insert the row.

Handling a registration exception

The promise from create() might also result in an exception. InvalidStateError is special and means that a passkey already exists for the local device. This is not an error, and no error will have been shown to the user. They’ll have seen a UI just like they were registering a passkey but the server doesn’t need to update anything.

NotAllowedError means that the user canceled the operation. Other exceptions mean that something more unexpected happened.

To test whether an exception is one of these values do something like:

function handleCreationError(e: Error) {
  if (e instanceof DOMException) {
    switch (e.name) {
      case 'InvalidStateError':
        console.log('InvalidStateError');
        return;

      case 'NotAllowedError':
        console.log('NotAllowedError');
        return;
    }
  }

  console.log(e);
}

(But obviously don’t just log them to the console in real code.)

Signing in with autocomplete

Somewhere on your site you have username & password inputs. On the username input element, add webauthn to the autocomplete attribute. So if you have:

<input type="text" name="username" autocomplete="username">

… then change that to …

<input type="text" name="username" autocomplete="username webauthn">

Autocomplete for passkeys works differently than for passwords. For the latter, when the user selects a username & password from the pop-up, the input fields are filled for them. Then they can click a button to submit the form and sign in. With passkeys, no fields are filled, but rather a pending promise is resolved. It’s then the site’s responsibility to navigate/update the page so that the user is signed in.

That pending promise must be set up by the site before the user focuses the username field and triggers autocomplete. (Just adding the webauthn tag doesn’t do anything if there’s not a pending promise for the browser to resolve.) To create it, run a function at page load that:

  1. Does feature detection and, if supported,
  2. Starts a “conditional” WebAuthn request to produce the promise that will be resolved if the user selects a credential.

Here’s how to do the feature detection:

if (!window.PublicKeyCredential ||
    !(PublicKeyCredential as any).isConditionalMediationAvailable) {
  return;
}

(PublicKeyCredential as any).isConditionalMediationAvailable()
  .then((result: boolean) => {
    if (!result) {
      return;
    }

    startConditionalRequest();
  });

Then, to start the conditional request:

var getOptions : CredentialRequestOptions = {
  // This is the critical option that tells the browser not to show
  // modal UI.
  mediation: "conditional" as CredentialMediationRequirement,

  publicKey: {
    challenge: Uint8Array.from(atob(CHALLENGE_SEE_BELOW), c =>
                                 c.charCodeAt(0)),

    rpId: SAME_AS_YOU_USED_FOR_REGISTRATION,
  }
};

navigator.credentials.get(getOptions).then(
  handleSignIn, handleSignInError);

Challenges

Challenges are random values, generated by the server, that are signed over when using a passkey. Because they are large random values, the server knows that the signature must have been generated after it generated the challenge. This stops “replay” attacks where a signature is captured and used multiple times.

Challenges are a little like a CSRF token: they should be large (16- or 32-byte), cryptographically-random values and stored in the session object. They should only be used once: when a sign-in attempt is received, the challenge should be invalidated. Future sign-in attempts will have to use a fresh challenge.

The snippet above has a value CHALLENGE_SEE_BELOW which is assumed to be the base64-encoded challenge for the sign-in. The sign-in page might XHR to get the challenge, or the challenge might be injected into the page’s template. Either way, it must be generated at the server!

Handling sign-in

If the user selects a passkey then handle­Sign­In will be called with a Public­Key­Credential object, the response field of which is a Authenticator­Assertion­Response. Send the Array­Buffers raw­Id, response.­client­Data­JSON, response.­authenticator­Data, and response.­signature to the server.

At the server, first look up the passkey: SELECT (username, public_key_spki, backed_up) FROM passkey WHERE id = ? and give the value of rawId for matching. The id column is a primary key, so there can either be zero or one matching rows. If there are zero rows then the user is signing in with a passkey that the server doesn’t know about—perhaps they deleted it. This is an error, reject the sign-in.

Otherwise, the server now knows the claimed username and public key. To validate the signature you’ll need to construct the signed data and parse the public key. The public_key_spki values from the database are stored in SubjectPublicKeyInfo format and most languages will have some way to ingest them. Here are some examples:

Your languages’s crypto library should provide a function that takes a signature and some signed data and tells you whether that signature is valid for a given public key. For the signature, pass in the value of the signature ArrayBuffer that the client sent. For the signed data, calculate the SHA-256 hash of clientDataJSON and append it to the contents of authenticatorData. If the signature isn’t valid, reject the sign-in.

But there are still a bunch of things that you need to check!

Parse the clientDataJSON as UTF-8 JSON and check that:

  1. The type member is “webauthn.get”.
  2. The challenge member is equal to the base64url encoding of the challenge that the server gave for this sign-in.
  3. The origin member is equal to your site’s sign-in origin (e.g. a string like “https://www.example.com”).
  4. The crossOrigin member, if present, is false.

There’s more! Take the authenticatorData and check that:

  1. The first 32 bytes are equal to the SHA-256 hash of the RP ID that you’re using.
  2. That bit zero of the byte at offset 32 is one. I.e. (authData[32] & 1) == 1. This is the user presence bit that indicates that a user approved the signature.

If all those checks work out then sign in the user whose passkey it was. I.e. set a cookie and respond to the running Javascript so that it can update the page.

If the stored value of backed_up is not equal to (authData[32] >> 4) & 1 then update that in the database.

Removing passwords

Once a user is using passkeys to sign in, great! But if they were upgraded from a password then that password is hanging around on the account, doing nothing useful yet creating risk. It would be good to ask the user about removing the password.

Doing this is reasonable if the account has a backed-up passkey. I.e. if SELECT 1 FROM passkeys WHERE username = ? AND backed_up = TRUE has results. A site might consider prompting the user to remove the password on an account when they sign in with a passkey and have a backed-up one registered.

Registering new, passkey-only users

For sign ups of new users, consider making them passkey-only if the feature detection (from the section on enrolling users) is happy.

When enrolling users where a passkey will be their only sign-in method you really want the passkey to end up “in their pocket”, i.e. on their phone. Otherwise they could have a passkey on the computer that they signed-up with but, if it’s not syncing to their phone, that’s not very convenient. There is not, currently, a great answer for this I’m afraid! Hopefully, in a few months, calling navigator­.credentials.­create() with authenticator­Selection.­authenticator­Attachment set to cross-plat­form will do the right thing. But with iOS 16 it’ll exclude the platform authenticator.

So, for now, do that on all platforms except for iOS/iPadOS, where authenticator­Attachment should continue to be plat­form.

(I’ll try and update this section when the answer is simplier!)

Settings

If you’ve used security keys with any sites then you’ll have noticed that they tend to list registered security keys in their account settings, have users name each one, show the last-used time, and let them be individually removed. You can do that with passkeys too if you like, but it’s quite a lot of complexity. Instead, I think you can have just two buttons:

First, a button to add a passkey that uses the createOptions object from above, but with authenticatorAttachment deleted in order to allow other devices to be registered.

Second, a “reset passkeys” button (like a “reset password” button). It would prompt for a new passkey registration, delete all other passkeys, and invalidate all other active sessions for the user.

Test vectors

Connecting up to your language’s crypto libraries is one of the trickier parts of this. To help, here are some test vectors to give you a ground truth to check against, in the format of Python 3 code that checks an assertion signature.

import codecs

from cryptography.hazmat.primitives.asymmetric import ec
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.serialization import (
  load_der_public_key)

# This is the public key in SPKI format, as obtained from the
# `getPublicKey` call at registration time.
public_key_spki_hex = '''
3059301306072a8648ce3d020106082a8648ce3d03010703420004dfacc605c6
e1192f4ab89671edff7dff80c8d5e2d4d44fa284b8d1453fe34ccc5742e48286
d39ec681f46e3f38fe127ce27c30941252430bd373b0a12b3e94c8
'''

# This is the contents of the `clientDataJSON` field at assertion
# time. This is UTF-8 JSON that you also need to validate in several
# ways; see the main body of the text.
client_data_json_hex = '''
7b2274797065223a22776562617574686e2e676574222c226368616c6c656e67
65223a22594934476c4170525f6653654b4d455a444e36326d74624a73345878
47316e6f757642445a483664436141222c226f726967696e223a226874747073
3a2f2f73656375726974796b6579732e696e666f222c2263726f73734f726967
696e223a66616c73657d
'''

# This is the `authenticatorData` field at assertion time. You also
# need to validate this in several ways; see the main body of the
# text.
authenticator_data_hex = '''
26bd7278be463761f1faa1b10ab4c4f82670269c410c726a1fd6e05855e19b46
0100000cc7
'''

# This is the signature at assertion time.
signature_hex = '''
3046022100af548d9095e22e104197f2810ee9563135316609bc810877d1685b
cff62dcd5b022100b31a97961a94b4983088386fd2b7edb09117f4546cf8a5c1
732420b2370384fd
'''

def from_hex(h):
    return codecs.decode(h.replace('\n', ''), 'hex')

def sha256(m):
    digest = hashes.Hash(hashes.SHA256())
    digest.update(m)
    return digest.finalize()

# The signed message is calculated from the authenticator data and
# clientDataJSON, but the latter is hashed first.
signed_message = (from_hex(authenticator_data_hex) +
                  sha256(from_hex(client_data_json_hex)))

public_key = load_der_public_key(from_hex(public_key_spki_hex))
public_key.verify(from_hex(signature_hex),
                  signed_message,
                  ec.ECDSA(hashes.SHA256()))
# `verify` throws an exception if the signature isn't valid.
print('ok')

Where to ask questions

StackOverflow is a reasonable place, with the passkey tag.

Passkeys

The presentations are out now (Google I/O, WWDC): we're making a push to take WebAuthn to the masses.

WebAuthn has been working reasonably well for enterprises and technically adept users. But we were not going to see broad adoption while the model was that you had to purchase a pair of security keys, be sure to register the backup on every site, yet also keep it in a fire safe. So the model for consumers replaces security keys with phones, and swaps out having a backup authenticator with backing up the private keys themselves. This could have been a different API, but it would have been a lot to have a second web API for public-key authentication, so WebAuthn it is. Basing things on WebAuthn also means that you can still use your security key if you like*, and hopefully with a expanded ranges of sites.

(* albeit not with Android this year, because it doesn't support a recent enough version of CTAP yet. Sorry.)

The WebAuthn spec is not a gentle introduction, but you can find several guides on how to make the API calls. Probably there'll be more coming. What I wanted to cover in this post is the groundwork semantics around passkeys. These are not new if you're fully versed in WebAuthn, but they are new if you've only used WebAuthn in a 2nd-factor context.

I'll probably build on this in some future posts and maybe amalgamate some past writings into a single document. The next paragraph just drops you into things without a lot of context. Perhaps it'll be useful for someone, but best to understand it as fragments of a better document that I'm accumulating.

Ok:

An authenticator is a map from (RP ID, user ID) pairs, to public key credentials. I'll expand on each of those terms:

An authenticator, traditionally, is the physical thing that holds keys and signs stuff. Security keys are authenticators. Laptops can be too; Windows Hello has been an authenticator for a while now. In the world of passkeys, phones are important authenticators. Now that keys may sync, you might consider the sync account itself to be a distributed authenticator. So, rather than thinking of authenticators as physical things, think of it as whatever maintains this map that contains the user's credentials.

An RP ID identifies a website. It's a string that contains a domain name. (In non-web cases, like SSH, it can be a URL, but I'm not covering them here.) A website can use an RP ID if that RP ID is equal to, or a suffix of, the site's domain, and the RP ID is at least an eTLD + 1. So https://foo.bar.example.com can use RP IDs foo.bar.example.com, bar.example.com, and example.com, but not com (because that's less than an eTLD + 1), nor example.org (because that's an unrelated domain). Because the credential map is keyed by RP ID, one website can't use another's credentials. However, be conscious of subdomains: usercontent.example.com can use an RP ID of example.com, although the client data will let the server know which origin made any given request.

Next, a user ID is an opaque byte string that identifies an account on a website. The spec says that it mustn't contain identifiable information (i.e. don't just make it the user's email address) because security keys don't protect the user ID to the same degree that they protect other information. The recommendation is to add a column to your users table, generate a large random value on demand, and store it there for each user. (The spec says 64 bytes but if you just generated 16 from a secure random source, I think you would be fine.) You could also HMAC an internal user ID, although that concentrates risk in that HMAC key.

Recall that an authenticator maps (RP ID, user ID) pairs to credentials? The important consequence is that if a site creates a credential it'll overwrite any existing credential with the same user ID. So an authenticator only contains a single credential for a given account. (For those who know WebAuthn already: I'm assuming discoverable credentials throughout.)

A credential is a collection of various fields. Most obviously a private key, but also metadata: the RP ID, for one, and user information. There's three pieces of user information: the user name, display name, and ID. We've covered the user ID already. The other two are free-form strings that get displayed in UI to help a user select a credential. The user name is generally how a user identifies themselves to the site, e.g. an email address. The display name is how the user would want to be addressed, which could be their legal name. Of the two, the user name will likely be more prominent in UIs.

Lastly a passkey is a WebAuthn credential that is safe and available when the user needs it, i.e. backed up. Not all implementations will be backing up credentials right away and passkeys on those platforms can be called “single-device passkeys”, which is a little awkward, but so's the lack of backup.

Another important aspect of the structure of things is that, while an account only has a single password, it can have multiple passkeys. That's because passkeys can't be copy–pasted around like passwords can. Instead users will register a passkeys as needed to cover their set of devices.

The authenticatorAttachment field in the assertion structure hints to the website about when an additional passkey might be needed. If the value of that field is cross-platform then the user had to use another device to sign-in and it might be worth asking them if they want to register the local device.

When there are multiple passkeys registered on a site, users will need to manage them. The way that sites have often ended up managing 2nd-factor WebAuthn credentials is via an explicit list in the user's account settings. Usually the user will be prompted to name a credential at registration time to distinguish them. That's still a reasonable way to manage passkeys if you like. (We pondered whether browsers could send a passkey “name” at registration time to avoid prompting the user for one but, in a world of syncing, there doesn't seem to be a good value that isn't either superfluously generic, e.g. “Google devices”, or a privacy problem, e.g. the sync account identifier).

If prompting for names and having an explicit list seems too complex then I think it would also be fine to simply have a reset button that a) registers a new passkey, b) deletes every other passkey on the account, and c) signs out all other sessions. I.e. a button that mirrors a password reset flow. For a great many sites, that would work well.

We don't want passwords to hang around on an account forever as a neglected weak-link. In order to guide sites in determining when that might be worth prompting the user to remove their password, there's a new backup state bit in the authenticator data that the server gets with each authentication. If it's set then the passkey will survive the loss of the device. (Unless the device is destroyed after creating the passkey but before managing to sync. But that's the same as generating a random password in a password manager.)

There are not firm rules around using that bit, but once the user has a backed up passkey on a portable authenticator then it's probably time to ask about dropping the password. Sites can know this when they see a creation or assertion operation with the backup state bit set and either the attachment is cross-platform, or it's platform but that platform is a mobile device. That's a conservative set of rules because, for example, a credential created on a MacBook might get synced to the user's iPhone. But maybe that user doesn't have an iPhone.

As you can see, one of the challenges with passkeys is the complexity! Several teams are Google are still working on fleshing out the things demoed at I/O but we know that good guidance will be important. In the mean time, I'm happy to answer questions over Twitter and am pondering if public office hours would be helpful too.

The several canons of CBOR

There are many encoding formats. CBOR is one of them. Like several others, a subset of it basically fine—I'm not starting that fight today.

Whatever encoding you use, it's nice to reduce flexibility. If there are multiple ways of encoding the same thing then, for anything with a non-negligible diversity of implementations, you'll find that there is a canonical encoding, it's just not documented. As soon as one implementation tries using a valid, but less common encoding, it'll find that it doesn't work and something will change to sort it all out. But that process of eventual coordination is expensive—better to do it explicitly and up-front.

For example, TCP options are chunks of tag–length–value and the spec doesn't define any restriction on their ordering. That didn't stop me blowing up a Ubuntu release because a change of mine happened to alter the order that Linux sent them, and some DSL modems dropped packets when options came in that order. That's an expensive discovery of an implicit canonicalisation rule.

If you're using CBOR then RFC 7049 has a section titled “Canonical CBOR”. One of the things it defines is how to order map elements. Great! No more TCP options messes. You can read that section for yourself, but one interpretation of the words there is what I'll call the three-step ordering for map keys:

  1. Lowest valued types come first then, within each type,
  2. Shortest encoded key comes first then, within consecutive keys with equal lengths,
  3. Sort lexicographically.

However, there is another interpretation of the same section which I'll call the two-step ordering. It drops the first step, i.e. doesn't sort by types. When I first read that section, I certainly came away with only one interpretation, but I can somewhat see how the other arises.

CTAP, the protocol for talking to security keys, uses CBOR a lot and explicitly picks the three-step ordering. An errata was eventually raised against the RFC to “clarify” that the three-step order was correct, but it was rejected as a semantic change. So perhaps that's an official ruling that two-step was the correct understanding?

It only matters if you mix different types of keys in the same map, and sometimes you don't, so maybe it's moot for you. (But keep in mind that negative and non-negative numbers are different types in CBOR.) Anyway, the IETF revised the CBOR RFC to firmly clarify the issue add a third option:

RFC 7049 is obsoleted by 8949. The old “Canonical CBOR” section is replaced by “Core Deterministic Encoding Requirements” and that specifies what I'll have to call the one-step ordering:

  1. Sort lexicographically on the encoded key.

It has some advantages! Like, why was it ever more complicated than that in the first place? But it was, and now there's two (or three) orderings. The two-step ordering even has a subsection of its own in the new RFC and a note saying that it “could be called Old Canonical CBOR”. If only shipped implementations were as changeable as the X-Men universe.

So that's why there's two and half “canonical” CBORs and now I have something that I can reference when people ask me about this.

Update: Someone on Twitter (Thomas Duboucher, perhaps? Sorry, I lost the tweet!) pointed out that the 1- and 3-step ordering give the same results in many cases. Having thought about it, that's correct! As long as you don't use maps, arrays, or tagged values as map keys, I believe they coincide. For something like CTAP2, where map keys are always ints or strings, they work out the same. So perhaps things aren't so bad. (Or perhaps a really subtle difference is worse! Time will tell.)

Picking parameters

When taking something from cryptographic theory into practice, it's very important to pick parameters. I don't mean picking the right parameters — although that certainly helps. I mean picking parameters at all.

That might seem obvious, but there are pressures pushing towards abdication: what if you get it wrong? Why not hedge bets and add another option? What about the two groups who already shipped something? We could just add options so that everyone's happy.

There are always exceptions, but the costs of being flexible are considerable. ANSI X9.62, from 1998, specified how elliptic curves were encoded in ASN.1 and included not just 27 possible curves, but allowed the parameters to be inherited from the issuing CA and allowed completely arbitrary elliptic curves to be defined inside a public key. That specifies almost nothing and avoids really standardising on anything. Thankfully I've not seen parameter inheritance implemented for ECC (it would be a security mess) but support for custom elliptic curves does, sadly, exist.

A couple of years ago, Microsoft had a signature bypass because of support for arbitrary curves [details]. Today, OpenSSL had an infinite loop in certificate parsing because of the same. On the flip side, I'm not aware that anyone has ever used the ability to specify arbitrary curves for anything useful.

It's not fair to just blame X9.62: don't get me started about RSA PSS. The cost of having too many options is frequently underestimated. Why does AES-192 exist, for example, other than to inflate test matrices?

As an aside, it's interesting to wonder how many CPU decades have been spent fuzzing OpenSSL's certificate parsing and yet today's issue wasn't reported before. This isn't a 2-128 fraction of state space that fuzzers couldn't be expected to find, and I'm not sure that Tavis found it from fuzzing at all.

Phones as security keys in Chrome

With Chrome 94, if you have an Android phone with Chrome on it, and it’s syncing to the same Google account as Chrome on a Chrome OS/Windows/macOS device, then you’ll be able to use that phone as a security key. You should be able to try this out on any WebAuthn using website, for example here. (But not accounts.google.com, which uses a different system.)

The reason that you are reading about this here and not on an official Google site is that people shouldn’t start registering their phone as a security key unless they have a physical security key as a back up. Just like a regular security key, if you lose the phone then you lose the credentials. So, just like a regular security key, you should have a back up. (You can also lose your credentials if you remove the screen lock, or somehow wipeout Play Services state — e.g. by doing a factory reset.)

We have plans for addressing this and making this suitable for regular people, and to allow use in other profiles, but we’re not there yet. We are interested in whether the communications infrastructure is good enough though. (More below.)

For signing into Google, it has long been possible to use your phone as a security key. This only worked in Chrome and functioned over BLE GATT between the desktop and phone. We have wanted to expand this to the web in general for years, but the success rate that we measured with BLE was poor. After quite a lot of work trying to improve the BLE success rate, we didn’t achieve much.

But the use of BLE is more than just a convenience. The security model demands some proof of physical proximity between the authenticator and the machine that is being authenticated. For a USB security key the authenticator will only respond to something that is making physical contact with it. So when a phone is acting as a security key it needs to prove that the machine it is talking to is physically close by. (Or at least that the attacker is in control of a BLE radio that is physically close.)

We looked at other Bluetooth modes in the hopes that they might work better, but classic Bluetooth RFCOMM isn’t supported on iOS and requires a lot of user interaction on android. BLE L2CAP is supported on iOS, but isn’t supported (in user space) on Windows. It’s also flaky in the face of MAC address rotation if the devices aren’t paired.

So where we’ve ended up is that all the communication happens over the internet connection, but the phone sends a nonce in a BLE advert and the other end of the channel has to prove receipt. That’s the least amount of Bluetooth we could use while still requiring physical proximity. Needing bilateral internet connectivity is unfortunate though. So you can also connect the phone with a USB cable while the security key operation is running. (But very sadly not on Windows; The USB stack there just isn’t designed in the right way for it.) We might also add L2CAP as an option in the future.

This isn’t enabled on Linux at the moment. Historically trying to do the BLE GATT connection would often fail with bluez, and so the phone as a security key infrastructure was disabled on Linux. Now that the desktop only needs to receive a BLE advert it looks like it could work, but we haven’t flipped that switch yet.

As I mentioned above, we are interested in whether the underlying infrastructure is plausible. Aggregated anonymous statistics are useful for many things but in this case they suggest that BLE isn’t always working as well as it should, but don’t tell us why not. So if you are especially keen about security keys and want to try this out, I’d be interested in your experiences. I can't promise to respond but I will read everything if you send me an email (agl at chromium dot org) or tweet at me (agl__).

Some troubleshooting hints if you're having issues because this will be much faster than asking me!

If no phones appear as options: you're using Windows, macOS, or Chrome OS, yes? And it's an Android phone? And the machine has working Bluetooth? And Chrome is up to date everywhere? If you navigate to chrome://sync-internals, is the “Transport state” at the top-left reporting “Active”? If not, enable Sync. On the phone, does Settings say that Sync is on at the top? If not, enable it. Is the account listed in Settings on the phone the same as the “Username” in chrome://sync-internals on the desktop? If all that's good then probably you just need to wait because it can take a couple of days for the registration to propagate. Probably in the “Device Info” section of the “Sync Node Browser” in chrome://sync-internals your phone is listed, but there's no paask_fields section yet. If you want to short-circuit that, you can install Chrome Canary on the phone and enable syncing there. That should register quite quickly.

You can select the phone on the desktop, but nothing happens: the phone should be getting a cloud message poke that triggers a notification. Does the phone have internet access? Did you completely disable notifications from Chrome? `adb logcat | grep -i cable` would be interesting if you're setup for that. Otherwise, if this is a common issue, I might have to add some logging and ask for the tunnel URL from chrome://device-log.

You get the notification and tap it, but something else goes wrong: if an error code or error message is displayed then I want to know what it is! If it's hanging then the message at the bottom of the screen changes for each step of the process so that's what'll be useful. Most problems seem to be BLE: is the phone close to the desktop? What other BLE is happening on the devices?

Efficient QR codes

QR codes seem to have won the battle for 2D barcodes, but they're not just a bag of bits inside. Their payload is a series of segments, each of which can have a different encoding. Segments are bitstrings, concatenated without any byte-alignment, and terminated with an empty segment of type zero. If you want to squeeze the maximum amount of data into a QR code without it turning into a gray square, understanding segmentation helps.

The most basic segment type is byte mode, which is a series of 8-bit bytes. If you control the QR decoder then this is perfectly efficient for encoding binary data. But you probably need to work with a variety of decoders. In that case, beware: the first edition of the QR standard, ISO/IEC 18004, said that byte mode should be interpreted as JIS X 0201. The 2005 edition changed that to be ISO/IEC 8859-1 (i.e. Latin-1). In practice, some QR decoders will attempt to content-sniff the encoding because, while UTF-8 contents should be indicated with an ECI header, they often aren't and UTF-8 is really common.

So if you put binary data into a QR code, you are probably going to hit these edge cases. The contents are likely going to be passed to a general operating system API for handling URLs — do you think the full pipeline will handle NUL bytes in the string, and UTF-8 non-characters and invalid surrogate pairs when interpreted as UTF-8? Also, you probably want your QR contents to be a printable string: bits of it might be shown in the scanner's UI; users might need to copy and paste them.

So let's assume that you want something printable and consider an obvious answer: base64url. It's very common, printable, and doesn't contain any characters that are special in URLs for maximum compatibility. It'll be encoded in a byte-mode segment: each base64url character contains 6 bits of information and takes 8 bits in the QR code for an efficiency of 75%. That's our baseline.

The next segment type to consider is digit mode. This only encodes the digits, 0–9, by packing triples of digits into 10 bits. If there are two digits left over at the end then they take 7 bits, and a singleton takes 4 bits. Ignoring the potential digits at the end, this lets you store 3×log2(10) = 3×3.322 = 9.966 bits of information in 10 bits of space. That's 99.66% efficient! So you can clearly do better than base64url.

The last segment type for these purposes is alphanumeric mode. These segments can encode A–Z, 0–9, and nine special characters: $, %, *, +, -, ., /, :, and space. Pairs of these characters are encoded in 11 bits. (If there's an odd number then the last takes 6 bits.) If you consider this to be “base-45” encoding then it stores 2×log2(45) = 10.98 bits in 11 bits of space, for 99.85% efficiency. Even better than digit mode, although only just.

So maybe you should base-45 encode your data using that alphabet. But, of the special characters supported in alphanumeric mode, only two (minus and period) are unreserved (i.e. safe) in URLs. So you might be reduced to base-38, which cuts the efficiency to 95.42%. But having textually smaller QR contents might be valuable and worth a few percent efficiency in your context.

If you've picked base-10 (digits), base-38, or even base-45 for your data then you need to get it into that form. Base-64 is easy because that's exactly 6 bits per character; you work on 3 bytes of input at a time and produce exactly 4 characters of output. But 10, 38, and 45 aren't powers of two. You've got three options here. The obvious conversion would be to treat the input as a bigint and repeatedly divmod by 10 (or 38, etc) to generate the encoding. If you have a bigint library to hand then it almost certainly has the functions for that, but it's a rather obnoxious (and quadratic) amount of computation and a significant dependency. So you might be willing to waste a few bits to make things easier.

Next option is an encoding noted by djb that achieves similar efficiency but with less computation and no long-division. I updated this post to include it, so it's covered in a section below.

The third option is to chunk the input and convert each chunk independently. Ideal input chunks would be 8 bytes or fewer, because nearly all environments will support a uint64 type and nearly all hardware can do a divmod on them. If you're using base-10 then there's going to be a function that can “print” a uint64 to digits for you, so let's take digits as an example. With a chunk size of two bytes you would get five digits. Each digit takes 3⅓ bits of space, so 16 input bits takes 16⅔ bits: 96% efficient. Less than the 99.66% we can get with digits for sure. But if you consider all chunk sizes from one to eight bytes, turning 7-byte chunks into 17 digits is 98.82% efficient. That's pretty good for the complexity savings.

For base-38, 7-byte chunks are the best again, producing 11 characters for 92.56% efficiency. For base-45, two-byte chunks produce 3 characters for 96.97% efficiency. (Four- and eight-byte chunks do the same if you want fewer loop iterations.)

(And you should use little-endian chunks because little-endian has won, even if the IETF hasn't caught up to that yet.)

Now you've got your payload encoding sorted … probably. A wrinkle is that it's difficult to know how your QR encoder will segment what you give it. You might have crafted a wonderful base-38 input and it might stuff it all into a byte-mode segment! (68.65% efficient, worse than base64url.) I'm sadly not aware of a good QR “debugger” that shows all the details of a QR code. ZXing's online service will give a hex-dump of the raw contents, but that assumes that you can decode the segment headers. (QR-Logo promises better debugging output but doesn't work for me.) My best advice is to use ZXing on a sample QR code, ignore the 0xec, 0x11 padding pattern at the end, and calculate whether the number of bytes used roughly makes sense.

You probably want to put a URL-like prefix at the front to make your QR codes distinguishable. One thing to consider is that “https://www.example.com/” is a byte-mode segment that takes 204 bits, but “HTTPS://WWW.EXAMPLE.COM/” is alphanumeric and needs only 145 bits. (That's assuming QR versions 1 through 9, the lengths are slightly different otherwise.) DNS names are case insensitive and “an implementation should accept uppercase letters” for the scheme says RFC 3986. Maybe it just looks odd and that's not worth the bits, though?

We'll finish up with a quick look at an example, which is the thing that started me on this path in the first place: SMART Health Cards. (Thank you to the person who pointed me at them, who likely wants to remain anonymous.)

SHC's are trying to squeeze a lot of data into a QR code: they minify their JSON structure and compress it but, even then, they sometimes span multiple QR codes and the user has to scan them all. As such their contents are a) a binary segment containing “shc:/” (and maybe sequence numbers if using multiple QR codes), and then b) a digits segment containing the payload. So they didn't use “SHC:/” to save bits, but the difference is small.

One thing to note is that the QR spec (ISO/IEC 18004:2005) has a whole section on “structured append” mode, where multiple QR codes can be combined into one. But trying that with iOS and Android shows that neither support it, so probably it can be considered dead and that's why SHC is replicating the same feature at a higher level.

Another thing to note is that SHC is using digits for better efficiency, which is great, but the way that they do it is not. They're using JWT, which is bad but not today's topic, so they have three base64-encoded strings. They then take each base64 character, subtract 45, and turn that into two base-10 digits! All that work minifying JSON and compressing it, and then they throw away 10% of their bits on such a trivial thing!

So SHC did pretty well, but missed an easy win. Having read this, you can do better.

The NTRU Prime encoding

Above, I referenced an encoding that gets nearly all the space efficiency of the bigint-and-divmod method, but without the computational costs. This section is about that. It's taken from page 18 of the NTRU Prime NIST submission.

Our motivating issue is thus: if you have a whole byte then taking it mod 10 to generate a digit works fairly well. The digits 0–5 have probability 26/256 while 6–9 have probability 25/256. That's not uniform therefore it doesn't encode the maximum amount of entropy, but it's 99.992% efficient, which is pretty good. But when you have a smaller range of input values the non-uniformity becomes significant and so does the reduction in information transmitted.

The encoding in NTRU Prime takes input values (which can be larger than bytes) and combines pairs of them. It produces some output values from each pair but, once the non-uniformity gets unconfortable, it pairs up the leftovers to increase the range. This repeats in a binary-tree.

As a concrete example we'll use the Python code from page 18 and set M = [256…] (because our inputs are bytes), change the 256 and 255 to 10 and 9 (to extract digits, not bytes), and set limit to 1024. Our example input will be 419dd0ed371f44b7.

I 65 157 208 237 55 31 68 183 40257→7,5 60880→0,8 7991→1,9 46916→6,1 402/656 608/656 79/656 469/656 399250→0,5,2 307743→3,4,7 399/431 185761→6,1,7 →2,3,1 307/431 132/186

The input bytes are written (in base 10) along the top. Pairs of them are combined. Take the top-right box: its value is 157×256 + 65 = 40257. That can be considered to be 40257 mod 65536 and, since there's a reasonable number of bits in there, two digits are extracted. Obviously 40257 mod 10 = 7, and 4025 mod 10 = 5. So the two digits are 7 and 5. That leaves 402 mod 656, and 656 is below the limit of 1024 that we set, so it's passed down to be combined into the box below. This continues down the tree: each time there's too little data left to extract another digit, the leftovers are passed down to be combinined. At the bottom there's nothing else to combine with so the final leftover value, 132 mod 186, is converted into the digits 2, 3, and 1. The ultimate output digits are concatenated from left-to-right, top-to-bottom.

This achieves good encoding efficiency without repeated long-division, and can be parallelised.

ACVP

If you do not know what ACVP is then you should read no further. If you think it might be useful then what you're actually looking for is Wycheproof; ACVP is only for those who have no choice.

If you're still reading and you're vaguely aware that your previous CAVP infrastructure isn't applicable any longer, and that you'll need to deal with ACVP next time, then you might be interested in BoringSSL's ACVP infrastructure. We have a number of different FIPS modules to test and wanted something generic rather than repeating the bespoke-per-module way that we handled CAVP. We also need to test not just BoringCrypto (a software module) but also embedded devices.

The result, acvptool, lives within the BoringSSL repo and can translate ACVP JSON into a series of reasonably simple IPC calls that a “module wrapper” speaks over stdin/stdout. BoringSSL's module wrapper is the reference implementation, but there's also a tiny one for testing that could easily be repurposed to forward over a serial link, etc, for embedded devices.

It's reasonably likely that you'll find some case that's not handled, but the code is just Go throwing around JSON so you should be able to extend it without too much bother. But, for the cases that are already handled, the weird undocumented quirks that'll otherwise consume hours of your life are taken care of.

Letter to 20 years ago

I noticed that I have not posted anything here in 2020! There's a bunch of reasons for this: the work I'm doing at the moment does not lend itself so well to blog posts, and life intervenes, leaving less time for personal projects. But in order to head off the risk that I'll post nothing at all this year I pulled something from one of my notebooks. 2020 is a round number so I decided to do some reflection and this was a letter that I imagined writing to myself 20 years ago. It is very much a letter to me! The topics are going to be quite specific and if you weren't paying attention to the computing industry in the year 2000 I’m not sure how much of it will make sense. But these are the points that I think me of 20 years ago would have wondered about.

You must be thinking that computers will be crazy fast by now. Yes…ish. It's complicated, and that's going to be a theme here. You've been hearing from Intel that the NetBurst chips will hit 10GHz in a few years, so with another few doublings what will have by now? 50GHz? Actually common values are around 3.5GHz. Some reach 5GHz, but only in bursts. Intel never hit 10GHz and nor did anybody else. It’s better than it sounds: instructions per clock are up a lot, so each cycle is worth more. (Although maybe we'll get to some of the issues that caused!) More importantly, all systems are multiprocessor now. It's physically a single chip, but inside is often 8- to 32-way SMT. Yep, that's cool. And yep, it only helps for certain sorts of workloads. Multithreaded programming is not going away.

Memory? 10s of gigabytes is common. Hard drives? It's nearly all flash now. You can still buy hard drives and they’re huge and cheap, but the speed of flash is pretty sweet. Computers really are quite substantially faster — don't be too put off by the clock speeds.

Your day-to-day life is a bunch of xterms and a web browser. Nothing's changed; you are dramatically underestimating the importance of path dependence. Terminals are still emulating a fancy VT-100 and sometimes they get messed up and need a reset. No fixes there. It's still bash or zsh; nearly unchanged from your time. The kernel has been fixed a little: you can now get a handle to a process, so no more PID races. You can open a file relative to a directory descriptor and you can create an unlinked file in a directory and link it later. Yes it's good that these things are possible now, but it is not a fundamental change and it took a long time. Actually you know what? Windows grew a much smarter shell, leapfrogging Linux in several respects. They had hardly moved forward since DOS so it was easier there, perversely.

So innovation must have happened higher up where there was new ground and little legacy, right? What about the semantic web? How did that turn out? Not well. We don't have lots of data in machine-readable formats and fancy GUIs so that anyone can create automation. Information is somewhere between impossible and a huge pain to access. You’ve read The Dilbert Future by now and its ‘confusopoly’ concept is much closer to the mark. The Semantic Web stuff failed so badly that nobody even tries any longer. (I'm afraid Scott Adams won’t seem so wholesome in the future either.) The closest you'll get is that your web browser can fill out your name, address, and credit card details. And it has to work really hard to do that because there’s almost no assistance from web pages. Go find Weaving the Web and throw it away.

Something more positive: bandwidth! You are using a dial-up that tops out at 5 KB/s and charges by the minute. You use a local proxy that keeps a copy of everything so that viewed pages are available offline and it lets you mark missing pages for batch fetching to reduce the cost. This problem is now solved. You can assume that any house in a city can get an always-on, several 10s of Mb/s connection. It's not as cheap as it could be but it's a standard household expense now. (Note: past me doesn't live in the US! —agl.) Also, everyone carries an impossibly fancy PDA that has that level of connection wirelessly and everywhere. I don't need to equivocate here, connectivity is solved in the sorts of places you're likely to live. But … there's a second edge to that sword. This connectivity can be a bit … much? There are some advantages to having the internet be stationary, metered, and behind 30 seconds of banshee wailing and static. Imagine your whole social life getting run through IRC, and that you're always connected. It’s tough to explain but there's a problem. But these PDAs? They have GPS and maps. Nobody gets lost anymore. Nobody carries paper street maps in their car. Connectivity can be pretty sweet.

This next bit is going to upset you a little: the whole Palladium / trusted boot stuff never took off on the desktop, but these PDAs are pretty locked down. One type of them is completely locked down and you can’t run non-approved software. The other will make you jump through hoops and, even then, you can't access the data of other programs. On the latter sort you can install a completely custom OS most of the time, but there's attestation and some things won't cooperate. This is still playing out and people are fighting over the details (because of money, of course). It remains a concern, but you underestimate the benefits of this sort of system. Your idea that people should own their own computers because they’re critical tools isn't wrong, but it is elitist. For the vast majority of people, their desktops degrade into a fragile truce with a whole ecosystem of malware and near-malware. Maybe it's their “fault” for having installed it, but these PDAs are so popular, in part, because they're hard to screw up. Bad stuff does get through the approval process, but it cannot mess things up to the wipe-and-reinstall level that desktops reach. The jury is still out about whether we will regret this, but you're wrong about the viability of giving people Windows XP and getting a good result.

Back on a positive note: the music industry switched to a $15 a month stream-whatever-you-want model and it works fine. You were completely right about this. Music still exists and it still pays a few at the top large sums and the rest very little. The music industry itself didn't sort this out though, other companies did it for them. What you're missing is that you’re not taking things far enough: companies also did this for TV and (many) movies. There are still rips of this stuff on BitTorrent, but it's not a live issue because people pay the subscription for the ease, by and large. In fact, access to scientific papers is a hotter issue now!

Basically, rates of change are really uneven.

Real-world measurements of structured-lattices and supersingular isogenies in TLS

This is the third in a series of posts about running experiments on post-quantum confidentiality in TLS. The first detailed experiments that measured the estimated network overhead of three families of post-quantum key exchanges. The second detailed the choices behind a specific structured-lattice scheme. This one gives details of a full, end-to-end measurement of that scheme and a supersingular isogeny scheme, SIKE/p434. This was done in collaboration with Cloudflare, who integrated Microsoft's SIKE code into BoringSSL for the tests, and ran the server-side of the experiment.

Setup

Google Chrome installs, on Dev and Canary channels, and on all platforms except iOS, were randomly assigned to one of three groups: control (30%), CECPQ2 (30%), or CECPQ2b (30%). (A random ten percent of installs did not take part in the experiment so the numbers only add up to 90.) CECPQ2 is the hybrid X25519+structured-lattice scheme previously described. CECPQ2b is the name that we gave to the combination of X25519 and the SIKE/p434 scheme.

Because optimised assembly implementations are labour-intensive to write, they were only available/written for AArch64 and x86-64. Because SIKE is computationally expensive, it wasn’t feasible to enable it without an assembly implementation, thus only AArch64 and x86-64 clients were included in the experiment and ARMv7 and x86 clients did not contribute to the results even if they were assigned to one of the experiment groups.

Cloudflare servers were updated to include support for both CECPQ2 and CECPQ2b, and to support an empty TLS extension that indicated that they were part of the experiment. Depending on the experiment group, Chrome would either offer CECPQ2, CECPQ2b, or just non-post-quantum options, in its TLS 1.3 handshake, along with the signaling extension to indicate which clients were part of the control group. Measurements were taken of how long TLS handshakes took to complete using Chrome’s metrics system. Chrome knew which servers were part of the experiment because they echoed the signaling extension, thus all three groups were measuring handshake duration against the same set of servers.

After this phase of the trial was complete, client-side measurements were disabled and Chrome Canary was switched to a mode where it randomly picked one of CECPQ2, CECPQ2b, or neither to offer. This enabled some additional, server-side measurements to ensure that nothing unexpected was occuring.

(Cloudflare has a significantly more detailed write up of this experiment.)

Biases

We’re aware of a couple of biases and these need to be kept in mind when looking at the results. Firstly, since ARMv7 and x86 platforms were excluded, the population was significantly biased towards more powerful CPUs. This will make supersingular isogenies look better. Also, we’ve seen from past experiments that Canary and Dev Chrome users tend to have worse networks than the Chrome user population as a whole, and this too will tend to advantage supersingular isogenies since they require less network traffic.

Results

Here are histograms of client-side results, first from Windows (representing desktops/laptops) and from Android (representing mobile devices):

1 10 100 1000 10000 TLS handshake time (ms) TLS handshake latency (Windows) Control Control CECPQ2 CECPQ2 CECPQ2b CECPQ2b 1 10 100 1000 10000 TLS handshake time (ms) TLS handshake latency (Android) Control Control CECPQ2 CECPQ2 CECPQ2b CECPQ2b

From the histograms we can see that the CECPQ2b (SIKE) group shifts visibly to the right (i.e. slower) in both cases. (On Android, a similar but smaller shift is seen for CECPQ2.) Despite the advantages of removing the slower clients and experimenting with worse-than-usual networks, the computational demands of SIKE out-weigh the reduced network traffic. Only for the slowest 5% of connections are the smaller messages of SIKE a net advantage.

Cloudflare have a much more detailed analysis of the server-side results, which are very similar.

Conclusion

While there may be cases where the smaller messages of SIKE are a decisive advantage, that doesn’t appear to be the case for TLS, where the computational advantages of structured lattices make them a more attractive choice for post-quantum confidentiality.

Username (and password) free login with security keys

Most readers of this blog will be familiar with the traditional security key user experience: you register a token with a site then, when logging in, you enter a username and password as normal but are also required to press a security key in order for it to sign a challenge from the website. This is an effective defense against phishing, phone number takeover, etc. But modern security keys are capable of serving the roles of username and password too, so the user experience can just involve clicking a login button, pressing the security key, and (perhaps) entering a locally-validated PIN if the security key doesn't do biometrics. This is possible with the recently released Chromium 76 and also with Edge or Firefox on current versions of Windows.

On the plus side, this one-button flow frees users from having to remember and type their username and password for a given site. It also avoids sites having to receive and validate a password, potentially avoiding both having a password database (which, even with aggressively slow hashes, will leak many users' passwords if disclosed), and removing any possibility of accidentally logging the plaintext values (which both Google and Facebook have done recently). On the negative side, users will need a modern security key (or Windows Hello-enabled computer) and may still need to enter a PIN.

Which security keys count as “modern”? For most people it'll mean a series-5 black Yubikey or else a blue Yubikey that has a faint “2” printed on the upper side. Of course, there are other manufacturers who make security keys and, if it advertises “CTAP2” support, there's a good chance that it'll work too. But those Yubikeys certainly do.

In practical terms, web sites exercise this capability via WebAuthn, the same API that handles the traditional security key flow. (I'm not going to go into much detail about how to use WebAuthn. Readers wanting more introductory information can see what I've written previously or else see one of the several tutorials that come up in a Google search.)

When registering a security key for a username-free login, the important differences are that you need to make requireResidentKey true, set userVerification to required, and set a meaningful user ID.

In WebAuthn terms, a “resident” credential is one that can be discovered without knowing its ID. Generally, most security keys operate statelessly, i.e. the credential ID is an encrypted private seed, and the security key doesn't store any per-credential information itself. Thus the credential ID is required for the security key to function so the server sends a list of them to the browser during login, implying that the server already knows which user is logging in. Resident keys, on the other hand, require some state to be kept by the security key because they can be used without presenting their ID first. (Note that, while resident keys require some state to be kept, security keys are free to keep state for non-resident keys too: resident vs non-resident is all about whether the credential ID is needed.)

User verification is about whether the security key is providing one or two authentication factors. With the traditional experience, the security key is something you have and the password is something you know. In order to get rid of the password, the security key now needs to provide two factors all by itself. It's still something you have so the second security key factor becomes a PIN (something you know) or a biometric (something you are). That begs the question: what's the difference between a PIN and a password? On the surface: nothing. A security key PIN is an arbitrary string, not limited to numbers. (I think it was probably considered too embarrassing to call it a password since FIDO's slogan is “solving the world's password problem”.) So you should think of it as a password, but it is a password with some deeper advantages: firstly, it doesn't get sent to web sites, so they can't leak it and people can safely use a single password everywhere. Secondly, brute-force resistance is enforced by the hardware of the security key, which will only allow eight attempts before locking and requiring a reset. Still, it'll be nice when biometrics are common in security keys.

A user ID is an opaque identifier that should not be personally identifying. Most systems will have some database primary key that identifies a user and, if using that as a WebAuthn user ID, ensure that you encrypt it first with a key that is only used for this purpose. That way it doesn't matter if those primary keys surface elsewhere too. Security keys will only store a single credential for a given pair of domain name and user ID. So, if you register a second credential with the same user ID on the same domain, it'll overwrite the first.

The fact that you can register more than one credential for a given domain means that it's important to set the metadata correctly when creating a resident credential. This isn't unique to resident keys, but it's much more important in this context. The user name and displayName will be shown by the browser during login when there's more than one credential for a domain. Also the relying party name and displayName will be shown in interfaces for managing the contents of a security key.

When logging in, WebAuthn works as normal except you leave the list of credential IDs empty and set userVerification to required. That triggers the resident-credential flow and the resulting credential will include the user ID, with which you look up the user and their set of registered public keys, and then validate the public key and other parameters.

Microsoft have a good test site (enter any username) where you can experiment with crafting different WebAuthn requests.

Exposed credentials

In order to support the above, security keys obviously have a command that says “what credentials do you have for domain x?”. But what level of authentication is needed to run that command is a little complex. While it doesn't matter for the web, one might want to use security keys to act as, for example, door access badges; especially over NFC. In that case one probably doesn't want to bother with a PIN etc. Thus the pertinent resident credentials would have to be discoverable and exercisable given only physical presence. But in a web context, perhaps you don't want your security key to indicate that it has a credential stored for cia.gov (or mss.gov.cn) to anyone who can plug it in. Current security keys, however, will disclose whether they have a resident credential for a given domain, and the user ID and public key for that credential, to anyone with physical access. (Which is one reason why user IDs should not be identifying.) Future security keys will have a concept of a per-credential protection level which will prevent them from being disclosed without user verification (i.e. PIN or biometrics), or without knowing their random credential ID. While Chromium will configure credential protection automatically if supported, other browsers may not. Thus it doesn't hurt to set credProtect: 2 in the extensions dictionary during registration.

Zero-knowledge attestation

U2F/FIDO tokens (a.k.a. “Security Keys”) are a solid contender for doing something about the effectiveness of phishing and so I believe they're pretty important. I've written a fairly lengthy introduction to them previously and, as mentioned there, one concerning aspect of their design is that they permit attestation: when registering a key it's possible for a site to learn a cryptographically authenticated make, model, and batch. As a browser vendor who has dealt with User-Agent sniffing, and as a large-site operator, who has dealt with certificate pervasiveness issues, that's quite concerning for public sites.

It's already the case that one significant financial site has enforced a single-vendor policy using attestation (i.e. you can only register a token made by that vendor). That does not feel very congruent with the web, where any implementation that follows the standards is supposed to be a first-class citizen. (Sure, we may have undermined that with staggering levels of complexity, but that doesn't discredit the worth of the goal itself.)

Even in cases where a site's intended policy is more reasonable (say, they want to permit all tokens with some baseline competence), there are strong grounds for suspecting that things won't turn out well. Firstly, the policies of any two sites may not completely align, leading to a crappy user experience where a user needs multiple tokens to cover all the sites that they use, and also has to remember which works where. Secondly, sites have historically not been so hot about staying up-to-date. New token vendors may find themselves excluded from the market because it's not feasible to get every site to update their attestation whitelists. That feels similar to past issues with User-Agent headers but the solution there was to spoof other browsers. Since attestation involves a cryptographic signature, that answer doesn't work here.

So the strong recommendation for public sites is not to request attestation and not to worry about it. The user, after all, has control of the browser once logged in, so it's not terribly clear what threats it would address.

However, if we assume that certain classes of sites probably are going to use attestation, then users have a collective interest in those sites enforcing the same, transparent standard, and in them keeping their attestation metadata current. But without any impetus towards those ends, that's not going to happen. Which begs the question: can browsers do something about that?

Ultimately, in such a world, sites only operate on a single bit of information about any registration: was this public-key generated in a certified device or not? The FIDO Alliance wants to run the certification process, so then the problem reduces down to providing that bit to the site. Maybe they would simply trust the browser to send it: the browser could keep a current copy of the attestation metadata and tell the site whether the device is certified or not. I don't present that as a straw-man: if the site's aim is just to ensure that the vast majority of users aren't using some backdoored token that came out of a box of breakfast cereal then it might work, and it's certainly simple for the site.

But that would be a short blog post, and I suspect that trusting the browser probably wouldn't fly in some cases.

So what we're looking for is something like a group signature scheme, but we can't change existing tokens. So we need to retrospectively impose a group signature on top of signers that are using vanilla P-256 ECDSA.

Zero-knowledge proofs

It is a surprising but true result in cryptography that it's possible to create a convincing proof of any statement in NP that discloses nothing except the truth of the statement. As an example of such a statement, we might consider “I know a valid signature of message x from one of the public keys in this set”. That's a pretty dense couple of sentences but rather than write an introduction to zero-knowledge proofs here, I'm going to refer you to Matthew Green's posts[1][2]. He does a better job than I would.

I obviously didn't pick that example at random. If there was a well-known set of acceptable public keys (say, as approved by the FIDO Alliance) then a browser could produce a zero-knowledge proof that it knew a valid attestation signature from one of those keys, without disclosing anything else, notably without disclosing which public key was used. That could serve as an “attestation valid” bit, as hypothesised above, that doesn't require trusting the browser.

As a concrete instantiation of zero-knowledge proofs for this task, I'll be using Bulletproofs [BBBPWM17]. (See zkp.science for a good collection of many different ZK systems. Also, dalek-cryptography have excellent notes on Bulletproofs; Cathie Yun and Henry de Valence from that group were kind enough to help me with a question about Bulletproofs too.)

The computational model for Bulletproofs is an arithmetic circuit: an acyclic graph where public and secret inputs enter and each node either adds or multiplies all its inputs. Augmenting that are linear constraints on the nodes of the circuit. In the tool that I wrote for generating these circuits, this is represented as a series of equations where the only operations are multiplication, addition, and subtraction. Here are some primitives that hopefully convince you that non-trivial functions can be built from this:

PP-256

Using single bit values in an arithmetic circuit certainly works, but it's inefficient. Getting past single-bit values, the arithmetic circuits in Bulletproofs don't work in ℤ (i.e. arbitrary-length integers), rather they work over a finite field. Bulletproofs are built on top of an elliptic curve and the finite field of the arithmetic circuit is the scalar field of that curve.

When dealing with elliptic curves (as used in cryptography) there are two finite fields in play: the x and y coordinates of the points on the curve are in the coordinate field of the curve. Multiples of the base point (B) then generate a prime number (n) of points in the group before cycling back to the base point. So xB + yB = (x + y mod n)B — i.e. you can reduce the multiple mod n before multiplying because it'll give the same result. Since n is prime, reduction mod n gives a field, the scalar field.

(I'm omitting powers of primes, cofactors, and some other complications in the above, but it'll serve.)

So Bulletproofs work in the scalar field of whatever elliptic curve they're implemented with, but we want to build P-256 ECDSA verification inside of a Bulletproof, and that involves lots of operations in P-256's coordinate field. So, ideally, the Bulletproofs need to work on a curve whose scalar field is equal to P-256's coordinate field. Usually when generating a curve, one picks the coordinate field to be computationally convenient, iterates other parameters until the curve meets standard security properties, and the scalar field is whatever it ends up as. However, after some quality time with “Constructing elliptic curves of prime order” (Broker & Stevenhagen) and Sage, we find that y² = x³ - 3x + B over GF(PP) where:

… gives a curve with the correct number of points, and which seems plausibly secure based on the SafeCurves criteria. (A more exhaustive check would be needed before using it for real, but it'll do for a holiday exploration.) Given its relationship to P-256, I called it “PP-256” in the code.

ECDSA verification

Reviewing the ECDSA verification algorithm, the public keys and message hash are obviously public inputs. The r and s values that make up the signature cannot be both be public because then the verifier could just try each public key and find which one generated the signature. However, one of r and s can be public. From the generation algorithm, r is the x-coordinate of a random point and s is blinded by the inverse of the nonce. So on their own, neither r nor s disclose any information and so can just be given to the verifier—moving work outside of the expensive zero-knowledge proof. (I'm not worrying about tokens trying to use a covert channel here but, if you do worry about that, see True2F.)

If we disclose s to the verifier directly then what's left inside the zero-knowledge proof is 1) selecting the public key; 2) checking that the secret r is in range; 3) u₂ = r/s mod n; 4) scalar-multiplication of the public key by u₂; 5) adding in the (now) public multiple of the base point; and 6) showing that the x-coordinate of resulting point equals the original r, mod n.

The public-key is a 4-tooth comb, which is a precomputed form that speeds up scalar multiplications. It consists of 30 values. The main measure that we want to minimise in the arithmetic circuit is the number of multiplications where both inputs are secret. When selecting from t possible public keys the prover supplies a secret t-bit vector where only one of the bits is set. The proof shows that each value is, indeed, either zero or one using IsBit (from above, at a cost of one multiply per bit), and that exactly one bit is set by requiring that the sum of the values equals one. Each of the 30t public-key values is multiplied by one of the bits and summed to select exactly one key.

Rather than checking that the secret r is within [0, n-1], which would cost 512 multiplies, we just check that it's not equal to zero mod n. That's the important condition here since an out of range r is otherwise just an encoding error. Showing that a number is not zero mod n just involves showing that it's not equal to zero or n, as 2n is outside of the arithmetic circuit field. Proving a ≠ b is easy: the prover just provides an inverse for a - b (since zero doesn't have an inverse) and the proof shows that (a - b) × (a - b)⁻¹ = 1.

Calculating r/s mod n is the most complex part of the whole proof! Since the arithmetic circuit is working mod P-256's p, working mod n (which is the order of P-256—slightly less than p) is awkward. The prover gives bit-wise breakdown of r; the proof does the multiplication as three words of 86, 86, and 84 bits; the prover supplies the values for the carry-chain (since bit-shifts aren't a native operation in the arithmetic circuit); the prover then gives the result in the form a×n + b, where b is a 256-bit number; and the proof does another multiplication and carry-chain to check that the results are equal. All for a total cost of 2152 multiplication nodes!

After that, the elliptic curve operation itself is pretty easy. Using the formulae from “Complete addition formulas for prime order elliptic curves” (Renes, Costello, and Batina) it takes 5365 multiplication nodes to do a 4-tooth comb scalar-mult with a secret scalar and a secret point. Then a final 17 multiplication nodes add in the public base-point multiple, supply the inverse to convert to affine form, and check that the resulting x-coordinate matches the original r value. The circuit does not reduce the x-coordinate mod n in order to save work: for P-256, that means that around one in 2¹²⁸ signatures may be incorrectly rejected, but that's below the noise floor of arithmetic errors in CPUs. Perhaps if this were to be used in the real world, that would be worth doing correctly, but I go back to work tomorrow so I'm out of time.

In total, the full circuit contains 7534 multiplication nodes, 2154 secret inputs, and 17 236 constraints.

(Pratyush Mishra points out that affine formulae would be more efficient than projective since inversion is cheap in this model. Oops!)

Implementation

My tool for generating the matrices that Bulletproofs operate on outputs 136KB of LZMA-compressed data for the circuit described above. In some contexts, that amount of binary size would be problematic, but it's not infeasible. There is also quite a lot of redundancy: the data includes instructions for propagating the secret inputs through the arithmetic circuit, but it also includes matrices from which that information could be derived.

The implementation is based on BoringSSL's generic-curve code. It doesn't even use Shamir's trick for multi-scalar multiplication of curve points, it doesn't use Montgomery form in a lot of places, and it doesn't use any of the optimisations described in the Bulletproofs paper. In short, the following timings are extremely pessimistic and should not be taken as any evidence about the efficiency of Bulletproofs. But, on a 4GHz Skylake, proving takes 18 seconds and verification takes 13 seconds. That's not really practical, but there is a lot of room for optimisation and for multiple cores to be used concurrently.

The proof is 70 450 bytes, dominated by the 2154 secret-input commitments. That's not very large by the standards of today's web pages. (And Dan Boneh points out that I should have used a vector commitment to the secret inputs, which would shrink the proof down to just a few hundred bytes.)

Intermediates and FIDO2

One important limitation of the above is that it only handles one level of signatures. U2F allows an intermediate certificate to be provided so that only less-frequently-updated roots need to be known a priori. With support for only a single level of signatures, manufacturers would have to publish their intermediates too. (But we already require that for the WebPKI.)

Another issue is that it doesn't work with the updated FIDO2 standard. While only a tiny fraction of Security Keys are FIDO2-based so far, that's likely to increase. With FIDO2, the model of the device is also included in the signed message, so the zero-knowledge proof would also have to show that a SHA-256 preimage has a certain structure. While Bulletproofs are quite efficient for implementing elliptic curves, a binary-based algorithm like SHA-256 is quite expensive: the Bulletproofs paper notes a SHA-256 circuit using 25 400 multiplications. There may be a good solution in combining different zero-knowledge systems based on “Efficient Zero-Knowledge Proof of Algebraic and Non-Algebraic Statements with Applications to Privacy Preserving Credentials” (Chase, Ganesh, Mohassel), but that'll have to be future work.

Happy new year.

CECPQ2

CECPQ1 was the experiment in post-quantum confidentiality that my colleague, Matt Braithwaite, and I ran in 2016. It's about time for CECPQ2.

I've previously written about the experiments in Chrome which lead to the conclusion that structured lattices were likely the best area in which to look for a new key-exchange mechanism at the current time. Thanks to the NIST process we now have a great many candidates to choose from in that space. While this is obviously welcome, it also presents a problem: the fitness space of structured lattices looks quite flat so there's no obviously correct choice. Would you like keys to be products (RLWE) or quotients (NTRU; much slower key-gen, but subsequent operations are faster; older, more studied)? Do you want the ring to be NTT-friendly (fast multiplication, but more structure), or to have just a power-of-two modulus (easy reduction), or to have as little structure as possible? What noise profile and failure probability? Smart people can reasonably end up with different preferences.

This begs the question of why do CECPQ2 now at all? In some number of years NIST will eventually whittle down the field and write standards. Adrian Stanger of the NSA said at CRYPTO this year that the NSA is looking to publish post-quantum standards around 2024, based on NIST's output. (And even said that they would be pure-PQ algorithms, not combined with an elliptic-curve operation as a safeguard.) So if we wait five years things are likely to be a lot more settled.

Firstly, you might not be happy with the idea of waiting five years if you believe Michele Mosca's estimate of a one sixth chance of a large quantum computer in ten years. More practically, as we sail past the two year mark of trying to deploy TLS 1.3, another concern is that if we don't exercise this ability now we might find it extremely difficult to deploy any eventual design.

TLS 1.3 should have been straightforward to deploy because the TLS specs make accommodations for future changes. However, in practice, we had to run a series of large-scale experiments to measure what patterns of bytes would actually weave through all the bugs in the TLS ecosystem. TLS 1.3 now has several oddities in the wire-format that exist purely to confuse various network intermediaries into working. Even after that, we're still dealing with issues. Gallingly, because we delayed our server deployment in order to ease the client deployment, we're now having to work around bugs in TLS 1.3 client implementations that wouldn't have been able to get established had we quickly and fully enabled it.

The internet is very large and it's not getting any easier to steer. So it seems dangerous to assume that we can wait for a post-quantum standard and then deploy it. Any CECPQ2 is probably, in the long-term, destined to be replaced. But by starting the deployment now it can hopefully make that replacement viable by exercising things like larger TLS messages. Also, some practical experience might yield valuable lessons when it comes to choosing a standard. If the IETF had published the TLS 1.3 RFC before feedback from deployment, it would have been a divisive mess.

CECPQ2 Details

At the time of CECPQ1, the idea of running both a post-quantum and elliptic-curve primitive concurrently (to ensure that even if the post-quantum part was useless, at least the security wasn't worse than before) wasn't universally embraced. But we thought it important enough to highlight the idea in the name: combined elliptic-curve and post-quantum. It's a much more widely accepted idea now and still makes sense, and the best choice of elliptic-curve primitive hasn't changed, so CECPQ2 is still a combination with X25519.

As for the post-quantum part, it's based on the HRSS scheme of Hülsing, Rijneveld, Schanck, and Schwabe. This is an instantiation of NTRU, the patent for which has expired since we did CECPQ1. (The list of people who've had a hand in NTRU is quite long. See the Wikipedia page for a start.)

Last year Saito, Xagawa, and Yamakawa (SXY) published a derivative of HRSS with a tight, QROM-proof of CCA2 security from an assumption close to CPA security. It requires changes (that slow HRSS down a little), but HRSS+SXY is currently the basis of CECPQ2. Since HRSS+SXY no longer requires an XOF, SHAKE-128 has been replaced with SHA-256.

Having said that it's hard to choose in the structured lattice space, obviously HRSS is a choice and there were motivations behind it:

  1. CCA2-security is worthwhile, even though TLS can do without. CCA2 security roughly means that a private-key can be used multiple times. The step down is CPA security, where a private-key is only safe to use once. NewHope, used in CECPQ1, was only CPA secure and that worked for TLS since its confidentiality keys are ephemeral. But CPA vs CCA security is a subtle and dangerous distinction, and if we're going to invest in a post-quantum primitive, better it not be fragile.
  2. Avoiding decryption failures is attractive. Not because we're worried about unit tests failing (hardware errors set a noise floor for that anyway), but because the analysis of failure probabilities is complex. In the time since we picked HRSS, a paper has appeared chipping away at these failures. Eliminating them simplifies things.
  3. Schemes with a quotient-style key (like HRSS) will probably have faster encap/decap operations at the cost of much slower key-generation. Since there will be many uses outside TLS where keys can be reused, this is interesting as long as the key-generation speed is still reasonable for TLS.
  4. NTRU has a long history. In a space without a clear winner, that's a small positive.

CECPQ2 will be moving slowly: It depends on TLS 1.3 and, as mentioned, 1.3 is taking a while. The larger messages may take some time to deploy if we hit middlebox- or server-compatibility issues. Also the messages are currently too large to include in QUIC. But working though these problems now is a lot of the reason for doing CECPQ2—to ensure that post-quantum TLS remains feasible.

Lastly, I want to highlight that this only addresses confidentiality, not authenticity. Confidentiality is more pressing since it can be broken retrospectively, but it's also much easier to deal with in TLS because it's negotiated independently for every connection. Post-quantum authenticity will be entangled with the certificate and CA ecosystem and thus will be significantly more work.

Post-quantum confidentiality for TLS

In 2016, my colleague, Matt Braithwaite, ran an experiment in Google Chrome which integrated a post-quantum key-agreement primitive (NewHope) with a standard, elliptic-curve one (X25519). Since that time, the submissions for the 1st round of NIST’s post-quantum process have arrived. We thus wanted to consider which of the submissions, representing the new state of the art, would be most suitable for future work on post-quantum confidentiality in TLS.

A major change since the 2016 experiment is the transition from TLS 1.2 to TLS 1.3 (a nearly-final version of which is now enabled by default in Chrome). This impacts where in the TLS handshake the larger post-quantum messages appear:

In TLS 1.2, the client offers possible cipher suites in its initial flow, the server selects one and sends a public-key in its reply, then the client completes the key-agreement in its second flow. With TLS 1.3, the client offers several possible public-keys in its initial flow and the server completes one of them in its reply. Thus, in a TLS 1.3 world, the larger post-quantum keys will be sent to every TLS server, whether or not they’ll use it.

(There is a mechanism in TLS 1.3 for a client to advertise support for a key-agreement but not provide a public-key for it. In this case, if the server wishes to use that key-agreement, it replies with a special message to indicate that the client should try again and include a public-key because it’ll be accepted. However, this obviously adds an extra round-trip and we don’t want to penalise more secure options—at least not if we want them adopted. Therefore I’m assuming here that any post-quantum key agreement will be optimistically included in the initial message. For a diagram of how this might work with a post-quantum KEM, see figure one of the Kyber paper. Also I'm using the term “public-key” here in keeping with the KEM constructions of the post-quantum candidates. It's not quite the same as a Diffie-Hellman value, but it's close enough in this context.)

In order to evaluate the likely latency impact of a post-quantum key-exchange in TLS, Chrome was augmented with the ability to include a dummy, arbitrarily-sized extension in the TLS ClientHello. To be clear: this was not an implementation of any post-quantum primitive, it was just a fixed number of bytes of random noise that could be sized to simulate the bandwidth impact of different options. It was included for all versions of TLS because this experiment straddled the enabling of TLS 1.3.

Post quantum families

Based on submissions to NIST’s process, we grouped likely candidates into three “families” of algorithms, based on size: supersingular isogenies (SI), structured lattices (SL), and unstructured lattices (UL). Not all submissions fit into these families, of course: in some cases the public-key + ciphertext size was too large to be viable in the context of TLS, and others were based on problems that we were too unfamiliar with to be able to evaluate.

As with our 2016 experiment, we expect any post-quantum algorithm to be coupled with a traditional, elliptic-curve primitive so that, if the post-quantum component is later found to be flawed, confidentiality is still protected as well as it would otherwise have been.

This led to the following, rough sizes for evaluation:

  1. Supersingular isogenies (SI): 400 bytes
  2. Structured lattices (SL): 1 100 bytes
  3. Unstructured lattices (UL): 10 000 bytes

Incompatibilities with unstructured lattices

Before the experiment began, in order to establish some confidence that TLS connections wouldn’t break, the top 2 500 sites from the Alexa list were probed to see whether large handshakes caused problems. Unfortunately, the 10 000-byte extension caused 21 of these sites to fail, including common sites such as godaddy.com, linkedin.com, and python.org.

Oddly enough, reducing the size of the extension to 9 999 bytes reduced that number to eight, but linkedin.com was still included. These remaining eight sites seem to reject ClientHello messages larger than 3 970 bytes.

This indicated that widespread testing of a 10 000-byte extension was not viable. Thus the unstructured lattice configuration was replaced with a 3 300-byte “unstructured lattice stand-in” (ULS). This size was chosen to be about as large as we could reasonably test (although four sites still broke.

Phase One

In February 2018, a fraction of Chrome installs (Canary and Dev channels only) were configured to enable the dummy extension in one of four sizes, with the configuration randomly selected, per install, from:

  1. Control group: no extension sent
  2. Supersingular isogenies (SI): 400 bytes
  3. Structured lattices (SL): 1 100 bytes
  4. Unstructured lattice standing (ULS): 3 300 bytes

We measured TLS handshake latency at the 50th and 95th percentile, split out by mobile / non-mobile and by whether the handshake was a full or resumption handshake.

Given that we measured a stand-in for the UL case, we needed to extrapolate from that to an estimate for the UL latency. We modeled a linear cost of each additional byte in the handshake, 66 bytes of overhead per packet, and an MTU around 1 500 bytes. The number of packets for the UL case is the same for MTU values around 1 500, but the number of packets for the SL case is less certain: a session ticket could push the ClientHello message into two packets there. We chose to model the SL case as a single packet for the purposes of estimating the UL cost.

We also ignored the possibility that the client’s initial congestion window could impact the UL case. If that were to happen, a UL client would have to wait an additional round-trip, thus our UL estimates might be optimistic.

Despite the simplicity of the model, the control, SI, and SL points for the various configurations are reasonably co-linear under it and we draw the least-squares line out to 10 000 bytes to estimate the UL latency.

Configuration Additional latency over control group
SI SL UL (estimated)
Desktop, Full, Median 4.0% 6.4% 71.2%
Desktop, Full, 95% 4.7% 9.6% 117.0%
Desktop, Resume, Median 4.3% 12.5% 118.6%
Desktop, Resume, 95% 5.2% 17.7% 205.1%
Mobile, Full, Median -0.2% 3.4% 34.3%
Mobile, Full, 95% 0.5% 7.2% 110.7%
Mobile, Resume, Median 0.6% 7.2% 66.7%
Mobile, Resume, 95% 4.2% 12.5% 149.5%

(The fact that one of the SI values came out implausibly negative should give some feeling for the accuracy of the results—i.e. the first decimal place is probably just noise, even at the median.)

As can be seen, the estimated latency overhead of unstructured lattices is notably large in TLS, even without including the effects of the initial congestion window. For this reason, we feel that UL primitives are probably not preferable for real-time TLS connections.

It is also important to note that, in a real deployment, the server would also have to return a message to the client. Therefore the transmission overhead, when connecting to a compatible server, would be doubled. This phase of the experiment experiment did not take that into account at all. Which leads us to…

Phase Two

In the second phase of the experiment we wished to measure the effect of actually negotiating a post-quantum primitive, i.e. when both client and server send post-quantum messages. We again added random noise of various sizes to simulate the bandwidth costs of this. Cloudflare and Google servers were augmented to echo an equal amount of random noise back to clients that included it.

In this phase, latencies were only measured when the server echoed the random noise back so that the set of measured servers could be equal between control and experiment groups. This required that the control group send one byte of noise (rather than nothing).

Since unstructured lattices were not our focus after the phase one results, there were only three groups in phase two: one byte (control), 400 bytes (SI), and 1100 bytes (SL).

Unlike phase one, we did not break the measurements out into resumption / full handshakes this time. We did that in phase one simply because that’s what Chrome measures by default. However, it begs the question of what fraction of handshakes resume, since that value is needed to correctly weigh the two numbers. Thus, in phase two, we simply measured the overall latency, with resumption (or not) implicitly included.

Here are the latency results, this time in milliseconds in order to reason about CPU costs below:

Configuration Additional latency over control group (ms)
SI SL
Mobile, Median 3.5 9.6
Mobile, 95% 18.4 159.0
Desktop, Median 2.6 5.5
Desktop, 95% 19.2 136.9

So far we’ve only talked about families of algorithms but, to analyse these results, we have to be more specific. That’s easy in the case supersingular isogenies (SI) because there is only one SI-based submission to NIST: SIKE.

For SIKEp503, the paper claims an encapsulation speed on a 3.4GHz Skylake of 16 619 kilocycles, or 4.9ms of work on a server. They also report on an assembly implementation for Aarch64, which makes a good example client. There the key generation + decapsulation time is 86 078 kilocycles, or 43ms on their 2GHz Cortex-A79.

On the other hand, structured lattices (SL) are much faster. There are several candidates here but most have comparable speed, and it’s under a 10th of a millisecond on modern Intel chips.

Therefore, if computation were included, the SI numbers above would have 48ms added to them. That happens to makes SL clearly faster at the median, although it does not close the gap at the 95% level. (The SI key-generation could be amortised across connections, at the cost of code complexity and maybe some forward security. Amortising key-generation across 16 connections results in 28ms of computation per connection, which doesn’t affect the argument.)

We had previously deployed CECPQ1 (X25519 + NewHope) in TLS, which should be comparable to SL measurements, and gotten quite different results: we found a 1ms increase at the median and only 20ms at 95%. (And CECPQ1 took roughly 1 900 bytes.)

On a purely technical level, CECPQ1 was based on TLS 1.2 and therefore increased the sizes of the server’s first flow and the client’s second flow. It’s possible that bytes in the client’s second flow are less expensive, but much more significantly, TLS 1.2 does not do a fresh key-agreement for a resumption handshake. Therefore all the successful resumptions had no overhead with CECPQ1, but with TLS 1.3, they would. Going back to historical Chrome statistics from the time of the CECPQ1 experiment, we estimate that about 50% of connections would have been resumptions, which closes the gap.

Additionally, there may be a significant difference in Chrome user populations between the experiments. CECPQ1 went to Chrome Stable (non-mobile only) while the post-quantum measurements here have been done on Dev and Canary channels. It is the case that, for all non-mobile TLS connections (completely unrelated to any post-quantum experiments), the population of Dev and Canary channel users see higher TLS latencies and the effect is larger than the SI or SL effect measured above. Thus the user population for these experiments might have had poorer internet connections than the population for CECPQ1, exaggerating the costs of extra handshake bytes.

Apart from CPU time, a second major consideration is that we are not comparing like with like. Our 400-byte figure for SI maps to a NIST “category one” strength, while 1 100 bytes for SL is roughly in line with “category three”. A comparable, “category one” SL primitive comes in at more like 800 bytes, while a “category three” SI is much slower.

(Although there is a recent paper suggesting that SIKEp503 might be more like category three after all. So maybe this point is moot, but such gyrations highlight that SI is a relatively new field of study.)

Lastly, and much more qualitatively, we have had many years of dealing with subtle bugs in elliptic-curve implementations. The desire for performance pushes for carefully-optimised implementations, but missed carries, or other range violations, are typically impossible for random tests to find. After many years, we are at the point where we are starting to use formally-synthesised code for elliptic-curve field operations but SI would take us deeper into the same territory, likely ensuring that the pattern of ECC bugs continues into the future. (As Hamburg says in his NIST submission, “some cryptographers were probably hoping never to see a carry chain again”.)

On the other hand, SL designs are much more like symmetric primitives, using small fields and bitwise operations. While it’s not impossible to botch, say, AES, my experience is that the defect rates in implementations of things like AES and SHA-256 is dramatically lower, and this is strong argument for SL over SI. Combining post-quantum primitives with traditional ECC is a good defense against bugs, but we should still be reticent to adopt potentially fragile constructs.

(Not all SL designs have this benefit, mind you. Hamburg, quoted above, uses ℤ/(23120 - 21560 - 1) in ThreeBears and lattice schemes for other primitives may need operations like discrete Gaussian sampling. But it holds for a number of SL-based KEMs.)

Thus the overall conclusion of these experiments is that post-quantum confidentiality in TLS should probably be based on structured lattices although there is still an open question around QUIC, for which exceeding a single packet is problematic and thus the larger size of SL is more significant.

Security Keys

Introduction

Predictions of, and calls for, the end of passwords have been ringing through the press for many years now. The first instance of this that Google can find is from Bill Gates in 2004, although I suspect it wasn’t the first.

None the less, the experience of most people is that passwords remain a central, albeit frustrating, feature of their online lives.

Security Keys are another attempt address this problem—initially in the form of a second authentication factor but, in the future, potentially as a complete replacement. Security Keys have gotten more traction than many other attempts to solve this problem and this post exists to explain and, to some extent, advocate for them to a technical audience.

Very briefly, Security Keys are separate pieces of hardware capable of generating public/private key pairs and signing with them. By being separate, they can hopefully better protect those keys than a general purpose computer can, and they can be moved between devices to serve as a means of safely authorising multiple devices. Most current examples attach via USB, but NFC and Bluetooth devices also exist.

Contrasts with existing solutions

Security Keys are not the first attempt at solving the problems of passwords, but they do have different properties than many of the other solutions.

One common form of second factor authentication is TOTP/HOTP. This often takes the form of an app on the user’s phone (e.g. Google Authenticator) which produces codes that change every minute or so. It can also take the form of a token with an LCD display to show such codes (e.g. RSA SecurID).

These codes largely solve the problem of password reuse between sites as different sites have different seed values. Thus stealing the password (and/or seed) database from one site no longer compromises accounts at other sites, as is the case with passwords.

However, these codes are still phishable: the user may be tricked into entering their password and code on a fake site, which can promptly forward them to the real site and impersonate the user. The codes may also be socially engineered as they can be read over the phone etc by a confused user.

Another common form of second factor authentication are SMS-delivered codes. These share all the flaws of HOTP/TOTP and add concerns around the social engineering of phone companies to redirect messages and, in extreme cases, manipulation of the SS7 network.

Lastly, many security guides advocate for the use of password managers. These, if used correctly, can also solve the password reuse problem and significantly help with phishing, since passwords will not be auto-filled on the wrong site. Thus this is sound advice and, uniquely amongst the solutions discussed here, can be deployed unilaterally by users.

Password managers, however, do not conveniently solve the problem of authenticating new devices, and their automated assistance is generally limited to a web context. They also change a password authentication from an (admittedly weak) verification of the user, to a verification of the device; an effect which has provoked hostility and counter-measures from relying parties.

In light of this, Security Keys should be seen as a way to improve upon, and exceed the abilities of, password managers in a context where the relying party is cooperating and willing to make changes. Security Keys are unphishable to a greater extent than password managers because credentials are bound to a given site, plus it’s infeasible to socially engineer someone to read a binary signature over the phone. Also, like TOTP/HOTP, they use fresh credentials for each site so that no reuse is possible. Unlike password managers, they can work outside of a web context and they can serve to authenticate new devices.

They aren’t magic, however. The unphishability of Security Keys depends on the extent to which the user may be mislead into compromising other aspects of their device. If the user can be tricked into installing malware, it could access the Security Key and request login signatures for arbitrary sites. Also, malware may compromise the user’s login session in a browser after successfully authenticating with a Security Key. Still, that’s a heck of a lot better than the common case of people using the same password across dozens of sites.

All the different terms

There is a lot of terminology specific to this topic. The first of which I’ve already used above: “relying parties”. This term refers to any entity trying to authenticate a user. When logging into a website, for example, the website is the relying party.

The FIDO Alliance is a group of major relying parties, secure token manufacturers, and others which defines many of the standards around Security Keys. The term that FIDO uses for Security Keys is “Universal 2nd factor” (U2F) so you’ll often see “U2F security key” used—it’s talking about the same thing. The terms “authenticator” and “token” are also often used interchangeably to refer to these devices.

At the time of writing, all Security Keys are based version one of FIDO’s “Client To Authenticator Protocol” (CTAP1). This protocol is split between documentation of the core protocol and separate documents that describe how the core protocol is transported over USB, NFC, and Bluetooth.

FIDO also defines a U2F Javascript API for websites to be able to interact with and use Security Keys. However, no browser ever implemented that API prior to a forthcoming (at the time of writing) version of Firefox.

But sites have been able to use Security Keys with Google Chrome for some years because Chrome ships with a hidden, internal extension through which the U2F API can be implemented with a Javascript polyfill, which Google also provides. (Extensions for Firefox were also available prior to native support in that browser.)

Thus all sites which supported Security Keys prior to 2018 used some polyfill in combination with Chrome’s internal extension, or one of the Firefox extensions, to do so.

The FIDO Javascript API is not the future, however. Instead, the W3C is defining an official Web Authentication standard for Security Keys, which is commonly called by its short name “webauthn”. This standard is significantly more capable (and significantly more complex) than the U2F API but, by the end of 2018, it is likely that all of Edge, Chrome, and Firefox will support it by default.

The webauthn standard has been designed to work with existing (CTAP1-based) devices, but FIDO is working on an updated standard for tokens, CTAP2, which will allow them to take advantage of the new capabilities in webauthn. (The standards were co-developed so it’s equally reasonable to see it from the other direction and say that webauthn allows browsers to take advantage of the new capabilities in CTAP2.)

There are no CTAP2 devices on the market yet but their major distinguishing feature will be that they can be used as a 1st (and only) factor. I.e. they have enough internal storage that they can contain a username and so both provide an identity and authenticate it. This text will mostly skim over CTAP2 since the devices are not yet available. But developers should keep it in mind when dealing with webauthn as it explains many, otherwise superfluous, features in that standard.

CTAP1 Basics

Since all current Security Keys use CTAP1, and webauthn is backwards compatible with it, understanding CTAP1 is pretty helpful for understanding the space in general. Here I’ll include some Python snippets for communicating with USB CTAP1 devices to make things concrete, although I’ll skip over everything that deals with framing.

CTAP1 defines two operations: creating a new key, and signing with an existing key. I’ll focus on them in turn.

Creating a new key

This operation is called “registration“ in CTAP1 terminology and it takes two, 32-byte arguments: a “challenge” and an “application parameter”. From the point of view of the token these arguments are opaque byte strings, but they’re intended to be hashes and the hash function has to be SHA-256 if you want to interoperate.

The challenge argument, when used with a web browser, ends up being the hash of a JSON-encoded structure that includes a random nonce from the relying party as well as other information. This nonce is intended to prove freshness: if it was signed by the newly generated key then the relying party could know that the key really was fresh and that this was the only time it had been registered. Unfortunately, CTAP1 doesn’t include any self-signature (and CTAP2 devices probably won’t either). Instead the situation is a lot more complex, which we’ll get to.

The application parameter identifies a relying party. In U2F it’s a hash of the origin (e.g. SHA-256(“https://example.com”)) while in webauthn it’s a hash of the domain (e.g. SHA-256(“example.com”)). As we’ll see, the signing operation also takes an application parameter and the token checks that it’s the same value that was given when the key was created. A phishing site will operate on a look-alike domain, but when the browser hashes that domain, the result will be different. Thus the application parameter sent to the token will be different and the token will refuse to allow the key to be used. Thus keys are bound to specific origins (or, with webauthn, domains) and cannot be used outside of that context.

Here’s some sample code that’ll shine some light on other aspects of the protocol, including the outputs from key creation:

(The full source is available if you want to play with it, although it’ll only work on Linux.)

The first thing to note is that the operation runs in a loop. CTAP1 devices require a “user presence” test before performing operations. In practice this means that they’ll have a button or capacitive sensor that you have to press. While the button/sensor isn’t triggered, operations return a specific error code and the host is expected to retry the operation after a brief delay until it succeeds.

A user-presence requirement ensures that operations cannot happen without a human being physically present. This both stops silent authentication (which could be used to track people) and it stops malware from silently proxying requests to a connected token. (Although it doesn’t stop malware from exploiting a touch that the user believes is authorising a legitimate action.)

Once the operation is successful, the response can be parsed. In the spirit of short example code everywhere, errors aren’t checked, so don’t use this code for real.

Key generation, of course, produces a public key. For CTAP1 tokens, that key will always be an uncompressed, X9.62-encoded, ECDSA P-256 public key. That encoding happens to always be 65 bytes long.

After that comes the key handle. This is an opaque value that the token uses to identify this key, and this evinces another important facet of CTAP1: the tokens are nearly stateless in practice.

In theory, the key handle could be a small identifier for a key which is stored on the device. In practice, however, the key handle is always an encrypted version of the private key itself (or a generating seed). This allows storage of the keys to be offloaded to the relying party, which is important for keeping token costs down.

This also means that CTAP1 tokens cannot be used without the user entering a username. The relying party has to maintain a database of key handles, and that database has to be indexed by something. This is changing with CTAP2, but I’ll not mention that further until CTAP2 tokens are being commercially produced.

Lastly, there’s the attestation certificate (just one; no chain) and a signature. As I mentioned above, the signature is sadly not from the newly created key, but from a separate attestation private key contained in the token. This’ll be covered in a later section

Signing with a key

Once a key has been created, we can ask the token to sign with it.

Again, a “challenge” and “application parameter” have to be given, and they have the same meaning and format as when creating a key. The application parameter has to be identical to the value presented when the key was created otherwise the token will return an error.

The code looks very similar:

The same pattern for waiting for a button press is used: the token is polled and returns an error until the button/sensor is pressed.

Three values are returned: a flag confirming that the button was pressed, a signature counter, and the signature itself—which signs over the challenge, application parameter, flags, and counter.

In the web context, the challenge parameter will be the hash of a JSON structure again, and that includes a nonce from the relying party. Therefore, the relying party can be convinced that the signature has been freshly generated.

The signature counter is a strictly monotonic counter and the intent is that a relying party can record the values and so notice if a private key has been duplicated, as the strictly-monotonic property will eventually be violated if multiple, independent copies of the key are used.

There are numerous problems with this, however. Firstly, recall that CTAP1 tokens have very little state in order to keep costs down. Because of that, all tokens that I’m aware of have a single, global counter shared by all keys created by the device. (The only exception I’ve seen is the Bluink key because it keeps state on a phone.) This means that the value and growth rate of the counter is a trackable signal that’s transmitted to all sites that the token is used to login with. For example, the token that I’m using right now has a counter of 431 and I probably use it far more often than most because I’m doing things like writing example Python code to trigger signature generation. I’m probably pretty identifiable because of that.

A signature counter is also not a very effective defense, especially if it’s per-token, not per-key. Security Keys are generally used to bless long-term logins, and an attacker is likely to be able to login once with a cloned key. In fact, the login that violates the monotonicity of the counter will probably be a legitimate login so relying parties that strictly enforce the requirement are likely to lock the real user out after a compromise.

Since the counter is almost universally per-token, that means that it’ll commonly jump several values between logins to the same site because the token will have been used to login elsewhere in-between. That makes the counter less effective at detecting cloning. If login sessions are long-lived, then the attacker may only need to sign once and quite possibly never be detected. If an attacker is able to observe the evolution of the counter, say by having the user login to an attacker-controlled site periodically, they can avoid ever triggering a counter regression on a victim site.

Finally, signature counters move the threat model from one that deals with phishing and password reuse, to one where attackers are capable of extracting key material from hardware tokens. That’s quite a change and the signature counter is not optional in the protocol.

Attestation

When creating a key we observed that the token returned a certificate, and a signature over the newly created key by that certificate. Here’s the certificate from the token that I happen to be using as I write this:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 95815033 (0x5b60579)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = Yubico U2F Root CA Serial 457200631
        Validity
            Not Before: Aug  1 00:00:00 2014 GMT
            Not After : Sep  4 00:00:00 2050 GMT
        Subject: CN = Yubico U2F EE Serial 95815033
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:fd:b8:de:b3:a1:ed:70:eb:63:6c:06:6e:b6:00:
                    69:96:a5:f9:70:fc:b5:db:88:fc:3b:30:5d:41:e5:
                    96:6f:0c:1b:54:b8:52:fe:f0:a0:90:7e:d1:7f:3b:
                    ff:c2:9d:4d:32:1b:9c:f8:a8:4a:2c:ea:a0:38:ca:
                    bd:35:d5:98:de
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            1.3.6.1.4.1.41482.2:
                1.3.6.1.4.1.41482.1.1
    Signature Algorithm: sha256WithRSAEncryption

Clearly, it identifies the manufacturer and the unknown extension in there identifies the exact model of the key.

The certificate, despite the 32-bit serial number, doesn’t uniquely identify the device. The FIDO rules say that at least 100 000 other devices should share the same certificate. (Some devices mistakenly shipped with uniquely identifying attestation certifications. These will be recognised and suppressed in Chrome 67.) This device happens to be a special, GitHub-branded one. I don’t know whether this attestation certificate is specific to that run of devices, but the same model of device purchased on Amazon around the same time (but unbranded) had a different certificate serial number, so maybe.

But the practical upshot is that a relying party can determine, with some confidence, that a newly created key is stored in a Yubico 4th-gen U2F device by checking the attestation certificate and signature. That fact should cause people to pause; it has weighty ramifications.

Traditionally, anyone who implemented the various specifications that make up a web browser or server has been an equal participant in the web. There’s never been any permission needed to create a new browser or server and, although the specifications on the client side are now so complex that implementing them is a major effort, it’s possible to leverage existing efforts and use WebKit, Firefox, or Chromium as a base.

(The DRM in Encrypted Media Extensions might be an exception, and there was an appropriately large fight over that. I’m not making any judgment about the result of that here though.)

But with attestation, it’s quite possible for a site to require that you have particular hardware if you want to use webauthn. The concerns about that vary depending on the context: for an internal, corporate site to demand employees use particular hardware seems unobjectionable; a large public site raises many more concerns. There appear to be two, main ways in which this could develop in an unhealthy direction:

Firstly, as we’ve experienced with User-Agent headers, sites are not always as responsive as we would like. User-Agent headers have layers upon layers of browsers spoofing other browsers—the history of browser development is written in there—all because of this. Spoofing was necessary because sites would implement workarounds or degraded experiences for some browsers but fail to update things as the browser landscape changed. In order to be viable, each new browser entrant had to spoof the identity of the currently dominant one.

But attestation is not spoofable. Therefore, if sites launch webauthn support and accept attestations from the current set of token vendors, future vendors may be locked out of the market: Their devices won’t work because their attestations aren’t trusted, and they won’t be able to get sites to update because they won’t have enough market presence to matter.

If things get really bad, we may see a market develop where attestation certificates are traded between companies to overcome this. The costs of that are ultimately borne by increased token prices, which users have to pay.

The second concern is that, if different sites adopt different policies, users can have no confidence that a given token will work on a given site. They may be forced to have multiple tokens to span the set of sites that they use, and to remember in each case which token goes with which site. This would substantially degrade the user experience.

FIDO does not dismiss these worries and their answer, for the moment, is the metadata service (MDS). Essentially this is a unified root store that all sites checking attestation are supposed to use and update from. That would solve the problem of site stagnation by creating a single place for new vendors to submit their roots which would, in theory, then auto-update everywhere else. It would also help with the problem of divergent policies because it includes a FIDO-specified security-level for each type of token, which would at least reduce the variety of possible policies and make them more understandable—if used.

The challenge at the moment is that major vendors are not even included in the MDS, and using the MDS properly is harder for sites than throwing together a quick hack that’ll be “good enough for now”.

Thus my advice is for sites to ignore attestation if you’re serving the public. As we’ll see when we cover the webauthn API, attestation information is not even provided by default. Sites newly supporting webauthn are probably just using passwords, or maybe SMS OTP, and thus Security Keys offer a clear improvement. Worrying about whether the Security Key is a physical piece of hardware, and what certifications it has, is a distraction.

webauthn

Now that we understand the underlying protocol, here’s how to actually use it. As mentioned above, there’s an older, “U2F” API but that’ll disappear in time and so it’s not covered here. Rather, the future is the W3C Web Authentication API, which builds on the Credential Management API.

Right now, if you want to experiment, you can use Firefox Nightly, Chrome Canary, or Edge 14291+. In Firefox Nightly, things should be enabled by default. For Chrome Canary, run with –enable-features=WebAuthentication –enable-experimental-web-platform-features. For Edge, I believe that you have to enable it in about:flags, but I’m just going off documentation.

Testing for support

Before trying to use webauthn, you should use feature detection to ensure that it’s supported by a given browser:

Creating a new key

The following are partial snippets of Javascript for creating a key. In-between the snippets is discussion so you would need to concatenate all the snippets in this section to get a complete example.

The name field is required, but currently discarded. In the future it’s intended that CTAP2 tokens will be able to store this and so it could be used in an account chooser interface.

The id field is optional and it sets the relying party ID (RP ID) of the credential. This is the domain name associated with the key and it defaults to the domain that’s creating the key. It can only be overridden to set it within the eTLD+1 of the current domain—like a cookie.

All these fields are required. However, like the previous chunk, they’re currently discarded and intended for forthcoming CTAP2 tokens.

The id identifies an account. A given CTAP2 token should not store two keys for the same RP ID and user ID. However, it’s possible that CTAP2 keys will allow blind enumeration of user IDs given physical possession of the token. (We have to see how that part of the CTAP2 spec is implemented in practice.) Therefore, you don’t want to store a username in the id field. Instead you could do something like HMAC-SHA256(key = server-side-secret, input = username)[:16]. (Although you’ll need a reverse index in the future if you want to use CTAP2 tokens in 1st-factor mode.)

The name and displayName are again strings intended for a future account chooser UI that doesn’t currently exist.

The challenge field for a new key is a little complex. The webauthn spec is very clear that this must be a strong, server-generated nonce and, for the assertion request which we’ll get to next, that’s correct. Also, if you’re checking attestation then this challenge is your only assurance of freshness, so you’ll want it in that case too.

However, as I mentioned above, it’s far from clear how well attestation will work outside of a controlled environment and I recommend that you ignore it in the general case. Given that, the utility of the challenge when creating a key is questionable. Generated U2F keys don’t sign over it and, while CTAP2 keys have the option of covering it with a self-signature, it remains to be seen whether any will actually do that.

One advantage that it does bring is that it stops CSRF attacks from registering a new key. But, if you don’t have a solid, general solution to CSRF already then you have far larger security issues.

Thus, since it’s easy and you’ll need it for assertions anyway, I still recommend that you generate a 16- or 32-byte nonce on the server as the spec suggests. I just thought that you should be aware that the benefit is a little fuzzier than you might imagine.

This enumerates the types of public keys that the server can process. The type field is always public-key and the alg comes from the IANA COSE list. The value -7 means ECDSA with P-256, which is effectively mandatory since it’s what all U2F tokens implement. At the moment, that’s the only value that makes sense although there might be some TPM-based implementations in the future that use RSA.

The timeout specifies how many milliseconds to wait for the user to select a token before returning an error.

The excludeCredentials is something that can be ignored while you get something working, but which you’ll have to circle back and read the spec on before deploying anything real. It allows you to exclude tokens that the user has already created a key on when adding new keys.

The promise will, if everything goes well, resolve to a PublicKeyCredential, the response member of which is an AuthenticatorAttestationResponse. I’m not going to pointlessly rewrite all the server-side processing steps from the spec here but I will note a couple of things:

Firstly, if you don’t know what token-binding is, that’s fine: ignore it. Same goes for extensions unless you previously supported the U2F API, in which case see the section below. I also suggest that you ignore the parts dealing with attestation (see above), which eliminates 95% of the complexity. Lastly, while I recommend following the remaining steps, see the bit above about the challenge parameter and don’t erroneously believe that checking, say, the origin member of the JSON is more meaningful than it is.

I’ll also mention something about all the different formats in play here: In order to process webauthn, you’ll end up dealing with JSON, ASN.1, bespoke binary formats, and CBOR. You may not have even heard of the last of those but CBOR is yet another serialisation format, joining Protocol Buffers, Msgpack, Thrift, Cap’n Proto, JSON, ASN.1, Avro, BSON, …. Whatever you might think of the IETF defining another format rather than using any of the numerous existing ones, you will have to deal with it to support webauthn, and you’ll have to deal with COSE, the CBOR translation of JOSE/JWT. You don’t have to support tags or indefinite lengths in CBOR though because the CTAP2 canonicalisation format forbids them.

The problem is that there’s going to be a lot of people having to implement converters from COSE key format to something that their crypto library accepts. Maybe I should start a GitHub repo of sample code for that in various libraries. But, short of that, here’s a quick reference to help you navigate all the different formats in play:

COSE Key

Used in: the public key in the attested credential data when creating a key. You have to parse this because you need the public key in order to check assertions.
Format: a CBOR map, defined here. However, you’ll struggle to figure out, concretely, what the keys in that map are from the RFC so, for reference, if you’re decoding a P-256 key you should expect these entries in the map: 1 (key type) = 2 (elliptic curve, x&y), 3 (algorithm) = -7 (ECDSA with SHA-256), -1 (curve) = 1 (P-256), and then x and y coordinates are 32-byte values with keys -2 and -3, respectively.
How to recognise: the first nibble is 0xa, for a small CBOR map.

X9.62 key

Used in: lots of things, this is the standard format for EC public keys. Contained within the SPKI format, which is used by Web Crypto and X.509.
Format: a type byte (0x04 for standard, uncompressed keys), followed by x and y values.
How to recognise: for P-256, it’s 65 bytes long and start with 0x04. (There’s also a compressed version of X9.62 but it’s exceedingly rare.)

SPKI

Used in: X.509 and Web Crypto.
Format: an ASN.1 structure called SubjectPublicKeyInfo: an algorithm identifier followed by a lump of bytes in an algorithm-specific format, which is X9.62 for elliptic-curve keys.
How to recognise: starts with 0x30.

ASN.1 signatures

Used in: nearly everything that deals with ECDSA signatures.
Format: an ASN.1 SEQUENCE containing two INTEGER values for ECDSA’s r and s values.
How to recognise: starts with 0x30 and you’re expecting a signature, not a key.

“Raw” signatures

Used in: Web Crypto.
Format: a pair of 32-byte values (for P-256).
How to recognise: 64 bytes long.

Getting assertions

Once you have one (or more) keys registered for a user you can challenge them for a signature. Typically this is done at login time although CTAP2 envisions user verifying tokens that take a fingerprint or PIN number, so signatures from those devices could replace reauthentication requests. (I.e. those times when a site asks you to re-enter your password before showing particularly sensitive information.)

When getting an assertion, the challenge (i.e. nonce) really must be securely generated at the server—there’s none of the equivocation as with generation.

The credentialID values are those extracted from the attested credential data from the registration.

Again, there’s no need for me to reiterate the processing steps already enumerated in the spec except to repeat that you can ignore token-binding if you don’t know what it is, and to reference the comments above about signature counters. If you implement the signature counter check, you should think through how the user will experience various scenarios and ensure that you’re monitoring metrics for the number of counter failures observed. (I don’t think we know, in practice, how often counter writes will be lost, or counters will be corrupted.)

Supporting keys created with the U2F API

This is just for the handful of sites that supported Security Keys using the older, U2F API. If you have keys that were registered with that API then they aren’t immediately going to work with webauthn because the AppID from U2F is a URL, but the relying party ID in webauthn is just a domain. Thus the same origin is considered to be a different relying party when using U2F and webauthn and the keys won’t be accepted. (Just the same as if you tried to use a credential ID from a different site.)

However, there is a solution to this: in the publicKey argument of the get call add an extensions field containing a dict with a key appid whose value is the AppID of your U2F keys. This can be done uniformly for all keys if you wish since both the relying party ID and AppID will be tried when this is asserted. With this in place, keys registered with U2F should work with webauthn.

There is no way to create a U2F key with webauthn however. So if you rollout webauthn support, have users create keys with webauthn, and have to roll it back for some reason, those new keys will not work with the U2F API. So complete the transition to webauthn of your login process first, then transition registration.

Concluding remarks

It’s exciting to see webauthn support coming to most browsers this year. I get to use Security Keys with about the same number of sites as I use SMS OTP with, and I use Security Keys anywhere I can. While I, like everyone technical, assumes that I’m less likely to get phished than most, I’m still human. So I hope that wide-spread support for webauthn encourages more sites to support Security Keys.

Challenges remain, though. Probably the most obvious is that NFC is the ideal interface for mobile devices, but it doesn’t work with iOS. Bluetooth works for both Android and iOS, but requires a battery and isn’t as frictionless.

Security keys will probably remain the domain of the more security conscious in the short-term since, with CTAP1, they can only be an additional authentication step. But I hope to see CTAP2 tokens generally available this year with fingerprint readers or PIN support. It might be that, in a few years time, a significant number of people have a passwordless experience with at least one site that they use regularly. That’ll be exciting.

TLS 1.3 and Proxies

I'll generally ignore the internet froth in a given week as much as possible, but when Her Majesty's Government starts repeating misunderstandings about TLS 1.3 it is necessary to write something, if only to have a pointer ready for when people start citing it as evidence.

The first misunderstanding in the piece is the claim that it's possible for man-in-the-middle proxies to selectively proxy TLS 1.2 connections, but not TLS 1.3 connections because the latter encrypts certificates.

The TLS 1.2 proxy behaviour that's presumed here is the following: the proxy forwards the client's ClientHello message to the server and inspects the resulting ServerHello and certificates. Based on the name in the certificate, the proxy may “drop out” of the connection (i.e. allow the client and server to communicate directly) or may choose to interpose itself, answering the client with an alternative ServerHello and the server with an alternative ClientKeyExchange, negotiating different encrypted sessions with each and forwarding so that it can see the plaintext of the connection. In order to satisfy the client in this case the client must trust the proxy, but that's taken care of in the enterprise setting by installing a root CA on the client. (Or, in Syria, by hoping that users click through the the certificate error.)

While there do exist products that attempt to do this, they break repeatedly because it's a fundamentally flawed design: by forwarding the ClientHello to the server, the proxy has committed to supporting every feature that the client advertises because, if the server selects a given feature, it's too late for the proxy to change its mind. Therefore, with every new cipher suite, new curve, and new extension introduced, a proxy that does this finds that it cannot understand the connection that it's trying to interpose.

One option that some proxies take is to try and heuristically detect when it can't support a connection and fail open. However, if you believe that your proxy is a strong defense against something then failing open is a bit of problem.

Thus another avenue that some proxies have tried is to use the same heuristics to detect unsupported connections, discard the incomplete, outgoing connection, and start another by sending a ClientHello that only includes features that the proxy supports. That's unfortunate for the server because it doubles its handshaking cost, but gives the proxy a usable connection.

However, both those tricks only slow down the rate at which customers lurch from outage to outage. The heuristics are necessarily imprecise because TLS extensions can change anything about a connection after the ClientHello and some additions to TLS have memorably broken them, leading to confused proxies cutting enterprises off from the internet.

So the idea that selective proxying based on the server certificate ever functioned is false. A proxy can, with all versions of TLS, examine a ClientHello and decide to proxy the connection or not but, if it does so, it must craft a fresh ClientHello to send to the server containing only features that it supports. Making assumptions about any TLS message after a ClientHello that you didn't craft is invalid. Since, in practice, this has not been as obvious as the designers of TLS had imagined, the 1.3 draft has a section laying out these requirements.

Sadly, it's precisely this sort of proxy misbehaviour that has delayed TLS 1.3 for over a year while my colleagues (David Benjamin and Steven Valdez) repeatedly deployed experiments and measured success rates of different serialisations. In the end we found that making TLS 1.3 look like a TLS 1.2 resumption solved a huge number of problems, suggesting that many proxies blindly pass through such connections. (Which should, again, make one wonder about what security properties they're providing.)

But, given all that, you might ponder why we bothered encrypting certificates? Partly it's one component of an effort to make browsing more private but, more concretely, it's because anything not encrypted suffers these problems. TLS 1.3 was difficult to deploy because TLS's handshake is, perforce, exposed to the network. The idea that we should make TLS a little more efficient by compressing certificates has been bouncing around for many years. But it's only with TLS 1.3 that we might make it happen because everyone expected to hit another swamp of proxy issues if we tried it without encrypting certificates first.

It's also worth examining the assumption behind waiting for the server certificate before making an interception decision: that the client might be malicious and attempt to fool the proxy but (to quote the article) the certificate is “tightly bound to the server we’re actually interacting with”. The problem here is that a certificate for any given site, and a valid signature over a ServerKeyExchange from that certificate, is easily available: just connect to the server and it'll send it to you. Therefore if you're worried about malware, how is it that the malware C&C server won't just reply with a certificate for a reputable site? The malware client, after all, can be crafted to compensate for any such trickery. Unless the proxy is interposing and performing the cryptographic checks, then the server certificate isn't tightly bound to anything at all and the whole reason for the design seems flawed.

On that subject, I'll briefly mention the fact that HTTPS proxies aren't always so great at performing cryptographic checks. (We recently notified a major proxy vendor that their product didn't appear to validate certificates at all. We were informed that they can validate certificates, it's just disabled by default. It's unclear what fraction of their customers are aware of that.)

Onto the second claim of the article: that TLS 1.3 is incompatible with PCI-DSS (credit card standards) and HIPAA (US healthcare regulation). No reasoning is given for the claim, so let's take a look:

Many PCI-DSS compliant systems use TLS 1.2, primarily stemming from requirement 4.1: “use strong cryptography and security protocols to safeguard sensitive cardholder data during transmission over open, public networks, including a) only trusted keys and certificates are accepted, b) the protocol in use only supports secure versions or configurations, and c) the encryption strength is appropriate for the encryption methodology in use”.

As you can see, the PCI-DSS requirements are general enough to adapt to new versions of TLS and, if TLS 1.2 is sufficient, then TLS 1.3 is better. (Even those misunderstanding aspects of TLS 1.3 are saying it's stronger than 1.2.)

HIPAA is likewise, requiring that one must “implement technical security measures to guard against unauthorized access to electronic protected health information that is being transmitted over an electronic communications network”.

TLS 1.3 is enabled in Chrome 65, which is rolling out now. It is a major improvement in TLS and lets us eliminate session-ticket encryption keys as a mass-decryption threat, which both PCI-DSS- and HIPAA-compliance experts should take great interest in. It does not require special measures by proxies—they need only implement TLS 1.2 correctly.

Testing Security Keys

Last time I reviewed various security keys at a fairly superficial level: basic function, physical characteristics etc. This post considers lower-level behaviour.

Security Keys implement the FIDO U2F spec, which borrows a lot from ISO 7816-4. Each possible transport (i.e. USB, NFC, or Bluetooth) has its own spec for how to encapsulate the U2F messages over that transport (e.g. here's the USB one). FIDO is working on much more complex (and more capable) second versions of these specs, but currently all security keys implement the basic ones.

In essence, the U2F spec only contains three functions: Register, Authenticate, and Check. Register creates a new key-pair. Authenticate signs with an existing key-pair, after the user confirms physical presence, and Check confirms whether or not a key-pair is known to a security key.

In more detail, Register takes a 32-byte challenge and a 32-byte appID. These are intended to be SHA-256 hashes, but are opaque and can be anything. The challenge acts as a nonce, while the appID is bound to the resulting key and represents the context of the key. For web browsers, the appID is the hash of a URL in the origin of the login page.

Register returns a P-256 public key, an opaque key handle, the security key's batch certificate, and a signature, by the public key in the certificate, of the challenge, appID, key handle, and public key. Since the security keys are small devices with limited storage, it's universally the case in the ones that I've looked at that the key handle is actually an encrypted private key, i.e. the token offloads storage of private keys. However, in theory, the key handle could just be an integer that indexes storage within the token.

Authenticate takes a challenge, an appID, and a key handle, verifies that the appID matches the value given to Register, and returns a signature, from the public key associated with that key handle, over the challenge and appID.

Check takes a key handle and an appID and returns a positive result if the key handle came from this security key and the appID matches.

Given that, there are a number of properties that should hold. Some of the most critical:

But there are a number of other things that would be nice to test:

So given all those desirable properties, how do various security keys manage?

Yubico

Easy one first: I can find no flaws in Yubico's U2F Security Key.

VASCO SecureClick

I've acquired one of these since the round-up of security keys that I did last time so I'll give a full introduction here. (See also Brad's review.)

This is a Bluetooth Low-Energy (BLE) token, which means that it works with both Android and iOS devices. For non-mobile devices, it includes a USB-A BLE dongle. The SecureClick uses a Chrome extension for configuring and pairing the dongle, which works across platforms. The dongle appears as a normal USB device until it sees a BLE signal from the token, at which point it “disconnects” and reconnects as a different device for actually doing the U2F operation. Once an operation that requires user-presence (i.e. a Register or Authenticate) has completed, the token powers down and the dongle disconnects and reconnects as the original USB device again.

If you're using Linux and you configure udev to grant access to the vendor ID & product ID of the token as it appears normally, nothing will work because the vendor ID and product ID are different when it's active. The Chrome extension will get very confused about this.

However, once I'd figured that out, everything else worked well. The problem, as is inherent with BLE devices, is that the token needs a battery that will run out eventually. (It takes a CR2012 and can be replaced.) VASCO claims that it can be used 10 times a day for two years, which seems plausible. I did run the battery out during testing, but testing involves a lot of operations. Like the Yubico, I did not find any problems with this token.

I did have it working with iOS, but it didn't work when I tried to check the battery level just now, and I'm not sure what changed. (Perhaps iOS 11?)

Feitian ePass

ASN.1 DER is designed to be a “distinguished” encoding, i.e. there should be a unique serialisation for a given value and all other representations are invalid. As such, numbers are supposed to be encoded minimally, with no leading zeros (unless necessary to make a number positive). Feitian doesn't get that right with this security key: numbers that start with 9 leading zero bits have an invalid zero byte at the beginning. Presumably, numbers starting with 17 zero bits have two invalid zero bytes at the beginning and so on, but I wasn't able to press the button enough times to get such an example. Thus something like one in 256 signatures produced by this security key are invalid.

Also, the final eight bytes of the key handle seem to be superfluous: you can change them to whatever value you like and the security key doesn't care. That is not immediately a problem, but it does beg the question: if they're not being used, what are they?

Lastly, the padding data in USB packets isn't zeroed. However, it's obviously just the previous contents of the transmit buffer, so there's nothing sensitive getting leaked.

Thetis

With this device, I can't test things like key handle mutability and whether the appID is being checked because of some odd behaviour. The response to the first Check is invalid, according to the spec: it returns status 0x9000 (“NO_ERROR”), when it should be 0x6985 or 0x6a80. After that, it starts rejecting all key handles (even valid ones) with 0x6a80 until it's unplugged and reinserted.

This device has the same non-minimal signature encoding issue as the Feitian ePass. Also, if you click too fast, this security key gets upset and rejects a few requests with status 0x6ffe.

USB padding bytes aren't zeroed, but appear to be parts of the request message and thus nothing interesting.

U2F Zero

A 1KiB ping message crashes this device (i.e. it stops responding to USB messages and needs to be unplugged and reinserted). Testing a corrupted key handle also crashes it and thus I wasn't able to run many tests.

KEY-ID / HyperFIDO

The Key-ID (and HyperFIDO devices, which have the same firmware, I think) have the same non-minimal encoding issue as the Feitian ePass, but also have a second ASN.1 flaw. In ASN.1 DER, if the most-significant bit of a number is set, that number is negative. If it's not supposed to be negative, then a zero pad byte is needed. I think what happened here is that, when testing the most-significant bit, the security key checks whether the first byte is > 0x80, but it should be checking whether it's >= 0x80. The upshot is the sometimes it produces signatures that contain negative numbers and are thus invalid.

USB padding bytes aren't zeroed, and include data that was not part of the request or response. It's unlikely to be material, but it does beg the question of where it comes from.

The wrapped keys also have some unfortunate properties. Firstly, bytes 16 through 31 are a function of the device and the appID, thus a given site can passively identify the same token when used by different accounts. Bytes 48 through 79 are unauthenticated and, when altered, everything still works except the signatures are wrong. That suggests that these bytes are the encrypted private key (or the encrypted seed to generate it). It's not obvious that there's any vulnerability from being able to tweak the private key like this, but all bytes of the key handle should be authenticated as a matter of best practice. Lastly, bytes 32 through 47 can't be arbitrarily manipulated, but can be substituted with the same bytes from a different key handle, which causes the signatures to be incorrect. I don't know what's going on there.

Overall, the key handle structure is sufficiently far from the obvious construction to cause worry, but not an obvious vulnerability.

Security Keys

Security Keys are (generally) USB-connected hardware fobs that are capable of key generation and oracle signing. Websites can “enroll” a security key by asking it to generate a public key bound to an “appId” (which is limited by the browser based on the site's origin). Later, when a user wants to log in, the website can send a challenge to the security key, which signs it to prove possession of the corresponding private key. By having a physical button, which must be pressed to enroll or sign, operations can't happen without user involvement. By having the security keys encrypt state and hand it to the website to store, they can be stateless(*) and robust.

(* well, they can almost be stateless, but there's a signature counter in the spec. Hopefully it'll go away in a future revision for that and other reasons.)

The point is that security keys are unphishable: a phisher can only get a signature for their appId which, because it's based on the origin, has to be invalid for the real site. Indeed, a user cannot be socially engineered into compromising themselves with a security key, short of them physically giving it to the attacker. This is a step up from app- or SMS-based two-factor authentication, which only solves password reuse. (And SMS has other issues.)

The W3C standard for security keys is still a work in progress, but sites can use them via the FIDO API today. In Chrome you can load an implementation of that API which forwards requests to an internal extension that handles the USB communication. If you do that, then there's a Firefox extension that implements the same API by running a local binary to handle it. (Although the Firefox extension appears to stop working with Firefox 57, based on reports.)

Google, GitHub, Facebook and Dropbox (and others) all support security keys this way. If you administer a G Suite domain, you can require security keys for your users. (“G Suite” is the new name for Gmail etc on a custom domain.)

But, to get all this, you need an actual security key, and probably two of them if you want a backup. (And a backup is a good idea, especially if you plan on dropping your phone number for account recovery.) So I did a search on Amazon for “U2F security key” and bought everything on the first page of results that was under $20 and available to ship now.

(Update: Brad Hill setup a page on GitHub that includes these reviews and his own.)

Yubico Security Key

Brand: Yubico, Firmware: Yubico, Chip: NXP, Price: $17.99, Connection: USB-A

Yubico is the leader in this space and their devices are the most common. They have a number of more expensive and more capable devices that some people might be familiar with, but this one only does U2F. The sensor is a capacitive so a light touch is sufficient to trigger it. You'll have no problems with this key, but it is the most expensive of the under $20 set.

Thetis U2F Security Key

Brand: Thetis, Firmware: Excelsecu, Chip: ?, Price: $13.95, Connection: USB-A

This security key is fashioned more like a USB thumb drive. The plastic inner part rotates within the outer metal shell and so the USB connector can be protected by it. The button is in the axis and is clicky, rather than capacitive, but doesn't require too much force to press. If you'll be throwing your security key in bags and worry about damaging them then perhaps this one will work well for you.

A minor nit is that the attestation certificate is signed with SHA-1. That doesn't really matter, but it suggests that the firmware writers aren't paying as much attention as one would hope. (I.e. it's a brown M&M.)

Feitian ePass

Brand: Feitian, Firmware: Feitian, Chip: NXP, Price: $16.99, Connection: USB-A, NFC

This one is very much like the Yubico, just a little fatter around the middle. Otherwise, it's also a sealed plastic body and capacitive touch sensor. The differences are a dollar and NFC support—which should let it work with Android. However, I haven't tested this feature.

I don't know what the opposite of a brown M&M is, but this security key is the only one here that has its metadata correctly registered with the FIDO Metadata Service.

U2F Zero

Brand: U2F Zero, Firmware: Conor Patrick, Chip: Atmel, Price: $8.99, Connection: USB-A

I did bend the rules a little to include this one: it wasn't immediately available when I did the main order from Amazon. But it's the only token on Amazon that has open source firmware (and hardware designs), and that was worth waiting for. It's also the cheapest of all the options here.

Sadly, I have to report that I can't quite recommend it because, in my laptop (a Chromebook Pixel), it's not thick enough to sit in the USB port correctly: Since it only has the “tongue” of a USB connector, it can move around in the port a fair bit. That's true of the other tokens too, but with the U2F Zero, unless I hold it just right, it fails to make proper contact. Since operating it requires pressing the button, it's almost unusable in my laptop.

However, it's fine with a couple of USB hubs that I have and in my desktop computer, so it might be fine for you. Depends how much you value the coolness factor of it being open-source.

KEY-ID FIDO U2F Security Key

Brand: KEY-ID, Firmware: Feitian(?), Chip: ?, Price: $12.00, Connection: USB-A

I photographed this one while plugged in in order to show the most obvious issue with this device: everyone will know when you're using it! Whenever it's plugged in, the green LED on the end is lit up and, although the saturation in the photo exaggerates the situation a little, it really is too bright. When it's waiting for a touch, it starts flashing too.

In addition, whenever I remove this from my desktop computer, the computer reboots. That suggests an electrical issue with the device itself—it's probably shorting something that shouldn't be shorted, like the USB power pin to ground, for example.

While this device is branded “KEY-ID”, I believe that the firmware is done by Feitian. There are similarities in certificate that match the Feitian device and, if you look up the FIDO certification, you find that Feitian registered a device called “KEY-ID FIDO® U2F Security Key”. Possibly Feitian decided against putting their brand on this.

(Update: Brad Hill, at least, reports no problems with these when using a MacBook.)

HyperFIDO Mini

Brand: HyperFIDO, Firmware: Feitian(?), Chip: ?, Price: $13.75, Connection: USB-A

By observation, this is physically identical to the KEY-ID device, save for the colour. It has the same green LED too (see above).

However, it manages to be worse. The KEY-ID device is highlighted in Amazon as a “new 2017 model”, and maybe this an example of the older model. Not only does it cause my computer to reliably reboot when removed (I suffered to bring you this review, dear reader), it also causes all devices on a USB hub to stop working when plugged in. When plugged into my laptop it does work—as long as you hold it up in the USB socket. The only saving grace is that, when you aren't pressing it upwards, at least the green LED doesn't light up.

HyperFIDO U2F Security Key

Brand: HyperFIDO, Firmware: Feitian(?), Chip: ?, Price: $9.98, Connection: USB-A

This HyperFIDO device is plastic so avoids the electrical issues of the KEY-ID and HyperFIDO Mini, above. It also avoids having an LED that can blind small children.

However, at least on the one that I received, the plastic USB part is only just small enough to fit into a USB socket. It takes a fair bit of force to insert and remove it. Also the end cap looks like it should be symmetrical and so able to go on either way around, but it doesn't quite work when upside down.

Once inserted, pressing the button doesn't take too much force, but it's enough to make the device bend worryingly in the socket. It doesn't actually appear to be a problem, but it adds a touch of anxiety to each use. Overall, it's cheap and you'll know it.

Those are the devices that matched my initial criteria. But, sometimes, $20 isn't going to be enough I'm afraid. These are some other security keys that I've ended up with:

Yubikey 4C

Brand: Yubico, Firmware: Yubico, Chip: NXP?, Price: $50 (direct from Yubico), Connection: USB-C

If you have a laptop that only has USB-C ports then a USB-A device is useless to you. Currently your only option is the Yubikey 4C at $50 a piece. This works well enough: the “button” is capacitive and triggers when you touch either of the contacts on the sides. The visual indicator is an LED that shines through the plastic at the very end.

Note that, as a full Yubikey, it can do more than just being a security key. Yubico have a site for that.

(Update: several people have mentioned in comments that this device wasn't very robust for them and had physical problems after some months. I can't confirm that, but there's always the option of a USB-A to USB-C adaptor. Just be careful to get one that'll work—the adaptor that I have doesn't hold “tongue-only” USB devices well enough.)

Many people lacking USB-A ports will have a Touch Bar, which includes a fingerprint sensor and secure element. One might spy an alternative (and cheaper solution) there. GitHub have published SoftU2F which does some of that but, from what I can tell, doesn't actually store keys in the secure element yet. However, in time, there might be a good answer for this.

(Update: the behaviour of SoftU2F might be changing.)

Yubikey Nano

Brand: Yubico, Firmware: Yubico, Chip: NXP?, Price: $50 (direct from Yubico), Connection: USB-A

Another $50 security key from Yubico, but I've included it because it's my preferred form-factor: this key is designed to sit semi-permanently inside the USB-A port. The edge is a capacitive touch sensor so you can trigger it by running your finger along it.

It does mean that you give up a USB port, but it also means that you've never rummaging around to find it.

(Note: newer Nanos look slightly different. See Yubico's page for a photo of the current design.)

Maybe Skip SHA-3

In 2005 and 2006, a series of significant results were published against SHA-1 [1][2][3]. These repeated break-throughs caused something of a crisis of faith as cryptographers questioned whether we knew how to build hash functions at all. After all, many hash functions from the 1990's had not aged well [1][2].

In the wake of this, NIST announced (PDF) a competition to develop SHA-3 in order to hedge the risk of SHA-2 falling. In 2012, Keccak (pronounced “ket-chak”, I believe) won (PDF) and became SHA-3. But the competition itself proved that we do know how to build hash functions: the series of results in 2005 didn't extend to SHA-2 and the SHA-3 process produced a number of hash functions, all of which are secure as far as we can tell. Thus, by the time it existed, it was no longer clear that SHA-3 was needed. Yet there is a natural tendency to assume that SHA-3 must be better than SHA-2 because the number is bigger.

As I've mentioned before, diversity of cryptographic primitives is expensive. It contributes to the exponential number of combinations that need to be tested and hardened; it draws on limited developer resources as multiple platforms typically need separate, optimised code; and it contributes to code-size, which is a worry again in the mobile age. SHA-3 is also slow, and is even slower than SHA-2 which is already a comparative laggard amongst crypto primitives.

SHA-3 did introduce something useful: extendable output functions (XOFs), in the form of the SHAKE algorithms. In an XOF, input is hashed and then an (effectively) unlimited amount of output can be produced from it. It's convenient, although the same effect can be produced for a limited amount of output using HKDF, or by hashing to a key and running ChaCha20 or AES-CTR.

Thus I believe that SHA-3 should probably not be used. It offers no compelling advantage over SHA-2 and brings many costs. The only argument that I can credit is that it's nice to have a backup hash function, but both SHA-256 and SHA-512 are commonly supported and have different cores. So we already have two secure hash functions deployed and I don't think we need another.

BLAKE2 is another new, secure hash function, but it at least offers much improved speed over SHA-2. Speed is important. Not only does it mean less CPU time spent on cryptography, it means that cryptography can be economically deployed in places where it couldn't be before. BLAKE2, however, has too many versions: eight at the current count (BLAKE2(X)?[sb](p)?). In response to complaints about speed, the Keccak team now have KangarooTwelve and MarsupilamiFourteen, which have a vector-based design for better performance. (Although a vector-based design can also be used to speed up SHA-2.)

So there are some interesting prospects for a future, faster replacement for SHA-2. But SHA-3 itself isn't one of them.

Update: two points came up in discussion about this. Firstly, what about length-extension? SHA-2 has the property that simply hashing a secret with some data is not a secure MAC construction, that's why we have HMAC. SHA-3 does not have this problem.

That is an advantage of SHA-3 because it means that people who don't know they need to use HMAC (with SHA-2) won't be caught out by it. Hopefully, in time, we end up with a hash function that has that property. But SHA-512/256, BLAKE2, K12, M14 and all the other SHA-3 candidates do have this property. In fact, it's implausible that any future hash function wouldn't.

Overall, I don't feel that solving length-extension is a sufficiently pressing concern that we should all invest in SHA-3 now, rather than a hash function that hopefully comes with more advantages. If it is a major concern for you now, try SHA-512/256—a member of the SHA-2 family.

The second point was that SHA-3 is just the first step towards a permutation-based future: SHA-3 has an elegant foundation that is suitable for implementing the full range of symmetric algorithms. In the future, a single optimised permutation function could be the basis of hashes, MACs, and AEADs, thus saving code size / die area and complexity. (E.g. STROBE.)

But skipping SHA-3 doesn't preclude any of that. SHA-3 is the hash standard itself, and even the Keccak team appear to be pushing K12 rather than SHA-3 now. It seems unlikely that a full set of primitives built around the Keccak permutation would choose to use the SHA-3 parameters at this point.

Indeed, SHA-3 adoption might inhibit that ecosystem by pushing it towards those bad parameters. (There was a thing about NIST tweaking the parameters at the end of the process if you want some background.)

One might argue that SHA-3 should be supported because you believe that it'll result in hardware implementations of the permutation and you hope that they'll be flexible enough to support what you really want to do with it. I'm not sure that would be the best approach even if your goal was to move to a permutation-based world. Instead I would nail down the whole family of primitives as you would like to see it and try to push small chips, where area is a major concern, to adopt it. Even then, the hash function in the family probably wouldn't be exactly SHA-3, but more like K12.

AES-GCM-SIV

AEADs combine encryption and authentication in a way that provides the properties that people generally expect when they “encrypt” something. This is great because, historically, handing people a block cipher and a hash function has resulted in a lot of bad and broken constructions. Standardising AEADs avoids this.

Common AEADs have a sharp edge though: you must never encrypt two different messages with the same key and nonce. Doing so generally violates the confidentiality of the two messages and might break much more.

There are some situations where obtaining a unique nonce is easy, for example in transport security where a simple counter can be used. (Due to poor design in TLS 1.2, some TLS implementations still managed to duplicate nonces. The answer there is to do what ChaCha20-Poly1305 does in TLS 1.2, and what everything does in TLS 1.3: make the nonce implicit. That means that any mistakes result in your code not interoperating, which should be noticed.)

But there are situations where ensuring nonce uniqueness is not trivial, generally where multiple machines are independently encrypting with the same key. A system for distributing nonces is complex and hard to have confidence in. Generating nonces randomly and depending on statistical uniqueness is reasonable, but only if the space of nonces is large enough. XSalsa20-Poly1305 (paper, code) has a 192-bit nonce and good performance across a range of CPUs. In many situations, using that with random nonces is a sound choice.

However, lots of chips now have hardware support for the AES-GCM AEAD, meaning that its performance and power use is hard to beat. Also, a 192-bit nonce results in a ~40% increase in overhead vs the 96-bit nonce of AES-GCM, which is undesirable in some contexts. (Overhead consists of the nonce and tag, thus the increase from 96- to 192-bit nonces is not 100%.)

But using random nonces in a 96-bit space isn't that comfortable. NIST recommends a limit of 232 messages when using random nonces with AES-GCM which, while quite large, is often not large enough not to have to worry about. Because of this, Shay Gueron, Yehuda Lindell and I have been working on AES-GCM-SIV (paper, spec): AES-GCM with some forgiveness. It uses the same primitives as AES-GCM, and thus enjoys the same hardware support, but it doesn't fail catastrophically if you repeat a nonce. Thus you can use random, 96-bit nonces with a far larger number of messages, or withstand a glitch in your nonce distribution scheme. (For precise numbers, see section 5.3 of the paper.)

There is a performance cost: AES-GCM-SIV encryption runs at about 70% the speed of AES-GCM, although decryption runs at the same speed. (Measured using current BoringSSL on an Intel Skylake chip with 8KiB messages.) But, in any situation where you don't have a watertight argument for nonce uniqueness, that might be pretty cheap compared to the alternative.

For example, both TLS and QUIC need to encrypt small messages at the server that can be decrypted by other servers. For TLS, these messages are the session tickets and, in QUIC, they are the source-address tokens. This is an example of a situation where many servers are independently encrypting with the same key and so Google's QUIC implementation is in the process of switching to using AES-GCM-SIV for source-address tokens. (Our TLS implementation may switch too in the future, although that will be more difficult because of existing APIs.)

AEADs that can withstand nonce duplication are called “nonce-misuse resistant” and that name appears to have caused some people to believe that they are infinitely resistant. I.e. that an unlimited number of messages can be encrypted with a fixed nonce with no loss in security. That is not the case, and the term wasn't defined that way originally by Rogaway and Shrimpton (nor does their SIV mode have that property). So it's important to emphasise that AES-GCM-SIV (and nonce-misuse resistant modes in general) are not a magic invulnerability shield. Figure four and section five of the the paper give precise bounds but, if in doubt, consider AES-GCM-SIV to be a safety net for accidental nonce duplication and otherwise treat it like a traditional AEAD.

AES-GCM-SIV is supported in BoringSSL now and, while one may not want to use the whole of BoringSSL, the core assembly is ISC licensed. Also, Shay has published reference, assembly and intrinsics versions for those who think AES-GCM-SIV might be useful to them.

CFI directives in assembly files

(This post uses x86-64 for illustration throughout. The fundamentals are similar for other platforms but will need some translation that I don't cover here.)

Despite compilers getting better over time, it's still the case that hand-written assembly can be worthwhile for certain hot-spots. Sometimes there are special CPU instructions for the thing that you're trying to do, sometimes you need detailed control of the resulting code and, to some extent, it remains possible for some people to out-optimise a compiler.

But hand-written assembly doesn't automatically get some of the things that the compiler generates for normal code, such as debugging information. Perhaps your assembly code never crashes (although any function that takes a pointer can suffer from bugs in other code) but you probably still care about accurate profiling information. In order for debuggers to walk up the stack in a core file, or for profilers to correctly account for CPU time, they need be able to unwind call frames.

Unwinding used to be easy as every function would have a standard prologue:

push rbp
mov rbp, rsp

This would make the stack look like this (remember that stacks grow downwards in memory):

Caller's stackRSP value before CALLRSP at function entryCaller's RBPCallee's local variablesSaved RIP(pushed by CALL)RBP always points here

So, upon entry to a function, the CALL instruction that jumped to the function in question will have pushed the previous program counter (from the RIP register) onto the stack. Then the function prologue saves the current value of RBP on the stack and copies the current value of the stack pointer into RBP. From this point until the function is complete, RBP won't be touched.

This makes stack unwinding easy because RBP always points to the call frame for the current function. That gets you the saved address of the parent call and the saved value of its RBP and so on.

The problems with this scheme are that a) the function prologue can be excessive for small functions and b) we would like to be able to use RBP as a general purpose register to avoid spills. Which is why the GCC documentation says that “-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging”. This means that you can't depend on being able to unwind stacks like this. A process can be comprised of various shared libraries, any of which might be compiled with optimisations.

To be able to unwind the stack without depending on this convention, additional debugging tables are needed. The compiler will generate these automatically (when asked) for code that it generates, but it's something that we need to worry about when writing assembly functions ourselves if we want profilers and debuggers to work.

The reference for the assembly directives that we'll need is here, but they are very lightly documented. You can understand more by reading the DWARF spec, which documents the data that is being generated. Specifically see sections 6.4 and D.6. But I'll try to tie the two together in this post.

The tables that we need the assembler to emit for us are called Call Frame Information (CFI). (Not to be confused with Control Flow Integrity, which is very different.) Based on that name, all the assembler directives begin with .cfi_.

Next we need to define the Canonical Frame Address (CFA). This is the value of the stack pointer just before the CALL instruction in the parent function. In the diagram above, it's the value indicated by “RSP value before CALL”. Our first task will be to define data that allows the CFA to be calculated for any given instruction.

The CFI tables allow the CFA to be expressed as a register value plus an offset. For example, immediately upon function entry the CFA is RSP + 8. (The eight byte offset is because the CALL instruction will have pushed the previous RIP on the stack.)

As the function executes, however, the expression will probably change. If nothing else, after pushing a value onto the stack we would need to increase the offset.

So one design for the CFI table would be to store a (register, offset) pair for every instruction. Conceptually that's what we do but, to save space, only changes from instruction to instruction are stored.

It's time for an example, so here's a trivial assembly function that includes CFI directives and a running commentary.

  .globl  square
  .type   square,@function
  .hidden square
square:

This is a standard preamble for a function that's unrelated to CFI. Your assembly code should already be full of this.

  .cfi_startproc

Our first CFI directive. This is needed at the start of every annotated function. It causes a new CFI table for this function to be initialised.

  .cfi_def_cfa rsp, 8

This is defining the CFA expression as a register plus offset. One of the things that you'll see compilers do is express the registers as numbers rather than names. But, at least with GAS, you can write names. (I've included a table of DWARF register numbers and names below in case you need it.)

Getting back to the directive, this is just specifying what I discussed above: on entry to a function, the CFA is at RSP + 8.

  push    rbp
  .cfi_def_cfa rsp, 16

After pushing something to the stack, the value of RSP will have changed so we need to update the CFA expression. It's now RSP + 16, to account for the eight bytes we pushed.

  mov     rbp, rsp
  .cfi_def_cfa rbp, 16

This function happens to have a standard prologue, so we'll save the frame pointer in RBP, following the old convention. Thus, for the rest of the function we can define the CFA as RBP + 16 and manipulate the stack without having to worry about it again.

  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8

We're getting ready to return from this function and, after restoring RBP from the stack, the old CFA expression is invalid because the value of RBP has changed. So we define it as RSP + 8 again.

  ret
  .cfi_endproc

At the end of the function we need to trigger the CFI table to be emitted. (It's an error if a CFI table is left open at the end of the file.)

The CFI tables for an object file can be dumped with objdump -W and, if you do that for the example above, you'll see two tables: something called a CIE and something called an FDE.

The CIE (Common Information Entry) table contains information common to all functions and it's worth taking a look at it:

… CIE
  Version:         1
  Augmentation:    "zR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     1b

  DW_CFA_def_cfa: r7 (rsp) ofs 8
  DW_CFA_offset: r16 (rip) at cfa-8

You can ignore everything until the DW_CFA_… lines at the end. They define CFI directives that are common to all functions (that reference this CIE). The first is saying that the CFA is at RSP + 8, which is what we had already defined at function entry. This means that you don't need a CFI directive at the beginning of the function. Basically RSP + 8 is already the default.

The second directive is something that we'll get to when we discuss saving registers.

If we look at the FDE (Frame Description Entry) for the example function that we defined, we see that it reflects the CFI directives from the assembly:

… FDE cie=…
  DW_CFA_advance_loc: 1 to 0000000000000001
  DW_CFA_def_cfa: r7 (rsp) ofs 16
  DW_CFA_advance_loc: 3 to 0000000000000004
  DW_CFA_def_cfa: r6 (rbp) ofs 16
  DW_CFA_advance_loc: 11 to 000000000000000f
  DW_CFA_def_cfa: r7 (rsp) ofs 8

The FDE describes the range of instructions that it's valid for and is a series of operations to either update the CFA expression, or to skip over the next n bytes of instructions. Fairly obvious.

Optimisations for CFA directives

There are some shortcuts when writing CFA directives:

Firstly, you can update just the offset, or just the register, with cfi_def_cfa_offset and cfi_def_cfa_register respectively. This not only saves typing in the source file, it saves bytes in the table too.

Secondly, you can update the offset with a relative value using cfi_adjust_cfa_offset. This is useful when pushing lots of values to the stack as the offset will increase by eight each time.

Here's the example from above, but using these directives and omitting the first directive that we don't need because of the CIE:

  .globl  square
  .type   square,@function
  .hidden square
square:
  .cfi_startproc
  push    rbp
  .cfi_adjust_cfa_offset 8
  mov     rbp, rsp
  .cfi_def_cfa_register rbp
  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8
  ret
  .cfi_endproc

Saving registers

Consider a profiler that is unwinding the stack after a profiling signal. It calculates the CFA of the active function and, from that, finds the parent function. Now it needs to calculate the parent function's CFA and, from the CFI tables, discovers that it's related to RBX. Since RBX is a callee-saved register, that's reasonable, but the active function might have stomped RBX. So, in order for the unwinding to proceed it needs a way to find where the active function saved the old value of RBX. So there are more CFI directives that let you document where registers have been saved.

Registers can either be saved at an offset from the CFA (i.e. on the stack), or in another register. Most of the time they'll be saved on the stack though because, if you had a caller-saved register to spare, you would be using it first.

To indicate that a register is saved on the stack, use cfi_offset. In the same example as above (see the stack diagram at the top) the caller's RBP is saved at CFA - 16 bytes. So, with saved registers annotated too, it would start like this:

square:
  .cfi_startproc
  push    rbp
  .cfi_adjust_cfa_offset 8
  .cfi_offset rbp, -16

If you need to save a register in another register for some reason, see the documentation for cfi_register.

If you get all of that correct then your debugger should be able to unwind crashes correctly, and your profiler should be able to avoid recording lots of detached functions. However, I'm afraid that I don't know of a better way to test this than to zero RBP, add a crash in the assembly code, and check whether GBD can go up correctly.

(None of this works for Windows. But Per Vognsen, via Twitter, notes that there are similar directives in MASM.)

CFI expressions

New in version three of the DWARF standard are CFI Expressions. These define a stack machine for calculating the CFA value and can be useful when your stack frame is non-standard (which is fairly common in assembly code). However, there's no assembler support for them that I've been able to find, so one has to use cfi_escape and provide the raw DWARF data in a .s file. As an example, see this kernel patch.

Since there's no assembler support, you'll need to read section 2.5 of the standard, then search for DW_CFA_def_cfa_expression and, perhaps, search for cfi_directive in OpenSSL's perlasm script for x86-64 and the places in OpenSSL where that is used. Good luck.

(I suggest testing by adding some instructions that write to NULL in the assembly code and checking that gdb can correctly step up the stack and that info reg shows the correct values for callee-saved registers in the parent frame.)

CFI register numbers

In case you need to use or read the raw register numbers, here they are for a few architectures:

(may be EBP on MacOS) (may be ESP on MacOS)
Register numberx86-64x86ARM
0RAXEAXr0
1RDXECXr1
2RCXEDXr2
3RBXEBXr3
4RSIESPr4
5RDIEBPr5
6RBPESIr6
7RSPEDIr7
8R8r8
9R9r9
10R10r10
11R11r11
12R12r12
13R13r13
14R14r14
15R15r15
16RIP

(x86 values taken from page 25 of this doc. x86-64 values from page 57 of this doc. ARM taken from page 7 of this doc.)

RISC-V assembly

RISC-V is a new, open instruction set. Fabrice Bellard wrote a Javascript emulator for it that boots Linux here (more info). I happen to have just gotten a physical chip that implements it too (one of these) and what's cool is that you can get the source code to the chip on GitHub.

The full, user-level instruction set is documented but there's a lot of information in there. I wanted a brief summary of things that I could keep in mind when reading disassembly. This blog post is just a dump of my notes; it's probably not very useful for most people! It also leaves out a lot of things that come up less frequently. There's also an unofficial reference card which is good, but still aimed a little low for me.

RISC-V is little-endian and comes in 32 and 64 bit flavours. In keeping with the RISC-V documents, the flavour (either 32 or 64) is called XLEN below in the few places where it matters.

For both, int is 32 bits. Pointers and long are the native register size. Signed values are always sign extended in a larger register and unsigned 8- and 16-bit values are zero extended. But unsigned 32-bit values are sign-extended. Everything has natural alignment in structs and the like, but alignment isn't required by the hardware.

The stack grows downwards and is 16-byte aligned upon entry to a function.

Instructions are all 32 bits and 32-bit aligned. There is a compressed-instruction extension that adds 16-bit instructions and changes the required alignment of all instructions to just 16 bits. Unlike Thumb on ARM, RISC-V compressed instructions are just a short-hand for some 32-bit instruction and 16- and 32-bit instructions can be mixed freely.

There are 31 general purpose registers called x1–x31. Register x0 drops all writes and always reads as zero. Each register is given an alias too, which describes their conventional use:

RegisterAliasDescriptionSaved by
x0zeroZero
x1raReturn addressCaller
x2spStack pointerCallee
x3gpGlobal pointer
x4tpThread pointer
x5–7t0–2TemporaryCaller
x8s0/fpSaved register / frame pointerCallee
x9s1Saved registerCallee
x10–17a1–a7Arguments and return valuesCaller
x18–27s2–11Saved registersCallee
x28–31t3–6TemporaryCaller

There are two types of immediates, which will be written as ‘I’ and ‘J’. Type ‘I’ intermediates are 12-bit, signed values and type ‘J’ are 20-bit, signed values. They are always sign-extended to the width of a register before use.

Arithmetic

There's no carry flag, instead one has to use a comparison instruction on the result in order to get the carry in a register.

A U suffix indicates an unsigned operation. Note that the multiplication and division instructions are part of the ‘M’ extension, but I expect most chips to have them.

InstructionEffectNotes
ADD dest,src1,src2dest = src1 + src2
SUB dest,src1,src2dest = src1 - src2
ADDI dest,src1,Idest = src1 + I
MUL dest,src1,src2dest = src1 × src2
MULH[|U|SH] dest,src1,src2dest = (src1 × src2) >> XLENThis returns the high word of the multiplication result for signed×signed, unsigned×unsigned or signed×unsigned, respectively
DIV[U] dest,src1,src2dest = src1/src2There's no trap on divide-by-zero, rather special values are returned. (See documentation.)
REM[U] dest,src1,src2dest = src1%src2

Bitwise

These should all be obvious:

Instruction
AND dest,src1,src2
OR dest,src1,src2
XOR dest,src1,src2
ANDI dest,src1,I
ORI dest,src1,I
XORI dest,src1,I

Shifts

Instructions here are Shift [Left|Right] [Logical|Arithmetic].

Note that only the minimal number of bits are read from the shift count. So shifting by the width of the register doesn't zero it, it does nothing.

InstructionEffectNotes
SLL dest,src1,src2dest = src1 << src2%XLEN
SRL dest,src1,src2dest = src1 >> src2%XLEN
SRA dest,src1,src2dest = src1 >> src2%XLEN
SLLI dest,src1,0‥31/63dest = src1 << I
SRLI dest,src1,0‥31/63dest = src1 >> I
SRAI dest,src1,0‥31/63dest = src1 >> I

Comparisons

These instructions set the destination register to one or zero depending on whether the relation is true or not.

InstructionEffectNotes
SLT dest,src1,src2dest = src1<src2(signed compare)
SLTU dest,src1,src2dest = src1<src2(unsigned compare)
SLTI dest,src1,Idest = src1<I(signed compare)
SLTIU dest,src1,Idest = src1<I(unsigned compare, but remember that the immediate is sign-extended first)

Control flow

The JAL[R] instructions write the address of the next instruction to the destination register for the subsequent return. This can be x0 if you don't care.

Many instructions here take an immediate that is doubled before use. That's because no instruction (even with compressed instructions) can ever start on an odd address. However, when you write the value in assembly you write the actual value; the assembler will halve it when encoding.

InstructionEffectNotes
JAL dest,Jdest = pc+2/4; pc+=2*J
JALR dest,src,Idest = pc+2/4; pc=src1+(I&~1)Note that the least-significant bit of the address is ignored
BEQ src1,src2,Iif src1==src2, pc+=2*I
BNE src1,src2,Iif src1!=src2, pc+=2*I
BLT src1,src2,Iif src1<src2, pc+=2*Isigned compare
BLTU src1,src2,Iif src1<src2, pc+=2*Iunsigned compare
BGE src1,src2,Iif src1>=src2, pc+=2*Isigned compare
BGEU src1,src2,Iif src1>=src2, pc+=2*Iunsigned compare

Memory

The suffix on these instructions denotes the size of the value read or written: ‘D’ = double-word (64 bits), ‘W’ = word (32 bits), ‘H’ = half-word (16 bits), ´B’ = byte. Reading and writing 64-bit values only works on 64-bit systems, and LWU only exists on 64-bit systems because it's the same as LW otherwise.

Alignment is not required, but might be faster. Also note that there's no consistent direction of data flow in the textual form of the assembly: the register always comes first.

InstructionEffectNotes
L[D|W|H|B] dest,I(src)dest = *(src + I)result is sign extended
L[D|W|H|B]U dest,I(src)dest = *(src + I)result is zero extended
S[D|W|H|B] src1,I(src2)*(src2 + I) = src1

Other

InstructionEffectNotes
LUI dest,Jdest = J<<12“Load Upper Immediate”. The result is sign extended on 64-bit.
AUIPC dest,Jdest = pc + J<<12“Add Upper Immediate to PC”. The result of the shift is sign-extended on 64-bit before the addition.

W instructions

These instructions only exist on 64-bit and are copies of the same instruction without the ‘W’ suffix, except that they operate only on the lower 32 bits of the registers and, when writing the result, sign-extend to 64 bits. They are ADDW, SUBW, ADDIW, DIV[U]W, REM[U]W, S[L|R]LW, SRAW, S[L|R]LIW and SRAIW

Pseudo-instructions

For clarity, there are also a number of pseudo-instructions defined. These are just syntactic sugar for one of the primitive instructions, above. Here's a handful of the more useful ones:

Pseudo-instructionTranslation
nopaddi x0,x0,0
mv dest,srcaddi dest,src,0
not dest,srcxor dest,src,-1
seqz dest,srcsltiu dest,src,1
snez dest,srcsltu dest,x0,src
j Jjal x0,J
retjalr x0,x1,0
call offsetauipc x6, offset >> 12; jalr x1,x6,offset & 0xfff
li dest,value(possibly several instructions to load arbitrary value)
la dest,symbolauipc dest, symbol >> 12; addi dest,dest,offset & 0xfff
l[d|w|h|b] dest,symbolauipc dest, symbol >> 12; lx dest,offset & 0xfff(dest)
s[d|w|h|b] src,symbolauipc dest, symbol >> 12; sx dest,offset & 0xfff(dest)

Note that the instructions that expand to two instructions where the first is AUIPC aren't quite as simple as they appear. Since the 12-bit immediate in the second instruction is sign extended, it could end up negative. If that happens, the J immediate for AUIPC needs to be one greater and then you can reach the value by subtracting from there.

CECPQ1 results

In July my colleague, Matt Braithwaite, announced that Chrome and Google would be experimenting with a post-quantum key-agreement primitive in TLS. One should read the original announcement for details, but we had two goals for this experiment:

Firstly we wanted to direct cryptoanalytic attention at the family of Ring Learning-with-Errors (RLWE) problems. The algorithm that we used, NewHope, is part of this family and appeared to be the most promising when we selected one at the end of 2015.

It's very difficult to know whether we had any impact here, but it's good to see the recent publication and withdrawal of a paper describing a quantum attack on a fundamental lattice problem. Although the algorithm contained an error, it still shows that capable people are testing these new foundations.

Our second goal was to measure the feasibility of deploying post-quantum key-agreement in TLS by combining NewHope with an existing key-agreement (X25519). We called the combination CECPQ1.

TLS key agreements have never been so large and we expected a latency impact from the extra network traffic. Also, any incompatibilities with middleboxes can take years to sort out, so it's best to discover them as early as possible.

Here the results are more concrete: we did not find any unexpected impediment to deploying something like NewHope. There were no reported problems caused by enabling it.

Although the median connection latency only increased by a millisecond, the latency for the slowest 5% increased by 20ms and, for the slowest 1%, by 150ms. Since NewHope is computationally inexpensive, we're assuming that this is caused entirely by the increased message sizes. Since connection latencies compound on the web (because subresource discovery is delayed), the data requirement of NewHope is moderately expensive for people on slower connections.

None the less, if the need arose, it would be practical to quickly deploy NewHope in TLS 1.2. (TLS 1.3 makes things a little more complex and we did not test with CECPQ1 with it.)

At this point the experiment is concluded. We do not want to promote CECPQ1 as a de-facto standard and so a future Chrome update will disable CECPQ1 support. It's likely that TLS will want a post-quantum key-agreement in the future but a more multilateral approach is preferable for something intended to be more than an experiment.

Roughtime

Security protocols often assume an accurate, local clock (e.g. TLS, Kerberos, DNSSEC and more). It's a widely accepted assumption when designing protocols but, for a lot of people, it just isn't true. We find good evidence that at least 25% of all certificate errors in Chrome are due to a bad local clock.

Even when the local clock is being synchronised, it's very likely to be using unauthenticated NTP. So if your threat model includes man-in-the-middle attackers then you still can't trust the local clock.

There have been efforts to augment NTP with authentication, but they still assume a world where each client trusts one or more time servers absolutely. In order to explore a solution that allows time servers to be effectively audited by clients, myself and my colleague Matt Braithwaite (with assistance and advice from Ben Laurie and Michael Shields) have developed Roughtime.

Very briefly: using some tricks we believe that it's viable to deploy servers that sign a client-chosen nonce and timestamp on demand. Once you have several of these servers, clients can generate their nonces by hashing replies from other servers with some entropy. That proves that a nonce was created after the reply was received. Clients maintain a chain of nonces and replies and, if a server misbehaves, can use replies from several other servers to prove and report it.

Currently there's only one Roughtime service running, so the idea of spreading trust around is inchoate. But we would like to gauge whether others are interested in this idea, specifically whether there are any organisations who would be seriously interested in deploying something like this in their clients. (Because I assume that, if you have clients, then you'll also be interested in running a server.)

There's a much longer introduction and many more details on the Roughtime site and we've also setup a mailing list.

memcpy (and friends) with NULL pointers

The C standard (ISO/IEC 9899:2011) has a sane-seeming definition of memcpy (section 7.24.2.1):

The memcpy function copies n characters from the object pointed to by s2 into the object pointed to by s1.

Apart from a prohibition on passing overlapping objects, I think every C programmer understands that.

However, the standard also says (section 7.1.4):

If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after promotion) not expected by a function with variable number of arguments, the behavior is undefined.

(Emphasis is mine.)

I'm sure that 7.1.4 seemed quite reasonable in isolation, but how does it interact with the case where memcpy is called with a zero length? If you read 7.24.2.1 then you might well think that, since the function copies zero bytes, it's valid to pass NULL as either of the pointer arguments. I claim that the vast majority of C programmers would agree with that, but 7.24.1(2) clarifies that 7.1.4 really does apply:

Where an argument declared as size_t n specifies the length of the array for a function, n can have the value zero […] pointer arguments on such a call shall still have valid values, as described in 7.1.4.

(Nobody would actually write memcpy(NULL, NULL, 0), of course, because it (at best) does nothing. But such a call can easily arise at run-time when an empty object is handled by a more general function.)

Some compilers will use this corner of the standard to assume that pointers passed to memcpy are non-NULL, irrespective of the length argument. GCC has built this in, while Clang can get it from the fact that glibc annotates memcpy with nonnull specifications.

Consider the following function:

#include <stdint.h>
#include <string.h>

int f(uint8_t *dest, uint8_t *src, size_t len) {
  memcpy(dest, src, len);
  return dest == NULL;
}

Here's the output of building that with GCC 6.1.1 with -O2:

0000000000000000 <f>:
   0:	48 83 ec 08          	sub    rsp,0x8
   4:	e8 00 00 00 00       	call   9 <f+0x9>  # memcpy
   9:	31 c0                	xor    eax,eax
   b:	48 83 c4 08          	add    rsp,0x8
   f:	c3                   	ret

From that we can see that rax (which holds the return value of a function in the amd64 ABI) is unconditionally set to zero, i.e. the compiler has assumed that dest == NULL is false because it has been passed to memcpy. The compiler's reasoning goes like this: 7.1.4 says that passing a NULL pointer to a standard library function is undefined behaviour, therefore if dest was NULL any behaviour is reasonable. So the code can be optimised with the assumption that it's non-NULL, as that's the only case with defined behaviour.

(You can also play with this snippet in Matt Godbolt's excellent tool.)

Opinions on this vary from “the C standard defines the language thus that optimisation is fine by definition” to “that's crazy: there's a huge amount of code out there that probably assumes the obvious behaviour of memcpy”. Personally, I find myself further towards the latter position than the former.

Also, it's not just memcpy: the same optimisations are annotated in glibc for (at least) memccpy, memset, memcmp, memchr, memrchr, memmem, mempcpy, bcopy and bcmp. Section 7.1.4 can be applied to any standard library function.

Measurement

To try and figure out the impact that this optimisation is having I built a number of open-source programs with GCC 6.1.1, with -fno-builtin (to disable GCC's built-in versions of these functions) and with glibc's string.h including, or not, the nonnull annotations. For example, the snippet of code above produces this diff when tested this way:

0000000000000000 <f>:
-   0:	48 83 ec 08          	sub    rsp,0x8
+   0:	53                   	push   rbx
+   1:	48 89 fb             	mov    rbx,rdi
    4:	e8 00 00 00 00       	call   9 <f+0x9>
    9:	31 c0                	xor    eax,eax
-   b:	48 83 c4 08          	add    rsp,0x8
-   f:	c3                   	ret    
+   b:	48 85 db             	test   rbx,rbx
+   e:	0f 94 c0             	sete   al
+  11:	5b                   	pop    rbx
+  12:	c3                   	ret    

The added code tests dest to set the return value, as intended.

The first program I tested was BIND 9.9.5 because of this advisory that says: “GCC now includes (by default) an optimization which is intended to eliminate unnecessary null pointer comparisons in compiled code. Unfortunately this optimization removes checks which are necessary in BIND and the demonstrated effect is to cause unpredictable assertion failures during execution of named, resulting in termination of the server process”. Although version 9.9.5 should be affected according to the advisory, I found no differences in the compiled output based on nonnull annotations in string.h. Perhaps it's because I'm using a different GCC, perhaps I just got something wrong in my testing, or perhaps these checks were eliminated for different reasons. (For example, a local root exploit in the kernel was enabled by a dereference-based removal of a NULL check.)

Next up, I tried something that I'm more involved with: BoringSSL. Here there are two changes: a reordering of two conditions in OPENSSL_realloc_clean (which has no semantic effect) and extensive changes in BN_mod_exp_mont. I'm sure I would be able to do a better analysis if I were more experienced with disassembling large programs, but I'm just using objdump and diff. Still, I believe that all the changes are the result of a single NULL check being removed and then the resulting offset shift of all the following code. That counts as an optimisation, but it's statically clear that the pointer cannot be NULL even without any assumptions about string.h functions so I struggle to give much credit.

Since BoringSSL showed some changes, I tried OpenSSL 1.0.2h. This also shows the same large changes around BN_mod_exp_mont. There's also a large change in dsa_builtin_paramgen2 (a function that we don't have in BoringSSL) but that appears to be another insignificant NULL-check removed and a consequent change of all the later offsets. Lastly, there are a handful of no-op changes: like swapping the arguments to cmp before jne.

Next I tried openssh-7.2p2, which shows no changes. I wondered whether someone had already done this analysis and corrected any problems in OpenSSH so tried a much older version too: 5.4p1. That does show a small, but non-trivial, change in ssh_rsa_verify. After a bit of thought, I believe that GCC has managed to eliminate a test for a non-NULL pointer at the end of openssh_RSA_verify. Just like the BoringSSL case, it's already possible to deduce that the pointer must be non-NULL without any section 7.1.4 assumptions.

Conclusions

It's clear that one has to write C code that's resilient to the compiler assuming that any pointers passed to standard library functions are non-NULL. There have been too many releases of glibc and GCC with this in to safely assume that it'll ever go away.

However, the benefits of this (i.e. the optimisations that the compiler can perform because of it) are nearly zero. Large programs can be built where it has no effect. When there are changes they are either cases that the compiler should have been able to figure out anyway, or else noise changes with no effect.

As for the costs: there have been several cases where removing NULL checks has resulted in a security vulnerability, although I can't find any cases of this precise corner of the C standard causing it. It also adds a very subtle, exceptional case to several very common functions, burdening programmers. But it thankfully rarely seems to make a difference in real-life code, so hopefully there's not a large pool of bugs in legacy code that have been created by this change.

Still, given the huge amount of legacy C code that exists, this optimisation seems unwise to me. Although I've little hope of it happening, I'd suggest that GCC and glibc remove these assumptions and that the next revision of the C standard change 7.24.1(2) to clarify that when a length is zero, pointers can be NULL.

If anyone wishes to check my results here, I've put the scripts that I used on GitHub. I'm afraid that it takes a bit of manual setup and, given variation in GCC versions across systems, some differences are to be expected but the results should be reproducible.

Cryptographic Agility

(These are notes that I wrote up from a talk that I gave at the National Academies Forum on Cyber Resilience. You can tell that it was in Washington, DC because of the “cyber”.

I wasn't quite sure how technical to pitch this talk so it's relatively introductory; regular readers probably know all this.

This isn't a transcript of what I said, but I try to hit the main points in my notes.)

Firstly I'd like to separate extensibility from agility. A protocol is extensible if you can add features to it without having to update every implementation at the same time—which is generally impossible. Cryptographic agility depends on having extensibility, at least if you ever want to use something that wasn't designed into a protocol at the beginning.

Protocols should be extensible: the world keeps changing and no design is going to be perfect for all time. But extensibility is much harder in practice than it sounds.

I happen to be particularly familiar with TLS and TLS has two, major extensibility mechanisms. The first is a simple version number. Here's how the specification says that it should work:

Client: I support up to version 1.2.

Server: (calculates the minimum of the version that the client supports and the maximum version that the server supports) Ok, let's use version 1.1.

This is commendably simple: it's not possible to express a range of versions and certainly not a discontinuous range. This is about as simple as an extensibility mechanism could be, and yet lots of implementations get it wrong. It's a common mistake for implementations to return an error when the client offers a version number that they don't understand.

This, of course, means that deploying new versions doesn't work. But it's insidious because the server will work fine until someone tries to deploy a new version. We thought that we had flexibility in the protocol but it turned out that bugs in code had rusted it in place.

At this point it's worth recalling the Law of the Internet: blame attaches to the last thing that changed. If Chrome updates and then something stops working then Chrome gets the blame. It doesn't matter that the server couldn't correctly calculate the minimum of two numbers. No normal person understands or cares about that.

What's to be done about this? Well, we work around issues if they're big and suck up the breakage if they're small. It's taken about 15 years to get to the point where web browsers don't have to work around broken version negotiation in TLS and that's mostly because we only have three active versions of TLS. When we try to add a fourth (TLS 1.3) in the next year, we'll have to add back the workaround, no doubt. In summary, this extensibility mechanism hasn't worked well because it's rarely used and that lets bugs thrive.

TLS has a second, major extension mechanism which is a series of (key, value) pairs where servers should ignore unknown keys. This has worked a little better because, while there are only three or four versions in play, with many years between versions, there are 25 to 30 extensions defined. It's not perfect: bugs in implementations have led them to be dependent on the order of extensions and somebody at least managed to write a server that breaks if the last value is empty.

Sometimes more extensibility points have been added inside of extensions in the expectation that it'll save adding another, top-level extension in the future. This has generally been a mistake: these extension points have added complexity for little reason and, when we try to use them, we often find that bugs have rusted them solid anyway. They've just been a waste.

There's a lesson in all this: have one joint and keep it well oiled.

Protocol designers underestimate how badly people will implement their designs. Writing down how you think it should work and hoping that it'll work, doesn't work. TLS's protocol negotiation is trivial and the specification is clear, yet it still didn't work in practice because it's difficult to oil.

Rather one needs to minimise complexity, concentrate all extensibility in a single place and actively defend it. An active defense can take many forms: fuzzing the extensibility system in test suites and compliance testing is good. You might want to define and implement dummy extensions once a year or such, and retire old ones on a similar schedule. When extensions contain lists of values, define a range of values that clients insert at random. In short, be creative otherwise you'll find that bug rust will quickly settle in.

Agility itself

Cryptographic agility is a huge cost. Implementing and supporting multiple algorithms means more code. More code begets more bugs. More things in general means less academic focus on any one thing, and less testing and code-review per thing. Any increase in the number of options also means more combinations and a higher chance for a bad interaction to arise.

Let's just consider symmetric ciphers for a moment. Because everyone wants them to be as fast as possible, BoringSSL currently contains 27 thousand lines of Perl scripts (taken from OpenSSL, who wrote them all) that generate assembly code just in order to implement AES-GCM. That's a tremendous amount of work and a tremendous scope for bugs.

Focusing again on TLS: over the years, 25 different ciphers and modes have been specified for use in TLS. Thankfully, of those, only nine are actively used. But that doesn't mean that the zombies of the others might not still be lurking around, ready to cause problems.

Where did this mess of diversity come from?

1. Old age / we had no idea what we were doing in the 1990's:

3DES_EDE_CBC AES_128_CBC AES_256_CBC DES40_CBC
DES_CBC DES_CBC_40 IDEA_CBC NULL
RC2_CBC_40 RC4_128 RC4_40

A lot of mistakes were made in the 1990's—we really didn't know what we were doing. Phil Rogaway did, but sadly not enough people listened to him; probably because they were busy fighting the US Government which was trying to ban the whole field of study at the time. Unfortunately that coincided with the early inflation period of the internet and a lot of those mistakes were embedded pretty deeply. We're still living with them today.

2. National pride cipher suites

ARIA_128_CBC ARIA_128_GCM ARIA_256_CBC ARIA_256_GCM
CAMELLIA_128_CBC CAMELLIA_128_GCM CAMELLIA_256_CBC CAMELLIA_256_GCM
SEED_CBC

The next cause of excess agility are the national pride cipher suites. Many countries consider cryptography to be an area of national interest but then mistakenly believe that means that they have to invent their own standards and primitives. South Korea and Japan were especially forthright about this and so managed to get these ciphers assigned code points in TLS but Russia and China and, to some extent, many other countries do the same thing.

Although they receive limited analysis compared to something like AES, they're generally not bad, per se, but they bring nothing new to the table: they add nothing but costs, and the costs are significant. Cryptographic diversity for the point of national pride should be strenuously resisted for that reason. Other countries may complain that the US got their standards widely used but the US got to specify a lot about the internet by being the first mover. (And AES is from Belgium anyway.) However, it is the case that I'm not aware of any of these national standards being used to promote something that's actually a deliberate backdoor; which is, of course, not true of the US.

3. Reasonable cases for diversity:

  • Embedded systems want to minimise circuit size: AES_128_CCM and AES_256_CCM.
  • We want something faster for when we don't have AES hardware: CHACHA20_POLY1305.
  • US Government standard, got hardware support from Intel: AES_128_GCM and AES_256_GCM.

Now we come to the ones that are reasonable to use and the reasons for diversity there. It's all about performance optimisation for different environments really: tiny devices want CCM because it only needs an AES-encrypt circuit. Devices without hardware support for AES-GCM want to use ChaCha20-Poly1305 because it's much more efficient in software. Everything else wants to use AES-GCM.

Agility has allowed us to introduce the ciphers in the final set and that's really important. But it's equally important to kill off the old stuff, and that's very hard. Nearly all the incentives are aligned against it. Recall the Law of the Internet (mentioned above); users hate stuff breaking and always blame you. Even djb will take to Twitter when one drops DSA support.

We have a long conveyor belt of primitives, we put new ones at the front and, every so often, we turn the crank and something drops off the end. In addition to all the obvious problems with killing off old stuff, that also means that there's a lot of inadvisable options that will generally function at any given time and this is leading to new products launching with no idea that they're sitting towards the end of this conveyor belt. These products expect a lifetime of some number of years and are unaware that we hope to discontinue something that they're using much sooner than that. It's no longer the case that we can assume that waiting a year will result in a reduction of the amount of use that a deprecated primitive gets because of new launches.

Google tries to address this where it can by requiring support for the newest options in our certification process for devices that interact with our services. But only a tiny subset of the things that interact with Google go through any of our certifications.

Things are even harder in non-interactive cases. TLS at least gets to negotiate between the client and server but algorithms in S/MIME messages and certificate signatures don't allow that. (One can think of ways to help change that, but the current reality is that they're not negotiated.) That's why dropping SHA-1 support in certificates has been a such a gruesome fight and why PKCS#8 messages still require us to support 40-bit RC2.

So what's the lesson here? I'd say that you need extensibility but, when it comes to cryptographic agility, have one option. Maybe two. Fight to keep it that small.

It's worth highlighting that, for the purposes of time, I've simplified things dramatically. I've considered only symmetric ciphers and modes above but, even within TLS, there's a whole separate conveyor belt for asymmetric algorithms. And I've not mentioned the oncoming storm of quantum computers. Quantum computers are going to be hilarious and I hope to be retired before they get big enough to cause problems!

Post-quantum key agreement

If large quantum computers can be built (and not of the D-Wave variety) then they'll make quite a mess of public-key cryptography. RSA and systems based on discrete logarithms (i.e. finite-field and elliptic-curve Diffie-Hellman) are all broken. I've written about hash-based signatures (which resist quantum attacks) before and the recent PQCRYPTO recommendations suggest those for signatures and McEliece for public-key encryption. Those are both sound, conservative recommendations but both have some size problems: McEliece public keys can be on the order of a megabyte of data and conservative hash-based signatures are about 40KB.

In some situations that's not a problem, and things like firmware signatures, where the key is embedded in hard-to-change silicon, should consider using hash-based signatures today. But those costs motivate the search for post-quantum schemes that are closer to the small, fast primitives that we have today.

One candidate is called Ring Learning-With-Errors (RLWE) and I'll try to give a taste of how it works in this post. This is strongly based on the A New Hope paper by Alkim, Ducas, Pöppelmann & Schwabe, and, in turn, on papers by Bos, Costello, Naehrig & Stebila and Peikert.

Firstly, the basic stats (because I always hate having to dig around for these values in papers). I've included the numbers for a current, elliptic-curve based Diffie-Hellman scheme (X25519) for comparision:

A New HopeX25519
Alice's transmission size1,824 bytesa32 bytes
Alice's computation129,638 cyclesb331,568 cyclesc
Bob's transmission size1,824 bytes32 bytes
Bob's computation126,236 cycles331,568 cycles

(a This is using a more compact scheme than in the paper. b These are Haswell cycle counts from the paper. c These are values from the SUPERCOP benchmark on titan0.)

Something to keep in mind when looking at the numbers above: sending 1,824 bytes on a 10MBit link takes 5.1 million cycles, assuming that your CPU is running at 3.5GHz.

RLWE key agreement

Our fundamental setting is ℤ12289[X]/(X1024+1). That's the set of polynomial equations where the largest power of x is 1023 and the coefficients are values between zero and 12288 (inclusive). For example, 66 + 4532x + 10000x2 + … + 872x1023.

Addition and multiplication can be defined for polynomials in this set. Addition is done by matching up powers of x and adding the corresponding coefficients. If the result is out of range then it's reduced modulo 12289.

Multiplication is high-school polynomial multiplication where the polynomial on the right is multiplied by every term of the polynomial on the left and the result is the sum of those terms. Coefficients are reduced modulo 12289 to keep them in range, but it's likely that the resulting powers of x will be too large—multiplying two x1023 terms gives a result in terms of x2046.

Polynomials with too large a degree can be reduced modulo x1024+1 until they're in range again. So, if we ended up with a term of a×x2046 then we could subtract a×x1022(x1024+1) to eliminate it. By repeated application of that trick, all the terms with powers of x greater than 1023 can be eliminated and then the result is back in the set.

Now that we can add and multiply within this set of polynomials we need to define a noise polynomial: this is simply a polynomial where each coefficient is sampled from a random distribution. In this case, the distribution will be a centered binomial distribution that ranges from -12 to 12. The probability density looks like this:

image/svg+xml Gnuplot Gnuplot Produced by GNUPLOT 5.0 patchlevel 1 0 -15 -10 -5 0 5 10 15

An important feature of noise polynomials is that the magnitude of each coefficient is small. That will be critical later on.

A random polynomial is one where each coefficient is sampled from a uniform distribution over the full range of zero to 12288.

To build a Diffie-Hellman protocol from this, Alice generates a random polynomial, a, and two noise polynomials, s and e. Alice calculates b = as+e and sends a and b to Bob. Bob generates his own s′ and e′, uses Alice's a to calculate u = as′+e′, and sends u back to Alice. Now Alice can calculate us = (as′+e′)s = as′s+e′s and Bob can calculate bs′ = (as+e)s′ = ass′+es′. But the keen-eyed will notice that those are different values!

The added noise is necessary for security but it means that the two sides to this protocol calculate different values. But, while the values are different, they're very similar because the magnitude of the noise polynomials is small. So a reconciliation mechanism is needed to find a shared value given two, similar polynomials.

Reconciliation

So far I've been following the A New Hope paper and it does include a reconciliation mechanism. But, to be honest, I'm not sure that I understand it, so I'm going to be describing a mechanism by Peikert here:

The reconciliation will treat each coefficient in the similar polynomials separately and will extract a single, shared bit from each. Since we're dealing with polynomials that have 1024 terms, we'll get a 1024-bit shared secret in total but I'm just going to discuss the process of processing a single coefficient to get a single bit.

Consider the coefficient space as a circle: zero and 12289 are the same value modulo 12289 and we put that at the top of the circle. At the bottom of the circle will be 12289/2 = 6144 (rounding down). We know that, for each coefficient, Alice and Bob will have similar values—meaning that the values will be close by on the circle.

image/svg+xml 0 6144

One option is to split the circle into left and right halves and say that if the point is in the left half then it's a zero, otherwise it's a one.

image/svg+xml 0 6144 0 1

But while that will work most of the time, there's obviously a problem when the points are near the top or the bottom. In these cases, a small difference in location can result in a point being in a different half, and thus Alice and Bob will get a different result.

In this case we want to split the circle into top (zero) and bottom (one), so that both points are clearly in the bottom half.

image/svg+xml 0 6144 0 1

But that just moves the problem around to the left and right edges of the circle. So how about we vary the basis in which we measure the points depending where they are? If the point is near the bottom or the top we'll use the top–bottom (blue) basis and, if not, we'll use the left–right (red) basis.

image/svg+xml 0 6144

But there's still a problem! Consider the two points in the diagram just above. One party will think that it's in a red area, measure it left–right and conclude that the shared bit is a zero. The other will think it's in a blue area, measure it top–bottom and conclude that the shared bit is a one.

There isn't a solution to this in which the parties operate independently so, instead, one of the parties chooses the basis in which each coefficient will be measured. This information is a single bit of information (i.e. red or blue) that we have Bob send to Alice along with his value u. With this reconciliation information in hand, both parties can measure their points in the same, optimal basis and calculate the same, shared value.

Optimisations

There's lots in the paper that I've skipped over here. Most importantly (for performance), a variant of the Fourier transform can be used to convert the polynomials into the frequency domain where multiplication is much faster. Some of the values transmitted can be transmitted in the frequency domain too to save conversions overall. Also, random polynomials can be sampled directly in the frequency domain.

The parameters here have also been carefully selected so that the reduction stage of multiplication happens magically, just by point-wise multiplication of a some constants before and after the transform.

The a value that Alice generates could be a global constant, but in order to save worrying about how it was generated, and to eliminate the possibility of all-for-the-price-of-one attacks (like LogJam), it's generated fresh for each instance. Rather than transmit it in full, Alice need only send a seed value for it to Bob.

Contrasts with Diffie-Hellman

The most important change from Diffie-Hellman is that this scheme requires that all values be ephemeral. Both QUIC and TLS 1.3 assume that a Diffie-Hellman public-value can be reused but, in this scheme, that breaks the security of the system. More traditional uses of Diffie-Hellman, e.g. TLS 1.2 and SSH, are fine though.

Another important change is that this scheme takes a full round-trip for Alice. With Diffie-Hellman, both parties can transmit a message at time zero and as soon as the other party has received the message, they can calculate the shared key and start transmitting. But even if the a value is a global constant in this scheme, the reconciliation process means that Bob can't send a message until he's received Alice's message, and Alice can't calculate the shared key until she has received Bob's message. So Alice has a wait a full round-trip.

Often that limitation isn't important because other parts of the protocol already require the round-trip (for example, in TLS). But for some uses it's a critical difference.

Also, since this protocol involves random noise it has a failure probability: it's possible that the reconciliation mechanism produces different answers for each side. Thankfully this probability can be made negligible (i.e. less than one in 264).

I should also note that the RLWE problem is hypothesised to be resistant to quantum attack, but we don't know that. We also don't know that it's resistant to attacks by classical computers! It's possible that someone will develop a classical algorithm tomorrow that breaks the scheme above. Thus it should be used concurrently with a standard Diffie-Hellman (e.g. X25519) and the outputs of each should concatenated as the input keying material for a KDF.

Juniper: recording some Twitter conversations

Update: Ralf wrote up some notes from his work. These now include an update themselves with information from Willem Pinckaers that suggests that the presumed Dual-EC output is exposed to the world in Juniper devices.

On Thursday, Juniper announced that some of their products were affected by “unauthorized code in ScreenOS that could allow a knowledgeable attacker to gain administrative access to NetScreen® devices and to decrypt VPN connections”. That sounds like an attacker managed to subvert Juniper's source code repository and insert a backdoor. Of course, any glimpses that we get of these sorts of attacks are fascinating.

Juniper followed up with a slightly more detailed post that noted that there were two backdoors: one via SSH and one that “may allow a knowledgeable attacker who can monitor VPN traffic to decrypt that traffic”. Either of these would be very interesting to a nation-state attacker but that latter—passive decryption of VPN connections—is really in their neighborhood.

So, of course, smarter people than I quickly took to Twitter to pull apart the differences in the fixed firmware versions. Since Twitter conversations are terrible to try and pick apart after the fact, I'm writing down the gist of things here. But I'm just the scribe in this case; other people did the work.

One of the first things that people focused on was a difference to a large, hex value that was visible by just diffing the strings of the two firmwares. That change is interesting not just because it's a large, opaque hex string in a binary, but because of the hex strings that immediately precede it. Specially they were:

  • FFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF: this is the prime order of the underlying field of P-256, a standard elliptic curve.
  • FFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFC: P-256 is typically written in short-Weierstrass form: y2=x3+ax+b. This is then the a value for P-256.
  • 5AC635D8AA3A93E7B3EBBD55769886BC651D06B0CC53B0F63BCE3C3E27D2604B: This is the b value for the P-256 equation.
  • 6B17D1F2E12C4247F8BCE6E563A440F277037D812DEB33A0F4A13945D898C296: This is the x coordinate for the standard generator of P-256—the starting point for operations on the curve.
  • FFFFFFFF00000000FFFFFFFFFFFFFFFFBCE6FAADA7179E84F3B9CAC2FC632551: This is the number of points on P-256.

So all the values just before the changed one are constants for P-256, suggesting that the changed value is cryptographic too. The obvious, missing value would be the y coordinate for the standard generator. One possibility was that the attack put in the wrong y value. This could put the generator on the wrong curve, say a weaker curve that shares most of the same parameters as P-256 but with a different value for b. But the curve that would have resulted, while weaker, wasn't real-time-passive-decryption weak. Also the replacement value in the fixed version wasn't the standard y value either.

Ralf-Philipp Weinmann was looking at the code itself and found:

That means that the changed value is an x coordinate and that the code was calculating the y value from it given the curve equation. Thus it would only need the x values and the points would always be on the correct curve. So perhaps it's a public key for something?

Changing a public key could easily be a big backdoor, but recall that the result here is somehow passive decryption of VPN traffic. It's unclear how changing a public key could result in passive decryption.

Oh dear. To explain: “EC PRNG” suggests that the value might be a constant in an elliptic-curve based pseudo-random number generator. That could certainly explain how passive decryption of VPN traffic was possible because it brings up memories of Dual-EC. Dual-EC was an NSA effort to introduce a backdoored pseudo-random number generator (PRNG) that, given knowledge of a secret key, allowed an attacker to observe output from the RNG and then predict its future output. If an attacker can predict the output of the PRNG then they can know the keys that one or both sides of a VPN connection will choose and decrypt it. (For more details, see the research paper.)

Indeed, it quickly came to light that Juniper have a page where they say that the VPN devices in question here “do utilize Dual_EC_DRBG, but do not use the pre-defined points cited by NIST”. In short, they used a backdoored RNG but changed the locks. Then this attack might be explained by saying that someone broke in and changed the locks again.

We're not sure that's actually what happened, but it seems like a reasonable hypothesis at this point. If it's correct, this is fairly bananas. Dual-EC is not a reasonable RNG: it's massively larger, slower and more complex than standard RNGs. It's output isn't even very uniform. Huge compromises were made in its design in order to meet its primary objective: to be a NOBUS, passive backdoor. (“NOBUS” is an intelligence community term for “nobody but us”, i.e. other parties shouldn't be able to use the backdoor.) Why would it be used in ScreenOS in the first place?

Again, assuming this hypothesis is correct then, if it wasn't the NSA who did this, we have a case where a US government backdoor effort (Dual-EC) laid the groundwork for someone else to attack US interests. Certainly this attack would be a lot easier given the presence of a backdoor-friendly RNG already in place. And I've not even discussed the SSH backdoor which, as Wired notes, could have been the work of a different group entirely. That backdoor certainly isn't NOBUS—Fox-IT claim to have found the backdoor password in six hours.

BoringSSL

We recently switched Google's two billion line repository over to BoringSSL, our fork of OpenSSL. This means that BoringSSL is now powering Chromium (on nearly all platforms), Android M and Google's production services. For the first time, the majority of Google's products are sharing a single TLS stack and making changes no longer involves several days of work juggling patch files across multiple repositories.

This is a big positive for Google and I'm going to document some of the changes that we've made in BoringSSL in this post. I am not saying that people should be ditching OpenSSL and switching to BoringSSL. For Linux distributions that doesn't even make sense because we've removed too much for many applications to run unaltered and, without linker trickery, it's not possible to have both OpenSSL and BoringSSL in the same process because their symbols will collide. Even if you're in the position of shipping your own TLS stack with your code, you should still heed the warnings in the README well.

OpenSSL have considerably improved their processes since last April, which is great and important because huge swathes of the Internet will continue to depend on it. BoringSSL started before those changes but, even taking them into consideration, I'm still happy with my decision to fork. (But note that Google employs OpenSSL team members Emilia Käsper, Bodo Möller and Ben Laurie and contributes monetarily via the Core Infrastructure Initiative, so we haven't dropped our support of OpenSSL as a project.)

With that in mind, I'm going to mention some of the cleanups that we've done in BoringSSL from the lowest level, upwards. While most people should continue to use OpenSSL, there are lots of developers outside of Google who work on Chromium and Android and thus this document shouldn't be internal to Google. This post may seem critical of OpenSSL, but remember that many of these changes are possible because we only have to worry about Google's needs—we have an order of magnitude fewer platforms and configurations to support than OpenSSL and we don't keep any ABI compatibility. We also have the superpower of being able to change, where needed, the code that calls BoringSSL, so you can't really compare the two.

The “we”, above, is primarily myself and my colleagues David Benjamin and Matt Braithwaite. But BoringSSL is open source and Brian Smith has clocked up 55 patches and we've also had contributions from Opera and CloudFlare. (Brian's number would be higher if I had had more time to review his pending changes in the past couple of weeks).

“Forking”

Generally when people say “forking” they mean that they took a copy of the code and started landing patches independently of the original source. That's not what we did with BoringSSL. Rather than start with a copy, I started with an empty directory and went through OpenSSL function-by-function, reformatting, cleaning up (sometimes discarding) and documenting each one. So BoringSSL headers and sources look like this rather than this. The comments in BoringSSL headers can be extracted by a tool to produce documentation of a sort. (Although it could do with a make-over.)

(Clang's formatting tool and its Vim integration are very helpful! It's been the biggest improvement in my code-editing experience in many years.)

For much of the code, lengths were converted from ints to size_ts and functions that returned one, zero or minus one were converted to just returning one or zero. (Not handling a minus one return value is an easy and dangerous mistake.)

I didn't always get everything right: sometimes I discarded a function that we later found we actually needed or I changed something that, on balance, wasn't worth the changes required in other code. Where possible, code that we've needed to bring back has gone into a separate section called “decrepit” which isn't built in Chromium or Android.

But large amounts of OpenSSL could simply be discarded given our more limited scope. All the following were simply never copied into the main BoringSSL: Blowfish, Camllia, CMS, compression, the ENGINE code, IDEA, JPAKE, Kerberos, MD2, MDC2, OCSP, PKCS#7, RC5, RIPE-MD, SEED, SRP, timestamping and Whirlpool. The OpenSSL that we started from has about 468,000 lines of code but, today, even with the things that we've added (including tests) BoringSSL is just 200,000. Even projects that were using OpenSSL's OPENSSL_NO_x defines to exclude functionality at compile time have seen binaries sizes drop by 300KB when switching to BoringSSL.

Some important bits of OpenSSL are too large to bite off all at once, however. The SSL, ASN.1 and X.509 code were “forked” in the traditional sense: they were copied with minimal changes and improved incrementally. (Or, in the case of ASN.1 and X.509, left alone until they could be replaced completely.)

The lowest-levels

OpenSSL has a confusing number of initialisation functions. Code that uses OpenSSL generally takes a shotgun approach to calling some subset of OpenSSL_­add_­all_­algorithms, SSL_­library_­init, ERR_­load_­crypto_­strings and the deprecated SSLeay aliases of the same. BoringSSL doesn't need any of them; everything works immediately and the errors don't print out funny just because you forgot to load the error strings. If, like Chromium, you care about avoiding static initialisation (because every disk seek to load pages of code delays displaying the window at startup) then you can build with BORINGSSL_­NO_­STATIC_­INITIALIZER and initialise the library when you need with CRYPTO_­library_­init. But the vast majority of code just wants to avoid having to think about it. In the future, we would like to move to an automatic lazy-init which would solve even Chromium's needs.

OpenSSL and BoringSSL are often built into shared libraries, but OpenSSL doesn't have any visibility annotations. By default symbols are not hidden and ELF requires that any non-hidden symbols can be interposed. So if you look at libcrypto.so in a Linux distribution you'll see lots of internal functions polluting the dynamic symbol table and calls to those functions from within the library have to indirect via the PLT. BoringSSL builds with hidden visibility by default so calls to internal functions are direct and only functions marked OPENSSL_­EXPORT are included in the dynamic symbol table.

Multi-threaded code is common these days but OpenSSL requires that you install callbacks to lock and unlock a conceptual array of locks. This trips up people who now take thread-safety to be a given, and can also mean that contention profiling shows a large, opaque amount of contention in the locking callback with no hint as to the real source. BoringSSL has a native concept of locks so is thread-safe by default. It also has “once” objects, atomic reference counting and thread-local storage, which eliminates much of the need for locking in the first place.

Errors

OpenSSL has a fairly unique method of handling errors: it pushes errors onto a per-thread queue as the stack unwinds. This means that OpenSSL errors can generally give you something like a stack trace that you might expect from gdb or a Python exception, which is definitely helpful in some cases. For contrast, NSS (Mozilla's crypto library) uses a more traditional, errno-like system of error codes. Debugging an NSS error involves looking up the numeric error code and then grepping the source code to find all the places where that error code can be set and figuring out which triggered this time.

However, this single error-code system is better for programmatic use. Code that tries to do something with OpenSSL errors (other than dumping them for human debugging) tends to look only at the first (i.e. deepest) error on the queue and tries to match on the reason or even function code. Thus changing the name of even internal functions could break calling code because these names were implicitly exported by the error system. Adding errors could also break code because now a different error could be first in the queue. Lastly, forgetting to clear the error queue after a failed function is very easy to do and thus endemic.

So BoringSSL no longer saves functions in the error queue: they all appear as OPENSSL_­internal, which saved about 15KB of binary size alone. As a bonus, we no longer need to run a script every time we add a new function. The file name and line number is still saved but, thankfully, I've never seen code try to match line numbers from the error queue. Trying to match on reason codes is still problematic, but we're living with it for now. We also have no good answer for forgetting to clear the error queue. It's possible that we'll change things in the future to automatically clear the error queue when calling most functions as, now that we're using thread-local storage, that'll no longer cause servers to burst into a flaming ball of lock contention. But we've not done that yet.

Parsing and serialisation

OpenSSL's parsing and serialisation involves a lot of incrementing pointers with single-letter names. BoringSSL drags this firmly into the 1990's with functions that automatically check bounds for parsing and functions that automatically resize buffers for serialisation. This code also handles parsing and serialising ASN.1 in an imperative fashion and we're slowly switching over to these functions because the OpenSSL ASN.1 code is just too complicated for us.

But I should note that OpenSSL's master branch now uses some similar parsing functions for parsing TLS structures at least. I've no idea whether that was inspired by BoringSSL, but it's great to see.

Random number generation

Random number generation in OpenSSL suffers because entropy used to be really difficult. There were entropy files on disk that applications would read and write, timestamps and PIDs would be mixed into entropy pools and applications would try other tricks to gather entropy and mix it into the pool. That has all made OpenSSL complicated.

BoringSSL just uses urandom—it's the right answer. (Although we'll probably do it via getrandom rather than /dev/urandom in the future.) There are no return values that you can forget to check: if anything goes wrong, it crashes the address space.

For the vast majority of code, that's all that you need to know, although there are some concessions to performance in the details:

TLS servers that are pushing lots of AES-CBC need the RNG to be really fast because each record needs a random IV. Because of this, if BoringSSL detects that the machine supports Intel's RDRAND instruction, it'll read a seed from urandom, expand it with ChaCha20 and XOR entropy from RDRAND. The seed is thread-local and refreshed every 1024 calls or 1MB output, whichever happens first.

Authenticated Encryption

Handing people a block cipher and hash function and expecting them to figure out the rest does not work. Authenticated Encryption is much closer to being reasonable and BoringSSL promotes it where possible. One very pleasing BoringSSL tale is that I handed that header file to a non-crypto developer and they produced secure code, first time. That would not have happened had I pointed them at EVP_CIPHER.

There is more to be done here as I've talked about before: we need nonce-misuse-resistant primitives and solutions for large files but what we have now is a significant improvement and the foundations for that future work are now in place.

SSL/TLS

As I mentioned, the SSL/TLS code wasn't reworked function-by-function like most of BoringSSL. It was copied whole and incrementally improved, predominantly by David Benjamin. I'm really happy with what he's managed to do with it.

At the small scale, most of the parsing and serialisation is now using the safe functions that I covered above. (Changes to convert most of the remaining pointer-juggling code are in my review queue.) TLS extensions are now a bit saner and no longer handled with huge switch statements. Support for SSLv2, DSS, SRP and Kerberos has all been dropped. The header file actually has comments.

Some important, small scale cleanups are less obvious. The large number of “functions” that were actually macros around ctrl functions (that bypassed the type system) are now real functions. In order to get TLS 1.0–1.2 you no longer use the ridiculously named SSLv23_method and then disable SSLv2 and SSLv3 by setting options on the SSL_CTX, rather you use TLS_method and control the versions by setting a minimum and maximum version.

There is lots more that I could mention like that.

At the larger scale, the buffer handling code has been substantially improved and the TLS code now does symmetric crypto using the AEAD interface, which cleanly partitions concerns that previously leaked all over the SSL code. We've also rewritten the version negotiation code so it no longer preprocesses the ClientHello and fiddles with method tables to use the correct version. This avoids some duplicated code and session resumption bugs and OpenSSL has since done a similar rewrite for 1.1.0. To solve a particular problem for Chrome, we've added some support for asynchronous private key operations so that slow smartcards don't block the network thread. Much of the DTLS logic has also been rewritten or pruned.

Perhaps most importantly, the state machine is much reduced. Renegotiation has been dropped except for the case of a TLS client handling renegotiation from a server while the application data flow has stopped, and even that is disabled by default. The DTLS code (a source of many bugs) is much saner in light of this.

Testing

OpenSSL has always had decent test coverage of lower-level parts like hash functions and ciphers, but testing of the more complex SSL/TLS code has been lacking. Testing that code is harder because you need to be able to produce sufficiently correct handshakes to get close to its edge cases, but you don't want to litter your real code with dozens of options for producing incorrect outputs in order to hit them. In BoringSSL, we've solved this by using a copy of Go's TLS stack for testing and we've littered it with such options. Our tests also stress asynchronous resume points across a range of handshakes. We wrote partial DTLS support in Go to test DTLS-only edge cases like reassembly, replay and retransmission. Along the way, we even discovered one of OpenSSL's old bug workarounds didn't work, allowing both projects to shed some code.

In C, any malloc call may fail. OpenSSL attempts to handle this, but such code is error-prone and rarely tested. It's best to use a malloc which crashes on failure, but for the benefit of consumers who can't, we have a "malloc test" mode. This runs all tests repeatedly, causing each successive allocation to fail, looking for crashes.

We now have 1,139 TLS tests which gives us 70% coverage of the TLS code—still better than any other TLS library that we've used.

The future

Now that we've done the task of aligning Google around BoringSSL, we'll hopefully be able to turn a little bit more attention to some feature work. Support for the IETF-approved ChaCha20-Poly1305 is coming soon. (Brian Smith has a change waiting for me.) Curve25519 and Ed25519 support are likely too. Next year, we will probably start on TLS 1.3 support.

But more cleanups are probably more important. The big one is the elimination of the ASN.1 and X.509 code in many cases. If you recall, we imported that code whole without cleanups and it hasn't been touched since. We've been incrementally replacing uses of the ASN.1 code with the new CBS and CBB functions but X.509 remains as a substantial user. We're not going to be able to drop that code completely because too much expects the X.509 functions to be available for reading and writing certificates, but we can make it so that the rest of the code doesn't depend on it. Then we can put it in a separate library and drop in a new certificate verification library that some of my Chromium colleagues are writing. Most users of BoringSSL will then, transparently, end up using the new library.

In the SSL code, the SSL object itself is a mess. We need to partition state that's really needed for the whole connection from state that can be thrown away after the handshake from state that can be optionally discarded after the handshake. That will save memory in servers as well as improving the clarity of the code. Since we don't have ABI compatibility, we can also reorder the structs to pack them better.

Lastly, we need to make fuzzing part of our process. Michał Zalewski's AFL has substantially improved the state of fuzzing but, whether we're using AFL or LibFuzzer, it's still a one-off for us. It should be much more like our CI builders. So should running clang-analyzer.

(David Benjamin contributed to this post.)

The ICANN Public Comments on WHOIS Privacy

ICANN is currently considering a proposal that “domains used for online financial transactions for commercial purpose should be ineligible for [WHOIS] privacy and proxy registrations” [PDF]. Given the vagueness around what would count as “commercial purpose” (tip jars? advertising? promoting your Kickstarter?) and concerns that some commercial sites are for small, home-run businesses, quite a lot of people are grumpy about this.

ICANN has a public comment period on this document until July 7th and what's interesting is that the comments (those that were emailed at least) are all in a mailing list archive. When you submit to the comment address (comments-ppsai-initial-05may15@icann.org) you receive a confirmation email with a link that needs to be followed and quite a clear statement that the comments are public, so I think that this is deliberate.

I was curious what the comments box on this sort of topic is full of so did a quick analysis. The comment period doesn't close until July 7th so obviously I'm missing a couple of days worth of responses, but it was a two month comment period so that doesn't seem too bad.

When I checked there were 11,017 messages and 9,648 (87.6%) of them were strongly based on the Respect Our Privacy form letter. Several hundred additional messages included wording from it so I think that campaign resulted in about 90% of messages. (And it's worth noting the the primary flow on that site is to call ICANN—of course, I've no data on the volume of phone calls that resulted.)

Another campaign site, Save Domain Privacy, has a petition and quite a few messages included its wording.

I classified all such messages as “against” and wrote a quick program to manually review the remaining messages that weren't trivially classifiable by string matching against those template messages.

  • Nine messages were so odd or confused that it's unclear what the writer believed.
  • Three messages were asking questions and not expressing an opinion.
  • Two messages were sufficiently equivocal that they didn't express a clear feeling in either direction.
  • One message was commenting on a different section of the document.
  • One message suggested that WHOIS privacy be available to all, but that it should have a significant (monetary) cost in order to discourage its use.

Many more messages were against and not based on either of the two template letters. That leaves 13 messages that expressed support for the proposal. (That's 0.12% for those who are counting, although I very much question how much meaning that number has.):

  • Three messages suggested that private WHOIS registration was contrary to the openness of the Internet.
  • Three messages believed that shutting down sites that infringe copyright, or sell counterfeit trademarked goods, was a compelling reason.
  • Two writers believed that it was compelling, in general, to have contact details for websites. One of who claimed to be a security researcher and wanted CERTs to have access to full WHOIS details.
  • Two messages suggested that it would hinder “cyber-bullies” and “pædophiles”, one of which described how hard it was to have a stalker's site shut down.
  • One author believed that being able to contact site owners in the event of a domain-name dispute was a compelling reason.
  • One message suggested that WHOIS privacy should be removed for all .com sites, but no others.
  • One commenter opined that the Internet is inherently hostile to privacy and thus those who want privacy should not register domains.

The comment period opened on May 5th, but between then and June 22nd there were only seven messages. However, the week of the 22nd brought 10,015 messages. The the week of the 29th brought 995 more. So I think it's clear that, without significant outside promotion of these topics, almost nobody would have noticed this proposal.

AEADs: getting better at symmetric cryptography

I gave a talk a couple of weeks ago at the Yahoo Unconference. The conference was at the end of a particually hard week for a bunch of reasons and I fear that the talk wasn't that great. (Afterwards I got home about 3pm and pretty much slept until the following morning.) This post is a, hopefully clearer, articulation of its contents.

I've been primarily working on getting Google products switched over to BoringSSL for a little over a year now. (Chromium is done on many platforms and AOSP switched recently.) This is neccessary work, but it doesn't exactly lend itself to talk material. So the talk was titled “Lessons Learnt from Questions”—the idea being that smart developers at Google often ask lots of cryptography questions, and from those questions one can tell what is unclear in existing documentation. Points that are not obvious to smart, non-experts are thus good topics to talk about because it's likely that lots of people are missing them.

I was aiming for an hour but, due to a misunderstanding in the weeks prior, I thought that I had to cut it down to 20 minutes, so I ditched all but one question:

“How do I encrypt lots of records given a per-user key?” —Anonymous developer

This question is about applying symmetric cryptography, which is the originally motivation of all of cryptography. You would have thought that we would have figured it out by now, but we totally haven't.

In the 1990's it was common for cryptographers to produce block ciphers and hash functions, and then developers would be tasked with composing them. In the talk I stepped through the design of, and attacks on, the SSLv3/TLS CBC construction as a demonstration that plausible-sounding design features can lead to disaster, but I'm going to omit that here because it worked better live. (Phillip Rogaway tried to warn about these issues when SSL, IPSec, etc were being developed, but sadly the right people didn't listen to him.)

So, for quite a long time now, cryptographers have understood that block ciphers and hash functions are too low-level an abstraction and that the constructions themselves need to be studied and standardised. The result is called authenticated encryption (AE), and let's ponder what it might look like.

Obviously an AE function must take some plaintext and, if it's symmetric encryption, it must take a key. We also assume that its going to take care of both confidentiality and authenticity for us, so not only will an attacker not be able to tell the plaintext from the ciphertext but, if the ciphertext is altered in any way, the decryption will fail cleanly. So let's experiment with a hypothetical SSH-like protocol that's trying to protect a series of key presses:

AE(key, plaintext) → ciphertext

AE(key, ‘h’) → α
AE(key, ‘e’) → β
AE(key, ‘l’) → γ
AE(key, ‘l’) → γ
AE(key, ‘o’) → δ

Oh no! The two l's in “hello” turned into the same ciphertext, so an attacker can see patterns in the input reflected in the ciphertext. That's pretty terrible; clearly we need our AE function to map the same plaintext to different ciphertexts. There's two ways that can happen: either the function is non-deterministic (i.e. it reads entropy internally as a hidden input), or we pass in some varying argument. Explicit is better than implicit so we choose the latter and add an argument called the nonce:

AE(key, plaintext, nonce) → ciphertext

AE(key, ‘h’, n0) → (α, n0)
AE(key, ‘e’, n1) → (β, n1)
AE(key, ‘l’, n2) → (γ, n2)
AE(key, ‘l’, n3) → (ɛ, n3)
AE(key, ‘o’, n4) → (δ, n4)

We assume that n0…4 are all distinct and thus the two l's now get different ciphertexts and patterns in the plaintext have been eliminated. Note that (for now) we need to include the nonce value with the ciphertext because it'll be needed when decrypting.

(As an aside: if you've done anything with cryptography before you've probably come across the acronym “IV”, for initialisation vector. Although not completely standard, I differentiate an IV and an nonce thus: an IV needs to be unpredictable while an nonce needs only to be distinct. The values 0, 1, 2, … are distinct, but they aren't unpredictable. For something like CBC mode, which needs an IV, using a counter would be unacceptable.)

We solved one problem but there's another thing that an evil attacker might do: reorder messages:

AE(key, ‘h’, n0) → (α, n0)(γ, n2)
AE(key, ‘e’, n1) → (β, n1)(δ, n4)
AE(key, ‘l’, n2) → (γ, n2)…untrusted network…(α, n0)
AE(key, ‘l’, n3) → (ɛ, n3)(ɛ, n3)
AE(key, ‘o’, n4) → (δ, n4)(β, n1)

All the ciphertexts on the right are perfectly valid and coupled with a valid nonce, but they end up decrypting to “lohle”—hardly what was intended.

There are two solutions to this. The first is to use an implicit nonce: don't transmit the nonce, just use a counter for each direction. Now, if a message is out of order, the receiver will attempt to decrypt it with the wrong nonce and it'll fail to decrypt. That works perfectly well for transport protocols like SSH and TLS because it's easy to keep a counter in those situations, but sometimes the problem isn't quite so synchronous. For these cases, one could include a sequence number or other context in the plaintext and that works fine, but it's a waste of space if the receiver already knew the information and just wanted to confirm it.

This motivates a common extension to authenticated encryption called authenticated encryption with associated data (AEAD). The associated data is another argument, of arbitrary length, which must be equal at the encryption and decryption ends:

AEAD(key, plaintext, nonce, ad) → ciphertext

AEAD(key, ‘h’, n0, 0) → (α, n0)
AEAD(key, ‘e’, n1, 1) → (β, n1)
AEAD(key, ‘l’, n2, 2) → (γ, n2)
AEAD(key, ‘l’, n3, 3) → (ɛ, n3)
AEAD(key, ‘o’, n4, 4) → (δ, n4)

In this example the associated data happens to be a counter, but it could be anything. The associated data isn't included in the ciphertext, but it must be identical when decrypting otherwise the decryption will fail. In other protocols it could equally well be some context denoting, say, the first or last record. This is the second solution to the reordering problem.

The associated data might seem quite a lot like an nonce. They both need to be presented at encryption and decryption time and they must both match for decryption to succeed, so why do we bother having both? Mostly because the associated data is free-form: you can have as much or as little of it as you like and you can repeat it etc. The requirements of an nonce are much stricter, in fact if you remember only one thing from this please remember this, which I'm calling The Law of AEADs:

Thou shall never reuse the same (key, nonce) pair, for all time. (With high probability.)

So, if you generate a random key and use it to encrypt a single message, it's ok to set the nonce to zero. If you generate a random key and encrypt a series of messages you must ensure that the nonce never repeats. A counter is one way to do this, but if you need to store that counter on disk then stop: the chances of you screwing up and reusing an nonce value are way too high in designs like that.

It would be nice if reusing an nonce just meant that the same plaintext would result in the same ciphertext. That's the least bad thing that an AEAD could do in that situation. However the reality is significantly worse: common AEADs tend to lose confidentiality of messages with a repeated nonce and authenticity tends to collaspe completely for all messages. (I.e. it's very bad.) We like these common AEADs because they're fast, but you must have a solid story about nonce uniqueness. AEADs like AES-GCM and ChaCha20-Poly1305 fail in this fashion.

So what should you do when using a counter isn't trivially safe? One option is to generate the nonce at random and consider the probability of duplicates. AES-GCM takes a 96-bit nonce and NIST says that you can only encrypt 232 messages under a single key if using random nonces. This is because if you throw 232 balls at 296 buckets then you have roughly a 2-33 chance of getting two in the same bucket and NIST drew the line there. That probability might seem either high or low to you. It's pretty small in absolute terms and, unlike a work factor, an attacker can't spend resources against it, but it's a very long way from the safety margins that we usually use in cryptography. So there are also functions like crypto_­secretbox_­xsalsa20­poly1305 (from NaCl) that have a 192-bit nonce. The probabilities with random nonces are much more comforting at that size.

Another approach would be to use an AEAD that doesn't fail quite so catastrophically when an nonce is repeated. This is called an nonce-misuse resistant AEAD and we're hitting the boundary of developed practice now. The CAESAR competition has several nonce-misuse resistant entries in it, although it's not scheduled to conclude until 2018. Closer to established primitives, Gueron and Lindell propose an AES-GCM-SIV mode with a strong foundation (assuming that you trust AES-GCM) and good performance (assuming that you have good AES-GCM performance).

AEADs with large plaintexts

If you look at AEAD APIs you'll generally notice that they take the entire plaintext or ciphertext at once. In other words, they aren't “streaming” APIs. This is not a mistake, rather it's the streaming APIs that are generally a mistake.

I've complained about this in the past, so I'll be brief here. In short, old standards (e.g. PGP) will encrypt plaintexts of any length and then put an authenticator at the end. The likely outcome of such a design is that some implementations will stream out unauthenticated plaintext and only notice any problem when they get to the end of the ciphertext and try to check the authenticator. But by that time the damage has been done—it doesn't take much searching to find people suggesting piping the output of gpg to tar or even a shell.

So, if streaming is required, large plaintexts should be chunked and each chunk should easily fit into a “one-shot” API like those linked to above. That prevents unauthenticated plaintext from being processed. However, there is no standard for this; it's another case where we've hit the borders of existing practice. Implementations need to construct the nonces and associated data so that the first chunk is known to be the first, so that each chunk is in order and so that truncation is always detectable.

Streaming decryption always runs the risk of an attacker truncating the ciphertext however. Designs must always be able to detect truncation, but it's very easy to imagine that the rest of a system doesn't handle it well. (For example, see the Cookie Cutter attack against HTTPS.)

If streaming isn't essential then an all-or-nothing transform would be ideal. But, again, there's no substantial standards or practice around this. Hopefully in ten years or so there will be clear answers for this and for large-plaintext constructions with AEADs (AERO is a start). But, for now, even symmetric encryption isn't a “solved problem”.

Why not DANE in browsers

Thomas Ptacek laid out a number of arguments against DNSSEC recently (and in a follow up). We don't fully agree on everything, but it did prompt me to write why, even if you assume DNSSEC, DANE (the standard for speaking about the intersection of TLS and DNSSEC) is not a foregone conclusion in web browsers.

There are two ways that you might wish to use DANE in a web browser: either to block a certificate that would normally be considered valid, or to bless a certificate that would normally be rejected. The first, obviously, requires that DANE information always be obtained—if a lookup failure was ignored, a network attacker with a bad certificate would just simulate a lookup failure. But requiring that browsers always obtain DANE information (or a proof of absence) is nearly implausible:

Some years ago now, Chrome did an experiment where we would lookup a TXT record that we knew existed when we knew the Internet connection was working. At the time, some 4–5% of users couldn't lookup that record; we assume because the network wasn't transparent to non-standard DNS resource types. DANE records are going to be even more non-standard, are going to be larger and browsers would have to fetch lots of them because they'll need the DNSKEY/RRSIG chain up from the root. Even if DNSSEC record lookup worked flawlessly for everyone, we probably still wouldn't implement this aspect of DANE because each extra lookup is more latency and another chance for packet loss to cause an expensive timeout and retransmit.

Instead, for this we have HPKP, which is a memory-based pinning solution using HTTP headers. We also have pre-loaded pinning in Chrome for larger or more obvious targets. Thomas Ptacek seems bullish on pinning but I'm much more lukewarm. HPKP is quite complex and no doubt someone will write a “supercookies” story about it at some point, as they have for HSTS. Additionally, pinning is quite dangerous. Clients deciding that “pinning is good” have caused headaches at Google. It's also worth noting that CryptoCat has committed pinning-suicide in Chrome at at the moment due to their CA having switched intermediates between renewals. They're waiting for the release of Chrome 41 to recover.

But what about the other side of DANE: blessing certificates that would otherwise be considered untrusted? In this case, DNSSEC can be seen as something like another CA. The same problems with looking up DNSSEC records apply, but are much less painful when you only need to depend on the lookup for sites that are using DANE certificates. Indeed, Chrome even supported something very like DANE for a while. In that case the DNSSEC records were contained in the certificate to avoid the latency and complexity of looking them up in the client. (DNSSEC records contains signatures so need not be transported over the DNS protocol.)

But support for that was removed because it was a bunch of parsing code outside of the sandbox, wasn't really being used and it conflicted with two long-term plans for the health of the HTTPS ecosystem: eliminating 1024-bit RSA and Certificate Transparency. The conflicts are probably the most important reasons for not wanting to renew the experiment.

The effort to remove 1024-bit RSA from HTTPS has been going for years and is, perhaps, nearing completion now. (That noise that you can hear is my colleague, Ryan Sleevi, crying softly.). There are still some 1024-bit root certificates, but they are nearly gone from the Mozilla set. The amount of work involved is an order of magnitude greater than you expect because of the interactions of different X.509 validation libraries, intermediate chains and varying root stores on different platforms and versions.

DNSSEC, however, is littered with 1024-bit RSA. You literally can't avoid it because the root zone transits through a 1024-bit key. DNSSEC has them because of (I think) concerns about the size of responses and they are usually rotated every two or three months. The RFC suggests that 1024-bit RSA is good for “most zones” until 2022. Dan Bernstein's paper on Batch NFS deals well with the question of whether that's wise.

Next, Certificate Transparency is our effort to add strong, technical audits to the CA system by creating a trustworthy log of all valid certificates. CT logs only accept certificates from CA as an anti-spam measure but people can create DANE certificates for domains at will. This is far from an insurmountable problem, but it is a problem that would need to be solved and the CT team already have their hands full with the staged rollout of CT in Chrome.

The 1024-bit RSA problem isn't insurmountable either (although it's baked in much deeper), so it's possible that browsers might accept a DNSSEC signature chain in a certificate in the future, but it's a long way out.

The POODLE bites again

October's POODLE attack affected CBC-mode cipher suites in SSLv3 due to SSLv3's under-specification of the contents of the CBC padding bytes. Since SSLv3 didn't say what the padding bytes should be, implementations couldn't check them and that opened SSLv3 up to an oracle attack.

We're done pretty well at killing off SSLv3 in response to that. Chrome 39 (released Nov 18th) removed fallback to SSLv3 and Chrome 40 is scheduled to remove SSLv3 completely. Firefox 34 (released Dec 1st) has already removed SSLv3 support.

We're removing SSLv3 in favour of TLS because TLS fully specifies the contents of the padding bytes and thus stops the attack. However, TLS's padding is a subset of SSLv3's padding so, technically, you could use an SSLv3 decoding function with TLS and it would still work fine. It wouldn't check the padding bytes but that wouldn't cause any problems in normal operation. However, if an SSLv3 decoding function was used with TLS, then the POODLE attack would work, even against TLS connections.

This was noted by, at least, Brian Smith on the TLS list ([1][2]) and I was sufficiently cynical to assume that there were probably more instances of this than the old versions of NSS that Brian cited, and so wrote a scanner for the issue.

Unfortunately, I found a number of major sites that had this problem. At least one of whom I had good enough contacts at to quickly find that they used an F5 device to terminate connections. I contacted F5 on October 21st and they started working on a fix. Yngve Pettersen also independently found this issue and contacted me about it around this time.

F5 reported that some of the affected sites weren't customers of theirs, which meant that there was (at least) a second vendor with the same issue. After more digging, I found that some A10 devices also have this problem. I emailed a number of contacts at A10 on October 30th but sadly didn't get a reply from any of them. It wasn't until November 13th that I found the right person at A10 to deal with this.

F5 and A10 have posted patches for their products (F5's are here and A10's are here and they have an advisory here). I'm not completely sure that I've found every affected vendor but, now that this issue is public, any other affected products should quickly come to light. (Citrix devices have an odd behaviour in this area in that they'll accept padding bytes that are all zeros, but not random padding. That's unexpected but I can't make an attack out of it.)

(Update: since posting this, it appears that products from Fortinet, Cisco, IBM (WebSphere, Domino, Tivoli) and Juniper may also be affected.)

Ivan Ristić has added a test for this issue to his excellent scanner at SSLLabs. Affected sites will have their grade set to F and will report “This server is vulnerable to the POODLE attack against TLS servers”.

This seems like a good moment to reiterate that everything less than TLS 1.2 with an AEAD cipher suite is cryptographically broken. An IETF draft to prohibit RC4 is in Last Call at the moment but it would be wrong to believe that RC4 is uniquely bad. While RC4 is fundamentally broken and no implementation can save it, attacks against MtE-CBC ciphers have repeatedly been shown to be far more practical. Thankfully, TLS 1.2 support is about to hit 50% at the time of writing.

POODLE attacks on SSLv3

My colleague, Bodo Möller, in collaboration with Thai Duong and Krzysztof Kotowicz (also Googlers), just posted details about a padding oracle attack against CBC-mode ciphers in SSLv3. This attack, called POODLE, is similar to the BEAST attack and also allows a network attacker to extract the plaintext of targeted parts of an SSL connection, usually cookie data. Unlike the BEAST attack, it doesn't require such extensive control of the format of the plaintext and thus is more practical.

Fundamentally, the design flaw in SSL/TLS that allows this is the same as with Lucky13 and Vaudenay's two attacks: SSL got encryption and authentication the wrong way around – it authenticates before encrypting.

Consider the following plaintext HTTP request, which I've broken into 8-byte blocks (as in 3DES), but the same idea works for 16-byte blocks (as in AES) just as well:

image/svg+xml

The last block contains seven bytes of padding (represented as •) and the final byte is the length of the padding. (And I've used a fictional, 8-byte MAC, but that doesn't matter.) Before transmission, those blocks would be encrypted with 3DES or AES in CBC mode to provide confidentiality.

Now consider how CBC decryption works at the receiver, thanks to this public domain diagram from Wikipedia:

image/svg+xml block cipherdecryption Key Plaintext Ciphertext Initialization Vector (IV) block cipherdecryption Key Plaintext Ciphertext block cipherdecryption Key Plaintext Ciphertext

An attacker can't see the plaintext contents like we can in the diagram, above. They only see the CBC-encrypted ciphertext blocks. But what happens if the attacker duplicates the block containing the cookie data and overwrites the last block with it? When the receiver decrypts the last block it XORs in the contents of the previous ciphertext (which the attacker knows) and checks the authenticity of the data. Critically, since SSLv3 doesn't specify the contents of the padding (•) bytes, the receiver cannot check them. Thus the record will be accepted if, and only if, the last byte ends up as a seven.

An attacker can run Javascript in any origin in a browser and cause the browser to make requests (with cookies) to any other origin. If the attacker does this block duplication trick they have a 1-in-256 chance that the receiver won't reject the record and close the connection. If the receiver accepts the record then the attacker knows that the decryption of the cookie block that they duplicated, XORed with the ciphertext of the previous block, equals seven. Thus they've found the last byte of the cookie using (on average) 256 requests.

Now the attacker can increase the length of the requested URL and decrease the length of something after the cookies and make the request look like this:

image/svg+xml

Note that the Cookie data has been shifted so that the second to last byte of the data is now at the end of the block. So, with another 256 requests the attacker can expect to have decrypted that byte and so on.

Thus, with an average of 256×n requests and a little control of the layout of those requests, an attacker can decrypt n bytes of plaintext from SSLv3. The critical part of this attack is that SSLv3 doesn't specify the contents of padding bytes (the •s). TLS does and so this attack doesn't work because the attacker only has a 2-64 or 2-128 chance of a duplicated block being a valid padding block.

This should be an academic curiosity because SSLv3 was deprecated very nearly 15 years ago. However, the Internet is vast and full of bugs. The vastness means that a non-trivial number of SSLv3 servers still exist and workarounds for the bugs mean that an attacker can convince a browser to use SSLv3 even when both the browser and server support a more recent version. Thus, this attack is widely applicable.

SSL/TLS has a perfectly good version negotiation mechanism that should prevent a browser and server that support a modern TLS version from using anything less. However, because some servers are buggy and don't implement version negotiation correctly, browsers break this mechanism by retrying connections with lesser SSL/TLS versions when TLS handshaking fails. These days we're more aware of the fact that fallback behaviour like this is a landmine for the future (as demonstrated today) but this TLS fallback behaviour was enshrined long ago and we're stuck with it. It means that, by injecting some trivial errors on the network, an attacker can cause a browser to speak SSLv3 to any server and then run the above attack.

What's to be done?

It's no revelation that this fallback behaviour is bad news. In fact, Bodo and I have a draft out for a mechanism to add a second, less bug-rusted mechanism to prevent it called TLS_FALLBACK_SCSV. Chrome and Google have implemented it since February this year and so connections from Chrome to Google are already protected. We are urging server operators and other browsers to implement it too. It doesn't just protect against this specific attack, it solves the fallback problem in general. For example, it stops attackers from downgrading TLS 1.2 to 1.1 and 1.0 and thus removing modern, AEAD ciphers from a connection. (Remember, everything less than TLS 1.2 with an AEAD mode is cryptographically broken.) There should soon be an updated OpenSSL version that supports it.

Even with TLS_FALLBACK_SCSV, there will be a long tail of servers that don't update. Because of that, I've just landed a patch on Chrome trunk that disables fallback to SSLv3 for all servers. This change will break things and so we don't feel that we can jump it straight to Chrome's stable channel. But we do hope to get it there within weeks and so buggy servers that currently function only because of SSLv3 fallback will need to be updated.

Chrome users that just want to get rid of SSLv3 can use the command line flag --ssl-version-min=tls1 to do so. (We used to have an entry in the preferences for that but people thought that “SSL 3.0” was a higher version than “TLS 1.0” and would mistakenly disable the latter.)

In Firefox you can go into about:config and set security.tls.version.min to 1. I expect that other browser vendors will publish similar instructions over the coming days.

As a server operator, it is possible to stop this attack by disabling SSLv3, or by disabling CBC-mode ciphers in SSLv3. However, the compatibility impact of this is unclear. Certainly, disabling SSLv3 completely is likely to break IE6. Some sites will be happy doing that, some will not.

A little further down the line, perhaps in about three months, we hope to disable SSLv3 completely. The changes that I've just landed in Chrome only disable fallback to SSLv3 – a server that correctly negotiates SSLv3 can still use it. Disabling SSLv3 completely will break even more than just disabling the fallback but SSLv3 is now completely broken with CBC-mode ciphers and the only other option is RC4, which is hardly that attractive. Any servers depending on SSLv3 are thus on notice that they need to address that now.

We hardened SSLv3 and TLS 1.0 against the BEAST attack with 1/n-1 record splitting and, based on an idea by Håvard Molland, it is possible to do something called anti-POODLE record splitting this time. I'll omit the details, but one can ensure that the last block in a CBC record contains at least fixed six bytes using only one split for AES and two for 3DES. With this, CBC is probably less bad than RC4. However, while anti-POODLE record splitting should be easy to deploy because it's valid according to the spec, so was 1/n-1 and deploying that was very painful. Thus there's a high risk that this would also cause compatibility problems. Therefore I'm not proceeding with anti-POODLE record splitting and concentrating on removing SSLv3 fallback instead. (There's also the possibility that an attacker could run the attack on the server to client direction. If we assume that both the client and the server get patched then we might as well assume that they are patched for TLS_FALLBACK_SCSV, which makes record splitting moot.)

PKCS#1 signature validation

On Wednesday, Chrome and Mozilla did coordinated updates to fix an RSA signature verification bug in NSS — the crypto library that handles SSL in Firefox and (currently) Chrome on most platforms. The updates should be well spread now and the bug has been detailed on Reddit, so I think it's safe to talk about.

(Hilariously, on the same day, bash turned out to have a little security issue and so hardly anyone noticed the NSS one!)

The NSS bug is another variant of Bleichenbacher's 2006 attack on RSA signature validation. To recap: an RSA signature is roughly the cube root of the message hash modulo the RSA modulus. Verification involves cubing the signature and checking that the result is the hash that you expected. Cubing modulo an RSA modulus is easy but finding a cube root is infeasible.

There's a little more to it because one needs to eliminate some properties of the RSA operation by formatting the hash that you're going to sign — called padding.

The standard for RSA signing and encryption is PKCS#1 version 1.5. The PKCS series of standards are amazing. Although I've no doubt that the people writing them were doing their best, it was a long time ago and mistakes were made. In a modern light, they are all completely terrible. If you wanted something that was plausible enough to be widely implemented but complex enough to ensure that cryptography would forever be hamstrung by implementation bugs, you would be hard pressed to do better. If you can find some implementers, just say "PKCS#11!" or "PKCS#12!" as punchlines. You'll get a good laugh, although possibly also a few panic attacks.

(PKCS#1 version 2 is much better, but only because they took it from Bellare and Rogaway and slapped the PKCS#1 title on it.)

PKCS#1 version 1.5 wanted to include an identifier for the hash function that's being used, inside the signature. This is a fine idea, but they did it by encoding the algorithm and hash value with ASN.1. This caused many implementations to include the complexity of an ASN.1 parser inside signature validation and that let the bugs in.

Bleichenbacher's original attack was based on the observation that the ASN.1 parsers, in many cases, didn't reject extra trailing data. This is reasonable behaviour for a generic ASN.1 parser, but a disaster in signature verification. Because the parser could ignore so much of the signature, Bleichenbacher could arrange for a perfect cube to have a suitable prefix and thus calculate the cube root over the integers — ignoring the RSA modulus completely!

That was fixed, but there was another common bug: the ASN.1 structure used to identify the hash was a standard structure called AlgorithmIdentifier, which includes an optional parameter. Implementations were ignoring the parameter and that also introduced a vulnerability: arbitrary bytes could be inserted as a parameter and that was sufficient to forge signatures. (See section five of this paper for details.)

This week's vulnerability was similar. Antoine Delignat-Lavaud (who, by the way, has been doing stellar work along with the Prosecco team at INRIA: see Triple Handshake, Cookie Cutter, SPDY and virtual host attacks and miTLS) noticed that the check on the contents of the parameters in NSS wasn't very strict — it was only checking that the length was at most two bytes. This is because, due to complexity, there wasn't universal agreement on what the the parameter should be. The ASN.1 for the parameter is an ANY type (with the concrete type depending on a preceding object id) but also optional. So, when not needed, should it be an ASN.1 NULL (which is two bytes long), or should it be omitted completely? The answer is the former, but it was underspecified for a long time.

Once Antoine had put a spotlight on the code, Brian Smith noticed something worse: an integer overflow in the ASN.1 parser. ASN.1 (in DER form at least, because, of course, ASN.1 has multiple defined encodings) has variable-length lengths, and NSS was using a generic ASN.1 parser which didn't check for overflow. So you could specify that a length was arbitrarily long and the parser would do something similar to:

unsigned length = 0;
for (i = 0; i < length_of_length; i++) {
  length <<= 8;
  length |= length[i];
}

Thus, as long as the last 4 or 8 bytes (depending on whether the system was 32 or 64 bit) encoded the correct length, the bytes before that would be ignored. That allows arbitrary bytes in the signature again and the attack from section 5 of the previous paper can be still be used to make forged signatures.

(Intel's ATR group also reported the same issue to Mozilla a little bit later. Bugs often seem to work like scientific discovery.)

The moral of the story

The moral here is that an ASN.1 parser should never have been put somewhere so sensitive; parsing is dangerous. It was a mistake to have defined PKCS#1 that way, but it can be ameliorated: rather than parsing, generate the expected signature contents and compare it against the plaintext. Go has always done this. I believe that SSH does this. Bouncy Castle reportedly does this. BoringSSL does this because I changed it from OpenSSL (prior to learning about the NSS issue — it's just a good idea). NSS does it now, although they're still worried about the omitted parameter vs NULL parameter confusion so they serialise two ASN.1 outputs and compare against both.

I generated a number of tests of different possible points of flexibility in signature parsing so that other libraries can be tested. For example, you can test OpenSSL (which still uses an ASN.1 parser) against them like this:

for cert in $(ls *.crt | grep -v root2048.crt); do
  openssl verify -CAfile root2048.crt $cert 2>> /dev/null | grep -q "$cert: OK"
  if [ $? -eq 0 ] ; then
    echo "$cert accepted"
  fi
done

The file control.crt should always be accepted. The file missingnull.crt will be accepted if you allow the parameter to be omitted. (I recommend against that. BoringSSL doesn't allow it and we'll see whether we can make that stick.) Nothing else should be accepted.

Sadly, OpenSSL 1.0.1 also accepts highvaltag.crt (ordinary ASN.1 tags in high-value form) and indeflen.crt (BER indefinite lengths rather than DER). The OpenSSL development branch also accepts bernull.crt (superfluous zero bytes in a length). To be clear, there is no known vulnerability here, but it's unwelcome flexibility in something that should be rigid. (I let OpenSSL team know on Wednesday and that they're welcome to use the code from BoringSSL.)

Remember, we use PKCS in browsers because we have to deal with the world as we find it, not as we want it. If you are not so encumbered, consider something simpler.

A couple more formal systems

After my last post on formal analysis of C code, people pointed out several other options to me and I wanted to do a follow up because the conclusion is a little more positive now.

Firstly, I noted last time that Z3 was the best, Why3-compatible SMT solver for my problem but that it came with a non-commercial license. Craig Stuntz pointed out that Microsoft sell a commercial license for $10K. Sure, that's quite a lot for a hobbyist project, but it's easy for a corporation and a lot easier than having to negotiate specially.

The first additional thing that I've looked into is AutoCorres. I mentioned last time that SeL4's refinement proof to C code involved a C parser in Isabelle that output structures in the Simpl framework and that Simpl was pretty intimidating.

AutoCorres is a tool for verifiably simplifing the Simpl structures and thus making proofs easier. Let's repeat the example from last time and show how AutoCorres helps. Here's the test C function:

int add(int a, int b) {
  return a+b;
}

And here's the raw result from the C parser, exactly as I showed last time:

test_global_addresses.add_body ≡
TRY
  Guard SignedArithmetic ⦃-2147483648 ≤ sint ´a + sint ´b ∧ sint ´a + sint ´b ≤ 2147483647⦄
   (creturn global_exn_var_'_update ret__int_'_update (λs. a_' s + b_' s));;
  Guard DontReach {} SKIP
CATCH SKIP
END

And here's what AutoCorres makes of it:

add' ?a ?b ≡
  DO oguard (λs. INT_MIN ≤ ?a + ?b);
     oguard (λs. ?a + ?b ≤ INT_MAX);
     oreturn (?a + ?b)
  OD

That's very impressive! That's using a Maybe monad (using the Haskell name) in Isabelle to handle the possible overflow. Obviously that's a lot easier to deal with than the raw Simpl. Here's the field addition function that I was using for the other examples in the previous post:

void add(int *out, const int *x, const int *y) {
  unsigned int i;
  unsigned int size = 10;
  for (i = 0; i < size; i++) {
    out[i] = x[i] + y[i];
  }
}

And here it is after AutoCorres:

add' ?out ?x ?y ≡
do i ← whileLoop (λi s. i < 0xA)
          (λi. do guard (λs. is_valid_w32 s (ptr_coerce (?out +⇩p uint i)));
                  guard (λs. is_valid_w32 s (ptr_coerce (?x +⇩p uint i)));
                  guard (λs. is_valid_w32 s (ptr_coerce (?y +⇩p uint i)));
                  guard (λs. INT_MIN ≤ sint (ucast s[ptr_coerce (?x +⇩p uint i)]) + sint (ucast s[ptr_coerce (?y +⇩p uint i)]));
                  guard (λs. sint (ucast s[ptr_coerce (?x +⇩p uint i)]) + sint (ucast s[ptr_coerce (?y +⇩p uint i)]) ≤ INT_MAX);
                  modify (λs. s[ptr_coerce (?out +⇩p uint i) := s[ptr_coerce (?x +⇩p uint i)] + s[ptr_coerce (?y +⇩p uint i)]]);
                  return (i + 1)
               od)
         0;
   skip
od

While the AutoCorres output looks very clean, there isn't any documentation at the moment which means that it's a little hard to actually use. (There are a couple of papers on the home page that are well worth reading - but I didn't find anything like a user guide.) There are some examples from which it's possible to make some guesses, but I wasn't able to get very far in a proof. Also, since AutoCorres builds on Simpl and C parser, it has the same limitations - most notably that the address of local variables can't be taken.

None the less, AutoCorres holds a lot of promise.

Next, is VST from Andrew Appel and others (yes, that Appel). VST is a framework for proving the behaviour of C code with Coq by using one of the intermediate languages from CompCert. Coq (which I'm pronouncing as “coke”) is a proof system like Isabelle and CompCert is a formally verified compiler. That means that it's possible to have a chain of formal verification from C code all the way to the semantics of the machine language! Additionally, the C parser isn't correctness critical because VST proves properties using the same parse tree as the verified compiler. That's an extremely strong verification standard.

CompCert is a mixture of GPL and “non-commercial” licensed code. It appears that everything needed for VST is licensed as GPL or another free license. So, one can also use VST to verify C code and then use a different compiler to build it if the non-commercial restriction is a problem. That moves the CompCert C parser and the other compiler into the trusted set, but so it goes.

The VST logic also appears to be very capable. It doesn't have a problem with pointers to local variables and should even have the capability to support reasoning about concurrent programs. The restrictions that it does have are local changes: only one memory load per line (at the moment) or side effect per statement and void functions need an explicit return at the end. (Unfortunately, one doesn't get an error from violating these requirements - rather the proof cannot be completed.)

VST thankfully, has brief documentation online which is expanded upon in the form of a book ($15 as a DRMed download). The book contains a very abstract articulation of the underlying logic which, to be honest, defeated me initially. I'd recommend skipping chapters 2 through 21 on a first reading and coming back to them later. They are useful, but they're also a lot clearer after having played with the software for a while.

Here's the same example function, translated into VST style. I switched back from having a special output array because that was something that I changed to make the SPARK verification easier but it was never part of the original code and VST doesn't have problem with overlapping the input and output. I've also switched from a for loop to a while loop because the documentation and examples for the latter are better. Otherwise, the changes are just to limit each statement to a single memory access and to include the final return.

void add(int *a, int *b) {
	int t1, t2;
	int i = 0;
	while (i < 10) {
		t1 = a[i];
		t2 = b[i];
		a[i] = t1 + t2;
		i++;
	}
	return;
}

We can use clightgen from CompCert to parse that into the intermediate language that CompCert and VST use. (A word to the wise - if you want to install VST, use exactly Coq 8.4p2, not the latest version which is 8.4p4.) Then we can use Coq to prove its behaviour.

Compared to Isabelle, Coq's proofs aren't as clear as the Isar style proofs in Isabelle (although the terseness is more useful when proofs are this complex) and neither Proof General nor CoqIDE are as nice as Isabelle's jedit mode. But it's not otherwise dramatically different.

Here's a specification of add in VST/Coq's language:

Definition bound_int (v : val) (b : Z) :=
  match v with
  | Vint i => -b < (Int.signed i) < b
  | _ => False
  end.

Definition partial_add (i: Z) (l: Z -> val) (r: Z -> val) (j: Z) :=
  if zlt j i then Val.add (l j) (r j) else (l j).

Definition add_spec :=
  DECLARE _add
  WITH a0 : val, b0 : val, sh : share, orig_a : Z -> val, orig_b : Z -> val
  PRE [_a OF (tptr tint), _b OF (tptr tint)]
    PROP (writable_share sh;
          forall i, 0 <= i < 10 -> is_int (orig_a i);
          forall i, 0 <= i < 10 -> is_int (orig_b i);
          forall i, 0 <= i < 10 -> bound_int (orig_a i) 134217728;
          forall i, 0 <= i < 10 -> bound_int (orig_b i) 134217728)
    LOCAL (`(eq a0) (eval_id _a);
           `(eq b0) (eval_id _b);
           `isptr (eval_id _a);
           `isptr (eval_id _b))
    SEP (`(array_at tint sh orig_a 0 10 a0);
         `(array_at tint sh orig_b 0 10 b0))
  POST [ tvoid ]
    PROP ()
    LOCAL ()
    SEP (`(array_at tint sh (partial_add 10 orig_a orig_b) 0 10 a0);
         `(array_at tint sh orig_b 0 10 b0)).

The partial_add function implements a map that reflects the state of the a array at step i of the loop. I've written it like that so that it can also be used in the loop invariant.

And here's the proof: It is quite long, but at least half of that is because I've not used Coq before and I don't know what I'm doing. I wouldn't call it readable however.

Definition inv orig_a a0 orig_b b0 sh :=
  EX i:Z,
    (PROP (0 <= i < 11;
           isptr a0;
           isptr b0)
     LOCAL (`(eq a0) (eval_id _a);
            `(eq b0) (eval_id _b);
            `(eq (Vint (Int.repr i))) (eval_id _i))
   SEP (`(array_at tint sh (partial_add i orig_a orig_b) 0 10 a0);
        `(array_at tint sh orig_b 0 10 b0))).

Lemma mod_nop : forall (i : Z) (m : Z), 0 <= i < m -> m > 0 -> i mod m = i.
Proof.
intros.
rewrite Zdiv.Zmod_eq.
assert(i/m=0).
apply Zdiv_small.
exact H.
rewrite H1.
omega.
exact H0.
Qed.

Lemma body_sumarray: semax_body Vprog Gprog f_add add_spec.
Proof.
start_function.
forward.
forward_while
  (inv orig_a a0 orig_b b0 sh)
  (
     PROP ()
     LOCAL ()
     SEP (`(array_at tint sh (partial_add 10 orig_a orig_b) 0 10 a0);
          `(array_at tint sh orig_b 0 10 b0))).
apply exp_right with 0.
entailer!.
quick_typecheck.
entailer.
cancel.
assert(i=10).
omega.
rewrite H7.
cancel.
forward.
entailer!.
unfold partial_add.
assert(is_int (orig_a i)).
auto.
assert(is_int (orig_b i)).
auto.
if_tac.
omega.
exact H7.
forward.
entailer!.
forward.
entailer.
apply prop_right.
assert(Vint (Int.add _id1 _id) = partial_add (i+1) orig_a orig_b (Int.signed (Int.repr i))).
Focus 2.
symmetry in H9.
apply H9.
unfold partial_add.
assert(Int.signed (Int.repr i) = i).
rewrite Int.signed_repr.
reflexivity.
unfold Int.max_signed, Int.min_signed.
simpl.
omega.
rewrite H9.
if_tac.
unfold Val.add.
rewrite Int.signed_repr_eq in H7.
rewrite mod_nop in H7.
if_tac in H7.
unfold partial_add in H7.
if_tac in H7.
omega.
rewrite Int.signed_repr_eq in H6.
rewrite mod_nop in H6.
if_tac in H6.
unfold Int.add, Int.unsigned, Int.intval, Int.repr.
assert(bound_int (orig_a i) 134217728).
auto.
unfold bound_int in H12.
symmetry in H7, H6.
rewrite H7.
rewrite H6.
reflexivity.
omega.
assert(Int.modulus = 4294967296).
auto.
rewrite H13.
omega.
assert(Int.modulus = 4294967296).
auto.
omega.
assert(i < Int.half_modulus).
assert(Int.half_modulus = 2147483648).
auto.
rewrite H12.
rewrite H12 in H11.
omega.
omega.
assert(Int.modulus = 4294967296).
auto.
omega.
assert(Int.modulus = 4294967296).
auto.
rewrite H11.
omega.
omega.
forward.
unfold inv.
apply exp_right with (Zsucc i).
entailer!.
apply derives_refl'.
apply equal_f.
apply array_at_ext.
intros.
unfold upd.
if_tac.
rewrite H10.
unfold Z.succ.
reflexivity.
unfold partial_add.
if_tac.
if_tac.
reflexivity.
omega.
if_tac.
omega.
reflexivity.
forward.
Qed.

That's certainly quite a chunk! Practically you have to step through the proof in Proof General and see the state evolve to understand anything. Additionally, when in the proof, it would be very useful if subgoals has some sort of description like “show the loop invariant plus not the loop condition results in the loop post-condition” - it's very easy to get lost in the proof. But I do feel that, while it would be a lot of work, I could make progress with VST, while I don't (at the moment) feel with other systems. Prof. Appel recently did a formal verification of the SHA256 from OpenSSL.

However, there is no community around VST that I can find - no mailing list, wiki, etc. The Subversion repo is only accessible via HTTP - you can't clone it from what I can tell. I think I found a (completeness) bug in VST, but the only option would be to try emailing Prof. Appel. (I didn't; I'm probably wrong.)

Still, AutoCorres and especially VST leave me more optimistic than I was at the end of the last post!

Others

I didn't have time to look at everything that deserved attention. Cryptol is one. Cryptol is a tool written in Haskell designed for the specification and implementation of cryptographic code and can call out to SMT solvers to show certain properties. From its internal language it can export implementations to (at least) C.

Next, Frama-C, which I mentioned last time in relation to the Jessie plugin, has other plugins. One that's useful is the value analysis, which can be properly used to eliminate NULL derefs, array overruns etc. In fact, Polar SSL has been fully validated in certain configurations using that. That's certainly very valuable, and you can do it without that multi-page Coq proof! Although such analysis wouldn't have caught the carry bug in Ed25519 nor the reduction bug in Donna, not all code is going to be suitable for formal analysis.

Microsoft also have lots of things. There's Dafny, VCC (Verifier for Concurrent C), Boogie, Havoc, Z3 and others! I didn't look at them because I don't have a Windows box to hand this week (although I see that some are exposed via a web interface too). I was thinking that maybe I'd have time when a Windows box was to hand but, realistically, probably not. So I've updated this post to mention them here. If you are Windows based, you will probably do well to pay attention to them first.

A shallow survey of formal methods for C code

Two interesting things in formally verified software happened recently. The big one was the release of SeL4 - a formally verified L4 microkernel. The second was much smaller, but closer to my usual scope: a paper which showed the correctness of sections of a couple of the assembly implementations of Curve25519.

Overflow and carry bugs are subtle and, although with 32-bit words you might hope to be able to do enough random testing to eliminate them, that's much less plausible with 64-bit implementations. The TweetNaCl paper mentions a bug that lived in one of the assembly implementations of ed25519 for years and I've sinned too:

When I wrote curve25519-donna (back when it was the only 64-bit implementation of Curve25519), I first wrote a 32-bit version which mirrored the implementation of the athlon assembly version. This was just to provide a reference for the 64-bit code, but it was included in the repository for education. It was never really supposed to be used, and wasn't constant-time, but I was aware that it was being used by some groups.

Many years later, someone noticed that it was missing a reduction in the final contraction. Values between 2255-19 and 2255 would be output when they should have been reduced. This turned out to be harmless (as best I can tell), but embarrassing none the less.

And Curve25519 is a very nice curve to implement! Take a look at the overflow considerations for this version of P-256. I hope that I got everything right there, but it's very complex. More formal verification of the software would be welcome!

SeL4 has a multi-stage verification using refinement. Refinement is the process of showing that two pieces of code behave identically where one version is more abstract. SeL4's proof refines from an abstract specification to a Haskell implementation to the C implementation. It also has a SAT-solver based proof that the ARMv6 assembly matches the C implementation.

The refinement proofs are done in Isabelle/HOL which, along with Coq, are the major proof systems. (There are several other systems including HOL Light, HOL4 and Mizar.) Proof systems assist in the construction of automated proofs using simple rules of inference and axioms. Although not used for software verification, Metamath is a good introduction to the idea. It clearly explains its axioms (which Isabelle and Coq are very bad at) and gives an idea for the scale of formal proofs with an example of 2+2 = 4.

The best introduction to Isabelle that I know of is the book Concrete Semantics, although I do wish that it would start with Isar style proofs much sooner. If you're doing work in Isabelle I think you need page 57 printed and stuck to the wall and if you're going through that book and exercises I'd recommend peeking at the Isar chapter sooner.

But Isabelle's traditional workflow is to write in its functional language and export to OCaml or Haskell. That's no help if we're seeking to prove things about C code.

Isabelle's theorems are objects in its underlying ML language and so you can write a program in ML that parses C and use the result in Isabelle. That's what SeL4 does. The underlying framework for expressing imperative languages in Isabelle is Schirmer's Simpl and this imposes some odd limitations on the C that can be processed. For one, all the local variables with the same name in a given .c file have to have the same type because the local state of all functions in a .c file is represented in a single struct. For the same reason, it's not possible to take the address of a local variable.

But we can try parsing a very simple C file with SeL4's parser:

int add(int a, int b) {
  return a+b;
}

That fits SeL4's C subset (called StrictC - see l4v/tools/c-parser/doc in the SeL4 source) and results in this Simpl construct (you'll need to read the Simpl paper to really understand it):

test_global_addresses.add_body ≡
TRY
  Guard SignedArithmetic ⦃-2147483648 ≤ sint ´a + sint ´b ∧ sint ´a + sint ´b ≤ 2147483647⦄
   (creturn global_exn_var_'_update ret__int_'_update (λs. a_' s + b_' s));;
  Guard DontReach {} SKIP
CATCH SKIP
END

There's clearly an overflow check in there, which is good, but to call the process of starting to prove things about it intimidating to the beginner would be an understatement. That and the oddities of the C subset motivated me to look further.

Dependent types are closely related to this problem so I checked Wikipedia for the dependently typed languages which support imperative programming and that are still active. That's just ATS and F*, according to that table. ATS doesn't deal with overflows and, while F*/F7 is worthy of a look because of miTLS, it's a CLR language and so not applicable here.

Next up, based on searches, is SPARK. SPARK is a subset of Ada designed for verification. There's a commercial company behind it, but versions are published under the GPL. It even comes with an IDE.

SPARK uses an SMT solver to prove correctness, which is very different from a proof in Isabelle. While an Isabelle proof is, by construction, just an application of axioms and rules of inference, an SMT solver is essentially a magic box into which you feed a proposition and out of which (maybe) comes a true/false answer. Trusting in an SMT solver is much, much stronger than not doing so, but it is a step down from a formal proof system. SPARK uses a version of Why3 to abstract over SMT solvers and we'll discuss Why3 more later.

Anyway, here's the same, trivial function in Ada:

function Add(X, Y : in Integer_32) return Integer_32 is
begin
   return X + Y;
end Add;

If we ask SPARK to process that then it generates verification conditions (VCs): one for each possible thing that could go wrong - overflow, array indexing etc. For that code it throws an error saying that X + Y might overflow, which is promising! In order to eliminate that, one needs to prove the verification condition that says that the addition is safe by adding preconditions to the function (which then become VCs at each callsite):

function Add(X, Y : in Integer_32) return Integer_32 with
  Pre => X < 2**30 and X > -2*30 and Y < 2**30 and Y > -2**30;

With that precondition, the function is accepted. I should note that this is SPARK 2014; there are many older versions (SPARK has been going for a long time) and they kept preconditions and the like in comments. Lots of pages about SPARK still talk about the older versions and major projects in SPARK (like Ironsides - a DNS server) still use the older versions.

With that initial success, let's try something a tiny bit more involved. Curve25519 deals with 255-bit integers but CPUs don't. So, in the same way that a 64-int integer could be represented with a pair of 32-bit integers, 255-bit integers are represented in Donna with ten, 32-bit integers. This representation is a little odd because the values aren't spaced at 32-bit multiples, but rather at alternating 26 and 25 bit positions. That also means that the representation is non-unique (setting bit 26 of the first word is equal to setting bit zero of the second) and that's done for performance. See this for more details, but it's not important right now.

Adding these 255-bit integers is done simply by adding the underlying words without carry, with suitable bounds on the inputs so that overflow doesn't happen:

type Ints is array (Integer range 0 .. 9) of Integer_32;

function Add (X, Y : in Ints) return Ints with
   Pre => (for all i in Ints'Range => X(i) < 2**30 and X(i) > -2**30 and Y(i) < 2**30 and Y(i) > -2**30),
   Post => (for all i in Ints'Range => Add'Result(i) = X(i) + Y(i));

function Add (X, Y : in Ints) return Ints is
   Sum : Ints;
begin
   for i in Ints'Range loop
      Sum(i) := X(i) + Y(i);
   end loop;

   return Sum;
end Add;

That mostly works, but SPARK can't prove that the result is X(i) + Y(i) for all i despite it being exactly what the function says. One really needs to spoon feed the prover: in this case with a loop invariant in the for loop:

for i in Ints'Range loop
   pragma Loop_Invariant (for all j in Ints'First .. i-1 => Sum(j) = X(j) + Y(j));
   Sum(i) := X(i) + Y(i);
end loop;

Sadly, that doesn't work either, despite being correct, because SPARK doesn't seem to like i-1 when i might be zero. So, trying again:

for i in Ints'Range loop
   Sum(i) := X(i) + Y(i);
   pragma Loop_Invariant (for all j in Ints'First .. i => Sum(j) = X(j) + Y(j));
end loop;

That one works, but now SPARK isn't sure whether uninitialised elements of Sum are being referenced. The property of being initialised isn't part of SPARK's logic and seems that the only way to proceed here is to disable that warning!

But, there's some promise! We would now like to show a higher level property of Add: that the 255-bit integer that the returned array represents is the sum of the integers that the input arrays represent. This means that the logic needs to deal with arbitrary integers and that we need to write a function in the logic that turns an array into an abstract integer.

Enabling arbitrary integers in the logic is possible (although good luck finding how to do it in the documentation: the answer is to put a file called gnat.adc in the gnatprove subdirectory of the project containing pragma Overflow_Mode (General => Strict, Assertions => Eliminated);). However, I failed at writing functions in the logic without dropping down into code and triggering overflow checks. It appears that ghost functions should be able to do this but, after getting the answer to a documentation bug on StackOverflow, actually trying to use any logic functions, even the identity function, caused the proof not to terminate. SPARK claims to be able to use Isabelle for proofs but I couldn't get that to work at all: strace says that the .thy file isn't ever written.

Stuck, I moved on from SPARK and tried Frama-C and its Jessie plugin. Even if SPARK was working for me, using C (even a subset) has advantages: it's convenient to use in other projects, there exist verified compilers for it and there's an extension to CompCert that allows for static verification of constant-time behaviour (although I've lost the paper!)

So here's the same function in Frama-C:

/*@ requires \valid_range(out, 0, 9);
  @ requires \valid_range(x, 0, 9);
  @ requires \valid_range(y, 0, 9);
  @ requires \forall integer i; 0 <= i < 10 ==>
      (x[i] > -1000 && x[i] < 1000);
  @ requires \forall integer i; 0 <= i < 10 ==>
      (y[i] > -1000 && y[i] < 1000);
  @ ensures \forall integer i; 0 <= i < 10 ==> (out[i] == x[i] + y[i]);
*/
void add(int *out, const int *x, const int *y) {
  /*@ loop invariant i >= 0 && i <= 10 &&
        (\forall integer j; 0 <= j < i ==> out[j] == x[j] + y[j]);
    @ loop variant 10-i;
    */
  for (int i = 0; i < 10; i++) {
    out[i] = x[i] + y[i];
  }
}

Note that this time we not only need a loop invariant to show the postcondition, but we also need a loop variant to show that the for loop terminates: you really need to spoon feed the verification! But Frama-C/Jessie has no problem with functions in the logic:

/*@ logic integer felemvalue (int *x) =
      x[0] + x[1] * (1 << 26) + x[2] * (1 << 51) + x[3] * (1 << 77) +
      x[4] * (1 << 102) + x[5] * 340282366920938463463374607431768211456 +
      x[6] * 11417981541647679048466287755595961091061972992 +
      x[7] * 766247770432944429179173513575154591809369561091801088 +
      x[8] * 25711008708143844408671393477458601640355247900524685364822016 +
      x[9] * 1725436586697640946858688965569256363112777243042596638790631055949824; */

That function maps from an array to the abstract integer that it represents. Note that the terms switched from using shifts to represent the constants to literal numbers. That's because the proof process suddenly became non-terminating (in a reasonable time) once the values reached about 104-bits, but the literals still worked.

With that logic function in hand, the we can easily prove the higher-level concept that we wanted as a post condition of the add function:

@ ensures felemvalue(out) == felemvalue(x) + felemvalue(y);

Flushed with success, I moved onto the next most basic function: multiplication of 255-bit integers. The code that Donna uses is the textbook, polynomial multiplication code. Here's how the function starts:

/* Multiply two numbers: output = in2 * in
 *
 * output must be distinct to both inputs. The inputs are reduced coefficient
 * form, the output is not.
 *
 * output[x] <= 14 * the largest product of the input limbs. */
static void fproduct(limb *output, const limb *in2, const limb *in) {
  output[0] =       ((limb) ((s32) in2[0])) * ((s32) in[0]);
  output[1] =       ((limb) ((s32) in2[0])) * ((s32) in[1]) +
                    ((limb) ((s32) in2[1])) * ((s32) in[0]);
  output[2] =  2 *  ((limb) ((s32) in2[1])) * ((s32) in[1]) +
                    ((limb) ((s32) in2[0])) * ((s32) in[2]) +
                    ((limb) ((s32) in2[2])) * ((s32) in[0]);
  …

Each of the lines is casting a 64-bit number down to 32 bits and then doing a 32×32⇒64 multiplication. (The casts down to 32 bits are just to tell the compiler not to waste time on a 64×64⇒64 multiplication.) In Frama-C/Jessie we need to show that the casts are safe, multiplications don't overflow and nor do the sums and multiplications by two etc. This should be easy. Here are the preconditions that set the bounds on the input.

/*@ requires \valid_range(output, 0, 18);
  @ requires \valid_range(in, 0, 9);
  @ requires \valid_range(in2, 0, 9);
  @ requires \forall integer i; 0 <= i < 10 ==> -134217728 < in[i] < 134217728;
  @ requires \forall integer i; 0 <= i < 10 ==> -134217728 < in2[i] < 134217728;
*/

However, Why3, which is the prover backend that Jessie (and SPARK) uses, quickly gets bogged down. In an attempt to help it out, I moved the casts to the beginning of the function and put the results in arrays so that the code was less complex.

As you can see in the Jessie documentation, the Why3 UI allows one to do transforms on the verification conditions to try and make life easier for the prover. Quickly, splitting a requirement becomes necessary. But this throws away a lot of information. In this case, each of the 32×32⇒64 multiplications is assumed to fit only within its bounds and so the intermediate bounds need to be reestablished every time. This means that the equations have to be split like this:

limb t, t2;
t = ((limb) in2_32[0]) * in_32[1];
//@ assert -18014398509481984 < t < 18014398509481984;
t2 = ((limb) in2_32[1]) * in_32[0];
//@ assert -18014398509481984 < t2 < 18014398509481984;
output[1] = t + t2;

That's only a minor problem really, but the further one goes into the function, the harder the prover needs to work for some reason. It's unclear why because every multiplication has the same bounds - they are all instances of the same proof. But it's very clear that the later multiplications are causing more work.

Why3 is an abstraction layer for SMT solvers and a large number are supported (see “External provers” on its homepage for a list). I tried quite a few and, for this task, Z3 is clearly the best with CVC4 coming in second. However, Z3 has a non-commercial license which is very worrying - does that mean that a verified version of Curve25519 that used Z3 has to also have a non-commercial license? So I stuck to using CVC4.

However, about a quarter of the way into the function, both CVC4 and Z3 are stuck. Despite the bounds being a trivial problem, and just instances of the same problem that they can solve at the beginning of the function, somehow either Why3 is doing something bad or the SMT solvers are failing to discard irrelevant facts. I left it running overnight and Z3 solved one more instance after six hours but achieved nothing after 12 hours on the next one. Splitting and inlining the problem further in the Why3 GUI didn't help either.

Like SPARK, Jessie can use Isabelle for proofs too (and for the same reason: they are both using Why3, which supports Isabelle as a backend). It even worked this time, once I added Real to the Isabelle imports. However, in the same way that the C parser in Isabelle was an ML function that created Isabelle objects internally, the Why3 connection to Isabelle is a function (why3_open) that parses an XML file. This means that the proof has large numbers of generated variable names (o33, o34, etc) and you have no idea which intermediate values they are. Additionally, the bounds problem is something that Isabelle could have solved automatically, but the assumptions that you can use are similarly autonamed as things like local.H165. In short, the Isabelle integration appears to be unworkable.

Perhaps I could split up each statement in the original code into a separate function and then write the statement again in the logic in order in order to show the higher level properties, but at some point you have to accept that you're not going to be able to dig this hole with a plastic spoon.

Conclusion

The conclusion is a bit disappointing really: Curve25519 has no side effects and performs no allocation, it's a pure function that should be highly amenable to verification and yet I've been unable to find anything that can get even 20 lines into it. Some of this might be my own stupidity, but I put a fair amount of work into trying to find something that worked.

There seems to be a lot of promise in the area and some pieces work well (SMT solvers are often quite impressive, the Frama-C framework appears to be solid, Isabelle is quite pleasant) but nothing I found worked well together, at least for verifying C code. That makes efforts like SeL4 and Ironsides even more impressive. However, if you're happy to work at a higher level I'm guessing that verifying functional programs is a lot easier going.

HSTS for new TLDs

Whatever you might think of them, the new TLDs are rapidly arriving. New TLDs approved so far this month include alsace, sarl, iinet, poker, gifts, restaurant, fashion, tui and allfinanz. The full list for last month is over twice as long.

That means that there's lots of people currently trying to figure out how to differentiate themselves from other TLDs. Here's an idea: why not ask me to set HSTS for the entire TLD? That way, every single site runs over HTTPS, always. It strikes me that could be useful if you're trying to build trust with users unfamiliar with the zoo of new domains.

(I can't speak for Firefox and Safari but I think it's safe to assume that Firefox would be on board with this. It's still unclear whether IE's support for HSTS will include preloading.)

I'm guessing that, with such a large number of new TLDs, I should be able to reach at least some operators of them via this blog post.

Encrypting streams

When sending data over the network, chunking is pretty much a given. TLS has a maximum record size of 16KB and this fits neatly with authenticated encryption APIs which all operate on an entire message at once.

But file encryption frequently gets this wrong. Take OpenPGP: it bulk encrypts the data and sticks a single MAC on the end. Ideally everyone is decrypting to a temporary file and waiting until the decryption and verification is complete before touching the plaintext, but it takes a few seconds of searching to find people suggesting commands like this:

gpg -d your_archive.tgz.gpg | tar xz

With that construction, tar receives unauthenticated input and will happily extract it to the filesystem. An attacker doesn't (we assume) know the secret key, but they can guess at the structure of the plaintext and flip bits in the ciphertext. Bit flips in the ciphertext will produce a corresponding bit flip in the plaintext, followed by randomising the next block. I bet some smart attacker can do useful things with that ability. Sure the gpg command will exit with an error code, but do you think that the shell script writer carefully handled that case and undid the changes to the filesystem?

The flaw here isn't in CFB mode's malleability, but in OpenPGP forcing the use of unauthenticated plaintext in practical situations. (Indeed, if you are ever thinking about the malleability of ciphertext, you have probably already lost.)

I will even claim that the existance of an API that can operate in a streaming fashion over large records (i.e. will encrypt and defer the authenticator and will decrypt and return unauthenticated plaintext) is a mistake. Not only is it too easy to misunderstand and misuse (like the gpg example above) but, even if correctly buffered in a particular implementation, the existance of large records may force other implementations to do dangerous things because of a lack of buffer space.

If large messages are chunked at 16KB then the overhead of sixteen bytes of authenticator for every chunk is only 0.1%. Additionally, you can safely stream the decryption (as long as you can cope with truncation of the plaintext).

Although safer in general, when chunking one has to worry that an attacker hasn't reordered chunks, hasn't dropped chunks from the start and hasn't dropped chunks from the end. But sadly there's not a standard construction for taking an AEAD and making a scheme suitable for encrypting large files (AERO might be close, but it's not quite what I have in mind). Ideally such a scheme would take an AEAD and produce something very like an AEAD in that it takes a key, nonce and additional data, but can safely work in a streaming fashion. I don't think it need be very complex: take 64 bits of the nonce from the underlying AEAD as the chunk number, always start with chunk number zero and feed the additional data into chunk zero with a zero byte prefix. Prefix each chunk ciphertext with a 16 bit length and set the MSB to indicate the last chunk and authenticate that indication by setting the additional data to a single, 1 byte. The major worry might be that for many underlying AEADs, taking 64 bits of the nonce for the chunk counter leaves one with very little (or none!) left.

That requires more thought before using it for real but, if you are ever building encryption-at-rest, please don't mess it up like we did 20 years ago. (Although, even with that better design, piping the output into tar may still be unwise because an attacker can truncate the stream at a chunk boundary: perhaps eliminating important files in the process.)

Update: On Twitter, zooko points to Tahoe-LAFS as an example of getting it right. Additionally, taking the MAC of the current state of a digest operation and continuing the operation has been proposed for sponge functions (like SHA-3) under the name MAC-and-continue. The exact provenance of this isn't clear, but I think it might have been from the Keccak team in this paper. Although MAC-and-continue doesn't allow random access, which might be important for some situations.

BoringSSL

Earlier this year, before Apple had too many goto fails and GnuTLS had too few, before everyone learnt that TLS heart-beat messages were a thing and that some bugs are really old, I started a tidy up of the OpenSSL code that we use at Google.

We have used a number of patches on top of OpenSSL for many years. Some of them have been accepted into the main OpenSSL repository, but many of them don’t mesh with OpenSSL’s guarantee of API and ABI stability and many of them are a little too experimental.

But as Android, Chrome and other products have started to need some subset of these patches, things have grown very complex. The effort involved in keeping all these patches (and there are more than 70 at the moment) straight across multiple code bases is getting to be too much.

So we’re switching models to one where we import changes from OpenSSL rather than rebasing on top of them. The result of that will start to appear in the Chromium repository soon and, over time, we hope to use it in Android and internally too.

There are no guarantees of API or ABI stability with this code: we are not aiming to replace OpenSSL as an open-source project. We will still be sending them bug fixes when we find them and we will be importing changes from upstream. Also, we will still be funding the Core Infrastructure Initiative and the OpenBSD Foundation.

But we’ll also be more able to import changes from LibreSSL and they are welcome to take changes from us. We have already relicensed some of our prior contributions to OpenSSL under an ISC license at their request and completely new code that we write will also be so licensed.

(Note: the name is aspirational and not yet a promise.)

Early ChangeCipherSpec Attack

OpenSSL 1.0.1h (and others) were released today with a scary looking security advisiory and that's always an event worth looking into. (Hopefully people are practiced at updating OpenSSL now!)

Update: the original reporter has a blog post up. Also, I won't, personally, be answering questions about specific Google services. (I cut this blog post together from notes that I'm writing for internal groups to evaluate and this is still very fresh.)

Update: my initial thoughts from looking at the diff still seem to be holding up. Someone is welcome to write a more detailed analysis than below. HP/ZDI have a write up of one of the DTLS issues.

There are some critical bug fixes to DTLS (TLS over datagram transports, i.e. UDP), but most people will be more concerned about the MITM attack against TLS (CVE-2014-0224).

The code changes are around the rejection of ChangeCipherSpec messages, which are messages sent during the TLS handshake that mark the change from unencrypted to encrypted traffic. These messages aren't part of the handshake protocol itself and aren't linked into the handshake state machine in OpenSSL. Rather there's a check in the code that they are only received when a new cipher is ready to be used. However, that check (for s->s3->tmp.new_cipher in s3_pkt.c) seems reasonable, but new_cipher is actually set as soon as the cipher for the connection has been decided (i.e. once the ServerHello message has been sent/received), not when the cipher is actually ready! It looks like this is the problem that's getting fixed in this release.

Here's the code in question that handles a ChangeCipherSpec message:

int ssl3_do_change_cipher_spec(SSL *s)
	{
	int i;
	const char *sender;
	int slen;

	if (s->state & SSL_ST_ACCEPT)
		i=SSL3_CHANGE_CIPHER_SERVER_READ;
	else
		i=SSL3_CHANGE_CIPHER_CLIENT_READ;

	if (s->s3->tmp.key_block == NULL)1
		{
		if (s->session == NULL)
			{
			/* might happen if dtls1_read_bytes() calls this */
			SSLerr(SSL_F_SSL3_DO_CHANGE_CIPHER_SPEC,SSL_R_CCS_RECEIVED_EARLY);
			return (0);
			}

		s->session->cipher=s->s3->tmp.new_cipher;
		if (!s->method->ssl3_enc->setup_key_block(s)) return(0); 2
		}

	if (!s->method->ssl3_enc->change_cipher_state(s,i))
		return(0);

	/* we have to record the message digest at
	 * this point so we can get it before we read
	 * the finished message */
	if (s->state & SSL_ST_CONNECT)
		{
		sender=s->method->ssl3_enc->server_finished_label;
		slen=s->method->ssl3_enc->server_finished_label_len;
		}
	else
		{
		sender=s->method->ssl3_enc->client_finished_label;
		slen=s->method->ssl3_enc->client_finished_label_len;
		}

	i = s->method->ssl3_enc->final_finish_mac(s,
		sender,slen,s->s3->tmp.peer_finish_md); 3
	if (i == 0)
		{
		SSLerr(SSL_F_SSL3_DO_CHANGE_CIPHER_SPEC, ERR_R_INTERNAL_ERROR);
		return 0;
		}
	s->s3->tmp.peer_finish_md_len = i;

	return(1);
	}

If a ChangeCipherSpec message is injected into the connection after the ServerHello, but before the master secret has been generated, then ssl3_do_change_cipher_spec will generate the keys (2) and the expected Finished hash (3) for the handshake with an empty master secret. This means that both are based only on public information. Additionally, the keys will be latched because of the check at (1) - further ChangeCipherSpec messages will regenerate the expected Finished hash, but not the keys.

The oldest source code on the OpenSSL site is for 0.9.1c (Dec 28, 1998) and the affected code appears almost unchanged. So it looks like this bug has existed for 15+ years.

The implications of this are pretty complex.

For a client there's an additional check in the code that requires that a CCS message appear before the Finished and after the master secret has been generated. An attacker can still inject an early CCS too and the keys will be calculated with an empty master secret. Those keys will be latched - another CCS won't cause them to be recalculated. However, when sending the second CCS that the client code requires, the Finished hash is recalculated with the correct master secret. This means that the attacker can't fabricate an acceptable Finished hash. This stops the obvious, generic impersonation attack against the client.

For a server, there's no such check and it appears to be possible to send an early CCS message and then fabricate the Finished hash because it's based on an empty master secret. However, that doesn't obviously gain an attacker anything. It would be interesting if a connection with a client certificate could be hijacked, but there is a check in ssl3_get_cert_verify that a CCS hasn't been already processed so that isn't possible.

Someone may be able to do something more creative with this bug; I'm not ruling out that there might be other implications.

Things change with an OpenSSL 1.0.1 server however. In 1.0.1, a patch of mine was included that moves the point where Finished values are calculated. The server will now use the correct Finished hash even if it erroneously processed a CCS, but this interacts badly with this bug. The server code (unlike the client) won't accept two CCS messages in a handshake. So, if an attacker injects an early CCS at the server to fixate the bad keys, then it's not possible for them to send a second in order to get it to calculate the correct Finished hash. But with 1.0.1, the server will use the correct Finished hash and will reply with the correct hash to the client. This explains the 1.0.1 and 1.0.2 mention in the advisory. With any OpenSSL client talking to an OpenSSL 1.0.1 server, an attacker can inject CCS messages to fixate the bad keys at both ends but the Finished hashes will still line up. So it's possible for the attacker to decrypt and/or hijack the connection completely.

The good news is that these attacks need man-in-the-middle position against the victim and that non-OpenSSL clients (IE, Firefox, Chrome on Desktop and iOS, Safari etc) aren't affected. None the less, all OpenSSL users should be updating.

Update: I hacked up the Go TLS library to perform the broken handshake in both 0.9.8/1.0.0 and 1.0.1 modes. There's a patch for crypto/tls and a tool that will attempt to do those broken handshakes and tell you if either worked. For example (at the time of writing):

$ ./earlyccs_check google.com
Handshake failed with error: remote error: unexpected message
Looks ok.
$ ./earlyccs_check github.com
Server is affected (1.0.1).
$ ./earlyccs_check amazon.com
Server is affected (0.9.8 or 1.0.0).

Matching primitive strengths

It's a common, and traditional, habit to match the strengths of cryptographic primitives. For example, one might design at the 128-bit security level and pick AES-128, SHA-256 and P-256. Or if someone wants “better” security, one might target a 192-bit security level with AES-192, SHA-384 and P-384. NIST does this sort of thing, for example in 800-57/3, section 5.2.

The logic underlying this is that an attacker will attack at a system's weakest point, so anything that is stronger than the weakest part is a waste. But, while matching up the security level might have made sense in the past, I think it warrants reconsideration now.

Consider the energy needed to do any operation 2128 times: Landauer tells us that erasing a bit takes a minimum amount of energy any actual crypto function will involve erasing many bits. But, for the sake of argument, let's say that we're doing a single bit XOR. If we take the minimum energy at room temperature (to save energy on refrigeration) we get 2.85×10-21 × 2128 = 0.97×1018J. That's about “yearly electricity consumption of South Korea as of 2009” according to Wikipedia. If we consider that actually doing anything useful in each of those 2128 steps is going to take orders of magnitude more bit operations, and that we can't build a computer anywhere near the theoretic limit, we can easily reach, say, 5.5×1024J, which is “total energy from the Sun that strikes the face of the Earth each year”.

(The original version of this post discussed incrementing a 128-bit counter and said that it took two erasures per increment. Samuel Neves on Twitter pointed out that incrementing the counter doesn't involve destroying the information of the old value and several people pointed out that one can hit all the bit-patterns with a Gray code and thus only need to flip a single bit each time. So I changed the above to use a one-bit XOR as the basic argument still holds.)

So any primitive at or above the 128-bit security level is equally matched today, because they are all effectively infinitely strong.

One way to take this is to say that anything above 128 bits is a waste of time. Another is to say that one might want to use primitives above that level, but based on the hope that discontiguous advances leave it standing but not the lesser primitives. The latter involves a lot more guesswork than the old practice of making the security levels match up. Additionally, I don't think most people would think that an analytic result is equally likely for all classes of primitives, so one probably doesn't want to spend equal amounts of them all.

Consider the difference between P-256 and P-384 (or curve25519 and Curve41417 or Hamburg's Goldilocks-448). If they are all infinitely strong today then the reason to want to use a larger curve is because you expect some breakthrough, but not too much of a breakthrough. Discrete log in an elliptic-curve group is a square-root work function today. For those larger curves to make sense you have to worry about a breakthrough that can do it in cube-root time, but not really forth-root and certainly not fifth-root. How much would you pay to insure yourself against that, precise possibility? And how does that compare against the risk of a similar breakthrough in AES?

It is also worth considering whether the security-level is defined in a way matches your needs. AES-128 is generally agreed to have a 128-bit security level, but if an attacker is happy breaking any one of 2n keys then they can realise a significant speedup. An “unbalanced” choice of cipher key size might be reasonable if that attack is a concern.

So maybe a 256-bit cipher (ciphers do pretty well now, just worry about Grover's algorithm), SHA-384 (SHA-256 should be fine, but hash functions have, historically, had a bad time) and 128512(? see below)-bit McEliece is actually “balanced” these days.

(Thanks to Mike Hamburg and Trevor Perrin, with whom I was discussing this after Mike presented his Goldilocks curve. Thanks to JP Aumasson on Twitter and pbsd on HN for suggesting multi-target attacks as a motivation for large cipher key sizes. Also to pbsd for pointing out a quantum attack against McEliece that I wasn't aware of.)

SHA-256 certificates are coming

It's a neat result in cryptography that you can build a secure hash function given a secure signature scheme, and you can build a secure signature scheme given a secure hash function. However, far from the theory, in the real world, lots of signatures today depend on SHA-1, which is looking increasingly less like a secure hash function.

There are lots of properties by which one can evaluate a hash function, but the most important are preimage resistance (if I give you a hash, can you find an input with that hash value?), second preimage resistance (if I give you a message, can you find another that hashes to the same value?) and collision resistance (can you find two messages with the same hash?). Of those, the third appears to be much harder to meet than the first two, based on historical results against hash functions.

Back when certificates were signed with MD5, a chosen-prefix collision attack (i.e. given two messages, can you append different data to each so that the results have the same hash?) against MD5 was used at least twice to break the security of the PKI. First the MD5 Collisions Inc demo against RapidSSL and then the Flame malware.

Today, SHA-1 is at the point where a collision attack is estimated to take 261 work and a chosen-prefix collision to take 277. Both are below the design strength of 280 and even that design strength isn't great by today's standards.

We hope that we have a second line of defense for SHA-1: after the MD5 Collisions Inc demo, CAs were required to use random serial numbers. A chosen-prefix attack requires being able to predict the certificate contents that will be signed and the random serial number should thwart that. With random serials we should be resting on a stronger hash function property called target-collision resistance. (Although I'm not aware of any proofs that random serials actually put us on TCR.)

Still, it would be better not to depend on those band-aids and to use a hash function with a better design strength while we do it. So certificates are starting to switch to using SHA-256. A large part of that shift came from Microsoft forbidding certificates using SHA-1 starting in 2016.

For most people, this will have no effect. Twitter ended up with a SHA-256 certificate after they replaced their old one because of the OpenSSL heartbeat bug. So, if you can still load Twitter's site, then you're fine.

But there are a lot of people using Windows XP prior to Service Pack 3, and they will have problems. We've already seen lots of user reports of issues with Twitter (and other sites) from these users. Wherever possible, installing SP3 is the answer. (Or, better yet, updating from Windows XP.)

There are also likely to be problems with embedded clients, old phones etc. Some of these may not come to light for a while.

We've not yet decided what Google's timeline for switching is, but we will be switching prior to 2016. If you've involved with a something embedded that speaks to Google over HTTPS (and there's a lot of it), now is the time to test https://cert-test.sandbox.google.com, which is using a SHA-256 certificate. I'll be using other channels to contact the groups that we know about but, on the web, we don't always know what's talking to us.

Revocation still doesn't work

I was hoping to be done with revocation for a bit, but sadly not.

GRC have published a breathless piece attacking a straw man argument: “Google tells us ... that Chrome's unique CRLSet solution provides all the protection we need”. I call it a straw man because I need only quote my own posts that GRC have read to show it.

The original CRLSets announcement contained two points. Firstly that online revocation checking doesn't work and that we were switching it off in Chrome in most cases. Secondly, that we would be using CRLSets to avoid having to do full binary updates in order to respond to emergency incidents.

In the last two paragraphs I mentioned something else. Quoting myself:

“Since we're pushing a list of revoked certificates anyway, we would like to invite CAs to contribute their revoked certificates (CRLs) to the list. We have to be mindful of size, but the vast majority of revocations happen for purely administrative reasons and can be excluded. So, if we can get the details of the more important revocations, we can improve user security. Our criteria for including revocations are:

  1. The CRL must be crawlable: we must be able to fetch it over HTTP and robots.txt must not exclude GoogleBot.
  2. The CRL must be valid by RFC 5280 and none of the serial numbers may be negative.
  3. CRLs that cover EV certificates are taken in preference, while still considering point (4).
  4. CRLs that include revocation reasons can be filtered to take less space and are preferred.”

In short, since we were pushing revocations anyway, maybe we could get some extra benefit from it. It would be better than nothing, which is what browsers otherwise have with soft-fail revocation checking.

I mentioned it again, last week (emphasis added):

“We compile daily lists of some high-value revocations and use Chrome's auto-update mechanism to push them to Chrome installations. It's called the CRLSet and it's not complete, nor big enough to cope with large numbers of revocations, but it allows us to react quickly to situations like Diginotar and ANSSI. It's certainly not perfect, but it's more than many other browsers do.”

“The original hope with CRLSets was that we could get revocations categorised into important and administrative and push only the important ones. (Administrative revocations occur when a certificate is changed with no reason to suspect that any key compromise occurred.) Sadly, that mostly hasn't happened.”

And yet, GRC managed to write pages (including cartoons!) exposing the fact that it doesn't cover many revocations and attacking Chrome for it.

They also claim that soft-fail revocation checking is effective:

The claim is that a user will download and cache a CRL while not being attacked and then be protected from a local attacker using a certificate that was included on that CRL. (I'm paraphrasing; you can search for “A typical Internet user” in their article for the original.)

There are two protocols used in online revocation checking: OCSP and CRL. The problem is that OCSP only covers a single certificate and OCSP is used in preference because it's so much smaller and thus removes the need to download CRLs. So the user isn't expected to download and cache the CRL anyway. So that doesn't work.

However, it's clear that someone is downloading CRLs because Cloudflare are spending half a million dollars a month to serve CRLs. Possibly it's non-browser clients but the bulk is probably CAPI (the library that IE, Chrome and other software on Windows typically uses - although not Firefox). A very obscure fact about CAPI is that it will download a CRL when it has 50 or more OCSP responses cached for a given CA certificate. But CAs know this and split their CRLs so that they don't get hit with huge bandwidth bills. But a split CRL renders the claimed protection from caching ineffective, because the old certificate for a given site is probably in a different part.

So I think the claim is that doing blocking OCSP lookups is a good idea because, if you use CAPI on Windows, then you might cache 50 OCSP responses for a given CA certificate. Then you'll download and cache a CRL for a while and then, depending on whether the CA splits their CRLs, you might have some revocations cached for a site that you visit.

It seems that argument is actually for something like CRLSets (download and cache revocations) over online checking, it's just a Rube Goldberg machine to, perhaps, implement it!

So, once again, revocation doesn't work. It doesn't work in other browsers and CRLSets aren't going to cover much in Chrome, as I've always said. GRC's conclusions follow from those errors and end up predictably far from the mark.

In order to make this post a little less depressing, let's consider whether its reasonable to aim to cover everything with CRLSets. (Which I mentioned before as well, but I'll omit the quote.) GRC quote numbers from Netcraft claiming 2.85 million revocations, although some are from certificate authorities not trusted by mainstream web browsers. I spent a few minutes harvesting CRLs from the CT log. This only includes certificates that are trusted by reasonable number of people and I only waited for half the log to download. From that half I threw in the CRLs included by CRLSets and got 2356 URLs after discarding LDAP ones.

I tried downloading them and got 2164 files. I parsed them and eliminated duplicates and my quick-and-dirty parser skipped quite a few. None the less, this very casual search found 1,062 issuers and 4.2 million revocations. If they were all dropped into the CRLSet, it would take 39MB.

So that's a ballpark figure, but we need to design for something that will last a few years. I didn't find figures on the growth of HTTPS sites, but Netcraft say that the number of web sites overall is growing at 37% a year. It seems reasonable that the number of HTTPS sites, thus certificates, thus revocations will double a couple of times in the next five years.

There's also the fact that if we subsidise revocation with the bandwidth and computers of users, then demand for revocation will likely increase. Symantec, I believe, have experimented with certificates that are issued for the maximum time (five years at the moment) but sold in single year increments. When it comes to the end of the year, they'll remind you to renew. If you do, you don't need to do anything, otherwise the certificate gets revoked. This seems like a good deal for the CA so, if we make revocation really effective, I'd expect to see a lot more of it. Perhaps factor in another doubling for that.

So we would need to scale to ~35M revocations in five years, which is ~300MB of raw data. That would need to be pushed to all users and delta updated. (I don't have numbers for the daily delta size.)

That would obviously take a lot of bandwidth and a lot of resources on the client. We would really need to aim at mobile as well, because that is likely to be the dominant HTTPS client in a few years, making storage and bandwidth use even more expensive.

Some probabilistic data structure might help. I've written about that before. Then you have to worry about false positives and the number of reads needed per validation. Phones might use Flash for storage, but it's remarkably slow.

Also, would you rather spend those resources on revocation, or on distributing the Safe Browsing set to mobile devices? Or would you spend the memory on improving ASLR on mobile devices? Security is big picture full of trade-offs. It's not clear the revocation warrants all that spending.

So, if you believe that downloading and caching revocation is the way forward, I think those are the parameters that you have to deal with.

No, don't enable revocation checking

Revocation checking is in the news again because of a large number of revocations resulting from precautionary rotations for servers affected by the OpenSSL heartbeat bug. However, revocation checking is a complex topic and there's a fair amount of misinformation around. In short, it doesn't work and you are no more secure by switching it on. But let's quickly catch up on the background:

Certificates bind a public key and an identity (commonly a DNS name) together. Because of the way the incentives work out, they are typically issued for a period of several years. But events occur and sometimes the binding between public key and name that the certificate asserts becomes invalid during that time. In the recent cases, people who ran a server that was affected by the heartbeat bug are worried that their private key might have been obtained by someone else and so they want to invalidate the old binding, and bind to a new public key. However, the old certificates are still valid and so someone who had obtained that private key could still use them.

Revocation is the process of invalidating a certificate before its expiry date. All certificates include a statement that essentially says “please phone the following number to check that this certificate has not been revoked”. The term online revocation checking refers to the process of making that phone call. It's not actually a phone call, of course, rather browsers and other clients can use a protocol called OCSP to check the status of a certificate. OCSP supplies a signed statement that says that the certificate is still valid (or not) and, critically, the OCSP statement itself is valid for a much shorter period of time, typically a few days.

The critical question is what to do in the event that you can't get an answer about a certificate's revocation status. If you reject certificates when you can't get an answer, that's called hard-fail. If you accept certificates when you can't get an answer that's called soft-fail.

Everyone does soft-fail for a number of reasons on top of the general principle that single points of failure should be avoided. Firstly, the Internet is a noisy place and sometimes you can't get through to OCSP servers for some reason. If you fail in those cases then the level of random errors increases. Also, captive portals (e.g. hotel WiFi networks where you have to “login” before you can use the Internet) frequently use HTTPS (and thus require certificates) but don't allow you to access OCSP servers. Lastly, if everyone did hard-fail then taking down an OCSP service would be sufficient to take down lots of Internet sites. That would mean that DDoS attackers would turn their attention to them, greatly increasing the costs of running them and it's unclear whether the CAs (who pay those costs) could afford it. (And the disruption is likely to be significant while they figure that out.)

So soft-fail is the only viable answer but it has a problem: it's completely useless. But that's not immediately obvious so we have to consider a few cases:

If you're worried about an attacker using a revoked certificate then the attacker first must be able to intercept your traffic to the site in question. (If they can't even intercept the traffic then you didn't need any authentication to protect it from them in the first place.) Most of the time, such an attacker is near you. For example, they might be running a fake WiFi access point, or maybe they're at an ISP. In these cases the important fact is that the attacker can intercept all your traffic, including OCSP traffic. Thus they can block OCSP lookups and soft-fail behaviour means that a revoked certificate will be accepted.

The next class of attacker might be a state-level attacker. For example, Syria trying to intercept Facebook connections. These attackers might not be physically close, but they can still intercept all your traffic because they control the cables going into and out of a country. Thus, the same reasoning applies.

We're left with cases where the attacker can only intercept traffic between a user and a website, but not between the user and the OCSP service. The attacker could be close to the website's servers and thus able to intercept all traffic to that site, but not anything else. More likely, the attacker might be able to perform a DNS hijacking: they persuade a DNS registrar to change the mapping between a domain (example.com) and its IP address(es) and thus direct the site's traffic to themselves. In these cases, soft-fail still doesn't work, although the reasoning is more complex:

Firstly, the attacker can use OCSP stapling to include the OCSP response with the revoked certificate. Because OCSP responses are generally valid for some number of days, they can store one from before the certificate was revoked and use it for as long as it's valid for. DNS hijackings are generally noticed and corrected faster than the OCSP response will expire. (On top of that, you need to worry about browser cache poisoning, but I'm not going to get into that.)

Secondly, and more fundamentally, when issuing certificates a CA validates ownership of a domain by sending an email, or looking for a specially formed page on the site. If the attacker is controlling the site, they can get new certificates issued. The original owners might revoke the certificates that they know about, but it doesn't matter because the attacker is using different ones. The true owners could try contacting CAs, convincing them that they are the true owners and get other certificates revoked, but if the attacker still has control of the site, they can hop from CA to CA getting certificates. (And they will have the full OCSP validity period to use after each revocation.) That circus could go on for weeks and weeks.

That's why I claim that online revocation checking is useless - because it doesn't stop attacks. Turning it on does nothing but slow things down. You can tell when something is security theater because you need some absurdly specific situation in order for it to be useful.

So, for a couple of years now, Chrome hasn't done these useless checks by default in most cases. Rather, we have tried a different mechanism. We compile daily lists of some high-value revocations and use Chrome's auto-update mechanism to push them to Chrome installations. It's called the CRLSet and it's not complete, nor big enough to cope with large numbers of revocations, but it allows us to react quickly to situations like Diginotar and ANSSI. It's certainly not perfect, but it's more than many other browsers do.

A powerful attacker may be able to block a user from receiving CRLSet updates if they can intercept all of that user's traffic for long periods of time. But that's a pretty fundamental limit; we can only respond to any Chrome issue, including security bugs, by pushing updates.

The original hope with CRLSets was that we could get revocations categorised into important and administrative and push only the important ones. (Administrative revocations occur when a certificate is changed with no reason to suspect that any key compromise occurred.) Sadly, that mostly hasn't happened. Perhaps we need to consider a system that can handle much larger numbers of revocations, but the data in that case is likely to be two orders of magnitude larger and it's very unlikely that the current CRLSet design is still optimal when the goal moves that far. It's also a lot of data for every user to be downloading and perhaps efforts would be better spent elsewhere. It's still the case that an attacker that can intercept traffic can easily perform an SSL Stripping attack on most sites; they hardly need to fight revoked certificates.

In order to end on a positive note, I'll mention a case where online revocation checking does work, and another, possible way to solve the revocation problem for browsers.

The arguments above started with the point that an attacker using a revoked certificate first needs to be able to intercept traffic between the victim and the site. That's true for browsers, but it's not true for code-signing. In the case where you're checking the signature on a program or document that could be distributed via, say, email, then soft-fail is still valuable. That's because it increases the costs on the attacker substantially: they need to go from being able to email you to needing to be able to intercept OCSP checks. In these cases, online revocation checking makes sense.

If we want a scalable solution to the revocation problem then it's probably going to come in the form of short-lived certificates or something like OCSP Must Staple. Recall that the original problem stems from the fact that certificates are valid for years. If they were only valid for days then revocation would take care of itself. (This is the approach taken by DNSSEC.) For complex reasons, it might be easier to deploy that by having certificates that are valid for years, but include a marker in them that indicates that an OCSP response must be provided along with the certificate. The OCSP response is then only valid for a few days and the effect is the same (although less efficient).

TLS Triple Handshakes

Today, the TLS WG mailing list meeting received a note about the work of Karthikeyan Bhargavan, Antoine Delignat-Lavaud, Cedric Fournet, Alfredo Pironti and Pierre-Yves Strub on triple handshake attacks against TLS. This is more complex than just a duplicated goto and I'm not going to try and reproduce their explanation here. Instead, I'll link to their site again, which also includes a copy of their paper.

In short, the TLS handshake hashes in too little information, and always has. Because of that it's possible to synchronise the state of two TLS sessions in a way that breaks assumptions made in the rest of the protocol.

I'd like to thank the researchers for doing a very good job of disclosing this. The note today even included a draft for fixing the TLS key derivation to include all the needed information to stop this attack, and it'll be presented at the WG meeting tomorrow.

In the mean time, people shouldn't panic. The impact of this attack is limited to sites that use TLS client-certificate authentication with renegotiation, and protocols that depend on channel binding. The vast majority of users have never used client certificates.

The client-certificate issues can be fixed with a unilateral, client change to be stricter about verifying certificates during a renegotiation, as suggested by the authors. I've included an image, below, that is loaded over an HTTPS connection that renegotiates with unrelated certificates before returning the image data. Hopefully the image below is broken. If not, then it likely will be soon because of a browser update. (I took the server down.)

Protocols that depend on channel binding (including ChannelID) need other changes. Ideally, the proposed update to the master-secret computation will be finalised and implemented. (For ChannelID, we have already updated the protocol to include a change similar to the proposed draft.)

It's likely that there are still concrete problems to be found because of the channel-binding break. Hopefully with today's greater publicity people can start to find them.

TLS Symmetric Crypto

At this time last year, the TLS world was mostly running on RC4-SHA and AES-CBC. The Lucky 13 attack against CBC in TLS had just been published and I had spent most of January writing patches for OpenSSL and NSS to implement constant-time CBC decoding. The RC4 biases paper is still a couple of week away, but it's already clear that both these major TLS cipher suite families are finished and need replacing. (The question of which is worse is complicated.)

Here's Chrome's view of TLS by number of TLS connections from around this time (very minor cipher suite families were omitted):

Cipher suite familyPercentage of TLS connections made by Chrome
RC4-MD52.8%
RC4-SHA148.9%
AES128-SHA11.2%
AES256-SHA146.3%

A whole bunch of people, myself included, have been working to address that since.

AES-GCM was already standardised, implemented in OpenSSL and has hardware support in Intel chips, so Wan-Teh Chang and I started to implement it in NSS, although first we had to implement TLS 1.2, on which it depends. That went out in Chrome 30 and 31. Brian Smith (and probably others!) at Mozilla shipped that code in Firefox also, with Firefox 27.

AES-GCM is great if you have hardware support for it: Haswell chips can do it in just about 1 cycle/byte. However, it's very much a hardware orientated algorithm and it's very difficult to implement securely in software with reasonable speed. Also, since TLS has all the infrastructure for cipher suite negotiation already, it's nice to have a backup in the wings should it be needed.

So we implemented ChaCha20-Poly1305 for TLS in NSS and OpenSSL. (Thanks to Elie Bursztein for doing a first pass in OpenSSL.) This is an AEAD made up of primitives from djb and we're using implementations from djb, Andrew M, Ted Krovetz and Peter Schwabe. Although it can't beat AES-GCM when AES-GCM has dedicated hardware, I suspect there will be lots of chips for a long time that don't have such hardware. Certainly most mobile phones at the moment don't (although AArch64 is coming).

Here's an example of the difference in speeds:

ChipAES-128-GCM speedChaCha20-Poly1305 speed
OMAP 446024.1 MB/s75.3 MB/s
Snapdragon S4 Pro41.5 MB/s130.9 MB/s
Sandy Bridge Xeon (AESNI)900 MB/s500 MB/s

(The AES-128-GCM implementation is from OpenSSL 1.0.1f. Note, ChaCha20 is a 256-bit cipher and AES-128 obviously isn't.)

There's also an annoying niggle with AES-GCM in TLS because the spec says that records have an eight byte, explicit nonce. Being an AEAD, the nonce is required to be unique for a given key. Since an eight-byte value is too small to pick at random with a sufficiently low collision probability, the only safe implementation is a counter. But TLS already has a suitable, implicit record sequence counter that has always been used to number records. So the explicit nonce is at best a waste of eight bytes per record, and possibly dangerous should anyone attempt to use random values with it. Thankfully, all the major implementations use a counter and I did a scan of the Alexa, top 200K sites to check that none are using random values - and none are. (Thanks to Dr Henson for pointing out that OpenSSL is using a counter, but with a random starting value.)

For ChaCha20-Poly1305 we can save 8 bytes of overhead per record by using the implicit sequence number for the nonce.

Cipher suite selection

Given the relative merits of AES-GCM and ChaCha20-Poly1305, we wish to use the former when hardware support is available and the latter otherwise. But TLS clients typically advertise a fixed preference list of cipher suites and TLS servers either respect it, or override the client's preferences with their own, fixed list. (E.g. Apache's SSLHonorCipherOrder directive.)

So, first, the client needs to alter its cipher suite preference list depending on whether it has AES-GCM hardware support - which Chrome now does. Second, servers that enforce their own preferences (which most large sites do) need a new concept of an equal-preference group: a set of cipher suites in the server's preference order which are all “equally good”. When choosing a cipher suite using the server preferences, the server finds its most preferable cipher suite that the client also supports and, if that is in an equal preference group, picks whichever member of the group is the client's most preferable. For example, Google servers have a cipher suite preference that includes AES-GCM and ChaCha20-Poly1305 cipher suites in an equal preference group at the top of the preference list. So if the client supports any cipher suite in that group, then the server will pick whichever was most preferable for the client.

The end result is that Chrome talking to Google uses AES-GCM if there's hardware support at the client and ChaCha20-Poly1305 otherwise.

After a year of work, let's see what Chrome's view of TLS is now:

Cipher suite familyPercentage of TLS connections made by Chrome
RC4-MD53.1%
RC4-SHA114.1%
AES128-SHA12.6%
AES256-SHA140.2%
AES128-GCM & ChaCha20-Poly130539.9%

What remains is standardisation work and getting the code out to the world for others. The ChaCha20-Poly1305 cipher suites are unlikely to survive the IETF without changes so we'll need to respin them at some point. As for getting the code out, I hope to have something to say about that in the coming months. Before then I don't recommend that anyone try to repurpose code from the Chromium repos - it is maintained with only Chromium in mind.

Fallbacks

Now that TLS's symmetric crypto is working better there's another, long-standing problem that needs to be addressed: TLS fallbacks. The new, AEAD cipher suites only work with TLS 1.2, which shouldn't be a problem because TLS has a secure negotiation mechanism. Sadly, browsers have a long history of doing TLS fallbacks: reconnecting with a lesser TLS version when a handshake fails. This is because there are lots of buggy HTTPS servers on the Internet that don't implement version negotiation correctly and fallback allows them to continue to work. (It also promotes buggy implementations by making them appear to function; but fallback was common before Chrome even existed so we didn't really have a choice.)

Chrome now has a four stage fallback: TLS 1.2, 1.1, 1.0 then SSLv3. The problem is that fallback can be triggered by an attacker injecting TCP packets and it reduces the security level of the connection. In the TLS 1.2 to 1.1 fallback, AES-GCM and ChaCha20-Poly1305 are lost. In the TLS 1.0 to SSLv3 fallback, all elliptic curve support, and thus ECDHE, disappears. If we're going to depend on these new cipher suites then we cannot allow attackers to do this, but we also cannot break all the buggy servers.

So, with Chrome 33, we implemented a fallback SCSV. This involves adding a pseudo-cipher suite in the client's handshake that indicates when the client is trying a fallback connection. If an updated server sees this pseudo-cipher suite, and the connection is not using the highest TLS version supported by the server, then it replies with an error. This is essentially duplicating the version negotiation in the cipher suite list, which isn't pretty, but the reality is that the cipher suite list still works but the primary version negotiation is rusted over with bugs.

With this in place (and also implemented on Google servers) we can actually depend on having TLS 1.2, at least between Chrome and Google (for now).

Bugs

Few rollouts are trouble free and Chrome 33 certainly wasn't, although most of the problems were because of a completely different change.

Firstly, ChaCha20-Poly1305 is running at reduced speed on ARM because early-revision, S3 phones have a problem that causes the NEON, Poly1305 code to sometimes calculate the wrong result. (It's not clear if it's the MSM8960 chip or the power supply in the phone.) Of course, it works well enough to run the unit tests successfully because otherwise I wouldn't have needed two solid days to pin it down. Future releases will need to run a self-test to gather data about which revisions have the problem and then we can selectively disable the fast-path code on those devices. Until then, it's disabled everywhere. (Even so, ChaCha20-Poly1305 is still much faster than AES-GCM on ARM.)

But it was the F5 workaround that caused most of the headaches. Older (but common) firmware revisions of F5 devices hang the connection when the ClientHello is longer than 255 bytes. This appeared to be a major issue because it meant that we couldn't add things to the ClientHello without breaking lots of big sites that use F5 devices. After a plea and a discussion on the TLS list, an engineer from F5 explained that it was just handshake messages between 256 and 511 bytes in length that caused the issue. That suggested an obvious solution: pad the handshake message so that it doesn't fall into the troublesome size range.

Chrome 33 padded all ClientHellos to 512 bytes, which broke at least two “anti-virus” products that perform local MITM of TLS connections on Windows and at least one “filtering” device that doesn't do MITM, but killed the TCP connections anyway. All made the assumption that the ClientHello will never be longer than some tiny amount. In the firewall case, the fallback SCSV stopped a fallback to SSLv3 that would otherwise have hidden the problem by removing the padding.

Thankfully, two of the three vendors have acted quickly to address the issue. The last, Kaspersky, has known about these sorts of problems for 18 months now (since Chrome tried to deploy TLS 1.1 and hit a similar issue) but, thankfully, doesn't enable MITM mode by default.

Overall, it looks like the changes are viable. Hopefully the path will now be easier for any other clients that wish to do the same.

Apple's SSL/TLS bug

Yesterday, Apple pushed a rather spooky security update for iOS that suggested that something was horribly wrong with SSL/TLS in iOS but gave no details. Since the answer is at the top of the Hacker News thread, I guess the cat's out of the bag already and we're into the misinformation-quashing stage now.

So here's the Apple bug:

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams,
                                 uint8_t *signature, UInt16 signatureLen)
{
	OSStatus        err;
	...

	if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
		goto fail;
	if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
		goto fail;
		goto fail;
	if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
		goto fail;
	...

fail:
	SSLFreeBuffer(&signedHashes);
	SSLFreeBuffer(&hashCtx);
	return err;
}

(Quoted from Apple's published source code.)

Note the two goto fail lines in a row. The first one is correctly bound to the if statement but the second, despite the indentation, isn't conditional at all. The code will always jump to the end from that second goto, err will contain a successful value because the SHA1 update operation was successful and so the signature verification will never fail.

This signature verification is checking the signature in a ServerKeyExchange message. This is used in DHE and ECDHE ciphersuites to communicate the ephemeral key for the connection. The server is saying “here's the ephemeral key and here's a signature, from my certificate, so you know that it's from me”. Now, if the link between the ephemeral key and the certificate chain is broken, then everything falls apart. It's possible to send a correct certificate chain to the client, but sign the handshake with the wrong private key, or not sign it at all! There's no proof that the server possesses the private key matching the public key in its certificate.

Since this is in SecureTransport, it affects iOS from some point prior to 7.0.6 (I confirmed on 7.0.4) and also OS X prior to 10.9.2 (confirmed on 10.9.1). It affects anything that uses SecureTransport, which is most software on those platforms although not Chrome and Firefox, which both use NSS for SSL/TLS. However, that doesn't mean very much if, say, the software update systems on your machine might be using SecureTransport.

I coded up a very quick test site at https://www.imperialviolet.org:1266. Note the port number (which is the CVE number), the normal site is running on port 443 and that is expected to work. On port 1266 the server is sending the same certificates but signing with a completely different key. If you can load an HTTPS site on port 1266 then you have this bug.

Because the certificate chain is correct and it's the link from the handshake to that chain which is broken, I don't believe any sort of certificate pinning would have stopped this. Also, this doesn't only affect sites using DHE or ECDHE ciphersuites - the attacker gets to choose the ciphersuite in this case and will choose the one that works for them.

Also, this doesn't affect TLS 1.2 because there's a different function for verifying the different ServerKeyExchange message in TLS 1.2. But, again, the attacker can choose any version that the client will accept. But if the client only enables TLS 1.2 then it appears that would workaround this issue. Likewise, if the client only enabled the plain, RSA ciphersuites then there's no ServerKeyExchange and that should also work around this issue. (Of the two, the former workaround is much more preferable.)

Based on my test site, both iOS 7.0.6 and OS X 10.9.2 fix the issue. (Update: it looks like the bug was introduced in 10.9 for OS X but existed in at least some versions of iOS 6. iOS 6.1.6 was released yesterday to fix it.)

This sort of subtle bug deep in the code is a nightmare. I believe that it's just a mistake and I feel very bad for whoever might have slipped in an editor and created it.

Here's a stripped down that code with the same issue:

extern int f();

int g() {
	int ret = 1;

	goto out;
	ret = f();

out:
	return ret;
}

If I compile with -Wall (enable all warnings), neither GCC 4.8.2 or Clang 3.3 from Xcode make a peep about the dead code. That's surprising to me. A better warning could have stopped this but perhaps the false positive rate is too high over real codebases? (Thanks to Peter Nelson for pointing out the Clang does have -Wunreachable-code to warn about this, but it's not in -Wall.)

Maybe the coding style contributed to this by allowing ifs without braces, but one can have incorrect indentation with braces too, so that doesn't seem terribly convincing to me.

A test case could have caught this, but it's difficult because it's so deep into the handshake. One needs to write a completely separate TLS stack, with lots of options for sending invalid handshakes. In Chromium we have a patched version of TLSLite to do this sort of thing but I cannot recall that we have a test case for exactly this. (Sounds like I know what my Monday morning involves if not.)

Code review can be effective against these sorts of bug. Not just auditing, but review of each change as it goes in. I've no idea what the code review culture is like at Apple but I strongly believe that my colleagues, Wan-Teh or Ryan Sleevi, would have caught it had I slipped up like this. Although not everyone can be blessed with folks like them.

Lastly, there was a lot of discussion yesterday that Apple missed checking the hostname in the certificate. It's true that curl on the OS X command line oddly accepts HTTPS connections to IP addresses when the IP address isn't in the certificate, but I can't find that there's anything more than that and Safari doesn't have that problem.

Implementing Elligator for Curve25519

There are some situations where you would like to encode a point on an elliptic curve in such a way that it appears to be a uniform, random string. This comes up in some password authenticated key exchange systems, where one doesn't want to leak any bits of the password. It also comes up in protocols that are trying to be completely uniform so as to make life hard for network censors.

For the purposes of this post, I'm only going to be considering Curve25519. Recall that Curve25519 is the set of points (u,v) where v2 = u3 + 486662x2 + u and u and v are taken from GF(q=2255-19). Curve25519 works with a Montgomery ladder construction and thus only exchanges v coordinates. So, when sending a v coordinate there are two ways that an attacker can distinguish it from a random string:

Firstly, since the field is only 255 bits, the 256th bit is always zero. Thus if an attacker sees a series of 32-byte strings where the top bit of the last byte is always zero, then they can be confident that they are not random strings. This is easy to fix however, just XOR in a random bit and mask it out before processing.

Secondly, the attacker can assume that a 32-byte string is a v coordinate and check whether v3 + 486662x2 + v is a square. This will always be true if the strings are v coordinates, by the curve equation, but will only be true 50% of the time otherwise. This problem is a lot harder to fix.

Thankfully; djb, Tanja Lange, Mike Hamburg and Anna Krasnova have published Elligator [paper]. In that paper they present two maps from elliptic curve points to uniform, random strings and you should, at least, read the introduction to the paper, which contains much more detail on the use cases and previous work.

Elligator 2 is suitable for Curve25519 and there are some hints about an implementation in section 5.5. However, this blog post contains my own notes about implementing each direction of the map with only a couple of exponentiations. You may need to read the paper first before the following makes much sense.

I'll start with the inverse map, because that's what's used during key-generation. Throughout I'll write (x, y) for affine coordinates on the twisted, Edwards isomorphism of Curve25519, as defined for Ed25519; (u,v) for affine coordinates on Curve25519 itself; and X, Y, Z for extended coordinates on the twisted, Edwards curve.

The inverse map, ψ-1, takes a point on the curve with the following limitations and produces a uniform, representative value:

  1. u ≠ -A. (The value A is a parameter of Curve25519 and has the value 486662.)
  2. -2u(u + A) is a square

Since we're generating the points randomly, I'm going to ignore the first condition because it happens far less frequently than malfunctions in the CPU instructions that I might use to detect it.

The second condition excludes about half the points on the curve: these points don't have any representation as a uniform string. When generating a key we need to detect and restart if we happen upon one of those points. (When working in the other direction, ψ(r) = ψ(-r), which explains why half the image is missing.)

If the point is in the image of the map then the uniform string representation, r, is defined by:

(If you can't see the equation, you need a browser that can handle inline SVG.)

So our key generation function is going to take a private key and either produce nothing (if the resulting point isn't in the image of the map), or produce a public key and the uniform representation of it. We're working with Curve25519, so the public key, if any, will exactly match the result of the scalar_base_mult operation on Curve25519.

Since I happened to have the ref10 code from Ed25519 close to hand, we'll be calculating the scalar multiplication on the twisted, Edwards isomorphism of Curve25519 that Ed25519 uses. I've discussed that before but, in short, it allows for the use of precomputed tables and so is faster than scalar_base_mult with the Curve25519 code.

The result of the scalar multiplication of the base point with the private key is a point on the twisted, Edwards curve in extended coordinates: (X : Y : Z), where x = X/Z and y = Y/Z and where (x, y) are the affine coordinates on the Edwards curve. In order to convert a point on the Edwards curve to a point on Curve25519, we evaluate:

This is looking quite expensive: first we have to convert from extended coordinates on the Edwards curve to affine, then to affine on Curve25519. In finite fields, inversions (i.e. division) are very expensive because they are done by raising the divisor to the power of q-2 (where q=2255-19). Compared to such large exponentiations, multiplications and additions are almost free.

But, we can merge all those inversions and do the whole conversion with just one. First, substitute the extended coordinate conversion: x = X/Z, y = Y/Z.

image/svg+xml

By calculating just 1/(X(Y-Z)), we can get (u,v). The reciprocal can be used directly to calculate v and then multiplied by X to calculate u.

What remains is to check whether -2u(u + A) is a square and then to calculate the correct square root. Happily, we can calculate all that (including both square roots because we want to be constant-time), with a single exponentiation.

Square roots are defined in the standard way for finite fields where q ≅ 5 mod 8:

Let's consider the square roots first. Both have a common, constant term of (-1/2) which we can ignore for now. With that gone, both are fractions that we need to raise to (q+3)/8. Section 5 of the Ed25519 paper contains a trick for merging the inversion with the exponentiation:

But let's see what else we can do with the c value.

So c7 gives us the variable part of the other square root without another large exponentiation! It only remains to multiply in the constant term of (-1/2)(q+3)/8 to each and then, for each, test whether the square root of -1 needs to be multiplied. For the first square root, for example, we should have that r2 = -u/(2(u+A)) thus 2r2(u+A)+u should be zero. If not, then we need to multiply r by sqrt(-1).

Lastly, we need to check that the point is in the image of the map at all! (An implementation may want to do this sooner so that failures can be rejected faster.) For this we need to check that -2u(u + A) is a square. We use the fact that an element of the field is a non-zero square iff raising it to (q-1)/2 results in one. Thus we need to calculate (-2)(q-1)/2(u(u+A))(q-1)/2. The first part is just a constant, but the latter would be an expensive, large exponentiation, if not for c again: