This is the third in a series of posts about running experiments on post-quantum confidentiality in TLS. The first detailed experiments that measured the estimated network overhead of three families of post-quantum key exchanges. The second detailed the choices behind a specific structured-lattice scheme. This one gives details of a full, end-to-end measurement of that scheme and a supersingular isogeny scheme, SIKE/p434. This was done in collaboration with Cloudflare, who integrated Microsoft's SIKE code into BoringSSL for the tests, and ran the server-side of the experiment.
Google Chrome installs, on Dev and Canary channels, and on all platforms except iOS, were randomly assigned to one of three groups: control (30%), CECPQ2 (30%), or CECPQ2b (30%). (A random ten percent of installs did not take part in the experiment so the numbers only add up to 90.) CECPQ2 is the hybrid X25519+structured-lattice scheme previously described. CECPQ2b is the name that we gave to the combination of X25519 and the SIKE/p434 scheme.
Because optimised assembly implementations are labour-intensive to write, they were only available/written for AArch64 and x86-64. Because SIKE is computationally expensive, it wasn’t feasible to enable it without an assembly implementation, thus only AArch64 and x86-64 clients were included in the experiment and ARMv7 and x86 clients did not contribute to the results even if they were assigned to one of the experiment groups.
Cloudflare servers were updated to include support for both CECPQ2 and CECPQ2b, and to support an empty TLS extension that indicated that they were part of the experiment. Depending on the experiment group, Chrome would either offer CECPQ2, CECPQ2b, or just non-post-quantum options, in its TLS 1.3 handshake, along with the signaling extension to indicate which clients were part of the control group. Measurements were taken of how long TLS handshakes took to complete using Chrome’s metrics system. Chrome knew which servers were part of the experiment because they echoed the signaling extension, thus all three groups were measuring handshake duration against the same set of servers.
After this phase of the trial was complete, client-side measurements were disabled and Chrome Canary was switched to a mode where it randomly picked one of CECPQ2, CECPQ2b, or neither to offer. This enabled some additional, server-side measurements to ensure that nothing unexpected was occuring.
(Cloudflare has a significantly more detailed write up of this experiment.)
We’re aware of a couple of biases and these need to be kept in mind when looking at the results. Firstly, since ARMv7 and x86 platforms were excluded, the population was significantly biased towards more powerful CPUs. This will make supersingular isogenies look better. Also, we’ve seen from past experiments that Canary and Dev Chrome users tend to have worse networks than the Chrome user population as a whole, and this too will tend to advantage supersingular isogenies since they require less network traffic.
Here are histograms of client-side results, first from Windows (representing desktops/laptops) and from Android (representing mobile devices):
From the histograms we can see that the CECPQ2b (SIKE) group shifts visibly to the right (i.e. slower) in both cases. (On Android, a similar but smaller shift is seen for CECPQ2.) Despite the advantages of removing the slower clients and experimenting with worse-than-usual networks, the computational demands of SIKE out-weigh the reduced network traffic. Only for the slowest 5% of connections are the smaller messages of SIKE a net advantage.
Cloudflare have a much more detailed analysis of the server-side results, which are very similar.
While there may be cases where the smaller messages of SIKE are a decisive advantage, that doesn’t appear to be the case for TLS, where the computational advantages of structured lattices make them a more attractive choice for post-quantum confidentiality.
Username (and password) free login with security keys (10 Aug 2019)
Most readers of this blog will be familiar with the traditional security key user experience: you register a token with a site then, when logging in, you enter a username and password as normal but are also required to press a security key in order for it to sign a challenge from the website. This is an effective defense against phishing, phone number takeover, etc. But modern security keys are capable of serving the roles of username and password too, so the user experience can just involve clicking a login button, pressing the security key, and (perhaps) entering a locally-validated PIN if the security key doesn't do biometrics. This is possible with the recently released Chromium 76 and also with Edge or Firefox on current versions of Windows.
On the plus side, this one-button flow frees users from having to remember and type their username and password for a given site. It also avoids sites having to receive and validate a password, potentially avoiding both having a password database (which, even with aggressively slow hashes, will leak many users' passwords if disclosed), and removing any possibility of accidentally logging the plaintext values (which both Google and Facebook have done recently). On the negative side, users will need a modern security key (or Windows Hello-enabled computer) and may still need to enter a PIN.
Which security keys count as “modern”? For most people it'll mean a series-5 black Yubikey or else a blue Yubikey that has a faint “2” printed on the upper side. Of course, there are other manufacturers who make security keys and, if it advertises “CTAP2” support, there's a good chance that it'll work too. But those Yubikeys certainly do.
In practical terms, web sites exercise this capability via WebAuthn, the same API that handles the traditional security key flow. (I'm not going to go into much detail about how to use WebAuthn. Readers wanting more introductory information can see what I've written previously or else see one of the several tutorials that come up in a Google search.)
In WebAuthn terms, a “resident” credential is one that can be discovered without knowing its ID. Generally, most security keys operate statelessly, i.e. the credential ID is an encrypted private seed, and the security key doesn't store any per-credential information itself. Thus the credential ID is required for the security key to function so the server sends a list of them to the browser during login, implying that the server already knows which user is logging in. Resident keys, on the other hand, require some state to be kept by the security key because they can be used without presenting their ID first. (Note that, while resident keys require some state to be kept, security keys are free to keep state for non-resident keys too: resident vs non-resident is all about whether the credential ID is needed.)
User verification is about whether the security key is providing one or two authentication factors. With the traditional experience, the security key is something you have and the password is something you know. In order to get rid of the password, the security key now needs to provide two factors all by itself. It's still something you have so the second security key factor becomes a PIN (something you know) or a biometric (something you are). That begs the question: what's the difference between a PIN and a password? On the surface: nothing. A security key PIN is an arbitrary string, not limited to numbers. (I think it was probably considered too embarrassing to call it a password since FIDO's slogan is “solving the world's password problem”.) So you should think of it as a password, but it is a password with some deeper advantages: firstly, it doesn't get sent to web sites, so they can't leak it and people can safely use a single password everywhere. Secondly, brute-force resistance is enforced by the hardware of the security key, which will only allow eight attempts before locking and requiring a reset. Still, it'll be nice when biometrics are common in security keys.
A user ID is an opaque identifier that should not be personally identifying. Most systems will have some database primary key that identifies a user and, if using that as a WebAuthn user ID, ensure that you encrypt it first with a key that is only used for this purpose. That way it doesn't matter if those primary keys surface elsewhere too. Security keys will only store a single credential for a given pair of domain name and user ID. So, if you register a second credential with the same user ID on the same domain, it'll overwrite the first.
The fact that you can register more than one credential for a given domain means that it's important to set the metadata correctly when creating a resident credential. This isn't unique to resident keys, but it's much more important in this context. The user name and displayName will be shown by the browser during login when there's more than one credential for a domain. Also the relying party name and displayName will be shown in interfaces for managing the contents of a security key.
When logging in, WebAuthn works as normal except you leave the list of credential IDs empty and set userVerification to required. That triggers the resident-credential flow and the resulting credential will include the user ID, with which you look up the user and their set of registered public keys, and then validate the public key and other parameters.
Microsoft have a good test site (enter any username) where you can experiment with crafting different WebAuthn requests.
In order to support the above, security keys obviously have a command that says “what credentials do you have for domain x?”. But what level of authentication is needed to run that command is a little complex. While it doesn't matter for the web, one might want to use security keys to act as, for example, door access badges; especially over NFC. In that case one probably doesn't want to bother with a PIN etc. Thus the pertinent resident credentials would have to be discoverable and exercisable given only physical presence. But in a web context, perhaps you don't want your security key to indicate that it has a credential stored for cia.gov (or mss.gov.cn) to anyone who can plug it in. Current security keys, however, will disclose whether they have a resident credential for a given domain, and the user ID and public key for that credential, to anyone with physical access. (Which is one reason why user IDs should not be identifying.) Future security keys will have a concept of a per-credential protection level which will prevent them from being disclosed without user verification (i.e. PIN or biometrics), or without knowing their random credential ID. While Chromium will configure credential protection automatically if supported, other browsers may not. Thus it doesn't hurt to set credProtect: 2 in the extensions dictionary during registration.
Zero-knowledge attestation (01 Jan 2019)
U2F/FIDO tokens (a.k.a. “Security Keys”) are a solid contender for doing something about the effectiveness of phishing and so I believe they're pretty important. I've written a fairly lengthy introduction to them previously and, as mentioned there, one concerning aspect of their design is that they permit attestation: when registering a key it's possible for a site to learn a cryptographically authenticated make, model, and batch. As a browser vendor who has dealt with User-Agent sniffing, and as a large-site operator, who has dealt with certificate pervasiveness issues, that's quite concerning for public sites.
It's already the case that one significant financial site has enforced a single-vendor policy using attestation (i.e. you can only register a token made by that vendor). That does not feel very congruent with the web, where any implementation that follows the standards is supposed to be a first-class citizen. (Sure, we may have undermined that with staggering levels of complexity, but that doesn't discredit the worth of the goal itself.)
Even in cases where a site's intended policy is more reasonable (say, they want to permit all tokens with some baseline competence), there are strong grounds for suspecting that things won't turn out well. Firstly, the policies of any two sites may not completely align, leading to a crappy user experience where a user needs multiple tokens to cover all the sites that they use, and also has to remember which works where. Secondly, sites have historically not been so hot about staying up-to-date. New token vendors may find themselves excluded from the market because it's not feasible to get every site to update their attestation whitelists. That feels similar to past issues with User-Agent headers but the solution there was to spoof other browsers. Since attestation involves a cryptographic signature, that answer doesn't work here.
So the strong recommendation for public sites is not to request attestation and not to worry about it. The user, after all, has control of the browser once logged in, so it's not terribly clear what threats it would address.
However, if we assume that certain classes of sites probably are going to use attestation, then users have a collective interest in those sites enforcing the same, transparent standard, and in them keeping their attestation metadata current. But without any impetus towards those ends, that's not going to happen. Which begs the question: can browsers do something about that?
Ultimately, in such a world, sites only operate on a single bit of information about any registration: was this public-key generated in a certified device or not? The FIDO Alliance wants to run the certification process, so then the problem reduces down to providing that bit to the site. Maybe they would simply trust the browser to send it: the browser could keep a current copy of the attestation metadata and tell the site whether the device is certified or not. I don't present that as a straw-man: if the site's aim is just to ensure that the vast majority of users aren't using some backdoored token that came out of a box of breakfast cereal then it might work, and it's certainly simple for the site.
But that would be a short blog post, and I suspect that trusting the browser probably wouldn't fly in some cases.
So what we're looking for is something like a group signature scheme, but we can't change existing tokens. So we need to retrospectively impose a group signature on top of signers that are using vanilla P-256 ECDSA.
It is a surprising but true result in cryptography that it's possible to create a convincing proof of any statement in NP that discloses nothing except the truth of the statement. As an example of such a statement, we might consider “I know a valid signature of message x from one of the public keys in this set”. That's a pretty dense couple of sentences but rather than write an introduction to zero-knowledge proofs here, I'm going to refer you to Matthew Green's posts. He does a better job than I would.
I obviously didn't pick that example at random. If there was a well-known set of acceptable public keys (say, as approved by the FIDO Alliance) then a browser could produce a zero-knowledge proof that it knew a valid attestation signature from one of those keys, without disclosing anything else, notably without disclosing which public key was used. That could serve as an “attestation valid” bit, as hypothesised above, that doesn't require trusting the browser.
As a concrete instantiation of zero-knowledge proofs for this task, I'll be using Bulletproofs [BBBPWM17]. (See zkp.science for a good collection of many different ZK systems. Also, dalek-cryptography have excellent notes on Bulletproofs; Cathie Yun and Henry de Valence from that group were kind enough to help me with a question about Bulletproofs too.)
The computational model for Bulletproofs is an arithmetic circuit: an acyclic graph where public and secret inputs enter and each node either adds or multiplies all its inputs. Augmenting that are linear constraints on the nodes of the circuit. In the tool that I wrote for generating these circuits, this is represented as a series of equations where the only operations are multiplication, addition, and subtraction. Here are some primitives that hopefully convince you that non-trivial functions can be built from this:
- IsBit(x): x² - x = 0
- NOT(x): 1 - x
- AND(x, y): x × y
- OR(x, y): x + y - (x × y)
- XOR(x, y): x + y - 2(x × y)
Using single bit values in an arithmetic circuit certainly works, but it's inefficient. Getting past single-bit values, the arithmetic circuits in Bulletproofs don't work in ℤ (i.e. arbitrary-length integers), rather they work over a finite field. Bulletproofs are built on top of an elliptic curve and the finite field of the arithmetic circuit is the scalar field of that curve.
When dealing with elliptic curves (as used in cryptography) there are two finite fields in play: the x and y coordinates of the points on the curve are in the coordinate field of the curve. Multiples of the base point (B) then generate a prime number (n) of points in the group before cycling back to the base point. So xB + yB = (x + y mod n)B — i.e. you can reduce the multiple mod n before multiplying because it'll give the same result. Since n is prime, reduction mod n gives a field, the scalar field.
(I'm omitting powers of primes, cofactors, and some other complications in the above, but it'll serve.)
So Bulletproofs work in the scalar field of whatever elliptic curve they're implemented with, but we want to build P-256 ECDSA verification inside of a Bulletproof, and that involves lots of operations in P-256's coordinate field. So, ideally, the Bulletproofs need to work on a curve whose scalar field is equal to P-256's coordinate field. Usually when generating a curve, one picks the coordinate field to be computationally convenient, iterates other parameters until the curve meets standard security properties, and the scalar field is whatever it ends up as. However, after some quality time with “Constructing elliptic curves of prime order” (Broker & Stevenhagen) and Sage, we find that y² = x³ - 3x + B over GF(PP) where:
- B= 0x671f37e49d38ff3b66fac0bdbcc1c1d8b9f884cf77f0d0e90271026e6ef4b9a1
- PP= 0xffffffff000000010000000000000000aaa0c132719468089442c088a05f455d
… gives a curve with the correct number of points, and which seems plausibly secure based on the SafeCurves criteria. (A more exhaustive check would be needed before using it for real, but it'll do for a holiday exploration.) Given its relationship to P-256, I called it “PP-256” in the code.
Reviewing the ECDSA verification algorithm, the public keys and message hash are obviously public inputs. The r and s values that make up the signature cannot be both be public because then the verifier could just try each public key and find which one generated the signature. However, one of r and s can be public. From the generation algorithm, r is the x-coordinate of a random point and s is blinded by the inverse of the nonce. So on their own, neither r nor s disclose any information and so can just be given to the verifier—moving work outside of the expensive zero-knowledge proof. (I'm not worrying about tokens trying to use a covert channel here but, if you do worry about that, see True2F.)
If we disclose s to the verifier directly then what's left inside the zero-knowledge proof is 1) selecting the public key; 2) checking that the secret r is in range; 3) u₂ = r/s mod n; 4) scalar-multiplication of the public key by u₂; 5) adding in the (now) public multiple of the base point; and 6) showing that the x-coordinate of resulting point equals the original r, mod n.
The public-key is a 4-tooth comb, which is a precomputed form that speeds up scalar multiplications. It consists of 30 values. The main measure that we want to minimise in the arithmetic circuit is the number of multiplications where both inputs are secret. When selecting from t possible public keys the prover supplies a secret t-bit vector where only one of the bits is set. The proof shows that each value is, indeed, either zero or one using IsBit (from above, at a cost of one multiply per bit), and that exactly one bit is set by requiring that the sum of the values equals one. Each of the 30t public-key values is multiplied by one of the bits and summed to select exactly one key.
Rather than checking that the secret r is within [0, n-1], which would cost 512 multiplies, we just check that it's not equal to zero mod n. That's the important condition here since an out of range r is otherwise just an encoding error. Showing that a number is not zero mod n just involves showing that it's not equal to zero or n, as 2n is outside of the arithmetic circuit field. Proving a ≠ b is easy: the prover just provides an inverse for a - b (since zero doesn't have an inverse) and the proof shows that (a - b) × (a - b)⁻¹ = 1.
Calculating r/s mod n is the most complex part of the whole proof! Since the arithmetic circuit is working mod P-256's p, working mod n (which is the order of P-256—slightly less than p) is awkward. The prover gives bit-wise breakdown of r; the proof does the multiplication as three words of 86, 86, and 84 bits; the prover supplies the values for the carry-chain (since bit-shifts aren't a native operation in the arithmetic circuit); the prover then gives the result in the form a×n + b, where b is a 256-bit number; and the proof does another multiplication and carry-chain to check that the results are equal. All for a total cost of 2152 multiplication nodes!
After that, the elliptic curve operation itself is pretty easy. Using the formulae from “Complete addition formulas for prime order elliptic curves” (Renes, Costello, and Batina) it takes 5365 multiplication nodes to do a 4-tooth comb scalar-mult with a secret scalar and a secret point. Then a final 17 multiplication nodes add in the public base-point multiple, supply the inverse to convert to affine form, and check that the resulting x-coordinate matches the original r value. The circuit does not reduce the x-coordinate mod n in order to save work: for P-256, that means that around one in 2¹²⁸ signatures may be incorrectly rejected, but that's below the noise floor of arithmetic errors in CPUs. Perhaps if this were to be used in the real world, that would be worth doing correctly, but I go back to work tomorrow so I'm out of time.
In total, the full circuit contains 7534 multiplication nodes, 2154 secret inputs, and 17 236 constraints.
(Pratyush Mishra points out that affine formulae would be more efficient than projective since inversion is cheap in this model. Oops!)
My tool for generating the matrices that Bulletproofs operate on outputs 136KB of LZMA-compressed data for the circuit described above. In some contexts, that amount of binary size would be problematic, but it's not infeasible. There is also quite a lot of redundancy: the data includes instructions for propagating the secret inputs through the arithmetic circuit, but it also includes matrices from which that information could be derived.
The implementation is based on BoringSSL's generic-curve code. It doesn't even use Shamir's trick for multi-scalar multiplication of curve points, it doesn't use Montgomery form in a lot of places, and it doesn't use any of the optimisations described in the Bulletproofs paper. In short, the following timings are extremely pessimistic and should not be taken as any evidence about the efficiency of Bulletproofs. But, on a 4GHz Skylake, proving takes 18 seconds and verification takes 13 seconds. That's not really practical, but there is a lot of room for optimisation and for multiple cores to be used concurrently.
The proof is 70 450 bytes, dominated by the 2154 secret-input commitments. That's not very large by the standards of today's web pages. (And Dan Boneh points out that I should have used a vector commitment to the secret inputs, which would shrink the proof down to just a few hundred bytes.)
Intermediates and FIDO2
One important limitation of the above is that it only handles one level of signatures. U2F allows an intermediate certificate to be provided so that only less-frequently-updated roots need to be known a priori. With support for only a single level of signatures, manufacturers would have to publish their intermediates too. (But we already require that for the WebPKI.)
Another issue is that it doesn't work with the updated FIDO2 standard. While only a tiny fraction of Security Keys are FIDO2-based so far, that's likely to increase. With FIDO2, the model of the device is also included in the signed message, so the zero-knowledge proof would also have to show that a SHA-256 preimage has a certain structure. While Bulletproofs are quite efficient for implementing elliptic curves, a binary-based algorithm like SHA-256 is quite expensive: the Bulletproofs paper notes a SHA-256 circuit using 25 400 multiplications. There may be a good solution in combining different zero-knowledge systems based on “Efficient Zero-Knowledge Proof of Algebraic and Non-Algebraic Statements with Applications to Privacy Preserving Credentials” (Chase, Ganesh, Mohassel), but that'll have to be future work.
Happy new year.
CECPQ2 (12 Dec 2018)
CECPQ1 was the experiment in post-quantum confidentiality that my colleague, Matt Braithwaite, and I ran in 2016. It's about time for CECPQ2.
I've previously written about the experiments in Chrome which lead to the conclusion that structured lattices were likely the best area in which to look for a new key-exchange mechanism at the current time. Thanks to the NIST process we now have a great many candidates to choose from in that space. While this is obviously welcome, it also presents a problem: the fitness space of structured lattices looks quite flat so there's no obviously correct choice. Would you like keys to be products (RLWE) or quotients (NTRU; much slower key-gen, but subsequent operations are faster; older, more studied)? Do you want the ring to be NTT-friendly (fast multiplication, but more structure), or to have just a power-of-two modulus (easy reduction), or to have as little structure as possible? What noise profile and failure probability? Smart people can reasonably end up with different preferences.
This begs the question of why do CECPQ2 now at all? In some number of years NIST will eventually whittle down the field and write standards. Adrian Stanger of the NSA said at CRYPTO this year that the NSA is looking to publish post-quantum standards around 2024, based on NIST's output. (And even said that they would be pure-PQ algorithms, not combined with an elliptic-curve operation as a safeguard.) So if we wait five years things are likely to be a lot more settled.
Firstly, you might not be happy with the idea of waiting five years if you believe Michele Mosca's estimate of a one sixth chance of a large quantum computer in ten years. More practically, as we sail past the two year mark of trying to deploy TLS 1.3, another concern is that if we don't exercise this ability now we might find it extremely difficult to deploy any eventual design.
TLS 1.3 should have been straightforward to deploy because the TLS specs make accommodations for future changes. However, in practice, we had to run a series of large-scale experiments to measure what patterns of bytes would actually weave through all the bugs in the TLS ecosystem. TLS 1.3 now has several oddities in the wire-format that exist purely to confuse various network intermediaries into working. Even after that, we're still dealing with issues. Gallingly, because we delayed our server deployment in order to ease the client deployment, we're now having to work around bugs in TLS 1.3 client implementations that wouldn't have been able to get established had we quickly and fully enabled it.
The internet is very large and it's not getting any easier to steer. So it seems dangerous to assume that we can wait for a post-quantum standard and then deploy it. Any CECPQ2 is probably, in the long-term, destined to be replaced. But by starting the deployment now it can hopefully make that replacement viable by exercising things like larger TLS messages. Also, some practical experience might yield valuable lessons when it comes to choosing a standard. If the IETF had published the TLS 1.3 RFC before feedback from deployment, it would have been a divisive mess.
At the time of CECPQ1, the idea of running both a post-quantum and elliptic-curve primitive concurrently (to ensure that even if the post-quantum part was useless, at least the security wasn't worse than before) wasn't universally embraced. But we thought it important enough to highlight the idea in the name: combined elliptic-curve and post-quantum. It's a much more widely accepted idea now and still makes sense, and the best choice of elliptic-curve primitive hasn't changed, so CECPQ2 is still a combination with X25519.
As for the post-quantum part, it's based on the HRSS scheme of Hülsing, Rijneveld, Schanck, and Schwabe. This is an instantiation of NTRU, the patent for which has expired since we did CECPQ1. (The list of people who've had a hand in NTRU is quite long. See the Wikipedia page for a start.)
Last year Saito, Xagawa, and Yamakawa (SXY) published a derivative of HRSS with a tight, QROM-proof of CCA2 security from an assumption close to CPA security. It requires changes (that slow HRSS down a little), but HRSS+SXY is currently the basis of CECPQ2. Since HRSS+SXY no longer requires an XOF, SHAKE-128 has been replaced with SHA-256.
Having said that it's hard to choose in the structured lattice space, obviously HRSS is a choice and there were motivations behind it:
- CCA2-security is worthwhile, even though TLS can do without. CCA2 security roughly means that a private-key can be used multiple times. The step down is CPA security, where a private-key is only safe to use once. NewHope, used in CECPQ1, was only CPA secure and that worked for TLS since its confidentiality keys are ephemeral. But CPA vs CCA security is a subtle and dangerous distinction, and if we're going to invest in a post-quantum primitive, better it not be fragile.
- Avoiding decryption failures is attractive. Not because we're worried about unit tests failing (hardware errors set a noise floor for that anyway), but because the analysis of failure probabilities is complex. In the time since we picked HRSS, a paper has appeared chipping away at these failures. Eliminating them simplifies things.
- Schemes with a quotient-style key (like HRSS) will probably have faster encap/decap operations at the cost of much slower key-generation. Since there will be many uses outside TLS where keys can be reused, this is interesting as long as the key-generation speed is still reasonable for TLS.
- NTRU has a long history. In a space without a clear winner, that's a small positive.
CECPQ2 will be moving slowly: It depends on TLS 1.3 and, as mentioned, 1.3 is taking a while. The larger messages may take some time to deploy if we hit middlebox- or server-compatibility issues. Also the messages are currently too large to include in QUIC. But working though these problems now is a lot of the reason for doing CECPQ2—to ensure that post-quantum TLS remains feasible.
Lastly, I want to highlight that this only addresses confidentiality, not authenticity. Confidentiality is more pressing since it can be broken retrospectively, but it's also much easier to deal with in TLS because it's negotiated independently for every connection. Post-quantum authenticity will be entangled with the certificate and CA ecosystem and thus will be significantly more work.
Post-quantum confidentiality for TLS (11 Apr 2018)
In 2016, my colleague, Matt Braithwaite, ran an experiment in Google Chrome which integrated a post-quantum key-agreement primitive (NewHope) with a standard, elliptic-curve one (X25519). Since that time, the submissions for the 1st round of NIST’s post-quantum process have arrived. We thus wanted to consider which of the submissions, representing the new state of the art, would be most suitable for future work on post-quantum confidentiality in TLS.
A major change since the 2016 experiment is the transition from TLS 1.2 to TLS 1.3 (a nearly-final version of which is now enabled by default in Chrome). This impacts where in the TLS handshake the larger post-quantum messages appear:
In TLS 1.2, the client offers possible cipher suites in its initial flow, the server selects one and sends a public-key in its reply, then the client completes the key-agreement in its second flow. With TLS 1.3, the client offers several possible public-keys in its initial flow and the server completes one of them in its reply. Thus, in a TLS 1.3 world, the larger post-quantum keys will be sent to every TLS server, whether or not they’ll use it.
(There is a mechanism in TLS 1.3 for a client to advertise support for a key-agreement but not provide a public-key for it. In this case, if the server wishes to use that key-agreement, it replies with a special message to indicate that the client should try again and include a public-key because it’ll be accepted. However, this obviously adds an extra round-trip and we don’t want to penalise more secure options—at least not if we want them adopted. Therefore I’m assuming here that any post-quantum key agreement will be optimistically included in the initial message. For a diagram of how this might work with a post-quantum KEM, see figure one of the Kyber paper. Also I'm using the term “public-key” here in keeping with the KEM constructions of the post-quantum candidates. It's not quite the same as a Diffie-Hellman value, but it's close enough in this context.)
In order to evaluate the likely latency impact of a post-quantum key-exchange in TLS, Chrome was augmented with the ability to include a dummy, arbitrarily-sized extension in the TLS ClientHello. To be clear: this was not an implementation of any post-quantum primitive, it was just a fixed number of bytes of random noise that could be sized to simulate the bandwidth impact of different options. It was included for all versions of TLS because this experiment straddled the enabling of TLS 1.3.
Post quantum families
Based on submissions to NIST’s process, we grouped likely candidates into three “families” of algorithms, based on size: supersingular isogenies (SI), structured lattices (SL), and unstructured lattices (UL). Not all submissions fit into these families, of course: in some cases the public-key + ciphertext size was too large to be viable in the context of TLS, and others were based on problems that we were too unfamiliar with to be able to evaluate.
As with our 2016 experiment, we expect any post-quantum algorithm to be coupled with a traditional, elliptic-curve primitive so that, if the post-quantum component is later found to be flawed, confidentiality is still protected as well as it would otherwise have been.
This led to the following, rough sizes for evaluation:
- Supersingular isogenies (SI): 400 bytes
- Structured lattices (SL): 1 100 bytes
- Unstructured lattices (UL): 10 000 bytes
Incompatibilities with unstructured lattices
Before the experiment began, in order to establish some confidence that TLS connections wouldn’t break, the top 2 500 sites from the Alexa list were probed to see whether large handshakes caused problems. Unfortunately, the 10 000-byte extension caused 21 of these sites to fail, including common sites such as
Oddly enough, reducing the size of the extension to 9 999 bytes reduced that number to eight, but
linkedin.com was still included. These remaining eight sites seem to reject ClientHello messages larger than 3 970 bytes.
This indicated that widespread testing of a 10 000-byte extension was not viable. Thus the unstructured lattice configuration was replaced with a 3 300-byte “unstructured lattice stand-in” (ULS). This size was chosen to be about as large as we could reasonably test (although four sites still broke.
In February 2018, a fraction of Chrome installs (Canary and Dev channels only) were configured to enable the dummy extension in one of four sizes, with the configuration randomly selected, per install, from:
- Control group: no extension sent
- Supersingular isogenies (SI): 400 bytes
- Structured lattices (SL): 1 100 bytes
- Unstructured lattice standing (ULS): 3 300 bytes
We measured TLS handshake latency at the 50th and 95th percentile, split out by mobile / non-mobile and by whether the handshake was a full or resumption handshake.
Given that we measured a stand-in for the UL case, we needed to extrapolate from that to an estimate for the UL latency. We modeled a linear cost of each additional byte in the handshake, 66 bytes of overhead per packet, and an MTU around 1 500 bytes. The number of packets for the UL case is the same for MTU values around 1 500, but the number of packets for the SL case is less certain: a session ticket could push the ClientHello message into two packets there. We chose to model the SL case as a single packet for the purposes of estimating the UL cost.
We also ignored the possibility that the client’s initial congestion window could impact the UL case. If that were to happen, a UL client would have to wait an additional round-trip, thus our UL estimates might be optimistic.
Despite the simplicity of the model, the control, SI, and SL points for the various configurations are reasonably co-linear under it and we draw the least-squares line out to 10 000 bytes to estimate the UL latency.
|Configuration||Additional latency over control group|
|Desktop, Full, Median||4.0%||6.4%||71.2%|
|Desktop, Full, 95%||4.7%||9.6%||117.0%|
|Desktop, Resume, Median||4.3%||12.5%||118.6%|
|Desktop, Resume, 95%||5.2%||17.7%||205.1%|
|Mobile, Full, Median||-0.2%||3.4%||34.3%|
|Mobile, Full, 95%||0.5%||7.2%||110.7%|
|Mobile, Resume, Median||0.6%||7.2%||66.7%|
|Mobile, Resume, 95%||4.2%||12.5%||149.5%|
(The fact that one of the SI values came out implausibly negative should give some feeling for the accuracy of the results—i.e. the first decimal place is probably just noise, even at the median.)
As can be seen, the estimated latency overhead of unstructured lattices is notably large in TLS, even without including the effects of the initial congestion window. For this reason, we feel that UL primitives are probably not preferable for real-time TLS connections.
It is also important to note that, in a real deployment, the server would also have to return a message to the client. Therefore the transmission overhead, when connecting to a compatible server, would be doubled. This phase of the experiment experiment did not take that into account at all. Which leads us to…
In the second phase of the experiment we wished to measure the effect of actually negotiating a post-quantum primitive, i.e. when both client and server send post-quantum messages. We again added random noise of various sizes to simulate the bandwidth costs of this. Cloudflare and Google servers were augmented to echo an equal amount of random noise back to clients that included it.
In this phase, latencies were only measured when the server echoed the random noise back so that the set of measured servers could be equal between control and experiment groups. This required that the control group send one byte of noise (rather than nothing).
Since unstructured lattices were not our focus after the phase one results, there were only three groups in phase two: one byte (control), 400 bytes (SI), and 1100 bytes (SL).
Unlike phase one, we did not break the measurements out into resumption / full handshakes this time. We did that in phase one simply because that’s what Chrome measures by default. However, it begs the question of what fraction of handshakes resume, since that value is needed to correctly weigh the two numbers. Thus, in phase two, we simply measured the overall latency, with resumption (or not) implicitly included.
Here are the latency results, this time in milliseconds in order to reason about CPU costs below:
|Configuration||Additional latency over control group (ms)|
So far we’ve only talked about families of algorithms but, to analyse these results, we have to be more specific. That’s easy in the case supersingular isogenies (SI) because there is only one SI-based submission to NIST: SIKE.
For SIKEp503, the paper claims an encapsulation speed on a 3.4GHz Skylake of 16 619 kilocycles, or 4.9ms of work on a server. They also report on an assembly implementation for Aarch64, which makes a good example client. There the key generation + decapsulation time is 86 078 kilocycles, or 43ms on their 2GHz Cortex-A79.
On the other hand, structured lattices (SL) are much faster. There are several candidates here but most have comparable speed, and it’s under a 10th of a millisecond on modern Intel chips.
Therefore, if computation were included, the SI numbers above would have 48ms added to them. That happens to makes SL clearly faster at the median, although it does not close the gap at the 95% level. (The SI key-generation could be amortised across connections, at the cost of code complexity and maybe some forward security. Amortising key-generation across 16 connections results in 28ms of computation per connection, which doesn’t affect the argument.)
We had previously deployed CECPQ1 (X25519 + NewHope) in TLS, which should be comparable to SL measurements, and gotten quite different results: we found a 1ms increase at the median and only 20ms at 95%. (And CECPQ1 took roughly 1 900 bytes.)
On a purely technical level, CECPQ1 was based on TLS 1.2 and therefore increased the sizes of the server’s first flow and the client’s second flow. It’s possible that bytes in the client’s second flow are less expensive, but much more significantly, TLS 1.2 does not do a fresh key-agreement for a resumption handshake. Therefore all the successful resumptions had no overhead with CECPQ1, but with TLS 1.3, they would. Going back to historical Chrome statistics from the time of the CECPQ1 experiment, we estimate that about 50% of connections would have been resumptions, which closes the gap.
Additionally, there may be a significant difference in Chrome user populations between the experiments. CECPQ1 went to Chrome Stable (non-mobile only) while the post-quantum measurements here have been done on Dev and Canary channels. It is the case that, for all non-mobile TLS connections (completely unrelated to any post-quantum experiments), the population of Dev and Canary channel users see higher TLS latencies and the effect is larger than the SI or SL effect measured above. Thus the user population for these experiments might have had poorer internet connections than the population for CECPQ1, exaggerating the costs of extra handshake bytes.
Apart from CPU time, a second major consideration is that we are not comparing like with like. Our 400-byte figure for SI maps to a NIST “category one” strength, while 1 100 bytes for SL is roughly in line with “category three”. A comparable, “category one” SL primitive comes in at more like 800 bytes, while a “category three” SI is much slower.
(Although there is a recent paper suggesting that SIKEp503 might be more like category three after all. So maybe this point is moot, but such gyrations highlight that SI is a relatively new field of study.)
Lastly, and much more qualitatively, we have had many years of dealing with subtle bugs in elliptic-curve implementations. The desire for performance pushes for carefully-optimised implementations, but missed carries, or other range violations, are typically impossible for random tests to find. After many years, we are at the point where we are starting to use formally-synthesised code for elliptic-curve field operations but SI would take us deeper into the same territory, likely ensuring that the pattern of ECC bugs continues into the future. (As Hamburg says in his NIST submission, “some cryptographers were probably hoping never to see a carry chain again”.)
On the other hand, SL designs are much more like symmetric primitives, using small fields and bitwise operations. While it’s not impossible to botch, say, AES, my experience is that the defect rates in implementations of things like AES and SHA-256 is dramatically lower, and this is strong argument for SL over SI. Combining post-quantum primitives with traditional ECC is a good defense against bugs, but we should still be reticent to adopt potentially fragile constructs.
(Not all SL designs have this benefit, mind you. Hamburg, quoted above, uses ℤ/(23120 - 21560 - 1) in ThreeBears and lattice schemes for other primitives may need operations like discrete Gaussian sampling. But it holds for a number of SL-based KEMs.)
Thus the overall conclusion of these experiments is that post-quantum confidentiality in TLS should probably be based on structured lattices although there is still an open question around QUIC, for which exceeding a single packet is problematic and thus the larger size of SL is more significant.