In my last post, I suggested that a register based modexp for 64-bit numbers could run at about 500K ops/sec. Well, I wrote one and got 450K ops/sec on an older Core2. (That's with gcc -O3, but no tuning of the code. Plus, I don't know the standard algorithm for 128-bit modulus using 64-bit operations, so I wrote my own, which is almost certainly suboptimal.). Roughly that's 220 ops/s, so a brute force solution of 64-bits would take about 242 seconds, which is more than enough for us.
However, there are much better solutions to the discrete log problem than that. Here I'm only dealing with groups of prime order. There are very good solutions for groups of order 2n, but DH uses prime order groups only.
The best information I found on this are a set of slides by djb. However, they are a little sparse (since they are slides after all). Quick summary:
- Brute force parallelises perfectly. An FPGA chip could do 230 modexps per second. An array of really good ones could push that upwards of 240 modexps/sec.
- Breaking n Diffie-Hellmans isn't much harder than breaking one of them when using brute force. Since you can look for collisions against all n public keys at once. If you were a sniffer trying to sniff hundreds of connections per second, that's actually a big advantage. That could give up an amortised benefit equal to 210 or more.
- You can use "random self reduction" to "split" a problem into many problems and solving any of them they breaks the original problem. Combine this with the previous point and you can speed up the breaking of a single problem.
- If you figure out the optimal number of subproblems to "split" the original problem into you have the "giant step, baby step" algorithm which takes only about 2√n modexps to break (where n is 64 in our case).
- Now things are getting complex, so I'm just going to include the results: Pollard's rho method lets us break 64-bits in 232 modexps.
- The Pohlig-Hellman method is even better, but you can choose a safe prime as your group order to stop it. (A safe prime, p, is such that (p-1)/2 is also prime.)
- The "index calculus" method uses lots of precomputation against the group order to find specific solutions in that group very quickly. I must admit that I'm a little shaky on how index calculus works, but I've found one empirical result where a Matlab solution was breaking 64-bit discrete logs in < 1 minute, including the precomputation.
In short, attacks against discrete log in prime order groups are a lot stronger that I suspected. The index calculus method, esp, seems be a killer against 64-bit DH exchanges providing any sort of security. Since we don't have the time (on the server) or the space (in the TCP options) to include a unique group for each exchange, the precomputation advantage means that it's very possible for a sniffer to be breaking these handshakes in real time.
So it would appear that we need larger key sizes and, possibly elliptic curve based systems (the EC systems, in general, can't be attacked with index calculus based methods). RFC 2385 suggests that 16 bytes in a TCP header is about as much as we would want to add (they are talking about SYN packets, which we don't need to put public values in, but the absolute max is 36 bytes.), which gives us 128-bit public values. Looks like I need to read up on EC systems.