This is rather a theme for NIST selected algorithms. AES includes a number of operations on a binary field and CPUs generally don't implement binary field operations because they aren't really useful for anything else. So many software implementations of AES include four, 1024-byte tables of results in that field in order to achieve reasonable performance. However, these table lookups cause a timing side-channel which is fairly effective even over a network. On the same machine, where the CPU cache state can be probed more directly, it's very effective.
In response to this, some AES implementations made their tables smaller to reduce the leak. That's obviously an improvement, but it came at the cost of slowing AES down and AES was never very quick to start off with. And while a smaller leak is better, AES is supposed to be operating at (at least) an 128-bit security level!
Next, the mode recommended for AES by NIST is GCM: it's AES in counter mode coupled with a polynomial authenticator. Cryptographically this is perfectly sensible but the polynomial authenticator uses binary fields again. For a hardware implementation, this is another gift but software implementations need, yet again, tables in memory to implement it at reasonable speed.
Tables not only cause problems with timing side-channels, they also cause slowdowns that don't appear in micro-benchmarks. Tables are great when the benchmark is running with a single key and they are contained in L1 cache. But when running in a larger system, the cache pressure causes slowdowns and the real-world speed of an such an algorithm can be dramatically less than the benchmarks suggest.
But there's no reason why a polynomial authenticator needs to use binary fields. Why not use a field that CPUs actually implement? If you do so you get a very fast authenticator with dramatically improved key agility.
So why does NIST continually pick algorithms that aren't suited to software implementations? One of my colleagues proffers (somewhat jokingly) a conspiracy theory but I suspect that the answer is much more mundane. At USENIX Security this year, Dickie George gave an excellent keynote involving some NSA history and described how they would spend a decade or more designing a cryptographic system from the chip upwards. Everything was custom built, carefully designed and it was a completely closed system. It's certainly nice to be able to do that!
If you're designing such a system, of course you would be orientated towards hardware. Every implementation of your system would be a custom-designed chip; there's no benefit to making something easy to implement on commodity systems. And, in that world, AES, GCM, and SHA3 make a lot of sense. As the GCM paper says, “binary Galois ﬁeld multiplication is especially suitable for hardware implementations”. I'm assuming that there's a fair amount of shared worldview between NIST and the NSA, but that doesn't seem unreasonable.
(In a NIST paper about the selection of AES they state: “table lookup: not vulnerable to timing attacks.” Oops. If you're building your own hardware, maybe.)
I'm not suggesting that the NSA is actually using AES, GCM etc in their own designs. But a generation or more of cryptographers from that world are probably comfortable with features that were chosen with hardware in mind. Those features, like binary fields, are what they will naturally gravitate towards.
People who aren't in the situation of designing custom chips, but rather have to design protocols for implementation on a menagerie of commodity chips, might want to consider that NIST doesn't appear to have them in mind. Although hardware implementations of AES have existed for a while, unless they are CPU instructions (like AES-NI), then the system-call overhead can cause the acceleration to be slower than the plain CPU for common, 1400-byte, chunk sizes. Even when you have a CPU instruction to implement AES and GCM, it's unlikely that all the chips that your code is going to run on will be so endowed.
Hardware implementations of these, poorly-suited primitives also take chip area and development time away from other tasks. Why not choose primitives that use CPU abilities that lots of code benefits from? Then optimisations of those CPU instructions lifts all boats.
So if you want to pick algorithms designed for CPUs, try Salsa20 rather than AES and Poly1305 rather than GCM. But we don't have a great answer for a modern hash function I'm afraid, and that's a little disappointing. SHA-2 is fine for now, although Zooko has some thoughts on the matter.
(I should note that, thanks to some great work by djb, Peter Schwabe, Emilia Käsper and others, we do have bitsliced, constant-time implementations of AES now that run reasonably fast (for AES). But it's only effective for counter mode, and
I believe that it needs long streams of data to achieve its headline speed. (A commenter pointed out that it can be made to work with only 8 blocks. Later, another pointed out that that I'm doubly wrong! We can do it for single blocks in reasonable time.) That's great, but it took 10 years of optimisation work and implies that AES can only be effectively implemented by some of the best practical cryptographers in the world.)