ImperialViolet

Why networked software should expire. (25 Dec 2008)

If your business is still writing letters in WordStar and printing them out on an original Apple Laserwriter, good for you. If you're writing your own LISP in vi on a PDP 10, best of luck. However, if you're still using IE6, that's just rude.

Networked software (by which I just mean programs that talk to other programs) on public networks have a different cost model than the first two examples, but our mental models haven't caught up with that fact yet. We're stuck with the idea that what software you run is your own business and that's preventing needed changes. Here's one example:

ECN (Explicit Congestion Notification) is a modification to TCP and IP which allows routers to indicate congestion by altering packets as they pass though. Early routers dropped packets only when their buffers overflowed and this was taken as an indication of congestion. It was soon noticed that a more probabilistic method of indicating congestion performed better. So routers starting using RED (random early drop) where, approximately, if a buffer is 50% full, a packet has a 50% chance of getting dropped. This gives an indication of congestion sooner and prevented cases where TCP timeouts for many different hosts would start to synchronise and resonate.

To indication congestion, RED drops a packet that has already traversed part of the network; throwing away information. So ECN was developed to indicate congestion without dropping the packets. Network simulations and small scale testing showed a small, but significant benefit from it.

But when ECN was enabled for vger.kernel.org, the mailing list server which handles Linux kernel mailing lists, many people suddenly noticed that their mails weren't getting though. It turned out that many buggy routers and firewalls simply dropped all packets which were using ECN. This was clearly against the specifications and, in terms of code, an easy fix.

ECN wasn't enabled by default in order to give time for the routers to get fixed. In a year or so, it was hoped, it could start to be used and the Internet could start to benefit.

That was over eight years ago now. ECN is still not enabled by default in any major OS. The latest numbers I've seen (which I collected) suggest that 0.5% of destinations will still stop working if you enable ECN and the number of hosts supporting ECN appears to be dropping.

The world has payed a price for not having ECN for the past eight years. Not a lot, but downloads have been a little slower and maybe more infrastructure has been build than was really needed. But who actually paid? Every user of the Internet did, a little bit. But that cost was imposed by router manufactures who didn't test their products and network operators who didn't install updates. Those people saved money by doing less and everyone else paid the price.

These problems are multiplying with the increasing amount of network middleware (routers, firewalls etc) getting deployed; often in homes and owned by people who don't know of care about them.

Recently, Linux 2.6.27 was released and broke Internet access for, probably, thousands of people. Ubuntu Intrepid released with it and had to disable TCP timestamps as a work around while the issue was fixed.

But the issue wasn't a bug in 2.6.27. It was a bug in many home routers (WiFi access points and the like) which was triggered by a perfectly innocent change in Linux that caused the order of TCP options to change. (I felt specifically aggrieved about this because I made that change.) The order was soon changed back and everything started working again.

But, for the future, this now means that the order cannot change. It's not written down anywhere, it's a rule written in bugs. This imposes costs on anyone who might write new TCP stacks in the future, by requiring increased testing and reduced sales as some customers find that it wont work with their routers. These are all costs created by router manufactures and paid by others.

Economists calls these sorts of costs externalities and they are seen as a failure which needs to be addressed. Often, in other areas, they are addressed by regulation or privitisation. Neither of those options appeal in this case.

An uncontroversial suggestion that I'm going to make is that we require better test suites. As a router manufacturer, testing involves checking that your equipment works with a couple of flavors of Windows and, if we're lucky, Linux and some BSDs too. This is much too small of a testing surface. There needs to be an open source test suite designed to test every corner of the RFCs. The NFS connectathons are similar in spirit and probably saved millions of man-hours of debugging over their lifetimes. Likewise, the ACID tests for web browsers focused attention on areas where they were poorly implementing the standards.

And, although my two examples above are both IP/TCP related, I don't want to suggest that the problem stops there. Every common RFC should have such a test suite. HTTP may be a simple protocol but I'll bet that most implementations can't cope with continued header lines. It's those corners which a test suite should address.

Testing should help, but I don't think that it'll be enough. Problems will slip through. Testing against specifications will also never catch problems with the specification itself.

DNS requests can carry multiple questions. There's a big counter in the packet to say how many questions you are asking. However, the reply format can only hold one response code. Thus, I don't know of any DNS server which handles multiple questions (most consider the request to be invalid).

The ability to ask multiple questions would be very helpful. Just look at the number of places which suggest that you turn off IPv6 to make your networking faster. That's because software will otherwise ask a single IPv6 question of DNS, wait for the reply and then ask the IPv4 question. This delay, caused by not being able to request both results in a single request, is causing people to report slowdowns and disable IPv6.

We need to fix DNS, but we never can because one cannot afford break the world. We can't even start a backwards compatible transition because of broken implementations.

That's why networked software should have an expiry date. After the expiry date, the code should make it very clear that it's time to upgrade. For a router, print a big banner when an administrator connects. Flash all the error lights. For software, pop up a dialog every time you start. For home routers, beep and flash a big indicator.

We don't need everyone to update and, as manufacturers fold, maybe there won't be any firmware updates or software upgrades. Almost certainly the device shouldn't stop working. But we need to make more of an effort to recognise that large populations of old code hold everyone else back.

If we can know that nearly all the old code is going to be gone by some date, maybe we can make progress.

(Thanks to Evan for first putting this idea in my mind.)