Personally, all the spam I get is filtered by procmail without any fancy statistical magic, or indeed, without looking at the body of the message at all. So if everyone could be like me the spam problem would go away.
But it seems that spam is a big problem for other people, and whilst I don't really worry about other people's problems very much when I have such a wide choice myself, spam filtering provides a nice thought exercise for a while. Not to mention a chance to lever in a few better ways of doing things
From a technical point of view I would start a company that runs sweatshops filtering spam by hand. They would have to have fair language skills, but English is pretty commonplace and there are enough sweatshop labors so I keep getting told.
However, I have a few non-technical problems with running sweatshops and it doesn't involve very much code, so probably isn't much fun.
AMTP is a small extension to the SMTP protocol that makes TLS mandatory and sets an evil bit (more or less) for each message. If the sending host doesn't correctly set the evil bit then you have a CA issued identity to lynch.
This is basically a 2-level trust tree. Everyone trusts the elite CAs and they trust all the ISPs in the world and so on. The major problem with this being that a CA issued identity costs, lots. From a management point of view this might seem like a very good idea. Get all those geeks off the Internet and then we can get down to making money off it ... somehow.
But it's making email sending exclusive (because it's expensive) and this is our end-to-end network goddammit.
There has been plenty of good work done by the reputation people about this sort of thing. But generally they are considering how to deal with reputation when you hold the whole graph. (Though anyone should feel free to point me at a paper which solves these issues). Dealing with reputation when one can only see a couple of small areas of the graph is a whole different matter.
Consider a simple system when a node (person) is free to setup a directed arc (reputation certificate) to any other node. Each arc has a float between 0..1 which indicates how confident the source is, that the destination will not send spam. Also assume that a node will accept a message if the sender can show a path from the target to the sender such that the product of all the arc weights is greater than 0.1.
Without a good knowledge of the graph, the sender isn't going to be able to find such a path, even if it exists. Assuming that there is a way to walk the graph, it's going to take a connection-request-reply to lots of different servers to get the information. (Because we wouldn't have it on one central server as that would be Bad).
See the aside below in which I contradict myself after you have read the rest.
However, most of the time I'm exchanging email with people that I have a good contact with. Messages which would require many hops of the trust graph are quite rare.
Thus it would be perfectly possible for search servers to hold much of the graph in memory. There wouldn't be a single central search server (as that would be Bad), but there wouldn't need to be as the server need not be trusted as it cannot lie. Possibly that would be enough to make the system work.
Issues that I'm no going to think about till the morning... negative certs, caching issues, the problem of time delay if a trusted source goes 'bad' (which are all rooted in the same issue).
Above, I state that searching the trust network wouldn't work. But it occurs to me that it would be fairly simple to find a path quite efficiently.
The trust graph is going to have a power law distribution. I don't know why, but I would be very surprised if it didn't. So, starting from two points A and B, to find a path between them walk up the orders until up hit a common meeting point at a high order node.
Walking up from B assumes that much of the time if C trusts D, then D trusts C. Because you actually want to find a path, in the end, that goes down to B. This assumption makes the graph look `symmetricish' and so the trick might produce a path pretty quickly. Unfortunately, the symmetric assumption falls down for the high order nodes.