Documents must be named. Naming via a filename is possible, but large and variable-length names are an unfortunate headache. If we assign random numbers then the birthday paradox tells us that we can only have the square root of the number of possible documents before the probability of a collision is 50%. However we can check if a document exists before using a document id, just we can get away with smaller document ids than random assignment would allow. To keep to a power of 2 bits in length, and so give enough document ids we use 64-bit ids.
All nodes can communicate on the general multicast channel (GMC), 224.97.103.108. Every node listens on this channel and keeps an passive track of:
This information is small and is carried in the payload of most small messages to the GMC.
The packet loss and ping times should be roughly the same for every node. If they aren't then we should probably alert an admin. We also expect response times to be normally distributed. The global response time dictates our timeouts.
(aside: In reality, there will be level distinct levels of nodes - those on the local switch and those one level removed. But networking hardware is fast, certainly a lot faster than a disk seek if a node needs to check if it has a given document.)
Node communications are UDP and are either unicast (in which case the target must acknowledge the packet and the source must retransmit a number of times), or broadcast - to the GMC. The replies to broadcast messages may have an expected number of respondents. If this is the case, each node that could respond (which is generally all of them) flips a weighted coin and, based on its knowledge of the rough number of active nodes in the network, may choose not to reply.
Nodes have a number of disks. A given document is never replicated across different disks of the same node (or across multiple copies on the same disk, for that matter) due to the increased likelihood of a failure destroying that whole node.
A node keeps bloom filters of the documents that can be found on each disk. This saves on expensive IO lookups. The size of those bloom filters is up to each node, but with 625000 documents per disk a 16 fold, 4-bit counted filter is only 5MB.
Each disk in a node keeps a full record of the docids stored on each other disk in the same node. Again, for 625000 documents per disk and three `other' disks, per disk and 8 bytes (64-bits) per documents that's only 15MB. In reality it will be a little more than this because we won't bother compacting that list very often when documents are removed.
Each node also watches a number of others. For each node that it watches it keeps a full list of the docids on that node on every disk. For each node that it's watching that's 625000*4(disks)*8(bytes per docid)*4(of our disks) = 80MB.
(aside: it would be possible to make savings by using erasure codes rather than duplicating this information across every disk. But given the small overhead it's questionable if the added complexity is worth it.)
Every document on a disk is either incoming, normal or superfluous. All are reported in reply to a lookup. Incoming documents are still being downloaded and superfluous documents are not reported as taking up any space and are deleted as space pressure requires. Only normal documents are marked as members of a disk when updating docids lists (on other disks and on watchers).
Every node has an HTTP port open which answers requests like http://node/0011223344556677. If the node has the requested docid it replies with that data. If not, it triggers a lookup and HTTP redirects to a node which does.
A complete virgin node announces itself to the GMC and requests watchers. The expected number of replies is 20 and the node picks those who are currently watching the least nodes. It unicasts with those nodes and, from then on, updates its watchers whenever the contents of its disks change. (With a small amount of buffering).
A node broadcasts a lookup request on the GMC and the first good reply is the `answer'. The node keeps track of other replies and the number received is either none, less than the replication level, more than the replication level or correct.
If none, we retransmit until we hit a 1 in a million chance of packet loss and give up.
(aside: we know the global GMC packet loss and we assume the worst - that only one node has the document. Thus, with a 1% GMC drop rate we need to retransmit three times.)
If the number is less then we need to replicate. We pick a node from our local knowledge that has a lot of free space (in percent) and redirect it to copy from one of the nodes that we know has it. If the node refuses we try again. (The node selection is probabilistic to avoid the whole network buggering the most virgin node).
There is a race condition here, but the window is small (as nodes will report incoming documents as soon as they have started replicating it) and the consequence are small and fixed in the next paragraph...
If we get too many replies then we need to tell some of the nodes that their copy is superfluous. We number the nodes by a 64 bit number which is the md4 hash of their IP + port number and the `distance' to a docid is the absolute difference of the docid and the node number. Those nodes most distant from the document get their copies marked as superfluous.
I don't believe that this is racable. At worst, some documents don't get marked as superfluous.
Insertion is covered above, except that the source isn't another node.
Every node performs two types of stir - block stir and doc stir. Block stir consists of checksumming documents and verifying their md4 sum against the value in the metadata. This guards against silent disk corruption (which may not be the disk's fault). If a document is corrupt it is deleted and a lookup is performed to correct its loss.
When serving up a document it's probably a good idea to md4 sum it anyway, since the IO cost is nill. Unfortunately, if a mistake is found it's a little too late but at least it won't happen again.
Doc stir is the continual process where by nodes perform lookups on documents which they hold. In theory this isn't needed, in practice Weird Shit Happens and doc stir causes the network to always move towards a more correct state. If a node has no better data (see loss panic below) then it picks documents at random for the stir. Starting on a random point on a random disk and going in order from there is as good a way as any.
(aside: with a replication level of three, and a GMC packet loss of 1% we expect 3% of packets to need a retransmit. If each node stirs one document per second that's 412 packets per second on the GMC. If we assume a couple hundred bytes each that's less than 100K per second, which is fine. There's about 1300 replies, but they are unicast. The load of 400 lookups per second on each node is a little troubling - even with the bloom filters. Our one billion documents would still take a month to stir completely with such a load)
Consider, for the moment, that we don't actually make nodes perform a disk lookup to confirm that they actually have a copy of a given file during stirring. Given the rate of stirring that we require in order to stir everything within a reasonable time period this could save a lot of disk io.
How big would the blooms need to be to reduce the chance of something bad happening to acceptable levels?
While stirring we know that at least one node has a copy of a given docid (the one that suggested the stir to us). So we consult the blooms of all other nodes and a number, call it n of them return a positive. Thus m (which is <= n) nodes actually have that document. Let p be the false positive rate for our blooms. Let there be N nodes in total.
Assuming a replication level of three, if we find less that n is less than two then we know that something is amiss. So, what is the probability that we get n bloom positives, but actually have no other copies in the network? In this case N-n nodes returned a true negative (with probability of 1-p) and n returned a false positive (probability p). Thus the total probability is NCn (1-p)N-n * pn.
We can plug in a few numbers here. With 16 bits per document in the bloom filter p=0.000459. Assume 400 other nodes and that 2 bloom filters returned positive. The gives a probability of 1.4% that there aren't actually any other copies of that document in the network. That is bad.
| Bits per doc in bloom | Probability of zero copies with 400 nodes and two hits |
| 16 | 0.01398 |
| 20 | 0.00035 |
| 24 | 7.66675e-6 |
| 28 | 1.64732e-7 |
| 32 | 3.52945e-9 |
| 36 | 7.55886e-11 |
That formula gives some nicer numbers for very small numbers of bits per doc, because the probability that there are only two false positives becomes so long - but that's unhelpful.
So, with 32 bits per doc we get a decent result. But consider that over 1 billion documents that means that 3 or 4 documents with only one copy in the network won't be rescued by the stir.
What if we get more than replication-1 bloom hits? What is the probability that the document is in danger then?
Again, assuming that we have a replication level of three, we get 3 blooms hits. The probability that we have no extra copies is NC3 (1-p)N-3 * p3. For 32 bits per doc the probability is 9.84781e-14 which is a good number. For 16 bits it's 0.00085 - which isn't.
But we not consider the number of times that we'll actually get 3 false positives. For 32 bits, it's a similar number: 9.80361e-14 - so the savings made aren't worth the extra code.
So, from the above. If we have 36-fold bloom filters then maybe we can avoid a lot of disk io. Something to think about. (a 36-fold, 4-bit counted filter for 625000*4 documents is only 45MB).
In the event of a disk loss the node that lost the disk takes a list of the lost docids (from one of the other disks) and starts firing panic messages to every other node that it knows about. Panic messages should be the maximum length (wrt the network MTU) and packed with docids. When receiving such a packet, those docids are put in the loss panic queue and random stirring is suspended until the LPQ is empty.
For a disk of 625000 documents that's 1600 docids per node and about 26 minutes as 400 stirs per second.
TODO: node loss panic.
| / | Root |
| Alternate | The Weird and Wonderful |
| Backlinks | What are backlinks |
| John Gilmore | What's Wrong with Copy Protection |
| Archives | Blog Archives |
| One | Archive 1 |
| Two | Archive 2 |
| Three | Archive 3 |
| Four | Archive 4 |
| Five | Archive 5 |
| Six | Archive 6 |
| Seven | Archive 7 |
| Eight | Archive 8 |
| Nine | Archive 9 |
| Ten | Archive 10 |
| Eleven | Archive 11 |
| Twelve | Archive 12 |
| Thirteen | Archive 13 |
| Fourteen | Archive 14 |
| Fifteen | Archive 15 |
| Sixteen | Archive 16 |
| Seventeen | Archive 17 |
| Eighteen | Archive 18 |
| Nineteen | Archive 19 |
| Twenty | Archive 20 |
| Twenty One | Archive 21 |
| Twenty Two | Archive 22 |
| Twenty Three | Archive 23 |
| Twenty Four | Archive 24 |
| Twenty Five | Archive 25 |
| Twenty Six | Archive 26 |
| Twenty Seven | Archive 27 |
| Twenty Eight | Archive 28 |
| Twenty Nine | Archive 29 |
| Thirty | Archive 30 |
| Photos | Poor People Caught on Film |
| Jack and the Beanstalk | Jack and the Beanstalk |
| RIP Scan | Results of a Stage Scan Fire |
| Yosemite | Yosemite National Park |
| Projects | Incomplete things from the lab |
| Seagull's Bane | Linux Automounter |
| bttrackd | BitTorrent Tracker |
| CAPTCHA | CAPTCHA CGI script |
| Conserv | Console Serving |
| Deerpark | Using Tor with Firefox/1.1 (Deerpark) |
| DNSFix | Fixing DNS |
| Xovers | XTA Crossover Control |
| IAFS | Archive Org Storage |
| JBIG2 | JBIG2 Encoder |
| Verify | PGP Key Verifier |
| MaxFlow | Maximal Flow in Python |
| PyBloom | Bloom Filters in Python |
| pyGnuTLS | Python wrapping of GnuTLS |
| Sxmap | Apache SuEXEC Map |
| Hellard | Union Server Notes |
| Recordings | Free recordings |
| ICSM Choir | St Paul's Church |
| School | Ancient School Stuff |
| Writings | Who knows |
| Cap Systems | Capability Systems |
| Intro | Introduction to me |
| Suprema | JMC2 Group Project |
| MP Letters | Letters I've written to my MP |
| Sound | Sound With Dramsoc |
| SyncThreading | The wonders of user-land threads |