The probability of data loss in large clusters — Martin Kleppmann's blog
Many distributed storage systems (e.g. Cassandra, Riak, HDFS, MongoDB, Kafka, …) use replication to make data durable. They are typically deployed in a "Just a Bunch of Disks" (JBOD) configuration – that is, without RAID to handle disk failure. If one of the disks on a node dies, that disk's data is simply lost. To avoid losing data permanently, the database system keeps a copy (replica) of the data on some other disks on other nodes.
The most common replication factor is 3 – that is, the database keeps copies of every piece of data on three separate disks attached to three different computers. The reasoning goes something like this: disks only die once in a while, so if a disk dies, you have a bit of time to replace it, and then you still have two copies from which you can restore the data onto the new disk. The risk that a second disk dies before you restore the data is quite low, and the risk that all three disks die at the same time is so tiny that you're more likely to get hit by an asteroid.
Read full article from The probability of data loss in large clusters — Martin Kleppmann's blog
No comments:
Post a Comment