Thoughts on British ICT, energy & environment, cloud computing and security from Memset's MD
I contend that the next stage of evolution of storage is “Just a Bunch of Disks” (JBOD), comprised of a range of media types with different performance characteristics, and with software doing the cleverness.
In this first post (1 of 2) I shall address the resilience aspects of this evolution.
Large RAID (Redundant Array of Independant Disks) systems – the defacto standard for storage systems – are increasingly unreliable, especially with large disks and slow rebuild times we see on busy arrays.
For example let’s look at a RAID with 24 x 2TByte disks. For this calculation I have used our RAID failure rate calculator. First, some assumptions:
|Usable array size (TB):||46||44||42||44||42||40|
|Chance of data loss per year:||1 in 6.0||1 in 10.3||1 in 10.5||1 in 348.5||1 in 1143.8||1 in 1189.4|
As you can easily see from the table, RAID5 is hopeless with large arrays of large disks. However, this should not be news to any sysadmin professionals (anyone still using RAID5 for such setups should be lined up and shot for being a twit!) but many professionals are still labouring under the illusion that RAID6, especially with hot spares, is safe. Indeed, a few years ago they would have been right.
So, what has changed? Well the main thing is the size of the disks themselves. While disks have grown exponentially larger their bandwidth (the speed with which data can be read or written to them) has grown much more slowly. As mentioned above, when a disk in a RAID array fails the controller automatically starts rebuilding itself to include one of the hot spares (or if you don’t have hot spares then the disk you replace the faulty one with). This is also why hot spares make such a difference; the machine is not waiting on a human to do a disk swap.
Anyway, rebuilding an array takes a long time, especially on full and busy arrays. This is because all the data that was on the failed disk now has to be written to the newly-recruited hot-spare. This involves reading data from all of the other disks and if they are busy, which is usually the case in a live array, then this process can take a long time. We often see rebuild rates as low as 10MB/sec (100Mbps) for very busy machines. I’ve been more generous in our example but even then it would take 1 day, 3 hours to rebuild a full array.
Herein lies the problem; if you get one more failure (in the case of RAID5) or two more failures (in the case of RAID6) while the array is rebuilding itself after the initial failure then you’re buggered; you can expect to lose all data. This is also why having more hot spares does not really help you; you are still vulnerable to that window. It should also be noted that, if you have hot spares, the time to replace a disk becomes fairly inconsequential.
However, by contrast our cloud storage system Memstore, which is based on OpenStack Swift (the same sytem RackSpace uses for CloudFiles), boasts a 1 in 100,000,000 chance per year of losing any individual object. This is also quite a fair comparison since, physically, Memstore consists of a bunch of 3U servers each with 24 disks in, but not using RAID.
Not only are those odds vastly better than the RAIDs but those are the chances of just losing one object. In general, this is a lot better than losing an entire 40TB of data in that rare failure scenario. The trade off is that it does mean that if you have 100,000,000 objects you can expepect to lose about one per year.
Memstore achieves its resilience by storing every object (or file) on at least three different nodes (hard disks). If any node fails it automatically re-distributes those objects that have lost a copy across to a new node. It also re-creates the lost node very swiftly due to the disk cluster being overall more lightly loaded. Here is where the comparison is a little unfair since we are being somewhat wasteful with disks. Even in our maximally-resilient RAID we are still using 20 out of 24 disks for data (83%) whereas Memstore only makes available 33% of the space for storage.
This means that to lose data three disks on different machines must fail within a short time of one-another. A slim chance!
However, my contention is that this doesn’t matter because disks are ridiculously cheap! Even with that triplication our costs for Memstore are <£25/TByte.
Some of you will now rightly be thinking, “but what about the performance loss?” or, “but it is not a file system”. You would be right to do so, but even there we have emerging solutions which I shall discuss in my next blog post…