Kate's Comment

Thoughts on British ICT, energy & environment, cloud computing and security from Memset's MD

Evolution of storage #1: resilience

I contend that the next stage of evolution of storage is “Just a Bunch of Disks” (JBOD), comprised of a range of media types with different performance characteristics, and with software doing the cleverness.

In this first post (1 of 2) I shall address the resilience aspects of this evolution.

RAID failings

Large RAID (Redundant Array of Independant Disks) systems – the defacto standard for storage systems – are increasingly unreliable, especially with large disks and slow rebuild times we see on busy arrays.

For example let’s look at a RAID with 24 x 2TByte disks. For this calculation I have used our RAID failure rate calculator. First, some assumptions:

  • RAID6 which means that any two disks may fail simultaneously and you still have your data, or RAID5 which means you can lose one disk.
  • Between zero and two hot spares (see below). A hot spare is a disk in the chassis which the RAID controller can automatically add into the array in the event of a failure. This takes times since all the data that was on the failed disk in effect has to be re-created / re-written. This cannot be done in advance since the data on each of the disks in use will be different.
  • A rebuild speed of 20MBytes/second. This is also based on our data; most RAID arrays are quite busy doing their work and therefore have a limited amount of bandwidth to devote to rebuilding.
  • An Annual Failure Rate (AFR) of 3.6% – ie. a 3.6% chance in each year that a single disk will fail. This is based on our actual collected data from thousands of 2TB disk-years in service.
  • A lifetime of 3 years.
  • That probability of failure is evenlly distributed across a disk’s lifetime. This is not a great assumption since in reality disks are more likely to fail either in the first few weeks of service or increasingly over time (in our experience) but will do for now.
  • A time to replace a faulty hard drive of 24 hours. Again, this is based on actual data; our customers prefer that we let them know and get permission before swapping disks even in hot swap chassis.

RAID level: RAID5 RAID6
Hot spares: 0 1 2 0 1 2
Usable array size (TB): 46 44 42 44 42 40
Chance of data loss per year: 1 in 6.0 1 in 10.3 1 in 10.5 1 in 348.5 1 in 1143.8 1 in 1189.4

As you can easily see from the table, RAID5 is hopeless with large arrays of large disks. However, this should not be news to any sysadmin professionals (anyone still using RAID5 for such setups should be lined up and shot for being a twit!) but many professionals are still labouring under the illusion that RAID6, especially with hot spares, is safe. Indeed, a few years ago they would have been right.

So, what has changed? Well the main thing is the size of the disks themselves. While disks have grown exponentially larger their bandwidth (the speed with which data can be read or written to them) has grown much more slowly. As mentioned above, when a disk in a RAID array fails the controller automatically starts rebuilding itself to include one of the hot spares (or if you don’t have hot spares then the disk you replace the faulty one with). This is also why hot spares make such a difference; the machine is not waiting on a human to do a disk swap.

Anyway, rebuilding an array takes a long time, especially on full and busy arrays. This is because all the data that was on the failed disk now has to be written to the newly-recruited hot-spare. This involves reading data from all of the other disks and if they are busy, which is usually the case in a live array, then this process can take a long time. We often see rebuild rates as low as 10MB/sec (100Mbps) for very busy machines. I’ve been more generous in our example but even then it would take 1 day, 3 hours to rebuild a full array.

Herein lies the problem; if you get one more failure (in the case of RAID5) or two more failures (in the case of RAID6) while the array is rebuilding itself after the initial failure then you’re buggered; you can expect to lose all data. This is also why having more hot spares does not really help you; you are still vulnerable to that window. It should also be noted that, if you have hot spares, the time to replace a disk becomes fairly inconsequential.

Cloud storage

However, by contrast our cloud storage system Memstore, which is based on OpenStack Swift (the same sytem RackSpace uses for CloudFiles), boasts a 1 in 100,000,000 chance per year of losing any individual object. This is also quite a fair comparison since, physically, Memstore consists of a bunch of 3U servers each with 24 disks in, but not using RAID.

Not only are those odds vastly better than the RAIDs but those are the chances of just losing one object. In general, this is a lot better than losing an entire 40TB of data in that rare failure scenario. The trade off is that it does mean that if you have 100,000,000 objects you can expepect to lose about one per year.

Memstore achieves its resilience by storing every object (or file) on at least three different nodes (hard disks). If any node fails it automatically re-distributes those objects that have lost a copy across to a new node. It also re-creates the lost node very swiftly due to the disk cluster being overall more lightly loaded. Here is where the comparison is a little unfair since we are being somewhat wasteful with disks. Even in our maximally-resilient RAID we are still using 20 out of 24 disks for data (83%) whereas Memstore only makes available 33% of the space for storage.

This means that to lose data three disks on different machines must fail within a short time of one-another. A slim chance!

However, my contention is that this doesn’t matter because disks are ridiculously cheap! Even with that triplication our costs for Memstore are <£25/TByte.

Performance

Some of you will now rightly be thinking, “but what about the performance loss?” or, “but it is not a file system”. You would be right to do so, but even there we have emerging solutions which I shall discuss in my next blog post…

1 comment