Thoughts on British ICT, energy & environment, cloud computing and security from Memset's MD
I contend that the next stage of evolution of storage is “Just a Bunch of Disks” (JBOD), comprised of a range of media types with different performance characteristics, and with software doing the cleverness.
In this first post (1 of 2) I shall address the resilience aspects of this evolution.
Large RAID (Redundant Array of Independant Disks) systems – the defacto standard for storage systems – are increasingly unreliable, especially with large disks and slow rebuild times we see on busy arrays.
For example let’s look at a RAID with 24 x 2TByte disks. For this calculation I have used our RAID failure rate calculator. First, some assumptions:
RAID level: | RAID5 | RAID6 | ||||
Hot spares: | 0 | 1 | 2 | 0 | 1 | 2 |
Usable array size (TB): | 46 | 44 | 42 | 44 | 42 | 40 |
Chance of data loss per year: | 1 in 6.0 | 1 in 10.3 | 1 in 10.5 | 1 in 348.5 | 1 in 1143.8 | 1 in 1189.4 |
As you can easily see from the table, RAID5 is hopeless with large arrays of large disks. However, this should not be news to any sysadmin professionals (anyone still using RAID5 for such setups should be lined up and shot for being a twit!) but many professionals are still labouring under the illusion that RAID6, especially with hot spares, is safe. Indeed, a few years ago they would have been right.
So, what has changed? Well the main thing is the size of the disks themselves. While disks have grown exponentially larger their bandwidth (the speed with which data can be read or written to them) has grown much more slowly. As mentioned above, when a disk in a RAID array fails the controller automatically starts rebuilding itself to include one of the hot spares (or if you don’t have hot spares then the disk you replace the faulty one with). This is also why hot spares make such a difference; the machine is not waiting on a human to do a disk swap.
Anyway, rebuilding an array takes a long time, especially on full and busy arrays. This is because all the data that was on the failed disk now has to be written to the newly-recruited hot-spare. This involves reading data from all of the other disks and if they are busy, which is usually the case in a live array, then this process can take a long time. We often see rebuild rates as low as 10MB/sec (100Mbps) for very busy machines. I’ve been more generous in our example but even then it would take 1 day, 3 hours to rebuild a full array.
Herein lies the problem; if you get one more failure (in the case of RAID5) or two more failures (in the case of RAID6) while the array is rebuilding itself after the initial failure then you’re buggered; you can expect to lose all data. This is also why having more hot spares does not really help you; you are still vulnerable to that window. It should also be noted that, if you have hot spares, the time to replace a disk becomes fairly inconsequential.
However, by contrast our cloud storage system Memstore, which is based on OpenStack Swift (the same sytem RackSpace uses for CloudFiles), boasts a 1 in 100,000,000 chance per year of losing any individual object. This is also quite a fair comparison since, physically, Memstore consists of a bunch of 3U servers each with 24 disks in, but not using RAID.
Not only are those odds vastly better than the RAIDs but those are the chances of just losing one object. In general, this is a lot better than losing an entire 40TB of data in that rare failure scenario. The trade off is that it does mean that if you have 100,000,000 objects you can expepect to lose about one per year.
Memstore achieves its resilience by storing every object (or file) on at least three different nodes (hard disks). If any node fails it automatically re-distributes those objects that have lost a copy across to a new node. It also re-creates the lost node very swiftly due to the disk cluster being overall more lightly loaded. Here is where the comparison is a little unfair since we are being somewhat wasteful with disks. Even in our maximally-resilient RAID we are still using 20 out of 24 disks for data (83%) whereas Memstore only makes available 33% of the space for storage.
This means that to lose data three disks on different machines must fail within a short time of one-another. A slim chance!
However, my contention is that this doesn’t matter because disks are ridiculously cheap! Even with that triplication our costs for Memstore are <£25/TByte.
Some of you will now rightly be thinking, “but what about the performance loss?” or, “but it is not a file system”. You would be right to do so, but even there we have emerging solutions which I shall discuss in my next blog post…
When dealing with ULRDBMS, it is important not to overlook the benefits of RAID100.
RAID100 (1+0+0) offers great resilience, high reading and writing throughput (random and serial) and has some terrific real world benefits – by splitting the array from time to time, you can simultaneously run backup and integrity checking routines, prior to rebuilding the mirror (which of course you haven’t changed as you have logged all DB changes elsewhere whilst carrying out maintenance).
Adding in asynchronous offsite replication to another site(s) allows even more flexibility in terms of resilience, throughput and continual integrity checking.
The cost of drives is largely incidental, the cost of getting data to and from them still isn’t.
In the real world, I still come across organisations which effectively batch process yet back up temp DBs (which as Molesworth says are rebuilt each time they are started) and access data leaves when branch (index) traversal is sufficient. This is wasteful and not an activity that many non DBAs are even aware off as they try to optimise datacentre throughput.
In order to optimise storage across the organisation, it can be useful to separate the transactional and analytical query loads. To make this happen with the least impact (automated), more ‘duplication’ may occur as data is replicated into ‘staging’ DB stores prior to being assembled into a federation of Data Marts tailored to analytical requirements and having no impact upon transactional throughput.
R+C