Bienvenido! - Willkommen! - Welcome!

Bitácora Técnica de Tux&Cía., Santa Cruz de la Sierra, BO
Bitácora Central: Tux&Cía.
Bitácora de Información Avanzada: Tux&Cía.-Información
May the source be with you!

Wednesday, October 20, 2010

ZFS RAID recommendations

Space, performance, and MTTDL JAN 30, 2007

In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems. 
The best thing about a model is that it is a simplification of real life.
The worst thing about a model is that it is a simplification of real life.

Small, Random Read Performance Model

For this analysis, we will use a small, random read performance model. The calculations for the model can be made with data which is readily available from disk data sheets. We calculate the expected I/O operations per second (iops) based on the average read seek and rotational speed of the disk. We don't consider the command overhead, as it is generally small for modern drives and is not always specified in disk data sheets.
maximum rotational latency = 60,000 (ms/min) / rotational speed (rpm)
iops = 1000 (ms/s) / (average read seek time (ms) + (maximum rotational latency (ms) / 2))
Since most disks use consistent rotational speeds, this small table may help you to see what the rotational speed contribution will be.
Rotational Speed (rpm)
Maximum Rotational Latency (ms)
4,200
14.3
5,400
11.1
7,200
8.3
10,000
6.0
15,000
4.0
For example, if we have a 73 GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms average read seek and rotational speed of 10,000 rpm:
iops = 1000 / (4.1 + (6.0 / 2)) = 140.8
By comparison, a 750 GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms average read seek and rotational speed of 7,200 rpm:
iops = 1000 / (8.5 + (8.3 / 2)) = 79.0
I purposely used those two examples because people are always wondering why we tend to prefer smaller, faster, and (unfortunately) more expensive drives over larger, slower, less expensive drives - a 78% performance improvement is rather significant. The 3.5" drives also use about 25-75% more power than their smaller cousins, largely due to the rotating mass. Small is beautiful in a SWaP sense.
Next we use the RAID set configuration information to calculate the total small, random read iops for the zpool or volume. Here we need to talk about sets of disks which may make up a multi-level zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of mirrored sets (RAID-1). RAID-0 is a stripe of disks.
  • For dynamic striping (RAID-0), add the iops for each set or disk. On average the iops are spread randomly across all sets or disks, gaining concurrency.
  • For mirroring (RAID-1), add the iops for each set or disk. For reads, any set or disk can satisfy a read, so we also get concurrency.
  • For single parity raidz (RAID-5), the set operates at the performance of one disk. See below.
  • For double parity raidz2 (RAID-6), the set operates at the performance of one disk. See below.
For example, if you have 6 disks, then there are many different ways you can configure them, with varying performance calculations
RAID Configuration (6 disks)
Small, 
Random Read Performance 
Relative to a Single Disk
6-disk dynamic stripe (RAID-0)
6
3-set dynamic stripe, 2-way mirror
 (RAID-1+0)
6
2-set dynamic stripe, 3-way mirror
 (RAID-1+0)
6
6-disk raidz (RAID-5)
1
2-set dynamic stripe, 3-disk raidz
 (RAID-5+0)
2
2-way mirror, 3-disk raidz
 (RAID-5+1)
2
6-disk raidz2 (RAID-6)
1
Clearly, using mirrors improves both performance and data reliability. Using stripes increases performance, at the cost of data reliability. raidz and raidz2 offer data reliability, at the cost of performance. This leads us down a rathole...

The Parity Performance Rathole

Many people expect that data protection schemes based on parity, such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance of striped volumes, except for the parity disk. In other words, they expect that a 6-disk raidz zpool would have the same small. random read performance as a 5-disk dynamic stripe. Similarly, they expect that a 6-disk raidz2 zpool would have the same performance as a 4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a checksum to validate the contents of a block of data written. The block is spread across the disks (vdevs) in the set. In order to validate the checksum, ZFS must read the blocks from more than one disk, thus not taking advantage of spreading unrelated, random reads concurrently across the disks. In other words, the small, random read performance of a raidz or raidz2 set is, essentially, the same as the single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.
Many people also think that this is a design deficiency. As a RAS guy, I value the data validation offered by the checksum over the performance supposedly gained by RAID-5. Reasonable people can disagree, but perhaps some day a clever person will solve this forZFS.
So, what do other logical volume managers or RAID arrays do? The results seem mixed. I have seen some RAID array performance characterization data which is very similar to the ZFS performance for parity sets. I have heard anecdotes that other implementations will read the blocks and only reconstruct a failed block as needed. The problem is, how do such systems know that a block has failed? Anecdotally, it seems that some of them trust what is read from the disk. To implement a per-disk block checksum verification, you'd still have to perform at least two reads from different disks, so it seems to me that you are trading off data integrity for performance. In ZFS, data integrity is paramount. Perhaps there is more room for research here, or perhaps it is just one of those engineering trade-offs that we must live with.

Other Performance Models

I'm also looking for other performance models which can be applied to generic disks with data that is readily available to the public. The reason that the small, random read iops model works is that it doesn't need to consider caching or channel resource utilization. Adding these variables would require some knowledge of the configuration topology and the cache policies (which may also change with firmware updates.) I've kicked around the idea of a total disk bandwidth model which will describe a range of possible bandwidths based upon the media speed of the drives, but it is not clear to me that it will offer any satisfaction. Drop me a line if you have a good model or further thoughts on this topic.
You should be cautious about extrapolating the performance results described here to other workloads. You could consider this to be a worst-case model because it assumes 0% disk cache hits. I would hope that most workloads exhibit better performance, but rather than guessing (hoping) the best way to find out is to run the workload and measure the performance. If you characterize a number of different configurations, then you might build your own performance graphs which fit your workload.

Putting It All Together

Now we have a method to compare a variety of different ZFS or RAID disk configurations by evaluating space, performance, and MTTDL. First, let's look at single parity schemes such as 2-way mirrors and raidz on the Sun Fire X4500 (aka Thumper) server.
Single Parity Model Results
Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better performance and MTTDL than raidz for any specific space requirement except for the case where we run out of hot spares for the 2-way mirror (using all 46 disks for data). By contrast, all of the raidz configurations here have hot spares. You can use this to help make design trade-offs by prioritizing space, performance, and MTTDL.
You'll also note that I did not label the left-side Y axis (MTTDL) again, but I did label the right-side Y axis (small, random read iops). I did this with mixed emotion. I didn't label the MTTDL axis values as I explained previously. But I did label the performance axis so that you can do a rough comparison to the double parity graph below. Note that in the double parity graph, the MTTDL axis is in units of Millions of years, instead of years above.
Double Parity Model Results
Here you can see the same sort of comparison between 3-way mirrors and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.
Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place.  If you want to be happier, you should use mirroring with at least one hot spare.

Conclusion

We can make design trade-offs between space, performance, and MTTDL for disk storage systems. As with most engineering decisions, there often is not a clear best solution given all of the possible solutions. By using some simple models, we can see the trade-offs more clearly.

No comments: