Why RAID-Z isn’t appropriate for me (or for almost any home user)

So, ZFS is cool.  OpenSolaris derivatives are cool.  RAID-Z is cool.  But it lacks one simple feature that other software RAID solutions handle – the ability to grow a volume by increasing the stripe size.  For instance, let’s just postulate that you have 3 2TB hard disks in a RAID-5, and you want to add 2 more to make a 5-disk volume.  Well, with ZFS, you have 2 options:

  • Back up everything on the current volume, destroy it, and create a 5-drive RAID-Z from scratch
  • Buy another 2TB drive, create a new vdev out of the 3 new drives, and add it to the zpool

Now, at first, the second option doesn’t sound too bad – until you realize that you’ve basically created a false RAID-Z2 (RAID-6), since you’ve got 2 parity disks.  It’s false because if 2 disks fail in the same vdev, you’re cooked, but you could lose one in each and be fine.  Also, you’re wasting money on an extra disk when you’re a simple home user who wants to scale in small parts.

Neither of these issues are a problem for larger deployments – they generally already have disk space for backups (or already have all the data backed up in the first place), or are building the entire thing from scratch to store future data.  Buying extra disks isn’t a problem – they have money.  Home users do not.

So, until this is possible, I’ll be using mdadm or a similar solution on OpenFiler or another Linux-based OS.  This is a real shame; I really wanted to start using OpenIndiana.


12 thoughts on “Why RAID-Z isn’t appropriate for me (or for almost any home user)

  1. Jerry

    I think you should take a look at this blog post:
    http://www.itsacon.net/?p=158

    Not exactly as easy as adding another drive, but it does accomplish increasing the overall ZFS pool size. Whether you can live with this alternative depends on how fast you consume the original data space. I started with 5 500 GB drives in a hardware raid three years ago and now am looking to migrate to something larger. I’m also looking to move away from hardware raid so I can move the drives to a new platform without worry of cost/time to replace the hardware raid card. I have not started the rebuild yet, but ZFS does appear to be the right solution for me.

  2. Erik Post author

    Jerry – thanks for pointing that out. I should have discussed that option in the post. I had seen people mention it, and decided against it since it’s extremely inconvenient. It does depend massively on how fast you fill the space, as you say. If you’re on a 3-year plan, it’s completely viable, as you seem to be. However, I built my 4TB RAID in November, and now less than a year later I was getting low on space. So, for me, the ability to just slap another drive in there, run 4 commands, and have 2TB more space is extremely nice.

    Migrating from hardware RAID is a really good idea, unless you have a $300+ RAID card (mainly because it probably has built-in memory). Modern software RAID can get you close to the same performance for most activities – you’re much more limited by the filesystem. Currently, I’m seeing 250MB/s+ for reads and writes using mdadm, and that’s on 5 5400RPM drives – it’s pretty amazing. I’m not using the disks for small files or random access – just large sequential reads – so XFS is working out really well. Deleting files is really slow, but I can live with that.

    Cheers!

  3. Charlie

    Hey Erik
    What solution did you end up with? Debian + mdadm? Im really in the same position as u seem to be and 250 reads/writes would be awesome. Got 3×1,5TB and 3x2TB and would love some kind of pool with all the space I can get. Whats my best option?

  4. Erik Post author

    I did use Debian + mdadm, and have been since I wrote this post. It still works great. I’ve grown the RAID a couple times since I wrote this, and now it’s a 5x2TB RAID5. For your situation, since you have disks of varying sizes, I’d recommend 2 RAID5s – one with 1.5TB drives and one with 2TB drives. If you want them to appear as a single logical volume, you could go for RAID 50 (5 + 0) which is described here: https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_50_.28RAID_5.2B0.29

  5. Charlie

    Sounds interesting! (and tnx for the fast reply). Since I have data on the 3×1,5 disks im thinkin about a raid5 with the 3x2tb, transfer the data from the 3×1,5 and then creating a new raid5 with them (leaving me with 2 raid5’s, just as you said). Would a merge be possible without dataloss at this point? (its possible with raidz)

  6. Isaac

    With 3×1,5TB and 3x2TB, RAID50 will provide you a total usable space of 6TB since the 3x2TB RAID5 will be forced to the same size as the 3×1,5TB RAID5 when you stripe them together. If you want to use all of both RAID5s, you should JBOD(md in linear mode) over the RAID5s. You will end up with 7TB total with a false 2 disk tolerance (can sustain 2 disk failures as long as they are in different raid sets).

  7. bas

    I spent the last 14 hours powersearching the web trying to decide RAID6 vs RAIDZ2 and I think yours is the pivotal argument for me as well, along with the fact that RAID6 is already built into the common Linux distro’s.
    I think the standard Linux RAID is also a bit more flexible so you can do stuff like this:
    http://louwrentius.com/blog/2008/08/building-a-raid-6-array-of-mixed-drives/

    The big upside of RAIDZ, I read, is that it continually checks against errors, whereas RAIDn doesn’t error check at all (it only reads one copy of the data, doesn’t bother with the other copies at all, they could have silently gone bad). So I think you need a nightly or weekly cron job to check the entire array

    RAIDZ also allows you to allocate sizes to different directories, i.e. if you have say a 2TB array you can allocate 500G to /home/myself, 50G to /home/grandma, 500G to /appdata, etc.. I haven’t seen how to do this with mdadm.. I guess you just partition /dev/mda1 just like any other HDD

  8. Erik Post author

    Thanks for the feedback! You can set up the directory sizes just using UNIX quota:

    http://linux.die.net/man/1/quota

    There’s no reason really to have the filesystem itself do this when the OS can.

    WRT dead disks, I’ve definitely seen log events from the kernel for write failures to block devices, and I would expect mdadm to watch for those kinds of errors. I have a feeling that the raw device writes occurring in mdadm’s code pay attention to any errors returned, and deal with them appropriately. Since RAID5/6 don’t use a dedicated parity drive, all disks are being written to all the time, so dead disks shouldn’t be too hard to detect. Also, because of this, all disks are being read all the time, and if the parity doesn’t match, bad disks can be detected that way. I’ve definitely seen disks kicked out of mdadm without running any sort of cronjob before – set to FAIL state and such. I’ve also seen rebuilds kick off after an error is detected. It’s pretty solid IMHO.

    Good luck!

  9. bas

    yes I had previously set up a cron job to email me when there was a drive problem. One time my O/S disk caught fire! And it also took one of the data disks out with it. No email, just my wife waking me up saying the internet was not working anymore :)
    The voltage regulator is what had burnt up. I was able to bypass it and save my data.

    Here’s how to check the entire array:
    http://unix.stackexchange.com/questions/28636/how-to-check-mdadm-raids-while-running

    Seems like good sense to put this on a weekly cron job.

  10. bas

    On a related note, I spoke with a friend the other day and we agreed to try out BitTorrent Sync http://labs.bittorrent.com/experiments/sync.html

    You can synchronize directories across the internet. It’s similar to what you could do with an SSH tunnel and rsync, but it’s a nicely integrated package that’s also available for mac and windows. Knowing BitTorrent it is really good at transferring large files over shady connections and piecing them back together successfully (I don’t have a shady connection but other folks may)

Leave a Reply

Your email address will not be published. Required fields are marked *