There have been many times over the years when we’ve had to call a customer to explain that their co-located server has crashed and won’t boot due to what appears to be a disk failure. Sometimes, the customer will inform us that this is impossible. Why? Because the server has RAID, so a disk failure can’t happen. That’s when we have to explain that RAID can and does break sometimes.
After a moment of silence, the customer will begin to pass through the same 5 emotional stages that you’ve heard applied to the terminally ill.
1. Denial: “This can’t be happening. We put more than one drive in that machine so this could never happen!”
2. Anger: “What the hell did you people do to our server? It has RAID, so you must have done something to cause this disaster. Its not our fault! This isn’t fair!”
3. Bargaining: “OK, can we at least get the data off the drives? If we can retrieve the data, we’ll have lots of work to do in order to recover, but all won’t be lost.”
4. Depression: “Sigh. What’s the point? We’re offline and it seems that we’ve lost all our vital data. I don’t know what we’re going to do now.”
5. Acceptance: “I guess we should have been monitoring the RAID containers. Lets get that server back online and we’ll have to try to restore from scratch. This time we’ll make sure we’re taking the proper steps to manage things correctly.”
[Warning: Cheesy infomercial reference, dead ahead!]
RAID is not the data equivalent of the Ronco Showtime Rotisserie. You can’t just “set it and forget it”.
[I told you it was cheesy.]
The fact that you installed multiple drives in your web/mail/database server is great. The fact that you’re using RAID to protect and preserve your data is great too. What many people seem to forget is that multiple drives and RAID are only two-thirds of the equation. Proper management is the too-often missing piece of the puzzle that leads to disaster.
All disk drives will fail at some point. Its going to happen. There’s no sense denying it. When you set up your RAID, you must remember this. Those drives will fail over time. The beauty of RAID is that when a drive fails, you can replace it without having to go into panic mode. Recovery should be relatively painless. That is what RAID is all about – easy recovery from drive failure. Awesome, right? Yes, but only if you actually know that a drive has failed. If you don’t know, then you can’t do anything about it. One drive failure and you’re still in business, but when drive number two fails and you haven’t replaced drive number one because you weren’t paying attention, you’re up a creek without a paddle.
When you build a server with RAID, then never check on the health of your RAID containers, all you’ve done is buy yourself extra time before you have to panic. Maybe by the time that second (or third) drive fails and the server goes offline for a week you’ll already be working somewhere else and it won’t matter to you. Maybe after this completely avoidable disaster you will be working somewhere else, even if you hadn’t planned on it.
How does, “Would you like fries with that?” strike you?
So what should you do to avoid your own RAID tale of woe? Three simple things:
A. Check the status of your RAID containers every day. Yes, I said EVERY day. There are usually ways to automate that, but if you can’t, then you’re going to have to log in to your server(s) and check it all manually every morning while you’re having your coffee and donut.
B. Always have spare drives ready. Its no fun trying to find a 36 gig SCSI drive when all you see are SATA drives for miles around you. Even if you can use readily available drives, do you want to wait to have drives shipped before you can replace one that’s failed? Have a spare drive or two at the datacenter just in case.
C. Back up often. Its not always the drives that sink RAID systems. Sometimes motherboards or RAID controllers die too, in which case you’ve got intact data on working drives that you can’t access. Have your data backed up someplace else at all times.
Not managing your RAID can lead to some nasty and totally avoidable situations. If you haven’t, take a moment to review your servers to ensure that you’re practicing good RAID management. One day you’ll be happy you did.