RCBIG -- you might not think this could happen

3/18/2009. You might not think this could happen. when I got an email from Ada Watson who'd indicated that her call to the helpdesk hadn't been responded to, I assumed I was just looking at something routine. Well. One thing about a RAID controller is that it can hang. Not usually a a terrible thing -- users get an incomprehensible error message, somthing about I/O. A system person goes to the server to reboot. And all is well.
Not today.
First note: I get en email message whenever something "interesting" happens to the RAID controller -- bad drive, timeout, etc. I hadn't received any for this machine. After rebooting the machine, I saw there was a bad drive. So I pulled that drive and replaced it. However, when I ran the administrative program for the RAID controller, it now showed there were two bad drives. One was the spare, and one was part of the RAID set. But the spare wasn't being pulled into the RAID set. Worse, when I tried to /force/ use of the spare drive, an error showed up.

Second note: Sometimes, it's helpful to reboot a server to allow it to find a drive that hasn't spun up properly. That's what I did. MISTAKE! On reboot, yet another drive mysteriously decided to be NOT PRESENT. A RAID5 can survive a missing spare and a missing drive, but it cannot survive a failed spare and two missing drives.

Third note: Result: a RAIDset, gone.

Fourth note: I began a restore from tape, and thought to let an ongoing backup continue. MISTAKE! The server hung. So, here I am, once again, wondering if I should be setting up a sleep-sofa in my office.

I'm rebalancing the indexes on the backup server. Then I'm going to start the restore again.
Here's an unpleasant observation: I've had to interrupt backups a number of times in the past week to restore the RAIDsets from RESSRV16. And here I am again. The question of balance is: Restore or Backup? I'm going with restore.

But it would be very bad if the backup server failed. Ulp.