This week, I was in the unenviable position of troubleshooting and recovering a RAID5 array which had TWO failed disks. If you know how RAID5 functions, then your heart probably already fell into your stomach and you are checking your own backups right now. 🙂 That’s right. A RAID5, which requires a minimum of three disks, can survive the failure of a single disk, but not two. So, when I got a call and the problem description included the words “blinking yellow lights on two of the disks”, I knew there was going to be trouble. I tried the standard stuff, like reseating the drives and rebooting the SAN first, but those had no effect.
Most of the time, in a situation like this, the next step is to rebuild the array with new disks and restore from backup. In this case, there was no recent backup of some of the data. I needed another option.
Since this was an HP MSA2012FC disk enclosure, I had a possible method of bringing the failed array back up by way of the ‘trust’ command in the command-line interface.
The trust command enables an offline virtual disk to be brought online for emergency data collection.
From HP documentation on the trust command:
Enables an offline virtual disk to be brought online for emergency data collection
only. It must be enabled before each use.
Caution – This command can cause unstable operation and data loss if used
improperly. It is intended for disaster recovery only.
The trust command re-synchronizes the time and date stamp and any other
metadata on a bad disk drive. This makes the disk drive an active member of the
virtual disk again. You might need to do this when:
■ One or more disks of a virtual disk start up more slowly or were powered on after
the rest of the disks in the virtual disk. This causes the date and time stamps to
differ, which the system interprets as a problem with the “late” disks. In this case,
the virtual disk functions normally after being trusted.
■ A virtual disk is offline because a drive is failing, you have no data backup, and
you want to try to recover the data from the virtual disk. In this case, trust may
work, but only as long as the failing drive continues to operate.
When the “trusted” virtual disk is back online, back up its data and audit the data to
make sure that it is intact. Then delete that virtual disk, create a new virtual disk,
and restore data from the backup to the new virtual disk. Using a trusted virtual disk
is only a disaster-recovery measure; the virtual disk has no tolerance for any
The most important points here are 1) You should audit any data recovered from a ‘trusted’ virtual disk because it may be corrupted, and 2) This will only work if the failed disk is still actually spinning and just ‘fell out of the array’; won’t help if the disk is completely dead.
I was very fortunate, in that both of the disks were not completely dead, so the trust command worked. I was able to copy almost all of the data off of the array. Although, even in my case, data which had been modified several days prior to the failure had been corrupted. It was still better than a 3 week old copy of the data, which was the alternative.
This command is obviously no substitute for good, verified and tested backups. But it sure came in handy in a pinch!