Monday 5 January 2009

Enterprise Computing: RAID Is Not Enough


Happy New Year and welcome back to all my readers!
:-)

I've been messing about with some old hard drives this week and unusually for me, one is sounding decidedly sickly. I've never had a personal hard drive go on me (I guess I always upgrade/move on before it happens), but rest assured I've had plenty "fail" in the Enterprise arena. Usually those failures are pre-emptive microcode soft-fails and the array seamlessly rebuilds onto another spare device and no data is lost.


Pity poor JournalSpace who managed to total their business this week by relying purely on RAID within their main database server.


The loss of the data is not clear - the server had a RAID-1 configuration; follow the link and have a read, but I quote:


"There was no hardware failure. Both drives are operating fine; DriveSavers had no problem in making images of the drives. The data was simply gone. Overwritten."


Now RAID is a great technology for recovering from physical drive failure and that is all it is - a mechanism to reduce the risk of data loss from failure of a hard drive. It is not a solution for managing data correctly. In this instance Journalspace must have suffered from the other things all good storage admins think (worry) about;



  • Sabotage

  • Server failure

  • Catastrophic array failure

  • Software bug

  • Site failure

  • User stupidity

If data is the lifeblood of your organisation then you *must* replicate it onto another online copy or at least onto a backup and have multiple copies in multiple locations.


If anyone out there is not sure they're protecting their data properly - then give me a call!

2 comments:

the storage anarchist said...

Replicating to another on-line copy won't work, unless you maintain several "snapshot" recovery points somewhere. Otherwise you'd probably just replicate the garbage that overwrote the real data, whether accidental or intentional wouldn't matter.

I've seen it happen...

No, backups or some form of CDP with snapshots to are the only real answer...

Globe Treader™ - © Kiran Ghag said...

+1
"relying just on online replication" = ability to replicate user/application mistakes in near-real time