Thursday 10 April 2008

When Storage Planning Goes Bad

I was chatting to colleagues today and we were reflecting on an installation which had just completed and needed another additional storage tranche installed. Ironically, the initial disk installation on the new array hadn't been fully implemented because the vendor "forgot" to install the full quota of cache in the array. Although this was a simple gotcha, it reminded me of others I've had along the way in my career including;

  • An engineer was testing the Halon system in the newly completed computer room extension at my first site. Unfortunately he'd forgotten to turn the key to "test" before pressing the fire button and let off the Halon in the whole of the datacentre with both the equipment up and running and operators in the room mounting tapes. Needless to say, they were out of there like a rat up a drainpipe!
  • During a recent delivery of storage arrays; one array literally fell of the back of the lorry. It had to be shipped back for repair...
  • An array installation I managed in one site was mis-cabled by both the electricians and the vendor. When it was powered up, it exploded...
  • On a delivery of equipment, the vendor arrived at the loading bay at the datacentre. As the loading bay door was opened, it jammed and broke, just too low for the arrays being delivered to be pushed under the door. The vendor had to return the following day after the broken door had been repaired.
  • A tape drive on a StorageTek library I worked on took 12 hours and around 6 staff to complete. Half way through the upgrade, we took a go/no go point and checked both the MVS and VM installations to ensure the new drives worked. The MVS connected drives were fine; the VM drives had a "minor problem", so we proceeded, in anticipation of resolving the VM problem. The following day we discovered the VM problem was not correctable and had to purchase additional drives at considerable cost.
  • After loaning out some disk space to a "temporary" project, we had a hardware failure 3 months later. It turned out that the team had forgotten to ask for backups for their data and 3 months of the work of a dozen people was lost.


Fortunately, most of the above were not life threatening (except the first, which I was not involved in directly). However one of these problems did result in data loss (albeit on a development environment). It shows how many times the unexpected and unplanned can happen and mess up the best laid plans.


Care to share any of your stories?

No comments: