Thursday, 21 June 2007

Hardware Replacement Lifecycle

How do you replace your hardware? I've always thought a sound technology replacement lifecycle is essential. Brand new arrays are great. You can't wait to get them in and start using the hardware and before you know it, they are starting to fill up. If you didn't order a fully populated array (and if we are buying DMX-3, then that's almost a certainty) then you might extend it a few times with more disks, cache and ports.

But then the free maintenance period starts to loom. Nobody wants to think about the hassle of getting out that old array. Maintenance kicks in and before you know it, you're paying charges on hardware you want to replace.

In some respects it's a bit like using a credit card. You tell yourself you absolutely won't use it for credit. You'll pay the balance off each month, but then something comes along that you just must have (like that shiny new plasma screen) and you think; "I can pay that off over a couple of months, no sweat". But then you're hooked and paying interest which you never wanted to pay.

Although it's hard, it is essential to establish a hardware replacement strategy. There are lots of good reasons for shipping old hardware out:

  • you will have to pay maintenance
  • there may be compatibility issues
  • parts may become difficult to locate (e.g. small disk sizes)
  • the vendor may end support
  • power/cooling/density pressures may increase
  • no new features will be developed for non-strategic hardware

So here's my plan. Agree a replacement schedule and stick to it. Firstly you keep your risk exposure to a minimum, but also you'll have the vendors knocking at your door as they will know you are in the market for new equipment. Users of the array can also be notified of the lifecycle and be made aware ahead of time of the refresh cycle. Imagine a 4-year hardware lifetime. An example timeline could be;

  1. Year 0 to the end of Year 3- GREEN: the array is a target for new allocations. The array is a target for maintenance upgrades on a regular basis (say 3-6 months, depending on release cycles)
  2. Year 4, first half - AMBER: the array is still a target for adding storage to existing hosts on the array. No new storage/ports/cache will be added to the array. No new hosts will be added to the array. Maintenance on the standard cycle still continues.
  3. Year 4 second half - RED: the array is a target for active migrations. Servers requiring new storage are forced to move off the array in order to receive their storage. Servers are scheduled to be moved to a new array within the 6 months. No new maintenance other than emergency fixes are to be applied.
  4. Year 4 end - BLUE: the array is empty. Maintenance is terminated. The array is decommissioned and shipped out.

There's an obvious flaw in this policy - new hardware has to be on the floor during the last 6-12 months of the term to make it work. But, in any event, that has to happen anyway; either you get a new array in early, or you pay maintenance on the old hardware as you deploy a new array and migrate to it.

It may be that vendor hardware releases dictate that the 4 year term is shortened or increased to ensure the latest technology can be put on the floor. That could be an issue depending on the terms of maintenance for existing equipment, however getting that sorted is just a case of getting the right commercial arrangement.

What do you do?

2 comments:

Marc Farley said...

The storage product lifecycle is one of the key rationales for having a virtualization strategy. Virtualizing storage targets, coupled with automated data migration from old to new storage can take a lot of pressure off the transition process. There may be opportunities to use less expensive or re-purposed storage during year four so end users don't have to deal with restrictions on their projects. All this depends on the capabilities of the virtualization system, of course.

In an EqualLogic environment (I work for them) - subsystems can be upgraded with higher capacity drives during the product lifecycle. This happens without any downtime using the virtual paging capability of the systems (pages are moved from your production storage system to a temporary system and then moved back to the upgraded system ). Considering that drives are the component most likely to fail, its possible to get longer life from a subsystem that way. It doesn't keep you from having end of life scenarios, but it makes them less frequent. There's a lot more to the page-based architecture that I won't go into now, but the long and short is that end of life is much more graceful.

Chris M Evans said...

Marc, thanks for the post, I'd be interested in learning more about your products and the page-based architecture.