Monday 19 January 2009

Enterprise Computing: Migrating Petabyte Arrays

Background

The physical capacity of storage arrays continues to grow at an enormous rate, year on year. Using EMC as a benchmark, we can see that a single array has grown over the years;

  • Symmetrix 3430 - 96 drives, 0.84TB
  • Symmetrix 5500 - 128 drives, 1.1TB
  • Symmetrix 8830 - 384 drives, 69.5TB
  • DMX3000 - 576 drives, 76.5TB
  • DMX-4 - 1920 drives, 1054TB

Note: these figures are indicative only!

DMX-3 and DMX-4 introduced arrays which scale to petabytes (1000TB) of available raw capacity. At some point, these petabyte arrays will need to be replaced and will represent a unique challenge to today's storage managers. Here's why.


Doing The Maths

From my experience, storage migrations from array to array can be complex and time consuming. Issues include:


  • Identifying all hosts for migration
  • Identifying all owners for storage
  • Negotiating migration windows
  • Gap analysis on driver, firmware, O/S, patch levels
  • Change Control
  • Migration Planning
  • Migration Execution
  • Cleanup

With all of of the above work to do, it's not surprising that realistically, around 10 servers per week is a good estimate of the capability of a single FTE (Full Time Equivalent, e.g. a storage guy). Some organisations may find this figure can be pushed higher, but I'm talking about one person, day in day out, performing this work, so I'll stick with my 10/week figure.

Assume an array has 250 hosts, each of an average 500GB, then this equates to about 125TB of data and almost 6 month's effort for our single FTE! In addition, the weekly migration schedule requires moving on average 5TB of data. If the target array differs from the source (e.g. a new vendor, different LUN size) then the migration task can be time consuming and complex to execute.




Look at the following diagram. It shows the lifecycle of physical storage in an array over time. Initially the array is deployed and storage configured. Over the lifetime of the array, more storage is added and presented to hosts until either the array reaches a maximum physical capacity or an acceptable capacity threshold. This remains until migrations start to take place to another array. Up to the point migrations take place, storage is added and paid for as required, however once migrations start, there is no refund from the vendor for the unused resources (those represented in green). They have been purchased but remain unused until the entire array is decommissioned. If the decommissioning process is lengthy then the amount of unused resources becomes high, especially on petabyte arrays. Imagine a typical 4-year lifecycle; up to 1 year could be spent moving host to new arrays - at significant cost in terms of manpower and impact to the business.

Solutions

So how should we adapt migration processes to handle the issue of migrating these monster arrays?

  • Establish Standards.  This is an age old issue but one that comes up time and time again.  Get your standards right.  These include consistent LUN sizes, naming standards and support matrix (compatibility) standards.
  • Consider Virtualisation. Products including SVC, USP, InVista (EMC) and iNSP (Incipient) all allow the storage layer to be virtualised.  This can assist in the migration process.
  • Keep Accurate Records.  This may seem a bit obvious but it is amazing the number of sites who don't know how to contact the owner of some of the servers connected to their storage.
  • Talk to Your Customers.  Migrations inevitably result in server changes and potentially an outage.  Knowing your customer and keeping them in the loop regarding change planning saves a significant amount of hassle.
Technology replacement is now part of standard operational work.  Replacing hardware is not all about technology; procedures and common sense will form a more and more important part of the process.

1 comment:

Chris "Saundo" Saunderson said...

I've got another for you:

1) Do currency!

Don't allow your volume managers, filesystems, operating systems and HBA/drivers to get too far out of date: support matrices are your friend on deciding what and when to replace the PB array with.

Working in a site that has 5+PB of disk, the biggest problem we run into is the lack of currency/patching of OS'/volume managers. Sneaky stuff like the number of PVLinks in a disk group can be worked around, but getting into the quagmire of upgrading volume managers kills any timeline, especially when conflicting support matrices end up forcing OS upgrades along with the VM or HBA upgrades!