Thursday 1 June 2006

Planning for SAN resilience

One aspect of storage design must consider the issues of resilience. All infrastructure components are subject to failure; even five 9's of reliability means an outage of just over 5 minutes per year. How do we plan for that?

Multipathing

This is a simple one; two or more entire fabrics connecting hosts to storage. If one fabric fails, then the other can take over. This design consideration is not just for recovery, it assists in maintenance, so one fabric can be upgraded whilst the other maintains operation. Multipathing is of course expensive; doubling up on all equipment. But it does reduce the risk of failure to an almost negligible number.

Director Class versus Switch

As mentioned, director class switches offer at least five 9's availability. Departmental switches on the other hand offer more like three 9s, which is a considerably less resilient piece of equipment. So, for a resilient SAN architecture, don't put deparmental switches into the infrastructure at points of criticality.

Component Failure

Director class five 9's refers to the failure of an entire switch. It doesn't refer to the resilience of an individual component. So, plan to spread risk across multiple components. That may mean separate switches, it may mean across separate blades on switches. Hardware capacity growth means blades have moved from 4-port (e.g. McDATA) to 32 and 48 port blades (Cisco), reconcentrating the risk back into a single blade. So, spread ISLs across blades, spread clustered servers across switches and so on.

In summary, look at the failure points of your hardware. Where they can't be remedied with redundant components, plan to spread the risk across multiple components instead. If you can afford it then duplicate the equipment with multiple fabrics, HBAs and storage ports.

No comments: