Friday, 7 September 2007

Virtualisation Update

Thanks to everyone who commented on the previous post relating to using virtualisation for DR. I'm looking forward to Barry's more contemporaneous explanation of the way SVC works.

I guess I should have said I understand Invista is stateless - but I didn't - so thank's to 'zilla for pointing it out.

So here's another issue. If SVC and USP cache the data (which I knew they did) then what happens if the virtualisation appliance fails? I'm not just thinking about a total failure but a partial failure or another issue which compromises the data in cache?

I was always worried that a problem with a USP virtualising solution was understanding what would happen if a failure occurred in the data path. Where is the data? What is the consistency position? A datacentre power down could be a good example. What is the data status as the equipment is powered back up?


BarryWhyte said...

Chris, I must have pre-emptively thought of and answered your question. I think the cache operation section I included as as background info should answer your questions for SVC.

The key being that there is two physical boxes that each store the cache, so other than total power failures its very unlikely for a hardware error to strike both nodes simultaneously. Total power failures being covered by battery backup as fire-hose dump as usual.

Nigel said...


As I mentioned previously. I understand that the cached data destined for external storage on the USP is destaged according to the same rules and algorithms as data destined for internal disks.

A couple of differences being that during a power out you have two options on the USP -
1. Keep data in cache backed up by the batteries
2. Destage data in cache to disk while running on batteries
(both slightly oversimplified but fundamentally correct).

Both protect your data.

For external storage option two is not possible because the USP cannot guarantee that the external disks are spinning etc. Data destined for external disk is always kept in duplexed cache and protected by the batteries.

So as far as what happens during a failure, I do not really see that much difference between data destined for internal or external disk. If I was a data centre manager I would not lie awake at night worrying about my data externalised behind my USPs.

the storage anarchist said...

I think that there must be some definite risk of data loss or corruption with any virtualization approach that terminates and regenerates the I/O (as do both SVC and UVM).

In fact, I think Hu Yoshida himself hints at the exposure at the end of this recent blog post.

"I do have a concern about striping a LUN across separate boxes, since each box may have different performance characteristics and, more importantly, different error recovery characteristics. I recommend that virtual LUNs be created in the same storage box and not spread across heterogeneous storage boxes."

If "different error recovery characteristics" across different external arrays put a striped LUN at risk of data loss or curruption, then this implies that the USP/USPV cannot handle all possible error conditions for the external arrays. Scary.

Now, I've heard that Hitachi has difficulty handling failure modes that hosts handle elegantly (loss of one half of a dual-controller array, as an example). And limiting the failure protection to battery-backed-up cache is short-sighted - Hitachi implemented destage to disk specifically in response to all the data that was lost/corrupted in NYC when the lights went out longer than the batteries lasted. If you can't use that with external storage, you're just playing russian roulette with your Tier2-3-4 data.

So although we may not yet know WHAT can go wrong, the experts at Hitachi seem to know that indeed they CAN...

Are you feeling lucky?

BTW- Barry Whyte is perhaps just a bit optimistic about simultaneous failures...they happen, and more importantly, there's NOTHING either SVC or UVM can do to even DETECT data error or corruption, much less fix it...the whole notion works on the ASSUMPTION that the end array is infallible.

Chris M Evans said...

BarryB, I have to say I've always been slightly uncomfortable with the concept of two arrays, each with cache (and potentially differing methods of securing data during a powerdown) and how each would cope with power loss. I certainly wouldn't be happy creating virtual LUNs with storage taken from more than one array (even less so when they are different vendors) as I can't imagine what state my data is in. Touching on Nigel's comment, at least with internal disks (simplifying here) if an I/O is written to the disk there are no other components in the way spoofing me to believe the I/O was successful.

In these instances, regardless of the technology, if data is that important, I'd prescribe a second or third online copy (which could be cheap disk) remote from the primary copy anyway.

the storage anarchist said...

I couldn't agree more.

IN fact, I think that's pretty much the only practical use case for UVM - to move copies of LUNs to external storage (which can also be accomplished using Open Replicator on Symmetrix).

Second is using UVM in cache-bypass mode, where you can get max performance and avoid most of the data integrity risks of two arrays.

But from a practical viewpoint, using UVM in full cache mode (which is required if you want to use local or remote replication, and presumably thin provisioning as well) - well, I think you're taking some pretty big risks, given that there's no end-to-end data integrity checking possible.

BarryWhyte said...

With SVC we generally recommend that you only stripe across mdisks with the same characteristics. That is, performance and reliability. So there is no problem striping across multiple controllers that are the same, i.e. several DS4800 boxes with the same RAID type. By its vary nature, a stripe-set will perform as well as its worst component.

When I talk about hardware failures, I mean a memory, cpu, planar or other such failure. SVC uses ECC chip-kill DDR, so there is some protection there. However in the event of any detected hardware error a node will dump the cache internally and stop. Remember any write data is mirrored, so its exactly the same as any other controller with a mirrored write cache, but at least they are in physically different boxes with SVC. Yes we don't have end to end CRC checking, but you only get subsystem end to end on high end controllers today - or mainframe.

It should be noted that we do also store the virtualization map out on virtualized storage too - in several places on our chosen 'quorum' disk extents. So even if you have a major problem, like the rack your nodes are in catches fire, then you can recover the cluster - any data in cache at this point in time could be lost - however this is no different from fire attacking a storage controller.