Monday, 7 January 2008

XiV Part II

Following on from BarryW's comment to my XiV post, I've been thinking over how the XiVarchitecture works. When a disk fails and the missing mirrors need to be recreated, then the data is likely to exist across all or most of the configured drives. Logically it would make sense that the target for the new lost mirrors would be all drives. So, as a drive fails, all drives could be copying to all drives in an attempt to ensure the recreated lost mirrors are well distributed across the subsystem. If this is true, all drives would become busy for read/writes for the rebuild time, rather than rebuild overhead being isolated to just one RAID group. Whilst that seems like a good thing for rebuild time, it seems like a bad thing for performance. Perhaps this isn't the case and in fact the failed device is re-created on a spare drive by copying all the mirrors back in from their other location.

Following the same line, in order to recreate a failed drive and rebuild the lost data across the array, then each drive must have spare capacity; in say a 200 drive system, that would require about 1/200th of all drives to be free at any one time, ready to receive rebuilt mirrors. Obviously the alternative option is just having spare drives, but that sounds less interesting!

What about when the failed disk is replaced? There must be another algorithm which ensures the replaced disk is not a target for all new writes, so presumably, static mirrors are pro-actively moved onto the replaced device.

This architecture throws up some interesting questions, especially when trying to understand performance. I am starting to get excited about messing about with one!

7 comments:

Unknown said...

It gets even more interesting when you consider what happens when you take a snap of a virtual volume.

Chris M Evans said...

I guess it depends on whether it is a "thin" snapshot or not. If it is, then the creation is simple pointer copying, if not, then I suppose the whole array will be busy copying data. I wonder if there is a way to segment the distribution of blocks into zones to divide up the workload into tiers?

Unknown said...

We also still the old problem of upgrade-balancing.

If you have X drives, but they're all full, do you have to add drives in a large batch to keep the write load spread out? It won't be very virtual if my data is grouped to physical devices by age.

Can this thing intelligently re-balance the data so that IO uses the whole virtual system?

Unknown said...

Hi Chris, I try to give an answer here in this post:
http://www.ibm.com/developerworks/
blogs/page/InsideSystemStorage?entry=
spreading_out_the_re_replication

BarryWhyte said...

In answer to Carl's question, yes as new storage is added, the mirror sets will be intelligently re-ballanced.

Ron Major said...

This is the same concept as the HP EVA. The EVA can level the IO performance by moving hot chunklets to less busy spindles. Chunklets are also spread to new disks when capacity is added to the array. Sparing is handled by reserving extra capacity equivalent to one or two drives across all of the disks. This eliminates idle spindles since all disks participate in the workload. RAID types are defined at the logical volume level rather than with disk groups.

It’s a great concept, but the EVA never became very popular. What is so remarkable about the XiV that has people so excited?

Chris M Evans said...

Ron, I think Nextra is generating more interest purely because of the involvement of Moshe Yanai in the company. Had he not been involved and the product come to market with storage "unkowns" then I'm sure it would have been a different case.