Thursday 3 January 2008

Two for the price of one

The holidays are over and it's back to work for me. In fact I returned yesterday; the break was good however it is also good to be back.

It seems that I've returned to a flurry of acquisitions. Yesterday there was the heavily reported (on the blogosphere) purchase of XiV by IBM. Tony Pearson gives a summary of the features on his post. One thing that interests me is the use of distributed writes across an entire array by creating 1MB blocks from (presumably) LUNs and filesystems. If a drive fails, then the data is still available on other disks in the system and spread across a great number rather than a single drive (RAID-1) or potentially a small number of drives (RAID5/6).

I've been trying to get my head around what this means. On the one hand it sounds like a real problem, as a double drive failure could impact a wide number of hosts; it all depends on how well the 1MB chunks are distributed. However maybe it isn't that much of a problem as the issue only arises when both of the chunks that mirror a 1MB block both occur on failing drives. I would expect that as the number of physical drives increases then the impact of double failure reduces, as does the number of 1MB blocks affected. In addition, a drive may fail only in one area rather than on the whole device, so the affected blocks could be quite small; the remainder could be perfectly readable and be quickly moved. No doubt Moshe and the team have done the maths to know what the risk is and compared it to that of standard arrays and wouldn't be selling the product if it was not inherently more safe.

The only other issue I can see is what market the product will slot into; Tony mentions that the product is not for structured data (although I guess it supports it) but was designed for unstructured data of large binary file types. So, why use RAID-1 compared to say a 14+2 RAID-6 configuration which would be much cheaper in terms of the disk cost? Presumably another selling point is performance, but I would expect the target data profile (medical, large binary objects) to be more sequential than random access and not be that impacted by using SATA.

I guess only time will tell. I look forward to seeing how things go.

The other purchase announced today was that of Onaro by Netapp. Onaro sell SANScreen, a tool to collect and analyse fibre channel SANs and to highlight configuration issues. Whilst I think it is a good product, I don't see the fit with Netapp's business in the NAS market (in fact I'm sure SANScreen doesn't currently support NAS), so where's the benefit here other than buying up a company which must be close to or is making money.

I wonder who will be bought tomorrow?

1 comment:

BarryWhyte said...

Chris,

The beauty of the distributed nature of the 'chunks' and hence the fast rebuild times (15mins) mean that as soon as one mirrored copy has been lost (single drive failure) a new second copy is made very soon afterwards. Most dual drive failures occur because of extra stress put on them due to the rebuild operation being performed. When this is targeted at the same disks that are already in a degraded state, recipe for disaster. The benefits are two fold due to the distributed nature.

a) you don't put a much stress on any one set of drives (so are less likely to cause the second failure)

b) as you suggested with a large enough distribution of drives the changes of a second drive failure causing both mirrored copies to be lost are less.

I'm looking at the lower level architecture and still trying to get my head round some of it too!