Wednesday 29 October 2008

Understanding EVA - revisited

Thanks to all those who posted in response to Understanding EVA earlier this week, especially Cleanur who added a lot of detail. Based on the additional knowledge, I'd summarise again:



  • EVA disks are placed in groups - usually recommended to be one single group unless there's a compelling reason not to (like different disk types e.g. FC/FATA).

  • Disk groups are logically divided into Redundancy Storage Sets, which can be from 6-11 disks in size, depending on the number of disks in the group, but ideally 8 drives.

  • Virtual LUNs are created across all disks in a group, however to minimise the risk of data loss from disk failure, equal slices of LUNs (called PSEGs) are created in each RSS with additional parity to recreate the data within the RSS if a disk failure occurs. PSEGs are 2MB in size.

  • In the event of a drive failure, data is moved dynamically/automagically to spare space reserved on each remaining disk.

I've created a new diagram to show this relationship. The vRAID1 devices are pretty much as before, although now numbered as 1-1 & 1-2 to show the two mirrors of each PSEG. For vRAID5, there are 4 data and 1 parity PSEG, which initially hits RSS1, then RSS2 then back to RSS1 again. I haven't shown it, but presumably the EVA does a calculation to ensure that the data resides evenly on each disk.

So here's some maths on the numbers. There are many good links worth reading; try here and here. I've taken the simplest formula and churned the numbers on a 168-drive array with a realistic MTBF (mean time before failure) of 100,000 hours. Before people leap in and quote the manufacturers numbers that Seagate et al provide, which are higher figures, remember arrays will predictively fail a drive and in any case with temperature variation, heavy workload, manufacturing defects etc, the probability is lower than manufacturing figures (as Google have already pointed out).

I've also assumed a repair (i.e. replace) time of 8 hours, which seems reasonable for arrays unattended overnight. If disks are not grouped, then the MTTDL (mean time to data loss) is about 44553 hours, or just over five years. This is for a single array - imagine if you had 70-80 of them - the risk would be increased. Now, with the disks in groups of 8 (meaning that data will be written across only 8 disks at a time), the double disk failure becomes 1,062,925 hours or just over 121 years. This is without any parity.

Clearly grouping disks into RSSs does improve things and quite considerably so, even if no parity is implemented, so thumbs up to RSSs from a mathematical perspective. However if a double disk failure does occur then every LUN in the disk group is impacted as data is spread across the whole disk group. So it's a case of very low probability, very high impact.

Mark & Marc commented on 3Par's implementation being similar to EVA. I think XIV sounds similar too. I'll do more investigation on this as I'd like to understand the implications of double disk failures on all array types.

6 comments:

Cleanur said...

Although potentially all LUN's within the disk group are affected, the failure and therefore the rebuild from parity are isolated within the same RSS and so do not require a rebuild of the entire undelying disk group or LUN's.

Just to clarify much of the maths should assume a non predicted double overlap failure within the same RSS, rather than within the same disk group. Arguably this type of failure is much more likely to be induced by failure of non disk components such as firmware, compound back end failures, administraors, engineers.

In a double overlap failure within the same RSS if you're LUN's reside on raid5 then potentially they will fail, depends on the nature of the failure, hard or transient. If they're on Raid1 then more than likely they would still survive, but again it depends on the nature of the failure, the two paired disks within the same RSS must both fail.

It's worth keeping in mind also that once sparing has completed replacing the failed disk only re allocates "reserved" space for additional sparing beyond that already available within the disk group. it doesn't necessarily improve availability of the disk group.

HP do provide some good guidance on diskgroup usage based on availability, performance and cost. In some cases all three overlap but It's difficult for any vendor to achieve all three goals in all scenarios. Any vendor who tells you he can should be treated with suspicion.

Some of the more vociferous vendors with troops of marketing savvy bloggers have pushed the double disk overlap failure for all it's worth. This is more to do with their traditional approach to rebuilds, nearline spares and the inabilility of traditional raid schemas to create large disk group structures reliably.

Linked below are some of the patent issues for RSS, leveling and distributed sparing, if you can get through the legal speak there's some good technical info.

http://www.patentstorm.us/patents/6895467/fulltext.html

http://www.patentstorm.us/patents/7146460/fulltext.html

Just a final thought, when HP get over their current aquisition trawl and decide to release something, just think of the potential availability inherrent in a double parity scheme that also utilises RSS structures.

Chris M Evans said...

cleanur - once again thanks for the comments, I will check out the patent stuff. Also, I totally agree about the way some bloggers portray technology, however that's their perogative; as long as they're up front about who they're paid by.

marcfarley said...

Thanks Chris, Great posts and comments! And if I wasn't clear in my previous posts I work for 3PAR. I don't try to hide that fact, but sometimes I forget to mention it.

It's important to understand how failure detection and data evacuation to spares is handled. It adds variables and complexity to the MTBF calculations, but don't you think assumptions about those mechanisms should be included? If a drive can have it's data "spared" in 4 hours or less - and ALL data on spares is redundantly and independently protected, then the risk of data loss from another drive failure is further minimized.

I agree with both you and cleanur about bloggers that have "pushed the double disk overlap failure for all it's worth." There are other ways to reduce the risk of data loss without the performance tradeoffs that n+2 RAID has.

But, RSSes aren't one of them. As cleanur said, restricting sparing to the same RSS as where the failure occurred creates the potential for data loss with a second drive failure.

As far as I can tell, the best thing about RSSes is that they make it easier for a human being to conceptualize data layout, but that doesn't mean it's optimal. Machines do a great job segmenting resources into virtual entities. Splitting RSSes when they exceed 11 disks has to be painful to storage admins due to the workload that must accompany that process.

If the 11 disk RSS max size was set to ensure RAID5 groups do not exceed the failure statistics of 10+1, there are virtualization techniques that can do this much better than RSSes.

Cleanur said...

Mark,

I'm not sure what you meant "by the workload that accompanies the process". RSS groups only go through create, merge, split process when the EVA either sees a failure and must react to protect data integrity or the EVA see's an opportunity to optimize the data layout through the addition of more disks to the pool.

In terms of management both of these processes incur near zero overhead to the Admin. In terms of workload on the EVA then yes these can have a transient impact on performance but this also applies to any storage platform on the market. As always the level of perfomance impact depends on the priority of the process being performed. In a disk failure scenario you have no choice of when this occurrs and the rebuild process will always have a high priority. If it's an expansion of an existing disk groups then this is a controlled process and can be easily scheduled and managed to ensure performance isn't impacted within the group.

Just to reiterate this entire RSS create, split, merge process is entirely transparent to the administrator. The discussion so far has been centered on how things work at the nuts and bolts conceptual level. All the EVA Admin sees during day to day operations is a pool, or multiple pools of differing storage tiers from where he requests capacity of a specific Raid level.

The entire process is pretty much:
Which disk group ?
What do want to call it ?
What Raid Level ?
Who gets Access ?
Done.

The EVA will then make the decisions based on internal controller heuristics on where data is placed and how availability of the said data can be further increased by intelligent distribution within the enclosures. The Admin just sees a LUN of the requested size for presentation to his host(s).

BTW I don't work for a specific vendor but do specialise across quite a few within a VAR.
We all have opinions on how best to do things and I'm just trying to explain this stuff in laymens terms in response to a question posed. The functionality discussed here is just a small subset of the overall package. Now I've read quite a bit on how many of the other vendors such as yourself, Equalogic, Pillar etc handle the virtualisation process. Even though I may not agree with the way these vendors appear to implement this process. Given the fact that I've not seen the deep dive internal documentation on how it really works and also never worked on these products hands on. I don't think I'm qualified to pass comment on whether these systems do things the right way or not given the limited data publicly available. I think you should consider this policy yourself before making blanket statements on other vendors functionality especially since this is in no way a competitive spat between vendors. If your not careful you'll fall into that other vendor's trap of "the wrong way or our way".

Cheers

marcfarley said...

Good comment cleanur. Reading my last comment I agree that it was heavy handed.

I like thinking about storage architectures. My reaction to RSS was simply "Why build an architecture with that in it?" It wasn't meant to be a competitive slam, but it had the snarky tone.

FWIW, I believe you when you say the create and merge operations are easy on the administrator, but I can't imagine that the split operation wouldn't be something a lot less wonderful. I don't see how it could be done without creating a pile of extra work for the system.

Cleanur said...

Marc,
No problem, and thanks for the response. I've just been getting a little tired of the "our way or the highway" attitude from some of the vendor corporate blogs. There's so much mis information being spouted recently. it's becoming depressing trying to reverse engineer the FUD that customers are being plyed with. "Hey x vendor said it so it must be true." ;-)

BTW, small correction but I should have said:-
The entire process is pretty much:
1. Which disk group ?
2. What do want to call it ?
3. *How big do you need it ?
4. What Raid Level ?
5. Who gets Access ?
Done.

The create, split & merge operations are completely hands off processes. How much work the split creates for the system ? From my experience of many upgrades which in some cases will force a split, it's not something that customer's notice. Afterall this process wasn't an afterthought it's part of the core virtualisation stack so The systems designed to deal with it.