Monday, 27 October 2008

Understanding EVA

I've not had much exposure to HP EVA storage however recently I've had a need (as part of a software tool project) to get into the depths of EVA and understand how it all works. The following is my understanding as I see it, plus some comments of my own. I'd be grateful for any feedback which help improve my englightenment or equally, knock me back for plain stupidity!


So, here goes. EVA arrays place disks into disk groups. The EVA system automagically sub-groups the disks into redundant storage sets (RSS). An RSS is simply a logical grouping of disks rather than some RAID implementation as there's no underlying RAID deployment at the disk level.

Within each disk group, it is possible to assign a protection level. This figure is "none", "one" or "two", indicating the amount of storage to reserve for disk failure rebuilds. The figure doesn't represent an actual disk, but rather an amount of disk capacity that will be reserved across the whole pool. So, setting "one" in a pool of 16 disks will reserve 1/16th of each disk for rebuilds.

Now we get to LUNs themselves and it is at this point that RAID protection comes in. A LUN can be created in a group with either vRAID0 (no protection), vRAID1 (mirrored) or vRAID5 (RAID-5) protection. vRAID5 uses a RAID5 (4+1) configuration with 4-data and 1-parity.


From the literature I've read and playing with the EVA simulator, it appears that the EVA spreads a LUN across all volumes within a disk group. I've tried to show this allocation in the diagram on the right, using a different colour for each protection type, within a disk pool of 16 drives.


The picture shows two RSSs and a LUN of each RAID protection type (vRAID0, vRAID1, vRAID5). Understanding vRAID0 is simple; the capacity of the LUN is striped across all physical disks, providing no protection against the loss of any disk within the configuration. In large disk groups, vRAID0 is clearly pointless as it will almost always lead to data loss in the event of a physical failure.

vRAID1 mirrors each segment of the LUN, which is striped across all volumes twice, one for each mirror. I've shown these as A & B and assumed they will be allocated on separate RSS groups. In the event that a disk fails, then a vRAID1 LUN can be recreated from the other mirror, using the spare space set aside on the remaining drives to achieve this.


Question: Does the EVA actively re-create failed mirrors immediately on failure of a physical disk. If so, does the EVA then actively rebuild the failed disk, once it has been replaced?


Now, vRAID5, a little more tricky. My understanding is that EVA uses RAID-5 (4+1), so there will never be an even layout of data and parity stripes across the disk group. I haven't shown in on the diagram, but I presume as data is written to a vRAID5 LUN it is split into smaller chunks (I think 128KB) and striped across the physical disks. In this way, there will be as close to an even distribution of data and parity as possible. In the event of a disk failure, the lost data can be recreated from the other data and parity components that make up that stripe.


Question: What size block does the EVA use for RAID-5 stripes?


At this point, I'm not sure of the benefit of Redundant Storage Sets. They aren't RAID groups, so there's no inherent protection if a disk in an RSS fails. If LUNs are created within the same RSS, then perhaps this minimises the impact of a disk failure to just that group of disks; see the second diagram.
The upshot is, I think the technique of dispersing the LUN across all disks is good for performance, but bad for availability - especially as it isn't easy to see what the impact of a double disk failure can be - my assumption is that it means *all* data will be affected if a double disk failure occurs within the same RSS group. I may be wrong but that doesn't sound good.
Feel free to correct me if I've got any of this wrong!

7 comments:

Cleanur said...

setting "one" in a pool of 16 disks will reserve 1/16th of each disk for rebuilds ?

By setting one you'll actually reserve 1/8th of each disk for rebuilds, or twice the capacity of the largest disk in the group. This is because a mirror (Vraid1) cannot exist on an uneven number of disks.

Was going to answer the rest myself, but the answer got longer and longer. Try this link on the HP forums.

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=886230&admit=109447626+1225124028266+28353475

Stephen Foskett said...

This is great stuff, Chris! Keep it coming!

Chris M Evans said...

Cheers Stephen!

Cleanur - thanks for the link - this sort of helps and clouds the issue too. I'll have a read in more detail and post back.

marcfarley said...

Chris.

I don't want to speak for HP, but consistent with RAID 5 operations, it looks like a double disk drive within an RSS will result in data loss, unless the lost disk's data has been transferred to spare capacity within the RSS. If the failed disk's data (including parity data) has been transferred to spare capacity then it is possible for the RSS to withstand the loss of a second disk.

Again, I don't want to speak for HP because I don't know how EVA handles this scenario, but this is the general idea of how sparing works in 3PAR InServ arrays.

Of course, a RAID 1 or RAID 5 LUN that spans multiple RSSes can survive a disk failure in each RSS without consideration of how sparing is implemented.

A static EVA configuration seems fairly straightforward, but upgrade scenarios are anything but clear. It looks like the EVA first extends existing RSSes up to a certain number of disks, but beyond that it forces the RSS to be subdivided - creating two smaller RSSes from the first.

It would not appear that this storage "cytokinesis" would be able to split the RSS on disk boundaries but would need to redistribute the data from each disk over the two "child RSSes". This process would involve a high number of disk I/Os and intricate data movements and "housekeeping" to preserve PSEG alignments(a PSEG is the per-disk granular unit of data in each "micro-array")

This is much easier to describe than it is to do - especially when the array is under load. (I don't know if the EVA supports live operations).

Given this mess, I agree that the RSS, as an architectural element, appears to be flawed.

wobbly1 said...

Question: Does the EVA actively re-create failed mirrors immediately on failure of a physical disk. If so, does the EVA then actively rebuild the failed disk, once it has been replaced?

It actively recreates the failed mirror by actually recreating both source and mirror blocks on a remaining pair within the disk group. This is the reason that a sparing level of 1 is actually the equivilent capacity of two drives and for HP best practice of having an even number of drives. An odd number means that 1 disk will not be used as there is no pair for it when vRaid 1 is used.

Question: What size block does the EVA use for RAID-5 stripes?

RSS writes data in 2MB blocks with the optimum RSS size being 8 disks

Mark Steel said...

Chris can you compare to 3PAR ? yes I know my favourite vendor but it really is completely different and a big improvement over Vraid from HP. If you haven't tested I will assist with comparison..

Mark

Cleanur said...

Redundant Storage Sets: - Because an EVA disk group can scale to many physical disks and each Vdisk within the diskgroup is effectively formed of chunks or segments spread across all spindles within the disk group. This would imply an increased risk of multiple concurrent disk (overlap) failures within such a large disk group. Which in turn, in a traditional RAID environment would result in vastly increased risk of data loss.

To avoid this issue EVA provides a seamless and dynamic mechanism to ensure availability of all Vraid protected data sets by splitting each Vdisk into multiples of six to eleven physical disks, this is referred to as an RSS, with the ideal or target number of disks per RSS being eight. Each of these is referred to as a redundant storage set (RSS) and each RSS provides an autonomous number of physical disks that in the case of Vraid 5 can handle the failure of a single physical disk within each RSS without the loss of data, remember disk groups are built of many RSS groups. In a Vraid 1 scenario the same holds true but also potentially multiple disks within the same RSS can fail and the Vdisk will survive.

The simplest way to think about this is that each Vdisk which is a logical construct consists of multiple Vraid arrays spliced together to offer much higher redundancy than a single large Raid array would be capable of. EVA handles this data protection automatically and requires no user intervention to recover or redistribute data on failed physical disks. The RSS groups have specific rules around how they are formed and how they are migrated and reconstructed as failures occur. Note the description above is an over simplification but it illustrates the basic premise of RSS groups and their tight integration with EVA Vraid.

Distributed Online Sparing :- is another feature of EVA which provides a mechanism to reduce the risk of multiple concurrent disk failure (overlap failure) causing data loss. In traditional raid arrays single or multiple disks are pre allocated as spare disks to individual arrays only to be used in the event of a drive failure, at which point data can be rebuilt from parity to the spare disk.

This mechanism although very common practice has in the past proved potentially unreliable and prone to additional failure during the rebuild process. The first reason for this is that an unused & non stress tested disk (the spare) is suddenly forced into life after months of effective inactivity and forced to cope with an extremely heavy I/O load as the data is rebuilt from parity, if this disk has any undetected flaws this is precisely the point at which it is most likely to fail. The second issue is the time to rebuild data from parity with a single spare is a slow process, as the spare disk becomes an I/O bottleneck for all of the other surviving disks within the array. Probability suggests that doubling the rebuild time also doubles the likelyhood of a second (overlap) failure.

To combat this issue EVA does not reserve specific physical disks for sparing, instead the EVA allocates capacity for sparing across the entire disk group. The spare space is a percentage of reserved space across all spindles within the given disk group. The percentage varies with the level of protection specified, effectively this is just a guarantee of how much space the EVA will reserve and make unavailable for user allocation.

This ensures that.
• All disks acting as spare capacity within the pool are known good and have been pre stressed, since they are already in use.
• All disks within the RSS affected by the failure take part in the rebuild of data to spare space across multiple target disks. Ensuring many more I/O’s are available to the rebuild process and thereby reducing the rebuild time and the risk of an overlap failure.
• The Percentage of overall disk group capacity reserved for sparing reduces as the disk group grows, achieving further capacity utilisation efficiencies over traditional hot sparing without sacrificing availability.

Even if spare space is exhausted by multiple failures with no replacement disk being added in between. The EVA sparing policy will continue to draw any available free space from the underlying disk group to ensure rebuilds can occur at the expense of free capacity within the pool. Once failed disks are replaced this capacity will be returned to the disk group. The Sparing mechanism can be triggered by either a hard failure in which case a parity rebuild will be required or a pre failure event in which case data will be evacuated from a suspected failing disk to spare space.