The Storage Architect: October 2008

Friday, 31 October 2008

Get the Balance Right

It's not very often I side with one vendor or another however after BarryB's recent post regarding "Benchmarketing" I feel obliged to comment. Have a read of Barry Whyte's rebuttal too.

We see technology advancements because "concept" devices are used to drive innovation but don't necessarily translate directly to end-user products. Look at the fashion industry - some of the most outrageous outfits are paraded down the catwalk but the same dress, coat, hat or whatever isn't sold in the shops. Instead it influences the next fashion season.

Look at the motor industry - concept cars appear well before actual consumer products. We may laugh at some and marvel at others - take the Bugatti Veyron. It is well known that Volkswagen make a loss on each car produced, however what counters this is the publicity, the research, the kudos of being able to claim Veyron technology (disputably the fastest car in the world) is deployed in the standard VW range. Lexus is another good example of a brand created by Toyota to perform the same function. Much the same can be said for Formula 1.

Now, I'm not endorsing IBM per-se here, however I don't see the harm with IBM marketing a "concept" piece of technology which could lead to innovation in the future. After all, IBM is well known for research of this kind; the disk drive and the tape drive spring to mind.

Even EMC's own bloggers question whether EMC is known for innovation and other than Symmetrix, I can't think of one thing I view as an EMC "idea".

Anyway, 'nuff said. As previously offered - I would love to take the position of moderator in developing real world benchmarking - bring it on!!

Who Ya Gonna Call?

Here's a quality piece of reporting from TechCrunch on the state of Facebook and their data problems. I mentioned just last week in this post about their data growth. It's incredible that they're purchasing a new Netapp 3070 filer each week!

I'm surprised that Facebook would be continually purchasing NAS filers to grow their content. There must be a rolling set of pictures, thumbnails and so on that are frequently looked at, but there also must be a significant amount that aren't and could be archived to super-dense nearline type technology akin to the Copan products.

Unfortunately when data growth is so intense, it isn't always easy to see the wood for the trees and from previous and current experience, using Netapp creates the risk of wasted resources.

In my experience, looking at just block-based arrays, I've always seen around 10-15% of orphan or unused resources and sometimes higher. When host-based wastage is taken into consideration, the figure can be much worse, although host reclamation is a much more intense process.

I'm willing to offer to anyone out there who has more than 50TB of storage on storage arrays a free analysis of their environment - for a 50:50 split of any savings that can be made. As budgets tighten, I think there will be more and more focus on this kind of work.

Pillar Crumbles

I picked this up last night on Mike Workman's blog over at Pillar. Looks like they're suffering the downturn. Storagezilla thinks this could be 30% of the workforce. I'm sure this is going to be one of many bad news stories from the storage industry we hear over the next few months.

I've never understood the point of Pillar's offering. Differentiating performance tiers based on the specific place on a disk seems a dead end idea. Firstly, disks might appear to be random access devices but if you're accessing one cylinder on a drive you can't be accessing another at the same time. You need some pretty clever code to ensure that lower tiered I/O requests don't overwhelm tier 1 requests in this kind of shared environment. In addition, everyone says SSDs are the future. Once these devices are mainstream, Pillar's tiering model is defunct (unless SSDs have some performance variant across different parts of the silicon!) as there's no differential in performance across an SSD device.

For me, a Compellent type architecture still seems best - granular access to each storage tier (including SSD) with dynamic relocation of data.

** disclaimer - I have no affiliation with Pillar or Compellent **

Thursday, 30 October 2008

SMI-S Is Dead

Take it from me, SMI-S is a thing of the past. If there's one thing the last few months have taught me it's how different each vendor's products really are. I've been working on a tool called SRA (see the link here) which will report on storage in a consistent manner. Let me tell you that isn't easy...

EMC Symmetrix/DMX - Physical disks are carved into smaller segments called hypers. These are then recombined into LUNs which then might be recombined into composite devices (metas) and replicated, cloned or snapped. The hypers that make up a LUN can come from anywhere within an array and can be moved around at will by a tool designed to improve performance, completely ruining your original well-planned configuration. Combining hypers give you RAID, which wasn't RAID before and was something called mirrors but is now, and is even RAID-6! Devices have personalities which survive their presentation or removal from a port. A device can have multiple personalities at the same time. LUNs use a nice numbering system based on hex - but don't expect them to number nicely if you destroy and create devices. Bit settings (flags) are used to ensure host SCSI commands work correctly.
HDS USP/HP XP - Physical disks are grouped into RAID groups from which LUNs are carved. Until recently you couldn't span RAID groups easily (unless you were combining some free space in each RAID group). Devices don't have a personality until they're presented to a host on a port, but they can have multiple personalities. HDS use a form of punishment known as CCI for anyone foolish enough to think they had made their arrays easy to manage. LUNs are numbered using a relic of the mainframe and yes, you can move things around to balance performance, but don't think you can do it unless there are spare LUNs (sorry LDEVs) around. Different host types are supported by a setting on a host group which lets you confuse the hell out of every one by telling them their LUN numbers are all the same but unique. Oh, and the storage the user sees doesn't actually have to be in the array itself.
HP EVA - Phew! Physical disks are managed in groups (which it's recommended to only have one of, but you can have more if you really must) but they don't use RAID at the group level because that would be too easy. Instead disks are grouped into Redundancy Storage Sets, which reduce the *risk* of disk failures but don't protect directly against them. LUNs are created only when they need to be presented to a host and they don't have simple LUN numbers, but rather 32 digit UUIDs. RAID protection is done at the LUN level, making it more difficult to conceptualise than either of the previous two examples.
Pillar Axiom - now we're getting really abstract. With Axiom, you can tier data on different levels of performance, but wait for it - they will be on the same drive, but utilising different parts of the same spindle! Argh! Enough!

Clearly every vendor wants to differentiate their product so you'll buy from them and not the competition. In some respects they *have* to differentiate otherwise all the vendors would spend their time in litigation with each other over patent copyright! (wait a minute, they already are). So SMI-S or any other standard is going to have a near impossible time creating a single reference point. Add to the mix the need to retain some competitive advantage (a bit like Microsoft holding back the really useful API calls in Windows) and to sell their own management tools and you can see why SMI-S will be at best a watered down generic interface.

So why bother. There's no benefit. Every vendor will give lip service to the standard and implement just what they can get away with.

The question is, what would replace it? There's no doubt something is needed. Most SRM tools are either overbloated, poorly implemented, expensive, or plainly don't work so some light touch software is a must.

I think the interim solution is to get vendors to conform to a standard API format, for example XML via an IP connection to the array. Then leave it to the vendor how to code up commands for querying or modifying the array. At least the access method would be consistent. We don't even see that today. All we need now is an acronym. How about Common Resource Access Protocol?

Wednesday, 29 October 2008

Understanding EVA - revisited

Thanks to all those who posted in response to Understanding EVA earlier this week, especially Cleanur who added a lot of detail. Based on the additional knowledge, I'd summarise again:

EVA disks are placed in groups - usually recommended to be one single group unless there's a compelling reason not to (like different disk types e.g. FC/FATA).

Disk groups are logically divided into Redundancy Storage Sets, which can be from 6-11 disks in size, depending on the number of disks in the group, but ideally 8 drives.

Virtual LUNs are created across all disks in a group, however to minimise the risk of data loss from disk failure, equal slices of LUNs (called PSEGs) are created in each RSS with additional parity to recreate the data within the RSS if a disk failure occurs. PSEGs are 2MB in size.

In the event of a drive failure, data is moved dynamically/automagically to spare space reserved on each remaining disk.

I've created a new diagram to show this relationship. The vRAID1 devices are pretty much as before, although now numbered as 1-1 & 1-2 to show the two mirrors of each PSEG. For vRAID5, there are 4 data and 1 parity PSEG, which initially hits RSS1, then RSS2 then back to RSS1 again. I haven't shown it, but presumably the EVA does a calculation to ensure that the data resides evenly on each disk.

So here's some maths on the numbers. There are many good links worth reading; try here and here. I've taken the simplest formula and churned the numbers on a 168-drive array with a realistic MTBF (mean time before failure) of 100,000 hours. Before people leap in and quote the manufacturers numbers that Seagate et al provide, which are higher figures, remember arrays will predictively fail a drive and in any case with temperature variation, heavy workload, manufacturing defects etc, the probability is lower than manufacturing figures (as Google have already pointed out).

I've also assumed a repair (i.e. replace) time of 8 hours, which seems reasonable for arrays unattended overnight. If disks are not grouped, then the MTTDL (mean time to data loss) is about 44553 hours, or just over five years. This is for a single array - imagine if you had 70-80 of them - the risk would be increased. Now, with the disks in groups of 8 (meaning that data will be written across only 8 disks at a time), the double disk failure becomes 1,062,925 hours or just over 121 years. This is without any parity.

Clearly grouping disks into RSSs does improve things and quite considerably so, even if no parity is implemented, so thumbs up to RSSs from a mathematical perspective. However if a double disk failure does occur then every LUN in the disk group is impacted as data is spread across the whole disk group. So it's a case of very low probability, very high impact.

Mark & Marc commented on 3Par's implementation being similar to EVA. I think XIV sounds similar too. I'll do more investigation on this as I'd like to understand the implications of double disk failures on all array types.

Tuesday, 28 October 2008

Virtualisation: LeftHand VSA Appliance - Part Two

In my previous post covering LeftHand's Virtual Storage Appliance, I discussed deploying a VSA guest under VMware. This post discusses performance of the VSA itself.

Deciding how to measure a virtual storage appliance's performance wasn't particularly difficult. VMware provides performance monitoring through the Virtual Infrastructure Client and gives some nice pretty graphs to work with. So from the appliance (called VSA1 in my configuration) I can see CPU, disk, memory and network throughput.

The tricky part comes in determining what to test. Initally I configured an RDM LUN from my Clariion array and ran the tests against that. Performance was poor and when I checked out the Clariion I found it was running degraded with a single SP and therefore no write cache. In addition, I also used a test Windows 2003 VM on the same VMware server - D'oh! That clearly wasn't going to give fair results as the iSCSI I/O would be going straight through the hypervisor and potentially VSA1 and the test W2K3 box would contend for hardware resources.

So, on to test plan 2, using another VMware server with only one single W2K3 guest, talking to VSA1 on the initial VMware hardware. So far so good - separate hardware for each component and a proper network in between (which is gigabit). To run the tests I decided to use Iometer. It's free, easy to use and you can perform a variety of tests with sequential and random I/O at different block sizes.

The first test was for 4K blocks, 50% sequential read/writes to an internal VMFS LUN on a SATA drive. The following two graphics show the VMware throughput; CPU wasn't max'd out and sat at an average of 80%. Data throughput averaged around 7MB/s for reads and only 704KB/s for writes.

I'm not sure why write performance is so poor compared to reads ho

wever I suspect there's a bit of caching going on somewhere. That's evident from looking at the network traffic which shows an equivalent amount of write traffic as there is network traffic. The read traffic doesn't add up. There's more read traffic on VSA1 than expected, which is shown in the figures from Iometer. It indicates around 700KB/s for both reads and writes.

I performed a few other tests, including a thin provisioned LUN. That showed a CPU increase for the same throughput - no surprise there. There's also a significant decrease in throughput when using 32KB blocks compared to 4KB and 512 bytes.

So, here's the $64,000 dollar question - what kind of throughput can I expect per Ghz of CPU and per GB of memory? Because remember there's no supplied hardware here from LeftHand, just the software. Perhaps with a 2TB limit per VSA maybe the performance isn't that much of an issue but it would be good to know if there's a formula to use. This throughtput versus CPU versus memory is the only indicator I can see that could be used to compare future virtual SANs against each other and when you're paying for the hardware, it's a good thing to know!

Monday, 27 October 2008

Understanding EVA

I've not had much exposure to HP EVA storage however recently I've had a need (as part of a software tool project) to get into the depths of EVA and understand how it all works. The following is my understanding as I see it, plus some comments of my own. I'd be grateful for any feedback which help improve my englightenment or equally, knock me back for plain stupidity!

So, here goes. EVA arrays place disks into disk groups. The EVA system automagically sub-groups the disks into redundant storage sets (RSS). An RSS is simply a logical grouping of disks rather than some RAID implementation as there's no underlying RAID deployment at the disk level.

Within each disk group, it is possible to assign a protection level. This figure is "none", "one" or "two", indicating the amount of storage to reserve for disk failure rebuilds. The figure doesn't represent an actual disk, but rather an amount of disk capacity that will be reserved across the whole pool. So, setting "one" in a pool of 16 disks will reserve 1/16th of each disk for rebuilds.

Now we get to LUNs themselves and it is at this point that RAID protection comes in. A LUN can be created in a group with either vRAID0 (no protection), vRAID1 (mirrored) or vRAID5 (RAID-5) protection. vRAID5 uses a RAID5 (4+1) configuration with 4-data and 1-parity.

From the literature I've read and playing with the EVA simulator, it appears that the EVA spreads a LUN across all volumes within a disk group. I've tried to show this allocation in the diagram on the right, using a different colour for each protection type, within a disk pool of 16 drives.

The picture shows two RSSs and a LUN of each RAID protection type (vRAID0, vRAID1, vRAID5). Understanding vRAID0 is simple; the capacity of the LUN is striped across all physical disks, providing no protection against the loss of any disk within the configuration. In large disk groups, vRAID0 is clearly pointless as it will almost always lead to data loss in the event of a physical failure.

vRAID1 mirrors each segment of the LUN, which is striped across all volumes twice, one for each mirror. I've shown these as A & B and assumed they will be allocated on separate RSS groups. In the event that a disk fails, then a vRAID1 LUN can be recreated from the other mirror, using the spare space set aside on the remaining drives to achieve this.

Question: Does the EVA actively re-create failed mirrors immediately on failure of a physical disk. If so, does the EVA then actively rebuild the failed disk, once it has been replaced?

Now, vRAID5, a little more tricky. My understanding is that EVA uses RAID-5 (4+1), so there will never be an even layout of data and parity stripes across the disk group. I haven't shown in on the diagram, but I presume as data is written to a vRAID5 LUN it is split into smaller chunks (I think 128KB) and striped across the physical disks. In this way, there will be as close to an even distribution of data and parity as possible. In the event of a disk failure, the lost data can be recreated from the other data and parity components that make up that stripe.

Question: What size block does the EVA use for RAID-5 stripes?

At this point, I'm not sure of the benefit of Redundant Storage Sets. They aren't RAID groups, so there's no inherent protection if a disk in an RSS fails. If LUNs are created within the same RSS, then perhaps this minimises the impact of a disk failure to just that group of disks; see the second diagram.

The upshot is, I think the technique of dispersing the LUN across all disks is good for performance, but bad for availability - especially as it isn't easy to see what the impact of a double disk failure can be - my assumption is that it means *all* data will be affected if a double disk failure occurs within the same RSS group. I may be wrong but that doesn't sound good.

Feel free to correct me if I've got any of this wrong!

Wednesday, 22 October 2008

Virtualisation: LeftHand VSA Appliance - Part One

I've been running the LeftHand Networks Virtual SAN Appliance for a while now. As I previously mentioned, I can see virtual storage appliances as a great new category, worthy of investigation for the flexibility of being able to provide functionality (replication, snapshots etc) without having to deploy appliance hardware.

This post is one of a number covering the deployment and configuration of VSA.

So, installation on VMware is remarkably easy. After downloading the installation material, a new virtual machine can be created from the VMware OVF file. I've included a couple of screenshots of the installation process. Choose a Datastore and you're done.

When the virtual appliance is started for the first time, you set the IP address and an administration password and that's it.

The remainder of the configuration is managed through the Centralised Management Console, a separate piece of software installed on a Windows or Linux Management host. Presumably this could be a virtual machine itself, but in my configuration it is installed on my laptop.

From this point on, the configuration challenge begins! I like to test software by seeing how far I can get before having to resort to looking at the manual. Unfortunately I didn't get far as there's a restriction on having a single SCSI LUN assigned to the virtual appliance and it must be device SCSI(1:0). This needs to be done while the VMware host is down (D'oh!) and I think that's because although a hard disk can be added dynamically, a SCSI controller can't, so once the disk is added offline, further disks can be added to existing SCSI controllers (although I don't think they can be used) even if the VMware host is up and running.

Within the CMC GUI, RAID can now be enabled, which picks up the single device configured to the VSA appliance. RAID isn't real RAID but virtual so there's no underlying redundancy available. I was however, able to make my one data LUN a raw (RDM) device, so presumably in a real configuration the data LUN could be a hardware RAID configured device within the VMware server itself.

The final configuration step is to create a Management Group, cluster and volume, which can easily be achieved using the Management Group Wizard. See the screenshot of the completion of the build.

I've now got a 30GB LUN (thin provisioned) which I can access via iSCSI - once I've performed two more configuration steps. First, I need to create a Volume List. This is just a grouping of LUNs against which I can apply some security details. So, on the Management Group I've already defined, I create a Volume List and add the LUN. I then create an Authentication Group and associate it with my Volume List. At this point within the Authentication Group I can specify the iSCSI initiators which can access the LUNs and if necessary, configure CHAP protection.

From my iSCSI client (my laptop) I add the VSA target and then I can configure the LUN as normal through Computer Management->Disk Management.

Phew! This sounds complicated but in reality it isn't. The configuration tasks complete quickly and it's easy to see how the security and device framework is implemented.

In the next post, I'll dig down into what I've configured, talk about thin provisioning and performance, plus some of the other features the VSA offers.

Tuesday, 21 October 2008

Who needs FCoE anyway?

I've been working on getting my "home SAN" into a usable configuration over the last few weeks. One hassle has been VMware (and I won't even mention Hyper-V again) and the support for fibre channel.

I guess thinking logically about it, VMware can't support every type of HBA out there and the line has to be drawn somewhere, but that meant my Qlogic and JNI cards were no use to me. Hurrah for Ebay, as I was able to pick up Emulex Lp9002L HBA cards for £10 each! I remember when these cards retailed at £600 or more.

Now I have two VMware instances accessing Clariion storage through McDATA switches and it all works perfectly.

That leads me on to a couple of thoughts. How many thousands of HBA cards are out there that have been ditched as servers are upgraded to the latest models? Most of them are perfectly servicable devices that will continue to give years of useful service, but "progress" to 4/8Gb fibre channel and FCoE dictates we must ditch these old devices and move on.

Why? It's not as if they cause a "green" issue - they're not power hungry or take up lots of space. I would also challenge anyone who can claim that they need more than 2Gb/s bandwidth on all but a small subset of their servers. (Just for the record, I see the case for using 4/8Gbs for large virtual server farms and the like as you've concentrated the I/O nicely and optimised resources)

So we need two things; (a) a central repository for returning old and unwanted HBAs (b) vendors to "open source" the code for their older HBA models to allow enthusiast programmers to develop drivers for the latest O/S releases.

If we can reduce the amount of IT waste, then I think this is a key strategy to any company claiming to be green in the world moving forward.

Monday, 20 October 2008

Facebook Facts and Figures

Have a look at this link from Facebook - http://www.facebook.com/note.php?note_id=30695603919

They're now serving up 15 billion images a day! From the figures quoted, Facebook host 40 billion files as 1PB of storage, or 25KB per image. Peak load is 300,000 images a second, or 7.5GB per second of bandwidth.

Now I suspect (and it's not rocket science to guess) that Facebook don't store all their images in one place and they are distributed for performance and redundancy, so storage usage must be significantly higher than the quoted figure.

Growth rate is 2-3TB a day! That's up to a petabyte a year at the current rates. What's interesting is that potentially all of these images must be available to view (although most of the time the smaller preview clips will get shown first) so as Facebook grows, they must be hitting some serious issues maintaining reasonable response times.

So, how many of these images are also located elsewhere on other sites like Flickr? How many are sitting on memory sticks, hard drives and so on? I guess we'll never know, but maybe when we've One Cloud to Rule Them All, we'll have a shed load of spare disks lying around.

It's the little things that annoy

It annoys me when vendors release new products and don't update their global website correctly.

Like this link for HDS; http://www.hds.com/products/storage-systems/adaptable-modular-storage-2000-family/index.html?WT.ac=prodssams2000

Which should follow through to a page (link at the bottom) to AMS2000 Specifications. Except it doesn't - it only displays the older models. Doesn't a product release of the level of the AMS2000 warrant someone checking the website (i.e. the shop window) to ensure all the data is there.

Come on guys!!

Friday, 17 October 2008

Friday BlogRoll - EMC

Here's this Friday's list of bloggers I follow. Today it's the turn of EMC.

Andrew's Blog - Andrew Cohen (EMC General Counsel) - HomePage - RSS Feed
Chuck's Blog - Chuck Hollis, EMC Marketing - HomePage - RSS Feed
Never Talk When You Can Nod - Andrew Chapman, SharePoint GM - HomePage - RSS Feed
Confessions of an Ebiz Junkie - Len Devanna - HomePage - RSS Feed
Cornelia Davis's Weblog - Cornelia Davis - HomePage - RSS Feed
Craig's Musing's - Craig Randall - HomePage - RSS Feed
Dave Graham's Weblog - Dave Graham - HomePage - RSS Feed
Energy Matters - Dick Sullivan - HomePage - RSS Feed
Information Playground - Steve Todd - HomePage - RSS Feed
Mark's Blog - Mark Lewis - HomePage - RSS Feed
Oracle Storage Guy - Jeff Browning - HomePage - RSS Feed
Storagezilla - Mark Twomey - HomePage - RSS Feed
The Storage Anarchist - Barry Burke - HomePage - RSS Feed

Enjoy!

Thursday, 16 October 2008

Bl** Hyper-V!

Well, I wasted 3 hours of my life last night trying to get Hyper-V working on one of my PC/servers. Admittedly it's an ancient 2 years old, only has PCI-Express, SATA-II support and up to 4-core Intel processors, but for some reason, my attempts to install Hyper-V would get just so far and fail with a cryptic 0x8007045D error.

As a seasoned professional, I tried the obvious - shouting at the PC, kicking the PC, snapping at my children as they came in to ask innocent questions, then as a last resort I tried using different installation media, screwing about with BIOS settings and so on.

None of it worked. The error code, according to Google, seems to be hardware related, but I've no idea where and Hyper-V being a complex high-quality piece of software gave me no clues. Perhaps if the installation hadn't taken up to 30 minutes at a time (goodness knows what it was it was doing) I could have got back to Heroes an hour earlier.

After giving up, I re-installed VMware ESXi - an installation which, no kidding, took only 10 minutes end to end.

I have been planning a review of the virtualisation technologies, especially with respect to storage, clearly Hyper-V is going to make this a challenge.

Microsoft - you're not on my Christmas card list this year (which they weren't on in the first place as my wife writes all the cards in our house) - VMware welcome back.

Tuesday, 14 October 2008

Replacing the Virtualisation Component - II

Thanks to Chuck who pointed out to me SVC's ability to move virtual WWNs between nodes during replacement. At some stage in the future I may get to play with SVC but I haven't so far, so this feature eluded me. Question: is SVC the *only* block virtualisation appliance to offer this functionality and is it a seamless operation or does it require downtime?

How about InVista or Incipient (or any other vendor I may have missed off - we will assume USP doesn't have the facility)? Answers on a ~~postcard~~ comment please.

Compellent and SSDs

There's been a lot of talk this week about Compellent and their support for solid state drives. See the press release here. So now we have two vendors offering SSD devices in their arrays, Compellent join the club with EMC. Which is best?

At a meeting I had last week, we discussed SSD drives and EMC's implementation in particular. The consensus was that SSDs (or should I be calling them EFDs?) in existing DMX array were more of an "also supports" rather than a mainline feature. The reason for that thinking was that DMX was never engineered specifically to support EFDs, but rather they've been added on as a recent value-add option. What's not clear is whether this bolt-on approach really means you get the best from the drives themselves, something that's important with the price point they sit at. Consider that EFDs sit behind a shared architecture of director ports, memory, front-end ports and queues. Do EFDs get priority access (I know they have to be placed in specific slots in the DMX storage cabinet so presumably they are affected by their position on the back-end directors).

The other problem with the EMC approach is that entire EFD LUNs must be given up to a host. With large databases, how do you predict which parts of the database at any one time are the hot parts? How does a re-org or reload affect the layout of the data? Either you need to put all of your database on EFD or spend a lot more time with the DBAs and Sys Admins creating a design that segments out active areas (and possibly repeating this process often).

If Compellent's technology works as described, then LUNs will be analysed at the block level and the active blocks will remain on the fastest storage with the least active moved to lower tiers of disk (or to other parts of the disk) within the same array.

This should offer a more granular approach to using SSDs for active data. In addition, if data can dynamically move up/down the stack of storage tiers, then as data profiles change over time, no application re-mapping or layout should be necessary. Hopefully this means that SSDs are used as efficiently as possible, justifying their inflated cost.

Just to conclude, I'm not saying Compellent have the perfect solution for using SSDs but it is a step in the right direction for making storage usage as efficient as possible.

Monday, 13 October 2008

Replacing the Virtualisation Component

There's no doubting that storage virtualisation will prove to be a key component of IT architecture in the future. The overall benefit is to abstract the physical storage layer from servers either in the fabric, or through the use of a dedicated appliance or even in the array itself.

Over time, storage resources can be upgraded and replaced, potentially without any impact to the host. In fact, products such as USP from HDS are sold on the virtues of their migration features.

However at some stage the virtualisation platform itself needs to be replaced. So how do we do that?

The essential concept of virtual storage is the presentation of a virtual world wide name (WWN). Each WWN then provides virtual LUNs to the host. The virtualisation appliance manages the redirection of I/O to the physical device, which also includes responding to SCSI LUN information queries (like the size of the LUN).

Ultimately, the host believes the virtual WWN is the physical device and any change to the underlying storage is achieved without affecting this configuration. If the virtualisation appliance must be replaced, then the virtual WWN could change and this means host changes, negating the benefit of deploying a virtual infrastructure.

As an example, HDS and HP allow USP/XP arrays to re-present externally connected storage as if it is part of the array itself. LUNs can be moved between either physical storage medium (internal or external) without impact to the host. However, the WWN used by the host to access the storage is a WWN directly associated with the USP/XP array (and in fact decoding the WWN shows it is based on the WWN serial number). If the USP is to be replaced, then some method of moving the data to another physical array is needed. At the same time, the host->WWN relationship has to change. This is not easy to achieve without (a) an outage (b) host reconfiguration (c) using the host as the data mover.

There isn't an easy solution to the issue of replacing the virtualisation tool. Stealing an idea from networking, it could be possible to provide a DNS style reference for the WWN with a "name server" to look up the actual physical WWN from the "DNS WWN". Unfortunately whilst this would be relatively easy to implement (a name server already exists in Fibre Channel) the major problem would be maintaining data integrity as a DNS WWN entry is changed and reads/writes start occurring from a new device. What we'd need is a universal synchronous replicator to ensure all I/Os written to an array are also written to any other planned target WWN, so as the WWN DNS entry is changed, it can't become live until a guaranteed synchronous mirror exists. I can't see many vendors agreeing to open up their replication technology to enable this; perhaps they could offer an API for "replication lite" which was used solely for migration purposes while the main replication product does the big replication features.

In the short term, we're going to have to accept that replacing the virtualisation engine is going to have some impact and just learn to work around it.

Monday, 6 October 2008

Could Netapp make a virtual NAS appliance?

As well as storage, one area of IT I find really interesting is virtualisation. Over the years I've used VM (e.g. the IBM mainframe platform), MVS (now morphed into z/OS) as well as products such as Iceberg. More recently I've been using VMware since it was first released and finally have managed to deploy a permanent VMware ESX installation in my home/office datacentre. That has given me the opportunity to install and test virtual SAN appliances, such as VSA from LeftHand Networks and Network Storage Server Virtual Appliance from FalconStor. I'll publish more on these in a week or so once I've done some homework, but for now I want to discuss Netapp.

As many of you will know, Netapp have offered a simulator for ONTAP to their customers for some time (BTW, Dave and the crew, although I'm not a customer, I would be grateful of an up-to-date copy). The simulator is great for script testing and learning new commands without totally wrecking your production operations. However I think it is about time Netapp took the plunge and offered ONTAP as a virtual appliance.

It shouldn't be hard to do for two reasons (a) the code is mostly Unix anyway and (b) most if not all the code exists in the simulator. It also seems to me to be an easy win; there are many organisations who wouldn't consider placing a Netapp filer into a branch office due to cost, but would deploy VMware for other services. A virtual filer could provide File & Print, iSCSI, SAN *and* most usefully, replicate that data back to core using standard Netapp protocols such as Snapmirror and Snapvault.

Perhaps Netapp haven't done it as they don't want to cut into their generous hardware margin on disk, but with a virtual offering to complement their physical ones, Netapp could retain their position as NAS vendor of choice.

Thursday, 2 October 2008

LeftHand - a case for a new application category

HP announced today their intention to acquire LeftHand networks, an iSCSI and virtualised SAN player.

Now, I doubt HP needed to buy LeftHand for their iSCSI technology. I suspect the bigger play here is the virtualised SAN technology they have - also known as the Virtual SAN Appliance. This allows a SAN to be created in a VMware guest, utilising the storage of the underlying VMware server itself.

I think we have a new technology sector starting to mature; virtual storage appliances.

At first glance you might ask why virtualise the SAN and initially I was skeptical until I gave it some thought (especially with reference to a client I'm dealing with at the moment). Imagine you have lots of branch offices. Previously you may have deployed a DNS/Active Directory server, perhaps a file server and some storage, the amount of storage being dependent on demand within the branch. Deploying the storage becomes a scalability and support nightmare if you have lots of branches. But how does a virtual SAN help?

Well, it allows you to provide SAN capability out of the resilient architecture you've already deployed in that location. Chances are you've deployed more than one physical server for failure purposes. You may also not need a large amount storage, but want advanced features like replication, snapshots etc. Deploying a virtual SAN lets you utilise these features but leverage both the hardware and storage of the ESX infrastructure you've deployed. The crucial point here is that you've benefited from getting the functionality you require without deploying bespoke hardware.

So you reduce costs, still maintaining a resilient infrastructure provide scalable support for small and medium branches. The challenge moves from supporting hardware (which has become a commodity) to supporting software as part of a virtual infrastructure and that's a different issue. What you've gained is a consistent set of functional SAN operations which can be overlaid on different hardware - hardware which can be changed and upgraded without impacting the virtual SAN configuration.

I've downloaded VSA to test as I now have a resilient VMware environment. I'm looking forward to discovering more.

Wednesday, 1 October 2008

Fast Food Storage Provisioning

Yesterday I discussed beating the credit crunch by getting your house in order. This was picked up my Marc Farley over at StorageRap and he posted accordingly. Marc, thanks for the additional comments, I will be reviewing the diagram accordingly based on your thoughts.

Moving on, think to yourself does this sound familiar?

Storage requests come in over time in a constant but unpredictable rate. When they arrive, you just provision them. Perhaps you check the requestor can "pay" for their storage (i.e. is authorised to request) but generally, storage is provisioned pretty much on demand. When you run out of storage, there's a minor panic and rush to place new hardware orders and then in a few weeks you're back in the game and provisioning again.

Welcome to Fast Food Storage Provisioning! I was going to use a brand name in this post, but then decided against it. After all, as these guys know, you're just asking for trouble.

How does this compare to storage? Easy. You walk into a fast food place and they're just there waiting to serve you, no questions asked, as long as you pay. They may have what you require ready, but if not, there's a panic in the kitchen area to cook what you want and so a delay in the delivery of your request. Those customers who eat food/storage every day become "overprovisioned" in both senses of the word.

Clearly Fast Food establishments have a vested interest in acquiring more customers as it builds their profits, however unless you are selling a service, storage growth is bad for the bottom line.

So, how about taking a few steps to make sure that storage is really needed?

When do you need the storage by? Poor project planning means storage requests can be placed long before servers and HBAs have even been delivered, never mind racked and configured.
Can the storage be delivered to you in increments? Most users who request 20TB immediately will never actually use it for days or weeks (in extreme cases may never use it all).
Have you checked your existing server to see if you have free storage? You would be amazed how many users have free LUNs on their servers they didn't know were there.
What exactly is your requirement in detail? How will you use what we give you? By questioning the request, you can find out if users have simply doubled the estimate of required storage given to them by the DBA. Get to the real deal on what growth is anticipated.

I'm not advocating saying no to customers, just to be confident that what you're deploying is what you need. Then you won't have that guilty feeling ordering another burger - I mean array...

Friday, 31 October 2008

Thursday, 30 October 2008

Wednesday, 29 October 2008

Tuesday, 28 October 2008

Monday, 27 October 2008

Wednesday, 22 October 2008

Tuesday, 21 October 2008

Monday, 20 October 2008

Friday, 17 October 2008

Thursday, 16 October 2008

Tuesday, 14 October 2008

Monday, 13 October 2008

Monday, 6 October 2008

Thursday, 2 October 2008

Wednesday, 1 October 2008

My Personal Profile

My Company

Subscribe To

What Am I Doing?

Blog Archive

FEEDJIT Live Page Popularity

FEEDJIT Live Traffic Map

FEEDJIT Live Traffic Feed